Compare commits

..

6248 Commits

Author SHA1 Message Date
1eb6146d96 Add manual simple retry to ECR login (#71287)
Summary:
Current retry with AWS_MAX_ATTEMPTS does not seem to work as we still get failures https://github.com/pytorch/pytorch/runs/4806177738?check_suite_focus=true

This should hopefully alleviate

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71287

Reviewed By: malfet, seemethere

Differential Revision: D33573788

Pulled By: janeyx99

fbshipit-source-id: 300fde9a9fa5a2da3e9d18b7989a3676500d8011
2022-01-18 10:56:53 -08:00
2bb6a4f437 Generate aten_interned_strings.h automatically (#69407)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69407

This generates aten_interned_strings.h from `native_functions.yaml`
which is more like how it was originally done. The items deleted from
`interned_strings.h` are duplicates that need to be removed in order
for the code to compile, some of the remaining items may still be out
of date but it is fairly benign even if that's the case.

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D32923636

Pulled By: albanD

fbshipit-source-id: a0fd6b3714e70454c5f4ea9b19da5e047d2a4687
2022-01-18 08:29:54 -08:00
d665097cad allow Bazel to build without glog and gflags (#70850)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70850

We support both, so we want to ensure both continue to work.
ghstack-source-id: 146960552

Test Plan: Tested manually. A subsequent diff adds this test configuration to CI.

Reviewed By: malfet

Differential Revision: D33297464

fbshipit-source-id: 70e1431d0907d480c576239af93ef57036d5e4d7
2022-01-18 08:08:46 -08:00
ffdc6b4994 extract //c10/macros to its own package (#70849)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70849

ghstack-source-id: 146960563

Test Plan: Bazel CI tests will protect this.

Reviewed By: malfet

Differential Revision: D33297235

fbshipit-source-id: 6504a977e82ad2f2232a74233b96cdea8bf94a20
2022-01-18 08:08:42 -08:00
8d0e354191 fix CAFFE2_BUILD_MAIN_LIB to the correct C10_BUILD_MAIN_LIB (#70848)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70848

This is the C10 library, it that's the main lib we are building
here. While here, use `local_defines` instead of `copts` for this
definition. Both `copts` and `local_defines` only apply to the
compilation units in the library, and not transitively.
ghstack-source-id: 146998039

Test Plan: We are relying on CI to verify this doesn't cause any problems.

Reviewed By: malfet

Differential Revision: D33429420

fbshipit-source-id: b3fc84c0588bd43346e3f9f77e851d293bde9428
2022-01-18 08:05:20 -08:00
fd9e08df5d Make Demux serializable with lambda function (#71311)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71311

Test Plan: Imported from OSS

Reviewed By: NivekT

Differential Revision: D33584552

Pulled By: ejguan

fbshipit-source-id: 52324faf5547f9f77582ec170ec91ce3114cfc61
2022-01-18 06:47:54 -08:00
f0db15122f [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D33629127

fbshipit-source-id: 47befcd98cfa544a4d822161d8bfbe8d7a788e4d
2022-01-18 01:50:08 -08:00
d17f340a2e The Cacherator (#71350)
Summary:
This PR adds a persistent filesystem cache for jitted kernels. The cache is disabled on Windows because it relies on POSIX headers.

The cache writes, by default, to `~/.cache/torch/kernels`, but the location can be controlled by setting the `PYTORCH_KERNEL_CACHE_PATH`. A separate environment variable, `USE_PYTORCH_KERNEL_CACHE`, will disable all caching logic when set to zero.

The use of a persistent fileystem cache dramatically lowers the "first call time" for an operator AFTER its has been compiled, because it skips (most of) the jit compilation process. On systems where we're compiling only to ptx that ptx still has to be just-in-time compiled by the driver API, so an additional latency of around 10 milliseconds is expected at first call time. On systems which compile to SASS the additional first call time latency is about one millisecond. This compares with times of 150 milliseconds+ for just-in-time kernel compilation.

Files in the cache use a mostly human readable string that includes an SHA1 hash of the CUDA C string used to generate them. Note that this is not an SHA1 hash of the file's contents, because the contents are the compiled ptx or SASS. No verification is done when the file is loaded to ensure the kernel is what's expected, but it's far more likely you'll be struck by a meteor than observe two file names conflict. Using SHA1 hashes to generate unique ids this way is a common practice (GitHub does it, too).

This cache design could be reused by other fusion systems and should allow us to jiterate more operations without fear of regressing the "incremental development" scenario where users are tweaking or extending programs slightly, rerunning then, and then repeating that process again and again. Without a cache each run of the program would have to recompile every jitted kernel, but with this cache we expect a negligible impact to the user experience.

cc kshitij12345, xwang233

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71350

Reviewed By: ngimel

Differential Revision: D33626671

Pulled By: mruberry

fbshipit-source-id: d55df53416fbe46348623846f699f9b998e6c318
2022-01-17 23:52:14 -08:00
7b9fff90d2 empty_generic: Remove redundant device argument (#70612)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70612

The device information is embedded in the `DataPtr` returned from the
allocator, so this argument is completely ignored.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D33623681

Pulled By: ngimel

fbshipit-source-id: bea64707bb17d46debb0ed7c1175493df56fee77
2022-01-17 20:18:43 -08:00
f93ffc9ea8 Sparse CSR: Handle zero matrix consistently for triangular_solve (#71304)
Summary:
This PR enables `test_block_triangular` tests on the CPU.
These tests revealed that there was a problem with how the nnz==0 case is handled. Now we return a tensor filled with NaNs both on CUDA and CPU.

cc nikitaved pearu cpuhrsch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71304

Reviewed By: davidberard98

Differential Revision: D33600482

Pulled By: cpuhrsch

fbshipit-source-id: d09cb619f8b6e54b9f07eb16765ad1c183c42487
2022-01-17 13:47:49 -08:00
17540c5c80 [warnings][Caffe2] Suppress warnings in non-c10 headers (#71370)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71370

Round out suppressing warnings in `caffe2` headers

Test Plan: CI check

Reviewed By: r-barnes

Differential Revision: D33613084

fbshipit-source-id: 9306d480bd796aeae4d887ad26b6ddc2c571c9e4
2022-01-17 10:09:31 -08:00
cf47338191 [Caffe2][warnings] Suppress -Wimplicit-int-float-conversion in TypeSafeSignMath.h for clang (#71369)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71369

Suppress `-Wimplicit-int-float-conversion` in `TypeSafeSignMath.h` when building with clang

Test Plan: CI check

Reviewed By: r-barnes

Differential Revision: D33612983

fbshipit-source-id: cff1239bc252d4a2f54a50a2bbcd48aeb8bf31ca
2022-01-17 10:05:21 -08:00
ddf97a59ca Remove the dependency of pytorch nightly. (#71323)
Summary:
This PR removes the PyTorch nightly dependencies of TorchBench CI. Instead, it relies on the bisection script to install TorchBench dependencies (https://github.com/pytorch/benchmark/pull/694).
This will unblock TorchBench CI users when the nightly build fails (e.g., https://github.com/pytorch/pytorch/issues/71260)

RUN_TORCHBENCH: resnet18
TORCHBENCH_BRANCH: xz9/optimize-bisection

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71323

Reviewed By: wconstab

Differential Revision: D33591713

Pulled By: xuzhao9

fbshipit-source-id: f1308ea33ece1f18196c993b40978351160ccc0c
2022-01-17 09:52:36 -08:00
a383d01774 [fbcode][warnings] Suppress warnings in caffe2/c10 (#71356)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71356

Suppress remaining header based warnings in `caffe2/c10` when building with `clang`

Test Plan: CI pass

Reviewed By: r-barnes

Differential Revision: D33600097

fbshipit-source-id: e1c0d84a0bad768eb03e047d62b5379cf28b48e2
2022-01-15 18:34:08 -08:00
1ecfa1d61a Load zip file in deploy interpreter (#71072)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71072

This PR replaces the old logic of loading frozen torch through cpython by directly loading zipped torch modules directly onto deploy interpreter. We use elf file to load the zip file as its' section and load it back in the interpreter executable. Then, we directly insert the zip file into sys.path of the each initialized interpreter. Python has implicit ZipImporter module that can load modules from zip file as long as they are inside sys.path.

Test Plan: buck test //caffe2/torch/csrc/deploy:test_deploy

Reviewed By: shunting314

Differential Revision: D32442552

fbshipit-source-id: 627f0e91e40e72217f3ceac79002e1d8308735d5
2022-01-15 14:39:59 -08:00
08d8f81704 [quant][fix][fx][graphmode] Fix qconfig setting for fused modules (#71254)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71254

when we configure linear and relu with the same qconfig, we currently have utility functions to also
generate a qconfig for the fused linear relu module, but this code is not called in correct order before
which resulted in unexpected behaviors. This PR fixes the issue. Please see test case for more details.
(Test case is from Supriya)

Test Plan:
python test/test_quantization.py TestQuantizeFx.test_fused_module_qat_swap

Imported from OSS

Reviewed By: supriyar

Differential Revision: D33558321

fbshipit-source-id: d95114dc4b77264e603c262c2da02a3de4acba69
2022-01-14 23:31:11 -08:00
bb49352354 caffe2/torch/csrc/jit/frontend/tree_views: workaround nvcc compiler error
Test Plan:
Move it outside the header so it's not seen by nvcc

```
$ buck2 build -c fbcode.platform=platform010 fbcode//accelerators/pytorch/lib/cuda:ngram_repeat_block_cuda
Downloading buck2...
[======================================================================]

watchman fresh instance event, clearing cache
Using disallowed linker flag 'arvr/third-party/toolchains/platform009/build/mesa/lib/libGL.so' in library rule 'fbsource//third-party/toolchains:opengl'
Using disallowed linker flag 'arvr/third-party/freeglut/3.0.0/libs/x64-linux/libglut.a' in library rule 'fbsource//third-party/toolchains:GLUT'
Action Failed for fbcode//accelerators/pytorch/lib/cuda:ngram_repeat_block_cuda (ovr_config//platform/linux:x86_64-fbcode-platform010-clang-6dbc4bb1b9a32829)#5:
cxx_compile ngram_repeat_block_cuda_kernel.cu (pic) failed with non-zero exit code 1
debug information: action_digest=b2bda91d24dad53e960c740ef9a412cee1902d86:94
stdout:
stderr:
fbcode/caffe2/torch/csrc/jit/frontend/tree_views.h: In instantiation of 'static torch::jit::Maybe<T> torch::jit::Maybe<T>::create(const torch::jit::SourceRange&, const T&) [with T = torch::jit::List<torch::jit::Property>]':
fbcode/caffe2/torch/csrc/jit/frontend/tree_views.h:505:117:   required from here
fbcode/caffe2/torch/csrc/jit/frontend/tree_views.h:220:33: error: cannot convert 'const torch::jit::List<torch::jit::Property>' to 'torch::jit::TreeList&&' {aka 'c10::SmallVector<c10::intrusive_ptr<torch::jit::Tree>, 4>&&'}
  220 |     return Maybe<T>(Compound::create(TK_OPTION, range, {value}));
      |                ~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~
fbcode/caffe2/torch/csrc/jit/frontend/tree.h:144:1: note:   initializing argument 3 of 'static torch::jit::TreeRef torch::jit::Compound::create(int, const torch::jit::SourceRange&, torch::jit::TreeList&&)'
  143 |       const SourceRange& range_,
      |         ~~~~~~~~~~~~~~~~~~~~~~~~
  144 |       TreeList&& trees_) {
      | ^
fbcode/caffe2/torch/csrc/jit/frontend/tree_views.h: In instantiation of 'static torch::jit::Maybe<T> torch::jit::Maybe<T>::create(const torch::jit::SourceRange&, const T&) [with T = torch::jit::List<torch::jit::Assign>]':
fbcode/caffe2/torch/csrc/jit/frontend/tree_views.h:505:171:   required from here
fbcode/caffe2/torch/csrc/jit/frontend/tree_views.h:220:33: error: cannot convert 'const torch::jit::List<torch::jit::Assign>' to 'torch::jit::TreeList&&' {aka 'c10::SmallVector<c10::intrusive_ptr<torch::jit::Tree>, 4>&&'}
  220 |     return Maybe<T>(Compound::create(TK_OPTION, range, {value}));
      |                ~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~
fbcode/caffe2/torch/csrc/jit/frontend/tree.h:144:1: note:   initializing argument 3 of 'static torch::jit::TreeRef torch::jit::Compound::create(int, const torch::jit::SourceRange&, torch::jit::TreeList&&)'
  143 |       const SourceRange& range_,
      |         ~~~~~~~~~~~~~~~~~~~~~~~~
  144 |       TreeList&& trees_) {
      | ^
cc1plus: note: unrecognized command-line option '-Wno-ignored-optimization-argument' may have been intended to silence earlier diagnostics
cc1plus: note: unrecognized command-line option '-Wno-ambiguous-reversed-operator' may have been intended to silence earlier diagnostics
cc1plus: note: unrecognized command-line option '-Wno-ignored-optimization-argument' may have been intended to silence earlier diagnostics
cc1plus: note: unrecognized command-line option '-Wno-ambiguous-reversed-operator' may have been intended to silence earlier diagnostics
command: buck-out/v2/gen/fbcode/999b02f9444004c1/tools/build/__wrap_nvcc.py__/wrap_nvcc.py -_NVCC_BIN_ fbcode ...<omitted>... ors/pytorch/lib/cuda/__ngram_repeat_block_cuda__/__objects__/ngram_repeat_block_cuda_kernel.cu.pic.o (rerun with -v to view the untruncated command)

```

Reviewed By: zhxchen17

Differential Revision: D33592885

fbshipit-source-id: a36dcb3c8265d009b2287f0a479695d1ddbf85aa
2022-01-14 21:58:31 -08:00
4bf1be898d caffe: fix warning: overloaded virtual function "torch::jit::Function::call" is only partially overridden in class "torch::jit::GraphFunction"
Summary:
Need to bring in all signatures

https://www.internalfb.com/code/fbsource/[36035b9e4e41813e215ffd5f4377d65b7259237e]/fbcode/caffe2/aten/src/ATen/core/function.h?lines=91-101

Test Plan:
```
Action Failed for fbcode//accelerators/pytorch/lib/cuda:ngram_repeat_block_cuda (ovr_config//platform/linux:x86_64-fbcode-platform010-clang-6dbc4bb1b9a32829)#5:
cxx_compile ngram_repeat_block_cuda_kernel.cu (pic) failed with non-zero exit code 1
debug information: action_digest=988629a726bc4eabcaf334db2317a969958d5fd2:94
stdout:
stderr:
fbcode/caffe2/torch/csrc/jit/api/function_impl.h(11): warning: overloaded virtual function "torch::jit::Function::call" is only partially overridden in class "torch::jit::GraphFunction"

fbcode/caffe2/torch/csrc/jit/api/function_impl.h(11): warning: overloaded virtual function "torch::jit::Function::call" is only partially overridden in class "torch::jit::GraphFunction"

fbcode/caffe2/torch/csrc/jit/frontend/tree_views.h: In instantiation of 'static torch::jit::Maybe<T> torch::jit::Maybe<T>::create(const torch::jit::SourceRange&, const T&) [with T = torch::jit::List<torch::jit::Property>]':
fbcode/caffe2/torch/csrc/jit/frontend/tree_views.h:505:117:   required from here
fbcode/caffe2/torch/csrc/jit/frontend/tree_views.h:220:33: error: cannot convert 'const torch::jit::List<torch::jit::Property>' to 'torch::jit::TreeList&&' {aka 'c10::SmallVector<c10::intrusive_ptr<torch::jit::Tree>, 4>&&'}
  220 |     return Maybe<T>(Compound::create(TK_OPTION, range, {value}));
      |                ~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~
fbcode/caffe2/torch/csrc/jit/frontend/tree.h:144:1: note:   initializing argument 3 of 'static torch::jit::TreeRef torch::jit::Compound::create(int, const torch::jit::SourceRange&, torch::jit::TreeList&&)'
  143 |       const SourceRange& range_,
      |         ~~~~~~~~~~~~~~~~~~~~~~~~
  144 |       TreeList&& trees_) {
      | ^
fbcode/caffe2/torch/csrc/jit/frontend/tree_views.h: In instantiation of 'static torch::jit::Maybe<T> torch::jit::Maybe<T>::create(const torch::jit::SourceRange&, const T&) [with T = torch::jit::List<torch::jit::Assign>]':
fbcode/caffe2/torch/csrc/jit/frontend/tree_views.h:505:171:   required from here
fbcode/caffe2/torch/csrc/jit/frontend/tree_views.h:220:33: error: cannot convert 'const torch::jit::List<torch::jit::Assign>' to 'torch::jit::TreeList&&' {aka 'c10::SmallVector<c10::intrusive_ptr<torch::jit::Tree>, 4>&&'}
  220 |     return Maybe<T>(Compound::create(TK_OPTION, range, {value}));
      |                ~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~
fbcode/caffe2/torch/csrc/jit/frontend/tree.h:144:1: note:   initializing argument 3 of 'static torch::jit::TreeRef torch::jit::Compound::create(int, const torch::jit::SourceRange&, torch::jit::TreeList&&)'
  143 |       const SourceRange& range_,
      |         ~~~~~~~~~~~~~~~~~~~~~~~~
  144 |       TreeList&& trees_) {
      | ^
cc1plus: note: unrecognized command-line option '-Wno-ignored-optimization-argument' may have been intended to silence earlier diagnostics
cc1plus: note: unrecognized command-line option '-Wno-ambiguous-reversed-operator' may have been intended to silence earlier diagnostics
cc1plus: note: unrecognized command-line option '-Wno-ignored-optimization-argument' may have been intended to silence earlier diagnostics
cc1plus: note: unrecognized command-line option '-Wno-ambiguous-reversed-operator' may have been intended to silence earlier diagnostics
command: buck-out/v2/gen/fbcode/999b02f9444004c1/tools/build/__wrap_nvcc.py__/wrap_nvcc.py -_NVCC_BIN_ fbcode ...<omitted>... ors/pytorch/lib/cuda/__ngram_repeat_block_cuda__/__objects__/ngram_repeat_block_cuda_kernel.cu.pic.o (rerun with -v to view the untruncated command)
```

Differential Revision: D33579670

fbshipit-source-id: 9acb443732feb3e921ce0fa5f38f21ed44f64114
2022-01-14 20:27:09 -08:00
3ed27a96ed [BE] Refactor repetitions into TorchVersion._cmp_wrapper` (#71344)
Summary:
First step towards https://github.com/pytorch/pytorch/issues/71280

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71344

Reviewed By: b0noI

Differential Revision: D33594463

Pulled By: malfet

fbshipit-source-id: 0295f0d9f0342f05a390b2bd4aa0a5958c76579b
2022-01-14 19:57:55 -08:00
c43e0286a9 [PyTorch][Lazy] Make hashing null optionals cheap (#71290)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71290

The existing code called an out-of-line hash function on a constant. This is just going to get the same random-looking 64-bit integer every time, so I just changed the constant to an integer I generated with `hex(random.randint(0x1000000000000000, 0xFFFFFFFFFFFFFFFF))` to get the same effect but without the runtime hashing.
ghstack-source-id: 146991945

Test Plan: CI

Reviewed By: wconstab

Differential Revision: D33574676

fbshipit-source-id: d6ce1e1cc0db67dfede148b7e3173508ec311ea8
2022-01-14 17:13:50 -08:00
a138aad6e6 [jit][edge] Return a no-op nullptr for UnionType on mobile for backward compatibility. (#71341)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71341

Old models containing UnionType need to be loaded even if they don't actually use Unions.
This is not the best solution, we need to catch this error on the compiler side instead, but before doing that we can land this first to at least mitigate model loading crash issues.
ghstack-source-id: 147056684

Test Plan:
CI
Verified with jaebong on his device locally.

Differential Revision: D33593276

fbshipit-source-id: fac4bc85c652974c7c10186a29f36e3e411865ad
2022-01-14 17:06:13 -08:00
b7222e15b6 [fix] max_pool1d: composite compliance (#70900)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/69991

Not sure if this is a good idea as this increases the number of operators.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70900

Reviewed By: wenleix

Differential Revision: D33585964

Pulled By: zou3519

fbshipit-source-id: 11bfa2e00ee123a6d36f7d4cccdf0c1a3e664d8c
2022-01-14 15:36:27 -08:00
fcbc34a5eb [PyTorch][Static Runtime] Avoid recomputing input size in dict_unpack (#71252)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71252

Same old problem, same old solution.

Interestingly, I tried using c10::irange instead, but that caused really bad assembly to be generated -- we lost inlining for lots of the loop body!
ghstack-source-id: 146939573

Test Plan:
CI

Spot-checked assembly before/after and confirmed that loop termination value was recomputed before and not after

Reviewed By: mikeiovine

Differential Revision: D33558118

fbshipit-source-id: 9fda2f1f89bacba2e8b5e61ba432871e973201fe
2022-01-14 14:33:56 -08:00
bf82d2012e [PyTorch] Add IValue::toDimVector & mostly replace toIntVector with it (#71247)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71247

Most uses of toIntVector() were for a Tensor shape. We have DimVector to avoid heap allocations in those cases, so let's use it.
ghstack-source-id: 146933314

Test Plan: CI -- if we think DimVector is good in general then I think we have to think this change is good?

Reviewed By: mikeiovine

Differential Revision: D33556198

fbshipit-source-id: cf2ad92c2d0b99ab1df4da0f6843e6ccb9a6320b
2022-01-14 14:32:40 -08:00
94ed61eb5c Pin numba to 0.54.1 (#71327)
Summary:
Not sure what is going on, but numba=0.55.0 currently installed in for example 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-bionic-py3.7-clang9:0d18ad2827487386d2a7864b11fec5bc83de6545 is build against newer version of numpy, which was apparently silently fixed on the pypi side (as latest numba download is numba-0.55.0-1-cp37-cp37m-manylinux2014_x86_64.manylinux_2_17_x86_64.whl  )
Fixes https://github.com/pytorch/pytorch/issues/71320

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71327

Reviewed By: suo, seemethere, atalman, janeyx99

Differential Revision: D33589002

Pulled By: malfet

fbshipit-source-id: d362a2b2fd045bc1720cd7fdc4c7b18b7d607fc4
2022-01-14 14:06:15 -08:00
d74bb42f7a Add a missing precondition to DistributedSampler docstring (#70104)
Summary:
Distributed sampler sets different indices for different processes. By doing this, it assumes that the data is the same across the board and in the same order. This may seem trivial, however, there are times that users don't guarantee the order items are gonna have, because they rely on something such as the order the filesystem lists a directory (which is not guaranteed and may vary on different computers), or the order a `set` is iterated.

I think it's better to make it clearer.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70104

Reviewed By: bdhirsh

Differential Revision: D33569539

Pulled By: rohan-varma

fbshipit-source-id: 68ff028cb360cadaee8c441256c1b027a57c7089
2022-01-14 13:55:12 -08:00
2faccc2f5d [quant] Remove some redundant entries in backend_config_dict for TensorRT (#70971)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70971

"root_module" and "reference_quantized_module_for_root" are only used in convert, removed
them for fused module and qat module swapping configurations
We may be able to remove some other fields as well.

Test Plan:
python test/fx2trt/test_quant_trt.py TestQuantizeFxTRTOps

Imported from OSS

Reviewed By: andrewor14

Differential Revision: D33470739

fbshipit-source-id: 67e6d58d7a3ec9fbd8c13527e701c06119aeb219
2022-01-14 12:43:25 -08:00
d793cc1993 Revert "Pin numba ot 0.54.1"
This reverts commit ac7f188c64805f2f9dd134f5781d3b584688e677 that was
landed accidentally.
2022-01-14 12:32:39 -08:00
ac7f188c64 Pin numba ot 0.54.1
As newer one is incompatible with numpy version we are using
Fixes https://github.com/pytorch/pytorch/issues/71320
2022-01-14 12:25:47 -08:00
680d61daab [LT] Remove torch::lazy::convertShapes (#71291)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71291

This commit removes torch::lazy::convertShapes since it's no longer used.
In addition, it replaces a numel logic within LTCTensorImpl.

Test Plan:
./build/bin/test_lazy
CI in lazy_tensor_staging branch

Reviewed By: wconstab

Differential Revision: D33575084

Pulled By: alanwaketan

fbshipit-source-id: b104ef39fd552822e1f4069eab2cb942d48423a6
2022-01-14 12:06:39 -08:00
c7d1501e4d fractional_maxpool3d: port to structured kernel (#70414)
Summary:
Port fractional maxpool 3d to structured kernel

Fixes https://github.com/pytorch/pytorch/issues/55070

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70414

Reviewed By: zdevito, wenleix

Differential Revision: D33572110

Pulled By: bdhirsh

fbshipit-source-id: 1f89eb511335f51cc7abbb0230e165da8752f9fc
2022-01-14 12:01:16 -08:00
a4196a9abf Remove unused optimizers variable in test (#70668)
Summary:
In `TestLRScheduler._test()`, an unused variable `optimizers` is created. This PR is a minor refactoring that removes the variable and the loop block that populates the set.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70668

Reviewed By: wenleix

Differential Revision: D33586236

Pulled By: albanD

fbshipit-source-id: cabf870a8221f144df9d3e2f2b564cdc5c255f5a
2022-01-14 11:59:49 -08:00
054b90f0d6 add channels last support for ChannelShuffle (#50247)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50247

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D26007052

Pulled By: VitalyFedyunin

fbshipit-source-id: 08f737d64a65791c8002ffd56b79b02cf14d6159
2022-01-14 11:55:21 -08:00
e531646955 Fix docstring for nn.MultiHeadAttention (#71100)
Summary:
Fixes nn.MultiHeadAttention's docstring problem reported at https://github.com/pytorch/pytorch/issues/70498.

cc albanD mruberry jbschlosser walterddr kshitij12345

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71100

Reviewed By: mruberry

Differential Revision: D33531726

Pulled By: albanD

fbshipit-source-id: d2aa8fa44d0f6b166a809b7e5ceee26efcbccf36
2022-01-14 10:29:18 -08:00
17bb68618f Copy: Fix CPU transpose path ignoring neg and conj bits (#69026)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69026

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D33064533

Pulled By: anjali411

fbshipit-source-id: 98c25586a1707ac2324f69f652ce5a14dd59c0ad
2022-01-14 10:13:33 -08:00
84b1c9798c add BFloat16 support for AvgPool2d on CPU (#66927)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66927

Test Plan: Imported from OSS

Reviewed By: mikaylagawarecki

Differential Revision: D33353198

Pulled By: VitalyFedyunin

fbshipit-source-id: 1aeaa4bb90ac99210b8f6051c09d6995d06ce3a1
2022-01-14 07:59:10 -08:00
88012c7daf [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D33577744

fbshipit-source-id: 7ecc8367998ee1dffde54c2f4dd3cfafe19a53c9
2022-01-14 06:10:57 -08:00
3a0c680a14 Jiterates exp2, erfc, erfinv and entr and refactors code_template.h to ATen (#71295)
Summary:
Per title.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71295

Reviewed By: ngimel

Differential Revision: D33575885

Pulled By: mruberry

fbshipit-source-id: bc841b46fc0b5458a26a4d4465b18a7a54cd5a5b
2022-01-13 23:58:51 -08:00
d068849cc0 - Fixed memory leak in ir_simplifier.cpp (#71285)
Summary:
The leak was causing long running inference loops to exhaust system memory. I tracked down the issue and noted that ModRound can be copied by value without worrying about a performance hit.

I originally branched from release/1.10 and made these changes. This commit includes the same changes but from master as requested in the original PR https://github.com/pytorch/pytorch/pull/71077

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71285

Reviewed By: wenleix

Differential Revision: D33575821

Pulled By: ZolotukhinM

fbshipit-source-id: 64333f6cbb2c222f05481499c9cae4c7e0116af6
2022-01-13 22:29:06 -08:00
910c01020e add BFloat16 support for AdaptiveMaxPool2d on CPU (#66929)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66929

Test Plan: Imported from OSS

Reviewed By: mikaylagawarecki

Differential Revision: D33353199

Pulled By: VitalyFedyunin

fbshipit-source-id: d402d5deb7ca766259ca42118ddc16625e134c4c
2022-01-13 20:00:42 -08:00
9e45c89891 remove skips from determinant tests (#70034)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/67512.

The accuracy requirement for non-contiguous inputs when using complex64 was too high so I reduced it to upto 1e-3.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70034

Reviewed By: anjali411

Differential Revision: D33530382

Pulled By: mruberry

fbshipit-source-id: 057daf75dc5feca5bb2f4428922eb7489435da60
2022-01-13 19:13:28 -08:00
356af8f857 Do not use ssize_t in python_arg_parser.[cpp|h] (#71250)
Summary:
Use `Py_ssize_t` when calling Python API
Use `c10::irange` to automatically infer loop type
 Use `size_t` or `unsigned` for unsigned type

 Partially addresses https://github.com/pytorch/pytorch/issues/69948

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71250

Reviewed By: atalman

Differential Revision: D33569724

Pulled By: malfet

fbshipit-source-id: c9eb75be9859d586c00db2f824c68840488a2822
2022-01-13 19:10:30 -08:00
675acfc1f4 Remove unwanted comma (#71193)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/70611

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71193

Reviewed By: ngimel

Differential Revision: D33542841

Pulled By: mruberry

fbshipit-source-id: 0f2f1218c056aea7ecf86ba4036cfb10df6e8614
2022-01-13 19:09:05 -08:00
558622642b Fix torch.dsplit docs dim specification (#70557)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/70445.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70557

Reviewed By: ngimel

Differential Revision: D33542864

Pulled By: mruberry

fbshipit-source-id: c3a7929bfcd964da99225ad715f4546f1fc8002a
2022-01-13 19:04:51 -08:00
5f2b4be3b9 [jit] Split DynamicType conformance test into smaller pieces. (#71275)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71275

Currently it's taking more than 10 minutes to run the conformance test. Instead we should use parametrized test to shard into test segments so that they can run in parallel.
ghstack-source-id: 146990608

Test Plan:
```
[zhxchen17@devbig560.ftw3 /data/users/zhxchen17/fbsource/fbcode] buck test mode/dev-tsan //caffe2/test/cpp/jit:jit -- -r 'LiteInterpreterDynamicTypeTestFixture'
Building... 34.9 sec (99%) 12110/12111 jobs, 0/12111 updated
Tpx test run coordinator for Facebook. See https://fburl.com/tpx for details.
Running with tpx session id: ebea52b3-7c7f-46be-9f69-18e2e7b040cc
Trace available for this run at /tmp/tpx-20220113-113635.717778/trace.log
RemoteExecution session id: reSessionID-ebea52b3-7c7f-46be-9f69-18e2e7b040cc-tpx
Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/4222124735827748
    ✓ ListingSuccess: caffe2/test/cpp/jit:jit : 431 tests discovered (11.173)
    ✓ Pass: caffe2/test/cpp/jit:jit - Conformance/LiteInterpreterDynamicTypeTestFixture.Conformance/0 (51.331)
    ✓ Pass: caffe2/test/cpp/jit:jit - Conformance/LiteInterpreterDynamicTypeTestFixture.Conformance/1 (65.614)
    ✓ Pass: caffe2/test/cpp/jit:jit - Conformance/LiteInterpreterDynamicTypeTestFixture.Conformance/3 (76.875)
    ✓ Pass: caffe2/test/cpp/jit:jit - Conformance/LiteInterpreterDynamicTypeTestFixture.Conformance/5 (77.271)
    ✓ Pass: caffe2/test/cpp/jit:jit - Conformance/LiteInterpreterDynamicTypeTestFixture.Conformance/4 (78.871)
    ✓ Pass: caffe2/test/cpp/jit:jit - Conformance/LiteInterpreterDynamicTypeTestFixture.Conformance/6 (78.984)
    ✓ Pass: caffe2/test/cpp/jit:jit - Conformance/LiteInterpreterDynamicTypeTestFixture.Conformance/7 (84.068)
    ✓ Pass: caffe2/test/cpp/jit:jit - Conformance/LiteInterpreterDynamicTypeTestFixture.Conformance/2 (85.198)
    ✓ Pass: caffe2/test/cpp/jit:jit - Conformance/LiteInterpreterDynamicTypeTestFixture.Conformance/8 (88.815)
    ✓ Pass: caffe2/test/cpp/jit:jit - Conformance/LiteInterpreterDynamicTypeTestFixture.Conformance/9 (90.332)
Summary
  Pass: 10
  ListingSuccess: 1
If you need help understanding your runs, please follow the wiki: https://fburl.com/posting_in_tpx_users
Finished test run: https://www.internalfb.com/intern/testinfra/testrun/4222124735827748
```

Reviewed By: qihqi

Differential Revision: D33570442

fbshipit-source-id: 5c49e03b0f88068d444c84b4adeaaf45433ce1fa
2022-01-13 18:22:55 -08:00
81f693d509 [ONNX] minor clarifications of docstrings (#69260) (#69549)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69549

[ONNX] minor clarifications of docstrings

1. Make description of ONNX_ATEN_FALLBACK more accurate (after #67460).
2. Specify minimum and maximum values for opset_version. This is pretty
   important information and we should make users dig through source
   code to find it.

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D32994267

Pulled By: msaroufim

fbshipit-source-id: ba641404107baa23506d337eca742fc1fe9f0772
2022-01-13 18:03:27 -08:00
d555d3f0d0 Update generated header to use flatbuffer v1.12; (#71279)
Summary:
Update generated header to use flatbuffer v1.12;
Also pin flatbuffer repo to v1.12

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71279

Test Plan:
unittest
Fixes #ISSUE_NUMBER

Reviewed By: gmagogsfm

Differential Revision: D33572140

Pulled By: qihqi

fbshipit-source-id: 319efc70f6c491c66a3dfcd7cad1f7defe69916b
2022-01-13 17:23:30 -08:00
e47771cca0 [ao] Removing unused allow list arguments from propagate_qconfig and helper (#71104)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71104

This shouldn't change any functionality given that those
variables were not used. It should be noted that a similar variable is
used in add_observer which is why it wasn't removed from there.
ghstack-source-id: 146940043

Test Plan:
python test/test_quantization.py

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D33510352

fbshipit-source-id: c66ed72c2b71a6e1822f9311467adaa1f4b730d0
2022-01-13 16:07:29 -08:00
e7c87e8b44 [quant] fix dropout in FX graph mode quantization (#71043)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71043

fix issue #68250
dropout break fx graph model quantization

Test Plan:
python test/test_quantization.py TestStaticQuantizedModule

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D33490176

fbshipit-source-id: 155546505b28ffc635ada65a1464b9d622dbc235
2022-01-13 15:59:59 -08:00
eac3decf93 ModuleList concatenation (#70887)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/70441.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70887

Reviewed By: ejguan

Differential Revision: D33555431

Pulled By: albanD

fbshipit-source-id: ce42459ee46a611e98e89f02686acbac16b6b668
2022-01-13 15:31:07 -08:00
2981534f54 [nn] cross_entropy: no batch dim support (#71055)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/60585

cc albanD mruberry jbschlosser walterddr kshitij12345

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71055

Reviewed By: anjali411

Differential Revision: D33567403

Pulled By: jbschlosser

fbshipit-source-id: 4d0a311ad7419387c4547e43e533840c8b6d09d8
2022-01-13 14:48:51 -08:00
e4d522a3cf More informative messages for None types comparisons (#69802)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69802

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D33555886

Pulled By: Gamrix

fbshipit-source-id: 3045cbe04de22f05db41a99ad3dda90c5271aa0f
2022-01-13 13:59:28 -08:00
ed9804088a Adding support for loops (#70209)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70209

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D33555889

Pulled By: Gamrix

fbshipit-source-id: f6c0c9d517849e3679e07ac1c8cf3bf367e91882
2022-01-13 13:59:25 -08:00
18d91a97e4 Adding custom device type change rules (#69051)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69051

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D33555884

Pulled By: Gamrix

fbshipit-source-id: c38812277d0e2aa008903a4328cb72e34bc6e1e6
2022-01-13 13:59:21 -08:00
03c4d2b9e3 Adding support for Ifs in Device Type Analysis (#69050)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69050

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D33555887

Pulled By: Gamrix

fbshipit-source-id: f7f057c5985f8b6e7a9fe5702a944b2b4cc4d5b5
2022-01-13 13:59:18 -08:00
4a8aa971cc Building a TensorProperty AbstractBaseClass (#71184)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71184

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D33555890

Pulled By: Gamrix

fbshipit-source-id: 694f7b5327b93257010b0abeed3310b0b816c0a8
2022-01-13 13:59:15 -08:00
dabcbb2726 Testing for Default Inference for Device Type (#69052)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69052

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D33555888

Pulled By: Gamrix

fbshipit-source-id: dbd43ebfc1bea4b17a96bdd378ea730ccf5944b2
2022-01-13 13:59:12 -08:00
ade83ed90c Building Default Inference for Device Type (#69049)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69049

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D33555885

Pulled By: Gamrix

fbshipit-source-id: 7364066cbc544ab8442a47c82ea89f0e73eaaa06
2022-01-13 13:57:08 -08:00
b64946cbc1 [acc_normalizer] Delete is_wrapped after normalization (#71046)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71046

att

Test Plan:
Added test coverage.

yinghai verifying locally for issue.

Reviewed By: kflu, 842974287

Differential Revision: D33487868

fbshipit-source-id: 5da615f66f50500b30bae84592859305b2971e1e
2022-01-13 13:33:01 -08:00
71b274d34d [pytorch] move ATen/CUDAGeneratorImpl.h to ATen/cuda (#71224)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71224

Pull Request resolved: https://github.com/facebookresearch/FBTT-Embedding/pull/19

Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/860

This patch follows up D33414890 (5cae40c169).

This patch removes an alias header "`ATen/CUDAGeneratorImpl.h`" since it has been moved to `ATen/cuda/CUDAGeneratorImpl.h`. This change should have already been propagated.

Test Plan: Internal and external CI

Reviewed By: jianyuh

Differential Revision: D33534276

fbshipit-source-id: 368177784ec84f003aad911cf4dd4da4a6e8e3d4
2022-01-13 13:29:44 -08:00
1de830a985 Use ptrdiff_t rather than ssize_t (#71271)
Summary:
`diff_type` kind of naturally should be `ptrdiff_t`, as `ssize_t` is actually defined [here](https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/sys_types.h.html) as :
> The type ssize_t shall be capable of storing values at least in the range [-1, {SSIZE_MAX}].

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71271

Reviewed By: atalman

Differential Revision: D33569304

Pulled By: malfet

fbshipit-source-id: 57dafed5fc42a1f91cdbed257e76cec4fdfbbebe
2022-01-13 12:41:53 -08:00
83b45fe166 [ao] disabling dynamic conv/convT ops (#71110)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71110

as mentioned in https://github.com/pytorch/pytorch/issues/70480 the dynamic conv ops are currently missing a key feature to bring their performance in line with other dynamic ops, this diff disables conv/convT from being automatically quantized with convert dynamic

Test Plan: buck test //caffe2/test:quantization --test-selectors test_quantized_module#TestDynamicQuantizedModule

Reviewed By: vkuzo

Differential Revision: D33511152

fbshipit-source-id: 50618fbe734c898664c390f896e70c68f1df3208
2022-01-13 11:28:02 -08:00
37eaf7640f Revert "Revert D33480077: .github: Re-enable xla test config" (#71202)
Summary:
This reverts commit 14922a136f940e2f9bc9d04d7963b8141138efa0.

Re-enable xla test config since PTXLA head is back to green -- https://app.circleci.com/pipelines/github/pytorch/xla.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71202

Reviewed By: wenleix

Differential Revision: D33569109

Pulled By: seemethere

fbshipit-source-id: ee0985768d1dfaa6c28865ae5b3dbce2a4a340f7
2022-01-13 11:19:18 -08:00
40eb004da5 Use nightly-binary instead of nightly to deduplicate refs for nightlies (#71270)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/71260

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71270

Reviewed By: seemethere

Differential Revision: D33568858

Pulled By: janeyx99

fbshipit-source-id: 03de185af987e5cb3b021d842be20c4a353b1033
2022-01-13 10:10:35 -08:00
003c94c790 [Quant] Templatize activationLimits function (#71220)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71220

This is to allow using this function for uint8 as well as int8

Test Plan:
buck test caffe2/test:quantization
This primarily tests T=uint8

Reviewed By: kimishpatel

Differential Revision: D33520713

fbshipit-source-id: 9640cf0a446e4c4e76887d643d72b767945bae76
2022-01-13 09:31:16 -08:00
4a26624670 [Quant] Add a guard against shapes for qnnpack qadd (#71219)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71219

qnnpack kernel does not support broadcasting

Test Plan: buck test caffe2/test:quantization

Reviewed By: kimishpatel

Differential Revision: D33520613

fbshipit-source-id: 93c5226d53cb7b90ed495ff7b14158f7171d25bf
2022-01-13 09:31:12 -08:00
e1b9d5854a [Quant] Add quantized input tensor data type checks (#71218)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71218

This asserts quint8 support, and fails with a helpful error message when attempted to use with a different qdtype

Test Plan: buck test caffe2/test:quantization

Reviewed By: kimishpatel

Differential Revision: D33455785

fbshipit-source-id: 6ec728f59bb707c2d941b50e6375a698c66284c0
2022-01-13 09:29:55 -08:00
188b744390 Make docker build cron once a week and not every hour on Wed (#71255)
Summary:
The many times a day was probably not intentional

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71255

Reviewed By: suo, atalman

Differential Revision: D33559155

Pulled By: janeyx99

fbshipit-source-id: c8703cea6f3188c9bcb0867b895261808d3164ee
2022-01-13 08:26:57 -08:00
1e3893ecbb [DataPipe] Removing deprecated DataPipes (#71161)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71161

Users should import these DataPipes from [TorchData](https://github.com/pytorch/data) if they would like to use them. We will be checking for any downstream library usage before landing this PR.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D33532272

Pulled By: NivekT

fbshipit-source-id: 9dbfb21baf2d1183e0aa379049ad8304753e08a1
2022-01-13 07:37:48 -08:00
60632a00fe [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D33561057

fbshipit-source-id: 79873717c45c8bbe6d0ae760e718770fd960185d
2022-01-13 03:27:06 -08:00
ff78c73286 [ONNX] Remove f arg from export_to_pretty_string (#69045) (#69546)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69546

The arg is not used and was previously deprecated.

Also remove torch.onnx._export_to_pretty_string. It's redundant with the
public version.

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D32994270

Pulled By: msaroufim

fbshipit-source-id: f8f3933b371a0d868d9247510bcd73c31a9d6fcc
2022-01-12 21:31:36 -08:00
3cc34a4502 [PyTorch][Static Runtime] s/toObject/toObjectRef/ in native ops (#71238)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71238

Saves a refcount bump for these.
ghstack-source-id: 146927203

Test Plan: CI

Reviewed By: mikeiovine

Differential Revision: D33554385

fbshipit-source-id: b2f8d5afdc0eb80c8765d88560d0e547376f28d1
2022-01-12 18:44:40 -08:00
ffdc0e23af [SR] Add various missing native ops (#71113)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71113

This diff adds a variety of missing ~~out variants~~/native ops. Most of these are trivial, so I included them all in one diff.

Native ops
* `aten::mul` (list variant)
* `aten::sub` (int variant)
* `aten::add` (list variant)
* `aten::Int`

Out variants
* ~~`aten::gt`~~ (codegen will handle)
* ~~`aten::eq`~~ (codegen will handle)
ghstack-source-id: 146927552

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: hlu1

Differential Revision: D33510756

fbshipit-source-id: df385958b9561955b2e866dab2e4c050abd26766
2022-01-12 18:40:31 -08:00
f6b804ba9f Fallback to server JIT type for type checking.
Summary:
T109800703
In runtime fallback to server JIT type if a DynamicType is parsed.

Test Plan: local headset

Reviewed By: scramsby

Differential Revision: D33557763

fbshipit-source-id: f5fe7dabf668de2f55cc26f9ebe8bcbccd570ce3
2022-01-12 17:59:54 -08:00
84d4087874 Fix trt const_fold as output use case (#71194)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71194

Reviewed By: jfix71, khabinov

Differential Revision: D33541168

fbshipit-source-id: dd5787430b272977963323a6ce38b3e15e979278
2022-01-12 16:57:19 -08:00
1bbea3c3a2 [PyTorch][JIT] Support mayContainAlias(Value*, ArrayRef<Value*>) (#69853)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69853

We can implement this overload more efficiently.
ghstack-source-id: 146924693

Test Plan:
patched alias_analysis tests

Time reported to initialize a predictor by static runtime when given ctr_mobile_feed local_ro net is 9.5s instead of 10.5s.

Reviewed By: mikeiovine

Differential Revision: D33039731

fbshipit-source-id: 52559d678e9eb00e335b9e0db304e7a5840ea397
2022-01-12 16:53:54 -08:00
cd253938a9 [PyTorch][SR][easy] s/input_or_constant_aliases/external_aliases/ (#69852)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69852

Looks like a stale comment.
ghstack-source-id: 146924694

Test Plan: review

Reviewed By: hlu1

Differential Revision: D33033264

fbshipit-source-id: aa0eff463c42716bdd7142d4662d8668af439f68
2022-01-12 16:52:26 -08:00
1bc3571078 [pytorch][PR] Add ability for a mobile::Module to save as flatbuffer (#70201)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70201

Included functions:
save_mobile_module -> saves a mobile::Module to flatbuffer
load_mobile_module_from_file -> loads a flatbuffer into mobile::Module
parse_mobile_module -> parses from bytes or deserialized flatbuffer module object

Compared to previous attempts, this diff only adds flatbuffer to cmake target and leaves fbcode/xplat ones unchanged.

Test Plan: unittest

Reviewed By: malfet, gmagogsfm

Differential Revision: D33239362

fbshipit-source-id: b9ca36b83d6af2d78cc50b9eb9e2a6fa7fce0763
2022-01-12 16:30:39 -08:00
7a93d8bb2d Revert D32374542: Implement the patterns module for the multi subgraph rewriter.
Test Plan: revert-hammer

Differential Revision:
D32374542 (de62bcac66)

Original commit changeset: 4ae8da575976

Original Phabricator Diff: D32374542 (de62bcac66)

fbshipit-source-id: 901e41d6abb202c5b1c6a3a84b060b2677b5bbe1
2022-01-12 15:50:58 -08:00
9ca367d48b [nnc] Use given kernel function name while emitting code (#67781)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67781

Update `LLVMCodeGen` in NNC to use the given kernel function name while emitting code.

This was earlier committed as D31445799 (c30dc52739) and got reverted as part of a stack of diffs that included a cache for `PyTorchLLVMJIT`, which was the likely culprit.

Test Plan:
```
buck test mode/opt //caffe2/test/cpp/tensorexpr:tensorexpr -- --exact 'caffe2/test/cpp/tensorexpr:tensorexpr - LLVM.CodeGenKernelFuncName'
```

Reviewed By: ZolotukhinM, bdhirsh

Differential Revision: D32145958

fbshipit-source-id: 5f4e0400c4fa7cabce5b91e6de2a294fa0cad88e
2022-01-12 15:49:17 -08:00
67941c8a94 Document torch.cuda.ExternalStream, torch.cuda.caching_allocator_alloc and torch.cuda.caching_allocator_delete (#70126)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/67414. Fixes https://github.com/pytorch/pytorch/issues/70117.

cc brianjo mruberry ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70126

Reviewed By: mruberry

Differential Revision: D33542910

Pulled By: ngimel

fbshipit-source-id: 4b870f4dceca6ee4cc8fba58819f1cb18ac9f857
2022-01-12 15:44:40 -08:00
ad803936d1 Codegen: ADInplaceOrViewType only include operators registered (#68692)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68692

ADInplaceOrViewType is a sharded file, so by only including specific
operator headers, we ensure that changing one (non-method) operator
only needs one shard to be re-compiled.

This also ports the generated code over to the `at::_ops` interface,
and the code generator itself to using `write_sharded` instead of
re-implementing its own version of sharding.

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D33217916

Pulled By: albanD

fbshipit-source-id: 90f1868f72644f1b5aa023cefd6a102bbbec95af
2022-01-12 15:34:45 -08:00
cc55da8a9b [caffe2/server quant] use new depthwise conv fbgemm interface (#71166)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71166

Remove the use of deprecated old interface

Test Plan: CI

Reviewed By: jiyuanzFB

Differential Revision: D33533494

fbshipit-source-id: 930eb93cd67c7a9bb77708cc48914aa0c9f1c841
2022-01-12 15:29:07 -08:00
de62bcac66 Implement the patterns module for the multi subgraph rewriter. (#71181)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71181

This diff introduces the patterns module that defines a pattern-replacement pair for the experimental multi subgraph rewriter.

Test Plan: Tested locally. Unit test suite forthcoming.

Reviewed By: ajauhri

Differential Revision: D32374542

fbshipit-source-id: 4ae8da575976e96b02c5c33c6ae2a0943fc7f126
2022-01-12 15:12:05 -08:00
3c0c5bde0e [cmake] Uncomment binaries (#71157)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71157

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D33528259

Pulled By: IvanKobzarev

fbshipit-source-id: b8c216558ca612bedd4c37205f38ed29c2c82b3c
2022-01-12 15:01:44 -08:00
e1f01d2c01 .ci: Add nightly trigger, remove CircleCI linux binary builds (#70957)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70957

Adds nightly trigger for github actions using a workflow that will pull
down viable/strict and tag it as `nightly` and then re-push it up to the
repository.

Also removes CircleCI linux binary builds since they will now be
outmoded in favor of our new GHA workflow

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D33535609

Pulled By: seemethere

fbshipit-source-id: ca6402df37db46e1872ff25befe96afa12e7b1af
2022-01-12 14:31:51 -08:00
6c1be299c1 caffe2/c10/core/TensorImpl.h: adapt to clang 12 (#70973)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70973

clang12 builds fail like this:

  caffe2/c10/core/TensorImpl.h:2615:1: error: static_assert failed due to requirement 'sizeof(void *) != sizeof(long) || sizeof(c10::TensorImpl) == sizeof(long) * 24' "You changed the size of TensorImpl on 64-bit arch.See Note [TensorImpl size constraints] on how to proceed."

Yet eliciting the size of that struct with this one-line addition:

  char (*__show_sizeof)[sizeof( TensorImpl )] = 1;

reports that its size is indeed 192 (aka 8 * 24):

  caffe2/c10/core/TensorImpl.h:2615:8: error: cannot initialize a variable of type 'char (*)[192]' with an rvalue of type 'int'

On closer inspection we determined that failures were occurring because TensorImpl was sometimes of size 208 and other times of size 192. The 192 size was expected and TensorImpl was hard-coded to raise an error for any other case on a 64-bit system, including the one we found where the size was 208.

Additional investigation revealed that systems using GCC 11 and CUDA 11040 with either C++ 201402 and 201703 would sometimes yield TensorImpl sizes of 208 whereas systems newer systems without CUDA would always yield sizes of 192.

The difference turned out to be that `std::unique_ptr` on NVCC systems is sometimes of size 16 and other times of size 8, accounting fully for the observed difference in TensorImpl sizes. We have not yet been able to find a set of preprocessor macros that predict when each size will occur.

To handle the situation, we've added extensive debugging information to the TensorImpl size-checking logic. A number of preprocessing definitions capture compiler versions and other information to help understand what changes might have affected the size of TensorImpl. The size of each member of TensorImpl is now individually checked, along with the total size. Template-based comparison functions are used to provide compile-time outputs about the system state as well as the observed and expected sizes of each item considered.

The template-based comparison functions cause the code to break if it's run on a 32-bit system because the templates and their associated static_asserts are compiled whether or not they'll ultimately be used. In C++17 we could prevent this using `if constexpr`; however, PyTorch is pinned to C++14, so we cannot. Instead, we check pointer size (`#if UINTPTR_MAX == 0xFFFFFFFF`) to determine which system we're on and provide separate checks for 32 vs 64-bit systems.

A final wrinkle is that 32-bit systems have some variations in data size as well. We handle these by checking that the relevant items are `<=` the expected values.

In summary...

Improvements over the previous situation:
* Added checks for 32-bit systems
* The sizes of individual fields are now checked
* Compile-time size results (expected versus observed) are provided
* Compile-time compiler and system info is provided
* Landing this diff will actually enable checks of TensorImpl size; they are currently disabled to expedite LLVM-12 + newer CUDA upgrade efforts.

Some work that could still be done:
* Figure out what preprocessor flags (if any) predict the size of `std::unique_ptr` for 64-bit systems and of various elements of 32-bit systems.

Test Plan: Building no longer triggers that static_assert failure.

Reviewed By: luciang

Differential Revision: D32749655

fbshipit-source-id: 481f84da6ff61b876a5aaba89b8589ec54d59fbe
2022-01-12 14:27:16 -08:00
385773cb77 add BFloat16 support for MaxPool2d on CPU (#56903)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56903

Test Plan: Imported from OSS

Reviewed By: mikaylagawarecki

Differential Revision: D28836791

Pulled By: VitalyFedyunin

fbshipit-source-id: e03d55cc30dfa3628f096938fbad34b1031948af
2022-01-12 14:20:20 -08:00
de902b5d02 [FX] Add a default_value arg to Graph.placeholder and fix split_module (#71016)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71016

I found out that `split_module` doesn't preserve default values for arguments. In trying to fix that, I noticed that `Graph.placeholder` doesn't make it easy to add a default argument when making a placeholder. This PR addresses both of those issues

Test Plan: Imported from OSS

Reviewed By: ansley

Differential Revision: D33482218

Pulled By: jamesr66a

fbshipit-source-id: 57ebcdab25d267333fb1034994e08fc1bdb128ee
2022-01-12 14:03:17 -08:00
5749be4678 Fix the shape inconsistency of out and elem tensor (#71065)
Summary:
See bug report  https://github.com/pytorch/pytorch/issues/71063

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71065

Reviewed By: anjali411

Differential Revision: D33549921

Pulled By: ejguan

fbshipit-source-id: bc43f5f9a88f7dcd8729d0e0f4b90d20f40b3064
2022-01-12 13:57:19 -08:00
2290976880 ci: Comment out pull_request trigger for binary builds (#71244)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71244

Binary builds adds a lot of skipped jobs to the default ciflow workflow
so we're commenting out the pull_request trigger for now until the new
ciflow mechanism becomes available

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D33555049

Pulled By: seemethere

fbshipit-source-id: 2d0d4704e7297d5931b2c9705ee4dfb26760736e
2022-01-12 13:48:10 -08:00
bfe1abd3b5 torch/monitor: add pybind (#69567)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69567

This exposes torch.monitor events and stats via pybind11 to the underlying C++ implementation.

* The registration interface is a tad different since it takes a lambda function in Python where as in C++ it's a full class.
* This has a small amount of changes to the counter interfaces since there's no way to create an initializer list at runtime so they now also take a vector.
* Only double based stats are provided in Python since it's intended more for high level stats where float imprecision shouldn't be an issue. This can be changed down the line if need arises.

```
events = []

def handler(event):
    events.append(event)

handle = register_event_handler(handler)

log_event(Event(type="torch.monitor.TestEvent", timestamp=datetime.now(), metadata={"foo": 1.0}))
```

D32969391 is now included in this diff.
This cleans up the naming for events. type is now name, message is gone, and metadata is renamed data.

Test Plan: buck test //caffe2/test:monitor //caffe2/test/cpp/monitor:monitor

Reviewed By: kiukchung

Differential Revision: D32924141

fbshipit-source-id: 563304c2e3261a4754e40cca39fc64c5a04b43e8
2022-01-12 13:35:11 -08:00
90ef54f8ea [PyTorch] Remove buggy ExclusivelyOwnedTraits<intrusive_ptr<T>> (#70647)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70647

It wasn't checking for the null state and it wasn't used.
ghstack-source-id: 146819525

Test Plan: CI

Reviewed By: hlu1

Differential Revision: D33414728

fbshipit-source-id: 7fcd648577cbfc35320c5c3ca9a19a14bd4d6858
2022-01-12 12:19:52 -08:00
479ce1c3a0 [PyTorch] Add isUndefined to ExclusivelyOwnedTraits<TensorBase> debug msg (#70638)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70638

We are seeing these assertions fire infrequently. Add more information to aid in debugging when they fire.
ghstack-source-id: 146819527

Test Plan: CI

Reviewed By: bdhirsh

Differential Revision: D33412651

fbshipit-source-id: 7e35faf9f4eeaa5f2455a4392e00f62fe692811c
2022-01-12 12:18:33 -08:00
4d28cef03a Added AutocastCPU string (#70013)
Summary:
Description:
- Added "AutocastCPU" string repr into `toString` method

Before
```
std::cout << c10::DispatchKey::AutocastCPU;
> UNKNOWN_TENSOR_TYPE_ID
```
and now:
```
std::cout << c10::DispatchKey::AutocastCPU;
> AutocastCPU
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70013

Reviewed By: ejguan

Differential Revision: D33550777

Pulled By: bdhirsh

fbshipit-source-id: b31e15e6d52fc1768af085e428328117d588f283
2022-01-12 12:06:46 -08:00
7884143dff Legacy support for embedded interpreter (#71197)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71197

Adds back legacy support for emmbedded interpreter to use .data section in internal use cases. Specifically this allows for dynamic loading of python extension files.

Test Plan: buck test mode/opt //caffe2/torch/csrc/deploy:test_deploy_gpu_legacy

Reviewed By: shunting314

Differential Revision: D33542636

fbshipit-source-id: b49f94163c91619934bc35595304b9e84d0098fc
2022-01-12 11:48:27 -08:00
a71b4dc164 Update nightly wheels to ROCm4.5.2 (#71064)
Summary:
cc jeffdaily sunway513 jithunnair-amd ROCmSupport KyleCZH

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71064

Reviewed By: malfet, janeyx99

Differential Revision: D33552643

Pulled By: seemethere

fbshipit-source-id: 3754f69188864f6b3639818a4b9013ed255a2d7d
2022-01-12 11:41:55 -08:00
fd0d4bef03 Edit cron to make the docker jobs run hopefully (#71232)
Summary:
Our docker builds have not been running with our previous cron, changes this so it should work hopefully.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71232

Reviewed By: ejguan

Differential Revision: D33552231

Pulled By: janeyx99

fbshipit-source-id: 1a3e1607b03d37614eedf04093d73f1b96698840
2022-01-12 11:37:03 -08:00
70951884d4 Add option to load historic operators in IR when the operator is deprecated (#71148)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71148

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D33521300

Pulled By: tugsbayasgalan

fbshipit-source-id: a0607dba5e7233590384326537017eb0b18da419
2022-01-12 11:07:04 -08:00
8f4cec2231 [warnings][Caffe2] Suppress warnings in caffe2 headers (#71196)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71196

`caffe2` headers contain code that can elicit warnings when built with strict compiler flags.  Rather than force downstream/consuming code to weaken their compiler flags, suppress those warnings in the header using `#pragma clang diagnostic` suppressions.

Test Plan: CI Pass

Reviewed By: malfet

Differential Revision: D33536233

fbshipit-source-id: 74404e7a5edaf244f79f7a0addd991a84442a31f
2022-01-12 10:16:35 -08:00
149f5ffa36 Fix inconsistency between new and old upgrader design (#71185)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71185

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D33539191

Pulled By: tugsbayasgalan

fbshipit-source-id: 721093793574663d56a8080c6a488024620266a1
2022-01-12 09:54:31 -08:00
54fe2741a1 [fx2trt] break down div (#71172)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71172

Break down div to smaller ops to make those div ops look like all other elementwise ops.

Use operator div ops instead of torch div if possible to avoid converting literal numbers to torch tensor (like in the following).
```
a = 1
b = 2

// `c` would be 0.5
c = a / b

// `c` would be torch.tensor([0.5])
c = torch.div(a, b)
```

The problem we saw on shufflenet is that there's size op followed by a div op which results in int64 tensors in acc traced graph (acc tracer turns operator.div to acc_ops.div which uses torch.div). And trt splitter splits out the reshape op that consumes the div op because we have a rule to split out ops that takes in int64 tensors as inputs.

Test Plan: Unit tests.

Reviewed By: wushirong

Differential Revision: D33482231

fbshipit-source-id: 508a171520c4e5b4188cfc5c30c1370ba9db1c55
2022-01-12 09:46:46 -08:00
6a40bb0fdf [DataPipe] Update deprecation warning (#71171)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71171

Editing two warnings to more accurately portray the deprecation plan for the DataPipes

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D33535785

Pulled By: NivekT

fbshipit-source-id: b902aaa3637ade0886c86a57b58544ff7993fd91
2022-01-12 09:34:53 -08:00
706777bf56 Disable the output invocation in jit (#71138)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71138

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D33521059

Pulled By: eellison

fbshipit-source-id: eaf20eaa6e62159dff9369a7b75e6d6009fb45d0
2022-01-12 09:11:37 -08:00
5480deb183 Add support for permutting dynamic fusion group outputs to channels last format (#70656)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70656

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D33458650

Pulled By: eellison

fbshipit-source-id: f0c7d20743deac7a87f7c9176e60da8100aefe41
2022-01-12 09:11:34 -08:00
39be20f259 [JIT][NNC] Add handling of strides to dynamic shape support. (#70464)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70464

Add handling of strided input tensors to dynamic fusion. This is done with the same set of input striding specializations as https://github.com/pytorch/pytorch/pull/60684/:
```
  S_ONE, // STRIDE_ONE: packed
  S_CONT, // STRIDE_CONTIGUOUS: stride[i + 1] * sizes[i + 1]
  S_TRAN_CONT, // STRIDE_TRANSPOSED_CONTIGUOUS: stride[i-1] * sizes[i-1]
  S_AS_ARG, // STRIDE_AS_ARG: stride passed in as runtime value
```
and then two additional specializations for a) contiguous tensor and b) channels-last tensor. channels-last is a common case and we should optimize for it. additionally, tensors natively store whether they are contiguous/channels-last contiguous, which makes it faster to check if tensors follow this pattern.

Output striding will be done in a follow up.

The striding is stored on both the TensorGroup node and on the guard node. The striding descriptors are stored as a vector of strings on the node for debugability and to make use of storing ivalues as attributes on nodes.

As an example:

```

%8 : Double(10, 11, 12, 13, strides=[1716, 1, 143, 11], requires_grad=0, device=cpu) = prim::TensorExprGroup_0[symbolic_shape_inputs=[-37, -36, -35, -34], striding_inputs_desc=[["TENSOR_CONT_CHANNELS_LAST"]](%x, %24, %23, %22, %21)```
```

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D33458649

Pulled By: eellison

fbshipit-source-id: c42616d3c683d70f6258180d23d3841a31a6030d
2022-01-12 09:11:31 -08:00
975e7d246e Remove ignore shapes arg (#71144)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71144

This wasn't being used anywhere. It was originally intended for the SR flow but we're doing something else now.

Test Plan: Imported from OSS

Reviewed By: navahgar, ZolotukhinM

Differential Revision: D33521061

Pulled By: eellison

fbshipit-source-id: 0574698a2b7409df6feb703f81e806d886225307
2022-01-12 09:09:49 -08:00
97585ae1e7 Simplify forward / backward AD for linalg.eigh and add checks (#70528)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70528

This PR adds checks for the backward of `linalg.eigh`, similar to those
deduced in https://github.com/pytorch/pytorch/pull/70253

It also makes its the implementation parallel that of the (fwd/bwd) derivative of
`torch.linalg.eig` and it makes most OpInfo tests pass.

cc jianyuh nikitaved pearu mruberry walterddr IvanYashchuk xwang233 Lezcano

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D33530149

Pulled By: albanD

fbshipit-source-id: 1f368b8d450d4e9e8ae74d3881c78513c27eb956
2022-01-12 08:35:52 -08:00
061be8d600 Correct forward AD for linalg.eig and add checks (#70527)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70527

This PR adds checks for the backward of `linalg.eig`, similar to those
deduced in https://github.com/pytorch/pytorch/pull/70253

It also modifies the function so that it does not save the input matrix,
as it's not necessary.

It also corrects the forward AD formula for it to be correct. Now all
the tests pass for `linalg.eig` and `linalg.eigvals`.

It also updates the docs to reflect better what's going on here.

cc jianyuh nikitaved pearu mruberry walterddr IvanYashchuk xwang233 Lezcano

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D33530148

Pulled By: albanD

fbshipit-source-id: 984521a04f81ecb28ac1c4402b0243c63dd6959d
2022-01-12 08:30:55 -08:00
e1aea9b968 Add retry to disabled tests file download (#71030)
Summary:
Helps with spotty disabling brought up in https://github.com/pytorch/pytorch/issues/70877 and https://github.com/pytorch/pytorch/issues/70875

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71030

Reviewed By: malfet, atalman

Differential Revision: D33486379

Pulled By: janeyx99

fbshipit-source-id: 56c4d56c2bd8be47a51dee19373aac6c9c5d1691
2022-01-12 08:20:44 -08:00
928ca95ff0 fix TensorLikePair origination (#70304)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70304

Without this patch `TensorLikePair` will try to instantiate everything although it should only do so for tensor-likes. This is problematic if it is used before a different pair that would be able to handle the inputs but never gets to do so, because `TensorLikePair` bails out before.

```python
from torch.testing._comparison import assert_equal, TensorLikePair, ObjectPair

assert_equal("a", "a", pair_types=(TensorLikePair, ObjectPair))
```

```
ValueError: Constructing a tensor from <class 'str'> failed with
new(): invalid data type 'str'.
```

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D33542995

Pulled By: mruberry

fbshipit-source-id: 77a5cc0abad44356c3ec64c7ec46e84d166ab2dd
2022-01-12 06:44:00 -08:00
49a5b33a74 add a equality comparison helper for assert_close internals (#69750)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69750

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D33542993

Pulled By: mruberry

fbshipit-source-id: 0de0559c33ec0f1dad205113cb363a652140b62d
2022-01-12 06:43:57 -08:00
b0a10a709f add explanation of quantized comparison strategy in assert_close (#68911)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68911

Closes #68548.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D33542997

Pulled By: mruberry

fbshipit-source-id: 78accf20a83cd72254ae0036dc23f9e5376a4c65
2022-01-12 06:43:53 -08:00
802dd2b725 change sparse COO comparison strategy in assert_close (#68728)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68728

This removes the ability for `assert_close` to `.coalesce()` the tensors internally. Additionally, we now also check `.sparse_dim()`. Sparse team: please make sure that is the behavior you want for all sparse COO comparisons in the future. #67796 will temporarily keep BC by always coalescing, but in the future `TestCase.assertEqual` will no longer do that.

cc nikitaved pearu cpuhrsch IvanYashchuk

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D33542996

Pulled By: mruberry

fbshipit-source-id: a8d2322c6ee1ca424e3efb14ab21787328cf28fc
2022-01-12 06:43:50 -08:00
8d05174def make meta tensor data access error message for expressive in assert_close (#68802)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68802

Without this patch, the error message of comparing meta tensors looks like this after #68722 was merged:

```python
>>> t = torch.empty((), device="meta")
>>> assert_close(t, t)
NotImplementedError: Could not run 'aten::abs.out' with arguments from the 'Meta' backend. [...]
[...]
The above exception was the direct cause of the following exception:
[...]
RuntimeError: Comparing

TensorLikePair(
    id=(),
    actual=tensor(..., device='meta', size=()),
    expected=tensor(..., device='meta', size=()),
    rtol=1.3e-06,
    atol=1e-05,
    equal_nan=False,
    check_device=True,
    check_dtype=True,
    check_layout=True,
    check_stride=False,
    check_is_coalesced=True,
)

resulted in the unexpected exception above. If you are a user and see this message during normal operation please file an issue at https://github.com/pytorch/pytorch/issues. If you are a developer and working on the comparison functions, please except the previous error and raise an expressive `ErrorMeta` instead.
```

Thus, we follow our own advice and turn it into an expected exception until #68592 is resolved:

```python
>>> t = torch.empty((), device="meta")
>>> assert_close(t, t)
ValueError: Comparing meta tensors is currently not supported
```

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D33542999

Pulled By: mruberry

fbshipit-source-id: 0fe1ddee15b5decdbd4c5dd84f03804ca7eac95b
2022-01-12 06:43:47 -08:00
b652887ad7 improve documentation of comparison internals (#68977)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68977

Follow-up to #68722 to address the review comments that were left open before merge.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D33542998

Pulled By: mruberry

fbshipit-source-id: 23c567cd328f83ae4df561ac8ee6c40c259408c9
2022-01-12 06:42:30 -08:00
523d448968 Remove deprecated cuDNN convolution ops (#71128)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71128

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D33517677

Pulled By: jbschlosser

fbshipit-source-id: 1690fd38a38ee7cf16865209280a9c457c5f70ff
2022-01-12 06:34:42 -08:00
93b2399c6c [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D33544281

fbshipit-source-id: 4f0b5d6d490e6fcb967550cfb1dc0111b1770f73
2022-01-12 04:16:43 -08:00
4a8d4cde65 Fix for tensor in list return added to wildcard set (#71170)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71170

As with an output in a tuple return, an output in a list return will not have any further uses that would make adding it directly to the list's contained elements give incorrect behavior. This unblocks a use case in op authoring.

cc Chillee

Test Plan: Imported from OSS

Reviewed By: d1jang

Differential Revision: D33535608

Pulled By: eellison

fbshipit-source-id: 2066d28e98c2f5d1b3d7e0206c7e39a27b3884b1
2022-01-11 22:12:39 -08:00
9bccb31306 Remove precise tuple construct flag (#71121)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71121

Test Plan: Imported from OSS

Reviewed By: d1jang

Differential Revision: D33515234

Pulled By: eellison

fbshipit-source-id: 57cfe171b583a6bb4d3493a34b159061e97a11b8
2022-01-11 22:12:36 -08:00
47ad6628f1 add optional refining (#69776)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69776

If we have an node output which is an optional type, but both if blocks produce a non-optional value, we can try to refine the if output type, which can open up further optimization opportunities.

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D33515235

Pulled By: eellison

fbshipit-source-id: 34f6ab94ac4238498f9db36a1b673c5d165e832e
2022-01-11 22:12:34 -08:00
772b3e92bf Parse symbolic shapes (#69775)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69775

Adds parsing for Symbolic Shapes.

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D33515233

Pulled By: eellison

fbshipit-source-id: 7ebb22c0ab37d78e459ebcab67bb86f731d00376
2022-01-11 22:12:31 -08:00
97e8dcba5e Fix mis-specified device arg name (#69645)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69645

As noted in code comment:
existing device operator is registered with input name `a`, which prevents torch.device(type="cuda") from working. add shim-layer here

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D33515231

Pulled By: eellison

fbshipit-source-id: c04af8158a9568a20cd5fbbbd573f6efab98fd60
2022-01-11 22:11:24 -08:00
9465c24245 [jit][edge] Use dynamic type instead of union types for schema parsers. (#70509)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70509

TypeFactory will construct DynamicType when building on Edge platforms. We use this facility to make FunctionSchema return DynamicType all the time for OptionalType. We don't explicitly use DynamicTypeFactory everywhere because that requires too many changes and will split the entire aten codebase.
ghstack-source-id: 146818621

Test Plan: CI

Reviewed By: iseeyuan

Differential Revision: D33306737

fbshipit-source-id: d7ce00b438f7c03b43945d578280cfd254b1f634
2022-01-11 20:14:25 -08:00
40121456af Sparse CSR: Add torch.randn_like (#68083)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68083

This PR adds support for `torch.randn_like(sparse_csr_tensor)`.
It creates a new sparse csr tensor with same indices but different values that are normally distributed.

In addition `.normal_()` and `torch.empty_like` were implemented because `randn_like` is a composite of these two functions.

cc nikitaved pearu cpuhrsch IvanYashchuk

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D33511280

Pulled By: cpuhrsch

fbshipit-source-id: 6129083e8bc6cc5af2e0191294bd5e4e864f6c0e
2022-01-11 18:29:24 -08:00
831c129e85 fx quant: fix test_fx_acc_tracer::test_quantized_batch_norm2d (#71175)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71175

D33330022 was landed with a Meta test failure (ghstack clobbered the fix),
resubmitting the Meta-only part to fix CI.

Test Plan:
```
buck test mode/opt //caffe2/test:test_fx_acc_tracer -- --exact 'caffe2/test:test_fx_acc_tracer - test_quantized_batch_norm2d (fx_acc.test_acc_tracer.AccTracerTest)' --run-disabled
```

Reviewed By: HDCharles

Differential Revision: D33531994

fbshipit-source-id: 39dc945c54fb9a7205c9d4114ede6b5ab99c5012
2022-01-11 17:38:00 -08:00
410e91adee Performance and memory improvements to batched torch.linalg.solve (#69752)
Summary:
Previously for single input matrix A and batched matrix B, matrix A was expanded and cloned before computing the LU decomposition and solving the linear system.

With this PR the LU decomposition is computed once for a single matrix and then expanded&cloned if required by a backend library call for the linear system solving.

Here's a basic comparison:
```python
# BEFORE THE PR
In [1]: import torch
In [2]: a = torch.randn(256, 256)
In [3]: b = torch.randn(1024, 256, 2)
In [4]: %%timeit
   ...: torch.linalg.solve(a, b)
   ...:
   ...:
329 ms ± 17.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# WITH THIS PR
In [1]: import torch
In [2]: a = torch.randn(256, 256)
In [3]: b = torch.randn(1024, 256, 2)
In [4]: %%timeit
   ...: torch.linalg.solve(a, b)
   ...:
   ...:
21.4 ms ± 23 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

cc jianyuh nikitaved pearu mruberry walterddr IvanYashchuk xwang233 Lezcano

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69752

Reviewed By: albanD

Differential Revision: D33028236

Pulled By: mruberry

fbshipit-source-id: 7a0dd443cd0ece81777c68b29438750f6524ac24
2022-01-11 16:14:16 -08:00
786f946098 [Profiler] Add glue layer to reduce the use of #ifdef USE_KINETO in the profiler code. (#69798)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69798

One of the major sources of complexity in `profiler_kineto.cpp` is that kineto may or may not be available. The code (including the types) follows two related but often distict codepaths, and large sections may or may not be `#ifdef`'d out.

Optimizing such code which preserving correctness is quite difficult; at one point I realized that I had broken the non-Kineto case, because moving work into the finalize step runs astray of a very large `#ifdef` around the finalize logic.

In order to make optimization more tractable, I gathered all of the calls to Kineto APIs and isolated them in the `kineto_shim.h/.cpp` files: the header allows callers to pretend as though Kineto is always available (mostly), and the cpp file hides most of the horrible `#ifdef`s so they don't pollute the main profiler code.

Test Plan: Unit tests.

Reviewed By: aaronenyeshi

Differential Revision: D32690568

fbshipit-source-id: 9a276654ef0ff9d40817c2f88f95071683f150c5
2022-01-11 15:57:46 -08:00
a3b7dd7b78 Enable nested default hooks (#70932)
Summary:
When default hooks are set, they are pushed onto a stack.
When nesting context-manager, only the inner-most hooks will
be applied.

There is special care needed to update the TLS code. See also https://github.com/pytorch/pytorch/issues/70940 (i.e. do we need to be storing the enabled flag as well?)

Fixes https://github.com/pytorch/pytorch/issues/70134

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70932

Reviewed By: mruberry

Differential Revision: D33530370

Pulled By: albanD

fbshipit-source-id: 3197d585d77563f36c175d3949115a0776b309f4
2022-01-11 15:03:49 -08:00
433cf44b79 delete ecr_gc_docker job (#71178)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71178

This should no longer be needed as we now set a lifecycle policy on ECR
and we also don't generate lots of temporary containers anymore.

Test Plan: Imported from OSS

Reviewed By: seemethere

Differential Revision: D33537851

Pulled By: suo

fbshipit-source-id: b97b7525be6f62ec8771dfb6a7ee13b22b78ac5a
2022-01-11 14:53:31 -08:00
e7634f83ce [jit][edge] Migrate base types to DynamicType on mobile. (#70233)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70233

Make type parser to produce DynamicType for all base types which don't have type arguments, and return DynamicType pointer for IValue::type().
ghstack-source-id: 146818622

Test Plan: no behavior change.

Reviewed By: iseeyuan

Differential Revision: D33137219

fbshipit-source-id: 1612c924f5619261ebb21359936309b41b2754f5
2022-01-11 13:53:29 -08:00
ecb6defa36 Fixed docs for forward_ad.make_dual (#71159)
Summary:
Minor docs change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71159

Reviewed By: mruberry

Differential Revision: D33530031

Pulled By: albanD

fbshipit-source-id: e0bbe3a29a7de675fa4c9bf90976616f0e093f74
2022-01-11 13:47:09 -08:00
2c8cb8a964 Speed up quantized upsampling for channels last (#70903)
Summary:
Moving the calls to `q_zero_point()` outside the for loop considerably
speeds up upsampling for channels last format.

This fix is very similar to https://github.com/pytorch/pytorch/pull/66525 but applies it for channels last format.

Fixes https://github.com/pytorch/pytorch/issues/70902

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70903

Reviewed By: mruberry

Differential Revision: D33531805

Pulled By: vkuzo

fbshipit-source-id: e723f1e3d53bdd66529c1326dccba889402a126c
2022-01-11 13:28:10 -08:00
edf15ebbc2 Adding python 3.10 binary workflows (#71132)
Summary:
Testing python 3.10

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71132

Reviewed By: mruberry

Differential Revision: D33534609

Pulled By: atalman

fbshipit-source-id: 561412735fb6d1269fca3db0fac5afd437a0bde2
2022-01-11 13:18:18 -08:00
7d6535cab3 Make Kineto + distributed a warning rather than an error (#71120)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71120

D33283314 (681e78bace) is causing jobs to fail when profiled, which is not ideal.

Test Plan:
pyper-online-cli launch 306587531 AI_INFRA ads_global_pyper_sla oncall_model_store --training_package_version training_platform:9344fe410969bdf614bc89cff0280281 --training_stage ONLINE --training_environment DEV --timeout 1728000

(Courtesy of yanjzhou)

Reviewed By: xw285cornell

Differential Revision: D33437773

fbshipit-source-id: 5c492f83146ff82557cfc1142aade3432cf73ca5
2022-01-11 12:50:17 -08:00
45b0bafb38 Drop more unused variables (#71123)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71123

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D33511656

fbshipit-source-id: b53565b589720cce9fdfe3bc222853dba8645aff
2022-01-11 12:46:24 -08:00
6c03f8d9e5 Drop unused variables and add some const (#71106)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71106

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D33490855

fbshipit-source-id: 9fc4a4e4a7ad5e6c31f394ec6d8221b964fdf043
2022-01-11 12:38:59 -08:00
1c8b167327 Move implementation of empty_like for sparse COO (#71103)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71103

Previously the implementation of empty_like for sparse COO was a
conditional path in generic implementation.
This PR makes use of the Dispatcher and moves the implementation into a
separate function.

cc nikitaved pearu cpuhrsch

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D33511240

Pulled By: cpuhrsch

fbshipit-source-id: 9a84f82a27e3cf0ac819d867b86df6d10ddf7fa7
2022-01-11 12:30:39 -08:00
a8612cd72a Skip failing tests in test_nn if compiled without LAPACK (#70913)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/70912

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70913

Reviewed By: mruberry

Differential Revision: D33534840

Pulled By: albanD

fbshipit-source-id: 0facf5682140ecd7a78edb34b9cd997f9319e084
2022-01-11 12:21:18 -08:00
14922a136f Revert D33480077: .github: Re-enable xla test config
Test Plan: revert-hammer

Differential Revision:
D33480077 (18e1e1d4d3)

Original commit changeset: a2e720c55d0e

Original Phabricator Diff: D33480077 (18e1e1d4d3)

fbshipit-source-id: e4e114a9a6d7940491ac0741e94f455a490f077a
2022-01-11 12:12:15 -08:00
940b89b03f Disable Python-3.6 binary builds (#71163)
Summary:
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71163

Reviewed By: anjali411

Differential Revision: D33532813

Pulled By: malfet

fbshipit-source-id: ab0833c2db187c452681a17907583599ff1cb481
2022-01-11 11:25:45 -08:00
4f35b9144c [jit][edge] Migrate ListType to DynamicType on mobile. (#70212)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70212

Use DynamicType instead of ListType all over the place in Lite Interpreter. Namely we need to modify the following places:
1. Type parser which produces the Type constants.
2. IValue::type() which returns reflected Type from IValues.
3. Helper functions to construct the container value.
4. Typechecks which test whether a type instance is a particular container type.
ghstack-source-id: 146818619

Test Plan: CI

Reviewed By: iseeyuan

Differential Revision: D33176931

fbshipit-source-id: 9144787f5fc4778538e5c665946974eb6171a2e6
2022-01-11 10:57:53 -08:00
18e1e1d4d3 .github: Re-enable xla test config (#71008)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71008

This reverts commit 6f83841582d8d818129dc4ce82a8478f221b32d7.

Test Plan: Imported from OSS

Reviewed By: janeyx99

Differential Revision: D33480077

Pulled By: seemethere

fbshipit-source-id: a2e720c55d0e1995e2b6cf2da7c801f377d52b3f
2022-01-11 10:49:20 -08:00
85c6489cdc ci: unquote env variables (#71139)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71139

These variables were being interpreted as being quoted in the GITHUB_ENV
file meaning they didn't register correctly when attempting to do the
actual binary_upload.sh leading to binaries not actually getting
uploaded.

This remedies that

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet, b0noI

Differential Revision: D33519952

Pulled By: seemethere

fbshipit-source-id: 727f6d4e5dbdfd0a3e2c76058bee9430b2c717a9
2022-01-11 10:21:11 -08:00
cf61738097 Drop unused variables; make things const; use some auto (#71107)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71107

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D33490773

fbshipit-source-id: 0d259db9c58c9b33aecc560075f6dcfa78883467
2022-01-11 08:55:54 -08:00
3c2ae2b47c Revert D32994274: [ONNX] Link to the wiki (#68505)
Test Plan: revert-hammer

Differential Revision:
D32994274 (a606ea73d6)

Original commit changeset: 34d54f935799

Original Phabricator Diff: D32994274 (a606ea73d6)

fbshipit-source-id: 81fc96c2aff9d14efb5e092fffd0685e507837e6
2022-01-11 07:40:14 -08:00
1b496cf158 Fixes doc errors in Tensor.triu(), Tensor.tril(), Tensor.ravel(). (#71057)
Summary:
Hi, PyTorch Team!
I am very much interested in starting up my contribution to PyTorch. I made several contributions in NumPy and CuPy, but this is my first PR towards PyTorch. I aim to contribute more in the upcoming future.

The PR fixes https://github.com/pytorch/pytorch/issues/70972  https://github.com/pytorch/pytorch/issues/70975.

#### Aim of PR
The functions like `Tensor.ravel`, `Tensor.tril`, `Tensor.tril_`, `Tensor.triu`, and `Tensor.triu_` had a couple of typos in docs. The PR aims to resolve that.

I'm looking forward to your viewpoints. Thanks!

cc: kshitij12345 vadimkantorov Lezcano TestSomething22

cc brianjo mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71057

Reviewed By: preeti1205

Differential Revision: D33502911

Pulled By: mruberry

fbshipit-source-id: 8ce0b68a29658a5a0be79bc807dfa7d71653532d
2022-01-11 07:34:59 -08:00
ac0d131291 Decprecating routed decoder (#70990)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70990

Releasing `decode` API for domains to let them implement custom `decode` DataPipe for now.

Test Plan: Imported from OSS

Reviewed By: NivekT

Differential Revision: D33477620

Pulled By: ejguan

fbshipit-source-id: d3c30ba55c327f4849d56f42d328a932a31777ed
2022-01-11 06:56:48 -08:00
d6b7d69d8b Python3.10 migration adding to binary linux tests (#71130)
Summary:
Python3.10 migration adding to binary linux tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71130

Reviewed By: seemethere, janeyx99

Differential Revision: D33518787

Pulled By: atalman

fbshipit-source-id: 53c2c1b96e7a530a2af9ae7d5840bf8398b870e5
2022-01-11 05:54:07 -08:00
fb8a9732d9 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D33524330

fbshipit-source-id: 112291a23e2efe2d573bee86ead8ce2fc3957e5b
2022-01-11 04:33:21 -08:00
fdda7b5e8a [Codemod][FBSourceBlackLinter] Daily arc lint --take BLACK
Reviewed By: zertosh

Differential Revision: D33525225

fbshipit-source-id: 973eb9f9a5dfbd70bf0127f44089237969c2bb68
2022-01-11 04:20:46 -08:00
40b80aa490 [jit][edge] Migrate TupleType to DynamicType on mobile. (#70205)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70205

Use DynamicType instead of TupleType all over the place in Lite Interpreter. Namely we need to modify the following places:
1. Type parser which produces the Type constants.
2. IValue::type() which returns reflected Type from IValues.
3. Helper functions to construct the container value.
4. Typechecks which test whether a type instance is a particular container type.
ghstack-source-id: 146818620

Test Plan: CI

Reviewed By: iseeyuan

Differential Revision: D33176925

fbshipit-source-id: 00f7a5db37ba772c912643c733db6c52dfdc695d
2022-01-11 01:01:48 -08:00
5cae40c169 [pytorch][aten][cuda] move CUDAGeneratorImpl.h to ATen/cuda (#70650)
Summary:
This patch moves a CUDA-specific file, `CUDAGeneratorImpl.h` to `ATen/cuda` as the following TODO comment in  `CUDAGeneratorImpl.h` suggests:
```
// TODO: this file should be in ATen/cuda, not top level
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70650

Reviewed By: jianyuh, xw285cornell

Differential Revision: D33414890

Pulled By: shintaro-iwasaki

fbshipit-source-id: 4ff839205f4e4ea4c8767f164d583eb7072f1b8b
2022-01-10 22:27:04 -08:00
33a5905cc6 [quant] fix reduce_range warning (#71027)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71027

Fix issue #61054. remove warning
reduce_range=True which caused the error message "UserWarning: Please use quant_min and quant_max to specify the range for observers".

Test Plan:
python test/test_quantization.py TestFakeQuantizeOps

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D33484341

fbshipit-source-id: 97c3d4658926183f88a0c4665451dd7f913d30e6
2022-01-10 20:05:36 -08:00
59e166feb2 [Quant][DBR] Add test for serialization (#70078)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70078

This commit adds a serialization test for DBR.

Test Plan:
python test/test_quantization.py TestQuantizeDBR.test_serialization

Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D33405192

fbshipit-source-id: 39c4cca49aff8b960f4dec6c272fbd0da267fa95
2022-01-10 17:50:05 -08:00
043e84b3d2 Per-overload torch.ops API (#67254)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67254

Fixes https://github.com/pytorch/pytorch/issues/65997

BC breaking:
`output = torch.ops._test.leaky_relu(self=torch.tensor(-1.0))` now fails with the error `TypeError: __call__() got multiple values for argument 'self'` since we call into `OpOverloadBundle`'s `__call__` method that has `self` bound to it as its first argument.

Follow up work:
1. disallow `default` as an overload name for aten operators.
2. Add a method to obtain a list of all overloads (exclude the ones registered by JIT)
3. Add methods/properties to `OpOverload` to access more schema information (types of input and output args etc)

cc ezyang gchanan

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D33469839

Pulled By: anjali411

fbshipit-source-id: c3fc43460f1c7c9651c64b4d46337be21c400621
2022-01-10 17:29:06 -08:00
b12ca69179 [jit][edge] Migrate DictType to DynamicType on mobile. (#70202)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70202

Use DynamicType instead of DictType all over the place in Lite Interpreter. Namely we need to modify the following places:
1. Type parser which produces the Type constants.
2. IValue::type() which returns reflected Type from IValues.
3. Helper functions to construct the container value.
4. Typechecks which test whether a type instance is a particular container type.
ghstack-source-id: 146735648

Test Plan: no behavior change.

Reviewed By: iseeyuan

Differential Revision: D33137257

fbshipit-source-id: 971bf431658c422ea9353cc32cdab66e98876e9d
2022-01-10 15:55:29 -08:00
a606ea73d6 [ONNX] Link to the wiki (#68505) (#69544)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69544

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D32994274

Pulled By: msaroufim

fbshipit-source-id: 34d54f935799fa94516a541a241900ec205c7427

Co-authored-by: Gary Miguel <garymiguel@microsoft.com>
2022-01-10 15:51:04 -08:00
7397683b57 Add forward AD formulas for mv, scatter_add, _s_where (#70468)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70468

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D33405364

Pulled By: soulitzer

fbshipit-source-id: 7681c33fb264a7a3ec6436ebb7c5bb07cd5ffc3d
2022-01-10 13:54:10 -08:00
78994d13c0 Add forward AD formulas for {batch,layer,group}_norm (#70355)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70355

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D33405362

Pulled By: soulitzer

fbshipit-source-id: 55a92e88a04e7b15a0a223025d66c14f7db2a190
2022-01-10 13:52:16 -08:00
7a08030903 Fix fx2trt CI test trigger condition (#71014)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71014

Replace test trigger with test_config matching.

Test Plan:
CI
https://github.com/pytorch/pytorch/runs/4746717568?check_suite_focus=true

Reviewed By: janeyx99

Differential Revision: D33480971

fbshipit-source-id: 9513e464753343a7ae47fcfaf48119f34bae94c5
2022-01-10 13:37:24 -08:00
80659b71a5 Hoisting common expressions out of If blocks [retry] (#65645)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65645

This is a retry of PR: https://github.com/pytorch/pytorch/pull/59492

Latest Changes: Added more tests, added the getOrCreateDB pattern, updated logic to remove unnecessary checks
addressed all comments.

Adding code to find common expressions from the two subblocks of an if
operation and hoist them before the if block.
This also allows Dead Code Elimination to
then eliminate some if blocks.

Test Plan: python test_jit.py TestIfHoisting

Reviewed By: eellison

Differential Revision: D33302065

Pulled By: Gamrix

fbshipit-source-id: a5a184a480cf07354359aaca344c6e27b687a3c2
2022-01-10 13:28:17 -08:00
569aeec1bc fix typo in debugging_hooks.py (#70956)
Summary:
I just fixed a small typo in the debugging_hooks documentation

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70956

Reviewed By: jbschlosser

Differential Revision: D33508898

Pulled By: dagitses

fbshipit-source-id: fc5935e5a2e2ddc45657a22d3b33a11aba378d9b
2022-01-10 12:59:42 -08:00
49ed097ebe Add documentation for lowering (#71116)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71116

As title, add more inline documentation for code.

Test Plan:
no
pingpoke

Reviewed By: 842974287

Differential Revision: D33465611

fbshipit-source-id: 6b5529893098e5591470c2f41a0d8989e3cfccb9
2022-01-10 12:56:59 -08:00
3fbff80bea ci: Move MAX_JOBS to not set on Darwin (#71122)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71122

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D33515392

Pulled By: seemethere

fbshipit-source-id: 376608c9a6e2e685a07d5010ce443a3f02475ee5
2022-01-10 12:49:14 -08:00
cfc1117591 Update sparse.rst to warn about _values() (#71088)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/70357

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71088

Reviewed By: jbschlosser

Differential Revision: D33511207

Pulled By: cpuhrsch

fbshipit-source-id: 9d0c5445842ed96999eb88445cbea7ae284b1a6f
2022-01-10 12:43:46 -08:00
30699cbfd5 Reland D33284352: [jit][edge] Do not reuse mobile type parser for all unpicklers. (#71048)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71048

reland D33284352 (0a921ba0d0)
ghstack-source-id: 146735646

Test Plan: All Github CI: ciflow rerun -l ciflow/all

Reviewed By: gmagogsfm

Differential Revision: D33489731

fbshipit-source-id: 3e160209a1abb193ad3eed3018054aa7d331025e
2022-01-10 12:42:23 -08:00
fb66f561b1 Add copy out to the fallback path in SR invocation of composed op (#70871)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70871

We had previously handled reusing memory in the optimized kernel execution path, but not yet handled it if we hit the unoptimized fallback.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D33458652

Pulled By: eellison

fbshipit-source-id: 4eb62181ed02c95813a99638f5e2d0f9347b5c08
2022-01-10 12:16:38 -08:00
c8332256ee [JIT] Refactor SR invocation of fusion (#70508)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70508

We can create the code object at compile time instead or runtime to speed it up. This also makes unnecessary the compilation cache. TODO: figure out if theres a way to cache InterpreterState object

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D33458648

Pulled By: eellison

fbshipit-source-id: 710389741e7c6210528f2f96ab496fcd533d942a
2022-01-10 12:16:35 -08:00
0adc7cc546 Inline Fallback Functions For Debugging (#70463)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70463

Fix for https://github.com/pytorch/pytorch/issues/52940

When we call inlining on a fallback function, insert the runtime optimized version of its graph.

Test Plan: Imported from OSS

Reviewed By: jbschlosser, davidberard98

Differential Revision: D33458651

Pulled By: eellison

fbshipit-source-id: fd7e5e2b5273a1677014ba1a766538c3ee9cad76
2022-01-10 12:15:11 -08:00
840459a269 [ONNX] Relax constant_fold gather with indices rank > 1 (#68140) (#68493)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68493

Fixes #66786.

`index_select` only supports `index` of 1-D tensor. `ONNX::Gather` allows `index` to have rank `q`. Abort constant folding `ONNX::Gather` if `index` rank is larger than 1.

Test Plan: Imported from OSS

Reviewed By: jansel

Differential Revision: D32483826

Pulled By: msaroufim

fbshipit-source-id: a8e8389d85287a859d32abf8d8d98852290b0a03

Co-authored-by: BowenBao <bowbao@microsoft.com>
2022-01-10 11:55:02 -08:00
4b47047dae [ONNX] Add support for shrink ops (#66969) (#68492)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68492

* Initial commit

* Fix flake issue

* Add test tags

Test Plan: Imported from OSS

Reviewed By: jansel

Differential Revision: D32483827

Pulled By: msaroufim

fbshipit-source-id: 41c623712524465b877d0fe0e2f4001d475bf2ce
2022-01-10 11:38:31 -08:00
62441157e3 Have getFilesToLevels return a reference (#71047)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71047

The copy induced by getFilesToLevels is currently consuming 3,457,470,000 cycles per day. A reference might fix that.

Reference:
```
["Inline torch::jit::JitLoggingConfig::getFilesToLevels[abi:cxx11] @ caffe2/torch/csrc/jit/jit_log.cpp:54"]
```

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D33479180

fbshipit-source-id: 05d306ad9ea23e2f30348a08d547ebe274eb0c10
2022-01-10 11:32:32 -08:00
87484d67e3 .github: Enable linux binary builds (#68388)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68388

Updates the gpu architectures as well as adding a trigger for
on_pull_request for the binary build workflows so that we can iterate on
this later

TODO:
* Create follow up PR to enable nightly linux GHA builds / disable CircleCI nighlty linux builds

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: janeyx99

Differential Revision: D33462294

Pulled By: seemethere

fbshipit-source-id: 5fa30517550d36f504b491cf6c1e5c9da56d8191
2022-01-10 11:30:45 -08:00
e9a8bb59b4 Move the apply_tensor_props into its own function for more public use (#67786)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67786

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D32175962

Pulled By: Gamrix

fbshipit-source-id: caefe1c849277632d976a6b5513f72b47595f2c0
2022-01-10 11:26:03 -08:00
3ef10da97d add support for pickle v4 (#70642)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70642

Review history on https://github.com/pytorch/pytorch/pull/70014

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D33414364

Pulled By: PaliC

fbshipit-source-id: 7e7ed491c6f16d4fac3a03f7e403935823c03aa6
2022-01-10 11:13:41 -08:00
118bd82dde detect mocked module on saving pass (#70641)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70641

Raises a not implemented error if we attempt to pickle an object which uses a mocked module. Now we no longer have to load the object to get this check, and instead happens right on the saving path.

Review History is on https://github.com/pytorch/pytorch/pull/69793 PR was moved to a different branch due to original branch getting corrupted.

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D33414365

Pulled By: PaliC

fbshipit-source-id: 6d72ddb05c47a3d060e9622ec0b6e5cd6c6c71c8
2022-01-10 11:11:55 -08:00
c4400fc431 Retire repeat_test_for_types (#71033)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/69865

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71033

Reviewed By: mruberry

Differential Revision: D33486370

Pulled By: janeyx99

fbshipit-source-id: 71f9383dbc1e00b572f26eb4f04d0a94c6759e35
2022-01-10 09:13:54 -08:00
e1b84e1b6b fix loading of older models that don't have maximize (#71023)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71023

Reviewed By: jbschlosser

Differential Revision: D33483687

Pulled By: albanD

fbshipit-source-id: 2f3c6e97a9579be9ba15eca0756fc1f2c466fbb6
2022-01-10 06:01:24 -08:00
b27dfa70c4 caffe2: disable TensorImpl static_assert (temporary)
Test Plan: buck2 build -c cxx.modules=false -c fbcode.platform=platform010 fbcode//caffe2/aten:ATen-cu

Reviewed By: singhsrb, meyering

Differential Revision: D33501636

fbshipit-source-id: a1a5bbb2b160eba8eb5abba4f6ae1929a58e11e9
2022-01-09 23:11:17 -08:00
fca8a0acaa Prevent import race condition that leaves torch.package.PackagePickler with unwanted dispatch table entries. (#71025)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71025

TL;DR In come cases:
1) user imports `dill`, which mutates `_Pickler.dispatch`,
2) user imports lib that imports `torch.package`
3) `PackagePickler.dispatch = _Pickler.dispatch.copy()` makes a copy of the mutated table
4) user calls `dill.extend(use_dill=False)` to reset `_Pickler.dispatch`, expecting everything to be okay
5) `PackagePickler` is used to pickle something like `ModuleDict`. `PackagePickler.dispatch` has stale entries to dill pickle functions like `save_module_dict`, which sometimes hard-code calls to `StockPickler.save_global`, which is unaware of torch.package module prefixes.
6) Exception is raised, e.g. `Got unhandled exception Can't pickle <class '<torch_package_2>.caffe2.mylib'>: it's not found as <class '<torch_package_2>.caffe2.mylib'>`

Differential Revision: D33483672

fbshipit-source-id: d7cd2a925bedf27c02524a6a4c3132a262f5c984
2022-01-09 15:13:39 -08:00
2bed616e0f [Dist tests] Make event_listener work for all dist tests (#70628)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70628

event_listener thread is used to log process tracebacks when a timed
out process sends it a request to get its traceback. Although, this thread is
created in `_run` function which is overridden by some classes such as
`TestDistBackend` so those tests did not have this feature. Move the
event_listener setup logic to `run_test` which is called by all distributed
test classes, which enables it for all distributed tests. Also modify logger
setup to ensure that logging.info calls are printed in the subprocess.
ghstack-source-id: 146714642

Test Plan: CI

Reviewed By: jaceyca, fduwjj

Differential Revision: D33410613

fbshipit-source-id: aa616d69d251bc9d04e45781c501d2244f011843
2022-01-09 14:54:09 -08:00
9267fd8d73 [WIP] [ATen] Add native_multi_attention_self_attention CPU + GPU implementation (#70649)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70649

As described in https://fb.quip.com/oxpiA1uDBjgP

This implements the first parts of the RFC, and is a rough draft showing the approach. The idea is that for the first cut we can maintain very close (identical I believe in this diff) numerical equivalence to the existing nn.MHA implementation, which is what this diff attempts to do. In subsequent implementations, once we have a working and adopted native self-attention implementation, we could then explore alternative implementations, etc.

The current implementation is similar to existing dedicated implementations such as LightSeq/FasterTransformer/DeepSpeed, and for MHA on both CPUs and GPUs is between 1.2x and 2x faster depending on the setting. It makes some approximations/restrictions (doesn't handle masking in masked softmax, etc), but these shouldn't materially impact performance.

This does the first few items:

* add native_multi_head_attention(...) , native_multi_head_attention_backward(..) to native_functions.yaml
* Implement native_multi_head_attention(..) on GPU, extracting bits and pieces out of LS/DS/FT as appropriate
* Implement native_multi_head_attention(..) on CPU

The backward implementation is still WIP, but the idea would be to:

* Hook these up in derivatives.yaml
Implement native_multi_head_attention_backward(..) on GPU, extracting out bits and pieces out of LS/DS (not FT since it’s inference only)
* Implement native_multi_head_attention_backward(..) on CPU
* In torch.nn.functional.multi_head_attention_forward 23321ba7a3/torch/nn/functional.py (L4953), add some conditionals to check if we are being called in a BERT/ViT-style encoder fashion, and invoke the native function directly.

Test Plan: TODO

Reviewed By: mikekgfb

Differential Revision: D31829981

fbshipit-source-id: c430344d91ba7a5fbee3138e50b3e62efbb33d96
2022-01-08 21:50:41 -08:00
785b6905de reduce plan generation log spam (#70880)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70880

Change loglevel to `debug` in caffe2 `optimizer.py` for logging rowwise Adagrad engine.

Test Plan: CI + sandcastle

Reviewed By: boryiingsu

Differential Revision: D33439337

fbshipit-source-id: b158249b8df771c0ec8b642210ede39972929b00
2022-01-08 10:07:06 -08:00
49a07c8922 Suppress some unused variable warnings in Sorting.cu and TensorTopK.cu (#70999)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70999

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D33470240

fbshipit-source-id: 906932cb5f497c77465b70ec9bc6fcb0705719de
2022-01-08 00:41:58 -08:00
d1e049c306 Fix some unused variable warnings and make some stuff const in ReplicationPadding.cu (#70998)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70998

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D33460035

fbshipit-source-id: bdf70fd04cce40a2a8d60c2c405f8d6cee9127e5
2022-01-08 00:40:51 -08:00
11aa1961c1 Use (void)error_unused to avoid unused warning (#71000)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71000

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D33470600

fbshipit-source-id: 868a6ee33a04846bd1efbe06ab306fbaad3bf9db
2022-01-07 23:39:30 -08:00
704af23ee4 Use a reference in GetSingleArgument (#71007)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71007

A string copy at Line 417 is currently consuming 125,749,287,000 cycles/day. I suspect the issue is with a copy-on-return, but we can experiment with introducing a reference in the middle to see if that produces a good savings without changing the interface.

Reference
```
["Inline caffe2::ArgumentHelper::GetSingleArgument @ caffe2/caffe2/utils/proto_utils.cc:417"]
```

Test Plan: Sandcastle

Reviewed By: xw285cornell

Differential Revision: D33478883

fbshipit-source-id: e863e359c0c718fcd0d52fd4b3c7858067de0670
2022-01-07 20:18:56 -08:00
9762aa0fdc Revert D33284352: [jit][edge] Do not reuse mobile type parser for all unpicklers.
Test Plan: revert-hammer

Differential Revision:
D33284352 (0a921ba0d0)

Original commit changeset: 997c4f110b36

Original Phabricator Diff: D33284352 (0a921ba0d0)

fbshipit-source-id: af316727442a64f1ae40d53d7a9d26ec550d634e
2022-01-07 19:58:03 -08:00
f626bef598 Fix docstring for nn.Hardshrink (#71012)
Summary:
Fixes nn.Hardshrkink's docstring problem reported at https://github.com/pytorch/pytorch/issues/70498.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71012

Reviewed By: dagitses

Differential Revision: D33482333

Pulled By: jbschlosser

fbshipit-source-id: 00eea76299676fc97c5cc31421af9c73665bfcf4
2022-01-07 18:56:47 -08:00
0a921ba0d0 [jit][edge] Do not reuse mobile type parser for all unpicklers. (#70338)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70338

Today Unpickler is used by both server and mobile for deserializing model, and it always fallback to mobile parser when there's no type resolver provided by user. However this is not intended as server and mobile type parser supports different things. In this diff we provide a default fallback using script parser and opt it out for all mobile cases.
ghstack-source-id: 146727330

(Note: this ignores all push blocking failures!)

Test Plan: CI

Reviewed By: iseeyuan

Differential Revision: D33284352

fbshipit-source-id: 997c4f110b36eee6596e8f23f6a87bf91a4197ed
2022-01-07 18:35:32 -08:00
3f3eae6737 [jit] Split Tensor type implementations to separate file. (#70121)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70121

Code move all TensorType dependencies into a separate `tensor_type.cpp`, so that we don't link with it in the min runtime accidentally.
ghstack-source-id: 146727331

(Note: this ignores all push blocking failures!)

Test Plan: no behavior change.

Reviewed By: gmagogsfm

Differential Revision: D33102286

fbshipit-source-id: e9fe176201bd2696cb8c65c670fcf225e81e8908
2022-01-07 18:35:29 -08:00
53b9c0f12d [jit] Polymorphic IValue::type() for DynamicType. (#70120)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70120

Before the change:
```
c10::Type t = ivalue.type();
```
After the change:
```
c10::Type t = ivalue.type();
c10::DynamicType d = ivalue.type<c10::DynamicType>(); // new path
```
The new path will be adopted in PyTorch Lite Interpreter to support lightweight type reflection. Note that type getters are selected at compile time so no performance overhead. The benefits of having a DynamicType will be elaborated in a separate document, but in short, DynamicType provides an isolated type system for controlling binary size bloat, and shrink down ~20 supported Type symbols into one so that the size taken by specializations and function name symbols are greatly reduced.

Lite Interpreter should only use the `<DynamicType>` variant of the interfaces from aten, to reduce binary size.
ghstack-source-id: 146727334

(Note: this ignores all push blocking failures!)

Test Plan: CI

Reviewed By: gmagogsfm

Differential Revision: D33102276

fbshipit-source-id: c5354e7d88f9de260c9b02636214b40fe15f8a10
2022-01-07 18:35:26 -08:00
62909facb3 [jit] Decouple ivalue.h from jit_type.h (#70119)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70119

JIT type and IValue have a mutual dependency because of various reasons today. It makes things worse when we have `jit_type.h` and `ivalue.h` mutually include each other, causing non deterministic name resolutions at different translation units, preventing us safely use symbols from `jit_type.h` in `ivalue.h` . This diff doesn't address the mutual dependency between JIT type and IValue at linking level, but at header level.

We choose to remove include of `ivalue.h` from `jit_type.h` because it's way harder to make a type-free header for IValue. The way we achieve this is by removing EnumType (which is the only type depending on IValue in JIT types) from `jit_type.h`, and let downstream users to specifiy an explicit `enum_type.h` as needed. We also move some IValue inline member function definitions back to `ivalue_inl.h` so that `jit_type.h` doesn't need IValue definition to be present.
We also remove a seemingly accidental include of `jit_type.h` from `ATen/core/List_inl.h` so that `ivalue.h` can include `jit_type.h` directly, otherwise due to another mutual inclusion between `ivalue.h` and `List_inl.h` we can still get nondeterministic behavior.
ghstack-source-id: 146727333

(Note: this ignores all push blocking failures!)

Test Plan: no behavior change.

Reviewed By: gmagogsfm

Differential Revision: D33155792

fbshipit-source-id: d39d24688004c2ec16c50dbfdeedb7b55f71cd36
2022-01-07 18:34:17 -08:00
0eb2fc608c [fx_acc] ensure all acc ops args to be keyword arguments (#70952)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70952

ATT

Test Plan: test_plan_no

Reviewed By: wushirong

Differential Revision: D33456343

fbshipit-source-id: 26b0c1042de6072ff8741617dd3523edc4a9b5fd
2022-01-07 17:53:36 -08:00
0cd474b2ce fix op not scriptable
Summary: Fix torch.sort, min/max, torch.numel after quantization not scriptable

Test Plan: python3 test/test_quantization.py TestQuantizeFxOps.test_general_shape_ops

Reviewed By: jerryzh168

Differential Revision: D33467184

Pulled By: terrychenism

fbshipit-source-id: 13775ab36d4007978df48c9af71d83398fce5161
2022-01-07 16:55:28 -08:00
d26e5ced72 Add missing docstrings for ONNX converter API. Fixes #67393 (#67640) (#68489)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68489

Test Plan: Imported from OSS

Reviewed By: jansel

Differential Revision: D32483783

Pulled By: msaroufim

fbshipit-source-id: 512e4495040a6a9833d501de2301f1709b0352b9
2022-01-07 16:43:09 -08:00
c59c86706e [quant] Add back README.md for backend_config (#70964)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70964

Accidently deleted before, adding this back. We'll make this more complete after
the structure is finalized

Test Plan:
no test needed

Imported from OSS

Reviewed By: dagitses

Differential Revision: D33470738

fbshipit-source-id: 00459a4b00514d3d0346de68788fab4cad8a5d12
2022-01-07 15:44:51 -08:00
00e5610914 FX quant: allow duplicate named_modules during fbgemm lowering (#70927)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70927

Earlier, when replacing deq-ref-quant modules, we get non-duplicate
named modules only. When model contains dupicate names, the lowering fails the
second time.
This PR allows duplicates when getting the named modules.

Test Plan: buck test //caffe2/torch/fb/model_transform/quantization/tests:fx_quant_api_test

Reviewed By: jerryzh168

Differential Revision: D33440028

fbshipit-source-id: f2fabd49a293beb90d7b4bf471610cde6279fd66
2022-01-07 15:43:31 -08:00
ad88354e25 torch.futures doc formatting (#70630)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70630

Params is incorrectly formatted [here](https://pytorch.org/docs/master/futures.html?highlight=future#:~:text=way%20as%20then().-,Parameters,-callback%20(Future)%20%E2%80%93%20a):

![image](https://user-images.githubusercontent.com/14858254/148119877-6c719851-4edd-4126-8ef7-e6c1920304cf.png)

Updated docs:

https://docs-preview.pytorch.org/70630/futures.html?highlight=future#:~:text=way%20as%20then().-,Parameters,-callback%20(Future)%20%E2%80%93%20a

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Test Plan: Imported from OSS

Reviewed By: dagitses, mrshenli

Differential Revision: D33478214

Pulled By: H-Huang

fbshipit-source-id: 8cd7022ae79a8e6fe8b5fa8b767c55903c9ac368
2022-01-07 15:22:22 -08:00
d583eca8c3 Add workflow to sync fbsync->master (#71013)
Summary:
Main logic of the workflow is implemented in `syncbranches.py` script,
which computes patch-id's of divergent history (as determined by `git
merge-base`) and treats all patches present in sync branch with
non-matching patch-ids as ones missing in target branch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71013

Reviewed By: bigfootjon

Differential Revision: D33480885

Pulled By: malfet

fbshipit-source-id: bd72c061720d0cba49c6754ec4e94437d8a5c262
2022-01-07 15:09:23 -08:00
d7db5fb462 ctc loss no batch dim support (#70092)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70092

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D33280068

Pulled By: george-qi

fbshipit-source-id: 3278fb2d745a396fe27c00fb5f40df0e7f584f81
2022-01-07 14:33:22 -08:00
9032d73f3b Disable cpp tests in multigpu job (#71015)
Summary:
See if this fixes the timeouts described in https://github.com/pytorch/pytorch/issues/70015

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71015

Reviewed By: dagitses

Differential Revision: D33483762

Pulled By: suo

fbshipit-source-id: 09bf93e73669a1211b200b4b272bfaa0d78a21d2
2022-01-07 14:32:01 -08:00
0721fc6474 Decouple MapDataPipe from Dataset (#70991)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70991

Test Plan: Imported from OSS

Reviewed By: dagitses

Differential Revision: D33477680

Pulled By: ejguan

fbshipit-source-id: d3e89492e921a96791319f35052a229684ddf7cf
2022-01-07 14:28:41 -08:00
3febe0d986 Remove backward op for 3d depthwise convolution (#70462)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70462

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D33340495

Pulled By: jbschlosser

fbshipit-source-id: a180951680aef8fb123463af098582ef6cf9bbdb
2022-01-07 14:24:34 -08:00
704fbc29ae Remove backward op for 2d depthwise convolution (#70461)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70461

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D33340494

Pulled By: jbschlosser

fbshipit-source-id: f2d8b2fcf9ad0f42b644b1dba51a694d83975566
2022-01-07 14:23:15 -08:00
a70297e7cb NNAPI: quant logistic fix (#70847)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70847

NNAPI needs a fixed zero point and scale for sigmoid (logistic)
ghstack-source-id: 146555935

Test Plan: LIBNEURALNETWORKS_PATH="/path/to/libneuralnetworks.so" pytest test/test_nnapi.py

Reviewed By: dreiss

Differential Revision: D33237918

fbshipit-source-id: 05ef3a81bf1589ad44b599a19bce4066531c432b
2022-01-07 13:36:33 -08:00
ed50a35cf8 [Model Averaging] Update the documentation of PeriodicModelAverager (#70974)
Summary:
Here 20 is a bad example, since the warmup step is set as 100. 200 iterations will make much more sense.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70974

Reviewed By: dagitses

Differential Revision: D33474576

Pulled By: rohan-varma

fbshipit-source-id: 4c7043108897848bde9503d77999971ad5567aa6
2022-01-07 13:20:42 -08:00
c8b897333c [rnn/gru] no batch dim (#70977)
Summary:
Reference https://github.com/pytorch/pytorch/issues/60585

Reland: https://github.com/pytorch/pytorch/pull/70442

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70977

Reviewed By: dagitses, george-qi

Differential Revision: D33477256

Pulled By: jbschlosser

fbshipit-source-id: 2035c2d00b2f627c7046fd9b13c71b9360cd6fad
2022-01-07 13:14:41 -08:00
338eb1b2b3 [LTC] Export torch::lazy::GetBackendDevice() (#70963)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70963

This commit exports torch::lazy::GetBackendDevice().

Test Plan: CI in the lazy_tensor_staging branch.

Reviewed By: wconstab

Differential Revision: D33468938

Pulled By: alanwaketan

fbshipit-source-id: f65599c9238bf6b4f4ffbd5194befdc267272831
2022-01-07 13:13:18 -08:00
0a002f879e Actually clean on clean workspace, including hidden files (#71018)
Summary:
The workspace should be totally empty before checking out PyTorch; this
is especially important with non-ephemeral runners.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71018

Reviewed By: robieta

Differential Revision: D33482985

Pulled By: suo

fbshipit-source-id: cafa123d2b893bfbdad62295586b5b79f1542b3a
2022-01-07 13:04:54 -08:00
bc026c0577 [jit] Split Union type and Optional type to separate impl file. (#69483)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69483

To avoid accidental linking to Union type and Optional type in Edge runtimes, we can separate these types into different files, so that we don't accidentally link with them in type.cpp.
ghstack-source-id: 146670525

Test Plan: just code move.

Reviewed By: ejguan

Differential Revision: D32264607

fbshipit-source-id: c60b6246f21f3eb0a67f827a9782f70ce5200da7
2022-01-07 11:23:15 -08:00
1011ac188f [jit][edge] Create DynamicType for OptionalType in mobile. (#68137)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68137

A small step to replace existing OptionalType usage to DynamicType in Edge runtime.
ghstack-source-id: 146670520

Test Plan: CI

Reviewed By: iseeyuan

Differential Revision: D32264617

fbshipit-source-id: 62d3ffad40901842deac19ca2098ea5ca132e718
2022-01-07 11:23:12 -08:00
0517e719ac [jit] Add conformance test for DynamicType with server JIT types. (#69482)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69482

Add a test to enumerate a number of JIT type combinations and see if their subtyping behavior is preserved in the new DynamicType system.
ghstack-source-id: 146670526

Test Plan: buck test mode/opt //caffe2/test/cpp/jit:jit -- --exact 'caffe2/test/cpp/jit:jit - LiteInterpreterTest.DynamicType'

Reviewed By: gmagogsfm

Differential Revision: D32891263

fbshipit-source-id: 728211b39778e93db011b69b0a4047df78a8fc5b
2022-01-07 11:23:09 -08:00
649dda9fee [jit] Implement DynamicType for TorchScript runtime. (#68136)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68136

DynamicType is an extension to existing server JIT types. Today using normal server types on Edge is a bit problematic because in embedded environments we don't need the full spectrum of types but we still build with these unneeded dependencies.

Is it possible to just get rid of unneeded JIT types from Edge builds? It's not easy to do so at this moment. For example, on Edge we don't support Union type, but we have to pull in the dependency of Union type because Optional type is being supported which inherits from Union type, so Union type has to be included in the build. Although we could split Union type and Optional type, it could be argued that the root cause is every time we use anything inheriting from `c10::Type`, we don't have the direct evidence of how much dependency we pull in, because we do virtual calls and we don't know what exactly we're calling with server JIT types. If we don't know, it's highly possible that the linker doesn't know either so it cannot effectively strip unused methods.

To address this problem, one option is to implement a separate `DynamicType` which has simpler behavior and doesn't store different types as different symbols in binary but rather raw data (or "tag"). This could increase the binary size by several KBs, so I included several binary size reductions in the same stack, hoping at least we don't regress the binary size.

Currently `DynamicType` inherits from `c10::Type` because I want to reduce the migration cost of `DynamicType` by making it interfacing with existing server JIT types. In the future `DynamicType` should be implemented as a separate class without relying on `c10::Type` to make things both simpler and leaner.
ghstack-source-id: 146670522

Test Plan: in the next diff.

Reviewed By: VitalyFedyunin

Differential Revision: D32264615

fbshipit-source-id: 180eb0998a14eacc1d8b28db39870d84fcc17d5b
2022-01-07 11:23:07 -08:00
0408449244 [jit] Reclaim some binary size. (#68038)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68038

Replace const std::function& to c10::function_ref because the former uses type erasure and adds 5-10 KB size overhead and adds another level of indirection to call the underlying functions. In contrast a non-owning c10::function_ref will just compile down to a raw function pointer which should be much smaller.
ghstack-source-id: 146670523

Test Plan: eyes

Reviewed By: iseeyuan, mrshenli

Differential Revision: D32264619

fbshipit-source-id: 558538fd882b8e1f4e72c4fd5e9d36d05f301e1e
2022-01-07 11:21:46 -08:00
dd1121435b SequentialLR update _last_lr on step (#70558)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/68956.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70558

Reviewed By: dagitses

Differential Revision: D33430213

Pulled By: albanD

fbshipit-source-id: 446f182610de32db224d55b244d76c3076e8080f
2022-01-07 10:36:35 -08:00
195181d4df Revert "add very dumb retry to ecr gc"
This reverts commit 22f528043342ea06d00835616e8447e2b8c94adb.
2022-01-07 10:29:13 -08:00
c6e727d05b Fix adamw formula doc (#68587)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/68482

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68587

Reviewed By: dagitses, jbschlosser

Differential Revision: D33478646

Pulled By: albanD

fbshipit-source-id: 4e6419829c3faa7449c041e7d467a6dab30fe917
2022-01-07 10:15:16 -08:00
08074c8f2d Update gradcheck.py (#70950)
Summary:
Following https://github.com/pytorch/pytorch/pull/64837#discussion_r779870974

Changed torch.equal to torch.allclose as exact comparision could be flaky

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70950

Reviewed By: albanD

Differential Revision: D33462426

Pulled By: anjali411

fbshipit-source-id: aeaba9d2a98d1d0af04fa2cab8c495c23ec0a9cc
2022-01-07 09:29:10 -08:00
8dfff8b2e2 Fix scatter for empty indexes (#70662)
Summary:
This PR fixes an issue with `scatter` where the output is garbage for zero-sized indexes.

```py
import torch

null_index = torch.zeros((0, 4), dtype=torch.int64)
null_arr = torch.zeros((0, 4))
zeros_arr = torch.zeros((1, 4))

result = zeros_arr.scatter(0, null_index, null_arr)

print(null_index)
print(null_arr)
print(zeros_arr)
print(result)
```

```
tensor([], size=(0, 4), dtype=torch.int64)
tensor([], size=(0, 4))
tensor([[0., 0., 0., 0.]])
tensor([[1.7036e+19, 2.9965e+32, 3.9133e-14, 1.3585e-19]])
```

the out array is never filled if `index` arg has 0 elements.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70662

Reviewed By: dagitses

Differential Revision: D33476807

Pulled By: albanD

fbshipit-source-id: 97dbdd9c0133899e58828c43ecba81838807b8af
2022-01-07 09:20:43 -08:00
4e7e8f2826 [PyTorch] Outline destructor of CppFunction (#63688)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63688

CppFunction is used for function registration, so it's not performance-sensitive. Outlining the destructor should reduce code size.
ghstack-source-id: 146648927

Test Plan: Mobile buildsizebot

Reviewed By: dhruvbird

Differential Revision: D30462640

fbshipit-source-id: de410f933bf936c16769a10a52092469007c8487
2022-01-07 09:16:23 -08:00
40c512f52c split cuda for all 11.X (#70899)
Summary:
the code didn't support 11.5 or above

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70899

Reviewed By: ngimel

Differential Revision: D33469544

Pulled By: janeyx99

fbshipit-source-id: ea38de36b025051f76322fe840e3851408195160
2022-01-07 09:11:16 -08:00
2378421340 Implement torch.allclose for sharded tensor. (#70331)
Summary:
Implement torch.allclose op for sharded tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70331

Test Plan:
Automated test added.
pritamdamania87
Fixes https://github.com/pytorch/pytorch/issues/67112

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Reviewed By: pritamdamania87

Differential Revision: D33339137

Pulled By: kumpera

fbshipit-source-id: 4263e468eaa117317b190f69877bf3f8bbac5658
2022-01-07 08:37:04 -08:00
997fa8671d Fix docstring for nn.Hardsigmoid (#70987)
Summary:
Fixes nn.Hardsigmoid's docstring problem reported at https://github.com/pytorch/pytorch/issues/70498.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70987

Reviewed By: dagitses

Differential Revision: D33476974

Pulled By: albanD

fbshipit-source-id: bf3a1c485dd2c369c56981f9afbfe45aa9cee2cc
2022-01-07 08:13:53 -08:00
f135438d3b Dispatch to at::convolution intead of at::_convolution in _convolution_double_backward (#70661)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70661

Dispatching to at::convolution can make Lazy Tensor trace the right convolution op.

Test Plan: pytest test/test_nn.py -k test_conv_double_backward_strided_with_3D_input_and_weight

Reviewed By: wconstab, jbschlosser

Differential Revision: D33428780

Pulled By: desertfire

fbshipit-source-id: 899e4135588ea33fff23d16103c25d9bcd3f902c
2022-01-07 07:53:46 -08:00
9ad21091dd [SR] Give VarStackNodeWrapper an iterator (#69922)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69922

D32596934 (65f54bc000) made the serial stack implementation a bit brittle. It introduced a new container type: `VarStackNodeWrapper`. This type was used as a template parameter in the serial stack implementation.

The other type used in the serial stack implementation is `at::ArrayRef<at::Tensor>`. Ideally, the interface of `VarStackNodeWrapper` should be as close as possible to this other type. However, because the new container type did not have an iterator, expressions like this would fail to compile:
```
for (const auto& tensor : tensors) {
  // do something
}
```
Introducing this iterator will make the code easier to maintain going forward.

Test Plan:
`buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- Stack`

I consider this a `VarStack` implementation detail, so I'd prefer not to test it directly. We can test it implicitly by adding some code to the serial stack implementation that uses the iterator.

Reviewed By: swolchok

Differential Revision: D33101489

fbshipit-source-id: 7cf44c072d230c41bd9113cf2393bc6a6645a5b5
2022-01-07 07:24:47 -08:00
6e16c9bb1d Add support for deleteKey for FileStore (#69953)
Summary:
torch_ucc uses `deleteKey`, and trying to run PyTorch tests with torch_ucc leads to failure about `deleteKey not implemented for FileStore`.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69953

Reviewed By: ngimel

Differential Revision: D33458457

Pulled By: H-Huang

fbshipit-source-id: f46afd59f950722ae594d9aafb8843f14019e930
2022-01-07 06:20:59 -08:00
d697bb4220 Adapt llvm_codegen.cpp to LLVM TOT (#70810)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70810

Adapt to LLVM top-of-tree APIs.

For context: LLVM is moving towards opaque pointers for IR values: https://llvm.org/docs/OpaquePointers.html

I also changed some `value->getScalarType()->getPointerElementType()`  expressions to directly reference relevant types. This is simpler and more in line with the intentions of the opaque IR pointers. (In fact I would expect those expressions to break in the future). I did not fix places where the relevant type wasn't obvious to me though.

Test Plan:
-
```
$ cd fbsource/fbcode
$ tp2_update_fbcode llvm-fb --branch=staging
# symlinks point to d9c037cf2b4f0268cb1897b99c8c87c5d0232616 TP2 revision
$ buck build mode/opt-clang-thinlto unicorn:index_server -c unicorn.hfsort="1" -c cxx.profile="fbcode//fdo/autofdo/unicorn/index_server:autofdo" -c cxx.modules=False -c cxx.extra_cxxflags="-Wforce-no-error"
```
- Check sandcastle jobs

Reviewed By: modiking

Differential Revision: D33431503

fbshipit-source-id: 33f39d0a0c0f4b805ab877a811ea0a670f834abf
2022-01-07 05:07:25 -08:00
87139d8532 [LTC] Sync LazyGraphExecutor and LazyTensor with the staging branch (#70867)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70867

This commit syncs LazyGraphExecutor and LazyTensor with the staging branch's
latest changes.

Test Plan: CI in the lazy_tensor_staging branch.

Reviewed By: wconstab, desertfire

Differential Revision: D33440005

Pulled By: alanwaketan

fbshipit-source-id: 0dd72643dbf81a87fc4b05019b6564fcb28f1979
2022-01-07 01:51:53 -08:00
1cdc643714 [TensorExpr] Add a pass for trimming JIT graphs. (#66847)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66847

Trimming means that we try to remove a small portion of the graph while
keeping it valid, and we try performing this step that N times. This is
useful for debugging when we try to find a minimal example reproducing
the issue at hand.

Differential Revision:
D31751397
D31751397

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: 07d8ba1435af8fd2d7b8cf00db6685543fe97a85
2022-01-07 01:03:59 -08:00
8223ef1cd8 [TensorExpr] Clean-up logic for copying input tensors and remove some dead code. (#70535)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70535

This also fixes handling of inputs that happen to be outputs (they
require copy).

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D33399116

Pulled By: ZolotukhinM

fbshipit-source-id: 9845838eb653b82ae47b527631b51893990d5319
2022-01-07 01:03:56 -08:00
5d7cc8f22a [TensorExpr] Add some graph-rewrite passes to prepare models for AOT compilation. (#66515)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66515

These passes should not be used generally as they change API of the
model's forward method, but they help experimenting with the model and
ironing out all the kinks before it can be compiled properly. In the
long run ideally we should provide a better way to enable such
experiments.

Differential Revision:
D31590862
D31590862

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: 74ded34c6c871d4cafa29f43dc27c7e71daff8fc
2022-01-07 01:03:53 -08:00
cdbf83b0c3 [TensorExpr] Add helper passes for AOT pipeline. (#66514)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66514

These passes will
1) help us analyze the graph before trying to compile it
and report errors upfront if it's not possible,
2) fill in missing strides/dtype/device info in JIT IR. Ideally, this
should be done by a dedicated JIT pass, but until it's available, we'll
be using a hack-around defined here.

Differential Revision:
D31590860
D31590860

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Pulled By: ZolotukhinM

fbshipit-source-id: fe8fdefbeacae8079958dd0b4b27809cc0acb34b
2022-01-07 01:02:31 -08:00
a311cfa800 Revert D33460427: [pytorch][PR] [rnn/gru] : no batch dim
Test Plan: revert-hammer

Differential Revision:
D33460427 (6eba936082)

Original commit changeset: c64d9624c305

Original Phabricator Diff: D33460427 (6eba936082)

fbshipit-source-id: 9a5000e202c5f383b03dd6caad9399e46e4ce80e
2022-01-06 23:37:28 -08:00
1622546050 use irange for loops (#70248)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70248

Modified loops in files under fbsource/fbcode/caffe2/ from the format
```
for(TYPE var=x0;var<x_max;x++)
```
to the format
```
for(const auto var: irange(xmax))
```

This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand.

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D32813863

fbshipit-source-id: 527244b4a2b220fdfe7f17dee3599603f492a2ca
2022-01-06 23:14:29 -08:00
36d9e03ab7 Reserve vector in gather_ranges_to_dense_op.h (#70478)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70478

Test Plan: Sandcastle

Reviewed By: xw285cornell

Differential Revision: D33339890

fbshipit-source-id: 50330e18e344f872d03f146cea0ed11eef4f506e
2022-01-06 23:10:28 -08:00
df6eb9bbab Fixed to_folder not saving dtype (#69983)
Summary:
As above.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69983

Reviewed By: pbelevich, ngimel

Differential Revision: D33466529

Pulled By: Chillee

fbshipit-source-id: 2d2f0ad5b8e2492aba4c19fa034c8b6c0848a568
2022-01-06 22:15:56 -08:00
23f902f7e4 Fix incorrect variable in autograd docs (#70884)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/68362.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70884

Reviewed By: mruberry

Differential Revision: D33463331

Pulled By: ngimel

fbshipit-source-id: 834ba9c450972710e0424cc92af222551f0b4a4a
2022-01-06 20:53:10 -08:00
22f5280433 add very dumb retry to ecr gc 2022-01-06 20:29:39 -08:00
c18e6b790e Adding elu,selu,softsign support for fx2trt (#70811)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70811

Add support of the above ops in fx2trt.

Reviewed By: 842974287

Differential Revision: D33407911

fbshipit-source-id: 8c635ddbd1cae6b0a0a04d345b0e0347111a6619
2022-01-06 19:42:24 -08:00
70b18b9511 Fix comment indentation issue (#70227)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70227

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D33251107

Pulled By: tugsbayasgalan

fbshipit-source-id: 293ffe5dde38480ea13963a2d7e1eb99dc594d22
2022-01-06 19:14:39 -08:00
32bf5e0ef9 Add native impl of gelu for QuantizedCPU (#69968)
Summary:
Add native implementation of gelu for quantized CPU.

cc jerryzh168 jianyuh raghuramank100 jamesr66a vkuzo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69968

Reviewed By: ejguan

Differential Revision: D33187095

Pulled By: vkuzo

fbshipit-source-id: 4c4bf0eb47d2d9c2b8827174f2ccdea41986148a
2022-01-06 19:01:26 -08:00
6eba936082 [rnn/gru] no batch dim (#70442)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/60585

TODO:
* [x] Doc updates

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70442

Reviewed By: zou3519

Differential Revision: D33460427

Pulled By: jbschlosser

fbshipit-source-id: c64d9624c305d90570c79d11a28557f9ec667b27
2022-01-06 18:39:09 -08:00
880a5b9ea6 [PyTorch] Move prim string ops to JIT op registry (#70501)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70501

This PR migrates prim string ops to be registered into JIT op registry instead of dispatcher. Since the implementation of these ops are backend agnostic, there's no need to go through dispatcher. Relying on `test_jit_string.py` to verify the correctness of these ops. I'm also adding tests to make sure all the operators are covered.

Test Plan: Rely on `test_jit_string.py`.

Reviewed By: iseeyuan

Differential Revision: D33351638

fbshipit-source-id: ecc8359da935a32d3a31add2c395a149a0d8892f
2022-01-06 18:26:28 -08:00
ddea6980fe [PyTorch][JIT] Don't refcount Type singletons (#69579)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69579

This should help us avoid reference counting overhead on singleton Type subclasses without a major rewrite of the Type subsystem.
ghstack-source-id: 146643993

Test Plan:
Ran //caffe2/caffe2/fb/high_perf_models/pytorch/benchmark_framework_overheads:cpp_benchmark with arguments `--op empty -niter 40 --stressTestRecordFunction --captureRecordFunctionInputs` on devbig with turbo off.

Before:
```
I1206 13:47:15.037441 1201670 bench.cpp:144] Mean 0.737675
I1206 13:47:15.037463 1201670 bench.cpp:145] Median 0.736725
I1206 13:47:15.037468 1201670 bench.cpp:146] Min 0.722897
I1206 13:47:15.037473 1201670 bench.cpp:147] stddev 0.00508187
I1206 13:47:15.037482 1201670 bench.cpp:148] stddev / mean 0.00688903
```

After:
```
I1206 13:48:16.830123 1205612 bench.cpp:144] Mean 0.66988
I1206 13:48:16.830150 1205612 bench.cpp:145] Median 0.663956
I1206 13:48:16.830157 1205612 bench.cpp:146] Min 0.65986
I1206 13:48:16.830164 1205612 bench.cpp:147] stddev 0.0335928
I1206 13:48:16.830171 1205612 bench.cpp:148] stddev / mean 0.0501475
```

Static runtime startup is also improved; for CMF local_ro, time to initialize a predictor went from 10.01s to 9.59s.

(Note: I wish I had a production workload to demonstrate the advantage of this on. I tried ctr_mobile_feed local_ro net but it was neutral. Anything that manipulates types or List/Dict a lot might be promising.)

Reviewed By: suo

Differential Revision: D32923880

fbshipit-source-id: c82ed6689b3598e61047fbcb2149982173127ff0
2022-01-06 17:39:16 -08:00
e6befbe85c Add flag to optionally average output attention weights across heads (#70055)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/47583

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70055

Reviewed By: bhosmer

Differential Revision: D33457866

Pulled By: jbschlosser

fbshipit-source-id: 17746b3668b0148c1e1ed8333227b7c42f1e3bf5
2022-01-06 17:32:37 -08:00
cc7382dd92 Enable upgraders in TS server (#70539)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70539

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70538

ghstack-source-id: 146384458

Test Plan: python test/test_jit.py TestUpgraders

Reviewed By: gmagogsfm

Differential Revision: D33375195

fbshipit-source-id: 170960b409175bb987cf9dbb65ffed3283e5f6f9
2022-01-06 17:10:30 -08:00
7b8f73dd32 No-batch-dim support for ConvNd (#70506)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70506

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D33355034

Pulled By: jbschlosser

fbshipit-source-id: 5a42645299b1d82cee7d461826acca1c5b35a71c
2022-01-06 16:53:50 -08:00
6896b2d734 [NNC Testing] Randomized loop nest infrastructure (#70410)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70410

Trying again after #70174 was reverted. Earlier the env
variable was read into a static var in C++ causing state to be retained
and causing test failures. Static type is removed in this PR.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D33321435

fbshipit-source-id: 6d108eb00cac9150a142ccc3c9a65a1867dd7de4
2022-01-06 16:21:42 -08:00
b7742b437a Allow RNN hidden_size to be 0 (#70556)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/56767.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70556

Reviewed By: ngimel

Differential Revision: D33455156

Pulled By: jbschlosser

fbshipit-source-id: 5dc57b09d7beb6ae81dfabc318e87c109bb4e6ae
2022-01-06 14:18:36 -08:00
e7602a1e30 Fix multiplication of 0-D sparse tensors (#70749)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70749

Fixes https://github.com/pytorch/pytorch/issues/65396 and a clang-tidy error.

cc nikitaved pearu cpuhrsch

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D33439136

Pulled By: cpuhrsch

fbshipit-source-id: 45ec58de7c18db183f891431d4a26e98fd0e924a
2022-01-06 13:36:46 -08:00
4fa70a2483 [pytorch] fix hipify_python (#70619)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70619

This Diff improves `hipify_python`, which is needed for AMD GPUs.

Change 1:
```
if (c == "," or ind == len(kernel_string) - 1) and closure == 0:
```
This is needed to deal with the following case (ex: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/test/cuda_vectorized_test.cu#L111)
```
kernel<<<val, func()>>>(...)
// In this case, kernel_string is "val, func()"
// so closure gets 0 when ind == len(kernel_string) - 1.
```

Change 2:
```
mask_comments()
```
This is needed to deal with a case where "<<<" is included in a comment or a string literal (ex: https://github.com/pytorch/pytorch/blob/master/torch/csrc/deploy/interpreter/builtin_registry.cpp#L71)
```
abc = "<<<XYZ>>>"
// Though this <<<XYZ>>> is irrelevant to CUDA kernels,
// the current script attempts to hipify this and fails.
```

Test Plan:
This patch fixes errors I encountered by running
```
python3 tools/amd_build/build_amd.py
```

I confirmed, with Linux `diff`, that this patch does not change HIP code that was generated successfully with the original script.

Reviewed By: hyuen

Differential Revision: D33407743

fbshipit-source-id: bec822e040a154be4cda1c294536792ca8d596ae
2022-01-06 13:27:43 -08:00
9c455d7086 dbr quant: add limited support for torch.nn.ModuleList (#70372)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70372

Enables basic support for `torch.nn.ModuleList` in DBR quant
by stopping it from being a leaf.  For now, we
require the user to check for `AutoQuantizationState` if they are
looping over the contents without any bounds checking.

In future PRs, we can explore how to solve this without requiring
user code changes.

Test Plan:
```
python test/test_quantization.py TestQuantizeDBR.test_module_list
```

Reviewed By: VitalyFedyunin

Differential Revision: D33302329

Pulled By: vkuzo

fbshipit-source-id: 1604748d4b6c2b9d14b50df46268246da807d539
2022-01-06 13:25:13 -08:00
c3f0c77b64 dbr quant support for custom leaf modules, part 3/x (#70349)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70349

Makes sure that child modules of non traceable leaf modules
do not participate in quantization swaps.  This should feature complete
the `non_traceable_module_class` feature.

Test Plan:
```
python test/test_quantization.py TestQuantizeDBR.test_prepare_custom_config_dict_non_traceable_module_class
```

Reviewed By: VitalyFedyunin

Differential Revision: D33296246

Pulled By: vkuzo

fbshipit-source-id: 08287429c89ee6aa42d13ca3060a74679a478181
2022-01-06 13:25:10 -08:00
423d8aabbd dbr quant: support for custom leaf modules, part 2/x (#70335)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70335

Adds test case that functions are not quantized inside custom leaf modules.
No logic change needed as it already works correctly.

Note: FX scripting rewriter does not go into modules without auto-quant,
which is why we are using torch.jit.trace to look at the graph.

Test Plan:
```
python test/test_quantization.py TestQuantizeDBR.test_prepare_custom_config_dict_non_traceable_module_class
```

Reviewed By: jerryzh168

Differential Revision: D33286370

Pulled By: vkuzo

fbshipit-source-id: 26c81c9e1ce7c4d38ddc1e318730cf1eaa25ff69
2022-01-06 13:25:07 -08:00
b12852eb41 dbr quant: support for custom leaf modules, part 1/x (#70330)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70330

Starts adding support for custom leaf modules, part 1/x.
In this PR, we ensure that leaf modules and all of their children
do not get `AutoQuantizationState` objects attached to them.
The API is matching prepare_fx, using the `prepare_custom_config_dict`
argument and the `non_traceable_module_class` key within that dict.

The next couple of PRs will ensure that modules and functions in
leaves do not get quantized, keeping it separate to make PRs smaller.

Test Plan:
```
python test/test_quantization.py TestQuantizeDBR.test_prepare_custom_config_dict_non_traceable_module_class
```

Reviewed By: jerryzh168

Differential Revision: D33285310

Pulled By: vkuzo

fbshipit-source-id: 532025fda5532b420fad0a4a0847074d1ac4ad93
2022-01-06 13:25:04 -08:00
a8929c3278 dbr quant: unbreak case when child module not returning any outputs (#70329)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70329

Fixes a crash in DBR when a child module does not return any tensors.
This happens sometimes in user models.

Test Plan:
```
python test/test_quantization.py TestQuantizeDBR.test_child_module_does_not_return_tensor
```

Reviewed By: VitalyFedyunin

Differential Revision: D33285309

Pulled By: vkuzo

fbshipit-source-id: 42b8cffb5ee02ce171a3e6c64d140bb5f217225a
2022-01-06 13:25:01 -08:00
f742853838 dbr quant: support functional linear without bias (#70328)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70328

Currently linear with bias crashes DBR convert step, this PR fixes it.
This unbreaks testing DBR on some customer models.

Test Plan:
```
python test/test_quantization.py TestQuantizeDBRIndividualOps.test_linear_functional_nobias
```

Reviewed By: jerryzh168

Differential Revision: D33285311

Pulled By: vkuzo

fbshipit-source-id: 757c7270be9e3ff9cdf2609b1e426e9fd34e50ff
2022-01-06 13:24:58 -08:00
c21a540866 dbr quant: support dynamic linear (#70257)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70257

Makes dynamic quantization for linear module work in DBR quant.

Coverage for more ops and functionals will be in future PRs.

Test Plan:
```
python test/test_quantization.py -k DBR
```

Reviewed By: jerryzh168

Differential Revision: D33262300

Pulled By: vkuzo

fbshipit-source-id: c1cb0f9dd3f42216ad6ba19f4222b171ff170174
2022-01-06 13:24:55 -08:00
dfb807d65e dbr quant: do not attach auto_quant_state to observers (#70256)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70256

Somewhere in previous PRs we started attaching AutoQuantState
to observers. This PR removes this, as that has not purpose
and makes model debugging more complicated.

Test Plan:
```
python test/test_quantization.py -k DBR
```

Reviewed By: jerryzh168

Differential Revision: D33262299

Pulled By: vkuzo

fbshipit-source-id: a3543b44c517325d57f5ed03b961a8955049e682
2022-01-06 13:23:43 -08:00
524bbb1442 [LTC] Sync gen_lazy_tensor.py from the staging branch (#70385)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70385

This commit sync gen_lazy_tensor.py from the lazy_tensor_staging branch
to the master.

Test Plan: CI in the lazy_tensor_staging branch.

Reviewed By: wconstab

Differential Revision: D33306232

Pulled By: alanwaketan

fbshipit-source-id: a15c72b22418637f851a6cd4901a9f5c4be75449
2022-01-06 13:12:37 -08:00
81b52c290f Adding leaky_relu support for fx2trt (#70799)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70799

Add op support in fx2trt for leaky_relu
1. add support in acc_ops and corresponding unit test
2. add support in acc_ops_converters and corresponding unit test

Reviewed By: 842974287

Differential Revision: D33399095

fbshipit-source-id: 978340e64b35ffefabdc48273ddfa86b5ee1816e
2022-01-06 12:40:14 -08:00
19f04da21e GHA: Make WORKFLOW_ID not a concatenation of run_id and run_num (#70938)
Summary:
![image](https://user-images.githubusercontent.com/31798555/148432431-f990a26b-55d4-414e-9abd-8cdb4b4e9844.png)

Since both GITHUB_RUN_ID and GITHUB_RUN_NUM are unchanged in rerun attempts, there's little reason to track both. It ends up just being confusing and also hard to use in joins in queries.

Currently, the only places the concatenated WORKFLOW_ID are used are for our test stats jsons in S3 and in our binary size stats in Scuba, code posted respectively:
https://github.com/pytorch/pytorch/blob/master/tools/stats/print_test_stats.py#L824
https://github.com/pytorch/pytorch/blob/master/tools/stats/upload_binary_size_to_scuba.py#L58
And I don't think we use the WORKFLOW_IDs in either stats in any queries yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70938

Reviewed By: seemethere, ngimel

Differential Revision: D33458655

Pulled By: janeyx99

fbshipit-source-id: 885b125a978fa0cc51553b08b8c63d5fdcf354d0
2022-01-06 12:34:10 -08:00
10b55648f5 CI: remove unused yaml and make upload_binary_size_to_scuba script work with GHA (#70643)
Summary:
Removes unused pytorch-job-specs.yml

It looks like the recent android GHA jobs use upload_binary_size_to_scuba.py, but a portion of the script was still using CIRCLE only variables

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70643

Reviewed By: ngimel

Differential Revision: D33455659

Pulled By: janeyx99

fbshipit-source-id: cfe79a674641ed3327c7650d2107ace2a5050983
2022-01-06 10:05:27 -08:00
578fe11673 [pytorch][aten][cuda] fix LpNormFunctor (#70601)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70601

`&` has lower precedence than `==`, so `==` will be evaluated first. This behavior should not be intended. This patch fixes it.

Test Plan: 🧐  Carefully check the change.

Reviewed By: hyuen

Differential Revision: D33397964

fbshipit-source-id: e3ac5b04e4688dfbf9d8ac3e5c4aa72282bf6ee9
2022-01-06 09:50:34 -08:00
c00d33033c Remove repeat test for types in test nn (#70872)
Summary:
Helps fix a part of https://github.com/pytorch/pytorch/issues/69865

The first commit just migrates everything as is.

The second commit uses the "device" variable instead of passing "cuda" everywhere

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70872

Reviewed By: jbschlosser

Differential Revision: D33455941

Pulled By: janeyx99

fbshipit-source-id: 9d9ec8c95f1714c40d55800e652ccd69b0c314dc
2022-01-06 09:20:02 -08:00
bc514cb425 Skip distributed tests if built with USE_DISTRIBUTED=0 (#70677)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/70676

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70677

Reviewed By: albanD

Differential Revision: D33439808

Pulled By: janeyx99

fbshipit-source-id: 7f9971eb564dbbb6625fe5f78328c3abe3808719
2022-01-06 08:55:05 -08:00
ff408fca7f Forward AD formulas for activation backwards (#70460)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70460

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D33405363

Pulled By: soulitzer

fbshipit-source-id: f68b59857a609ff593e9e399b9287d58dacef9e2
2022-01-06 08:41:17 -08:00
3051aabd0e Add forward AD formulas for convolution and some others (#69956)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69956

Test Plan: Imported from OSS

Reviewed By: albanD, bdhirsh

Differential Revision: D33235974

Pulled By: soulitzer

fbshipit-source-id: ea60d687edc5d62d92f3fd3cb6640421d32c908c
2022-01-06 08:39:51 -08:00
4916a21f10 quantization: fix scale+zp serialization of quantized BatchNorm{2|3}d (#70432)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70432

Scale and zero_point need to be buffers for serialization to work
on them properly.  This PR moves them to buffers.  This is BC breaking,
but the "before" state was completely broken (scale + zp were not
serialized at all) so there is no value in trying to handle it.

Test Plan:
```
python test/test_quantization.py TestStaticQuantizedModule.test_batch_norm2d_serialization
python test/test_quantization.py TestStaticQuantizedModule.test_batch_norm3d_serialization
```

```
python test/test_quantization.py TestStaticQuantizedModule.test_batch_norm2d_serialization
```

Imported from OSS

Differential Revision:
D33330022
D33330022

Reviewed By: jerryzh168

Pulled By: vkuzo

fbshipit-source-id: 673c61f1a9f8f949fd9e6d09a4dbd9e5c9d5fd04
2022-01-06 08:26:20 -08:00
6773589a06 Drop some unused variables (#70879)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70879

Sandcastle from layer_norm_kernel.cu

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D33439040

fbshipit-source-id: e7d0e37ab25d62c63f675da3b6eff670fd93b26a
2022-01-06 08:11:25 -08:00
748790588c Upgrading the loop to use irange (#70326)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70326

See D24145988 for context: it allows loops such as for(int i=0;i<10;i++) to be expressed as for(const auto i : c10::irange(10)). This is nice because it auto-types the loops and adds const-safety to the iteration variable.

Test Plan: buck run //caffe2/torch/fb/sparsenn:test

Reviewed By: r-barnes

Differential Revision: D33243400

fbshipit-source-id: b1f1b4163f4bf662031baea9e5268459b40c69a3
2022-01-06 07:06:53 -08:00
b0fdca8855 Bump version number to 7 and compile old operators with old schema (#68358)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68358

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D33433730

Pulled By: tugsbayasgalan

fbshipit-source-id: 202c58365bae13195d3545cefcb0da9162b02151
2022-01-05 23:57:22 -08:00
8bdbe94344 Add forward compatability tests in CI (#64139)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64139

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D30626912

Pulled By: tugsbayasgalan

fbshipit-source-id: 781a88386701b42e2e86daaca0a779d1fc1c4df3
2022-01-05 23:40:06 -08:00
402f2934bf Revert D33262228: Per-overload torch.ops API
Test Plan: revert-hammer

Differential Revision:
D33262228 (8e6d1738a4)

Original commit changeset: 600dbf511514

Original Phabricator Diff: D33262228 (8e6d1738a4)

fbshipit-source-id: 238fa88ea9c4f26c7511334765c07452fbca9655
2022-01-05 22:10:11 -08:00
884aa2baad ci: Make linux.*xlarge non-ephemeral (#70869)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70869

Makes linux runners non-ephemeral to reduce the amount of
CreateInstance calls that we have going towards AWS as well as to reduce
the amount of github API calls we make in order to create new instances.

Should help alleviate some of the queuing issues we may observe due to
AWS / Github rate limits

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D33436874

Pulled By: seemethere

fbshipit-source-id: b2736fb4c9d175b1b0e2efb5017dcb4a8d4c05f4
2022-01-05 22:04:21 -08:00
2367face24 Prefer maybe_multiply when multiplying by a constant (#68185)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68185

As per title

We also fix the first input to `handle_r_to_c` fo `rsub`as it was
flipped for the two inputs.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D32684855

Pulled By: mruberry

fbshipit-source-id: ffeab8d561e657105b314a883260f00d0ae59bbf
2022-01-05 20:33:43 -08:00
1a061c7fe1 Merge index_{add,fill,copy,select} sampling (#68184)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68184

This was in the TODO list as the three operations are very similar.
Did this as one of them was failing in the noncontig tests and I wanted
to make sure that all of them were tested properly, as they all appear
in the derivative formulas of each other.

After this PR, these operations do pass the noncontiguous tests.

cc mruberry

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D32684854

Pulled By: mruberry

fbshipit-source-id: 5db58be8d1e1fce434eab9cdf410cbf1024bbdf9
2022-01-05 20:33:40 -08:00
baeca11a21 Remove random_fullrank_matrix_distinc_singular_value (#68183)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68183

We do so in favour of
`make_fullrank_matrices_with_distinct_singular_values` as this latter
one not only has an even longer name, but also generates inputs
correctly for them to work with the PR that tests noncontig inputs
latter in this stack.

We also heavily simplified the generation of samples for the SVD, as it was
fairly convoluted and it was not generating the inputs correclty for
the noncontiguous test.

To do the transition, we also needed to fix the following issue, as it was popping
up in the tests:

Fixes https://github.com/pytorch/pytorch/issues/66856

cc jianyuh nikitaved pearu mruberry walterddr IvanYashchuk xwang233 Lezcano

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D32684853

Pulled By: mruberry

fbshipit-source-id: e88189c8b67dbf592eccdabaf2aa6d2e2f7b95a4
2022-01-05 20:33:37 -08:00
08ef4ae0bc Remove unnecessary sync in linalg.det (#67014)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67014

LAPACK functions return negative infos when there was an unexpected
input. This happens (for example) when the user does not specify
matrices of the correct size. We already check all this things on the
PyTorch end, so this check that induces a synchronisation is
unnecessary.

I also took this chance to avoid some code repetition in the computation
of the determinant of `P`. I also changed the use of `ExclusivelyOwned<Tensor>`
by regular `Tensors` + moving into the tuple, which should be as efficient or more.

cc jianyuh nikitaved pearu mruberry walterddr IvanYashchuk xwang233 Lezcano

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D32684851

Pulled By: mruberry

fbshipit-source-id: dc046d1cce4c07071d16c4e2eda36412bd734e0f
2022-01-05 20:33:34 -08:00
4d4e81d869 Make linalg.lu_factor structured (#66934)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66934

cc jianyuh nikitaved pearu mruberry walterddr IvanYashchuk xwang233 Lezcano

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D32684856

Pulled By: mruberry

fbshipit-source-id: 1675448da9a8677c8420005ce753972234e7accc
2022-01-05 20:33:31 -08:00
012c38e04d Add contiguous_strides as a correct replacement of defaultStride (#67789)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67789

`at::defaultStride` was added in https://github.com/pytorch/pytorch/pull/18779.
As it was noted in that PR, it differs from the actual computation of
the default strides when one or more of the dimensions of the tensor are
zero. See https://github.com/pytorch/pytorch/pull/18779#discussion_r272296140

We add two functions, `contiguous_strides` and `contiguous_strides_vec`
which correct this issue and we replace the previous (wrong) uses of
`defaultStride`.

cc jianyuh nikitaved pearu mruberry walterddr IvanYashchuk xwang233 Lezcano

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D32684852

Pulled By: mruberry

fbshipit-source-id: 62997a5a97a4241a12e73e2be2e192b80b491cb1
2022-01-05 20:33:28 -08:00
a35b4b49d2 Add linalg.lu_factor (#66933)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66933

This PR exposes `torch.lu` as `torch.linalg.lu_factor` and
`torch.linalg.lu_factor_ex`.

This PR also adds support for matrices with zero elements both in
the size of the matrix and the batch. Note that this function simply
returns empty tensors of the correct size in this case.

We add a test and an OpInfo for the new function.

This PR also adds documentation for this new function in line of
the documentation in the rest of `torch.linalg`.

Fixes https://github.com/pytorch/pytorch/issues/56590
Fixes https://github.com/pytorch/pytorch/issues/64014

cc jianyuh nikitaved pearu mruberry walterddr IvanYashchuk xwang233 Lezcano

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D32834069

Pulled By: mruberry

fbshipit-source-id: 51ef12535fa91d292f419acf83b800b86ee9c7eb
2022-01-05 20:32:12 -08:00
3f53365086 define get_dot_graph (#70541)
Summary:
In the [docstring](https://github.com/pytorch/pytorch/blob/master/torch/fx/passes/graph_drawer.py#L54-L60) we mention `get_dot_graph but it is not defined, so I defined it here.
Not sure if this is preferred, or should we update the docstring to use `get_main_dot_graph`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70541

Test Plan:
```
            g = FxGraphDrawer(symbolic_traced, "resnet18")
            with open("a.svg", "w") as f:
                f.write(g.get_dot_graph().create_svg())
```

Reviewed By: khabinov

Differential Revision: D33378080

Pulled By: mostafaelhoushi

fbshipit-source-id: 7feea2425a12d5628ddca15beff0fe5110f4a111
2022-01-05 20:00:20 -08:00
917d56a7e4 Copy: Fix conj bit being ignored on type mismatch (#68963)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68963

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D33064492

Pulled By: anjali411

fbshipit-source-id: 043f927d6bfff46bf5f8ea6fce9409f250bf8ff8
2022-01-05 17:59:32 -08:00
cfc5519661 Support Sparse CSR transpose. Fix clang-tidy warnings. (#70582)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70582

cc nikitaved pearu cpuhrsch

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D33414446

Pulled By: cpuhrsch

fbshipit-source-id: dd0888d9dd3885579e853643a60d13373b5d6b15
2022-01-05 17:41:51 -08:00
3a21f38a2e Integrate multi_tensor zero_grad into Optimizer base class (#69936)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69936

Currently, the optimizers in `torch/optim/_multi_tensor/` all override the base Optimizer class' implementation of `zero_grad` with the same foreach zero_grad implementation (e.g. [here](https://github.com/pytorch/pytorch/blob/master/torch/optim/_multi_tensor/adadelta.py#L93-L114)). There is a TODO that indicates that this should be refactored to the base class once the foreach ops are in good shape. This PR is intended to address that TODO.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D33346748

Pulled By: mikaylagawarecki

fbshipit-source-id: 6573f4776aeac757b6a778894681868191a1b4c7
2022-01-05 15:46:23 -08:00
8e6d1738a4 Per-overload torch.ops API (#67254)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67254

Fixes https://github.com/pytorch/pytorch/issues/65997

TODO: disallow `default` as an overload name for aten operators.

BC breaking:
`output = torch.ops._test.leaky_relu(self=torch.tensor(-1.0))` now fails with the error `TypeError: __call__() got multiple values for argument 'self'` since we call into `OpOverloadBundle`'s `__call__` method that has `self` bound to it as its first argument.

cc ezyang gchanan

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D33262228

Pulled By: anjali411

fbshipit-source-id: 600dbf511514ea9b41aea3e6b1bc1102dab08909
2022-01-05 15:17:41 -08:00
f9e1a1c97f Increase tolerance for test_adadelta (#69919)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/69698

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69919

Reviewed By: cpuhrsch

Differential Revision: D33286427

Pulled By: jbschlosser

fbshipit-source-id: a2ca90683c14b6669f9b1804881ac675ba925fc5
2022-01-05 15:02:10 -08:00
ce409d8f50 docs: clarify smooth l1 == l1 when beta == 0 (#70673)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/68558.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70673

Reviewed By: albanD

Differential Revision: D33430267

Pulled By: jbschlosser

fbshipit-source-id: db92187ff4f2799b19a6c4a5a6b653e9211c3aca
2022-01-05 14:35:35 -08:00
2431218ee4 Jiterates more ops (#70663)
Summary:
This PR jiterates:

- lcm
- i0e
- i1e
- ndtri
- erfcx
- digamma
- trigamma
- lgamma

It also adds TODOs to jiterate `kaiser_window`, `igamma`, `igammac` and `polygamma`, but jiterating those ops requires more features.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70663

Reviewed By: ngimel

Differential Revision: D33420854

Pulled By: mruberry

fbshipit-source-id: 6f32ac3cf24eda051bf19b6d20e94cdf81f50761
2022-01-05 13:57:25 -08:00
a5bc44422a [PyTorch] Remove the List/Dict move operations (#69370)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69370

These operations are likely slower than copying because they perform a heap allocation and reference count bump, whereas copying is just a reference count bump. This diff is up to see 1) if anything breaks 2) if we can measure any improvements.
ghstack-source-id: 146468907

Test Plan:
Ran //sigrid/lib/features/tests:pytorch_feature_conversion_benchmark before/after

```
swolchok@devbig032 ~/f/fbcode> for x in (seq 5); sudo scripts/bertrand/noise/denoise.sh /tmp/pytorch_feature_conversion_benchmark.Dec7Stable ; end
============================================================================
sigrid/lib/features/tests/PyTorchFeatureConversionBenchmark.cpprelative  time/iter  iters/s
============================================================================
PyTorchFeatureConversionDenseBenchmark                       2.43us  410.68K
PyTorchFeatureConversionIdListBenchmark                      3.74us  267.65K
PyTorchFeatureConversionIdScoreListBenchmark                 4.98us  200.81K
============================================================================
============================================================================
sigrid/lib/features/tests/PyTorchFeatureConversionBenchmark.cpprelative  time/iter  iters/s
============================================================================
PyTorchFeatureConversionDenseBenchmark                       2.43us  410.75K
PyTorchFeatureConversionIdListBenchmark                      3.75us  266.92K
PyTorchFeatureConversionIdScoreListBenchmark                 4.98us  200.97K
============================================================================
============================================================================
sigrid/lib/features/tests/PyTorchFeatureConversionBenchmark.cpprelative  time/iter  iters/s
============================================================================
PyTorchFeatureConversionDenseBenchmark                       2.44us  410.43K
PyTorchFeatureConversionIdListBenchmark                      3.75us  266.75K
PyTorchFeatureConversionIdScoreListBenchmark                 5.04us  198.23K
============================================================================
============================================================================
sigrid/lib/features/tests/PyTorchFeatureConversionBenchmark.cpprelative  time/iter  iters/s
============================================================================
PyTorchFeatureConversionDenseBenchmark                       2.43us  411.17K
PyTorchFeatureConversionIdListBenchmark                      3.74us  267.60K
PyTorchFeatureConversionIdScoreListBenchmark                 5.00us  199.84K
============================================================================
============================================================================
sigrid/lib/features/tests/PyTorchFeatureConversionBenchmark.cpprelative  time/iter  iters/s
============================================================================
PyTorchFeatureConversionDenseBenchmark                       2.44us  410.19K
PyTorchFeatureConversionIdListBenchmark                      3.73us  267.89K
PyTorchFeatureConversionIdScoreListBenchmark                 4.96us  201.46K
============================================================================
swolchok@devbig032 ~/f/fbcode> for x in (seq 5); sudo scripts/bertrand/noise/denoise.sh /tmp/pytorch_feature_conversion_benchmark.Dec8RemoveListAndDictMove ; end
============================================================================
sigrid/lib/features/tests/PyTorchFeatureConversionBenchmark.cpprelative  time/iter  iters/s
============================================================================
PyTorchFeatureConversionDenseBenchmark                       2.47us  405.12K
PyTorchFeatureConversionIdListBenchmark                      3.60us  278.07K
PyTorchFeatureConversionIdScoreListBenchmark                 4.87us  205.44K
============================================================================
============================================================================
sigrid/lib/features/tests/PyTorchFeatureConversionBenchmark.cpprelative  time/iter  iters/s
============================================================================
PyTorchFeatureConversionDenseBenchmark                       2.45us  407.39K
PyTorchFeatureConversionIdListBenchmark                      3.63us  275.56K
PyTorchFeatureConversionIdScoreListBenchmark                 4.95us  202.17K
============================================================================
============================================================================
sigrid/lib/features/tests/PyTorchFeatureConversionBenchmark.cpprelative  time/iter  iters/s
============================================================================
PyTorchFeatureConversionDenseBenchmark                       2.47us  405.49K
PyTorchFeatureConversionIdListBenchmark                      3.63us  275.58K
PyTorchFeatureConversionIdScoreListBenchmark                 4.88us  205.05K
============================================================================
============================================================================
sigrid/lib/features/tests/PyTorchFeatureConversionBenchmark.cpprelative  time/iter  iters/s
============================================================================
PyTorchFeatureConversionDenseBenchmark                       2.52us  396.13K
PyTorchFeatureConversionIdListBenchmark                      3.59us  278.29K
PyTorchFeatureConversionIdScoreListBenchmark                 4.88us  204.94K
============================================================================
============================================================================
sigrid/lib/features/tests/PyTorchFeatureConversionBenchmark.cpprelative  time/iter  iters/s
============================================================================
PyTorchFeatureConversionDenseBenchmark                       2.46us  406.77K
PyTorchFeatureConversionIdListBenchmark                      3.62us  276.17K
PyTorchFeatureConversionIdScoreListBenchmark                 4.92us  203.07K
============================================================================
```

Reviewed By: suo, hlu1

Differential Revision: D32836701

fbshipit-source-id: 6e1c3d81f1b4ee13156320263dac17f5256c1462
2022-01-05 13:49:22 -08:00
b283b1de39 Cleaning code in fbcode/caffe2/c10/core/TensorImpl.h (#70588)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70588

Test Plan: Sandcastle

Reviewed By: meyering

Differential Revision: D33399751

fbshipit-source-id: 3e507973f7a8f58635f3446650e85d0f959254c0
2022-01-05 13:40:59 -08:00
395f853770 Parallelize docker dependency builds (#70866)
Summary:
Those scripts are run on 8 vCPU instances, so passing at least `-j6` makes sense

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70866

Reviewed By: atalman

Differential Revision: D33435083

Pulled By: malfet

fbshipit-source-id: c879ed928da0b77346a92976d2fe9ad92ba01b5e
2022-01-05 13:34:27 -08:00
be298212a6 reduce igamma instantiations (#70666)
Summary:
Don't compile scalar versions of the kernel (there is no scalar overload), combine igamma and igammac kernels.
Igamma cubin size 10 MB -> 2 MB on V100

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70666

Reviewed By: malfet

Differential Revision: D33431359

Pulled By: ngimel

fbshipit-source-id: 440998f751251be274f40dd035efba08b8969192
2022-01-05 13:06:24 -08:00
6c4437118b Deprecating Python 3.6 (#70493)
Summary:
Deprecating python 3.6 from documentation and from cmake

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70493

Reviewed By: suo

Differential Revision: D33433118

Pulled By: atalman

fbshipit-source-id: c3adc7b75714efdb5b6acda5d4cddc068fb4a145
2022-01-05 11:46:32 -08:00
025cd69a86 [AMD] Fix some legacy hipify script (#70594)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70594

Pull Request resolved: https://github.com/facebookincubator/gloo/pull/315

Fix some out-dated hipify script:
* python -> python3 (fb internal)
* rocblas return code
* gloo makefile for hip clang

Test Plan: Sandcastle + OSS build

Reviewed By: malfet, shintaro-iwasaki

Differential Revision: D33402839

fbshipit-source-id: 5893039451bcf77bbbb1b88d2e46ae3e39caa154
2022-01-05 11:34:25 -08:00
34c49d3d3b Document torch.quantile interpolation kwarg (#70637)
Summary:
clone of https://github.com/pytorch/pytorch/pull/59397

This PR documents the interpolation kwarg parameter added in https://github.com/pytorch/pytorch/issues/49267. Now that the forward compatibility period is over, we can expose this parameter.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70637

Reviewed By: jbschlosser

Differential Revision: D33411707

Pulled By: anjali411

fbshipit-source-id: f5f2d0a6739b3a855bbdf58fc671ac2f0342ce69
2022-01-05 11:02:13 -08:00
616afcf981 [jit] [shape analysis] Move constant tensors out of fused subgraphs during generalization (#70320)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70320

ghstack-source-id: 146514368

Test Plan: `buck test mode/dev-nosan //caffe2/test/cpp/jit:jit`

Reviewed By: eellison

Differential Revision: D33280508

fbshipit-source-id: fe4291d7c49f0a498b330de96b698e99f6f6a505
2022-01-05 10:19:14 -08:00
b60b1b100f Set cuDNN deterministic flag for test_conv_double_backward_cuda (#69941)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/69833

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69941

Reviewed By: george-qi

Differential Revision: D33430727

Pulled By: jbschlosser

fbshipit-source-id: 4a250bd0e5460ee631730afe0ab68ba72f37d292
2022-01-05 10:05:56 -08:00
93c7504438 [PyTorch] Improve StorageImpl::set_data_ptr (#65432)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65432

There is no reason to do an extra write to the input DataPtr (via `std::swap`) before returning a new DataPtr.
ghstack-source-id: 146471376

Test Plan:
Inspected assembly for this function to verify that we are
really getting fewer instructions generated. I don't have a specific
application for this at the moment, but it's clearly better IMO.

Reviewed By: mikeiovine

Differential Revision: D31097807

fbshipit-source-id: 06ff6f5fc675df0f38b0315b4147ed959243b6d0
2022-01-05 09:46:35 -08:00
70d3b2700f [LTC] Fix stride accessors in LTCTensorImpl (#70623)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70623

Strides on lazy tensor should only be read after calling setup_size_properties. This fixes a failure in hf_Longformer.

Test Plan: CI on the lazy_tensor_staging branch

Reviewed By: wconstab, alanwaketan

Differential Revision: D33410142

Pulled By: desertfire

fbshipit-source-id: ccb2ba8d258bdb88f6b51be6196563f9c4c06cbf
2022-01-05 09:31:41 -08:00
6f473c80a5 Enable fx2trt CI test (#70658)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70658

Config '--exclude-distributed-test' was intended to disabled fx2trt test on normal docker test suite, but test is auto disabled now. Remove config.

Test Plan:
CI
https://github.com/pytorch/pytorch/actions/runs/1656375648

Reviewed By: houseroad

Differential Revision: D33417803

fbshipit-source-id: 9dfb4cbd6fa9ad18a4be989ee86d1f8a298347f9
2022-01-05 09:28:58 -08:00
4cbe140ec5 Add CI config to test USE_PER_OPERATOR_HEADERS=0 (#69907)
Summary:
The CMake build defaults to `USE_PER_OPERATOR_HEADERS = 1` which
generates extra headers in the `ATen/ops` folder that don't exist
otherwise. In particular, fb-internal builds using buck don't support
these headers and so all includes must be guarded with
`#ifdef AT_PER_OPERATOR_HEADERS`.

This adds a CI run which builds with `USE_PER_OPERATOR_HEADERS = 0` so
open source contributions don't have to wait for their PR to be
imported to find out it doesn't work in fb-internal. This flag
shouldn't effect runtime behavior though, so I don't run any tests.

cc seemethere malfet pytorch/pytorch-dev-infra

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69907

Reviewed By: malfet, atalman

Differential Revision: D33411864

Pulled By: seemethere

fbshipit-source-id: 18b34d7a83dc81cf8a6c396ba8369e1789f936e9
2022-01-05 09:18:06 -08:00
e1e43c4e71 Prevent sum overflow in broadcast_object_list (#70605)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70605

broadcast_object_list casted the sum of all object lengths to int from long causing overflows.

Test Plan:
Add a Tensor  with >2GB storage requirement (in distributed_test.py) to object broadcast.

This Tensor is only added if test are running at Meta as github tests will oom.

Without fix the length will overflow and the program will request a negative sized Tensor:
```
RuntimeError: Trying to create tensor with negative dimension -2147482417: [-2147482417]
```
With fix it will pass the test.

Test used on server with GPUs:

buck test  mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn --local -- broadcast_object
buck test  mode/dev-nosan //caffe2/test/distributed:distributed_gloo_spawn --local -- broadcast_object

Reviewed By: r-barnes

Differential Revision: D33405741

fbshipit-source-id: 972165f8297b3f5d475636e6127ed4a49adacab1
2022-01-05 09:07:39 -08:00
8ba27c576c Upgrade CI to ROCm4.5.2 (#69886)
Summary:
cc jeffdaily sunway513 jithunnair-amd ROCmSupport KyleCZH

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69886

Reviewed By: albanD, seemethere

Differential Revision: D33429299

Pulled By: malfet

fbshipit-source-id: c3d6d9e45e30d0149b04e59ea255d88bc0e933f2
2022-01-05 08:48:46 -08:00
20489ebdc9 Increase tensor size for mem check tests (#70603)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/70226

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70603

Reviewed By: mruberry

Differential Revision: D33410439

Pulled By: janeyx99

fbshipit-source-id: e94615ece6d0fdf230de5297118678b70f34a18c
2022-01-05 08:27:48 -08:00
1aa98c7540 [docs] multi_head_attention_forward no-batch dim support (#70590)
Summary:
no batch dim support added in https://github.com/pytorch/pytorch/issues/67176

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70590

Reviewed By: VitalyFedyunin

Differential Revision: D33405283

Pulled By: jbschlosser

fbshipit-source-id: 86217d7d540184fd12f3a9096605d2b1e9aa313e
2022-01-05 08:26:25 -08:00
e228b71dae remove unnecessary skips in rsub OpInfo (#69973)
Summary:
Skips are unnecessary as https://github.com/pytorch/pytorch/issues/53797 was fixed

Thanks Lezcano for finding the same.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69973

Reviewed By: mruberry

Differential Revision: D33161663

Pulled By: anjali411

fbshipit-source-id: 06b75bc5fc0cf90239f17835c07b86b2282ec846
2022-01-05 08:22:38 -08:00
216ae7bc91 [docs] Transformer: no batch dim support doc update (#70597)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70597

Reviewed By: VitalyFedyunin

Differential Revision: D33405284

Pulled By: jbschlosser

fbshipit-source-id: 04f37e8b9798ded7fcedac48629645843a0e3a28
2022-01-05 08:20:51 -08:00
5543b7ce16 Fix docstring for nn.Softplus (#70576)
Summary:
Fixes nn.Softplus' docstring problem reported at https://github.com/pytorch/pytorch/issues/70498.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70576

Reviewed By: VitalyFedyunin

Differential Revision: D33407444

Pulled By: albanD

fbshipit-source-id: 7f1f438afb1a1079d30e0c4741aa609c5204329f
2022-01-05 08:12:15 -08:00
657a7e74ed Fix docstring for nn.Tanh (#70577)
Summary:
Fixes nn.Tanh's docstring problem reported at https://github.com/pytorch/pytorch/issues/70498.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70577

Reviewed By: VitalyFedyunin

Differential Revision: D33408564

Pulled By: albanD

fbshipit-source-id: 2008cb55ef72b4b057d8d68e4505956aaf6cc3fa
2022-01-05 07:56:57 -08:00
adceb13da1 Copy: Avoid extra dispatch in type-mismatch case (#68950)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68950

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D33064447

Pulled By: anjali411

fbshipit-source-id: 82bf4e144c1e629e30226eedc9d26ca63cfb4431
2022-01-05 07:32:47 -08:00
e1aa5db108 Bazel: Only run ATen codegen once (#70147)
Summary:
Due to a merge conflict, the new bazel cuda build does something
rather obnoxious. It runs ATen codegen with `--per-operator-headers`
enabled and extracts a subset of the generated files; then calls it
again without the flag to extract the CUDA files.

This PR instead calls the codegen once but keeps track of what is
CPU and what is CUDA in separate lists.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70147

Reviewed By: VitalyFedyunin

Differential Revision: D33413020

Pulled By: malfet

fbshipit-source-id: 4b502c38a209d1aa63d715e2336df6fc5aac2212
2022-01-05 06:56:52 -08:00
1681323ddc DOC: Merge extraheader block from theme instead of override (#70187)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/70185

The extraheader block in docs/source/_templates/layout.html overrides the one from the pytorch theme. The theme block adds Google Analytics, so they were missing from the `master` documentation. This came up in PR pytorch/pytorch.github.io#899.

brianjo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70187

Reviewed By: bdhirsh

Differential Revision: D33248466

Pulled By: malfet

fbshipit-source-id: b314916a3f0789b6617cf9ba6bd938bf5ca27242
2022-01-05 06:42:38 -08:00
aea3d3ced7 dbr quant: stop calling eager quant convert (#70247)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70247

Stops calling the eager mode quantization `convert` function
from DBR quant convert, and instead implements the module swaps
manually.  This will make it easier to support quantization types
other than static int8 in future PRs.

Test Plan:
```
python test/test_quantization.py -k DBR
```

Reviewed By: jerryzh168

Differential Revision: D33255924

Pulled By: vkuzo

fbshipit-source-id: afdfd61d71833d987bb38aa4d8c3d214f900c03e
2022-01-05 06:36:44 -08:00
4e90fa6a8c dbr quant: break up test class into multiple classes (#70246)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70246

Breaks up the large `TestQuantizeDBR` test case into
1. `TestQuantizeDBRIndividualOps` for testing functionality of ops
2. `TestQuantizeDBRMultipleOps` for testing non-fusion interactions between ops
3. `TestQuantizeDBR` for everything else

We may need to refactor this more in the future, but this should
unblock things for the near future.

Test Plan:
```
python test/test_quantization.py TestQuantizeDBR
python test/test_quantization.py TestQuantizeDBRIndividualOps
python test/test_quantization.py TestQuantizeDBRMultipleOps
```

Reviewed By: jerryzh168

Differential Revision: D33255925

Pulled By: vkuzo

fbshipit-source-id: 82db1a644867e9303453cfedffed2d81d083c9cd
2022-01-05 06:36:41 -08:00
5b20052857 dbr quant: start recording ops which are not quantizeable (#70200)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70200

Adds the logic to record not just the subgraphs which are quantizeable,
but also the set of ops (not subgraph) which are quantizeable.  This changes
the information recorded during tracing as follows (an example):

```
// before
1. subgraph of conv1 -> conv2
2. no other information about other ops

// after
1. subgraph of conv1 -> conv2
2. set of types of ops which were not quantizeable but were encountered during tracing
```

This has two uses:
1. easier development of DBR quant to cover more ops, as now the ops which are not being quantized are easier to inspect
2. easier understanding for the user of what DBR quant is doing or not doing for a model

Test Plan:
```
python test/test_quantization.py TestQuantizeDBR.test_unsupported_ops_recorded
```

Reviewed By: VitalyFedyunin

Differential Revision: D33240997

Pulled By: vkuzo

fbshipit-source-id: 3168eae286387e6cb01df3ae60dc13620fb784d5
2022-01-05 06:36:38 -08:00
80e685e2c0 dbr quant: start reusing static quant module mappings (#70196)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70196

Deletes the custom DBR static quant module mapping, and reuses
the global ones.

Test coverage for all the ops will be in future PRs.

Test Plan:
```
python test/test_quantization.py TestQuantizeDBR
```

Reviewed By: jerryzh168

Differential Revision: D33240998

Pulled By: vkuzo

fbshipit-source-id: da248b28d7b681794fa0494ff31fd065680f6fef
2022-01-05 06:35:11 -08:00
45f5a3ceb8 Fix generating files for Vulkan on Windows (#69696)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69696

Using `find` is not portable as it won't be there on Windows for example. We can use `glob` with the recursive option added in Python 3.5 instead.

Test Plan: CircleCI

Reviewed By: xta0

Differential Revision: D32994229

fbshipit-source-id: 4a755c4313300142c051f533d0b3876dc9035da0
2022-01-05 05:32:13 -08:00
c468e35d83 [caffe2] don't use __FUNCSIG__ when building for Windows with clang (#70561)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70561

When building with strict(er) compiler warnings on Windows, clang complains that `__FUNCSIC__` is a proprietary language extension. When using clang, it seems we can use `__PRETTY_FUNCTION__` instead, like we do on other platforms. This is also in line with the logic on L100:127.

Test Plan: CI

Reviewed By: kalman5

Differential Revision: D33386400

fbshipit-source-id: d45afa92448042ddcd1f68adc7a9ef4643276b31
2022-01-04 23:44:56 -08:00
12653be434 [PyTorch] Optimize no input NVTX collection (#70133)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70133

we were creating `sstream` + string concats via `getNvtxStr` even when there were no inputs and wasting precious time. this diff avoids `stringstream` when there is no input to squeeze performance. 60% reduction in overhead

Test Plan:
Before
```
I1214 22:48:07.964118 2971180 bench.cpp:154] Mean 0.970494
I1214 22:48:07.964139 2971180 bench.cpp:155] Median 0.969054
I1214 22:48:07.964144 2971180 bench.cpp:156] Min 0.962247
I1214 22:48:07.964148 2971180 bench.cpp:157] stddev 0.00774841
I1214 22:48:07.964154 2971180 bench.cpp:158] stddev / mean 0.00798398
```

After
```
I1214 22:59:00.039872 3437853 bench.cpp:154] Mean 0.384333
I1214 22:59:00.039896 3437853 bench.cpp:155] Median 0.384886
I1214 22:59:00.039899 3437853 bench.cpp:156] Min 0.370235
I1214 22:59:00.039902 3437853 bench.cpp:157] stddev 0.00435907
I1214 22:59:00.039907 3437853 bench.cpp:158] stddev / mean 0.0113419
```

Reviewed By: aaronenyeshi, robieta

Differential Revision: D33137501

fbshipit-source-id: ce0e8cf9aef7ea22fd8aed927e76be4ca375efc3
2022-01-04 23:40:22 -08:00
44283c2766 NNAPI: Add qint16 support via int16 (#70621)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70621

Pytorch doesn't have support for qint16 yet. Add an option to handle qint16 via int16 & qint32 data types.

* For qint16 tensors in NNAPI, the user sends a qint32 tensor. We convert the qint32 to int16 for the converter and set the zero point and scale for nnapi
    * inputs to the model have to have fixed scale and zero point and are only supported for testing
* Added a flag use_int16_for_qint16 which will be used maintain backwards compatibility in the converter when true qint16 is supported in PyTorch
ghstack-source-id: 146507483

Test Plan: pytest test/test_nnapi.py

Reviewed By: dreiss

Differential Revision: D33285124

fbshipit-source-id: b6376fa1bb18a0b9f6a18c545f600222b650cb66
2022-01-04 23:12:38 -08:00
10b40acbdb [PyTorch][Static Runtime] Fast aliasing in select_tensor by manual borrowing (#68122)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68122

See code comments for details; in brief, we repurpose support
for borrowing `Tensor`s in `MaybeOwned` to make the `select_tensor`
output a borrowed IValue that we have to clean up manually.

If we have any other ops that always create a new reference to an
existing Tensor, we can easily apply this same optimization.
ghstack-source-id: 146482212

Test Plan:
See perf measurements on ctr_mobile_feed local_ro net for this stack: P467203421
(local is neutral: P467267554)

--do_profile output for local_ro (updated Dec 10):

```
swolchok@devbig032 /d/u/s/f/fbcode> tail Stable.profile.txt
First iter time: 0.989023 ms
Number of operators: 2037
Total number of managed tensors: 1597
Total number of managed output tensors: 0
Total number of unmanaged values: 2568
Number of unmanaged values requiring cleanup: 2568
Number of unmanaged values not requiring cleanup: 0
Total memory managed: 50368 bytes
Total number of reused tensors: 1010
Total number of 'out' variant nodes/total number of nodes: 2001/2037 (98.2327%)
swolchok@devbig032 /d/u/s/f/fbcode> ttail TMCC^C
swolchok@devbig032 /d/u/s/f/fbcode> tail TMCOFastAliasing.profile.txt
First iter time: 0.994703 ms
Number of operators: 2551
Total number of managed tensors: 1146
Total number of managed output tensors: 0
Total number of unmanaged values: 4047
Number of unmanaged values requiring cleanup: 3533
Number of unmanaged values not requiring cleanup: 514
Total memory managed: 50048 bytes
Total number of reused tensors: 559
Total number of 'out' variant nodes/total number of nodes: 2001/2551 (78.4398%)
```

for local: (also Dec 10):

```
==> Stable.local.profile.txt <==
First iter time: 9.0909 ms
Number of operators: 1766
Total number of managed tensors: 1894
Total number of managed output tensors: 0
Total number of unmanaged values: 2014
Number of unmanaged values requiring cleanup: 2014
Number of unmanaged values not requiring cleanup: 0
Total memory managed: 4541440 bytes
Total number of reused tensors: 847
Total number of 'out' variant nodes/total number of nodes: 1744/1766 (98.7542%)

==> TMCOFastAliasing.local.profile.txt <==
First iter time: 7.5512 ms
Number of operators: 2378
Total number of managed tensors: 1629
Total number of managed output tensors: 0
Total number of unmanaged values: 3503
Number of unmanaged values requiring cleanup: 2891
Number of unmanaged values not requiring cleanup: 612
Total memory managed: 3949312 bytes
Total number of reused tensors: 586
Total number of 'out' variant nodes/total number of nodes: 1744/2378 (73.3389%)
```

Reviewed By: hlu1

Differential Revision: D32318674

fbshipit-source-id: a2d781105936fda2a3436d32ea22a196f82dc783
2022-01-04 22:36:13 -08:00
4d8fc8693c [PyTorch][Static Runtime] Support memory planning for torch.to() w/o requiring copying (#67223)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67223

ghstack-source-id: 146482215

Test Plan:
See perf measurements on ctr_mobile_feed local_ro net for this stack: P467203421
(local is neutral: P467267554)

Reviewed By: hlu1

Differential Revision: D31776259

fbshipit-source-id: f84fcaa05029577213f3bf2ae9d4b987b68480b3
2022-01-04 22:36:10 -08:00
1507ce90b2 [PyTorch][Static Runtime] Avoid managed output tensor DCHECK (#67221)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67221

Update memory leak checks to not require that output tensors are cleaned up.
ghstack-source-id: 146464297

Test Plan: Tests should still pass;  reviewers to confirm that this is OK in principle

Reviewed By: d1jang

Differential Revision: D31847567

fbshipit-source-id: bb7ff2f2ed701e2d7de07d8032a1281fccabd6a9
2022-01-04 22:36:07 -08:00
99a10c371f [PyTorch][Static Runtime] Fix dtype changing between iterations for to() (#67394)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67394

ghstack-source-id: 146464294

Test Plan:
Added new test, which failed but now passes.

Checked perf on ctr_mobile_feed local net (still not on recordio inputs yet), looks neutral

```
Stable, local
========================================

I1027 13:40:23.411118 2156917 PyTorchPredictorBenchLib.cpp:131] PyTorch predictor: number of prediction threads 1
I1027 13:40:48.708222 2156917 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.16975. Iters per second: 162.081
I1027 13:41:13.915948 2156917 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.1487. Iters per second: 162.636
I1027 13:41:38.984462 2156917 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.11408. Iters per second: 163.557
I1027 13:42:04.138948 2156917 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.13566. Iters per second: 162.982
I1027 13:42:29.342630 2156917 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.14269. Iters per second: 162.795
I1027 13:42:29.342669 2156917 PyTorchPredictorBenchLib.cpp:264] Mean milliseconds per iter: 6.14218, standard deviation: 0.0202164
0

FixToDtypeChanges, local
========================================
I1027 13:44:59.632668 2176333 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.11023. Iters per second: 163.66
I1027 13:45:24.894635 2176333 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.16308. Iters per second: 162.257
I1027 13:45:50.275280 2176333 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.17868. Iters per second: 161.847
I1027 13:46:15.637431 2176333 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.18688. Iters per second: 161.632
I1027 13:46:40.670816 2176333 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.10549. Iters per second: 163.787
I1027 13:46:40.670863 2176333 PyTorchPredictorBenchLib.cpp:264] Mean milliseconds per iter: 6.14887, standard deviation: 0.03843706
```

Reviewed By: hlu1

Differential Revision: D31972722

fbshipit-source-id: 7a445b325a29020b31dd2bd61e4171ecc2793b15
2022-01-04 22:34:49 -08:00
ab7d0df449 Support cloning CSR tensors (#70581)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70581

cc nikitaved pearu cpuhrsch

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D33413992

Pulled By: cpuhrsch

fbshipit-source-id: 3a576d2c2f26d1edcc8f6932b2dbe2c7c11e9593
2022-01-04 21:41:18 -08:00
d1dbcb1780 Change to use current LLLVM APIs (#70625)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70625

In llvm-13 depricated APIs were removed. These APIs were just wrappers around APIs present in llvm-9+. Changed to use underlying APIs.

Test Plan: buck build mode/opt-clang-thinlto -j 70 unicorn/topaggr:top_aggregator_server -c unicorn.hfsort="1" -c cxx.extra_cxxflags="-Wforce-no-error -fbracket-depth=300" -c cxx.profile="fbcode//fdo/autofdo/unicorn/topaggr/top_aggregator_server:autofdo" -c cxx.modules=False

Reviewed By: WenleiHe

Differential Revision: D33169593

fbshipit-source-id: c8923991b351a893ef8f6c0d01858149b63c0d33
2022-01-04 20:25:58 -08:00
f8eaebc978 Avoid adding torch::deploy interpreter library to the data section (#70208)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70208

Create custom section ".embedded_interpreter" in order to store interpreter instead of .data in order to allow in order to increae the amount of memory that can be used by 33% for the other sections of the executable (1.5GB -> 2.0GB) such as .text/.data/.bss. This also removes memory limitations of the interpreter and tech debt.

Test Plan:
buck test mode/opt //caffe2/torch/csrc/deploy:test_deploy
readelf -S ~/fbcode/buck-out/gen/caffe2/torch/csrc/deploy/test_deploy
check the size of the .data section
Apply the fix and check the size of the .data section again. It should be reduced by the size of the interpreter.so

The output of `readelf -S ~/fbcode/buck-out/gen/caffe2/torch/csrc/deploy/test_deploy` is as follows. The .data section is now 0.0015415GB and the .torch_deploy_payXXX section is 0.605125GB

```
(pytorch) [sahanp@devvm4333.vll0 ~/local/fbsource/fbcode] readelf -S buck-out/gen/caffe2/torch/csrc/deploy/test_deploy
There are 55 section headers, starting at offset 0x24bac82b0:

Section Headers:
  [Nr] Name              Type             Address           Offset
       Size              EntSize          Flags  Link  Info  Align
  [ 0]                   NULL             0000000000000000  00000000
       0000000000000000  0000000000000000           0     0     0
  [ 1] .interp           PROGBITS         0000000000200350  00000350
       0000000000000028  0000000000000000   A       0     0     1
  [ 2] .note.ABI-tag     NOTE             0000000000200378  00000378
       0000000000000020  0000000000000000   A       0     0     4
  [ 3] .note.gnu.build-i NOTE             0000000000200398  00000398
       0000000000000024  0000000000000000   A       0     0     4
  [ 4] .dynsym           DYNSYM           00000000002003c0  000003c0
       0000000000d07a48  0000000000000018   A       9     1     8
  [ 5] .gnu.version      VERSYM           0000000000f07e08  00d07e08
       0000000000115f86  0000000000000002   A       4     0     2
  [ 6] .gnu.version_r    VERNEED          000000000101dd90  00e1dd90
       0000000000000510  0000000000000000   A       9    15     4
  [ 7] .gnu.hash         GNU_HASH         000000000101e2a0  00e1e2a0
       00000000003b4fb0  0000000000000000   A       4     0     8
  [ 8] .hash             HASH             00000000013d3250  011d3250
       0000000000457e20  0000000000000004   A       4     0     4
  [ 9] .dynstr           STRTAB           000000000182b070  0162b070
       0000000004ef205a  0000000000000000   A       0     0     1
  [10] .rela.dyn         RELA             000000000671d0d0  0651d0d0
       0000000000110b80  0000000000000018   A       4     0     8
  [11] .rela.plt         RELA             000000000682dc50  0662dc50
       00000000000093f0  0000000000000018   A       4    35     8
  [12] .rodata           PROGBITS         0000000006837040  06637040
       00000000034067a8  0000000000000000 AMS       0     0     64
  [13] fb_build_info     PROGBITS         0000000009c3d7f0  09a3d7f0
       00000000000002ee  0000000000000000   A       0     0     16
  [14] .gcc_except_table PROGBITS         0000000009c3dae0  09a3dae0
       00000000014a9340  0000000000000000   A       0     0     4
  [15] .eh_frame_hdr     PROGBITS         000000000b0e6e20  0aee6e20
       00000000004abf54  0000000000000000   A       0     0     4
  [16] .eh_frame         PROGBITS         000000000b592d78  0b392d78
       000000000200e344  0000000000000000   A       0     0     8
  [17] .text             PROGBITS         000000000d5a2000  0d3a2000
       000000001e55944e  0000000000000000  AX       0     0     256
  [18] .init             PROGBITS         000000002bafb450  2b8fb450
       0000000000000017  0000000000000000  AX       0     0     4
  [19] .fini             PROGBITS         000000002bafb468  2b8fb468
       0000000000000009  0000000000000000  AX       0     0     4
  [20] .never_hugify     PROGBITS         000000002bafb480  2b8fb480
       0000000000000db3  0000000000000000  AX       0     0     16
  [21] text_env          PROGBITS         000000002bafc240  2b8fc240
       0000000000002e28  0000000000000000  AX       0     0     16
  [22] .plt              PROGBITS         000000002baff070  2b8ff070
       00000000000062b0  0000000000000000  AX       0     0     16
  [23] .tdata            PROGBITS         000000002bb06000  2b906000
       0000000000000b20  0000000000000000 WAT       0     0     8
  [24] .tbss             NOBITS           000000002bb06b40  2b906b20
       0000000000007cb8  0000000000000000 WAT       0     0     64
  [25] .fini_array       FINI_ARRAY       000000002bb06b20  2b906b20
       0000000000000028  0000000000000000  WA       0     0     8
  [26] .init_array       INIT_ARRAY       000000002bb06b48  2b906b48
       0000000000008878  0000000000000000  WA       0     0     8
  [27] .data.rel.ro      PROGBITS         000000002bb0f3c0  2b90f3c0
       0000000000029ce0  0000000000000000  WA       0     0     64
  [28] .ctors            PROGBITS         000000002bb390a0  2b9390a0
       0000000000000010  0000000000000000  WA       0     0     8
  [29] .dynamic          DYNAMIC          000000002bb390b0  2b9390b0
       0000000000000340  0000000000000010  WA       9     0     8
  [30] .got              PROGBITS         000000002bb393f0  2b9393f0
       000000000001f040  0000000000000000  WA       0     0     8
  [31] .bss.rel.ro       NOBITS           000000002bb58440  2b958430
       0000000000000c40  0000000000000000  WA       0     0     32
  [32] .data             PROGBITS         000000002bb5a000  2b959000
       0000000000194188  0000000000000000  WA       0     0     4096
  [33] .tm_clone_table   PROGBITS         000000002bcee188  2baed188
       0000000000000000  0000000000000000  WA       0     0     8
  [34] .probes           PROGBITS         000000002bcee188  2baed188
       0000000000000002  0000000000000000  WA       0     0     2
  [35] .got.plt          PROGBITS         000000002bcee190  2baed190
       0000000000003168  0000000000000000  WA       0     0     8
  [36] .bss              NOBITS           000000002bcf1300  2baf02f8
       00000000005214f0  0000000000000000  WA       0     0     128
  [37] .nvFatBinSegment  PROGBITS         000000002c213000  2baf1000
       0000000000002850  0000000000000000   A       0     0     8
  [38] .nv_fatbin        PROGBITS         000000002c216000  2baf4000
       0000000052baed38  0000000000000000  WA       0     0     8
  [39] .comment          PROGBITS         0000000000000000  7e6a2d38
       00000000000001dc  0000000000000000  MS       0     0     1
  [40] .debug_aranges    PROGBITS         0000000000000000  7e6a2f20
       0000000001266c00  0000000000000000           0     0     16
  [41] .debug_info       PROGBITS         0000000000000000  7f909b20
       000000007b21de49  0000000000000000           0     0     1
  [42] .debug_abbrev     PROGBITS         0000000000000000  fab27969
       000000000179f365  0000000000000000           0     0     1
  [43] .debug_line       PROGBITS         0000000000000000  fc2c6cce
       00000000176954ac  0000000000000000           0     0     1
  [44] .debug_str        PROGBITS         0000000000000000  11395c17a
       0000000039dc32b0  0000000000000001  MS       0     0     1
  [45] .debug_ranges     PROGBITS         0000000000000000  14d71f430
       0000000026a2d930  0000000000000000           0     0     16
  [46] .debug_types      PROGBITS         0000000000000000  17414cd60
       000000000b211ff5  0000000000000000           0     0     1
  [47] .debug_loc        PROGBITS         0000000000000000  17f35ed55
       000000009ca80c7e  0000000000000000           0     0     1
  [48] .debug_macinfo    PROGBITS         0000000000000000  21bddf9d3
       000000000000151c  0000000000000000           0     0     1
  [49] .note.stapsdt     NOTE             0000000000000000  21bde0ef0
       0000000000001b3c  0000000000000000           0     0     4
  [50] .debug_macro      PROGBITS         0000000000000000  21bde2a2c
       0000000000040e6a  0000000000000000           0     0     1
  [51] .torch_deploy_pay PROGBITS         0000000000000000  21be23896
       0000000026ba5d28  0000000000000000           0     0     1
  [52] .symtab           SYMTAB           0000000000000000  2429c95c0
       00000000020ce0c8  0000000000000018          54   863985     8
  [53] .shstrtab         STRTAB           0000000000000000  244a97688
       000000000000025c  0000000000000000           0     0     1
  [54] .strtab           STRTAB           0000000000000000  244a978e4
       00000000070309c6  0000000000000000           0     0     1
Key to Flags:
  W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
  L (link order), O (extra OS processing required), G (group), T (TLS),
  C (compressed), x (unknown), o (OS specific), E (exclude),
  l (large), p (processor specific)
```

Reviewed By: shunting314

Differential Revision: D33243703

fbshipit-source-id: 09a798113766c716297458cea7a74f074268dc82
2022-01-04 19:57:06 -08:00
2292520bdc Fix genSparseCSRTensor: generate non-trivial values for uint8 dtype. (#70580)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70580

cc nikitaved pearu cpuhrsch

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D33413597

Pulled By: cpuhrsch

fbshipit-source-id: 313b08e1bd96ffb8d5c7a0fda9384502325e5d08
2022-01-04 18:02:36 -08:00
29ff596dca [CUDA graphs] Changes batchnorm to increment num_batches_tracked in place for improved graph safety (#70444)
Summary:
This PR was not my worst debugging annoyance, nor my smallest in lines changed, but it has the highest `debugging annoyance/lines changed` ratio.

The current pattern
```
self.num_batches_tracked = self.num_batches_tracked + 1
```
, if captured, deletes an eagerly-allocated tensor and overwrites it with a captured tensor. Replays read from the (deallocated) original tensor's address.
This can cause
1. an IMA on graph replay
2. failure to actually increment `num_batches_tracked` during graph replay, because every replay reads from the old location without adding to it
3. numerical corruption if the allocator reassigns the original tensor's memory to some unrelated tensor
4. combinations of 1, 2, and 3, depending on global allocation patterns and if/when the BN module is called eagerly sometimes between replays

(ask me how I know).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70444

Reviewed By: albanD

Differential Revision: D33342203

Pulled By: ngimel

fbshipit-source-id: 5f201cc25030517e75af010bbaa88c452155df21
2022-01-04 17:06:46 -08:00
14457bb8cb Remove backward op for slow 3d transposed convolution (#69933)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69933

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D33131343

Pulled By: jbschlosser

fbshipit-source-id: 4300c66f0f4811c949f82c62d17c7b5200cd15a3
2022-01-04 16:55:43 -08:00
1adb70c6f0 Revert D33409880: [pytorch][PR] Deprecating Python 3.6
Test Plan: revert-hammer

Differential Revision:
D33409880 (d95be99561)

Original commit changeset: 4f9123398960

Original Phabricator Diff: D33409880 (d95be99561)

fbshipit-source-id: 32dc1c3c07ef99a04fab7d0fb742cf4e6c4b718a
2022-01-04 16:37:09 -08:00
8369a46417 [maskrcnn] use stable sort in mask rcnn caffe2 ops (#70510)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70510

Pull Request resolved: https://github.com/facebookresearch/detectron2/pull/3838

Pull Request resolved: https://github.com/facebookresearch/mobile-vision/pull/58

Pull Request resolved: https://github.com/fairinternal/detectron2/pull/567

D32694315 changes the implementation of sorting in NMS to stable sort. While the C2 operators are using non-stable sort. This causes test failure such as:
- mobile-vision/d2go/tests:fb_test_meta_arch_rcnn - test_export_caffe2 (d2go.tests.fb.test_meta_arch_rcnn.TestFBNetV2MaskRCNNFP32) (architecture: x86_64, buildmode: dev-nosan, buildsystem: buck, compiler: clang, sanitizer: none) https://www.internalfb.com/intern/testinfra/diagnostics/7318349463675961.562949999530318.1640814509/
- mobile-vision/d2go/tests:fb_test_meta_arch_rcnn - test_export_torchscript_mobile_c2_ops (d2go.tests.fb.test_meta_arch_rcnn.TestFBNetV2MaskRCNNFP32) (architecture: x86_64, buildmode: dev-nosan, buildsystem: buck, compiler: clang, sanitizer: none) https://www.internalfb.com/intern/testinfra/diagnostics/7318349463675961.844424980844724.1640814504/

To illustrate, in the failed test_export_caffe2 test, the inputs of BoxWithNMSLimit are:
```
(Pdb) ws.FetchBlob("246")
array([[0.01234568, 0.01234568, 0.01234568, 0.01234568, 0.01234568,
        0.01234568, 0.01234568, 0.01234568, 0.01234568, 0.01234568,
        0.01234568, 0.01234568, 0.01234568, 0.01234568, 0.01234568,
        0.01234568, 0.01234568, 0.01234568, 0.01234568, 0.01234568,
        0.01234568, 0.01234568, 0.01234568, 0.01234568, 0.01234568,
        0.01234568, 0.01234568, 0.01234568, 0.01234568, 0.01234568,
        0.01234568, 0.01234568, 0.01234568, 0.01234568, 0.01234568,
        0.01234568, 0.01234568, 0.01234568, 0.01234568, 0.01234568,
        0.01234568, 0.01234568, 0.01234568, 0.01234568, 0.01234568,
        0.01234568, 0.01234568, 0.01234568, 0.01234568, 0.01234568,
        0.01234568, 0.01234568, 0.01234568, 0.01234568, 0.01234568,
        0.01234568, 0.01234568, 0.01234568, 0.01234568, 0.01234568,
        0.01234568, 0.01234568, 0.01234568, 0.01234568, 0.01234568,
        0.01234568, 0.01234568, 0.01234568, 0.01234568, 0.01234568,
        0.01234568, 0.01234568, 0.01234568, 0.01234568, 0.01234568,
        0.01234568, 0.01234568, 0.01234568, 0.01234568, 0.01234568,
        0.01234568]], dtype=float32)
(Pdb) ws.FetchBlob("248")
array([[ 0.,  0., 10., 20.,  0.,  0., 10., 20.,  0.,  0., 10., 20.,  0.,
         0., 10., 20.,  0.,  0., 10., 20.,  0.,  0., 10., 20.,  0.,  0.,
        10., 20.,  0.,  0., 10., 20.,  0.,  0., 10., 20.,  0.,  0., 10.,
        20.,  0.,  0., 10., 20.,  0.,  0., 10., 20.,  0.,  0., 10., 20.,
         0.,  0., 10., 20.,  0.,  0., 10., 20.,  0.,  0., 10., 20.,  0.,
         0., 10., 20.,  0.,  0., 10., 20.,  0.,  0., 10., 20.,  0.,  0.,
        10., 20.,  0.,  0., 10., 20.,  0.,  0., 10., 20.,  0.,  0., 10.,
        20.,  0.,  0., 10., 20.,  0.,  0., 10., 20.,  0.,  0., 10., 20.,
         0.,  0., 10., 20.,  0.,  0., 10., 20.,  0.,  0., 10., 20.,  0.,
         0., 10., 20.,  0.,  0., 10., 20.,  0.,  0., 10., 20.,  0.,  0.,
        10., 20.,  0.,  0., 10., 20.,  0.,  0., 10., 20.,  0.,  0., 10.,
        20.,  0.,  0., 10., 20.,  0.,  0., 10., 20.,  0.,  0., 10., 20.,
         0.,  0., 10., 20.,  0.,  0., 10., 20.,  0.,  0., 10., 20.,  0.,
         0., 10., 20.,  0.,  0., 10., 20.,  0.,  0., 10., 20.,  0.,  0.,
        10., 20.,  0.,  0., 10., 20.,  0.,  0., 10., 20.,  0.,  0., 10.,
        20.,  0.,  0., 10., 20.,  0.,  0., 10., 20.,  0.,  0., 10., 20.,
         0.,  0., 10., 20.,  0.,  0., 10., 20.,  0.,  0., 10., 20.,  0.,
         0., 10., 20.,  0.,  0., 10., 20.,  0.,  0., 10., 20.,  0.,  0.,
        10., 20.,  0.,  0., 10., 20.,  0.,  0., 10., 20.,  0.,  0., 10.,
        20.,  0.,  0., 10., 20.,  0.,  0., 10., 20.,  0.,  0., 10., 20.,
         0.,  0., 10., 20.,  0.,  0., 10., 20.,  0.,  0., 10., 20.,  0.,
         0., 10., 20.,  0.,  0., 10., 20.,  0.,  0., 10., 20.,  0.,  0.,
        10., 20.,  0.,  0., 10., 20.,  0.,  0., 10., 20.,  0.,  0., 10.,
        20.,  0.,  0., 10., 20.,  0.,  0., 10., 20.,  0.,  0., 10., 20.,
         0.,  0., 10., 20.,  0.,  0., 10., 20.]], dtype=float32)
(Pdb) ws.FetchBlob("249")
array([1.], dtype=float32)
```
This contains 81 boxes (representing 81 classes) with equal score, stable sort will return the class id 0; while the non-stable sort returns class id 50.

This diff changes the sorting to stable sort for BoxWithNMSLimit op.

Test Plan:
The D2 (401a6b682b)Go's tests can pass after this change.
```
buck test mode/dev-nosan //mobile-vision/d2go/tests:fb_test_meta_arch_rcnn -- --run-disabled
```
https://www.internalfb.com/intern/testinfra/testrun/4785074687594820

Reviewed By: newstzpz

Differential Revision: D33355251

fbshipit-source-id: 9f3fc230b852a5e43f0e3cb8fa9093cbaf53e8b6
2022-01-04 16:33:10 -08:00
b16b444828 don't unsqueeze every stack arg if possible (#70288)
Summary:
Fixes T98738497
Use `cat` and `view` if possible, instead of unsqueezing every arg. Helps perf when there are a lot of small arguments to `stack`.
Benchmark:
```
import torch
from torch.utils.benchmark import Timer

inputs =  [torch.randn([1, 128]) for _ in range(500)]
out = torch.empty(1,500,128)
def stack_cat(inputs):
    cat_result = torch.concat(inputs, dim=1)
    return cat_result.view( [1, 500, 128])

timer_stack = Timer(stmt="torch.stack(inputs, dim=1)", globals=globals())
timer_cat = Timer(stmt="stack_cat(inputs)", globals=globals())
print("stack ", timer_stack.blocked_autorange().median)
print("cat ", timer_cat.blocked_autorange().median)
```
Before:
```
stack  0.00023390522226691247
cat  7.437262553721667e-05
```
After
```
stack  7.397504318505526e-05
cat  7.37407322973013e-05
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70288

Reviewed By: robieta, mruberry

Differential Revision: D33289789

Pulled By: ngimel

fbshipit-source-id: b57dcb8ec66e767f552c08deeba330f31ae6c3d0
2022-01-04 16:07:30 -08:00
f8f96d4858 Copy: Re-use existing neg and conj kernel implementations (#68949)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68949

This reuses the existing `neg_kernel` and `conj_kernel`
implementations for copy, saving some binary size and compile time.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D33064390

Pulled By: anjali411

fbshipit-source-id: eb0ee94ed3db44ae828ea078ba616365f97a7ff5
2022-01-04 15:30:31 -08:00
95a1952633 add SparseXPU to dispatch key set autogradother_backends (#70443)
Summary:
According to dispatch table computation logic, if no kernel
register to a certain dispatch key, will use CompositeExplicitAutograd
backend kernel, so we need add sparseXPU key to the alias key pool.

Signed-off-by: Ma, Jing1 <jing1.ma@intel.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70443

Reviewed By: jbschlosser

Differential Revision: D33406004

Pulled By: bdhirsh

fbshipit-source-id: 009037739c818676901b10465632d3fef5ba14f2
2022-01-04 15:16:46 -08:00
a60adc7f8a fractional_max_pool2d_backward: port to structured kernel (#68245)
Summary:
Ported to structured kernel the fractional_max_pool2d_backward.

Ref https://github.com/pytorch/pytorch/issues/55070

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68245

Reviewed By: jbschlosser

Differential Revision: D33405521

Pulled By: bdhirsh

fbshipit-source-id: 4930e870d4025485317208df751bc3721ecdb7eb
2022-01-04 15:15:29 -08:00
7e58b1dd7b Sets device guard in _cudnn_impl functions (#70406)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/70404

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70406

Reviewed By: mruberry

Differential Revision: D33407972

Pulled By: ngimel

fbshipit-source-id: 6bf97602ea13f8eaaff95d9f412a2eeaa0e6ba10
2022-01-04 15:11:17 -08:00
6089a0f14a Extend checkout for pytorch/builder (#70644)
Summary:
https://www.torch-ci.com/minihud shows 2 recent failures due to timing out. Increasing to 30m to see if it could be alleviated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70644

Reviewed By: suo, malfet, seemethere, atalman

Differential Revision: D33413604

Pulled By: janeyx99

fbshipit-source-id: 756a7ad94c589e39b8567acbfc3e769dc0b9113f
2022-01-04 14:55:47 -08:00
7b8c43cd7c Revert "Revert D32498570: make codegen'd device guards not cuda-specific. Allow them to be used in external codegen" (#69951)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69951

This reverts commit 0ef523633fddf2d63e97d5028b00af10ff344561.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D33113543

Pulled By: bdhirsh

fbshipit-source-id: b28073ee0870b413ea9f617f27671ae5c6f3c696
2022-01-04 14:53:21 -08:00
bb5b4cceb6 Revert "Revert D32498569: allow external backend codegen to toggle whether to generate out= and inplace kernels" (#69950)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69950

This reverts commit f6cad53443704dfe5a20cc62bee14d91e3bffcaa.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D33113545

Pulled By: bdhirsh

fbshipit-source-id: d6590294662588d36c09662dea65919ad4e1e288
2022-01-04 14:52:00 -08:00
d95be99561 Deprecating Python 3.6 (#70493)
Summary:
Deprecating python 3.6 from documentation and from cmake

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70493

Reviewed By: malfet

Differential Revision: D33409880

Pulled By: atalman

fbshipit-source-id: 4f912339896096be95b344724a4d9ae88cdf1a8f
2022-01-04 14:41:27 -08:00
4d08db0cb2 Flaky tests reporting: use GITHUB_RUN_ID instead of concatenated value (#70604)
Summary:
I did not realize the WORKFLOW_ID variable in our GHA scripts concatenated RUN_ID and RUN_NUMBER.

For flaky tests collection, we should be only using RUN_ID, which makes it easier for us to write queries on the data

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70604

Reviewed By: suo

Differential Revision: D33409503

Pulled By: janeyx99

fbshipit-source-id: 932405989dc1a406dfe9da9a7f513ca127c8d436
2022-01-04 14:36:13 -08:00
0ece9a49d7 Revert D33198155: Bump version number to 7 and compile old operators with old schema
Test Plan: revert-hammer

Differential Revision:
D33198155 (d35fc409ad)

Original commit changeset: 38a1185f9ecb

Original Phabricator Diff: D33198155 (d35fc409ad)

fbshipit-source-id: 411aaeb4e047aad9202db50d4d0f2ff35bc51f9d
2022-01-04 13:44:59 -08:00
61b562206b Fix docstring for nn.ELU (#70574)
Summary:
Fixes nn.ELU's docstring problem reported at https://github.com/pytorch/pytorch/issues/70498.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70574

Reviewed By: VitalyFedyunin

Differential Revision: D33404696

Pulled By: albanD

fbshipit-source-id: 1ffcba3fdeadf88a4433e9168c42ddb252e833e9
2022-01-04 13:27:59 -08:00
9cf0de509f DispatchStub: Improve type mismatch errors (#67880)
Summary:
Currently when you register a kernel implementation to a dispatch stub,
it takes the function signature from the function pointer you pass in.
That means if you get the signature wrong, it fails at runtime with a
link error instead of failing during the compilation. This also means
that when registering nullptr you need to manually specify the type.

Instead, taking the type from `DispatchStub::FnPtr` means quicker time
to signal on failure and better error messages. The only downside is
you need to actually include the DispatchStub declaration which for
some CPU kernels was missing, so I've had to add them here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67880

Reviewed By: mrshenli

Differential Revision: D33400922

Pulled By: ngimel

fbshipit-source-id: 2da22f053ef82da5db512986e5b968d97a681617
2022-01-04 11:00:47 -08:00
f64906f470 ibm z14/15 SIMD support (#66407)
Summary:
https://github.com/pytorch/pytorch/issues/66406
implemented z arch 14/15 vector SIMD additions.
so far besides bfloat all other types have their SIMD implementation.

it has 99% coverage and currently passing the local test.
it is concise and the main SIMD file is only one header file
it's using template metaprogramming, mostly. but still, there are a few macrosses left with the intention not to modify PyTorch much
Sleef supports z15

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66407

Reviewed By: mrshenli

Differential Revision: D33370163

Pulled By: malfet

fbshipit-source-id: 0e5a57f31b22a718cd2a9ac59753fb468cdda140
2022-01-04 09:40:18 -08:00
8dcfdf39e7 [DataPipe] Renaming FileLoader to FileOpener with deprecation warning for FileLoader (#70367)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70367

This PR renames the `FileLoaderIterDataPipe` to `FileOpenerIterDataPipe`. For the sake of not breaking many CI tests immediately, it still preserves `FileLoader` as an alias. This will allow downstream libraries/users to migrate their use cases before we fully remove all references to `FileLoader` from PyTorch.

Fixes https://github.com/pytorch/data/issues/103. More detailed discussion about this decision is also in the linked issue.

cc VitalyFedyunin ejguan NivekT pmeier Nayef211

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D33301648

Pulled By: NivekT

fbshipit-source-id: 59278dcd44e372df0ba2001a4eecbf9792580d0b
2022-01-04 09:14:50 -08:00
7c7eb351c3 Populate __name__ for torch.nn.modules.utils.{_single,_pair,...} (#70459)
Summary:
This helps with debug printouts and python level graph analysis.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70459

Reviewed By: wconstab

Differential Revision: D33340032

Pulled By: jansel

fbshipit-source-id: 24d3fdf31e9e5e92bb47f0db30339cf373a1d4d4
2022-01-04 08:37:12 -08:00
1150046d29 NNAPI: Add runtime flexible shapes & return shapes (#70334)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70334

* Use 0 for load time flexible shapes
* -1 for runtime flexible shapes
* NNAPI needs return shapes for flexible outputs

Test Plan: Tested via upcoming ops

Reviewed By: dreiss

Differential Revision: D33237922

fbshipit-source-id: 50afdd8e3c6401dfb79b4bc09513c9882a09e5d5
2022-01-04 08:37:09 -08:00
a825351c13 GHA Windows: Propagate exit code from .bat to calling bash script (#70011)
Summary:
The windows 1st shard was silently failing to run (more details here https://github.com/pytorch/pytorch/issues/70010) because the code to run them was never reached. It was silently failing because our CI still returned green for those workflow jobs, because the exit code from the batch script DID NOT PROPAGATE to the calling bash script.

The key here is that even though we have
```
if ERRORLEVEL 1 exit \b 1
```

The exit code 1 was NOT propagating back to the bash script, as the `exit \b 1` was within an `if` statement and the batch script was actually run in a cmd shell, so the bash script win-test.sh continued without erroring. Moving the `exit \b 1` to be standalone fixes it.

More details for this can be found in this stack overflow https://stackoverflow.com/a/55290133

Evidence that now a failure in the .bat would fail the whole job:
https://github.com/pytorch/pytorch/runs/4621483334?check_suite_focus=true

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70011

Reviewed By: seemethere, samdow

Differential Revision: D33303020

Pulled By: janeyx99

fbshipit-source-id: 8920a43fc6c4b67fecf90f3fca3908c314522cd6
2022-01-04 08:35:49 -08:00
d35fc409ad Bump version number to 7 and compile old operators with old schema (#68358)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68358

Test Plan: Imported from OSS

Reviewed By: samdow

Differential Revision: D33198155

Pulled By: tugsbayasgalan

fbshipit-source-id: 38a1185f9ecb34a33f737ad0b060b3490956300c
2022-01-04 01:31:25 -08:00
d9106116aa nnapi: Add int32 type torchscript expressions (#70197)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70197

Test Plan:
* `pytest test/test_nnapi.py`
* Testing via ops following this commit

Reviewed By: anshuljain1, dreiss

Differential Revision: D33237917

fbshipit-source-id: f0493620f28a62ad9fe0b97b67d1e25059d50c24
2022-01-03 19:00:38 -08:00
1b66915f39 Have type_parser return const reference (#70477)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70477

Test Plan: Sandcastle

Reviewed By: cccclai

Differential Revision: D33340030

fbshipit-source-id: b2a295b7c1c01e86971f6b9bbdd7d3718a2d3f0c
2022-01-03 16:18:28 -08:00
bc3246453b Added explicit build command for Windows and clarification on obtaining (#70190)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70190

C++ build tools to readme.md

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D33245438

Pulled By: ikriv

fbshipit-source-id: ef863d68926bd7416d0e10d24197d19392c124de
2022-01-03 14:33:59 -08:00
1e67570f3a Drop omp simd from batch_permutation_op.cc (#70579)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70579

Fixes
```
     36 stderr: caffe2/caffe2/operators/batch_permutation_op.cc:25:1: error: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Werror,-Wpass-failed=transform-warning]
      3 caffe2/caffe2/operators/batch_permutation_op.cc:25:1: error: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Werror,-Wpass-failed=transform-warning]
```

Test Plan: Sandcastle

Reviewed By: meyering

Differential Revision: D33378925

fbshipit-source-id: 5ae3bfb8fadfa91a13ff0dcf5fae2ce7864ea90e
2022-01-03 08:45:50 -08:00
ab49d41bb5 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D33393329

fbshipit-source-id: 728d47e62e8d81c5243c62917d88e54c4b4a1db2
2022-01-02 17:30:39 -08:00
fa09099ba3 Codegen: TraceType only includes operators being registered (#68691)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68691

TraceType is a sharded file, so by only including specific operator
headers, we ensure that changing one (non-method) operator only needs
one shard to be re-compiled.

This also changes all the included autograd and jit headers from
including `ATen/ATen.h` to just including `ATen/core/Tensor.h`.

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D33336948

Pulled By: albanD

fbshipit-source-id: 4e40371592b9a5a7e7fcd1d8cecae11ffb873113
2022-01-02 13:09:19 -08:00
779f41a78a [quant] Add a e2e test for standalone module + custom backend_config_dict (#70152)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70152

This is to demonstrate our backend_config_dict works for one of
our internal use cases

Test Plan:
python test/fx2trt/test_quant_trt.py

Imported from OSS

Reviewed By: vkuzo, raghuramank100

Differential Revision: D33205161

fbshipit-source-id: dca8570816baaf85a79f2be75378d46c3af0e454
2022-01-02 11:20:50 -08:00
ce86881afa [quant][graphmode][fx] Add qat module mapping support in backend_config_dict (#70287)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70287

This PR adds the support to configuring qat modules for fused/non-fused modules
TODO: there are some redundant configs, especially for fused op patterns, we can clean them up later

Test Plan:
python test/fx2trt/test_quant_trt.py TestQuantizeFxTRTOps

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D33274057

fbshipit-source-id: b2e6a078211320d97c41ffadd3ecedfab57e3b77
2021-12-30 23:30:34 -08:00
65faf1a7eb [fx2trt] Add version check for ProfilingVerbosity bulider config (#70286)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70286

att

Test Plan:
python test/fx2trt/test_quant_trt.py

Imported from OSS

Reviewed By: soulitzer

Differential Revision: D33274058

fbshipit-source-id: c7657f9ba8b578d40d6fc1793b8b363898700eee
2021-12-30 19:59:25 -08:00
6bc06ec3c2 [PyTorch Edge][QNNPack] Tighten Step Height for Indirection Buffers (#70530)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70530

```kernel_size + (output_width * step_width - 1) * kernel_height``` is more space than needed, and ```kernel_size + (output_width - 1) * step_width * kernel_height``` is just enough.

Test Plan: Phabricator Tests

Reviewed By: kimishpatel

Differential Revision: D32553599

fbshipit-source-id: 30f6d191705bcb25dc9bb7a91c6d7b99c3a348e5
2021-12-30 14:57:33 -08:00
7bfaa230be [nn] adaptive_avg_pool{1/2/3}d : Error on negative output_size (#70488)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/70232

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70488

Reviewed By: H-Huang

Differential Revision: D33367289

Pulled By: jbschlosser

fbshipit-source-id: 6b7b89d72c4e1e049ad6a0addb22a261c28ddb4c
2021-12-30 14:42:11 -08:00
e6c3aa3880 Remove backward ops for mkldnn convolution (#70467)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70467

Test Plan: Imported from OSS

Reviewed By: mikaylagawarecki

Differential Revision: D33342476

Pulled By: jbschlosser

fbshipit-source-id: 9811d02b16adea0dd1dd2500261f4b3b294d2dee
2021-12-30 14:29:22 -08:00
cfc71f56e4 [quant][fx][graphmode] Support standalone module in _convert_do_not_use (#70151)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70151

this supports converting an observed standalone module to quantized standalone module
in the new convert flow (convert observers to quant-dequant operators)

Test Plan:
```
python test/test_quant_trt.py TestConvertFxDoNotUse
```

Imported from OSS

Reviewed By: supriyar

Differential Revision: D33205163

fbshipit-source-id: 01ea44fb2a8ffe30bec1dd5678e7a72797bafafc
2021-12-30 12:31:03 -08:00
401a6b682b add BFloat16 support for AdaptiveAvgPool2d on CPU (#56902)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56902

Test Plan: Imported from OSS

Reviewed By: mikaylagawarecki

Differential Revision: D28836789

Pulled By: VitalyFedyunin

fbshipit-source-id: caac5e5b15190b8010bbfbc6920aa44032208ee7
2021-12-30 11:58:37 -08:00
bc40fb5639 [Reinstate] Wishart distribution (#70377)
Summary:
Implement https://github.com/pytorch/pytorch/issues/68050
Reopened merged and reverted PR https://github.com/pytorch/pytorch/issues/68588 worked with neerajprad
cc neerajprad

Sorry for the confusion.

TODO:

- [x] Unit Test
- [x] Documentation
- [x] Change constraint of matrix variables with 'torch.distributions.constraints.symmetric' if it is reviewed and merged. Debug positive definite constraints https://github.com/pytorch/pytorch/issues/68720

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70377

Reviewed By: mikaylagawarecki

Differential Revision: D33355132

Pulled By: neerajprad

fbshipit-source-id: e968c0d9a3061fb2855564b96074235e46a57b6c
2021-12-30 11:41:46 -08:00
14d3d29b16 make ProcessException pickleable (#70118)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/70116

Happy to add tests if you let me know the best place to put them.

cc VitalyFedyunin

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70118

Reviewed By: malfet

Differential Revision: D33255899

Pulled By: ejguan

fbshipit-source-id: 41d495374182eb28bb8bb421e890eca3bddc077b
2021-12-30 09:09:55 -08:00
9c742bea59 [PyTorch Edge][QNNPack] Enable Depthwise Specific Conv3d Kernel for Kernel Size 3x3x3 (#69315)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69315

Uses kernels and setup modifications from earlier diffs in this stack
ghstack-source-id: 146346780

Test Plan:
**Correctness**
- Test using QNNPack Operator-Level Test:
-- Neon Kernel: As in test plan of D32217846, all tests pass
-- SSE2 Kernel: ```buck test xplat/caffe2/aten/src/ATen/native/quantized/cpu/qnnpack:pytorch_qnnpack_test```, all tests pass
- Test by Printing Results of Model-Level Test: D32122020

**Performance**

*Operator Level tests from convolution.cc in D32217846*
||Before (V23 of D32217846, without newly added kernel)|After (V48 of D31966574, with newly added kernel)|
|depthwise 3x3x3 static|184 ms|134 ms|
|depthwise 3x3x3 runtime|181 ms|134 ms|
|depthwise 3x3x3s2 static|30 ms|22 ms|
|depthwise 3x3x3s2 runtime|30 ms|23 ms|
|depthwise 3x3x3s1x2 static|97 ms|70 ms|
|depthwise 3x3x3s1x2 runtime|96 ms|70 ms|
|depthwise 3x3x3s2x1 static|53 ms|38 ms|
|depthwise 3x3x3s2x1 runtime|53 ms|38 ms|
|depthwise 3x3x3d2 static|104 ms|74 ms|
|depthwise 3x3x3d2 runtime|103 ms|75 ms|
|depthwise 3x3x3d1x2 static|158 ms|116 ms|
|depthwise 3x3x3d1x2 runtime|157 ms|115 ms|
|depthwise 3x3x3d2x1 static|120 ms|86 ms|
|depthwise 3x3x3d2x1 runtime|120 ms|87 ms|
|depthwise 3x3x3 per channel static|182 ms|134 ms|
|depthwise 3x3x3 per channel runtime|184 ms|134 ms|
|depthwise 3x3x3s2 per channel static|30 ms|22 ms|
|depthwise 3x3x3s2 per channel runtime|31 ms|23 ms|
|depthwise 3x3x3s1x2 per channel static|95 ms|70 ms|
|depthwise 3x3x3s1x2 per channel runtime|95 ms|71 ms|
|depthwise 3x3x3s2x1 per channel static|53 ms|39 ms|
|depthwise 3x3x3s2x1 per channel runtime|55 ms|39 ms|
|depthwise 3x3x3d2 per channel static|105 ms|75 ms|
|depthwise 3x3x3d2 per channel runtime|103 ms|75 ms|
|depthwise 3x3x3d1x2 per channel static|158 ms|116 ms|
|depthwise 3x3x3d1x2 per channel runtime|158 ms|116 ms|
|depthwise 3x3x3d2x1 per channel static|118 ms|87 ms|
|depthwise 3x3x3d2x1 per channel runtime|119 ms|87 ms|

Average Change: -36.96%

(Generated with https://www.internalfb.com/intern/anp/view/?id=1371846&revision_id=291376782898627)

*Model Level Test on Synthesized Conv3d Model*

Model Details:
- 21 channels, input size: 9 x 12 x 7, kernel size: 3x3x3
- Config added in D31928710
- Model generated with https://www.internalfb.com/intern/anp/view/?id=1313660&revision_id=248658657303993

```buck run aibench:run_bench -- -b dw_conv_3d_3x3x3_big_2b.json --platform android/arm64 --framework pytorch --remote --devices Pixel-4a-11-30```

- Before (V23 of D32217846): [0.0935 ms](https://our.intern.facebook.com/intern/aibench/details/768298420366437)
- After (V48 of D31966574): [0.0665 ms](https://our.intern.facebook.com/intern/aibench/details/67271954298132)
(29% faster)

* Model Level Test on Video Model-like Inputs (provided by liyilui) *
- D33000199
- 87.5% faster

Reviewed By: kimishpatel

Differential Revision: D31966574

fbshipit-source-id: 6554a878401c1120054f6b02241456e8fb44b152
2021-12-30 08:12:10 -08:00
3d4590d16f [PyTorch Edge][QNNPack] Depthwise Conv3d mp8x27 (per-channel) Sse2 Kernel (#69314)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69314

Implementation based off of [convolution-operator-tester.h](https://www.internalfb.com/code/fbsource/[679135d62c0a64e3d0fa0c830aa062ac28f292b8]/fbcode/caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/test/convolution-operator-tester.h)

Generated files (caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/wrappers/q8dwconv/*) made with
- cd caffe2/aten/src/ATen/native/quantized/cpu/qnnpack
- python3 generate-wrapper.py

The math used the compute the ```w_zyxc_ptr``` is explained here:

{F681213069}
ghstack-source-id: 146346784

Test Plan: Test when used in depthwise conv3d later in this diff stack (D31966574)

Reviewed By: kimishpatel

Differential Revision: D32261231

fbshipit-source-id: 8e793696f7c3b0e7cceda88df8099f64f3c69ac4
2021-12-30 08:12:07 -08:00
821c085c9b [PyTorch Edge][QNNPack] Depthwise Conv3d mp8x27 (per channel) Neon Kernel (#69313)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69313

Allows for depthwise conv3d with 3x3x3 kernel

Implementation based heavily off of [mp8x25-neon-per-channel.c](https://www.internalfb.com/code/fbsource/[679135d62c0a64e3d0fa0c830aa062ac28f292b8]/fbcode/caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8dwconv/mp8x25-neon-per-channel.c) (depthwise conv2d with 5x5 kernel)

This supports per-channel convolution, but it works for non per-channel too

Generated files (caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/wrappers/q8dwconv/*) made with
- cd caffe2/aten/src/ATen/native/quantized/cpu/qnnpack
- python3 generate-wrapper.py
ghstack-source-id: 146346785

Test Plan: Test when used in depthwise conv3d later in this diff stack (D31966574)

Reviewed By: kimishpatel

Differential Revision: D32074096

fbshipit-source-id: 8111926df6ecb89d88ca810deeab87b1c072f55a
2021-12-30 08:12:04 -08:00
15d443326c [PyTorch Edge][QNNPack] Depthwise Conv3d Weight Packing (#69312)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69312

Enable packing weights to be compatible with depthwise specific conv3d kernels
ghstack-source-id: 146346778

Test Plan:
- Existing 2d weight packing uses do not break (phabricator tests)
- Test 3d weight packing when used in depthwise conv3d later in this diff stack (D31966574)

Reviewed By: kimishpatel

Differential Revision: D32045036

fbshipit-source-id: a2323f74f7d30d92d4ed91315f59539ecad729ec
2021-12-30 08:12:00 -08:00
db37fd3865 [PyTorch Edge][QNNPack] Depthwise Conv3d Indirection Buffer Setup (#69311)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69311

Enable setting up indirection buffer to be compatible with depthwise specific conv3d kernels
ghstack-source-id: 146346788

Test Plan:
- Existing 2d indirection buffer uses do not break (phabricator tests)
- Test 3d indirection buffer when used in depthwise conv3d later in this diff stack (D31966574)

Reviewed By: kimishpatel

Differential Revision: D31999533

fbshipit-source-id: a403d8dcad6e50641b9235e0b574129b2dfb5412
2021-12-30 08:11:57 -08:00
9863cd5741 [PyTorch Edge][QNNPack] Refactor Computing Step Dimensions (#69310)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69310

Extract computing step height and step width into helper function and store them in operator struct since the same calculation is used in many places before this diff.
ghstack-source-id: 146346783

Test Plan: Phabricator tests

Reviewed By: kimishpatel

Differential Revision: D32553327

fbshipit-source-id: e5bf07416f4c1ccde9975f835767392ad7a851c1
2021-12-30 08:11:54 -08:00
cea3eba617 [PyTorch Edge][QNNPack] Operator-Level Conv3d Tests (#69309)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69309

Test correctness of QNNPack Conv3d

- Add Depth dimension to ConvolutionOperatorTester
- Add tests which use it

Includes John's changes in D32388572
ghstack-source-id: 146346786

Test Plan:
Build the Test
- ```cd caffe2/aten/src/ATen/native/quantized/cpu/qnnpack```
- ```./scripts/build-android-arm64.sh```
- Test binary is outputted to ```build/android/arm64-v8a```

Run the Test
- ```test_name=convolution-test```
- ```chmod +x build/android/arm64-v8a/$test_name```
- Send the binary to android device and execute it, ex. connect to one world and ```adb push build/android/arm64-v8a/$test_name /data/local/tmp/$test_name``` then ```adb shell /data/local/tmp/$test_name```

Reviewed By: kimishpatel

Differential Revision: D32217846

fbshipit-source-id: eba200c136894461bf76b2a5416540fe8781d588
2021-12-30 08:10:34 -08:00
35251a5528 [PyTorch] Add Enum to IValue Deepcopy (#69937)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69937

This enables ```export_torch_mobile_model``` compatibility with Enum IValues

Test Plan: ModuleAPITest.DeepCopyEnum

Reviewed By: gmagogsfm

Differential Revision: D33104681

fbshipit-source-id: ca2a6d259c312487fe38dd1bed33ab6b7910bc2a
2021-12-30 07:52:22 -08:00
36db501736 softplus_backward: remove output arg (#70296)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/69042

Tested with OpInfo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70296

Reviewed By: mikaylagawarecki

Differential Revision: D33349227

Pulled By: albanD

fbshipit-source-id: edeb35cb19ab4434d39df93d4536cb07679218b5
2021-12-30 02:16:36 -08:00
18dd5cdba5 [Operator Versioning][Test] Use hypothesis for better test input data and broader coverage (#70263)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70263

Leverage the hypothesis library as it's more systematic way for testing. To write a test, it needs two parts:

1. A function that looks like a normal test in your test framework of choice but with some additional arguments
2. A given decorator that specifies how to provide those arguments.
ghstack-source-id: 146344955

Test Plan:
```

buck test mode/opt //caffe2/test:jit
python test/test_jit.py TestSaveLoadForOpVersion

```

Reviewed By: iseeyuan

Differential Revision: D33244389

fbshipit-source-id: c93d23f3d9575ebcb4e927a8caee42f4c3a6939d
2021-12-29 20:43:32 -08:00
c627211651 [quant][fx][graphmode][be] Change the type for output of convert to be torch.nn.Module (#69959)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69959

GraphModule is an implementation detail, We don't want to expose it in quantization apis

Test Plan:
python test/test_quantization.py TestQuantizeFx.test_quantized_model_type

Imported from OSS

Reviewed By: supriyar

Differential Revision: D33119103

fbshipit-source-id: d8736ff08b42ee009d6cfd74dcb3f6150f71f3d2
2021-12-29 20:33:32 -08:00
fb78a31916 Add testing across mem_formats to ModuleInfos (#69317)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69317

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D33285780

Pulled By: mikaylagawarecki

fbshipit-source-id: 1d19293e640e5581351a9c74892dcac4bcdd3f1d
2021-12-29 14:53:27 -08:00
14f4b91f6e Add Nondeterministic Tol to gradient test in test_modules (#69402)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69402

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D33285781

Pulled By: mikaylagawarecki

fbshipit-source-id: f1ab43173d4f558adc943a8acefc13c34cfa5cfa
2021-12-29 14:51:56 -08:00
d2abf3f981 Added antialias flag to interpolate (CPU only, bicubic) (#68819)
Summary:
Description:
- Added antialias flag to interpolate (CPU only)
  - forward and backward for bicubic mode
  - added tests

Previous PR for bilinear, https://github.com/pytorch/pytorch/pull/65142

### Benchmarks

<details>
<summary>
Forward pass, CPU. PTH interpolation vs PIL
</summary>

Cases:
- PTH RGB 3 Channels, float32 vs PIL RGB uint8 (apples vs pears)
- PTH 1 Channel, float32 vs PIL 1 Channel Float

Code: https://gist.github.com/vfdev-5/b173761a567f2283b3c649c3c0574112

```
Torch config: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201402
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - CPU capability usage: AVX2
  - CUDA Runtime 11.1
  - NVCC architecture flags: -gencode;arch=compute_61,code=sm_61
  - CuDNN 8.0.5
  - Build settings: BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=1, USE_CUDNN=1, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=0, USE_OPENMP=ON, USE_ROCM=OFF,

Num threads: 1
[------------------- Downsampling (bicubic): torch.Size([1, 3, 906, 438]) -> (320, 196) -------------------]
                                                  |  Reference, PIL 8.4.0, mode: RGB  |  1.11.0a0+gitb0bdf58
1 threads: -------------------------------------------------------------------------------------------------
      channels_first contiguous torch.float32     |                4.5                |          5.2
      channels_last non-contiguous torch.float32  |                4.5                |          5.3

Times are in milliseconds (ms).

[------------------- Downsampling (bicubic): torch.Size([1, 3, 906, 438]) -> (460, 220) -------------------]
                                                  |  Reference, PIL 8.4.0, mode: RGB  |  1.11.0a0+gitb0bdf58
1 threads: -------------------------------------------------------------------------------------------------
      channels_first contiguous torch.float32     |                5.7                |          6.4
      channels_last non-contiguous torch.float32  |                5.7                |          6.4

Times are in milliseconds (ms).

[------------------- Downsampling (bicubic): torch.Size([1, 3, 906, 438]) -> (120, 96) --------------------]
                                                  |  Reference, PIL 8.4.0, mode: RGB  |  1.11.0a0+gitb0bdf58
1 threads: -------------------------------------------------------------------------------------------------
      channels_first contiguous torch.float32     |                3.0                |          4.0
      channels_last non-contiguous torch.float32  |                2.9                |          4.1

Times are in milliseconds (ms).

[------------------ Downsampling (bicubic): torch.Size([1, 3, 906, 438]) -> (1200, 196) -------------------]
                                                  |  Reference, PIL 8.4.0, mode: RGB  |  1.11.0a0+gitb0bdf58
1 threads: -------------------------------------------------------------------------------------------------
      channels_first contiguous torch.float32     |                14.7               |          17.1
      channels_last non-contiguous torch.float32  |                14.8               |          17.2

Times are in milliseconds (ms).

[------------------ Downsampling (bicubic): torch.Size([1, 3, 906, 438]) -> (120, 1200) -------------------]
                                                  |  Reference, PIL 8.4.0, mode: RGB  |  1.11.0a0+gitb0bdf58
1 threads: -------------------------------------------------------------------------------------------------
      channels_first contiguous torch.float32     |                3.5                |          3.9
      channels_last non-contiguous torch.float32  |                3.5                |          3.9

Times are in milliseconds (ms).

[---------- Downsampling (bicubic): torch.Size([1, 1, 906, 438]) -> (320, 196) ---------]
                                 |  Reference, PIL 8.4.0, mode: F  |  1.11.0a0+gitb0bdf58
1 threads: ------------------------------------------------------------------------------
       contiguous torch.float32  |               2.4               |          1.8

Times are in milliseconds (ms).

[---------- Downsampling (bicubic): torch.Size([1, 1, 906, 438]) -> (460, 220) ---------]
                                 |  Reference, PIL 8.4.0, mode: F  |  1.11.0a0+gitb0bdf58
1 threads: ------------------------------------------------------------------------------
       contiguous torch.float32  |               3.1               |          2.2

Times are in milliseconds (ms).

[---------- Downsampling (bicubic): torch.Size([1, 1, 906, 438]) -> (120, 96) ----------]
                                 |  Reference, PIL 8.4.0, mode: F  |  1.11.0a0+gitb0bdf58
1 threads: ------------------------------------------------------------------------------
       contiguous torch.float32  |               1.6               |          1.4

Times are in milliseconds (ms).

[--------- Downsampling (bicubic): torch.Size([1, 1, 906, 438]) -> (1200, 196) ---------]
                                 |  Reference, PIL 8.4.0, mode: F  |  1.11.0a0+gitb0bdf58
1 threads: ------------------------------------------------------------------------------
       contiguous torch.float32  |               7.9               |          5.7

Times are in milliseconds (ms).

[--------- Downsampling (bicubic): torch.Size([1, 1, 906, 438]) -> (120, 1200) ---------]
                                 |  Reference, PIL 8.4.0, mode: F  |  1.11.0a0+gitb0bdf58
1 threads: ------------------------------------------------------------------------------
       contiguous torch.float32  |               1.7               |          1.3

Times are in milliseconds (ms).

```

</details>

Code is moved from torchvision: https://github.com/pytorch/vision/pull/3810 and https://github.com/pytorch/vision/pull/4208

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68819

Reviewed By: mikaylagawarecki

Differential Revision: D33339117

Pulled By: jbschlosser

fbshipit-source-id: 6a0443bbba5439f52c7dbc1be819b75634cf67c4
2021-12-29 14:04:43 -08:00
Jim
2b00dbbbbc fix typos in torch/csrc/deploy/README.md (#70494)
Summary:
Fixes typo in torch/csrc/deploy/README.md

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70494

Reviewed By: mikaylagawarecki

Differential Revision: D33354431

Pulled By: H-Huang

fbshipit-source-id: b05757a795d2700eea21d7b881d87a7b239a8b52
2021-12-29 13:52:06 -08:00
8af39b7668 AdaptiveLogSoftmaxWithLoss no_batch_dim support (#69054)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69054

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D33200166

Pulled By: george-qi

fbshipit-source-id: 9d953744351a25f372418d2a64e8402356d1e9b7
2021-12-29 10:25:26 -08:00
0460324b9b Fix docs rendering for nn.Module.named_modules() (#70491)
Summary:
The documentation rendering for nn.Module.named_modules() is a bit broken, see the description of the last argument [here](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.named_modules).

This PR fixes that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70491

Reviewed By: mikaylagawarecki

Differential Revision: D33349882

Pulled By: albanD

fbshipit-source-id: a46327c12e8114f7ef2055a8518c4ca9d186e669
2021-12-29 10:08:53 -08:00
fb736c77a4 Remove backward op for slow dilated 3d convolution (#70068)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70068

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D33172550

Pulled By: jbschlosser

fbshipit-source-id: 72109577c020b33e4b9807064f53f1989475d1c2
2021-12-29 09:46:19 -08:00
2c67621a19 [rnn,gru,lstm]cell : no batch dim (#70236)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/60585

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70236

Reviewed By: mikaylagawarecki

Differential Revision: D33338774

Pulled By: jbschlosser

fbshipit-source-id: 7d8d00272e543b3e67060136b5d98a4baefbedd5
2021-12-29 09:27:32 -08:00
9266b2af73 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D33347489

fbshipit-source-id: d43ce53c93724f44b587bfe892534f8d13eadaca
2021-12-29 04:06:52 -08:00
103fc5f9a5 Remove unused variable (#70261)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70261

ghstack-source-id: 146310591

Test Plan:
```
buck test  fbsource//xplat/caffe2:for_each_prod_ptl_model_test
```

{gif:p014gzft}

Reviewed By: iseeyuan

Differential Revision: D33265656

fbshipit-source-id: 6e303ee304064a61383ba2ae34f2e21077ec9db3
2021-12-28 22:21:29 -08:00
066c9ff08f Deprecating python 3.6 (#70325)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/70457

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70325

Reviewed By: seemethere

Differential Revision: D33339496

Pulled By: atalman

fbshipit-source-id: 7509cab4f7469dae234bcf3f79e0aabb54577b8a
2021-12-28 18:44:59 -08:00
a0c99a8d3b [Operator Verioning][Edge] Update upgrader codegen with latest change (#70293)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70293

```
python /Users/chenlai/pytorch/tools/codegen/operator_versions/gen_mobile_upgraders.py

```
https://github.com/pytorch/pytorch/pull/70161 is landed to resolve a thread safety issue. Accordingly, the upgrader codegen needs to be updated.
ghstack-source-id: 146296324

Test Plan:
```
buck test mode/opt //caffe2/test:upgrader_codegen
buck run mode/opt //caffe2/torch/fb/mobile/upgrader_codegen:upgrader_codegen
python /Users/chenlai/pytorch/tools/codegen/operator_versions/gen_mobile_upgraders.py

```

Reviewed By: iseeyuan

Differential Revision: D33274831

fbshipit-source-id: 0e1d2a81edc9b6111f3c6127dbd5b97e16c93dca
2021-12-28 18:34:31 -08:00
a6eadf9b50 Remove backward op for slow 3d convolution (#69978)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69978

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D33131003

Pulled By: jbschlosser

fbshipit-source-id: 097440b2eb501c1eeeb8a666d4bc3508fc5d0cfa
2021-12-28 16:19:23 -08:00
5e113eb24d .github: Add linux.4xlarge executor (#70474)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70474

Needed to compile linux wheels for CUDA 11.x since we were OOM'ing with
16GB of RAM

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: atalman

Differential Revision: D33343322

Pulled By: seemethere

fbshipit-source-id: 9f62e07ce2ca229fa25285429c01dc074d63b388
2021-12-28 15:40:28 -08:00
0fb73035f7 [Bootcamp Task] Replace string concatenation by fmt::format (#70366)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/69979

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70366

Reviewed By: H-Huang

Differential Revision: D33339291

Pulled By: LynneD

fbshipit-source-id: e4e0535cd2db8e9fa8b0875d17a900be58384367
2021-12-28 14:15:21 -08:00
e96dda15e5 Remove backward op for slow 2d transposed convolution (#70333)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70333

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D33301402

Pulled By: jbschlosser

fbshipit-source-id: 3cfb3165589fe1620f22479b05139676d20dc493
2021-12-28 12:38:59 -08:00
c732a26e59 Add macro to register CPU kernel for all arch types (#70332)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70332

Idea to avoid recompilations: what if we introduce a new macro REGISTER_ALL_CPU_DISPATCH that registers the same kernel across all CPU arch types? We'd call this from native/Convolution*.cpp and wouldn't need to move any logic underneath the native/cpu dir. That would simplify these PRs quite a bit and would also avoid the recompilation. Wdyt about this approach?

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D33301403

Pulled By: jbschlosser

fbshipit-source-id: d7cc163d4fe23c35c93e512d1f0a8af8c9897933
2021-12-28 12:37:36 -08:00
244730eeea .github: Add needs build for generate-test-matrix (#70456)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70456

This job was still running on workflows despite ciflow not being enabled

This makes it so that test matrix generation only occurs before tests
are actually run.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: atalman

Differential Revision: D33338946

Pulled By: seemethere

fbshipit-source-id: 4b83d5fe6572771807708764609a72c4f1c5745d
2021-12-28 10:11:34 -08:00
4ed02748be fix typo in the docs of multiprocessing (#70448)
Summary:
Fix typo in the docs of multiprocessing.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70448

Reviewed By: gchanan

Differential Revision: D33336962

Pulled By: H-Huang

fbshipit-source-id: 1235703b8ddc26c33dcbc34bd25ac36b11a18923
2021-12-28 09:58:47 -08:00
73b5b6792f Adds reduction args to signature of F.multilabel_soft_margin_loss docs (#70420)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/70301

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70420

Reviewed By: gchanan

Differential Revision: D33336924

Pulled By: jbschlosser

fbshipit-source-id: 18189611b3fc1738900312efe521884bced42666
2021-12-28 09:48:05 -08:00
6f83841582 .github: Temporarily disable xla test config (#70453)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70453

Removes the current xla config, downstream `pytorch/xla` is broken for
clang compilation so temporarily removing this config until the xla team
can fix this upstream CI.

Context: https://github.com/pytorch/xla/pull/3255/files#r775980035

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: zengk95

Differential Revision: D33338463

Pulled By: seemethere

fbshipit-source-id: 1ef332c685d5e2cc7e2eb038e93bd656847fd099
2021-12-28 08:49:01 -08:00
15f14ce0dc fix typo in adam docs (#70387)
Summary:
Fix the typo in [adam docs in master branch](https://pytorch.org/docs/master/generated/torch.optim.Adam.html#torch.optim.Adam)

![image](https://user-images.githubusercontent.com/41060790/147345284-37e180d1-fd06-4a62-9c79-2d17b8aa5cd3.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70387

Reviewed By: H-Huang

Differential Revision: D33309283

Pulled By: albanD

fbshipit-source-id: d20c5d8f2498ac64013f71e202a6b50dcc069f2b
2021-12-28 07:35:39 -08:00
574dbb584d quant tests: fix log spew for HistogramObserver (#70107)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70107

Histogram observer used floor division on tensors, which is a deprecated
behavior.  There was a warning printed:

```
/Users/vasiliy/pytorch/torch/ao/quantization/observer.py:905: UserWarning: __floordiv__ is deprecated, and i
ts behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' funct
ion NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use
torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='flo
or').
```

This PR fixes the warning.

Test Plan:
```
python test/test_quantization.py TestHistogramObserver
```

Reviewed By: ejguan

Differential Revision: D33187926

Pulled By: vkuzo

fbshipit-source-id: 9c37de4c6d6193bee9047b6a28ff37ee1b019753
2021-12-28 06:27:51 -08:00
00df885d4e quant tests: clean up logs about incorrect tensor copy (#70106)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70106

Some of quantization tests had log spew like

```
UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
```

This PR cleans up the root cause from the utils. Some other
tests may still hit this warning from other places

Test Plan:
```
python test/test_quantization.py TestFakeQuantizeOps
```

this particular warning no longer appears

Reviewed By: soulitzer

Differential Revision: D33187925

Pulled By: vkuzo

fbshipit-source-id: bd1acd77fd72a10dad0c254f9f9f32e513c8a89a
2021-12-28 06:26:40 -08:00
b7b32b56f1 Revert D33281300: Prevent sum overflow in broadcast_object_list
Test Plan: revert-hammer

Differential Revision:
D33281300 (807f9a828c)

Original commit changeset: 1bc83e8624ed

Original Phabricator Diff: D33281300 (807f9a828c)

fbshipit-source-id: beb81a9cbfba405a61b11dfaa8e39c9601f45643
2021-12-27 19:01:53 -08:00
807f9a828c Prevent sum overflow in broadcast_object_list (#70336)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70336

broadcast_object_list casted the sum of all object lengths to int from long causing overflows.

Test Plan:
Increased size of Tensor used in object transfers to have  >2GB storage requirement (in distributed_test.py)

Without fix the length will overflow and the program will request a negative sized Tensor:
```
RuntimeError: Trying to create tensor with negative dimension -2147482417: [-2147482417]
```
With fix it will pass the test.

Test used on server with GPUs:

buck test  mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn --local -- broadcast_object

Differential Revision: D33281300

fbshipit-source-id: 1bc83e8624edc14e747eeced7bc8a7a10e443ee4
2021-12-27 16:17:53 -08:00
5a9ea9e386 Automated submodule update: tensorpipe (#70438)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/tensorpipe](https://github.com/pytorch/tensorpipe).

New submodule commit: 52791a2fd2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70438

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: zertosh

Differential Revision: D33331758

fbshipit-source-id: 1e811ddc30e9afa440523c6cb5c4e893eb560978
2021-12-27 15:19:21 -08:00
bf610f08b0 Back out "Make TorchScript Preserve Fully Qualified Class Name for Python Exceptions"
Summary: as title

Test Plan:
```
buck run mode/opt-split-dwarf -c=python.package_style=inplace //ai_infra/distributed_ai/pyper_test_framework/templates:pyper_release_v2 -- --model inline_cvr_post_imp_deterministic_shrunk_pyper_release_v2 --cluster TSCTestCluster --hpc_identity oncall_pyper_oncall --stage prod_offline_training --test_module training_platform
...
############## Start inline_cvr_post_imp_model Test Results Analysis ##############
I1226 22:03:56.789000 3346280 test_driver.py:139  UNKNOWN     ] Test finished in 808.2743511786684 seconds.
+-------------------------+---------+------------------------+-----------------+
| Test Case               | Status  | Message                | Model Entity ID |
+-------------------------+---------+------------------------+-----------------+
| SmallWorld_release_test | Success | finished successfully. | 987987491       |
+-------------------------+---------+------------------------+-----------------+
I1226 22:03:56.790000 3346280 test_driver.py:143  UNKNOWN     ] test_run_id: 3d085f61-28d1-411d-bd27-940ea2554b23 use this id to find your run in scuba pyper_test_framework
I1226 22:03:56.792000 3346280 test_driver.py:160  UNKNOWN     ] Calling cleanup
I1226 22:03:56.792000 3346280 training_platform_test_launcher.py:385  UNKNOWN     ] Stopping launched jobs 1
I1226 22:03:59.563122 3346280 ClientSingletonManager.cpp:100] Shutting down Manifold ClientSingletonManager
```

Reviewed By: seemethere

Differential Revision: D33325936

fbshipit-source-id: 64414bf7061ad77e8ac12eb8abafee4043e0fa1e
2021-12-27 09:11:46 -08:00
4ae71c8d34 Add graph op replacement pass (#69915)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69915

Test Plan: Imported from OSS

Reviewed By: samdow

Differential Revision: D33198158

Pulled By: tugsbayasgalan

fbshipit-source-id: f2b924edf9959aaf51f97db994fae031fa062cf8
2021-12-25 13:03:19 -08:00
63e58d262a Extend Graph, CompilationUnit, and schema matching to accept optional operator version number (#69914)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69914

Test Plan: Imported from OSS

Reviewed By: qihqi

Differential Revision: D33198157

fbshipit-source-id: b98d9401e515f695d6cf99116f695edc7976bf01
2021-12-25 00:35:33 -08:00
df3cbcff28 Add utility methods to find an upgrader (#68355)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68355

Test Plan: Imported from OSS

Reviewed By: samdow

Differential Revision: D33198156

Pulled By: tugsbayasgalan

fbshipit-source-id: 68380148f0d9bee96d8090bf01c8dfca8e1f8b12
2021-12-24 12:23:04 -08:00
911d527b87 Make TorchScript Preserve Fully Qualified Class Name for Python Exceptions (#70339)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70339

When a python program is translated to TorchScript, the python exception type is dropped. This makes users's life hard when they need to categorize errors based more than only exception message.

Here we make the change so when we raise a python exception, we record the fully qualified class name for the exception. Later on when the TorchScript is interpreted, a special exception CustomJITException is thrown. User can get the python class name from CustomJITException::getPythonClassName .

Note that, this diff does not customize the mapping from C++ exception to Python exception. It's left to the users to do whatever mapping they want.

Code under scripts/shunting are just my own experimental code. I can split them out if requested.
ghstack-source-id: 146221879

Test Plan: buck test mode/opt //caffe2/test:jit

Reviewed By: gmagogsfm

Differential Revision: D33282878

fbshipit-source-id: 910f67a764519f1053a48589d1a34df69001525d
2021-12-24 00:25:40 -08:00
ab4f9862a3 [Compiled Mobilenetv3 Demo] Integrate Compiled Mobilenetv3 into FB4A Playground app (#70370)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70370

Demo of Mobilenetv3 compiled with NNC in FB4A Playground app:
- Add compiled ModelConfig in FB4A app
- Enable Camera inputs for Mobilenet processor in the app and add ability to show live outputs
- Use downscaled inputs, which works for both original mobilenetv3 model and the compiled model
- Update nnc_aten_adaptive_avg_pool2d to use adaptive_avg_pool2d instead of adaptive_avg_pool2d_out as the latter is not included in the traced operators of mobilenetv3 model and hence not included in the app.
- Update app dependencies to include nnc_backend_lib and asm binary

Test Plan:
Run `arc playground pytorchscenario` from fbandroid to build and install the app on a connected device.
Live demo with compiled Mobilenetv3 model:
https://pxl.cl/1W1kb

Reviewed By: larryliu0820

Differential Revision: D33301477

fbshipit-source-id: 5d50a0e70a7f7d2157d311d6b1feef46e78e85b6
2021-12-23 23:46:20 -08:00
0ee663d2fa Revert D33234529: [NNC Testing] Randomized loop nest infrastructure
Test Plan: revert-hammer

Differential Revision:
D33234529 (1d094587ea)

Original commit changeset: 9019f1f1d4ca

Original Phabricator Diff: D33234529 (1d094587ea)

fbshipit-source-id: a79deca9f186299bf884587eb7d50af2464979fb
2021-12-23 23:11:23 -08:00
e429a68478 Allow single node fusion for nvfuser (#70000)
Summary:
Setting `PYTORCH_NVFUSER_ONE_OP_FUSION=1` will take all nodes nvFuser support, instead of waiting for fusion opportunity.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70000

Reviewed By: samdow

Differential Revision: D33292195

Pulled By: davidberard98

fbshipit-source-id: 8ed5ce5e82fbb6737e8ab5ce4223b038eaf47756
2021-12-23 17:07:57 -08:00
5ccf28d066 Do not use ZeroTensor for inplace ops (#69998)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69998

Fixes: https://github.com/pytorch/pytorch/issues/69855

The check for undefined grads for forward AD was not being run because `check_undefined_grads` was only passed as True by OpInfo for backward AD. This PR updates gradcheck to interpret `check_undefined_grads` as possibly for forward or backward AD.

This PR also updates codegen to 1) not use ZeroTensor for `self` when the op is inplace. 2) only create zeros (either through ZeroTensor or at::zeros) if the tensor itself is not undefined. Previously we would error in this case when we call `.options` on the undefined tensor.

~TODO: undo the skips that are due to the original issue~

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D33235973

Pulled By: soulitzer

fbshipit-source-id: 5769b6d6ca123b2bed31dc2bc6bc8e4701581891
2021-12-23 15:52:34 -08:00
3116d87024 Add forward AD formulas for {adaptive_,fractional_,}max_pool{2,3}d_{backward,} (#69884)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69884

Also fixes: https://github.com/pytorch/pytorch/issues/69322, https://github.com/pytorch/pytorch/issues/69325

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D33093039

Pulled By: soulitzer

fbshipit-source-id: b9a522a00f4e9e85974888de5058de07280f8f66
2021-12-23 15:51:09 -08:00
6925576e88 [acc_ops] No longer mark acc_ops.cat as unary (#70365)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70365

We should only mark ops as unary if they should have a single fx.Node input. However, `cat` has a sequence of `tensors` input.

Reviewed By: alexbeloi

Differential Revision: D33299988

fbshipit-source-id: db3581eaee4ad9d2358eed01ec9027825f58f220
2021-12-23 15:09:03 -08:00
133c7f2cf9 Revert D33301254: [pytorch][PR] GHA Windows: Propagate exit code from .bat to calling bash script
Test Plan: revert-hammer

Differential Revision:
D33301254 (6431ac6c7a)

Original commit changeset: 6861dbf0f0a3

Original Phabricator Diff: D33301254 (6431ac6c7a)

fbshipit-source-id: c9d8f72bb198de678456e0a1bcf3264c2ea52874
2021-12-23 15:03:48 -08:00
6431ac6c7a GHA Windows: Propagate exit code from .bat to calling bash script (#70011)
Summary:
The windows 1st shard was silently failing to run (more details here https://github.com/pytorch/pytorch/issues/70010) because the code to run them was never reached. It was silently failing because our CI still returned green for those workflow jobs, because the exit code from the batch script DID NOT PROPAGATE to the calling bash script.

The key here is that even though we have
```
if ERRORLEVEL 1 exit \b 1
```

The exit code 1 was NOT propagating back to the bash script, as the `exit \b 1` was within an `if` statement and the batch script was actually run in a cmd shell, so the bash script win-test.sh continued without erroring. Moving the `exit \b 1` to be standalone fixes it.

More details for this can be found in this stack overflow https://stackoverflow.com/a/55290133

Evidence that now a failure in the .bat would fail the whole job:
https://github.com/pytorch/pytorch/runs/4621483334?check_suite_focus=true

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70011

Reviewed By: malfet

Differential Revision: D33301254

Pulled By: janeyx99

fbshipit-source-id: 6861dbf0f0a34d5baed59f928e34eab15af6f461
2021-12-23 14:09:41 -08:00
ab57f6d12c [LTC] Upstream utils to extract BackendDevice from at::Tensor (#70069)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70069

This commit upstreams utils to extract BackendDevice from at::Tensor.

Test Plan: ./build/bin/test_lazy --gtest_filter=BackendDeviceTest.GetBackendDevice*

Reviewed By: samdow

Differential Revision: D33293160

Pulled By: alanwaketan

fbshipit-source-id: 78647239f90b4d04adce84ae6022b8983ad30c09
2021-12-23 12:42:03 -08:00
16e6e1a59e [Easy] Lint wrap.py file (#70341)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70341

Per title
ghstack-source-id: 146181936

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D33290099

fbshipit-source-id: e4415a42086d9b1b78b0b5f42d4b02f275131dfa
2021-12-23 11:30:36 -08:00
3c231e9bd7 [FSDP] Remove module.wrapper_config support (#70340)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70340

Some wrap APIs support module.wrapper_config to specify the FSDP
arguments, though this feature is currently unused in all use cases and there
is no plan to support this API. enable_wrap() and wrap() along with FSDP
constructor wrapping should be enough for all use cases, so get rid of the
unnecessary code.
ghstack-source-id: 146181819

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D33290066

fbshipit-source-id: e7f3d8b2f2ff6bdf4a3e5021dbb53adf052ee8dc
2021-12-23 11:29:13 -08:00
d100d98db8 torch.linalg routines return torch.linalg.LinAlgError when a numerical error in the computation is found. (#68571)
Summary:
This PR fixes https://github.com/pytorch/pytorch/issues/64785 by introducing a `torch.LinAlgError` for reporting errors caused by bad values in linear algebra routines which should allow users to easily catch errors caused by numerical errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68571

Reviewed By: malfet

Differential Revision: D33254087

Pulled By: albanD

fbshipit-source-id: 94b59000fdb6a9765e397158e526d1f815f18f0f
2021-12-23 10:53:26 -08:00
6a84449290 [SR] Fast path for VarStack on scalars (#70210)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70210

Add a fast-path for `VarStack` nodes for when the inputs are scalars.

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- VarStack`

Reviewed By: hlu1

Differential Revision: D33177498

fbshipit-source-id: 922ab76a6808fbfdb8eb6091163a380344e38de6
2021-12-23 10:31:17 -08:00
cc8b916395 Transformer{DecoderLayer} : no batch dim (#70322)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/60585

TransformerDecoder Test Timings (takes about 30s)
<details>

```
pytest test/test_modules.py -k _TransformerDeco --durations=10
============================================================================================== test session starts ===============================================================================================
platform linux -- Python 3.10.0, pytest-6.2.5, py-1.10.0, pluggy-1.0.0
rootdir: /home/kshiteej/Pytorch/pytorch_no_batch_mha, configfile: pytest.ini
plugins: hypothesis-6.23.2, repeat-0.9.1
collected 639 items / 591 deselected / 48 selected

test/test_modules.py ss......ss......ss..ssssssssss..................                                                                                                                                      [100%]

================================================================================================================================================================================ slowest 10 durations ==============================================================================================
17.13s call     test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_TransformerDecoderLayer_cuda_float64
4.13s call     test/test_modules.py::TestModuleCPU::test_gradgrad_nn_TransformerDecoderLayer_cpu_float64
1.22s call     test/test_modules.py::TestModuleCUDA::test_grad_nn_TransformerDecoderLayer_cuda_float64
0.86s call     test/test_modules.py::TestModuleCPU::test_cpu_gpu_parity_nn_TransformerDecoderLayer_cpu_float32
0.73s call     test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_TransformerDecoderLayer_cuda_float32
0.57s call     test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_TransformerDecoderLayer_cuda_float32
0.56s call     test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_TransformerDecoderLayer_cuda_float64
0.48s call     test/test_modules.py::TestModuleCPU::test_grad_nn_TransformerDecoderLayer_cpu_float64
0.41s call     test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_TransformerDecoderLayer_cuda_float32
0.40s call     test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_TransformerDecoderLayer_cuda_float64
============================================================================================ short test summary info =============================================================================================
========================================================================== 32 passed, 16 skipped, 591 deselected, 3 warnings in 29.62s ===========================================================================
```

</details>

Transformer Test Timings (takes about 1m10s)

<details>
```
pytest test/test_modules.py -k _Transformer_ --durations=10
============================================================================================== test session starts ===============================================================================================
platform linux -- Python 3.10.0, pytest-6.2.5, py-1.10.0, pluggy-1.0.0
rootdir: /home/kshiteej/Pytorch/pytorch_no_batch_mha, configfile: pytest.ini
plugins: hypothesis-6.23.2, repeat-0.9.1
collected 639 items / 591 deselected / 48 selected

test/test_modules.py ss......ss......ss..ssssssssss..................                                                                                                                                      [100%]

==================================================================================
============================================================================================== slowest 10 durations ==============================================================================================
46.40s call     test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_Transformer_cuda_float64
11.09s call     test/test_modules.py::TestModuleCPU::test_gradgrad_nn_Transformer_cpu_float64
2.48s call     test/test_modules.py::TestModuleCUDA::test_grad_nn_Transformer_cuda_float64
1.03s call     test/test_modules.py::TestModuleCPU::test_grad_nn_Transformer_cpu_float64
0.96s call     test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_Transformer_cuda_float32
0.87s call     test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_Transformer_cuda_float32
0.85s call     test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_Transformer_cuda_float64
0.85s call     test/test_modules.py::TestModuleCPU::test_cpu_gpu_parity_nn_Transformer_cpu_float32
0.65s call     test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_Transformer_cuda_float64
0.47s call     test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_Transformer_cuda_float32
============================================================================================ short test summary info =============================================================================================
===================================================================== 32 passed, 16 skipped, 591 deselected, 3 warnings in 70.19s (0:01:10) ======================================================================
```
</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70322

Reviewed By: cpuhrsch

Differential Revision: D33286285

Pulled By: jbschlosser

fbshipit-source-id: 46e08cf47f37787733a535f683c3fd21f652486d
2021-12-23 10:13:31 -08:00
4d49af863f GaussianNLLLoss no_batch_dim docs and testing (#69783)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69783

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D33200486

Pulled By: george-qi

fbshipit-source-id: a2bc2b366772682825f879dae4ac29c1f4d6a5f1
2021-12-23 09:27:53 -08:00
a9c7d626e1 Add the maximize flag to AdamW (#70146)
Summary:
Related issue: https://github.com/pytorch/pytorch/issues/68052

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70146

Reviewed By: malfet

Differential Revision: D33254561

Pulled By: albanD

fbshipit-source-id: f190c836a4162f936c5953e076747c345df21421
2021-12-23 09:20:29 -08:00
b15212c62b enable backward pass computation and communication overlap by prefetching all gather (#70235)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70235

address comments in https://github.com/pytorch/pytorch/pull/69282:
Have fixed a few corner cases for prefetching full parameters in post backward hook.

After benchmarking, prefetching full parameters in the pre-backward hook has the best performance and stable but at cost of increased memory; prefetching full parameters in the post-backward hook did not see expected performance, also failed in a few corner cases (fixed) although there is no memory increase. The main issue is that post backward hook fire order is not consistent with opposite of forward computation order, so incorrectly prefetched all gather could delay the really needed all gather in the single NCCL stream and cause some layer's computation delay.

So putting  these two algorithms as two configurable experimental algorithms for now

prefetch full parameters at pre-backward hook:

It is observed from past traces that all gather ops are not triggered until current layer's backward pass starts to compute, also for some models previous layers' reduce scatter is scheduled before next layer's all gather ops, since all gather and reduce scatter are in the same nccl stream, this case could result in backward pass has no communication and computation overlap.

To explicitly make next layers' all gather scheduled while previous layers' backward computation is running, we can prefetch next layers' all gather full params. This can help 1) both all gather and reduce scatter are overlapped with computation deterministically 2) only prefetch one layer's all gather full parameters, to avoid increasing too much memories.

The implementation borrowed the idea from facebookresearch/fairscale#865, where forward graph order is recorded in the forward pass.

In the backward pass, this PR prefetches all gather full parameter in current layer's pre-backward hook, instead of prefetching in current layer's post backward hook in facebookresearch/fairscale#865. Also make sure all gather streams are synced properly.

Experiments showed 10% memory increase and 20% latency speed up for 1GB roberta model in a slow network environment.

Test Plan: unit tests

Reviewed By: rohan-varma

Differential Revision: D33252795

fbshipit-source-id: 4e2f47225ba223e7429b0dcaa89df3634bb70050
2021-12-22 23:02:46 -08:00
1d094587ea [NNC Testing] Randomized loop nest infrastructure (#70174)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70174

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D33234529

fbshipit-source-id: 9019f1f1d4ca945c92bee401f7ec674b7d987de4
2021-12-22 22:07:39 -08:00
656d2a7bf6 [quant][fx][graphmode] Add backend_config_dict for standalone module (#70150)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70150

This PR allows user to specify backend_config_dict for standalone modules, both in prepare and convert step
adding this now to allow prototype for some of our customer use cases, test for the codepath will be added in
a separate PR

Test Plan:
regression tests
```
python test/test_quantization.py TestQuantizeFx
```
test that specifies backend_config for some module will be added in a separate PR for the use case we have in mind
since it requires other features

Imported from OSS

**Static Docs Preview: classyvision**
|[Full Site](https://our.intern.facebook.com/intern/staticdocs/eph/D33205162/V9/classyvision/)|

|**Modified Pages**|

Reviewed By: vkuzo

Differential Revision: D33205162

fbshipit-source-id: a657cef8e49d99b6a43653141521dc87c33bfd89
2021-12-22 21:18:39 -08:00
795af1578c Revert D33172665: [LTC] Upstream utils to extract BackendDevice from at::Tensor
Test Plan: revert-hammer

Differential Revision:
D33172665 (121d067999)

Original commit changeset: b334ee358ea7

Original Phabricator Diff: D33172665 (121d067999)

fbshipit-source-id: 8bff43cddfc5d30483ec5cea8eff037aab9d1cfa
2021-12-22 21:12:49 -08:00
12afe2bb84 update poisson_nll_loss opinfo samples (#70300)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/67461

cc albanD mruberry jbschlosser walterddr kshitij12345

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70300

Reviewed By: cpuhrsch

Differential Revision: D33285896

Pulled By: jbschlosser

fbshipit-source-id: ec917ec7d3113dbc4ae03978fa5abb24aa082c01
2021-12-22 19:10:57 -08:00
681e78bace [Profiler] Address issues from profiler bifurcation. (#70327)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70327

After D32678163 (7ea86dfdb1), test_rpc_profiler began failing. This was surprising, because it should have been a no-op refactor. However, one change is that a Kineto profiler is no longer also an autograd profiler; the RPC framework was assuming a legacy profiler but when a kineto profiler was active things still kind of worked due to that implementation detail. (But crashed after the class split.)

This diff tidys up a couple of things:
1) Move `getProfilerConfig` into `api.cpp`, since it is no longer correct to static_cast a `KinetoThreadLocalState` to a `ProfilerLegacyThreadLocalState`. (And really the class we want is `ProfilerThreadLocalStateBase` anyway.)

2) Add a mechanism for callers to check if the active profiler is a legacy or kineto profiler. (So callers like RPC can adjust or provide a nice error message.)

3) Fix the RPC test to create a legacy profiler.

Test Plan: `caffe2/torch/fb/training_toolkit/backend/tests:test_rpc_profiler` now passes, and before the fix to `test_rpc_profiler.py`, I verified that the test failed with the error message added to `utils.cpp` rather than just crashing.

Reviewed By: suphoff

Differential Revision: D33283314

fbshipit-source-id: e4fc5b5cfc9ca3b91b8f5e09adea36f38611f90d
2021-12-22 18:50:42 -08:00
121d067999 [LTC] Upstream utils to extract BackendDevice from at::Tensor (#70069)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70069

This commit upstreams utils to extract BackendDevice from at::Tensor.

Test Plan: ./build/bin/test_lazy --gtest_filter=BackendDeviceTest.GetBackendDevice*

Reviewed By: wconstab

Differential Revision: D33172665

Pulled By: alanwaketan

fbshipit-source-id: b334ee358ea7b031bbffb0a16fa634715dba83f5
2021-12-22 18:15:45 -08:00
bd8e8e3aaf [GHA] Clean after checkout (#70337)
Summary:
Github's checkout action sometimes leaves untracked files in the repo
Remedy it by running `git clean -fxd`, which should nuke them all

Tentative fix for https://github.com/pytorch/pytorch/issues/70097

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70337

Reviewed By: suo

Differential Revision: D33289189

Pulled By: malfet

fbshipit-source-id: 16e3ebe7a61fda1648189c78bdf1b1185247037a
2021-12-22 18:10:23 -08:00
a421ee0e52 [nn] InstanceNorm : no batch dim for modules (#65323)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/60585

cc albanD mruberry jbschlosser walterddr kshitij12345

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65323

Reviewed By: davidberard98

Differential Revision: D33285268

Pulled By: jbschlosser

fbshipit-source-id: c5210bb431eaf27190e1cd75c42af3e5bcf83f72
2021-12-22 18:00:36 -08:00
c06b3208d4 Revert D33141012: test //c10/... in CI
Test Plan: revert-hammer

Differential Revision:
D33141012 (0ccccf4ed5)

Original commit changeset: 702000587171

Original Phabricator Diff: D33141012 (0ccccf4ed5)

fbshipit-source-id: 1e30c2dad940f54185dc93912fd7b3e81eec5b63
2021-12-22 17:48:48 -08:00
23ab6ce723 Revert D33141011: extract //c10/macros into its own package
Test Plan: revert-hammer

Differential Revision:
D33141011 (8f4c724bb6)

Original commit changeset: caa97448f922

Original Phabricator Diff: D33141011 (8f4c724bb6)

fbshipit-source-id: 79423ed51f9a43ecf1f716a739c74949b66fadb4
2021-12-22 17:48:45 -08:00
f126501d37 Revert D33141010: allow Bazel to build without glog and gflags
Test Plan: revert-hammer

Differential Revision:
D33141010 (8c41f258f4)

Original commit changeset: d951e5616459

Original Phabricator Diff: D33141010 (8c41f258f4)

fbshipit-source-id: d52ca20ddf4c5a91cb09a32fecb30a00227fc4ae
2021-12-22 17:47:23 -08:00
682fab19d4 [SR] verify_and_correct_memory_overlap handles tensor lists (#69774)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69774

We recently ran into a nasty bug caused by incorrect schema annotations on an `aten::split` overload. `verify_and_correct_memory_overlap` is supposed to prevent crashes in this scenario, but it didn't because it did not handle `Tensor[]` outputs.

This change extends the memory correction mechanism to handle tensor lists.
ghstack-source-id: 146152478

Test Plan: `buck test caffe2/benchmarks/static_runtime/...`

Reviewed By: hlu1

Differential Revision: D33022494

fbshipit-source-id: 8d1d41ca1d4fd5dfb7c8a66028c391ba63551eb0
2021-12-22 17:18:18 -08:00
385c12852e [LTC] Upstream LazyTensor <=> at::Tensor utils (#70066)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70066

This commit upstreams utils to convert at::Tensors into LazyTensors and
vice versa.

Test Plan:
Covered by test_ptltc on the lazy_tensor_staging branch since TorchScript
Backend hasn't merged yet.

Reviewed By: desertfire

Differential Revision: D33171590

Pulled By: alanwaketan

fbshipit-source-id: b297ff5fc8ca1a02d30e16ad2249985310e836a9
2021-12-22 16:53:07 -08:00
2e94a0d282 Remove backward ops for NNPACK spatial convolution (#70305)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70305

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D33279223

Pulled By: jbschlosser

fbshipit-source-id: f263012b3edaa87ce5430ffd6204a5453360d5dd
2021-12-22 14:58:12 -08:00
7cdfd86a72 TestMathBits: test with neg and conj bit set (#68948)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68948

The case where both the negative and conjugate bits are set
isn't tested currently despite being handled explicitly by `copy`.
In theory this shouldn't matter because neg_bit is only used for real
values, but it does mean the code in copy is untested. So, this just
runs it with a single sample as a sanity check.

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D33064371

Pulled By: anjali411

fbshipit-source-id: e90c65e311507c4fc618ff74fecc4929599c4fa3
2021-12-22 14:30:35 -08:00
7c690ef1c2 FractionalMaxPool3d with no_batch_dim support (#69732)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69732

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D33280090

Pulled By: george-qi

fbshipit-source-id: aaf90a372b6d80da0554bad28d56436676f9cb89
2021-12-22 14:30:32 -08:00
8c41f258f4 allow Bazel to build without glog and gflags (#69995)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69995
ghstack-source-id: 146027060

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D33141010

fbshipit-source-id: d951e5616459e8aa163ae0741e245f53185580e8
2021-12-22 14:30:30 -08:00
8f4c724bb6 extract //c10/macros into its own package (#69994)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69994
ghstack-source-id: 145799968

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D33141011

fbshipit-source-id: caa97448f922d7c12980bf01669c1b3ef5c1213b
2021-12-22 14:30:27 -08:00
0ccccf4ed5 test //c10/... in CI (#69993)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69993
ghstack-source-id: 145799967

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D33141012

fbshipit-source-id: 70200058717189a57858f3f8d94ecc364fb229d6
2021-12-22 14:30:24 -08:00
1bd147b61a Fix masked_softmax's perf for element_size is not 8 (#70271)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70271

Test Plan:
Rebase on top of D32407544 and
buck run mode/opt -c fbcode.enable_gpu_sections=true pytext/fb/tools:benchmark_masked_softmax -- masked-softmax --batch-size=10
to see correct perf data ( PT time = ~2.5x PT native time )

Reviewed By: ngimel

Differential Revision: D33268055

fbshipit-source-id: f48b17852c19c2bc646f9ed8d9d5aac85caa8a05
2021-12-22 14:29:09 -08:00
c34aa715fa AT_MKL_SEQUENTIAL and build changes (#70259)
Summary:
Re-land of  https://github.com/pytorch/pytorch/pull/69419

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70259

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D33246757

Pulled By: ngimel

fbshipit-source-id: 738f8558d4cad6752be14108f9931ec3514f6682
2021-12-22 13:52:23 -08:00
b37de0a4bb Update flags in nnc lowering (#70306)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70306

USE_XNNPACK is the right one to enable lowering to prepacked xnnpack based ops

Test Plan: CI

Reviewed By: ZolotukhinM, priyaramani

Differential Revision: D33279375

fbshipit-source-id: d19ded5643f487f7b58c54a860ad39c8d484ed05
2021-12-22 12:25:35 -08:00
f36b44bb9e Remove ciflow_should_run job (#70204)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/66725

This removes the ci_flow_should_run job and puts it in the build stage for the different job templates.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70204

Reviewed By: malfet

Differential Revision: D33282338

Pulled By: zengk95

fbshipit-source-id: 327ff2bca9720d2a69083594ada5c7788b65adbd
2021-12-22 11:52:42 -08:00
276253b164 Fixed wrong return type in ModuleList getitem (#69083)
Summary:
Fixes typing error:
`Expected type ‘Iterable’ (matched generic type ‘Iterable[_T1]’), got ‘Module’ instead.
`

see: https://discuss.pytorch.org/t/modulelist-typing-error-not-an-iterable/138137/5 :

To reproduce (e.g. with mypy/pycharm):

```python
import torch.nn as nn
class Model(nn.Module):

    def __init__(self):
        super().__init__()
        self.module_list = nn.ModuleList(
            [nn.Linear(8, 8), nn.Linear(8, 8), nn.Linear(8, 8), nn.Linear(8, 8), nn.Linear(8, 1)]
        )

    def forward(self, batch):
        for i in self.module_list[1:4]:
            pass
        return batch
model = Model()
out = model(torch.randn(1, 1))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69083

Reviewed By: davidberard98

Differential Revision: D33279114

Pulled By: jbschlosser

fbshipit-source-id: 90d74e76602163586b6ff4c49613a2694a9af37c
2021-12-22 11:38:17 -08:00
ce9a2f8ba9 [C++ API] Added missing nearest-exact mode and anti-alias flag (#69318)
Summary:
Description:

Following https://github.com/pytorch/pytorch/pull/65142#issuecomment-981995692 adding missing nearest-exact mode and anti-alias flag to C++ frontend.

- https://github.com/pytorch/pytorch/pull/65142
- https://github.com/pytorch/pytorch/pull/64501

- added tests in pytorch/test/cpp/api/functional.cpp

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69318

Reviewed By: davidberard98

Differential Revision: D33278995

Pulled By: jbschlosser

fbshipit-source-id: fa87c0c78df6b398e4f9688cc02111eed187afa7
2021-12-22 11:10:51 -08:00
da63f3f92b Corrected typo in Cross entropy formula (#70220)
Summary:
Changes made to line 1073: The denominator of the formula was the EXP(SUM(x)) and changed it to SUM(EXP(x))

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70220

Reviewed By: davidberard98

Differential Revision: D33279050

Pulled By: jbschlosser

fbshipit-source-id: 3e13aff5879240770e0cf2e047e7ef077784eb9c
2021-12-22 11:06:21 -08:00
b7259b8660 [quant][be] Add a check in prepare_qat to make sure the model is in training mode (#69879)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69879

att

Test Plan:
```
python test/test_quantization.py TestQuantizationAwareTraining
```

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D33080989

fbshipit-source-id: 55a631284365ec9dfd6bd7469688490ab1891d41
2021-12-22 11:00:00 -08:00
2806d821b0 Add conversion of torch.permute to acc_ops.permute (#70294)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70294

In order to inference shape for permute, the node target needs to get converted from torch.permute to acc_opts.permute.

Reviewed By: jfix71

Differential Revision: D33267469

fbshipit-source-id: b77eff1892211eac4a798a2f3e624140e287f4a2
2021-12-22 10:38:39 -08:00
56969bf88a make inverse call linalg_inv (#70276)
Summary:
`linalg.inv` and `inverse` are aliases according to documentation, yet their implementation is somewhat diverged. This makes `inverse` call into `linalg_inv`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70276

Reviewed By: malfet

Differential Revision: D33271847

Pulled By: ngimel

fbshipit-source-id: cf018ddd2c1cee29026dd5f546f03f3a1d3cf362
2021-12-22 10:15:40 -08:00
4db3a8fc0a [nn] TransformerEncoderLayer: no-batch-dim (#69291)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/60585
TODO:
* [ ] Update docs?
* [x] Generic reference function?

cc albanD mruberry jbschlosser walterddr kshitij12345

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69291

Reviewed By: davidberard98

Differential Revision: D33278970

Pulled By: jbschlosser

fbshipit-source-id: 8dd5b6d7c0099fa38aa037c186778b10834bdee4
2021-12-22 10:00:09 -08:00
69b37a16f3 Remove unused CUDASolver.h from SparseCUDABlas (#70281)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70281

Reviewed By: ngimel

Differential Revision: D33272704

Pulled By: malfet

fbshipit-source-id: a33a7f9cd1513115a0b9ab75530e85e9913e8dd3
2021-12-22 09:04:34 -08:00
31c7e5d629 Install TensorRT lib on oss docker and enable fx2trt unit test (#70203)
Summary:
CI

Lib installed and unit test run on https://github.com/pytorch/pytorch/actions/runs/1604076060

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70203

Reviewed By: malfet

Differential Revision: D33264641

Pulled By: wushirong

fbshipit-source-id: ba30010bbd06e70d31415d8c52086d1779371bcf
2021-12-22 08:50:48 -08:00
b5f71375f5 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D33275345

fbshipit-source-id: b07a27897680190f9fff86e22d8c68c1c9aff19a
2021-12-22 08:05:39 -08:00
29f1ccc8f0 Fix some Composite Compliance problems with binary_cross_entropy backward (#70198)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70198

This PR fixes composite compliance problems with:
- binary_cross_entropy's backward formula
- binary_cross_entropy_with_logits's backward formula
- binary_cross_entropy's double backward formula

It does so by adding checks for areAnyTensorSubclassLike.

Test Plan:
- I tested everything with functorch.
- We are going to do https://github.com/pytorch/pytorch/issues/69530 in
the future so we have a way of testing this in core. I need the
binary_cross_entropy ones for something right now and didn't want to
wait until we come up with a solution for #69530.

Reviewed By: Chillee

Differential Revision: D33246995

Pulled By: zou3519

fbshipit-source-id: 310ed3196b937d01b189870b86a6c5f77f9258b4
2021-12-22 07:24:04 -08:00
75dbe88b05 [DataPipe] removing unbatch_level from .groupby (#70249)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70249

IMO, the `unbatch_level` argument is not needed here since users can simply can `.unbatch` before calling `.groupby` if needed. One small step closer to an unified API with other libraries.

Note that we may rename the functional name from `.groupby` to `.group` in the future. TBD.

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D33259104

Pulled By: NivekT

fbshipit-source-id: 490e3b6f5927f9ebe8772d5a5e4fbabe9665dfdf
2021-12-22 07:13:12 -08:00
e02d836cb2 [LTC] Upstream LTCTensorImpl (#70062)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70062

This commit upstreams LTCTensorImpl from the lazy_tensor_staging branch.
It inherits from c10::TensorImpl and thus manages the lifetime/storage
of LazyTensor.

Test Plan: ./build/bin/test_lazy --gtest_filter=LazyTensorImplTest.*

Reviewed By: desertfire

Differential Revision: D33171186

Pulled By: alanwaketan

fbshipit-source-id: 6af9f91cc7c7e997f120cb89a7bcd6785c03ace0
2021-12-22 03:21:52 -08:00
633f770c3c [StaticRuntime] Add out-variant support for TensorExprDynamicGroup op (#69479)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69479

This diff adds support for out-variant optimization for `TensorExprDynamicGroup` op, which will be used for TensorExpr based fusion in Static Runtime.
ghstack-source-id: 146107008

Test Plan:
```
buck run mode/opt //caffe2/caffe2/fb/predictor:pytorch_predictor_test
```

Completed accuracy test on inline_cvr model 294738512 v0. Results:
```
get 1012 prediction values
get 1012 prediction values
pyper_inference_e2e_local_replayer_test.out.132ea03c2 pyper_inference_e2e_local_replayer_test.out.1858bbeb0
max_error:  0 % total:  0
```

Reviewed By: d1jang, mikeiovine

Differential Revision: D32768463

fbshipit-source-id: a3e6c1ea9ff5f3b57eb89095aa79a6d426fbb52a
2021-12-22 00:30:22 -08:00
7d4db93a7d [jit] Handle output tensor being passed in as inputs to TensorExprDynamicGroup (#69478)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69478

This diff handles the case when output tensors are being passed in as
inputs to TensorExprDynamicGroup op.

This is in preparation to support out-variant optimizations in Static Runtime.
ghstack-source-id: 146107007

Test Plan: buck test mode/dev-nosan //caffe2/test/cpp/jit:jit

Reviewed By: eellison

Differential Revision: D32823889

fbshipit-source-id: ff18e17fcd09953e55c8da6b892e60756521c2fc
2021-12-22 00:30:19 -08:00
4dec15e6d8 [nnc] Add a run method to TensorExprKernel that takes in output tensors (#69477)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69477

This diff adds a new run method to `TensorExprKernel` which takes in
output tensors as inputs and stores the output in those given tensors.
ghstack-source-id: 146107009

Test Plan: buck test mode/dev-nosan //caffe2/test/cpp/tensorexpr:tensorexpr -- --exact 'caffe2/test/cpp/tensorexpr:tensorexpr - Kernel.RunWithAllocatedOutputs'

Reviewed By: ZolotukhinM

Differential Revision: D32823890

fbshipit-source-id: edc1f4839785124048b034060feb71cb8c1be34f
2021-12-22 00:30:15 -08:00
0bdf4702f6 [jit] Add a new op that composes all of the dynamic shape logic (#69476)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69476

This diff adds a new op, `TensorExprDynamicGroup`, that composes all the logic behind running a dynamic shaped fused node. This includes a guard instruction that checks for conditions, a conditional that calls the fused node or the fallback graph depending on the guard.
ghstack-source-id: 146107006

Test Plan:
```
buck test mode/dev-nosan //caffe2/test/cpp/jit:jit
```

Reviewed By: eellison

Differential Revision: D32320082

fbshipit-source-id: 2bd1a43391ca559837d78ddb892d931abe9ebb73
2021-12-22 00:28:57 -08:00
b613fbdbf2 Back out "[Quant] Added 4 bit support for embedding quantized module" (#70273)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70273

Original commit changeset: 73e63383cf60

Original Phabricator Diff: D33152674 (9f512e129b)

Test Plan: CI

Reviewed By: larryliu0820

Differential Revision: D33268459

fbshipit-source-id: 051bfcbbad3fa083301a3cea508d00946d6db881
2021-12-21 21:28:04 -08:00
47ba28f3b5 Back out "[Quant][Eager] Added 4 bit support for eager mode quantization flow" (#70272)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70272

Original commit changeset: 5cdaac5aee9b

Original Phabricator Diff: D33152675 (75718e5059)

Test Plan: CI

Reviewed By: larryliu0820

Differential Revision: D33268415

fbshipit-source-id: 99eb3209d513149ed23a1d9071d1b1c12174d09a
2021-12-21 21:28:01 -08:00
a86f9806bc Back out "[Quant][fx] Added test for quint4x2 for fx graph mode quantization" (#70274)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70274

Original commit changeset: 89951fcd23e7

Original Phabricator Diff: D33152672 (de4e7dece9)

Test Plan: CI

Reviewed By: larryliu0820

Differential Revision: D33268165

fbshipit-source-id: d667a761d72b9423407ce4d6617e9b6a04b5c9f8
2021-12-21 21:26:46 -08:00
6217fee96b Revert D33246843: [pytorch][PR] Implementation of Wishart distribution
Test Plan: revert-hammer

Differential Revision:
D33246843 (a217a62e73)

Original commit changeset: 825fcddf4785

Original Phabricator Diff: D33246843 (a217a62e73)

fbshipit-source-id: 2c8063e8d10e9d3ac20fa44673e6011ed1160753
2021-12-21 18:55:49 -08:00
2d509ff31b [GHA] Fix doc push jobs (#70269)
Summary:
Home folder in docker images is `/var/lib/jenkins`, rather than `/home/jenkins`
Also repo secrets can not start with `GITHUB_` prefix according to [Naming your secrets](https://docs.github.com/en/actions/security-guides/encrypted-secrets#naming-your-secrets) guide

Fixes https://github.com/pytorch/pytorch/issues/70211

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70269

Reviewed By: suo

Differential Revision: D33271404

Pulled By: malfet

fbshipit-source-id: 044bb34c75a0e8a9f0b2f5790be7aa2397524a24
2021-12-21 18:20:10 -08:00
591ca4d6bc [Operator Versioning][Edge] Reorganize upgrader initialization logic for thread safety (#70225)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70225

Thanks for zhxchen17's suggestion. This pr move the operator initialization logic to `upgrader_mobile.cpp`, such that we can leverage the static variable to ensure the operator initialization only happens once.
ghstack-source-id: 146103229

Test Plan:
```

buck test mode/opt //papaya/integration/service/test/analytics/histogram:generic_histogram_system_test -- --exact 'papaya/integration/service/test/analytics/histogram:generic_histogram_system_test - SumHistogramSystemTest.test' --run-disabled
buck test mode/opt //caffe2/test/cpp/jit:jit
buck test mode/dev //papaya/integration/service/test/mnist:mnist_system_test -- --exact 'papaya/integration/service/test/mnist:mnist_system_test - MnistFederatedSystemTest.test'
```

Reviewed By: zhxchen17

Differential Revision: D33247543

fbshipit-source-id: 6c3a87fe909a1be01452fa79649065845b26d805
2021-12-21 17:26:17 -08:00
21c6de9fdc Extend autograd functional benchmarking to run vectorized tasks (#67045)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67045

To run: `python benchmarks/functional_autograd_benchmark/functional_autograd_benchmark.py --gpu -1 --model-filter=ppl    _robust_reg --num-iter 100`

```
Results for model ppl_robust_reg on task vjp: 0.0012262486852705479s (var: 2.2107682351446556e-10)
Results for model ppl_robust_reg on task vhp: 0.002099371049553156s (var: 6.906406557760647e-10)
Results for model ppl_robust_reg on task jvp: 0.001860950025729835s (var: 1.1251884146634694e-10)
Results for model ppl_robust_reg on task hvp: 0.003481731517240405s (var: 2.2713633751614282e-10)
Results for model ppl_robust_reg on task jacobian: 0.0012128615053370595s (var: 1.3687526667638394e-09)
Results for model ppl_robust_reg on task hessian: 0.009885427542030811s (var: 9.366265096844018e-09)
Results for model ppl_robust_reg on task hessian_fwdrev: 0.005268776323646307s (var: 2.4293791422991262e-09)
Results for model ppl_robust_reg on task hessian_revrev: 0.002561321249231696s (var: 7.557877101938004e-10)
Results for model ppl_robust_reg on task jacfwd: 0.002619938924908638s (var: 5.109343503839625e-10)
Results for model ppl_robust_reg on task jacrev: 0.0013469004770740867s (var: 3.1857563254078514e-09)
```
Notes:
 - We go through batched fallback for both
 - ppl_robust_reg takes 3 tensor inputs and returns a single scalar output
   - this means that jacobian is equivalent to doing vjp and vmap would not help us
   - we expect jacfwd to be slower than jacrev

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D33265947

Pulled By: soulitzer

fbshipit-source-id: 14f537a1376dea7e5afbe0c8e97f94731479b018
2021-12-21 17:20:29 -08:00
82c5f298ed [shard] fix named_params_with_sharded_tensor (#70228)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70228

fix named_params_with_sharded_tensor impl, where `named_parameters` already loop the submodules recursively, so we shouldn't put it in the submodule loop.
ghstack-source-id: 146076471

Test Plan: Added more complicated test cases (that involves multiple submodules) to capture this issue.

Reviewed By: pritamdamania87

Differential Revision: D33251428

fbshipit-source-id: cf24ca7fbe4a5e485fedd2614d00cdea2898239e
2021-12-21 15:29:38 -08:00
74c834e0dc [DataPipe] adding a finally statement to ensure hook is reset (#70214)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70214

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D33255306

Pulled By: NivekT

fbshipit-source-id: de2fe6bf08328e481c714aaad390db771073469e
2021-12-21 15:21:04 -08:00
23902fb895 Fixed typo in torch check for cdist (#70178)
Summary:
Description:
- Fixed typo in torch check for cdist

cc zou3519

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70178

Reviewed By: bdhirsh

Differential Revision: D33236027

Pulled By: zou3519

fbshipit-source-id: e87a982c0dc5fe576db8f2afc4b2010924f047c0
2021-12-21 15:16:39 -08:00
a217a62e73 Implementation of Wishart distribution (#68588)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/68050

TODO:
- [x] Unit Test
- [x] Documentation
- [x] Change constraint of matrix variables with 'torch.distributions.constraints.symmetric' if it is reviewed and merged. https://github.com/pytorch/pytorch/issues/68720

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68588

Reviewed By: bdhirsh

Differential Revision: D33246843

Pulled By: neerajprad

fbshipit-source-id: 825fcddf478555235e7a66de0c18368c41e935cd
2021-12-21 14:07:30 -08:00
0544f975e1 [reland] Support torch.equal for ShardedTensor. (#70145)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70145

Added support for torch.equal to ShardedTensor. This is really
helpful in terms of comparing two ShardedTensors.
ghstack-source-id: 146066939

Test Plan: waitforbuildbot

Reviewed By: wanchaol

Differential Revision: D33201714

fbshipit-source-id: 56adfc36e345d512c9901c56c07759bf658c745b
2021-12-21 13:22:52 -08:00
c321d4c1ca [Operator Versioning] Split the upgrader test to a separate file and cover mobile part (#70090)
Summary:
1. Split the test `test_save_load.py` to two files. Basically move the operator versioning related changes to `test_save_load_for_op_versions.py`.
2. Add mobile module related test to `test_save_load_for_op_versions.py`

How to run:
```
buck test mode/opt //caffe2/test:jit
or
python test/test_jit.py TestSaveLoadForOpVersion
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70090

ghstack-source-id: 146103547

Test Plan:
```
buck test mode/opt //caffe2/test:jit
python test/test_jit.py TestSaveLoadForOpVersion
```

Reviewed By: tugsbayasgalan

Differential Revision: D33180767

fbshipit-source-id: dd31e313c81e90b598ea9dd5ad04a853c017f994
2021-12-21 13:08:01 -08:00
a6f953156e [StaticRuntime] Add TensorExpr fusion with dynamic shapes in SR (#69475)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69475

This diff adds TensorExpr fusion with dynamic shapes in SR. This includes tracing the input graph with sample inputs, and then performing fusion with generalization to get fused graphs with dynamic shapes.
ghstack-source-id: 146059043

Test Plan:
```
buck run mode/opt //caffe2/caffe2/fb/predictor:pytorch_predictor_test
```

Reviewed By: d1jang

Differential Revision: D32320088

fbshipit-source-id: 397f498878ddfcee9dad7a839652f79f034fefe3
2021-12-21 12:41:02 -08:00
c6d1162325 [jit] Add support for dynamic shape fusion in JIT. (#69474)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69474

This diff adds support for dynamic shape fusion in JIT. This is done
by performing fusion with the static shapes observed on the first run,
generalizing the fused subgraphs and generating code for the generalized fused
subgraphs with dynamic shapes.
ghstack-source-id: 146059044

Test Plan:
```
buck test mode/dev-nosan //caffe2/test/cpp/jit:jit
```

Reviewed By: eellison

Differential Revision: D32781307

fbshipit-source-id: f821d9f8c271bcb78babcb4783d66f2f0020b0ea
2021-12-21 12:39:44 -08:00
c5333cdfba [nnc] tensorexpr for quantized::add (#70188)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70188

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D33238093

Pulled By: IvanKobzarev

fbshipit-source-id: bd4e451bfd7531f31f216def2c3c1ba2f2e566e7
2021-12-21 12:30:56 -08:00
bb51519937 bug fix FractionalMaxPool2d (random_samples dimensions) (#70031)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70031

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D33200618

Pulled By: george-qi

fbshipit-source-id: 142f224c2cab1008d2d4e9ed333697a92d2d42db
2021-12-21 12:21:54 -08:00
91da2d5fa1 [StaticRuntime] Refactor StaticModule to pass in sample inputs (#69473)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69473

This diff refactors StaticModule and its uses to pass in sample inputs. These inputs need to be passed into the constructor because they are need to perform TensorExpr fusion before other optimizations are performed on the input graph.
ghstack-source-id: 146059041

Test Plan: buck run mode/opt //caffe2/caffe2/fb/predictor:pytorch_predictor_test

Reviewed By: donaldong

Differential Revision: D32320084

fbshipit-source-id: b8bd46d442be4cc90ca60f521e0416fdb88eea60
2021-12-21 11:20:25 -08:00
c4a6c7a436 fix cpu binary size increase for clamp (#70168)
Summary:
Per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70168

Reviewed By: bdhirsh

Differential Revision: D33229811

Pulled By: ngimel

fbshipit-source-id: 3509da766fa327f4103fdcf880d368f64c111496
2021-12-21 10:59:27 -08:00
5504e4ae5c [nnc] Move DispatchParallel to external_functions (#70221)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70221

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D33249149

Pulled By: IvanKobzarev

fbshipit-source-id: fa6b2535dc09229d72b1c45eaa75434477cdff5e
2021-12-21 10:51:38 -08:00
304efd8e9a Change TH_BLAS_MKL into AT_MKL_ENABLED() (#70219)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70219

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69419

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D33246758

Pulled By: ngimel

fbshipit-source-id: aedef4c9ef97b6aa9f574313c94f774b77df2748
2021-12-21 10:36:55 -08:00
a197f3fe52 [FSDP/Checkpoint] Activation offload support in checkpoint_wrapper (#70165)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70165

Implements activation offload support in checkpoint_wrapper API via
save_on_cpu hooks. We avoid modifying the torch.utils.checkpoint implementation
and instead compose offload + checkpoint using the save_on_cpu hook for the
former.
ghstack-source-id: 146078900

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D33228820

fbshipit-source-id: 98b4da0828462c41c381689ee07360ad014e808a
2021-12-21 10:08:18 -08:00
e428a90553 Android build migrated to GHA. (#68843)
Summary:
All for builds of the Android (arm32/64 and x86_32/64) are not migrated to the GHA, away from circleCI. Since this part of the workflow creates final binary with all architectures in it, it was not possible to do migration step by step.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68843

Reviewed By: malfet

Differential Revision: D33257480

Pulled By: b0noI

fbshipit-source-id: dd280c8268bdd31763754c36f38e4ea12b23cd2e
2021-12-21 10:02:51 -08:00
5e222d08a1 Revert "Revert D32498572: allow external backend codegen to be used without autograd kernels" (#69949)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69949

This reverts commit 33363cea64fd4be16975c32cf57e9eb123af371d.

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D33113544

Pulled By: bdhirsh

fbshipit-source-id: e219f10d52776498c9ad273e97bca3e3406cf702
2021-12-21 08:19:37 -08:00
8e763cd735 Add explicit OperatorHandle destructor (#70033)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/70032

Windows build of PyTorch doesn't produce the `c10::OperatorHandle::~OperatorHandle(void)` symbol in any of its `*.lib` files. This fix is to explicitly define it in Dispatcher.cpp, so downstream consumers wanting to dllimport can find it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70033

Reviewed By: jbschlosser

Differential Revision: D33240599

Pulled By: bdhirsh

fbshipit-source-id: 56cc5963043bd5caac30e42c3501a4f48d086b36
2021-12-21 07:39:26 -08:00
adaf383837 dbr quant: better fix for bug with recursion on dequantize (#70128)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70128

Previous code disabled torch_function when dequantizing arguments
to an unquantizeable function.  This PR blocklists the dequantize
method from the dequantize hook instead, so we can remove
the previous hack.

Test Plan:
```
python test/test_quantization.py TestQuantizeDBR
```

Reviewed By: ejguan

Differential Revision: D33194396

Pulled By: vkuzo

fbshipit-source-id: 6175c2da637c1d0c93b3fea0ef8218eaee6a2872
2021-12-21 06:25:37 -08:00
cce9c9aa45 dbr quant: stop overridding tensor getters (#70115)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70115

This PR turns off DBR quant __torch_function__ overrides on
tensor attribute getters such as `x.dtype`. This should help
with making the debug logs more readable, and reduce framework
overhead.

Test Plan:
```
python test/test_quantization.py TestQuantizeDBR
```

Reviewed By: ejguan

Differential Revision: D33189544

Pulled By: vkuzo

fbshipit-source-id: e0d664bb6b76ca9e71c8a439ae985a0849312862
2021-12-21 06:25:34 -08:00
f291708058 dbr quant: clean up logging format (#70114)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70114

This PR makes the debug logging for DBR quant be more useful
and easier to read.

New format looks like

```
DEBUG:auto_trace: fqn: _tf_ <function tanhshrink at 0x7fa4d02d4790> out torch.float32 end
```

This will be useful to speed up further work.

Test Plan:
```
// run this with logging enabled, logs easier to read
python test/test_quantization.py TestQuantizeDBR
```

Reviewed By: jerryzh168

Differential Revision: D33189545

Pulled By: vkuzo

fbshipit-source-id: 20af7e066e710beac5a3871a9d6259ee5518f97d
2021-12-21 06:25:31 -08:00
fb2a6747b8 dbr quant: add test for qconfig_dict and methods (#70109)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70109

Adds a test case for DBR quant + qconfig_dict specifying methods
by object_type.  Fixes a bug in the FX rewriter for scripting
to make the test pass.

Full coverage of methods will come in future PRs, this PR is
just to verify qconfig_dict is hooked up correctly.

Test Plan:
```
python test/test_quantization.py TestQuantizeDBR.test_qconfig_dict_object_type_method
```

Reviewed By: jerryzh168

Differential Revision: D33188160

Pulled By: vkuzo

fbshipit-source-id: 47ab9dbca8cdb1cf22d6d673d9c15b3bc0d1ec81
2021-12-21 06:24:18 -08:00
78bea1bb66 update example in classification losses (#69816)
Summary:
Just updated a few examples that were either failing or raising deprecated warnings.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69816

Reviewed By: bdhirsh

Differential Revision: D33217585

Pulled By: albanD

fbshipit-source-id: c6804909be74585c8471b8166b69e6693ad62ca7
2021-12-21 02:46:48 -08:00
19f898402d Revert D33241684: [pytorch][PR] Install TensorRT lib on oss docker and enable fx2trt unit test
Test Plan: revert-hammer

Differential Revision:
D33241684 (dab3d3132b)

Original commit changeset: cd498908b00f

Original Phabricator Diff: D33241684 (dab3d3132b)

fbshipit-source-id: d5b2e663b5b0c9e570bd799b9f6111cd2a0de4f7
2021-12-20 23:14:35 -08:00
b376d82caf Remove backward op for slow dilated 2d convolution (#70067)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70067

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D33172551

Pulled By: jbschlosser

fbshipit-source-id: 2f1802c77253e543ebb7ee8ee0a12fa4defde311
2021-12-20 19:18:34 -08:00
dab3d3132b Install TensorRT lib on oss docker and enable fx2trt unit test (#70203)
Summary:
CI

Lib installed and unit test run on https://github.com/pytorch/pytorch/actions/runs/1604076060

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70203

Reviewed By: janeyx99

Differential Revision: D33241684

Pulled By: wushirong

fbshipit-source-id: cd498908b00f3417bdeb5ede78f5576b3b71087c
2021-12-20 18:51:48 -08:00
123be0e5b7 [fusion] Add ConvTranspose+BN fusion support (#70022)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70022

Add support for fusing ConvTranpose{1,2,3}d with BatchNorm{1,2,3}d. This re-uses the existing fusion logic but adds a "transpose" flag to the fusing function which when enabled will use the appropriate reshape for ConTranspose's transposed weights.

Test Plan: `buck test mode/dev //caffe2/test:quantization -- -r quantization.eager.test_fusion.TestFusion`

Reviewed By: jerryzh168

Differential Revision: D33074405

fbshipit-source-id: 5e9eff1a06d8f98d117e7d18e80da8e842e973b7
2021-12-20 18:42:48 -08:00
24f16de987 [Static Runtime] Support native op split_with_sizes (#69999)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69999

This adds support for the split_with_sizes operator in static runtime by adding native operators. Those operators will have less overhead comparing to their JIT fallbacks (no dispatching, no stack constructing in runtime).

split_with_sizes can be called directly from cpp API, or in `torch.split`  when `split_sizes` is a list. This diff adds support for both use cases.

Test Plan:
- Added unit tests. Made sure the operators are used
- Benchmark
```
./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench \
--scripted_model=/data/users/dxd/305797439_0.predictor.precompute.remote_request_only \
--method_name=user.forward --pt_cleanup_activations=1 \
--pt_enable_out_variant=1 --pt_optimize_memory=1 --iters=1000 --warmup_iters=500 \
--num_threads=1 --pt_enable_static_runtime=1 --set_compatibility=1 \
--input_type="recordio" --pt_inputs=/data/users/dxd/305797439_0_user.inputs.recordio \
--recordio_use_ivalue_format=1 --do_profile=1 --do_benchmark=1
```

#### Before
```
Static runtime ms per iter: 3.62073. Iters per second: 276.187
0.0471904 ms.    1.31501%. aten::split_with_sizes (5 nodes)
```
#### After
```
Static runtime ms per iter: 3.44374. Iters per second: 290.382
0.0432057 ms.    1.34276%. aten::split_with_sizes (5 nodes, native)
```

Reviewed By: swolchok

Differential Revision: D33141006

fbshipit-source-id: feae34c4c873fc22d48a8ff3bf4d71c0e00bb365
2021-12-20 18:32:54 -08:00
6623c4838e Handle the corner case when min == max in L2 search (#70207)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70207

In corner case when min == max, adjust_hist_to_include_zero() function used in L2 search will cause additional_nbins = -2147483648 and initialize bins_f with negative size.

Test Plan:
Before fix:
f315187213

After fix:
f315471862

Reviewed By: jspark1105

Differential Revision: D33227717

fbshipit-source-id: 7e8a455e51a0703a3a9c5eb7595d9b4d43966001
2021-12-20 17:46:55 -08:00
f17e76b0f2 Expand description of bias_sizes arg for convolution_backward (#70195)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70195

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D33240155

Pulled By: jbschlosser

fbshipit-source-id: c4f907d6e33e4d1eeb1b5228f1152307c8b27729
2021-12-20 17:33:17 -08:00
3e8ef9a272 Add return type annotation for ShardedTensor (#69945)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69945

Test Plan: CI

Reviewed By: wanchaol

Differential Revision: D32502393

fbshipit-source-id: 7bea08762446b211d8ea028d024d2acdabe45479
2021-12-20 17:15:44 -08:00
c555b7bacb GHA: Remove caffe2 check in Windows shard 1 smoke tests (#70010)
Summary:
Windows shard 1 hasn't actually been running any tests because the script that does so exited before running the python tests but did not report an error. This has been happening to all windows tests across the board, for example https://github.com/pytorch/pytorch/runs/4526170542?check_suite_focus=true

Removing the caffe2.python check passes the smoke tests now. You can observe that the run_test.py file is called in the windows cpu job now https://github.com/pytorch/pytorch/runs/4541331717?check_suite_focus=true

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70010

Reviewed By: malfet, seemethere

Differential Revision: D33161291

Pulled By: janeyx99

fbshipit-source-id: 85024b0ebb3ac42297684467ee4d0898ecf394de
2021-12-20 16:05:38 -08:00
e6d9bb8d57 reduce the number of instantiations for bernoulli tensor tensor kernel (#70169)
Summary:
Reduces the binary size of DistributionBernoulli.cu 12282600 -> 3946792
Tensor-tensor bernoulli kernels are rarely used, we limit dispatches to double probability type for double `self` tensor, and `float` probability type for everything else. This would be a minor perf hit if probability tensor is of the different dtype, but given how rarely these kernels are used (and how rarely the probability tensor is not float) this is not a problem.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70169

Reviewed By: jbschlosser

Differential Revision: D33237890

Pulled By: ngimel

fbshipit-source-id: 185c4b97aba0fb6ae159d572dd5bbb13cf676bb4
2021-12-20 13:46:34 -08:00
79a40b22aa [Checkpoint] Make checkpoint_wrapper an nn.Module (#70164)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70164

Implement Alban's suggestion to make checkpoint_wrapper an nn.Module
instead of patching the forward pass, which is too hacky.
ghstack-source-id: 146011215

Test Plan: IC

Reviewed By: mrshenli

Differential Revision: D33214696

fbshipit-source-id: dc4b3e928d66fbde828ab60d90b314a8048ff7a2
2021-12-20 13:22:28 -08:00
fcaecd718a Write flaky tests to rockset (#70136)
Summary:
Try using Rockset as backend for data instead of RDS

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70136

Reviewed By: suo

Differential Revision: D33242148

Pulled By: janeyx99

fbshipit-source-id: 8935ceb43717fff4922b634165030cca7e934968
2021-12-20 13:17:21 -08:00
5651e1e3ad Add auto_linear formulas and some others (#69727)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69727

Still need to test the backward ones. We would need to update gradgradcheck to check forward over backward.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D33031728

Pulled By: soulitzer

fbshipit-source-id: 86c59df5d2196b5c8dbbb1efed9321e02ab46d30
2021-12-20 12:15:25 -08:00
65f54bc000 [SR] Optimize VarStack (#68750)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68750

There was some room for optimization in static runtime's `prim::VarStack`:

* Avoid refcount bumps - constructing the `std::vector<at::Tensor>` can be avoided by writing a custom version of `stack_out` that takes a `std::vector<at::Tensor*>`

* Skip the memory overlap check

* Avoid device dispatcher overhead in a few places (e.g. `tensor.unsqueeze -> at::native::unsqueeze`)

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- Stack`

Reviewed By: swolchok

Differential Revision: D32596934

fbshipit-source-id: e8f0ccea37c48924cb4fccbfdac4e1e11da95ee0
2021-12-20 11:46:11 -08:00
a799ffebd2 Create lower code example (#70142)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70142

Create lower code example in oss, and run benchmark agaist resnet101

Test Plan: CI

Reviewed By: 842974287

Differential Revision: D33117440

fbshipit-source-id: 359d0c9e65899ab94c8f3eb112db70db5d938504
2021-12-20 11:37:08 -08:00
423ce416d8 Prune osx-arm64 binaries from nightly channel (#70132)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/70043

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70132

Reviewed By: janeyx99

Differential Revision: D33195431

Pulled By: malfet

fbshipit-source-id: 4579a6788255a6df306862c3e959ae7a9ddd4e45
2021-12-20 11:28:43 -08:00
41959ce77f [JIT] scripting, freezing, serialization for sparse csr (#69555)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69555

1. Implement pickling/unpickling
2. Add `test_freeze_sparse_csr, tests_serialize_sparse_csr` tests

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D33181367

Pulled By: davidberard98

fbshipit-source-id: a15d5193a7b1b1625a27e4af003cec33cdbc8071
2021-12-20 11:13:34 -08:00
bcb6076099 Sparse CSR tensors: storage access should throw (#70072)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70072

Like (sparse COO tensors), sparse CSR tensors don't really have an actual storage() that can be accessed, so sparsetensor->storage() should throw.

cc nikitaved pearu cpuhrsch

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D33181309

Pulled By: davidberard98

fbshipit-source-id: 8f1dc4da03073d807e5acee2ac47caeffb94b16c
2021-12-20 11:12:01 -08:00
bcc7dbdf37 Change open source unit test deps (#70167)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70167

1. Change unit test dependency to open source base class, so that this unit test can run on git oss CI
2. Remove usage of typing.Protocol, so that lower can run with Python 3.6

Test Plan:
oss CI
passed with change included in commit:
https://github.com/pytorch/pytorch/actions/runs/1597530689
see test(fx2trt)

Reviewed By: yinghai

Differential Revision: D33228894

fbshipit-source-id: ffe3d40a02a642b3b857a0605101797037a580bb
2021-12-20 10:41:38 -08:00
dd02af6283 Bilinear no_batch_dim (#69539)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69539

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D33200105

Pulled By: george-qi

fbshipit-source-id: c674e3937fea95c4ec41a01c5aa6d6890042b288
2021-12-20 09:44:07 -08:00
978089c381 Prevent divide-by-zero errors in Timer (#70050)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/66503

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70050

Reviewed By: mruberry

Differential Revision: D33168868

Pulled By: robieta

fbshipit-source-id: 7d0ece9e888f6c69a9e0ced581c92d3259fb3540
2021-12-20 09:16:03 -08:00
ad0cd8a76e [DataPipe] Improve inline doc and testing for CollatorIterDataPipe (#70139)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70139

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D33199107

Pulled By: NivekT

fbshipit-source-id: f96d77490998ac9bc3da8d4ff1a9caa08e9e7f27
2021-12-20 08:05:21 -08:00
8a912014b1 [Operator Versioning][Edge] Initialize upgrader thread safe (#70161)
Summary:
Upgrader should only be initialized once when runtime loads the first module. It no longer needs to initialized afterwards.

Previously, instead of using an atomic variable, the upgrader will be initialized depends on whether byteCodeFunctionWithOperator.function.get_code().operators_ is empty. If it's empty, it means the operator from the upgrader is not initialized yet. However, it's not thread safe. When multiple thread loads module together, it's possible that they all consider it's the first module. Use an atomic variable here to make sure it's thread safe.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70161

ghstack-source-id: 146012642

Test Plan:
```
buck test mode/opt //papaya/integration/service/test/analytics/histogram:generic_histogram_system_test -- --exact 'papaya/integration/service/test/analytics/histogram:generic_histogram_system_test - SumHistogramSystemTest.test' --run-disabled
buck test mode/opt //caffe2/test/cpp/jit:jit
```

Reviewed By: iseeyuan

Differential Revision: D33220320

fbshipit-source-id: 10f2397c3b358d5a1d39a2ce25457e3fdb640d2c
2021-12-19 20:16:00 -08:00
7ea86dfdb1 [Profiler] Factor common logic into torch/csrc/profiler/api.h (#69459)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69459

This change breaks the dependency between the kineto and legacy profiler; instead of `profiler_kineto.h` including `profiler_legacy.h`, they both include `profiler/api.h`. As part of this refactor, I injected some intermediate classes to keep legacy behavior from leaking into the kineto profiler:

1) ProfilerThreadLocalState has become ProfilerThreadLocalStateBase which just handles config and callback handle. Legacy and Kineto profilers inherit this and implement their own very disjoint set of logic.

2) CUDAStubs is a pure virtual class to make the interface more readable, and the "always fail" behavior has been moved to a `DefaultCUDAStubs` class in `api.cpp`.

Test Plan: Ran the overhead ubenchmark.

Reviewed By: aaronenyeshi

Differential Revision: D32678163

fbshipit-source-id: 9b733283e4eae2614db68147de81b72f6094ce6c
2021-12-19 18:40:28 -08:00
181120f7d7 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D33229251

fbshipit-source-id: 3a69bb459fa0a65888d6f9c8e70b5de032ddad97
2021-12-19 16:38:25 -08:00
60191196d4 [AutoAccept][Codemod][FBSourceBuckFormatLinter] Daily arc lint --take BUCKFORMAT
Reviewed By: zertosh

Differential Revision: D33229262

fbshipit-source-id: 7c22aa59a2a9eea94d2f403c339eb20abc7d9c41
2021-12-19 16:34:00 -08:00
ef70174f2e Separate c10::Symbol header from list of interned strings (#69406)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69406

Most files that include `interned_strings.h` don't actually depend on
anything generated from `FORALL_NS_SYMBOLS` yet because they're in a
single file you need to recompile whenever a new symbol is added. Here
I move the class definition into a separate file so this doesn't
happen.

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D32923637

Pulled By: albanD

fbshipit-source-id: 6e488cbfcfe2c041a99d9ff22e167dbddf3f46d7
2021-12-19 14:52:26 -08:00
06d0536dad Low precision support for jiterator (#70157)
Summary:
This adds support for bfloat16 and fp16 types for jiterator by adding at::Half and at::BFloat16 classes to the jiterator code template. The only methods defined in those classes are construction from float and implicit conversion to float. Mathematical operations on them never need to be defined, because jiterator is written in a way to implicitly upcast the inputs to the functor, so all math has to be performed on float only (e.g. compute part of the kernel would always be written as
```
        out[j] = i0<float>(arg0[j]);
```
It also adds support for casting to complex outputs, by adding a similar templated class c10::complex<T>. Originally I planned to only support float -> complex complex for it, but to compile fetch_and_cast function we also need complex -> float conversion. We can avoid it by compiling fetch_and_cast for a different subset of types, but I'm not doing it in this PR. Thus, technically, we can compile a kernel that would accept complex inputs and produce wrong results, but we are guarding against it by static asserting that none of the functor datatype are complex, and runtime-checking that none of the inputs are complex.
Adding bfloat16, half and complex support allows us to remove special handling for type promotion tests for gcd.
i0 (that supports half and bfloat16 inputs) is moved to use jiterator.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70157

Reviewed By: mruberry

Differential Revision: D33221645

Pulled By: ngimel

fbshipit-source-id: 9cfe8aba3498a0604c4ea62c217292ea06c826b1
2021-12-19 11:56:57 -08:00
78f06e0690 fixing conv2d decomposition and tests (#70127)
Summary:
Current implementation has a bug where decomposed `add_optional` from `conv2d` is placed before the producer node, this causes linter error on graph.

Cherry-picked from https://github.com/csarofeen/pytorch/pull/1333
Fixing issue posted in https://github.com/csarofeen/pytorch/issues/1325

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70127

Reviewed By: ejguan

Differential Revision: D33199018

Pulled By: jansel

fbshipit-source-id: bce1f14a443811b4d55116a04fd4daa86084cc47
2021-12-19 10:38:23 -08:00
de4e7dece9 [Quant][fx] Added test for quint4x2 for fx graph mode quantization (#69846)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69846

Test Plan:
In pytorch main dir, execute

    to run the added test

Reviewed By: jbschlosser

Differential Revision: D33152672

Pulled By: dzdang

fbshipit-source-id: 89951fcd23e7061d6c51e9422540b5f584f893aa
2021-12-19 06:15:26 -08:00
75718e5059 [Quant][Eager] Added 4 bit support for eager mode quantization flow (#69806)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69806

Minor modifications were made to support 4 bit embedding quantized module in eager mode quantization flow and to allow for testing of the changes

Test Plan:
In pytorch main dir, execute
```
python test_quantization.py TestPostTrainingStatic.test_quantized_embedding
```
to run the series of tests, including the newly added test_embedding_4bit
function

Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D33152675

fbshipit-source-id: 5cdaac5aee9b8850e61c99e74033889bcfec5d9f
2021-12-19 06:14:12 -08:00
9f512e129b [Quant] Added 4 bit support for embedding quantized module (#69769)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69769

Added 4 bit support and the correpsonding test in the module api. Restructured the test_quantized_module for both 4 & 8 bit support.

Test Plan:
In pytorch main dir, execute
```
python test/test_quantization.py TestStaticQuantizedModule.test_embedding_api
```

Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D33152674

fbshipit-source-id: 73e63383cf60994ab34cc7b4eedd8f32a806cf7f
2021-12-18 22:26:24 -08:00
b331752314 [Quant] Implemented 4 bit embedding op support; added corresponding test case (#69768)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69768

Support for the 4 embedding operator has been added. The support is analogous to the preexisting support for byte/8bit embedding. A corresponding test case was added to test_quantized_embedding_op.py

Test Plan:
In pytorch main dir, execute
```
python test/test_quantization.py TestStaticQuantizedModule.test_embedding_api
```
to run the series of tests, including the newly added test_embedding_4bit
function

Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D33152673

fbshipit-source-id: bdcc2eb2e37de38fda3461ff3ebf1d2fb5e58071
2021-12-18 22:03:33 -08:00
94abf120c8 [quant][fx][graphmode][be] Use is_qat instead of model.training as a flag for qat (#69878)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69878

But we'll still verify that model.training is True when user call prepare_qat API.
Relaxing this condition might also mean that we change the api for methods in fuser_method_mapping,
with additional flag for qat (currently we just have different fusions for training/eval), I don't think
this is P0, we could revisit if there is a need in the future

Test Plan:
```
python test/test_quantization.py TestQuantizeFx
```

Imported from OSS

Reviewed By: supriyar

Differential Revision: D33080988

fbshipit-source-id: b13715b91f10454948199323c5d81ef88bb3517f
2021-12-18 00:00:46 -08:00
fb34af1b21 [nnc][quantization] Optimize constructTensors in ext functions (#69856)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69856

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D33064756

Pulled By: IvanKobzarev

fbshipit-source-id: 430d850f8591b8e0a0bdba5c41896627a72db88e
2021-12-17 23:45:03 -08:00
84b7832010 Updates CUDA memory leak check to verify against driver API and print more diagnostic information (#69556)
Summary:
Per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69556

Reviewed By: mrshenli

Differential Revision: D32954770

Pulled By: mruberry

fbshipit-source-id: a6c2ae6f704422c178569980ca4b9c72c4272f55
2021-12-17 23:37:49 -08:00
6c68045f60 [quant][graphmode][fx][be] Fix a typo in quantization/fx/graph_module (#69877)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69877

att

Test Plan:
```
python tes/test_quantization.py TestQuantizeFx
```

Imported from OSS

Reviewed By: supriyar

Differential Revision: D33079525

fbshipit-source-id: dfd3afb916067a628071a59ce95c6b1d228a3c72
2021-12-17 23:33:33 -08:00
9d3a6fa623 [quant][bc-breaking] Remove QConfigDynamic from quantization api (#69875)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69875

att

Test Plan:
ci + regression tets:
```
python test/test_quantization.py TestPostTrainingStatic
python test/test_quantization.py TestPostTrainingDynamic
python test/test_quantization.py TestQuantizeFx
```

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D33079096

fbshipit-source-id: 1e73bb27c518eba62b60f3a8c4b532dddc8367cf
2021-12-17 23:10:06 -08:00
5db711f9d3 [quant][be] Replace QConfigDynamic with QConfig in code (#69864)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69864

att, will have a follow up PR that removes QConfigDynamic in the api

Test Plan:
regression tests
```
python test/test_quantization.py TestPostTrainingStatic
python test/test_quantization.py TestPostTrainingDynamic
python test/test_quantization.py TestQuantizeFx
```

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D33073235

fbshipit-source-id: 6c1a1647032453803c55cdad7c04154502f085db
2021-12-17 22:30:57 -08:00
c463d50098 [fx2trt] Convert to tuple is output_size of adaptive avg pool is an integer (#70144)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70144

It can be an integer and in this case we need to extend it.

Test Plan:
Added a unit test.
```
RemoteExecution session id: reSessionID-d97b46e3-20d1-4f5c-a166-4efcf1579352-tpx
Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/8162774391775638
    ✓ ListingSuccess: caffe2/test/fx2trt/converters:test_adaptive_avgpool - main (9.454)
    ✓ Pass: caffe2/test/fx2trt/converters:test_adaptive_avgpool - test_adaptive_avgpool_with_dynamic_shape (caffe2.test.fx2trt.converters.acc_op.test_adaptive_avgpool.TestAdaptiveAvgPoolConverter) (16.083)
    ✓ Pass: caffe2/test/fx2trt/converters:test_adaptive_avgpool - test_adaptive_avgpool_1 (caffe2.test.fx2trt.converters.acc_op.test_adaptive_avgpool.TestAdaptiveAvgPoolConverter) (16.349)
    ✓ Pass: caffe2/test/fx2trt/converters:test_adaptive_avgpool - test_adaptive_avgpool_2 (caffe2.test.fx2trt.converters.acc_op.test_adaptive_avgpool.TestAdaptiveAvgPoolConverter) (16.543)
    ✓ Pass: caffe2/test/fx2trt/converters:test_adaptive_avgpool - test_adaptive_avgpool_0 (caffe2.test.fx2trt.converters.acc_op.test_adaptive_avgpool.TestAdaptiveAvgPoolConverter) (16.651)
Summary
  Pass: 4
  ListingSuccess: 1
```

Reviewed By: wushirong

Differential Revision: D33200773

fbshipit-source-id: 8c10d644982a4723a78f8615d8bcdbc3968790db
2021-12-17 18:31:25 -08:00
9ee3006d58 [fx-acc][graph-opts] bug fixes for transpose_to_reshape, optimize_quantization, finalize_kwargs_to_concrete
Summary:
Fixes a couple of bugs that surfaced during integration of graph opts into `AcceleratedGraphModule` (D31484770).

2. Fix bug in `graph_opt.transpose_to_reshape` implementation that causes it to incorrectly apply opt for `permute` op acting on shape `(B, N, N)` with `N > 1` and permutation `(0, 2, 1)`. Fixed the bug and added test case to cover this case.
3. Revert part of D31671833 (0e371e413d), where I made `acc_out_ty` into a required argument
4. Align `graph_opt.transpose_to_reshape` and `graph_opt.optimize_quantization` to not set `acc_out_ty` when adding a new node to graph and instead rely on tensor metadata
5. Run `acc_utils.copy_acc_out_ty_from_meta_to_acc_ops_kwargs()` in `GraphOptsTest.verify_numerics` before running graph on sample inputs.

Test Plan:
```
buck test mode/opt glow/fb/fx/graph_opts:
```

```
...
Summary
  Pass: 85
  ListingSuccess: 4
If you need help understanding your runs, please follow the wiki: https://fburl.com/posting_in_tpx_users
Finished test run: https://www.internalfb.com/intern/testinfra/testrun/562950163929022
```

Reviewed By: jfix71

Differential Revision: D31851549

fbshipit-source-id: 602affe2a2a0831d2f17b87025107ca87ecb0e59
2021-12-17 17:35:48 -08:00
bd9983366b [fx2trt] Add support for torch.mean (#70052)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70052

As the title. Also refactored a bit to separate out the common part of adding a reduce operator.

This would make mnasnet lowerable without splitter.

Test Plan: Added unit tests.

Reviewed By: wushirong

Differential Revision: D33163950

fbshipit-source-id: 7eb8f8a852cd8e8d9937029c4b4602b036502b3a
2021-12-17 15:48:31 -08:00
9fb199bc12 Add convolution_backward to aten_interned_strings.h (#70112)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70112

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D33188664

Pulled By: jbschlosser

fbshipit-source-id: 20e565c2fef4c1c3c087ba9b36320b7e539e467e
2021-12-17 15:38:47 -08:00
9b14d93d78 Fix bazel workflows (#70137)
Summary:
Fixes regression after manual rebase of e35bf56461

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70137

Reviewed By: pbelevich

Differential Revision: D33197055

Pulled By: malfet

fbshipit-source-id: 21adf7297f75715a59d2a1b3751b4ec8f71c7c03
2021-12-17 14:48:11 -08:00
70ed4f3ffc Try dropping Torch from typeshed_internal (#69926)
Summary:
Removes the internal typeshed for PyTorch and replaces it with PyTorch's own type annotations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69926

Generated files are in P471601595, P471601643, P471601662

Based on an example in D26410012

Test Plan: Sandcastle

Reviewed By: malfet, pradeep90

Differential Revision: D32292834

fbshipit-source-id: 5223f514cbdccd02c08ef0a027a48d92cdebed2c
2021-12-17 14:08:19 -08:00
e35bf56461 [Bazel] Add CUDA build to CI (#66241)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/35316
On master, bazel cuda build is disabled due to lack of a proper `cu_library` rule. This PR:
- Add `rules_cuda` to the WORKSPACE and forward `cu_library` to `rules_cuda`.
- Use a simple local cuda and cudnn repositories (adopted from TRTorch) for cuda 11.3.
- Fix current broken cuda build.
- Enable cuda build in CI, not just for `:torch` target but all the test binaries to catch undefined symbols.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66241

Reviewed By: ejguan

Differential Revision: D31544091

Pulled By: malfet

fbshipit-source-id: fd3c34d0e8f80fee06f015694a4c13a8e9e12206
2021-12-17 13:44:29 -08:00
e0f4e28c69 Skip forward-over-reverse gradgrad check for pinv singular on CUDA fo… (#70123)
Summary:
…r cdouble

Fixes https://github.com/pytorch/pytorch/issues/70046

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70123

Reviewed By: zou3519

Differential Revision: D33193017

Pulled By: soulitzer

fbshipit-source-id: 846f97ad1bf38c7239e9fc40fd5f476e29264f7c
2021-12-17 13:38:57 -08:00
38e026c14d Add tanh_backward to AT symbols (#70071)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70071

This commit adds tanh_backward to aten_interned_strings.h as an AT symbol.

Test Plan: CI.

Reviewed By: mruberry

Differential Revision: D33173370

Pulled By: alanwaketan

fbshipit-source-id: e20ed2a807156ce772b7c1e3f434fa895116f4c3
2021-12-17 13:35:05 -08:00
a6b7521428 always use max cmake when cmake3 and cmake are all existed (#69355)
Summary:
For Pytorch source build when using Ninja generator, it requires **CMake >=3.13**,  Pytorch always checks **cmake3 >= 3.10** first, so when **3.13> cmake3 >= 3.10** and then PyTorch will use cmake3, there will report an error: ```Using the Ninja generator requires CMake version 3.13 or greater```  even the **CMake >=3.13** .

For example: for my centos machine, the system CMake3 is ```3.12```,  and my conda env's CMake is ```3.19.6```,  there will have a build error which PyTorch choose CMake 3, I can update CMake3 or create an alias or a symlink to solve this problem, but the more reasonable way is that ```_get_cmake_command ``` always return the newest CMake executable (unless explicitly overridden with a same CMAKE_PATH environment variable).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69355

Reviewed By: jbschlosser

Differential Revision: D33062274

Pulled By: malfet

fbshipit-source-id: c6c77ce1374e6090a498be227032af1e1a82d418
2021-12-17 12:53:49 -08:00
254360e182 [ROCm] Skip test_fn_fwgrad_bwgrad_* unexpected success tests (#70124)
Summary:
Skip tests that cause unexpected success for ROCm

Signed-off-by: Kyle Chen <kylechen@amd.com>

additional to this PR:
https://github.com/pytorch/pytorch/pull/70061

skipping 4 more tests that cause unexpected success and fail the CI job for ROCm

log:
https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-bionic-rocm4.3.1-py3.6-test2/15350/console

cc jeffdaily sunway513 jithunnair-amd ROCmSupport KyleCZH

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70124

Reviewed By: ejguan

Differential Revision: D33193508

Pulled By: malfet

fbshipit-source-id: 9949910e2e7dc66cbadd23cea874df26e2d4136d
2021-12-17 12:08:47 -08:00
26e32988bd Revert D32596264: Codegen: TraceType only includes operators being registered
Test Plan: revert-hammer

Differential Revision:
D32596264 (e66a8ab4f5)

Original commit changeset: 2f28b62d7b99

Original Phabricator Diff: D32596264 (e66a8ab4f5)

fbshipit-source-id: 7d18c4e77ce30dd7817a95f9c39b565cb246cd12
2021-12-17 11:20:12 -08:00
2f622e87bd Revert D32596274: Codegen: ADInplaceOrViewType only include operators registered
Test Plan: revert-hammer

Differential Revision:
D32596274 (9ad940d982)

Original commit changeset: 400cad023782

Original Phabricator Diff: D32596274 (9ad940d982)

fbshipit-source-id: 5c53195edaae47b9daba373cf166d2382178d01b
2021-12-17 11:02:08 -08:00
60eb1e53b2 Sparse CSR CPU: Add block sparse support for MKL path (#68710)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68710

This PR adds support for block sparse (BSR) matrices for functions that
use Inspector-Executor MKL Sparse API. At the moment of this PR it's:
* torch.addmm
* torch.addmv
* torch.triangular_solve (once https://github.com/pytorch/pytorch/pull/62180 is merged)

cc nikitaved pearu cpuhrsch IvanYashchuk

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D33179486

Pulled By: cpuhrsch

fbshipit-source-id: e1dec0dccdbfed8b280be16b8c11fc9e770d50ae
2021-12-17 10:56:05 -08:00
0cfff65395 Apply contiguous on inputs of cdist backward (#70016)
Summary:
Description:
- Apply contiguous on inputs of cdist backward
- Added a test

Fixes https://github.com/pytorch/pytorch/issues/69997

cc ezyang albanD zou3519 gqchen pearu nikitaved soulitzer Lezcano Varal7

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70016

Reviewed By: ejguan

Differential Revision: D33187946

Pulled By: albanD

fbshipit-source-id: 645306aa043b2f84c4c2df0306fabfc224d746b6
2021-12-17 10:54:45 -08:00
bc95e5a196 [ROCm] Skip test_fn_fwgrad_bwgrad_gradient_cuda_complex128 (#70061)
Summary:
This PR will skip test_fn_fwgrad_bwgrad_gradient_cuda_complex128 test for ROCm

Signed-off-by: Kyle Chen <kylechen@amd.com>

Related github isssue:
[https://github.com/pytorch/pytorch/issues/70027](https://github.com/pytorch/pytorch/issues/70027)

jithunnair-amd jeffdaily

cc jeffdaily sunway513 jithunnair-amd ROCmSupport KyleCZH

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70061

Reviewed By: ejguan

Differential Revision: D33189411

Pulled By: malfet

fbshipit-source-id: a60d5b35099d3c8d3ceebb996e91470a8a676f85
2021-12-17 10:47:31 -08:00
de992c6b21 Specify ij indexing when cartesian_prod calls meshgrid (#68753)
Summary:
Currently, `cartesian_prod` calls `meshgrid` without passing an indexing parameter. This causes a warning to be shown when running the `cartesian_prod` example from the docs. This PR simply passes the default value for this indexing parameter instead.

Fixes https://github.com/pytorch/pytorch/issues/68741

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68753

Reviewed By: kimishpatel

Differential Revision: D33173011

Pulled By: mruberry

fbshipit-source-id: 667185ec85bd62bda177bc5768d36f56cfc8b9ab
2021-12-17 10:39:44 -08:00
9ad940d982 Codegen: ADInplaceOrViewType only include operators registered (#68692)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68692

ADInplaceOrViewType is a sharded file, so by only including specific
operator headers, we ensure that changing one (non-method) operator
only needs one shard to be re-compiled.

This also ports the generated code over to the `at::_ops` interface,
and the code generator itself to using `write_sharded` instead of
re-implementing its own version of sharding.

Test Plan: Imported from OSS

Reviewed By: jbschlosser, malfet

Differential Revision: D32596274

Pulled By: albanD

fbshipit-source-id: 400cad0237829720f94d60f9db7acd0e918e202e
2021-12-17 10:36:20 -08:00
e66a8ab4f5 Codegen: TraceType only includes operators being registered (#68691)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68691

TraceType is a sharded file, so by only including specific operator
headers, we ensure that changing one (non-method) operator only needs
one shard to be re-compiled.

This also changes all the included autograd and jit headers from
including `ATen/ATen.h` to just including `ATen/core/Tensor.h`.

Test Plan: Imported from OSS

Reviewed By: jbschlosser, malfet

Differential Revision: D32596264

Pulled By: albanD

fbshipit-source-id: 2f28b62d7b9932f30fad7daacd8ac5bb7f63c621
2021-12-17 10:35:05 -08:00
0d06616c47 Add dict methods to ParameterDict (#69403)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/68476

We implemented all of the following `dict` methods for `ParameterDict`
- `get `
- `setdefault`
- `popitem`
- `fromkeys`
- `copy`
- `__or__`
- `__ior__`
- `__reversed__`
- `__ror__`

The behavior of these new methods matches the expected behavior of python `dict` as defined by the language itself: https://docs.python.org/3/library/stdtypes.html#typesmapping

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69403

Reviewed By: albanD

Differential Revision: D33187111

Pulled By: jbschlosser

fbshipit-source-id: ecaa493837dbc9d8566ddbb113b898997e2debcb
2021-12-17 10:15:47 -08:00
35519428a2 Remove backward ops for miopen depthwise convolution (#70064)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70064

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D33171169

Pulled By: jbschlosser

fbshipit-source-id: 668ca9baa992d3bb1cfa7b53fd2127ffeb051147
2021-12-17 10:08:49 -08:00
ab2a739851 Remove backward ops for miopen transposed convolution (#70063)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70063

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D33171170

Pulled By: jbschlosser

fbshipit-source-id: 4fd6c1cd027f714354644c4ac7694d0f9092c762
2021-12-17 10:07:27 -08:00
ec577300d7 OpInfo: Convert more sample_input_funcs to generators (#69976)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69976

These are sample functions that already use generators internally, this just moves the `yield` into the sample function itself.

Re-submit of #69257

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D33172953

Pulled By: mruberry

fbshipit-source-id: 7b8bae72df6a225df88a158b7ffa82a71d3c061b
2021-12-17 10:03:59 -08:00
950957f857 Fix jit tests assuming sample_inputs is a list (#69975)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69975

cc mruberry

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D33172952

Pulled By: mruberry

fbshipit-source-id: 1f8bb49179f7fbd0fec5e7344e8c213484518e27
2021-12-17 10:02:50 -08:00
ad79d0dd4b Add ciflow/trunk label (#69575)
Summary:
Which includes all workflows but periodic ones

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69575

Reviewed By: seemethere

Differential Revision: D32932850

Pulled By: malfet

fbshipit-source-id: 80b58fb3a0d5f8dbc527124be5bf25bd716448b8
2021-12-17 09:57:46 -08:00
de296d526f move torch.testing from prototype to beta (#69668)
Summary:
cc brianjo mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69668

Reviewed By: albanD

Differential Revision: D33028213

Pulled By: mruberry

fbshipit-source-id: 3316b887d4c322cc1262feee651464da4124a6de
2021-12-17 09:52:47 -08:00
de2d9e2966 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D33183467

fbshipit-source-id: d7c37f3522a38e85891524c544eab4fdb01270de
2021-12-17 09:45:20 -08:00
1065739781 Fix build on latest main branch of thrust - SoftMax.cu (#70039)
Summary:
Similar to https://github.com/pytorch/pytorch/issues/69985

I think there's any other source file  which should `#include <thrust/iterator/constant_iterator.h>` as of 73a6c36f1b

```
mkozuki@mkozuki-srv ~/ghq/github.com/crcrpar/torch-0 master
torch-0 ❯ git rev-parse HEAD; rg -inw make_constant_iterator
73a6c36f1bfbf9aff04ba41cfe6ab06aa99883d9
aten/src/ATen/native/cuda/LegacyThrustHelpers.cu
54:    thrust::make_constant_iterator(1),

aten/src/ATen/native/sparse/cuda/SoftMax.cu
301:      thrust::make_constant_iterator(int64_t(1)),
```

## build error

```console
https://github.com/pytorch/pytorch/issues/22 2048. /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DAT_PER_OPERATOR_HEADERS -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DIDEEP_USE_MKL -DMAGMA_V2 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTH_BLAS_MKL -DTORCH_CUDA_BUILD_MAIN_LIB -DUSE_C10D_GLOO -DUSE_C10D_MPI -DUSE_C10D_NCCL -DUSE_CUDA -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_NCCL -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cuda_EXPORTS -Iaten/src -I../aten/src -I. -I../ -I../cmake/../third_party/benchmark/include -I../cmake/../third_party/cudnn_frontend/include -I../third_party/onnx -Ithird_party/onnx -I../third_party/foxi -Ithird_party/foxi -Iinclude -I../torch/csrc/distributed -I../aten/src/TH -I../aten/src/THC -I../aten/src/ATen/cuda -Icaffe2/aten/src -I../aten/../third_party/catch/single_include -I../aten/src/ATen/.. -Icaffe2/aten/src/ATen -Inccl/include -I../c10/cuda/../.. -I../c10/.. -I../third_party/tensorpipe -Ithird_party/tensorpipe -I../third_party/tensorpipe/third_party/libnop/include -I../torch/csrc/api -I../torch/csrc/api/include -isystem=third_party/gloo -isystem=../cmake/../third_party/gloo -isystem=../cmake/../third_party/googletest/googlemock/include -isystem=../cmake/../third_party/googletest/googletest/include -isystem=../third_party/protobuf/src -isystem=/opt/conda/include -isystem=../third_party/gemmlowp -isystem=../third_party/neon2sse -isystem=../third_party/XNNPACK/include -isystem=../third_party -isystem=../cmake/../third_party/eigen -isystem=/opt/conda/include/python3.8 -isystem=/opt/conda/lib/python3.8/site-packages/numpy/core/include -isystem=../cmake/../third_party/pybind11/include -isystem=/opt/hpcx/ompi/include/openmpi -isystem=/opt/hpcx/ompi/include/openmpi/opal/mca/hwloc/hwloc201/hwloc/include -isystem=/opt/hpcx/ompi/include/openmpi/opal/mca/event/libevent2022/libevent -isystem=/opt/hpcx/ompi/include/openmpi/opal/mca/event/libevent2022/libevent/include -isystem=/opt/hpcx/ompi/include -isystem=/usr/local/cuda/include -isystem=../third_party/ideep/mkl-dnn/third_party/oneDNN/include -isystem=../third_party/ideep/include -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_86,code=compute_86 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=integer_sign_change,--diag_suppress=useless_using_declaration,--diag_suppress=set_but_not_used,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=implicit_return_from_non_void_function,--diag_suppress=unsigned_compare_with_zero,--diag_suppress=declared_but_not_referenced,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -Xcudafe --diag_suppress=20236 -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -O3 -DNDEBUG -Xcompiler=-fPIC -DCAFFE2_USE_GLOO -DCUDA_HAS_FP16=1 -DHAVE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD -Xcompiler=-Wall,-Wextra,-Wno-unused-parameter,-Wno-unused-variable,-Wno-unused-function,-Wno-unused-result,-Wno-unused-local-typedefs,-Wno-missing-field-initializers,-Wno-write-strings,-Wno-unknown-pragmas,-Wno-type-limits,-Wno-array-bounds,-Wno-unknown-pragmas,-Wno-sign-compare,-Wno-strict-overflow,-Wno-strict-aliasing,-Wno-error=deprecated-declarations,-Wno-missing-braces,-Wno-maybe-uninitialized -DTORCH_CUDA_BUILD_MAIN_LIB -Xcompiler -pthread -std=c++14 -MD -MT caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/sparse/cuda/SoftMax.cu.o -MF caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/sparse/cuda/SoftMax.cu.o.d -x cu -c ../aten/src/ATen/native/sparse/cuda/SoftMax.cu -o caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/sparse/cuda/SoftMax.cu.o
https://github.com/pytorch/pytorch/issues/22 2048. ../aten/src/ATen/native/sparse/cuda/SoftMax.cu(301): error: namespace "thrust" has no member "make_constant_iterator"
...
https://github.com/pytorch/pytorch/issues/22 2048. 13 errors detected in the compilation of "../aten/src/ATen/native/sparse/cuda/SoftMax.cu".
```

cc xwang233 zasdfgbnm ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70039

Reviewed By: mruberry

Differential Revision: D33166702

Pulled By: ngimel

fbshipit-source-id: 33f3b80095c8562786a9a9b7a0e7eb58201af458
2021-12-17 09:28:44 -08:00
92463573d8 Sanitize string before passing it as shell argument (#70070)
Summary:
Use `c10::printQuotedString` to escape any characters that might render
string to be interpreted as more than one argument by shell script.

Please note, that this codepath is deprecated and is not accessible
by a typical PyTorch usage workflows.

This issue was discovered by Daniel Lawrence of the Amazon Alexa team.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70070

Reviewed By: suo

Differential Revision: D33172721

Pulled By: malfet

fbshipit-source-id: 9dbd17f6eb775aaa1a545da42cbc95864c1189ee
2021-12-17 08:08:28 -08:00
54406314cc Update PULL_REQUEST_TEMPLATE.md (#70105)
Summary:
Many users actually send things like `Fixes #{69696}` which then fails to properly close the corresponding issue.

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70105

Reviewed By: ejguan

Differential Revision: D33187501

Pulled By: albanD

fbshipit-source-id: 2080ee42c30b9db45177f049627118a6c3b544b7
2021-12-17 07:53:36 -08:00
b1d5948b34 Remove backward ops for miopen convolution (#69987)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69987

Stack from [ghstack](https://github.com/ezyang/ghstack):
* __->__ #69987

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D33170379

Pulled By: jbschlosser

fbshipit-source-id: 6bc274f1d457ec5bddc8b52c2f1c44eaae2ff0ed
2021-12-17 07:43:38 -08:00
f045618dab dbr quant: extend qconfig_dict support to functionals, part 2 (#69766)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69766

Follow-up on the previous PR, removes the requirement to have a parent
qconfig in order for the object type qconfig to be applied for a function.

Test Plan:
```
python test/test_quantization.py TestQuantizeDBR
```

Reviewed By: jerryzh168

Differential Revision: D33020218

Pulled By: vkuzo

fbshipit-source-id: fa0e10f05ca5f88b48ef74b9d2043ea763506742
2021-12-17 05:59:55 -08:00
a4173fc887 dbr quant: extend qconfig_dict support to functions, part 1 (#69758)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69758

Extends DBR quant `qconfig_dict['object_type']` support to function types,
with the restriction that a parent module must have a qconfig.

A future PR will remove the restriction above (it is due to some technical
debt), to keep PR sizes small.

Test Plan:
```
python test/test_quantization.py TestQuantizeDBR
```

Reviewed By: jerryzh168

Differential Revision: D33020217

Pulled By: vkuzo

fbshipit-source-id: ce8a8185f9c87d437e1319ff6f19e8f6adf41e02
2021-12-17 05:59:52 -08:00
c186773d92 dbr quant: make fqn during prepare op hook required (#69726)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69726

This is a cleanup, this variable was previously optional
but it always exists, because the only way an op hook
can run if there is a parent module with an `AutoQuantizationState`
object.

Test Plan:
```
python test/test_quantization.py TestQuantizeDBR
```

Reviewed By: albanD

Differential Revision: D33003472

Pulled By: vkuzo

fbshipit-source-id: de5769194808d42b025b848667815b4e3d73b6c6
2021-12-17 05:59:49 -08:00
b999f87503 fx quant: move _parent_name to common utils (#69720)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69720

This function is also useful for DBR quant, moving it from FX utils
to common utils.

Test Plan:
```
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeDBR
```

Reviewed By: jerryzh168

Differential Revision: D33003473

Pulled By: vkuzo

fbshipit-source-id: 20360682c69d614a645c14fc29d3ee023d6b2623
2021-12-17 05:59:46 -08:00
4f450f44bf dbr quant: initial support of qconfig_dict for modules (#69719)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69719

This PR changes the API signature of DBR quant to use `qconfig_dict`,
similar to FX graph mode quantization.  In this first PR, only basic
functionality is implemented:
* qconfig=None or static quantization with quint8 only is tested
* non-default qconfig for modules only is tested
* targeting ops by order is not implemented

Expanding this support will be done in future PRs.

Test Plan:
```
python test/test_quantization.py TestQuantizeDBR
```

Reviewed By: jerryzh168

Differential Revision: D33003475

Pulled By: vkuzo

fbshipit-source-id: f5af81e29c34ea57c2e23333650e44e1758102e4
2021-12-17 05:59:44 -08:00
0f1ceb34ec fx quant: refactor qconfig_dict utils to separate file (#69636)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69636

Moves some of the qconfig_dict utilities away from the FX subdirectory
into the quantization subdirectory. These utilities can be reused with
other workflows.

A future PR will start using these utilities in DBR quant.

Test Plan:
```
python test/test_quantization.py TestQuantizeFx
```

Reviewed By: albanD

Differential Revision: D33003474

Pulled By: vkuzo

fbshipit-source-id: 34417b198681279469e6d7c43ea311180086d883
2021-12-17 05:58:25 -08:00
7abb7667a6 [tensorexpr] Add memory planning to reuse intermediate buffers (#66452)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66452

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D31557188

Pulled By: huiguoo

fbshipit-source-id: f18dfeba1df20d5d4f118640fc10782534eb9219
2021-12-17 01:38:02 -08:00
ac92f7cc75 [tensorexpr] Remove the optional argument in LoopNest::prepareForCodeGen (#67144)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67144

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D31881150

Pulled By: huiguoo

fbshipit-source-id: af99087722ec71d6deb9049b63b573ae7720c9ec
2021-12-17 01:37:59 -08:00
bbfd7b75ca [tensorexpr] Move the allocation of intermediate buffers from TEK to CodeGen (#67143)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67143

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D31881151

Pulled By: huiguoo

fbshipit-source-id: 457e5d4ff8a15f70af9c797c9ab4803d8e779abe
2021-12-17 01:37:56 -08:00
6075ec15b1 [tensorexpr] Add BufMap instruction to reuse the memory of dest buf for src buf (#66451)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66451

Test Plan: Imported from OSS

Reviewed By: navahgar, ZolotukhinM

Differential Revision: D31557190

Pulled By: huiguoo

fbshipit-source-id: 96e08a05cb1c558706c4189e27d5d72efbd9c510
2021-12-17 01:37:53 -08:00
c7e0951524 [tensorexpr] Add a stmt recorder to obtain stmt PCs (#66450)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66450

Test Plan: Imported from OSS

Reviewed By: navahgar, ZolotukhinM

Differential Revision: D31557189

Pulled By: huiguoo

fbshipit-source-id: 416d79ddfc46a0109187cdeb919ad9b5abde8030
2021-12-17 01:36:37 -08:00
043098ef7f [quant][graphmode] Rename backend_config_dict folder to backend (#69882)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69882

att

Test Plan:
```
python test/fx2trt/test_quant_trt.py
```

Imported from OSS

Reviewed By: supriyar

Differential Revision: D33081761

fbshipit-source-id: c3178eec5798ac8587be09a963944b570c73e8ea
2021-12-16 21:13:04 -08:00
3d51c88032 [DataPipe] Unifying API - removing options to have fn_args and fn_kwargs from MapDataPipes (#69561)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69561

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D32952099

Pulled By: NivekT

fbshipit-source-id: 95b725774a9d04d655e2542760726908f33043f4
2021-12-16 18:11:00 -08:00
b89c283c80 [DataPipe] Unifying API - removing options to have fn_args and fn_kwargs from IterDataPipes (#69560)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69560

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D32952100

Pulled By: NivekT

fbshipit-source-id: e0cc31408c7cf3220fe274feed1c7202a1aaae70
2021-12-16 18:09:52 -08:00
4a6a5d1630 OpInfos for torch.{flatten, column_stack} (#69237)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69237

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D32988956

Pulled By: anjali411

fbshipit-source-id: b7f5c537ff9731f56232aa5647910f03edf4582a
2021-12-16 17:50:58 -08:00
ef6f776e82 [quant][be] Cleanup test cases for eager mode workflow (#69880)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69880

Making the test cases more standardized, in general we would like to have
```
TestQuantizeEager,
TestQuantizeEagerOps,
TestQuantizeEagerModels,
```

but currently since we have separate ptq static, ptq dynamic and qat static apis, we only partially cleaned
up the test cases, we can merge all of them later when we merge all the apis

Test Plan:
python test/test_quantization.py

Imported from OSS

Reviewed By: supriyar

Differential Revision: D33081418

fbshipit-source-id: fcb96559b76bbc51eb1b0625e0d4b193dbb37532
2021-12-16 17:47:30 -08:00
92320dfe6e [shard] remove set device for nccl (#69946)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69946

This PR remove the implicit set_device for nccl pg according to the proposal of https://github.com/pytorch/pytorch/issues/69731
ghstack-source-id: 145847504

Test Plan: wait for ci

Reviewed By: pritamdamania87

Differential Revision: D33099095

fbshipit-source-id: 3fe9f6a0facf5ea513c267e9f32c6a7fd56cc8a2
2021-12-16 17:16:42 -08:00
9813629500 [reland][quant][fx][graphmode] Add support for conv add pattern in backend_config_dict (#70007)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70007

This PR extends fusion pattern support from simple sequence of ops to a simple
subgraph like conv - add
```
x - conv ---\
y ---------add ---- ouptut
```
where input x, y and output are observed/quantized

Test Plan:
```
python test/fx2trt/test_quant_trt.py TestQuantizeFxTRTOps.test_conv_add
```

Imported from OSS

Imported from OSS

Reviewed By: supriyar

Differential Revision: D33144605

fbshipit-source-id: 331fda77bdc431a8cd9abe1caea8347a71776ec2
2021-12-16 17:10:44 -08:00
62809dc062 .github: Volume mount netrc to home directory (#70057)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70057

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D33169220

Pulled By: seemethere

fbshipit-source-id: 720e5fb946249a26f0505afc34b95530258e53ea
2021-12-16 15:23:45 -08:00
a73c6a45b6 [reland][quant][graphmode][fx] Enable fuse handler for sequence of 3 ops (#70006)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70006

reland: fixing some mypy errors that was missed before

This PR enables fuse handler for sequence of three ops, and merges all fuse handlers into one

TODO: we can also move this to backend_config_dict folder

Test Plan:
regression fusion test
```
python test/test_quantization.py TestFuseFx
```

Imported from OSS

Imported from OSS

Reviewed By: supriyar

Differential Revision: D33144606

fbshipit-source-id: ca34f282018a0fb4d04c7e35119eaf2d64258e78
2021-12-16 15:04:16 -08:00
fa582045fc Fix lint/mypy violations (#70059)
Summary:
Introduced by https://github.com/pytorch/pytorch/pull/69194

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70059

Reviewed By: suo, cccclai

Differential Revision: D33170748

Pulled By: malfet

fbshipit-source-id: a2e42f37d04c21a735f6474e42eb6670d2a0c3b9
2021-12-16 14:06:27 -08:00
02c63c3006 extract out c10 targets to the c10 package (#69992)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69992

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D33141013

fbshipit-source-id: e5edd6bd5b5834ac27390ba940ebed9148512c8d
2021-12-16 13:11:49 -08:00
d459e79500 [jit][edge] Remove usage of shared_ptr<mobile::Code>. (#68037)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68037

Right now mobile::Code doesn't outlive its enclosing Function, and all accesses to Code happens inside interpreter loop which doesn't outlive the module, so we don't need to use std::shared_ptr here. This also should saves us 1-2 KB for binary size, because shared_ptr seems to bloat on arm64 android.
ghstack-source-id: 145818696

Test Plan: eyes.

Reviewed By: qihqi, tugsbayasgalan

Differential Revision: D32264616

fbshipit-source-id: d83f538d6604cf75fd7728a25127b4849ce7ab2a
2021-12-16 13:11:46 -08:00
39f65fee47 [jit] Split ClassType into a separate header. (#68036)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68036

In Edge cases we want to separately include class_type.h because in the future we want to stop depending on the rest of the JIT types declared inside jit_type.h
ghstack-source-id: 145818699

Test Plan: no behavior change.

Reviewed By: qihqi, gmagogsfm

Differential Revision: D32264618

fbshipit-source-id: 53dc187772e3dde88ff978b87252c31f3641860b
2021-12-16 13:10:05 -08:00
243e135eb4 Sparse CSR CUDA: Add block sparse support for torch.triangular_solve (#68709)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68709

This PR adds support for triangular solver with a block CSR matrix.

cc nikitaved pearu cpuhrsch IvanYashchuk ngimel

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D33066067

Pulled By: cpuhrsch

fbshipit-source-id: 9eaf1839071e9526be8d8c6d47732b24200f3557
2021-12-16 13:03:42 -08:00
5f3f327a9d update SequentialLR signature (#69817)
Summary:
- ~optimizer isn't required for `SequentialLR` since it's already present in the schedulers. Trying to match the signature of it with `ChainedScheduler`.~
- ~`verbose` isn't really used anywhere so removed it.~

updated missing docs and added a small check

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69817

Reviewed By: ngimel

Differential Revision: D33069589

Pulled By: albanD

fbshipit-source-id: f015105a35a2ca39fe94c70acdfd55cdf5601419
2021-12-16 12:58:00 -08:00
15b9e5f8a4 Revert D33136054: Remove backward ops for miopen convolution
Test Plan: revert-hammer

Differential Revision:
D33136054 (8b9b819d22)

Original commit changeset: e049168732bd

Original Phabricator Diff: D33136054 (8b9b819d22)

fbshipit-source-id: 2a3cc3df3519d04595795f0bc87a807705d13a13
2021-12-16 12:46:02 -08:00
b199e3c842 Provide functionality to write custom ShardedTensor ops. (#69874)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69874

We have a handful of ops supported for ShardedTensor via
``__torch_function__`` dispatch. However, we currently can't cover all torch
operators and having a way for users to extend this functionality will make
this functionality much more general.

In this PR, I've introduced a custom_sharded_op decorator which can be used to
register a custom sharded op implementation.
ghstack-source-id: 145841141

Test Plan: waitforbuildbot

Reviewed By: wanchaol

Differential Revision: D33078587

fbshipit-source-id: 5936b7ac25582e613653c19afa559219719ee54b
2021-12-16 12:40:13 -08:00
1f86e0ee2a don't compile pow kernels for non-existent case (#70017)
Summary:
Per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70017

Reviewed By: malfet

Differential Revision: D33163747

Pulled By: ngimel

fbshipit-source-id: 784c7934428ee896c637662fdd59833c3a395f64
2021-12-16 12:31:30 -08:00
8b9b819d22 Remove backward ops for miopen convolution (#69987)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69987

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D33136054

Pulled By: jbschlosser

fbshipit-source-id: e049168732bdfcf590ec8102412f2ef0418f9dcc
2021-12-16 11:49:49 -08:00
b4c4a015d6 Revert D33163841: Revert D33102715: Back out "Revert D32606547: torch/monitor: add C++ events and handlers"
Test Plan: revert-hammer

Differential Revision:
D33163841

Original commit changeset: e262b6d8c80a

Original Phabricator Diff: D33102715 (eb374de3f5)

fbshipit-source-id: 644216036a238a458f0a2198460b36d24fb035f8
2021-12-16 11:12:18 -08:00
96fe82ac3c HANDLE_TH_ERRORS: Move exception translation out of line (#69974)
Summary:
I've noticed that the `HANDLE_TH_ERRORS` macros are actually very expensive in terms of compile time.  Moving the bulk of the catch statements out of line using a lippincott function significantly improves compile times and object file binary sizes. For just the generated autograd bindings, this halves serial build time from 8 minutes to 4 and binary size is more than halved for most files with the biggest difference being `python_variable_methods.cpp` which went from 126 MB to 43 MB.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69974

Reviewed By: mruberry

Differential Revision: D33160899

Pulled By: albanD

fbshipit-source-id: fc35fa86f69ffe5a0752557be30b438c8564e998
2021-12-16 11:04:48 -08:00
9ff8c49ed9 Enable cpu scalar arguments for jiterator (#69861)
Summary:
Creates analog of `gpu_kernel_with_scalars` for jiterator kernels

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69861

Reviewed By: mruberry

Differential Revision: D33134013

Pulled By: ngimel

fbshipit-source-id: fd2412e8d6432e15d5721e95a194d29fa70ad92c
2021-12-16 10:58:59 -08:00
ff53ed24d2 fix NameError of docstring in broadcast_object_list (#69810)
Summary:
This PR fixes NameError of docstring in broadcast_object_list.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69810

Reviewed By: kimishpatel

Differential Revision: D33143167

Pulled By: jbschlosser

fbshipit-source-id: 99c076466ae4b4a332763b7546028c5097b417d7
2021-12-16 10:50:45 -08:00
c9e898fef8 delete TH (#69929)
Summary:
Move TH<C>GenerateByteType includes into torch/csrc (the only place they are used), and we can remove TH folder altogether!
The only thing left in THC are includes left for bc compatibility.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69929

Reviewed By: mruberry

Differential Revision: D33133013

Pulled By: ngimel

fbshipit-source-id: 78c87cf93d2d641631b0f71051ace318bf4ec3c1
2021-12-16 10:45:30 -08:00
7f7966a888 [Docs] Fix the syntax of documentation (#69958)
Summary:
Fixes the syntax of documentation in the file torch/nn/utils/clip_grad.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69958

Reviewed By: mruberry

Differential Revision: D33160612

Pulled By: albanD

fbshipit-source-id: 2dc199fee345bb4c75632900bc6f73a1ab8192a6
2021-12-16 10:38:39 -08:00
ebc66bfeea [Profiler] Pull helper methods into dedicated file. (And start torch/csrc/profiler folder. (#69255)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69255

One thing that I've found as I optimize profier is that there's a lot of intermingled code, where the kineto profiler relies on the legacy (autograd) profiler for generic operations. This made optimization hard because I had to manage too many complex dependencies. (Exaserbated by the USE_KINETO #ifdef's sprinkled around.) This PR is the first of several to restructure the profiler(s) so the later optimizations go in easier.

Test Plan: Unit tests

Reviewed By: aaronenyeshi

Differential Revision: D32671972

fbshipit-source-id: efa83b40dde4216f368f2a5fa707360031a85707
2021-12-16 10:33:47 -08:00
b23890177f [Operator Versioning][Edge] Codegen upgrader_mobile.cpp (#69194)
Summary:
From operator version map and upgrader torchscript, generate upgrader_mobile.cpp file. It also includes a unit test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69194

ghstack-source-id: 145819351

Test Plan:
```
buck test mode/opt //caffe2/test:upgrader_codegen
```
```
buck run mode/opt //caffe2/torch/fb/mobile/upgrader_codegen:upgrader_codegen
```
```
python /Users/chenlai/pytorch/tools/codegen/operator_versions/gen_mobile_upgraders.py
```

Reviewed By: iseeyuan

Differential Revision: D32748985

fbshipit-source-id: f8437766edaba459bfc5e7fc7a3ca0520c4edb9a
2021-12-16 10:29:35 -08:00
c4281cc92d Prototype checkpoint_wrapper (#69955)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69955

Implements a checkpoint_wrapper function, which wraps nn.Module with checkpointing so user won't have to call checkpoint() everytime they want to checkpoint the module.

Currently only support for reentrant-based checkpointing is added and only tested with FSDP to unblock a use case.

Future work is to add support for new checkpointing API, add more tests, upstream to torch.utils.checkpoint.
ghstack-source-id: 145811242

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D33107276

fbshipit-source-id: c4a1c68d71d65713a929994940a8750f73fbdbdb
2021-12-16 09:59:19 -08:00
c80b5b8c8f Revert D33102715: Back out "Revert D32606547: torch/monitor: add C++ events and handlers"
Test Plan: revert-hammer

Differential Revision:
D33102715 (eb374de3f5)

Original commit changeset: 3816ff01c578

Original Phabricator Diff: D33102715 (eb374de3f5)

fbshipit-source-id: e262b6d8c80a05f3a67e024fedfbadefdbfe6e29
2021-12-16 09:39:57 -08:00
8c7f4a0d0b [tensorexpr] check for index out of bounds in ir_eval (#68858)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68858

when executing with ir_eval, check for index out of bounds.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D32657881

Pulled By: davidberard98

fbshipit-source-id: 62dd0f85bb182b34e9c9f795ff761081290f6922
2021-12-16 09:27:45 -08:00
76d282d447 Nvfuser code bump 12 5 (#69964)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69964

Things added in this PR that requires review:
1. cuLaunchCooperativeKernel driver API added
aten/src/ATen/cuda/detail/LazyNVRTC.cpp
aten/src/ATen/cuda/nvrtc_stub/ATenNVRTC.h

nvfuser code update:
1. perf turning on codegen scheduler that improves performance.
2. permutation support has been extended beyond contiguous/channels-last. (The improvements could be observed on PW benchmark)

Things reverted from local changes:
1. aten::gelu with approximation
2. local changes that is upstreamed in PR https://github.com/pytorch/pytorch/issues/68804

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69428

Reviewed By: ngimel

Differential Revision: D33073817

Pulled By: wconstab

fbshipit-source-id: e77d32e81d037d7370822b040456fd4c3bd68edb
2021-12-16 08:28:54 -08:00
a6a1c709ff Fixed libtorch at::Tensor::print() linking error (#69615)
Summary:
There was a declaration of function at::Tensor::print() in TensorBody.h,  left there during the refactoring of Tensor and TensorBase (d701357d921ef167d42c125e65b6f7da6be3ad0f). Removing it from TensorBody.h resolve the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69615

Test Plan:
code below now compile and works fine (print `[CPUFloatType [3, 4, 5, 5, 5]] `)
```
#include <torch/torch.h>

int main()
{
    torch::Tensor tensor = torch::randn({3, 4, 5, 5, 5});
    tensor.print();
}
```

Fixes https://github.com/pytorch/pytorch/issues/69515

Reviewed By: ngimel

Differential Revision: D33020361

Pulled By: albanD

fbshipit-source-id: 190f253fb4101a4205aede3574b6e8acd19e54a1
2021-12-16 07:57:10 -08:00
531da0c43b change asan test shard to 3 (#69843)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/68261

This PR changes the number of test shard from 2-->3 for all Asan test, aiming to improve the run time for Asan tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69843

Reviewed By: janeyx99

Differential Revision: D33160771

Pulled By: xidachen

fbshipit-source-id: dba1d318cc49b923e18704839471d8753cc00eca
2021-12-16 07:22:03 -08:00
fe7b6446d5 [LTC] Upstream LazyTensor and LazyGraphExecutor (#69815)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69815

Test Plan: Imported from OSS

Reviewed By: dagitses, jbschlosser

Differential Revision: D33059774

Pulled By: desertfire

fbshipit-source-id: dd1e3e5f4fd3181517eebd2742f6a5b7b6fb9a7d
2021-12-16 05:44:40 -08:00
28243769f9 [LTC] Upstream several internal ops (#69716)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69716

To prepare for the landing of LazyTensor and LazyGraphExecutor,
- arithmetic_ir_ops.h
- cast.h
- device_data.h
- expand.h
- generic.h
- scalar.h

Test Plan: Imported from OSS

Reviewed By: wconstab

Differential Revision: D32999410

Pulled By: desertfire

fbshipit-source-id: 31559dd7a1e525591ae9e2d7f915ee864437c11f
2021-12-16 05:44:37 -08:00
e6a4988b2d [LTC] Upstream utils in computation_client (#69621)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69621

Upstream the following utils
- metrics.h
- multi_wait.h
- thread_pool.h
- unique.h

Test Plan: Imported from OSS

Reviewed By: wconstab, VitalyFedyunin

Differential Revision: D32957629

Pulled By: desertfire

fbshipit-source-id: 5f2fb57493856556099b7cda7560a568d1f9ed97
2021-12-16 05:43:09 -08:00
73a6c36f1b Add more details to the known limitations section of torchhub docs (#69970)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69970

This is a follow up to https://github.com/pytorch/hub/issues/243

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D33124060

Pulled By: NicolasHug

fbshipit-source-id: 298fe14b39a1aff3e0b029044c9a0db8bc82336a
2021-12-16 02:43:48 -08:00
eb374de3f5 Back out "Revert D32606547: torch/monitor: add C++ events and handlers" (#69923)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69923

Original commit changeset: fbaf2cc06ad4

Original Phabricator Diff: D32606547 (e61fc1c03b)

This is the same thing as the original diff but just using a normal std::mutex instead of std::shared_timed_mutex which is not available on OSX 10.11. The performance difference should be negligible and easy to change down the line if it does become a bottleneck.

Old failing build: https://github.com/pytorch/pytorch/runs/4495465412?check_suite_focus=true

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68783

Test Plan:
buck test //caffe2/test/cpp/monitor:monitor

will add ciflow tags to ensure mac builds are fine

Reviewed By: aivanou

Differential Revision: D33102715

fbshipit-source-id: 3816ff01c578d8e844d303d881a63cf5c3817bdb
2021-12-15 22:51:43 -08:00
5cc4037369 [PyTorch][Distributed] Integrate with ShardedOptimizer in the unit test of ShardedLinear (#69569)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69569

Since ShardedOptimizer is added in https://github.com/pytorch/pytorch/pull/68607. We now integrate it in our unit test for Sharded Linear.
ghstack-source-id: 145773749

Test Plan: CI + Unit test

Reviewed By: wanchaol

Differential Revision: D32777020

fbshipit-source-id: eb6b1bb0f6234976f024273833154cab274fed25
2021-12-15 17:55:01 -08:00
dc18048dd8 [PT-D][Fix] Broken sharded embedding and embedding bag test fix (#69725)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69725

We have added a `no_grad` cx manager in the tensor sharding to ensure that the local_shard is the root node. But it turns out for embedding and embedding_bag, when the `max_norm` is specified, it will complain for row-wise sharding. We use the original `max_norm` of the operators.

Error traces:
```
  File "/data/sandcastle/boxes/fbsource/fbcode/buck-out/dev/gen/caffe2/test/distributed/_sharded_tensor/sharded_embedding#binary,link-tree/torch/overrides.py", line 1389, in handle_torch_function
    result = torch_func_method(public_api, types, args, kwargs)
  File "/data/sandcastle/boxes/fbsource/fbcode/buck-out/dev/gen/caffe2/test/distributed/_sharded_tensor/sharded_embedding#binary,link-tree/torch/distributed/_sharded_tensor/api.py", line 554, in __torch_function__
    return sharded_embedding(types, args, kwargs, self._process_group)
  File "/data/sandcastle/boxes/fbsource/fbcode/buck-out/dev/gen/caffe2/test/distributed/_sharded_tensor/sharded_embedding#binary,link-tree/torch/distributed/_sharded_tensor/ops/embedding.py", line 115, in sharded_embedding
    return _handle_row_wise_sharding(
  File "/data/sandcastle/boxes/fbsource/fbcode/buck-out/dev/gen/caffe2/test/distributed/_sharded_tensor/sharded_embedding#binary,link-tree/torch/distributed/_sharded_tensor/ops/embedding.py", line 309, in _handle_row_wise_sharding
    gathered_input_embeddings = torch.nn.functional.embedding(
  File "/data/sandcastle/boxes/fbsource/fbcode/buck-out/dev/gen/caffe2/test/distributed/_sharded_tensor/sharded_embedding#binary,link-tree/torch/nn/functional.py", line 2153, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: A view was created in no_grad mode and its base or another view of its base has been modified inplace with grad mode enabled. Given that this use case is ambiguous and error-prone, it is forbidden. You can clarify your code by moving both the view and the inplace either both inside the no_grad block (if you don't want the inplace to be tracked) or both outside (if you want the inplace to be tracked).
 exiting process 2 with exit code: 10
```

As a fix, we clone, detach the local shard from the narrow result without using the context manager.
ghstack-source-id: 145773748

Test Plan: CI + Unit test.

Reviewed By: pritamdamania87, wanchaol

Differential Revision: D33000927

fbshipit-source-id: 4d5a93120675e90d4d6d6225a51c4a481d18d159
2021-12-15 17:53:49 -08:00
4d5dd00e61 Remove backward ops for cuDNN transposed convolution (#69902)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69902

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D33093795

Pulled By: jbschlosser

fbshipit-source-id: 8b90150bd1996e48c0c888bdab4e95a849d10ef5
2021-12-15 17:48:25 -08:00
3dc3651e0e Remove backward ops for cuDNN convolution (#69901)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69901

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D33093796

Pulled By: jbschlosser

fbshipit-source-id: f5beab6f3078144b6c8e5c4c51d69823815a9f99
2021-12-15 17:46:49 -08:00
bf15dc22bc Fix build on latest main branch of thrust (#69985)
Summary:
Our internal CI that builds PyTorch with the latest main branch of thrust fails with
```
https://github.com/pytorch/pytorch/issues/22 466.9 /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DAT_PER_OPERATOR_HEADERS -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DIDEEP_USE_MKL -DMAGMA_V2 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTH_BLAS_MKL -DTORCH_CUDA_BUILD_MAIN_LIB -DUSE_C10D_GLOO -DUSE_C10D_MPI -DUSE_C10D_NCCL -DUSE_CUDA -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_NCCL -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cuda_EXPORTS -Iaten/src -I../aten/src -I. -I../ -I../cmake/../third_party/benchmark/include -I../cmake/../third_party/cudnn_frontend/include -I../third_party/onnx -Ithird_party/onnx -I../third_party/foxi -Ithird_party/foxi -Iinclude -I../torch/csrc/distributed -I../aten/src/TH -I../aten/src/THC -I../aten/src/ATen/cuda -Icaffe2/aten/src -I../aten/../third_party/catch/single_include -I../aten/src/ATen/.. -Icaffe2/aten/src/ATen -Inccl/include -I../c10/cuda/../.. -I../c10/.. -I../third_party/tensorpipe -Ithird_party/tensorpipe -I../third_party/tensorpipe/third_party/libnop/include -I../torch/csrc/api -I../torch/csrc/api/include -isystem=third_party/gloo -isystem=../cmake/../third_party/gloo -isystem=../cmake/../third_party/googletest/googlemock/include -isystem=../cmake/../third_party/googletest/googletest/include -isystem=../third_party/protobuf/src -isystem=/opt/conda/include -isystem=../third_party/gemmlowp -isystem=../third_party/neon2sse -isystem=../third_party/XNNPACK/include -isystem=../third_party -isystem=../cmake/../third_party/eigen -isystem=/opt/conda/include/python3.8 -isystem=/opt/conda/lib/python3.8/site-packages/numpy/core/include -isystem=../cmake/../third_party/pybind11/include -isystem=/opt/hpcx/ompi/include/openmpi -isystem=/opt/hpcx/ompi/include/openmpi/opal/mca/hwloc/hwloc201/hwloc/include -isystem=/opt/hpcx/ompi/include/openmpi/opal/mca/event/libevent2022/libevent -isystem=/opt/hpcx/ompi/include/openmpi/opal/mca/event/libevent2022/libevent/include -isystem=/opt/hpcx/ompi/include -isystem=/usr/local/cuda/include -isystem=../third_party/ideep/mkl-dnn/third_party/oneDNN/include -isystem=../third_party/ideep/include -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_86,code=compute_86 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=integer_sign_change,--diag_suppress=useless_using_declaration,--diag_suppress=set_but_not_used,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=implicit_return_from_non_void_function,--diag_suppress=unsigned_compare_with_zero,--diag_suppress=declared_but_not_referenced,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -Xcudafe --diag_suppress=20236 -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -O3 -DNDEBUG -Xcompiler=-fPIC -DCAFFE2_USE_GLOO -DCUDA_HAS_FP16=1 -DHAVE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD -Xcompiler=-Wall,-Wextra,-Wno-unused-parameter,-Wno-unused-variable,-Wno-unused-function,-Wno-unused-result,-Wno-unused-local-typedefs,-Wno-missing-field-initializers,-Wno-write-strings,-Wno-unknown-pragmas,-Wno-type-limits,-Wno-array-bounds,-Wno-unknown-pragmas,-Wno-sign-compare,-Wno-strict-overflow,-Wno-strict-aliasing,-Wno-error=deprecated-declarations,-Wno-missing-braces,-Wno-maybe-uninitialized -DTORCH_CUDA_BUILD_MAIN_LIB -Xcompiler -pthread -std=c++14 -MD -MT caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/LegacyThrustHelpers.cu.o -MF caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/LegacyThrustHelpers.cu.o.d -x cu -c ../aten/src/ATen/native/cuda/LegacyThrustHelpers.cu -o caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/LegacyThrustHelpers.cu.o
https://github.com/pytorch/pytorch/issues/22 466.9 ../aten/src/ATen/native/cuda/LegacyThrustHelpers.cu(53): error: namespace "thrust" has no member "make_constant_iterator"
https://github.com/pytorch/pytorch/issues/22 466.9
https://github.com/pytorch/pytorch/issues/22 466.9 1 error detected in the compilation of "../aten/src/ATen/native/cuda/LegacyThrustHelpers.cu".
```
The failure is because this file uses `thrust::make_counting_iterator`, but didn't include the file where this function is defined.

cc: xwang233

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69985

Reviewed By: jbschlosser

Differential Revision: D33135575

Pulled By: ngimel

fbshipit-source-id: 7a8da56bba609d6c30de4a064669faba12cb7168
2021-12-15 17:08:43 -08:00
98c0fb8b42 [sparsity] More descriptive error message for missing parameters (#69895)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69895

sparse.Linear has an error message that doesn't tell the user how to resolve the issue. This adds more info.
ghstack-source-id: 145603212

Test Plan: Not needed -- string change only

Reviewed By: jerryzh168

Differential Revision: D33039278

fbshipit-source-id: b5f7f5d257142eb3e7ad73f7c005755253a329d7
2021-12-15 16:58:31 -08:00
46ace4ac33 Add support for masked_softmax when softmax_elements > 1024 & corresponding unit tests (#69924)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69924

Test Plan: buck build mode/opt -c fbcode.enable_gpu_sections=true caffe2/test:nn && buck-out/gen/caffe2/test/nn\#binary.par -r test_masked_softmax

Reviewed By: ngimel

Differential Revision: D32819181

fbshipit-source-id: 6838a11d3554ec8e1bd48f1c2c7b1ee3a4680995
2021-12-15 16:44:15 -08:00
32ffad17a9 [PyTorch][Easy] make GlobalRecordFunctionCallbacks smallvector (#70002)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70002

callbacks are limited to 4. no reason for it to be `std::vector`

Test Plan: CI

Reviewed By: aaronenyeshi

Differential Revision: D32611294

fbshipit-source-id: 21823248abe40d461579b9b68d53c8c0de2a133d
2021-12-15 16:28:09 -08:00
65ab63310b [PyTorch] use div instead of mul when calculating sampling probability (#70001)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70001

multiplying inversion of `kLowProb` instead of division which uses less expensive `mul` instead of `idv`

Test Plan:
Before
{F682076291}

After
{F682076323}

Reviewed By: robieta

Differential Revision: D32608440

fbshipit-source-id: 7851317a0f7e33813f2bd7a152e5e7f4b5c361b4
2021-12-15 15:28:18 -08:00
66406ee0f7 [PyTorch][Static Runtime] Fix to() w/dtype bool (#69935)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69935

Didn't realize that `AT_DISPATCH_ALL_TYPES` should really be called `AT_DISPATCH_MOST_TYPES`.
ghstack-source-id: 145661358

Test Plan:
Added test for dtype bool.

Ran CMF local_ro net:

before:

```
I1215 12:33:49.300174 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.966491. Iters per second: 1034.67
I1215 12:33:49.825570 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.94867. Iters per second: 1054.11
I1215 12:33:50.349246 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.947926. Iters per second: 1054.93
I1215 12:33:50.870433 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.943779. Iters per second: 1059.57
I1215 12:33:51.393702 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.947185. Iters per second: 1055.76
I1215 12:33:51.915666 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.945672. Iters per second: 1057.45
I1215 12:33:52.438475 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.948407. Iters per second: 1054.4
I1215 12:33:52.965337 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.95472. Iters per second: 1047.43
I1215 12:33:53.494563 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.967083. Iters per second: 1034.04
I1215 12:33:54.017879 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.948945. Iters per second: 1053.8
I1215 12:33:54.017930 1606538 PyTorchPredictorBenchLib.cpp:290] Mean milliseconds per iter: 0.951888, standard deviation: 0.0083367
```

after:
```
I1215 12:32:35.820874 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.999845. Iters per second: 1000.15
I1215 12:32:36.343147 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.944363. Iters per second: 1058.91
I1215 12:32:36.863806 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.942542. Iters per second: 1060.96
I1215 12:32:37.385459 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.944677. Iters per second: 1058.56
I1215 12:32:37.905436 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.941135. Iters per second: 1062.55
I1215 12:32:38.424907 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.939748. Iters per second: 1064.11
I1215 12:32:38.944643 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.941764. Iters per second: 1061.84
I1215 12:32:39.463791 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.938946. Iters per second: 1065.02
I1215 12:32:39.987567 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.95437. Iters per second: 1047.81
I1215 12:32:40.511204 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.959139. Iters per second: 1042.6
I1215 12:32:40.511242 1594955 PyTorchPredictorBenchLib.cpp:290] Mean milliseconds per iter: 0.950653, standard deviation: 0.0184761
```

Reviewed By: hlu1

Differential Revision: D33106675

fbshipit-source-id: 5bb581f8d0ed22ef08df1936dc8d67045e44e862
2021-12-15 15:26:56 -08:00
b28a4100ff scripts: Fix manylinux2014 promotion to pypi (#70003)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70003

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: jbschlosser, janeyx99

Differential Revision: D33143730

Pulled By: seemethere

fbshipit-source-id: 83a46047fbfe4709e841fbfcaa75e434ff325be5
2021-12-15 14:55:00 -08:00
38cfacd817 Tensor: Define operators override functions in TensorBody.h (#68697)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68697

Currently, if you include `Tensor.h` but not `TensorOperators.h` then
using overloaded operators will compile but fail at link time.
Instead, this defines the member functions in `TensorBody.h` and
leaves `TensorOperators.h` as only the free functions.

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D32596269

Pulled By: albanD

fbshipit-source-id: 5ce39334dc3d505865268f5049b1e25bb90af44a
2021-12-15 14:29:38 -08:00
9c7c1b769a Functionalization: Only include headers for required ops (#68690)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68690

RegisterFunctionalization.cpp is a shared file, so only including the
required operators means a single operator change only requires 1
shard to be rebuilt instead of all of them.

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D32596275

Pulled By: albanD

fbshipit-source-id: 8b56f48872156b96fbc0a16b542b8bab76b73fd4
2021-12-15 14:29:35 -08:00
7bb4b683b5 Codegen: Registration now only includes the functions used (#68689)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68689

Currently Register{DispatchKey}.cpp includes all of
`NativeFunctions.h`, so any operator signature change requires all
backend registration to be recompiled. However, most backends only
have registrations for a small fraction of operators so it makes sense
to only include the specific functions required.

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D32596273

Pulled By: albanD

fbshipit-source-id: 11d511f47937fbd5ff9f677c9914277b5d015c25
2021-12-15 14:29:32 -08:00
6ba18ba87e Codegen: Generate static dispatch headers per operator (#68714)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68714

This splits the static dispatch headers (e.g. `CPUFunctions.h`)
into per operators headers (e.g. `ops/empty_cpu_dispatch.h`) which is
needed for when `Tensor.h` is compiled with static dispatch enabled.

There are also several places in ATen where the static dispatch
headers are used as an optimization even in dynamic dispatch builds.

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D32596265

Pulled By: albanD

fbshipit-source-id: 287783ef4e35c7601e9d2714ddbc8d4a5b1fb9e5
2021-12-15 14:29:29 -08:00
303d60b8da Add TORCH_ASSERT_ONLY_METHOD_OPERATORS macro (#68688)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68688

This adds a new macro `TORCH_ASSERT_ONLY_METHOD_OPERATORS` which
allows `Tensor.h` to be included, but not headers which pull in all
other operators. So, a file that defines this macro needs to use the
fine-grained headers to include only the operators being used.

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D32596267

Pulled By: albanD

fbshipit-source-id: 6fc2ce3d2b0f52ac6d81b3f063193ce26e0d75a3
2021-12-15 14:29:26 -08:00
bab61be43b Codegen: Add root_name property to NativeFunction{,sGroup} (#68687)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68687

This adds `NativeFunction.root_name` which is the canonical name
for the operator group. i.e. the BaseOperatorName without inplace or
double-underscores. In the previous PR I referred to this as
`base_name` but confusingly `BaseOperatorName` does potentially
include inplace or double-underscores.

I also add the property to `NativeFunctionsGroup` so that grouped
functions with type `Union[NativeFunction, NativeFunctionsGroup]`
can have the property queried without needing `isinstance` checks.

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D32596271

Pulled By: albanD

fbshipit-source-id: 8b6dad806ec8d796dcd70fc664604670d668cae7
2021-12-15 14:28:10 -08:00
a406a427ae Revert D33004315: Support torch.equal for ShardedTensor.
Test Plan: revert-hammer

Differential Revision:
D33004315 (1c4c81622c)

Original commit changeset: 786fe26baf82

Original Phabricator Diff: D33004315 (1c4c81622c)

fbshipit-source-id: e1dda70fea656834fdf0f2a9f874415f7b460c6e
2021-12-15 14:14:06 -08:00
1c4c81622c Support torch.equal for ShardedTensor. (#69734)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69734

Added support for `torch.equal` to ShardedTensor. This is really
helpful in terms of comparing two ShardedTensors.

Will implement `allclose` in a follow PR.
ghstack-source-id: 145301451

Test Plan: waitforbuildbot

Reviewed By: fduwjj, wanchaol

Differential Revision: D33004315

fbshipit-source-id: 786fe26baf82e1bb4fecfdbfc9ad4b64e704877f
2021-12-15 13:07:36 -08:00
8a08e70bf4 Revert D32596676: Avoid adding torch::deploy interpreter library to the data section
Test Plan: revert-hammer

Differential Revision:
D32596676 (986d19c0a7)

Original commit changeset: 1ab15b2d3642

Original Phabricator Diff: D32596676 (986d19c0a7)

fbshipit-source-id: da4f02114fd7e41634f116ab659a55cd985cfd7d
2021-12-15 13:02:22 -08:00
24bc3be146 [Profiler] Clean up profiler includes. (#69421)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69421

I've hit a lot of build issues in D32671972, and I've come to realize that a lot of it boils down to header hygene. `function.h` includes `profiler.h` *solely* to transitively include `record_function.h` which winds up leaking the profiler symbols. Moreover several files are relying on transitive includes to get access to `getTime`. As long as I have to touch all the places that use `getTime`, I may as well also move them to the new namespace.

Test Plan: Unit tests and CI.

Reviewed By: aaronenyeshi, albanD

Differential Revision: D32865907

fbshipit-source-id: f87d6fd5afb784dca2146436e72c69e34623020e
2021-12-15 12:50:24 -08:00
587f8d9924 OperatorEntry: Avoid unnecessarily templated code (#67986)
Summary:
`assertSignatureIsCorrect` is instantiated at minimum once per unique operator signature yet its core logic is independent of the type. So, it makes sense to have a light-weight template that does nothing but call into the non-templated function with the correct `CppSignature` object.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67986

Reviewed By: jbschlosser

Differential Revision: D33108600

Pulled By: swolchok

fbshipit-source-id: 7594524d3156ff2422e6edcdffcb263dc67ea346
2021-12-15 12:43:53 -08:00
986d19c0a7 Avoid adding torch::deploy interpreter library to the data section (#69245)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69245

Create custom section ".embedded_interpreter" in order to store interpreter instead of .data in order to allow in order to increae the amount of memory that can be used by 33% for the other sections of the executable (1.5GB -> 2.0GB) such as .text/.data/.bss. This also removes memory limitations of the interpreter and tech debt.

Test Plan:
buck test mode/opt //caffe2/torch/csrc/deploy:test_deploy
readelf -S ~/fbcode/buck-out/gen/caffe2/torch/csrc/deploy/test_deploy
check the size of the .data section
Apply the fix and check the size of the .data section again. It should be reduced by the size of the interpreter.so

The output of `readelf -S ~/fbcode/buck-out/gen/caffe2/torch/csrc/deploy/test_deploy` is as follows. The .data section is now 0.0015415GB and the .torch_deploy_payXXX section is 0.605125GB

```
(pytorch) [sahanp@devvm4333.vll0 ~/local/fbsource/fbcode] readelf -S buck-out/gen/caffe2/torch/csrc/deploy/test_deploy
There are 55 section headers, starting at offset 0x24bac82b0:

Section Headers:
  [Nr] Name              Type             Address           Offset
       Size              EntSize          Flags  Link  Info  Align
  [ 0]                   NULL             0000000000000000  00000000
       0000000000000000  0000000000000000           0     0     0
  [ 1] .interp           PROGBITS         0000000000200350  00000350
       0000000000000028  0000000000000000   A       0     0     1
  [ 2] .note.ABI-tag     NOTE             0000000000200378  00000378
       0000000000000020  0000000000000000   A       0     0     4
  [ 3] .note.gnu.build-i NOTE             0000000000200398  00000398
       0000000000000024  0000000000000000   A       0     0     4
  [ 4] .dynsym           DYNSYM           00000000002003c0  000003c0
       0000000000d07a48  0000000000000018   A       9     1     8
  [ 5] .gnu.version      VERSYM           0000000000f07e08  00d07e08
       0000000000115f86  0000000000000002   A       4     0     2
  [ 6] .gnu.version_r    VERNEED          000000000101dd90  00e1dd90
       0000000000000510  0000000000000000   A       9    15     4
  [ 7] .gnu.hash         GNU_HASH         000000000101e2a0  00e1e2a0
       00000000003b4fb0  0000000000000000   A       4     0     8
  [ 8] .hash             HASH             00000000013d3250  011d3250
       0000000000457e20  0000000000000004   A       4     0     4
  [ 9] .dynstr           STRTAB           000000000182b070  0162b070
       0000000004ef205a  0000000000000000   A       0     0     1
  [10] .rela.dyn         RELA             000000000671d0d0  0651d0d0
       0000000000110b80  0000000000000018   A       4     0     8
  [11] .rela.plt         RELA             000000000682dc50  0662dc50
       00000000000093f0  0000000000000018   A       4    35     8
  [12] .rodata           PROGBITS         0000000006837040  06637040
       00000000034067a8  0000000000000000 AMS       0     0     64
  [13] fb_build_info     PROGBITS         0000000009c3d7f0  09a3d7f0
       00000000000002ee  0000000000000000   A       0     0     16
  [14] .gcc_except_table PROGBITS         0000000009c3dae0  09a3dae0
       00000000014a9340  0000000000000000   A       0     0     4
  [15] .eh_frame_hdr     PROGBITS         000000000b0e6e20  0aee6e20
       00000000004abf54  0000000000000000   A       0     0     4
  [16] .eh_frame         PROGBITS         000000000b592d78  0b392d78
       000000000200e344  0000000000000000   A       0     0     8
  [17] .text             PROGBITS         000000000d5a2000  0d3a2000
       000000001e55944e  0000000000000000  AX       0     0     256
  [18] .init             PROGBITS         000000002bafb450  2b8fb450
       0000000000000017  0000000000000000  AX       0     0     4
  [19] .fini             PROGBITS         000000002bafb468  2b8fb468
       0000000000000009  0000000000000000  AX       0     0     4
  [20] .never_hugify     PROGBITS         000000002bafb480  2b8fb480
       0000000000000db3  0000000000000000  AX       0     0     16
  [21] text_env          PROGBITS         000000002bafc240  2b8fc240
       0000000000002e28  0000000000000000  AX       0     0     16
  [22] .plt              PROGBITS         000000002baff070  2b8ff070
       00000000000062b0  0000000000000000  AX       0     0     16
  [23] .tdata            PROGBITS         000000002bb06000  2b906000
       0000000000000b20  0000000000000000 WAT       0     0     8
  [24] .tbss             NOBITS           000000002bb06b40  2b906b20
       0000000000007cb8  0000000000000000 WAT       0     0     64
  [25] .fini_array       FINI_ARRAY       000000002bb06b20  2b906b20
       0000000000000028  0000000000000000  WA       0     0     8
  [26] .init_array       INIT_ARRAY       000000002bb06b48  2b906b48
       0000000000008878  0000000000000000  WA       0     0     8
  [27] .data.rel.ro      PROGBITS         000000002bb0f3c0  2b90f3c0
       0000000000029ce0  0000000000000000  WA       0     0     64
  [28] .ctors            PROGBITS         000000002bb390a0  2b9390a0
       0000000000000010  0000000000000000  WA       0     0     8
  [29] .dynamic          DYNAMIC          000000002bb390b0  2b9390b0
       0000000000000340  0000000000000010  WA       9     0     8
  [30] .got              PROGBITS         000000002bb393f0  2b9393f0
       000000000001f040  0000000000000000  WA       0     0     8
  [31] .bss.rel.ro       NOBITS           000000002bb58440  2b958430
       0000000000000c40  0000000000000000  WA       0     0     32
  [32] .data             PROGBITS         000000002bb5a000  2b959000
       0000000000194188  0000000000000000  WA       0     0     4096
  [33] .tm_clone_table   PROGBITS         000000002bcee188  2baed188
       0000000000000000  0000000000000000  WA       0     0     8
  [34] .probes           PROGBITS         000000002bcee188  2baed188
       0000000000000002  0000000000000000  WA       0     0     2
  [35] .got.plt          PROGBITS         000000002bcee190  2baed190
       0000000000003168  0000000000000000  WA       0     0     8
  [36] .bss              NOBITS           000000002bcf1300  2baf02f8
       00000000005214f0  0000000000000000  WA       0     0     128
  [37] .nvFatBinSegment  PROGBITS         000000002c213000  2baf1000
       0000000000002850  0000000000000000   A       0     0     8
  [38] .nv_fatbin        PROGBITS         000000002c216000  2baf4000
       0000000052baed38  0000000000000000  WA       0     0     8
  [39] .comment          PROGBITS         0000000000000000  7e6a2d38
       00000000000001dc  0000000000000000  MS       0     0     1
  [40] .debug_aranges    PROGBITS         0000000000000000  7e6a2f20
       0000000001266c00  0000000000000000           0     0     16
  [41] .debug_info       PROGBITS         0000000000000000  7f909b20
       000000007b21de49  0000000000000000           0     0     1
  [42] .debug_abbrev     PROGBITS         0000000000000000  fab27969
       000000000179f365  0000000000000000           0     0     1
  [43] .debug_line       PROGBITS         0000000000000000  fc2c6cce
       00000000176954ac  0000000000000000           0     0     1
  [44] .debug_str        PROGBITS         0000000000000000  11395c17a
       0000000039dc32b0  0000000000000001  MS       0     0     1
  [45] .debug_ranges     PROGBITS         0000000000000000  14d71f430
       0000000026a2d930  0000000000000000           0     0     16
  [46] .debug_types      PROGBITS         0000000000000000  17414cd60
       000000000b211ff5  0000000000000000           0     0     1
  [47] .debug_loc        PROGBITS         0000000000000000  17f35ed55
       000000009ca80c7e  0000000000000000           0     0     1
  [48] .debug_macinfo    PROGBITS         0000000000000000  21bddf9d3
       000000000000151c  0000000000000000           0     0     1
  [49] .note.stapsdt     NOTE             0000000000000000  21bde0ef0
       0000000000001b3c  0000000000000000           0     0     4
  [50] .debug_macro      PROGBITS         0000000000000000  21bde2a2c
       0000000000040e6a  0000000000000000           0     0     1
  [51] .torch_deploy_pay PROGBITS         0000000000000000  21be23896
       0000000026ba5d28  0000000000000000           0     0     1
  [52] .symtab           SYMTAB           0000000000000000  2429c95c0
       00000000020ce0c8  0000000000000018          54   863985     8
  [53] .shstrtab         STRTAB           0000000000000000  244a97688
       000000000000025c  0000000000000000           0     0     1
  [54] .strtab           STRTAB           0000000000000000  244a978e4
       00000000070309c6  0000000000000000           0     0     1
Key to Flags:
  W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
  L (link order), O (extra OS processing required), G (group), T (TLS),
  C (compressed), x (unknown), o (OS specific), E (exclude),
  l (large), p (processor specific)
```

Reviewed By: shunting314

Differential Revision: D32596676

fbshipit-source-id: 1ab15b2d36422506d8f781d3bbc0c70c44bc3d91
2021-12-15 11:27:57 -08:00
c6bcfb152d [PyTorch][easy] Move GlobalRecordFunctionCallbacks{,Entry} to cpp file (#68483)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68483

Doesn't need to be in the header.
ghstack-source-id: 145668417

Test Plan: CI

Reviewed By: chaekit

Differential Revision: D32477113

fbshipit-source-id: 30e7796413e3220e4051544559f9110ab745022d
2021-12-15 09:38:51 -08:00
873585da2b [SR] Improve set_inputs (#69087)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69087
This diff includes a variety of improvements to `set_inputs` to unify behavior with `torch::jit::Module`:

1. Eliminate code duplication between rvalue/lvalue overloads
2. Add type checks
3. Make input length check a `TORCH_CHECK` instead of a debug check - we have to fail when the wrong number of inputs are passed.
4. `schema` now always includes `self`, even if we release `module_`. This is consistent with `torch::jit::Module`.|
ghstack-source-id: 145599837

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: hlu1

Differential Revision: D32711705

fbshipit-source-id: fe97c10b4f03801ba59868b452e7d02b26b3106b
2021-12-15 09:31:19 -08:00
aeedd89d4e [PyTorch] RecordFunction: use SmallVector for ObserverContextList (#68412)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68412

These lists have the same size as CallbackHandles, so they should be the same container type.
ghstack-source-id: 145668416

Test Plan:
Run same command as previous diff.

Before: see previous diff, average about 0.46us
After: P467928077, average about 0.43us

Reviewed By: chaekit

Differential Revision: D32454856

fbshipit-source-id: 3a3ff4d381d99f51ef868d4dec4db7c411b5ea56
2021-12-15 09:31:16 -08:00
29914f55bf Skip print_test_stats checks for tests that use repeat_test_for_types (#69872)
Summary:
Once https://github.com/pytorch/pytorch/issues/69865 is fixed, this change should be undone.

This will avoid print_test_stats errors in CI, such as https://github.com/pytorch/pytorch/runs/4501145212?check_suite_focus=true (HUD view fc37e5b3ed)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69872

Reviewed By: dagitses, suo

Differential Revision: D33094446

Pulled By: janeyx99

fbshipit-source-id: 7378556d75ea94dd407a2bf9dda37b15c57014f7
2021-12-15 09:29:58 -08:00
d71b8e1a8d More distutils.version.LooseVersion changes (#69947)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69947

Reviewed By: seemethere

Differential Revision: D33111996

Pulled By: malfet

fbshipit-source-id: e7d2cc4ed3e39452e809965e360b05f0b409ec0d
2021-12-15 08:07:36 -08:00
6f9844693f Revert D32974907: [quant][graphmode][fx] Enable fuse handler for sequence of 3 ops
Test Plan: revert-hammer

Differential Revision:
D32974907 (bf089840ac)

Original commit changeset: ba205e74b566

Original Phabricator Diff: D32974907 (bf089840ac)

fbshipit-source-id: e47838f3008ba014d884aef53460df654f0cf731
2021-12-15 05:46:49 -08:00
87bc1f4ed8 Revert D33024528: [quant][fx][graphmode] Add support for conv add pattern in backend_config_dict
Test Plan: revert-hammer

Differential Revision:
D33024528 (59000cff91)

Original commit changeset: 5c770c82c8f6

Original Phabricator Diff: D33024528 (59000cff91)

fbshipit-source-id: 7da6f421ef63f47fbffad8b3ad91f6a31d19d867
2021-12-15 05:45:29 -08:00
43b8e833e9 Fix bug in aten::full signature in version_map.h to accurately reflect the current schema (#69860)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69860

Previously I made a mistake and checked in the aten::full.names for the upgrader of aten::full. So changed it back to just aten::full.

Test Plan: None

Reviewed By: gmagogsfm

Differential Revision: D33066985

fbshipit-source-id: a5598d60d1bff9b4455f807361388fac0689ba14
2021-12-15 01:09:31 -08:00
5c7817fd43 Add test operator in upgrader entry (#69427)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69427

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D32867984

Pulled By: tugsbayasgalan

fbshipit-source-id: 25810fc2fd4b943911f950618968af067c04da5c
2021-12-15 00:40:05 -08:00
47f11730ec Add testing for forward over reverse gradgrad (#69740)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69740

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D33031727

Pulled By: soulitzer

fbshipit-source-id: 2bcba422b4bcea3bbc936d07ba45171a6531e578
2021-12-14 23:35:10 -08:00
d0fe7db1f6 Add formulas for distributions (#69690)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69690

* #69558

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D33031726

Pulled By: soulitzer

fbshipit-source-id: 9ae461dc6043d48d5bb8c2bbaa266d06ad99f317
2021-12-14 23:35:07 -08:00
b399a4d7b9 Add some reduction forward AD formulas (#69661)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69661

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D33020601

Pulled By: soulitzer

fbshipit-source-id: 110da6dcd490e5c3849cace62a777aa1a2b6982e
2021-12-14 23:33:43 -08:00
3b7fc0243c [PyTorch] Make TypePrinter take const Type& (#69412)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69412

TypePrinter does not need to take ownership of the Type.

This helps unblock the following diff to stop refcounting Type singletons.
ghstack-source-id: 145671619

Test Plan: CI

Reviewed By: suo

Differential Revision: D32858525

fbshipit-source-id: df58676938fd20c7bae4a366d70b2067a852282d
2021-12-14 23:13:03 -08:00
7a12b5063e [AutoAccept][Codemod][FBSourceBuckFormatLinter] Daily arc lint --take BUCKFORMAT
Reviewed By: zertosh

Differential Revision: D33119794

fbshipit-source-id: ca327caf34560c0bba32511e57d5dc18b71bdfe1
2021-12-14 21:54:41 -08:00
59000cff91 [quant][fx][graphmode] Add support for conv add pattern in backend_config_dict (#69778)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69778

This PR extends fusion pattern support from simple sequence of ops to a simple
subgraph like conv - add
```
x - conv ---\
y ---------add ---- ouptut
```
where input x, y and output are observed/quantized

Test Plan:
```
python test/fx2trt/test_quant_trt.py TestQuantizeFxTRTOps.test_conv_add
```

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D33024528

fbshipit-source-id: 5c770c82c8f693fabdac5c69343942a9dfda84ef
2021-12-14 20:46:01 -08:00
408283319a [Operator Versioning][Edge] Change OP to CALL when there is a valid upgrader (#67731)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67731

1. Register upgrader function at loading stage
2. Change OP to CALL when there operator_version from model is smaller than current runtime version and there exists a valid upgrader

The interpreter log is :
```
RUNNING 0 STOREN 1 3
RUNNING 1 DROPR 1
RUNNING 2 LOAD 2
RUNNING 3 LOAD 3
RUNNING 4 CALL 0
RUNNING 0 STOREN 1 2
RUNNING 1 LOAD 1
RUNNING 2 OP 0, aten::is_floating_point
RUNNING 3 JF 3
RUNNING 4 LOADC 1
RUNNING 5 JMP 3
RUNNING 8 STORE 3
RUNNING 9 MOVE 3
RUNNING 10 JF 5
RUNNING 11 LOAD 1
RUNNING 12 LOAD 2
RUNNING 13 OP 1, aten::div.Tensor
RUNNING 14 JMP 5
RUNNING 19 STORE 4
RUNNING 20 DROPR 2
RUNNING 21 DROPR 1
RUNNING 22 MOVE 4
RUNNING 23 RET
RUNNING 5 LOAD 2
RUNNING 6 LOAD 3
RUNNING 7 CALL 0
RUNNING 0 STOREN 1 2
RUNNING 1 LOAD 1
RUNNING 2 OP 0, aten::is_floating_point
RUNNING 3 JF 3
RUNNING 4 LOADC 1
RUNNING 5 JMP 3
RUNNING 8 STORE 3
RUNNING 9 MOVE 3
RUNNING 10 JF 5
RUNNING 11 LOAD 1
RUNNING 12 LOAD 2
RUNNING 13 OP 1, aten::div.Tensor
RUNNING 14 JMP 5
RUNNING 19 STORE 4
RUNNING 20 DROPR 2
RUNNING 21 DROPR 1
RUNNING 22 MOVE 4
RUNNING 23 RET
RUNNING 8 MOVE 2
RUNNING 9 MOVE 3
RUNNING 10 CALL 0
RUNNING 0 STOREN 1 2
RUNNING 1 LOAD 1
RUNNING 2 OP 0, aten::is_floating_point
RUNNING 3 JF 3
RUNNING 4 LOADC 1
RUNNING 5 JMP 3
RUNNING 8 STORE 3
RUNNING 9 MOVE 3
RUNNING 10 JF 5
RUNNING 11 LOAD 1
RUNNING 12 LOAD 2
RUNNING 13 OP 1, aten::div.Tensor
RUNNING 14 JMP 5
RUNNING 19 STORE 4
RUNNING 20 DROPR 2
RUNNING 21 DROPR 1
RUNNING 22 MOVE 4
RUNNING 23 RET
RUNNING 11 TUPLE_CONSTRUCT 3
RUNNING 12 RET
```

The upgrader bytecode is:
```
(STOREN, 1, 2)
(LOAD, 1, 0)
(OP, 0, 0)
(JF, 3, 0)
(LOADC, 1, 0)
(JMP, 3, 0)
(LOAD, 2, 0)
(OP, 0, 0)
(STORE, 3, 0)
(MOVE, 3, 0)
(JF, 5, 0)
(LOAD, 1, 0)
(LOAD, 2, 0)
(OP, 1, 0)
(JMP, 5, 0)
(LOAD, 1, 0)
(LOAD, 2, 0)
(LOADC, 0, 0)
(OP, 2, 0)
(STORE, 4, 0)
(DROPR, 2, 0)
(DROPR, 1, 0)
(MOVE, 4, 0)
(RET, 0, 0)
```
ghstack-source-id: 145635622

Test Plan: describe in summary and CI

Reviewed By: iseeyuan

Differential Revision: D32092517

fbshipit-source-id: 0314b4bda5d2578cdd4e7cfbfd1e3c07fbccf8a3
2021-12-14 19:13:12 -08:00
9e4d60a552 [Operator Versioning][Edge] Use check in cpp source file for upgrader (#67728)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67728

1. Check in upgrader_mobile.h and upgrader_mobile.cpp
2. Add test to parse all bytecode from upgrader_mobile.h
ghstack-source-id: 145635621

Test Plan: buck test mode/dev //caffe2/test/cpp/jit:jit -- --exact 'caffe2/test/cpp/jit:jit - LiteInterpreterUpgraderTest.Upgrader'

Reviewed By: iseeyuan

Differential Revision: D32087295

fbshipit-source-id: 21e95aabb5e9db76be27e01adfea8fbc41caeaf6
2021-12-14 19:10:51 -08:00
bf089840ac [quant][graphmode][fx] Enable fuse handler for sequence of 3 ops (#69658)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69658

This PR enables fuse handler for sequence of three ops, and merges all fuse handlers into one

TODO: we can also move this to backend_config_dict folder

Test Plan:
regression fusion test
```
python test/test_quantization.py TestFuseFx
```

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D32974907

fbshipit-source-id: ba205e74b566814145f776257c5f5bb3b24547c1
2021-12-14 19:04:21 -08:00
102684b252 [SR] Fix stack/concat bug (#68777)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68777

Fixed some cases where negative dimensions were not handled correctly

* `_stack_cpu` calls `maybe_wrap_dim`, but `_stack_cpu_out` does not. This is only problematic when `_stack_cpu_out` forwards to the serial kernel: [ref](https://www.internalfb.com/code/fbsource/[1b5af978b48f2e5d308d42b588bde3275869a57b]/fbcode/caffe2/aten/src/ATen/native/TensorShape.cpp?lines=1541-1547).
* concat also needs to wrap its dim

Test Plan:
`buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Added new tests to cover this case

Reviewed By: hlu1

Differential Revision: D32604623

fbshipit-source-id: 00aaa42817cd2d3e7606ce75ab5a9744645118cf
2021-12-14 16:26:27 -08:00
ebc35a7ead [JIT] Enable freezing for sparse COO tensors (#69614)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69614

Previously sparse COO tensors were ignored during freezing, because
`tryInsertConstant` would fail during `freeze_module.cpp`, and because
hashes weren't implemented for COO tensor IValues.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D32954620

Pulled By: davidberard98

fbshipit-source-id: a91f97fdfc2152b417f43a6948100c94970c0831
2021-12-14 15:43:50 -08:00
33363cea64 Revert D32498572: allow external backend codegen to be used without autograd kernels
Test Plan: revert-hammer

Differential Revision:
D32498572 (b83b6f7424)

Original commit changeset: 3e7159c633f6

Original Phabricator Diff: D32498572 (b83b6f7424)

fbshipit-source-id: f93fa444c95a2423eef5975a2ecdb96f14e0c535
2021-12-14 15:28:49 -08:00
f6cad53443 Revert D32498569: allow external backend codegen to toggle whether to generate out= and inplace kernels
Test Plan: revert-hammer

Differential Revision:
D32498569 (aa0cf68c17)

Original commit changeset: ebd932d042b9

Original Phabricator Diff: D32498569 (aa0cf68c17)

fbshipit-source-id: 21a393fa339510d926512a7983d33ece327b743d
2021-12-14 15:27:24 -08:00
0ef523633f Revert D32498570: make codegen'd device guards not cuda-specific. Allow them to be used in external codegen
Test Plan: revert-hammer

Differential Revision:
D32498570 (2e7a91c45f)

Original commit changeset: 0ce6a5614417

Original Phabricator Diff: D32498570 (2e7a91c45f)

fbshipit-source-id: 7c64ce1b5e51a680b4aeae8721e0c9e15c793289
2021-12-14 15:04:10 -08:00
24ee1d13f6 Another attempt to fix version comparison check (#69939)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69939

Reviewed By: atalman

Differential Revision: D33108135

Pulled By: malfet

fbshipit-source-id: cadadfe5b04c4378f149136f8e1f8e8d6266775c
2021-12-14 14:54:15 -08:00
d4f8313497 Add low level torch.profiler.kineto_profile base class (#63302)
Summary:
Refactor torch.profiler.profile by separate it into one low level class and one high level wrapper.

The PR include the following change:
1. separate class torch.profiler.profile into two separated class: kineto_profiler and torch.profiler.profile.
2. The former class has the low-level functionality exposed in C++ level like: prepare_profiler, start_profiler, stop_profiler.
3. The original logics in torch.profiler.profile including export_chrome_trace, export_stacks, key_averages, events, add_metadata are all moved into kineto_profiler since they are all exposed by the torch.autograd.profiler.
4. The new torch.profiler.profile is fully back-compatible with original class since it inherit from torch.profiler.kineto_profiler. Its only responsibility in new implementation is the maintenance of the finite state machine of ProfilerAction.

With the refactoring, the responsibility boundary is clear and the new logic is simple to understand.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63302

Reviewed By: albanD

Differential Revision: D33006442

Pulled By: robieta

fbshipit-source-id: 30d7c9f5c101638703f1243fb2fcc6ced47fb690
2021-12-14 14:47:43 -08:00
e8d5c7cf7f [nn] mha : no-batch-dim support (python) (#67176)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/60585

* [x] Update docs
* [x] Tests for shape checking

Tests take roughly 20s on system that I use. Below is the timings for slowest 20 tests.

```
pytest test/test_modules.py -k _multih --durations=20
============================================================================================== test session starts ===============================================================================================
platform linux -- Python 3.10.0, pytest-6.2.5, py-1.10.0, pluggy-1.0.0
rootdir: /home/kshiteej/Pytorch/pytorch_no_batch_mha, configfile: pytest.ini
plugins: hypothesis-6.23.2, repeat-0.9.1
collected 372 items / 336 deselected / 36 selected

test/test_modules.py ..............ssssssss..............                                                                                                                                                  [100%]

================================================================================================ warnings summary ================================================================================================
../../.conda/envs/pytorch-cuda-dev/lib/python3.10/site-packages/torch/backends/cudnn/__init__.py:73
test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_MultiheadAttention_cuda_float32
  /home/kshiteej/.conda/envs/pytorch-cuda-dev/lib/python3.10/site-packages/torch/backends/cudnn/__init__.py:73: UserWarning: PyTorch was compiled without cuDNN/MIOpen support. To use cuDNN/MIOpen, rebuild PyTorch making sure the library is visible to the build system.
    warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/warnings.html
============================================================================================== slowest 20 durations ==============================================================================================
8.66s call     test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_MultiheadAttention_cuda_float64
2.02s call     test/test_modules.py::TestModuleCPU::test_gradgrad_nn_MultiheadAttention_cpu_float64
1.89s call     test/test_modules.py::TestModuleCUDA::test_grad_nn_MultiheadAttention_cuda_float64
1.01s call     test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_MultiheadAttention_cuda_float32
0.51s call     test/test_modules.py::TestModuleCPU::test_grad_nn_MultiheadAttention_cpu_float64
0.46s call     test/test_modules.py::TestModuleCUDA::test_forward_nn_MultiheadAttention_cuda_float32
0.45s call     test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_MultiheadAttention_cuda_float64
0.44s call     test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_MultiheadAttention_cuda_float32
0.21s call     test/test_modules.py::TestModuleCUDA::test_pickle_nn_MultiheadAttention_cuda_float64
0.21s call     test/test_modules.py::TestModuleCUDA::test_pickle_nn_MultiheadAttention_cuda_float32
0.18s call     test/test_modules.py::TestModuleCUDA::test_forward_nn_MultiheadAttention_cuda_float64
0.17s call     test/test_modules.py::TestModuleCPU::test_non_contiguous_tensors_nn_MultiheadAttention_cpu_float32
0.16s call     test/test_modules.py::TestModuleCPU::test_non_contiguous_tensors_nn_MultiheadAttention_cpu_float64
0.11s call     test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_MultiheadAttention_cuda_float64
0.08s call     test/test_modules.py::TestModuleCPU::test_pickle_nn_MultiheadAttention_cpu_float32
0.08s call     test/test_modules.py::TestModuleCPU::test_pickle_nn_MultiheadAttention_cpu_float64
0.06s call     test/test_modules.py::TestModuleCUDA::test_repr_nn_MultiheadAttention_cuda_float64
0.06s call     test/test_modules.py::TestModuleCUDA::test_repr_nn_MultiheadAttention_cuda_float32
0.06s call     test/test_modules.py::TestModuleCPU::test_forward_nn_MultiheadAttention_cpu_float32
0.06s call     test/test_modules.py::TestModuleCPU::test_forward_nn_MultiheadAttention_cpu_float64
============================================================================================ short test summary info =============================================================================================
=========================================================================== 28 passed, 8 skipped, 336 deselected, 2 warnings in 19.71s ===========================================================================
```

cc albanD mruberry jbschlosser walterddr

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67176

Reviewed By: dagitses

Differential Revision: D33094285

Pulled By: jbschlosser

fbshipit-source-id: 0dd08261b8a457bf8bad5c7f3f6ded14b0beaf0d
2021-12-14 13:21:21 -08:00
37ec99c0e4 Open source trt lowering workflow (#69381)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69381

Open source lowering workflow, related tools and tests.

Test Plan: CI

Reviewed By: 842974287

Differential Revision: D32815136

fbshipit-source-id: 3ace30833a2bc52e9b02513c5e223cb339fb74a3
2021-12-14 13:00:21 -08:00
930067d129 Build clang builds with -Werror (#69712)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69712

Test Plan: Imported from OSS

Reviewed By: seemethere

Differential Revision: D32997002

Pulled By: malfet

fbshipit-source-id: 8ebb5a955f8ae2d3fb67bc70636a2b1d66010c84
2021-12-14 12:41:57 -08:00
c76c6e9bd3 [ONNX] Add BFloat16 type support when export to ONNX (#66788)
Summary:
- PyTorch and ONNX has supported BFloat16, add this to unblock some mixed-precision training model.
- Support PyTorch TNLG model to use BFloat16 tensors for the inputs/outputs of the layers that run on the NPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66788

Reviewed By: jansel

Differential Revision: D32283510

Pulled By: malfet

fbshipit-source-id: 150d69b1465b2b917dd6554505eca58042c1262a
2021-12-14 12:23:32 -08:00
800a457b6f [shard] add ShardedOptimizer (#68607)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68607

This PR added ShardedOptimizer and a API to get module parameters along with ShardedTensor param, it allows user to use this Optimizer Wrapper to construct a optimizer that involves ShardedTensor

The state_dict support will be a follow up diff
ghstack-source-id: 145532834

Test Plan: python test_sharded_optim.py

Reviewed By: pritamdamania87

Differential Revision: D32539994

fbshipit-source-id: a3313c6870d1f1817fc3e08dc2fc27dc43bef743
2021-12-14 12:15:20 -08:00
457ba1dd3e Porting index_add to structured kernels, add an out variant (#65993)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65993

This PR attempts to port `index_add` to structured kernels, but does more than that:

* Adds an `out=` variant to `index_add`
* Revises `native_functions.yaml` registrations, to not have multiple entries and instead pass default value to `alpha`.
* Changes in `derivatives.yaml` file for autograd functioning
* Revises error messages, please see: https://github.com/pytorch/pytorch/pull/65993#issuecomment-945441615

Follow-up PRs in near future will attempt to refactor the OpInfo test, and will give another look at tests in `test/test_torch.py` for this function. (hence the use of ghstack for this)

~This is WIP because there are tests failing for `Dimname` variant on mobile/android builds, and I'm working on fixing them.~

Issue tracker: https://github.com/pytorch/pytorch/issues/55070

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D32646426

fbshipit-source-id: b035ecf843a9a27d4d1e18b202b035adc2a49ab5
2021-12-14 11:57:13 -08:00
9594a94d80 fix CompositeImplicitAutograd ops improperly labeled (#69863)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69863

This reverts commit 41c344d460a941c57f4793690c396f830a992824.

Test Plan: Imported from OSS

Reviewed By: albanD, soulitzer

Differential Revision: D33072958

Pulled By: bdhirsh

fbshipit-source-id: 3d3488f37986256986ab009d6f16476f29cff625
2021-12-14 11:47:07 -08:00
269e92669a [c2] Remove unused private fields (#69709)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69709

Fix logical bug in `caffe2/ideep/operators/conv_op.cc`, which
contained an always false statement (fusion_type_ == X && fusion_type_ == Y ) statement

Test Plan: Imported from OSS

Reviewed By: r-barnes

Differential Revision: D32997006

Pulled By: malfet

fbshipit-source-id: 23e4db1b17cf8a77eae6a8691847ffa484d4736c
2021-12-14 11:31:08 -08:00
fef9981998 Update run_test.py (#69920)
Summary:
Do not compare LooseVersion against string

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69920

Reviewed By: atalman

Differential Revision: D33101166

Pulled By: malfet

fbshipit-source-id: a2df9e01d17663262718f11e580c8b009764f7b5
2021-12-14 11:26:56 -08:00
3e43c478a8 [Quant][fx] Lower reference conv[1-3]d module (#69228)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69228

Implement lowering logic for reference conv modules,
similar to https://github.com/pytorch/pytorch/pull/65723.
ghstack-source-id: 145058198

Test Plan:
python test/test_quantization.py TestQuantizeFx.test_conv_lowering

Imported from OSS

Reviewed By: anjali411

Differential Revision: D32890743

fbshipit-source-id: 04f2500628c60b0fbc84d22705164215e190aeba
2021-12-14 11:23:39 -08:00
b67eaec853 [DateLoader] more clearly expose 'default_collate' and 'default_convert' to users (#69862)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69862

Fixes #69445

cc SsnL VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: ejguan, ngimel

Differential Revision: D33068792

Pulled By: NivekT

fbshipit-source-id: ef9791acdc23d014b8761fa7420062d454ce8969
2021-12-14 11:18:26 -08:00
1188d89a1d TestMathBits: Call functions with original sample input values (#68947)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68947

`_test_math_view` currently calls the operator with different values
than those specified in the `SampleInput`. This is undesirable as it
could break mathematical properties required by the operator. Instead,
this calls `math_op_view(math_op_physical(sample.input))` to get a
view that represents the same value as the original input.

`test_neg_view` already did this by returning `torch._neg_view(-x)`
from `math_op_view` but this moves the handling into `_test_math_view`
to make it apply to all view op tests.

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D33064327

Pulled By: anjali411

fbshipit-source-id: 4d87e0c04fc39b95f8dc30dcabda0d554d16a1d8
2021-12-14 11:10:13 -08:00
1a299d8f1b Add support for transformer layout of masked_softmax (#69272)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69272

In transformer encoder and MHA, masked_softmax's mask is a 2D tensor (B, D), where input is a 4D tensor (B, H, D, D).
This mask could be simply broadcasted to a (B, H, D, D) like input, and then do a regular masked_softmax, however it will bring the problem of non-contiguous mask & consume more memory.
In this diff, we maintained mask's shape unchanged, while calc the corresponding mask for input in each cuda thread.

This new layout is not currently supported in CPU yet.

Test Plan: buck build mode/opt -c fbcode.enable_gpu_sections=true caffe2/test:nn && buck-out/gen/caffe2/test/nn\#binary.par -r test_masked_softmax

Reviewed By: ngimel

Differential Revision: D32605557

fbshipit-source-id: ef37f86981fdb2fb264d776f0e581841de5d68d2
2021-12-14 10:51:58 -08:00
2e7a91c45f make codegen'd device guards not cuda-specific. Allow them to be used in external codegen (#68531)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68531

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D32498570

Pulled By: bdhirsh

fbshipit-source-id: 0ce6a5614417671313b4d274ea84742c5b81d1b0
2021-12-14 10:25:04 -08:00
aa0cf68c17 allow external backend codegen to toggle whether to generate out= and inplace kernels (#68530)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68530

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D32498569

Pulled By: bdhirsh

fbshipit-source-id: ebd932d042b988e19c71aa04a21677db9bdc9f04
2021-12-14 10:25:02 -08:00
b83b6f7424 allow external backend codegen to be used without autograd kernels (#68529)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68529

Test Plan: Imported from OSS

Reviewed By: wconstab

Differential Revision: D32498572

Pulled By: bdhirsh

fbshipit-source-id: 3e7159c633f6a80b60faa068436a4c49ebe731ca
2021-12-14 10:23:12 -08:00
8acd0a8b2f Allow row sizes to support int64/size_t. (#69303)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69303

Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/792

Follow up to D32715453 (e60fd10659), allowing row size to be 64-bit.

Test Plan:
buck test mode/opt -c fbcode.caffe2_gpu_type=v100,a100 //deeplearning/fbgemm/fbgemm_gpu:quantize_ops_test
   buck test mode/opt -c fbcode.caffe2_gpu_type=none //deeplearning/fbgemm/fbgemm_gpu:quantize_ops_test
   buck test mode/opt //caffe2/test:

Reviewed By: jspark1105, jianyuh

Differential Revision: D32768838

fbshipit-source-id: 9e2b01d8d23e71f8333820e725379c3fc1c0711a
2021-12-14 10:09:08 -08:00
2c9dd886af Modify torch.movedim to handle scalar as no-op (#69537)
Summary:
`torch.movedim` directly handle the case of a scalar tensor (0-dim) in input as a no-op by returning a view of the input tensor (after all the usual checks for the other parameters)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69537

Test Plan:
This code now works fine and res1 is a view of tensor
```
import torch

tensor = torch.rand(torch.Size([]))
res1 = torch.movedim(tensor, 0, 0)
```

Fixes https://github.com/pytorch/pytorch/issues/69432

Reviewed By: jbschlosser

Differential Revision: D33020014

Pulled By: albanD

fbshipit-source-id: b3b2d380d70158bd3b3d6b40c073377104e09007
2021-12-14 09:55:59 -08:00
7503ec58b2 [nnc][fix] xnnpack ifdef (#69870)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69870

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D33075061

Pulled By: IvanKobzarev

fbshipit-source-id: dd53ad8b7d0ff36a68f0864540d6f7dd2284f0e0
2021-12-14 09:50:24 -08:00
f7294cd865 [Static Runtime] Skip ReplaceWithCopy when inputs have writters (#69819)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69819

We should skip ReplaceWithCopy if the inputs to the operator can be updated during inference. For a set of tensors that share data, ReplaceWithCopy should not happen to any of them if there exists updates to any of them.

Currently, the check in place has missed some cases (suppose there exists updates, and uses <= 1). This diff addresses the missing cases by querying AliasDB.

Test Plan:
- Added test cases, including a one that is problematic before this diff
- CI

Reviewed By: mikeiovine

Differential Revision: D33052562

fbshipit-source-id: 61f87e471805f41d071a28212f2f457e8c6785e7
2021-12-14 09:39:49 -08:00
07767569c9 Properly import LooseVersion (#69904)
Summary:
This fixes regression introduced by https://github.com/pytorch/pytorch/pull/57040

Somehow importing `distutils` from `setuptool` caused import of
`distutils.versions`, which is not a documented dependency and got
change with the release of
[setuptools-59.6.0](https://github.com/pypa/setuptools/tree/v59.6.0)
We should not rely on that, as
`import distutils` never re-imports `distutils.version`, which one can
see by observing
https://github.com/python/cpython/blob/3.9/Lib/distutils/__init__.py
or by running:
```
% python3 -c "import distutils;print(distutils.__version__, dir(distutils))"
3.7.5 ['__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__version__', 'sys']
% python3 -c "from setuptools import distutils;print(distutils.__version__, dir(distutils))"
3.7.5 ['__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__version__', 'archive_util', 'ccompiler', 'cmd', 'config', 'core', 'debug', 'dep_util', 'dir_util', 'dist', 'errors', 'extension', 'fancy_getopt', 'file_util', 'filelist', 'log', 'spawn', 'sys', 'sysconfig', 'util', 'version']
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69904

Reviewed By: albanD, atalman, janeyx99

Differential Revision: D33094453

Pulled By: malfet

fbshipit-source-id: aaf1adb7c6f293c4e376ccff21c64cd6ba625e97
2021-12-14 09:28:19 -08:00
fdcb78df38 print fix in lr_scheduler (#68338)
Summary:
`{:5d}` fails for `CosineAnnealingWarmRestarts` which has float `epoch`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68338

Reviewed By: jbschlosser

Differential Revision: D33063970

Pulled By: albanD

fbshipit-source-id: 992e987f8d5f6f8f5067924df4671e9725b6d884
2021-12-14 09:05:19 -08:00
f7210f8d90 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D33090919

fbshipit-source-id: 78efa486776014a27f280a01a21f9e0af6742e3e
2021-12-14 08:06:58 -08:00
4f81b2adbb Remove if conditioning from some MacOS workflow steps (#69788)
Summary:
Indirectly fixes https://github.com/pytorch/pytorch/issues/69389

These steps shouldn't error out when the credentials aren't set anyway

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69788

Reviewed By: seemethere

Differential Revision: D33061307

Pulled By: janeyx99

fbshipit-source-id: 7db6d15b3e80c3c13ea428248a8b4f8d2d32d4a1
2021-12-14 07:54:15 -08:00
fa615b332d added set_printoptions examples (#68324)
Summary:
Added examples for `torch.set_printoptions`

```
>>> torch.set_printoptions(precision=2)
>>> torch.tensor([1.12345])
tensor([1.12])
>>> torch.set_printoptions(threshold=5)
>>> torch.arange(10)
tensor([0, 1, 2, ..., 7, 8, 9])
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68324

Reviewed By: ngimel

Differential Revision: D33063869

Pulled By: anjali411

fbshipit-source-id: 24db99df1419f96ba8ae2b5217cb039b288b630a
2021-12-14 07:40:52 -08:00
d90012689f [DataPipe] Control shuffle settings from DataLoader2 (#65756)
Summary:
Makes `shuffle` DataPipe sensitive to DataLoader(2) `shuffle` kwarg.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65756

Reviewed By: albanD

Differential Revision: D31344867

Pulled By: VitalyFedyunin

fbshipit-source-id: e0084e0ac193ac784d6298328ca1222745681347
2021-12-14 07:35:26 -08:00
620a1fcb55 OpInfos for: normal, bernoulli, multinomial (#66358)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66358

Test Plan: - run tests

Reviewed By: mruberry

Differential Revision: D31551695

Pulled By: zou3519

fbshipit-source-id: cf1b43118a0414a1af9ece9ae8c0598b2701aa0a
2021-12-14 06:59:38 -08:00
4829dcea09 Codegen: Generate seperate headers per operator (#68247)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68247

This splits `Functions.h`, `Operators.h`, `NativeFunctions.h` and
`NativeMetaFunctions.h` into seperate headers per operator base name.
With `at::sum` as an example, we can include:
```cpp
<ATen/core/sum.h>         // Like Functions.h
<ATen/core/sum_ops.h>     // Like Operators.h
<ATen/core/sum_native.h>  // Like NativeFunctions.h
<ATen/core/sum_meta.h>    // Like NativeMetaFunctions.h
```

The umbrella headers are still being generated, but all they do is
include from the `ATen/ops' folder.

Further, `TensorBody.h` now only includes the operators that have
method variants. Which means files that only include `Tensor.h` don't
need to be rebuilt when you modify function-only operators. Currently
there are about 680 operators that don't have method variants, so this
is potentially a significant win for incremental builds.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D32596272

Pulled By: albanD

fbshipit-source-id: 447671b2b6adc1364f66ed9717c896dae25fa272
2021-12-14 06:40:08 -08:00
badf7b0210 fix typo changing the generated code (#69899)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69899

Reviewed By: soulitzer

Differential Revision: D33093461

Pulled By: albanD

fbshipit-source-id: 2c672a2b767f0caed1ef3a1d2afa1cacdfcdc320
2021-12-14 06:36:14 -08:00
51033ec840 Add forward AD layout check for storage numel (#68631)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68631

This PR:
- Adds the check that the storage numel of the base and tangent tensors are the same. This is to support the case when as_strided reveals elements that aren't indexable by the input tensor.
- Skips the check when batched tensors are involved, because using as_strided to reveal elements that not indexable by the input tensor is already not allowed vmap.
- Adds tests for the above two cases, as well as an edge case regarding conj bit (what about neg bit?)

For functorch:
- we need to copy the batching rule implemented here

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D32899678

Pulled By: soulitzer

fbshipit-source-id: 54db9550dd2c93bc66b8fb2d36ce40799ebba794
2021-12-14 04:34:25 -08:00
6078e12ad6 Add forward AD support for as_strided (#68629)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68629

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D32899680

Pulled By: soulitzer

fbshipit-source-id: b80ba4483c06108938923f17dc67278b854515ef
2021-12-14 04:33:05 -08:00
fed9b90ed4 fixing removeProfilingNodes duplicated functions (#1282) (#68804)
Summary:
Unfortunately there're two versions of removeProfilingNodes function and one of them is not cleaning up profile_ivalue nodes properly. This leads to a dangling profile_ivalue node, which ended up being profiled multiple times and could give us false assert failures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68804

Reviewed By: mrshenli

Differential Revision: D32980157

Pulled By: Krovatkin

fbshipit-source-id: cd57c58a941d10ccd01a6cd37aac5c16256aaea6
2021-12-13 22:54:30 -08:00
82075c0a19 Create trt plugin base (#69487)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69487

Write customized plugin for trt requires extend IPluginV2IOExt.  This diff extract functions that should share comon impl between plugins from IPluginV2IOExt into plugin_base, make writing customized plugin for oss user easier.

This diff also fix double creator issue, the root cause is about get_trt_plugin in converters.py look for plugin by name matching. Swith to use the util function from converters_utils.py resolve the issue.

Test Plan: CI

Reviewed By: 842974287

Differential Revision: D32747052

fbshipit-source-id: 7f2e8811c158230f66a0c389af4b84deaf7e2d1f
2021-12-13 21:31:24 -08:00
77a4b89411 Adding windows cuda 11.5 workflows (#69377)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/69081

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69377

Reviewed By: ngimel

Differential Revision: D33076022

Pulled By: atalman

fbshipit-source-id: aeb2791fc15d7b491976f57a74c1989c6ca61b81
2021-12-13 20:49:02 -08:00
b1ef56d646 [quant][docs] quantized model save/load instructions (#69789)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69789

Add details on how to save and load quantized models without hitting errors

Test Plan:
CI autogenerated docs

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D33030991

fbshipit-source-id: 8ec4610ae6d5bcbdd3c5e3bb725f2b06af960d52
2021-12-13 20:23:59 -08:00
2b81ea4f9a [DataPipe] Export ShardingFilter (#69844)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69844

Test Plan: Imported from OSS

Reviewed By: NivekT

Differential Revision: D33062183

Pulled By: ejguan

fbshipit-source-id: 6b3f4ad376959c4d2e8c8b2751ae6657527dcd36
2021-12-13 19:30:56 -08:00
603a1de871 Fix inefficient recursive update in ShardedTensor.state_dict hook (#68806)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/68805

The bug is described in the linked issue. This PR is an attempt to make the functions `_recurse_update_dict` and `_recurse_update_module` more efficient in how they iterate over the submodules. The previous implementation was suboptimal, as it recursively called the update method on the submodules returned by `module.named_modules()`, while `module.named_modules()` already returned all submodules including nested ones.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68806

Reviewed By: pritamdamania87

Differential Revision: D33053940

Pulled By: wanchaol

fbshipit-source-id: 3e72822f65a641939fec40daef29c806af725df6
2021-12-13 19:22:55 -08:00
b08d64202a Remove THGeneral (#69041)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69041

`TH_CONCAT_{N}` is still being used by THP so I've moved that into
it's own header but all the compiled code is gone.

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D32872477

Pulled By: ngimel

fbshipit-source-id: 06c82d8f96dbcee0715be407c61dfc7d7e8be47a
2021-12-13 16:14:28 -08:00
8dfdc3df82 [ROCm] Refactor how to specify AMD gpu targets using PYTORCH_ROCM_ARCH (#61706)
Summary:
Remove all hardcoded AMD gfx targets

PyTorch build and Magma build will use rocm_agent_enumerator as
backup if PYTORCH_ROCM_ARCH env var is not defined

PyTorch extensions will use same gfx targets as the PyTorch build,
unless PYTORCH_ROCM_ARCH env var is defined

torch.cuda.get_arch_list() now works for ROCm builds

PyTorch CI dockers will continue to be built for gfx900 and gfx906 for now.

PYTORCH_ROCM_ARCH env var can be a space or semicolon separated list of gfx archs eg. "gfx900 gfx906" or "gfx900;gfx906"
cc jeffdaily sunway513 jithunnair-amd ROCmSupport KyleCZH

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61706

Reviewed By: seemethere

Differential Revision: D32735862

Pulled By: malfet

fbshipit-source-id: 3170e445e738e3ce373203e1e4ae99c84e645d7d
2021-12-13 15:41:40 -08:00
c6c3b43498 [SR][easy] Accessors for value array offsets (#69755)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69755

Per swolchok's suggestion on D32609915 (1c43b1602c). Hide the value offset indices behind accessors to provide more flexibility if we ever decide to change the layout of the values array.

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: hlu1

Differential Revision: D32838145

fbshipit-source-id: cf805c077672de4c2fded9b41da01eca6d84b388
2021-12-13 15:31:39 -08:00
3d358a7678 Adds a maximize flag to Adam (#68164)
Summary:
Solves the next most important use case in https://github.com/pytorch/pytorch/issues/68052.

I have kept the style as close to that in SGD as seemed reasonable, given the slight differences in their internal implementations.

All feedback welcome!

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68164

Reviewed By: VitalyFedyunin

Differential Revision: D32994129

Pulled By: albanD

fbshipit-source-id: 65c57c3f3dbbd3e3e5338d51def54482503e8850
2021-12-13 05:53:53 -08:00
fc37e5b3ed Hook up general convolution to convolution_backward (#69584)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69584

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D32936380

Pulled By: jbschlosser

fbshipit-source-id: c6fdd88db33bd1a9d0eabea47ae09a4d5b170e92
2021-12-12 17:30:01 -08:00
0420de3539 [SR] Log SR options (#69809)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69809

SR options is only printed out once per model per net. Logging it is actually pretty helpful for debugging.

Test Plan: CI

Reviewed By: donaldong

Differential Revision: D33046814

fbshipit-source-id: 536b34e00fbc8a273c5eb4d8ae5caca0dc1f4c24
2021-12-12 16:32:00 -08:00
f0e98dcbd3 General convolution_backward function (#69044)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69044

Test Plan: Imported from OSS

Reviewed By: zou3519, albanD, H-Huang

Differential Revision: D32708818

Pulled By: jbschlosser

fbshipit-source-id: e563baa3197811d8d51553fc83718ace2f8d1b7a
2021-12-12 15:53:38 -08:00
a5b5152d7a Fix typo in aten::full in version_map (#69807)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69807

Test Plan: {gif:ursvp75m}

Reviewed By: gmagogsfm

Differential Revision: D33044503

fbshipit-source-id: 14aac66b123d84ca3f35f02c276b15e55015df9e
2021-12-12 14:47:16 -08:00
af7ee9fc01 Forward AD for inplace comparison operators (#69597)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69597

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D33020600

Pulled By: soulitzer

fbshipit-source-id: 0c9ab210f7dc952a41fbcaa1f5f7921c2fdeb18b
2021-12-12 00:11:14 -08:00
0dcbd73eee Add some forward AD formulas (#69384)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69384

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D33020602

Pulled By: soulitzer

fbshipit-source-id: a92dd243f2b5b21fe277b0bb17bcd61dfe5a0d67
2021-12-12 00:11:11 -08:00
baf92f9d5a Fix copy_ forward AD to handle broadcasting (#69592)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69592

Currently, forward AD function for`copy_` (in `VariableTypeManual`) does not handle the broadcasting case. ~EDIT: but that is not a design decision, not a bug. In this PR, we make that clear as a comment.~

Note: `broadcast_to` does not have a batching rule in core, so the ops that rely on `copy_` to broadcast will still fail batched forward grad computation.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D33020603

Pulled By: soulitzer

fbshipit-source-id: 09cb702bffc74061964a9c05cfef5121f8164814
2021-12-12 00:11:08 -08:00
db32daf4b2 Do not test batched forward grad for inplace ops (#69558)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69558

Currently we skip batched forward grad checks completely for certain views that also have inplace variants. This PR allow us to decouple the check.

Alternative: just skip the batched forward checks for inplace ops entirely. I'm okay with this because it was surprising to me these checks are being run in the first place.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D33020599

Pulled By: soulitzer

fbshipit-source-id: f8012aadc0e775f80da0ab62b2c11f6645bb1f51
2021-12-12 00:09:45 -08:00
f565167fbd Revert D32606547: torch/monitor: add C++ events and handlers
Test Plan: revert-hammer

Differential Revision:
D32606547 (e61fc1c03b)

Original commit changeset: a00d0364092d

Original Phabricator Diff: D32606547 (e61fc1c03b)

fbshipit-source-id: fbaf2cc06ad4bec606e8a9c6f591d65c04e6fa56
2021-12-11 22:51:03 -08:00
f575179953 [quant][fx][graphmode] Move more patterns to use ModuleReLU fuse handler (#69644)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69644

This PR cleans up the init of ModuleReLUFuseHandler and moved all `module - relu`
fusion pattern to use this handler

also disabled additional_fuser_method argument temporarily, will enable
after we bring back the simple pattern format

Test Plan:
```
python test/test_quantize_fx.py TestFuseFx
```

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D32974906

fbshipit-source-id: 23483ea4293d569cb3cec6dadfefd4d9f30921a7
2021-12-11 22:00:06 -08:00
e61fc1c03b torch/monitor: add C++ events and handlers (#68783)
Summary:
This adds a C++ event handler corresponding to the Python one mentioned in the RFC.

This changes the counters a bit to all be push driven instead of being polled. The two window types are "fixed count" and "interval". One is based off the number of logged events and the other is based off of time windows. There's currently no active ticker for interval so it needs a regular stream of events to ensure events are produced. A follow up diff can add support for things like HHWheel / simple ticker.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68783

Test Plan: buck test //caffe2/test/cpp/monitor:monitor

Reviewed By: kiukchung

Differential Revision: D32606547

fbshipit-source-id: a00d0364092d7d8a98e0b18e503c0ca8ede2bead
2021-12-11 16:44:46 -08:00
20f7c893c1 Populate runtime with upgrader graph (#68773)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68773

Test Plan: Imported from OSS

Reviewed By: qihqi, gmagogsfm

Differential Revision: D32603258

Pulled By: tugsbayasgalan

fbshipit-source-id: 6fa0b7ee4ebe46c9aa148923c6ef3e1de106ad13
2021-12-11 13:44:24 -08:00
17f3179d60 Back out "[pytorch][PR] Add ability for a mobile::Module to save as flatbuffer" (#69796)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69796

(Note: this ignores all push blocking failures!)

Test Plan: External CI + Sandcastle

Reviewed By: zhxchen17

Differential Revision: D33032671

fbshipit-source-id: dbf6690e960e25d6a5f19043cbe792add2acd7ef
2021-12-10 21:29:53 -08:00
3906f8247a clear predict_net field from PredictorExporterMeta stored in the exporter to save memory (#68485)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68485

In OSS, the only change is that we make the predict_net field of PredictorExporterMeta nullable.

Test Plan: sandcastle, let CI run

Reviewed By: boryiingsu

Differential Revision: D32467138

fbshipit-source-id: 81bd5fca695462f6a186bcfa927073874cc9c26a
2021-12-10 21:25:36 -08:00
19fecc63e4 [PyTorch][kineto] Remove heap-allocated vectors in saveExtraArgs (#69737)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69737

We can use stack allocation instead.
ghstack-source-id: 145312454

Test Plan: Ran internal framework overhead benchmark with --stressTestKinto --kinetoAddFlops, but difference was minimal. Still good to fix.

Reviewed By: chowarfb

Differential Revision: D33007329

fbshipit-source-id: e096312fef5b729cf12580be152c9418683745b8
2021-12-10 20:24:17 -08:00
731c8255b7 Fix the TorchBench CI when running with a benchmark branch. (#69795)
Summary:
Fixes TorchBench CI when user is running with their own branch

Supersedes https://github.com/pytorch/pytorch/pull/69770

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69795

Reviewed By: malfet

Differential Revision: D33032886

Pulled By: xuzhao9

fbshipit-source-id: 82baee94df6925bf91bb575143efa058ce98b914
2021-12-10 18:04:43 -08:00
59deee8308 Make c10 tests compilable with -Werror (#69711)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69711

Test Plan: Imported from OSS

Reviewed By: r-barnes

Differential Revision: D32997005

Pulled By: malfet

fbshipit-source-id: 369194051ece9d213b48584ca84e5d76b3794dae
2021-12-10 16:47:46 -08:00
e305e4d4d8 Suppress common warnings when building by clang (#69710)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69710

Namely no range-loop-analysis (that detect when loop variable can not be const reference

Test Plan: Imported from OSS

Reviewed By: r-barnes

Differential Revision: D32997003

Pulled By: malfet

fbshipit-source-id: dba0e7875e5b667e2cc394c70dd75e2403265918
2021-12-10 16:45:38 -08:00
41c344d460 Revert D32739976: fix CompositeImplicitAutograd ops improperly labeled
Test Plan: revert-hammer

Differential Revision:
D32739976 (195b0d0645)

Original commit changeset: a756dd9e0b87

Original Phabricator Diff: D32739976 (195b0d0645)

fbshipit-source-id: 6e898dd5435f31e604588e6e50be1217fa207a54
2021-12-10 13:04:29 -08:00
77213fa4d3 Fix docker builds for Python-3.6 (#69785)
Summary:
As [conda-4.11](https://anaconda.org/anaconda/conda/files?version=4.11.0) is no longer available for Python-3.6, stick to 4.10 for 3.6 builds

Fixes https://github.com/pytorch/pytorch/issues/69781

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69785

Reviewed By: seemethere, atalman

Differential Revision: D33026217

Pulled By: malfet

fbshipit-source-id: d742a1e79634ed62b3a941ba23a7a74f41c2f4cb
2021-12-10 12:29:15 -08:00
a5a7e30943 [DataPipe] Adding interface for MapDataPipes (#69648)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69648

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D32989066

Pulled By: NivekT

fbshipit-source-id: ef96bcd4ac4d7a576fdd2a3fb4ef52ae6a902e10
2021-12-10 12:06:08 -08:00
81a60b9813 [DataPipe] Adding output types to DataPipe interface file (#69647)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69647

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D32989067

Pulled By: NivekT

fbshipit-source-id: 2c2e71e9e514e0d584affaa0b71b7b0d07a2ddbf
2021-12-10 12:04:45 -08:00
d026057bb3 [PyTorch] Update SmallVector from LLVM (#69110)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69110

I pasted the current LLVM code, reapplied the modifications listed in the code comments, caught a few more in the diff/build process. The trivially copyable detection is different now; if gcc builds fail, will try reverting to C10_IS_TRIVIALLY_COPYABLE or copying what LLVM is doing.

The motivation for this change is that, as noted in an existing comment, C10_IS_TRIVIALLY_COPYABLE did the wrong thing for std::unique_ptr, which caused problems with D32454856 / #68412.

ghstack-source-id: 145327773

Test Plan: CI

Reviewed By: bhosmer, mruberry

Differential Revision: D32733017

fbshipit-source-id: 9452ab90328e3fdf457aad23a26f2f6835b0bd3d
2021-12-10 11:57:19 -08:00
1d269e8c15 [PyTorch] Simple refcount bump fixes in standardizeVectorForUnion & callees (#66695)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66695

More extra reference counting in this path.
ghstack-source-id: 145125484

Test Plan: CI

Reviewed By: suo

Differential Revision: D31692197

fbshipit-source-id: 126b6c72efbef9410d4c2e61179b6b67459afc23
2021-12-10 11:43:01 -08:00
5374d5d8c9 [shard] fix with_comms wrapper (#69493)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69493

When added `with_comms` decorator with arguments, we added an `with_comms_decorator` inner function, `with_comms()` will refer to a function object, the added parentheses was necessary to use in test cases.

This PR fixes the `with_comms` wrapper behavior, to allow we both specify with/without arguments in test cases:
```
with_comms
def test_case:
    ...
```
or
```
with_comms(backend="gloo")
def test_case:
    ...
```
ghstack-source-id: 145327066

Test Plan: test_sharded_tensor

Reviewed By: pritamdamania87

Differential Revision: D32897555

fbshipit-source-id: 2f3504630df4f6ad1ea73b8084fb781f21604110
2021-12-10 10:25:54 -08:00
e1c583a691 [JIT] simplify logic for merging types during profiling (#69096)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69096

Instead of storing profiling data in a map and then merginging at
the end, perform merging directly during profiling.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D32772626

Pulled By: davidberard98

fbshipit-source-id: 22622c916a61908b478dd09433815685ce43682a
2021-12-10 09:29:19 -08:00
3219f6a487 Make vec512 bfloat16 map function clang-Wall clean (#69707)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69707

`const` modifier for `__m512` return value doesn't make much sense

Test Plan: Imported from OSS

Reviewed By: r-barnes

Differential Revision: D32997008

Pulled By: malfet

fbshipit-source-id: fb98659713fe2a23cc702252c0655106687f0dbf
2021-12-10 09:11:42 -08:00
a5ad2cdab5 Cleanup ProcessGroup.cpp (#69706)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69706

Mostly code modernization, also do not capture unused `this` in
end_handler functor

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D32997009

Pulled By: malfet

fbshipit-source-id: ac907f0c6889ad06d4fb0171964cb05133e5e610
2021-12-10 09:11:39 -08:00
7ea5926130 Make blend operations clang-Wall clean (#69705)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69705

Test Plan: Imported from OSS

Reviewed By: r-barnes

Differential Revision: D32997007

Pulled By: malfet

fbshipit-source-id: cbadc44e1e7373800e94b7b2fd2711530854978c
2021-12-10 09:10:07 -08:00
195b0d0645 fix CompositeImplicitAutograd ops improperly labeled (#69169)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69169

I checked `derivatives.yaml`, and it doesn't look like `logical_not/and/xor` are meant to work with autograd. Those 3 ops are currently set as `CompositeImplicitAutograd` though, implying that they do work with autograd. Updating them to be CompositeExplicitAutograd instead.

This came up because I'm trying to improve the error checking in external backend codegen, and these ops being improperly labeled incorrectly triggers my new error checks for XLA (see https://github.com/pytorch/pytorch/pull/67090)

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D32739976

Pulled By: bdhirsh

fbshipit-source-id: a756dd9e0b87276368063c8f4934be59dca371d3
2021-12-10 09:03:51 -08:00
29d759948e use irange for loops 2 (#66746)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66746

Modified loops in files under fbsource/fbcode/caffe2/ from the format

`for(TYPE var=x0;var<x_max;x++)`

to the format

`for(const auto var: irange(xmax))`

This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand.

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D31705361

fbshipit-source-id: 33fd22eb03086d114e2c98e56703e8ec84460268
2021-12-10 04:26:23 -08:00
91d16cb633 [Jit] Fix schema of aten::split int[] version (#69745)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69745

Missed in D31935573 (6b44e75f6b).

Reviewed By: d1jang

Differential Revision: D31889867

fbshipit-source-id: 417bd0b15db4891dbd641b35a803553f11d0d756
2021-12-10 02:33:36 -08:00
9962bfb3c9 Remove THTensor (#69040)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69040

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D32872478

Pulled By: ngimel

fbshipit-source-id: f93e16509d64308d91e374744410a6a811e7f4e3
2021-12-10 02:29:11 -08:00
531b045446 [tensorexpr] Fix the buf size of discontiguous tensors (#69657)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69657

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D32974473

Pulled By: huiguoo

fbshipit-source-id: 52dcd13d0ad7f7e4f1beb69dcaabc8ceb386ffca
2021-12-10 01:26:37 -08:00
aab67c6dff Add native masked_softmax (#69268)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69268

This diff enabled native masked softmax on CUDA, also expanded our current warp_softmax to accept masking.
The mask in this masked softmax has to be the same shape as input, and has to be contiguous.

In a following diff I will submit later, I will have encoder mask layout included, where input is BHDD and mask is BD.

Test Plan: buck build mode/opt -c fbcode.enable_gpu_sections=true caffe2/test:nn && buck-out/gen/caffe2/test/nn\#binary.par -r test_masked_softmax

Reviewed By: ngimel

Differential Revision: D32338419

fbshipit-source-id: 48c3fde793ad4535725d9dae712db42e2bdb8a49
2021-12-09 23:29:45 -08:00
a5996a6857 [SR] Wrap check_for_memory_leak with DCHECK (#69588)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69588

Code cleanup

Reviewed By: mikeiovine

Differential Revision: D32938333

fbshipit-source-id: d15dc405b281411c4c3c27a1dabf82f430c3ed08
2021-12-09 22:11:21 -08:00
3bb20ae49f Make c10d tests -Werror clean (#69703)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69703

Test Plan: Imported from OSS

Reviewed By: seemethere

Differential Revision: D32997001

Pulled By: malfet

fbshipit-source-id: 38b5f195c04f2b3b920e6883a96fe9a36345b9d2
2021-12-09 22:10:04 -08:00
be757addfa Do not use std::labs (#69704)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69704

Instead, compute size diff inside the if statement

Test Plan: Imported from OSS

Reviewed By: zou3519, seemethere

Differential Revision: D32997004

Pulled By: malfet

fbshipit-source-id: a23819240bfe8278a11ebc6bae1e856de162f082
2021-12-09 22:05:14 -08:00
3f02ad09ec [ONNX] shapeValueMap: Represent symbolic shape as value (#68203) (#69545)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69545

Test Plan: Imported from OSS

Reviewed By: msaroufim

Differential Revision: D32994272

Pulled By: malfet

fbshipit-source-id: 77cbdd78d01712faf4f9703549a2833340954509

Co-authored-by: jiafatom <jiafa@microsoft.com>
2021-12-09 22:00:46 -08:00
3d32a0c139 Back out "[wip][quant][graphmode] produce reference pattern for binary ops and then rewrite to quantized op" (#69713)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69713

Original commit changeset: 456086b308c4

Original Phabricator Diff: D32537714 (bd8a4a9372)

Reviewed By: jerryzh168

Differential Revision: D32976643

fbshipit-source-id: bea6bf6a2718e42c9efa48a0b0c1dc7fe3893065
2021-12-09 21:55:09 -08:00
7dba88dfdb [nnc][quant] Fix quantized concat (#69596)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69596

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D32941108

Pulled By: IvanKobzarev

fbshipit-source-id: 727f608b98625648e2e444396d910838c95f58f2
2021-12-09 18:55:32 -08:00
b2e79ed5ec Remove WindowsTorchApiMacro.h in favor of Export.h (#69585)
Summary:
Follow up to https://github.com/pytorch/pytorch/issues/68095

This also changes the files from the ATen folder to include c10's `Export.h` instead since they can't ever be exporting `TORCH_PYTHON_API`.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69585

Reviewed By: mrshenli

Differential Revision: D32958594

Pulled By: albanD

fbshipit-source-id: 1ec7ef63764573fa2b486928955e3a1172150061
2021-12-09 17:30:09 -08:00
f87f1d08e8 [SR] assignStorageToManagedTensors returns a vector (#69568)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69568

Non-empty vectors should never be passed to `assignStorageToManagedTensors` and `assignStorageToManagedOutputTensors`. Presumably, this out-variant convention was adopted to avoid move-assigning the corresponding attribtues in `MemoryPlanner`. But the cost of a vector move-assign is not high, and this function type signature is safer.

Test Plan: `buck test caffe2/bechmarks/static_runtime:static_runtime_cpptest`

Reviewed By: donaldong

Differential Revision: D32729289

fbshipit-source-id: 88f19de8eb89d8a4f1dd8bbd4d9e7f686e41888b
2021-12-09 17:01:48 -08:00
9aa1b3e396 [Static Runtime] [Code Cleanup] Encapsulate function objects within ProcessedFunction (#69595)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69595

This changes encapsulates `function` object in `ProcessedFunction` objects instead of exposing it unnecessarily just for executing it.

Test Plan: Existing tests

Reviewed By: mikeiovine

Differential Revision: D32908341

fbshipit-source-id: 5ff4951cbe276c5c6292227124d9eec1dd16e364
2021-12-09 15:11:03 -08:00
41e1ab0785 Introduce isTensorSubclassLike; add special cases to backwards formulas (#69534)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69534

Something is TensorSubclassLike if it is a Tensor subclass or if it has
the same problems as Tensor subclasses. Today that just includes Tensor
Subclasses and meta tensors but may include other things in the future.

Some of our backwards formulas are incompatible with TensorSubclassLike
objects. For example, calling .data_ptr() is a problem because many
TensorSubclassLike objects don't have storage. Another problem is
in-place operations: performing `regular_tensor.inplace_(tensor_subclass)`
is a problem.

This PR adds special cases to the backward formulas for torch.max and
torch.clamp to handle this. The backward formulas for torch.max and
torch.clamp are not dispatcher operations so they cannot be overridden
and we hesitate to make them dispatcher operations for FC/BC concerns
and performance overhead concerns.

Furthermore, the old concept of "is this inplace operation vmap
compatible?" can be subsumed by the general "is this inplace operation
tensor-subclass compatible" question, so I replaced all instances of
isInplaceVmapCompatible and replaced it with the isTensorSubclassLike
checks.

Test Plan
- I tested the changes using functorch.
- It's possible to write a test for these in core (one has to make
a custom tensor subclass and then send it through the operation and then
invoke autograd), but I wanted to push the work to doing some
generic testing for backward formulas
(https://github.com/pytorch/pytorch/issues/69530) instead of doing some
one-off things now.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D32967727

Pulled By: zou3519

fbshipit-source-id: 30fda1a7581da4c55179b7a3ca05069150bbe2dc
2021-12-09 15:03:22 -08:00
d3649309e6 [pytorch][PR] Add ability for a mobile::Module to save as flatbuffer (#69306)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69306

Included functions:

save_mobile_module -> saves a mobile::Module to flatbuffer
load_mobile_module_from_file -> loads a flatbuffer into mobile::Module
parse_mobile_module -> parses from bytes or deserialized flatbuffer
Module object

Test Plan: unittests

Reviewed By: gmagogsfm

Differential Revision: D32806835

fbshipit-source-id: 71913c6650e225634f878946bd16960d377a7f57
2021-12-09 14:53:31 -08:00
193e3c484e .github: Add fbsync to push triggers (#69718)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69718

canary is now pushing to fbsync so we should change our workflows to
reflect that.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet, janeyx99

Differential Revision: D32999967

Pulled By: seemethere

fbshipit-source-id: bc4bc9afd2d73c53f91d3af3b81aca1b31f665a4
2021-12-09 14:30:29 -08:00
3e20a74b55 [SR] Update memory planner docs (#69559)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69559

We have a lot of special cases. Document them so they're easy to learn about.
ghstack-source-id: 145226542

Test Plan: Spell check? :)

Reviewed By: d1jang

Differential Revision: D32929416

fbshipit-source-id: 2362410f25a27cdb74a4939903446192cef61978
2021-12-09 14:22:33 -08:00
e963b43691 Extend explanation of torch.cholesky_inverse to consider batched inputs. (#69069)
Summary:
While implementing https://github.com/pytorch/pytorch/issues/68720,
We found out empirically that `torch.cholesky_inverse` support batched inputs, but it is not explained in doc: [link](https://github.com/pytorch/pytorch/pull/68720#pullrequestreview-817243697)
`torch.cholesky_inverse` is implemented in https://github.com/pytorch/pytorch/issues/50269 and the doc was updated at https://github.com/pytorch/pytorch/issues/31275 but not merged.
neerajprad

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69069

Reviewed By: mrshenli

Differential Revision: D32979362

Pulled By: neerajprad

fbshipit-source-id: 0967c969434ce6e0ab15889c240149c23c0bce44
2021-12-09 14:01:31 -08:00
9ad05f2c3a Upgrade oneDNN to v2.3.3 and package oneDNN Graph API together (#63748)
Summary:
This PR upgrades oneDNN to [v2.3.3](https://github.com/oneapi-src/oneDNN/releases/tag/v2.3.3) and includes [Graph API preview release](https://github.com/oneapi-src/oneDNN/releases/tag/graph-v0.2) in one package.

- oneDNN will be located at `pytorch/third_party/ideep/mkl-dnn/third_party/oneDNN`
- The version of oneDNN will be [v2.3.3](https://github.com/oneapi-src/oneDNN/releases/tag/v2.3.3)
  The main changes on CPU:

  - v2.3
    - Extended primitive cache to improve primitive descriptor creation performance.
    - Improved primitive cache performance in multithreaded configurations.
    - Introduced initial optimizations for bfloat16 compute functionality for future Intel Xeon Scalable processor (code name Sapphire Rapids).
    - Improved performance of binary primitive and binary post-op for cases with broadcast and mixed source and destination formats.
    - Improved performance of reduction primitive
    - Improved performance of depthwise convolution primitive with NHWC activations for training cases
  - v2.3.1
    -  Improved int8 GEMM performance for processors with Intel AVX2 and Intel DL Boost support
    - Fixed integer overflow for inner product implementation on CPUs
    - Fixed out of bounds access in GEMM implementation for Intel SSE 4.1
  - v2.3.2
    - Fixed performance regression in fp32 inner product primitive for processors with Intel AVX512 support
  - v2.3.3
    - Reverted check for memory descriptor stride validity for unit dimensions
    - Fixed memory leak in CPU GEMM implementation

  More changes can be found in https://github.com/oneapi-src/oneDNN/releases.
- The Graph API provides flexible API for aggressive fusion, and the preview2 supports fusion for FP32 inference.  See the [Graph API release branch](https://github.com/oneapi-src/oneDNN/tree/dev-graph-preview2) and [spec](https://spec.oneapi.io/onednn-graph/latest/introduction.html) for more details. A separate PR will be submitted to integrate the oneDNN Graph API to Torchscript graph.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63748

Reviewed By: albanD

Differential Revision: D32153889

Pulled By: malfet

fbshipit-source-id: 536071168ffe312d452f75d54f34c336ca3778c1
2021-12-09 13:42:40 -08:00
17641fed2a Revert D32942007: OpInfo: Convert more sample_input_funcs to generators
Test Plan: revert-hammer

Differential Revision:
D32942007 (d21646c432)

Original commit changeset: bb5b253d6d87

Original Phabricator Diff: D32942007 (d21646c432)

fbshipit-source-id: d37c78174f0acea48e4cd4af3ac67ca4ee7ac54d
2021-12-09 10:54:41 -08:00
0ccb1dcdbb Fix inference_mode decorator (#68617)
Summary:
This fixes the case when `torch.inference_mode` is called with `mode=False` (disabled). When used as a decorator, it ignored the argument and enabled inference mode anyway.

`_DecoratorContextManager` is changed so that a new instance is a copy instead of a new instance with default parameters.

I also added more tests to cover this case.

Current behaviour:

```python
>>> import torch
>>> x = torch.ones(1, 2, 3, requires_grad=True)
>>> torch.inference_mode(mode=False)
... def func(x):
...     return x * x
...
>>> out = func(x)
>>> out.requires_grad
False
```

New behaviour (fixed):

```python
>>> import torch
>>> x = torch.ones(1, 2, 3, requires_grad=True)
>>> torch.inference_mode(mode=False)
... def func(x):
...     return x * x
...
>>> out = func(x)
>>> out.requires_grad
True
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68617

Reviewed By: mrshenli

Differential Revision: D32958434

Pulled By: albanD

fbshipit-source-id: 133c69970ef8bffb9fc9ab5142dedcffc4c32945
2021-12-09 10:45:09 -08:00
afb742382a use irange for loops 10 (#69394)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69394

Modified loops in files under fbsource/fbcode/caffe2/ from the format
```
for(TYPE var=x0;var<x_max;x++)
```
to the format
```
for(const auto var: irange(xmax))
```

This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand.

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D32837991

fbshipit-source-id: fc7c4f76d2f32a17a0faf329294b3fe7cb81df32
2021-12-09 09:49:34 -08:00
2d5b3101c1 Added ScriptFunction pkl exception for issue #61210 #61381 (#67076)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/61381, https://github.com/pytorch/pytorch/issues/61210

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67076

Reviewed By: jbschlosser

Differential Revision: D32908175

Pulled By: suo

fbshipit-source-id: f6e175793243dc96cde5e44022d92f2623b934eb

Co-authored-by: LucaStubbe <stubbeluca@gmail.com>
Co-authored-by: Kanon Tromp <ktromp1@student.cccd.edu>
2021-12-09 09:44:49 -08:00
d21646c432 OpInfo: Convert more sample_input_funcs to generators (#69257)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69257

These are sample functions that already use generators internally, this just moves the `yield` into the sample function itself.
Diff is best viewed ignoring whitespace changes https://github.com/pytorch/pytorch/pull/69257/files?diff=unified&w=1

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D32942007

Pulled By: mruberry

fbshipit-source-id: bb5b253d6d87b3495b7059924bed35b09d2768a2
2021-12-09 08:38:51 -08:00
6de9f0fc94 OpInfo: Allow sample_inputs_func to be any iterable (#69256)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69256

Closes #52486

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D32942008

Pulled By: mruberry

fbshipit-source-id: f5b01b0298c0160b0bec6e86e2b6db8cfe746206
2021-12-09 08:37:26 -08:00
d2917f705a Fix errors in common_utils.py (#69578)
Summary:
This fixes the following error:
```python
Traceback (most recent call last):
  File "/home/gaoxiang/pytorch-ucc2/test/distributed/test_distributed_spawn.py", line 40, in <module>
    run_tests()
  File "/home/gaoxiang/.local/lib/python3.9/site-packages/torch/testing/_internal/common_utils.py", line 618, in run_tests
    ['--import-slow-tests'] if IMPORT_SLOW_TESTS else List[str]([]))
  File "/usr/lib/python3.9/typing.py", line 680, in __call__
    raise TypeError(f"Type {self._name} cannot be instantiated; "
TypeError: Type List cannot be instantiated; use list() instead
Traceback (most recent call last):
  File "/home/gaoxiang/pytorch-ucc2/test/run_test.py", line 1058, in <module>
    main()
  File "/home/gaoxiang/pytorch-ucc2/test/run_test.py", line 1036, in main
    raise RuntimeError(err_message)
RuntimeError: distributed/test_distributed_spawn failed!
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69578

Reviewed By: mrshenli

Differential Revision: D32963113

Pulled By: malfet

fbshipit-source-id: b064e230c5e572e890b4ac66ebdda2707b8c12d7
2021-12-09 07:33:43 -08:00
07932e2735 [sparsity] Convert function for sparse kernels without a context manager (#66778)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66778

This removes the hack of the context manager that would communicate the zeros block shape to the quantization convert.
The conversion will assume that the converted modules have `sparse_params` (which is added by the sparsifier).

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D31835721

Pulled By: z-a-f

fbshipit-source-id: c5fd2da3b09a728a2296765c00ca69275dbca3b1
2021-12-09 02:58:57 -08:00
b957b82db7 Replace issue templates with new issue forms - v2 (#69361)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69361

This PR introduces the new issue forms that replace issue templates.
(This is exactly the same as https://github.com/pytorch/pytorch/pull/65917 which was reverted due to an issue during the import)

This is similar to what was done in torchvision https://github.com/pytorch/vision/pull/4299 and torchaudio, you can see the end result here: https://github.com/pytorch/vision/issues/new/choose (click e.g. on the [bug report](https://github.com/pytorch/vision/issues/new?assignees=&labels=&template=bug-report.yml))

The main new thing is that we can enforce some of the fields to be filled, especially for bug reports. It's also a much cleaner GUI for users IMHO, and we can provide better examples and instructions.

There is still a "blank" template available.

I removed the "Questions" form: we say we close these issues anyway. I replaced it with a direct link to https://discuss.pytorch.org. Since we still have a "blank" template, I think this  covers all previous use-cases properly.

Test Plan: Imported from OSS

Reviewed By: albanD, mrshenli

Differential Revision: D32947189

Pulled By: NicolasHug

fbshipit-source-id: f19abe3e7c9c479b0b227969a207916db5bdb6e3
2021-12-09 02:42:29 -08:00
e948856ce7 [sparsity] Add ability to keep sparsity parameters in modules (#66777)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66777

Sometimes one might need to keep the sparsity parameters after the sparsifier is detached.
This saves the parameters in the `sparse_params`.
There are two ways of keeping the sparsifier params:

1. Tuple[str, ...]: A tuple of all the parameters that need to be stored.
2. Dict[str, Tuple[str, ...]]: A dict of layer keys and parameters. In this case only specified layers will have the parameters attached to.

For example:

```
>>> # This will keep params in every module
>>> sparsifier.squash_mask(keep_sparse_params=('sparse_block_shape',))
>>> print(model.submodule.linear1.sparse_params)
{'sparse_block_shape': (1, 4)}
>>> print(model.submodule.linear2.sparse_params)
{'sparse_block_shape': (1, 4)}
```

```
>>> # This will keep params only in specific modules
>>> sparsifier.squash_mask(keep_sparse_params={'submodule.linear1': ('sparse_block_shape',)})
>>> print(model.submodule.linear1.sparse_params)
{'sparse_block_shape': (1, 4)}
>>> print(model.submodule.linear2.sparse_params)
AttributeError: 'Linear' object has no attribute 'sparse_params'
```

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D31835722

Pulled By: z-a-f

fbshipit-source-id: 20c2d80207eb7ce7291e7f5f655d3fb2a627190f
2021-12-09 02:36:27 -08:00
13faaff54c [Operator Versioning][Edge] Implement register function for upgrader (#67730)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67730

This pr implement the register function for upgrader so it can be used at loading stage
ghstack-source-id: 145170986

Test Plan:
```
buck test //caffe2/test/cpp/jit:jit
```

Reviewed By: iseeyuan

Differential Revision: D32092518

fbshipit-source-id: 779b51eb12b8cb162a93a55c1e66fe0becc4cb36
2021-12-09 02:18:09 -08:00
4f5806dee7 [AO] Clear the contents of the torch/ao/__init__.py (#69415)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69415

Adding the imports inside the torch/ao/__init__.py has a high chance of causing circular dependencies, especially if sparsity and quantization use each other's resources.
To avoid the dependency issues, we can just keep the __init__ empty.

Notes:
- This means that the user will have to explicitly import the `torch.ao.quantization` or `torch.ao.sparsity` instead of `from torch import ao; ao.quantization.???`.
- The issue of circular dependencies that are caused by the imports with binding submodules is [fixed in Python 3.7](https://docs.python.org/3/whatsnew/3.7.html#other-language-changes), which means this solution will become obsolete at the [3.6's EoL](https://www.python.org/dev/peps/pep-0494/#and-beyond-schedule), which comes [12/23/2022](https://devguide.python.org/#status-of-python-branches).

Future options to resolve the circular dependencies (subject to discussion):
1. Use interfaces for binding submodules. For example, have a torch/ao/_nn with all the source code, and an interface torch/ao/nn with only the __init__.py file. The __init__ files inside the torch/ao/_nn will be empty
2. Completely isolate the common code into a separate submodule, s.a. torch/ao/common. The other submodules will not be referencing each other.

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D32860168

Pulled By: z-a-f

fbshipit-source-id: e3fe77e285992d34c87d8742e1a5e449ce417c36
2021-12-09 01:21:30 -08:00
015e481a41 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D32975574

fbshipit-source-id: 66856595c7bc29921f24a2c5c00c72892f262aa1
2021-12-09 00:10:33 -08:00
dc87cf5fe1 Fixes mem_get_info when querying on a device other than the current device (#69640)
Summary:
Also fixes the documentation failing to appear and adds a test to validate that op works with multiple devices properly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69640

Reviewed By: ngimel

Differential Revision: D32965391

Pulled By: mruberry

fbshipit-source-id: 4fe502809b353464da8edf62d92ca9863804f08e
2021-12-08 23:04:30 -08:00
24d885f5f8 [Vulkan] Thread-safe Vulkan backend for OSS (#69576)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69576

Vulkan backend for OSS is also thread-safe by default:
* Removed `MAKE_VULKAN_THREADSAFE` preprocessor and if-conditions

Test Plan:
Test build on Android:
```
cd ~/fbsource
buck build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 //xplat/caffe2:pt_vulkan_perf_test_binAndroid\#android-arm64 --show-output
adb push buck-out/gen/xplat/caffe2/pt_vulkan_perf_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_perf_test
adb shell "/data/local/tmp/vulkan_perf_test"
```
Test build on MacOS:
```
cd ~/fbsource
buck build //xplat/caffe2:pt_vulkan_perf_test_binAppleMac
./buck-out/gen/xplat/caffe2/pt_vulkan_perf_test_binAppleMac\#macosx-x86_64
```

Test result on Google Pixel 5:
```
//xplat/caffe2:pt_vulkan_perf_test_binAndroid#android-arm64 buck-out/gen/fe3a39b8/xplat/caffe2/pt_vulkan_perf_test_binAndroid#android-arm64
buck-out/gen/xplat/caffe2/pt_vulkan_perf_test_binAndroid#android-arm64: 1 file pushed, 0 skipped. 145.4 MB/s (826929592 bytes in 5.426s)
Running /data/local/tmp/vulkan_perf_test
Run on (8 X 1804.8 MHz CPU s)
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------------------------------------------------
Benchmark                                                                   Time             CPU   Iterations
-------------------------------------------------------------------------------------------------------------
cat_op_channel_perf/N:3/C:40/H:221/W:193/iterations:1000/threads:1       39.3 ms         10.1 ms         1000
cat_op_channel_perf/N:3/C:20/H:221/W:193/iterations:1000/threads:1       27.1 ms         5.86 ms         1000
cat_op_channel_perf/N:3/C:39/H:221/W:193/iterations:1000/threads:1       58.5 ms         11.8 ms         1000
cat_op_channel_perf/N:3/C:4/H:221/W:193/iterations:5000/threads:1        5.98 ms        0.803 ms         5000
cat_op_channel_perf/N:3/C:3/H:221/W:193/iterations:5000/threads:1        9.14 ms        0.857 ms         5000
cat_op_channel_perf/N:3/C:40/H:221/W:193/iterations:1000/threads:3       32.1 ms         31.3 ms         3000
```

Test result on MacOS:
```
Running ./buck-out/gen/xplat/caffe2/pt_vulkan_perf_test_binAppleMac#macosx-x86_64
Run on (16 X 2400 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 256 KiB (x8)
  L3 Unified 16384 KiB (x1)
Load Average: 18.89, 29.61, 24.95
***WARNING*** Library was built as DEBUG. Timings may be affected.
-------------------------------------------------------------------------------------------------------------
Benchmark                                                                   Time             CPU   Iterations
-------------------------------------------------------------------------------------------------------------
cat_op_channel_perf/N:3/C:40/H:221/W:193/iterations:1000/threads:1       53.3 ms         39.6 ms         1000
cat_op_channel_perf/N:3/C:20/H:221/W:193/iterations:1000/threads:1       28.0 ms         20.7 ms         1000
cat_op_channel_perf/N:3/C:39/H:221/W:193/iterations:1000/threads:1       51.8 ms         38.7 ms         1000
cat_op_channel_perf/N:3/C:4/H:221/W:193/iterations:5000/threads:1        2.76 ms         1.31 ms         5000
cat_op_channel_perf/N:3/C:3/H:221/W:193/iterations:5000/threads:1        2.29 ms         1.11 ms         5000
cat_op_channel_perf/N:3/C:40/H:221/W:193/iterations:1000/threads:3       49.2 ms         41.8 ms         3000
```

Reviewed By: SS-JIA

Differential Revision: D32933891

fbshipit-source-id: d8ebd5394771e1d79230c1f3aa8fbec4472b3197
2021-12-08 21:04:52 -08:00
ecf9c82f24 Reduce binary size of TensorCompare.cu (#68835)
Summary:
This PR does several things
1) eliminates `where` instantiations for deprecated `byte` condition dtype, and casts `condition` to `bool` in this case. This is a perf penalty for people using deprecated calls
2) Makes `clamp_{min/max}.Tensor` overload reuse `clamp_{min/max}.Scalar` kernels if limit argument is cpu scalar, instead of instantiating `gpu_kernel_with_scalars`
3) Unifies all clamp_scalar kernels to use a single kernel with lambda picking the correct operation. I've verified that it doesn't degrade kernel performance.
4) Eliminates redundant TensorIterator construction that `clamp` structured kernel was doing when only `min` or `max` was specified

This reduces the cubin size for TensorCompare.cu on V100 from 15751920 bytes to 7691120 bytes, with corresponding reduction in compile time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68835

Reviewed By: mruberry

Differential Revision: D32839241

Pulled By: ngimel

fbshipit-source-id: 0acde5af10a767264afbdb24684b137c5544b8d9
2021-12-08 20:08:53 -08:00
3e560239e2 [Vulkan] Implement clone operator (#69551)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69551

Implemented `clone` operator in the Vulkan backend:
* Supports only <= 4D tensors.
* Internal name is `aten::clone`.
* Vulkan `clone` operator accepts only `c10::MemoryFormat::Preserve` and  `c10::MemoryFormat::Contiguous` for the argument `c10::optional<c10::MemoryFormat> optional_memory_format`.
* Throws an exception if the `optional_memory_format argument` is neither `MemoryFormat::Preserve` nor `MemoryFormat::Contiguous`
* CPU implementation: [/aten/src/ATen/native/TensorFactories.cpp::clone()](3e45739543/aten/src/ATen/native/TensorFactories.cpp (L1415))
* MKL-DNN implementation: [/aten/src/ATen/native/mkldnn/TensorShape.cpp::mkldnn_clone()](3e45739543/aten/src/ATen/native/mkldnn/TensorShape.cpp (L58))
* `self.copy_(src)` calls `copy_()` for Vulkan to Vulkan copy operation
```
vTensor::copy_()
vTensor::copy_() X -> Vulkan
vTensor::copy_() CPU -> Vulkan
vTensor::clone()
vTensor::clone() -> MemoryFormat::Preserve
vTensor::clone() -> MemoryFormat::Preserve -> self = at::empty_like(src)
vTensor::clone() self.copy_(src); -> BEFORE
vTensor::copy_()
vTensor::copy_() X -> Vulkan
vTensor::copy_() Vulkan -> Vulkan
vTensor::clone() self.copy_(src); -> AFTER
vTensor::copy_()
vTensor::copy_() Vulkan -> X
vTensor::copy_() Vulkan -> CPU
```
* References:
  * Function `torch.clone` in PyTorch documentation: https://pytorch.org/docs/stable/generated/torch.clone.html
  * Pytorch preferred way to copy a tensor: https://stackoverflow.com/questions/55266154/pytorch-preferred-way-to-copy-a-tensor
  * `torch.memory_format`: https://pytorch.org/docs/stable/tensor_attributes.html?highlight=memory_format#torch.torch.memory_format
  * `c10::MemoryFormat` definition in [/c10/core/MemoryFormat.h](3e45739543/c10/core/MemoryFormat.h (L28))

Test Plan:
Build & test on Android:
```
cd ~/fbsource
buck build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 //xplat/caffe2:pt_vulkan_api_test_binAndroid\#android-arm64 --show-output
adb push buck-out/gen/xplat/caffe2/pt_vulkan_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_api_test
adb shell "/data/local/tmp/vulkan_api_test"
```
Build & test on MacOS:
```
cd ~/fbsource
buck build //xplat/caffe2:pt_vulkan_api_test_binAppleMac
./buck-out/gen/xplat/caffe2/pt_vulkan_api_test_binAppleMac\#macosx-x86_64
```
Test result on Android (Google Pixel 5):
```
[ RUN      ] VulkanAPITest.clone_success
[       OK ] VulkanAPITest.clone_success (5 ms)
[ RUN      ] VulkanAPITest.clone_invalidinputs_exceptions
[       OK ] VulkanAPITest.clone_invalidinputs_exceptions (1 ms)
```
Test result on MacOS:
```
[ RUN      ] VulkanAPITest.clone_success
[       OK ] VulkanAPITest.clone_success (19 ms)
[ RUN      ] VulkanAPITest.clone_invalidinputs_exceptions
[       OK ] VulkanAPITest.clone_invalidinputs_exceptions (2 ms)
```

Reviewed By: SS-JIA

Differential Revision: D32923535

fbshipit-source-id: ea29792e1b0080cbbc1c8c7e8bf2beffad9b5c0d
2021-12-08 18:46:56 -08:00
eb2a803406 Run test_embedding_bag_with_no_grad_tensors only for TensorPipe (#69626)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69626

Sparse tensors are only supported by the TensorPipe RPC backend. As a
result, moving test_embedding_bag_with_no_grad_tensors to be a TensorPipe
specific test.
ghstack-source-id: 145134888

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D32959952

fbshipit-source-id: d65f2edbb6dad7705475690a8c6293a322299dde
2021-12-08 18:29:38 -08:00
b61c532f96 Make make_dual redispatch (#68630)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68630

Constraints:
1) (functorch) if all the inputs to an op have requires_grad=False and don't have tangents, then their VariableType
    kernel should be a no-op i.e., behave like a redispatch. This is due to functorch's DynamicLayerStack
   having the autograd key by default (which is so that transformations like vmap) still work with autograd
2) (inference mode) inference tensors in inference mode will call straight into the kernel, we should still do something sensible
    inside even if we normally wouldn't redispatch into it.
3) ~Should support potential application of interposition below autograd: `nn.Parameter` is a example of subclassing where the subclass
    is not preserved when an operation is performed. There is an exception though: we want calling `make_dual` on a
    `nn.Parameter` to preserve its parameterness.~
4) Should avoid calls to shallow_copy_and_detach to avoid spurious calls into `__python_dispatch__`.

This PR:
- does not redispatch to `make_dual` from its `ADInplaceOrView` kernel to satisfy (1)
- calls into `alias` from the kernel in the native namespace so that behavior is consistent with other views in inference mode to satisfy (2)
- discussion of (3). We still wouldn't be able to directly override `make_dual` below autograd. In this PR, instead of not redispatching at all, we choose to redispatch into `at::alias` so that one can override `make_dual`. The side effect is that one would not be able to distinguish calls between the two, which can be problematic (though a straightforward but hacky solution would be to create a new `at::alias_for_make_dual` that would allow users to distinguish) the two. This isn't ideal but seems to be the simplest way to satisfy (3). We don't pursue that hacky solution here.
- (4) is satisfied because we remove calls to `shallow_copy_and_detach`

<details>
<summary> A potentially less hacky but more involved solution? (WIP) </summary>

Realizing that make_dual is more like requires_grad, perhaps it shouldn't be autograd explicit? Make make_dual a composite or python-only construct. i.e., it would be a view on the primal followed by something to the effect of primal.set_fw_grad(tangent).

Additional constraints:
5) make_dual needs to be backward-differentiable (I can't think of any applications yet becuase
   technically as a high-order function, jvp's input is the tangent only, "detach" is not applied on
   the tangent, so one would still be able to propagate gradients through it).
6) set_fw_grad needs to raise an error if there is a layout mismatch and base is a forward-differnentiable view

Possible plan
- (6) implies that a plain view would not suffice. We need a `detach`-like operation to ensure that set_fw_grad
  knows the view is not forward differentiable.
- (5) implies that is this (new) `detach` would need to be backward differentiable (API TBD).
- (3) is no longer relevant because make_dual is no longer autograd explicit, but perhaps this new detach should behave like the current one? There is a lot of logic to replicate for detach, so this may be hard.
- (1) is satisfied if we use current detach logic, i.e., , and (4) is trivial.

I'm not convinced that this is the right solution either, because in the end does (3) still work?

 </details>

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D32899679

Pulled By: soulitzer

fbshipit-source-id: 98e13ae954e14e1e68dbd03eb5ab3300d5ed2c5e
2021-12-08 17:56:03 -08:00
7956a405ef Make make_dual also return namedtuple when level less than zero (#68628)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68628

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D32899681

Pulled By: soulitzer

fbshipit-source-id: 61ed09f4038e19817978a521e9571fdc482b424b
2021-12-08 17:54:40 -08:00
1c43b1602c [SR] Scope exit guard for memory planner deallocation (#68795)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68795

This change improves static runtime exception safety. Added a scope exit guard that invokes `MemoryPlanner::deallocate` in its destructor.

Caveat: we have to be really careful with the exception behavior of `MemoryPlanner::deallocate` and `MemoryPlanner`'s constructor, because they're now both potentially called in the destructor of the scope exit guard. Letting exceptions potentially escape destructors is playing with fire since 1) the destructor of `Deallocator` is (implicitly) `noexcept`, 2) even if it wasn't, `std::terminate` will be called if an exception escapes and the stack is already unwinding. To get around this, we wrap the deallocation stuff in a try/catch. If deallocation throws, then we simply reset all of the memory planner stuff and carry on.
There's a catch: the code path that we take when handling the deallocation exception can't throw. However, this code path is much simpler than memory planner construction/deallocation, so it's much easier to manually audit the correctness here.

Test Plan:
**New unit tests**

`buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: hlu1

Differential Revision: D32609915

fbshipit-source-id: 71fbe6994fd573ca6b7dd859b2e6fbd7eeabcd9e
2021-12-08 16:41:52 -08:00
3b27304d20 Fix typos in ATen README (#69170)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69170

Reviewed By: mrshenli

Differential Revision: D32957504

Pulled By: H-Huang

fbshipit-source-id: d8e613b67a864f95e45b2d45398ee71efde0c567
2021-12-08 14:02:26 -08:00
b10381f42d Port smooth_l1_loss to structured kernels (#67404)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67404

Port smooth_l1_loss to structured kernels.

Brian Hirsh authored the part of adding build_borrowing_binary_op_coerce_to_scalar to TensorIterator.

Test Plan: This commit shouldn't change the behavior. So, CI.

Reviewed By: bdhirsh, ngimel

Differential Revision: D31981147

Pulled By: alanwaketan

fbshipit-source-id: a779bb76c848eed8b725dc0e1d56b97a3bd9c158
2021-12-08 12:56:24 -08:00
497ec9d9b8 Getting NS to work with Ferraris (#68908)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68908

see description in github

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D32928449

fbshipit-source-id: ba7085b823a0ebcd0d9e40f4ac19ca0a2cac1169
2021-12-08 12:26:00 -08:00
51b6981c36 [PyTorch Tests] Split out skip logic, make changes for plugins (#67256)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67256

To change what tests can be run in various cases, the check logic should be moved to functions and variables that can be changed.

One challenge here is that decorators don't have dynamic functionality. If something is read in when imported and then changed afterwards, it will not actually change. This means we need to separate out the variables that need to be changed for our use case.

Those are put into common_distributed.py and can be changed before importing the distributed_test.py code.

The use case is to add new backends to the tests and split it into tests that can be ran on demand as a separate instance. To do so, you would change DistTestSkipCases after importing it into a launcher or a setup script and then load distributed_test.

Test Plan: Check the signals

Reviewed By: mrshenli

Differential Revision: D31906947

fbshipit-source-id: 45e3258c55f4dc34e12a468bed65280f4c25748f
2021-12-08 12:23:15 -08:00
e279963eef Remove remaining THC code (#69039)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69039

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D32872476

Pulled By: ngimel

fbshipit-source-id: 7972aacc24aef9450fb59b707ed6396c501bcb31
2021-12-08 12:18:08 -08:00
7407e3d6fd [fix] cross_entropy : fix weight with ignore_index and label_smoothing (#69511)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/69339

cc albanD mruberry jbschlosser walterddr

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69511

Reviewed By: mrshenli

Differential Revision: D32951935

Pulled By: jbschlosser

fbshipit-source-id: 482eae851861a32f96bd6231dd3448fb6d44a015
2021-12-08 12:08:33 -08:00
d44d59aa70 [BE] Enable C++ stacktraces for MultiProcessTestCase (#69175)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69175

Shows C++ stacktraces for python distributed tests that inherit from
MultiProcessTestCase. Closes https://github.com/pytorch/pytorch/issues/69168
ghstack-source-id: 145085858

Test Plan: CI

Reviewed By: H-Huang

Differential Revision: D32736872

fbshipit-source-id: 743e870eefa7a9e77c5791d0936e2ebd5c9b1016
2021-12-08 11:57:51 -08:00
adb619a193 Adding hardswish, opinfo tests to custom rules (#69399)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69399

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D32937576

Pulled By: Gamrix

fbshipit-source-id: 0e53d9e6669e70abcc744399f022a902214ef213
2021-12-08 11:56:34 -08:00
a0efa48c7b [Operator Versioning][Edge] Have operator version number available at the loading stage (#67729)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67729

1. operator version is needed to decide whether applying upgrader or not. This pr make it available at loading stage.
2. Swap the order of parsing instruction and operator, because instruction needs to know the operator first because deciding whether applying upgrader or not (change `OP` to `CALL` or not).
ghstack-source-id: 145082390

Test Plan:
```
buck test //caffe2/test/cpp/jit:jit
```

Reviewed By: iseeyuan

Differential Revision: D32092516

fbshipit-source-id: 853a68effaf95dca86ae46b7f7f4ee0d8e8767da
2021-12-08 11:50:46 -08:00
2808563e69 Forward fix for failing master (#69625)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69625

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D32959635

Pulled By: anjali411

fbshipit-source-id: 4d811c6a05deb991cb2886dd65b3f6059555b395
2021-12-08 11:30:38 -08:00
3e6164449f Add efficient zero tensors (#64837)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64837

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D32834987

Pulled By: anjali411

fbshipit-source-id: 20ea08ade0db0044ca633d9c1a117a6a2e65d1fd
2021-12-08 10:37:39 -08:00
30bb4e0071 Add nvidia-smi memory and utilization as native Python API (#69104)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69104

Add nvidia-smi memory and utilization as native Python API

Test Plan:
testing the function returns the appropriate value.
Unit tests to come.

Reviewed By: malfet

Differential Revision: D32711562

fbshipit-source-id: 01e676203299f8fde4f3ed4065f68b497e62a789
2021-12-08 10:33:23 -08:00
ee60b5ddf3 Improve efficiency of shape hash by not using tostring (#69496)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69496

tostring is expensive, and this is equivalent and faster

Test Plan: covered by lazy tensor unit tests

Reviewed By: desertfire, alanwaketan

Differential Revision: D32901050

fbshipit-source-id: 34080f415db5fd5d3817f7f2533f062a6ec07d21
2021-12-08 09:16:00 -08:00
2cb385dd6e OpInfo for nn.functional.dropout2d, revise sample inputs for dropout (#67891)
Summary:
Earlier, we were only testing for inputs with the shape of `(5,)` for `nn.functional.dropout`, but since it's used a lot - I feel it's a good idea to test for a few more shapes including scalars. This PR:

1. Revises sample inputs for `nn.functional.dropout`
2. Adds an OpInfo for `nn.functional.dropout2d`.

A note regarding the documentation:

Looks like `nn.functional.dropout2d` also supports inputs of shape `(H, W)` apart from `(N, C, H, W) / (C, H, W)` but the [documentation](https://pytorch.org/docs/stable/generated/torch.nn.Dropout2d.html#torch.nn.Dropout2d) doesn't mention that (`H, W` case). Should that be revised or am I missing anything here? (Filed an issue here: https://github.com/pytorch/pytorch/issues/67892)

```python
# A 2D tensor is a valid input for Dropout2d
In [11]: tensor = torch.randn((3, 4), device='cpu', dtype=torch.float32)
In [12]: dropout2d = torch.nn.Dropout2d(p=0.5)

In [13]: dropout2d(tensor)
Out[13]:
tensor([[-0.1026, -0.0000, -0.0000, -0.0000],
        [-1.5647,  0.0000, -0.0000, -0.5820],
        [-0.0000, -3.2080,  0.1164, -3.6780]])
```

Issue Tracker: https://github.com/pytorch/pytorch/issues/54261

cc: mruberry zou3519

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67891

Reviewed By: mrshenli

Differential Revision: D32628527

Pulled By: mruberry

fbshipit-source-id: 4c9b89550f1d49526e294378ce107eba9f29cabb
2021-12-08 08:54:16 -08:00
f54745a6ff add OpInfo for torch.diagflat (#65680)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65680

cc mruberry

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D31730001

Pulled By: mruberry

fbshipit-source-id: 487e41da4b043944cc5b26d6081209fb0875f4de
2021-12-08 08:49:45 -08:00
7e49f4638c add OpInfo for torch.nn.functional.kl_div (#65469)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65469

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D31111698

Pulled By: mruberry

fbshipit-source-id: 0af41a2ef2b199db3d8c63050277e72213f04565
2021-12-08 08:48:18 -08:00
8b20dde932 add python dispatch test back to CI and fix typo in test (#69565)
Summary:
The error message was changed following a PR comment. And since the test doesn't run on CI, I forgot to update the test to catch the new error message.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69565

Reviewed By: mrshenli

Differential Revision: D32932982

Pulled By: albanD

fbshipit-source-id: a1da72b0ca735e72b481bc944039233094f1c422
2021-12-08 08:44:49 -08:00
afaa184b44 [Static Runtime] Avoid evaluating expressions of Node* for interpreter fallback op (#69489)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69489

This change avoids pulling out `Node*` out of `ProcessedNode*` to  evaluate expressions related to `Node*` at op execution time.

Perf gain is expected to be there but not measurable and the purpose of this change is to make SR's code more self-contained (calling more code from SR not JIT) during execution time.

Test Plan: Existing tests

Reviewed By: mikeiovine

Differential Revision: D32893265

fbshipit-source-id: f0f397666b3556f985d45112af8fe0b08de22139
2021-12-08 08:40:30 -08:00
fc2614537b Updating quantization documentation (#68907)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68907

Added information about symmetric
qschemes and corrected an error in reference to https://github.com/pytorch/pytorch/issues/68540

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D32662033

fbshipit-source-id: 9052c597f61991934b86850fea8b6eab78397450
2021-12-08 08:32:33 -08:00
39fb855d91 [DataLoader] Implementing communication processes for Map-style DataPipes (#68549)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68549

cc SsnL VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D32922676

Pulled By: NivekT

fbshipit-source-id: fd918a342214d617a489ac5acffff15b55e9b255
2021-12-08 07:27:01 -08:00
f3983f9c47 [quant][embdding qat] Re-land Add FX support for QAT EmbeddingBag (#69334)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69334

Original PR #68121 broke with incompatible qengine for Mac OS, this PR re-introduces changes with fix

Add FX support for QAT EmbeddingBag operator, previously only eager mode support.

Test Plan:
pytest test/quantization/fx/test_quantize_fx.py  -v -k "test_qat_embeddingbag_linear"

Imported from OSS

Reviewed By: jingsh

Differential Revision: D32815153

fbshipit-source-id: 33654ce29de6e81920bf3277a75027fe403a1eb2
2021-12-08 05:57:20 -08:00
93aa3603ee [quant][embedding qat] Re-Land Support Embedding QAT via FX API (#69333)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69333

Original PR reverted due to break with incompatible qengine on Mac OS, this diff fixes that.

Support QAT workflow by using torch.fx QAT API.  e.g. `prepare_qat_fx` and `convert_fx`.

Test Plan:
`pytest test/quantization/fx/test_quantize_fx.py -v -k "test_qat_embedding_linear"`

Imported from OSS

Reviewed By: jingsh

Differential Revision: D32814827

fbshipit-source-id: f7a69d2b596f1276dc5860b397c5d5d07e5b9e16
2021-12-08 05:28:07 -08:00
fc8404b5bc histc: Avoid dispatch in parallel region (#68520)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68520

Ref #56794

This changes the code from allocating 1 tensor per thread inside the
parallel region, to allocating one larger tensor outside the parallel
region and manually viewing each thread's slice of the histogram.

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D32929365

Pulled By: ngimel

fbshipit-source-id: e28da2736e849a0282b70f34d11526d3355d5bd5
2021-12-08 02:42:43 -08:00
2a38e1a76a Fix TSAN issue in TCPStore (#69590)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69590

The variable `callbackRegisteredData_` was written to without
synchronization.
ghstack-source-id: 145066862

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D32938979

fbshipit-source-id: bc9a11a70680db45ece95880ae19ce2026e8a88e
2021-12-07 23:29:08 -08:00
0ce49000db Release GIL during RPC shutdown. (#69586)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69586

In certain scenarios during shutdown the following assert failed:
https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/rpc/rpc_agent.cpp#L39.
This was due to _reset_current_rpc_agent not releasing GIL.

Fixed this issue by releasing GIL.
ghstack-source-id: 145062265

Test Plan: waitforbuildbot

Reviewed By: mrshenli

Differential Revision: D32937687

fbshipit-source-id: 980adbcc1e3799b40206f7bca6e7695ca67f0fc2
2021-12-07 23:24:57 -08:00
c236247826 OpInfo tests for (svd|pca)_lowrank (#69107)
Summary:
As per title.

While working on this I have discovered several issues with these methods related to grad instabilities. I will file them and link here later. These were quite painful to force to pass all the tests with these discovered issues, sorry for the delay, mruberry!

cc jianyuh nikitaved pearu mruberry walterddr IvanYashchuk xwang233 Lezcano

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69107

Reviewed By: zou3519

Differential Revision: D32920341

Pulled By: mruberry

fbshipit-source-id: 15b33e2b46acdcbff8a37d8e43e381eb55d1a296
2021-12-07 19:50:12 -08:00
e06af79136 Fix sign op converter (#69580)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69580

Fix bug in sign converter

Reviewed By: 842974287

Differential Revision: D32934661

fbshipit-source-id: f21d7c65b07ab2f0a0027939d660e56dacd9cdef
2021-12-07 19:04:51 -08:00
6b950eea27 Remove finput and fgrad_input from slow3d transpose signatures (#68899)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68899

Test Plan: Imported from OSS

Reviewed By: zou3519, albanD

Differential Revision: D32655872

Pulled By: jbschlosser

fbshipit-source-id: 963b391a489c639f98d9f634d4f4c668353c799a
2021-12-07 18:24:40 -08:00
05946051f8 [quant][graphmode] initial support for fusion pattern in backend_config_dict (#69335)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69335

This PR added support for configuring fusion with:
"pattern", "fuser_method"

This only works for simple sequence of 2 op patterns currently, will extend this in future PRs

Test Plan:
regresion test on linear-relu fusion:
```
python test/fx2trt/test_quant_trt.py TestQuantizeFxTRTOps
```

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D32816164

fbshipit-source-id: f300b7b96b36908cb94a50a8a17e0e15032509eb
2021-12-07 16:54:42 -08:00
2d38d37f5f use irange for loops (#69533)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69533

Modified loops in files under fbsource/fbcode/caffe2/ from the format
```
for(TYPE var=x0;var<x_max;x++)
```
to the format
```
for(const auto var: irange(xmax))
```

This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand.

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D32837942

fbshipit-source-id: 8663037a38ade8f81bd5e983a614d197ea11f0d1
2021-12-07 16:53:27 -08:00
8a975c0106 [LT] Sync with the lazy_tensor_staging branch (#69527)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69527

- Add missing TORCH_API in class/struct declarations;
- Fix internal op declarations in ltc_ops;
- Update lazy_ts_lowering.py

Test Plan: Imported from OSS

Reviewed By: alanwaketan

Differential Revision: D32918929

Pulled By: desertfire

fbshipit-source-id: e956d51aff5ef593fdf4cd5ad2a38e38788913d8
2021-12-07 16:47:35 -08:00
049debd97d [Reland][Autograd/Checkpoint] Checkpoint implementation without reentrant autograd (#69508)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69508

Original Phabricator Diff: D32704467 (e032dae329)

Reland, fix is to not test traditional checkpoint when input does not require grad as that is unsupported as documented.

Original PR body:

Resubmission of https://github.com/pytorch/pytorch/pull/62964 with the
suggestions and tests discussed in
https://github.com/pytorch/pytorch/issues/65537.

Adds a `use_reentrant=False` flag to `checkpoint` function. When
`use_reentrant=True` is specified, a checkpointing implementation that uses
SavedVariableHooks instead of re-entrant autograd is used. This makes it more
composable with things such as `autograd.grad` as well as DDP (still need to
add thorough distributed testing).

As discussed in https://github.com/pytorch/pytorch/issues/65537, the tests that we need to add are:

- [x] Gradient hooks are called once
- [x] works when input does require grads but Tensor that require grads are captures (like first layer in a nn)
- [x] works for functions with arbitrary input/output objects
- [x] distributed tests (next PR)

Note that this is only for `torch.utils.checkpoint`, if this approach overall looks good, we will do something similar for `checkpoint_sequential`.
ghstack-source-id: 144948501

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D32902634

fbshipit-source-id: 2ee87006e5045e5471ff80c36a07fbecc2bea3fe
2021-12-07 16:31:23 -08:00
3456c2cbc8 Allow build_android.sh to forward Vulkan args (#69332)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69332

 ---

## Context

The `build_android.sh` script currently does not forward Vulkan configuration options, which makes it impossible to control them when running `build_pytorch_android.sh`.

## Changes

Slightly change the script to allow Vulkan configuration options to propagate from `build_pytorch_android.sh` to `build_android.sh`

Test Plan: Imported from OSS

Reviewed By: beback4u

Differential Revision: D32840908

Pulled By: SS-JIA

fbshipit-source-id: e55d89c93c996b92b743cf047f5a285bb516bbc4
2021-12-07 16:24:35 -08:00
fa39754e11 [vulkan] Disable shader optimization to avoid Validation Errors (#69331)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69331

 ---

## Context

When the optimization flag is turned on, some SPIR-V modules produced from the Vulkan compute shaders were invalid. The Vulkan Validation layer raises the following error for these modules:

```
[ UNASSIGNED-CoreValidation-Shader-InconsistentSpirv ] Object: VK_NULL_HANDLE (Type = 0) | SPIR-V module not valid: Header block 52[%52] is contained in the loop construct headed by 44[%44], but it's merge block 47[%47] is not
%52 = OpLabel
```

Turning off the optimization flag, the SPIR-V modules produced no longer reports these errors in the Validation layer.

## Changes

Turns off optimization when generating SPIR-V modules to ensure correctness of the modules.

**Note that disabling SPIR-V optimization did not regress inference latency for the several models I tested**.

Test Plan: Imported from OSS

Reviewed By: beback4u

Differential Revision: D32840910

Pulled By: SS-JIA

fbshipit-source-id: 7ccb5691fd0e2d11b9c8c28ad7b83906e8163699
2021-12-07 16:24:32 -08:00
bede33e3f5 [vulkan] Add image format qualifier to glsl files (#69330)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69330

 ---

## Context

Previously, our shader files did not declare any [image format qualifiers](https://www.khronos.org/opengl/wiki/Layout_Qualifier_(GLSL)#Image_formats) for image layouts. This causes the SPIR-V modules produced to declare the [StorageImageWriteWithoutFormat](https://www.khronos.org/registry/SPIR-V/specs/unified1/SPIRV.html#_a_id_capability_a_capability) capability, which requires `shaderStorageImageWriteWithoutFormat` to be enabled in [VkPhysicalDeviceFeatures](https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/VkPhysicalDeviceFeatures.html). `shaderStorageImageWriteWithoutFormat` is not available on some devices, causing errors to be reported by the Vulkan validation layer.

## Changes

Vulkan shaders now declare the image format explicitly so that the SPIR-V modules produced are compatible with devices that do not have `shaderStorageImageWriteWithoutFormat` enabled.

Test Plan: Imported from OSS

Reviewed By: beback4u

Differential Revision: D32840909

Pulled By: SS-JIA

fbshipit-source-id: 76e0a0da68b423ebc74ae7e839b9cfaf57d2cd39
2021-12-07 16:23:09 -08:00
e5a1ee0e5a [quant][graphmode] Refactor fusion to use the new Pattern format (#68770)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68770

Previous fusion only works for a sequnce of ops, which is not general enough for fusion patterns
that is defined by a subgraph, this PR refactors that to make it more general

Test Plan:
```
python test/test_quantization.py TestFuseFx
```

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D32602637

fbshipit-source-id: a7897c62081b9d71c67fb56e78484cf68deaacf6
2021-12-07 16:12:40 -08:00
1433160a36 use irange for loops 6 (#66742)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66742

Modified loops in files under fbsource/fbcode/caffe2/ from the format

`for(TYPE var=x0;var<x_max;x++)`

to the format

`for(const auto var: irange(xmax))`

This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand.

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D31705366

fbshipit-source-id: be58222426c192406a7f93c21582c3f6f2082401
2021-12-07 16:07:50 -08:00
9a7732e852 CMake: Support dynamic codegen outputs (#68246)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68246

Currently the codegen produces a list of output files at CMake
configuration time and the build system has no way of knowing if the
outputs change. So if that happens, you basically need to delete the
build folder and re-run from scratch.

Instead, this generates the output list every time the code generation
is run and changes the output to be a `.cmake` file that gets included
in the main cmake configuration step. That means the build system
knows to re-run cmake automatically if a new output is added. So, for
example you could change the number of shards that `Operators.cpp` is
split into and it all just works transparently to the user.

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D32596268

Pulled By: albanD

fbshipit-source-id: 15e0896aeaead90aed64b9c8fda70cf28fef13a2
2021-12-07 15:58:06 -08:00
cd9da3267c Rationalize API exports in torch_python (#68095)
Summary:
This renames `WindowsTorchApiMacro.h` to `Export.h` to mirror the c10 header `c10/macros/Export.h` and also updates it to use `C10_EXPORT`/`C10_IMPORT`. This also removes the `THP_API` macro from `THP_export.h` which appears to serve the same purpose.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68095

Reviewed By: jbschlosser

Differential Revision: D32810881

Pulled By: albanD

fbshipit-source-id: d6949ccd0d80d6c3e5ec1264207611fcfe2503e3
2021-12-07 15:24:37 -08:00
829b49b867 Output UnionType str rep with () instead of [] (#69502)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69502

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D32902781

Pulled By: tugsbayasgalan

fbshipit-source-id: 67a73b209575437477cdbd3eb8f685019709e99c
2021-12-07 14:17:06 -08:00
a8232ee1bc Sparse CSR CUDA: Add block torch.addmv when mat is sparse (#68708)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68708

This PR adds block CSR matrix times dense vector multiplication.

cc nikitaved pearu cpuhrsch IvanYashchuk ngimel

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D32647694

Pulled By: cpuhrsch

fbshipit-source-id: a1c120691c4350284b156fe4259eda684b734b66
2021-12-07 14:02:59 -08:00
6df7b75186 skip ORT tensor in TensorIterator because it doesn't have storage (#68705)
Summary:
ORT Tensors are similar to XLA tensors which doesn't have storage. So extend the condition to ORT tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68705

Reviewed By: zou3519

Differential Revision: D32921378

Pulled By: albanD

fbshipit-source-id: 3bda9bba2ddd95cb561a4d1cff463de652256708
2021-12-07 13:33:54 -08:00
008469c5e2 [SR] Simplify memory re-use algorithm (#68302)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68302

Implement the new memory re-use algorithm. It’s roughly based on the c2 one, but after going through many iterations it may not be a 1:1 port anymore. Also deleted the old liveness analysis.

Test Plan:
## **Re-use metrics**

`inline_cvr` (294738512_58)
**Before**
* `local`
```
Total number of managed tensors: 2660
Total number of managed output tensors: 0
Total number of unmanaged values: 3041
Total memory managed: 4601984 bytes
Total number of reused tensors: 1183
```
* `local_ro`
```
Total number of managed tensors: 1412
Total number of managed output tensors: 0
Total number of unmanaged values: 2677
Total memory managed: 29696 bytes
Total number of reused tensors: 959
```

**After**
* `local`
```
Total number of managed tensors: 2660
Total number of managed output tensors: 0
Total number of unmanaged values: 3041
Total memory managed: 4520000 bytes
Total number of reused tensors: 1198
```
* `local_ro`
```
Total number of managed tensors: 1412
Total number of managed output tensors: 0
Total number of unmanaged values: 2677
Total memory managed: 29120 bytes
Total number of reused tensors: 963
```

Reviewed By: hlu1

Differential Revision: D32370424

fbshipit-source-id: 06a8e0a295ed7a2b4d14071349c1f1e975f746bf
2021-12-07 13:25:42 -08:00
c309637923 Making cuda 11.5 workflows periodic (#69323)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/68259

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69323

Reviewed By: gchanan, malfet

Differential Revision: D32812346

Pulled By: atalman

fbshipit-source-id: 081f40802997cfb986742f1621eee4b4565660f0
2021-12-07 13:14:07 -08:00
baac51ff4a Add conda-forge dependency for cuda-11.5 (#69541)
Summary:
[NVIDIA's cudatoolkit=11.5](https://anaconda.org/nvidia/cudatoolkit/files?version=11.5.0) at the time of the writing depends on libstdcxx-ng >=9.4.0, but latest available from official anaconda channel is [9.3.0](https://anaconda.org/anaconda/libstdcxx-ng/files?version=9.3.0), so add `-c conda-forge` as extra dependency to resolve the problem

Should resolve problems such as https://app.circleci.com/pipelines/github/pytorch/pytorch/420750/workflows/19d6e3ce-a305-49c6-bac8-11ed43ed2b1e/jobs/16829102

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69541

Reviewed By: atalman

Differential Revision: D32921300

Pulled By: malfet

fbshipit-source-id: 09dd3575f968679f545aec739a2791dde85d37c1
2021-12-07 12:58:41 -08:00
358e908162 Add Union type to TorchScript Language Ref (#69514)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69514

Reviewed By: tugsbayasgalan

Differential Revision: D32909371

Pulled By: gmagogsfm

fbshipit-source-id: af1c3040cd59ee913dc576cf8a8c759313f1e07f
2021-12-07 12:53:54 -08:00
c21169ea41 [JIT] optimize_for_inference on methods other than forward (#69367)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69367

Test Plan: Imported from OSS

Reviewed By: cpuhrsch

Differential Revision: D32835529

Pulled By: davidberard98

fbshipit-source-id: d3066c23d071bc2a3bee59b8ab03b6ab0e43efcf
2021-12-07 12:36:47 -08:00
60ca6776e2 [JIT] run frozen optimizations on methods other than forward (#68668)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68668

This updates run_frozen_optimizations so that it will run on additional methods other than forward
ghstack-source-id: 143871758

Test Plan:
Added test in test_freezing.py
```
python3 test/test_jit.py -- test_conv_bn_folding_not_forward
```

Reviewed By: eellison

Differential Revision: D32567857

fbshipit-source-id: 75e56efad576404dc8d6897861d249573f5ccd7a
2021-12-07 12:35:30 -08:00
63470f9449 Sparse CSR: Implement unary ufuncs (with 0->0 correspondence) (#69292)
Summary:
This PR attempts to add support for unary ufuncs (with 0->0 correspondence) for Sparse CSR Layout.

Ops supported: `['abs', 'asin', 'asinh', 'atan', 'atanh', 'ceil', 'conj_physical', 'floor', 'log1p', 'neg', 'round', 'sin', 'sinh', 'sign', 'sgn', 'signbit', 'tan', 'tanh', 'trunc', 'expm1', 'sqrt', 'angle', 'isinf', 'isposinf', 'isneginf', 'isnan', 'erf', 'erfinv']`

cc nikitaved pearu cpuhrsch IvanYashchuk peterbell10

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69292

Reviewed By: pbelevich

Differential Revision: D32805514

Pulled By: cpuhrsch

fbshipit-source-id: 9ae20817e77a36d3aa6c5afa532b9dc3b8cf1dd3
2021-12-07 12:07:41 -08:00
1a202b0c39 Docs: Fix broken code syntax in autograd.rst (#69362)
Summary:
The backticks around `nn.Parameters` were not rendered correctly because the word was enclosed in an italics block.
Spotted the issue on https://pytorch.org/docs/stable/notes/autograd.html#locally-disable-grad-doc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69362

Reviewed By: zou3519

Differential Revision: D32924093

Pulled By: albanD

fbshipit-source-id: 5a310ac3f3d13a5116f7aa911817b9452eee711d
2021-12-07 12:03:15 -08:00
10229e156b trt engine inspector demo (#66683)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66683

Starting from TensorRT 8.2, we have this nice engine inspector which gives you much details of trt layer.

Test Plan:
```
buck run  mode/opt -c python.package_style=inplace scripts/yinghai/test:trt_engine_inspector
```
And you will see something like
```
{"Layers": [{
  "Name": "PWN(PWN(relu_1), add_1)",
  "LayerType": "PointWiseV2",
  "Inputs": [
  {
    "Name": "x",
    "Dimensions": [10,2],
    "Format/Datatype": "Row major linear FP16 format"
  }],
  "Outputs": [
  {
    "Name": "(Unnamed Layer* 1) [ElementWise]_output",
    "Dimensions": [10,2],
    "Format/Datatype": "Row major linear FP16 format"
  }],
  "ParameterType": "PointWise",
  "ParameterSubType": "PointWiseExpression",
  "NbInputArgs": 1,
  "InputArgs": ["arg0"],
  "NbOutputVars": 1,
  "OutputVars": ["var1"],
  "NbParams": 0,
  "Params": [],
  "NbLiterals": 4,
  "Literals": ["0.000000e+00f", "1.000000e+00f", "0.000000e+00f", "0.000000e+00f"],
  "NbOperations": 2,
  "Operations": ["const auto var0 = pwgen::iMax(arg0, literal0);", "const auto var1 = pwgen::iPlus(arg0, var0);"],
  "TacticValue": "0x0"
},{
  "Name": "matmul_1",
  "LayerType": "MatrixMultiply",
  "Inputs": [
  {
    "Name": "(Unnamed Layer* 1) [ElementWise]_output",
    "Dimensions": [10,2],
    "Format/Datatype": "Row major linear FP16 format"
  },
  {
    "Name": "y",
    "Dimensions": [10,2],
    "Format/Datatype": "Row major linear FP16 format"
  }],
  "Outputs": [
  {
    "Name": "output0",
    "Dimensions": [10],
    "Format/Datatype": "Row major linear FP16 format"
  }],
  "ParameterType": "MatrixMultiply",
  "MatrixOpA": "VECTOR",
  "MatrixOpB": "VECTOR",
  "Alpha": 1,
  "Beta": 0,
  "TacticValue": "0x1"
}],
"Bindings": ["x"
,"y"
,"output0"
]}
```

Reviewed By: RoshanPAN, wushirong

Differential Revision: D31681405

fbshipit-source-id: 31f912c37812ac17c6421073e0c35e512463ba6e
2021-12-07 11:50:09 -08:00
aa9fbb9ae9 [JIT] check stack size after calling operator (#68788)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68788

In debug mode, this should throw errors for ops where the wrong number ops is returned (i.e. the number of values left on the stack is different from the number shown in the schema)

Test Plan:
Run this in debug mode and verify that it doesn't throw an assert
```
import torch

class Thing(torch.nn.Module):
    torch.jit.export
    def en(self, x: torch.Tensor):
        return torch.add(x, 2.0)

    def forward(self, x: torch.Tensor, y: torch.Tensor):
        a = torch.mm(x, y)
        b = torch.nn.functional.gelu(a)
        c = self.en(b)
        return c.std_mean()

if __name__ == '__main__':
    unsc = Thing()
    thing = torch.jit.script(unsc)
    x = torch.randn(4, 4)
    y = torch.randn(4, 4)
    std, mean = thing.forward(x, y)
    print(std, mean)
    print(str(thing.forward.graph))
```

Reviewed By: gchanan

Differential Revision: D32625256

Pulled By: davidberard98

fbshipit-source-id: 61d5ec0c5a9f8b43706257119f4f524bb9dbe6f5
2021-12-07 11:43:50 -08:00
bd8d4195a6 [DataPipe] Small change to generation script and update to DataPipe .pyi file (#69392)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69392

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D32849463

Pulled By: NivekT

fbshipit-source-id: b6d419fbe0e4cc9d718f21fb3fe886f721f618d3
2021-12-07 11:40:53 -08:00
fdfdafd1e6 [DataPipe] Removing usage of unbatch_level from .batch interface and DataFrame (#69393)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69393

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D32849461

Pulled By: NivekT

fbshipit-source-id: 16abbe289ad2092faaa029fd78f3d6924e7b2ff4
2021-12-07 11:40:50 -08:00
357160e68e [DataPipe] Unifying API - removing nesting_level argument from FilterIterDataPipe (#69391)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69391

As part of the efforts to unify the APIs across different data backends (e.g. TorchData, TorchArrow), we are making changes to different DataPipes' APIs. In this PR, we are removing the input argument `nesting_level` from `FilterIterDataPipe`.

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D32849462

Pulled By: NivekT

fbshipit-source-id: 91cf1dc03dd3d3cbd7a9c6ccbd791ade91355f30
2021-12-07 11:40:46 -08:00
4478b14e4c [DataPipe] Unifying API - removing nesting_level argument from MapperIterDataPipe (#69390)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69390

As part of the efforts to unify the APIs across different data backends (e.g. TorchData, TorchArrow), we are making changes to different DataPipes' APIs. In this PR, we are removing the input argument `nesting_level` from `MapperIterDataPipe`.

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D32849465

Pulled By: NivekT

fbshipit-source-id: 963ce70b84a7658331d126e5ed9fdb12273c8e1f
2021-12-07 11:39:08 -08:00
9cb52327a8 [quant][refactor] Move pattern type definition to ao/quantization/utils.py (#68769)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68769

att, since we want to use this type in fuser_method_mapping in later PRs

Test Plan:
no change to logic, just regression test on ci
```
python test/test_quantization.py
```

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D32602636

fbshipit-source-id: 15b95241431dfca9b1088d0920bf75705b37aa9a
2021-12-07 11:00:22 -08:00
976b076715 [iOS] Add LibTorch nightly build (#69341)
Summary:
Add LibTorch nightly build for using in LibTorchvision.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69341

Test Plan:
CI jobs: https://fburl.com/lbyjzpxz
1. Validate lib is uploaded to link https://ossci-ios-build.s3.amazonaws.com/libtorch_ios_nightly_build.zip
2. Download lib from the link and validate `version.txt` is correct
3. Test the lib in HelloWorld demo
Imported from OSS

Reviewed By: xta0

Differential Revision: D32901836

Pulled By: hanton

fbshipit-source-id: 8622c3e6052cec2039bc15dea0d495ec1a8186cb
2021-12-07 10:07:28 -08:00
3edf1b6cee [PyTorch] Avoid no-op shared_ptr dtor when constructing tuple (#69337)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69337

See note in code.
ghstack-source-id: 144657751

Test Plan:
Ran PyTorchFeatureConversionBenchmark 5x before/after:

```
swolchok@devbig032 ~/f/fbcode> for x in (seq 5); sudo scripts/bertrand/noise/denoise.sh /tmp/pytorch_feature_conversion_benchmark.Dec2CacheTupleTypes ; end                                                                                                                                                                                              (pytorch-ort-bert)
============================================================================
sigrid/lib/features/tests/PyTorchFeatureConversionBenchmark.cpprelative  time/iter  iters/s
============================================================================
PyTorchFeatureConversionDenseBenchmark                       2.39us  418.75K
PyTorchFeatureConversionIdListBenchmark                      3.59us  278.91K
PyTorchFeatureConversionIdScoreListBenchmark                 5.01us  199.51K
============================================================================
============================================================================
sigrid/lib/features/tests/PyTorchFeatureConversionBenchmark.cpprelative  time/iter  iters/s
============================================================================
PyTorchFeatureConversionDenseBenchmark                       2.42us  413.80K
PyTorchFeatureConversionIdListBenchmark                      3.56us  280.60K
PyTorchFeatureConversionIdScoreListBenchmark                 5.05us  198.15K
============================================================================
============================================================================
sigrid/lib/features/tests/PyTorchFeatureConversionBenchmark.cpprelative  time/iter  iters/s
============================================================================
PyTorchFeatureConversionDenseBenchmark                       2.41us  414.25K
PyTorchFeatureConversionIdListBenchmark                      3.55us  281.59K
PyTorchFeatureConversionIdScoreListBenchmark                 5.02us  199.09K
============================================================================
============================================================================
sigrid/lib/features/tests/PyTorchFeatureConversionBenchmark.cpprelative  time/iter  iters/s
============================================================================
PyTorchFeatureConversionDenseBenchmark                       2.39us  417.68K
PyTorchFeatureConversionIdListBenchmark                      3.55us  281.65K
PyTorchFeatureConversionIdScoreListBenchmark                 5.05us  198.06K
============================================================================
============================================================================
sigrid/lib/features/tests/PyTorchFeatureConversionBenchmark.cpprelative  time/iter  iters/s
============================================================================
PyTorchFeatureConversionDenseBenchmark                       2.39us  417.54K
PyTorchFeatureConversionIdListBenchmark                      3.56us  281.03K
PyTorchFeatureConversionIdScoreListBenchmark                 5.05us  198.13K
============================================================================
swolchok@devbig032 ~/f/fbcode> for x in (seq 5); sudo scripts/bertrand/noise/denoise.sh /tmp/pytorch_feature_conversion_benchmark.Dec2TupleConstruction ; end                                                                                                                                                                                            (pytorch-ort-bert)
============================================================================
sigrid/lib/features/tests/PyTorchFeatureConversionBenchmark.cpprelative  time/iter  iters/s
============================================================================
PyTorchFeatureConversionDenseBenchmark                       2.38us  420.38K
PyTorchFeatureConversionIdListBenchmark                      3.53us  282.90K
PyTorchFeatureConversionIdScoreListBenchmark                 4.99us  200.41K
============================================================================
============================================================================
sigrid/lib/features/tests/PyTorchFeatureConversionBenchmark.cpprelative  time/iter  iters/s
============================================================================
PyTorchFeatureConversionDenseBenchmark                       2.37us  421.54K
PyTorchFeatureConversionIdListBenchmark                      3.54us  282.27K
PyTorchFeatureConversionIdScoreListBenchmark                 4.99us  200.28K
============================================================================
============================================================================
sigrid/lib/features/tests/PyTorchFeatureConversionBenchmark.cpprelative  time/iter  iters/s
============================================================================
PyTorchFeatureConversionDenseBenchmark                       2.38us  420.99K
PyTorchFeatureConversionIdListBenchmark                      3.56us  280.56K
PyTorchFeatureConversionIdScoreListBenchmark                 5.08us  196.91K
============================================================================
============================================================================
sigrid/lib/features/tests/PyTorchFeatureConversionBenchmark.cpprelative  time/iter  iters/s
============================================================================
PyTorchFeatureConversionDenseBenchmark                       2.37us  421.48K
PyTorchFeatureConversionIdListBenchmark                      3.54us  282.87K
PyTorchFeatureConversionIdScoreListBenchmark                 5.00us  199.88K
============================================================================
============================================================================
sigrid/lib/features/tests/PyTorchFeatureConversionBenchmark.cpprelative  time/iter  iters/s
============================================================================
PyTorchFeatureConversionDenseBenchmark                       2.38us  419.69K
PyTorchFeatureConversionIdListBenchmark                      3.56us  280.68K
PyTorchFeatureConversionIdScoreListBenchmark                 4.97us  201.23K
============================================================================
```

Looks like maybe around 1% faster?

Reviewed By: hlu1

Differential Revision: D32817592

fbshipit-source-id: 4b015dc993b26a92e45a3673e14fde32105a34fa
2021-12-07 09:39:15 -08:00
617a3bd944 GHA: Re enable mac json uploads (#69387)
Summary:
Removed JSON uploading to S3 for Mac GHA workflows as the AWS credentials were not working.

This PR tries uploading them to GitHub instead, which works https://github.com/pytorch/pytorch/runs/4413940318?check_suite_focus=true

They should show up on the HUD page: hud.pytorch.org/pr/69387 with the name test-jsons after the CI is completed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69387

Reviewed By: seemethere

Differential Revision: D32885204

Pulled By: janeyx99

fbshipit-source-id: 3d25ead6d464144a228fdf8ead5172de3ed8430e
2021-12-07 08:25:51 -08:00
945d2e380c [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D32910817

fbshipit-source-id: 60d0cb10412e1a37a0249bb223b75855c5596dbd
2021-12-07 08:11:09 -08:00
4670f0f2c5 Set non-default backend names to lower case (#69400)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69400

Hopefully this makes naming more consistent. Without this change, some tests will fail for plugins since values can be set to upper case in some cases. This should prevent that and make lookup and comparison consistent.

Test Plan: Check the signals. There is no specific test for this, but all tests should pass.

Reviewed By: mrshenli

Differential Revision: D32836529

fbshipit-source-id: 1b7d2b64e04fe0391b710aa6ed6d1e47df9027a3
2021-12-07 07:58:46 -08:00
2dd46d3aa9 FX: ensure node stack trace survives copying (#69368)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69368

Before this PR, copying a node would lose the stack trace. This PR
ensures that the stack trace is preserved across copies.

This is useful because quantization passes would like to start
allowing the user to preserve stack traces, and we use the copy
behavior.

Test Plan:
```
python test/test_fx.py TestFX.test_stack_traces
```

Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D32835248

fbshipit-source-id: 91610fd8d05f5683cfa5e11fb6f9f3feacb8e241
2021-12-07 06:18:38 -08:00
ca945d989a [quant][graphmode][fx] Add default_replay_qconfig for ops like reshape (#69249)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69249

This PR added default_replay_qconfig and default_replay_observer which is used
when we want to configure an operator to reuse the observer from input, if the input
Tensor for the operator is not observed, we will not observe the output of this operator either,
if the input Tensor is observed, we will observe the output of the operator with the same observer.

e.g.

```
x1 = x0.reshape()
```
if reshape is configured with default_replay_qconfig:
1. if x0 is observed with observer_0, we'll observe x1 with the same observer instance
2. if x0 is not observed, we won't observe x1 either

Test Plan:
```
python test/test_quantization.py TestQuantizeFx.test_replay_qconfig
```

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D32774723

fbshipit-source-id: 26862b2bc181d0433e2243daeb3b8f7ec3dd33b2
2021-12-06 22:56:14 -08:00
8b1e49635a [JIT] Separate GPU implementation of frozen_conv_add_relu_fusion.cpp (#68149)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68149

JIT optimization passes are part of the CPU-only build (i.e. necessary GPU flags are not passed in). This separates the implementation of frozen_conv_add_relu_fusion so that the GPU-enabled implementation is registered at runtime (if it is available)

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D32773666

Pulled By: davidberard98

fbshipit-source-id: c83dbb88804bdef23dc60a6299acbfa76d5c1495
2021-12-06 21:06:25 -08:00
e55b939732 Enable build-split for all CUDA-11.x version (#69494)
Summary:
Should fix cu115 wheel binary builds, see https://hud.pytorch.org/ci/pytorch/pytorch/nightly?name_filter=cu115

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69494

Reviewed By: atalman

Differential Revision: D32899994

Pulled By: malfet

fbshipit-source-id: bb0e05a30c9360c75d2cfd9d4e0d40ed9a3b2830
2021-12-06 20:39:06 -08:00
bd8a4a9372 [wip][quant][graphmode] produce reference pattern for binary ops and then rewrite to quantized op (#68229)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68229

This PR makes BinaryOpQuantizeHandler to always produce reference patterns, and we rely on
subgraph_rewriter to rewrite the reference qunatized patterns to quantized ops

Test Plan:
```
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps
```

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D32537714

fbshipit-source-id: 456086b308c4446840d8d37997daa6f8f8068479
2021-12-06 20:20:15 -08:00
bcd0303834 [fx2trt][easy] add sparse flag to TRTInterpreter (#69495)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69495

As the title. Separated from D30589161.

Test Plan: Tested in D30589161.

Reviewed By: maratsubkhankulov, wushirong

Differential Revision: D32898927

fbshipit-source-id: 89e18d2eb19b43fbab92b4988d0a21d21cff2d1f
2021-12-06 18:57:08 -08:00
3211588308 [fx2trt] Separate sign from trunc_div and use it for acc_ops.sign (#69486)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69486

As the title. Migrate from sign plugin to native trt layers. All the layers are fused into one single PWN kernel in TRT.

```
[TensorRT] VERBOSE: Engine Layer Information:
Layer(PointWiseV2): PWN(sign_1_sign_rhs + sign_1_sign_rhs_broadcast, PWN(PWN(sign_1_floor_div*2_rhs + sign_1_floor_div*2_rhs_broadcast, PWN(PWN(PWN([UNARY]-[acc_ops.sign]-[sign_1_prod_abs], [UNARY]-[acc_ops.sign]-[sign_1_prod_abs_exp]), PWN([UNARY]-[acc_ops.sign]-[sign_1_prod_exp], [ELEMENTWISE]-[acc_ops.sign]-[sign_1_exp_floor_div])), [ELEMENTWISE]-[acc_ops.sign]-[sign_1_floor_div*2])), [ELEMENTWISE]-[acc_ops.sign]-[sign_1_sign])), Tactic: 0, x[Float(2,2,3)] -> output0[Float(2,2,3)]
```

Test Plan: CI

Reviewed By: wushirong

Differential Revision: D32887537

fbshipit-source-id: ac250b5197e340319de29653a27f879a0e1ea9cd
2021-12-06 16:54:44 -08:00
e23827e6d6 [fx2trt] [Prep for release] Add type hints to converters and separate main files (#69458)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69458

1. Added type hints to acc ops converters.
2. Put some of the class/logic in fx2trt.py to some separated files. (input_tensor_spec.py, trt_module.py, converter_registry.py).
3. Added import in `__init__.py` so that user can just call `from torch.fx.experimental.fx2trt import xxx` instead of `experimental.fx2trt.fx2trt`.

Test Plan: CI

Reviewed By: wushirong

Differential Revision: D32884637

fbshipit-source-id: e3e1e597edb9a08b47b4595bd371f570f2f3c9b6
2021-12-06 16:54:41 -08:00
a2d1cadfdb [fx2trt] Add a helper function to generate specs for dynamic batch size (#69405)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69405

Add a helper function that will generate input tensor specs with dynamic batch size.

Note that the constraint currently on this function is that the batch dimension of all these tensors should be the first dimension.

Also add more doc strings.

Test Plan:
Added unit tests.
```
Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/7881299413036896
    ✓ ListingSuccess: caffe2/test/fx2trt/core:test_input_tensor_spec - main (7.455)
    ✓ Pass: caffe2/test/fx2trt/core:test_input_tensor_spec - test_from_tensor (caffe2.test.fx2trt.core.test_input_tensor_spec.TestTRTModule) (7.047)
    ✓ Pass: caffe2/test/fx2trt/core:test_input_tensor_spec - test_from_tensors_with_dynamic_batch_size (caffe2.test.fx2trt.core.test_input_tensor_spec.TestTRTModule) (7.066)
    ✓ Pass: caffe2/test/fx2trt/core:test_input_tensor_spec - test_from_tensors (caffe2.test.fx2trt.core.test_input_tensor_spec.TestTRTModule) (7.181)
Summary
  Pass: 3
  ListingSuccess: 1
```

Wait for CI to verify if this unit test can run without RE.

Reviewed By: yinghai, kflu

Differential Revision: D32853947

fbshipit-source-id: 19713e8ad5478c945385c7013f7a1b9894151fea
2021-12-06 16:54:39 -08:00
cfe3cbb392 [fx2trt] Use weights shape as normalize shape in layer norm (#69401)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69401

As the title. In PyTorch, these two shapes are the same. Normalize shape might be retrieved from tensor.size and in explicit batch dim, it won't work right now.

Test Plan:
```
    ✓ ListingSuccess: caffe2/test/fx2trt/converters:test_layernorm - main (7.018)
    ✓ Pass: caffe2/test/fx2trt/converters:test_layernorm - test_layer_norm_with_dynamic_shape_0_1d_normalized_shape (caffe2.test.fx2trt.converters.acc_op.test_layernorm.TestLayerNormConverter) (22.945)
    ✓ Pass: caffe2/test/fx2trt/converters:test_layernorm - test_layer_norm_0_1d_normalized_shape (caffe2.test.fx2trt.converters.acc_op.test_layernorm.TestLayerNormConverter) (23.203)
    ✓ Pass: caffe2/test/fx2trt/converters:test_layernorm - test_layer_norm_with_dynamic_shape_1_2d_normalized_shape (caffe2.test.fx2trt.converters.acc_op.test_layernorm.TestLayerNormConverter) (42.549)
    ✓ Pass: caffe2/test/fx2trt/converters:test_layernorm - test_layer_norm_1_2d_normalized_shape (caffe2.test.fx2trt.converters.acc_op.test_layernorm.TestLayerNormConverter) (43.544)
    ✓ Pass: caffe2/test/fx2trt/converters:test_layernorm - test_layer_norm_with_dynamic_shape_2_4d_input_shape (caffe2.test.fx2trt.converters.acc_op.test_layernorm.TestLayerNormConverter) (45.958)
    ✓ Pass: caffe2/test/fx2trt/converters:test_layernorm - test_layer_norm_2_4d_input_shape (caffe2.test.fx2trt.converters.acc_op.test_layernorm.TestLayerNormConverter) (47.027)
Summary
  Pass: 6
  ListingSuccess: 1
```

Reviewed By: yinghai

Differential Revision: D32853359

fbshipit-source-id: 8a122fe3348a1d9ad07b48647ec6166d171d113a
2021-12-06 16:53:29 -08:00
59e98b66ac Revert D32704467: [Autograd/Checkpoint] Checkpoint implementation without reentrant autograd
Test Plan: revert-hammer

Differential Revision:
D32704467 (e032dae329)

Original commit changeset: 6eea1cce6b93

fbshipit-source-id: 1a788c1fd57cee46bba82e216e6162d078359cc2
2021-12-06 16:33:32 -08:00
bc89528931 Initialize upgrader and operator version files (#68772)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68772

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D32603257

Pulled By: tugsbayasgalan

fbshipit-source-id: 5a3d9ba4d0a01ddff4ff6ebdf7bb88ec125765b0
2021-12-06 16:27:52 -08:00
9e678446a2 [Pytorch Edge] Add new_empty_strided to tracer (#69492)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69492

We already add empty, and this is another weird variation that sometimes pops up. Triggering it is unclear, so just adding it for now.

Test Plan: ran tracer

Differential Revision: D32896522

fbshipit-source-id: 38627d8efc48ef240100ccdbd94c0e7208b0b466
2021-12-06 15:28:13 -08:00
65b0f389d2 [PyTorch][Distributed] Use auto-grad enabled collections for the shared linear op to enable backward grad calculation (#68096)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68096

We replace all c10d APIs with the Auto-grad collection in the shareded linear op. So that we can enable the backward propagation (grad calculation for sharded linear).
ghstack-source-id: 144882914

Test Plan: Unit test + CI

Reviewed By: pritamdamania87

Differential Revision: D32177341

fbshipit-source-id: 1919e8ca877bdc79f4cdb0dc2a82ddaf6881b9f1
2021-12-06 15:17:08 -08:00
7c2489bdae [PyTorch][Distributed] Enable Reduce Scatter and modify all_to_all for sharded linear with more test cases. (#68786)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68786

To enable the auto grad for the sharded linear, we find we need to make some changes to the current nn function api (c10d api with auto grad enabled). So we made the following several changes:

1. Add a new api `reduce_scatter` since we need it in the rowwise sharding.
2. Modify the `all_to_all` api to make sure it consistent with the ones in distributed_c10d.py.
3. Found the cpp input params of `reduce_scatter` is missing input param, added more unit test to cover these cases.
4. Sync the NN test from gloo to nccl.
ghstack-source-id: 144860208

Test Plan: CI + Unit Test

Reviewed By: pritamdamania87

Differential Revision: D32569674

fbshipit-source-id: 9bd613f91bbf7a39eede0af32a5a5db0f2ade43b
2021-12-06 13:38:58 -08:00
e032dae329 [Autograd/Checkpoint] Checkpoint implementation without reentrant autograd (#69027)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69027

Resubmission of https://github.com/pytorch/pytorch/pull/62964 withe
suggestions and tests discussed in
https://github.com/pytorch/pytorch/issues/65537.

Adds a `use_reentrant=False` flag to `checkpoint` function. When
`use_reentrant=True` is specified, a checkpointing implementation that uses
SavedVariableHooks instead of re-entrant autograd is used. This makes it more
composable with things such as `autograd.grad` as well as DDP (still need to
add thorough distributed testing).

As discussed in https://github.com/pytorch/pytorch/issues/65537, we have added
the following tests:

-[ ] Gradient hooks are called once
ghstack-source-id: 144644859

Test Plan: CI

Reviewed By: pbelevich

Differential Revision: D32704467

fbshipit-source-id: 6eea1cce6b935ef5a0f90b769e395120900e4412
2021-12-06 13:29:37 -08:00
4d81175a07 add VSX dispatch for fft_fill_with_conjugate_symmetry_stub (#68914)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/68057.

As discussed in https://github.com/pytorch/pytorch/issues/68057 adding change to provide the missing dispatch for VSX.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68914

Reviewed By: seemethere

Differential Revision: D32696773

Pulled By: malfet

fbshipit-source-id: f1b70ab85bf9fb1c0119cc70d6125b8801d95669
2021-12-06 13:04:59 -08:00
f87faf3c29 .github: Volume mount local netrc for docs push (#69472)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69472

Neglected the fact that the actual push for these variables is happening
inside of a docker container, this should help resolve that issue

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D32889583

Pulled By: seemethere

fbshipit-source-id: d0ef213787694ab1a7e9fb508c58d2f53ff218c3
2021-12-06 12:11:23 -08:00
1859e5f000 [FSDP] Enforce wrapper_cls as a mandatory kwarg in enable_wrap. (#69358)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69358

Enforces and raises error earlier if wrapper_cls is not provided as an
arg into enable_wrap() function. Also improves the documentation.
ghstack-source-id: 144807950

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D32826963

fbshipit-source-id: d1b98df021e86d3d87a626e82facf6230b571a55
2021-12-06 12:11:20 -08:00
00245fed96 [FSDP] Kill config_auto_wrap_policy, remove policy from enable_wrap, (#69357)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69357

Since we only want to support enable_wrap() and wrap() manual wrapping
APIs without them accepting auto_wrap_policy, remove all this unneeded code.
ghstack-source-id: 144807951

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D32826318

fbshipit-source-id: 6526e700ebdf132cbb10439698f5c97ce083cd3d
2021-12-06 12:11:17 -08:00
c95277e92a [FSDP] Remove auto_wrap() (#69356)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69356

Per title
ghstack-source-id: 144807949

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D32816150

fbshipit-source-id: 6b4eacc63edd267bc1eb8a1c1d6c753bc581d63a
2021-12-06 12:11:14 -08:00
f333cde14e [FSDP] Make recursive_wrap, wrap APIs independent of ConfigAutoWrap. (#68776)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68776

Makes these APIs independent of ConfigAutoWrap so that they can be
used by FSDP ctor without it knowing about ConfigAutoWrap.

Also gets us one step closer to killing ConfigAutoWrap.recursive_wrap and
auto_wrap(), as we will only support enable_wrap() and wrap() moving forward.

Will test via unittests and FSDP benchmarks to ensure the wrapping still works.
ghstack-source-id: 144807948

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D32604021

fbshipit-source-id: 54defc0cd90b16b5185a8c1294b39f75c06ffd21
2021-12-06 12:09:49 -08:00
456139d0ae FX pass: fuse_sparse_matmul_add (#69340)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69340

- An FX pass to fuse ops resulting from addmm(a, b.t())
- Used to enable structured sparsity using TRT

Reviewed By: 842974287

Differential Revision: D32456684

fbshipit-source-id: 601826af216cea314ee85ed522d5c54a5151d720
2021-12-06 12:07:02 -08:00
68b5c86e65 [Vulkan] Implement slice operator (#69382)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69382

Implemented `slice` operator on the Vulkan backend:
* Supports only <= 4D tensors.
* `aten::slice.Tensor` will be executed internally by indexing Tensor.
* Slicing means selecting the elements present in the tensor by using `:` slice operator. We can slice the elements by using the index of that particular element.
* Indexing starts with 0. `end` is exclusive. In this example, we will be getting the elements from the very start to the end index 4(exclusive) of the tensor.
```
tensor = torch.tensor([2, 4, 1, 7, 0, 9])
print(tensor[ : 4])
# Outputs- tensor([2, 4, 1, 7])
```
* Generalized input tensors to 4D ones to simplify input/output texture handling. For example, {2, 3} is treated as {1,1,2,3} internally.
* Negative `start` and `end` inputs are allowed.
* CPU implementation: [/aten/src/ATen/native/TensorShape.cpp::slice()](3e45739543/aten/src/ATen/native/TensorShape.cpp (L1262))
* For **width** dimension, use `vkCmdCopyImage` API,
  * input texture size = `{x,y,z}`
  * if `step` is 1, copy a region from the input texture to the output texture once where
    * source offset = `{start,0,0}`
    * destination offset = `{0,0,0}`
    * copy extents = `{end-start,y,z}`
    * call `vkCmdCopyImage` API
  * if `step` is not 1, do for-loop from x=`start` to `end-1` by `step` (also from x_new=`0` to `end-start-1`) where
    * x_max = x
    * copy extents = `{1,y,z}`
    * if (x >= x_max) continue; // out of range
    * source offset = `{x,0,0}`
    * destination offset = `{x_new,0,0}`
    * call `vkCmdCopyImage` API
* For **height** dimension, use `vkCmdCopyImage` API,
  * input texture size = `{x,y,z}`
  * if `step` is 1, copy a region from the input texture to the output texture once where
    * source offset = `{0,start,0}`
    * destination offset = `{0,0,0}`
    * copy extents = `{x,end-start,z}`
    * call `vkCmdCopyImage` API
  * if `step` is not 1, do for-loop from y=`start` to `end-1` by `step` (also from y_new=`0` to `end-start-1`) where
    * y_max = y
    * copy extents = `{x,1,z}`
    * if (y >= y_max) continue; // out of range
    * source offset = `{0,y,0}`
    * destination offset = `{0,y_new,0}`
    * call `vkCmdCopyImage` API
* For **batch** and **feature**(channel) dimensions, we build up shader operations from the output texture point of view to avoid the nondeterministic order of GPU shader operations between texels. See [incoherent memory access](https://www.khronos.org/opengl/wiki/Memory_Model#Incoherent_memory_access)
  * `b,c,h,w` = input tensor dims (NCHW)
  * `b1,c1,h1,w1` = output tensor dims (NCHW)
  * `posIn` = position (x,y,z) for input texture
  * `posOut` = position (x,y,z) for output texture
  * `inval` = input texel value
  * `outval` = output texel value
  * `max_dst_index` = batch size of output tensor * channel size of output tensor
  * `n` = end - start
  * `i` = index of input texel (0...3) and `j` = index of output texel (0..3)
  * Pseudo code:
```
for (uint j = 0; j < 4; ++j) {
  dst_index = posOut.z * 4 + j;
  if (dst_index >= max_dst_index) {
    save outval to output texture at posOut
    break; // out of reange
  }

  b1 = int(dst_index / channel size of output tensor);
  c1 = dst_index % channel size of output tensor;
  h1 = posOut.y;
  w1 = posOut.x;

  b=b1
  c=c1
  h=h1
  w=w1

  if (dim==0) { // batch
    b=start+step*b1;
  } else { // feature(channel)
    c=start+step*c1
  }

  src_index = b * channel size of input tensor + c;
  posIn.x = int(w);
  posIn.y = int(h);
  posIn.z = int(src_index / 4);
  i = (src_index % 4);
  read inval from input texture at posIn
  outval[j] = inval[i]
  if (j == 3) {
    save outval to output texture at posOut
  }
}
```
* Error/edge cases:
  * Vulkan backend doesn't support zero-sized slice. It throws an exception when allocating a Vulkan buffer if any dim size is zero.
  * The slice step should be positive.
* Generalized test cases with different dim size tensors for batch, feature, height and width. For example, a 4D tensor slicing by dim=width:
```
tensor {2, 3, 40, 50} slicing with dim=3, start=10, end=30, step=1 <-> tensor indexing by [:,:,:,10:30:1]
tensor {2, 3, 40, 50} slicing with dim=3, start=10, end=30, step=7 <-> tensor indexing by [:,:,:,10:30:7]
tensor {2, 3, 40, 50} slicing with dim=3, start=10, end=50, step=2 <-> tensor indexing by [:,:,:,10:50:2] with end=out of range
tensor {2, 3, 40, 50} slicing with dim=3, start=-60, end=60, step=2 <-> tensor indexing by [:,:,:,-60:60:2] with start/end=out of range
tensor {2, 3, 40, 50} slicing with dim=3, start=-30, end=-10, step=2 <-> tensor indexing by [:,:,:,-30:-10:1] with negative start/end
tensor {2, 3, 40, 50} slicing with dim=3, start=0, end=INT64_MAX, step=2 <-> tensor indexing by [:,:,:,0:9223372036854775807:1] with end=INT64_MAX
tensor {2, 3, 40, 50} slicing with dim=3, start=-10, end=INT64_MAX, step=2 <-> tensor indexing by [:,:,:,-10:9223372036854775807:1] with negative start and end=INT64_MAX
tensor {2, 3, 40, 50} slicing with dim=3, start=INT64_MIN, end=INT64_MAX, step=2 <-> tensor indexing by [:,:,:,-9223372036854775808:9223372036854775807:1] with start=INT64_MIN and end=INT64_MAX
tensor {2, 3, 40, 50} slicing with dim=3, start=empty, end=empty, step=2 <-> tensor indexing by [:,:,:,::1] with empty start/end
```
* References:
  * [Slicing PyTorch Datasets](https://lewtun.github.io/blog/til/nlp/pytorch/2021/01/24/til-slicing-torch-datasets.html)
  * [How to Slice a 3D Tensor in Pytorch?](https://www.geeksforgeeks.org/how-to-slice-a-3d-tensor-in-pytorch/)
  * [PyTorch Tensor Indexing API](https://pytorch.org/cppdocs/notes/tensor_indexing.html#translating-between-python-c-index-types)
  * [PyTorch Tensor Indexing](https://deeplearninguniversity.com/pytorch/pytorch-tensor-indexing/)
  * [Slicing and Striding](https://mlverse.github.io/torch/articles/indexing.html#slicing-and-striding)
* Vulkan `slice` operator tensor conversion:
{F684363708}

Test Plan:
Build & test on Android:
```
cd ~/fbsource
buck build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 //xplat/caffe2:pt_vulkan_api_test_binAndroid\#android-arm64 --show-output
adb push buck-out/gen/xplat/caffe2/pt_vulkan_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_api_test
adb shell "/data/local/tmp/vulkan_api_test"
```
Build & test on MacOS:
```
cd ~/fbsource
buck build //xplat/caffe2:pt_vulkan_api_test_binAppleMac
./buck-out/gen/xplat/caffe2/pt_vulkan_api_test_binAppleMac\#macosx-x86_64
```
Test result on Android (Google Pixel 5):
```
[ RUN      ] VulkanAPITest.slice_width_success
[       OK ] VulkanAPITest.slice_width_success (17 ms)
[ RUN      ] VulkanAPITest.slice_height_success
[       OK ] VulkanAPITest.slice_height_success (13 ms)
[ RUN      ] VulkanAPITest.slice_feature_success
[       OK ] VulkanAPITest.slice_feature_success (20 ms)
[ RUN      ] VulkanAPITest.slice_batch_success
[       OK ] VulkanAPITest.slice_batch_success (9 ms)
[ RUN      ] VulkanAPITest.slice_invalidinputs_exceptions
[       OK ] VulkanAPITest.slice_invalidinputs_exceptions (0 ms)
```
Test result on MacOS:
```
[ RUN      ] VulkanAPITest.slice_width_success
[       OK ] VulkanAPITest.slice_width_success (81 ms)
[ RUN      ] VulkanAPITest.slice_height_success
[       OK ] VulkanAPITest.slice_height_success (56 ms)
[ RUN      ] VulkanAPITest.slice_feature_success
[       OK ] VulkanAPITest.slice_feature_success (132 ms)
[ RUN      ] VulkanAPITest.slice_batch_success
[       OK ] VulkanAPITest.slice_batch_success (33 ms)
[ RUN      ] VulkanAPITest.slice_invalidinputs_exceptions
[       OK ] VulkanAPITest.slice_invalidinputs_exceptions (1 ms)
```

Reviewed By: SS-JIA

Differential Revision: D32482638

fbshipit-source-id: 65841fb2d3489ee407f2b4f38619b700787d41b0
2021-12-06 12:05:37 -08:00
a84ed8be6d unify compare kernels (#69111)
Summary:
This unifies 6 compare ops (NE, EQ, LT, LE, GE, GT) into 2 kernels, reducing context size. Performance is ~5% worse for low width broadcasted cases, on-par for non-broadcasted
With this PR, benchmarks for contiguous, 1M-MM, 1M-M1, op with scalar (size in MB and bandwidth in GB/s):
```
    5.0,   795.9
   10.0,   650.5
   15.0,   706.2
   20.0,   731.6
   25.0,   744.9
   30.0,   758.1
   35.0,   762.6
   40.0,   768.8
   45.0,   775.7
   50.0,   780.7
   55.0,   781.7
   60.0,   783.0
   65.0,   784.8
   70.0,   790.7
   75.0,   789.2
   80.0,   794.4
   85.0,   794.2
   90.0,   797.4
   95.0,   796.3
  100.0,   798.0
    3.0,   363.7     1.0,   122.2     3.0,   385.5
    6.0,   420.4     2.0,   142.9     6.0,   755.5
    9.0,   438.3     3.0,   151.6     9.0,   684.5
   12.0,   449.5     4.0,   156.4    12.0,   702.9
   15.0,   463.7     5.0,   159.6    15.0,   716.8
   18.0,   472.7     6.0,   161.4    18.0,   737.0
   21.0,   477.6     7.0,   162.4    21.0,   745.6
   24.0,   480.9     8.0,   164.1    24.0,   755.4
   27.0,   483.7     9.0,   163.7    27.0,   760.7
   30.0,   487.3    10.0,   165.9    30.0,   770.4
   33.0,   491.4    11.0,   166.3    33.0,   774.3
   36.0,   492.9    12.0,   166.2    36.0,   779.0
   39.0,   494.7    13.0,   166.7    39.0,   782.5
   42.0,   491.3    14.0,   166.7    42.0,   789.0
   45.0,   495.1    15.0,   167.5    45.0,   790.0
   48.0,   499.7    16.0,   167.7    48.0,   791.8
   51.0,   496.2    17.0,   166.9    51.0,   794.0
   54.0,   497.6    18.0,   167.7    54.0,   797.4
   57.0,   497.1    19.0,   167.5    57.0,   798.6
   60.0,   498.8    20.0,   168.8    60.0,   802.1

```
Master
```
    5.0,   743.4
   10.0,   665.7
   15.0,   702.3
   20.0,   727.5
   25.0,   740.7
   30.0,   757.5
   35.0,   760.3
   40.0,   768.5
   45.0,   775.7
   50.0,   776.8
   55.0,   781.1
   60.0,   786.5
   65.0,   786.8
   70.0,   790.1
   75.0,   789.7
   80.0,   789.1
   85.0,   793.2
   90.0,   793.8
   95.0,   795.9
  100.0,   796.0
    3.0,   383.1     1.0,   129.0     3.0,   337.0
    6.0,   445.0     2.0,   149.6     6.0,   670.6
    9.0,   445.3     3.0,   159.6     9.0,   678.6
   12.0,   474.9     4.0,   164.1    12.0,   705.5
   15.0,   480.8     5.0,   167.2    15.0,   718.3
   18.0,   490.3     6.0,   169.1    18.0,   733.3
   21.0,   493.9     7.0,   168.5    21.0,   742.5
   24.0,   503.8     8.0,   171.9    24.0,   756.4
   27.0,   506.7     9.0,   171.3    27.0,   759.8
   30.0,   508.7    10.0,   172.4    30.0,   767.1
   33.0,   515.7    11.0,   174.2    33.0,   773.7
   36.0,   516.7    12.0,   170.4    36.0,   781.7
   39.0,   519.1    13.0,   174.4    39.0,   782.1
   42.0,   515.7    14.0,   174.1    42.0,   787.0
   45.0,   519.2    15.0,   172.7    45.0,   788.1
   48.0,   522.2    16.0,   175.4    48.0,   791.7
   51.0,   519.6    17.0,   175.1    51.0,   795.7
   54.0,   518.5    18.0,   174.8    54.0,   795.8
   57.0,   519.1    19.0,   174.4    57.0,   796.6
   60.0,   521.5    20.0,   175.6    60.0,   800.1
```
<details>
<summary>Benchmarking script </summary>

```
import torch
from matplotlib import pyplot as plt
from torch.utils.benchmark import Timer, Compare
import math
import click
print(torch.cuda.get_device_capability()) # check that we are on Volta (compute capability 7,0)
#torch.cuda.set_device(1)
# don't benchmark on anything too small, you'll see only overhead
click.command()
click.option('--op_str', default="torch.gt")
click.option('--dtype_str', default="float", type=click.Choice(['float', 'half']))
def bench(op_str, dtype_str):
    if dtype_str == "float":
        dtype = torch.float
    elif dtype_str == "half":
        dtype = torch.half

    MB = 1024 * 1024
    size = MB
    results = []
    sizes = []
    for _ in range(20):
        torch.cuda.memory.empty_cache()
        a=torch.randn(int(size), device="cuda", dtype=dtype)
        b=torch.randn(int(size), device="cuda", dtype=dtype)
        t = Timer(stmt=f"{op_str}(a,b)", label = op_str, sub_label=f"{size/MB} MB", description="contiguous", globals = {"a":a, "b":b})
        res = t.blocked_autorange()
        results.append(res)
        sizes.append(size)
        size +=  MB
        del a #to save memory for next iterations
        del b
    c=Compare(results)
    #print(c)
    bw=[]
    bytes=[]
    element_size = torch.tensor([], dtype=dtype).element_size()
    output_element_size = 1
    for res, size in zip(results,sizes):
        bytes_io = 2*size*element_size + output_element_size * size
        bytes.append(bytes_io/MB)
        # we'll report bandwidth in GB/s
        bw.append(bytes_io/res.median * 1e-9)
        print(f"{bytes_io/MB:7.1f}, {bw[-1]:7.1f}")

    sizes = []
    results = [[],[],[]]

    size = MB
    for _ in range(20):
        torch.cuda.memory.empty_cache()
        M = math.floor(math.sqrt(size))
        a=torch.randn(1, M, device="cuda", dtype=dtype)
        b=torch.randn(M, M, device="cuda", dtype=dtype)
        b1 = torch.randn(M, 1, device="cuda", dtype=dtype)
        tb = Timer(stmt=f"{op_str}(a,b)", label = op_str, sub_label=f"{M*M/MB} MB", description="MMM1", globals = {"a":a, "b":b})
        t1 = Timer(stmt=f"{op_str}(a,b1)", label = op_str, sub_label=f"{M*M/MB} MB", description="M11M", globals = {"a":a, "b1":b1})
        ts = Timer(stmt=f"{op_str}(b,1.)", label = op_str, sub_label=f"{M*M/MB} MB", description="scalar", globals = {"a":a, "b":b})

        res = [t.blocked_autorange() for t in (tb, t1, ts)]
        for (rl, r) in zip(results, res):
            rl.append(r)
        sizes.append(M)
        size += MB
        del a #to save memory for next iterations
        del b
    comps = [Compare(r) for r in results]
    #[print(c) for c in comps]
    bw=[[],[],[]]

    for res, res1, ress, size in zip(results[0],results[1],results[2], sizes):
        bytes_io = (size+size*size)*element_size + output_element_size * size*size #(size+size+size*size)*4
        bytes_io1 = (size+size)*element_size + output_element_size * size*size #(size+size+size*size)*4
        bytes_ios = (size*size)*element_size + output_element_size * size * size
        bytes_iol = (bytes_io, bytes_io1, bytes_ios)
        for (bw_elem, bytes_elem, res_elem) in zip(bw, bytes_iol, (res, res1, ress)):
            bw_elem.append(bytes_elem/res_elem.median * 1e-9)
        print(f"{bytes_iol[0]/MB:7.1f}, {bw[0][-1]:7.1f}", f"{bytes_iol[1]/MB:7.1f}, {bw[1][-1]:7.1f}",
        f"{bytes_iol[2]/MB:7.1f}, {bw[2][-1]:7.1f}")

if __name__ == '__main__':
    bench()
```
</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69111

Reviewed By: mruberry

Differential Revision: D32851098

Pulled By: ngimel

fbshipit-source-id: cfb83922b2e8eb6a0ad0621ff07c2dada9c8e626
2021-12-06 11:00:53 -08:00
38c576cfef Clean up CODEOWNERS for .github/ (#69395)
Summary:
Cleans up the CODEOWNERS file to reflect current team

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69395

Test Plan: yeah_sandcastle

Reviewed By: anjali411

Differential Revision: D32885237

Pulled By: seemethere

fbshipit-source-id: a465f2cd0e27d5e53f5af5769d1cad47ec5348e7
2021-12-06 10:50:29 -08:00
bf01cd5228 Move THC_sleep to ATen (#69038)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69038

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D32872479

Pulled By: ngimel

fbshipit-source-id: 97c7592b16eee2ecc66c42507c358aa92cc8ee50
2021-12-06 10:20:43 -08:00
a974699633 Skips failing ROCm test (#69456)
Summary:
ROCm and CUDA type promotion are slightly divergent and need to be updated.

cc jeffdaily sunway513 jithunnair-amd ROCmSupport KyleCZH

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69456

Reviewed By: anjali411, janeyx99

Differential Revision: D32883895

Pulled By: mruberry

fbshipit-source-id: 3b0ba8a9d092c2d7ff20d78da42d4a147b1db12d
2021-12-06 09:12:31 -08:00
b737e09f60 expose return_types in Python (#66614)
Summary:
https://github.com/facebookresearch/functorch/issues/87

TODO:
* [x] Add comments
* [x] Add test
* [x] Fix XLA

<details>

<summary>Generated python_return_types.cpp</summary>

```cpp
#include <Python.h>

#include <vector>
#include <map>
#include <string>

#include "torch/csrc/autograd/python_return_types.h"
#include "torch/csrc/utils/structseq.h"
#include "torch/csrc/Exceptions.h"

namespace {
PyTypeObject* get__det_lu_based_helper_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"det", ""}, {"lu", ""}, {"pivs", ""},  {nullptr} };
    static PyTypeObject _det_lu_based_helperNamedTuple;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types._det_lu_based_helper", nullptr, NamedTuple_fields, 3 };
    if (!is_initialized) {
        PyStructSequence_InitType(&_det_lu_based_helperNamedTuple, &desc);
        _det_lu_based_helperNamedTuple.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &_det_lu_based_helperNamedTuple;
}
PyTypeObject* get__fake_quantize_per_tensor_affine_cachemask_tensor_qparams_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"output", ""}, {"mask", ""},  {nullptr} };
    static PyTypeObject _fake_quantize_per_tensor_affine_cachemask_tensor_qparamsNamedTuple;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types._fake_quantize_per_tensor_affine_cachemask_tensor_qparams", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&_fake_quantize_per_tensor_affine_cachemask_tensor_qparamsNamedTuple, &desc);
        _fake_quantize_per_tensor_affine_cachemask_tensor_qparamsNamedTuple.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &_fake_quantize_per_tensor_affine_cachemask_tensor_qparamsNamedTuple;
}
PyTypeObject* get__fused_moving_avg_obs_fq_helper_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"output", ""}, {"mask", ""},  {nullptr} };
    static PyTypeObject _fused_moving_avg_obs_fq_helperNamedTuple;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types._fused_moving_avg_obs_fq_helper", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&_fused_moving_avg_obs_fq_helperNamedTuple, &desc);
        _fused_moving_avg_obs_fq_helperNamedTuple.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &_fused_moving_avg_obs_fq_helperNamedTuple;
}
PyTypeObject* get__lu_with_info_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"LU", ""}, {"pivots", ""}, {"info", ""},  {nullptr} };
    static PyTypeObject _lu_with_infoNamedTuple;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types._lu_with_info", nullptr, NamedTuple_fields, 3 };
    if (!is_initialized) {
        PyStructSequence_InitType(&_lu_with_infoNamedTuple, &desc);
        _lu_with_infoNamedTuple.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &_lu_with_infoNamedTuple;
}
PyTypeObject* get__unpack_dual_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"primal", ""}, {"tangent", ""},  {nullptr} };
    static PyTypeObject _unpack_dualNamedTuple;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types._unpack_dual", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&_unpack_dualNamedTuple, &desc);
        _unpack_dualNamedTuple.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &_unpack_dualNamedTuple;
}
PyTypeObject* get_aminmax_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"min", ""}, {"max", ""},  {nullptr} };
    static PyTypeObject aminmaxNamedTuple;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.aminmax", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&aminmaxNamedTuple, &desc);
        aminmaxNamedTuple.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &aminmaxNamedTuple;
}

PyTypeObject* get_aminmax_out_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"min", ""}, {"max", ""},  {nullptr} };
    static PyTypeObject aminmax_outNamedTuple1;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.aminmax_out", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&aminmax_outNamedTuple1, &desc);
        aminmax_outNamedTuple1.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &aminmax_outNamedTuple1;
}
PyTypeObject* get_cummax_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"values", ""}, {"indices", ""},  {nullptr} };
    static PyTypeObject cummaxNamedTuple;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.cummax", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&cummaxNamedTuple, &desc);
        cummaxNamedTuple.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &cummaxNamedTuple;
}

PyTypeObject* get_cummax_out_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"values", ""}, {"indices", ""},  {nullptr} };
    static PyTypeObject cummax_outNamedTuple1;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.cummax_out", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&cummax_outNamedTuple1, &desc);
        cummax_outNamedTuple1.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &cummax_outNamedTuple1;
}
PyTypeObject* get_cummin_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"values", ""}, {"indices", ""},  {nullptr} };
    static PyTypeObject cumminNamedTuple;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.cummin", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&cumminNamedTuple, &desc);
        cumminNamedTuple.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &cumminNamedTuple;
}

PyTypeObject* get_cummin_out_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"values", ""}, {"indices", ""},  {nullptr} };
    static PyTypeObject cummin_outNamedTuple1;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.cummin_out", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&cummin_outNamedTuple1, &desc);
        cummin_outNamedTuple1.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &cummin_outNamedTuple1;
}
PyTypeObject* get_eig_out_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"eigenvalues", ""}, {"eigenvectors", ""},  {nullptr} };
    static PyTypeObject eig_outNamedTuple;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.eig_out", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&eig_outNamedTuple, &desc);
        eig_outNamedTuple.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &eig_outNamedTuple;
}

PyTypeObject* get_eig_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"eigenvalues", ""}, {"eigenvectors", ""},  {nullptr} };
    static PyTypeObject eigNamedTuple1;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.eig", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&eigNamedTuple1, &desc);
        eigNamedTuple1.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &eigNamedTuple1;
}
PyTypeObject* get_frexp_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"mantissa", ""}, {"exponent", ""},  {nullptr} };
    static PyTypeObject frexpNamedTuple;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.frexp", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&frexpNamedTuple, &desc);
        frexpNamedTuple.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &frexpNamedTuple;
}

PyTypeObject* get_frexp_out_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"mantissa", ""}, {"exponent", ""},  {nullptr} };
    static PyTypeObject frexp_outNamedTuple1;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.frexp_out", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&frexp_outNamedTuple1, &desc);
        frexp_outNamedTuple1.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &frexp_outNamedTuple1;
}
PyTypeObject* get_geqrf_out_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"a", ""}, {"tau", ""},  {nullptr} };
    static PyTypeObject geqrf_outNamedTuple;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.geqrf_out", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&geqrf_outNamedTuple, &desc);
        geqrf_outNamedTuple.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &geqrf_outNamedTuple;
}

PyTypeObject* get_geqrf_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"a", ""}, {"tau", ""},  {nullptr} };
    static PyTypeObject geqrfNamedTuple1;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.geqrf", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&geqrfNamedTuple1, &desc);
        geqrfNamedTuple1.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &geqrfNamedTuple1;
}
PyTypeObject* get_histogram_out_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"hist", ""}, {"bin_edges", ""},  {nullptr} };
    static PyTypeObject histogram_outNamedTuple;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.histogram_out", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&histogram_outNamedTuple, &desc);
        histogram_outNamedTuple.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &histogram_outNamedTuple;
}

PyTypeObject* get_histogram_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"hist", ""}, {"bin_edges", ""},  {nullptr} };
    static PyTypeObject histogramNamedTuple1;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.histogram", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&histogramNamedTuple1, &desc);
        histogramNamedTuple1.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &histogramNamedTuple1;
}
PyTypeObject* get_kthvalue_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"values", ""}, {"indices", ""},  {nullptr} };
    static PyTypeObject kthvalueNamedTuple;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.kthvalue", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&kthvalueNamedTuple, &desc);
        kthvalueNamedTuple.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &kthvalueNamedTuple;
}

PyTypeObject* get_kthvalue_out_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"values", ""}, {"indices", ""},  {nullptr} };
    static PyTypeObject kthvalue_outNamedTuple1;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.kthvalue_out", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&kthvalue_outNamedTuple1, &desc);
        kthvalue_outNamedTuple1.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &kthvalue_outNamedTuple1;
}
PyTypeObject* get_linalg_cholesky_ex_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"L", ""}, {"info", ""},  {nullptr} };
    static PyTypeObject linalg_cholesky_exNamedTuple;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.linalg_cholesky_ex", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&linalg_cholesky_exNamedTuple, &desc);
        linalg_cholesky_exNamedTuple.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &linalg_cholesky_exNamedTuple;
}

PyTypeObject* get_linalg_cholesky_ex_out_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"L", ""}, {"info", ""},  {nullptr} };
    static PyTypeObject linalg_cholesky_ex_outNamedTuple1;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.linalg_cholesky_ex_out", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&linalg_cholesky_ex_outNamedTuple1, &desc);
        linalg_cholesky_ex_outNamedTuple1.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &linalg_cholesky_ex_outNamedTuple1;
}
PyTypeObject* get_linalg_eig_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"eigenvalues", ""}, {"eigenvectors", ""},  {nullptr} };
    static PyTypeObject linalg_eigNamedTuple;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.linalg_eig", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&linalg_eigNamedTuple, &desc);
        linalg_eigNamedTuple.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &linalg_eigNamedTuple;
}

PyTypeObject* get_linalg_eig_out_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"eigenvalues", ""}, {"eigenvectors", ""},  {nullptr} };
    static PyTypeObject linalg_eig_outNamedTuple1;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.linalg_eig_out", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&linalg_eig_outNamedTuple1, &desc);
        linalg_eig_outNamedTuple1.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &linalg_eig_outNamedTuple1;
}
PyTypeObject* get_linalg_eigh_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"eigenvalues", ""}, {"eigenvectors", ""},  {nullptr} };
    static PyTypeObject linalg_eighNamedTuple;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.linalg_eigh", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&linalg_eighNamedTuple, &desc);
        linalg_eighNamedTuple.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &linalg_eighNamedTuple;
}

PyTypeObject* get_linalg_eigh_out_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"eigenvalues", ""}, {"eigenvectors", ""},  {nullptr} };
    static PyTypeObject linalg_eigh_outNamedTuple1;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.linalg_eigh_out", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&linalg_eigh_outNamedTuple1, &desc);
        linalg_eigh_outNamedTuple1.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &linalg_eigh_outNamedTuple1;
}
PyTypeObject* get_linalg_inv_ex_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"inverse", ""}, {"info", ""},  {nullptr} };
    static PyTypeObject linalg_inv_exNamedTuple;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.linalg_inv_ex", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&linalg_inv_exNamedTuple, &desc);
        linalg_inv_exNamedTuple.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &linalg_inv_exNamedTuple;
}

PyTypeObject* get_linalg_inv_ex_out_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"inverse", ""}, {"info", ""},  {nullptr} };
    static PyTypeObject linalg_inv_ex_outNamedTuple1;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.linalg_inv_ex_out", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&linalg_inv_ex_outNamedTuple1, &desc);
        linalg_inv_ex_outNamedTuple1.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &linalg_inv_ex_outNamedTuple1;
}
PyTypeObject* get_linalg_lstsq_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"solution", ""}, {"residuals", ""}, {"rank", ""}, {"singular_values", ""},  {nullptr} };
    static PyTypeObject linalg_lstsqNamedTuple;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.linalg_lstsq", nullptr, NamedTuple_fields, 4 };
    if (!is_initialized) {
        PyStructSequence_InitType(&linalg_lstsqNamedTuple, &desc);
        linalg_lstsqNamedTuple.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &linalg_lstsqNamedTuple;
}

PyTypeObject* get_linalg_lstsq_out_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"solution", ""}, {"residuals", ""}, {"rank", ""}, {"singular_values", ""},  {nullptr} };
    static PyTypeObject linalg_lstsq_outNamedTuple1;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.linalg_lstsq_out", nullptr, NamedTuple_fields, 4 };
    if (!is_initialized) {
        PyStructSequence_InitType(&linalg_lstsq_outNamedTuple1, &desc);
        linalg_lstsq_outNamedTuple1.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &linalg_lstsq_outNamedTuple1;
}
PyTypeObject* get_linalg_qr_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"Q", ""}, {"R", ""},  {nullptr} };
    static PyTypeObject linalg_qrNamedTuple;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.linalg_qr", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&linalg_qrNamedTuple, &desc);
        linalg_qrNamedTuple.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &linalg_qrNamedTuple;
}

PyTypeObject* get_linalg_qr_out_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"Q", ""}, {"R", ""},  {nullptr} };
    static PyTypeObject linalg_qr_outNamedTuple1;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.linalg_qr_out", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&linalg_qr_outNamedTuple1, &desc);
        linalg_qr_outNamedTuple1.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &linalg_qr_outNamedTuple1;
}
PyTypeObject* get_linalg_slogdet_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"sign", ""}, {"logabsdet", ""},  {nullptr} };
    static PyTypeObject linalg_slogdetNamedTuple;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.linalg_slogdet", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&linalg_slogdetNamedTuple, &desc);
        linalg_slogdetNamedTuple.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &linalg_slogdetNamedTuple;
}

PyTypeObject* get_linalg_slogdet_out_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"sign", ""}, {"logabsdet", ""},  {nullptr} };
    static PyTypeObject linalg_slogdet_outNamedTuple1;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.linalg_slogdet_out", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&linalg_slogdet_outNamedTuple1, &desc);
        linalg_slogdet_outNamedTuple1.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &linalg_slogdet_outNamedTuple1;
}
PyTypeObject* get_linalg_svd_out_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"U", ""}, {"S", ""}, {"Vh", ""},  {nullptr} };
    static PyTypeObject linalg_svd_outNamedTuple;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.linalg_svd_out", nullptr, NamedTuple_fields, 3 };
    if (!is_initialized) {
        PyStructSequence_InitType(&linalg_svd_outNamedTuple, &desc);
        linalg_svd_outNamedTuple.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &linalg_svd_outNamedTuple;
}

PyTypeObject* get_linalg_svd_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"U", ""}, {"S", ""}, {"Vh", ""},  {nullptr} };
    static PyTypeObject linalg_svdNamedTuple1;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.linalg_svd", nullptr, NamedTuple_fields, 3 };
    if (!is_initialized) {
        PyStructSequence_InitType(&linalg_svdNamedTuple1, &desc);
        linalg_svdNamedTuple1.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &linalg_svdNamedTuple1;
}
PyTypeObject* get_lstsq_out_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"solution", ""}, {"QR", ""},  {nullptr} };
    static PyTypeObject lstsq_outNamedTuple;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.lstsq_out", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&lstsq_outNamedTuple, &desc);
        lstsq_outNamedTuple.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &lstsq_outNamedTuple;
}

PyTypeObject* get_lstsq_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"solution", ""}, {"QR", ""},  {nullptr} };
    static PyTypeObject lstsqNamedTuple1;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.lstsq", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&lstsqNamedTuple1, &desc);
        lstsqNamedTuple1.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &lstsqNamedTuple1;
}
PyTypeObject* get_lu_unpack_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"P", ""}, {"L", ""}, {"U", ""},  {nullptr} };
    static PyTypeObject lu_unpackNamedTuple;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.lu_unpack", nullptr, NamedTuple_fields, 3 };
    if (!is_initialized) {
        PyStructSequence_InitType(&lu_unpackNamedTuple, &desc);
        lu_unpackNamedTuple.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &lu_unpackNamedTuple;
}

PyTypeObject* get_lu_unpack_out_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"P", ""}, {"L", ""}, {"U", ""},  {nullptr} };
    static PyTypeObject lu_unpack_outNamedTuple1;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.lu_unpack_out", nullptr, NamedTuple_fields, 3 };
    if (!is_initialized) {
        PyStructSequence_InitType(&lu_unpack_outNamedTuple1, &desc);
        lu_unpack_outNamedTuple1.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &lu_unpack_outNamedTuple1;
}
PyTypeObject* get_max_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"values", ""}, {"indices", ""},  {nullptr} };
    static PyTypeObject maxNamedTuple;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.max", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&maxNamedTuple, &desc);
        maxNamedTuple.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &maxNamedTuple;
}

PyTypeObject* get_max_out_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"values", ""}, {"indices", ""},  {nullptr} };
    static PyTypeObject max_outNamedTuple1;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.max_out", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&max_outNamedTuple1, &desc);
        max_outNamedTuple1.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &max_outNamedTuple1;
}
PyTypeObject* get_median_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"values", ""}, {"indices", ""},  {nullptr} };
    static PyTypeObject medianNamedTuple;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.median", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&medianNamedTuple, &desc);
        medianNamedTuple.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &medianNamedTuple;
}

PyTypeObject* get_median_out_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"values", ""}, {"indices", ""},  {nullptr} };
    static PyTypeObject median_outNamedTuple1;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.median_out", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&median_outNamedTuple1, &desc);
        median_outNamedTuple1.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &median_outNamedTuple1;
}
PyTypeObject* get_min_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"values", ""}, {"indices", ""},  {nullptr} };
    static PyTypeObject minNamedTuple;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.min", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&minNamedTuple, &desc);
        minNamedTuple.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &minNamedTuple;
}

PyTypeObject* get_min_out_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"values", ""}, {"indices", ""},  {nullptr} };
    static PyTypeObject min_outNamedTuple1;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.min_out", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&min_outNamedTuple1, &desc);
        min_outNamedTuple1.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &min_outNamedTuple1;
}
PyTypeObject* get_mode_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"values", ""}, {"indices", ""},  {nullptr} };
    static PyTypeObject modeNamedTuple;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.mode", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&modeNamedTuple, &desc);
        modeNamedTuple.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &modeNamedTuple;
}

PyTypeObject* get_mode_out_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"values", ""}, {"indices", ""},  {nullptr} };
    static PyTypeObject mode_outNamedTuple1;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.mode_out", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&mode_outNamedTuple1, &desc);
        mode_outNamedTuple1.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &mode_outNamedTuple1;
}
PyTypeObject* get_nanmedian_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"values", ""}, {"indices", ""},  {nullptr} };
    static PyTypeObject nanmedianNamedTuple;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.nanmedian", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&nanmedianNamedTuple, &desc);
        nanmedianNamedTuple.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &nanmedianNamedTuple;
}

PyTypeObject* get_nanmedian_out_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"values", ""}, {"indices", ""},  {nullptr} };
    static PyTypeObject nanmedian_outNamedTuple1;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.nanmedian_out", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&nanmedian_outNamedTuple1, &desc);
        nanmedian_outNamedTuple1.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &nanmedian_outNamedTuple1;
}
PyTypeObject* get_qr_out_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"Q", ""}, {"R", ""},  {nullptr} };
    static PyTypeObject qr_outNamedTuple;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.qr_out", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&qr_outNamedTuple, &desc);
        qr_outNamedTuple.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &qr_outNamedTuple;
}

PyTypeObject* get_qr_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"Q", ""}, {"R", ""},  {nullptr} };
    static PyTypeObject qrNamedTuple1;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.qr", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&qrNamedTuple1, &desc);
        qrNamedTuple1.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &qrNamedTuple1;
}
PyTypeObject* get_slogdet_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"sign", ""}, {"logabsdet", ""},  {nullptr} };
    static PyTypeObject slogdetNamedTuple;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.slogdet", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&slogdetNamedTuple, &desc);
        slogdetNamedTuple.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &slogdetNamedTuple;
}
PyTypeObject* get_solve_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"solution", ""}, {"LU", ""},  {nullptr} };
    static PyTypeObject solveNamedTuple;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.solve", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&solveNamedTuple, &desc);
        solveNamedTuple.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &solveNamedTuple;
}

PyTypeObject* get_solve_out_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"solution", ""}, {"LU", ""},  {nullptr} };
    static PyTypeObject solve_outNamedTuple1;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.solve_out", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&solve_outNamedTuple1, &desc);
        solve_outNamedTuple1.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &solve_outNamedTuple1;
}
PyTypeObject* get_sort_out_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"values", ""}, {"indices", ""},  {nullptr} };
    static PyTypeObject sort_outNamedTuple;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.sort_out", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&sort_outNamedTuple, &desc);
        sort_outNamedTuple.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &sort_outNamedTuple;
}

PyTypeObject* get_sort_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"values", ""}, {"indices", ""},  {nullptr} };
    static PyTypeObject sortNamedTuple1;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.sort", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&sortNamedTuple1, &desc);
        sortNamedTuple1.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &sortNamedTuple1;
}
PyTypeObject* get_svd_out_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"U", ""}, {"S", ""}, {"V", ""},  {nullptr} };
    static PyTypeObject svd_outNamedTuple;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.svd_out", nullptr, NamedTuple_fields, 3 };
    if (!is_initialized) {
        PyStructSequence_InitType(&svd_outNamedTuple, &desc);
        svd_outNamedTuple.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &svd_outNamedTuple;
}

PyTypeObject* get_svd_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"U", ""}, {"S", ""}, {"V", ""},  {nullptr} };
    static PyTypeObject svdNamedTuple1;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.svd", nullptr, NamedTuple_fields, 3 };
    if (!is_initialized) {
        PyStructSequence_InitType(&svdNamedTuple1, &desc);
        svdNamedTuple1.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &svdNamedTuple1;
}
PyTypeObject* get_symeig_out_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"eigenvalues", ""}, {"eigenvectors", ""},  {nullptr} };
    static PyTypeObject symeig_outNamedTuple;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.symeig_out", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&symeig_outNamedTuple, &desc);
        symeig_outNamedTuple.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &symeig_outNamedTuple;
}

PyTypeObject* get_symeig_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"eigenvalues", ""}, {"eigenvectors", ""},  {nullptr} };
    static PyTypeObject symeigNamedTuple1;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.symeig", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&symeigNamedTuple1, &desc);
        symeigNamedTuple1.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &symeigNamedTuple1;
}
PyTypeObject* get_topk_out_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"values", ""}, {"indices", ""},  {nullptr} };
    static PyTypeObject topk_outNamedTuple;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.topk_out", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&topk_outNamedTuple, &desc);
        topk_outNamedTuple.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &topk_outNamedTuple;
}

PyTypeObject* get_topk_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"values", ""}, {"indices", ""},  {nullptr} };
    static PyTypeObject topkNamedTuple1;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.topk", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&topkNamedTuple1, &desc);
        topkNamedTuple1.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &topkNamedTuple1;
}
PyTypeObject* get_triangular_solve_out_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"solution", ""}, {"cloned_coefficient", ""},  {nullptr} };
    static PyTypeObject triangular_solve_outNamedTuple;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.triangular_solve_out", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&triangular_solve_outNamedTuple, &desc);
        triangular_solve_outNamedTuple.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &triangular_solve_outNamedTuple;
}

PyTypeObject* get_triangular_solve_namedtuple() {
    static PyStructSequence_Field NamedTuple_fields[] = { {"solution", ""}, {"cloned_coefficient", ""},  {nullptr} };
    static PyTypeObject triangular_solveNamedTuple1;
    static bool is_initialized = false;
    static PyStructSequence_Desc desc = { "torch.return_types.triangular_solve", nullptr, NamedTuple_fields, 2 };
    if (!is_initialized) {
        PyStructSequence_InitType(&triangular_solveNamedTuple1, &desc);
        triangular_solveNamedTuple1.tp_repr = (reprfunc)torch::utils::returned_structseq_repr;
        is_initialized = true;
    }
    return &triangular_solveNamedTuple1;
}
}

namespace torch {
namespace autograd {

std::map<std::string, PyTypeObject*>& get_namedtuple_types_map() {
  // [NOTE] Non-global map
  // This map calls Python functions during its initialization.
  // If it is a global static variable and in case it is loaded
  // before Python interpreter is ready, then the calls it makes during
  // initialization will SEGFAULT.
  // To avoid this we make it function static variable so that it is
  // initialized only after the Python interpreter is ready.
  static std::map<std::string, PyTypeObject*> namedtuple_types_map = {
    {"_det_lu_based_helper", get__det_lu_based_helper_namedtuple()},
    {"_fake_quantize_per_tensor_affine_cachemask_tensor_qparams", get__fake_quantize_per_tensor_affine_cachemask_tensor_qparams_namedtuple()},
    {"_fused_moving_avg_obs_fq_helper", get__fused_moving_avg_obs_fq_helper_namedtuple()},
    {"_lu_with_info", get__lu_with_info_namedtuple()},
    {"_unpack_dual", get__unpack_dual_namedtuple()},
    {"aminmax", get_aminmax_namedtuple()},
    {"aminmax_out", get_aminmax_out_namedtuple()},
    {"cummax", get_cummax_namedtuple()},
    {"cummax_out", get_cummax_out_namedtuple()},
    {"cummin", get_cummin_namedtuple()},
    {"cummin_out", get_cummin_out_namedtuple()},
    {"eig_out", get_eig_out_namedtuple()},
    {"eig", get_eig_namedtuple()},
    {"frexp", get_frexp_namedtuple()},
    {"frexp_out", get_frexp_out_namedtuple()},
    {"geqrf_out", get_geqrf_out_namedtuple()},
    {"geqrf", get_geqrf_namedtuple()},
    {"histogram_out", get_histogram_out_namedtuple()},
    {"histogram", get_histogram_namedtuple()},
    {"kthvalue", get_kthvalue_namedtuple()},
    {"kthvalue_out", get_kthvalue_out_namedtuple()},
    {"linalg_cholesky_ex", get_linalg_cholesky_ex_namedtuple()},
    {"linalg_cholesky_ex_out", get_linalg_cholesky_ex_out_namedtuple()},
    {"linalg_eig", get_linalg_eig_namedtuple()},
    {"linalg_eig_out", get_linalg_eig_out_namedtuple()},
    {"linalg_eigh", get_linalg_eigh_namedtuple()},
    {"linalg_eigh_out", get_linalg_eigh_out_namedtuple()},
    {"linalg_inv_ex", get_linalg_inv_ex_namedtuple()},
    {"linalg_inv_ex_out", get_linalg_inv_ex_out_namedtuple()},
    {"linalg_lstsq", get_linalg_lstsq_namedtuple()},
    {"linalg_lstsq_out", get_linalg_lstsq_out_namedtuple()},
    {"linalg_qr", get_linalg_qr_namedtuple()},
    {"linalg_qr_out", get_linalg_qr_out_namedtuple()},
    {"linalg_slogdet", get_linalg_slogdet_namedtuple()},
    {"linalg_slogdet_out", get_linalg_slogdet_out_namedtuple()},
    {"linalg_svd_out", get_linalg_svd_out_namedtuple()},
    {"linalg_svd", get_linalg_svd_namedtuple()},
    {"lstsq_out", get_lstsq_out_namedtuple()},
    {"lstsq", get_lstsq_namedtuple()},
    {"lu_unpack", get_lu_unpack_namedtuple()},
    {"lu_unpack_out", get_lu_unpack_out_namedtuple()},
    {"max", get_max_namedtuple()},
    {"max_out", get_max_out_namedtuple()},
    {"median", get_median_namedtuple()},
    {"median_out", get_median_out_namedtuple()},
    {"min", get_min_namedtuple()},
    {"min_out", get_min_out_namedtuple()},
    {"mode", get_mode_namedtuple()},
    {"mode_out", get_mode_out_namedtuple()},
    {"nanmedian", get_nanmedian_namedtuple()},
    {"nanmedian_out", get_nanmedian_out_namedtuple()},
    {"qr_out", get_qr_out_namedtuple()},
    {"qr", get_qr_namedtuple()},
    {"slogdet", get_slogdet_namedtuple()},
    {"solve", get_solve_namedtuple()},
    {"solve_out", get_solve_out_namedtuple()},
    {"sort_out", get_sort_out_namedtuple()},
    {"sort", get_sort_namedtuple()},
    {"svd_out", get_svd_out_namedtuple()},
    {"svd", get_svd_namedtuple()},
    {"symeig_out", get_symeig_out_namedtuple()},
    {"symeig", get_symeig_namedtuple()},
    {"topk_out", get_topk_out_namedtuple()},
    {"topk", get_topk_namedtuple()},
    {"triangular_solve_out", get_triangular_solve_out_namedtuple()},
    {"triangular_solve", get_triangular_solve_namedtuple()},
  };
  return namedtuple_types_map;
}

PyTypeObject* get_namedtuple(std::string name) {
  static auto& namedtuple_types_map = get_namedtuple_types_map();
  return namedtuple_types_map[name];
}

void initReturnTypes(PyObject* module) {
  static struct PyModuleDef def = {
      PyModuleDef_HEAD_INIT, "torch._C._return_types", nullptr, -1, {}};
  PyObject* return_types_module = PyModule_Create(&def);
  if (!return_types_module) {
    throw python_error();
  }

  for (const auto& return_type_pair : get_namedtuple_types_map()) {
    // hold onto the TypeObject for the unlikely case of user
    // deleting or overriding it.
    Py_INCREF(return_type_pair.second);
    if (PyModule_AddObject(
            return_types_module,
            return_type_pair.first.c_str(),
            (PyObject*)return_type_pair.second) != 0) {
      Py_DECREF((PyObject*)return_type_pair.second);
      throw python_error();
    }
  }

  // steals a reference to return_types on success
  if (PyModule_AddObject(module, "_return_types", return_types_module) != 0) {
    Py_DECREF(return_types_module);
    throw python_error();
  }
}

} // namespace autograd
} // namespace torch

```

</details>

<details>

<summary>Eg. updated call in other python_*_functions</summary>

```cpp
// linalg_cholesky_ex
static PyObject * THPVariable_linalg_cholesky_ex(PyObject* self_, PyObject* args, PyObject* kwargs)
{
  HANDLE_TH_ERRORS
  static PyTypeObject* NamedTuple = get_namedtuple("linalg_cholesky_ex");
  static PyTypeObject* NamedTuple1 = get_namedtuple("linalg_cholesky_ex_out");
  static PythonArgParser parser({
    "linalg_cholesky_ex(Tensor input, *, bool upper=False, bool check_errors=False, TensorList[2] out=None)",
  }, /*traceable=*/true);

  ParsedArgs<4> parsed_args;
  auto _r = parser.parse(nullptr, args, kwargs, parsed_args);
  if(_r.has_torch_function()) {
    return handle_torch_function(_r, nullptr, args, kwargs, THPLinalgVariableFunctionsModule, "torch.linalg");
  }
  if (_r.isNone(3)) {
    // aten::linalg_cholesky_ex(Tensor self, *, bool upper=False, bool check_errors=False) -> (Tensor L, Tensor info)

    auto dispatch_linalg_cholesky_ex = [](const at::Tensor & self, bool upper, bool check_errors) -> ::std::tuple<at::Tensor,at::Tensor> {
      pybind11::gil_scoped_release no_gil;
      return at::linalg_cholesky_ex(self, upper, check_errors);
    };
    return wrap(NamedTuple, dispatch_linalg_cholesky_ex(_r.tensor(0), _r.toBool(1), _r.toBool(2)));
  } else {
    // aten::linalg_cholesky_ex.L(Tensor self, *, bool upper=False, bool check_errors=False, Tensor(a!) L, Tensor(b!) info) -> (Tensor(a!) L, Tensor(b!) info)
    auto out = _r.tensorlist_n<2>(3);
    auto dispatch_linalg_cholesky_ex_out = [](at::Tensor & L, at::Tensor & info, const at::Tensor & self, bool upper, bool check_errors) -> ::std::tuple<at::Tensor,at::Tensor> {
      pybind11::gil_scoped_release no_gil;
      return at::linalg_cholesky_ex_out(L, info, self, upper, check_errors);
    };
    return wrap(NamedTuple1, dispatch_linalg_cholesky_ex_out(out[0], out[1], _r.tensor(0), _r.toBool(1), _r.toBool(2)));
  }
  Py_RETURN_NONE;
  END_HANDLE_TH_ERRORS
}

```

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66614

Reviewed By: H-Huang

Differential Revision: D32741134

Pulled By: zou3519

fbshipit-source-id: 27bada30d20e66333ca1be1775608d9f0cbf9f59
2021-12-06 09:05:29 -08:00
78b7a419b2 Enable native_dropout/backward for lazy (#69374)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69374

Enables existing native_dropout operator for use with lazy tensors.  Also adds aten interned strings so lazy tensor codegen can refer to the symbols in generated IR classes.

Test Plan: CI for regressions of existing use cases, and manual tests of new Lazy Tensor functionality

Reviewed By: ngimel

Differential Revision: D32837301

fbshipit-source-id: a372a24ec65367fb84ad2e97c7e38cae4ec703a6
2021-12-06 08:14:10 -08:00
b6f41bb848 The Jiterator (#69439)
Summary:
This PR:

- creates the "jiterator" pattern, allowing elementwise unary and binary kernels that don't accept scalars to be jit compiled when called
- ports the gcd and i1 CUDA kernels to use the jiterator
- extends elementwise binary systemic testing to be comparable to elementwise unary systemic testing
- separates one test case from test_out in test_ops.py
- updates more OpInfos to use expected failures instead of skips

The jiterator currently does not support half, bfloat16 or complex dtypes. It also (as mentioned above) doesn't support scalar inputs. In the future we expect to add support for those datatypes and scalars.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69439

Reviewed By: ngimel

Differential Revision: D32874968

Pulled By: mruberry

fbshipit-source-id: d44bb9cde4f602703e75400ec5a0b209f085e9b3
2021-12-06 07:32:48 -08:00
3202028ed1 [Core ML] Avoid recompiling models when the OS version is not changed (#69438)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69438

We don't need to recompile the model if the OS version is not changed. This could save hundreds of ms when loading the model.

{F683788183}
ghstack-source-id: 144784720
ghstack-source-id: 144821734

Test Plan:
1. Test in the playground app
2. Test in the ig

Reviewed By: hanton

Differential Revision: D32866326

fbshipit-source-id: ae2174f68dda4d2ab89ee328cb710c08d45c4d9a
2021-12-06 00:49:51 -08:00
c97dc9286d Revert D32780415: [Static Runtime] Move implementation details from impl.h into internal.h
Test Plan: revert-hammer

Differential Revision:
D32780415 (999e93e6a8)

Original commit changeset: 119b7aedbf56

fbshipit-source-id: 1aa777e8c1854ab27e86bc625188f7170097fac8
2021-12-04 19:44:07 -08:00
29a45f0009 Revert D32743881: [Core ML] Avoid recompiling models when the OS version is not changed
Test Plan: revert-hammer

Differential Revision:
D32743881 (b97903abb8)

Original commit changeset: 2e94c6035520

fbshipit-source-id: 6cb05c414a23e15604b095c333a92ed8980092bd
2021-12-04 15:57:58 -08:00
999e93e6a8 [Static Runtime] Move implementation details from impl.h into internal.h (#69274)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69274

`impl.h` is the main header file that defines the interface of Static Runtime to its clients.

However, it is currently filled with implementation details that should not be leaked to our clients. 1) this can unnecessarily leak our internals to our clients which can make it hard to change them later 2) cause unnecessary merge conflicts when multiple people are touching this enormous impl.cpp file.

To alleviate the situation, this change moves the implementation details from impl.h into a new file, internal.h, that's internally kept without leaking the details to our clients.

This change will be followed by another change to rename `impl.h` into `runtime.h` or anything better since `impl.h` is currently not about implementation but SR's interface.

Note that this change is NOT complete since the remaining declarations in impl.h still contain a lot of implementation details. Therefore, we should keep working on minimizing the interface to prevent our API from being bloated unnecessarily. Also we need to work on modularizing our implementations into separate pieces organized by separate files in the near future.

Test Plan: Existing unittests

Reviewed By: donaldong

Differential Revision: D32780415

fbshipit-source-id: 119b7aedbf563b195641c5674572a9348732145f
2021-12-04 14:48:28 -08:00
b97903abb8 [Core ML] Avoid recompiling models when the OS version is not changed (#69234)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69234

We don't need to recompile the model if the OS version is not changed. This could save hundreds of ms when loading the model.

{F683788183}
ghstack-source-id: 144784720

Test Plan:
1. Test in the playground app
2. Test in the ig

Reviewed By: hanton

Differential Revision: D32743881

fbshipit-source-id: 2e94c6035520de3eeaf0b61f7cf9082228c8a955
2021-12-04 13:38:27 -08:00
e8f4c9cc40 [LT] Upstream LazyView and view ops IR Nodes (#69277)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69277

LazyView is the main class for tracking alias caused by view
ops. The corresponding IR classes for view ops are hand-written now, and
we can switch to code-gen them in future. For certain view ops, they
have a reverse IR class to perform inplace update in the backward
direction on a chain of alias ops.

As part of the future work, we will simplify the logic for LazyView once
the functionalization pass in core is ready to use.

Test Plan: Imported from OSS

Reviewed By: wconstab

Differential Revision: D32820014

Pulled By: desertfire

fbshipit-source-id: d9eb526cb23885f667e4815dc9dd291a7b7e4256
2021-12-04 08:44:54 -08:00
0bbe21b172 [LT] Upstream more util functions (#69098)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69098

Add the following utils: helpers, ir_dump_util, and
tensor_util. Some of the util functions may be better organized by
grouping into different files, but we can leave that for later.

Test Plan: Imported from OSS

Reviewed By: alanwaketan

Differential Revision: D32758480

Pulled By: desertfire

fbshipit-source-id: 2a0707879f0c49573380b4c8227a3c916c99bf9a
2021-12-04 08:42:35 -08:00
bfe5ad28e6 [Linalg] Add a runtime switch to let pytorch prefer a backend impl in linalg functions on GPU (#67980)
Summary:
Per title.

This PR introduces a global flag that lets pytorch prefer one of the many backend implementations while calling linear algebra functions on GPU.

Usage:
```python
torch.backends.cuda.preferred_linalg_library('cusolver')
```

Available options (str): `'default'`, `'cusolver'`, `'magma'`.

Issue https://github.com/pytorch/pytorch/issues/63992 inspired me to write this PR. No heuristic is perfect on all devices, library versions, matrix shapes, workloads, etc. We can obtain better performance if we can conveniently switch linear algebra backends at runtime.

Performance of linear algebra operators after this PR should be no worse than before. The flag is set to **`'default'`** by default, which makes everything the same as before this PR.

The implementation of this PR is basically following that of https://github.com/pytorch/pytorch/pull/67790.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67980

Reviewed By: mruberry

Differential Revision: D32849457

Pulled By: ngimel

fbshipit-source-id: 679fee7744a03af057995aef06316306073010a6
2021-12-03 19:06:30 -08:00
9663e08674 [Static Runtime] Fix a bug that aten::embedding_bag keeps cannot handle resized input tensors (#69219)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69219

This change fixes a bug that `aten::embedding_bag` implementation does not adjust the size of a managed output tensor according to a given input after memory planning starts.

Test Plan: Enhanced `StaticRuntime.EmbeddingBag` to trigger the existing bug that's fixed by this change.

Reviewed By: mikeiovine

Differential Revision: D32544399

fbshipit-source-id: 0a9f1d453e96f0cfa8443c8d0b28bbc520e38b29
2021-12-03 19:01:45 -08:00
6a4fa86026 Add OpInfos for misc nn.functional operators (#68922)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68922

Reviewed By: Chillee

Differential Revision: D32842301

Pulled By: saketh-are

fbshipit-source-id: b7166faefb64668fc76cca6c528501b0d360c43b
2021-12-03 17:03:02 -08:00
da023611d7 [CUDA graphs] Fixes make_graphed_callables example typos (#69379)
Summary:
cc mcarilli

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69379

Reviewed By: mruberry

Differential Revision: D32841260

Pulled By: ngimel

fbshipit-source-id: a7d0b9db0578526907547b201eddd55827812b63
2021-12-03 16:51:14 -08:00
e92b14bf1f Update CUDA version to 11.3 and setup proper environment variables. (#69383)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69383

Test Plan:
TorchBench CI

RUN_TORCHBENCH: hf_Bert

Reviewed By: janeyx99

Differential Revision: D32845001

Pulled By: xuzhao9

fbshipit-source-id: 50dff742ad4786e4b4995bd9aa82629b2fc050c5
2021-12-03 16:12:29 -08:00
a3ca4c83a6 [PyTorch] Add torch::jit::toString(const Type&) (#66689)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66689

Let's not take an extra refcount bump to stringify types.
ghstack-source-id: 144374720

Test Plan: CI

Reviewed By: suo

Differential Revision: D31691526

fbshipit-source-id: 673d632a83e6179c063530fdbc346c22d5f47d7c
2021-12-03 15:16:08 -08:00
855365e9c4 Clean up dead code (#69296)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69296

remove a commented block of code that was accidentally checked in

Test Plan: no testable changes

Reviewed By: alanwaketan

Differential Revision: D32799197

fbshipit-source-id: d3eb05cbafb0f5a4a3f41c17f66ca6d0c2fc60b7
2021-12-03 15:11:38 -08:00
a813ddf5ec CUDACachingAllocator: make an error message more accurate. (#69174)
Summary:
The `TORCH_CHECK` asserts for strictly-greater-than `kLargeBuffer`,
but the exception claims `>=`. Fix the error message to match the
code.

Happy to open an issue if it's helpful; I was hopeful the trivial fix doesn't need a separate issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69174

Reviewed By: zou3519

Differential Revision: D32760055

Pulled By: H-Huang

fbshipit-source-id: 1a8ab68f36b326ed62d78afdcb198f4d6572d017
2021-12-03 15:04:59 -08:00
088a4feb41 Update the documentation for AMP with DataParallel (#69218)
Summary:
Following https://github.com/pytorch/pytorch/issues/60540 and pull request https://github.com/pytorch/pytorch/issues/43102

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69218

Reviewed By: gchanan

Differential Revision: D32803814

Pulled By: ngimel

fbshipit-source-id: 06fdbbee2c7734153271be70ec4bc24263c8c367
2021-12-03 14:58:47 -08:00
80a67cd33c Limit uploading JSONs to trunk (#69385)
Summary:
Mac workflows on forked PRs don't have the right permissions to upload artifacts :/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69385

Reviewed By: malfet, atalman

Differential Revision: D32843252

Pulled By: janeyx99

fbshipit-source-id: e137a6707fe46559771b9d77fbfe44b0a21c914a
2021-12-03 13:20:37 -08:00
a20b9f8d5c add HPU case for backend_to_string function (#69225)
Summary:
Change-Id: If8ed7f1161343a2e494d8b964576e1ee193007f7

Fixes https://github.com/pytorch/pytorch/issues/65609

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69225

Reviewed By: gchanan

Differential Revision: D32804545

Pulled By: wconstab

fbshipit-source-id: bdf359bd779113153ebdecc515edba94e47e0ae4
2021-12-03 12:54:15 -08:00
6f7a5ddffc [SR] Use std::vector::reserve in GetLivenessMap (#68884)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68884

This diff uses std::vector::reserve in GetLivenessMap to set container capacity for all local contains to avoid runtime resizing.

The changes should theoretically improves the performance by a little.

Test Plan:
- [x] `buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- -v 1`
- [x]
```
seq 1 10 | xargs -I{} ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench \
--scripted_model=/data/users/dxd/302008423_0.predictor.disagg.local \
--method_name=local_request_only.forward --pt_cleanup_activations=1 \
--pt_enable_out_variant=1 --pt_optimize_memory=1 --iters=0 --warmup_iters=0 \
--num_threads=1 --pt_enable_static_runtime=1 --set_compatibility=1 \
--input_type="recordio" --pt_inputs=/data/users/dxd/302008423_0.local_ro.inputs.recordio \
--recordio_use_ivalue_format=1
```

### Before
```
I1201 12:04:46.753311 2874563 PyTorchPredictorBenchLib.cpp:336] Took 10.9826 sec to initialize a predictor.
I1201 12:05:00.617139 2875780 PyTorchPredictorBenchLib.cpp:336] Took 11.1078 sec to initialize a predictor.
I1201 12:05:15.279667 2876813 PyTorchPredictorBenchLib.cpp:336] Took 11.7979 sec to initialize a predictor.
I1201 12:05:30.201207 2877554 PyTorchPredictorBenchLib.cpp:336] Took 11.8901 sec to initialize a predictor.
I1201 12:05:44.386926 2879713 PyTorchPredictorBenchLib.cpp:336] Took 11.2722 sec to initialize a predictor.
I1201 12:05:58.003582 2881426 PyTorchPredictorBenchLib.cpp:336] Took 10.8046 sec to initialize a predictor.
I1201 12:06:12.004778 2882604 PyTorchPredictorBenchLib.cpp:336] Took 11.2754 sec to initialize a predictor.
I1201 12:06:26.101241 2884888 PyTorchPredictorBenchLib.cpp:336] Took 11.3355 sec to initialize a predictor.
I1201 12:06:40.364817 2886572 PyTorchPredictorBenchLib.cpp:336] Took 11.401 sec to initialize a predictor.
I1201 12:06:54.483794 2888614 PyTorchPredictorBenchLib.cpp:336] Took 11.3498 sec to initialize a predictor.
```

### After
```
I1201 11:51:53.775239 2818391 PyTorchPredictorBenchLib.cpp:336] Took 10.9113 sec to initialize a predictor.
I1201 11:52:07.412720 2819530 PyTorchPredictorBenchLib.cpp:336] Took 10.8413 sec to initialize a predictor.
I1201 11:52:21.202816 2820359 PyTorchPredictorBenchLib.cpp:336] Took 11.0216 sec to initialize a predictor.
I1201 11:52:35.513288 2821029 PyTorchPredictorBenchLib.cpp:336] Took 11.4216 sec to initialize a predictor.
I1201 11:52:49.145979 2821930 PyTorchPredictorBenchLib.cpp:336] Took 10.8272 sec to initialize a predictor.
I1201 11:53:02.908790 2822859 PyTorchPredictorBenchLib.cpp:336] Took 11.0262 sec to initialize a predictor.
I1201 11:53:16.276015 2823657 PyTorchPredictorBenchLib.cpp:336] Took 10.6893 sec to initialize a predictor.
I1201 11:53:30.103283 2824382 PyTorchPredictorBenchLib.cpp:336] Took 11.1854 sec to initialize a predictor.
I1201 11:53:44.298514 2825365 PyTorchPredictorBenchLib.cpp:336] Took 11.4796 sec to initialize a predictor.
I1201 11:53:58.258708 2826128 PyTorchPredictorBenchLib.cpp:336] Took 11.2652 sec to initialize a predictor.
```

Reviewed By: swolchok

Differential Revision: D32649252

fbshipit-source-id: 5cd296d12b12e5b15e85e4f1a8a236e293f37f9c
2021-12-03 12:18:06 -08:00
ae11264583 Fixed type checking errors in node.py (#68124)
Summary:
Fixes [issue#67](https://github.com/MLH-Fellowship/pyre-check/issues/67)
This PR fixes the type checking errors in Pytorch torch/fx/node.py .
The variable types in 363:20 and 364:20 were declared to have type `List[str]`  but were  assigned a value of  `None`. This caused an incompatitble variable type error.  I changed the type from `List[str]` to `Optional[List[str]` . This therefore fixed the incompatitble variable type error.

Signed-off-by: Onyemowo  Agbo
onionymous
0xedward

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68124

Reviewed By: gmagogsfm

Differential Revision: D32322414

Pulled By: onionymous

fbshipit-source-id: be11bbbd463715ddf28a5ba78fb4adbf62878c80
2021-12-03 12:03:49 -08:00
6baaec30cd [DataPipe] Adding ShufflerMapDataPipe (#68606)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68606

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D32813290

Pulled By: NivekT

fbshipit-source-id: 8d1ebd5bc776563c23250f76a2efc1d395f1af9c
2021-12-03 11:36:33 -08:00
3e45739543 [PyTorch][JIT] Use stack.pop_back() instead of pop(stack) for DROP (#69326)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69326

Looks like this really is slightly cheaper (see assembly diff screenshot in internal test plan). The problem is that `pop()` returns the value, so we have to spend instructions moving it out of the stack and then destroying it via a local.
ghstack-source-id: 144641680

Test Plan:
{F684148304}

CI

Reviewed By: zhxchen17

Differential Revision: D32812841

fbshipit-source-id: e9e43458d3364842f67edd43e43575a1f72e3cb0
2021-12-03 11:09:05 -08:00
2c84b010e6 [PyTorch] Use toObjectRef in JIT interpreter (#69324)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69324

This slightly shrinks runImpl.

Before:
- Move pointer out of IValue
- Clear the IValue to none
- Do our thing with the Object
- destroy the intrusive_ptr on the C stack
- destroy the IValue on the C stack (even though it was cleared to None, the destructor has to run anyway)

After:
- Grab the pointer out of IValue
- Do our thing with the Object
- Decref the pointer in the IValue on the JIT stack as we assign over it

We should be saving at least the memory traffic from clearing the IValue and possibly the dtor code as well.
ghstack-source-id: 144638920

Test Plan:
Inspected assembly to verify shorter runImpl

Tried to microbenchmark (D32809454) but can't show a difference.

Reviewed By: gchanan

Differential Revision: D32812252

fbshipit-source-id: a3689f061ee51ef01e4696bd4c6ffcbc41c30af5
2021-12-03 11:07:16 -08:00
5a480831e6 .github: Propagate WITH_PUSH to docs jobs (#69372)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69372

Docs weren't getting push since this variable wasn't getting propagated
to the docker container

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D32837012

Pulled By: seemethere

fbshipit-source-id: 5074d5266a567df2230981186cabffb53c01c634
2021-12-03 11:00:38 -08:00
8f8524a447 Expose is_metal_available in header (#68942)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68942

Currently, `at::native::is_metal_available()` is implemented, but it's not exposed in the header, so nobody can use it. It's a useful function and I want to use it, so exposing it in the header.

Test Plan: CI

Reviewed By: sodastsai, xta0

Differential Revision: D32675236

fbshipit-source-id: b4e692db7d171dfb872d5c2233cc808d7131f2e9
2021-12-03 10:31:03 -08:00
77ca153d3e Remove columns and ones from slow2d transpose signatures (#68898)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68898

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D32655873

Pulled By: jbschlosser

fbshipit-source-id: 810035a745e3851bd5326459b563e4796a074a65
2021-12-03 09:56:18 -08:00
7ca2da14e9 Remove finput and fgrad_input from slow3d signatures (#68897)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68897

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D32655875

Pulled By: jbschlosser

fbshipit-source-id: 8d04968b2df47e11da1eceb1612d55d00768eeb4
2021-12-03 09:55:02 -08:00
73d2ca20e0 .github: Add credentials for macos test jobs (#69371)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69371

macOS jobs need credentials to upload their test stats

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D32836893

Pulled By: seemethere

fbshipit-source-id: 0f5a8f1b35f4240d57b08a2120a97a13ba3b3de5
2021-12-03 09:43:41 -08:00
6ed7354435 [SR][Code cleanup] Typedef/default for kwargs (#69164)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69164

We have lots of methods that take `std::unordered_map<std::string, c10::IValue>` now. That's kind of ugly and cumbersome to type, so add a `KWargs` typedef.

Also made the `operator()` default `kwargs` to empty. Note that we could have another overload that doesn't take `kwargs` at all, but the perf gain is so minuscule it's probably not worth it.
ghstack-source-id: 144691899

Test Plan: CI

Reviewed By: d1jang

Differential Revision: D32734677

fbshipit-source-id: 8d6496a6d1ec2dc71253151d2f6408f1387966cf
2021-12-03 09:27:37 -08:00
b761172406 Revert D32786909: [C10D] [Easy] Use pinned memory for HtoD copies in Reducer:: sync_bucket_indices
Test Plan: revert-hammer

Differential Revision:
D32786909 (dbc8d9c947)

Original commit changeset: a53f96f57e67

fbshipit-source-id: 19599c3a489804bfdbb3062f4256dceb680c143b
2021-12-03 08:31:45 -08:00
e0fb228e03 Revert of adding windows CUDA 11.5 workflow (#69365)
Summary:
This is partial revert of bb522c9d7a to revert addition of workflows for CUDA 11.5 windows that fails

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69365

Reviewed By: suo

Differential Revision: D32831418

Pulled By: atalman

fbshipit-source-id: 184346d22623f88594312a4ce2e4d29cc67e8338
2021-12-03 08:00:16 -08:00
21919be96b CMake: Update precompiled header and fix support (#67851)
Summary:
This fixes the `USE_PRECOMPILED_HEADERS` cmake version check which was accidentally inverted, so it was always disabled.

I've also made the precompiled header so it only includes headers used in 95% or more of code, weighted by compile time. This limits it to the standard library, `c10` and a limited subset of `ATen/core`. Crucially, the new pch doesn't depend on `native_functions.yaml` so won't cause as much unnecessary rebuilding.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67851

Reviewed By: zou3519

Differential Revision: D32290902

Pulled By: dagitses

fbshipit-source-id: dfc33330028c99b02ff40963926c1f1260d00d00
2021-12-03 06:51:56 -08:00
cc46dc45e1 [SR] Factor logic that determines managed tensors out of MemoryPlanner (#68295)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68295

There's no reason we can't figure out what tensors we need to manage at model load time. It's also useful to have the set of ranges available at load time for integrating the ranges algorithm introduced in the previous diff.

Test Plan: `buck test caffe2/benchmarks/static_runtime/...`

Reviewed By: hlu1

Differential Revision: D32400593

fbshipit-source-id: 0466b2641166ddc9c14f72774f4ba151407be400
2021-12-03 04:45:27 -08:00
276cb8f501 [Pytorch Edge] Make Tracer support xirp metal segmentation (#69328)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69328

Aten_metal_prepack is cpp based and can be safely included here.

Test Plan: "Traced" the xirp model with the script.

Reviewed By: xta0

Differential Revision: D32813686

fbshipit-source-id: 7a428151348dc9d3f576531701926d6b3413de3d
2021-12-02 22:16:19 -08:00
a07ffe8d0e Add OpInfos for combinations, cartesian_prod, sum_to_size, ldexp, as_strided (#68853)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68853

Reviewed By: davidberard98

Differential Revision: D32811147

Pulled By: saketh-are

fbshipit-source-id: 941dcf949072f8d10faf4d5a0fa0ef409ac6a7db
2021-12-02 21:22:56 -08:00
834bd3134e Back out "Add efficient zero tensors" (#69327)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69327

Original commit changeset: d44096d88265

Original Phabricator Diff: D32144240 (668574af4a)

Test Plan:
CI

original diff failed 175 builds in CI

Reviewed By: airboyang, anjali411

Differential Revision: D32809407

fbshipit-source-id: c7c8e69bcee0274992e2d5da901f035332e60071
2021-12-02 19:11:41 -08:00
c572a603a6 fix for python 3.10 for gradient opinfo (#68113)
Summary:
This PR fixes https://github.com/pytorch/pytorch/issues/67612 by creating a tensor first and then converting the dtype explicitly using `.to(dtype)` call.

Looking forward to your feedback and suggestions on this.

cc: kshitij12345 mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68113

Reviewed By: zou3519

Differential Revision: D32797329

Pulled By: saketh-are

fbshipit-source-id: 5c34709ab277c82cda316a3ea1cf01e853e4c38b
2021-12-02 19:01:01 -08:00
572c3e3118 Fix some usages of CUDA_VERSION (#69092)
Summary:
See https://pytorch.slack.com/archives/G4Z791LL8/p1638229956006300

I grepped c10, aten, and torch for CUDA_VERSION and checked the usages I saw.
I can't guarantee I made a clean sweep. but this improves the status quo.

cc ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69092

Reviewed By: zou3519

Differential Revision: D32786919

Pulled By: ngimel

fbshipit-source-id: 1d29827dca246f33118d81e136252ddb5bf3830f
2021-12-02 18:32:47 -08:00
dbc8d9c947 [C10D] [Easy] Use pinned memory for HtoD copies in Reducer:: sync_bucket_indices (#69298)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69298

I was exploring adding an invariant that we actually use properly-tracked pinned memory when doing non-blocking copies (to plug various correctness holes), and found this case where we allocate a tensor without pinned memory and then copy it with non_blocking=True.

Test Plan: Unit tests cover this code.

Reviewed By: rohan-varma

Differential Revision: D32786909

fbshipit-source-id: a53f96f57e6727238e4cd2164c1a0f04cf270413
2021-12-02 17:34:34 -08:00
e2c7bd08b9 Add operator div (#68528)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68528

Add operator converter div, torch.floor_div is announce to be deprecated by pytorch, consider remove after full deprecation done by pytorch.

Reviewed By: 842974287

Differential Revision: D32497573

fbshipit-source-id: d06c864077f745c295c33fb25639b7116f85ca20
2021-12-02 17:25:40 -08:00
bede18b061 Add support for C++ frontend wrapper on Linux (#69094)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69094

Partially addresses https://github.com/pytorch/pytorch/issues/68768

Test Plan: Imported from OSS

Reviewed By: seemethere

Differential Revision: D32730079

Pulled By: malfet

fbshipit-source-id: 854e4215ff66e087bdf354fed7a17e87f2649c87
2021-12-02 16:47:00 -08:00
33c3c539b6 THPStorage: Prefer intrusive_ptr over owning raw pointers (#69248)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69248

Reviewed By: zou3519

Differential Revision: D32771035

Pulled By: ngimel

fbshipit-source-id: cf9bbcc5563ae9715ecf13631ba56c32240e59e3
2021-12-02 16:33:03 -08:00
9f39a2de0a [fix] add range check for k kthvalue_cpu (#68863)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/68813

Long-term it might make more sense to port it to structured

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68863

Reviewed By: H-Huang

Differential Revision: D32749372

Pulled By: mruberry

fbshipit-source-id: 85a1b2a31e922ff1df0d0f3f576ad219e652aa49
2021-12-02 15:33:06 -08:00
cc85b68984 .github: Fix ci workflows generation (#69329)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69329

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D32814709

Pulled By: seemethere

fbshipit-source-id: ea83aa0319bebb65623856ca9e34689581dd517b
2021-12-02 15:28:59 -08:00
f786b03f98 ci: Migrate docs push to GHA (#69172)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69172

Migrates the docs push jobs to Github Actions by implementing a simple
WITH_PUSH switch to do the actual push.

Adds 2 new workflows for GHA:
* linux-docs (on trunk)
* linux-docs-push (on schedule)

linux-docs-push is the only workflow that actually gets access to
credentials so it should be relatively safe.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D32767239

Pulled By: seemethere

fbshipit-source-id: 5b100f986cf4023c323f4f96f0fe7942fec49ad2
2021-12-02 15:06:57 -08:00
db5425bcd2 re-enable layer_norm in autodiff (#69007)
Summary:
Turn on layer_norm in autodiff

https://github.com/pytorch/pytorch/issues/67732 should have fixed the previously issue exposed by enabling layer_norm in autodiff.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69007

Reviewed By: soulitzer

Differential Revision: D32699108

Pulled By: eellison

fbshipit-source-id: 6951668c0e74e056d3776294f4e1fd3123c763e5
2021-12-02 14:55:00 -08:00
5b2586fe09 [testing] Ignore expected_regex in assertRaisesRegex for non-native device (#68723)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/29719

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68723

Reviewed By: zou3519

Differential Revision: D32797061

Pulled By: mruberry

fbshipit-source-id: 3bcae6d3d62d180059dbe39be520b0e7f9aea19f
2021-12-02 14:52:27 -08:00
36ba1b6b3a Remove unused _convolution_nogroup op (#68829)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68829

Test Plan: Imported from OSS

Reviewed By: zou3519, albanD

Differential Revision: D32627578

Pulled By: jbschlosser

fbshipit-source-id: 8a4c0ac58aae184a465b1fd40cce880a60d67339
2021-12-02 14:42:08 -08:00
791d5087ed [TensorExpr] Add lowerings for quantized ops: cat, mul, conv1d, relu. (#69055)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69055

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D32710325

Pulled By: ZolotukhinM

fbshipit-source-id: 4a7f0ac059ea238463317b6a45a822b8d05610dd
2021-12-02 14:34:21 -08:00
83c4451f60 [TensorExpr] Add a pass to symbolize an input dimension. (#68857)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68857

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D32632908

Pulled By: ZolotukhinM

fbshipit-source-id: bcee95d83731fcea07ec2f55ed78418ee52f51b6
2021-12-02 14:34:18 -08:00
1e9dcdd2a0 [TensorExpr] TensorExprKernel: support custom-class constants. (#68856)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68856

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D32632907

Pulled By: ZolotukhinM

fbshipit-source-id: e4180f8d791ba0cdf82bcb3bd11b61405c2faadd
2021-12-02 14:34:15 -08:00
48d7d585c8 [TensorExpr] IR Eval: add more logging. (#68855)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68855

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D32632905

Pulled By: ZolotukhinM

fbshipit-source-id: fef9b019d8d5b8a3ffd4075bfac069d1c81f647d
2021-12-02 14:34:12 -08:00
b6bcf5a0f1 [TensorExpr] Un-const TEK::kernel_func_name to allow recompilation. (#68854)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68854

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D32632904

Pulled By: ZolotukhinM

fbshipit-source-id: 154e3802ba844e738f09dbc239cf3656b9f8d5fd
2021-12-02 14:33:02 -08:00
a0367f8980 Revert D32404517: [quant][embedding qat] Support Embedding QAT via FX API
Test Plan: revert-hammer

Differential Revision:
D32404517 (abda069ce2)

Original commit changeset: 0484df8c826b

fbshipit-source-id: 4e7d62b9ccdb84eb4d184cd0b3c9506013fd8336
2021-12-02 14:28:35 -08:00
ec4c749024 Revert D32318435: [quant][embdding qat] Add FX support for QAT EmbeddingBag
Test Plan: revert-hammer

Differential Revision:
D32318435 (4484c04513)

Original commit changeset: 8b5d1a5d5422

fbshipit-source-id: e46d431f92a5c3f86c757695164d1eb5b0041298
2021-12-02 14:27:17 -08:00
8dafe6e147 Forward fix merge conflict (#69319)
Summary:
Forward fixes a merge conflict between two commits

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69319

Reviewed By: seemethere

Differential Revision: D32810884

Pulled By: janeyx99

fbshipit-source-id: 6e2f9fc89d00da979de1430a172673e82c51ba14
2021-12-02 14:05:54 -08:00
52219b1017 Fix ChainedScheduler.get_last_lr() (#69112)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/68820

cc vincentqb jbschlosser albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69112

Reviewed By: zou3519

Differential Revision: D32796626

Pulled By: albanD

fbshipit-source-id: bde9d4e473527be4c0a7f21cb57f795a67a99eaa
2021-12-02 13:44:12 -08:00
db30696be8 [pytorch][PR] bug fix for D32374003 (#69278)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69278

Test Plan:
```
fbpkg build -E smart.inference_platform_sp.sigrid_predictor.persistent.bolt --yes
```

Reviewed By: kimishpatel, HDCharles

Differential Revision: D32773910

fbshipit-source-id: a2181fea354f310cf9f6f57b802dc4a148627acc
2021-12-02 13:31:19 -08:00
915c26f588 GHA: preserve downloaded JSONs as artifacts (#69258)
Summary:
Preserves the .json files in the test folder for every test job as an artifact.

Going to hud.pytorch.org/pr/69258 and downloading/unzipping any of the `test-jsons-*.zip` shows that .pytorch-slow-tests.json and .pytorch-disabled-tests.json exist. (Though you won't see them in your file manager as they are hidden files.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69258

Reviewed By: seemethere

Differential Revision: D32807102

Pulled By: janeyx99

fbshipit-source-id: ed1b227cdd32160ed045dd79a7edc55216dcfe53
2021-12-02 13:26:14 -08:00
cafcf599d0 Deprecate torch.triangular_solve (#63570)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63570

There is a use of `at::triangular_solve_out` in the file
`torch/csrc/jit/tensorexpr/external_functions.cpp` that I have not dared
to move to `at::linalg_solve_triangular_out`.

**Deprecation note:**

This PR deprecates the `torch.triangular_solve` function in favor of
`torch.linalg.solve_triangular`. An upgrade guide is added to the
documentation for `torch.triangular_solve`.

Note that it DOES NOT remove `torch.triangular_solve`, but
`torch.triangular_solve` will be removed in a future PyTorch release.

cc jianyuh nikitaved pearu mruberry walterddr IvanYashchuk xwang233 Lezcano

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D32618035

Pulled By: anjali411

fbshipit-source-id: 0bfb48eeb6d96eff3e96e8a14818268cceb93c83
2021-12-02 13:24:55 -08:00
dde801686b Expose MobileCode to python (#66592)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66592

Test Plan: Imported from OSS

Reviewed By: samdow

Differential Revision: D31632600

Pulled By: tugsbayasgalan

fbshipit-source-id: 46a7ac20ddb6b433bd037280ed020481901a15c9
2021-12-02 13:18:46 -08:00
bb522c9d7a Enabling CUDA 11.5 for binary builds, Adding windows workflows for CUDA 11.5 (#69262)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/68259

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69262

Reviewed By: malfet

Differential Revision: D32804850

Pulled By: atalman

fbshipit-source-id: abac45ad1d49ec7e0e7df6cb9a22a46fbcd905a2
2021-12-02 13:04:43 -08:00
f587267dc7 Revert D31705359: use irange for loops 8
Test Plan: revert-hammer

Differential Revision:
D31705359 (17e5200441)

Original commit changeset: c9ea2fbc0f9c

fbshipit-source-id: 08fff2d12beca953ad30dd0baabf86e39ac84f14
2021-12-02 12:55:08 -08:00
97750e03a4 [torch][edge] Add int to the copy kernel. (#69297)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69297

.

Test Plan: CI

Reviewed By: JacobSzwejbka

Differential Revision: D32799822

fbshipit-source-id: c40fdb55a706b3a8eccaa69dbfbc6d7af0b111e4
2021-12-02 12:13:58 -08:00
7142b0b033 .github: Add linux.large to actionlint.yaml (#69304)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69304

Don't know why this isn't automatically figured out

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: anjali411, atalman, janeyx99

Differential Revision: D32805380

Pulled By: seemethere

fbshipit-source-id: 2c4805f87ae91388a6b605a6394024887b4bc71e
2021-12-02 11:21:49 -08:00
4056251a18 Add missing spaces to an error message (#69289)
Summary:
Before:
`ValueError: InstanceNorm1d returns 0-filled tensor to 2D tensor.This is because InstanceNorm1d reshapes inputs to(1, N * C, ...) from (N, C,...) and this makesvariances 0.`

After:
`ValueError: InstanceNorm1d returns 0-filled tensor to 2D tensor. This is because InstanceNorm1d reshapes inputs to (1, N * C, ...) from (N, C,...) and this makes variances 0.`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69289

Reviewed By: jbschlosser

Differential Revision: D32796035

Pulled By: albanD

fbshipit-source-id: c8e7c5cf6e961ec5f7242b31c7808454104cde02
2021-12-02 11:03:33 -08:00
2ea70a6462 Aloow Union of scalars to be NumberType (#66591)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66591

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D31632599

Pulled By: tugsbayasgalan

fbshipit-source-id: 374065da1d91334a19c15c604faf13ebec1681f6
2021-12-02 10:52:02 -08:00
d673b1ec59 .github: Switch ciflow-should-run to self hosted (#69166)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69166

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D32735493

Pulled By: seemethere

fbshipit-source-id: 9a03cf5245d1dbfe1be86cfbb3f5d1d42dd391c8
2021-12-02 10:42:07 -08:00
14ed4df305 [PyTorch][Static Runtime][easy] give to_copy_functor a name (#67701)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67701

I split this out to ease rebasing and review.
ghstack-source-id: 144507288

Test Plan: CI

Reviewed By: hlu1

Differential Revision: D32112523

fbshipit-source-id: dba14e6ada33df02dbcd7025b090a8a18cf438ae
2021-12-02 10:36:26 -08:00
21686923e8 [PyTorch][SR] More debug logging (#67220)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67220

Specifically we log AliasDb and same_storage_values, and are chattier about the aliasing logs in the liveness analysis.
ghstack-source-id: 144507289

Test Plan: Used to help develop D31776259

Reviewed By: hlu1

Differential Revision: D31847561

fbshipit-source-id: 8371455d060c17dace91cd90e4034b7618f820a6
2021-12-02 10:36:23 -08:00
b22e4d4aea [PyTorch][SR] Add more to() tests & extend debug logging in testStaticRuntime (#67219)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67219

I found that these specific test cases were causing different failures when developing D31776259. I also found that it was difficult to debug testStaticRuntime failures, so I added more verbose logs gated behind -v 2.
ghstack-source-id: 144507287

Test Plan: Used during development of D31776259

Reviewed By: hlu1

Differential Revision: D31847566

fbshipit-source-id: ea9147fb246c345d18bbc8d7f3bfba48d3a0fab3
2021-12-02 10:34:54 -08:00
84aa1ddedd [quant] Remove warning for quantized Tensor in __dir__ (#69265)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69265

This is used in tab completion, we should not put warning here

Test Plan:
ci

Imported from OSS

Reviewed By: albanD

Differential Revision: D32778736

fbshipit-source-id: f1bec5e09a8238ab41329ac2b64e6f3267799f6a
2021-12-02 10:30:36 -08:00
17e5200441 use irange for loops 8 (#66743)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66743

Modified loops in files under fbsource/fbcode/caffe2/ from the format

`for(TYPE var=x0;var<x_max;x++)`

to the format

`for(const auto var: irange(xmax))`

This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand.

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D31705359

fbshipit-source-id: c9ea2fbc0f9cd29e97a52dcb203addc5f2abb09b
2021-12-02 10:21:29 -08:00
ff3fc37267 [BE] rewrite ProcessGroupNCCLTest to be MultiProcess (#67705)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67705

This PR rewrites ProcessGroupNCCLTest to be MultiProcessTestCase. It was originally written in a single process multi-GPU fashion, we change it to multi-process instead to align with other c10d tests.
ghstack-source-id: 144555092

Test Plan: wait for CI

Reviewed By: pritamdamania87, fduwjj

Differential Revision: D32113626

fbshipit-source-id: 613d36aeae36bf441de1c2c83aa4755f4d33df4d
2021-12-02 10:12:05 -08:00
5c816520b3 ns for fx: fix bug in graph matcher (#69238)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69238

The NS for FX graph matcher was not properly taking into account
seen_nodes, this allowed a node to be matched twice.

Test Plan:
FB-only testing on real model passes.

Ideally we would have a test case to capture this, but hopefully we can land this soon to unblock production work.

Imported from OSS

Reviewed By: HDCharles

Differential Revision: D32765761

fbshipit-source-id: ed3dff8fd981e399a649fcd406883b4d56cc712a
2021-12-02 09:59:57 -08:00
698c35e743 Add functorch TLS to ATen/ThreadLocalState (#69181)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69181

functorch lives out-of-tree. However, it has some TLS that needs to be
propagated. The solution for that is we store a pointer to the TLS
inside pytorch/pytorch and extend FuncTorchTLSBase inside functorch to
include whatever functorch needs.

A previous solution used ThreadLocalDebugInfo. However, all
PyTorch-managed threads (e.g. spawned by Autograd) all receive a
shared_ptr that points to the same ThreadLocalDebugInfo. This leads to
race conditions if the multiple threads start modifying the TLS
stored within ThreadLocalDebugInfo without using mutexes.

Test Plan:
- tested with functorch
- The performance impact of this change when functorch is not used is
negligible because we end up manipulating nullptrs.

Reviewed By: albanD

Differential Revision: D32742312

Pulled By: zou3519

fbshipit-source-id: 1a8439a4af06b3d3e50b9a2dbca98a0ba612062a
2021-12-02 09:29:55 -08:00
0de7a618a3 functionalization: update is_aliased() logic (#68881)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68881

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D32647614

Pulled By: bdhirsh

fbshipit-source-id: 6bec50d3e54419d1707d0b6c0c6729bcc1ced1f0
2021-12-02 09:19:12 -08:00
4484c04513 [quant][embdding qat] Add FX support for QAT EmbeddingBag (#68121)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68121

Add FX support for QAT EmbeddingBag operator, previously only eager mode support.

Test Plan:
pytest test/quantization/fx/test_quantize_fx.py  -v -k "test_qat_embeddingbag_linear"

Imported from OSS

Reviewed By: supriyar

Differential Revision: D32318435

fbshipit-source-id: 8b5d1a5d5422972c49676f9e470d5fbe29dd503b
2021-12-02 09:05:07 -08:00
78ab3cde4a Do not modify type map from getCustomClassTypeImpl() (#69261)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69261

As this function is supposed to be called only once per type from
caching getCustomClassType template

Test Plan: Imported from OSS

Reviewed By: suo, lw

Differential Revision: D32776564

Pulled By: malfet

fbshipit-source-id: 218436657e6ad5ad0c87964857114d1e60c57140
2021-12-02 08:53:09 -08:00
113684cf81 Fix crash in checkCustomClassType if arg is null (#69259)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69259

Otherwise `checkCustomClassType(nullptr, new Type())` will crash

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D32775297

Pulled By: malfet

fbshipit-source-id: 54b10fd395d734c615dcaf85a5e599a445cee663
2021-12-02 08:51:59 -08:00
668574af4a Add efficient zero tensors (#64837)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64837

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D32144240

Pulled By: anjali411

fbshipit-source-id: d44096d882657c7f9270a16636900e0b73cefa40
2021-12-02 08:47:45 -08:00
abda069ce2 [quant][embedding qat] Support Embedding QAT via FX API (#68296)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68296

Support QAT workflow by using torch.fx QAT API.  e.g. `prepare_qat_fx` and `convert_fx`.

Test Plan:
`pytest test/quantization/fx/test_quantize_fx.py -v -k "test_qat_embedding_linear"`

Imported from OSS

Reviewed By: jingsh, supriyar

Differential Revision: D32404517

fbshipit-source-id: 0484df8c826b823b60dfecd9def77bf8cffe0527
2021-12-02 08:42:45 -08:00
3157371bb4 [quant][embedding qat] Fix bug enforcing quant_min <= zero_point <= quant_max for float zeropoint (#68852)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68852

When using a float zero_point in FakeQuant, such as for embeddings, it does not need to be between
quant_min and quant_max, as is enforced for integer zero_points.

This is because float zero_points are formulated as per:

```
xq = Round(Xf * inv_scale + zero_point),
Xq = Round((Xf - min) * inv_scale)
```

Test Plan:
pytest test/test_quantization.py -v -k "test_fake_quant_per_channel_qparam_range"

Imported from OSS

Reviewed By: supriyar

Differential Revision: D32645014

fbshipit-source-id: 96dc3ca6eef9cee60be6919fceef95c9f2759891
2021-12-02 07:58:03 -08:00
397183f44c Add Lazy Tensor codegen infra (#69020)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69020

Merges the lazy tensor codegen infra which has already been used on lazy_tensor_staging.

Test Plan: Test via lazy_tensor_staging branch

Reviewed By: alanwaketan, bdhirsh

Differential Revision: D32570613

fbshipit-source-id: 2cd5698644398bda69669683f8de79fd3b6639b5
2021-12-02 07:51:52 -08:00
28c519961f Follow the undefined Tensor <-> None rule better in torch dispatch (#67793)
Summary:
As per title. This in particular allows to more easily override backward function for which the underlying backend returns `None`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67793

Reviewed By: zou3519

Differential Revision: D32242962

Pulled By: albanD

fbshipit-source-id: 6e114def90ee9499161e1303d301ba7fd003ff89
2021-12-02 07:46:56 -08:00
0465f64bb8 [DataPipe] Adding BatcherMapDataPipe (#68197)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68197

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D32440963

Pulled By: NivekT

fbshipit-source-id: 277cbe8d735afe341a7c189be20e1d334ecf9d4a
2021-12-02 07:27:17 -08:00
00ebbd5ef6 Revert D32010095: [pytorch][PR] Add ability for a mobile::Module to save as flatbuffer
Test Plan: revert-hammer

Differential Revision:
D32010095 (41d35dc201)

Original commit changeset: d763b0557780

fbshipit-source-id: bf746a0389135c9f5f67f00f449435ce08fb5f6d
2021-12-02 06:41:40 -08:00
ed3b73fd4d [Static Runtime] Skip ProcessedNode:: verify_no_memory_overlap() for out variants (#68639)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68639

Fix all problems related to `ProcessedNode:: verify_no_memory_overlap()`
- Only enable this check for native and fallback ops that are not inplace or view ops
- Enable ProcessedNode:: verify_no_memory_overlap() in debug mode and enforce it
- Add gflag --static_runtime_disable_debug_memory_overlap_check to test the runtime memory overlap fix for bad schemas

fb::expand_dims's schema was not correct after this check is re-enabled. It's fixed in D32556204 (39ab417107)

Reviewed By: mikeiovine

Differential Revision: D32553708

fbshipit-source-id: 88de63cdf1ee4f87b7726c8b65a11a5fb8a99d13
2021-12-02 05:03:12 -08:00
c60232d89a [shard] add back init_from_local_shard_and_global_metadata API (#69226)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69226

This add back the previous init_from_local_shards API, but renamed it to init_from_local_shard_and_global_metadata. It's a partial revert of D32147888 (35712a8eb4). We now provide two APIs:
1. `init_from_local_shards`: user don't need to provide global metadata and we do all_gather under the hood, the other that
2. `init_from_local_shards_and_global_metadata`: user need to explicitly construct ShardedTensorMetadata to use this API, need to ensure correctness on all ranks, as there's no cross-rank communication/validations.

All of these two APIs stay private until it stablizes and proof of UX. The second one can only be called on `ShardedTensor` class directly, not included as a package API for now.

Test Plan:
test_init_from_local_shards_and_global_metadata
test_init_from_local_shards_and_global_metadata_invalid_shards

Reviewed By: dstaay-fb, pritamdamania87

Differential Revision: D32746882

fbshipit-source-id: bafd26ce16c02e2095907f9e59984a5d775c7df5
2021-12-02 01:02:56 -08:00
12621c3a39 support pure fp16 training in FSDP (#68417)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68417

1. since parameter attributes are lazily initialized at the beginning of forward, it makes more sense to init full_param_padded using parameters' data type during lazy_init, instead of using parameters' data type during construction, as parameters' data type may be changed after construction and before training loop
2.add checking whether parameter storage is changed outside FSDP and handle it properly
ghstack-source-id: 144479019

Test Plan: unit tests

Reviewed By: rohan-varma

Differential Revision: D32458643

fbshipit-source-id: 0e07e5e08270f2e265e8f49124a6648641e42e7a
2021-12-02 00:27:45 -08:00
41d35dc201 Add ability for a mobile::Module to save as flatbuffer (#67351)
Summary:
Included functions:

* save_mobile_module -> saves a mobile::Module to flatbuffer
* load_mobile_module_from_file -> loads a flatbuffer into mobile::Module
* parse_mobile_module -> parses from bytes or deserialized flatbuffer
      Module object

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67351

Reviewed By: iseeyuan

Differential Revision: D32010095

Pulled By: qihqi

fbshipit-source-id: d763b0557780f7c2661b6485105b045e41a5e8f1
2021-12-01 23:58:15 -08:00
40fb28ea87 [JIT] Compute input sym shapes in partial eval graph (#68281)
Summary:
Needed for NNC dynamic shape fusion. Previously, when creating a partially evaluated graph for symbolic shape compute, if the input wasn't used, we wouldn't compute it, which led to failures when NNC expected this value to be passed in.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68281

Reviewed By: navahgar

Differential Revision: D32401365

Pulled By: eellison

fbshipit-source-id: 97a684e5f1faed5df77c8fd69f9623cdba0781f9
2021-12-01 22:33:35 -08:00
d8a44270d6 [DataPipe] Simplify BatcherIterDataPipe by removing 'unbatch_level' argument and functionality (#68594)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68594

Based on my conversation with ejguan [here](https://github.com/pytorch/pytorch/pull/68197#pullrequestreview-809148827), we both believe that having the `unbatch_level` argument and functionality is making this DataPipe unnecessarily complicated, because users can call `.unbatch` before `.batch` if they would like to do so. That will likely be cleaner as well.

I also checked other libraries (for example, [TensorFlow](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#unbatch)), and I do not see them provide the ability the `unbatch` within the `batch` function either.

This PR simplifies the DataPipe by removing the argument.

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D32532594

Pulled By: NivekT

fbshipit-source-id: 7276ce76ba2a3f207c9dfa58803a48e320adefed
2021-12-01 22:00:31 -08:00
ad182479b0 [deploy] docs (#69251)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69251

This adds some actual documentation for deploy, which is probably useful
since we told everyone it was experimentally available so they will
probably be looking at what the heck it is.

It also wires up various compoenents of the OSS build to actually work
when used from an external project.

Differential Revision:
D32783312
D32783312

Test Plan: Imported from OSS

Reviewed By: wconstab

Pulled By: suo

fbshipit-source-id: c5c0a1e3f80fa273b5a70c13ba81733cb8d2c8f8
2021-12-01 21:55:18 -08:00
cbe0a38d8c Back out "[CUDA Pinned Memory] Event recording with non-blocking copies should track the storage context, not the tensor data pointer" (#69193)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69193

Reviewed By: xing-liu, yuchenhao

Differential Revision: D32748570

fbshipit-source-id: bd73d7567f94c70daeace49d4081381b8adf2d77
2021-12-01 19:30:08 -08:00
929f2a750a Back out "[CUDA Pinned Memory] Alternative implementation of pinned memory allocator focusing on multi-threaded scalability" (#69191)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69191

Reviewed By: xing-liu, yuchenhao

Differential Revision: D32748466

fbshipit-source-id: 6abd3265e8a20270305da3f8be25114ad4d67fc2
2021-12-01 19:28:57 -08:00
370d0afc1b Strided masked var. (#68738)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68738

Test Plan: Imported from OSS

Reviewed By: davidberard98

Differential Revision: D32767155

Pulled By: cpuhrsch

fbshipit-source-id: a5c095103405fbfc28b9f4fd624bdbbc45e7f715
2021-12-01 19:19:37 -08:00
291e56eda4 [Pytorch Edge] Update Black Box Api with operator versioning (#68678)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68678

Test Plan: Ill update the unit test before land

Reviewed By: cccclai

Differential Revision: D32573603

fbshipit-source-id: 19271bcbb68b61d24d6943e61a943f4f75fddb5d
2021-12-01 19:13:32 -08:00
b9738e923e [Operator Versioning][Edge] Add old models and unittest (#67726)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67726

1. Check in one model with aten:div_tensor old op with unittest in both cpp and python. The following two lines are commented out and expected to work after using upgrader.
```
_helper(mobile_module_v2, div_tensor_0_3)
_helper(current_mobile_module, torch.div)
```

2. Update the commented code accordingly.

Currently there are 6 upgraders. The following old models with operators are added to cover these 6 upgraders:
```
// Tensor x Tensor

test_versioned_div_tensor_v3

// Tensor x Scalar

test_versioned_div_scalar_float_v3
test_versioned_div_scalar_reciprocal_int_v3
test_versioned_div_scalar_inplace_float_v3

// Scalar x Scalar

test_versioned_div_scalar_scalar_v3

// Tensor x Tensor with out kwarg

test_versioned_div_tensor_out_v3

// Tensor x Tensor inplace

test_versioned_div_tensor_inplace_v3

// Tensor x Scalar inplace

test_versioned_div_scalar_inplace_int_v3

```
Note:
In this pr, per model, it includes the following test:
1. Model (with old op) load/run test will be in both cpp and python
2. Model (with old op) + upgrader test will be in python
Other tests considered adding:
1. per upgrader bytecode test
2. app level integration test
ghstack-source-id: 144422418

Test Plan: CI and the added unittest

Reviewed By: iseeyuan

Differential Revision: D32069653

fbshipit-source-id: 96d9567088a1f709bc7795f78beed7a308e71ca9
2021-12-01 18:46:30 -08:00
124bb6a19d RegisterDispatchKey.cpp: remove redundant code (#68983)
Summary:
remove the line since line 10 has already included this header file

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68983

Reviewed By: samdow

Differential Revision: D32706952

Pulled By: soulitzer

fbshipit-source-id: 98746e12d8d04d64ee2e0449e4aec5153ac723d5
2021-12-01 18:38:19 -08:00
fced51eaf7 [torch][distributed] Check for file existence before invoking cleanup logic in FileStore destructor (#68603)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68603

FileStore is frequently used from the python lang, which has GC. This means, that users of FileStore from python do not have control over when FileStore destructor is invoked. If the directory for file store is created by some external logic, that has cleanup procedure, this procedure may have a race condition with the logic in the FileStore destructor.

The diff adds check for file access in destructor before actually invoking the cleanup. In long term, it makes sense to move out the cleanup logic out of the destructor to a separate method.

Test Plan:
CI

Stress tests: `buck test mode/dev-nosan //torchrec/examples/dlrm/tests:test_dlrm_main -- --exact 'torchrec/examples/dlrm/tests:test_dlrm_main - torchrec.examples.dlrm.tests.test_dlrm_main.MainTest: test_main_function' --run-disabled --jobs 18 --stress-runs 20 --record-results`

Reviewed By: colin2328

Differential Revision: D32535470

fbshipit-source-id: 6f421f2e7b0d9ac9c884a1db2f7e5a94fc59fc0e
2021-12-01 16:43:42 -08:00
3c1e2ff9eb fixing layer_norm cuda bug (#69210)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/69208

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69210

Reviewed By: H-Huang

Differential Revision: D32764811

Pulled By: ngimel

fbshipit-source-id: fb4201fe5f2284fcb22e36bc1029eef4a21b09bf
2021-12-01 15:46:47 -08:00
d72d476875 [pyper] add flag to disable clip_ranges_gather fusions (#69198)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69198

add flag --enable_clip_ranges_gather_fusions to disable clip_ranges+gather_ranges fusions.

This fusion happens in static runtime, and it also happens in jit when optimize_sparse_nn_model is used.

Note that clip_ranges+gather_ranges+sigrid_hash fusions use different code that was untouched by D30515441 (01b30922dd), so not disabling it for now.
This also effectively disables ClipRangesGatherSigridHash(graph) (even though it's not explicitly included), because that fusion lookgs for the clip_ranges_gather_lengths_to_offsets fusion, which won't exist if this flag is on

Test Plan:
Run ptvsc2_predictor_bench with --enable_clip_ranges_gather_fusions=0 and SR=1
```
Input size: 211
Static runtime ms per iter: 11.9668. Iters per second: 83.5643
Time per node type:
        6.42796 ms.    54.5663%. static_runtime::fused_variadic_sigrid_transforms_torch_bind (1 nodes, out variant)
        1.64969 ms.    14.0041%. fb::quantized_linear (9 nodes, out variant)
       0.475394 ms.    4.03557%. fb::clip_ranges_gather_sigrid_hash_precompute_v3 (158 nodes, out variant)
       0.367554 ms.    3.12013%. aten::argmin (1 nodes, out variant)
       0.358351 ms.    3.04201%. aten::matmul (1 nodes, out variant)
       0.215082 ms.    1.82581%. static_runtime::to_copy (805 nodes, out variant)
       0.214397 ms.    1.81999%. fb::gather_ranges (313 nodes, out variant)
       0.179759 ms.    1.52595%. fb::offsets_to_ranges (655 nodes, out variant)
       0.173236 ms.    1.47058%. fb::lengths_to_offsets (464 nodes, out variant)
       0.151249 ms.    1.28394%. aten::sub (1 nodes, out variant)
        0.14017 ms.    1.18989%. aten::sigmoid (3 nodes, out variant)
       0.136118 ms.    1.15549%. aten::mul (5 nodes, out variant)
       0.130813 ms.    1.11046%. aten::sum (3 nodes, out variant)
       0.124876 ms.    1.06006%. aten::repeat (1 nodes, out variant)
        0.12191 ms.    1.03488%. static_runtime::signed_log1p (1 nodes, out variant)
      0.0922349 ms.   0.782972%. aten::norm (1 nodes, out variant)
      0.0877845 ms.   0.745193%. aten::pow (1 nodes, out variant)
      0.0783335 ms.   0.664966%. fb::batch_box_cox (1 nodes, out variant)
      0.0755047 ms.   0.640951%. fb::clip_ranges (311 nodes, out variant)
      0.0702456 ms.   0.596308%. static_runtime::layer_norm (2 nodes, out variant)
      0.0696762 ms.   0.591475%. fb::quantize_per_tensor (4 nodes)
      0.0556873 ms.   0.472724%. quantized::embedding_bag_byte_prepack (3 nodes, out variant)
      0.0555237 ms.   0.471335%. prim::VarConcat (2 nodes, out variant)
      0.0437336 ms.    0.37125%. static_runtime::dict_unpack (2 nodes, native)
      0.0390592 ms.    0.33157%. static_runtime::dequantize_copy (9 nodes, out variant)
      0.0385823 ms.   0.327521%. fb::concat_add_mul_replacenan_clip (1 nodes, out variant)
      0.0321869 ms.   0.273231%. prim::TupleConstruct (1 nodes, out variant)
      0.0308289 ms.   0.261703%. fb::casted_batch_one_hot_lengths (1 nodes, out variant)
      0.0280272 ms.    0.23792%. static_runtime::reshape_copy (2 nodes, out variant)
      0.0244705 ms.   0.207727%. fb::sigrid_hash_precompute (1 nodes, out variant)
       0.020917 ms.   0.177562%. static_runtime::VarTupleUnpack (1 nodes, native)
      0.0175842 ms.   0.149271%. aten::div (1 nodes, out variant)
      0.0169989 ms.   0.144302%. aten::narrow_copy (4 nodes, out variant)
     0.00818147 ms.  0.0694517%. aten::logit (1 nodes, out variant)
     0.00719822 ms.   0.061105%. prim::VarStack (1 nodes, out variant)
     0.00687292 ms.  0.0583435%. aten::add (1 nodes, out variant)
     0.00328646 ms.  0.0278985%. aten::clamp_min (1 nodes, out variant)
     0.00325073 ms.  0.0275951%. static_runtime::expand_dims_copy (1 nodes, out variant)
     0.00295617 ms.  0.0250946%. static_runtime::flatten_copy (1 nodes, out variant)
     0.00230511 ms.  0.0195679%. aten::expand_as (1 nodes, native)
     0.00182061 ms.   0.015455%. aten::full_like (1 nodes, out variant)
    0.000268152 ms. 0.00227631%. prim::ListConstruct (1 nodes, out variant)
        11.7801 ms. in Total
```

Servicelabs:
AF: https://www.internalfb.com/intern/servicelab/1001770528/
AI: https://www.internalfb.com/intern/servicelab/402342245/
Prospector: https://www.internalfb.com/intern/servicelab/502342630/

Reviewed By: movefast1990

Differential Revision: D32750847

fbshipit-source-id: b809a72a9fbeea86080346962eb17761e71397d8
2021-12-01 15:26:36 -08:00
263125a962 Fix RAdam docstring on LR default value (#69186)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69186

Reviewed By: albanD

Differential Revision: D32759614

Pulled By: H-Huang

fbshipit-source-id: b11819c50156a538cd6003e9cddde0390c853f67
2021-12-01 14:32:07 -08:00
3bf4080fd9 Change misleading MaxUnpool2d example to better demonstrate output_size usage (#68936)
Summary:
At https://github.com/pytorch/pytorch/issues/68873, jbschlosser states that maxunpool2d with the `output_size` argument only works for indices of the same size. This makes sense, but unfortunately it's not what's shown in the example! I've removed the wrong example and replaced it with one where specifying `output_size` is actually necessary -- the unpool call fails without it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68936

Reviewed By: H-Huang

Differential Revision: D32759207

Pulled By: jbschlosser

fbshipit-source-id: 658e1724150a95454a05a771ae7c6e2e736740a7
2021-12-01 14:11:26 -08:00
2eef5e76db add extra_repr for nn.ZeroPad2d (#69206)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/69205

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69206

Reviewed By: H-Huang

Differential Revision: D32759597

Pulled By: jbschlosser

fbshipit-source-id: abc9ee69fb5e22d45a640993a4e598b016020688
2021-12-01 13:53:19 -08:00
cd043c335f Revert D32329330: [JIT] Separate GPU implementation of frozen_conv_add_relu_fusion.cpp
Test Plan: revert-hammer

Differential Revision:
D32329330 (cfc75c2137)

Original commit changeset: c0f10da4b954

fbshipit-source-id: e81f93a5c1e2bb9b20fde6ccaeef143472a5b900
2021-12-01 12:55:10 -08:00
e6c435bf96 [LTC] Upstream helpers for c10::Device <=> BackendDevice (#69064)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69064

This commit upstreams helpers for converting a c10::Device to
BackendDevice and vice versa.

Test Plan: ./build/bin/test_lazy --gtest_filter=BackendDeviceTest.FromAten:BackendDeviceTest.ToAten

Reviewed By: wconstab

Differential Revision: D32732607

Pulled By: alanwaketan

fbshipit-source-id: 0dd233d37a4a30fc4b22dba322ddd85d4cb3635b
2021-12-01 12:15:32 -08:00
92f168941e remove accidentally committed redundant debug print (#68510)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68510

remove accidentally committed redundant debug print
ghstack-source-id: 144362817

Test Plan: unit tests

Reviewed By: rohan-varma

Differential Revision: D32487736

fbshipit-source-id: 279030f782e6b716a6bbfd591c5ce761de3ddd63
2021-12-01 11:35:34 -08:00
1842364b30 Strided masked normalize. (#68694)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68694

Test Plan: Imported from OSS

Reviewed By: samdow

Differential Revision: D32724552

Pulled By: cpuhrsch

fbshipit-source-id: 82f579a86b0b265e0b9b3715a8a327b775dd55e1
2021-12-01 10:45:16 -08:00
23633bdb5c record the datapipe for each pieces of Dataset (#67613)
Summary:
Add record_function for each DataPipe.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67613

Reviewed By: H-Huang

Differential Revision: D32246672

Pulled By: ejguan

fbshipit-source-id: 02ef7e75748c5b84fdcbb103398532e1f2962fbf
2021-12-01 10:29:06 -08:00
deaf745aee Add kl divergence between normal and laplace distribution. (#68807)
Summary:
Fixes [https://github.com/pytorch/pytorch/issues/68746]
![KL_normal_laplace](https://user-images.githubusercontent.com/35850237/143008244-f304cee1-9583-4de1-b0d0-5751ebdb8188.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68807

Reviewed By: H-Huang

Differential Revision: D32750391

Pulled By: neerajprad

fbshipit-source-id: 129e6ef60d6e244d0d6b02b3944bfd5d8b06edcb
2021-12-01 10:22:08 -08:00
486ae5c733 Dataset & IterableDataset attribute errors prints attribute (#69021)
Summary:
The message is the message from a standard attribute error.
Thought it would be informative when the error is thrown.
Alternatively in python 3.10, one can set the keyword arguments 'name' and 'obj',
reference: https://github.com/python/cpython/blob/3.10/Doc/library/exceptions.rst#concrete-exceptions

Fixes #{?}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69021

Reviewed By: samdow

Differential Revision: D32730362

Pulled By: ejguan

fbshipit-source-id: 7132ba612fa6075aeffb9315ce651828e9a8e0bc
2021-12-01 10:16:31 -08:00
d507fd63f3 Check that block height and width are positive in nn.Fold (#69048)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/68875

cc albanD mruberry jbschlosser walterddr

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69048

Reviewed By: samdow

Differential Revision: D32729307

Pulled By: jbschlosser

fbshipit-source-id: 162cafb005873012d900d86997d07640967038c0
2021-12-01 10:08:47 -08:00
c08e95dd9c Introduce IS_LINUX and IS_MACOS global vars (#69093)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69093

Test Plan: Imported from OSS

Reviewed By: samdow

Differential Revision: D32730080

Pulled By: malfet

fbshipit-source-id: aa3f218d09814b4edd96b01c7b57b85fd58c47fc
2021-12-01 09:47:38 -08:00
840fe8e4e6 Fix MacOS artifact upload (#69188)
Summary:
Add test shard number and runner name to the test name suffix
Otherwise test report names for shard 1 and shard 2 will be identical
and override each other

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69188

Reviewed By: janeyx99

Differential Revision: D32747747

Pulled By: malfet

fbshipit-source-id: 149f921d8e420d3ed69ce812bdcd3c034799353a
2021-12-01 08:06:48 -08:00
f9e69af22e Modify LU_backward and lu_solve_backward to use linalg_solve_triangular (#63569)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63569

This PR also rewrites `lu_solve_backward` from scratch going from
solving 5 systems of equations to just 2.

cc jianyuh nikitaved pearu mruberry walterddr IvanYashchuk xwang233 Lezcano

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D32618014

Pulled By: anjali411

fbshipit-source-id: 0e915bcf7045a4db43ffd076d807beac816c8538
2021-12-01 07:34:38 -08:00
478069d6f2 Remove duplicate .DS_Store in gitignore (#68981)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68981

Reviewed By: samdow

Differential Revision: D32707039

Pulled By: soulitzer

fbshipit-source-id: 346f0f3de583d995be34c252db4f9f26cd574ba8
2021-12-01 07:28:33 -08:00
e5e0c19882 OpInfo : embedding_bag (#67252)
Summary:
Adds OpInfo for `embedding_bag`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67252

Reviewed By: VitalyFedyunin

Differential Revision: D32462157

Pulled By: zou3519

fbshipit-source-id: 70303349a718720c4fa47519fa94ae900e052939
2021-12-01 07:00:50 -08:00
1da1707568 Sparse: Implement simple unary ufuncs operators (#68887)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68887

Closes #46988, closes #46987, closes #46761

By "simple" I mean operators that map 0->0 so we can implement it by
just re-dispatching on the values tensor. That does mean we have `sin`
but not `cos` for example, but without fill value support this is the
best that can be done.

Most of these don't support autograd because the derivative formulas
use unsupported operators.

cc nikitaved pearu cpuhrsch IvanYashchuk

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D32734911

Pulled By: cpuhrsch

fbshipit-source-id: 203ab105799f3d2d682b01ca3d6b18e7c994776a
2021-12-01 05:43:19 -08:00
afff381824 Automated submodule update: tensorpipe (#69089)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/tensorpipe](https://github.com/pytorch/tensorpipe).

New submodule commit: ed4bbe52b7

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69089

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: lw

Differential Revision: D32725534

fbshipit-source-id: 73b1e0f67c957ca0220cd47179dd4b350a98fd33
2021-12-01 02:29:18 -08:00
a23d1036ab Add ops for BI (mean) (#68826)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68826

Test Plan: Imported from OSS

Reviewed By: samdow

Differential Revision: D32732465

Pulled By: eellison

fbshipit-source-id: e8b185d89e5ecbe5c8e09d576c84a1f0a402a5e0
2021-12-01 00:45:00 -08:00
19b87292fc Add TE fuser ops (#68825)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68825

Factoring out the elementwise ops in tensorexpr fuser and adding their corresponding shape functions, since we need shape functions to fuse them with dynamic shapes

Test Plan: Imported from OSS

Reviewed By: samdow

Differential Revision: D32732466

Pulled By: eellison

fbshipit-source-id: 69cacf6fbed8eb97e475f5d55b2eec0384fe8ec1
2021-12-01 00:43:42 -08:00
7fad758e02 [FSDP] AutoWrap Main API (#68155)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68155

Per title
ghstack-source-id: 144398229

Test Plan: CI

Reviewed By: pbelevich, mrshenli

Differential Revision: D32327954

fbshipit-source-id: 36bdf06c1c50932a93acbfa97017c549fa490a6c
2021-12-01 00:16:38 -08:00
999e52a795 [FileStore] log timeout in err msg (#69167)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69167

Per title
ghstack-source-id: 144378083

Test Plan: Ci

Reviewed By: H-Huang

Differential Revision: D32736119

fbshipit-source-id: f37fd3e4ac393c07eb8bd1f9202841d33d0a8aad
2021-11-30 23:29:09 -08:00
845a82b635 Debug positive definite constraints (#68720)
Summary:
While implementing https://github.com/pytorch/pytorch/issues/68644,
during the testing of 'torch.distributions.constraint.positive_definite', I found an error in the code: [location](c7ecf1498d/torch/distributions/constraints.py (L465-L468))
```
class _PositiveDefinite(Constraint):
    """
    Constrain to positive-definite matrices.
    """
    event_dim = 2

    def check(self, value):
        # Assumes that the matrix or batch of matrices in value are symmetric
        # info == 0 means no error, that is, it's SPD
        return torch.linalg.cholesky_ex(value).info.eq(0).unsqueeze(0)
```

The error is caused when I check the positive definiteness of
`torch.cuda.DoubleTensor([[2., 0], [2., 2]])`
But it did not made a problem for
`torch.DoubleTensor([[2., 0], [2., 2]])`

You may easily reproduce the error by following code:

```
Python 3.9.7 (default, Sep 16 2021, 13:09:58)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> const = torch.distributions.constraints.positive_definite
>>> const.check(torch.cuda.DoubleTensor([[2., 0], [2., 2]]))
tensor([False], device='cuda:0')
>>> const.check(torch.DoubleTensor([[2., 0], [2., 2]]))
tensor([True])
```
The cause of error can be analyzed more if you give 'check_errors = True' as a additional argument for 'torch.linalg.cholesky_ex'.
It seem that it is caused by the recent changes in 'torch.linalg'.
And, I suggest to modify the '_PositiveDefinite' class by using 'torch.linalg.eig' function like the below:

```
class _PositiveDefinite(Constraint):
    """
    Constrain to positive-definite matrices.
    """
    event_dim = 2

    def check(self, value):
        return (torch.linalg.eig(value)[0].real > 0).all(dim=-1)
```

By using above implementation, I get following result:
```
Python 3.9.7 (default, Sep 16 2021, 13:09:58)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> const = torch.distributions.constraints.positive_definite
>>> const.check(torch.cuda.DoubleTensor([[2., 0.], [2., 2.]]))
tensor(True, device='cuda:0')
>>> const.check(torch.DoubleTensor([[2., 0.], [2., 2.]]))
tensor(True)
```

FYI, I do not know what algorithm is used in 'torch.linalg.eig' and 'torch.linalg.cholesky_ex'. As far as I know, they have same time complexity generally, O(n^3). It seems that in case you used special algorithms or finer parallelization, time complexity of Cholesky decomposition may be reduced to approximately O(n^2.5). If there is a reason 'torch.distributions.constraints.positive_definite' used 'torch.linalg.cholesky_ex' rather than 'torch.linalg.eig' previously, I hope to know.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68720

Reviewed By: samdow

Differential Revision: D32724391

Pulled By: neerajprad

fbshipit-source-id: 32e2a04b2d5b5ddf57a3de50f995131d279ede49
2021-11-30 22:27:27 -08:00
8586f374bc [Pytorch Edge] Get Operator Version from model file (#68677)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68677

Using in compatibility apis. Luckily the stream reader kinda just does this already so mostly just create a wrapper in our compatibility files

Test Plan: ci

Reviewed By: cccclai

Differential Revision: D32573132

fbshipit-source-id: 86331c03a1eebcd86ed29b9c6cd8a8fd4fe79949
2021-11-30 21:10:21 -08:00
219db3b4e1 Add OpInfo for torch.linalg.tensorsolve (#68810)
Summary:
This PR adds an OpInfo entry for tensorsolve function.
The keyword argument is different from NumPy so a lambda function is needed to be passed to `ref=`.
I had to change the dtypes for `test_reference_testing` because NumPy does computation internally using double for all linear algebra functions and maybe for some other functions. Using `torch.float64` and `torch.complex128` is more reliable for NumPy comparisons.

cc mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68810

Reviewed By: soulitzer

Differential Revision: D32696065

Pulled By: mruberry

fbshipit-source-id: a4305065d3e7d0097503dc05938b3c4784e14996
2021-11-30 20:31:12 -08:00
b05237f5e4 [Pytorch Edge] Add bool to copy kernel (#69106)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69106

this kernel sucks.

Test Plan: ci

Reviewed By: shoumikhin, cccclai

Differential Revision: D32729888

fbshipit-source-id: c747d4bf3d5233c8ed15dba5e2c2d244ba7d4b3f
2021-11-30 19:45:42 -08:00
e534c5efd7 CMake: Include instead of copying cpu kernel files (#67656)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67656

Currently, each cpu kernel file is copied into the build folder 3 times to give them different compilation flags. This changes it to instead generate 3 files that `#include` the original file. The biggest difference is that updating a copied file requires `cmake` to re-run, whereas include dependencies are natively handled by `ninja`.

A side benefit is that included files show up directly in the build dependency graph, whereas `cmake` file copies don't.

Test Plan: Imported from OSS

Reviewed By: dagitses

Differential Revision: D32566108

Pulled By: malfet

fbshipit-source-id: ae75368fede37e7ca03be6ade3d4e4a63479440d
2021-11-30 19:13:53 -08:00
f6f1b580f8 Fix mypy in cpp_extension.py (#69101)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69101

Test Plan: Imported from OSS

Reviewed By: atalman, janeyx99

Differential Revision: D32730081

Pulled By: malfet

fbshipit-source-id: 76ace65b51850b74b175a3c4688c05e107873e8d
2021-11-30 16:01:55 -08:00
6953b7e269 [BE] Fix mypy local run on MacOS (#69097)
Summary:
Unversioned python invocations should not be used, as it can be aliased to Python-2
Also invoke mypy as `python3 -mmypy` as binary aliases are not always available for user installation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69097

Reviewed By: janeyx99

Differential Revision: D32729367

Pulled By: malfet

fbshipit-source-id: 7539bd0af15f97eecddfb142dba7de7f3587083d
2021-11-30 15:52:23 -08:00
aa2163eba5 .github: Add linux.large instance type (#69165)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69165

We're hitting hard concurrency limits for built in github runners so
let's use our own runners and make them non-ephemeral so they'll have
basically constant uptime

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: atalman

Differential Revision: D32735494

Pulled By: seemethere

fbshipit-source-id: c042c6f0fb23fd50acef312d96b0c89d02c93270
2021-11-30 14:45:51 -08:00
e60fd10659 [fbgemm] remove assumption number of rows is in 32 bit (#69066)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69066

Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/781

And remove unnecessary looping inside parallel_for despite fbgemm routines support batching multiple rows

Test Plan: CI

Reviewed By: dskhudia, jianyuh

Differential Revision: D32715453

fbshipit-source-id: 33c3e72f51c8ff5d02dafab4a8947d1230c2d551
2021-11-30 13:38:53 -08:00
ef7ed082ec [PyTorch] Remove StringView from RecordFunction implementation [2/2] (#68411)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68411

Avoids heap-allocating a std::string instance in before() each time even if it's not going to be used.
ghstack-source-id: 144287655

Test Plan:
Run //caffe2/caffe2/fb/high_perf_models/pytorch/benchmark_framework_overheads:cpp_benchmark before/after this diff with arguemnts --stressTestRecordFunction --op empty

Before: P467922606
After: P467922626

Reviewed By: chaekit

Differential Revision: D32453846

fbshipit-source-id: 18e1b482dbf5217add14cbaacd447de47cb5877b
2021-11-30 13:22:27 -08:00
1d84d8c5d8 [PyTorch] Remove StringView from RecordFunction interface (1/2) (#68410)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68410

First step toward not heap-allocating a string in RecordFunction::before() every time
ghstack-source-id: 144287654

Test Plan: CI

Reviewed By: chaekit

Differential Revision: D32453847

fbshipit-source-id: 080d95095fb568287b65fcc41a4ca6929b5f9a87
2021-11-30 13:20:08 -08:00
22690c2cb6 Use cub::FutureValue to simplify 64bit indexing split of cub scan (#66711)
Summary:
https://github.com/NVIDIA/cub/pull/305 has landed to cub 1.15. This is ready to review and land. This PR contains https://github.com/pytorch/pytorch/pull/66219, please land that PR first before review.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66711

Reviewed By: soulitzer

Differential Revision: D32698306

Pulled By: ngimel

fbshipit-source-id: 4cc6b9b24cefd8932f4d421c6d64ea20ea911f52
2021-11-30 13:15:36 -08:00
c48e6f014a [vulkan] Update VMA settings to reduce memory usage (#69088)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69088

It was found that the Vulkan backend was consuming a huge (~287 MB) of graphics memory when executing a lightweight segmentation model. In fact the Vulkan backend tends to consume a huge amount of memory in general.

It was found that the reason for this is due to how the backend uses [VMA](https://gpuopen-librariesandsdks.github.io/VulkanMemoryAllocator/html/). When allocating memory, VMA will first allocate a large block of memory, then subdivide that block to use for individual textures and buffers. The pattern is used because Vulkan has a limit on the number of `vkDeviceMemory` allocations that can be active at one time.

It turns out that the Vulkan backend was using custom memory pools with a block size of 64 MiB, meaning that the minimum amount of memory used will be 64 MiB at minimum. Furthermore, usage of the [linear allocation algorithm](https://gpuopen-librariesandsdks.github.io/VulkanMemoryAllocator/html/custom_memory_pools.html#linear_algorithm) resulted in minimal reuse of memory, leading to the creation of much more blocks than were actually required and a huge amount of unused memory.

By avoiding the use of custom memory pools and instead simply using the default memory pool provided by VMA, the library seems to have a much easier time minimizing the amount of unused memory. This change reduces memory usage down to 20 MB when running the aforementioned segmentation model.

This diff also reduces the preferred block size to 32 MiB and removes the use of the linear allocation algorithm in case custom memory pools are needed in the future.

Test Plan:
Build and run vulkan_api_test:

```
cd ~/pytorch
BUILD_CUSTOM_PROTOBUF=OFF \
  BUILD_TEST=ON \
  USE_EIGEN_FOR_BLAS=OFF \
  USE_FBGEMM=OFF \
  USE_MKLDNN=OFF \
  USE_NNPACK=OFF \
  USE_NUMPY=OFF \
  USE_OBSERVERS=OFF \
  USE_PYTORCH_QNNPACK=OFF \
  USE_QNNPACK=OFF \
  USE_VULKAN=ON \
  USE_VULKAN_API=ON \
  USE_VULKAN_SHADERC_RUNTIME=ON \
  USE_VULKAN_WRAPPER=OFF \
  MACOSX_DEPLOYMENT_TARGET=10.9 CC=clang CXX=clang++ python3 setup.py develop --cmake && ./build/bin/vulkan_api_test
```

Reviewed By: beback4u

Differential Revision: D32653767

fbshipit-source-id: b063a8ea76d34b57d0e2e6972ca5f6f73f2fd7e5
2021-11-30 12:45:41 -08:00
fcd1375b2b [DDP][BE][Docs] Clarify checkpoint support (#68827)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68827

Add a note about current checkpoint support with DDP. Note that this
does not include the features enabled with _set_static_graph yet, as it is an
undocumented private API. Once we support static graph as beta feature in OSS
we can add to the note here.
ghstack-source-id: 144285041

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D32624957

fbshipit-source-id: e21d156a1c4744b6e2a807b5b5289ed26701886f
2021-11-30 12:37:37 -08:00
994f110a6f Refactor DDP checkpoint tests (#68792)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68792

Refactor tests to be more clear what features are supported and
unsupported under certain DDP configs.
ghstack-source-id: 144285040

Test Plan: Ci

Reviewed By: pbelevich

Differential Revision: D32609498

fbshipit-source-id: 5231242054d4ff6cd8e7acc4a50b096771ef23d1
2021-11-30 12:36:14 -08:00
49abda208b [JIT] internal build bug fix (#69061)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69061

`warning` breaks this build [D32622152](https://www.internalfb.com/diff/D32622152)

Test Plan: Imported from OSS

Differential Revision: D32712448

Pulled By: makslevental

fbshipit-source-id: c7a70487bd0b95ac8b242522c36597d36072201f
2021-11-30 12:32:11 -08:00
5e0302e1d0 [quant][embedding qat] Set FakeQuant zeropoint dtype matches observer (#68390)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68390

Observer zero_point's dtype can be float, in the specific case of `torch.per_channel_affine_float_qparams`.
This change sets FakeQuant's zero_point dtype accordingly.

Test Plan:
`pytest test/quantization/core/test_workflow_module.py  -v -k "embedding"`
`pytest test/quantization/eager/test_quantize_eager_qat.py  -v -k "embedding"`

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D32446405

fbshipit-source-id: cca7aade68ff171887eeeae42801f77d934dad4c
2021-11-30 12:21:14 -08:00
8f9f559453 ammend tensors.rst and torch.rst for doc generation (#69030)
Summary:
(This is my first contribution to PyTorch) Added missing operations to docs added in https://github.com/pytorch/pytorch/issues/64430. Please let me know if I've done anything wrong.

Fixes https://github.com/pytorch/pytorch/issues/68928

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69030

Reviewed By: samdow

Differential Revision: D32706826

Pulled By: soulitzer

fbshipit-source-id: edcc175a8f9bc69450a39059580c05edce699312
2021-11-30 12:04:13 -08:00
0aa9d177fe [fx] remove CPatcher (#69032)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69032

I am removing it because, for packaging-related reasons, it's easier if
torch.fx is a pure Python module.

I don't think there is much reason to keep it: this functionality was
experimental, has no known users currently, and we didn't have a clear
path to turning it on by default due to regressions in tracing
performance. Also, it only was ever enabled for `rand` and friends.

Technically the removal of the `enable_cpatching` arguments on
`symbolic_trace` and `Tracer.__init__` are BC-breaking, but the
docstrings clearly state that the argument is experimental and BC is not
guaranteed, so I think it's fine.

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D32706344

Pulled By: suo

fbshipit-source-id: 501648b5c3610ae71829b5e7db74e3b8c9e1a480
2021-11-30 11:59:57 -08:00
81246ed01c Markdown was linking to repo rather than pytorch.org website (#68937)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68937

Reviewed By: samdow

Differential Revision: D32707264

Pulled By: soulitzer

fbshipit-source-id: c534f008087def33784dde701130769e2058aa9f
2021-11-30 11:51:24 -08:00
251686fc4c Revert D32706197: Sparse: Implement simple unary ufuncs operators
Test Plan: revert-hammer

Differential Revision:
D32706197 (fbaa19a6fa)

Original commit changeset: 65e1acb36457

fbshipit-source-id: 45c4b486f9eee200d5a1f6d46d267617124f8a5e
2021-11-30 10:50:12 -08:00
8fef7c09f5 Remove finput from slow2d signatures (#68896)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68896

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D32655874

Pulled By: jbschlosser

fbshipit-source-id: 3c9acb106961c40af1432652179edb2bc5a4bfa5
2021-11-30 09:47:24 -08:00
cd3e37cbe4 [Static Runtime] [Code Cleanup] Reduce indentation depth in ops.cpp (#69028)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69028

This change converts

```
if (..) {
 ...
} else {
 ...
}
# end of function
```

into

```
if(...) {
  ...
  return;
}
...
```
in ops.cpp to remove the else branch to reduce the indentation depth by 1 for better readability.

Test Plan: N/A

Reviewed By: hlu1

Differential Revision: D32506235

fbshipit-source-id: a4fd5188bd680dba5dcad2b6e873735a54497664
2021-11-30 09:41:46 -08:00
cfc75c2137 [JIT] Separate GPU implementation of frozen_conv_add_relu_fusion.cpp (#68149)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68149

JIT optimization passes are part of the CPU-only build (i.e. necessary GPU flags are not passed in). This separates the implementation of frozen_conv_add_relu_fusion so that the GPU-enabled implementation is registered at runtime (if it is available)
ghstack-source-id: 143676384

Test Plan:
In the following script, conv_add_relu fusion is not observed without this change, but is observed when this change is added.
```
from typing import List, Optional

import torch

class Model(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.weight = torch.nn.Parameter(torch.rand((3, 3, 7, 7), device="cuda"))
        self.add_tensor = torch.nn.Parameter(torch.rand((3, 3, 7, 7), device="cuda"))

    def forward(
        self,
        inp: torch.Tensor,
        bias: Optional[torch.Tensor],
        stride: List[int],
        padding: List[int],
        dilation: List[int],
        groups: int,
    ):
        # weight = torch.zeros((3, 3, 7, 7), device="cuda")
        inp = inp.to("cuda")
        conv_result = torch.conv2d(
            inp, self.weight, bias, stride, padding, dilation, groups
        )
        add_result = conv_result.add_(self.add_tensor)
        return add_result.relu_()

    torch.jit.export
    def make_prediction(self, inp: torch.Tensor):
        bias = None
        groups = 1
        stride = (1, 1)
        padding = (0, 0)
        dilation = (1, 1)

        return self.forward(inp, bias, stride, padding, dilation, groups)

if __name__ == "__main__":
    # generate some sample input
    groups = 1
    channels_in = 3
    channels_out = 3
    kernel_size = (7, 7)
    stride = (1, 1)
    padding = (0, 0)
    dilation = (1, 1)
    inp = torch.rand((64, 3, 432, 432))
    weight = torch.rand(
        (channels_out, channels_in, kernel_size[0], kernel_size[1]), device="cuda"
    )
    bias = None

    model = Model()
    model.eval()
    script = torch.jit.script(model)
    script = torch.jit.freeze(script)
    script = torch.jit.optimize_for_inference(script)

    print("~~~~ FORWARD ~~~~")
    print(script.graph)

    print("with preserved_attrs")
    print(torch.sum(script.forward(inp, bias, stride, padding, dilation, groups)))
```

Reviewed By: cpuhrsch

Differential Revision: D32329330

fbshipit-source-id: c0f10da4b9540c588819efe3ec540baa0fae4b35
2021-11-30 09:31:57 -08:00
7342b654a1 [static runtime] dequantize out variant (#68664)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68664

Reland D32187063 (f120335643), fixing lint
Add out variant for aten::dequantize

Test Plan:
Test on inline_cvr model
```
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 3 ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --scripted_model=/data/users/ansha/tmp/adfinder/294738512/294738512_0.predictor.disagg.local --recordio_inputs=/data/users/ansha/tmp/adfinder/294738512/294738512_0_local.inputs.recordio --pt_enable_static_runtime=1 --compare_results=1 --iters=5 --warmup_iters=5 --num_threads=1 --do_profile=1 --method_name=local.forward --set_compatibility --do_benchmark=1 --recordio_use_ivalue_format=1
```

Before:
0.047472 ms.   0.409729%. aten::dequantize (9 nodes)

After
0.0307179 ms.   0.267204%. static_runtime::dequantize_copy (9 nodes, out variant)

Test on ctr_mbl_feed model 307210374 on 696 inputs

Before:
0.0569016 ms.   0.296647%. aten::dequantize (10 nodes)

After:
0.0423128 ms.   0.220481%. static_runtime::dequantize_copy (10 nodes, out variant)

Reviewed By: mikeiovine

Differential Revision: D32566429

fbshipit-source-id: b95dfc4c5e4115e083794093bc1571c7b1d72f5b
2021-11-30 09:03:26 -08:00
d3de3546d9 Revert D32099294: Split cuda: list cpp files that go in _cu library explicitly
Test Plan: revert-hammer

Differential Revision:
D32099294 (b47ae9810c)

Original commit changeset: 8a3582944b6b

fbshipit-source-id: eab63e6ba3db3e17f404292a3659823607627576
2021-11-30 08:42:19 -08:00
6fea7499c2 CompositeImplicitAutograd compliance testing (#65819)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65819

Related to #61669.

Functions registered as CompositeImplicitAutograd MUST work for most, if
not all, backends. This includes Tensor subclasses.

To achieve this, we (PyTorch) impose a set of constraints on how a
CompositeImplicitAutograd function can be written.

Concretely, this PR adds tests for all OpInfos that checks for
compliance. The things that get tested in this PR apply to composite
ops and are that:
- the op does not change the metadata of a Tensor without performing
dispatches
- the op does not call set_ or resize_
- the op does not directly access the data ptr

The mechanism for the test is to create a new __torch_dispatch__
object, CompositeCompliantTensor. For each operator, we wrap all inputs
in CompositeCompliantTensor, turn on python mode for it,
and send it through the operator.

Non-CompositeImplicitAutograd operators will pass the test because they
perform a dispatch to backend code. Here's how CompositeCompliantTensor
catches problems:

- If it sees set_ or resize_ getting called, it will directly error
out
- After each operation, CompositeCompliantTensor checks to make sure
that its metadata is consistent with that of the thing it is wrapping.
If the CompositeImplicitAutograd op modifes the metadata directly
(through e.g. the TensorImpl API) then the metadata will go out of sync.
- If data_ptr gets called, that returns a nice error (because the
storage is meta).

CompositeCompliantTensor is written in an interesting way. First off,
if a view operation occurs (e.g. `B = A.view_op(...)`), then B.storage()
must alias A.storage() where B.storage() is CompositeCompliantTensor's
storage, NOT the storage of the tensor it is wrapping. This is an
invariant in autograd, see #62182 for details. To handle
this we replay the view on A's storage and set it as B's storage.

Secondly, there are cases where the metadata is allowed to go out of
sync. I believe this is only possible with in-place view functions, like
transpose_, t_, squeeze_, unsqueeze_. Those are special cased.

Finally, I added a new section to aten/src/ATen/native/README.md about
what it means to be CompositeImplicitAutograd Compliant

Test Plan: - run tests

Reviewed By: ezyang, bdhirsh

Differential Revision: D31268369

Pulled By: zou3519

fbshipit-source-id: 31634b1cbe1778ab30196013cfc376ef9bd2e8b1
2021-11-30 07:35:22 -08:00
b83e8d560b [LT] Sync LTC branch changes on torch/csrc/lazy/core (#69012)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69012

Some changes to torch/csrc/lazy/core were done on the
lazy_tensor_staging branch (https://github.com/pytorch/pytorch/pull/68427).
Merge those back into the trunk.

Test Plan: Imported from OSS

Reviewed By: wconstab

Differential Revision: D32708696

Pulled By: desertfire

fbshipit-source-id: e54b978f2bdb9c7db27880f60246fdf1e8b41019
2021-11-30 07:09:15 -08:00
39ab417107 [Static Runtime] Fix fb::expand_dims schema (#68636)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68636

Same old alias problem

Reviewed By: mikeiovine

Differential Revision: D32556204

fbshipit-source-id: 4d380f0110ad1be83f705e6d6910a6aaf818ec08
2021-11-30 06:28:29 -08:00
5b37ac54cb dbr quant overhead [14/x]: cache whether an op is a module (#68877)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68877

Saves whether an op type is a module during tracing, so we
can avoid recalculating this when validating the op during inference.
This leads to a small speedup.

Test Plan:
```
python test/test_quantization.py TestQuantizeDBR
```

```
// MobileNetV2, 1x3x224x224, function level profiling

// before
validate_cur_op - 1.77%

// after
validate_cur_op - 1.41%

```

Reviewed By: jerryzh168

Differential Revision: D32646149

Pulled By: vkuzo

fbshipit-source-id: 03ebc4fedceb84bb885939dff8dec81d30ba6892
2021-11-30 06:13:06 -08:00
b47ae9810c Split cuda: list cpp files that go in _cu library explicitly (#67216)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67216

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D32099294

Pulled By: dagitses

fbshipit-source-id: 8a3582944b6b48af1ac31c5df09a7e6e838892c4
2021-11-30 04:24:55 -08:00
174eea8a05 Remove native_functions.yaml dependency from IndexKernel.{cpp,cu} (#66914)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66914

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D31856105

Pulled By: dagitses

fbshipit-source-id: 8729783b68879b509ae6b66ce145de0af68aad8c
2021-11-30 04:24:52 -08:00
f7d598948a Remove native_functions.yaml dependency from TensorModeKernel.cu (#66913)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66913

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D31856102

Pulled By: dagitses

fbshipit-source-id: 8888a1984adef09104a40ae683d091143cd1f4fa
2021-11-30 04:22:09 -08:00
ec1339a48b [CUDA Pinned Memory] Alternative implementation of pinned memory allocator focusing on multi-threaded scalability (#68906)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68906

The existing PyTorch pinned memory allocator has been a challenge for scalability in multi-GPU inference workloads. The existing allocator is mostly designed in the context of training, where in the process-per-GPU setup we have natural sharding of the global locks and lower allocation rates (perhaps O(100 allocs/sec) per process. In this setup we might have globally on the order of O(200k allocs/sec) - e.g. 20k QPS and 10 allocs/query. This is a different domain.

In the existing allocator, we observe tail latencies of cudaEventCreate and cudaEventDestroy (while holding the lock) can also completely stall all allocations, which is undesirable.

The idea here is to retain a similar design to the existing PyTorch allocator - eager collection of used memory, no lock-free or deferred tricks, identical semantics around events, but to:

a) split up the locks around the various critical datastructures, and
b) do as little work as possible while holding any process-global mutexes (importantly, no CUDA runtime API calls)
c) pool CUDA events manually (as cuda event creation is a bottleneck at high rates from multiple threads).

This does require a bit of care, but I believe it's correct. In general the threading and state transitions are fairly simple.

With these improvements, microbenchmarks show significant improvements (1.5x-3x). Importantly, real workloads also show significant improvements, especially WRT tail latency and stalls.

Test Plan:
Unit tests all pass.

With a synthetic benchmark such as:

```
static void BM_copies_baseline(benchmark::State& state) {
  auto N = state.range(0);
  auto scale = state.range(1);
  auto object_size_min = N;
  auto object_size_max = scale * N;

  auto device = at::Device(at::kCUDA, at::cuda::current_device());

  uint64_t bytes_copied = 0;
  uint64_t allocs = 0;
  auto stream = at::cuda::getCurrentCUDAStream();
  for (auto _ : state) {
    auto object_size = static_cast<int64_t>(expf(folly::Random::randDouble(
        logf(object_size_min), logf(object_size_max))));
    auto tensor = at::empty(
        {object_size},
        at::TensorOptions().dtype(at::kByte).pinned_memory(true));
    at::cuda::CachingHostAllocator_recordEvent(
        tensor.storage().data_ptr().get_context(), stream);
    bytes_copied += object_size;
    allocs += 1;
  }
  state.counters["BW"] =
      benchmark::Counter(bytes_copied, benchmark::Counter::kIsRate);
  state.counters["Allocs"] =
      benchmark::Counter(allocs, benchmark::Counter::kIsRate);
}

BENCHMARK(BM_copies_baseline)->Args({1000000, 20})->Threads(1)->UseRealTime();
BENCHMARK(BM_copies_baseline)->Args({1000000, 20})->Threads(4)->UseRealTime();
BENCHMARK(BM_copies_baseline)->Args({1000000, 20})->Threads(16)->UseRealTime();
BENCHMARK(BM_copies_baseline)->Args({1000000, 20})->Threads(64)->UseRealTime();
BENCHMARK(BM_copies_baseline)->Args({1000000, 20})->Threads(128)->UseRealTime();
BENCHMARK(BM_copies_baseline)->Args({1000000, 20})->Threads(256)->UseRealTime();
```

I observe roughly 1.5-3x improvements.

End to end application testing also sees significant improvements in the contended scenario.

Reviewed By: jianyuh, ngimel

Differential Revision: D32588784

fbshipit-source-id: ee86c3b7ed4da6412dd3c89362f989f4b5d91736
2021-11-30 02:49:43 -08:00
0cdeb586ae [LTC] Upstream some utilities (#69046)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69046

This commit upstreams utilities including ExceptionCleanup, MaybeRef,
Iota, ToVector, ToOptionalVector and GetEnumValue.

Test Plan: ./build/bin/test_lazy --gtest_filter=UtilTest.*

Reviewed By: wconstab, Chillee

Differential Revision: D32709090

Pulled By: alanwaketan

fbshipit-source-id: 5147433becd4dbb07be7d36d66b0b8685054d714
2021-11-30 02:44:02 -08:00
fbaa19a6fa Sparse: Implement simple unary ufuncs operators (#68887)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68887

Closes #46988, closes #46987, closes #46761

By "simple" I mean operators that map 0->0 so we can implement it by
just re-dispatching on the values tensor. That does mean we have `sin`
but not `cos` for example, but without fill value support this is the
best that can be done.

Most of these don't support autograd because the derivative formulas
use unsupported operators.

cc nikitaved pearu cpuhrsch IvanYashchuk

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D32706197

Pulled By: cpuhrsch

fbshipit-source-id: 65e1acb3645737ca7bdb7f2db739d8e118906f4b
2021-11-30 00:30:30 -08:00
3186d36972 [TensorExpr] Supress TracerWarnings in test_unsupported in test_jit_fuser_te.py. (#68757)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68757

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D32600951

Pulled By: ZolotukhinM

fbshipit-source-id: 7b9859d7dee1e9803b8fde5d071890a72d30cec9
2021-11-30 00:06:36 -08:00
75ce040620 [TensorExpr] Allow for 'keepdim' argument in aten::mean in NNC's external call. (#68756)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68756

That fixes some warnings in our tests.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D32600952

Pulled By: ZolotukhinM

fbshipit-source-id: 548eaf3659e20795cce44d8f57e77f4a47d44d98
2021-11-30 00:06:34 -08:00
a93f505ee5 [TensorExpr] IRPrinter: print sizes and name when visiting a Buf. (#68755)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68755

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D32600950

Pulled By: ZolotukhinM

fbshipit-source-id: 925da05d958497791cb9176a5d15d8315334aa24
2021-11-30 00:05:10 -08:00
8cc9ec2f6b Add option to get input dtype from user (#68751)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68751

Add option to get input dtype from user for AOT compilation

Test Plan:
BI model compiles and runs fine
```
(pytorch)  ~/fbsource/fbcode/caffe2/fb/nnc
└─ $ buck run //caffe2/binaries:aot_model_compiler -- --model=bi.pt --model_name=pytorch_dev_bytedoc --model_version=v1 '--input_dims=1,115;1' --input_types='int64;int64'
Building... 8.3 sec (99%) 7673/7674 jobs, 0/7674 updated
WARNING: Logging before InitGoogleLogging() is written to STDERR
W1116 14:32:44.632536 1332111 TensorImpl.h:1418] Warning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (function operator())
E1116 14:32:44.673710 1332111 huge_pages_allocator.cc:287] Not using huge pages because not linked with jemalloc
The compiled llvm assembly code was saved to bi.compiled.ll
The compiled model was saved to bi.compiled.pt
```

> Error thrown when input dims and input types sizes don't match

```
(pytorch)  ~/fbsource/fbcode/caffe2/fb/nnc
└─ $ buck run //caffe2/binaries:aot_model_compiler -- --model=bi.pt --model_name=pytorch_dev_bytedoc --model_version=v1 '--input_dims=1,115;1' --input_types='int64;int64;int64'
.
.
terminate called after throwing an instance of 'c10::Error'
  what():  [enforce fail at aot_model_compiler.cc:208] split(';', FLAGS_input_dims).size() == split(';', FLAGS_input_types).size(). Number of input_dims and input_types should be the same
.
.
.
```

Reviewed By: ljk53

Differential Revision: D32477001

fbshipit-source-id: 8977b0b59cf78b3a2fec0c8428f83a16ad8685c5
2021-11-29 21:39:49 -08:00
ac1fe91dc9 Clean up some THC includes (#69024)
Summary:
These seem to not be needed and cause ninja to rebuild the files at every build.

(There also is THCStorage.cu, but hopefully this will go away with https://github.com/pytorch/pytorch/issues/68556 )

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69024

Reviewed By: soulitzer

Differential Revision: D32705309

Pulled By: ngimel

fbshipit-source-id: 5255297f213fdcf36e7203de7460a71291f8c9a0
2021-11-29 20:55:27 -08:00
ce53baf573 Merging the implementations of ClearProfiling (#67575)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67575

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D32497548

Pulled By: Gamrix

fbshipit-source-id: fb656b017d405487e25bd2407b069a702769659f
2021-11-29 19:48:56 -08:00
e6a8d15a4c cpu_kernel_vec: Hoist stride checks out of loop (#68962)
Summary:
`cpu_kernel_vec` does stride checks to determine whether to use the vectorized or scalar inner loop. Since it uses a 1d `for_each` loop, it re-does these stride checks after every loop over the inner dimension. For iterators with small inner dimensions, this means a significant proportion of the time may be spent just on stride checks.

This changes it to use a 2d loop so the stride checks are further amortized. With the below `copy_` benchmark, it saves 50% of the callgrind instruction count from 28.4 Million to 13.5 Million and 30% time speedup from 22.8 us to 16.4 us on my machine.

```
from torch.utils.benchmark import Timer
import timeit
timer = Timer(
    stmt="b.copy_(a);",
    setup="""
    auto a = at::rand({10000, 8}, at::kComplexDouble).slice(0, 0, -1, 2);
    auto b = at::empty_like(a);
    """,
    num_threads=1,
    language='c++',
    timer=timeit.default_timer
)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68962

Reviewed By: mrshenli

Differential Revision: D32684191

Pulled By: ngimel

fbshipit-source-id: 582af038314a0f999f43669e66edace38ff8d2dc
2021-11-29 19:37:58 -08:00
61ea2fc35e Fix device type / dtype handling for parametrized test names (#65217)
Summary:
This PR absolves `_TestParametrizer`s (e.g. `ops`, `modules`, `parametrize`) of the responsibility of adding device type (e.g. `'cpu'`, `'cuda'`, etc.) / dtype (e.g. 'float32') to generated test names. This fixes repeated instances of the device string being added to generated test names (e.g. `test_batch_norm_training_True_cuda_track_running_stats_True_cuda_affine_True_cuda`).

The responsibility for placing device / dtype suffixes is now handled by `instantiate_device_type_tests()` instead so it is added a single time. It will place `<device>_<dtype>` at the end of the test name unconditionally, maintaining the current naming convention.

As part of this work, I also tightened the semantics through some additional error case handling:
* Composing multiple decorators that each try to handle the same parameter will error out with a nice message. This includes the case to trying to compose `modules` + `ops`, as they each try to handle `dtype`. Similarly, `ops` + `dtypes` is forbidden when both try to handle `dtype`. This required changes in the following test files:
  * `test/test_unary_ufuncs.py`
  * `test/test_foreach.py`
* The `modules` / `ops` decorators will now error out with a nice message if used with `instantiate_parametrized_tests()` instead of `instantiate_device_type_tests()`, since they're not (currently) written to work outside of a device-specific context.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65217

Reviewed By: mruberry

Differential Revision: D32627303

Pulled By: jbschlosser

fbshipit-source-id: c2957228353ed46a0b7da8fa1a34c67598779312
2021-11-29 19:02:23 -08:00
933d5b561f Fixed links to RNN docs in comments (#68828)
Summary:
Fixed links to RNN docs in comments

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68828

Reviewed By: soulitzer

Differential Revision: D32702384

Pulled By: jbschlosser

fbshipit-source-id: 577c88842cde555534d9a39fa7dfd24164d71552
2021-11-29 18:55:53 -08:00
863f321c6d Fix typo in AdaptiveLogSoftmaxWithLoss docs (#68926)
Summary:
Fixes a typo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68926

Reviewed By: soulitzer

Differential Revision: D32702366

Pulled By: jbschlosser

fbshipit-source-id: 8975aad3e817dab33359cf29182b4bd1e3aa1299
2021-11-29 18:51:58 -08:00
b8c3693281 Remove autograd-enabled collective APIs from distributed docs (#69011)
Summary:
These APIs are not yet officially released and are still under discussion. Hence, this commit removes those APIs from docs and will add them back when ready.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69011

Reviewed By: fduwjj

Differential Revision: D32703124

Pulled By: mrshenli

fbshipit-source-id: ea049fc7ab6b0015d38cc40c5b5daf47803b7ea0
2021-11-29 18:14:50 -08:00
178010455d Vectorized: Use inline namespace instead of anonymous (#67655)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67655

Some of the CPU operators already use the `namespace CPU_CAPABILITY` trick to avoid anonymous namespacing, like [`PowKernel.cpp`](cd51d2a3ec/aten/src/ATen/native/cpu/PowKernel.cpp (L14)). This extends that pattern to the `Vectorized` class, which avoids `Wsubobject-linage` warnings like I was getting in #67621.

For many functions, it was necessary to add `inline` because the functions are defined in a header. There were no link errors previously because the anonymous namespace ensured they were not exposed to linkage. Similarly, free functions defined in an anonymous namespace might need the `C10_UNUSED` attribute to silence warnings about the function not being called in the only translation unit that it's defined in. By removing the anonymous namespace, these decorators are no longer necessary.

Test Plan: Imported from OSS

Reviewed By: dagitses

Differential Revision: D32566109

Pulled By: malfet

fbshipit-source-id: 01d64003513b4946dec6b709bd73bbab05772134

Co-authored-by: Nikita Shulga <nshulga@fb.com>
2021-11-29 16:54:17 -08:00
1d0416397a [PyTorch] Change from unique_ptr to optional for RecordFunction state (#68397)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68397

Now that hot paths can avoid instantiating RecordFunction by using shouldRunRecordFunction, we can improve efficiency for profiling cases by avoiding a large heap allocation.
ghstack-source-id: 144235785

Test Plan:
1) Run //caffe2/caffe2/fb/high_perf_models/pytorch/benchmark_framework_overheads:cpp_benchmark before/after this diff with arguemnts --stressTestRecordFunction --op empty.

Before: P467891381

After: P467902339

2) Run without --stressTestRecordFunction to verify no regression in the regular dispatcher path.

Before: P467902381

After: P467902403

Reviewed By: chaekit

Differential Revision: D32448365

fbshipit-source-id: 2d32a3bd82c60d2bb11fc57bb88bf3f02aa3fa25
2021-11-29 16:35:36 -08:00
7194faed7f [PyTorch] Optimzize mergeRunCallbacks for RecordFunction (#68387)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68387

Function call overhead on tryRunCallback was notable.
ghstack-source-id: 144235788

Test Plan:
Run //caffe2/caffe2/fb/high_perf_models/pytorch/benchmark_framework_overheads:cpp_benchmark before/after this diff with arguemnts `--stressTestRecordFunction --op empty`.

Before: P467891339
After: P467891381

Reviewed By: chaekit

Differential Revision: D32443863

fbshipit-source-id: c0b3dd40bbd5bca976c2ebb0f21aa62e097b302e
2021-11-29 16:33:36 -08:00
f1a3512b78 Adding Linux cuda 11.5 workflows (#68745)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/68960

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68745

Reviewed By: janeyx99

Differential Revision: D32707491

Pulled By: atalman

fbshipit-source-id: 100facfdcc0fc2f68e203a696856852faa25ee08
2021-11-29 16:21:00 -08:00
27228656e6 [FX][docs] Document gotcha about training flag (#68915)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/68913

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68915

Reviewed By: jamesr66a

Differential Revision: D32705410

Pulled By: jubinchheda

fbshipit-source-id: a44c17ab0e62465823ceb0ef983ae330b50fb073
2021-11-29 16:13:32 -08:00
f253370bb9 dbr quant overhead [13/x]: cache results of get_module_hook_type (#68841)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68841

Caches the current module's hook type as an attribute on the module.
This requires the assumption that the current module's hook type
does not change during inference, which is an assumption we can
commit to.

Test Plan:
correctness
```
python test/test_quantization.py TestQuantizeDBR
```

performance
```
// MobileNetV2, 1x3x224x224, function profiling

// before
get_module_hook_type -> 2.58%

// after
get_module_hook_type -> 0.73%
```

Reviewed By: jerryzh168

Differential Revision: D32630881

Pulled By: vkuzo

fbshipit-source-id: 667f2667ef9c5514e5d82e4e7e4c02b8238edc65
2021-11-29 16:10:24 -08:00
2ad4727ad9 dbr quant: fix debugging fqn info for converted model (#68840)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68840

Fixes the debugging FQN info for a converted model. Some of this
information was missing because eager mode convert performed
module swaps. This information is only used in debugging and is
not used for inference.

Test Plan:
```
python test/test_quantization.py TestQuantizeDBR
```

turn `enable_logging` on in `auto_trace.py`, the FQN is now displayed
for a converted model

Reviewed By: jerryzh168

Differential Revision: D32630884

Pulled By: vkuzo

fbshipit-source-id: be8c43343abfdab9fe0af39499d908ed61a01b78
2021-11-29 16:10:21 -08:00
a03fe9ba61 dbr quant overhead[12/x]: turn off overrides for module convert output hook (#68839)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68839

We can assume that there are no overrides needed for the hook which
dequantizes the module outputs, so we can turn them off explicitly.
While this does not lead to a measurable perf win, it makes things
easier to debug by eliminating the no-op overrides.

Test Plan:
```
python test/test_quantization.py TestQuantizeDBR
```

Reviewed By: jerryzh168

Differential Revision: D32630886

Pulled By: vkuzo

fbshipit-source-id: 1719c168f5f21f3e59c80a3b6d0f32ebb1c77ef8
2021-11-29 16:10:18 -08:00
515db56755 dbr quant: remove unnecessary outputs hook (#68838)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68838

Removes an unnecessary outputs hook on the top level
module.  The same hook is already called inside the regular
hook flow.

Test Plan:
```
python test/test_quantization.py TestQuantizeDBR
```

Reviewed By: soulitzer

Differential Revision: D32630882

Pulled By: vkuzo

fbshipit-source-id: aa5f1b1cb866051013195d7311949333b08df4de
2021-11-29 16:10:15 -08:00
e3af582f92 dbr quant overhead[11/x]: speed up module convert hook (#68837)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68837

The module convert hook dequantizes the module outputs if the user
requested the module to adhere to a certain dtype for outputs. This
is most commonly used for the assumption that a model's overall return
type if fp32.

This PR precalculates for each module whether this hook will do anything,
and returns early if it does not. This prevents the overhead of this
hook to influencing any module which does not need this hook.

Test Plan:
```
python test/test_quantization.py TestQuantizeDBR
```

perf

```
MobileNetV2, 1x3x224x224, function level profiling

// before
outputs_convert_hook - 0.73%

// before
outputs_convert_hook - 0.45%
```

Reviewed By: jerryzh168

Differential Revision: D32630885

Pulled By: vkuzo

fbshipit-source-id: 7ee84de742fc0c752b66d20d097405a754c8b480
2021-11-29 16:10:12 -08:00
be70477a7b dbr quant overhead[10/x]: disable torch_function overrides for leaf nodes (#68836)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68836

If we have a leaf module like a `torch.nn.Conv2d`, DBR quant handles
the input and output of the module and should treat the inside of
this module as invisible.  Specifically, there is no need to override
the `F.conv2d` call if the parent module is already being overridden.

Before this PR, `__torch_function__` was still overridden for the insides
of leaf modules, and the override was a no-op.  There was some overhead
in these overrides because they were checking the hook type.

This PR adds a fast global override so we can skip overridding the insides
of leaf modules. This has some performance benefits in the prepare model,
because we now skip overriding all of the inner functions in observers.

Test Plan:
testing
```
python test/test_quantization.py TestQuantizeDBR
```

perf
```
// MobileNetV2, 1x3x224x224, comparing fp32 with dbr quant, Mac OS laptop

// before

fp32: 0.017837 seconds avg
fx_prepared: 0.021963 seconds avg, 0.812143 speedup vs fp32
fx_quantized: 0.012632 seconds avg, 1.412056 speedup vs fp32
dt_prepared: 0.034052 seconds avg, 0.523820 speedup vs fp32
dt_quantized: 0.018316 seconds avg, 0.973829 speedup vs fp32

// after

fp32: 0.020395 seconds avg
fx_prepared: 0.026969 seconds avg, 0.756230 speedup vs fp32
fx_quantized: 0.013195 seconds avg, 1.545611 speedup vs fp32
dt_prepared: 0.033432 seconds avg, 0.610023 speedup vs fp32
dt_quantized: 0.018244 seconds avg, 1.117866 speedup vs fp32

```

Reviewed By: jerryzh168

Differential Revision: D32630883

Pulled By: vkuzo

fbshipit-source-id: 6365e1c514726d8b2a4b3a51f114f5fed3ebe887
2021-11-29 16:08:52 -08:00
1342f19a8c Add ModuleInfo-based device transfer tests (#68092)
Summary:
Continuation of https://github.com/pytorch/pytorch/issues/65488; addresses the problem that got it reverted.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68092

Reviewed By: mruberry

Differential Revision: D32299103

Pulled By: jbschlosser

fbshipit-source-id: bc298aca15368f2acb5082e6fb6eedea60b5d75f
2021-11-29 15:48:40 -08:00
89a145fd91 Sparse CSR CUDA: Add torch.sparse.sampled_addmm (#68007)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68007

This PR adds a new function to the sparse module.
`sampled_addmm` computes α*(A @ B) * spy(C) + β*C, where C is a sparse CSR matrix and A, B are dense (strided) matrices.
This function is currently restricted to single 2D matrices, it doesn't support batched input.

cc nikitaved pearu cpuhrsch IvanYashchuk

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D32435799

Pulled By: cpuhrsch

fbshipit-source-id: b1ffac795080aef3fa05eaeeded03402bc097392
2021-11-29 15:43:29 -08:00
af49805a73 Port lerp to structured kernels (#68924)
Summary:
Ref https://github.com/pytorch/pytorch/issues/55070

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68924

Reviewed By: jbschlosser

Differential Revision: D32697409

Pulled By: bdhirsh

fbshipit-source-id: b098533e46f8bdbb995c76db0e6a124ab2b076b8
2021-11-29 15:11:30 -08:00
62847a2b9c Fix bug on empty GLOO_SOCKET_IFNAME_ENV (#68933)
Summary:
This PR is trying to fix the no device bug when user resets the `GLOO_SOCKET_IFNAME_ENV` with

```bash
export GLOO_SOCKET_IFNAME_ENV=
```

Thank you for your time on reviewing this PR :).

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68933

Reviewed By: soulitzer

Differential Revision: D32690633

Pulled By: mrshenli

fbshipit-source-id: f6df2b8b067d23cf1ec177c77cc592dc870bda72
2021-11-29 15:05:38 -08:00
b468566208 Add ModuleInfo-based CPU / GPU parity tests (#68097)
Summary:
Continuation of https://github.com/pytorch/pytorch/issues/64694; fixes issues with the diff there

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68097

Reviewed By: mruberry

Differential Revision: D32300650

Pulled By: jbschlosser

fbshipit-source-id: f3a5e72b019d4eddd7202854999eab61fffc9006
2021-11-29 14:58:07 -08:00
fb63bb60ec Strided masked norm. (#68584)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68584

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D32581285

Pulled By: cpuhrsch

fbshipit-source-id: 896ee1e58957b46c2f6a16a170adff4cb3b8da62
2021-11-29 14:23:27 -08:00
f776f30780 Keep the sequence or mapping type in default_collate (#68779)
Summary:
`default_collate`, `default_convert`, and `pin_memory` convert sequences into lists. I believe they should keep the original type when possible (e.g., I have a class that inherits from `list`, which comes from a 3rd party library that I can't change, and provides extra functionality).

Note it's easy to do when the type supports an iterable in its creation but it's not always the case (e.g., `range`).

Even though this can be accomplished if using a custom `default_collate`/`default_convert`, 1) this is behavior they should support out-of-the-box IMHO, and 2) `pin_memory` still does it.

cc VitalyFedyunin ejguan NivekT

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68779

Reviewed By: wenleix

Differential Revision: D32651129

Pulled By: ejguan

fbshipit-source-id: 17c390934bacc0e4ead060469cf15dde815550b4
2021-11-29 13:14:20 -08:00
d9e7d85390 Remove TH/THC Storage (#68556)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/67852

cc ezyang bhosmer smessmer ljk53 bdhirsh

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68556

Reviewed By: ejguan

Differential Revision: D32652758

Pulled By: ngimel

fbshipit-source-id: 170956fca112606f9008abe09b92c6ddc411be09
2021-11-29 12:55:20 -08:00
f5fa91ba2e Sparse: Add additional opinfo tests (#68886)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68886

cc nikitaved pearu cpuhrsch IvanYashchuk

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D32697933

Pulled By: cpuhrsch

fbshipit-source-id: fffdd1bc663cc1bc49abe8cf3680982d1cb497bc
2021-11-29 12:49:20 -08:00
3bd7dbf119 [Dist CI][BE] Remainder of c10d/store tests run in subprocess (#68822)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68822

Per title, we switched over c10d_gloo and nccl and results look good
so far, so switch the rest of them as well. After the only dist tests that
won't run in subprocess are pipe and fsdp tests, which historically haven't had
much flakiness.
ghstack-source-id: 144213522

Test Plan: CI

Reviewed By: H-Huang

Differential Revision: D32624330

fbshipit-source-id: 469f613e5b0e4529e6b23ef259d948837d4af26b
2021-11-29 10:59:39 -08:00
250d0bd20b [RPC][Dist CI][BE] RPC tests run in subprocess (#68821)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68821

Continuing effort to move most distributed tests to run in subprocess
for better reproducibility + reduce flakiness.
ghstack-source-id: 144213520

Test Plan: CI

Reviewed By: H-Huang

Differential Revision: D32624199

fbshipit-source-id: 04448636320554d7a3ab29ae92bc1ca9fbe37da2
2021-11-29 10:58:08 -08:00
51f4ac40fd ci: Use default blank if no TEST_CONFIG (#69008)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69008

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D32699051

Pulled By: seemethere

fbshipit-source-id: 9ed12fe8a7f541c6eda77182cfd1b0a733a545f0
2021-11-29 10:05:20 -08:00
ee59a09772 Implement sharding for MacOS jobs (#68784)
Summary:
Do not run distributed tests as part of separate shard, but keep it inside one of the two shards (to limit concurrency problems)
Fixes https://github.com/pytorch/pytorch/issues/68260

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68784

Reviewed By: seemethere, janeyx99

Differential Revision: D32653440

Pulled By: malfet

fbshipit-source-id: ebe5bbc30bdf67e930f2c766c920932700f3a4e4
2021-11-29 09:31:42 -08:00
61a4204d80 Sparse CSR CUDA: Add block torch.addmm when mat1 is sparse (#68707)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68707

This PR adds a path for block CSR matrices for `torch.addmm`. cuSPARSE interface is restricted to 32-bit indices and square blocks.
My plan is to make everything work and tests passing using an unsafe constructor first, keeping it all private. Then discuss & implement constructors with block information separately unlocking the functions for wider use. Documentation will come with the update to constructors.

cc nikitaved pearu cpuhrsch IvanYashchuk ngimel

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D32650366

Pulled By: cpuhrsch

fbshipit-source-id: 430a9627901781ee3d2e2496097b71ec17727d98
2021-11-29 08:58:49 -08:00
9ee5db490b neg_sparse: Fix output dtype (#68885)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68885

`torch.neg` should preserve the input dtype but for sparse tensors it
was promoting integers to floating point. This would have been picked
up by the OpInfo-based test, but `neg` wasn't marked with
`supports_sparse=True` so it was never run.

cc nikitaved pearu cpuhrsch IvanYashchuk

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D32680008

Pulled By: cpuhrsch

fbshipit-source-id: 502f8743c1c33ab802e3d9d097792887352cd220
2021-11-29 08:48:22 -08:00
7b701ce2d4 Add set_to_none option to C++ API (#68801)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/68167.

Signed-off-by: Vinnam Kim <vinnam.kim@makinarocks.ai>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68801

Reviewed By: mruberry

Differential Revision: D32625239

Pulled By: jbschlosser

fbshipit-source-id: 5f09b959e23d5448106a47029d06ec20ad094d82
2021-11-29 08:42:39 -08:00
787ded5103 Add lazy::Shape::numel() (#68314)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68314

Add a convenience to lazy::Shape for counting the number of elements (by multiplying out the dimensions).  This is a method on Tensor, and in switching other lazy tensor shape utils to use aten shape inference, we need numel counts.

Test Plan: add unit tests

Reviewed By: alanwaketan

Differential Revision: D32409138

fbshipit-source-id: 3ae725300f8826d38e45412f46501d5e5f776fb2
2021-11-29 08:38:09 -08:00
3d504ae1b4 [RELAND] Fix Dispatching not considering List[Optional[Tensor]] for dispatch (#68073)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68073

Relanding the original PR. Its body was as follows:

Followup to https://github.com/pytorch/pytorch/pull/60787

It turns out that the original PR was wrong for unboxed kernels. We
recently ran into this in
https://github.com/facebookresearch/functorch/issues/124

For unboxed kernels, the correct type for a Tensor?[] argument is
actually `List<optional<Tensor>>`, not `ArrayRef<optional<Tensor>>`
ghstack-source-id: 144204580

Test Plan:
- assert that https://github.com/facebookresearch/functorch/issues/124
actually works

Reviewed By: gchanan

Differential Revision: D32313601

Pulled By: zou3519

fbshipit-source-id: 8028d5f34eecabc53d603bd54d6b6748b5db461a
2021-11-29 08:31:55 -08:00
17ba936da0 .github: Migrate XLA tests to GHA (#64320)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64320

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D30684490

Pulled By: seemethere

fbshipit-source-id: 5d2657f9aa4c7082591239a5bb095cc85d2cde66
2021-11-29 08:30:57 -08:00
f398320e0d packaging: Include lazy headers in package_data (#68817)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68817

Looks like these files are getting used by downstream xla so we need to
include them in our package_data

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D32622241

Pulled By: seemethere

fbshipit-source-id: 7b64e5d4261999ee58bc61185bada6c60c2bb5cc
2021-11-29 08:29:48 -08:00
871cd7c5b9 Forward-mode AD support for torch.split, torch.split_with_sizes (#68566)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68566

These are just auto-linear as pointed out by Jeffrey.
ghstack-source-id: 143814393

Test Plan: - Run OpInfo tests.

Reviewed By: albanD, soulitzer

Differential Revision: D32520239

Pulled By: zou3519

fbshipit-source-id: 807115157b131e6370f364f61db1b14700279789
2021-11-29 07:50:53 -08:00
3315c4b31e add instructions for unhandled exceptions in assert_close (#68722)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68722

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D32684446

Pulled By: mruberry

fbshipit-source-id: 04fe5730721d24e44692cdc9bb327484356ead3f
2021-11-28 21:35:53 -08:00
d095f498a0 Tensor docs (#63308)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/62146.

Modernizes and clarifies the documentation of torch.tensor and torch.as_tensor, highlighting the distinction in their copying behavior and preservation of autograd history.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63308

Reviewed By: albanD, ngimel

Differential Revision: D30338025

Pulled By: mruberry

fbshipit-source-id: 83a0c113e4f8fce2dfe086054562713fe3f866c2
2021-11-28 21:26:12 -08:00
6ae34ea6f8 Revert D32521980: Add linalg.lu_factor
Test Plan: revert-hammer

Differential Revision:
D32521980 (b10929a14a)

Original commit changeset: 26a49ebd87f8

fbshipit-source-id: e1a6bb9c2ece9bd78190fe17e16a46e3358c5c82
2021-11-28 17:22:15 -08:00
b10929a14a Add linalg.lu_factor (#66933)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66933

This PR exposes `torch.lu` as `torch.linalg.lu_factor` and
`torch.linalg.lu_factor_ex`.

This PR also adds support for matrices with zero elements both in
the size of the matrix and the batch. Note that this function simply
returns empty tensors of the correct size in this case.

We add a test and an OpInfo for the new function.

This PR also adds documentation for this new function in line of
the documentation in the rest of `torch.linalg`.

Fixes https://github.com/pytorch/pytorch/issues/56590
Fixes https://github.com/pytorch/pytorch/issues/64014

cc jianyuh nikitaved pearu mruberry walterddr IvanYashchuk xwang233 Lezcano

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D32521980

Pulled By: mruberry

fbshipit-source-id: 26a49ebd87f8a41472f8cd4e9de4ddfb7f5581fb
2021-11-27 17:52:48 -08:00
01ddd5dde6 [opinfo] use dtypes instead of dtypesIfCPU (#68732)
Summary:
Reland https://github.com/pytorch/pytorch/issues/67619

Replace usage of dtypesIfCPU with dtypes in OpInfo class and also make it a mandatory argument.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68732

Reviewed By: jbschlosser

Differential Revision: D32594344

Pulled By: mruberry

fbshipit-source-id: 660b38aef97752ba064228e8989041ed1d5777fe
2021-11-27 16:07:51 -08:00
cffad597ea Tune test_reference_numerics_normal (#68019)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68019

Reviewed By: albanD

Differential Revision: D32482535

Pulled By: mruberry

fbshipit-source-id: 48300a5c6a4484fb81789f9049d3f08272d9f31c
2021-11-26 18:59:31 -08:00
5fdcc20d8d [JIT][Symbolic Shape Analysis] expose op shape functions (#68748)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68748

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D32598605

Pulled By: makslevental

fbshipit-source-id: c97a06cd0fe143a6ea14db65fc5d3f76abdff312
2021-11-24 17:17:01 -08:00
f14c16e509 Revert D32599540: [pytorch][PR] implemented 'torch.distributions.constraints.symmetric' checking if the tensor is symmetric at last 2 dimension.
Test Plan: revert-hammer

Differential Revision:
D32599540 (bc3bdbc8f4)

Original commit changeset: 9227f7e99318

fbshipit-source-id: edfe7072073d910a49be52e1b8c2d374ef71e9ec
2021-11-24 17:15:31 -08:00
c2e3b92db4 partial revert of D32522826 (#68889)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68889

Reviewed By: cpuhrsch, ejguan

Differential Revision: D32650385

Pulled By: Krovatkin

fbshipit-source-id: 2c4a30cfc729a023b592b6b6e1959bbd2ad6f7cf
2021-11-24 17:05:20 -08:00
4afa5ea0ab native_functions.yaml: remove SparseXPU which is added by accident (#68791)
Summary:
gen_backend_stubs.py will report 'assert' when generate code with
SparseXPU dispatch key for external backends, if SparseXPU is in
native_functions.yaml.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68791

Reviewed By: cpuhrsch, ejguan

Differential Revision: D32646303

Pulled By: bdhirsh

fbshipit-source-id: 64e42cc40468bc8c696a31b4b7c0cc3728866a64
2021-11-24 15:34:17 -08:00
c5f63f859e Add slow path to getCustomClassTypeImpl (#68717)
Summary:
This fixes custom class registration issue when `typeid` is not guaranteed to be unique across multiple libraries, which is the case for libc++ runtime on MacOS 11 in particular for M1
From [libcxx/include/typeinfo](78d6a7767e/include/typeinfo (L139)):
```
// -------------------------------------------------------------------------- //
//                          NonUniqueARMRTTIBit
// -------------------------------------------------------------------------- //
// This implementation of type_info does not assume always a unique copy of
// the RTTI for a given type inside a program. It packs the pointer to the
// type name into a uintptr_t and reserves the high bit of that pointer (which
// is assumed to be free for use under the ABI in use) to represent whether
// that specific copy of the RTTI can be assumed unique inside the program.
// To implement equality-comparison of type_infos, we check whether BOTH
// type_infos are guaranteed unique, and if so, we simply compare the addresses
// of their type names instead of doing a deep string comparison, which is
// faster. If at least one of the type_infos can't guarantee uniqueness, we
// have no choice but to fall back to a deep string comparison.
```

But `std::type_index` hash is computed always assuming that implementation is unique
By adding a slow path this problem can be fixed in those scenarios.

Fixes https://github.com/pytorch/pytorch/issues/68039

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68717

Reviewed By: seemethere

Differential Revision: D32605187

Pulled By: malfet

fbshipit-source-id: 8d50e56885b8c97dad3bc34a69c47ef879456dd1
2021-11-24 15:00:47 -08:00
14dc9759f2 Revert D32650384: OpInfos for torch.{flatten, column_stack}
Test Plan: revert-hammer

Differential Revision:
D32650384 (aceb46e4ce)

Original commit changeset: 9ead83b378d0

fbshipit-source-id: 3ef281e536b1f21a6f13c6c51309021cf92b53b2
2021-11-24 14:55:26 -08:00
96929ea995 Update empty and empty_like examples in docs (#68874)
Summary:
For some reason, the example for `torch.empty` showed the usage of `torch.empty_like` and the other way around. These are now swapped.

Fixes https://github.com/pytorch/pytorch/issues/68799

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68874

Reviewed By: wenleix

Differential Revision: D32646645

Pulled By: ejguan

fbshipit-source-id: c8298bcaca450aaa4abeef2239af2b14cadc05b3
2021-11-24 14:01:06 -08:00
d44e610efa [CUDA Pinned Memory] Event recording with non-blocking copies should track the storage context, not the tensor data pointer (#68749)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68749

The logic for asynchronous copies (either HtoD or DtoH) using cudaMemcpyAsync relies on recording an event with the caching host allocator to notify it that a given allocation has been used on a stream - and thus it should wait for that stream to proceed before reusing the host memory.

This tracking is based on the allocator maintaining a map from storage allocation pointers to some state.

If we try to record an event for a pointer we don't understand, we will silently drop the event and ignore it (9554ebe44e/aten/src/ATen/cuda/CachingHostAllocator.cpp (L171-L175)).

Thus, if we use the data_ptr of a Tensor instead of the storage allocation, then reasonable code can lead to incorrectness due to missed events.

One way this can occur is simply by slicing a tensor into sub-tensors - which have different values of `data_ptr()` but share the same storage, for example:

```
image_batch = torch.randn(M, B, C, H, W).pin_memory()
for m in range(M):
  sub_batch = image_batch[m].cuda(non_blocking=True)
  # sub_batch.data_ptr() != image_batch.data_ptr() except for m == 0.
  # however, sub_batch.storage().data_ptr() == image_batch.storage().data_ptr() always.
```

Therefore, we instead use the storage context pointer when recording events, as this is the same state that is tracked by the caching allocator itself. This is a correctness fix, although it's hard to determine how widespread this issue is.

Using the storage context also allows us to use a more efficient structure internally to the caching allocator, which will be sent in future diffs.

Test Plan: Test added which demonstrates the issue, although it's hard to demonstrate the race explicitly.

Reviewed By: ngimel

Differential Revision: D32588785

fbshipit-source-id: d87cc5e49ff8cbf59052c3c97da5b48dd1fe75cc
2021-11-24 13:20:22 -08:00
bc3bdbc8f4 implemented 'torch.distributions.constraints.symmetric' checking if the tensor is symmetric at last 2 dimension. (#68644)
Summary:
Implemented submodule for https://github.com/pytorch/pytorch/issues/68050
Opened cleaned, final version of PR for https://github.com/pytorch/pytorch/issues/68240

Explanation:
I am trying to contribute to PyTorch by implementing distributions for symmetric matrices like Wishart distribution and Inverse Wishart distribution. Although there is a LKJ distribution for the Cholesky decomposition of correlation matrices, it only represents equivalence to restricted form of Wishart distribution. [https://arxiv.org/abs/1809.04746](https://arxiv.org/abs/1809.04746) Thus, I started implementing Wishart distribution and Inverse Wishart distribution seperately.

I added a short code about the 'torch.distributions.constraints.symmetric', which was not included in 'torch.distributions.constraints' previously. i.e., 'torch.distributions.constraints' contains module like 'positive_definite' constraints, but it just assumes symmetricity of the input matrix. [Link](1adeeabdc0/torch/distributions/constraints.py (L466)) So, I think it will be better if we have constraint checking symmetricity of the tensors in PyTorch.

We may further utilize it like
`constraints.stack([constraints.symmetric, constraints.positive_definite])`
for the constraint of the covariance matrix in Multivariate Normal distribution, for example, to check if the random matrix is a symmetric positive definite matrix.

cc fritzo neerajprad alicanb nikitaved

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68644

Reviewed By: jbschlosser

Differential Revision: D32599540

Pulled By: neerajprad

fbshipit-source-id: 9227f7e9931834a548a88da69e4f2e9af7732cfe
2021-11-24 13:13:28 -08:00
1940cc028e [quant][graphmode][fx] Fork subgraph_rewriter from torch.fx to quantization (#68228)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68228

Forking this for now so that we can make changes as we need, the changes can be merged back to torch.fx
later

Test Plan:
```
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps
```

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D32537713

fbshipit-source-id: 326598d13645fcc28ef2c66baaac6a077b80fd0c
2021-11-24 10:49:05 -08:00
aceb46e4ce OpInfos for torch.{flatten, column_stack} (#67555)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67555

Test Plan: Imported from OSS

Reviewed By: cpuhrsch

Differential Revision: D32650384

Pulled By: anjali411

fbshipit-source-id: 9ead83b378d0ece60569e1a0fc7d8849f89566b3
2021-11-24 10:25:37 -08:00
cf54416925 Add docs entry for adjoint. (#68869)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68869

As per title.

cc brianjo mruberry anjali411

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D32647456

Pulled By: anjali411

fbshipit-source-id: 2cb053a6884e2b22d3decc058e86d10f355fcb84
2021-11-24 10:03:41 -08:00
c7d5e0f53f OpInfos for torch.atleast_{1d, 2d, 3d} (#67355)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67355

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D32649416

Pulled By: anjali411

fbshipit-source-id: 1b42e86c7124427880fff52fbe490481059da967
2021-11-24 09:55:39 -08:00
b69155f754 Avoid dtype mismatch error in torch.save if storages are unallocated (#68787)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/58970

cc mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68787

Reviewed By: mruberry

Differential Revision: D32617425

Pulled By: anjali411

fbshipit-source-id: fe7f2374e4ef4428346a0a202cae8e0d382e03ab
2021-11-24 09:51:29 -08:00
208e109dbf Revert D32633806: Sparse CSR CUDA: Add block torch.addmm when mat1 is sparse
Test Plan: revert-hammer

Differential Revision:
D32633806 (b28ddd72d3)

Original commit changeset: b98db0bd655c

fbshipit-source-id: 1c757628526bb1b88747257fc77d8b9cb996e502
2021-11-24 09:15:17 -08:00
7802953dd5 [nnc][quantization] quantized ops for BI bytedoc via aten (#68790)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68790

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D32609427

Pulled By: IvanKobzarev

fbshipit-source-id: de8f4209befe2509f5033888c739554470768290
2021-11-24 08:59:44 -08:00
31d36fd35d fix sccache issue on Windows CPU (#68870)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/68796

```
2021-11-24T10:12:40.7634007Z Compile requests                   4312
2021-11-24T10:12:40.7634484Z Compile requests executed          4300
2021-11-24T10:12:40.7634823Z Cache hits                         4227
2021-11-24T10:12:40.7635122Z Cache hits (C/C++)                 4227
2021-11-24T10:12:40.7636139Z Cache misses                         62
2021-11-24T10:12:40.7636930Z Cache misses (C/C++)                 62
2021-11-24T10:12:40.7637333Z Cache timeouts                        0
2021-11-24T10:12:40.7637839Z Cache read errors                     0
2021-11-24T10:12:40.7638161Z Forced recaches                       0
2021-11-24T10:12:40.7638489Z Cache write errors                    0
2021-11-24T10:12:40.7638828Z Compilation failures                  1
2021-11-24T10:12:40.7639180Z Cache errors                         10
2021-11-24T10:12:40.7639490Z Cache errors (C/C++)                 10
2021-11-24T10:12:40.7639856Z Non-cacheable compilations            0
2021-11-24T10:12:40.7640244Z Non-cacheable calls                   0
2021-11-24T10:12:40.7640601Z Non-compilation calls                12
2021-11-24T10:12:40.7640987Z Unsupported compiler calls            0
2021-11-24T10:12:40.7641426Z Average cache write               0.104 s
2021-11-24T10:12:40.7641763Z Average cache read miss           6.000 s
2021-11-24T10:12:40.7642110Z Average cache read hit            0.046 s
2021-11-24T10:12:40.7642485Z Failed distributed compilations       0
```
https://github.com/pytorch/pytorch/runs/4310176911?check_suite_focus=true

cc seemethere malfet pytorch/pytorch-dev-infra

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68870

Reviewed By: ejguan

Differential Revision: D32646289

Pulled By: janeyx99

fbshipit-source-id: bf04446439e55a4ccaf9ce7c77812752ca717a7c
2021-11-24 08:04:59 -08:00
be7e159e71 Remove extraneous logging (#68830)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68830

No logical changes, removing a logging statement that was accidentally committed.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang jjlilley mrzzd

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D32628711

Pulled By: H-Huang

fbshipit-source-id: 070190b92f97c8e38d8bb03124c13cb061fc9ec1
2021-11-24 07:15:50 -08:00
7d8a79b6f3 [nnc] llvm_codegen quantization types for vectype (#68736)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68736

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D32596261

Pulled By: IvanKobzarev

fbshipit-source-id: 0388c3b5ae58eb16921d25d9a784f82f1bb924fc
2021-11-24 01:17:39 -08:00
b28ddd72d3 Sparse CSR CUDA: Add block torch.addmm when mat1 is sparse (#68707)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68707

This PR adds a path for block CSR matrices for `torch.addmm`. cuSPARSE interface is restricted to 32-bit indices and square blocks.
My plan is to make everything work and tests passing using an unsafe constructor first, keeping it all private. Then discuss & implement constructors with block information separately unlocking the functions for wider use. Documentation will come with the update to constructors.

cc nikitaved pearu cpuhrsch IvanYashchuk ngimel

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D32633806

Pulled By: cpuhrsch

fbshipit-source-id: b98db0bd655cce651a5da457e78fca08619a5066
2021-11-23 22:55:46 -08:00
b5b62b3408 Cleanup old TD logic (#68842)
Summary:
Remove `--determine-from` option from run_test.py and remove all
references from corresponding test scripts

Followup after https://github.com/pytorch/pytorch/pull/64921

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68842

Reviewed By: seemethere, janeyx99

Differential Revision: D32631418

Pulled By: malfet

fbshipit-source-id: bdb5dd888c1d97dfaf95c1f299bf8073f3de9588
2021-11-23 18:45:42 -08:00
d9f3feb5a2 [SR] Use std::vector::reserve for StaticModule constants (#68834)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68834

This diff uses std::vector::reserve for constructing constants in StaticModule. We can also avoid two extra iterations over all the graph nodes.

This diff should technically improve its performance by a tiny bit.

Test Plan: - [x] buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- -v 1

Reviewed By: mikeiovine

Differential Revision: D32628806

fbshipit-source-id: 99dd2a7a36e86899ca1fe5300f3aa90d30a43726
2021-11-23 18:00:04 -08:00
8fb9ce4927 Update Documentation to Make CUDA Call Explicit (#67973)
Summary:
I am clarifying in the docs to make the call to cudaStreamWaitEvent explicit.

Fixes https://github.com/pytorch/pytorch/issues/67866

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67973

Reviewed By: mruberry

Differential Revision: D32620261

Pulled By: ngimel

fbshipit-source-id: 1fc8beb2062baaddb013ea4d7b10da2baa10f15e
2021-11-23 16:25:37 -08:00
79b67d9a4a [Quant] Refactor handling of FixedQParams operators (#68143)
Summary:
**Summary**: FixedQParams operators do not need fake quantization
in the prepare step. This commit introduces FixedQParamsObserver
and makes FixedQParamsFakeQuantize a simple wrapper around this
observer. It also removes the fake quantize logic in forward.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68143

Test Plan:
Added two tests:
python3 test/test_quantization.py TestQuantizeFx.test_fixed_qparams_patterns
python3 test/test_quantization.py TestQuantizeFx.test_register_patterns

**Reviewers**: Jerry Zhang

**Subscribers**: Jerry Zhang, Supriya Rao

**Tasks**: T104942885

**Tags**: pytorch

Reviewed By: albanD

Differential Revision: D32484427

Pulled By: andrewor14

fbshipit-source-id: 5a048b90eb4da79074c5ceffa3c8153f8d8cd662
2021-11-23 15:26:10 -08:00
998daf44d6 All get_attr node to be in64 type (#68818)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68818

Operator Support was blocking all node with dtype int64 from lowering. This diff ease the condition, for input from get_attr node(which are known not gonna be used for trt compute) to have dtype int64.

Reviewed By: brad-mengchi, 842974287

Differential Revision: D32609457

fbshipit-source-id: ea255f3281349a4254cb6abdeed671ab2c0216ba
2021-11-23 15:21:47 -08:00
78dce417a1 [BE] Simplify magma installation logic (#68778)
Summary:
Difference between `CUDA_VERSION` is magma package name is just a dot between major and minor

In process of refactoring, discovered that some docker images set `CUDA_VERSION` to contain minor revision, so modified pattern to strip it, i.e. `cuda-magma102` would be installed for `CUDA_VERSION=10.2.89` and `cuda-magma113` would be installed for `CUDA_VERSION=11.3.0`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68778

Reviewed By: seemethere

Differential Revision: D32605365

Pulled By: malfet

fbshipit-source-id: 43f8edeee5b55fdea6b4d9943874df8e97494ba1
2021-11-23 14:57:44 -08:00
2cd48d14ef Fix test_svd_errors_and_warnings warning message when cuda >= 11.5 (#68683)
Summary:
In SVD cusolverDnXgesvd computations,

When cuda < 11.5, cusolver raises CUSOLVER_STATUS_EXECUTION_FAILED when input contains nan.
When cuda >= 11.5, cusolver normally finishes execution and sets info array indicating convergence issue.

Related: https://github.com/pytorch/pytorch/issues/68259 #64533

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68683

Reviewed By: dagitses

Differential Revision: D32583576

Pulled By: mruberry

fbshipit-source-id: f732872522e0bda2703450ffcc64ae3a0d3f5bc0
2021-11-23 14:16:23 -08:00
8e343ba5db Revert D32611368: [pytorch][PR] Initial version of general convolution_backward
Test Plan: revert-hammer

Differential Revision:
D32611368 (445b31abff)

Original commit changeset: 26d759b7c908

fbshipit-source-id: e91f45f0f31150e60d657a3964b7e42027beff58
2021-11-23 13:39:36 -08:00
84047ff342 Add API usage logging to ShardedTensor and fix a few tests. (#68771)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68771

ghstack-source-id: 143974518

Test Plan: waitforbuildbot

Reviewed By: fduwjj, wanchaol

Differential Revision: D32601562

fbshipit-source-id: ed624137efab94fbe556609bb40cca14e69d9bac
2021-11-23 13:30:59 -08:00
959cb03132 Populate operator_input_sizes_ (#68542)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68542

title

Test Plan: unittest

Reviewed By: iseeyuan

Differential Revision: D32508159

fbshipit-source-id: 0773a725973a493f19a2e9a340365e559dfdf7f8
2021-11-23 12:18:06 -08:00
c0e6dc9ac7 [pytorch] Fix loading from checkpoint after "maximize" flag was introduced in SGD (#68733)
Summary:
After 'maximize' flag was introduced in  https://github.com/pytorch/pytorch/issues/46480 some jobs fail because they resume training from the checkpoints.

After we load old checkpoints we will get an error during optimizer.step() call during backward pass in [torch/optim/sgd.py", line 129] because there is no key 'maximize' in the parameter groups of the SGD.

To circumvent this I add a default value `group.setdefault('maximize', False)` when the optimizer state is restored.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68733

Reviewed By: albanD

Differential Revision: D32480963

Pulled By: asanakoy

fbshipit-source-id: 4e367fe955000a6cb95090541c143a7a1de640c2
2021-11-23 11:42:16 -08:00
73f494d690 .circleci: Remove migrated CUDA 10.2 build (#68782)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68782

These builds are no longer required for slow_gradcheck and should be
removed

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet, janeyx99

Differential Revision: D32606679

Pulled By: seemethere

fbshipit-source-id: e4827a6f217b91c34cfab6c2340e3272f3db1522
2021-11-23 09:50:53 -08:00
23288fdacc Making norms inputs independent (#68526)
Summary:
An update to https://github.com/pytorch/pytorch/issues/67442 to make sure all of the inputs produced are independent

Updates group_norm and instance_norm (local_response_norm was already producing independent inputs)

Also updates instance_norm for a bug in one set of inputs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68526

Reviewed By: ngimel

Differential Revision: D32532076

Pulled By: samdow

fbshipit-source-id: 45b9320fd9aecead052b21f838f95887cfb71821
2021-11-23 09:41:36 -08:00
e7e1b76106 Require CMake 3.13 when building with Ninja (#68731)
Summary:
There is a bug in CMake's Ninja generator where files considered inputs to the cmake command couldn't be generated by another build step. The fix was included in CMake 3.13, but 3.10.3 is still sufficient for other cmake generators e.g. makefiles.
For reference, the bug is here https://gitlab.kitware.com/cmake/cmake/-/issues/18584

This is necessary for https://github.com/pytorch/pytorch/issues/68246 but I'm isolating the change here to make testing easier.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68731

Reviewed By: jbschlosser

Differential Revision: D32604545

Pulled By: malfet

fbshipit-source-id: 9bc0bd8641ba415dd63ce21a05c177e2f1dd9866
2021-11-23 09:34:20 -08:00
3282386aa4 Added additional string to search cpu flags for vnni detection (#67686)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/67685

cc jerryzh168 jianyuh raghuramank100 jamesr66a vkuzo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67686

Reviewed By: ejguan

Differential Revision: D32109038

Pulled By: malfet

fbshipit-source-id: 3ea6e4cc1aa82831fd6277129a67c8241a5591a5
2021-11-23 09:32:53 -08:00
98e51895ef [dist_quant] change op registration to each file instead (#68797)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68797

This change dist quantization op registration to each file instead, allow torch deploy test to pass
ghstack-source-id: 143994945

Test Plan: wait for sc

Reviewed By: jbschlosser

Differential Revision: D32610679

fbshipit-source-id: 3ade925286f1ed0f65017939f1ad3f5c539e1767
2021-11-23 09:20:26 -08:00
445b31abff Initial version of general convolution_backward (#65219)
Summary:
Towards [convolution consolidation](https://fb.quip.com/tpDsAYtO15PO).

Introduces the general `convolution_backward` function that uses the factored-out backend routing logic from the forward function.

Some notes:
* `finput` is now recomputed in the backward pass for the slow 2d / 3d kernels instead of being saved from the forward pass. The logic for is based on the forward computation and is present in `compute_finput2d` / `compute_finput3d` functions in `ConvUtils.h`.
* Using structured kernels for `convolution_backward` requires extra copying since the backend-specific backward functions return tensors. Porting to structured is left as future work.
* The tests that check the routing logic have been renamed from `test_conv_backend_selection` -> `test_conv_backend` and now also include gradcheck validation using an `autograd.Function` hooking up `convolution` to `convolution_backward`. This was done to ensure that gradcheck passes for the same set of inputs / backends.

The forward pass routing is done as shown in this flowchart (probably need to download it for it to be readable since it's ridiculous):
![conv_routing_graph md](https://user-images.githubusercontent.com/75754324/137186002-5bca75ca-f911-4e61-8245-ec07af841506.png)

![conv_nogroup_routing_graph md](https://user-images.githubusercontent.com/75754324/139731619-9d0d436e-cce3-4bc3-8eaf-d469f667f0d7.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65219

Reviewed By: mruberry

Differential Revision: D32611368

Pulled By: jbschlosser

fbshipit-source-id: 26d759b7c908ab8f19ecce627acea7bd3d5f59ba
2021-11-23 08:19:45 -08:00
a31aea8eaa [quant][graphmode][fx] Add support for specifying reference quantized module mapping in backend_config_dict (#68227)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68227

This PR adds two keys to backend_config_dict:
"root_module": the root module for the pattern (since we may have patterns for fused ops)
"reference_quantized_module_for_root": the corresponding reference quantized module for the root

Test Plan:
```
python test/test_quant_trt.py TestQuantizeFxTRTOps
python test/test_quant_trt.py TestConvertFxDoNotUse
```

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D32537711

fbshipit-source-id: 6b8f36a219db7bb6633dac53072b748ede8dfa78
2021-11-22 21:35:04 -08:00
b845b9876b [sparsity] Fix for the failing pruner test (#68794)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68794

The pruner `test_constructor` fails because of a typo in the regular expression matching for the error that the pruner throws.
This fixes it.

Test Plan:
Separate test is not needed -- single letter change.
Previous test: `python test/test_ao_sparsity.py -- TestBasePruner

Reviewed By: ngimel

Differential Revision: D32609589

Pulled By: z-a-f

fbshipit-source-id: 800ef50c8cdbf206087bc6f945d1830e4af83c03
2021-11-22 21:07:24 -08:00
d6a68e0b8d [PyTorch][3/N] Enable the rest forward spec options for ShardedEmbedding and ShardedEmbeddingBag. (#67799)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67799

We have enabled the sharding embedding and embedding bag in https://github.com/pytorch/pytorch/pull/67188 and https://github.com/pytorch/pytorch/pull/66604. We now want to enable as many parameters as defined in doc as possible: https://pytorch.org/docs/stable/generated/torch.nn.functional.embedding_bag.html, https://pytorch.org/docs/stable/generated/torch.nn.functional.embedding.html.

For the ones that we don't support we just throw exception.

Last but not least, we use get to get params instead of directly using the key.
ghstack-source-id: 143987066

Test Plan: Unit test & CI

Reviewed By: pritamdamania87

Differential Revision: D31985333

fbshipit-source-id: 3794241b81eecc815bc4390679d0bb0323f4ae72
2021-11-22 20:33:03 -08:00
5d300e761d Add OpInfos for parcel Activation Functions I (#68521)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68521

Reviewed By: jbschlosser

Differential Revision: D32606625

Pulled By: saketh-are

fbshipit-source-id: acf98a07c45bce95b1470bf9856577426265f3d1
2021-11-22 20:01:35 -08:00
74e6d2ce67 fix typos in jit_language_reference.rst (#68706)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/68700

- indent problem

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68706

Reviewed By: mruberry

Differential Revision: D32598916

Pulled By: jbschlosser

fbshipit-source-id: 42af216e83fb48bbd311fc3d41fc3e8f5a2fef08
2021-11-22 19:09:06 -08:00
e7d8f096c9 [sparsity] Fix GPU training for sparsity (#66412)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66412

The GPU training was not supported in the sparsifier.
The reason was that when the sparsifier was created the masks would default to the CPU.
Attaching a GPU model to the sparsifier would throw an error.
The solution is to create the masks on the same device as the weight.

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D31590675

Pulled By: z-a-f

fbshipit-source-id: 98c2c1cedc7c60aecea4076e5254ef6b3443139e
2021-11-22 16:49:39 -08:00
0b0674121a Fix strict aliasing rule violation in bitwise_binary_op (#66194)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/66119

Failure on ARM Neoverse N1 before this PR:
```
======================================================================
FAIL: test_bitwise_ops_cpu_int16 (__main__.TestBinaryUfuncsCPU)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/pytorch/pytorch/torch/testing/_internal/common_device_type.py", line 373, in instantiated_test
    result = test(self, **param_kwargs)
  File "test_binary_ufuncs.py", line 315, in test_bitwise_ops
    self.assertEqual(op(a, b), op(a_np, b_np))
  File "/opt/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 1633, in assertEqual
    self.assertEqual(
  File "/opt/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 1611, in assertEqual
    super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
AssertionError: False is not true : Tensors failed to compare as equal!Found 176 different element(s) (out of 225), with the greatest difference of 21850 (-21846 vs. 4) occuring at index (0, 2).

======================================================================
FAIL: test_bitwise_ops_cpu_int32 (__main__.TestBinaryUfuncsCPU)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/pytorch/pytorch/torch/testing/_internal/common_device_type.py", line 373, in instantiated_test
    result = test(self, **param_kwargs)
  File "test_binary_ufuncs.py", line 315, in test_bitwise_ops
    self.assertEqual(op(a, b), op(a_np, b_np))
  File "/opt/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 1633, in assertEqual
    self.assertEqual(
  File "/opt/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 1611, in assertEqual
    super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
AssertionError: False is not true : Tensors failed to compare as equal!Found 188 different element(s) (out of 225), with the greatest difference of 1335341061 (-1335341056 vs. 5) occuring at index (14, 8).

----------------------------------------------------------------------
```
which passes now.

CC malfet ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66194

Reviewed By: dagitses, bdhirsh, ngimel

Differential Revision: D31430274

Pulled By: malfet

fbshipit-source-id: bcf1c9d584c02eff328dd5b1f7af064fac5942c9
2021-11-22 16:43:09 -08:00
d176c82bd5 [sparsity] Fix and enable the pruning tests (#66411)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66411

The original tests were disabled, and had some bugs. This fixes those unittests.

Test Plan: Imported from OSS

Reviewed By: HDCharles

Differential Revision: D31590678

Pulled By: z-a-f

fbshipit-source-id: ddbed34cc01d5f15580cb8f0033416f2f9780068
2021-11-22 15:28:12 -08:00
b46c89d950 Add linalg.solve_triangular (#63568)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63568

This PR adds the first solver with structure to `linalg`. This solver
has an API compatible with that of `linalg.solve` preparing these for a
possible future merge of the APIs. The new API:
- Just returns the solution, rather than the solution and a copy of `A`
- Removes the confusing `transpose` argument and replaces it by a
correct handling of conj and strides within the call
- Adds a `left=True` kwarg. This can be achieved via transposes of the
inputs and the result, but it's exposed for convenience.

This PR also implements a dataflow that minimises the number of copies
needed before calling LAPACK / MAGMA / cuBLAS and takes advantage of the
conjugate and neg bits.

This algorithm is implemented for `solve_triangular` (which, for this, is
the most complex of all the solvers due to the `upper` parameters).
Once more solvers are added, we will factor out this calling algorithm,
so that all of them can take advantage of it.

Given the complexity of this algorithm, we implement some thorough
testing. We also added tests for all the backends, which was not done
before.

We also add forward AD support for `linalg.solve_triangular` and improve the
docs of `linalg.solve_triangular`. We also fix a few issues with those of
`torch.triangular_solve`.

Resolves https://github.com/pytorch/pytorch/issues/54258
Resolves https://github.com/pytorch/pytorch/issues/56327
Resolves https://github.com/pytorch/pytorch/issues/45734

cc jianyuh nikitaved pearu mruberry walterddr IvanYashchuk xwang233 Lezcano

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D32588230

Pulled By: mruberry

fbshipit-source-id: 69e484849deb9ad7bb992cc97905df29c8915910
2021-11-22 12:41:06 -08:00
a2e35e167b refactor: update f-string for swa.utils.py (#68718)
Summary:
_ Update some old-style formats to f-string, for whole and coherent consistency.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68718

Reviewed By: jbschlosser

Differential Revision: D32593746

Pulled By: albanD

fbshipit-source-id: fcc17958f8af6a3260beca883bc1065f019dcf0e
2021-11-22 11:23:18 -08:00
9554ebe44e [Dist CI][BE] c10d gloo tests run in subprocess (#68504)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68504

Per title
ghstack-source-id: 143928767

Test Plan: CI

Reviewed By: H-Huang

Differential Revision: D32485100

fbshipit-source-id: a55687aea4af69e3830aee6f0278550c72f142c2
2021-11-22 09:54:07 -08:00
ddc22ea3b2 [Dist CI][BE] test_c10d_nccl run in subprocess (#68503)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68503

Per title
ghstack-source-id: 143928768

Test Plan: CI

Reviewed By: H-Huang

Differential Revision: D32484990

fbshipit-source-id: 6682f46256af0da5153e5087a91a7044156dd17f
2021-11-22 09:52:58 -08:00
39ec0f321b GHA: add print_tests_stats step to MacOS workflow (#68669)
Summary:
This will allow trunk CI to print test stats and upload stats (test reports, flaky tests, failed tests) to
- Scribe
- S3
- RDS

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68669

Reviewed By: dagitses

Differential Revision: D32578169

Pulled By: janeyx99

fbshipit-source-id: c348e2070402754789f462b52cd71411984102e2
2021-11-22 08:26:52 -08:00
a66ff81837 [DataPipe] Optimize Grouper from N^2 to N (#68647)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68647

Fixes #68539

When all data from source datapipe depletes, there is no need to yield the biggest group in the buffer.

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D32562646

Pulled By: ejguan

fbshipit-source-id: ce91763656bc457e9c7d0af5861a5606c89965d5
2021-11-22 07:49:13 -08:00
148f323856 Revert D32541986: [pytorch][PR] [opinfo] use dtypes instead of dtypesIfCPU
Test Plan: revert-hammer

Differential Revision:
D32541986 (d2a90f91bc)

Original commit changeset: 793d7d22c3ec

fbshipit-source-id: c60c4be3416f6feb658b5da1bdf75f0cbe6bee24
2021-11-22 04:58:01 -08:00
7c6a8a47db [BE] minor improvement to dist quantization (#67401)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67401

some minor changes to dist quantization, mainly change the namespace and add some notes for future code dedup
ghstack-source-id: 143910067
ghstack-source-id: 143910067

Test Plan: wait for ci

Reviewed By: mrshenli

Differential Revision: D31979269

fbshipit-source-id: 85a2f395e6a3487dd0b9d1fde886eccab106e289
2021-11-21 23:31:59 -08:00
fb556c91ce [BE] delete frontend.cpp (#67400)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67400

c10d/frontend.cpp was originally proposed to introduce pure C++ API and use TorcBind to share python level API with TorchScript. This is no longer needed, so delete this to reduce code redundancy.
ghstack-source-id: 143910066
ghstack-source-id: 143910066

Test Plan: wait for ci

Reviewed By: navahgar

Differential Revision: D31979270

fbshipit-source-id: 6ceb8b53d67ab8f9aef44b34da79346dfbb51225
2021-11-21 23:30:52 -08:00
d2a90f91bc [opinfo] use dtypes instead of dtypesIfCPU (#67619)
Summary:
Replace usage of `dtypesIfCPU` with `dtypes` in OpInfo class and also make it a mandatory argument.

Also added DeprecationWarning on using `dtypesIfCPU`

This raises a question :
For an OpInfo entry, currently `dtypes` works for any external backend, `dtypesIfCPU` for CPU and `dtypesIfCUDA` and `dtypesIfROCM` for CUDA and ROCm respectively.

If we merge `dtypes` and `dtypesIfCPU`, then for cases where external backend `dtypes` don't match cpu `dtypes` then it will lead to failures.

Currently there are few issues (5 failures) due to this on XLA (we may add relevant skips for the same). If we agree that skip should be added, then should it be added via OpInfo using decorators mechanism or at the XLA end? I think XLA end makes more sense to me to have one source of skips.

<details>

<summary>XLA Fail Log</summary>

```
Nov 01 11:48:26 ======================================================================
Nov 01 11:48:26 ERROR [0.016s]: test_reference_eager_histogram_xla_float32 (__main__.TestOpInfoXLA)
Nov 01 11:48:26 ----------------------------------------------------------------------
Nov 01 11:48:26 Traceback (most recent call last):
Nov 01 11:48:26   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 371, in instantiated_test
Nov 01 11:48:26     result = test(self, **param_kwargs)
Nov 01 11:48:26   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 737, in test_wrapper
Nov 01 11:48:26     return test(*args, **kwargs)
Nov 01 11:48:26   File "/var/lib/jenkins/workspace/xla/test/test_ops.py", line 411, in test_reference_eager
Nov 01 11:48:26     self.compare_with_eager_reference(op, sample_input)
Nov 01 11:48:26   File "/var/lib/jenkins/workspace/xla/test/test_ops.py", line 397, in compare_with_eager_reference
Nov 01 11:48:26     cpu_inp, cpu_args, cpu_kwargs = cpu(sample_input)
Nov 01 11:48:26   File "/var/lib/jenkins/workspace/xla/test/test_ops.py", line 393, in cpu
Nov 01 11:48:26     sample.args), to_cpu(sample.kwargs)
Nov 01 11:48:26   File "/var/lib/jenkins/workspace/xla/test/test_ops.py", line 386, in to_cpu
Nov 01 11:48:26     return {k: to_cpu(v) for k, v in x.items()}
Nov 01 11:48:26   File "/var/lib/jenkins/workspace/xla/test/test_ops.py", line 386, in <dictcomp>
Nov 01 11:48:26     return {k: to_cpu(v) for k, v in x.items()}
Nov 01 11:48:26   File "/var/lib/jenkins/workspace/xla/test/test_ops.py", line 390, in to_cpu
Nov 01 11:48:26     raise ValueError("Unknown type {0}!".format(type(x)))
Nov 01 11:48:26 ValueError: Unknown type <class 'NoneType'>!
Nov 01 11:48:26
Nov 01 11:48:26 ======================================================================
Nov 01 11:48:26 FAIL [0.575s]: test_reference_eager___rmatmul___xla_int64 (__main__.TestOpInfoXLA)
Nov 01 11:48:26 ----------------------------------------------------------------------
Nov 01 11:48:26 Traceback (most recent call last):
Nov 01 11:48:26   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 371, in instantiated_test
Nov 01 11:48:26     result = test(self, **param_kwargs)
Nov 01 11:48:26   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 737, in test_wrapper
Nov 01 11:48:26     return test(*args, **kwargs)
Nov 01 11:48:26   File "/var/lib/jenkins/workspace/xla/test/test_ops.py", line 411, in test_reference_eager
Nov 01 11:48:26     self.compare_with_eager_reference(op, sample_input)
Nov 01 11:48:26   File "/var/lib/jenkins/workspace/xla/test/test_ops.py", line 402, in compare_with_eager_reference
Nov 01 11:48:26     self.assertEqual(actual, expected, exact_dtype=True, exact_device=False)
Nov 01 11:48:26   File "/var/lib/jenkins/workspace/xla/test/pytorch_test_base.py", line 607, in assertEqual
Nov 01 11:48:26     return DeviceTypeTestBase.assertEqual(self, x, y, *args, **kwargs)
Nov 01 11:48:26   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1903, in assertEqual
Nov 01 11:48:26     super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
Nov 01 11:48:26 AssertionError: False is not true : Tensors failed to compare as equal!With rtol=0.001 and atol=0.001, found 44 element(s) (out of 50) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 9.187201950435738e+18 (-9.187201950435738e+18 vs. 34.0), which occurred at index (0, 4).
Nov 01 11:48:26
Nov 01 11:48:26 ======================================================================
Nov 01 11:48:26 FAIL [0.137s]: test_reference_eager_linalg_multi_dot_xla_int64 (__main__.TestOpInfoXLA)
Nov 01 11:48:26 ----------------------------------------------------------------------
Nov 01 11:48:26 Traceback (most recent call last):
Nov 01 11:48:26   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 371, in instantiated_test
Nov 01 11:48:26     result = test(self, **param_kwargs)
Nov 01 11:48:26   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 737, in test_wrapper
Nov 01 11:48:26     return test(*args, **kwargs)
Nov 01 11:48:26   File "/var/lib/jenkins/workspace/xla/test/test_ops.py", line 411, in test_reference_eager
Nov 01 11:48:26     self.compare_with_eager_reference(op, sample_input)
Nov 01 11:48:26   File "/var/lib/jenkins/workspace/xla/test/test_ops.py", line 402, in compare_with_eager_reference
Nov 01 11:48:26     self.assertEqual(actual, expected, exact_dtype=True, exact_device=False)
Nov 01 11:48:26   File "/var/lib/jenkins/workspace/xla/test/pytorch_test_base.py", line 607, in assertEqual
Nov 01 11:48:26     return DeviceTypeTestBase.assertEqual(self, x, y, *args, **kwargs)
Nov 01 11:48:26   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1903, in assertEqual
Nov 01 11:48:26     super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
Nov 01 11:48:26 AssertionError: False is not true : Tensors failed to compare as equal!With rtol=0.001 and atol=0.001, found 4 element(s) (out of 4) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 140230883884432.0 (0.0 vs. 140230883884432.0), which occurred at index (0, 0).
Nov 01 11:48:26
Nov 01 11:48:26 ======================================================================
Nov 01 11:48:26 FAIL [0.461s]: test_reference_eager_matmul_xla_int64 (__main__.TestOpInfoXLA)
Nov 01 11:48:26 ----------------------------------------------------------------------
Nov 01 11:48:26 Traceback (most recent call last):
Nov 01 11:48:26   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 371, in instantiated_test
Nov 01 11:48:26     result = test(self, **param_kwargs)
Nov 01 11:48:26   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 737, in test_wrapper
Nov 01 11:48:26     return test(*args, **kwargs)
Nov 01 11:48:26   File "/var/lib/jenkins/workspace/xla/test/test_ops.py", line 411, in test_reference_eager
Nov 01 11:48:26     self.compare_with_eager_reference(op, sample_input)
Nov 01 11:48:26   File "/var/lib/jenkins/workspace/xla/test/test_ops.py", line 402, in compare_with_eager_reference
Nov 01 11:48:26     self.assertEqual(actual, expected, exact_dtype=True, exact_device=False)
Nov 01 11:48:26   File "/var/lib/jenkins/workspace/xla/test/pytorch_test_base.py", line 607, in assertEqual
Nov 01 11:48:26     return DeviceTypeTestBase.assertEqual(self, x, y, *args, **kwargs)
Nov 01 11:48:26   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1903, in assertEqual
Nov 01 11:48:26     super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
Nov 01 11:48:26 AssertionError: False is not true : Tensors failed to compare as equal!With rtol=0.001 and atol=0.001, found 37 element(s) (out of 50) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 7.661375630332297e+18 (-7.66128151259864e+18 vs. 94117733658072.0), which occurred at index (4, 5).
Nov 01 11:48:26
Nov 01 11:48:26 ======================================================================
Nov 01 11:48:26 FAIL [0.050s]: test_reference_eager_remainder_autodiffed_xla_int64 (__main__.TestOpInfoXLA)
Nov 01 11:48:26 ----------------------------------------------------------------------
Nov 01 11:48:26 Traceback (most recent call last):
Nov 01 11:48:26   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 371, in instantiated_test
Nov 01 11:48:26     result = test(self, **param_kwargs)
Nov 01 11:48:26   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 737, in test_wrapper
Nov 01 11:48:26     return test(*args, **kwargs)
Nov 01 11:48:26   File "/var/lib/jenkins/workspace/xla/test/test_ops.py", line 411, in test_reference_eager
Nov 01 11:48:26     self.compare_with_eager_reference(op, sample_input)
Nov 01 11:48:26   File "/var/lib/jenkins/workspace/xla/test/test_ops.py", line 402, in compare_with_eager_reference
Nov 01 11:48:26     self.assertEqual(actual, expected, exact_dtype=True, exact_device=False)
Nov 01 11:48:26   File "/var/lib/jenkins/workspace/xla/test/pytorch_test_base.py", line 607, in assertEqual
Nov 01 11:48:26     return DeviceTypeTestBase.assertEqual(self, x, y, *args, **kwargs)
Nov 01 11:48:26   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1903, in assertEqual
Nov 01 11:48:26     super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
Nov 01 11:48:26 AssertionError: False is not true : Tensors failed to compare as equal!Attempted to compare equality of tensors with different dtypes. Got dtypes torch.int64 and torch.float32.
Nov 01 11:48:26
Nov 01 11:48:26 ----------------------------------------------------------------------
```

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67619

Reviewed By: ngimel

Differential Revision: D32541986

Pulled By: mruberry

fbshipit-source-id: 793d7d22c3ec9b4778784254ef6f9c980b4b0ce2
2021-11-21 21:52:38 -08:00
2d06c081ca Fix test issue with householder_product for non-contiguous inputs. (#68231)
Summary:
Fixes failing tests for `householder_product` due to non-contiguous inputs as shown here: https://github.com/pytorch/pytorch/issues/67513.

The floating point error was set too high for the complex64 type, so this PR reduces the error threshold for that particular type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68231

Reviewed By: dagitses

Differential Revision: D32562774

Pulled By: mruberry

fbshipit-source-id: edae4447ee257076f53abf79f55c5ffa1a9b3cb2
2021-11-21 21:47:23 -08:00
3b3dc1ade8 Sparse CSR CPU: add triangular_solve_out (#62180)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62180

This PR adds CPU dispatch for `triangular_solve` with sparse CSR matrix.
The implementation uses MKL Sparse library. If it's not available then a runtime error is thrown.

cc nikitaved pearu cpuhrsch IvanYashchuk

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D32581395

Pulled By: cpuhrsch

fbshipit-source-id: 41c7133a0d2754ef60b5a7f1d14aa0bf7680a844
2021-11-21 21:29:20 -08:00
e1c449ff34 dbr quant overhead[9/x]: precalculate when to skip op_convert_after_hook (#68432)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68432

Speeds up `op_convert_after_hook` by precalculating when this hook is a no-op
based on informationg gathered while tracing, and skipping execution when
this flag is true.

```
MobileNetV2, function level profiling, 1x3x224x224

// before
op_convert_before_hook = 3.25%

// after
op_convert_before_hook = 1.35%
```

Test Plan:
```
python test/test_quantization.py TestQuantizeDBR
```

Reviewed By: jerryzh168

Differential Revision: D32463752

Pulled By: vkuzo

fbshipit-source-id: b0c3d37909ddc8c254fe53f90954f625ae874e3b
2021-11-21 07:08:29 -08:00
ba230de118 dbr quant: remove more asserts from hot paths (#68431)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68431

asserts have some overhead, removing the asserts used only to make
mypy happy from the path which is hit in every forward.

Test Plan: python test/test_quantization.py TestQuantizeDBR

Reviewed By: jerryzh168

Differential Revision: D32463767

Pulled By: vkuzo

fbshipit-source-id: 5f85f80144f35a725afe481bf027ea61ca6315bf
2021-11-21 07:08:26 -08:00
95c00cf029 speed up quantized relu6 inplace kernel (#68404)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68404

The qclamp kernel is equal to (non inplace) or faster (inplace) than the
qrelu6 kernel.  Removing the qrelu6 kernel and routing qrelu6 to the
qclamp kernel instead.

Test Plan:
```
// correctness
python test/test_quantization.py TestQuantizedOps.test_qrelu6

// benchmarking
import torch
import torch.nn.functional as F
toq = torch.ops.quantized
import time

N_WARMUP = 5
N_ITER = 1000

data = torch.randn(32, 32, 64, 64)
data = torch.quantize_per_tensor(data, 0.05, 0, torch.quint8)

for _ in range(N_WARMUP):
    F.hardtanh(data, 0., 6., inplace=True)
t1 = time.time()
for _ in range(N_ITER):
    F.hardtanh(data, 0., 6., inplace=True)
t2 = time.time()

for _ in range(N_WARMUP):
    toq.relu6(data, inplace=True)
t3 = time.time()
for _ in range(N_ITER):
    toq.relu6(data, inplace=True)
t4 = time.time()

t_hardtanh = t2 - t1
t_qrelu6 = t4 - t3
print(t_hardtanh, t_qrelu6)

// before
0.7156341075897217 1.4007949829101562

// after
0.6825599670410156 0.6571671962738037
```

Reviewed By: jerryzh168

Differential Revision: D32463754

Pulled By: vkuzo

fbshipit-source-id: a87fe5907d7b71d87eb1d5f6588cd509a88f2969
2021-11-21 07:08:23 -08:00
592053f115 dbr quant: simplify relatedness logic (#68374)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68374

Cleans up the relatedness logic in DBR quant. For now, this is still
duplicated with NS.  A future PR should unify these mappings.

Test Plan:
```
python test/test_quantization.py TestQuantizeDBR
```

Reviewed By: jerryzh168

Differential Revision: D32463750

Pulled By: vkuzo

fbshipit-source-id: 90c2f5e79b86b1b595bd52650305bad88212ed49
2021-11-21 07:08:20 -08:00
f1021bcf38 dbr quant overhead[8/x]: small speedup in op_needs_quantization (#68373)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68373

Removes redundant logic in `op_needs_quantization`, for a small speedup.

Test Plan:
```
// MobileNetV2, 1x3x224x224 input, % of time spent by function during DBR convert

// before
cur_op_needs_hooks - 0.76%
op_needs_quantizaion - 0.41%

// after
cur_op_needs_hooks - 0.70%
op_needs_quantization - 0.36%
```

Reviewed By: jerryzh168

Differential Revision: D32463762

Pulled By: vkuzo

fbshipit-source-id: 334591c514dfa5af6fabc1390005088e8c5ca952
2021-11-21 07:08:17 -08:00
74ba1067a6 dbr quant overhead[7/x]: speed up AutoQuantizationState.reset_to_new_call (#68372)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68372

Speeds up `AutoQuantizationState.reset_to_new_call` by going around
the getattr and setattr overhead in `torch.nn.Module`.

Test Plan:
```
// MobileNetV2, 1x3x224x224 input, % of time spent by function during DBR convert

// before
reset_to_new_call - 1.09%

// after
reset_to_new_call - 0.18%
```

Reviewed By: jerryzh168

Differential Revision: D32463759

Pulled By: vkuzo

fbshipit-source-id: f3faa464372b0703f7d246680d62acd2782453e3
2021-11-21 07:08:15 -08:00
b7d58745c8 dbr quant overhead[6/x]: remove unneeded isinstance checks in op_convert_before_hook (#68371)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68371

`isinstance` has some overhead, changing the code in `op_convert_before_hook`
to use the information calculate during tracing instead which is cheaper.

Test Plan:
```
python test/test_quantization.py TestQuantizeDBR
```

function level benchmarking
```
// MobileNetV2, 1x3x224x224 input, % of time spent by function during DBR convert

// before
op_convert_before_hook = 3.55%
isinstance = 1.62%

// after
op_convert_before_hook = 2.89%
```

Reviewed By: jerryzh168

Differential Revision: D32463757

Pulled By: vkuzo

fbshipit-source-id: 129efe9c279a41f55b8bfd09132e21c0066298a6
2021-11-21 07:08:12 -08:00
b3a7d696b3 dbr quant overhead[5/x]: remove unnecessary asserts (#68370)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68370

Removes asserts which are duplicate (the same condition is checked
when calculating the hook type, so there is no need to check it again).
For the assert in `validate_is_at_last_seen_idx`, rewrites it to
raise an Error instead to ensure it does not get stripped in
production environments.

Test Plan:
```
python test/test_quantization.py TestQuantizeDBR
```

Reviewed By: jerryzh168

Differential Revision: D32463766

Pulled By: vkuzo

fbshipit-source-id: 8a7b7e0bf270bc327f49bd3e5bd156339e846381
2021-11-21 07:08:09 -08:00
16a6e0612d dbr quant: clean up key types in AutoQuantizationState mappings (#68369)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68369

`AutoQuantizationState` has various mappings keyed on IDs. Only
`tensor_id_to_observer` actually needs string keys because it is an
`torch.nn.ModuleDict`.  This PR changes the other mappings to have
integer keys, for simplicity and performance.

Test Plan:
```
python test/test_quantization.py TestQuantizeDBR
```

Reviewed By: jerryzh168

Differential Revision: D32463765

Pulled By: vkuzo

fbshipit-source-id: 5a9bf2a1102859097eedf1e536761084cd408856
2021-11-21 07:08:06 -08:00
3fc9bc43c6 dbr quant overhead[4/x]: speed up hook type calculations (#68351)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68351

Speeds up `get_module_hook_type` and `get_torch_function_hook_type` by
bypassing the expensive `torch.nn.Module` getters and setters and
fetching `_auto_quant_state` directly.

Test Plan:
Model level benchmarking is noisy.  Individual `cProfile` results:

```
// MobileNetV2, 1x3x224x224 input, % of time spent by function during DBR convert

// before
get_module_hook_type - 5.96%
get_torch_function_hook_type - 2.24%

// after
get_module_hook_type - 2.10%
get_torch_function_hook_type - 0.57%
```

Reviewed By: jerryzh168

Differential Revision: D32463756

Pulled By: vkuzo

fbshipit-source-id: 6eb199052ddf8d78f1c123a427e7437fc7c4fe58
2021-11-21 07:08:03 -08:00
c72ffee497 dbr quant overhead[3/x]: speed up AutoQuantizationState.mark_cur_op_complete (#68350)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68350

`torch.nn.Module` has overhead for getting and setting attributes because
it does various type checks on the attribute.

This PR explicitly gets and sets the right thing for this particular
function, avoding the type checks. Model level benchmarks are too noisy,
but according to function level profiling this reduces the time spent in
this function in a quantized model from 2.60% to 0.53%, on MobileNetV2 with
input size 1x3x224x224.

Test Plan:
```
python test/test_quantization.py TestQuantizeDBR
```

Reviewed By: albanD

Differential Revision: D32463751

Pulled By: vkuzo

fbshipit-source-id: a29beed2a2b87ca4df675a30dd591f797c8a1dbe
2021-11-21 07:06:42 -08:00
c7ecf1498d dbr quant overhead[2/x]: precalculate op_convert_info (#68347)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68347

Moves `op_convert_info` to be precalculated in the convert step
instead of calculated dynamically.  This should help with framework
overhead.

Test Plan:
Noisy benchmark:

```
// before

fp32: 0.016103 seconds avg
fx_prepared: 0.019841 seconds avg, 0.811601 speedup vs fp32
fx_quantized: 0.011907 seconds avg, 1.352346 speedup vs fp32
dt_prepared: 0.035055 seconds avg, 0.459357 speedup vs fp32
dt_quantized: 0.018891 seconds avg, 0.852417 speedup vs fp32

// after

fp32: 0.020535 seconds avg
fx_prepared: 0.023071 seconds avg, 0.890070 speedup vs fp32
fx_quantized: 0.011693 seconds avg, 1.756206 speedup vs fp32
dt_prepared: 0.038691 seconds avg, 0.530734 speedup vs fp32
dt_quantized: 0.021109 seconds avg, 0.972793 speedup vs fp32
```

The benchmark is too noisy to rely on, but according to `cProfiler`
this removes about 5% of overhead.

Reviewed By: jerryzh168

Differential Revision: D32463761

Pulled By: vkuzo

fbshipit-source-id: e2ad0d7eeff7dbadf3aa379604bfe9bec0c228fe
2021-11-20 15:17:12 -08:00
9fba8971a7 dbr quant: move model level utils into own file (#68346)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68346

Some utility functions for DBR quant need to be aware
of `AutoQuantizationState`.  This PR moves them into their own file, so they
can use the type directly without circular imports, and removes the mypy
ignores which are no longer necessary after this change.

Test Plan:
```
python test/test_quantization.py TestQuantizeDBR
```

Reviewed By: jerryzh168

Differential Revision: D32463763

Pulled By: vkuzo

fbshipit-source-id: e2c367de0d5887c61e6d2c3a73d82f7d76af3de1
2021-11-20 15:17:10 -08:00
629f9a5532 dbr quant: clean up AutoQuantizationState.get_op_convert_info flag (#68345)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68345

Removes a flag to unwrap scale and zp which was only needed by
the FX rewriter. Moves the logic to happen in the FX tracer instead.
This resolves a technical debt TODO.

Test Plan:
```
python test/test_quantization.py TestQuantizeDBR
```

Reviewed By: jerryzh168

Differential Revision: D32463764

Pulled By: vkuzo

fbshipit-source-id: ba7c976664c95111174fb65488bdac62b4f4984d
2021-11-20 15:17:07 -08:00
52cc9cb0ee dbr quant: refactor AutoQuantizationState._get_packed_param_name (#68344)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68344

Makes `AutoQuantizationState._get_packed_param_name` use `seen_op_info`
instead of the current op. This will make future performance improvements
easier.

Test Plan:
```
python test/test_quantization.py TestQuantizeDBR
```

Reviewed By: albanD

Differential Revision: D32463758

Pulled By: vkuzo

fbshipit-source-id: 0c16fe4bc989cb66180ad674ec55060cd970e32e
2021-11-20 15:17:04 -08:00
2755cf457c dbr quant: refactor AutoQuantizationState._get_input_args_quant_dequant_info (#68343)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68343

Refactors `AutoQuantizationState._get_input_args_quant_dequant_info` to
use less internal state, makes the function have no side effects by passing
the state in the arguments, and moves the function to utils file.

This will help with a future refactor to cache this info at runtime.

Test Plan:
```
python test/test_quantization.py TestQuantizeDBR
```

Reviewed By: jerryzh168

Differential Revision: D32463760

Pulled By: vkuzo

fbshipit-source-id: bdd50b0772f128755f9b734b5eeb0a9f4bc4970b
2021-11-20 15:17:02 -08:00
57472ec414 dbr quant: refactor get_quantized_op to only use seen_op_info (#68342)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68342

Before this PR, `get_quantized_op` required the current callable.

After this PR, `get_quantized_op` only requires `seen_op_info`.
The signature was changed slightly to return `None` if the original
callable does not need replacement for quantization.

This will make it easier to make performance improvements in a
future PR.

Test Plan:
```
python test/test_quantization.py TestQuantizeDBR
```

Reviewed By: jerryzh168

Differential Revision: D32463768

Pulled By: vkuzo

fbshipit-source-id: 5db2c4199f6c0529817f4c058f81fd1d32b9fa9f
2021-11-20 15:16:59 -08:00
9cf4779ec9 dbr quant: refactor get_func_output_obs_type to only use seen_op_info (#68341)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68341

Before this PR, `get_func_output_obs_type` used information from the
incoming op and its arguments, which makes it hard to cache.

This PR refactors `get_func_output_obs_type` to only use information
collected during tracing. This will make it easier to make performance
improvements in a future PR.

Test Plan:
```
python test/test_quantization.py TestQuantizeDBR
```

Reviewed By: jerryzh168

Differential Revision: D32463755

Pulled By: vkuzo

fbshipit-source-id: 25a220de652f0285685d43aedf7392082104b26c
2021-11-20 15:16:56 -08:00
f8b084c563 dbr quant overhead[1/x]: remove expensive calls to named_modules (#68309)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68309

This is the first of a series of PRs to reduce overhead of DBR quantization
prototype. For now, the measurement of this work is not super scientific as
there are a lot of low hanging fruit.  As we speed up the prototype, we
might need to invest in better benchmarking.

Current benchmarking setup:
* mac OS laptop with OMP_NUM_THREADS=1
* torchvision's mobilenet_v2
* input size 1x3x224x224
* we measure fp32 forward, prepared and quantized forward with FX quant vs DBR quant

Note that due to small input size, this benchmark is pretty noisy.
The goal here is to measure overhead of DBR quant logic (not the kernels),
so small input is good as we want the kernels to take as little % of overall
time as possible.

High level goal is for DBR quant convert forward to approach the FX time.

This first PR removes the expensive named_modules calls and resets the op
counter in the op instead. According to cProf, this should be a 2 to 3 pct win.

Test Plan:
```
benchmark: https://gist.github.com/vkuzo/1a4f98ca541161704ee3c305d7740d4a

// before

fp32: 0.020101 seconds avg
fx_prepared: 0.020915 seconds avg, 0.961083 speedup vs fp32
fx_quantized: 0.012037 seconds avg, 1.670005 speedup vs fp32
dt_prepared: 0.037506 seconds avg, 0.535953 speedup vs fp32
dt_quantized: 0.022688 seconds avg, 0.885988 speedup vs fp32

// after

fp32: 0.020722 seconds avg
fx_prepared: 0.023417 seconds avg, 0.884893 speedup vs fp32
fx_quantized: 0.014834 seconds avg, 1.396942 speedup vs fp32
dt_prepared: 0.039120 seconds avg, 0.529700 speedup vs fp32
dt_quantized: 0.020063 seconds avg, 1.032831 speedup vs fp32
```

Reviewed By: albanD

Differential Revision: D32463753

Pulled By: vkuzo

fbshipit-source-id: 1d7de7d9c4837e2b0ec815f0f67014c7600bb16c
2021-11-20 15:16:53 -08:00
ed6ef0eec4 dbr quantization: inline scale and zp (#68251)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68251

Before this PR, DBR quantization used to recalculate scale and zero_point
in the converted model every time it was needed, which is slow.
This PR creates a pass during the convert function to go through every
observer in the model and cache its scale and zero_point.

Note: only doing this for observers which correspond to int8 operations
is saved for a future PR.

Test Plan:
```
python test/test_quantization.py TestQuantizeDBR
```

Reviewed By: VitalyFedyunin

Differential Revision: D32463769

Pulled By: vkuzo

fbshipit-source-id: d1d2e598e2bccc1958e5023096b451d69dc34e29
2021-11-20 15:16:51 -08:00
ca499567d2 barebones numeric suite for quantization with dynamic tracing (#67776)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67776

This adds a barebones `add_loggers` and `extract_logger_info` API
to analyze intermediate activations of models using quantization
with dynamic tracing.  The API generally matches the NS for FX tool,
with some omissions.  For now, this is moving fast to help us
debug real models, and the API will be 100% aligned before this is marketed to users,
in future PRs.

Note: the current approach couples Numeric Suite with the quantization
logic. This is not the best for composability, and may be changed
at a future time.

Test Plan:
```
python test/test_quantization.py TestAutoTracing.test_numeric_suite
```

```
python test/test_quantization.py TestAutoTracing.test_numeric_suite
```

Differential Revision:
D32231332
D32231332

Reviewed By: jerryzh168

Pulled By: vkuzo

fbshipit-source-id: 8adfb50cd8b7836c391669afe2e2ff6acae6d40a
2021-11-20 15:15:48 -08:00
d0eff8d846 Strided masked softmin. (#68463)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68463

Test Plan: Imported from OSS

Reviewed By: dagitses

Differential Revision: D32576497

Pulled By: cpuhrsch

fbshipit-source-id: 286edb2e7a5415df76858c69d0312743437b0fd8
2021-11-19 20:51:42 -08:00
75955e4ef8 [clone][sparse] Add torch._C._sparse namespace (#68672)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68672

This PR adds `python_module: sparse` to `native_function.yaml`.
These functions would appear in `torch._C._sparse` namespace instead of
just `torch`.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D32517813

fbshipit-source-id: 7c3d6df57a24d7c7354d0fefe1b628dc89be9431
2021-11-19 19:47:38 -08:00
95f4cd0ba9 Implement topk with sort for some cases (#68632)
Summary:
Benchmark that compares original implementation and the sort implementation (this code should run on a branch without this patch):
```python
import torch
import timeit

def tune_dtype(f):
    def ret(*args, **kwargs):
        for dtype in [torch.int8, torch.half, torch.float, torch.double]:
            f(*args, **kwargs, dtype=dtype)
    return ret

def tune_slice(f):
    def ret(*args, **kwargs):
        slice = 1
        while slice <= 256:
            f(*args, **kwargs, slice=slice)
            slice *= 2
    return ret

def tune_slice_size(f):
    def ret(*args, **kwargs):
        slice_size = 1
        while slice_size <= 1_000_000:
            f(*args, **kwargs, slice_size=slice_size)
            slice_size *= 10
    return ret

def tune_k(f):
    def ret(*args, slice_size, **kwargs):
        k = 1
        while k <= slice_size:
            f(*args, **kwargs, k=k, slice_size=slice_size)
            k *= 10
    return ret

def topk_with_sort(tensor, k, dim=-1, largest=True):
    values, indices = tensor.sort(dim=dim, descending=largest)
    return values.narrow(dim, 0, k), indices.narrow(dim, 0, k)

def run50sync(f):
    for _ in range(50):
        f()
    torch.cuda.synchronize()

def warmup():
    N = 1000000
    for i in range(1, N // 10000):
        torch.randn(i, device='cuda')

def benchmark_one(slice, slice_size, k, dtype):
    input_ = torch.empty((slice, slice_size), dtype=dtype, device="cuda").random_()
    torch.cuda.synchronize()
    time = timeit.timeit(lambda: run50sync(lambda: torch.topk(input_, k, dim=1)), number=1)
    torch.cuda.synchronize()
    time_sort = timeit.timeit(lambda: run50sync(lambda: topk_with_sort(input_, k, dim=1)), number=1)
    method = "orig" if time < time_sort else "sort"
    speedup = time / time_sort
    print(f"(dtype={dtype}, slice={slice}, slice_size={slice_size}, k={k}) -> (method={method}, speedup={speedup})")

if __name__ == "__main__":
    warmup()
    tune_dtype(tune_slice(tune_slice_size(tune_k(benchmark_one))))()

```
Benchmark result see next comment.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68632

Reviewed By: dagitses

Differential Revision: D32566233

Pulled By: ngimel

fbshipit-source-id: f7a508176ef3685b491048c4a6562121c60b8b2a
2021-11-19 17:18:20 -08:00
e554d8b89c Fix retry on connect failure decorator (#68600)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/68541 by checking string contains instead of exact eror

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68600

Reviewed By: dagitses, H-Huang

Differential Revision: D32535592

Pulled By: rohan-varma

fbshipit-source-id: 864c3e3c6831f2351c2949b2348af4f48a308522
2021-11-19 17:13:30 -08:00
8e51381bac Make AOT compiler generic (#68637)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68637

Make AOT compiler compile BI bytedoc model also making the compiler generic enough for other models. Shape propagation pass replaced with the new JIT tracer as shape propagation doesn't yet support dynamic shapes.
Change to get and set input dtype to follow

Test Plan:
BI model changed to return a tuple of tensors instead of returning a tuple(list[tensor], list[string]). Modified BI model runs well with these changes
```
jf download GN91Hg9shoWzU1oPAGQ7X9SV8-5nbmQwAAAA --file bi.pt

└─ $ ./compile_model.sh -m pytorch_dev_bytedoc -p bi.pt -v v1 -i "1,115;1"
+ VERSION=v1
+ getopts m:p:v:i:h opt
+ case $opt in
+ MODEL=pytorch_dev_bytedoc
+ getopts m:p:v:i:h opt
+ case $opt in
+ MODEL_PATH=bi.pt
+ getopts m:p:v:i:h opt
+ case $opt in
+ VERSION=v1
+ getopts m:p:v:i:h opt
+ case $opt in
+ INPUT_DIMS='1,115;1'
+ getopts m:p:v:i:h opt
+ require_arg m pytorch_dev_bytedoc
+ '[' -n pytorch_dev_bytedoc ']'
+ require_arg p bi.pt
+ '[' -n bi.pt ']'
+ require_arg i '1,115;1'
+ '[' -n '1,115;1' ']'
+ '[' '!' -f bi.pt ']'
+++ dirname ./compile_model.sh
++ cd .
++ pwd -P
+ SRC_DIR=/data/users/priyaramani/fbsource/fbcode/caffe2/fb/nnc
+ FBCODE_DIR=/data/users/priyaramani/fbsource/fbcode/caffe2/fb/nnc/../../..
+ FBSOURCE_DIR=/data/users/priyaramani/fbsource/fbcode/caffe2/fb/nnc/../../../..
+ KERNEL_DIR=/data/users/priyaramani/fbsource/fbcode/caffe2/fb/nnc/../../../../xplat/pytorch_models/build/pytorch_dev_bytedoc/v1/nnc
++ readlink -f bi.pt
++ sed 's/.pt.*//'
+ MODEL_PATH_PREFIX=/data/users/priyaramani/fbsource/fbcode/caffe2/fb/nnc/bi
+ LLVM_CODE_PATH=/data/users/priyaramani/fbsource/fbcode/caffe2/fb/nnc/bi.compiled.ll
+ ASSEMBLY_CODE_PATH=/data/users/priyaramani/fbsource/fbcode/caffe2/fb/nnc/bi.compiled.s
+ COMPILED_MODEL_FILE_PATH=/data/users/priyaramani/fbsource/fbcode/caffe2/fb/nnc/bi.compiled.pt
+ KERNEL_FUNC_NAME=nnc_pytorch_dev_bytedoc_v1_forward
+ buck run //caffe2/binaries:aot_model_compiler -- --model=bi.pt --model_name=pytorch_dev_bytedoc --model_version=v1 '--input_dims=1,115;1'
Restarting Buck daemon because Buck version has changed...
Buck daemon started.
Parsing buck files... 0.6 sec (0/unknown)
.
.
Parsing buck files: finished in 5.0 sec
Creating action graph: finished in 0.7 sec
Downloaded 3750/4917 artifacts, 16.09 Mbytes, 13.3% cache miss (for updated rules)
Building: finished in 01:22.3 min (100%) 4995/4995 jobs, 4995/4995 updated
  Total time: 01:28.0 min
BUILD SUCCEEDED
Run with 56 threads
Run with 56 threads
Loading model...
Model loaded: /data/users/priyaramani/fbsource/fbcode/caffe2/fb/nnc/bi.compiled.pt
Running forward ...
WARNING: Logging before InitGoogleLogging() is written to STDERR
W1115 11:42:18.170666 1597103 TensorImpl.h:1418] Warning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (function operator())
(Columns 1 to 10 0.5428  0.1651  0.0158  0.0055  0.0503  0.0749  0.0161  0.0204  0.0237  0.0095

Columns 11 to 12 0.0609  0.0148
[ CPUFloatType{1,12} ], Columns 1 to 10-1.3946 -0.0835 -1.1268  0.3325 -2.1884  4.6175 -0.1206 -1.5058 -1.5277 -2.1214

Columns 11 to 20 1.3726 -0.4573 -1.7583 -2.2275  1.9607 -5.3430 -4.4927 -3.2548 -5.3214  2.9002

Columns 21 to 30-1.3973 -0.8084 -1.8491 -1.6518  4.2531 -0.0321 -0.0282 -1.1180 -0.9800  2.9228

Columns 31 to 32 0.8228  2.2611
[ CPUFloatType{1,32} ])
Starting benchmark.
Running warmup runs.
Main runs.
Main run finished. Milliseconds per iter: 40.64. Iters per second: 24.6063
Memory usage before main runs: 71581696 bytes
Memory usage after main runs: 94347264 bytes
Peak memory usage after main runs: 94347264 bytes
Average memory increase per iter: 2.22495e+06 bytes
0 value means "not available" in above
```

Reviewed By: ljk53

Differential Revision: D32438852

fbshipit-source-id: 5defdc2593abda5da328f96248459d23b2c5e5c6
2021-11-19 17:08:07 -08:00
c41d8290b3 Rename shard_lengths to shard_sizes to be more inline with Tensor sizes. (#66464)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66464

Dimension sizes are referred to as `size` in general in PyTorch and
hence rename shard_lengths to shard_sizes.

#Closes: https://github.com/pytorch/pytorch/issues/65794
ghstack-source-id: 143866449

Test Plan: waitforbuildbot

Reviewed By: fduwjj, wanchaol

Differential Revision: D31564153

fbshipit-source-id: 6273426c4b0e079358806070d0d9644740adb257
2021-11-19 16:30:00 -08:00
af564e73b8 Strided masked log_softmax. (#68461)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68461

Test Plan: Imported from OSS

Reviewed By: dagitses, zou3519

Differential Revision: D32569961

Pulled By: cpuhrsch

fbshipit-source-id: 5d262adacf239dace4a28de85af4b602e36f17f0
2021-11-19 16:28:35 -08:00
578507cb7b Fix nanmedian result using more CUDA memory than necessary (#68591)
Summary:
CUDA's `at::nanmedian` creates a sorted copy of the array, then indexes into it to create a single element view. This view necessarily keeps the entire `sorted` tensor's storage alive which can be avoided by returning a copy, which is what `at::median` does indirectly via `at::where`.

This also changes the index variable `k` to be a simple `int64_t` instead of the CUDA tensor that was used before. This saves the  additional host and device operations from calling `Tensor`'s `operator -` which helps balance out the cost of the `clone` added here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68591

Reviewed By: dagitses

Differential Revision: D32538538

Pulled By: ngimel

fbshipit-source-id: abe9888f80cf9d24d50a83da756e649af1f6ea3b
2021-11-19 16:16:19 -08:00
6cca14d02f [fx2trt][easy] Replace all network.add_activation() call with helper function (#68676)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68676

As the title, the helper functions handles setting layer name. We would want to use those helper functions whenever possible.

Test Plan: CI

Reviewed By: wushirong

Differential Revision: D32571061

fbshipit-source-id: 4a191f0085c0b3965dc02d99bb33de21973d565d
2021-11-19 15:29:39 -08:00
37edb7483a [torchelastic][1/n] Fix caffe2.test.distributed.launcher.api_test flaky tests (#68624)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68624

Fix `caffe2.test.distributed.launcher.api_test` flaky tests for opt-tsan mode.
The diff changes the default `mp.Process` invocation to use spawn context.  `mp.Process` will uses `fork` method that is not compatible with `*san`.

Test Plan: CI

Reviewed By: d4l3k

Differential Revision: D32550578

fbshipit-source-id: f4767987e8e10a6a2ece3f86e48278f2dbaebe7c
2021-11-19 15:23:30 -08:00
a545a409f8 [quant][graphmode][fx] Support input_quantized_idxs and output_quantized_idxs in the new convert (#68042)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68042

att

Also added test cases from TestQuantizeFx which tests all combinations of {fp32, int8} input and output override

Test Plan:
```
python test/fx2trt/test_quant_trt.py TestConvertFxDoNotUse
```

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D32271511

fbshipit-source-id: 87ffc00069aaff7d1c455cdd97fac82b11aa4527
2021-11-19 15:12:54 -08:00
993b7a2052 Remove doubly nested anonymous namespace (#68555)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68555

The outer namespace is already anonymous, so this is not necessary.

Test Plan: Imported from OSS

Reviewed By: dagitses

Differential Revision: D32565941

Pulled By: malfet

fbshipit-source-id: 4daf1c46b25ff68e748e6c834c63d759ec6fde4f
2021-11-19 14:40:47 -08:00
5456d8c8f3 Add vectorized Jacobian and Hessian computation with forward AD (#67041)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67041

Original PR here: https://github.com/pytorch/pytorch/pull/62246 (The old PR does more things, but now that's split across this stack)

This PR:
- Adds "jacfwd" and "hessian_fwdrev"
- Modifies existing tests to also test the `forward_ad=True` case

Test Plan: Imported from OSS

Reviewed By: gchanan, zou3519

Differential Revision: D32314424

Pulled By: soulitzer

fbshipit-source-id: 785b0e39162b93dc3b3cb9413233447152eddd53
2021-11-19 14:31:09 -08:00
7bb401a4c9 Add forward AD support for miscellanous operators (#67820)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67820

Original PR here: https://github.com/pytorch/pytorch/pull/67040

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D32314423

Pulled By: soulitzer

fbshipit-source-id: ecd898dc903692cab084f6922a1d86986f957b1b
2021-11-19 14:31:06 -08:00
e358c49a5b Add OpInfo test and fix a couple cases (#66294)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66294

In this PR:
- OpInfo for forward AD now checks batched forward grad when `op.check_batched_grad=True`
- Adds setting to disable the test for individual ops `check_batched_forward_grad` and disable for the ops here: https://github.com/pytorch/pytorch/issues/66357

Fixes some more failures:
- Make Forward AD metadata less strict by allowing stride to differ when size is 1
- Fix sum batching rule when logical tensor is a scalar and dim is unspecified
- Batching rule for `_reshape_alias`
- ~Batching rules now preserve storage offset for view operator that return non-zero storage offset~ (moved to previous PR)

Test Plan: Imported from OSS

Reviewed By: zou3519, albanD

Differential Revision: D31842020

Pulled By: soulitzer

fbshipit-source-id: 3517a8fb9d6291fccb53c0b1631eab5bbb24ebd1
2021-11-19 14:31:03 -08:00
21d203b5ca Add internal assert for tangent layout mismatch for view ops (#66293)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66293

This PR:
 - Asserts that if the output is a view, then the `is_same_metadata` must return `true`. Otherwise, we are performing a copy.
 - unless we are being called from `make_dual` which can allow the tangent and primal to have different layouts, because it is not forward differentiable.
 - To make this possible, we add `is_make_dual` as a parameter. ~The alternative is to make `make_dual` non-composite, and then we can rely on its `view_info` for differentiability information. This also assumes that the only composite function that calls `set_fw_grad` is `make_dual`.~
 - Batching rules now preserve storage offset for view operator that return non-zero storage offset

Test Plan: Imported from OSS

Reviewed By: zou3519, albanD

Differential Revision: D31842021

Pulled By: soulitzer

fbshipit-source-id: ed606f5a7b4770df1e9ebc6eb1d584b27dad5bae
2021-11-19 14:30:59 -08:00
2455cc2adf Address case when layout of tangent is not same as base (#66292)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66292

In this PR:
1. Fix the case when tangent has a different layout from the base when `set_fw_grad` by adding a native function and its batching rule.

For (1) we replace the following:
```
Tensor new_with_same_meta(const Variable& base) {
  int64_t nelement_in_storage = base.storage().nbytes() / base.itemsize();
  auto new_tensor = at::zeros({nelement_in_storage}, base.options());
  auto res = new_tensor.as_strided(base.sizes(), base.strides(), base.storage_offset());
  return res;
}
```
with a native function as to enable a batching rule to alter its behavior.

This new function will be similar to `new_zeros_strided` except we also require the `storage_offset` and `storage_numel` arguments.

Possible concerns:
 - Why have redundant logic? Why not add new args `new_zeros_strided`? This is probably a niche use case, so it's better not to complicate the current API.
 - Previously the created tensor inherits the TensorOptions of the primal. Now we inherit from the TensorOptions of the tangent.
   - Probably fine. Likely, no one relies on this because the behavior is only triggered when tangent/base have different layouts.
 - Why pass in exploded size, stride, and offset? It is possible in the non-batched case to pass in a tensor directly, but not possible when we'd like to have a batching rule. The size, stride, and offset we'd be passing won't belong to any live tensor.

Test Plan: Imported from OSS

Reviewed By: zou3519, albanD

Differential Revision: D31842019

Pulled By: soulitzer

fbshipit-source-id: a58433d814fd173bc43a2c550b395377dba40de2
2021-11-19 14:29:46 -08:00
bbe2aae84c Support cuda 11.5: install magma for cuda in conda (#68665)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/68667

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68665

Reviewed By: malfet

Differential Revision: D32570283

Pulled By: atalman

fbshipit-source-id: 4471fe8c4f8cc74c542ed67038322f07e861af73
2021-11-19 13:43:26 -08:00
183dcdf551 [reland] Fix flaky test_nccl_timeout (#68544)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/66882

In addition to changes in https://github.com/pytorch/pytorch/pull/68403, add one more error check that can be raised when a collective times out

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68544

Reviewed By: albanD

Differential Revision: D32508706

Pulled By: rohan-varma

fbshipit-source-id: 7d41b91f547d4ad763c44cd11e7b9914b452b617
2021-11-19 13:25:24 -08:00
875ba3dddb [quant][trt] Add support for torch.addmm in TensorRT (#67537)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67537

This PR adds support for quantizing torch.addmm to produce a reference quantized pattern,
and also adds support in the backend_config_dict api that allows people to specify the input, weight and bias input for each input:

```
    addmm_config = {
        "pattern": torch.addmm,
        "observation_type": ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT,
        "dtype_configs": [
            weighted_op_qint8_dtype_config,
        ],
        # a map from input type to input index
        "input_type_to_index": {
            "bias": 0,
            "input": 1,
            "weight": 2,
        }
    }
```

This requires some changes in getting weight_dtype and bias_dtype in the type inference stage of prepare, which will be added in the previous PR

Test Plan:
```
pytho test/fx2trt/test_quant_trt.py TestQuantizeFxTRT.test_addmm
```

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D32014998

fbshipit-source-id: 8d96c1e8b7ebb2ab385c08a5b1e43f2d5a2cbcbe
2021-11-19 13:19:28 -08:00
ee4cfaa286 [SR] Add utility class to determine tensor ranges (#68284)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68284

Add a new class `ManagedTensorRanges` that determines when manage tensors can be made available for re-use. This class provides a method `availableTensors(Node* node)` that returns a vector of `Value*` (corresponding to managed tensors) that are not used (either directly or through any alias) after `node`.

Test Plan: New unit tests: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: swolchok

Differential Revision: D32397207

fbshipit-source-id: fb0d9a23f13abf6f2207e3d7266384966f477fc6
2021-11-19 13:10:55 -08:00
a6d862c50a [quant][graphmode][fx] Add support for weight and bias dtype in backend_config_dict (#68602)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68602

This PR adds support for configuring weight/bias dtype in backend_config_dict
and refactor the current code that checks when to insert observers

Test Plan:
```
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps
```

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D32537712

fbshipit-source-id: 28eb7c61a8dcad8c1f3f6622d490a34cff0c59e2
2021-11-19 13:01:50 -08:00
da4a95c79a [ROCm] Use hipCUB/rocPRIM scan algorithms for large index support (#68487)
Summary:
For inclusive_scan and exclusive_scan, use hipCUB/rocPRIM scan algorithms for large index support.
Implemented for ROCm 5.0 and above.
Code reference : ROCmSoftwarePlatform/rocPRIM@5673df4#diff-47f4ef75e5af60dd5fe3906df9cf971f0635602a6b64a706dee6633d6677ef1a

Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>

cc jeffdaily sunway513 jithunnair-amd ROCmSupport KyleCZH

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68487

Reviewed By: ngimel

Differential Revision: D32547541

Pulled By: malfet

fbshipit-source-id: 4dd984e6906aec7634d05e2ceaa55e31cd4d7376
2021-11-19 12:51:30 -08:00
5880a2f1ef Allow fuse unsqueeze cat sum with multiple input (#68650)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68650

Allow fuse unsqueeze cat sum with >2 input, the impl in this diff is naive, just concat item with add. Not sure can have more perf gain with fuse multiple add into one operation.

Test Plan: unit test

Reviewed By: jfix71

Differential Revision: D32520135

fbshipit-source-id: 535b1c8c91e415d5f1af714378b9205c1ca02ffd
2021-11-19 12:45:37 -08:00
2cab77f810 Masked normalization infrastructure and strided masked softmax (#68333)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68333

Test Plan: Imported from OSS

Reviewed By: dagitses, ZolotukhinM

Differential Revision: D32564435

Pulled By: cpuhrsch

fbshipit-source-id: 4d4662323ceffd12c210b7e931682d0442578157
2021-11-19 12:41:22 -08:00
f99f5ee088 add support for None in assert_close (#67795)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67795

Closes #61035.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D32532207

Pulled By: mruberry

fbshipit-source-id: 6a2b4245e0effce4ddea7d89eca63e3b163951a7
2021-11-19 12:38:25 -08:00
0809553cf0 refactor assert_close to be more modular (#67794)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67794

This change is needed to conveniently use the same comparison mechanism for our internal testsuite (see #67796). The reworked version is on par with the previous version except for the ability to pass a custom message as callable. Before we converted everything to a tensor so it was fairly easy to provide consistent mismatch diagnostics to the callable. Now, with arbitrary `Pair`'s that are used for comparison that is no longer viable.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D32532206

Pulled By: mruberry

fbshipit-source-id: dc847fba6a795c1766e01bc3e88b680a68287b1e
2021-11-19 12:37:16 -08:00
f74779e403 [android] Lite interpreter naming for android nightly publishing (#68651)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68651

Test Plan: Imported from OSS

Reviewed By: linbinyu

Differential Revision: D32564796

Pulled By: IvanKobzarev

fbshipit-source-id: 57847bfb2778433cfb02ad7a5a79ae30a6b438c1
2021-11-19 10:56:13 -08:00
4bcff4733d Add OpInfos for parcel Elementwise Binary II (#68085)
Summary:
Adds OpInfos for `torch.lcm`, `torch.gcd`, `torch.heaviside`, `torch.bitwise_or`, `torch.bitwise_xor`, `torch.isclose`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68085

Reviewed By: ngimel

Differential Revision: D32533310

Pulled By: saketh-are

fbshipit-source-id: 1616ebec61164cd1b44672f36220787a878b96a4
2021-11-19 10:37:07 -08:00
c2c859bdf2 [quant][embedding qat] Add benchmarks for QAT Embedding+EmbeddingBag (#66560)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66560

Test Plan: Imported from OSS

Reviewed By: HDCharles

Differential Revision: D31618282

Pulled By: b-koopman

fbshipit-source-id: ebfe723cfc4004f413f157e65532d64e8d0274b3
2021-11-19 06:29:19 -08:00
f82f14de17 [libkineto] Refactor 4/n: Simplify activity logger step 2/3 (#68329)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68329

Pull Request resolved: https://github.com/pytorch/kineto/pull/466

1. Generalize ChromeTraceLogger::handleGenericActivity to enable it to handle Cuda runtime activities as well as the Roctracer generic activities.
This primarily involves enabling generic support for CPU -> GPU flows.

2. In the event of out-of-order GPU activities (an issue with Cuda11.0, likely fixed in later versions), no longer remove them but print warnings. Another diff will add these warnings to the metadata section.

Reviewed By: briancoutinho

Differential Revision: D31624496

fbshipit-source-id: dab04b3e3c0dd6799496ac87f837363de79eea25
2021-11-18 23:09:20 -08:00
18312313c4 [Profiler] Add missing guards (#65812)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65812

Multiple threads are recording events to a shared activity buffer and the buffer is at some point transferred to libkineto.
The access to and the transfer of the buffer needs to be done under lock.

Reviewed By: leitian, xw285cornell

Differential Revision: D31220061

fbshipit-source-id: f11c879df1b55aa9068187e600730bb0e5e5455f
2021-11-18 22:39:21 -08:00
343723e2ad [PyTorch][JIT][easy] Delete unnecessary overload of MemoryDAG::mayAlias (#66966)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66966

T* is convertible to const T*, so we don't need this overload.
ghstack-source-id: 143749559

Test Plan: builds

Reviewed By: hlu1

Differential Revision: D31809824

fbshipit-source-id: 70cca86c4a87dc09cd958953a08a801db3e4d047
2021-11-18 22:36:06 -08:00
ced57eb490 [PyTorch][Static Runtime] Delete incorrect alias analysis code (#67075)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67075

Sharing storage if `mayAlias` is incorrect, as the old comment notes; sharing if `mustAlias` would be nice but, as the new comment notes, would not matter.
ghstack-source-id: 143749553

Test Plan: CI

Reviewed By: hlu1

Differential Revision: D31851893

fbshipit-source-id: 5bdc8de984d5919332c9010e8b0160211d96bc2f
2021-11-18 22:34:50 -08:00
833dcaf2d6 Sparse CSR: Add torch.sin (#68123)
Summary:
This PR attempts to add support for `torch.sin` for sparse CSR tensors.

This aims to be a revised implementation (in some form) of https://github.com/pytorch/pytorch/pull/68083, and the implementation aims to be similar to that in [`SparseTensorMath.cpp` file](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/sparse/SparseTensorMath.cpp)

The tests and `empty_like` support for sparse CSR tensors (with a minor correction) are borrowed from https://github.com/pytorch/pytorch/pull/68083 temporarily to assist CI with testing this PR. :)

cc nikitaved pearu cpuhrsch IvanYashchuk krshrimali

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68123

Reviewed By: jbschlosser

Differential Revision: D32533379

Pulled By: cpuhrsch

fbshipit-source-id: eb834d64d16ee12734c77e74fffa4a47614e3dfb
2021-11-18 21:58:09 -08:00
758d7dea9c torch.monitor - Initial C++ Stats (#68074)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68074

This is the first step of many PRs towards implementing the `torch.monitor` RFC https://github.com/pytorch/rfcs/pull/30

This defines the aggregation types, the `Stat` class and provides some simple collection of the stats.

This doesn't match the RFC exactly as it incorporates some of the comments on the RFC as well as a few changes for performance.

Changes:
* added window_size to the stats. If specified it will always compute the stat using the `window_size` number of values. If there aren't enough values within that window it reports the previous stats.
* This doesn't include the push metrics yet (will be coming).
  After more discussion it looks like the best way to handle this is to support a hybrid where the metric can set how frequently it'll be logged. For fixed window_size metrics it'll be logged each time it hits the window size. This will allow performant counters as well as lower frequency push counters (window_size=1).

Performance considerations:
* Updating the stats acquires a lock on that Stat object. This should be performant unless there's many-many threads writing to the same stat. Single thread will typically use futex so should be quite fast.
* Adding/removing/fetching all stats sets a global lock on the stat list -- this shouldn't be an issue since these events happen infrequently.
* Fetching stats accesses one stat at a time instead of a global lock. This means the exported values are linearizable but not serializable across multiple stats but I don't expect this to be an issue.

Next steps:
1. Add StatCollector interface for push style metrics
1. Add pybind interfaces to expose to Python
1. Add default metric providers
1. Integrate into Kineto trace view

Test Plan:
buck test //caffe2/test/cpp/monitor:monitor

CI

Reviewed By: kiukchung

Differential Revision: D32266032

fbshipit-source-id: dab8747b4712f5dba5644387817a3a0fda18b66a
2021-11-18 21:46:23 -08:00
68d8ab0cc6 [const_fold] Fix call_module const folding (#68614)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68614

We need to copy modules over to the `split` graph during const folding. We were previously only doing so from the non-constant submod, but we need to do this for the constant one as well in case some `call_module` is const folded.

Test Plan: Added unit test

Reviewed By: wushirong, 842974287

Differential Revision: D32543289

fbshipit-source-id: 80d1d0ce2c18a665b00e1343d6c55d939390ab10
2021-11-18 20:56:06 -08:00
39747dc456 [nnc] Loweings for flatten, xnnpack prepack op (#68470)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68470

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D32545261

Pulled By: IvanKobzarev

fbshipit-source-id: b2bf5b3260002bcc40a351a9c56d786b16b69287
2021-11-18 20:14:42 -08:00
ca92111758 Add native_dropout (#63937)
Summary:
Adds native_dropout to have a reasonable target for torchscript in auto diff. native_dropout has scale and train as arguments in its signature, this makes native_dropout more consistent with other operators and removes conditionals in the autodiff definition.

cc gmagogsfm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63937

Reviewed By: mruberry

Differential Revision: D32477657

Pulled By: ngimel

fbshipit-source-id: d37b137a37acafa50990f60c77f5cea2818454e4
2021-11-18 19:41:10 -08:00
a39060c001 textray demo for unity
Summary:
Previously I need to back out D32220626 and then apply D31841609 to run the textray unity demo. It's hard to have other people to take a look how this textray demo looks like.

I copied the textray demo (a single file) from pytext folder to unity folder and applied the changes needed. This way, other people can also run this textray demo. This also makes my dev environment cleaner.

Test Plan: buck run mode/opt :textray_demo

Reviewed By: mleshen

Differential Revision: D32537190

fbshipit-source-id: 5df6347c4bec583c225aea9f98fbc9f37b5d3153
2021-11-18 19:04:18 -08:00
ff125a3624 Minor changes in documentation (#68557)
Summary:
Fixed some small typos

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68557

Reviewed By: mruberry

Differential Revision: D32538749

Pulled By: ngimel

fbshipit-source-id: 09a9cd4031463b6a40d7307bd8fcb7d364444ac3
2021-11-18 17:57:16 -08:00
9ce3c630ba [Docs] Mention torch.bfloat16 in torch.finfo (#68496)
Summary:
https://pytorch.org/docs/master/type_info.html#torch.torch.finfo seems to miss `torch.bfloat16`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68496

Reviewed By: mruberry

Differential Revision: D32538806

Pulled By: ngimel

fbshipit-source-id: 1296b3eb34d024cfc7d85cf53efe771ee9f98ea2
2021-11-18 17:52:41 -08:00
913ac27112 Fixes forward AD codegen for multiple formulas (#68535)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/67367

- Adds check to make sure forward grad itself does not have forward grad at the same level
- Verify with `python test/test_ops.py -k test_forward_mode_AD_linalg_eigh_cpu_float64` that it fails the check before, but passes after the codegen update

Before:
```
  if (_any_has_forward_grad_eigenvalues) {
      auto self_t_raw = toNonOptFwGrad(self);
      auto self_t = self_t_raw.defined() ? self_t_raw : at::zeros_like(toNonOptTensor(self));
      auto eigenvalues_new_fw_grad = eigh_jvp_eigenvalues(self_t, eigenvalues, eigenvectors);
      if (eigenvalues_new_fw_grad.defined()) {
        // The hardcoded 0 here will need to be updated once we support multiple levels.
        eigenvalues._set_fw_grad(eigenvalues_new_fw_grad, /* level */ 0, /* is_inplace_op */ false);
      }
  }
  if (_any_has_forward_grad_eigenvectors) {
      auto self_t_raw = toNonOptFwGrad(self);
      auto self_t = self_t_raw.defined() ? self_t_raw : at::zeros_like(toNonOptTensor(self));
      auto eigenvectors_new_fw_grad = eigh_jvp_eigenvectors(self_t, eigenvalues, eigenvectors);
      if (eigenvectors_new_fw_grad.defined()) {
        // The hardcoded 0 here will need to be updated once we support multiple levels.
        eigenvectors._set_fw_grad(eigenvectors_new_fw_grad, /* level */ 0, /* is_inplace_op */ false);
      }
  }
```

After:
```
  c10::optional<at::Tensor> eigenvalues_new_fw_grad_opt = c10::nullopt;
  if (_any_has_forward_grad_eigenvalues) {
      auto self_t_raw = toNonOptFwGrad(self);
      auto self_t = self_t_raw.defined() ? self_t_raw : at::zeros_like(toNonOptTensor(self));
      eigenvalues_new_fw_grad_opt = eigh_jvp_eigenvalues(self_t, eigenvalues, eigenvectors);
  }
  c10::optional<at::Tensor> eigenvectors_new_fw_grad_opt = c10::nullopt;
  if (_any_has_forward_grad_eigenvectors) {
      auto self_t_raw = toNonOptFwGrad(self);
      auto self_t = self_t_raw.defined() ? self_t_raw : at::zeros_like(toNonOptTensor(self));
      eigenvectors_new_fw_grad_opt = eigh_jvp_eigenvectors(self_t, eigenvalues, eigenvectors);
  }
  if (eigenvalues_new_fw_grad_opt.has_value() && eigenvalues_new_fw_grad_opt.value().defined()) {
    // The hardcoded 0 here will need to be updated once we support multiple levels.
    eigenvalues._set_fw_grad(eigenvalues_new_fw_grad_opt.value(), /* level */ 0, /* is_inplace_op */ false);
  }

  if (eigenvectors_new_fw_grad_opt.has_value() && eigenvectors_new_fw_grad_opt.value().defined()) {
    // The hardcoded 0 here will need to be updated once we support multiple levels.
    eigenvectors._set_fw_grad(eigenvectors_new_fw_grad_opt.value(), /* level */ 0, /* is_inplace_op */ false);
  }
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68535

Reviewed By: ngimel

Differential Revision: D32536089

Pulled By: soulitzer

fbshipit-source-id: a3f288540e2d78a4a9ec4bd66d2c0f0e65dd72cd
2021-11-18 17:44:17 -08:00
e7002c62ae [nnc] External functions quantized via Dispatch (#68572)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68572

Test Plan: Imported from OSS

Reviewed By: beback4u

Differential Revision: D32522410

Pulled By: IvanKobzarev

fbshipit-source-id: 7bb373819275582bb02e0d2ffd87a78d19f92318
2021-11-18 17:27:03 -08:00
a990a7ac31 [torchelastic] Remove stale test_get_default_executable test (#68609)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68609

The test is stale and tests non-existent method

Test Plan: ci

Reviewed By: kiukchung

Differential Revision: D32540127

fbshipit-source-id: c47b7aed3df6947819efb2f4ad1b7a059c252138
2021-11-18 17:20:36 -08:00
003f6ccec6 [BE] rename some tests in test_c10d_common (#67828)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67828

as titled
ghstack-source-id: 143781976

Test Plan: wait for ci

Reviewed By: mrshenli

Differential Revision: D32165576

fbshipit-source-id: 40c04b74f9e3241d3b3d64dee53af01fcfd1018b
2021-11-18 17:14:58 -08:00
3757a16c7a Adding custom testing based on opinfos input for ops with custom rules. (#67500)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67500

* #66898

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D32497547

Pulled By: Gamrix

fbshipit-source-id: 07761f0e27f4ac289377ff3279ce6470d4b727dd
2021-11-18 16:29:00 -08:00
71a031e954 Adding Custom Rules to Device Propagation (#66973)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66973

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D32497549

Pulled By: Gamrix

fbshipit-source-id: 5732682c0b39709f76cf218490e5f5136c0d83f8
2021-11-18 16:28:56 -08:00
77db720c65 Moving parts of the Shape Registry into a common file (#66948)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66948

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D32497550

Pulled By: Gamrix

fbshipit-source-id: 650feded6bae379af3d73a52edac2721bd7af2f2
2021-11-18 16:27:45 -08:00
244691db98 surface ncclUniqueId store broadcast error (#68597)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68597

Users got confused by just 'Socket timeout'. Surfacing detailed error message. https://fb.workplace.com/groups/319878845696681/posts/650320792652483/. As we are using store more often for desync timeout/slowness detection, will need a good wrapper to surface error message for all store APIs.

Test Plan:
```
RuntimeError: [3] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got exception: Socket Timeout
Exception raised from recvBytes at caffe2/torch/csrc/distributed/c10d/Utils.hpp:595 (most recent call first):
# 0  c10::get_backtrace[abi:cxx11](unsigned long, unsigned long, bool)
# 1  std::_Function_handler<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > (), c10::(anonymous namespace)::GetFetchStackTrace()::$_0>::_M_invoke(std::_Any_data const&)
# 2  c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)
# 3  c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*)
# 4  c10d::TCPStore::doWait(c10::ArrayRef<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::chrono::duration<long, std::ratio<1l, 1000l> >)
# 5  c10d::TCPStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
# 6  c10d::PrefixStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
# 7  c10d::PrefixStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
# 8  c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, c10d::OpType, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int)
# 9  c10d::ProcessGroupNCCL::getNCCLComm(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<c10::Device, std::allocator<c10::Device> > const&, c10d::OpType, int, bool)
# 10 c10d::ProcessGroupNCCL::allreduce(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&)
# 11 pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::ProcessGroup::Work, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup::Work> >, c10d::ProcessGroup, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::ProcessGroup::WorkTraceback (most recent call last):
```

Reviewed By: rohan-varma, mingzhe09088

Differential Revision: D32533304

fbshipit-source-id: e471636ee0c5291215cb6cde659b10bee13b7d12
2021-11-18 16:04:39 -08:00
ab1d879b33 [WIP] forbid aliasing between the outputs of a differentiable graph (#67732)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67732

Reviewed By: cpuhrsch

Differential Revision: D32522826

Pulled By: Krovatkin

fbshipit-source-id: 9fdf3509dcd1b885f7c7f06d22b340c0f93bbe12
2021-11-18 15:03:35 -08:00
9f4e004abd Revert D32283178: Add linalg.solve_triangular
Test Plan: revert-hammer

Differential Revision:
D32283178 (0706607abc)

Original commit changeset: deb672e6e52f

fbshipit-source-id: d2a3421292147426cc61c2f063b721acf9004755
2021-11-18 14:46:10 -08:00
48771d1c7f [BC-breaking] Change dtype of softmax to support TorchScript and MyPy (#68336)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68336

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D32470965

Pulled By: cpuhrsch

fbshipit-source-id: 254b62db155321e6a139bda9600722c948f946d3
2021-11-18 11:26:14 -08:00
748d9d2494 Revert D32187063: [static runtime] dequantize out variant
Test Plan: revert-hammer

Differential Revision:
D32187063 (f120335643)

Original commit changeset: 1fec6b74c7d3

fbshipit-source-id: 9770f8379e9ddba9e537fef0e66cc93c2caaf860
2021-11-18 10:12:31 -08:00
0706607abc Add linalg.solve_triangular (#63568)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63568

This PR adds the first solver with structure to `linalg`. This solver
has an API compatible with that of `linalg.solve` preparing these for a
possible future merge of the APIs. The new API:
- Just returns the solution, rather than the solution and a copy of `A`
- Removes the confusing `transpose` argument and replaces it by a
correct handling of conj and strides within the call
- Adds a `left=True` kwarg. This can be achieved via transposes of the
inputs and the result, but it's exposed for convenience.

This PR also implements a dataflow that minimises the number of copies
needed before calling LAPACK / MAGMA / cuBLAS and takes advantage of the
conjugate and neg bits.

This algorithm is implemented for `solve_triangular` (which, for this, is
the most complex of all the solvers due to the `upper` parameters).
Once more solvers are added, we will factor out this calling algorithm,
so that all of them can take advantage of it.

Given the complexity of this algorithm, we implement some thorough
testing. We also added tests for all the backends, which was not done
before.

We also add forward AD support for `linalg.solve_triangular` and improve the
docs of `linalg.solve_triangular`. We also fix a few issues with those of
`torch.triangular_solve`.

Resolves https://github.com/pytorch/pytorch/issues/54258
Resolves https://github.com/pytorch/pytorch/issues/56327
Resolves https://github.com/pytorch/pytorch/issues/45734

cc jianyuh nikitaved pearu mruberry walterddr IvanYashchuk xwang233 Lezcano

Test Plan: Imported from OSS

Reviewed By: zou3519, JacobSzwejbka

Differential Revision: D32283178

Pulled By: mruberry

fbshipit-source-id: deb672e6e52f58b76536ab4158073927a35e43a8
2021-11-18 09:45:51 -08:00
f120335643 [static runtime] dequantize out variant (#67873)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67873

Add out variant for aten::dequantize

Test Plan:
Test on inline_cvr model
```
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 3 ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --scripted_model=/data/users/ansha/tmp/adfinder/294738512/294738512_0.predictor.disagg.local --recordio_inputs=/data/users/ansha/tmp/adfinder/294738512/294738512_0_local.inputs.recordio --pt_enable_static_runtime=1 --compare_results=1 --iters=5 --warmup_iters=5 --num_threads=1 --do_profile=1 --method_name=local.forward --set_compatibility --do_benchmark=1 --recordio_use_ivalue_format=1
```

Before:
0.047472 ms.   0.409729%. aten::dequantize (9 nodes)

After
0.0307179 ms.   0.267204%. static_runtime::dequantize_copy (9 nodes, out variant)

Reviewed By: hlu1

Differential Revision: D32187063

fbshipit-source-id: 1fec6b74c7d3f25d0f445775c4558d30c55dcece
2021-11-18 09:31:27 -08:00
7d38768d84 Rename splitter result (#68303)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68303

Result of splitter is run on either accelerator or directly on gpu, rename gpu part graph to run_on_gpu

Test Plan: buck test mode/opt caffe2/test:trt_tools_test

Reviewed By: 842974287

Differential Revision: D32392492

fbshipit-source-id: b085376c00c1097752e856e22c631d74a0fbc38f
2021-11-18 09:04:30 -08:00
533e72e0a4 Fix DLPack CUDA stream convention (#67618)
Summary:
Apparently for the array API, cuda default stream and per thread stream should be 1 and 2 instead of 0 and 1:

https://data-apis.org/array-api/latest/API_specification/array_object.html?dlpack-self-stream-none#dlpack-self-stream-none.

This caused a problem in the interop with CuPy https://github.com/cupy/cupy/pull/5970#discussion_r739912926.

cc rgommers leofang mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67618

Reviewed By: albanD

Differential Revision: D32521805

Pulled By: mruberry

fbshipit-source-id: 95777e4014e5edf1f88ba10adc03c6e34c13248d
2021-11-18 08:36:05 -08:00
d5d2096dab [testing] make @dtypes mandatory when using @dtypesIf (#68186)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53647

With this if a test forgets to add `dtypes` while using `dtypesIf`, following error is raised
```
AssertionError: dtypes is mandatory when using dtypesIf however 'test_exponential_no_zero' didn't specify it
```

**Tested Locally**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68186

Reviewed By: VitalyFedyunin

Differential Revision: D32468581

Pulled By: mruberry

fbshipit-source-id: 805e0855f988b77a5d8d4cd52b31426c04c2200b
2021-11-18 08:29:31 -08:00
857fed1f42 torch.linalg.qr: forward AD support (#67268)
Summary:
As per title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67268

Reviewed By: ngimel

Differential Revision: D31960517

Pulled By: albanD

fbshipit-source-id: bfd1028a8d352f550efb420f9ca609c09f4a7484
2021-11-18 08:11:54 -08:00
a2d187a672 [BE] MapAllocator: report map error on Linux (#68545)
Summary:
Add `, strerror(errno), " (", errno, ")"`  suffix to TORCH_CHECK messages that report failures from POSIX calls

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68545

Reviewed By: ngimel

Differential Revision: D32509300

Pulled By: malfet

fbshipit-source-id: 1d7792d07e3a1184d2d54d137e6a9105dbab7d4c
2021-11-18 08:04:09 -08:00
b1aa45a8a7 Fix _make_wrapper_subclass's storage_offset handling (#68268)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68268

Previously, `_make_wrapper_subclass` ignored the storage offset it was
passed. This PR fixes that by updating TensorMaker::computeStorageSize()
and TensorMaker::make_tensor() to take into account storage_offset.

Test Plan: - added test

Reviewed By: albanD, bdhirsh

Differential Revision: D32396330

Pulled By: zou3519

fbshipit-source-id: 2c85bc4066044fe6cb5ab0fc192de6c9069855fd
2021-11-18 07:07:42 -08:00
f0e2ad5037 Stop warning spamming about vmap in gradcheck (#68586)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68586

We updated the vmap warnings to be more descriptive in
https://github.com/pytorch/pytorch/pull/67347 . However, gradcheck does
some warning squashing that matches on the warning message and we didn't
update that. This PR updates the warning squashing in gradcheck.

Test Plan: - check logs

Reviewed By: albanD

Differential Revision: D32530259

Pulled By: zou3519

fbshipit-source-id: 9db380b57c38b3b72cbdb29574f71dbfe71e90d1
2021-11-18 07:00:36 -08:00
f9ef807f4d Replace empty with new_empty in nn.functional.pad (#68565)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68565

This makes it so that we can now vmap over nn.functional.pad (circular
variant). Previously we could not because we were effectively doing
`out.copy_(input)` where the out was created with empty.

This also has the added side effect of cleaning up the code.

Test Plan:
- I tested this using functorch.vmap and can confirm that vmap now
works.
- Unfortunately this doesn't work with the vmap in core so I cannot add
a test for this here.

Reviewed By: albanD

Differential Revision: D32520188

Pulled By: zou3519

fbshipit-source-id: 780a7e8207d7c45fcba645730a5803733ebfd7be
2021-11-18 06:06:50 -08:00
6c9cf5e6ea [quant][embedding qat] eager mode QAT for Embeddings (#66429)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66429

Test Plan: Imported from OSS

Reviewed By: HDCharles, supriyar

Differential Revision: D31618284

Pulled By: b-koopman

fbshipit-source-id: 0c0e2e86b98da9f29e9b2fc2a35c59424f94cbba
2021-11-18 05:57:11 -08:00
dbbb02474b [GPU host alloc] Fast path for size 0 malloc (#68532)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68532

Diff to better handle size 0 pinned memory allocation requests.
----
### Behavior before fix
The very first size 0 malloc comes in. It will create a block with `{key: 0, value: Block(0, 0, true)}`.

Another size 0 malloc comes in.
It will either 1) get a block with size > 0 (which is a waste of pinned memory) or 2) call `cudaHostAlloc()` with size 0 to eventually get *ptr=0.
Note that this block is *not registered* to the block pool because we have a duplicate entry (and that's why we will keep wasting size > 0 pinned memory block, if `available.empty() == false`).

----
### Behavior after fix

Let `malloc()` simply return a nullptr (0).
This avoids wasting valid size > 0 blocks as well as save the calls to `cudaHostAlloc()` which is expensive.
This is also safe since `free()` simply returns success for nullptrs.

-----

Test Plan: Unit tests.

Reviewed By: yinghai

Differential Revision: D32487522

fbshipit-source-id: 6140cab54ff5a34ace7d046f218fb32805c692c0
2021-11-18 02:39:36 -08:00
4635f5711f [static runtime][dper] multi_env tests for static runtime: selective enable (#67467)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67467

Unit tests for static runtime in the dper multi-env tests for cpu and scripted (including fx-traced + scripted) models. Only turn it on for single_operators_tests that are in the inline_cvr local/local_ro/remote_ro model for now.

Will have another diff that turns this on by default and explicitly disables for certain tests.

Test Plan: buck test dper3/dper3/modules/low_level_modules/tests:single_operators_test

Reviewed By: hlu1, houseroad

Differential Revision: D30870488

fbshipit-source-id: 382daec8dbcb95135cdd43e7b84a1d23b445d27c
2021-11-18 01:04:12 -08:00
35712a8eb4 [reland] simplify init_from_local_shards API (#68021)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68021

reland PR of https://github.com/pytorch/pytorch/pull/64481 as the previous one have some internal failures that didn't get captured when first landed.

This simplifies `init_from_local_shards` API in sharded tensor, to only require user pass in a list of `Shard` and `overall_size`, instead of ShardedTensorMetadata. We will do the all_gather inside to form a valid ShardedTensorMetadata instead.

TODO: add more test cases to improve coverage.
ghstack-source-id: 143661119
ghstack-source-id: 143661119

Test Plan: TestShardedTensorFromLocalShards

Reviewed By: pritamdamania87

Differential Revision: D32147888

fbshipit-source-id: 897128b75224f4b9644471a04a64079f51e0d5fe
2021-11-17 23:20:37 -08:00
Rok
952ca25daa Sparse CSR: add convert_indices_from_csr_to_coo (#66774)
Summary:
This PR adds conversion from CSR to COO.

Fixes https://github.com/pytorch/pytorch/issues/56959

cc nikitaved pearu cpuhrsch IvanYashchuk gchanan mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66774

Reviewed By: zou3519

Differential Revision: D32288415

Pulled By: cpuhrsch

fbshipit-source-id: 683ba658dc46835fdf3c0e24645c0c2bb243b968
2021-11-17 22:28:30 -08:00
96ba2099d1 Fix c10d TCP store with mutex (#68499)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68499

TCP store is actually being accessed by multi-threading (NCCL watch dog thread), but no mutex protection while FileStore and HashStore have. As enabling desync root cause analysis makes store calls more often, the race condition of TCP store was always triggered when creating another process group like gloo. Adding mutex to TCP store, to be the same with FileStore and HashStore.

Test Plan:
DDP benchmark with desync debug enabled, no perf regression

https://www.internalfb.com/intern/fblearner/details/309398285?tab=Outputs

W/o this diff

https://www.internalfb.com/intern/fblearner/details/308379789?tab=Outputs

Reviewed By: mingzhe09088

Differential Revision: D32482254

fbshipit-source-id: e8f466e1c6fdcab6cfa170f44b9be70395935fb8
2021-11-17 20:30:10 -08:00
146a7f68e2 Enable desync root cause analysis for NCCL (#68310)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68310

Enable desync root cause analysis by recording the last footprint of collective calls. When timeout we parse the store trace and figure out the root cause of the desync issue. This feature is built based on async error handling.

Test Plan:
Standalone test
* Typical desync - P467288969
* Mismatched collectives - P467288916
* Mismatched broadcast size - P467288873

DDP benchmark
* DDP benchmark desync - P467433483, P467520195

No perf regression:
* w/o this diff https://www.internalfb.com/intern/fblearner/details/308379789?tab=Outputs
* w/ this diff https://www.internalfb.com/intern/fblearner/details/308534088?tab=Outputs

Reviewed By: mingzhe09088

Differential Revision: D32348647

fbshipit-source-id: 43e7e96e3fa2be0ac66c1325bceb639b461a8b3a
2021-11-17 20:29:03 -08:00
9807787135 scatter_reduce (#68115)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/63780

Basic functionality of a `scatter_reduce` algorithm with `reduce="sum"`:

* `scatter_reduce` is named as `scatter_reduce2` due to compiling issues
* It currently re-uses functionality from `scatter_add`
* Tests are missing: WIP

The error when the `scatter_reduce` naming is used:
```
In file included from aten/src/ATen/core/TensorBody.h:3,
                 from ../aten/src/ATen/core/Tensor.h:3,
                 from ../aten/src/ATen/DeviceGuard.h:4,
                 from ../aten/src/ATen/ATen.h:11,
                 from aten/src/ATen/native/cpu/CopyKernel.cpp.DEFAULT.cpp:1:
aten/src/ATen/Operators.h:13949:18: error: redefinition of ‘struct at::_ops::scatter_reduce’
13949 | struct TORCH_API scatter_reduce {
      |                  ^~~~~~~~~~~~~~
aten/src/ATen/Operators.h:13817:18: note: previous definition of ‘struct at::_ops::scatter_reduce’
13817 | struct TORCH_API scatter_reduce {
      |                  ^~~~~~~~~~~~~~
aten/src/ATen/Operators.h:13960:18: error: redefinition of ‘struct at::_ops::scatter_reduce_out’
13960 | struct TORCH_API scatter_reduce_out {
      |                  ^~~~~~~~~~~~~~~~~~
aten/src/ATen/Operators.h:13839:18: note: previous definition of ‘struct at::_ops::scatter_reduce_out’
13839 | struct TORCH_API scatter_reduce_out {
      |                  ^~~~~~~~~~~~~~~~~~
In file included from ../aten/src/ATen/core/Tensor.h:3,
                 from ../aten/src/ATen/DeviceGuard.h:4,
                 from ../aten/src/ATen/ATen.h:11,
                 from aten/src/ATen/native/cpu/CopyKernel.cpp.DEFAULT.cpp:1:
aten/src/ATen/core/TensorBody.h: In member function ‘at::Tensor at::Tensor::scatter_reduce(int64_t, const at::Tensor&, c10::string_view, c10::optional<long int>) const’:
aten/src/ATen/core/TensorBody.h:3976:83: error: cannot convert ‘c10::string_view’ {aka ‘c10::basic_string_view<char>’} to ‘const at::Tensor&’
 3976 |     return at::_ops::scatter_reduce::call(const_cast<Tensor&>(*this), dim, index, reduce, output_size);
      |                                                                                   ^~~~~~
      |                                                                                   |
      |                                                                                   c10::string_view {aka c10::basic_string_view<char>}
In file included from aten/src/ATen/core/TensorBody.h:3,
                 from ../aten/src/ATen/core/Tensor.h:3,
                 from ../aten/src/ATen/DeviceGuard.h:4,
                 from ../aten/src/ATen/ATen.h:11,
                 from aten/src/ATen/native/cpu/CopyKernel.cpp.DEFAULT.cpp:1:
aten/src/ATen/Operators.h:13824:109: note:   initializing argument 4 of ‘static at::Tensor at::_ops::scatter_reduce::call(const at::Tensor&, int64_t, const at::Tensor&, const at::Tensor&, c10::string_view)’
13824 |   static at::Tensor call(const at::Tensor & self, int64_t dim, const at::Tensor & index, const at::Tensor & src, c10::string_view reduce);
      |                                                                                          ~~~~~~~~~~~~~~~~~~~^~~
In file included from ../aten/src/ATen/ATen.h:15,
                 from aten/src/ATen/native/cpu/CopyKernel.cpp.DEFAULT.cpp:1:
aten/src/ATen/Functions.h: In function ‘at::Tensor at::scatter_reduce(const at::Tensor&, int64_t, const at::Tensor&, c10::string_view, c10::optional<long int>)’:
aten/src/ATen/Functions.h:7119:61: error: cannot convert ‘c10::string_view’ {aka ‘c10::basic_string_view<char>’} to ‘const at::Tensor&’
 7119 |     return at::_ops::scatter_reduce::call(self, dim, index, reduce, output_size);
      |                                                             ^~~~~~
      |                                                             |
      |                                                             c10::string_view {aka c10::basic_string_view<char>}
In file included from aten/src/ATen/core/TensorBody.h:3,
                 from ../aten/src/ATen/core/Tensor.h:3,
                 from ../aten/src/ATen/DeviceGuard.h:4,
                 from ../aten/src/ATen/ATen.h:11,
                 from aten/src/ATen/native/cpu/CopyKernel.cpp.DEFAULT.cpp:1:
aten/src/ATen/Operators.h:13824:109: note:   initializing argument 4 of ‘static at::Tensor at::_ops::scatter_reduce::call(const at::Tensor&, int64_t, const at::Tensor&, const at::Tensor&, c10::string_view)’
13824 |   static at::Tensor call(const at::Tensor & self, int64_t dim, const at::Tensor & index, const at::Tensor & src, c10::string_view reduce);
      |                                                                                          ~~~~~~~~~~~~~~~~~~~^~~
In file included from ../aten/src/ATen/ATen.h:15,
                 from aten/src/ATen/native/cpu/CopyKernel.cpp.DEFAULT.cpp:1:
aten/src/ATen/Functions.h: In function ‘at::Tensor& at::scatter_reduce_out(at::Tensor&, const at::Tensor&, int64_t, const at::Tensor&, c10::string_view, c10::optional<long int>)’:
aten/src/ATen/Functions.h:7124:65: error: cannot convert ‘c10::string_view’ {aka ‘c10::basic_string_view<char>’} to ‘const at::Tensor&’
 7124 |     return at::_ops::scatter_reduce_out::call(self, dim, index, reduce, output_size, out);
      |                                                                 ^~~~~~
      |                                                                 |
      |                                                                 c10::string_view {aka c10::basic_string_view<char>}
In file included from aten/src/ATen/core/TensorBody.h:3,
                 from ../aten/src/ATen/core/Tensor.h:3,
                 from ../aten/src/ATen/DeviceGuard.h:4,
                 from ../aten/src/ATen/ATen.h:11,
                 from aten/src/ATen/native/cpu/CopyKernel.cpp.DEFAULT.cpp:1:
aten/src/ATen/Operators.h:13846:111: note:   initializing argument 4 of ‘static at::Tensor& at::_ops::scatter_reduce_out::call(const at::Tensor&, int64_t, const at::Tensor&, const at::Tensor&, c10::string_view, at::Tensor&)’
13846 |   static at::Tensor & call(const at::Tensor & self, int64_t dim, const at::Tensor & index, const at::Tensor & src, c10::string_view reduce, at::Tensor & out);
      |                                                                                            ~~~~~~~~~~~~~~~~~~~^~~
In file included from ../aten/src/ATen/ATen.h:15,
                 from aten/src/ATen/native/cpu/CopyKernel.cpp.DEFAULT.cpp:1:
aten/src/ATen/Functions.h: In function ‘at::Tensor& at::scatter_reduce_outf(const at::Tensor&, int64_t, const at::Tensor&, c10::string_view, c10::optional<long int>, at::Tensor&)’:
aten/src/ATen/Functions.h:7129:65: error: cannot convert ‘c10::string_view’ {aka ‘c10::basic_string_view<char>’} to ‘const at::Tensor&’
 7129 |     return at::_ops::scatter_reduce_out::call(self, dim, index, reduce, output_size, out);
      |                                                                 ^~~~~~
      |                                                                 |
      |                                                                 c10::string_view {aka c10::basic_string_view<char>}
In file included from aten/src/ATen/core/TensorBody.h:3,
                 from ../aten/src/ATen/core/Tensor.h:3,
                 from ../aten/src/ATen/DeviceGuard.h:4,
                 from ../aten/src/ATen/ATen.h:11,
                 from aten/src/ATen/native/cpu/CopyKernel.cpp.DEFAULT.cpp:1:
aten/src/ATen/Operators.h:13846:111: note:   initializing argument 4 of ‘static at::Tensor& at::_ops::scatter_reduce_out::call(const at::Tensor&, int64_t, const at::Tensor&, const at::Tensor&, c10::string_view, at::Tensor&)’
13846 |   static at::Tensor & call(const at::Tensor & self, int64_t dim, const at::Tensor & index, const at::Tensor & src, c10::string_view reduce, at::Tensor & out);
      |                                                                                            ~~~~~~~~~~~~~~~~~~~^~~
In file included from aten/src/ATen/NativeFunctions.h:6,
                 from ../aten/src/ATen/TensorIndexing.h:12,
                 from ../aten/src/ATen/ATen.h:20,
                 from aten/src/ATen/native/cpu/CopyKernel.cpp.DEFAULT.cpp:1:
aten/src/ATen/NativeMetaFunctions.h: At global scope:
aten/src/ATen/NativeMetaFunctions.h:496:18: error: redefinition of ‘struct at::meta::structured_scatter_reduce’
  496 | struct TORCH_API structured_scatter_reduce : public at::impl::MetaBase {
      |                  ^~~~~~~~~~~~~~~~~~~~~~~~~
aten/src/ATen/NativeMetaFunctions.h:481:18: note: previous definition of ‘struct at::meta::structured_scatter_reduce’
  481 | struct TORCH_API structured_scatter_reduce : public at::impl::MetaBase {
      |                  ^~~~~~~~~~~~~~~~~~~~~~~~~
ninja: build stopped: subcommand failed.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68115

Reviewed By: albanD

Differential Revision: D32488450

Pulled By: cpuhrsch

fbshipit-source-id: 65e79c6d0555c0d5715535bb52aade8d5fcd9722
2021-11-17 19:53:12 -08:00
e72b9db48e [fx2trt] add converter for acc_ops.hardtanh (#68550)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68550

Missing ops in https://fburl.com/gsheet/q06f1vrc

Test Plan: unit tests

Reviewed By: wushirong

Differential Revision: D32500303

fbshipit-source-id: 9266210ae229263f6bb2a60486c279ceb766ffdf
2021-11-17 17:59:37 -08:00
9d9ca88f5c [predictor][trt] Expose more CUDA/CuDNN info to at::Context and BC stage 1 (#68146)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68146

Expose more CUDA/CuDNN info to at::Context

Test Plan: CI; lint;

Reviewed By: houseroad

Differential Revision: D32264935

fbshipit-source-id: ad43d5d245dba4a054e09346240414159832585e
2021-11-17 17:16:19 -08:00
d71092f668 [android][fbjni] Update fbjni to 0.2.2 (#68400)
Summary:
ghstack-source-id: caeb8df3a18a6fa48d591af126ac59d8e41494b5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68400

Fixes #{issue number}

CI-all check:
https://github.com/pytorch/pytorch/pull/68497

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68495

Reviewed By: linbinyu

Differential Revision: D32481451

Pulled By: IvanKobzarev

fbshipit-source-id: b19ce05ff9d63b3f701d718eefbf1e9d66e11639
2021-11-17 16:54:22 -08:00
53bfb00ee1 [bugfix] TensorList args in functionalization pass (#68395)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68395

At the time that I wrote the pass, I thought that `c10::TensorList` and `c10::List<Tensor>` were the same thing. But it looks like a `TensorList` is actually an `ArrayRef<Tensor>`. This led to a nasty bug when I tried to add conditional functionalization to `block_diag`, where in the boxed kernel, I would:

(1) unwrap the first `IValue` by calling `.toTensorList()` (this actually returns a `List<Tensor>`, not a `TensorList`).
(2) call `TensorList to_functional_tensor(List<Tensor>)` to get out a `TensorList` with the functionalized tensors
(3) wrap that back into an `IValue` and put in on the stack.

Somewhere in that sequence of operations, something bad happens and we segfault. Fixing up the signature of `to_functional_tensor` to be `List<Tensor> to_functional_tensor(List<Tensor>)` fixes the bug. I have a feeling that there's a latent TensorList-related bug in the boxing/unboxing logic that made this worse, but I'm okay to stick with my narrow fix for now.

Additionally tested by running `pytest test/test_ops.py test/test_vmap.py -v -k block_diag` on top of this PR: https://github.com/pytorch/functorch/pull/235

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D32448258

Pulled By: bdhirsh

fbshipit-source-id: 3b2b6c7cd5e4c29533d0502f24272d826bfe03c1
2021-11-17 15:50:30 -08:00
b0bdf588ea [ONNX] Release values cached in global object (#68210)
Summary:
To release constants computed and stored by `ConstantValueMap::SetValue(...)` during ONNX exporting, `ConstantValueMap::Clear()` needs to be called explicitly. Otherwise, it's a memory leak.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68210

Reviewed By: jansel

Differential Revision: D32465670

Pulled By: msaroufim

fbshipit-source-id: 521e474071b94c5d2cd4f353ee062cee78be1bd4
2021-11-17 12:47:59 -08:00
4eb772fde6 Refactor saving jit::Module to mobile .pt in 2 steps: (#66494)
Summary:
1. is to convert Function -> mobile::Function
2. is to serialize mobile::Function

This also opens opportunity to create mobile::Module without saving/reloading

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66494

Reviewed By: zhxchen17

Differential Revision: D32293022

Pulled By: qihqi

fbshipit-source-id: 29b43d47ff86071d5e2f9d6ca4dba4445711ce3d
2021-11-17 12:02:20 -08:00
e2aeb4a7af Improve native layer norm backward perf (#68238)
Summary:
Benchmarks
At this PR
```
[------------------------------------------------------ ln ------------------------------------------------------]
                  |  fwd, torch.float32  |  fwdbwd, torch.float32  |  fwd, torch.float16  |  fwdbwd, torch.float16
1 threads: -------------------------------------------------------------------------------------------------------
      200, 256    |         17.5         |          106.6          |         18.1         |           94.7
      1000, 256   |         18.7         |          116.6          |         18.7         |          110.7
      6000, 256   |         28.1         |          111.8          |         19.4         |           92.3
      6272, 256   |         29.3         |          108.5          |         20.1         |           92.7
      200, 512    |         19.3         |           83.8          |         19.1         |          116.3
      1000, 512   |         17.9         |           88.0          |         17.9         |           93.0
      6000, 512   |         36.9         |          141.2          |         27.4         |          103.3
      6272, 512   |         38.2         |          146.5          |         28.1         |          107.9
      200, 1024   |         18.1         |           89.5          |         21.1         |          102.7
      1000, 1024  |         17.9         |           88.7          |         18.5         |           92.5
      6000, 1024  |         77.6         |          277.5          |         40.3         |          148.5
      6272, 1024  |         80.7         |          288.1          |         42.0         |          154.0
      200, 1536   |         17.9         |          117.3          |         18.1         |           88.1
      1000, 1536  |         22.9         |           92.0          |         19.4         |           89.0
      6000, 1536  |        123.4         |          436.3          |         61.7         |          228.5
      6272, 1536  |        129.1         |          457.3          |         64.3         |          238.5
      200, 2048   |         18.0         |           90.5          |         19.1         |          101.6
      1000, 2048  |         31.1         |          109.8          |         25.3         |          107.9
      6000, 2048  |        174.5         |          589.8          |         87.1         |          310.5
      6272, 2048  |        182.2         |          617.0          |         91.2         |          316.7
      200, 3072   |         19.8         |           96.4          |         19.4         |           89.3
      1000, 3072  |         48.1         |          168.7          |         23.5         |          100.9
      6000, 3072  |        267.1         |          930.0          |        134.8         |          519.2
      6272, 3072  |        278.2         |          971.2          |        140.7         |          540.2
```
Pre-https://github.com/pytorch/pytorch/issues/67977
```
[------------------------------------------------------- ln -------------------------------------------------------]
                    |  fwd, torch.float32  |  fwdbwd, torch.float32  |  fwd, torch.float16  |  fwdbwd, torch.float16
1 threads: ---------------------------------------------------------------------------------------------------------
        200,   256  |         20.9         |            92.6         |         21.3         |          110.1
       1000,   256  |         20.3         |            91.8         |         28.1         |          115.6
       6000,   256  |         93.0         |           310.7         |         86.3         |          299.8
       6272,   256  |         97.3         |           323.5         |         90.0         |          314.1
        200,   512  |         20.9         |           110.2         |         21.1         |           95.0
       1000,   512  |         24.0         |           102.8         |         22.2         |           95.9
       6000,   512  |        121.7         |           367.2         |        105.6         |          337.4
       6272,   512  |        127.0         |           382.3         |        111.3         |          352.0
        200,  1024  |         21.0         |           131.8         |         20.4         |           93.3
       1000,  1024  |         35.5         |           108.7         |         27.7         |           99.4
       6000,  1024  |        170.4         |           495.5         |        137.7         |          411.4
       6272,  1024  |        177.5         |           517.6         |        143.6         |          428.6
        200,  1536  |         21.9         |            97.6         |         20.8         |           92.7
       1000,  1536  |         44.3         |           129.7         |         33.9         |          100.1
       6000,  1536  |        215.8         |           619.2         |        167.2         |          480.9
       6272,  1536  |        225.0         |           646.9         |        174.8         |          505.9
        200,  2048  |         21.8         |           100.8         |         20.7         |           96.7
       1000,  2048  |         53.7         |           152.4         |         41.4         |          118.3
       6000,  2048  |        267.0         |           753.6         |        220.4         |          571.5
       6272,  2048  |        278.6         |           785.8         |        211.4         |          589.2
        200,  3072  |         20.9         |           103.7         |         21.9         |          104.6
       1000,  3072  |         71.4         |           201.1         |         53.1         |          148.3
       6000,  3072  |        365.7         |          1040.3         |        262.0         |          731.5
       6272,  3072  |        382.0         |          1084.4         |        273.3         |          766.3
```
Benchmarking script
```
import torch
from torch.utils.benchmark import Timer, Compare

results = []
for dtype in (torch.float, torch.half):
    for fs in (256, 512, 1024, 1536, 2048, 3072):
        for bs in (200, 1000, 6000, 196*32):
            ln = torch.nn.LayerNorm((fs,), device="cuda", dtype=dtype)
            X = torch.randn(bs, fs, device="cuda", dtype=dtype, requires_grad=True)
            gO = torch.rand_like(X)
            stmtfwd = "ln(X)"
            stmtfwdbwd = "X.grad=None; ln.zero_grad(set_to_none=True); out = ln(X); out.backward(gO)"
            tfwd = Timer(stmt=stmtfwd, label="ln", sub_label=f"{bs:5}, {fs:5}", description=f"fwd, {dtype}", globals=globals())
            tfwdbwd = Timer(stmt=stmtfwdbwd, label="ln", sub_label=f"{bs:5}, {fs:5}", description=f"fwdbwd, {dtype}", globals=globals())
            for t in (tfwd, tfwdbwd):
                results.append(t.blocked_autorange())
        print(fs, end='\r')
c = Compare(results)
c.print()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68238

Reviewed By: mruberry

Differential Revision: D32469450

Pulled By: ngimel

fbshipit-source-id: 08fe755c156d3d5c366c966cb808bf0f3e74c050
2021-11-17 12:00:07 -08:00
f3e2fefe09 Actually enable PYTORCH_RETRY_TEST_CASES for linux tests (#68486)
Summary:
After realizing that CUDA mem leaks were not rerun, I realized I forgot to pass the env var as a Docker variable.

What a noob mistake.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68486

Reviewed By: seemethere

Differential Revision: D32501718

Pulled By: janeyx99

fbshipit-source-id: 9918d626e90bea1562a3094c6eb12cb7d86dbf6a
2021-11-17 11:50:48 -08:00
2f37a39a5c [quant][graphmode][fx] Refactor node_name_to_target_dtype to make it more clear (#68317)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68317

We use the node_name_to_target_dtype to store the target dtype for output activations for each node, computed from qconfig for the node,
there are two problems with node_name_to_target_dtype that makes it hard to work with:
1. we mutate node_name_to_target_dtype when we insert observers, this makes the data structure confusing because it's typically unexpected
to change a data structure that store the "target" dtype
2. currently it only stores target dtype about output activations, while we also need target dtype for input activation, weight and bias

This PR fixes both problem by removing mutation from the node_name_to_target_dtype and expanding the target_dtype for node to include
the missing target dtype for input activation, weight and bias. We will have another refactor to simplify the observation for weight and bias dtype
in the future.

Please see comments for the updated structure of node_name_to_target_dtype

TODO: we may want to rename node_name_to_target_dtype to node_name_to_target_dtype_info in a separate PR.

Test Plan:
```
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps
```

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D32411858

fbshipit-source-id: 3d76dd65056920ff8642899517bc1b95d43fc1de
2021-11-17 11:21:25 -08:00
3b4f072383 Remove TH/THC Storage data and copy functions (#68127)
Summary:
Part of https://github.com/pytorch/pytorch/issues/67852

cc ezyang bhosmer smessmer ljk53 bdhirsh

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68127

Reviewed By: mrshenli

Differential Revision: D32441885

Pulled By: ngimel

fbshipit-source-id: 1bbe7c8bed30bfe1737511a4f347fd9a8024dd99
2021-11-17 11:19:54 -08:00
4e21d77dbb Use TORCH_CHECK in MapAllocator (#68424)
Summary:
When porting `THAllocator` to ATen I changed `AT_ERROR` to `TORCH_INTERNAL_ASSERT` but the direct translation should have been `TORCH_CHECK`.

33e9a0b5f6/c10/util/Exception.h (L619-L623)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68424

Reviewed By: VitalyFedyunin

Differential Revision: D32465548

Pulled By: ngimel

fbshipit-source-id: 7fa9c1fe27e4849b76248badb681d7b6877ce9e8
2021-11-17 10:33:22 -08:00
693fe2fd9b docs: Added Union to supported types in documentation (#68435)
Summary:
This PR simply updates the documentation following up on https://github.com/pytorch/pytorch/pull/64234, by adding `Union` as a supported type.

Any feedback is welcome!

cc ansley albanD gmagogsfm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68435

Reviewed By: davidberard98

Differential Revision: D32494271

Pulled By: ansley

fbshipit-source-id: c3e4806d8632e1513257f0295568a20f92dea297
2021-11-17 10:26:31 -08:00
61206ba4db [SR] Add StorageGroup abstraction (#68279)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68279

While reworking the liveness analysis, I noticed that using `std::pair<size_t, std::vector<Tensor*>>` to represent storage groups made things quite unreadable.

Add a simple class to wrap a `std::vector<at::Tensor*>` and store a `size` attribute

Test Plan:
`buck test caffe2/benchmarks/static_runtime/...`

Also ran inline_cvr benchmarks, did not see any errors

Reviewed By: swolchok

Differential Revision: D32369447

fbshipit-source-id: e0b562aa7eefd738b1a34f1f37eb7bc95d71a257
2021-11-17 09:29:08 -08:00
cac3cd1433 add torch.diff support for n greater than 1 (#67260)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67260

Addressing 54853

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D31930294

Pulled By: mikaylagawarecki

fbshipit-source-id: 97c7a27e9200c6688242680ff96b73dfff828479
2021-11-17 09:16:33 -08:00
3da2e09c9b Added antialias flag to interpolate (CPU only, bilinear) (#65142)
Summary:
Description:
- Added antialias flag to interpolate (CPU only)
  - forward and backward for bilinear mode
  - added tests

### Benchmarks

<details>
<summary>
Forward pass, CPU. PTH interpolation vs PIL
</summary>

Cases:
- PTH RGB 3 Channels, float32 vs PIL RGB uint8 (apply vs pears)
- PTH 1 Channel, float32 vs PIL 1 Channel Float

Code: https://gist.github.com/vfdev-5/b173761a567f2283b3c649c3c0574112

```
# OMP_NUM_THREADS=1 python bench_interp_aa_vs_pillow.py

Torch config: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201402
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - CPU capability usage: AVX2
  - CUDA Runtime 11.1
  - NVCC architecture flags: -gencode;arch=compute_75,code=sm_75
  - CuDNN 8.0.5
  - Build settings: BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.0, USE_CUDA=1, USE_CUDNN=1, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=0, USE_OPENMP=ON,

Num threads: 1
[------------------------ Downsampling: torch.Size([1, 3, 906, 438]) -> (320, 196) ------------------------]
                                                  |  Reference, PIL 8.3.2, mode: RGB  |  1.10.0a0+git1e87d91
1 threads: -------------------------------------------------------------------------------------------------
      channels_first contiguous torch.float32     |                2.9                |          3.1
      channels_last non-contiguous torch.float32  |                2.6                |          3.6

Times are in milliseconds (ms).

[------------------------ Downsampling: torch.Size([1, 3, 906, 438]) -> (460, 220) ------------------------]
                                                  |  Reference, PIL 8.3.2, mode: RGB  |  1.10.0a0+git1e87d91
1 threads: -------------------------------------------------------------------------------------------------
      channels_first contiguous torch.float32     |                3.4                |          4.0
      channels_last non-contiguous torch.float32  |                3.4                |          4.8

Times are in milliseconds (ms).

[------------------------ Downsampling: torch.Size([1, 3, 906, 438]) -> (120, 96) -------------------------]
                                                  |  Reference, PIL 8.3.2, mode: RGB  |  1.10.0a0+git1e87d91
1 threads: -------------------------------------------------------------------------------------------------
      channels_first contiguous torch.float32     |                1.6                |          1.8
      channels_last non-contiguous torch.float32  |                1.6                |          1.9

Times are in milliseconds (ms).

[----------------------- Downsampling: torch.Size([1, 3, 906, 438]) -> (1200, 196) ------------------------]
                                                  |  Reference, PIL 8.3.2, mode: RGB  |  1.10.0a0+git1e87d91
1 threads: -------------------------------------------------------------------------------------------------
      channels_first contiguous torch.float32     |                9.0                |          11.3
      channels_last non-contiguous torch.float32  |                8.9                |          12.5

Times are in milliseconds (ms).

[----------------------- Downsampling: torch.Size([1, 3, 906, 438]) -> (120, 1200) ------------------------]
                                                  |  Reference, PIL 8.3.2, mode: RGB  |  1.10.0a0+git1e87d91
1 threads: -------------------------------------------------------------------------------------------------
      channels_first contiguous torch.float32     |                2.1                |          1.8
      channels_last non-contiguous torch.float32  |                2.1                |          3.4

Times are in milliseconds (ms).

[--------------- Downsampling: torch.Size([1, 1, 906, 438]) -> (320, 196) --------------]
                                 |  Reference, PIL 8.3.2, mode: F  |  1.10.0a0+git1e87d91
1 threads: ------------------------------------------------------------------------------
       contiguous torch.float32  |               1.2               |          1.0

Times are in milliseconds (ms).

[--------------- Downsampling: torch.Size([1, 1, 906, 438]) -> (460, 220) --------------]
                                 |  Reference, PIL 8.3.2, mode: F  |  1.10.0a0+git1e87d91
1 threads: ------------------------------------------------------------------------------
       contiguous torch.float32  |               1.4               |          1.3

Times are in milliseconds (ms).

[--------------- Downsampling: torch.Size([1, 1, 906, 438]) -> (120, 96) ---------------]
                                 |  Reference, PIL 8.3.2, mode: F  |  1.10.0a0+git1e87d91
1 threads: ------------------------------------------------------------------------------
       contiguous torch.float32  |              719.9              |         599.9

Times are in microseconds (us).

[-------------- Downsampling: torch.Size([1, 1, 906, 438]) -> (1200, 196) --------------]
                                 |  Reference, PIL 8.3.2, mode: F  |  1.10.0a0+git1e87d91
1 threads: ------------------------------------------------------------------------------
       contiguous torch.float32  |               3.7               |          3.5

Times are in milliseconds (ms).

[-------------- Downsampling: torch.Size([1, 1, 906, 438]) -> (120, 1200) --------------]
                                 |  Reference, PIL 8.3.2, mode: F  |  1.10.0a0+git1e87d91
1 threads: ------------------------------------------------------------------------------
       contiguous torch.float32  |              834.4              |         605.7

Times are in microseconds (us).

```

</details>

Code is moved from torchvision: https://github.com/pytorch/vision/pull/4208

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65142

Reviewed By: mrshenli

Differential Revision: D32432405

Pulled By: jbschlosser

fbshipit-source-id: b66c548347f257c522c36105868532e8bc1d4c6d
2021-11-17 09:10:15 -08:00
143491e0ad [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D32484422

fbshipit-source-id: 5c836dc7d06f12e64cc4bb1e85d8fa4b62a29b85
2021-11-17 07:27:04 -08:00
3e3bf40b0a Revert D32452012: [pytorch][PR] Fix flaky test_nccl_timeout
Test Plan: revert-hammer

Differential Revision:
D32452012 (faa1e8b7cf)

Original commit changeset: c959b25957f2

fbshipit-source-id: a2786744b12ceed350eec0ca2834f5176a4e21ee
2021-11-17 06:08:53 -08:00
54ac64f035 Revert D32477989: [pytorch][PR] Actually enable PYTORCH_RETRY_TEST_CASES for linux tests
Test Plan: revert-hammer

Differential Revision:
D32477989 (173c0f8a98)

Original commit changeset: e28d095773f5

fbshipit-source-id: 2de5fac08f7f322a3aeb92a67b5fdfa0a6518bf1
2021-11-17 06:04:14 -08:00
0dc3f829d9 Nvfuser code bump 11 5 (#67943)
Summary:
nvfuser code update:
1. Tuning heuristics on schedulers for reduction/normalization kernels;
2. bfloat16 on IO tensor support;
3. Refactored memory format support, now we can support dimension collapsing with non-coherent input tensors with different memory format. e.g. channels last tensor input to batch normalization. Note that we are currently limiting memory format to only Contiguous and Channels last;
4. Refactored nvfuser graph partitioning in `graph_fuser.cpp`, separated node merge and profile node API. Updated `profiling_record.cpp`.

Things that are reverted from our local branch:
1. changes on some entries in autodiff
2. aten::gelu with approximation
3. native_dropout(_backward)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67943

Reviewed By: ngimel

Differential Revision: D32288709

Pulled By: dzhulgakov

fbshipit-source-id: fc9491182ea7e0158bc112c66f096823c588eaf1
2021-11-17 01:22:17 -08:00
01b30922dd [static runtime] fuse gather+to+lengths_to_offsets (#64075)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64075

Test Plan:
Before:
`I0826 17:17:54.165174 1064079 PyTorchPredictorBenchLib.cpp:313] PyTorch run finished. Milliseconds per iter: 6.66724. Iters per second: 149.987`

After:
`I0826 17:13:07.464485 1040300 PyTorchPredictorBenchLib.cpp:313] PyTorch run finished. Milliseconds per iter: 6.46362. Iters per second: 154.712`

Profile after: P453143683

Accuracy tested comparing with jit interpreter for no differences under 1e-3 (nnc ops turned on) https://www.internalfb.com/intern/diff/view-version/136824794/

======

With 100-request recordio inputs (211 inputs)

Before:
`I1101 12:43:13.558375 742187 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 11.7882. Iters per second: 84.8309`
After:
`I1101 13:50:41.087644 1126186 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 11.6763. Iters per second: 85.6438`

Profile after: P465977010
Constituent ops before (total is 0.5646):
```
       0.187392 ms.    1.61737%. fb::clip_ranges_gather (309 nodes, out variant)
       0.174101 ms.    1.50266%. fb::lengths_to_offsets (464 nodes, out variant)
       0.203126 ms.    1.75317%. static_runtime::to_copy (805 nodes, out variant)
```
Constitutent ops after (total is 0.4985):
```
       0.376559 ms.    3.25614%. fb::clip_ranges_to_gather_to_offsets (305 nodes, out variant)
      0.0614349 ms.   0.531235%. fb::lengths_to_offsets (159 nodes, out variant)
      0.0573315 ms.   0.495751%. static_runtime::to_copy (195 nodes, out variant)
     0.00325543 ms.  0.0281501%. fb::gather_ranges (4 nodes, out variant)
```

Compare with jit interpreter inside benchmark:
`I1101 13:55:53.013602 1149446 PtVsBlackBoxPredictorBenchLib.cpp:175] Finished comparing PT static runtime and jit interpreter results`

======

Casting on the fly:

a. Static runtime off
```
Static runtime ms per iter: 11.4658. Iters per second: 87.2159
0.220367 ms.    1.94726%. static_runtime::to_copy (805 nodes, out variant)
0.172585 ms.    1.52504%. fb::clip_ranges_gather (309 nodes, out variant)
0.157836 ms.    1.39471%. fb::lengths_to_offsets (464 nodes, out variant)
```

b. Casting on the fly, using explicit allocation+to_copy (which has the fast pass for certain cases, but we'll always call empty):
```
I1115 09:08:35.711972 1925508 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 11.6732. Iters per second: 85.6662

0.599439 ms.    5.25098%. fb::clip_ranges_to_gather_to_offsets (305 nodes, out variant)
0.0552475 ms.   0.483958%. fb::lengths_to_offsets (159 nodes, out variant)
0.0576032 ms.   0.504593%. static_runtime::to_copy (195 nodes, out variant)
0.00299026 ms.  0.0261941%. fb::gather_ranges (4 nodes, out variant)
```

c. Casting on the fly with native::to (no explicit allocation, but no fast pass):
```
Static runtime ms per iter: 11.5627. Iters per second: 86.4849
0.454356 ms.     3.9652%. fb::clip_ranges_to_gather_to_offsets (305 nodes, out variant)
0.06315 ms.   0.551115%. static_runtime::to_copy (195 nodes, out variant)
0.0590741 ms.   0.515544%. fb::lengths_to_offsets (159 nodes, out variant)
0.00359182 ms.   0.031346%. fb::clip_ranges_gather (4 nodes, out variant)
```

d. Removal of the to() call in question from the fusion pattern:
```
Static runtime ms per iter: 11.3658. Iters per second: 87.9836
 0.29591 ms.     2.6479%. fb::clip_ranges_to_gather_to_offsets (305 nodes, out variant)
 0.154612 ms.    1.38352%. static_runtime::to_copy (500 nodes, out variant)
0.0567151 ms.   0.507505%. fb::lengths_to_offsets (159 nodes, out variant)
0.0051115 ms.  0.0457394%. fb::clip_ranges_gather (4 nodes, out variant)
```

Reviewed By: hlu1

Differential Revision: D30515441

fbshipit-source-id: 53acee10619ac2be7dc8982e929e3210c4bb6d21
2021-11-17 00:49:31 -08:00
faa1e8b7cf Fix flaky test_nccl_timeout (#68403)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/66882

- Remove time.sleep call
- Use gloo barrier to enforce rank synchronization
- Reduce timeouts for allrduce
- Pass in timeout and call wait() in _check_for_nccl_abort()

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68403

Reviewed By: H-Huang

Differential Revision: D32452012

Pulled By: rohan-varma

fbshipit-source-id: c959b25957f2eb8d59c506075da6023d25bbcfd9
2021-11-16 23:43:23 -08:00
6186b90c53 [Contrib][Fakelowp] Change Lut Size for Tanh (#68334)
Summary:
Reference code LUT size increased and now mininum
starts from 0, instead of 7000 earlier

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68334

Reviewed By: jiecaoyu

Differential Revision: D32467332

Pulled By: hl475

fbshipit-source-id: 3e4510e09374519aebe657a31f0b1ccde117e761
2021-11-16 23:39:02 -08:00
f6696c5a85 export CPUOffload in _fsdp package (#68308)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68308

export CPUOffload in _fsdp package, as cpu_offload config in FSDP API needs to import this class
ghstack-source-id: 143560608

Test Plan: unit tests

Reviewed By: rohan-varma

Differential Revision: D32408719

fbshipit-source-id: ee5c40ec91a423fbd58872fbdeb5f2dda8a3d89e
2021-11-16 22:56:12 -08:00
9c15523793 Attach unused parameter info to static graph error message (#68413)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68413

attach unused parameter info to static graph error message
ghstack-source-id: 143560766

Test Plan: unit tests

Reviewed By: rohan-varma

Differential Revision: D32457112

fbshipit-source-id: 31de859bf5289aa6044279014f0e76be9bcb9e54
2021-11-16 22:55:08 -08:00
9de730ebba q_avgpool: Loop over batch dimension inside operators (#66819)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66819

This has a number of different advantages:
- For channels last tensors, DispatchStub overhead is only incurred once.
- For contiguous tensors, parallelization now happens over batch and
  chanels, enabling better load balancing between threads.
- `q_scale()` and `q_zero_point()` are no longer called inside of a
  parallel region, which is not allowed (see gh-56794)

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D32445352

Pulled By: ngimel

fbshipit-source-id: cd938e886cd5696855eb56a649eaf3bccce35e54
2021-11-16 22:29:42 -08:00
1cade067e3 [Vulkan] Vulkan backend is now thread-safe (#67733)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67733

Vulkan backend is now thread-safe:
* `ThreadContext` class holds onto all per-thread Vulkan states such as Command, Descriptor and Resource objects.
* `ThreadContext::SingletonThreadLocalObject<T>` is a very light version of `folly::SingletonThreadLocal` (https://github.com/facebook/folly/blob/main/folly/SingletonThreadLocal.h). It holds a static object with `thread_local` modifier. It is tied with a `GPU` object which allows us to expand multi-threaded GPU backend and multi-GPU capability in the future. The lifetime of `SingletonThreadLocalObject<T>` object is from the first call (instantiation) to the termination of thread.
* `MAKE_VULKAN_THREADSAFE` preprocessor is used for BUCK and the implementation of thread-safe Vulkan backend. We can quickly exclude it from the BUCK if any unexpected issue gets uncovered in the future. Once we are confident it's stable, we can remove the preprocessor from the code.
* A new perf test is added with `{3,40,221,193}` with 3 threads.
* `vkQueueSubmit` is not thread-safe, only one thread can push the commands at a time (See https://vkguide.dev/docs/chapter-1/vulkan_command_flow/#vulkan-command-execution). The number of available queues depends on GPU. It could be 1 and we cannot assume we can create multiple queues. Thus, we need to avoid calling `vkQueueSubmit` from multiple threads at the same time. When running Vulkan backend in different threads without any locking mechanism, `vkQueueSubmit` will get the `VK_ERROR_INITIALIZATION_FAILED(-3)` error.
* In the `Context::~Context()`, we should not call `flush()` since all per-thread objects will be destroyed as each thread exits. From the following logs, you can verify all per-thread objects are getting destroyed as their threads are terminated. The logs captured all ctor/dtor calls when running Vulkan backend with 3 different threads:
```
ThreadContext::ThreadContext() -> thread[0x1207d5e00] this[0x0x7f9489981e28]
Context::Context() -> thread[0x1207d5e00] this[0x7f9489981800] device_[1]
Resource::Pool::Pool() -> thread[0x7000095ab000] this[0x7f9489965258] device_[0x7f94998cf218] allocator_[0x7f947980ee00]
Command::Pool::Pool() -> thread[0x7000095ab000] this[0x7f9489965068] device_[0x7f94998cf218] command_pool_[0xfa21a40000000003]
Resource::Pool::Pool() -> thread[0x70000962e000] this[0x7f947980d458] device_[0x7f94998cf218] allocator_[0x7f949b119c00]
Command::Pool::Pool() -> thread[0x70000962e000] this[0x7f947980d268] device_[0x7f94998cf218] command_pool_[0xead9370000000008]
Resource::Pool::Pool() -> thread[0x1207d5e00] this[0x7f949a0ee858] device_[0x7f94998cf218] allocator_[0x7f9499901c00]
Command::Pool::Pool() -> thread[0x1207d5e00] this[0x7f949a0ee668] device_[0x7f94998cf218] command_pool_[0xcad092000000000d]
Descriptor::Pool::Pool() -> thread[0x1207d5e00] this[0x7f949a0ee910] device_[0x7f94998cf218] descriptor_pool_[0xa43473000000002d]
Descriptor::Pool::Pool() -> thread[0x70000962e000] this[0x7f947980d510] device_[0x7f94998cf218] descriptor_pool_[0x980b0000000002e]
Descriptor::Pool::Pool() -> thread[0x7000095ab000] this[0x7f9489965310] device_[0x7f94998cf218] descriptor_pool_[0x4b7df1000000002f]
Descriptor::Pool::~Pool() -> thread[0x7000095ab000] this[0x7f9489965310] device_[0x7f94998cf218] descriptor_pool_[0x4b7df1000000002f] -> enter
Descriptor::Pool::~Pool() -> thread[0x7000095ab000] this[0x7f9489965310] device_[0x7f94998cf218] descriptor_pool_[0x4b7df1000000002f] -> leave
Command::Pool::~Pool() -> thread[0x7000095ab000] this[0x7f9489965068] device_[0x7f94998cf218] command_pool_[0xfa21a40000000003] -> enter
Command::Pool::~Pool() -> thread[0x7000095ab000] this[0x7f9489965068] device_[0x7f94998cf218] command_pool_[0xfa21a40000000003] -> leave
Resource::Pool::~Pool() -> thread[0x7000095ab000] this[0x7f9489965258] device_[0x7f94998cf218] allocator_[0x7f947980ee00] -> enter
Descriptor::Pool::~Pool() -> thread[0x70000962e000] this[0x7f947980d510] device_[0x7f94998cf218] descriptor_pool_[0x980b0000000002e] -> enter
Descriptor::Pool::~Pool() -> thread[0x70000962e000] this[0x7f947980d510] device_[0x7f94998cf218] descriptor_pool_[0x980b0000000002e] -> leave
Command::Pool::~Pool() -> thread[0x70000962e000] this[0x7f947980d268] device_[0x7f94998cf218] command_pool_[0xead9370000000008] -> enter
Command::Pool::~Pool() -> thread[0x70000962e000] this[0x7f947980d268] device_[0x7f94998cf218] command_pool_[0xead9370000000008] -> leave
Resource::Pool::~Pool() -> thread[0x70000962e000] this[0x7f947980d458] device_[0x7f94998cf218] allocator_[0x7f949b119c00] -> enter
Resource::Pool::~Pool() -> thread[0x7000095ab000] this[0x7f9489965258] device_[0x7f94998cf218] allocator_[0x7f947980ee00] -> leave
Resource::Pool::~Pool() -> thread[0x70000962e000] this[0x7f947980d458] device_[0x7f94998cf218] allocator_[0x7f949b119c00] -> leave
Descriptor::Pool::~Pool() -> thread[0x1207d5e00] this[0x7f949a0ee910] device_[0x7f94998cf218] descriptor_pool_[0xa43473000000002d] -> enter
Descriptor::Pool::~Pool() -> thread[0x1207d5e00] this[0x7f949a0ee910] device_[0x7f94998cf218] descriptor_pool_[0xa43473000000002d] -> leave
Command::Pool::~Pool() -> thread[0x1207d5e00] this[0x7f949a0ee668] device_[0x7f94998cf218] command_pool_[0xcad092000000000d] -> enter
Command::Pool::~Pool() -> thread[0x1207d5e00] this[0x7f949a0ee668] device_[0x7f94998cf218] command_pool_[0xcad092000000000d] -> leave
Resource::Pool::~Pool() -> thread[0x1207d5e00] this[0x7f949a0ee858] device_[0x7f94998cf218] allocator_[0x7f9499901c00] -> enter
Resource::Pool::~Pool() -> thread[0x1207d5e00] this[0x7f949a0ee858] device_[0x7f94998cf218] allocator_[0x7f9499901c00] -> leave
Context::~Context() -> thread[0x1207d5e00] this[0x7f9489981800] device_[1] -> enter
Context::~Context() -> thread[0x1207d5e00] this[0x7f9489981800] device_[1] -> leave
ThreadContext::~ThreadContext() -> thread[0x1207d5e00] this[0x0x7f9489981e28] -> enter
ThreadContext::~ThreadContext() -> thread[0x1207d5e00] this[0x0x7f9489981e28] -> leave
```
Some notes on unexpected behaviors by `VkQueue`:
* We need to make sure only one thread accesses `VkQueue` at a time if multi-threaded. Or we need to have a locking mechanism to protect `VkQueue` from multiple threads. This approach is used for this change.
* To avoid having lock overhead, we tried to have per-thread `VkQueue` (having separate object per thread) didn't fix `VK_ERROR_INITIALIZATION_FAILED` error by `vkQueueSubmit` call. This was not expected. Interestingly, MacOS doesn't crash with this per-thread approach but no wonder since its behavior has been not that reliable. Not sure it's an Android Vulkan driver issue or not.
* Making the entire `Context` as `thread_local` without any lock actually fixes the same error.

Test Plan:
**Test build on Android**
```
cd ~/fbsource
buck build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 //xplat/caffe2:pt_vulkan_perf_test_binAndroid\#android-arm64 --show-output
adb push buck-out/gen/xplat/caffe2/pt_vulkan_perf_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_perf_test
adb shell "/data/local/tmp/vulkan_perf_test"
```
**Test build on MacOS**
```
cd ~/fbsource
buck build //xplat/caffe2:pt_vulkan_perf_test_binAppleMac
./buck-out/gen/xplat/caffe2/pt_vulkan_perf_test_binAppleMac\#macosx-x86_64
```

**Test result on Google Pixel 5**
```
//xplat/caffe2:pt_vulkan_perf_test_binAndroid#android-arm64 buck-out/gen/fe3a39b8/xplat/caffe2/pt_vulkan_perf_test_binAndroid#android-arm64
buck-out/gen/xplat/caffe2/pt_vulkan_perf_test_binAndroid#android-arm64: 1 file pushed, 0 skipped. 145.4 MB/s (826929592 bytes in 5.426s)
Running /data/local/tmp/vulkan_perf_test
Run on (8 X 1804.8 MHz CPU s)
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
=============================================================================================================
Thread-safe Vulkan backend on Google Pixel 5
-------------------------------------------------------------------------------------------------------------
Benchmark                                                                   Time             CPU   Iterations
-------------------------------------------------------------------------------------------------------------
cat_op_channel_perf/N:3/C:40/H:221/W:193/iterations:1000/threads:1       55.8 ms         15.1 ms         1000
cat_op_channel_perf/N:3/C:20/H:221/W:193/iterations:1000/threads:1       25.6 ms         4.08 ms         1000
cat_op_channel_perf/N:3/C:39/H:221/W:193/iterations:1000/threads:1       60.6 ms         14.3 ms         1000
cat_op_channel_perf/N:3/C:4/H:221/W:193/iterations:5000/threads:1        4.52 ms        0.757 ms         5000
cat_op_channel_perf/N:3/C:3/H:221/W:193/iterations:5000/threads:1        7.16 ms        0.770 ms         5000
cat_op_channel_perf/N:3/C:40/H:221/W:193/iterations:1000/threads:3       35.9 ms         38.8 ms         3000
=============================================================================================================
Non thread-safe Vulkan backend on Google Pixel 5
-------------------------------------------------------------------------------------------------------------
Benchmark                                                                   Time             CPU   Iterations
-------------------------------------------------------------------------------------------------------------
cat_op_channel_perf/N:3/C:40/H:221/W:193/iterations:1000/threads:1       55.0 ms         14.5 ms         1000
cat_op_channel_perf/N:3/C:20/H:221/W:193/iterations:1000/threads:1       25.8 ms         4.30 ms         1000
cat_op_channel_perf/N:3/C:39/H:221/W:193/iterations:1000/threads:1       60.6 ms         14.5 ms         1000
cat_op_channel_perf/N:3/C:4/H:221/W:193/iterations:5000/threads:1        4.52 ms        0.761 ms         5000
cat_op_channel_perf/N:3/C:3/H:221/W:193/iterations:5000/threads:1        7.15 ms        0.765 ms         5000
```
For the single thread scenario of thread-safe and non thread-safe versions, the difference between them is less than 2% which is acceptable. In other words, there is no considerable performance degradation with the thread-safe Vulkan backend by using:
* singleton thread local objects for `Command`, `Descriptor` and `Resource` pools
* mutex lock for `VkQueueCommit` call

**Test result on MacOS**
```
Running ./buck-out/gen/xplat/caffe2/pt_vulkan_perf_test_binAppleMac#macosx-x86_64
Run on (16 X 2400 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 256 KiB (x8)
  L3 Unified 16384 KiB (x1)
Load Average: 11.96, 7.17, 5.45
***WARNING*** Library was built as DEBUG. Timings may be affected.
=============================================================================================================
Thread-safe Vulkan backend on MacOS
-------------------------------------------------------------------------------------------------------------
Benchmark                                                                   Time             CPU   Iterations
-------------------------------------------------------------------------------------------------------------
cat_op_channel_perf/N:3/C:40/H:221/W:193/iterations:1000/threads:1       58.4 ms         42.8 ms         1000
cat_op_channel_perf/N:3/C:20/H:221/W:193/iterations:1000/threads:1       12.3 ms         5.43 ms         1000
cat_op_channel_perf/N:3/C:39/H:221/W:193/iterations:1000/threads:1       56.0 ms         41.2 ms         1000
cat_op_channel_perf/N:3/C:4/H:221/W:193/iterations:5000/threads:1        3.00 ms         1.52 ms         5000
cat_op_channel_perf/N:3/C:3/H:221/W:193/iterations:5000/threads:1        2.56 ms         1.34 ms         5000
cat_op_channel_perf/N:3/C:40/H:221/W:193/iterations:1000/threads:3       42.8 ms         42.8 ms         3000
=============================================================================================================
Non thread-safe Vulkan backend on MacOS
-------------------------------------------------------------------------------------------------------------
Benchmark                                                                   Time             CPU   Iterations
-------------------------------------------------------------------------------------------------------------
cat_op_channel_perf/N:3/C:40/H:221/W:193/iterations:1000/threads:1       58.6 ms         42.6 ms         1000
cat_op_channel_perf/N:3/C:20/H:221/W:193/iterations:1000/threads:1       11.3 ms         4.67 ms         1000
cat_op_channel_perf/N:3/C:39/H:221/W:193/iterations:1000/threads:1       57.6 ms         42.4 ms         1000
cat_op_channel_perf/N:3/C:4/H:221/W:193/iterations:5000/threads:1        2.89 ms         1.45 ms         5000
cat_op_channel_perf/N:3/C:3/H:221/W:193/iterations:5000/threads:1        2.47 ms         1.27 ms         5000
```
Non thread-safe version is slightly faster than the thread-safe one. This test result is only for reference since we cannot trust MacOS that has an extra layer [MoltenVk](https://github.com/KhronosGroup/MoltenVK) on top of `Metal`.

Reviewed By: SS-JIA

Differential Revision: D32093974

fbshipit-source-id: 9eab7f0db976eff717540a5b32f94ed17a00b662
2021-11-16 22:09:32 -08:00
2317e28e9e Enable complex autograd for col2im / im2col (#68199)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68199

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D32467043

Pulled By: mruberry

fbshipit-source-id: 9094aff036f75b280422e210f7089140ea61fc71
2021-11-16 21:11:44 -08:00
fea2bb64c8 OpInfos for stft, istft, fftshift, ifftshift (#68198)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68198

This unearths some bugs in istft backward, so I've disabled
backward tests but it's fixed in the next PR in the stack.

cc mruberry peterbell10

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D32467044

Pulled By: mruberry

fbshipit-source-id: 5cf49560cbeb0263a66aafb48ed1bcc8884b75f1
2021-11-16 21:09:54 -08:00
6e640a0acf Revise the socket implementation of c10d (#68226)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68226

**Note that this PR is unusually big due to the urgency of the changes. Please reach out to me in case you wish to have a "pair" review.**

This PR introduces a major refactoring of the socket implementation of the C10d library. A big portion of the logic is now contained in the `Socket` class and a follow-up PR will further consolidate the remaining parts. As of today the changes in this PR offer:

 - significantly better error handling and much more verbose logging (see the example output below)
 - explicit support for IPv6 and dual-stack sockets
 - correct handling of signal interrupts
 - better Windows support

A follow-up PR will consolidate `send`/`recv` logic into `Socket` and fully migrate to non-blocking sockets.

## Example Output

```
[I logging.h:21] The client socket will attempt to connect to an IPv6 address on (127.0.0.1, 29501).
[I logging.h:21] The client socket is attempting to connect to [localhost]:29501.
[W logging.h:28] The server socket on [localhost]:29501 is not yet listening (Error: 111 - Connection refused), retrying...
[I logging.h:21] The server socket will attempt to listen on an IPv6 address.
[I logging.h:21] The server socket is attempting to listen on [::]:29501.
[I logging.h:21] The server socket has started to listen on [::]:29501.
[I logging.h:21] The client socket will attempt to connect to an IPv6 address on (127.0.0.1, 29501).
[I logging.h:21] The client socket is attempting to connect to [localhost]:29501.
[I logging.h:21] The client socket has connected to [localhost]:29501 on [localhost]:42650.
[I logging.h:21] The server socket on [::]:29501 has accepted a connection from [localhost]:42650.
[I logging.h:21] The client socket has connected to [localhost]:29501 on [localhost]:42722.
[I logging.h:21] The server socket on [::]:29501 has accepted a connection from [localhost]:42722.
[I logging.h:21] The client socket will attempt to connect to an IPv6 address on (127.0.0.1, 29501).
[I logging.h:21] The client socket is attempting to connect to [localhost]:29501.
[I logging.h:21] The client socket has connected to [localhost]:29501 on [localhost]:42724.
[I logging.h:21] The server socket on [::]:29501 has accepted a connection from [localhost]:42724.
[I logging.h:21] The client socket will attempt to connect to an IPv6 address on (127.0.0.1, 29501).
[I logging.h:21] The client socket is attempting to connect to [localhost]:29501.
[I logging.h:21] The client socket has connected to [localhost]:29501 on [localhost]:42726.
[I logging.h:21] The server socket on [::]:29501 has accepted a connection from [localhost]:42726.
```
ghstack-source-id: 143501987

Test Plan: Run existing unit and integration tests on devserver, Fedora, Ubuntu, macOS Big Sur, Windows 10.

Reviewed By: Babar, wilson100hong, mrshenli

Differential Revision: D32372333

fbshipit-source-id: 2204ffa28ed0d3683a9cb3ebe1ea8d92a831325a
2021-11-16 20:49:25 -08:00
4c346bd073 Added forward derivatives for neg, diag, inverse, linalg_eig (#67837)
Summary:
Recreated due to CI failures as per comment https://github.com/pytorch/pytorch/pull/67339#issuecomment-959893293

===

See also discussion in https://github.com/pytorch/pytorch/issues/10223, starting from [this](https://github.com/pytorch/pytorch/issues/10223#issuecomment-949499666) comment

The formulas for the derivatives are taken from https://people.maths.ox.ac.uk/gilesm/files/NA-08-01.pdf.

As indicated, the method linalg_eig_jvp should be used instead of linalg_eig_jvp_eigenvalues and linalg_eig_jvp_eigenvectors in the future. Due to a codegen limitation, this is not yet possible.

CC albanD Lezcano

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67837

Reviewed By: mrshenli

Differential Revision: D32403662

Pulled By: soulitzer

fbshipit-source-id: 529cb93f865ce4cc2e24fa6f672d4234e7abe2b1
2021-11-16 20:32:47 -08:00
aa9ee8d02a [Static Runtime] Avoid copying function objects per StaticRuntime instance (#68368)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68368

Currently, each instance of `StaticRuntime` has its own copy of `std::function` object wrapped in `ProcessedNode::Function` object, in order to invoke actual operation implementation.

However, all instances of `StaticRuntime` derived from same `StaticModule` objects invoke exactly same op implementation, and this is avoidable.

This change adds `StaticModule::functions_` member variable to keep a list of unique instance of `ProcessedFunction` objects. A newly constructed `StaticRuntime` takes `ProcessedFunction`'s pointers instead of the whole function object. This can save a substantial amount of memory per `StaticRuntime` instance.

This comes with a sacrifice in execution time. Now that a `ProcessedNode` instance keeps the function object's pointer, executing a node now involves an extra pointer dereference. However, this cost was proved to be negligible from local performance tests.

Thanks to hlu1 for proposing this non-intrusive improvement idea :D

Test Plan:
This change reduces the size of a StaticRuntime instance by 14.41% (459KB -> 393KB) (patched D32181666 to print the memory turnover from instantiating a StaticRuntime instance) for CMF/local ( & 8% for CMF/local_ro). No noticeable latency regression was observed.

==AFTER

* CMF/local
memory turnover: 393608
latency: PyTorch run finished. Milliseconds per iter: 15.6965. Iters per second: 63.7087

* CMF/local_ro
memory turnover:387288
latency: PyTorch run finished. Milliseconds per iter: 7.51308. Iters per second: 133.101

==BEFORE

* CMF/local
memory turnover: 459888
latency: PyTorch run finished. Milliseconds per iter: 15.8278. Iters per second: 63.18

* CMF/local_ro
memory turnover: 420832
latenfcy: PyTorch run finished. Milliseconds per iter: 7.43756. Iters per second: 134.453

==Confirmation that ptvsc2_predictor_bench reports the same memrmoy management stats for inline_cvr:

==AFTER

Total number of managed tensors: 2660
Total number of managed output tensors: 0
Total number of unmanaged values: 3041
Total memory managed: 1496896 bytes
Total number of reused tensors: 1183
Total number of 'out' variant nodes/total number of nodes: 2452/2469 (99.3115%)

Total number of managed tensors: 1412
Total number of managed output tensors: 0
Total number of unmanaged values: 2677
Total memory managed: 39040 bytes
Total number of reused tensors: 959
Total number of 'out' variant nodes/total number of nodes: 1928/1937 (99.5354%)

Total number of managed tensors: 1293
Total number of managed output tensors: 0
Total number of unmanaged values: 14
Total memory managed: 5293824 bytes
Total number of reused tensors: 771
Total number of 'out' variant nodes/total number of nodes: 1298/1298 (100%)

==BEFORE

Total number of managed tensors: 2660
Total number of managed output tensors: 0
Total number of unmanaged values: 3041
Total memory managed: 1496896 bytes
Total number of reused tensors: 1183
Total number of 'out' variant nodes/total number of nodes: 2452/2469 (99.3115%)

Total number of managed tensors: 1412
Total number of managed output tensors: 0
Total number of unmanaged values: 2677
Total memory managed: 39040 bytes
Total number of reused tensors: 959
Total number of 'out' variant nodes/total number of nodes: 1928/1937 (99.5354%)

Total number of managed tensors: 1293
Total number of managed output tensors: 0
Total number of unmanaged values: 14
Total memory managed: 5293824 bytes
Total number of reused tensors: 771
Total number of 'out' variant nodes/total number of nodes: 1298/1298 (100%)

Reviewed By: swolchok

Differential Revision: D32337548

fbshipit-source-id: e714e735399c93fde337b0f70e203a2de632057a
2021-11-16 20:28:48 -08:00
fd85d925b0 Fix some sign issues (#68361)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68361

Fixes
```
caffe2/aten/src/ATen/FunctionalizeFallbackKernel.cpp:36:31: error: comparison of integers of different signs: 'int64_t' (aka 'long') and 'const unsigned long' [-Werror,-Wsign-compare]
    for (int64_t idx = 0; idx < num_returns; ++idx) {
                          ~~~ ^ ~~~~~~~~~~~
caffe2/aten/src/ATen/native/cuda/Sorting.cpp:87:16: error: comparison of integers of different signs: 'int64_t' (aka 'long') and 'std::vector::size_type' (aka 'unsigned long') [-Werror,-Wsign-compare]
    assert(dim < out_shape.size());
           ~~~ ^ ~~~~~~~~~~~~~~~~
```

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D32433063

fbshipit-source-id: b896dbab81861f3f074e00db73d20d9523037dd1
2021-11-16 20:18:58 -08:00
173c0f8a98 Actually enable PYTORCH_RETRY_TEST_CASES for linux tests (#68486)
Summary:
After realizing that CUDA mem leaks were not rerun, I realized I forgot to pass the env var as a Docker variable.

What a noob mistake.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68486

Reviewed By: malfet, seemethere

Differential Revision: D32477989

Pulled By: janeyx99

fbshipit-source-id: e28d095773f50864ab49229e434187a9ecb004cc
2021-11-16 19:02:03 -08:00
affa3f846c Sparse CSR CPU: add torch.addmm (#65606)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65606

This PR adds `torch.addmm(c, a, b, alpha=1.0, beta=0.0, out=out)` variant with `a, b, c, out` all being sparse CSR tensors on CPU.

cc nikitaved pearu cpuhrsch IvanYashchuk

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D32366236

Pulled By: cpuhrsch

fbshipit-source-id: e910bcc96eee99d624b80ee881df3887ab3ba5ac
2021-11-16 17:22:46 -08:00
5cfca5524c [JIT] clear GraphFunction.optimized_graphs_ after freezing a module (#68316)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68316

Consider the following:
```
class Mod(nn.Module):
    def __init__(self, val):
	super().__init__()
	self.param = nn.Parameter(val)

    def forward(self, x):
	# this method will change during freezing
	return x + self.param

    torch.jit.export
    def make_prediction(self, x):
	y = x + x
	return self.forward(y)

param = torch.rand([2, 2])

unscripted_mod = Mod(param)
mod = torch.jit.script(unscripted_mod)
mod.eval()
mod = torch.jit.freeze(mod, preserved_attrs=["make_prediction"])`
```

During freezing the following will occur:
1. do some pre-freezing, including inlining; in particular, forward will be inlined into make_prediction. During inlining, forward.optimized_graph() is called, and the result is cached
2. freeze some methods. While freezing forward, the graph associated with the function will get updated. The cached optimized_graphs_ are not updated.

Previously, a call to `mod.forward(x)` would return an exectutor that would run on the old cached optimized_graph(). This would mean that the freezing optimizations would not apply, and potentially that the execution would fail because of parameters removed from the module.

This change clears the optimized_graphs_ cache after running freezing to prevent executing an old version of the graph.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D32410862

Pulled By: davidberard98

fbshipit-source-id: dd8bfe86ec2898b7c72813ab32c08f25c38e4cea
2021-11-16 17:15:29 -08:00
75ccb07b26 [SR] LOG->VLOG (#68477)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68477

We're printing a lot of unnecessary logs in prod. Change these from LOG(INFO) to VLOG(1) so you can easily flip them back for testing.

Test Plan: CI

Reviewed By: ajyu, d1jang

Differential Revision: D32439776

fbshipit-source-id: 40fa57f4eeb6ca0b610008062cc94aed62fb6981
2021-11-16 17:09:52 -08:00
515d9fb2a9 Add OpInfo for torch.histc (#67452)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67452

Reviewed By: davidberard98

Differential Revision: D32453690

Pulled By: saketh-are

fbshipit-source-id: 6311519dc1b2e92a200d0455d32a9c7301a45d51
2021-11-16 13:55:30 -08:00
a8bcfc90f5 fix fsdp overlap flaky test (#68415)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68415

remove e4["cpu_iter"] from short list as cpu may take some time to queue both compute and all-gather.
close #68391
ghstack-source-id: 143478769

Test Plan: unit tests

Reviewed By: rohan-varma

Differential Revision: D32457334

fbshipit-source-id: baeedfb628ce4554a1ef365c3a2de27b8884f6d4
2021-11-16 13:52:13 -08:00
27eca2c6fd Revert D32467139: [pytorch][PR] [android][fbjni] Update fbjni to 0.2.2
Test Plan: revert-hammer

Differential Revision:
D32467139 (04056df475)

Original commit changeset: 49e155989d2d

fbshipit-source-id: ce03be3c6f209a6e9969660bd823d5343a7f0615
2021-11-16 13:50:50 -08:00
284758b585 correct NLLLoss parameters default value (#68426)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/17577

Previous
`size_average by default:  True`
`reduce by default: True`
Present
`size_average by default:  None`
`reduce by default: None`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68426

Reviewed By: VitalyFedyunin

Differential Revision: D32463324

Pulled By: jbschlosser

fbshipit-source-id: 7ba9cd03c9fb6b2f19301e7e39c3c490de17202b
2021-11-16 13:45:52 -08:00
76e9dbb0f4 [torch.fx] add code-gen customizability and support for setting breakpoint in code-gen'd forward() call (#67139)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67139

This diff enables setting breakpoint in the graph module's generated python code. See test plan for usage.

In order to support this functionality, and other similar functionalities to customize the generated code, a code transformer functionality is added to `fx.Graph`. This allows flexible customization of `fx.Graph`'s code gen behavior, in composable and functional ways. See test plan for its usage.

Test Plan:
### Use of `fx.experimental.debug.set_trace`

```
In [2]: from torch.fx.experimental.debug import set_trace

In [3]: set_trace(ttop)
Out[3]:
top(
  (a): Sub()
)

In [4]: ttop(1)
> /data/users/kefeilu/fbsource33/fbcode/buck-out/dev/gen/caffe2/torch/fb/fx2trt/<eval_with_key>.10(6)forward()
(Pdb) l
  1
  2
  3
  4     def forward(self, x):
  5         import pdb; pdb.set_trace()
  6  ->     a = self.a(x);  x = None
  7         getitem = a[0]
  8         getitem_1 = a[0];  a = None
  9         add = getitem + getitem_1;  getitem = getitem_1 = None
 10         return add
 11
(Pdb)
```

### Use of `on_generate_code`

```
In [1]: def insert_pdb(body):
   ...:     return ['import pdb; pdb.set_trace()\n', *body]
   ...:

In [8]: type(ttop)
Out[8]: torch.fx.graph_module.GraphModule.__new__.<locals>.GraphModuleImpl

In [10]: with ttop.graph.on_generate_code(lambda _: insert_pdb):
    ...:     ttop.recompile()
    ...:     print(f"== _on_generate_code should not be None: { ttop.graph._on_generate_code }")
    ...:     print(ttop.code)
    ...:

== _on_generate_code should not be None: <function insert_pdb at 0x7fc9895ddd30>

def forward(self, x):
    import pdb; pdb.set_trace()
    a = self.a(x);  x = None
    getitem = a[0]
    getitem_1 = a[0];  a = None
    add = getitem + getitem_1;  getitem = getitem_1 = None
    return add

In [11]: ttop.graph._on_generate_code  # restored to None

In [12]: ttop(1) # this should drop into pdb
> /data/users/kefeilu/fbsource33/fbcode/buck-out/dev/gen/caffe2/torch/fb/fx2trt/<eval_with_key>.6(6)forward()
(Pdb) l
  1
  2
  3
  4     def forward(self, x):
  5         import pdb; pdb.set_trace()
  6  ->     a = self.a(x);  x = None
  7         getitem = a[0]
  8         getitem_1 = a[0];  a = None
  9         add = getitem + getitem_1;  getitem = getitem_1 = None
 10         return add
 11
```

Reviewed By: jamesr66a

Differential Revision: D30736160

fbshipit-source-id: 9646867aae0461b5131dfd4ba9ee77a8c2ea9c93
2021-11-16 13:28:11 -08:00
8954c92529 [PyTorch][Static Runtime] Borrow outputs in static_runtime::VarTupleUnpack (#68161)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68161

Continuing rollout of borrowing outputs for native ops.
ghstack-source-id: 143424920

Test Plan:
Compare CMF local_ro perf again.

Previous diff:
```
I1110 22:05:23.245435 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.03272. Iters per second: 968.313
I1110 22:05:23.822196 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.06478. Iters per second: 939.163
I1110 22:05:24.395256 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.035. Iters per second: 966.186
I1110 22:05:24.964169 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.02786. Iters per second: 972.898
I1110 22:05:25.536558 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.03205. Iters per second: 968.946
I1110 22:05:26.109027 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.04256. Iters per second: 959.174
I1110 22:05:26.679611 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.03245. Iters per second: 968.567
I1110 22:05:27.253048 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.04493. Iters per second: 957.005
I1110 22:05:27.822629 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.0299. Iters per second: 970.971
I1110 22:05:28.393326 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.03039. Iters per second: 970.509
I1110 22:05:28.393368 113949 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.03726, standard deviation: 0.0111053
```

This diff:
```
I1110 22:18:48.453075 191647 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 0.931188. Iters per second: 1073.9
I1110 22:18:48.967614 191647 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 0.933196. Iters per second: 1071.59
I1110 22:18:49.483338 191647 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 0.932087. Iters per second: 1072.86
I1110 22:18:49.997144 191647 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 0.930877. Iters per second: 1074.26
I1110 22:18:50.529383 191647 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 0.936981. Iters per second: 1067.26
I1110 22:18:51.085038 191647 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 0.953214. Iters per second: 1049.08
I1110 22:18:51.607192 191647 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 0.940719. Iters per second: 1063.02
I1110 22:18:52.126169 191647 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 0.942638. Iters per second: 1060.85
I1110 22:18:52.644445 191647 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 0.937574. Iters per second: 1066.58
I1110 22:18:53.163486 191647 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 0.941636. Iters per second: 1061.98
I1110 22:18:53.163537 191647 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 0.938011, standard deviation: 0.00691196
```

0.099 (9.5%!) usec/iter improvement over previous diff

Reviewed By: hlu1

Differential Revision: D32347900

fbshipit-source-id: 8169ebcadf1248e555a18bbffa99eef6cac1ba85
2021-11-16 12:32:15 -08:00
755be54c77 [PyTorch][Static Runtime] Borrow outputs in static_runtime::dict_unpack (#68160)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68160

This generalizes the mechanism D32318674 added for letting native ops borrow their outputs and uses it in dict_unpack.
ghstack-source-id: 143424919

Test Plan:
4.5% in CMF local_ro compared to D32318674 (previous two diffs were necessary steps but didn't get the full win yet):

```
FastAliasingInSelectTensor, local_ro
========================================
I1110 22:06:37.549811 119627 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.08488. Iters per second: 921.76
I1110 22:06:38.147949 119627 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.08675. Iters per second: 920.171
I1110 22:06:38.766340 119627 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.08626. Iters per second: 920.592
I1110 22:06:39.366608 119627 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.08376. Iters per second: 922.717
I1110 22:06:39.964979 119627 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.08362. Iters per second: 922.833
I1110 22:06:40.565248 119627 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.08423. Iters per second: 922.312
I1110 22:06:41.167326 119627 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.0945. Iters per second: 913.659
I1110 22:06:41.766187 119627 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.08373. Iters per second: 922.742
I1110 22:06:42.367816 119627 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.08995. Iters per second: 917.475
I1110 22:06:42.968391 119627 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.08854. Iters per second: 918.665
I1110 22:06:42.968446 119627 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.08662, standard deviation: 0.00351662

BorrowDictUnpackOutputs, local_ro
========================================

I1110 22:05:23.245435 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.03272. Iters per second: 968.313
I1110 22:05:23.822196 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.06478. Iters per second: 939.163
I1110 22:05:24.395256 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.035. Iters per second: 966.186
I1110 22:05:24.964169 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.02786. Iters per second: 972.898
I1110 22:05:25.536558 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.03205. Iters per second: 968.946
I1110 22:05:26.109027 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.04256. Iters per second: 959.174
I1110 22:05:26.679611 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.03245. Iters per second: 968.567
I1110 22:05:27.253048 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.04493. Iters per second: 957.005
I1110 22:05:27.822629 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.0299. Iters per second: 970.971
I1110 22:05:28.393326 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.03039. Iters per second: 970.509
I1110 22:05:28.393368 113949 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.03726, standard deviation: 0.0111053
```

0.04936 (4.5%) usec/iter improvement

Reviewed By: hlu1

Differential Revision: D32347390

fbshipit-source-id: e636ddafacf30ed2a2d84a6e15fff97481342fdb
2021-11-16 12:31:03 -08:00
bbc24222d2 [PyTorch][Static Runtime] Refcount bump pass in native_ops (#68159)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68159

These all look like they'll cause unnecessary refcount bumps to me.
ghstack-source-id: 143424917

Test Plan:
CI

TODO profile local_ro?

Reviewed By: hlu1

Differential Revision: D32347392

fbshipit-source-id: d8ed91b5855b86765db00c61ad3650273302c7b6
2021-11-16 12:27:12 -08:00
86399d8e0c Add histogramdd to torch.rst (#68273)
Summary:
The `torch.histogramdd` operator is documented in `torch/functional.py` but does not appear in the generated docs because it is missing from `docs/source/torch.rst`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68273

Reviewed By: cpuhrsch

Differential Revision: D32470522

Pulled By: saketh-are

fbshipit-source-id: a23e73ba336415457a30bae568bda80afa4ae3ed
2021-11-16 11:55:40 -08:00
ed00a763a2 [PyTorch] Don't force refcount bump when accessing DictEntryRef key/value (#68158)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68158

to() sometimes returns a reference; let's forward that through.
ghstack-source-id: 143424916

Test Plan: Combined with following diff, seeing a huge drop in dict_unpack self time in ctr_mobile_feed local_ro net. Following diff by itself didn't work.

Reviewed By: suo

Differential Revision: D32347391

fbshipit-source-id: da96295bf83ea30867a2e3fceedc9b4e0a33ffa3
2021-11-16 11:44:08 -08:00
04056df475 [android][fbjni] Update fbjni to 0.2.2 (#68400)
Summary:
ghstack-source-id: caeb8df3a18a6fa48d591af126ac59d8e41494b5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68400

Fixes #{issue number}

Updates fbjni version to 0.2.2

ci-all PR: https://github.com/pytorch/pytorch/pull/68401

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68402

Reviewed By: linbinyu

Differential Revision: D32467139

Pulled By: IvanKobzarev

fbshipit-source-id: 49e155989d2dbafedd5b2df77e089e25e8b4f8f8
2021-11-16 11:34:46 -08:00
df129fa8d6 [PyTorch] Support MaybeOwned<IValue> (#68157)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68157

Does what it says on the tin. I don't have a use for `MaybeOwned<IValue>` itself right now, but following diffs will use `MaybeOwnedTraits<IValue>::{create,destroy}Borrow` and I thought it best to just provide the full thing.
ghstack-source-id: 143424915

Test Plan: Extended MaybeOwned tests to cover this.

Reviewed By: hlu1

Differential Revision: D32347393

fbshipit-source-id: 219658cb69b951d36dee80c2ae51387328224866
2021-11-16 11:24:32 -08:00
030ee34216 Add OpInfo for torch.nonzero (#67459)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67459

Reviewed By: davidberard98

Differential Revision: D32453687

Pulled By: saketh-are

fbshipit-source-id: e7ed5601686d88407bf67bca0f75304b30fa7ac5
2021-11-16 11:10:43 -08:00
10e9d80ad1 [PyTorch][Static Runtime] Don't track scalar ivalues (#67702)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67702

This isn't a particularly large optimization and it does
nothing before select_tensor is introduced (I'm surprised that no
operators have optimizable outputs!), but it seems like we should probably get the savings.
ghstack-source-id: 143424918

Test Plan: CI; checked `--do_profile=1` ouput with following diff and we save tracking hundreds of values, as expected.

Reviewed By: hlu1

Differential Revision: D32112522

fbshipit-source-id: 1804b77992a73670bfc1e36af608b852b8261bd2
2021-11-16 11:05:42 -08:00
eqy
391be39575 Use reduced precision switch in test_addmm_baddbmm_overflow (#68399)
Summary:
https://github.com/pytorch/pytorch/issues/68125
Checking to see if actually using the switch fixes the test...

CC mruberry ngimel ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68399

Reviewed By: VitalyFedyunin

Differential Revision: D32466974

Pulled By: ngimel

fbshipit-source-id: aa8643ed913b344977fd103974625c527d20dbb8
2021-11-16 10:50:17 -08:00
5c3529a86d [lint] small pass to make lint clean (#68367)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68367

- bmm_test.py was using syntax not allowed in 3.6
- Some suppressions were not placed on the correct line.

With this file,
```
lintrunner --paths-cmd='git grep -Il .'
```
passes successfully.

Test Plan: Imported from OSS

Reviewed By: janeyx99, mrshenli

Differential Revision: D32436644

Pulled By: suo

fbshipit-source-id: ae9300c6593d8564fb326822de157d00f4aaa3c2
2021-11-16 10:27:00 -08:00
639258499f [PyTorch][Static Runtime] Add & use "small array" for ProcessedNodeInputs (#67935)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67935

Rationale should be documented in code comments. In short, we
can avoid heap-allocating arrays of input indexes for operators with 5
or fewer inputs, at the cost of a tag bit check on access.
ghstack-source-id: 143429112

Test Plan:
Patched d1jang's D32181666, which prints static runtime memory usage.

Previous diff, local:

```
I1105 12:17:36.459688 866763 PyTorchStaticRuntimePredictor.cpp:82] memory turnover after creating an instance of StaticRuntime: 354208
```

This diff, local:

```
I1105 12:48:35.820663 1066520 PyTorchStaticRuntimePredictor.cpp:82] memory turnover after creating an instance of StaticRuntime: 338064
```
4.5% savings (16144 bytes)

Ran 10 repetitions of CMF local_ro with core pinning: P467095603. This diff is perf neutral compared to the previous diff.

Reviewed By: hlu1

Differential Revision: D32216573

fbshipit-source-id: d18483db255f75f1d90e610ecded7727c6ffe65c
2021-11-16 10:21:12 -08:00
6acde23bec [PyTorch][Static Runtime] Switch input/output repr to 2-byte offsets (#67934)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67934

This reduces the memory requirements of ProcessedNode: by allocating outputs sequentially into a shared array and supporting at most 2**16 - 1 values (current models seem to have 10-20x less than that), we only need to store the 2-byte offset into that array and 2-byte number of outputs in ProcessedNode.
ghstack-source-id: 143429113

Test Plan:
Patched d1jang's diff to measure memory turnover around SR startup.

Previous diff, CMF local:

```
I1104 12:19:39.900211 597593 PyTorchStaticRuntimePredictor.cpp:82] memory turnover after creating an instance of StaticRuntime: 427120
```

This diff, CMF local:

```
I1105 12:17:36.459688 866763 PyTorchStaticRuntimePredictor.cpp:82] memory turnover after creating an instance of StaticRuntime: 354208
72912 bytes (17%) savings
```

Perf looks neutral; see next diff (D32216573) test plan for details.

Reviewed By: hlu1

Differential Revision: D32190751

fbshipit-source-id: 30c1e2caa9460f0d83b2d9bb24c68ccfcef757cc
2021-11-16 10:19:50 -08:00
8678472ec8 [PyTorch][Static Runtime] Save 2 pointers in ProcessedNode (#67860)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67860

We don't need 8-byte sizes for inputs and outputs, and we only need op names if profiling isn't disabled.
ghstack-source-id: 143429111

Test Plan:
Ran CMF local & local_ro with recordio inputs. I'm calling
the result inconclusive/neutral because I saw some noise (as you'll
see below), but that's fine with me since this is a clear memory win.

```
Nov4Stable, local_ro
========================================
I1104 09:53:08.875444 505783 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.19925. Iters per second: 833.851
I1104 09:53:10.200443 505783 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.1996. Iters per second: 833.608
I1104 09:53:11.524045 505783 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.19746. Iters per second: 835.103
I1104 09:53:12.851861 505783 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.20479. Iters per second: 830.019
I1104 09:53:14.183387 505783 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.20487. Iters per second: 829.964
I1104 09:53:14.183427 505783 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.2012, standard deviation: 0.00341762

re-ran stable in light of baffling regression (see next entry), and sure enough we still have some significant run-to-run-variation:

I1104 09:56:15.244969 524012 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.24956. Iters per second: 800.28
I1104 09:56:16.621292 524012 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.24776. Iters per second: 801.437
I1104 09:56:18.018808 524012 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.25247. Iters per second: 798.42
I1104 09:56:19.399660 524012 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.25054. Iters per second: 799.656
I1104 09:56:20.781828 524012 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.25052. Iters per second: 799.664
I1104 09:56:20.781878 524012 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.25017, standard deviation: 0.00171396

Nov4SaveTwoWordsInProcessedNode, local_ro
========================================
I1104 09:53:42.070139 508309 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.2411. Iters per second: 805.736
I1104 09:53:43.438390 508309 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.24102. Iters per second: 805.788
I1104 09:53:44.773303 508309 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.20682. Iters per second: 828.621
I1104 09:53:46.110538 508309 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.21216. Iters per second: 824.973
I1104 09:53:47.448279 508309 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.21265. Iters per second: 824.639
I1104 09:53:47.448334 508309 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.22275, standard deviation: 0.0168698

early runs look like a glitch, rerunning

I1104 09:54:20.999117 511022 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.24558. Iters per second: 802.841
I1104 09:54:22.376780 511022 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.24436. Iters per second: 803.623
I1104 09:54:23.738584 511022 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.23176. Iters per second: 811.845
I1104 09:54:25.113063 511022 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.24938. Iters per second: 800.395
I1104 09:54:26.476349 511022 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.23552. Iters per second: 809.377
I1104 09:54:26.476395 511022 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.24132, standard deviation: 0.00737197

Nov4Stable, local
========================================

I1104 09:57:56.854537 533814 PyTorchPredictorBenchLib.cpp:346] memory turnover after getPredictor: 177885632
I1104 09:58:02.829813 533814 PrepareModelInputs.cpp:190] Loaded 696 records.
I1104 09:58:03.010681 533814 PyTorchPredictorBenchLib.cpp:353] memory turnover before benchmarking: 4590507056
I1104 09:58:03.010710 533814 PyTorchPredictorBenchLib.cpp:154] PyTorch predictor: number of prediction threads 1
I1104 09:58:58.839010 533814 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 20.0567. Iters per second: 49.8586
I1104 09:59:54.797755 533814 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 20.1007. Iters per second: 49.7494
I1104 10:00:50.696525 533814 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 20.0657. Iters per second: 49.8363
I1104 10:01:46.514736 533814 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 20.0696. Iters per second: 49.8265
I1104 10:02:42.378270 533814 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 20.0641. Iters per second: 49.8402
I1104 10:02:42.378316 533814 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 20.0714, standard deviation: 0.0170605
I1104 10:02:42.378325 533814 PyTorchPredictorBenchLib.cpp:366] memory turnover after benchmarking: 4591882400

Nov4SaveTwoWordsInProcessedNode, local
========================================
I1104 10:38:15.543320 733514 PyTorchPredictorBenchLib.cpp:346] memory turnover after getPredictor: 177721792
I1104 10:38:21.224673 733514 PrepareModelInputs.cpp:190] Loaded 696 records.
I1104 10:38:21.382973 733514 PyTorchPredictorBenchLib.cpp:353] memory turnover before benchmarking: 4590343216
I1104 10:38:21.382992 733514 PyTorchPredictorBenchLib.cpp:154] PyTorch predictor: number of prediction threads 1
I1104 10:39:17.005359 733514 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.9498. Iters per second: 50.1257
I1104 10:40:12.545269 733514 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.9279. Iters per second: 50.1808
I1104 10:41:08.138119 733514 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.999. Iters per second: 50.0026
I1104 10:42:03.686841 733514 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.9115. Iters per second: 50.2222
I1104 10:42:55.137498 733539 Proxy2Connection.cpp:343] Received NotRegisteredException from Configerator Proxy2.
I1104 10:42:55.138715 733539 ReadOnlyConnectionIf.h:91] Mark connection as healthy.
I1104 10:42:55.384534 733514 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.6297. Iters per second: 50.9433
I1104 10:42:55.384579 733514 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 19.8836, standard deviation: 0.14571
I1104 10:42:55.384588 733514 PyTorchPredictorBenchLib.cpp:366] memory turnover after benchmarking: 4591711760
```

Reviewed By: d1jang

Differential Revision: D32177531

fbshipit-source-id: 267e38a151d2dbab34fd648135d173cfbee1c22e
2021-11-16 10:12:53 -08:00
45b2f41c3e [package] fix torchscript classes in package (#68028)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68028

Today, we demangle a typename before passing it to the TorchScript
compiler. This breaks compilation of torch classes in cases where we are
attempting to script the same class name from inside a package and out,
since we will return the same qualified name for both.

Differential Revision:
D32261907
D32261907

Test Plan: Imported from OSS

Reviewed By: saketh-are

Pulled By: suo

fbshipit-source-id: 921bc03ad385d94b9279fbc6f3b7dcd0ddbe5bc7
2021-11-16 10:01:40 -08:00
ba16b1eca7 [numpy] Alias arctan2 to atan2 (#67010)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/65906

Adds an alias `arctan2` to improve numpy compatibility

cc mruberry rgommers

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67010

Reviewed By: anjali411

Differential Revision: D32378998

Pulled By: mruberry

fbshipit-source-id: 424c5c10c12b49c20ee83ccd109325c480b5b6cf
2021-11-16 09:41:09 -08:00
6226a3cf74 [Vulkan] Implement permute operator (#68274)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68274

Implemented `permute` operator on the Vulkan backend:
* Supports only <= 4D tensors.
* Builds up shader operations from the output texture point of view to avoid the nondeterministic order of GPU shader operations between texels. See [incoherent memory access](https://www.khronos.org/opengl/wiki/Memory_Model#Incoherent_memory_access)
* Generalized input tensors to 4D ones to simplify input/output texture handling. For example, {2, 3} is treated as {1,1,2,3} internally.
* 1D to 4D inputs with all possible permutations are used for test cases.
* Reference on CPU implementation of `permute` operator: [TensorShape.cpp](cbf596bf8e/aten/src/ATen/native/TensorShape.cpp (L936))
* When shuffling dims, a new depth size of output texture needs to be determined by `ceil(batch*channel)/4`. This logic needs to be handled in a separate change.
    * The depth of texture cannot exceed a certain number, depending on the device's capability. It is typically 2048 on most of android devices but less than or equal to 16,384 (see [Value distribution for maxImageDimension3D on Android](https://vulkan.gpuinfo.org/displaydevicelimit.php?name=maxImageDimension3D&platform=android)). i.e., 2048 on MacOS and Google Pixel 5.
    * Due to this limitation, `permute` op needs to throw an exception if the depth of output texture is greater than or equal to `VkImageFormatProperties.maxExtent.depth`.
    * Otherwise, the following error will occur: `-[MTLTextureDescriptorInternal validateWithDevice:]:1325: failed assertion "Texture Descriptor Validation MTLTextureDescriptor has depth (10664) greater than the maximum allowed size of 2048."`
* Vulkan `permute` operator tensor conversion:
{F679505029}
{F679505223}
* Vulkan `permute` operator shader equation:
{F679504799}
* Error/edge cases:
```
X = torch.randint(0, 23, (2, 3, 2, 2))
O = torch.permute(X, (2, 2, 1, 0))
# RuntimeError: repeated dim in permute

O = torch.permute(X, (2, 1, 0))
# RuntimeError: number of dims don't match in permute

O = torch.permute(X, (4, 3, 2, 1, 0))
# RuntimeError: number of dims don't match in permute

O = torch.permute(X, (3, 2, -1, 0))
# RuntimeError: repeated dim in permute

data2 = [0,1,2]
X2 = torch.tensor(data2)
O2 = torch.permute(X2, (0))
# permute(): argument 'dims' (position 2) must be tuple of ints, not int
# TypeError: permute(): argument 'dims' (position 2) must be tuple of ints, not int

O = torch.permute(X, (0, 1, 2, 3))
# do nothing since the dims doesn't change?
```
* Shader debug traces with a 4D tensor size {2,3,2,2} with permute by {3,2,1,0}:
```
output tensor:
(1,1,.,.) =
  0.4395  0.5652
  0.1309  0.9768
  0.0490  0.1127
(2,1,.,.) =
  0.7058  0.2238
  0.6542  0.4064
  0.4813  0.0500
(1,2,.,.) =
  0.1716  0.4951
  0.2225  0.3255
  0.0758  0.7150
(2,2,.,.) =
  0.3762  0.0228
  0.6367  0.4411
  0.7682  0.7599
[ CPUFloatType{2,2,3,2} ]

shader debug traces:
src_index:0, b c h w: 0 0 0 0, posIn: (0 0 0) i:0 -> b c h w: 0 0 0 0, dst_index: 0, posOut: (0 0 0) j:0 -> inval[0.439453] outval[0.439453] -> inval[0.439453 0.130859 0.049011 0.564941] outval[0.439453 0.000000 0.000000 0.000000]
src_index:3, b c h w: 1 0 0 0, posIn: (0 0 0) i:3 -> b c h w: 0 0 0 1, dst_index: 0, posOut: (1 0 0) j:0 -> inval[0.564941] outval[0.564941] -> inval[0.439453 0.130859 0.049011 0.564941] outval[0.564941 0.000000 0.000000 0.000000]
src_index:1, b c h w: 0 1 0 0, posIn: (0 0 0) i:1 -> b c h w: 0 0 1 0, dst_index: 0, posOut: (0 1 0) j:0 -> inval[0.130859] outval[0.130859] -> inval[0.439453 0.130859 0.049011 0.564941] outval[0.130859 0.000000 0.000000 0.000000]
src_index:4, b c h w: 1 1 0 0, posIn: (0 0 1) i:0 -> b c h w: 0 0 1 1, dst_index: 0, posOut: (1 1 0) j:0 -> inval[0.976562] outval[0.976562] -> inval[0.976562 0.112671 -65504.000000 -65504.000000] outval[0.976562 0.000000 0.000000 0.000000]
src_index:2, b c h w: 0 2 0 0, posIn: (0 0 0) i:2 -> b c h w: 0 0 2 0, dst_index: 0, posOut: (0 2 0) j:0 -> inval[0.049011] outval[0.049011] -> inval[0.439453 0.130859 0.049011 0.564941] outval[0.049011 0.000000 0.000000 0.000000]
src_index:5, b c h w: 1 2 0 0, posIn: (0 0 1) i:1 -> b c h w: 0 0 2 1, dst_index: 0, posOut: (1 2 0) j:0 -> inval[0.112671] outval[0.112671] -> inval[0.976562 0.112671 -65504.000000 -65504.000000] outval[0.112671 0.000000 0.000000 0.000000]
src_index:0, b c h w: 0 0 1 0, posIn: (0 1 0) i:0 -> b c h w: 0 1 0 0, dst_index: 1, posOut: (0 0 0) j:1 -> inval[0.171509] outval[0.171509] -> inval[0.171509 0.222412 0.075745 0.494873] outval[0.439453 0.171509 0.000000 0.000000]
src_index:3, b c h w: 1 0 1 0, posIn: (0 1 0) i:3 -> b c h w: 0 1 0 1, dst_index: 1, posOut: (1 0 0) j:1 -> inval[0.494873] outval[0.494873] -> inval[0.171509 0.222412 0.075745 0.494873] outval[0.564941 0.494873 0.000000 0.000000]
src_index:1, b c h w: 0 1 1 0, posIn: (0 1 0) i:1 -> b c h w: 0 1 1 0, dst_index: 1, posOut: (0 1 0) j:1 -> inval[0.222412] outval[0.222412] -> inval[0.171509 0.222412 0.075745 0.494873] outval[0.130859 0.222412 0.000000 0.000000]
src_index:4, b c h w: 1 1 1 0, posIn: (0 1 1) i:0 -> b c h w: 0 1 1 1, dst_index: 1, posOut: (1 1 0) j:1 -> inval[0.325439] outval[0.325439] -> inval[0.325439 0.714844 -65504.000000 -65504.000000] outval[0.976562 0.325439 0.000000 0.000000]
src_index:2, b c h w: 0 2 1 0, posIn: (0 1 0) i:2 -> b c h w: 0 1 2 0, dst_index: 1, posOut: (0 2 0) j:1 -> inval[0.075745] outval[0.075745] -> inval[0.171509 0.222412 0.075745 0.494873] outval[0.049011 0.075745 0.000000 0.000000]
src_index:5, b c h w: 1 2 1 0, posIn: (0 1 1) i:1 -> b c h w: 0 1 2 1, dst_index: 1, posOut: (1 2 0) j:1 -> inval[0.714844] outval[0.714844] -> inval[0.325439 0.714844 -65504.000000 -65504.000000] outval[0.112671 0.714844 0.000000 0.000000]
src_index:0, b c h w: 0 0 0 1, posIn: (1 0 0) i:0 -> b c h w: 1 0 0 0, dst_index: 2, posOut: (0 0 0) j:2 -> inval[0.705566] outval[0.705566] -> inval[0.705566 0.653809 0.481201 0.223755] outval[0.439453 0.171509 0.705566 0.000000]
src_index:3, b c h w: 1 0 0 1, posIn: (1 0 0) i:3 -> b c h w: 1 0 0 1, dst_index: 2, posOut: (1 0 0) j:2 -> inval[0.223755] outval[0.223755] -> inval[0.705566 0.653809 0.481201 0.223755] outval[0.564941 0.494873 0.223755 0.000000]
src_index:1, b c h w: 0 1 0 1, posIn: (1 0 0) i:1 -> b c h w: 1 0 1 0, dst_index: 2, posOut: (0 1 0) j:2 -> inval[0.653809] outval[0.653809] -> inval[0.705566 0.653809 0.481201 0.223755] outval[0.130859 0.222412 0.653809 0.000000]
src_index:4, b c h w: 1 1 0 1, posIn: (1 0 1) i:0 -> b c h w: 1 0 1 1, dst_index: 2, posOut: (1 1 0) j:2 -> inval[0.406250] outval[0.406250] -> inval[0.406250 0.049957 -65504.000000 -65504.000000] outval[0.976562 0.325439 0.406250 0.000000]
src_index:2, b c h w: 0 2 0 1, posIn: (1 0 0) i:2 -> b c h w: 1 0 2 0, dst_index: 2, posOut: (0 2 0) j:2 -> inval[0.481201] outval[0.481201] -> inval[0.705566 0.653809 0.481201 0.223755] outval[0.049011 0.075745 0.481201 0.000000]
src_index:5, b c h w: 1 2 0 1, posIn: (1 0 1) i:1 -> b c h w: 1 0 2 1, dst_index: 2, posOut: (1 2 0) j:2 -> inval[0.049957] outval[0.049957] -> inval[0.406250 0.049957 -65504.000000 -65504.000000] outval[0.112671 0.714844 0.049957 0.000000]
src_index:0, b c h w: 0 0 1 1, posIn: (1 1 0) i:0 -> b c h w: 1 1 0 0, dst_index: 3, posOut: (0 0 0) j:3 -> inval[0.376221] outval[0.376221] -> inval[0.376221 0.636719 0.768066 0.022751] outval[0.439453 0.171509 0.705566 0.376221] outval_after[0.439453 0.171509 0.705566 0.376221]
src_index:3, b c h w: 1 0 1 1, posIn: (1 1 0) i:3 -> b c h w: 1 1 0 1, dst_index: 3, posOut: (1 0 0) j:3 -> inval[0.022751] outval[0.022751] -> inval[0.376221 0.636719 0.768066 0.022751] outval[0.564941 0.494873 0.223755 0.022751] outval_after[0.564941 0.494873 0.223755 0.022751]
src_index:1, b c h w: 0 1 1 1, posIn: (1 1 0) i:1 -> b c h w: 1 1 1 0, dst_index: 3, posOut: (0 1 0) j:3 -> inval[0.636719] outval[0.636719] -> inval[0.376221 0.636719 0.768066 0.022751] outval[0.130859 0.222412 0.653809 0.636719] outval_after[0.130859 0.222412 0.653809 0.636719]
src_index:4, b c h w: 1 1 1 1, posIn: (1 1 1) i:0 -> b c h w: 1 1 1 1, dst_index: 3, posOut: (1 1 0) j:3 -> inval[0.440918] outval[0.440918] -> inval[0.440918 0.759766 -65504.000000 -65504.000000] outval[0.976562 0.325439 0.406250 0.440918] outval_after[0.976562 0.325439 0.406250 0.440918]
src_index:2, b c h w: 0 2 1 1, posIn: (1 1 0) i:2 -> b c h w: 1 1 2 0, dst_index: 3, posOut: (0 2 0) j:3 -> inval[0.768066] outval[0.768066] -> inval[0.376221 0.636719 0.768066 0.022751] outval[0.049011 0.075745 0.481201 0.768066] outval_after[0.049011 0.075745 0.481201 0.768066]
src_index:5, b c h w: 1 2 1 1, posIn: (1 1 1) i:1 -> b c h w: 1 1 2 1, dst_index: 3, posOut: (1 2 0) j:3 -> inval[0.759766] outval[0.759766] -> inval[0.440918 0.759766 -65504.000000 -65504.000000] outval[0.112671 0.714844 0.049957 0.759766] outval_after[0.112671 0.714844 0.049957 0.759766]
```

Test Plan:
Build & test on Android:
```
cd ~/fbsource
buck build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 //xplat/caffe2:pt_vulkan_api_test_binAndroid\#android-arm64 --show-output
adb push buck-out/gen/xplat/caffe2/pt_vulkan_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_api_test
adb shell "/data/local/tmp/vulkan_api_test"
```
Build & test on MacOS:
```
cd ~/fbsource
buck build //xplat/caffe2:pt_vulkan_api_test_binAppleMac
./buck-out/gen/xplat/caffe2/pt_vulkan_api_test_binAppleMac\#macosx-x86_64
```
Test result on Android (Google Pixel 5):
```
[ RUN      ] VulkanAPITest.permute_2d_success
[       OK ] VulkanAPITest.permute_2d_success (26 ms)
[ RUN      ] VulkanAPITest.permute_3d_success
[       OK ] VulkanAPITest.permute_3d_success (6 ms)
[ RUN      ] VulkanAPITest.permute_4d_success
[       OK ] VulkanAPITest.permute_4d_success (10 ms)
[ RUN      ] VulkanAPITest.permute_4dmclaren_success
[       OK ] VulkanAPITest.permute_4dmclaren_success (1 ms)
[ RUN      ] VulkanAPITest.permute_4dbig_success
[       OK ] VulkanAPITest.permute_4dbig_success (234 ms)
[ RUN      ] VulkanAPITest.permute_negativedims_success
[       OK ] VulkanAPITest.permute_negativedims_success (0 ms)
[ RUN      ] VulkanAPITest.permute_1d_nochange
[       OK ] VulkanAPITest.permute_1d_nochange (0 ms)
[ RUN      ] VulkanAPITest.permute_sameDims_nochange
[       OK ] VulkanAPITest.permute_sameDims_nochange (1 ms)
[ RUN      ] VulkanAPITest.permute_invalidinputs_exceptions
[       OK ] VulkanAPITest.permute_invalidinputs_exceptions (1 ms)
```
Test result on MacOS:
```
[ RUN      ] VulkanAPITest.permute_2d_success
[       OK ] VulkanAPITest.permute_2d_success (154 ms)
[ RUN      ] VulkanAPITest.permute_3d_success
[       OK ] VulkanAPITest.permute_3d_success (13 ms)
[ RUN      ] VulkanAPITest.permute_4d_success
[       OK ] VulkanAPITest.permute_4d_success (33 ms)
[ RUN      ] VulkanAPITest.permute_4dmclaren_success
[       OK ] VulkanAPITest.permute_4dmclaren_success (2 ms)
[ RUN      ] VulkanAPITest.permute_4dbig_success
[       OK ] VulkanAPITest.permute_4dbig_success (251 ms)
[ RUN      ] VulkanAPITest.permute_negativedims_success
[       OK ] VulkanAPITest.permute_negativedims_success (2 ms)
[ RUN      ] VulkanAPITest.permute_1d_nochange
[       OK ] VulkanAPITest.permute_1d_nochange (1 ms)
[ RUN      ] VulkanAPITest.permute_sameDims_nochange
[       OK ] VulkanAPITest.permute_sameDims_nochange (0 ms)
[ RUN      ] VulkanAPITest.permute_invalidinputs_exceptions
[       OK ] VulkanAPITest.permute_invalidinputs_exceptions (2 ms)
```

Reviewed By: SS-JIA

Differential Revision: D32292554

fbshipit-source-id: dbeaee6ff98633022cf34d6da90662d81eac6b0e
2021-11-16 09:27:51 -08:00
bc3d380ed1 Throw error when saving storages that view same data with different type (#66949)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/58970

cc mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66949

Reviewed By: albanD

Differential Revision: D31926323

Pulled By: anjali411

fbshipit-source-id: f6e7acc0c1968b70a94f9b0b69a32780e8e21a62
2021-11-16 08:44:44 -08:00
bf60c6e71b [JIT] remove prim::SetAttr from list of ops with side effects (#68311)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68311

prim::SetAttr is listed as an op with side effects, but in AliasDb, `analyzeSetAttr` already accounts for its behavior. By removing it from the list of ops with side effects, dead code elimination will work in a few other scenarios.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D32409510

fbshipit-source-id: 52ed9e19f92afb95c669ad3c2440f72f9515ba4c
2021-11-16 08:39:24 -08:00
add79722dd Correct householder_product docs. (#68335)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68335

When discussing https://github.com/pytorch/pytorch/pull/63880, we
realised that the docs of `householder_product` were not correct. This
PR fixes this.

The new docs are slightly more difficult, but hopefully correct. Note
that this is a LAPACK function in disguise, so it is expected the
specification to be more difficult than normal.

cc brianjo mruberry jianyuh nikitaved pearu walterddr IvanYashchuk xwang233 Lezcano

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D32429755

Pulled By: mruberry

fbshipit-source-id: 3ac866d30984adcd9f3b83d7fa9ae7b7ae5d4b53
2021-11-16 07:54:24 -08:00
01a8862582 OpInfo tests for nn.functional.max_pool{n}d. (#68075)
Summary:
As per title.

It is planned to use these tests for fixing issues with the max_unpools' backward methods reported in https://github.com/pytorch/pytorch/issues/67658 and https://github.com/pytorch/pytorch/issues/67657.
max_unpool.backward methods are not tested and implemented with custom kernels. We can replace these kernels with advanced indexing operations (i.e. `gather`) which are efficient and well tested.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68075

Reviewed By: malfet

Differential Revision: D32308317

Pulled By: mruberry

fbshipit-source-id: 9f91c6e6a9d78c19230e93fc0a3164f4eb7b8ec5
2021-11-16 07:28:32 -08:00
33e9a0b5f6 [Reland] Python tracer. (#68325)
Summary:
There were two issues with the original PR:
1) My assumption that bound C functions could be trusted to stay alive was not valid. I'm still not entirely sure what was dying, but I've just added a cache so that the first time I see a function I collect the repr just like I was already doing with Python functions.

2) `std::regex` is known to be badly broken and prone to segfaults. Because I'm just doing a very simple prefix prune it's fine to do it manually; see `trimPrefix`. Long term we should move all of PyTorch to `re2` as the internal lint suggests, but CMake is hard and I couldn't get it to work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68325

Reviewed By: chaekit

Differential Revision: D32432596

Pulled By: robieta

fbshipit-source-id: 06fb4bcdc6933a3e76f6021ca69dc77a467e4b2e
2021-11-15 23:32:49 -08:00
438ca7603f Fix sign comparison issue in Histogram.cpp (#68294)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68294

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D32403821

fbshipit-source-id: cdbf1d83ab02b1e996559e4cfbbe699b7165483a
2021-11-15 23:14:04 -08:00
ec742c65d5 Fix a sign comparison issue in BatchLinearAlgebraLib.cpp (#68293)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68293

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D32403788

fbshipit-source-id: 1afc5e62e7157f144ec36b029ee3bcc6c23d65a1
2021-11-15 23:12:56 -08:00
d541aa8cbe [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D32454757

fbshipit-source-id: ffb46701843245ac040905423eb950902b51951d
2021-11-15 21:54:23 -08:00
27cc11226d make broadcast fastpath the default for currently rolled-out ops (#68365)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68365

title. broadcast fastpath has been running fine for the enabled ops for a while now, so make it the default for these ops.

Test Plan: diff is a no-op, so sandcastle

Differential Revision: D32107847

fbshipit-source-id: b239b127b219985bf7df6a0eea2d879b8e9c79a4
2021-11-15 21:41:57 -08:00
7ee84ad321 Refactoring quantized op tests to combine test classes (#68282)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68282

Combined 3 Dynamic quantized op test classes into 1

Test Plan:
python test/test_quantization.py TestDynamicQuantizedOps

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D32402163

fbshipit-source-id: 696b7ef5d823632941dc7afc95161501445d0e18
2021-11-15 20:47:02 -08:00
065018d812 [pytorch][xros] Ensure all pytorch mobile operators build ok in XROS mode (#68266)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68266
* Use `if...endif` to adjust pyTorch internals towards XROS

Test Plan: CI

Reviewed By: kkosik20

Differential Revision: D32190771

fbshipit-source-id: cce073dea53c2b5681d913321101cd83c6472019
2021-11-15 19:52:45 -08:00
86c1368611 [fx][const fold] Add test/example for skipping quant/dequant pattern (#68378)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68378

Add test/example for skipping quant/dequant pattern

Reviewed By: jfix71

Differential Revision: D32410544

fbshipit-source-id: e63419a01a097e4c570c3861d79d573cabc0b294
2021-11-15 18:49:23 -08:00
722af775c3 [ONNX] ConstantMap setters to update existing value instead of emplace (#67630) (#67812)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67812

`UpdateShape` uses `.emplace(tensorName, shapeValue)`. This will not update `shapeValue` for `tensorName`, if such name already exist in the map. Hence our code is not able to correct the shape inference error, even if we inferred the shape correctly later.

Test Plan: Imported from OSS

Reviewed By: msaroufim

Differential Revision: D32181300

Pulled By: malfet

fbshipit-source-id: 05c58ad3fdac683ad957996acde8f0ed6341781d

Co-authored-by: BowenBao <bowbao@microsoft.com>
2021-11-15 17:20:07 -08:00
d32efe8bc2 [ONNX] Remove the argument use_external_data_format of export() method entirely. (#67080) (#67811)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67811

* remove the argument use_external_data_format of export() method entirely

Test Plan: Imported from OSS

Reviewed By: msaroufim

Differential Revision: D32181302

Pulled By: malfet

fbshipit-source-id: 4bc1448b7487bb9dfdad4e36008ff5b227fd64a3

Co-authored-by: hwangdeyu <dejack953@outlook.com>
2021-11-15 17:20:04 -08:00
9d25554d45 [ONNX] Allow registration of custom symbolics for aten namespace (#66481) (#67810)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67810

Test Plan: Imported from OSS

Reviewed By: msaroufim

Differential Revision: D32181303

Pulled By: malfet

fbshipit-source-id: af2a715dc554b958fa3f5a7a8ae96cb3f7d112bb
2021-11-15 17:18:39 -08:00
09615cd0b0 Adding Dynamic Conv and ConvT ops/modules (#68176)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68176

it should be noted that for the modules, reduce_range is set to
true by default in a similar fashion to linear_dynamic.

Test Plan:
python test/test_quantization.py TestDynamicQuantizedModule
python test/test_quantization.py TestDynamicQuantizedConv
python test/test_quantization.py TestQuantizedConv

Imported from OSS

Reviewed By: kimishpatel

Differential Revision: D32374003

fbshipit-source-id: 011562bd0f4d817387d53bb113df2600aa60a7a3
2021-11-15 16:42:25 -08:00
529ebae0ac Bugfix for TorchScript RNN RELU and TANH (#61274)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/28418
Related https://github.com/pytorch/pytorch/issues/32976 but has already been fixed before.

TorchScript handling of GRU and LSTM have been working, but not for RNN (Tanh and ReLU). The reason is that the ```Union[Tensor, PackedSequence]``` is not supported by TorchScript. Using ```torch._jit_internal._overload_method``` in ```RNNBase::Forward``` does not work, as it seems TorchScript does not correctly use them if the method gets inherited by ```RNN```. My solution is to move the ```RNNBase::forward``` to ```RNN``` and annotate using ```torch._jit_internal._overload_method```. LSTM and GRU anyway use their own ```forward``` methods, so there seems to be no problem related to this fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61274

Reviewed By: anjali411

Differential Revision: D32374452

Pulled By: malfet

fbshipit-source-id: 77bab2469c01c5dfa5eaab229429724a4172445d

Co-authored-by: Nikita Shulga <nshulga@fb.com>
2021-11-15 16:20:58 -08:00
2fd468e5f8 [jit] Set the graph input types before interpreting the graph during tracing (#68242)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68242

Test Plan: Imported from OSS

Reviewed By: saketh-are

Differential Revision: D32382958

Pulled By: navahgar

fbshipit-source-id: 4e82a604a9ea2046af2755de23944147e618a65f
2021-11-15 15:44:32 -08:00
9ed49449b3 [SR] Add net level record functions (#68091)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68091

Add record functions for recording perf stats on the entire network.

Note that this is backed by the same pre-sampling mechanism as the op record functions, so net level stats get logged relatively infrequently. (If this is not acceptable, we can not use pre-sampling at the cost of a little bit of perf, every inference will require an RNG call)

Reviewed By: hlu1

Differential Revision: D32296756

fbshipit-source-id: 09ff16c942f3bfc8f4435d6cca2be4a6b8dc6091
2021-11-15 15:39:08 -08:00
0823d18fcd make TSComputation ctor explicit (#68286)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68286

Test Plan: check it compiles

Reviewed By: alanwaketan

Differential Revision: D32402016

fbshipit-source-id: b623afa8831cd906336d7fcafbcbad32f79254b0
2021-11-15 14:58:33 -08:00
7b958fbec4 ci: Build periodic jobs with DEBUG=1 (#67192)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67192

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD, janeyx99

Differential Revision: D31902447

Pulled By: seemethere

fbshipit-source-id: 1d1cca8b5ac84b1c23ab73e2d973bfb7bffa8982
2021-11-15 14:51:06 -08:00
ea0a558487 GHA CI: make the default config use only one GPU (#68382)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/66511

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68382

Reviewed By: mrshenli

Differential Revision: D32441585

Pulled By: janeyx99

fbshipit-source-id: d92407c9bb9e4f740435840b4022e75749d7f0ba
2021-11-15 14:35:49 -08:00
6adbe044e3 Added nearest-exact interpolation mode (#64501)
Summary:
Added "nearest-exact" interpolation mode to fix the issues: https://github.com/pytorch/pytorch/issues/34808 and https://github.com/pytorch/pytorch/issues/62237.

Description:

As we can not fix "nearest" mode without large impact on already trained model [it was suggested](https://github.com/pytorch/pytorch/pull/64501#pullrequestreview-749771815) to introduce new mode instead of fixing exising "nearest" mode.

- New mode "nearest-exact" performs index computation for nearest interpolation to match scikit-image, pillow, TF2 and while "nearest" mode still match opencv INTER_NEAREST, which appears to be buggy, see https://ppwwyyxx.com/blog/2021/Where-are-Pixels/#Libraries.

"nearest":
```
input_index_f32 = output_index * scale
input_index = floor(input_index_f32)
```

"nearest-exact"
```
input_index_f32 = (output_index + 0.5) * scale - 0.5
input_index = round(input_index_f32)
```

Comparisions with other libs: https://gist.github.com/vfdev-5/a5bd5b1477b1c82a87a0f9e25c727664

PyTorch version | 1.9.0 "nearest" | this PR "nearest" | this PR "nearest-exact"
---|---|---|---
Resize option: | |
OpenCV INTER_NEAREST result mismatches | 0 | 0 | 10
OpenCV INTER_NEAREST_EXACT result mismatches | 9 | 9 | 9
Scikit-Image result mismatches | 10 | 10 | 0
Pillow result mismatches | 10 | 10 | 7
TensorFlow result mismatches | 10 | 10 | 0
Rescale option: | | |
size mismatches (https://github.com/pytorch/pytorch/issues/62396) | 10 | 10 | 10
OpenCV INTER_NEAREST result mismatches | 3 | 3| 5
OpenCV INTER_NEAREST_EXACT result mismatches | 3 | 3| 4
Scikit-Image result mismatches | 4 | 4 | 0
Scipy result mismatches | 4 | 4 | 0
TensorFlow: no such option | - |  -

Versions:
```
skimage: 0.19.0.dev0
opencv: 4.5.4-dev
scipy: 1.7.2
Pillow: 8.4.0
TensorFlow: 2.7.0
```

Implementations in other libs:

- Pillow:
  - ee079ae67e/src/libImaging/Geometry.c (L889-L899)
  - ee079ae67e/src/libImaging/Geometry.c (L11)
  - `a[2] == 0`

- Scikit-Image :
  - dev v0.19.0 uses scipy ndi.zoom:
    - 38fae50c3f/skimage/transform/_warps.py (L180-L188)
    - 47bb6febaa/scipy/ndimage/src/ni_interpolation.c (L775-L779)
    - 47bb6febaa/scipy/ndimage/src/ni_interpolation.c (L479)

Additionally:
- Updated upsampling tests

cc ezyang gchanan albanD mruberry jbschlosser walterddr fmassa heitorschueroff ppwwyyxx

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64501

Reviewed By: anjali411

Differential Revision: D32361901

Pulled By: jbschlosser

fbshipit-source-id: df906f4d25a2b2180e1942ffbab2cc14600aeed2
2021-11-15 14:28:19 -08:00
e3bcf64ff8 [qnnpack] Remove redundant fp16 dependency (#68011)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68011

`qnnpack/operator.h` introduces a dependency on an external library fp16 via `qnnpack/requantization.h`.
Including `qnnpack/operator.h` in `pytorch_qnnpack.h` will make objects who really don't require fp16 depend on it indirectly because they include `pytorch_qnnpack.h`.
This was causing some test and bench targets to fail building for local and android/arm64 (only two tried) using cmake.

This diff moves `qnnpack/operator.h` from `pytorch_qnnpack.h` to `qnnpack_func.h`, and explicitly add `qnnpack/operator.h` in `src/conv-prepack.cc`.

Test Plan: Ran all the tests for local on my devserver, and arm64 on Pixel3a.

Reviewed By: salilsdesai

Differential Revision: D32250984

fbshipit-source-id: 21468d8ef79c90e9876dc00da95383180a1031b5
2021-11-15 12:38:44 -08:00
0cf46fb0de [fx2trt] fix a bug in conversion from negative dim to positive dim (#68360)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68360

Added a helper function to do this. Only use `mod` to convert negative dim to positive. Do nothing when it's already positive.

Previously in `getitem` if we are slicing to the very end, we will get the dimension wrong.

Test Plan: Add a unit test

Reviewed By: yinghai, wushirong

Differential Revision: D32432893

fbshipit-source-id: 3c5d6a578d92a15207a5e52802750f9ea7f272a9
2021-11-15 12:30:50 -08:00
549e014963 [docs] fix torch.histc's min/max arg types (#64191)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/31475. `torch.histc` accepts Scalar min/max. The docs erroneously specified their types as int.

cc brianjo mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64191

Reviewed By: mrshenli

Differential Revision: D32437279

Pulled By: saketh-are

fbshipit-source-id: e6017e9236d815abd818dcd44e27819611666823
2021-11-15 12:29:25 -08:00
ccd9675569 [lint] Disable modernize-use-nodiscard (#68354)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68354

Lint rule: https://clang.llvm.org/extra/clang-tidy/checks/modernize-use-nodiscard.html

This check adds a ton of noise to our diffs. `[[nodiscard]]` is typically only useful when ignoring the return value of a function is a critical error, e.g. for `operator new`.

Test Plan: Verified that the lint does not get triggered

Reviewed By: hlu1

Differential Revision: D32429731

fbshipit-source-id: ca3d90686ec8d419d3f96167140dc406df6f4a53
2021-11-15 12:11:08 -08:00
c697eeba72 [JIT] Combine concat nodes where possible (#67000)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67000

See the [related issue](https://github.com/pytorch/pytorch/issues/66654) for context.

This new JIT optimization transforms patterns like this:
```
%inputs.1 : Tensor[] = prim::ListConstruct(%a, %b, %c)
%concat.1 : Tensor = aten::cat(%inputs, %dim)
%inputs.2 : Tensor[] = prim::ListConstruct(%x, %concat.1, %y)
%concat.2 : Tensor = aten::cat(%inputs.2, %dim)
```
into this:
```
%inputs.2 : Tensor[] = prim::ListConstruct(%x, %a, %b, %c, %y)
%concat.2 : Tensor = aten::cat(%inputs.2, %dim)
```
(it can do this for chains of `aten::cat` longer than 2 as well)

A few conditions have to hold:
1.  The `dim`s have to match.
2. `inputs.1` and `inputs.2` cannot be mutated

Test Plan: `buck test caffe2/test/cpp/jit:jit -- ConcatOpt`

Reviewed By: d1jang

Differential Revision: D31819491

fbshipit-source-id: 9f1a501d52099eb1a630b5dd906df4c38c3817ba
2021-11-15 12:02:45 -08:00
30cda0b28c [bugfix] functionalization pass for view ops without a 'self' first argumennt (#68339)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68339

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D32429570

Pulled By: bdhirsh

fbshipit-source-id: e6df243c508c2ba2ca1df7a53fa68f32db454f32
2021-11-15 11:58:21 -08:00
5b05983497 [bugfix] fix two edge cases in functionalization (#68269)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68269

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D32396357

Pulled By: bdhirsh

fbshipit-source-id: 1d374b693f3f526d027104cbdc08b8bbe9d38307
2021-11-15 11:58:18 -08:00
12026124cc Avoid the view for mkldnn case in 1D convolution (#68166)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/68034

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68166

Reviewed By: mrshenli

Differential Revision: D32432444

Pulled By: jbschlosser

fbshipit-source-id: fc4e626d497d9e4597628a18eb89b94518bb3b33
2021-11-15 11:56:45 -08:00
56024e91c9 GHA: Enable flaky test reporting by setting PYTORCH_RETRY_TEST_CASES=1 (#68300)
Summary:
Enables https://github.com/pytorch/pytorch/issues/68150 in CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68300

Reviewed By: seemethere

Differential Revision: D32435332

Pulled By: janeyx99

fbshipit-source-id: 155018afaf73d5a24d13d358879361468ec7b18e
2021-11-15 11:23:55 -08:00
24b60b2cbf [lint] lintrunner fixes/improvements (#68292)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68292

- noqa was typo-d to be the same as type: ignore
- generalize clang-tidy initialization and use it for clang_format as well
- Add a script that lets you update the binaries in s3 relatively easily

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D32403934

Pulled By: suo

fbshipit-source-id: 4e21b22605216f013d87d636a205707ca8e0af36
2021-11-15 11:08:26 -08:00
43874d79e7 Fix failing test due to a bug in NumPy when using OpenBLAS (#67679)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67679

implementations

Fixes https://github.com/pytorch/pytorch/issues/67675

cc mruberry

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D32368698

Pulled By: mruberry

fbshipit-source-id: 3ea6ebc43c061af2f376cdf5da06884859bbbf53
2021-11-15 08:25:12 -08:00
d1c529bd0b replace platform specific CI environment variables with generic ones (#68133)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/59478

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68133

Reviewed By: saketh-are

Differential Revision: D32401080

Pulled By: atalman

fbshipit-source-id: 057a34a56f8a2d324f4d1ea07da3a09772177897
2021-11-15 07:02:44 -08:00
1c0d6ff835 [fx][const fold] Allow to set up a function to modify const_nodes for split_const_subgraphs (#67784)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67784

FX model generates quant/dequant layers for INT8 explicit mode support. However, if the inputs of quant/dequant layers are constant, the layer will be put into constant subgraph and optimized out. Hence TensorRT will fails to parse the left over graph. It is better to set up an optional function (skip_folding_node_fn) to skip folding nodes for split_const_subgraphs.

Reviewed By: jfix71

Differential Revision: D32076970

fbshipit-source-id: 7dcbb4f02386f8c831d09a2f0e40bcdba904471c
2021-11-15 06:51:19 -08:00
4c87aa77d1 [DataPipe] Traverse DataPipe graph excluding primitive and callable (#67783)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67783

Add `getstate_hook` to exclude primitive objects and callable when serialization when `exclude_primitive` is enabled for `traverse`.
For graph traversing, we don't have to handle the lambda and other stuff.
This is used by `OnDiskCacheHolder` to trace the DataPipe Graph.

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D32146697

Pulled By: ejguan

fbshipit-source-id: 03b2ce981bb21066e807f57c167b77b2d0e0ce61
2021-11-15 06:46:31 -08:00
1adeeabdc0 Fix trt tuple(Dims) throwing issue (#68318)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68318

Adding a `__iter__` binding so that when we do `tuple(Dims)` can construct the right iterator and knows where to stop instead of trial and error with exception catch. We should upstream this to https://github.com/NVIDIA/TensorRT. cc: wushirong

I did try a very similar `__iter__` fix previsouly but not sure why it wasn't effective...

Reviewed By: kflu, wushirong

Differential Revision: D32412430

fbshipit-source-id: 6390a1275dc34ef498acf933bb96f636c15baf41
2021-11-13 19:48:46 -08:00
be281fc597 Check for None in torch.jit.Graph.create (#68253)
Summary:
...because we don't like segfaults from Python (see test).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68253

Reviewed By: suo

Differential Revision: D32396747

Pulled By: gmagogsfm

fbshipit-source-id: a0925e8479702766e88176280985a63bc79e4f6a
2021-11-13 11:30:33 -08:00
6fb8ebcd92 [tensorexp] Add strides to Buf (#68018)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68018

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D32262381

Pulled By: IvanKobzarev

fbshipit-source-id: dba79add0bf703bc2378d64e726d4c47ec30e3be
2021-11-13 08:33:01 -08:00
f7366ca51b implemented quantize_per_tensor_dynamic and added a corresponding test script (#68004)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68004

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D32301792

Pulled By: dzdang

fbshipit-source-id: f680557ba4736d095efc33e8c92111265f25aee0
2021-11-13 06:34:36 -08:00
cb14a258a2 [c10d] Fix object-based collectives for debug mode (#68223)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68223

DETAIL debug mode didn't work with object-based collectives for NCCL backend, because we'd only check if backend is NCCL and then move tensors to CUDA.

Instead, check if it is a wrapped PG, and then check the pg that is wrapped to see if its nccl.
ghstack-source-id: 143242023

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D32366840

fbshipit-source-id: be0a2af6849f8f24446593f4a4fbea4a67586ee5
2021-11-13 04:18:31 -08:00
ec94bb787a [TensorExpr] Add a way to define target triple/cpu/attrs for llvm codegen and turn on the AOT workflow. (#66527)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66527

Differential Revision:
D31593869
D31593869

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: e7534c11fbcf0dab5f49d01d6053caf77b833ef0
2021-11-13 00:52:20 -08:00
52e93fca2c [TensorExpr] Fix some TE python bindings. (#68232)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68232

Differential Revision:
D32380676
D32380676

Test Plan: Imported from OSS

Reviewed By: saketh-are

Pulled By: ZolotukhinM

fbshipit-source-id: 9287a2c765a53b45ac04d625cc010f5384a8bddf
2021-11-13 00:52:18 -08:00
e511a7a5b4 [TensorExpr] Remove non-determinism in iterating over unordered_set of intermediate buffers. (#68277)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68277

Differential Revision:
D32400553
D32400553

Test Plan: Imported from OSS

Reviewed By: saketh-are, priyaramani

Pulled By: ZolotukhinM

fbshipit-source-id: a8fe820bbddaa19f95db432efaa6d3e36095a05e
2021-11-13 00:50:57 -08:00
80339e85c5 Fix disabling bot with subprocessing (#68290)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/68270

Tested locally + tests get disabled properly

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68290

Reviewed By: mrshenli

Differential Revision: D32403956

Pulled By: janeyx99

fbshipit-source-id: 86629daa86f83f6777f2279524ef973af51046b9
2021-11-12 19:56:17 -08:00
282221c5d6 Fuse unsqueeze, cat, sum for inline_cvr (#68289)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68289

Fuse op unsqueese+cat+sum to add op

Reviewed By: jfix71

Differential Revision: D31769197

fbshipit-source-id: 184b3c8217f2ad9fab9ac8d3c91cd33cf7e7de30
2021-11-12 18:20:11 -08:00
48c8de45b0 [ONNX] Remove the argument example_outpus of export() method entirely. (#67082) (#67809)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67809

* remove the argument example_outpus of export() method entirely

[ONNX] Follow-up: Remove the argument example_outpus of export() method entirely. (#67629)

* Resolve CI failure

* remove test after removing example_outputs

[ONNX] Follow-up: Follow-up: Remove the argument example_outpus of export() method entirely (#67719)

Removing unused import, resolving flake error.

Test Plan: Imported from OSS

Reviewed By: msaroufim

Differential Revision: D32181305

Pulled By: malfet

fbshipit-source-id: ba00547b7cb455ace86606b1bda643c02bdcfa1b

Co-authored-by: hwangdeyu <dejack953@outlook.com>
2021-11-12 17:06:26 -08:00
a8b93cb3ec More aggressively market functorch.vmap when torch.vmap gets called (#67347)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67347

This PR:
- changes the warning when torch.vmap gets called to suggest using
functorch.vmap
- changes the warning when a batching rule isn't implemented to suggest
using functorch.vmap

Test Plan: - test/test_vmap.py

Reviewed By: H-Huang

Differential Revision: D31966603

Pulled By: zou3519

fbshipit-source-id: b01dc1c2e298ce899b4a3a5fb333222a8d5bfb56
2021-11-12 16:10:16 -08:00
da5ffe752a Add reporting for flaky tests in CI (#68150)
Summary:
This PR does NOT change how signal is displayed in CI but rather just reports stats of flaky tests to RDS. **None of the below will be enabled after landing this PR--it will be done in a separate PR with environment variables.**

We report flaky tests stats when a test first fails, and when we rerun it MAX_NUM_RETRIES times, we get at least one success.
For tests that fail all the reruns, we assume it is because it is a real test failure.
For tests that succeed the first time, we do not rerun the test, even if it was previously noted as flaky.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68150

Test Plan:
First, I modified:
test_async_python to always fail (will be our "failing test")
test_async_future_type_python to fail 40% of the time
test_async_script_capture to fail 60% of the time

Then, running `python test/test_jit.py -v -k test_async` while setting IN_CI to 1:
```
(pytorch) janeyx@janeyx-mbp pytorch % python test/test_jit.py -v -k test_async
...

Running tests...
----------------------------------------------------------------------
  test_async_future_type_python (jit.test_async.TestAsync) ... ok (0.004s)
  test_async_grad_guard_no_grad (jit.test_async.TestAsync) ... ok (0.020s)
  test_async_grad_guard_with_grad (jit.test_async.TestAsync) ... ok (0.008s)
  test_async_kwargs (jit.test_async.TestAsync) ... ok (0.045s)
  test_async_parsing (jit.test_async.TestAsync) ... ok (0.010s)
  test_async_python (jit.test_async.TestAsync) ... FAIL (0.003s)
    test_async_python failed - num_retries_left: 3
  test_async_python (jit.test_async.TestAsync) ... FAIL (0.003s)
    test_async_python failed - num_retries_left: 2
  test_async_python (jit.test_async.TestAsync) ... FAIL (0.003s)
    test_async_python failed - num_retries_left: 1
  test_async_python (jit.test_async.TestAsync) ... FAIL (0.003s)
    test_async_python failed - num_retries_left: 0
  test_async_script (jit.test_async.TestAsync) ... ok (0.008s)
  test_async_script_capture (jit.test_async.TestAsync) ... FAIL (0.010s)
    test_async_script_capture failed - num_retries_left: 3
  test_async_script_capture (jit.test_async.TestAsync) ... FAIL (0.010s)
    test_async_script_capture failed - num_retries_left: 2
  test_async_script_capture (jit.test_async.TestAsync) ... ok (0.011s)
    test_async_script_capture succeeded - num_retries_left: 1
  test_async_script_capture (jit.test_async.TestAsync) ... FAIL (0.010s)
    test_async_script_capture failed - num_retries_left: 0
  test_async_script_error (jit.test_async.TestAsync) ... ok (0.040s)
  test_async_script_multi_forks (jit.test_async.TestAsync) ... ok (0.025s)
  test_async_script_multi_waits (jit.test_async.TestAsync) ... ok (0.009s)
...

======================================================================
FAIL [0.003s]: test_async_python (jit.test_async.TestAsync)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/janeyx/pytorch/test/jit/test_async.py", line 30, in test_async_python
    self.assertTrue(False)
AssertionError: False is not true

======================================================================
FAIL [0.010s]: test_async_script_capture (jit.test_async.TestAsync)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/janeyx/pytorch/test/jit/test_async.py", line 123, in test_async_script_capture
    self.assertTrue(False)
AssertionError: False is not true

----------------------------------------------------------------------
Ran 28 tests in 0.399s

FAILED (failures=2, expected failures=5, unexpected successes=1)
```
Yielding this as the test report (I changed the extension from xml to txt so it uploads here):
[TEST-jit.test_async.TestAsync-20211110222055.txt](https://github.com/pytorch/pytorch/files/7517532/TEST-jit.test_async.TestAsync-20211110222055.txt)

And then running print_test_stats correctly excludes the all failing test `test_async_python` and calculates red and green appropriately:
```
(pytorch) janeyx@janeyx-mbp pytorch % python tools/stats/print_test_stats.py test-reports/python-unittest/test.test_jit
[scribe] Not invoking RDS lambda outside GitHub Actions:
[{'create_table': {'table_name': 'flaky_tests', 'fields': {'name': 'string', 'suite': 'string', 'file': 'string', 'num_green': 'int', 'num_red': 'int', 'pr': 'string', 'ref': 'string', 'branch': 'string', 'workflow_id': 'string', 'build_environment': 'string'}}}]
[scribe] Writing for None
[scribe] Wrote stats for flaky_tests
[scribe] Not invoking RDS lambda outside GitHub Actions:
[{'write': {'table_name': 'flaky_tests', 'values': {'name': 'test_async_script_capture', 'suite': 'jit.test_async.TestAsync', 'file': 'test/test_jit', 'num_green': 1, 'num_red': 3, 'pr': None, 'ref': None, 'branch': None, 'workflow_id': None, 'build_environment': 'linux-xenial-gcc5.4-py3'}}}]
(pytorch) janeyx@janeyx-mbp pytorch %
```

-------------------
If you're curious, I also included the code for when we would like to override the report_only feature and also hide flaky signal in CI. The results for the same test command correctly still fail the test suite, but mark the flaky test_async_future_type_python as passed:
```
(pytorch) janeyx@janeyx-mbp pytorch % python test/test_jit.py -v -k test_async
...

Running tests...
----------------------------------------------------------------------
  test_async_future_type_python (jit.test_async.TestAsync) ... FAIL (0.004s)
    test_async_future_type_python failed - num_retries_left: 3
  test_async_future_type_python (jit.test_async.TestAsync) ... ok (0.001s)
  test_async_grad_guard_no_grad (jit.test_async.TestAsync) ... ok (0.017s)
  test_async_grad_guard_with_grad (jit.test_async.TestAsync) ... ok (0.008s)
  test_async_kwargs (jit.test_async.TestAsync) ... ok (0.091s)
  test_async_parsing (jit.test_async.TestAsync) ... ok (0.010s)
  test_async_python (jit.test_async.TestAsync) ... FAIL (0.003s)
    test_async_python failed - num_retries_left: 3
  test_async_python (jit.test_async.TestAsync) ... FAIL (0.003s)
    test_async_python failed - num_retries_left: 2
  test_async_python (jit.test_async.TestAsync) ... FAIL (0.004s)
    test_async_python failed - num_retries_left: 1
  test_async_python (jit.test_async.TestAsync) ... FAIL (0.003s)
    test_async_python failed - num_retries_left: 0
  test_async_script (jit.test_async.TestAsync) ... ok (0.008s)
  test_async_script_capture (jit.test_async.TestAsync) ... ok (0.011s)
  test_async_script_error (jit.test_async.TestAsync) ... ok (0.039s)
...

======================================================================
FAIL [0.003s]: test_async_python (jit.test_async.TestAsync)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/janeyx/pytorch/test/jit/test_async.py", line 30, in test_async_python
    self.assertTrue(False)
AssertionError: False is not true

----------------------------------------------------------------------
Ran 26 tests in 0.390s

FAILED (failures=1, expected failures=4)
```
With test reports:
[TEST-jit.test_async.TestAsync-20211110224810.txt](https://github.com/pytorch/pytorch/files/7517663/TEST-jit.test_async.TestAsync-20211110224810.txt)
And running print_test_stats:
```
(pytorch) janeyx@janeyx-mbp pytorch % python tools/stats/print_test_stats.py test-reports/python-unittest/test.test_jit
[scribe] Not invoking RDS lambda outside GitHub Actions:
[{'create_table': {'table_name': 'flaky_tests', 'fields': {'name': 'string', 'suite': 'string', 'file': 'string', 'num_green': 'int', 'num_red': 'int', 'pr': 'string', 'ref': 'string', 'branch': 'string', 'workflow_id': 'string', 'build_environment': 'string'}}}]
[scribe] Writing for None
[scribe] Wrote stats for flaky_tests
[scribe] Not invoking RDS lambda outside GitHub Actions:
[{'write': {'table_name': 'flaky_tests', 'values': {'name': 'test_async_future_type_python', 'suite': 'jit.test_async.TestAsync', 'file': 'test/test_jit', 'num_green': 1, 'num_red': 1, 'pr': None, 'ref': None, 'branch': None, 'workflow_id': None, 'build_environment': 'linux-xenial-gcc5.4-py3'}}}]
```

Reviewed By: saketh-are

Differential Revision: D32393907

Pulled By: janeyx99

fbshipit-source-id: 37df890481ab84c62809c022dc6338b50972899c
2021-11-12 15:03:14 -08:00
8bf150f21b Revert D32178667: [pytorch][PR] Python tracer for profiler
Test Plan: revert-hammer

Differential Revision:
D32178667 (33353fb828)

Original commit changeset: 118547104a7d

fbshipit-source-id: 47510607589fc39c730ba913f47c01a7d107b7b0
2021-11-12 14:53:52 -08:00
a82e51a7ae Move some cub templates out of the header file (#67650)
Summary:
Cub routines are both expensive to compile and used in multiple
different operators throughout the cuda folder. So, it makes sense to
compile them in one centralized place where possible (i.e. when
custom operators aren't involved).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67650

Reviewed By: mruberry

Differential Revision: D32259660

Pulled By: ngimel

fbshipit-source-id: 5f7dbdb134297e1ffdc1c7fc5aefee70a2fa5422
2021-11-12 13:51:11 -08:00
6ddaf3bd37 [LT] Upstream TsNode, TsNodeLowering, TsLoweringContext (#68154)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68154

Test Plan: added a basic test; cover more by using lazy_tensor_staging tests

Reviewed By: Krovatkin, alanwaketan

Differential Revision: D32224303

fbshipit-source-id: ac3e1161229b8ae60fdb15ffa72e17072b595914
2021-11-12 12:57:20 -08:00
f6e45102d2 [quant][embedding qat] Support non-partial functions in qconfig comparison (#68067)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68067

Embedding QAT uses a NoopObserver class for activation,
and a FakeQuant for weight, make sure that qconfig comparison
functions properly for a mix of partial function and class in
qconfig.

Test Plan:
`pytest test/quantization/eager/test_quantize_eager_qat.py  -v -k "test_embedding_qat_qconfig_equal"`

Imported from OSS

Reviewed By: HDCharles

Differential Revision: D32318434

fbshipit-source-id: c036eef9cbabe7c247745930501328e9c75a8cb0
2021-11-12 12:48:00 -08:00
66b52d5b49 [TensorExpr] Convert linear_clamp_run to using schema in NNC lowerings. (#66523)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66523

Differential Revision:
D31590857
D31590857

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Pulled By: ZolotukhinM

fbshipit-source-id: da8a7d68c8a4cf74c3f622b8a3af54d00ffb14a6
2021-11-12 12:26:06 -08:00
06e8cb9e04 Manually Disabling two TestDistBackendWithSpawn tests on ROCm, test_ddp_profiling_torch_profiler and test_ddp_sync_bn_training_vs_eval (#68255)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68255

Manually disabling these two tests because they can't be disabled via Probot.

See the issues #68222 and #68173 for details.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang jeffdaily sunway513 jithunnair-amd ROCmSupport KyleCZH

Test Plan: Imported from OSS

Reviewed By: malfet, saketh-are

Differential Revision: D32390899

Pulled By: NivekT

fbshipit-source-id: bd4996d73014337a9175b20ae67a3880ee994699
2021-11-12 12:04:21 -08:00
33353fb828 Python tracer for profiler (#67407)
Summary:
This PR instruments the CPython interpreter and integrates the resulting trace into the PyTorch profiler.

The python tracing logic works by enabling `PyEval_SetProfile`, and then logging the minimal information to track every time python calls or returns from a function. A great deal of care has gone into keeping this process very lightweight; the `RawEvent` struct is only two words and doesn't do anything fancy. When a python function is called, we have to do extra work. If the call is to `nn.Module.__call__`, we simply incref to extend the life of the module. Otherwise we check if we have seen the function before, and if not go through the (somewhat expensive) task of saving the strings which we then cache.

To actually get a useful timeline, we have to replay the events to determine the state of the python stack at any given point. A second round of stack replay is needed to figure out what the last python function was for each torch op so we can reconstruct the correct python stack. All of this is done during post processing, so while we want to be reasonably performant it is no longer imperative to shave every last bit.

I still need to do a bit of refinement (particularly where the tracer interfaces with the profiler), but this should give a good sense of the general structure.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67407

Test Plan:
```
import torch

class MyModule(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = torch.nn.Linear(2, 2)
        self.relu = torch.nn.ReLU()

    def forward(self, x):
        x = self.linear(x)
        return self.relu(x)

def call_module():
    m = MyModule()
    for _ in range(4):
        m(torch.ones((2, 2)))

def top_level_fn():
    with torch.profiler.profile(with_stack=True) as p:
        call_module()

    p.export_chrome_trace("test_trace.json")

top_level_fn()
```
<img width="1043" alt="Screen Shot 2021-10-27 at 6 43 18 PM" src="https://user-images.githubusercontent.com/13089297/139171803-f95e70f3-24aa-45e6-9d4b-6d437a3f108d.png">

PS: I've tried to comment liberally, particularly around some of the more magical parts. However I do plan on doing another linting and commenting pass. Hopefully it's not too bad right now.

Reviewed By: gdankel, chaekit

Differential Revision: D32178667

Pulled By: robieta

fbshipit-source-id: 118547104a7d887e830f17b94d3a29ee4f8c482f
2021-11-12 11:58:12 -08:00
96d116fec2 [JIT] Add additional debug output when op cannot be found in AliasDb (#68099)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68099

When an op in the graph cannot be matched to any known ops, alias_analysis.cpp throws an error.

Before:
```
RuntimeError: 0INTERNAL ASSERT FAILED at "../torch/csrc/jit/ir/alias_analysis.cpp":612, please report a bug to PyTorch. We don't have an op for aten::add but it isn't a special case. Argument types: Tensor, float, Tensor,
```

After:
```
RuntimeError: 0INTERNAL ASSERT FAILED at "../torch/csrc/jit/ir/alias_analysis.cpp":612, please report a bug to PyTorch. We don't have an op for a
ten::add but it isn't a special case.  Argument types: Tensor, float, Tensor,

Candidates:
        aten::add.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> (Tensor)
        aten::add.Scalar(Tensor self, Scalar other, Scalar alpha=1) -> (Tensor)
        aten::add.out(Tensor self, Tensor other, *, Scalar alpha=1, Tensor(a!) out) -> (Tensor(a!))
        aten::add.t(t[] a, t[] b) -> (t[])
        aten::add.str(str a, str b) -> (str)
        aten::add.int(int a, int b) -> (int)
        aten::add.complex(complex a, complex b) -> (complex)
        aten::add.float(float a, float b) -> (float)
        aten::add.int_complex(int a, complex b) -> (complex)
        aten::add.complex_int(complex a, int b) -> (complex)
        aten::add.float_complex(float a, complex b) -> (complex)
        aten::add.complex_float(complex a, float b) -> (complex)
        aten::add.int_float(int a, float b) -> (float)
        aten::add.float_int(float a, int b) -> (float)
        aten::add(Scalar a, Scalar b) -> (Scalar)
```

Test Plan:
Run
```
import torch

if __name__ == '__main__':
    ir = """
graph(%x : Tensor,
      %y : Tensor):
  %2 : float = prim::Constant[value=1.2]()
  %result : Tensor= aten::add(%x, %2, %y)
  return (%result)
"""
    x = torch.tensor([[1., 2.], [3., 4.]])
    y = torch.tensor([[2., 1.], [2., 1.]])
    graph = torch._C.parse_ir(ir)
    print(graph)
    graph.alias_db().analyze()
    # print(script(x, y))
```

to get the results above

Imported from OSS

Reviewed By: anjali411

Differential Revision: D32339639

fbshipit-source-id: a79a3c2f157154b5fb1e3f33a23e43b7884e8e38
2021-11-12 08:39:41 -08:00
98bab78e11 Revert D32039318: [pytorch][PR] Bump dlpack.h to latest version
Test Plan: revert-hammer

Differential Revision:
D32039318 (d049772538)

Original commit changeset: 7dfc653e1e77

fbshipit-source-id: 0d4b1af7381a2638ca9f3c3af26c2ff0b7bd7469
2021-11-12 08:20:21 -08:00
5c3a9f3fdc adding opinfo for torch.nn.bilinear and torch.nn.glu (#67478)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67478

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D32027807

Pulled By: mikaylagawarecki

fbshipit-source-id: 501057cc9aced19fca26c4294fe81dcbb4b83a26
2021-11-12 08:13:15 -08:00
dc24503a89 Fix Hash(c10::Scalar), account for garbage data in union (#68201)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68201

Hash(c10::Scalar) made a bad assumption that it was valid to just hash over all the bytes of data of the c10::Scalar struct.

Becuase c10::Scalar stores a union of different (float/int/complex) types with different sizes, not all bytes are valid in all cases.  Hash() should only read the bytes corresponding to the currently active type.

Test Plan: Added new unit tests.  Verified HashTest.Scalar failed with the original Hash() impl and then fixed.

Reviewed By: alanwaketan

Differential Revision: D32367564

fbshipit-source-id: ac30dd4f6dd0513954986d3d23c0c11ba802c37b
2021-11-12 07:20:08 -08:00
0bd0a67c4f [lint][fbcode/caffe2] CLANGFORMAT
Test Plan:
Proof of coverage:

```
$ hg files fbcode/caffe2 |
  arc linttool debugfilterpaths --take CLANGFORMAT --path-match-only > ~/before.txt

$ hg up this_diff

$ hg files fbcode/caffe2 |
  arc linttool debugfilterpaths --take CLANGFORMAT --path-match-only > ~/after.txt

$ comm -3 ~/before.txt ~/after.txt | pastry
P467377980: https://www.internalfb.com/intern/paste/P467377980/
```

These files lost coverage:

- `fbcode/caffe2/torch/abi-check.cpp`
- `fbcode/caffe2/torch/custom_class.h`
- `fbcode/caffe2/torch/custom_class_detail.h`
- `fbcode/caffe2/torch/deploy.h`
- `fbcode/caffe2/torch/extension.h`
- `fbcode/caffe2/torch/library.h`
- `fbcode/caffe2/torch/script.h`

Everything else in P467377980 gained coverage.

Reviewed By: suo

Differential Revision: D32364856

fbshipit-source-id: 9b3ba3350ecdb50038412a24af5e0da0fe4d69b8
2021-11-12 05:12:39 -08:00
e795315c63 Changes and fixes to prepare for dynamic conv (#68175)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68175

This slightly alters the way from_float works so it will work
with placeholder observers. It also fixes a but with ConvTranspose3d and
ConvTranspose1d where the parameters like kernel_size, stride...etc
weren't set properly. New tests were added to check for this type of
issue as well.

Test Plan:
python test/test_quantization.py TestQuantizedOps
python test/test_quantization.py TestStaticQuantizedModule

Imported from OSS

Reviewed By: z-a-f

Differential Revision: D32374004

fbshipit-source-id: caaa548d12d433d9c1fa0abc8597a7d31bb4e8af
2021-11-11 23:55:04 -08:00
1181628d85 BE: Use TORCH_CHECK instead of explicit c10::Error (#68187)
Summary:
`if (cond) { raise c10::error("", msg)}` is identical to `TORCH_CHECK(!cond, msg);`, but with better attribution

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68187

Reviewed By: xuzhao9

Differential Revision: D32360956

Pulled By: malfet

fbshipit-source-id: e554b99926d7ad0c79a1cd54d35f47339fa2429d
2021-11-11 22:01:41 -08:00
799ebce3aa Add algo recorder/replayer to lower.py (#68194)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68194

Add algorithm recorder/replayer to lower.py

Reviewed By: yinghai

Differential Revision: D31909575

fbshipit-source-id: 552f2ba4fbd6ea646316f6412d55416a76e1f69a
2021-11-11 21:22:22 -08:00
613c1aca6d Adds support for automated error and warning testing (#67354)
Summary:
Adds a new class `ErrorOrWarningInput` that is a `SampleInput` with some additional metadata for validating that `SampleInput` throws the desired warning or error. The architecture to support these new tests is modeled after the existing reference tests and sample input functions.

Existing invalid input tests for neg and kthvalue are ported to the new scheme to validate it.

There may be a simpler/clearer naming scheme we can use here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67354

Reviewed By: jbschlosser

Differential Revision: D31989888

Pulled By: mruberry

fbshipit-source-id: 4fa816e1e8d0eef21b81c2f80813d42b2c26714e
2021-11-11 19:28:47 -08:00
89d556f648 add VS extension in doc (#63944)
Summary:
add VS  extension in https://pytorch.org/cppdocs/installing.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63944

Reviewed By: malfet

Differential Revision: D30546156

Pulled By: seemethere

fbshipit-source-id: a65448d8702f9fd400c9dd2ef2d9f961f30c4983
2021-11-11 18:02:08 -08:00
9cb65df79f [Static Runtime] Fallback to disabling manage_output_tensors instead of crashing when wrong API is used (#67939)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67939

With `manage_output_tensor` enabled, a client of `StaticRuntime` requires to call it via  `PyTorchPredictor::predict_managed_result`. If the client uses `PyTorchPredictor::operator()`  the client will experience a crash (intended behavior not to  leak memory of managed output tensors). This mistake can cause a catastrophic failure in production if that happens (by gatekeeper, config changes, etc).

Considering the complexity in how `PyTorchPredictor` is used in different settings, the chances that this bug can hit production is non-zero.

This change introduces `StaticRuntime::disableManageOutputTensor` to disable `manage_output_tensor` feature when a client mistakenly uses `PyTorchPredictor::operator()` instead of crashing. When `StaticRuntime` is invoked via `PyTorchPredictor::operator()`, it first calls  `StaticRuntime::disableManageOutputTensor` to disable the feature, so that it can get non-managed output tensors to pass to the client safely.

A slight perf degradation is expected by forcefully disabling `manage_output_tensors`, but its robustness value outweighs a catastrophic failure of crashes at a high rate.

Test Plan: Added a unittest `StaticRuntime, DisableManageOutputTensors` to cover the newly added code.

Reviewed By: swolchok

Differential Revision: D32219731

fbshipit-source-id: caf5c910b34726c570e17435ede7d888443e90cf
2021-11-11 17:31:07 -08:00
3dc0754c53 [pytorch][mobile] deprecate the LLVM-based static analyzer (#68180)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68180

Since we've open sourced the tracing-based selective build, we can deprecate the
op-dependency-graph-based selective build and the static analyzer tool that
produces the dependency graph.
ghstack-source-id: 143108377

Test Plan: CIs

Reviewed By: seemethere

Differential Revision: D32358467

fbshipit-source-id: c61523706b85a49361416da2230ec1b035b8b99c
2021-11-11 16:37:08 -08:00
301369a774 [PyTorch][Fix] Pass the arguments of embedding as named arguments (#67574)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67574

When adding the optional params for sharded embedding op. Found that we cannot get these params from `__torch_function__` override. The reason is that we don't pass them via keyword arguments. So maybe we want to change them to kwargs?
ghstack-source-id: 143029375

Test Plan: CI

Reviewed By: albanD

Differential Revision: D32039152

fbshipit-source-id: c7e598e49eddbabff6e11e3f8cb0818f57c839f6
2021-11-11 15:22:10 -08:00
9571eb599c [lint] fix up clangtidy lintrunner integration (#68192)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68192

- Run on exactly the same stuff as the existing linter checks.
- Exclude deploy interpreter headers from being reported.

Test Plan: Imported from OSS

Reviewed By: janeyx99

Differential Revision: D32364023

Pulled By: suo

fbshipit-source-id: c27eca4a802534875d609d004fa9f6fca59ae6a5
2021-11-11 14:53:28 -08:00
6afb414c21 Nan in linalg eig (#67544)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/61251. As per the comment here (https://github.com/pytorch/pytorch/issues/61251#issuecomment-954676082), a consensus has been reached to raise an error if there is a NaN value in the input when calling `eig()`. This PR implements that feature.

cc jianyuh nikitaved pearu mruberry walterddr IvanYashchuk xwang233 Lezcano

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67544

Reviewed By: malfet

Differential Revision: D32310919

Pulled By: mruberry

fbshipit-source-id: fc74a1ae2d929157c2d4c9051e3e9a4bf03dd5be
2021-11-11 14:33:49 -08:00
d049772538 Bump dlpack.h to latest version (#65047)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/64995

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65047

Reviewed By: ngimel

Differential Revision: D32039318

Pulled By: mruberry

fbshipit-source-id: 7dfc653e1e77799d1f26a95fa9bbae3c7ffc887c
2021-11-11 14:02:16 -08:00
0420545639 Enable all dtype combinations in torch.Tensor.view(dtype) (#66493)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/29013

Note: This PR does not enable autograd. This can be done in a future PR.

cc mruberry rgommers

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66493

Reviewed By: gchanan

Differential Revision: D32314680

Pulled By: mruberry

fbshipit-source-id: 69d325573b2331f32b83c05c91ffbe80571e7ae2
2021-11-11 13:55:21 -08:00
f9ea41f257 Fixes spelling error writeable to writable, improves warning, and documentation (#67664)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46741
pytorchbot

contributors: nickleus27, yanivsagy, and khanhthien123

SmrutiSikha this is mostly your work.  We just did very minor clean up.

cc mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67664

Reviewed By: gchanan

Differential Revision: D32311838

Pulled By: mruberry

fbshipit-source-id: 0e5d4d888caeccb0fd7c80e6ff11b1b1fa8e00d6
2021-11-11 13:05:00 -08:00
1e8f836c44 Remove OpInfo non-contig inputs (#67677)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67677

This follows
https://github.com/pytorch/pytorch/issues/63341#issuecomment-899690614

Fixes https://github.com/pytorch/pytorch/issues/67012

Note. I wrote the OpInfo for `index_fill`, so removing those inputs in
there is right. kshitij12345 mentioned that the same thing is true for
the inputs for tile / repeat.
https://github.com/pytorch/pytorch/issues/67012#issuecomment-948537446

There are more uses of `transpose` within the OpInfos, but most of them
are for testing `mm` and `baddmm`. I did not touch those, as those
operations are so important that it won't hurt to test those more
thoroughly.

cc mruberry

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D32311729

Pulled By: mruberry

fbshipit-source-id: ac0804ca6f893118046b3e1bd97b5a2e6b900b59
2021-11-11 13:03:16 -08:00
4fe3965b3a Fix dtype arg typing for Tensor.type doc string (#67019)
Summary:
Fix typing error in PyCharm when using torch.Tensor.type(dtype=torch.int64)

<img width="386" alt="Screenshot 2021-10-21 at 15 30 50" src="https://user-images.githubusercontent.com/59562934/138288062-cc2ba45e-ece0-4fca-9369-55d020404c28.png">

Thanks for your great work! :)

cc brianjo mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67019

Reviewed By: malfet

Differential Revision: D32311313

Pulled By: mruberry

fbshipit-source-id: 90fc453bc4129a301d567d4b39137b93c5dac01e
2021-11-11 12:58:46 -08:00
b07a11929d Array API: Add torch.linalg.cross (#63285)
Summary:
### Create `linalg.cross`

Fixes https://github.com/pytorch/pytorch/issues/62810

As discussed in the corresponding issue, this PR adds `cross` to the `linalg` namespace (**Note**: There is no method variant) which is slightly different in behaviour compared to `torch.cross`.

**Note**: this is NOT an alias as suggested in mruberry's [https://github.com/pytorch/pytorch/issues/62810 comment](https://github.com/pytorch/pytorch/issues/62810#issuecomment-897504372) below
> linalg.cross being consistent with the Python Array API (over NumPy) makes sense because NumPy has no linalg.cross. I also think we can implement linalg.cross without immediately deprecating torch.cross, although we should definitely refer users to linalg.cross. Deprecating torch.cross will require additional review. While it's not used often it is used, and it's unclear if users are relying on its unique behavior or not.

The current default implementation of `torch.cross` is extremely weird and confusing. This has also been reported multiple times previously. (See https://github.com/pytorch/pytorch/issues/17229, https://github.com/pytorch/pytorch/issues/39310, https://github.com/pytorch/pytorch/issues/41850, https://github.com/pytorch/pytorch/issues/50273)

- [x] Add `torch.linalg.cross` with default `dim=-1`
- [x] Add OpInfo and other tests for `torch.linalg.cross`
- [x] Add broadcasting support to `torch.cross` and `torch.linalg.cross`
- [x] Remove out skip from `torch.cross` OpInfo
- [x] Add docs for `torch.linalg.cross`. Improve docs for `torch.cross` mentioning `linalg.cross` and the difference between the two. Also adds a warning to `torch.cross`, that it may change in the future (we might want to deprecate it later)

 ---

### Additional Fixes to `torch.cross`
- [x] Fix Doc for Tensor.cross
- [x] Fix torch.cross in `torch/overridres.py`

While working on `linalg.cross` I noticed these small issues with `torch.cross` itself.

[Tensor.cross docs](https://pytorch.org/docs/stable/generated/torch.Tensor.cross.html) still mentions `dim=-1` default which is actually wrong. It should be `dim=None` after the behaviour was updated in PR https://github.com/pytorch/pytorch/issues/17582 but the documentation for the `method` or `function` variant wasn’t updated. Later PR https://github.com/pytorch/pytorch/issues/41850 updated the documentation for the `function` variant i.e `torch.cross` and also added the following warning about the weird behaviour.
> If `dim` is not given, it defaults to the first dimension found with the size 3. Note that this might be unexpected.

But still, the `Tensor.cross` docs were missed and remained outdated. I’m finally fixing that here. Also fixing `torch/overrides.py` for `torch.cross` as well now, with `dim=None`.

To verify according to the docs the default behaviour of `dim=-1` should raise, you can try the following.

```python
a = torch.randn(3, 4)
b = torch.randn(3, 4)
b.cross(a)  # this works because the implementation finds 3 in the first dimension and the default behaviour as shown in documentation is actually not true.
>>> tensor([[ 0.7171, -1.1059,  0.4162,  1.3026],
        [ 0.4320, -2.1591, -1.1423,  1.2314],
        [-0.6034, -1.6592, -0.8016,  1.6467]])

b.cross(a, dim=-1)  # this raises as expected since the last dimension doesn't have a 3
>>> RuntimeError: dimension -1 does not have size 3
```

Please take a closer look (particularly the autograd part, this is the first time I'm dealing with `derivatives.yaml`). If there is something missing, wrong or needs more explanation, please let me know. Looking forward to the feedback.

cc mruberry Lezcano IvanYashchuk rgommers

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63285

Reviewed By: gchanan

Differential Revision: D32313346

Pulled By: mruberry

fbshipit-source-id: e68c2687c57367274e8ddb7ef28ee92dcd4c9f2c
2021-11-11 12:49:41 -08:00
40bedf6206 Fix test_triangular_solve testcase enumeration (#67635)
Summary:
use product instead of zip to cover all cases

cc mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67635

Reviewed By: malfet

Differential Revision: D32310956

Pulled By: mruberry

fbshipit-source-id: 806c3313e2db26d77199d3145b2d5283b6ca3617
2021-11-11 12:49:38 -08:00
db014b8529 Add set_deterministic_debug_mode and get_deterministic_debug_mode (#67778)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/67386

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67778

Reviewed By: ngimel

Differential Revision: D32310661

Pulled By: mruberry

fbshipit-source-id: 300129e96ca51c22fa711182ce6a9f4d4d2ce57f
2021-11-11 12:48:29 -08:00
cd4e31ff21 [LTC] Add some comments to BackendDevice() (#68156)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68156

[skip ci]

Test Plan: Imported from OSS

Reviewed By: wconstab

Differential Revision: D32346302

Pulled By: alanwaketan

fbshipit-source-id: 06de6afbe2f937511abce485b24cec0a85bfbe97
2021-11-11 12:43:56 -08:00
7b376bf844 Remove ProcessGroup from TensorPipeAgent initialization (#68128)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68128

Reland of D31762735 (0cbfd466d2).

This diff was originally reverted due to failure in test_send_export_type_through_rpc_with_custom_pickler.

I updated rpc_pickler_test.py to prevent a race condition where processes were not registering their pickler before handling their rpc_sync calls.

Test Plan:
rpc_pickler_test file:

buck test mode/dev-nosan -c 'cxx.coverage_only=caffe2' //caffe2/torch/fb/training_toolkit/backend/metrics/tests:rpc_pickler_test //caffe2/torch/fb/training_toolkit/backend/metrics/collectors/fbdata_aggregator/tests:batch_collector_test -- --run-disabled --collect-coverage '--code-coverage-session=test_session' --force-tpx

rpc_pickler stress test:

buck test mode/dev-nosan -c 'cxx.coverage_only=caffe2' //caffe2/torch/fb/training_toolkit/backend/metrics/tests:rpc_pickler_test -- --exact 'caffe2/torch/fb/training_toolkit/backend/metrics/tests:rpc_pickler_test - test_send_export_type_through_rpc_with_custom_pickler (caffe2.torch.fb.training_toolkit.backend.metrics.tests.rpc_pickler_test.CythonTypeRpcSpawnTest)' --run-disabled --collect-coverage '--code-coverage-session=test_session' --force-tpx --jobs 18 --stress-runs 10 --record-results

Reviewed By: mrshenli

Differential Revision: D32316077

fbshipit-source-id: e58de2335fbaa3ab46d46fe222c659197633a5e4
2021-11-11 12:28:55 -08:00
b473ca999b [lint] add cmakelint to lintrunner (#68191)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68191

+ fix filename of exec_linter

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D32364022

Pulled By: suo

fbshipit-source-id: 740892d9580edc348c3e818664fd37f145669fda
2021-11-11 12:19:01 -08:00
6cade3362b [fx-acc] add optimize_noop graph opt (#68131)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68131

Ports EliminateNoop to FX

Adds optimization for a few more ops and cases than the glow version
* `acc_ops.dequantize`
* `acc_ops.flatten`
* `acc_ops.(max|min)_full_reduce`
* `acc_ops.permute`
* `acc_ops.reshape`
* `acc_ops.squeeze`
* `acc_ops.to_dtype`

Already covered by either constant fold or custom mapper
* acc_ops.slice_tensor
* acc_ops.getitem

Bug fix
* If `-1` is used in reshape's `shape` argument, we would convert this inferred value to actual positive value but needed to use integer division, otherwise we get a float in the shape tuple. Existing unit tests didn't cover this because `unittest.TestCase.assertEqual(1, 1.0)` doesn't check types and returns `True`.

Test Plan:
# Graph Opt
`buck test mode/opt glow/fb/fx/graph_opts:test_fx_graph_opts -- TestEliminateNoOp`
```
Tpx test run coordinator for Facebook. See https://fburl.com/tpx for details.
Running with tpx session id: 95c17eb9-cd4d-463a-96c8-358ca3679d56
Trace available for this run at /tmp/tpx-20211105-144929.801413/trace.log
Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/5629499609900775
    ✓ ListingSuccess: glow/fb/fx/graph_opts:test_fx_graph_opts - main (4.873)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_graph_opts - test_eliminate_noop_01_noop_dequantize (glow.fb.fx.graph_opts.tests.test_fx_graph_opts.TestEliminateNoOp) (0.032)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_graph_opts - test_eliminate_noop_02_flatten (glow.fb.fx.graph_opts.tests.test_fx_graph_opts.TestEliminateNoOp) (0.048)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_graph_opts - test_eliminate_noop_12_tile (glow.fb.fx.graph_opts.tests.test_fx_graph_opts.TestEliminateNoOp) (0.081)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_graph_opts - test_eliminate_noop_15_to_dtype (glow.fb.fx.graph_opts.tests.test_fx_graph_opts.TestEliminateNoOp) (0.022)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_graph_opts - test_eliminate_noop_20_cat (glow.fb.fx.graph_opts.tests.test_fx_graph_opts.TestEliminateNoOp) (0.126)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_graph_opts - test_eliminate_noop_18_max_pool2d (glow.fb.fx.graph_opts.tests.test_fx_graph_opts.TestEliminateNoOp) (0.183)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_graph_opts - test_eliminate_noop_08_reshape (glow.fb.fx.graph_opts.tests.test_fx_graph_opts.TestEliminateNoOp) (0.034)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_graph_opts - test_eliminate_noop_16_avg_pool2d (glow.fb.fx.graph_opts.tests.test_fx_graph_opts.TestEliminateNoOp) (0.183)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_graph_opts - test_eliminate_noop_10_squeeze (glow.fb.fx.graph_opts.tests.test_fx_graph_opts.TestEliminateNoOp) (0.048)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_graph_opts - test_eliminate_noop_06_min_full_reduce (glow.fb.fx.graph_opts.tests.test_fx_graph_opts.TestEliminateNoOp) (0.038)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_graph_opts - test_eliminate_noop_09_noop_reshape (glow.fb.fx.graph_opts.tests.test_fx_graph_opts.TestEliminateNoOp) (0.055)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_graph_opts - test_eliminate_noop_00_identity (glow.fb.fx.graph_opts.tests.test_fx_graph_opts.TestEliminateNoOp) (0.025)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_graph_opts - test_eliminate_noop_04_max_full_reduce (glow.fb.fx.graph_opts.tests.test_fx_graph_opts.TestEliminateNoOp) (0.037)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_graph_opts - test_eliminate_noop_21_noop_cat (glow.fb.fx.graph_opts.tests.test_fx_graph_opts.TestEliminateNoOp) (0.037)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_graph_opts - test_eliminate_noop_03_noop_flatten (glow.fb.fx.graph_opts.tests.test_fx_graph_opts.TestEliminateNoOp) (0.040)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_graph_opts - test_eliminate_noop_19_noop_max_pool2d (glow.fb.fx.graph_opts.tests.test_fx_graph_opts.TestEliminateNoOp) (0.135)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_graph_opts - test_eliminate_noop_11_noop_squeeze (glow.fb.fx.graph_opts.tests.test_fx_graph_opts.TestEliminateNoOp) (0.036)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_graph_opts - test_eliminate_noop_14_to_dtype (glow.fb.fx.graph_opts.tests.test_fx_graph_opts.TestEliminateNoOp) (0.024)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_graph_opts - test_eliminate_noop_17_noop_avg_pool2d (glow.fb.fx.graph_opts.tests.test_fx_graph_opts.TestEliminateNoOp) (0.114)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_graph_opts - test_eliminate_noop_13_noop_tile (glow.fb.fx.graph_opts.tests.test_fx_graph_opts.TestEliminateNoOp) (0.031)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_graph_opts - test_eliminate_noop_05_noop_max_full_reduce (glow.fb.fx.graph_opts.tests.test_fx_graph_opts.TestEliminateNoOp) (0.026)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_graph_opts - test_eliminate_noop_07_noop_min_full_reduce (glow.fb.fx.graph_opts.tests.test_fx_graph_opts.TestEliminateNoOp) (0.030)
Summary
  Pass: 22
  ListingSuccess: 1
If you need help understanding your runs, please follow the wiki: https://fburl.com/posting_in_tpx_users
Finished test run: https://www.internalfb.com/intern/testinfra/testrun/5629499609900775
```

# Shape Inference
`buck test mode/opt //glow/fb/fx/acc_tracer:test_acc_shape_inference`
```
Summary
  Pass: 99
  ListingSuccess: 1
If you need help understanding your runs, please follow the wiki: https://fburl.com/posting_in_tpx_users
Finished test run: https://www.internalfb.com/intern/testinfra/testrun/4503599703156114
```

Reviewed By: jfix71

Differential Revision: D32081046

fbshipit-source-id: 22403f2bb72a2605f1adcbb733e8150795c7984b
2021-11-11 12:08:24 -08:00
fe90313d02 Avoid index_put_ overhead in histogram kernel's inner loop (#67815)
Summary:
**TLDR**: Makes torch.histc run 400x faster on large inputs. Should fix [a broken test on internal CI](https://www.internalfb.com/intern/test/281475013640093/).

HistogramKernel presently calls torch.Tensor.index_put_ once for each element of its input tensor. Obtaining a data pointer and manipulating it directly avoids the considerable dispatch overhead from calling index_put_. Behavior is unchanged because the tensor being operated on is known to be contiguous and in CPU memory.

Fixes performance regression introduced in https://github.com/pytorch/pytorch/pull/65318.

Benchmark: time taken to compute histc on a tensor with 10,000,000 elements

1. Before https://github.com/pytorch/pytorch/pull/65318: **0.003s**
2. After https://github.com/pytorch/pytorch/pull/65318: **2.154s**
3. After this change: **0.005s**

Benchmark code:
```
import torch as t
from timeit import default_timer as timer

x = t.randperm(10000000, dtype=t.float32)

start = timer()
t.histc(x)
end = timer()
print(end - start)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67815

Reviewed By: anjali411

Differential Revision: D32357663

Pulled By: saketh-are

fbshipit-source-id: f8fa59173ea4772c8ad1332548ef4d9ea8f01178
2021-11-11 11:16:45 -08:00
61a94495d9 [DataPipe] adding ZipperMapDataPipe (#68032)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68032

Part of #57031

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D32263058

Pulled By: NivekT

fbshipit-source-id: 13a30ee9d9779284a9fd9bb7222fc41253c6fe3b
2021-11-11 10:36:05 -08:00
bd5f33f91e demo backend decoupled from operators (#66100)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66100

A backend should not directly dependent on ATen operators. The demo backend is changed to that way for testing purpose.

Test Plan: Imported from OSS

Reviewed By: pavithranrao

Differential Revision: D31384614

Pulled By: iseeyuan

fbshipit-source-id: c97f0c4aa12feb1d124f1d7a852e9955a7a2ce42
2021-11-11 10:26:17 -08:00
97a386805e [Pytorch Edge] Add selective macros to metal ops (#68134)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68134

Add the macros in preparation of making these selective. Should be a no-op in this diff.

ghstack-source-id: 143023844

Test Plan: CI

Reviewed By: dhruvbird

Differential Revision: D32326833

fbshipit-source-id: 7abc93102bff0aa0bc5e3383bdf3e95fb84ce5ba
2021-11-11 10:15:31 -08:00
c2642b6465 Sparse CSR CPU: add torch.add with all inputs sparse (#64391)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64391

This PR adds `torch.add(a, b, alpha=None, out=out)` variant with `a, b, out` all being sparse CSR tensors on CPU.

Fixes #59060

cc nikitaved pearu cpuhrsch IvanYashchuk

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D32316562

Pulled By: cpuhrsch

fbshipit-source-id: 384462369007854b5e2e6cb9ae7b320302627c71
2021-11-11 10:02:12 -08:00
84d3df8027 Fast cuda layer norm (#67977)
Summary:
This adds apex-inspired fast layer norm forward kernel to pytorch (it is a significant rewrite though).
It's much faster than current implementation, for a typical transformer size (32*196, 1024) time goes down from ~180us to ~49 us on Volta. Compared to apex, it also produces bitwise accurate results between float inputs representable in fp16, and fp16 inputs. It produces slightly different results compared to current implementation though, because welford summation is implemented differently.
It is slower than lightSeq (~37 us), but lightseq uses inaccurate variance approximation, and doesn't guarantee float - fp16 bitwise accuracy.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67977

Reviewed By: mruberry

Differential Revision: D32285331

Pulled By: ngimel

fbshipit-source-id: a8b876a9cf3133daacfe0ce3a37e3ad566f4b6a8
2021-11-11 09:32:40 -08:00
eqy
a1ace029e2 Add host-side memory requirement for test_softmax_64bit_indexing (#67922)
Summary:
https://github.com/pytorch/pytorch/issues/67910
The original `largeTensorTest` decorator didn't account for the additional host-side memory requirements.
Thanks crcrpar for raising the issue, CC ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67922

Reviewed By: malfet

Differential Revision: D32308602

Pulled By: mruberry

fbshipit-source-id: 97b7d2c39fe63c1a8269402f72186026a89f6b4c
2021-11-11 09:24:15 -08:00
9e7b314318 OpInfo for nn.functional.conv1d (#67747)
Summary:
This PR adds OpInfo for `nn.functional.conv1d`. There is a minor typo fix in the documentation as well.

Issue tracker: https://github.com/pytorch/pytorch/issues/54261

cc: mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67747

Reviewed By: malfet

Differential Revision: D32309258

Pulled By: mruberry

fbshipit-source-id: add21911b8ae44413e033e19398f398210737c6c
2021-11-11 09:23:04 -08:00
35f1617001 Implement Entropy methods for Binomial and Multinomial distributions (#67609)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/60866.

Because it seems https://github.com/pytorch/pytorch/pull/61719 shows no response for a long time, I made this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67609

Reviewed By: malfet

Differential Revision: D32310866

Pulled By: mruberry

fbshipit-source-id: b3a8dde452f448e5981f5405f5f925f860b0d84f
2021-11-11 09:16:28 -08:00
864c6b3794 [nnc] aotCompiler outputSpec support quantized outputs (#67711)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67711

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D32115833

Pulled By: IvanKobzarev

fbshipit-source-id: e96eb72a290ffb88011b86b3c65c0eff864b63dc
2021-11-11 09:01:46 -08:00
362c6069b9 [nnc] Lazy lowerings registration; custom classes network params (#67623)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67623

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D32065076

Pulled By: IvanKobzarev

fbshipit-source-id: 4945ac6483938d428c539ed1ce4fcd6988b34250
2021-11-11 09:00:23 -08:00
f89572f417 Add feature: zeros_like() from a dense tensor to a sparse tensor (#68108)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/67904.
 - Create a sparse tensor when the sparse layout is given even if the input tensor is not sparse.

cc nikitaved pearu cpuhrsch IvanYashchuk

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68108

Reviewed By: anjali411

Differential Revision: D32316269

Pulled By: cpuhrsch

fbshipit-source-id: 923dbd4dc7c74f51f7cdbafb2375a30271a6a886
2021-11-11 08:54:15 -08:00
5efe5e243a Ease constrain for fuse path in trt lower (#68148)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68148

Question raised regarding whether we should fuse pass a->b->c if node a has other consumer rather than node b. This diff is to ease the constrain in fuse path so that in case:
```
   a
|     |
b     d
|
c
```
we still allow fuse path(a->b->c), after fuse, node b will be eliminated by dead_node_eliminator while node a keep in graph.

Reviewed By: yinghai, 842974287

Differential Revision: D32296266

fbshipit-source-id: 44ded07a97b5b708bdf37193a022fae21410b4bd
2021-11-11 08:48:34 -08:00
d4ae789655 OpInfos for new_blah functions and some _like functions (#67357)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67357

This PR adds OpInfos for:
- new_ones, new_zeros, new_full, new_empty
- rand_like, randint_like

I forgot to add the _like functions in a previous PR, so here they are.

Test Plan: - wait for tests

Reviewed By: mruberry

Differential Revision: D31969533

Pulled By: zou3519

fbshipit-source-id: 236d70d66e82f1d6f8e5254b55ca2a37b54c9494
2021-11-11 07:21:23 -08:00
4466ba8f30 Working POC of define-by-run quantization (#64676)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64676

We implement a working eager mode quantization flow which uses
tracing and `__torch_function__` and `torch.nn.Module.__call__` overrides to automate the model modifications needed for quantization.  Partial program capture (instead of full program capture) is used, allowing this scheme to target a wide variety of user programs.  Control flow over quantizeable ops is not supported, but general control flow is supported.

In particular:
* `auto_trace.py` contains the machinery to override `__torch_function__` and `torch.nn.Module.__call__` and call hooks before and after each quantizeable module or function
* `quantization_state.py` contains the state needed to use the hooks to implement quantization logic such as adding quants/dequants, observers, etc.
* please see `README.md` for more details

Test Plan:
```
python test/test_quantization.py TestAutoTracing
python test/test_quantization.py TestAutoTracingModels
```

```
python test/test_quantization.py TestAutoTracing
python test/test_quantization.py TestAutoTracingModels
```

Differential Revision:
D31992281
D31992281

Reviewed By: HDCharles

Pulled By: vkuzo

fbshipit-source-id: 6d40e855f3c96b9a4b637a0e677388a7b92f7967
2021-11-11 06:25:24 -08:00
f02efc749a [Dist CI][BE] Run each test in its own process for test_distributed_spawn (#67901)
Summary:
Context: https://github.com/pytorch/pytorch/issues/67061

Use `run_test.py`'s provided flag `"--subprocess"`, passed in like `extra_unittest_args=["--subprocess"]` when running test_distributed_spawn. This will ensure that each test is run separately in its own process. The goal is to more closely simulate how a developer would run a single test when reproducing a CI failure and make reproducibility easier in general.

Also, when a test fails, print out the exact command that was issued so developer knows how to reproduce it.

For example test fails, it will print out something like the following to logs -

```
Test exited with non-zero exitcode 1. Command to reproduce: BACKEND=gloo WORLD_SIZE=3 /fsx/users/rvarm1/conda/envs/pytorch/bin/python distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_Backend_enum_class
```

running test_distributed_spawn is still the same cmd as before:

`
python test/run_test.py --verbose -i distributed/test_distributed_spawn
`

as seen in [distributed contributing](https://github.com/pytorch/pytorch/blob/master/torch/distributed/CONTRIBUTING.md) guide.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67901

Reviewed By: cbalioglu, mruberry

Differential Revision: D32225172

Pulled By: rohan-varma

fbshipit-source-id: 7e8d4c7a41858044bd2a4e0d1f0bf8f1ac671d67
2021-11-11 06:11:00 -08:00
aea4e61ec3 skip test_jit_legacy (#68129)
Summary:
disables failing tests in [https://github.com/pytorch/pytorch/issues/66429](https://github.com/pytorch/pytorch/issues/67646)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68129

Reviewed By: suo, janeyx99

Differential Revision: D32326118

Pulled By: Krovatkin

fbshipit-source-id: ca00d2214503f418be45dc756057b990fb6e6370
2021-11-10 23:08:32 -08:00
a6a2616558 Automated submodule update: kineto (#67445)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/kineto](https://github.com/pytorch/kineto).

New submodule commit: f60ad2cb0f

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67445

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: robieta

Differential Revision: D31993939

fbshipit-source-id: 3d4aa2f900434d4bbe5134db8453deb227ef5685
2021-11-10 22:33:03 -08:00
a229c3e51a Add complete type name in error message when fail to export model (#67750)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67750

Add more information about why exporting model fails.

Before: error message:
```
E1102 22:57:42.984015 3220949 ExceptionTracer.cpp:221] exception stack complete
terminate called after throwing an instance of 'c10::Error'
  what():  __torch__ types other than torchbind (__torch__.torch.classes)are not supported in lite interpreter. Workaround: instead of using arbitrary class type (class Foo()), define a pytorch class (class Foo(torch.nn.Module)). The problematic type is: __torch__.dper3.core.schema_utils.IdListFeature
Exception raised from getFunctionTuple at caffe2/torch/csrc/jit/serialization/export_module.cpp:246 (most recent call first):
```

After
```
E1102 22:57:42.984015 3220949 ExceptionTracer.cpp:221] exception stack complete
terminate called after throwing an instance of 'c10::Error'
  what():  __torch__ types other than torchbind (__torch__.torch.classes)are not supported in lite interpreter. Workaround: instead of using arbitrary class type (class Foo()), define a pytorch class (class Foo(torch.nn.Module)).
Exception raised from getFunctionTuple at caffe2/torch/csrc/jit/serialization/export_module.cpp:246 (most recent call first):
```
ghstack-source-id: 143009294

Test Plan: CI

Reviewed By: larryliu0820

Differential Revision: D32129397

fbshipit-source-id: 0594a98a59f727dc284acd1c9bebcd7589ee7cbb
2021-11-10 21:04:05 -08:00
1f07efd0f2 [SR] Fix aten::split schema (#68135)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68135

Update the schema to reflect the changes in  D31935573 (6b44e75f6b).

Test Plan:
`buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Confirmed native implementation is used.

Reviewed By: hlu1

Differential Revision: D32326865

fbshipit-source-id: 7f607f57ceb6690a2782d94d9ee736ba64e7d242
2021-11-10 20:03:30 -08:00
47bc47f2b9 [SR] Add runtime check to correct bad schema alias info (#67825)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67825

The comment explains how it works.

Test Plan:
A small regression to local and local_ro if we only enable it for fallback ops.
```
## local_ro
# before
I1103 21:25:05.250440 2636751 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.22213. Iters per second: 818.247
I1103 21:25:08.629221 2636751 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.22351. Iters per second: 817.319
I1103 21:25:12.005179 2636751 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.22285. Iters per second: 817.759
I1103 21:25:12.005236 2636751 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.22283, standard deviation: 0.000693619

# after
# # only enable for fall back ops: 0.7%
I1103 21:26:40.190436 2644597 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.22928. Iters per second: 813.481
I1103 21:26:43.590443 2644597 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.23265. Iters per second: 811.262
I1103 21:26:46.992928 2644597 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.23379. Iters per second: 810.51
I1103 21:26:46.992980 2644597 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.23191, standard deviation: 0.0023424

# enable for all (no clone): 4.7%
I1103 21:27:55.291216 2649780 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.28204. Iters per second: 780.005
I1103 21:27:58.822347 2649780 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.27854. Iters per second: 782.14
I1103 21:28:02.354184 2649780 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.27958. Iters per second: 781.506
I1103 21:28:02.354240 2649780 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.28006, standard deviation: 0.00179765

# local
# before
I1103 21:52:00.784718 2765168 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.676. Iters per second: 50.8233
I1103 21:52:28.985873 2765168 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.699. Iters per second: 50.7641
I1103 21:52:57.200223 2765168 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.6953. Iters per second: 50.7735
I1103 21:52:57.200273 2765168 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 19.6901, standard deviation: 0.0123206
# after
# # only enable for fall back ops: 0.1%
I1103 21:45:25.514535 2734440 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.7103. Iters per second: 50.7349
I1103 21:45:53.773594 2734440 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.7005. Iters per second: 50.7601
I1103 21:46:21.955680 2734440 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.7398. Iters per second: 50.659
I1103 21:46:21.955729 2734440 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 19.7169, standard deviation: 0.0204658

# enable for all (no clone): 0.9%
I1103 21:43:22.162272 2723868 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.8893. Iters per second: 50.2783
I1103 21:43:50.651847 2723868 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.8566. Iters per second: 50.3611
I1103 21:44:19.068519 2723868 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.8793. Iters per second: 50.3037
I1103 21:44:19.068570 2723868 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 19.875, standard deviation: 0.0167498
```

Reviewed By: d1jang

Differential Revision: D32124812

fbshipit-source-id: 0f60c26f8fb338d347e4ca7a70b23e5a386fc9aa
2021-11-10 19:35:11 -08:00
ca7d0062ad [PyTorch Edge] Better error message when training attribute is not found (#68103)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68103

The error message `'training' attribute not found.` in itself isn't particularly actionable. Anyone running into this tends to be clueless regarding why they are getting this message.

For example, see [this post](https://fb.workplace.com/groups/pytorch.edge.users/posts/965868874283406/) asking for help when seeing this specific error message.

The most common reason for this error is that users call `.eval()` in the model instance before saving it. This change tries to draw attention to that oversight and allows them to proactively investigate and correct that mis-action if necessary.

This saves valuable time for our users and effort from the team tp provide support. Overall, I believe this is a Developer Experience win.

ghstack-source-id: 143021300

Test Plan: Build/CI

Reviewed By: JacobSzwejbka

Differential Revision: D32304477

fbshipit-source-id: 474abe717a862347f16ad981834ddab6819cb4d3
2021-11-10 19:31:10 -08:00
0e366b8e5f Make torch.fx.experimental.fx2trt.passes a package (#68139)
Summary:
Only packages and tools (which are explicitly specified) are included in the wheel/conda files

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68139

Test Plan:
Run `python3 -c "from setuptools import find_packages; print([x for x in find_packages(exclude=('tools','tools.*')) if 'torch.fx' in x])"` before and after the change
Fixes https://github.com/pytorch/pytorch/issues/68059

Reviewed By: nrsatish, seemethere

Differential Revision: D32330483

Pulled By: malfet

fbshipit-source-id: a55443730999a83c615b3f943c327353c011bf7b
2021-11-10 15:57:29 -08:00
f171c78c04 add unpack_sequence and unpad_sequence functions (#66550)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/66549

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66550

Reviewed By: malfet

Differential Revision: D32299193

Pulled By: jbschlosser

fbshipit-source-id: 96c92d73d3d40b7424778b2365e0c8bb1ae56cfb
2021-11-10 15:15:08 -08:00
a510f4139b Fix lambda function broke torch.save
Summary: Torch.save use pickle, which cannot handle lambda function or local function directly without modify serialization.py. This diff fix the issue by extract lambda to a normal function.

Test Plan: buck test mode/dev-nosan //caffe2/test/fx2trt/core:test_trt_module

Reviewed By: 842974287

Differential Revision: D32320536

fbshipit-source-id: 497d2e64f94526f92e6d1a9909b6ad629dbca850
2021-11-10 14:21:06 -08:00
22e73f616c Update unpack_dual to return named tuple (#68062)
Summary:
Also updates the doc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68062

Reviewed By: gchanan

Differential Revision: D32315089

Pulled By: soulitzer

fbshipit-source-id: 567c812da093daeb6549b0dc7ecbffd58eb8ccc2
2021-11-10 14:14:55 -08:00
d6e6064efc [LT] Upstream backend interfaces (#67927)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67927

BackendData - represents 'tensor data' in opaque backend storage
LoweringContext - interface for performing backend-specific IR lowering
BackendImplInterface - interface for lazy tensors backends to implement

Reorgs backend-related files into lazy/backend subdir

includes a few small fixes, which were made on lazy_tensor_staging but need to be back-ported to master.

Test Plan: used by lazy_tensor_staging branch

Reviewed By: desertfire

Differential Revision: D32142032

fbshipit-source-id: 828c717bcd0d511876e64ad209b50f7bfb10cec5
2021-11-10 12:55:31 -08:00
c075f0f633 Update rpc testing to include USE_TENSORPIPE directive (#68080)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68080

Fixes #68002

After FaultyProcessGroupAgent was replaced with FaultyTensorpipeAgent there is now a dependency on Tensorpipe for rpc testing. However, if user does not have USE_TENSORPIPE enabled they will hit an issue such `undeclared identifier 'FaultyTensorPipeRpcBackendOptions'`. This is for testing the faulty agent method so it should not block compilation. Update to wrap the Tensorpipe specific code in a directive.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D32292861

Pulled By: H-Huang

fbshipit-source-id: 4ffb879860ced897674728200a1831f18fea0a4a
2021-11-10 12:12:18 -08:00
a3bb95c1b5 don't include label in ci: sev issue (#68093)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68093

We don't want regular users without write access to be able to file an
actual issue with the `ci: sev` label since that issue will
automatically show up on hud.pytorch.org

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D32299553

Pulled By: seemethere

fbshipit-source-id: d46a96f16ae29120fff94288d3e0c06b103edf7f
2021-11-10 12:03:18 -08:00
ecd5b1a8d4 [SR] Native implementation for aten::split (#67476)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67476

Native ops are faster than falling back to the JIT interpreter, sometimes significantly (we've previously seen this with ops like TupleUnpack). We should improve op coverage where possible.

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: d1jang

Differential Revision: D31994040

fbshipit-source-id: 9de57d8d7925ee46544478eae8229952ca5f248a
2021-11-10 10:23:03 -08:00
746a31b290 Logger integration format (#67962)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67962

Logger integration format for chunks at [dims] -> input_val.shape[dim]

NOTE: Unused typing imports removed

Test Plan:
buck run -c python.package_style=inplace mode/dev-nosan caffe2/torch/fb/fx2trt:test_chunk

out:
RuntimeWarning: Asked for 2000 chunks along dimention 2 on tensor with size (3, 10, 20), chunks will default to 20

Reviewed By: 842974287

Differential Revision: D32233039

fbshipit-source-id: 1fde12c9f743bb80cdb309e0b7be287173d45147
2021-11-10 10:12:06 -08:00
8dfbc620d4 don't hardcode mask type in mha (#68077)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68077

Reviewed By: zou3519

Differential Revision: D32292410

Pulled By: ngimel

fbshipit-source-id: 67213cf5474dc3f83e90e28cf5a823abb683a6f9
2021-11-10 09:41:21 -08:00
ae5864498d torch.allclose opinfo (#68023)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68023

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D32295811

Pulled By: george-qi

fbshipit-source-id: 3253104a5a9655d8ba7bbba6620038ed6d6669f1
2021-11-10 09:16:39 -08:00
9a2db6f091 Factor backend routing logic out of convolution forward (#67790)
Summary:
This PR introduces a new function `_select_conv_backend` that returns a `ConvBackend` enum representing the selected backend for a given set of convolution inputs and params.

The function and enum are exposed to python for testing purposes through `torch/csrc/Module.cpp` (please let me know if there's a better place to do this).

A new set of tests validates that the correct backend is selected for several sets of inputs + params. Some backends aren't tested yet:
* nnpack (for mobile)
* xnnpack (for mobile)
* winograd 3x3 (for mobile)

Some flowcharts for reference:
![conv_routing_graph md](https://user-images.githubusercontent.com/75754324/140828957-1135b400-38c0-4c9f-87ef-4f33ceebeeae.png)
![conv_nogroup_routing_graph md](https://user-images.githubusercontent.com/75754324/140828977-ed223a4e-aa86-49f1-9925-c0f6b9ab36af.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67790

Reviewed By: zou3519

Differential Revision: D32280878

Pulled By: jbschlosser

fbshipit-source-id: 0ce55174f470f65c9b5345b9980cf12251f3abbb
2021-11-10 07:53:55 -08:00
147de8243b Fixed deprection warnings with .data<T>() in SpectalOps.cpp (#67993)
Summary:
Description:
- Fixed deprection warnings `.data<T>()` -> `.data_ptr<T>()` in SpectralOps.cpp shown while building pytorch from source

```c++
../aten/src/ATen/native/mkl/SpectralOps.cpp:213:10: warning: ‘T* at::Tensor::data() const [with T = c10::complex<double>]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.
data_ptr<T>() instead. [-Wdeprecated-declarations]
  213 |   return reinterpret_cast<std::complex<T>*>(t.data<c10::complex<T>>());
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67993

Reviewed By: H-Huang

Differential Revision: D32246945

Pulled By: mruberry

fbshipit-source-id: 5cd6b0ac6ddff0afc56e99641971e1e3b6434af6
2021-11-10 07:33:15 -08:00
6011c35a79 [LTC] Upstream class BackendDevice (#68027)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68027

This commit upstreams class BackendDevice to the master, which is a backend
specific representation of the actual hardware, for instances, CPU, GPU, or
TPU.

This concept is important for backend like XLA where it needs to tell the
actual hardware type from the c10::DeviceType::Lazy virtual device during
both IR constructions and lowerings.

Test Plan: ./build/bin/test_lazy --gtest_filter=BackendDeviceTest.*

Reviewed By: wconstab

Differential Revision: D32261838

Pulled By: alanwaketan

fbshipit-source-id: 579c3fc5f9da7847c887a383c6047e8ecb9cc5bc
2021-11-10 07:05:43 -08:00
a6c0edff1a fix gradcheck to generate valid input for forward AD complex (#68001)
Summary:
This fixed a few of the linalg checks that we disabled before!

This also seems to break sgn, abs and angle (sending on CI here to see if there are more). These two functions used to only ever get pure imaginary or real values.
This is very much likely that something is wrong with their formula.
But they are implemented as element-wise, so not sure where the error can come from. I tried to look at it but nothing obvious seems wrong there (especially because it is correct in backward mode).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68001

Reviewed By: soulitzer

Differential Revision: D32280475

Pulled By: albanD

fbshipit-source-id: e68b1ce0e2e97f8917c3d393141d649a7669aa9d
2021-11-10 03:07:48 -08:00
94b6fa6f8b Adds an optimizer instance variable to ChainedScheduler (#68010)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/67601.

As simple a fix as I could make it. I even managed to delete some testing code!

I checked calling `super()` and, as I had feared, it doesn't work out the box, so perhaps that ought to be revisited later.

As it stands,  https://github.com/pytorch/pytorch/issues/20124, still applies to the chained scheduler, but I think this change is still an improvement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68010

Reviewed By: zou3519

Differential Revision: D32278139

Pulled By: albanD

fbshipit-source-id: 4c6f9f1b2822affdf63a6d22ddfdbcb1c6afd579
2021-11-10 01:31:47 -08:00
cb2a41e508 [PyTorch Edge] Don't use LeftRight in mobile (#66064)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66064

The only place this is used seems to be in the dispatcher for `operatorLookupTable_`. Disarming `LeftRight` disarms it for this one use case.

This should make .so loading faster, and also reduce memory consumption since `LeftRight<T>` does 2 writes for every write. I'd like to get a thorough review from reviewers for this diff since I want to make sure that initialization of stuff that writes into the dispatcher isn't going to happen on multiple threads for on-device use.

Created a new class named `LeftRightNoOpWrapper<T>` for use in mobile builds.

### Why is LeftRight<T> slow?

It maintains 2 copies of each data structure `T` to be able to keep reads quick. Every write goes to both data structures, which means that writes that 2x and memory overhead is also 2x

### Why is this safe for mobile builds?

1. .so loading never happens concurrently with model execution
2. Custom ops are loaded during .so load - initializers are all run serially
3. I don't see any threads being spawned from the global schema and kernel initializers

After discussing with dreiss, it seems like there could be rare cases in OSS apps or internal Android/iOS apps where a `.so` or `dylib` is loaded after the PT runtime is loaded, and this load happens concurrently with an in-progress inference run, which is looking up the operator table in the dispatcher.

To avoid crashes there, it seems reasonable to use the RW lock, since I don't expect any contention 99.9% of the time.

When registering operators, everything is serial so only one thread will ever hold the lock. The next time it needs the lock, it will have already released it.
During inference runs, only one thread will ask for the shared lock unless multiple concurrent inferences are in progress. Even in that case, they will all be able to simultaneously get the Read lock.

Test Plan: Build and generate a local build of the iOS app to test.

Reviewed By: swolchok

Differential Revision: D31352346

fbshipit-source-id: c3f12454de3dbd7b421a6057d561e9373ef5bf98
2021-11-09 21:49:45 -08:00
b0817e19e0 [PyTorch] Avoid reading file from stream for 0 byte Tensor storage (#67787)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67787

First noticed in https://fb.workplace.com/groups/pytorch.edge.team/posts/952737705280969/ - basically one of the speech models has ~400 0 byte tensor files, so we're basically paying the cost of looking it up in the archive and reading nothing from it.

Turns out that there's a fairly simple fix to avoid reading a 0 byte tensor. Once we notice that it's 0 bytes, just use the default `DataPtr` instead to initializing it with 0 bytes read in from the input file stream.

ghstack-source-id: 142025211

Test Plan: CI and manually ran a couple production mobile models with bundled inputs. CI Will run all prod. mobile mobiles with bundled inputs.

Reviewed By: swolchok

Differential Revision: D32054983

fbshipit-source-id: 919b0cdbc44bccb8f6cfe0da10ff5474af37fd99
2021-11-09 21:45:05 -08:00
bf31d4b2b5 [PyTorch] Replace copy_ with data_ptr<float>() since input Tensor's dtype is guaranteed to be float (#67788)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67788

Based on comments from supriyar in D31657430 (20aa417e38).
ghstack-source-id: 142924000

Test Plan: CI

Reviewed By: supriyar

Differential Revision: D32055028

fbshipit-source-id: 756d526585f8ded755ea42b52dbbf5c1687acde2
2021-11-09 21:40:23 -08:00
6b44e75f6b aliasing fixes (#66977)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66977

Fix for https://github.com/pytorch/pytorch/issues/47218

More context is in original PR here: https://github.com/pytorch/pytorch/pull/20556

Test Plan: Imported from OSS

Reviewed By: malfet, albanD

Differential Revision: D31935573

Pulled By: eellison

fbshipit-source-id: 3658d5711116396c35f1d5016773b0096ed347a5
2021-11-09 18:33:37 -08:00
3f1a3f7b18 Fix ads dense arch regression (#68071)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68071

Reviewed By: yinghai

Differential Revision: D32261611

fbshipit-source-id: 3224464bbf30fecbdb69e6ae88e42485ef67f800
2021-11-09 18:22:01 -08:00
91af74c934 remove Generate* macro files (#67940)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67940

Reviewed By: mruberry

Differential Revision: D32250987

Pulled By: ngimel

fbshipit-source-id: 3feb0bc876bc26d0a42784e5c6001670ed71e971
2021-11-09 17:31:56 -08:00
eqy
790763b0fe Add an option to disable reduced precision reductions for FP16 GEMM (#67946)
Summary:
https://github.com/pytorch/pytorch/issues/67578 disabled reduced precision reductions for FP16 GEMMs. After benchmarking, we've found that this has substantial performance impacts for common GEMM shapes (e.g., those found in popular instantiations of multiheaded-attention) on architectures such as Volta. As these performance regressions may come as a surprise to current users, this PR adds a toggle to disable reduced precision reductions
`torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = `
rather than making it the default behavior.

CC ngimel ptrblck
stas00 Note that the behavior after the previous PR can be replicated with
`torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = False`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67946

Reviewed By: zou3519

Differential Revision: D32289896

Pulled By: ngimel

fbshipit-source-id: a1ea2918b77e27a7d9b391e030417802a0174abe
2021-11-09 17:27:20 -08:00
078c655985 [nnc][mobile] temporarily disable quantization external functions (#68029)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68029

Temporarily disable quantization external functions with a new macro DISABLE_NNC_QUANTIZATION.

The ATen CPU library consists of two parts:
A. Common operator functions, e.g. "at::empty()", the list of sources can be found at "aten_cpu_source_list" in "tools/build_variables.bzl";
B. Implementations of these operators, e.g. "at::native::empty()", the list of sources is defined at "aten_native_source_list" in "tools/build_variables.bzl";

Note that A does not directly depend on B. A calls B via dispatch table. The dependency is injected into the dispatch table by B during its static initialization.

For internal mobile builds, B is built on a per-app basis. A is the public library for other libraries to depend on. Because these external functions call quantization functions that are not part of A, the NNC kernel library cannot resolve the missing symbols.

Use this PR to unblock the internal experiment until we figure out a better solution (e.g. move quantization API to A).
ghstack-source-id: 142868370

Test Plan: Make sure it can build with the stacked diff.

Reviewed By: IvanKobzarev

Differential Revision: D32239783

fbshipit-source-id: 3797b14104b0f54fb527bc3fc5be7f09cc93d9e4
2021-11-09 17:10:16 -08:00
b1a42298a4 Simplify example for nn.Flatten (#67472)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/67415
Using the docstring example provided by jbschlosser to the issue submitted by qzylalala

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67472

Reviewed By: soulitzer

Differential Revision: D32210995

Pulled By: jbschlosser

fbshipit-source-id: f22bcd729699993942b6e676b479618ac613022c
2021-11-09 17:03:06 -08:00
d8f0087e08 .github: Fix sccache for macOS workflows on push (#68094)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68094

Turns out sccache was not getting activated properly on master pushes so
this should help resolve that

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D32299636

Pulled By: seemethere

fbshipit-source-id: 5f1be98dffdb202d3c11b6ceb2b49af235e1f91b
2021-11-09 16:40:56 -08:00
1b2a366932 [SR] Enforce checks for resizing of the internal buffer in MemoryPlanner in unit tests (#67941)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67941

I just found out that due to the round up of the Tensor storage sizes to multiples of 64 bytes, resizing is not actually triggered for a lot of our unit tests (23 OSS, 16 internal). Now they should be all fixed. Also moved a bunch of tests to `test_static_module.cc` so that `test_static_runtime.cc` now only contains operator tests.

From now on, by default if `args2` is passed to `test_static_runtime`, at the end of the second iteration, it would check that the managed buffer's size is bigger than the previous size and enforce that. You can bypass the check for ops with constant output sizes, such as `aten::sum` without `dim` passed in.

Test Plan:
Facebook
```
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
buck test //caffe2/benchmarks/static_runtime/fb:test_fb_operators
```

Reviewed By: swolchok

Differential Revision: D32196204

fbshipit-source-id: 8425d9efe6b9a1c1e3807e576b1143efd7561c71
2021-11-09 16:07:40 -08:00
8d025bbc2d .github: Migrate macOS workflows to GHA (#67717)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67717

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D32287733

Pulled By: seemethere

fbshipit-source-id: 8df6b20aada818ad39895ef87dc280098e09707b
2021-11-09 15:46:05 -08:00
55e3b23abe [Pytorch Edge] Generic Build Features for Selective Build (#67817)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67817

Implementation of build features as a useable feature. Includes tracing support and selectivity support. Follow up of Dhruv's prototype in D30076214.

The general idea is to allow selectivity of arbitrary sections of the codebase through the 2 apis,
BUILD_FEATURE_REQUIRED(NAME), and
BUILD_FEATURE_AVAILABLE(NAME)

References
PyTorch Edge Team Workplace group post link: https://fb.workplace.com/groups/pytorch.edge.team/posts/905584476662959/
Quip talking about some early ideas related to build features: https://fb.quip.com/iur3ApU9q29v
Google Doc about most recent discussion and details: https://docs.google.com/document/d/1533zuN_9pwpQBa4RhtstUjT5B7guowblqJz35QYWPE0/edit

Will remove the copy kernel example after. Its just here as an example.
ghstack-source-id: 142850218

Test Plan: CI, dummy traced a model, and played around with its unit test if i removed the traced value from the yaml

Reviewed By: dhruvbird

Differential Revision: D32151856

fbshipit-source-id: 33764c1f6902a025e53807b784792a83c8385984
2021-11-09 15:37:21 -08:00
43ef6816f2 OpInfo for nn.functional.cross_entropy (#63547)
Summary:
Reference: https://github.com/facebookresearch/functorch/issues/78 and https://github.com/pytorch/pytorch/issues/54261

TODOs:

* [ ] Investigate autograd failures.
* [ ] Clean up `test_nn.py` for `cross_entropy`.

cc: mruberry zou3519

cc albanD mruberry jbschlosser walterddr

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63547

Reviewed By: mruberry

Differential Revision: D32062955

Pulled By: zou3519

fbshipit-source-id: 2a62a4c28af51fb71159df2e262d05039d549b7e
2021-11-09 15:07:12 -08:00
eaf0457eef [distributed][docs] Delete distributed optimimzer section from RPC and add reference to namespace docs page (#68068)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68068

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Test Plan: Imported from OSS

Reviewed By: pritamdamania87

Differential Revision: D32286554

Pulled By: jamesr66a

fbshipit-source-id: a43fe1f0cfa74721f467b128f2e878bd02f32546
2021-11-09 15:01:54 -08:00
7c90bd77ec Test functionalization pass in python (#66101)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66101

Updated description:

This PR tests the functionalization pass in python in two ways. For each of the test programs that I have in `test_functionalization.py`, it:
- runs the program with and without functionalization, and asserts the outputs and (potentially mutated) inputs are equal in both cases
- runs the program with `LoggingTensor`, and uses expecttests on the resulting graph. I manually confirm that the graphs look reasonable and only contain functional ops.

Mechanically, the changes include:
- factoring out `LoggingTensor` into a testing util so it can be re-used in multiple tests
- adding some private python api's in the `torch` namespace as hooks that I can use during testing

In the original version of this PR, I also added some fixes to the `_make_subclass()` function in python: allowing you to pass in strides and storage_offset. I kept them in mainly because the changes were already there.

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D31942095

Pulled By: bdhirsh

fbshipit-source-id: 90ff4c88d461089704922e779571eee09c21d707
2021-11-09 14:34:05 -08:00
fe46d6c68f functionalization: map copy_() -> to().expand_as() (#67878)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67878

The functionalization pass doesn't work with `copy_()` which is a problem with functorch. Originally we were going to make a functional `copy()` operator to fix this problem, but zou3519 that we can get (most of) the existing functionality by mapping `self.copy_(src)` to `src.to(self).expand_as(self)`. This makes the codegen a bit uglier, but has the benefit of avoiding a totally unnecessary tensor allocation in functorch.

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D32280588

Pulled By: bdhirsh

fbshipit-source-id: 2c6ee65f0929e0846566987183ba2498c88496c2
2021-11-09 14:34:02 -08:00
be4150139a bugfix for conditional functionalization (#67715)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67715

I had original made the `vector<ViewMeta>` and `Tensor`s stored on the `Update` struct references, but will pointed out a bug in the conditional-functionalization PR due to a use-after-free error. This happens because the queued-up updates might not be synced until later, and can out-live the original tensor that was used to create them.

It was kind of strange that this doesn't show up in the existing `test/test_functionalization.py` tests that I have in this stack, which technically also should have this bug (they call sync_() after the mutated tensors have gone out of scope). I looked at it with gdb, and I'm wondering if it's just because the stored values in the free'd `ViewMeta`/`Tensor` just happen to not get clobbered by the time the sync is called in the test.

Either way, copying the Tensor + vector<ViewMeta> is probably not ideal for performance, but I couldn't think of an easy work-around for now.

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D32136007

Pulled By: bdhirsh

fbshipit-source-id: 707c6392a31b967e8965b9b77f297fd10a0a095a
2021-11-09 14:32:17 -08:00
4100a5cc48 Revert D32286934: [pytorch][PR] replace platform specific CI environment variables with generic ones
Test Plan: revert-hammer

Differential Revision:
D32286934 (7d931fb082)

Original commit changeset: 1008938088da

fbshipit-source-id: dd2dd07742670a34deec10995b95b98c9fd62724
2021-11-09 14:06:18 -08:00
273f7ae9b3 fx: Update fx.rst (#68043)
Summary:
When I run this part of the code on the document with PyTorch version 1.10.0, I found some differences between the output and the document, as follows:

```python
import torch
import torch.fx as fx

class M(torch.nn.Module):
    def forward(self, x, y):
        return x + y

# Create an instance of `M`
m = M()

traced = fx.symbolic_trace(m)
print(traced)
print(traced.graph)
traced.graph.print_tabular()
```

I get the result:

```shell
def forward(self, x, y):
    add = x + y;  x = y = None
    return add

graph():
    %x : [#users=1] = placeholder[target=x]
    %y : [#users=1] = placeholder[target=y]
    %add : [#users=1] = call_function[target=operator.add](args = (%x, %y), kwargs = {})
    return add
opcode         name    target                   args    kwargs
-------------  ------  -----------------------  ------  --------
placeholder    x       x                        ()      {}
placeholder    y       y                        ()      {}
call_function  add     <built-in function add>  (x, y)  {}
output         output  output                   (add,)  {}
```

This pr modified the document。

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68043

Reviewed By: driazati

Differential Revision: D32287178

Pulled By: jamesr66a

fbshipit-source-id: 48ebd0e6c09940be9950cd57ba0c03274a849be5
2021-11-09 14:00:45 -08:00
c7eaec86f0 [NCCL] Patch bfloat16 support (#67843)
Summary:
Patch bfloat16 support in NCCL, PR https://github.com/pytorch/pytorch/issues/63260 adds bfloat16 support but is
still not complete to enable bfloat16 for allreduce in end-to-end training.

This patch does the followings:
* fix minimum NCCL version from 2.9.7 to 2.10, NCCL adds bf16 support in
  v2.10.3-1 (commit 7e51592)
* update bfloat16 datatype flag in `csrc/cuda/nccl.cpp` so that NCCL
  operations like all reduce can use it
* enable unit tests for bfloat16 datatype if possible

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67843

Reviewed By: H-Huang

Differential Revision: D32248132

Pulled By: mrshenli

fbshipit-source-id: 081e96e725af3b933dd65ec157c5ad11c6873525
2021-11-09 13:46:13 -08:00
45ac6f2b65 [quant] Fix comparison against reference for test_qat_functional_linear (#68061)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68061

Test had a typo that didn't compare test value against reference value, fixed typo.

Test Plan:
`pytest test/quantization/fx/test_quantize_fx.py  -v -k "test_qat_functional_linear"`

Imported from OSS

Reviewed By: HDCharles

Differential Revision: D32280803

fbshipit-source-id: d57a25a0dcdd88df887a39b5117abafaf15125b2
2021-11-09 13:33:13 -08:00
a9c2f11d2a Update Freezing Logic and add new passes (#68024)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68024

Pull Request resolved: #67949

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D32260614

Pulled By: eellison

fbshipit-source-id: 41d7a9b45e33297a17560a22eba8973e2fc48b43
2021-11-09 13:21:52 -08:00
d2438a8901 [qnnpack] Lock before weightpacking in qlinear (#68012)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68012

Previous attempt to make qlinear threadsafe placed lock after weight ptr was already accessed via packB. Race condition occurs when thread1 acquires lock, packs weights but thread2 still uses old nullptr after acquiring the lock. This causes a null pointer dereference later.
ghstack-source-id: 142714894

Test Plan: Tested on repro diff

Reviewed By: kimishpatel

Differential Revision: D32252563

fbshipit-source-id: 429fcd3f76193f1c4c8081608b6f725b19562230
2021-11-09 13:03:02 -08:00
e86058559a Op info for activation functions 2 (softsign, tanh, tanhshrink, threshold, celu, sigmoid, mish, hardsigmoid) (#67492)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67492

Reviewed By: zou3519

Differential Revision: D32282580

Pulled By: samdow

fbshipit-source-id: 115afe790328577357a90117bede3b6502590441
2021-11-09 12:57:38 -08:00
726e2ed715 [lint] add more lints to lintrunner (#68069)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68069

- executable bit
- cub include
- raw CUDA API usage

Test Plan: Imported from OSS

Reviewed By: janeyx99

Differential Revision: D32286559

Pulled By: suo

fbshipit-source-id: 21d58e259c951424f9c6cbf1dac6d79fe7236aa4
2021-11-09 12:48:56 -08:00
cbf596bf8e Sparse CSR CPU: add addmv_out (#61536)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61536

This PR adds CPU dispatch for `addmv_out` with Sparse CSR matrix.
The implementation uses MKL Sparse library. If it's not available then a
runtime error is thrown.
Since structured_delegate is used we only need to implement the out variant, the in-place and normal variants are autogenerated.

MKL descriptor of sparse matrices is implemented in `at::mkl::sparse::MklSparseCsrDescriptor`.
MKL Sparse doesn't allow switching indices type in runtime, it's
predetermined in build time. Only 32-bit version of MKL was tested
locally, but I expect 64-bit version to work correctly as well.

When indices type of PyTorch CSR tensor doesn't match with MKL's,
indices tensor is converted to MKL compatible type (`int` vs `int64_t`).

cc nikitaved pearu cpuhrsch IvanYashchuk

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D32141787

Pulled By: malfet

fbshipit-source-id: b818a0b186aa227982221c3862a594266a58a2a6
2021-11-09 12:34:21 -08:00
7d931fb082 replace platform specific CI environment variables with generic ones (#68022)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/59478

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68022

Reviewed By: seemethere

Differential Revision: D32286934

Pulled By: atalman

fbshipit-source-id: 1008938088da56807e85fb5d776abf79f28ef77b
2021-11-09 12:06:44 -08:00
a027551358 [LT] Merge cache.h (#67929)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67929

1. Write a node-hash based unit test for Cache
2. Replace CHECK with TORCH_CHECK in IrUtil

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D32246134

Pulled By: desertfire

fbshipit-source-id: c464bc300126d47e9ad4af3b3e8484a389757dc0
2021-11-09 12:02:02 -08:00
a473417076 [LT] Merge permutation_util into master (#67766)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67766

Test Plan: `build/bin/test_lazy`

Reviewed By: wconstab

Differential Revision: D32147676

Pulled By: desertfire

fbshipit-source-id: 528b48c9cf789abc171235091c7146b2ab7a9c76
2021-11-09 12:00:39 -08:00
442d7d72de fixed type checking errors in options.py (#68056)
Summary:
Fixes [issue#64](https://github.com/MLH-Fellowship/pyre-check/issues/64)
This PR fixes the type checking errors in torch/distributed/rpc/options.py.
The variable types in 84:8 and 85:8 were  declared to have type `List`  but were sometimes assigned a value of  `None`. This caused an incompatitble variable type error. Therefore, I changed the type from `List` to `Optional[List]` . Hence, this fixes the incompatitble variable type error.

Signed-off-by: Onyemowo  Agbo
onionymous
0xedward

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68056

Reviewed By: zou3519

Differential Revision: D32282289

Pulled By: mrshenli

fbshipit-source-id: ee410165e623834b4f5f3da8d44bd5a29306daae
2021-11-09 11:42:34 -08:00
acb035f513 Revert D31609714: Fix Dispatching not considering List[Optional[Tensor]] for dispatch
Test Plan: revert-hammer

Differential Revision:
D31609714 (c581f56c74)

Original commit changeset: bb91cafd32fb

fbshipit-source-id: a04055e7af4bf8491b44bbc3e3bddc7831ab205e
2021-11-09 10:41:53 -08:00
6e53d6df83 [SR] Introduce StaticMethod (#67981)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67981

To save on memory, various internal classes need to release all references to their `torch::jit::Module` after constructing their `StaticModule`. Unfortunately, many of these classes currently instantiate a `torch::jit::Method` attribute, which holds a reference to the `ivalue` backing its owning module.

To avoid this, I've introduced a new subclass of `IMethod` to represent scripted functions backed by static runtime.

Test Plan: CI

Reviewed By: swolchok

Differential Revision: D32232039

fbshipit-source-id: 434b3a1a4b893b2c4e6cacbee60fa48bd33b5722
2021-11-09 10:37:29 -08:00
5e19fb61fd [SR] Release reference to JIT module if possible (#67911)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67911

If we can remove `self` from the graph inputs, there is no need for `StaticModule` to hold onto its `Module` reference anymore.

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: hlu1

Differential Revision: D32190755

fbshipit-source-id: 9c4649a63b6e68c7d2e47395a23572985d2babb1
2021-11-09 10:35:31 -08:00
9ae3f3945b Add remote_module logging to the __new__ method. (#68035)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68035

RemoteModule is sometimes created using object.__new__ (ex:
init_from_module_rref), in this case the logging in the __init__ method would
not pick this up.

As a result, adding a `__new__` method to RemoteModule to log all usages
appropriately.
ghstack-source-id: 142762019

Test Plan: waitforbuildbot

Reviewed By: vipannalla

Differential Revision: D32263978

fbshipit-source-id: a95ab0bb5d0836da8fe6333c41593af164b008d9
2021-11-09 09:32:34 -08:00
96b4f2296e CppSignature: Compare types by their mangled names (#67987)
Summary:
`.name()` has to call `__cxa_demangle` and allocate a new string, both of which can be avoided by just comparing the mangled names directly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67987

Reviewed By: mruberry

Differential Revision: D32264560

Pulled By: H-Huang

fbshipit-source-id: 9dd4388ba4e2648c92e4062dafe6d8dc3ea6484e
2021-11-09 08:52:42 -08:00
114ef8c5ea Add SiLU backward Aten symbol (#67665)
Summary:
This is related to https://github.com/pytorch/xla/issues/3192. bdhirsh

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67665

Reviewed By: desertfire

Differential Revision: D32245736

Pulled By: bdhirsh

fbshipit-source-id: c5a2b24214fa37a181246cbbfcbee131473cf807
2021-11-09 08:14:02 -08:00
c581f56c74 Fix Dispatching not considering List[Optional[Tensor]] for dispatch (#66506)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66506

Followup to https://github.com/pytorch/pytorch/pull/60787

It turns out that the original PR was wrong for unboxed kernels. We
recently ran into this in
https://github.com/facebookresearch/functorch/issues/124

For unboxed kernels, the correct type for a Tensor?[] argument is
actually `List<optional<Tensor>>`, not `ArrayRef<optional<Tensor>>`

Test Plan:
- assert that https://github.com/facebookresearch/functorch/issues/124
actually works

Reviewed By: bdhirsh

Differential Revision: D31609714

Pulled By: zou3519

fbshipit-source-id: bb91cafd32fb3c1b7d1e4f966b46b5d973b50df2
2021-11-09 08:00:09 -08:00
803e88d418 [DataPipe] Fixing pickling issues with fork and demux (#67930)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67930

Fixes #67848

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D32222184

Pulled By: NivekT

fbshipit-source-id: 48871c45a855d92cd599e21f3b53827dd32c91ef
2021-11-09 07:54:02 -08:00
577a4d34a7 making import_module private and deprecating public method (#67990)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67990

Duplicate of the following PR which was merged by mistake without ghimport
https://github.com/pytorch/pytorch/pull/67914

cc albanD NicolasHug

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D32247560

Pulled By: jdsgomes

fbshipit-source-id: 8ba5ba7d17fc3d0d2c377da467ea805822e21ec1
2021-11-09 07:27:57 -08:00
0a9cd6d461 Removes unnecessary no_pretrained_model from test_quantize_fx.py (#67836)
Summary:
TorchVision accidentally included model builders for quantized models without weights; this was an old bug. These builders were largely unusable and caused issues to the users. Commonly they were filtered out to avoid causing issues.

We've recently fixed that (https://github.com/pytorch/vision/pull/4854) by either removing those unnecessary builders or by providing quantized weights. This PR removes the no-longer necessary filtering of the methods.

**It should be merged after TorchVision is synced on FBCode.**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67836

Reviewed By: jerryzh168

Differential Revision: D32230658

Pulled By: datumbox

fbshipit-source-id: 01cd425b1bda3b4591a25840593b3b5dde3a0f12
2021-11-09 05:49:27 -08:00
f9422e1c6b Fix deadlock for multi-output forward AD (#67995)
Summary:
Will hide some of the issues from https://github.com/pytorch/pytorch/issues/67367
This will at least allow us to run gradcheck for now until the above issue is fixed.

For more context, the deadlock happens when we (wrongfully) set a forward grad that also has a forward grad of the same level.
In particular, when exiting the level from 191b48b12f/torch/csrc/autograd/forward_grad.cpp (L23)
We are taking the `all_forward_levels_mutex_` lock and proceed to delete the level at 191b48b12f/torch/csrc/autograd/forward_grad.cpp (L29) (nothing else usually references this object, so it gets deleted as soon as it gets removed from the vector). Note that, at this point, we still have the lock!

In the level destructor in 191b48b12f/torch/csrc/autograd/forward_grad.cpp (L55) we are deleting the forward grad. Which triggers the deletion the grad Tensor and everything it holds (assuming nothing else references it).
But in the (bad) case where this Tensor also has a forward grad for this level, the autograd meta clears the fw grads: 191b48b12f/torch/csrc/autograd/forward_grad.h (L124)
While clearing, we access the level (to de-register this forward grad) via 191b48b12f/torch/csrc/autograd/forward_grad.h (L139)
But this tries to access the level again in 191b48b12f/torch/csrc/autograd/forward_grad.cpp (L39) and deadlocks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67995

Reviewed By: soulitzer

Differential Revision: D32250996

Pulled By: albanD

fbshipit-source-id: f6118117effd3114fa90dc8fe22865339445f70c
2021-11-09 01:32:43 -08:00
f8297d40fc Adds a maximize flag to SGD. (#67847)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46480 -- for SGD.

## Notes:
- I have modified the existing tests to take a new `constructor_accepts_maximize` flag. When this is set to true, the ` _test_basic_cases_template` function will test both maximizing and minimizing the sample function.
- This was the clearest way I could think of testing the changes -- I would appreciate feedback on this strategy.

## Work to be done:
[] I need to update the docs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67847

Reviewed By: H-Huang

Differential Revision: D32252631

Pulled By: albanD

fbshipit-source-id: 27915a3cc2d18b7e4d17bfc2d666fe7d2cfdf9a4
2021-11-09 00:43:07 -08:00
c5e5264be2 Disable TF32 in pinv_jvp and pinv_backward (#67948)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/67947

cc ptrblck xwang233 zasdfgbnm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67948

Reviewed By: H-Huang

Differential Revision: D32251934

Pulled By: ngimel

fbshipit-source-id: a2b1a118337b38db61350c9e49f1ba19030d70ec
2021-11-08 22:33:29 -08:00
417dc7f86c Revert D32007691: [pytorch][PR] Op info for activation functions 2 (softsign, tanh, tanhshrink, threshold, celu, sigmoid, mish, hardsigmoid)
Test Plan: revert-hammer

Differential Revision:
D32007691 (ea60e7d559)

Original commit changeset: 6cb14dc56e29

fbshipit-source-id: 9ef599ef07302fb521b1f413b989786adfa3576c
2021-11-08 21:16:53 -08:00
36d9a74bc6 Enforce that test cases extend from correct TestCase (#67819)
Summary:
Addresses https://github.com/pytorch/pytorch/issues/66903

Main code is in  torch/testing/_internal/common_utils.py and everything else is fixing the lint

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67819

Reviewed By: H-Huang

Differential Revision: D32259978

Pulled By: janeyx99

fbshipit-source-id: 39c5ffbaa510e1e533d6bdacf5c6158a3dd9885d
2021-11-08 18:28:36 -08:00
25cd81876d Fix typo grid_sampler_3d_cuda (#67752)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67752

Reviewed By: NivekT, mruberry

Differential Revision: D32256561

Pulled By: H-Huang

fbshipit-source-id: b4d56cadf15bc00181e899ea4be4b1bcfe63f692
2021-11-08 18:16:01 -08:00
4b1d044498 [WIP][resubmit] Don't #define NUM_THREADS (#68008)
Summary:
This reverts commit 9e8016d8c48e9c99addad93112f99d3375a0fbc7.

cc jeffdaily sunway513 jithunnair-amd ROCmSupport KyleCZH

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68008

Reviewed By: H-Huang

Differential Revision: D32254779

Pulled By: ngimel

fbshipit-source-id: 38ec415199f62a1e58000abe3e34ac91898a94ae
2021-11-08 18:03:45 -08:00
a2ab06514b Fixes CUDA vs CPU consistency for index_put_ when accumulating (part 2) (#67189)
Summary:
Description:
- Follow up PR to https://github.com/pytorch/pytorch/issues/66790 to fix the tests on functorch, https://github.com/pytorch/functorch/issues/195

In functorch, a null tensor is added to the list of indices for the batch dimension in C++, but I can not find an equivalent of that in python without using `torch.jit.script`. If any other better solutions could be suggested, I'd be happy to replace the current way of testing.

cc ngimel zou3519

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67189

Reviewed By: suo

Differential Revision: D31966686

Pulled By: ngimel

fbshipit-source-id: a14b9e5d77d9f43cd728d474e2976d84a87a6ff4
2021-11-08 17:56:43 -08:00
3f048c637f [distributed] Render torch.distributed.optim members (#67885)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67885

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D32191952

Pulled By: jamesr66a

fbshipit-source-id: a9ed52da8e89b3491eab2e691f5571338f83e8e3
2021-11-08 16:20:55 -08:00
fd198a2fea [fx2trt] fix import in oss tests (#68016)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68016

We would want to use oss test utils.

Also refactor both test utils so that the internal one is an enhancement over the oss test utils.

Test Plan: CI

Reviewed By: wushirong

Differential Revision: D32250266

fbshipit-source-id: 968b8f215ca2d294f7d0bd13cf9563be567954dd
2021-11-08 16:11:00 -08:00
0d8a8a2e41 [fx2trt]organize converter utils (#68015)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68015

Put all converter utils into a single file `converter_utils.py`.

Test Plan: CI

Reviewed By: wushirong

Differential Revision: D32250243

fbshipit-source-id: 93fb34bc9ca23f4c3cef3125e04871083dbd413d
2021-11-08 16:09:42 -08:00
5b036d5f2b [Doc] [ONNX]Fix a broken url for ONNXRuntime custom op (#67944)
Summary:
**Description**
Update the broken url by a valid link https://onnxruntime.ai/docs/reference/operators/add-custom-op.html.

**Motivation**
Closes https://github.com/pytorch/pytorch/issues/67849. The url is broken.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67944

Reviewed By: NivekT

Differential Revision: D32252880

Pulled By: H-Huang

fbshipit-source-id: 400b0efa3d6f63e60b016c482fbbed1293c29806
2021-11-08 15:51:02 -08:00
82398e38ab Upgrade and fix boto3 version to 1.19.12 (#68025)
Summary:
The new boto3 version could be causing the macos test reporting to fail. Pinning to version 1.19.12

example fail: https://app.circleci.com/pipelines/github/pytorch/pytorch/406385/workflows/f15ca6ba-e8af-45a3-b1b0-c0298ea3fe9d/jobs/16687920

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68025

Reviewed By: malfet, seemethere

Differential Revision: D32261971

Pulled By: janeyx99

fbshipit-source-id: 1a2cd636a2f0b206921749c3f0c9e4707c9a1222
2021-11-08 15:43:35 -08:00
9094947b0a use better secrets for upload labels workflow (#68013)
Summary:
Should prevent https://github.com/pytorch/pytorch/runs/4134946329?check_suite_focus=true

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68013

Reviewed By: seemethere

Differential Revision: D32254046

Pulled By: janeyx99

fbshipit-source-id: 55a7a1b8f8434f6608fe9d423982406c1e187c59
2021-11-08 15:14:28 -08:00
db9b4f1a37 [ROCm] Bump magma source to pickup memory leak fix (#67225)
Summary:
Magma's magma_queue was double allocating storage when creating
ptrArray for gemm operations.  A fix has been upstreamed and the build
needs to pick this up going forward.

Fixes #{issue number}

cc jeffdaily sunway513 jithunnair-amd ROCmSupport KyleCZH

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67225

Reviewed By: janeyx99

Differential Revision: D32252609

Pulled By: seemethere

fbshipit-source-id: e27ba1a54dc060fd1bfb4afad9079bf9b4705c8a
2021-11-08 15:08:09 -08:00
0b09d62cf3 [hackathon][DataPipe] adding .pyi file generation for torch.utils.data.datapipes (#67374)
Summary:
Stack from [ghstack](https://github.com/ezyang/ghstack):
* __->__ https://github.com/pytorch/pytorch/issues/67374

This is a work in progress.

Related TorchData issue: https://github.com/pytorch/data/issues/80

cc VitalyFedyunin ejguan NivekT

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67374

Reviewed By: H-Huang

Differential Revision: D32153211

Pulled By: NivekT

fbshipit-source-id: b4c61f191f20fd98ca44bb9e4f972c6d812994a0
2021-11-08 14:43:24 -08:00
2e523ed229 [JIT] additional support for CallMethod with autocasting (#67925)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67925

Previously, the following would always fail, because autocasting would not be enabled in the called method:

```
torch.jit.script
def fn(x, y):
    with autocast():
        # CallMethod() to some method

fn(x, y)
```

This allows the above, if autocasting is globally enabled, e.g.

```
torch.jit.script
def fn(x, y):
    with autocast():
        # CallMethod() to some method

with autocast():
    fn(x, y) # now
```
ghstack-source-id: 142667351

Test Plan: added test in test_jit_autocast.py

Reviewed By: navahgar

Differential Revision: D32214439

fbshipit-source-id: bb7db054e25e18f5e3d2fdb449c35b5942ab303e
2021-11-08 14:37:09 -08:00
f57c63032e [ONNX] Fix reciprocal when input is not floating point (#67471) (#67808)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67808

torch.reciprocal implicitly casts the inputs to float, and ONNX
Reciprocal requires floating point inputs.

Also separate the reciprocal test from other tests, and test different
input types.

Test Plan: Imported from OSS

Reviewed By: msaroufim

Differential Revision: D32181307

Pulled By: malfet

fbshipit-source-id: 3e1109b3c85a49c51dc713656a900b4ee78c8340
2021-11-08 14:37:07 -08:00
eb22d06e5e [ONNX] Use human readable enum for dtype scalars (#66822) (#67807)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67807

Also make quoting of string literals consistent.

Test Plan: Imported from OSS

Reviewed By: msaroufim

Differential Revision: D32181309

Pulled By: malfet

fbshipit-source-id: e1053701e3589f0310d8b5ef920359c03c6713f0
2021-11-08 14:37:05 -08:00
958d517643 [ONNX] Fix new_full and full_like for Python 3.9 (#67124) (#67806)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67806

Previously new_full would fail with errors like:
`TypeError: only integer tensors of a single element can be converted to an index`

And full_like would trigger warnings like:
`DeprecationWarning: an integer is required (got type float).  Implicit conversion to integers using __int__ is deprecated, and may be removed in a future version of Python.`

Test Plan: Imported from OSS

Reviewed By: msaroufim

Differential Revision: D32181301

Pulled By: malfet

fbshipit-source-id: 2cf262cfef36c18e7b2423efe1e1d4fa3438f0ba

Co-authored-by: Bowen Bao <bowbao@microsoft.com>
2021-11-08 14:37:03 -08:00
37688148ae [ONNX] Support opset 15 (#67121) (#67805)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67805

Also fix Reduce ops on binary_cross_entropy_with_logits

The graph says the output is a scalar but with `keepdims=1`
(the default), the output should be a tensor of rank 1. We set keep
`keepdims=0` to make it clear that we want a scalar output.

This previously went unnoticed because ONNX Runtime does not strictly
enforce shape inference mismatches if the model is not using the latest
opset version.

Test Plan: Imported from OSS

Reviewed By: msaroufim

Differential Revision: D32181304

Pulled By: malfet

fbshipit-source-id: 1462d8a313daae782013097ebf6341a4d1632e2c

Co-authored-by: Bowen Bao <bowbao@microsoft.com>
2021-11-08 14:37:00 -08:00
ead59b5ff3 [ONNX] Suppress ort warnings in onnx related test (#67054) (#67804)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67804

Improve readability of test logs by suppressing ort warnings logging for onnx related test.

Reducing ONNX CI test log binary size:
linux-xenial-py3.6-clang7-onnx-test1: 12443 KB -> 6958 KB
linux-xenial-py3.6-clang7-onnx-test2: 16884 KB -> 5778 KB

Test Plan: Imported from OSS

Reviewed By: msaroufim

Differential Revision: D32181308

Pulled By: malfet

fbshipit-source-id: 11cf165dc212d061606590e96c08c6e021135f74

Co-authored-by: BowenBao<bowbao@microsoft.com>
2021-11-08 14:35:20 -08:00
ea60e7d559 Op info for activation functions 2 (softsign, tanh, tanhshrink, threshold, celu, sigmoid, mish, hardsigmoid) (#67492)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67492

Reviewed By: mruberry

Differential Revision: D32007691

Pulled By: samdow

fbshipit-source-id: 6cb14dc56e296154e2f48249049c4d2fe4f4d10d
2021-11-08 14:30:50 -08:00
a1d733ae8c Avoid convert trt.Dims to tuple in hot path (#67960)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67960

For some reason, we are throwing py::index_error when converting a trt.Dims to tuple. This staying in the hot path of trt inference in not good, especially when we register a bunch of pybind11 exception translator where they repeatedly rethrow the exception. Since shape is static information, we save it once to avoid such repeated conversion.

Reviewed By: jianyuh, wushirong, 842974287

Differential Revision: D32232065

fbshipit-source-id: 11e49da9758ead0ff3aa647bbd3fce7735bf4a07
2021-11-08 13:36:15 -08:00
4a8f27445d [Quant] Add dynamic QAT Linear module (#67325)
Summary:
**Summary:** This commit adds the `torch.nn.qat.dynamic.modules.Linear`
module, the dynamic counterpart to `torch.nn.qat.modules.Linear`.
Functionally these are very similar, except the dynamic version
expects a memoryless observer and is converted into a dynamically
quantized module before inference.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67325

Test Plan:
`python3 test/test_quantization.py TestQuantizationAwareTraining.test_dynamic_qat_linear`

**Reviewers:** Charles David Hernandez, Jerry Zhang

**Subscribers:** Charles David Hernandez, Supriya Rao, Yining Lu

**Tasks:** 99696812

**Tags:** pytorch

Reviewed By: malfet, jerryzh168

Differential Revision: D32178739

Pulled By: andrewor14

fbshipit-source-id: 5051bdd7e06071a011e4e7d9cc7769db8d38fd73
2021-11-08 10:24:25 -08:00
db456d16ee torch.lobpcg.backward: do not save non-Variable types with ctx.save_for_backward. (#67994)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/67827

cc ezyang albanD zou3519 gqchen pearu nikitaved soulitzer Lezcano Varal7

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67994

Reviewed By: H-Huang

Differential Revision: D32244818

Pulled By: albanD

fbshipit-source-id: 702a3a1d1f4c160bef7ec1f764a2ab5d01ca7901
2021-11-08 10:02:09 -08:00
8e2528132b [lint] small updates to .lintrunner.toml (#67942)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67942

- Change "name" to "code" for consistency with linttool and LintMessage
format.
- Change "args" and "init_args" to "command" and "init_command" for
consistency with internal representation.

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D32250606

Pulled By: suo

fbshipit-source-id: 557fef731bab9adca7ab1e7cc41b996956076b05
2021-11-08 09:45:26 -08:00
d201102d36 [lint] Add the rest of the grep linters (#67932)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67932

Also various improvements to grep_linter.py, including the ability to
specify a replacement pattern.

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D32250603

Pulled By: suo

fbshipit-source-id: e07eb182e9473a268e2b805a68a859b91228bfbb
2021-11-08 09:45:20 -08:00
53f118c800 [lint] improve mypy lintrunner config (#67936)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67936

- Add the strict config
- Make the patterns exactly match the current CI
- Add init_args

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D32250605

Pulled By: suo

fbshipit-source-id: a71d434bf6024db4462260a460a1bc2d9ac66a32
2021-11-08 09:45:14 -08:00
419c58ea9c [lint] add newlines linter to lintrunner (#67894)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67894

As title. Confirmed that the code base passes by running:

```
lintrunner --paths-cmd='git grep -Il ""' --take NEWLINE
```

and seeing that it pases

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D32250604

Pulled By: suo

fbshipit-source-id: de9bcba635d21f8832bb25147b19b7b2e8802247
2021-11-08 09:45:07 -08:00
4b021280ad [lint] add nativefunctions to lintrunner (#67890)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67890

Adding another linter. I also added a generic initializer that installs
the right pip packages (you can invoke it by running `lintrunner init`).

Differential Revision:
D32197366
D32197366

Test Plan: Imported from OSS

Reviewed By: driazati

Pulled By: suo

fbshipit-source-id: 82844e78f1ee3047220d8444874eab41d7cc0e9e
2021-11-08 09:44:59 -08:00
5bb5bfccf7 [lint] add lintrunner support for circleci_linter (#67872)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67872

As title. This demonstrates some of the nice features of lintrunner:
- Uniform error reporting means you get a nice diff of the changes for
free
- Can run with -a to just accept the changes (don't need to tell people
to run a special regenerate command since the linter adaper already knows how.)

Differential Revision:
D32187386
D32187386

Test Plan: Imported from OSS

Reviewed By: driazati

Pulled By: suo

fbshipit-source-id: 71de6b042730be80ff6794652039e9bc655a72b1
2021-11-08 09:43:25 -08:00
b3770766c4 Fixes deprecation warnings in test_optim.py (#67954)
Summary:
Catches deprecation warnings when we call `scheduler.step(epoch)`
in tests.

Removes duplicate parameters to optimizers unless we are specifically
testing for that

Fixes https://github.com/pytorch/pytorch/issues/67696

There is one warning remaining when I run this locally -- however that is due to the implementation of the `SequentialLR` Scheduler. I will open a new issue relating to that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67954

Reviewed By: H-Huang

Differential Revision: D32244056

Pulled By: albanD

fbshipit-source-id: 2ab3086a58e10c8d29809ccbaab80606a1ec61d8
2021-11-08 09:36:08 -08:00
b546cdf401 [SR] Out variant for prim::NumToTensor (#67856)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67856

Returns a tensor constructed from scalar input

Test Plan:
```
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
```

Ran
```
buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --gtest_filter=*NumToTensorScalar* --v=1
```
and the output contains `Switch to out variant for node: %2 : Tensor = prim::NumToTensor(%0)`.

Reviewed By: mikeiovine

Differential Revision: D32014194

fbshipit-source-id: e7df65ea1bf05d59c1fc99b721aee420e484f542
2021-11-08 09:02:58 -08:00
0dc99dcf59 Update __init__.py (#67900)
Summary:
fix bugs https://github.com/pytorch/pytorch/issues/67896
fix a syntax error in pytorch/torch/cuda/__init__.py
Fixes https://github.com/pytorch/pytorch/issues/67896

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67900

Reviewed By: mruberry

Differential Revision: D32211978

Pulled By: soulitzer

fbshipit-source-id: a313a5e23b4d79e5b7bb909eaf82c9ee6cab10c9
2021-11-08 08:56:38 -08:00
5bc89275dd [SR] Eliminate no-ops (#67437)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67437

Certain ops do nothing on the forward pass and can be discarded after training: `aten::detach` and `fb::scale_gradient` are examples of this.

Test Plan: `buck test caffe2/test:jit -- test_freezing`

Reviewed By: hlu1

Differential Revision: D31980843

fbshipit-source-id: 0045b6babcfae786a2ce801b2f5997a078205bc0
2021-11-08 08:42:33 -08:00
191b48b12f [torch.fx] Fix replace pattern mechanism (#66442)
Summary:
Fixes #{issue number}

The following code would not return the pattern correctly:

```python
        def f(x):
            x = torch.sigmoid(x)
            x = torch.sigmoid(x)
            return torch.sigmoid(x)

        def pattern(x):
            return torch.sigmoid(x)

        def replacement(x):
            return torch.exp(x)

        def comparison(x):
            x = torch.exp(x)
            x = torch.exp(x)
            return torch.exp(x)

        traced = symbolic_trace(f)
        comparison_fn = symbolic_trace(comparison)

        subgraph_rewriter.replace_pattern(traced, pattern, replacement) # Only one sigmoid gets converted.
```

This PR fixes this by adding a new test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66442

Reviewed By: ZolotukhinM

Differential Revision: D32238424

Pulled By: ansley

fbshipit-source-id: 386e777174c639baafc166d5ffbc0658a96b1ee9
2021-11-07 13:23:02 -08:00
9fb3ba9d7b Revert D31762735 (#67924)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67924

This diff reverts the changes made in D31762735 (0cbfd466d2)

Test Plan: Wait for CI

Reviewed By: derekmod-fb

Differential Revision: D32214744

fbshipit-source-id: e0a65b6a31a88216ae1243549fcbc901ef812374
2021-11-06 17:34:13 -07:00
9cacf2b718 Add custom zipper script to zip python modules for torch.deploy (#67006)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67006

Test Plan: nervouslaugh_

Reviewed By: shunting314

Differential Revision: D31822429

fbshipit-source-id: c2efeab1446fbeb70b98d4ee766fbc670cf091b0
2021-11-06 11:49:02 -07:00
ae501a9727 [PyTorch Edge] Update bytecode version compatibility check (#67417)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67417

bytecode version is valid when it's smaller than kMaxSupported and larger than kMinSupported
ghstack-source-id: 142609392

Test Plan:
```
buck test mode/dev //caffe2/test/cpp/jit:jit -- --exact 'caffe2/test/cpp/jit:jit - LiteInterpreterTest.isCompatibleFail'
```

Reviewed By: JacobSzwejbka, iseeyuan

Differential Revision: D31984839

fbshipit-source-id: 2011e77455c931c0a8a58267494d44bcf167b877
2021-11-05 19:34:01 -07:00
80178d6152 [DDP] Fix some issues with code example in DDP docstring (#67883)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67883

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Test Plan: Imported from OSS

Reviewed By: zhaojuanmao

Differential Revision: D32190946

Pulled By: jamesr66a

fbshipit-source-id: a376324b95cbe833ffa606ecdfc6156432880f70
2021-11-05 17:32:45 -07:00
22afe82ce3 [rpc] Switch RPC agent check to TORCH_CHECK and add more descriptive error (#67882)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67882

I ran into a hard-to-interpret error message when trying to run the following script, which was missing an `init_rpc` call:

```
# $ torchrun --standalone --nnodes=1 --nproc_per_node=1 script.py
import os
rank = int(os.environ['LOCAL_RANK'])
world_size = int(os.environ['WORLD_SIZE'])

import torch.distributed
# !!!!!! Uncomment the following and the script succeeds
# torch.distributed.rpc.init_rpc('worker', rank=rank, world_size=world_size)

import torch.distributed as dist
dist.init_process_group(backend='gloo')

import torchvision.models as models
import torch

rn50 = models.resnet50()
rn50.train()
rn50 = torch.nn.parallel.DistributedDataParallel(rn50)

from torch.distributed.rpc import RRef
from torch.distributed.optim import DistributedOptimizer

params = []
for param in rn50.parameters():
    params.append(RRef(param))

dist_optim = DistributedOptimizer(
        torch.optim.SGD,
        params,
        lr=0.05)

loss_func = torch.nn.CrossEntropyLoss()

with torch.distributed.autograd.context() as context_id:
    pred = rn50(torch.randn(50, 3, 224, 224))
    target = torch.randn(50, 1000).softmax(dim=1)
    loss = loss_func(pred, target)
    dist.autograd.backward(context_id, [loss])
    dist_optim.step(context_id)
```

Error:

```
Traceback (most recent call last):
  File "/xxx/torchrun_exp/script.py", line 23, in <module>
    params.append(RRef(param))
RuntimeError: agentINTERNAL ASSERT FAILED at "../torch/csrc/distributed/rpc/rpc_agent.cpp":237, please report a bug to PyTorch. Current RPC agent is not set!
```

Since this is a user-facing error, I've changed `TORCH_INTERNAL_ASSERT` to `TORCH_CHECK` and added a hint about how to resolve the issue. On the other hand, the fact that this was originally `TORCH_INTERNAL_ASSERT` may suggest that the author thought that this should be an internal-only error condition. If there is some other place that should be throwing an exception in this case that is failing, let me know and I can adapt the fix to change that location.

Question for reviewers:
* Is there a good test file where I can add a test for this error condition?

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D32190947

Pulled By: jamesr66a

fbshipit-source-id: 3621d755329fd524db68675c55b1daf20e716d43
2021-11-05 17:31:11 -07:00
efdb17b984 Add meta support to tensor range factories (#67032)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67032

This PR adds meta backend support to the `range`, `arange`, `linspace`, and `logspace` operators.

Note that the original PR (#66630) was reverted due to two failing unit tests in the Bionic CI. This revision includes a fix for those tests; otherwise its content is identical to the previous PR.

Original commit changeset: 2f9d8d1acbb0
ghstack-source-id: 142487306

Test Plan: Extended the existing tensor creation tests to assert meta backend support.

Reviewed By: zhaojuanmao

Differential Revision: D31834403

fbshipit-source-id: a489858a2a8a38a03234b14408e14d2b208a8d34
2021-11-05 15:36:29 -07:00
9e8016d8c4 Revert D31932215: [pytorch][PR] Don't #define NUM_THREADS
Test Plan: revert-hammer

Differential Revision:
D31932215 (f70e8064f4)

Original commit changeset: ccdf11e249fb

fbshipit-source-id: 4c330aebe9cfb483f02ceb1fdaf5c3b0f8fa6fa1
2021-11-05 15:14:32 -07:00
10411e3561 [quan][fusion] Fix a additional_fuser_method method for fuse_fx (#67876)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67876

Previously we miss it when we call obj.convert and this argument would not impact the fusion.
This PR fixes it and adds a test for it

Test Plan:
python test/test_quantization.py TestFuseFx

Imported from OSS

Reviewed By: malfet

Differential Revision: D32191364

fbshipit-source-id: 566bd39461010d70a21de71f611bb929976fe01d
2021-11-05 14:51:15 -07:00
f70e8064f4 Don't #define NUM_THREADS (#67258)
Summary:
PyTorch doesn't compile with the latest `main` branch of cub again. The root cause is, PyTorch defines a macro `NUM_THREADS`, and cub added some code like
```C++
template<...., int NUM_THREADS, ...>
```
and these two mess up with each other.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67258

Reviewed By: albanD

Differential Revision: D31932215

Pulled By: ngimel

fbshipit-source-id: ccdf11e249fbc0b6f654535067a0294037ee7b96
2021-11-05 13:56:11 -07:00
b1ecfc6d45 Add timeouts for GHA jobs for pytorch/pytorch (#67912)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/67713

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67912

Reviewed By: seemethere

Differential Revision: D32215323

Pulled By: atalman

fbshipit-source-id: 45da7c4bb13c877c9b38bea8615adf75c4a9702d
2021-11-05 12:50:19 -07:00
f6402c469e (torch/elastic) fix scale down bug caused by calling rdzv_handler.shutdown() on premature agent failures (#67749)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67749

Fixes: https://github.com/pytorch/pytorch/issues/67742

Test Plan:
Added unittests.

Validated manually:

```
# start agent 0
$ torchrun --rdzv_backend c10d --rdzv_id 123 --rdzv_endpoint localhost:29500 --nnodes 1:2 --nproc_per_node 1 --monitor_interval 1 test.py

# start agent 1
torchrun --rdzv_backend c10d --rdzv_id 123 --rdzv_endpoint localhost:29500 --nnodes 1:2 --nproc_per_node 1 --monitor_interval 1 test.py

# kill agent 0
CTRL+C (SIGINT) or kill -15 (SIGTERM)

# restart it
torchrun --rdzv_backend c10d --rdzv_id 123 --rdzv_endpoint localhost:29500 --nnodes 1:2 --nproc_per_node 1 --monitor_interval 1 test.py
```

Reviewed By: cbalioglu

Differential Revision: D32129005

fbshipit-source-id: db292268250ef6f1e06f5b4c5bd67124d8dfd325
2021-11-05 12:18:46 -07:00
240e8d5cc5 Updated searchsorted functionality (#66818)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/60492

Updates searchsorted API to be more consistent with numpy and adds an OpInfo for searchsorted

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66818

Reviewed By: mruberry

Differential Revision: D31745142

Pulled By: samdow

fbshipit-source-id: 0b9600afc3cb0720afb5811212404ee96d2a7d93
2021-11-05 12:13:47 -07:00
f6a4c80a5a Refactor cuDNN Convolution memory format and Conv-Bias-Relu code (#65594)
Summary:
This PR makes several changes:

- Changed function `bool cudnn_conv_use_channels_last(...)` to `at::MemoryFormat cudnn_conv_suggest_memory_format(...)`
- Removed `resize_` in cudnn convolution code. Added a new overloading method `TensorDescriptor::set` that also passes the desired memory format of the tensor.
- Disabled the usage of double + channels_last on cuDNN Conv-Relu and Conv-Bias-Relu. Call `.contiguous(memory_format)` before passing data to cuDNN functions.
- Disabled the usage of cuDNN fused Conv-Bias-Relu in cuDNN < 8.0 version due to a CUDNN_STATUS_NOT_SUPPORTED error. Instead, use the native fallback path.
- Let Conv-Bias-Relu code respect the global `allow_tf32` flag.

From cuDNN document, double + NHWC is genenrally not supported.

Close https://github.com/pytorch/pytorch/pull/66968

Fix https://github.com/pytorch/pytorch/issues/55301

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65594

Reviewed By: jbschlosser, malfet

Differential Revision: D32175766

Pulled By: ngimel

fbshipit-source-id: 7ba079c9f7c46fc56f8bfef05bad0854acf380d7
2021-11-05 11:50:55 -07:00
cdd5d16489 [Foreach] Implement L1&L2 norm (#62646)
Summary:
Implement L1 & L2 norm in fast path with the reference of [nvidia/apex](https://github.com/NVIDIA/apex/blob/master/csrc/multi_tensor_l2norm_kernel.cu).
When `ord` is neither 1 nor 2, then slow path is chosen.

Related: https://github.com/pytorch/pytorch/issues/58833

cc ptrblck mcarilli ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62646

Reviewed By: malfet

Differential Revision: D32173421

Pulled By: ngimel

fbshipit-source-id: 14b7544601658a979b83509df351e1848ded7675
2021-11-05 11:23:00 -07:00
e7a3bbce89 [nnc] Add support for dynamic shapes in TensorExprKernel (#67861)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67861

Previously submitted as https://github.com/pytorch/pytorch/pull/67197.
This got reverted because its failures were hidden by the failures of
another PR.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D32178196

Pulled By: navahgar

fbshipit-source-id: cc8a5c68aed360d06289e69645461cfa773e1300
2021-11-05 11:18:19 -07:00
a4a6d056e6 Add ownership to more edge tests (#67859)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/66232

This should be the last immediate task. I anticipate test ownership will change overtime but this is the last big thing to close it out

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67859

Reviewed By: soulitzer

Differential Revision: D32210534

Pulled By: janeyx99

fbshipit-source-id: 7fd835d87d9d35d49ec49de1fcfa29b085133e99
2021-11-05 11:01:16 -07:00
9dafb6434b remove use of THGenerateAllTypes, clean up (#67867)
Summary:
Per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67867

Reviewed By: mruberry

Differential Revision: D32191053

Pulled By: ngimel

fbshipit-source-id: 84eb6c2989495fca5f7b055c4984efe5de94e812
2021-11-05 10:57:04 -07:00
ee7412dd29 autodiff fix for autocast_to_xxx (#67648)
Summary:
Fixes autocast + autodiff issue where `RuntimeError: grad_inputs.size() == node->inputs().size()INTERNAL ASSERT FAILED at "../torch/csrc/jit/runtime/autodiff.cpp":426, please report a bug to PyTorch.`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67648

Reviewed By: cpuhrsch

Differential Revision: D32083227

Pulled By: davidberard98

fbshipit-source-id: edf526cff4ec21874ae35ec730d13c250073e10c
2021-11-05 10:48:39 -07:00
9269080b47 [PyTorchEge] backport test (#67824)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67824

Testing backport of all prod models using model test framework

Ref:
[Create tests at run-time (google test)](https://stackoverflow.com/questions/19160244/create-tests-at-run-time-google-test)

breaking the list of models into 20 chunks based on a simple hash (sum of all char values)
ghstack-source-id: 142398833

Test Plan:
```
 buck test //xplat/pytorch/mobile/test:test_read_all_mobile_model_configs
Starting new Buck daemon...

Parsing buck files: finished in 7.6 sec
Creating action graph: finished in 0.9 sec
[RE] Metadata: Session ID=[reSessionID-66f5adfe-50d1-4599-9828-3e8115181601]
[RE] Waiting on 0 remote actions. Completed 1008 actions remotely, action cache hit rate: 43.59%.
Downloaded 26/1523 artifacts, 252.60 Kbytes, 96.6% cache miss (for updated rules)
Building: finished in 01:18.6 min (100%) 5532/5532 jobs, 770/5532 updated
  Total time: 01:27.3 min
Testing: finished in 11:21.6 min (41 PASS/0 FAIL)
BUILD SUCCEEDED
RESULTS FOR //xplat/pytorch/mobile/test:test_read_all_mobile_model_configs
PASS    673.8s 41 Passed   0 Skipped   0 Failed   //xplat/pytorch/mobile/test:test_read_all_mobile_model_configs
TESTS PASSED
```

Reviewed By: dhruvbird

Differential Revision: D32068955

fbshipit-source-id: d06c2434a4a69572ab52df31a684e5973b9d551c
2021-11-05 10:41:36 -07:00
02e35ce17b [ONNX] Update onnx function export with comments and clean up (#66817) (#67803)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67803

* Addresses comments from #63589

[ONNX] remove torch::onnx::PRODUCER_VERSION (#67107)

Use constants from version.h instead.
This simplifies things since we no longer have to update
PRODUCER_VERSION for each release.

Also add TORCH_VERSION to version.h so that a string is available for
this purpose.

[ONNX] Set `ir_version` based on opset_version. (#67128)

This increases the odds that the exported ONNX model will be usable.
Before this change, we were setting the IR version to a value which may
be higher than what the model consumer supports.

Also some minor clean-up in the test code:
* Fix string replacement.
* Use a temporary file so as to not leave files around in the test
  current working directory.

Test Plan: Imported from OSS

Reviewed By: msaroufim

Differential Revision: D32181306

Pulled By: malfet

fbshipit-source-id: 02f136d34ef8f664ade0bc1985a584f0e8c2b663

Co-authored-by: BowenBao <bowbao@microsoft.com>
Co-authored-by: Gary Miguel <garymiguel@microsoft.com>
Co-authored-by: Nikita Shulga <nshulga@fb.com>
2021-11-05 10:35:35 -07:00
ace2183195 [FSDP] Address follow up comments for CPU offload (#67813)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67813

Address Shen's comments in
https://github.com/pytorch/pytorch/pull/67249/files
ghstack-source-id: 142379312

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D32157545

fbshipit-source-id: 3cc2df6d5fa0d3b9383ed3711e7f79729dbb1dda
2021-11-05 10:34:08 -07:00
823ae3a4ff [forward ad] Also check layout of grad matches that of self for inplace over view (#67816)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/67800

Currently when the grad is the same layout as base, we try to assign the same tensor to the forward grad of both the base and the view. However, when the layout of the grad is different from the layout of the view, this triggers a copy to be created, and the tangent of the view (after the inplace) will not have a view relationship with the view of the base.

This PR just changes it so that we only do the above optimization when the layout also matches the layout of self

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67816

Reviewed By: malfet

Differential Revision: D32190021

Pulled By: soulitzer

fbshipit-source-id: b1b2c9b332e83f4df5695ee9686ea76447f9305b
2021-11-05 10:26:24 -07:00
13a69d23b1 Add retry logic for test_multitenancy and documentation for find_free_port (#67775)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67775

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D32142749

Pulled By: H-Huang

fbshipit-source-id: 67ab4ede4f4bff96a1ffd41d55b3be0edc82b1ce
2021-11-05 09:05:12 -07:00
33b7790907 Fix conv_transpose3d backward with non-contiguous grad_out (#67829)
Summary:
Many thanks to Forest Yang (meowmix) from the forum for reporting it with a minimal reproduction.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67829

Reviewed By: malfet

Differential Revision: D32184786

Pulled By: albanD

fbshipit-source-id: b63dbd3148b5def2109deb2f4612c08f55f59dfb
2021-11-05 08:34:21 -07:00
07a08fb95f Fix typo in LinearLR docs (#67840)
Summary:
The final learning rate should be 0.05 like the lr used as the argument for the optimizer and not 0.005.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67840

Reviewed By: jbschlosser

Differential Revision: D32187091

Pulled By: albanD

fbshipit-source-id: 8aff691bba3896a847d7b9d9d669a65f67a6f066
2021-11-05 07:16:15 -07:00
53ebccbe78 Fix warnings produced when running test_optim.py (#67756)
Summary:
Fixes part of https://github.com/pytorch/pytorch/issues/67696 by adding calls to `optimizer.step()` in various places.

## Notes for reviewers:
- It is not entirely clear which is the right optimizer to step in each case. I have favoured the more explicit approach of creating a set of optimizers and calling step on each of them.
- At the time of writing, the only Scheduler without an `optimizer` instance variable is `ChainedScheduler` which I need to deal with once. I use `hasattr` to do this check. Let me know if this ought to be changed.
- I am opening this PR for review when it only solve part of the issue, as I'd rather get feedback sooner. I think it is fine to fix the issue in several PRs too.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67756

Reviewed By: jbschlosser

Differential Revision: D32187864

Pulled By: albanD

fbshipit-source-id: fd0d133bcaa3a24588e5a997ad198fdf5879ff5a
2021-11-05 07:12:13 -07:00
b098264f22 Revert D32063662: [pytorch][PR] TST Adds device transfer into module info tests
Test Plan: revert-hammer

Differential Revision:
D32063662 (da59bd1d13)

Original commit changeset: 0868235a0ae7

fbshipit-source-id: a4f775874faa88be0eb5272dedf3bbc8194ebde6
2021-11-05 07:07:39 -07:00
bb8978f605 Revert D32175963: Converting hardswish to strucutred kernels with metatensor support
Test Plan: revert-hammer

Differential Revision:
D32175963 (57335a9ee3)

Original commit changeset: f4d749c6aeaf

fbshipit-source-id: 6d68a96cf872c2d7b518c061875b9336bca0043a
2021-11-05 07:04:40 -07:00
4d5338228f Revert D32175960: Moving parts of the Shape Registry into a common file
Test Plan: revert-hammer

Differential Revision:
D32175960 (d04389e6f0)

Original commit changeset: 2e30115ca554

fbshipit-source-id: 27f9889c535e4f7c21c50b2468e1e6650e952d4f
2021-11-05 07:04:37 -07:00
38af37f409 Revert D32175958: Adding Custom Rules to Device Propagation
Test Plan: revert-hammer

Differential Revision:
D32175958 (853298481b)

Original commit changeset: 26a9ef41e10a

fbshipit-source-id: adcc70687b5b454f358b5446bed2c06d04e61435
2021-11-05 07:04:35 -07:00
b1ac7f51a1 Revert D32175957: Adding custom testing based on opinfos input for ops with custom rules.
Test Plan: revert-hammer

Differential Revision:
D32175957 (b8e165e841)

Original commit changeset: 1cb51a7b6cbb

fbshipit-source-id: 29fd0750d9981758436c55eea2de40cdaddfb9be
2021-11-05 07:04:33 -07:00
0c8569bec9 Revert D32175959: Merging the implementations of ClearProfiling
Test Plan: revert-hammer

Differential Revision:
D32175959 (f1754319e3)

Original commit changeset: b335dacce709

fbshipit-source-id: 23d1f75d47f15effc9806bd6e5228007d521b0b3
2021-11-05 07:03:18 -07:00
2f68878a05 [Static Runtime] Add a comment on clients taking ownership of managed output tensors (#67554)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67554

This change adds a comment on clients taking ownership of managed output tensor to remind SR developers of  how and why that matters.

Test Plan: N/A

Reviewed By: swolchok

Differential Revision: D32013468

fbshipit-source-id: bcc13055c329c61677bdcc76411fe8db44bb2cee
2021-11-04 22:20:49 -07:00
ba9d9d488e Implement padding with slice layer (#67888)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67888

Implement padding with slice layer, work step is:
reverse slice and pad 0
[1, 2] => [2, 1, 0 ... 0]
transpose, reverse tensor back to original order, finish pre-pad
[2, 1, 0 ... 0] => [0 ... 0, 1, 2]
continue post-pad
[0 ... 0, 1, 2] => [0 ... 0, 1, 2, 0 ... 0]

Test Plan: buck test mode/dev-nosan caffe2/test/fx2trt/converters:test_pad

Reviewed By: 842974287

Differential Revision: D32160739

fbshipit-source-id: dbbc04d916e23551e3ce9be480283377e9a38b34
2021-11-04 21:25:01 -07:00
daaad47d9c Allow torch::deploy unity embed xar file of any size (#67814)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67814

There was a limitation on the xar file size we can embed into the binary previously. The payload (xar file here) is added to .data section by default using 'ld -b binary -r' command (which section the payload goes is hardcoded in ld BTW. Check code pointer [here](https://github.com/bminor/binutils-gdb/blob/binutils-2_32/bfd/binary.c#L80) ) . When we link the object file containing the payload to other parts of the executable, we will get relocation out of range error if the overall size of .test, .data, .bss etc sections exceed 2G. Some relocation entries uses 32 bit singed integer, thus the limit is 2G here.

To solve the issue and mitigate the risk, we designed a mechanism to put the payload in a customized payload section (.torch_deploy_payload.unity here). The payload section does not join the party of relocating and symbol resolution, thus in theory it can be as large as the disk space... Since we don't do relocation for the payload section, the start/end/size symbols are no longer available/valid, we have to parse the ELF file ourselves to figure out those.

The mechanism can be used to embed interprter.so as well. The interpreter.so is currently 0.5G. That will limit the other .test/.data/.bss sections of the executable to be at most 1.5G. Using this mechanim in this diff avoid the interpreter.so taking any budgets. We could also use this mechanism to ship python scripts with our binary rather than freeze them before hand. These use cases are not handled in this diff.

This diff also improves experience for those simple use cases that does not depends on extra shared libraries in the XAR file (except the shared libraries for python extensions themselves). This is mainly for fixing the stress test right now, but it also makes other simple cases easier.
ghstack-source-id: 142483327

Test Plan:
# Verify the relocation out of range issue is fixed
Add //caffe2:torch as a dependency to the macro build_unity(name="example", …) in torch/csrc/deploy/unity/TARGETS and run 'buck run mode/opt :unity_demo', it's expected to get the relocation errors like:
```
ld.lld: error:
caffe2/c10/util/intrusive_ptr.h:325:(.text._ZN11ska_ordered8detailv317sherwood_v3_tableISt4pairIN3c106IValueES4_ES4_NS3_6detail11DictKeyHashENS0_16KeyOrValueHasherIS4_S5_S7_EENS6_14DictKeyEqualToENS0_18KeyOrValueEqualityIS4_S5_SA_EESaIS5_ESaINS0_17sherwood_v3_entryIS5_EEEE15emplace_new_keyIS5_JEEES2_INSH_18templated_iteratorIS5_EEbEaPSF_OT_DpOT0_+0x4E9): relocation R_X86_64_32S out of range: 2345984168 is not in [-2147483648, 2147483647]; references c10::UndefinedTensorImpl::_singleton
>>> defined in /data/sandcastle/boxes/fbsource/fbcode/buck-out/opt/gen/caffe2/c10/c10#platform009-clang,static/libc10.a(../c10#compile-UndefinedTensorImpl.cpp.o44c44c4c,platform009-clang/core/UndefinedTensorImpl.cpp.o)
```

With the diff, the error above is resolved.

# Pass Stress Test

Also pass existing unit tests for unity.

buck test mode/opt //caffe2/torch/csrc/deploy/unity/tests:test_unity_sum -- --exact 'caffe2/torch/csrc/deploy/unity/tests:test_unity_sum - UnityTest.TestUnitySum' --run-disabled --jobs 18 --stress-runs 10 --record-results

buck test mode/opt //caffe2/torch/csrc/deploy/unity/tests:test_unity_simple_model -- --exact 'caffe2/torch/csrc/deploy/unity/tests:test_unity_simple_model - UnityTest.TestUnitySimpleModel' --run-disabled --jobs 18 --stress-runs 10 --record-results

# Verify debug sections are not messed up

Verified that debug sections are not messed up and GDB still works:
`gdb ~/fbcode/buck-out/gen/caffe2/torch/csrc/deploy/unity/unity_demo`

```
b main
run
l
c
```

Reviewed By: suo

Differential Revision: D32159644

fbshipit-source-id: a133513261b73551a71acc257f4019f7b5af34a8
2021-11-04 20:52:57 -07:00
5a48868d39 [qnnpack] fix benchmarks after an API update (#67768)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67768

We don't need to pass so many padding args after removing support for asymm padding from qnnpack

Test Plan: it builds

Reviewed By: jshen

Differential Revision: D32082204

fbshipit-source-id: 2bfe4c135ad613f0cc267e7e3ab6357731f29bc2
2021-11-04 20:17:05 -07:00
f1754319e3 Merging the implementations of ClearProfiling (#67575)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67575

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D32175959

Pulled By: Gamrix

fbshipit-source-id: b335dacce709a64e3d5779f9c6e9569f86e10748
2021-11-04 19:02:08 -07:00
b8e165e841 Adding custom testing based on opinfos input for ops with custom rules. (#67500)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67500

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D32175957

Pulled By: Gamrix

fbshipit-source-id: 1cb51a7b6cbb75bf3841e3c4caedf88aa94168fe
2021-11-04 19:02:06 -07:00
853298481b Adding Custom Rules to Device Propagation (#66973)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66973

Test Plan: Imported from OSS

Reviewed By: driazati

Differential Revision: D32175958

Pulled By: Gamrix

fbshipit-source-id: 26a9ef41e10a171be6a8779a4e6014e2e7e3f2c1
2021-11-04 19:02:04 -07:00
d04389e6f0 Moving parts of the Shape Registry into a common file (#66948)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66948

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D32175960

Pulled By: Gamrix

fbshipit-source-id: 2e30115ca554816166fedddbcdeffbe189eb19a6
2021-11-04 19:02:02 -07:00
57335a9ee3 Converting hardswish to strucutred kernels with metatensor support (#66899)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66899

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D32175963

Pulled By: Gamrix

fbshipit-source-id: f4d749c6aeaf064084be72361607ea4f3f6bc91d
2021-11-04 19:02:00 -07:00
ec8a71f9ac Dtype Analysis for Unary and Binary ops with Metatensors (#66898)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66898

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D32175961

Pulled By: Gamrix

fbshipit-source-id: 72721259b900e5a311b6bcb5c350366ba420b734
2021-11-04 19:00:50 -07:00
4b084bc832 Benchmarks for various fusers (#67622)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67622

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D32171063

Pulled By: bertmaher

fbshipit-source-id: 40d3a7adcc52aba3b051e382ec5ec4ee7e43d81b
2021-11-04 18:57:17 -07:00
31fc9d6539 Introduce version control for tensorrt converter decorator (#67886)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67886

Similar to what we have in torch2trt tensorrt_converter, introduce version enablement for fx2trt converters. Upgrade to trt 8.2 will introduce new op converter as well as deprecate old op.

Test Plan: pass existing unit test

Reviewed By: 842974287

Differential Revision: D32183581

fbshipit-source-id: 6419acada296d24e882efa9fca25eca6349153e4
2021-11-04 17:39:15 -07:00
f5daa9f76b [iOS] Enable ARC for CMake build (#67884)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67884

Test Plan: Imported from OSS

Reviewed By: husthyc

Differential Revision: D32191532

Pulled By: xta0

fbshipit-source-id: a295004f8e7f1b0f5a4ab12ffd9b37c36b80226b
2021-11-04 16:50:46 -07:00
c2ceba8ada [PyTorchEdge] Move all serialize/deserialize files to a separate target (#66805)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66805

{F672465642}

DGW:
```
buck query 'allpaths(//xplat/caffe2:torch_mobile_core, //xplat/caffe2:torch_mobile_interpreter)' --output-format dot_compact | pastry
bunnylol dgw paste_id

```

Test Plan:
buck builds to pass

```
buck build fbsource//fbandroid/mode/opt @//fbandroid/mode/messenger //fbandroid/apps/messenger:messenger_staticdi_dextr_splitarsc_dlstr_xzs_for_perftest_redex_optimizedtestableresources_postprocessed_resign //fbandroid/apps/messenger:messenger_staticdi_dextr_splitarsc_dlstr_xzs_for_perftest#unstripped_native_libraries

buck build //xplat/caffe2:torch_mobile_coreAndroid#android-armv7,shared

buck build //xplat/caffe2:torch_commonAndroid#android-armv7,shared

```

DGW:
```
buck query 'allpaths(//xplat/caffe2/fb/runtime:only_flatbuffer_test, //xplat/caffe2:miniz)' --output-format dot_compact | pastry
P464671429: https://www.internalfb.com/intern/paste/P464671429/

bunnylol dgw P464671429
```

loader is decoupled from miniz

```
buck query 'allpaths(//xplat/caffe2/fb/runtime:flatbuffer_loader, //xplat/caffe2:miniz)' --output-format dot_compactdigraph result_graph {
}
```

Reviewed By: iseeyuan

Differential Revision: D31532862

fbshipit-source-id: 51e6880e78e1cafe20c8d90e98037edc3c1b6b11
2021-11-04 15:55:52 -07:00
b0c05297f9 [Static Runtime] Arena allocate StorageImpls for managed tensors (#66130)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66130

We're reusing backing storage for these tensors, which is only safe because they have non-overlapping lifetimes. Accordingly, it seems that they can also share their StorageImpl.

ghstack-source-id: 142427752

Test Plan:
benchmarked ctr_mobile_feed local and local_ro:

Using recordio inputs for model 302008423_0

```
swolchok@devbig032 ~/f/fbcode> env MKL_NUM_THREADS=1 OMP_NUM_THREADS=1  > environment^C
swolchok@devbig032 ~/f/fbcode> sudo ~/fbsource2/fbcode/scripts/bertrand/noise/denoise-env.sh \
                                 /tmp/ptvsc2_predictor_benchNov1ArenaAllocateStorageImpls \
                               --scripted_model=/data/users/swolchok/ctr_mobile_feed_q3_2021/302008423_0.predictor.disagg.local \
                               --method_name=local.forward --pt_cleanup_activations=1 \
                               --pt_enable_out_variant=1 --pt_optimize_memory=1 --iters=2 --warmup_iters=2 \
                                      --num_threads=1 --pt_enable_static_runtime=1 --set_compatibility=1 --repetitions=5 --recordio_use_ivalue_format=1 --recordio_inputs=/data/users/swolchok/ctr_mobile_feed_q3_2021/302008423_0.local.inputs.recordio

Stable
========================================
I1101 14:19:16.473964 2748837 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 20.0131. Iters per second: 49.9673
I1101 14:20:12.193130 2748837 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 20.0155. Iters per second: 49.9612
I1101 14:21:07.761898 2748837 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 19.9751. Iters per second: 50.0624
I1101 14:22:03.218066 2748837 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 19.9104. Iters per second: 50.2249
I1101 14:22:58.723256 2748837 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 19.956. Iters per second: 50.1102
I1101 14:22:58.723306 2748837 PyTorchPredictorBenchLib.cpp:262] Mean milliseconds per iter: 19.974, standard deviation: 0.043643

ArenaAllocateStorageImpls
========================================
I1101 14:08:57.070914 2695478 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 19.9771. Iters per second: 50.0572
I1101 14:09:52.605121 2695478 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 19.924. Iters per second: 50.1907
I1101 14:10:48.098287 2695478 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 19.9353. Iters per second: 50.1624
I1101 14:11:43.645395 2695478 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 19.9723. Iters per second: 50.0694
I1101 14:12:39.171636 2695478 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 19.9673. Iters per second: 50.0819
I1101 14:12:39.171685 2695478 PyTorchPredictorBenchLib.cpp:262] Mean milliseconds per iter: 19.9552, standard deviation: 0.0239318

difference: 0.0188 (0.09%), which is less than 1 standard deviation

Stable, local_ro
========================================
I1101 14:26:10.796161 2787930 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 1.25991. Iters per second: 793.708
I1101 14:26:12.194727 2787930 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 1.26862. Iters per second: 788.26
I1101 14:26:13.591312 2787930 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 1.26549. Iters per second: 790.207
I1101 14:26:14.982439 2787930 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 1.25943. Iters per second: 794.01
I1101 14:26:16.377033 2787930 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 1.25995. Iters per second: 793.68
I1101 14:26:16.377094 2787930 PyTorchPredictorBenchLib.cpp:262] Mean milliseconds per iter: 1.26268, standard deviation: 0.00414788

ArenaAllocateStorageImpls, local_ro
========================================
I1101 14:26:45.875073 2790009 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 1.20987. Iters per second: 826.536
I1101 14:26:47.207271 2790009 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 1.20827. Iters per second: 827.633
I1101 14:26:48.533766 2790009 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 1.20023. Iters per second: 833.174
I1101 14:26:49.850610 2790009 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 1.19206. Iters per second: 838.884
I1101 14:26:51.172356 2790009 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 1.19958. Iters per second: 833.622
I1101 14:26:51.172411 2790009 PyTorchPredictorBenchLib.cpp:262] Mean milliseconds per iter: 1.202, standard deviation: 0.00722754

Difference: 0.06 usec/iter (4.8%), which is much more than 1 standard deviation

```

we can see that this is a large relative improvement on local_ro, but no effect on local.

Reviewed By: hlu1

Differential Revision: D31357486

fbshipit-source-id: 229c003677da76e89c659d0e0639002accced76e
2021-11-04 15:43:39 -07:00
01809731bc [Static Runtime] Cache managed tensor Storages (#66638)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66638

See comments in code explaining what we're doing here.
ghstack-source-id: 142427750

Test Plan:
Ran ptvsc2_predictor_bench on ctr_mobile_feed local and local_ro net before/after this change on a devserver with turbo off.

Results:

```
stable, local_ro:
========================================
I1014 16:13:52.713300 151733 PyTorchPredictorBenchLib.cpp:252] PyTorch run finished. Milliseconds per iter: 2.68012. Iters per second: 373.118
I1014 16:14:00.961875 151733 PyTorchPredictorBenchLib.cpp:252] PyTorch run finished. Milliseconds per iter: 2.66156. Iters per second: 375.719
I1014 16:14:09.163097 151733 PyTorchPredictorBenchLib.cpp:252] PyTorch run finished. Milliseconds per iter: 2.6449. Iters per second: 378.086
I1014 16:14:17.425621 151733 PyTorchPredictorBenchLib.cpp:252] PyTorch run finished. Milliseconds per iter: 2.66661. Iters per second: 375.008
I1014 16:14:25.711349 151733 PyTorchPredictorBenchLib.cpp:252] PyTorch run finished. Milliseconds per iter: 2.67375. Iters per second: 374.006
I1014 16:14:25.711390 151733 PyTorchPredictorBenchLib.cpp:269] Mean milliseconds per iter: 2.66539, standard deviation: 0.0134423

stable, local:
========================================
I1014 15:08:28.547081 3979345 PyTorchPredictorBenchLib.cpp:252] PyTorch run finished. Milliseconds per iter: 6.42772. Iters per second: 155.576
I1014 15:08:48.276582 3979345 PyTorchPredictorBenchLib.cpp:252] PyTorch run finished. Milliseconds per iter: 6.3643. Iters per second: 157.127
I1014 15:09:07.978683 3979345 PyTorchPredictorBenchLib.cpp:252] PyTorch run finished. Milliseconds per iter: 6.3566. Iters per second: 157.317
I1014 15:09:27.875543 3979345 PyTorchPredictorBenchLib.cpp:252] PyTorch run finished. Milliseconds per iter: 6.42044. Iters per second: 155.752
I1014 15:09:47.558079 3979345 PyTorchPredictorBenchLib.cpp:252] PyTorch run finished. Milliseconds per iter: 6.34902. Iters per second: 157.505
I1014 15:09:47.558120 3979345 PyTorchPredictorBenchLib.cpp:269] Mean milliseconds per iter: 6.38361, standard deviation: 0.037421

cache storages, local_ro:
========================================
I1014 16:15:42.292997 160496 PyTorchPredictorBenchLib.cpp:252] PyTorch run finished. Milliseconds per iter: 2.66604. Iters per second: 375.088
I1014 16:15:50.622402 160496 PyTorchPredictorBenchLib.cpp:252] PyTorch run finished. Milliseconds per iter: 2.68683. Iters per second: 372.186
I1014 16:15:58.901475 160496 PyTorchPredictorBenchLib.cpp:252] PyTorch run finished. Milliseconds per iter: 2.67028. Iters per second: 374.493
I1014 16:16:07.156373 160496 PyTorchPredictorBenchLib.cpp:252] PyTorch run finished. Milliseconds per iter: 2.66317. Iters per second: 375.492
I1014 16:16:15.474292 160496 PyTorchPredictorBenchLib.cpp:252] PyTorch run finished. Milliseconds per iter: 2.68394. Iters per second: 372.587
I1014 16:16:15.474334 160496 PyTorchPredictorBenchLib.cpp:269] Mean milliseconds per iter: 2.67405, standard deviation: 0.0106982

cache storages, local:
========================================
I1014 20:53:43.113400 1657168 PyTorchPredictorBenchLib.cpp:252] PyTorch run finished. Milliseconds per iter: 6.3811. Iters per second: 156.713
I1014 20:54:02.829102 1657168 PyTorchPredictorBenchLib.cpp:252] PyTorch run finished. Milliseconds per iter: 6.36039. Iters per second: 157.223
I1014 20:54:22.885171 1657168 PyTorchPredictorBenchLib.cpp:252] PyTorch run finished. Milliseconds per iter: 6.47333. Iters per second: 154.48
I1014 20:54:42.768963 1657168 PyTorchPredictorBenchLib.cpp:252] PyTorch run finished. Milliseconds per iter: 6.41404. Iters per second: 155.908
I1014 20:55:02.624423 1657168 PyTorchPredictorBenchLib.cpp:252] PyTorch run finished. Milliseconds per iter: 6.4042. Iters per second: 156.147
I1014 20:55:02.624460 1657168 PyTorchPredictorBenchLib.cpp:269] Mean milliseconds per iter: 6.40661, standard deviation: 0.0427168
```

Looks like this diff is neutral or a slight regression, but it is a stepping stone on the way to the following diff.

Reviewed By: hlu1

Differential Revision: D31326711

fbshipit-source-id: a6e0185f24a6264b1af2a51b69243c310d0d48d5
2021-11-04 15:42:22 -07:00
56dda833ff Small updates to RELEASE.md (#65489)
Summary:
Combine `xla` and `builder` branch pinning steps and link them to a PR that does it correctly
Update example PR for version bump, as few files have changed
Deleted FaceHub step as it is no longer necessary after recent update

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65489

Reviewed By: zhouzhuojie, seemethere

Differential Revision: D31120498

Pulled By: malfet

fbshipit-source-id: e1a9db2b03243c8d28eeed9888c3653e4460ad07
2021-11-04 15:39:40 -07:00
d5d342b237 Sparse CSR CUDA: Support mixed memory format input for triangular_solve (#66401)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66401

This PR fixes the case when result and input tensors have different
strides.
cuSPARSE from CUDA 11.3.1 has a bug: it doesn't use correct strides to
write the result. This is "fixed" in PyTorch code by copying the input
tensor to a tensor with same strides as result tensor has.

cc nikitaved pearu cpuhrsch IvanYashchuk ngimel

Test Plan: Imported from OSS

Reviewed By: davidberard98

Differential Revision: D32177966

Pulled By: cpuhrsch

fbshipit-source-id: 118437409df147f04dce02763aff9bfd33f87c63
2021-11-04 15:34:42 -07:00
a20a64af4e Increased tolerance for test_zero_model_parallel tests (#67765)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/67764

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67765

Reviewed By: malfet

Differential Revision: D32171621

Pulled By: mrshenli

fbshipit-source-id: 8c34f4714289cb41824f3a18822a28ed670fa0a6
2021-11-04 15:17:45 -07:00
c541c69e89 Fix minor typo in contributing.md (#67855)
Summary:
Fixes #{issue number}
No issue number, minor change

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67855

Reviewed By: malfet

Differential Revision: D32186689

Pulled By: driazati

fbshipit-source-id: 7cda19f66ff1312296d8310922bb0d221df81e46
2021-11-04 14:38:48 -07:00
8bed46ef38 [WIP][LTC] Upstream class Shape (#67672)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67672

This commit Upstreams class Shape from lazy_tensor_staging branch.

Test Plan: WIP.

Reviewed By: malfet

Differential Revision: D32095478

Pulled By: alanwaketan

fbshipit-source-id: 61611b12fc079b195833b5b22a6cf73c0935b8b9
2021-11-04 14:12:03 -07:00
e8ac8c005d [NOOP][clangformat][codemod] Enable CLANGFORMAT (#67854)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67854

Test Plan: Visual inspection. Sandcastle.

Reviewed By: zertosh

Differential Revision: D32173077

fbshipit-source-id: 10ab4b0afa18c7be4fab3e3564d9b479a7a48cb5
2021-11-04 14:07:57 -07:00
938bab0bfd [PyTorch] Add int version of vectorized PrefixSum to Benchmark (#67865)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67865

- Add int version of vectorized PrefixSum
- Use unaligned load/store instructions
- Add exclusive scan version. "exclusive" means that the i-th input element is not included in the i-th sum. For details see https://en.cppreference.com/w/cpp/algorithm/exclusive_scan

Test Plan:
```
buck build mode/opt-clang //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench
OMP_NUM_THREADS=1 numactl -m 0 -C 5 \
./buck-out/opt/gen/caffe2/benchmarks/cpp/tensorexpr/tensorexpr_bench --benchmark_filter=PrefixSumBench
```

For full benchmark results, see P465274613

```
PrefixSumBench/LocalInt/64                            57 ns         56 ns   12414048 GB/s=9.06239G/s
PrefixSumBench/LocalInt/256                          221 ns        221 ns    3160853 GB/s=9.28635G/s
PrefixSumBench/LocalInt/1024                         818 ns        817 ns     857922 GB/s=10.0235G/s
PrefixSumBench/LocalInt/4096                        3211 ns       3210 ns     217614 GB/s=10.2093G/s
PrefixSumBench/LocalInt/16384                      12806 ns      12804 ns      54805 GB/s=10.2364G/s
PrefixSumBench/LocalInt/65536                      51115 ns      51079 ns      13741 GB/s=10.2643G/s
PrefixSumBench/LocalInt/262144                    205974 ns     205912 ns       3401 GB/s=10.1847G/s
PrefixSumBench/LocalInt/1048576                   829523 ns     828859 ns        845 GB/s=10.1207G/s
PrefixSumBench/LocalIntAVX2/64                        45 ns         45 ns   15568113 GB/s=11.3549G/s
PrefixSumBench/LocalIntAVX2/256                      208 ns        208 ns    3371174 GB/s=9.86913G/s
PrefixSumBench/LocalIntAVX2/1024                     893 ns        892 ns     783154 GB/s=9.18629G/s
PrefixSumBench/LocalIntAVX2/4096                    3618 ns       3613 ns     193834 GB/s=9.06838G/s
PrefixSumBench/LocalIntAVX2/16384                  14416 ns      14411 ns      48564 GB/s=9.09543G/s
PrefixSumBench/LocalIntAVX2/65536                  57650 ns      57617 ns      12156 GB/s=9.09952G/s
PrefixSumBench/LocalIntAVX2/262144                230855 ns     230612 ns       3035 GB/s=9.09386G/s
PrefixSumBench/LocalIntAVX2/1048576               924265 ns     923777 ns        758 GB/s=9.08077G/s
PrefixSumBench/LocalIntAVX512/64                      23 ns         23 ns   24876551 GB/s=22.0697G/s
PrefixSumBench/LocalIntAVX512/256                     95 ns         95 ns    7387386 GB/s=21.556G/s
PrefixSumBench/LocalIntAVX512/1024                   435 ns        435 ns    1609682 GB/s=18.8425G/s
PrefixSumBench/LocalIntAVX512/4096                  1815 ns       1815 ns     385462 GB/s=18.0561G/s
PrefixSumBench/LocalIntAVX512/16384                 7479 ns       7476 ns      93660 GB/s=17.5335G/s
PrefixSumBench/LocalIntAVX512/65536                30171 ns      29879 ns      23430 GB/s=17.5468G/s
PrefixSumBench/LocalIntAVX512/262144              125805 ns     125631 ns       5570 GB/s=16.6929G/s
PrefixSumBench/LocalIntAVX512/1048576             504216 ns     503983 ns       1384 GB/s=16.6446G/s
PrefixSumBench/ExclusiveScanIntAVX512/64              23 ns         23 ns   30058295
PrefixSumBench/ExclusiveScanIntAVX512/256            101 ns        101 ns    7398498
PrefixSumBench/ExclusiveScanIntAVX512/1024           435 ns        434 ns    1403877
PrefixSumBench/ExclusiveScanIntAVX512/4096          1979 ns       1978 ns     354016
PrefixSumBench/ExclusiveScanIntAVX512/16384         7828 ns       7819 ns      89551
PrefixSumBench/ExclusiveScanIntAVX512/65536        31206 ns      31192 ns      22408
PrefixSumBench/ExclusiveScanIntAVX512/262144      130106 ns     130023 ns       5388
PrefixSumBench/ExclusiveScanIntAVX512/1048576     525515 ns     524976 ns       1244
```

Reviewed By: navahgar, swolchok

Differential Revision: D32011740

fbshipit-source-id: 7962de710bd588291dd6bf0c719f579c55f7c063
2021-11-04 14:00:19 -07:00
641ba36a4e fix annotation for Demultiplexer (#65998)
Summary:
cc SsnL VitalyFedyunin ejguan NivekT

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65998

Reviewed By: bdhirsh

Differential Revision: D32145926

Pulled By: ejguan

fbshipit-source-id: 60be3126fb9e73b8631b5040676264504e926707
2021-11-04 13:44:02 -07:00
da59bd1d13 TST Adds device transfer into module info tests (#65488)
Summary:
Follow up to  https://github.com/pytorch/pytorch/issues/61935

This PR adds device to device transfer test into `ModuleInfo`.

cc albanD mruberry jbschlosser walterddr

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65488

Reviewed By: mruberry

Differential Revision: D32063662

Pulled By: jbschlosser

fbshipit-source-id: 0868235a0ae7e5b6a3e4057c23fe70753c0946d2
2021-11-04 12:50:33 -07:00
3d4a6ff15d Revert D32154788: Move Concat Linear out of Optimize Numerics
Test Plan: revert-hammer

Differential Revision:
D32154788 (ea94dde573)

Original commit changeset: faa6465c89b3

fbshipit-source-id: 0dcaa65268b68ed01e6a5bc7b73ade1f51163b33
2021-11-04 12:20:02 -07:00
86aea79217 Revert D32154786: Fix Freezing Docs Parameters
Test Plan: revert-hammer

Differential Revision:
D32154786 (db15a7c0b3)

Original commit changeset: d8a2b4f39ff4

fbshipit-source-id: 657e3974a8e0ca71790adc1b031a87b7c497ea25
2021-11-04 12:20:00 -07:00
279af1a668 Revert D32154787: Formatted with Black
Test Plan: revert-hammer

Differential Revision:
D32154787 (08d630b9a6)

Original commit changeset: 6a95691c4ad9

fbshipit-source-id: 2dbcf2395071433731683f685a0351fa8604d620
2021-11-04 12:18:37 -07:00
08d630b9a6 Formatted with Black (#67792)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67792

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D32154787

Pulled By: Gamrix

fbshipit-source-id: 6a95691c4ad9d997071bb4ffc00b5eab30f90b81
2021-11-04 11:32:26 -07:00
db15a7c0b3 Fix Freezing Docs Parameters (#67201)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67201

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D32154786

Pulled By: Gamrix

fbshipit-source-id: d8a2b4f39ff477f5131c02fe8c0b1a25339ce158
2021-11-04 11:32:24 -07:00
ea94dde573 Move Concat Linear out of Optimize Numerics (#67196)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67196

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D32154788

Pulled By: Gamrix

fbshipit-source-id: faa6465c89b3676d6b1ff7c20a677738a7fbdf88
2021-11-04 11:30:39 -07:00
6f0a1f2b8d Only set sccache_epilogue to run on build job exits (#67798)
Summary:
Fixes:
* https://github.com/pytorch/pytorch/issues/65431

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67798

Reviewed By: malfet

Differential Revision: D32174810

Pulled By: boyuantan

fbshipit-source-id: 072fdc042b56e541a877074120d41645c98e41f5
2021-11-04 11:11:02 -07:00
618bab593c .github: Output expected vs. actual (#67703)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67703

Had this script fail on me within CI without actually telling me what
was wrong so adding some more output here to showcase what the actual
vs. the expected result is

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: janeyx99

Differential Revision: D32112898

Pulled By: seemethere

fbshipit-source-id: dfc9a82c709d52e0dde02d1e99a19eecc63c5836
2021-11-04 11:02:43 -07:00
90d311b268 [RPC] Add exception logging to constValue() (#67802)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67802

In RPC C++ code, we might sometimes call constValue() when the future actually has an exception, and in unittests we want to assert on the exception. What happens is that we get a message basically saying "!eptr_" which indicates there is some exception but we don't know what it is.

This diff simply adds logging for the exception and mentions that `value` over `constValue` should be used when the future can have an exception. The contract of `constValue` to throw when `eptr_` is set is still held, it is just enhanced with additional logging.
ghstack-source-id: 142375391

Test Plan: Added UT

Reviewed By: mrshenli

Differential Revision: D32156552

fbshipit-source-id: 4dd5e73b92173209074c104a4b75c2021e20de4b
2021-11-04 10:04:09 -07:00
7c739e1ab9 Resubmit #67161 (#67735)
Summary:
Skip building extensions if windows following https://github.com/pytorch/pytorch/pull/67161#issuecomment-958062611

Related issue: https://github.com/pytorch/pytorch/issues/67073

cc ngimel xwang233 ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67735

Reviewed By: bdhirsh

Differential Revision: D32141250

Pulled By: ngimel

fbshipit-source-id: 9bfdb7cf694c99f6fc8cbe9033a12429b6e4b6fe
2021-11-04 09:59:30 -07:00
8b0c2c18eb Fix pretrained=True for test_pt_onnx_trt (#67818)
Summary:
Addresses https://github.com/pytorch/pytorch/pull/66312#issuecomment-960357403

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67818

Reviewed By: malfet

Differential Revision: D32161208

Pulled By: janeyx99

fbshipit-source-id: 076e52ddc8718c74eb2941e867d92bfa4fe70f80
2021-11-04 09:49:42 -07:00
af1bd88fc4 Allow scalars for aliased binary ops {multiply, subtract, divide} (#65937)
Summary:
https://github.com/pytorch/pytorch/issues/65868 pointed out that the "long-form" versions of some binary ops like `mul`, `sub`, and `div` don't match their alias's behavior when it comes to handling scalar inputs. This PR adds the missing registration in `python_arg_parser.cpp` to resolve this.

CC ptrblck ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65937

Reviewed By: malfet

Differential Revision: D32156580

Pulled By: ngimel

fbshipit-source-id: b143cf7119a8bb51609e1b8734204edb750f0210
2021-11-04 09:36:45 -07:00
bd8feb33d4 Update distributed contributing guide to show how to run one test in test_distributed_spawn (#67801)
Summary:
Running one test in test_distributed_spawn is a bit confusing but possible. Add documentation to the CONTRIBUTING.md for this.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67801

Reviewed By: mrshenli

Differential Revision: D32157700

Pulled By: rohan-varma

fbshipit-source-id: a1d10f2fb5f169b46c6d15149bf949082d9bd200
2021-11-04 08:54:31 -07:00
4262c8913c Remove native_functions.yaml dependency from TensorTopK.cu (#66794)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66794

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D31856104

Pulled By: dagitses

fbshipit-source-id: 2b9c0e1072455c5019c6f681faa3de848b3dae46
2021-11-04 08:32:06 -07:00
927da4d32f Remove native_functions.yaml dependency from Sort.cu (#66793)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66793

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D31856100

Pulled By: dagitses

fbshipit-source-id: 1469ce1deb4124f2a9e160a8e3298d56ac3f6561
2021-11-04 08:30:40 -07:00
61ed9285dd Automated submodule update: tensorpipe (#67845)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/tensorpipe](https://github.com/pytorch/tensorpipe).

New submodule commit: d2aa3485e8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67845

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: lw

Differential Revision: D32170821

fbshipit-source-id: 1958e824a9f02c5178fa5d4a73a171dedefc540c
2021-11-04 08:24:05 -07:00
cfd998c197 Remove ProcessGroup RPC backend placeholder as part of 1.11 (#67363)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67363

ProcessGroup RPC backend is deprecated. In 1.10 it would throw an error to the user to be more user friendly. This PR now removes it completely.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D32138321

Pulled By: H-Huang

fbshipit-source-id: b4f700d8f1b1d46ada7b5062d3f754646571ea90
2021-11-04 07:57:58 -07:00
8e1ead8e4d Fix the kl_div docs (#67443)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67443

Fixes https://github.com/pytorch/pytorch/issues/57459

After discussing the linked issue, we resolved that `F.kl_div` computes
the right thing as to be consistent with the rest of the losses in
PyTorch.

To avoid any confusion, these docs add a note discussing how the PyTorch
implementation differs from the mathematical definition and the reasons
for doing so.

These docs also add an example that may further help understanding the
intended use of this loss.

cc brianjo mruberry

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D32136888

Pulled By: jbschlosser

fbshipit-source-id: 1ad0a606948656b44ff7d2a701d995c75875e671
2021-11-04 07:09:08 -07:00
04fe4382ec Automated submodule update: tensorpipe (#67769)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/tensorpipe](https://github.com/pytorch/tensorpipe).

New submodule commit: caa2ccb394

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67769

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: lw

Differential Revision: D32138256

fbshipit-source-id: dfe4c73ae25c8f362f2917dd7594bdcd418c2a0d
2021-11-04 01:13:19 -07:00
b8d365ca3a ci fix (#67826)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67826

Reviewed By: Chillee

Differential Revision: D32164770

Pulled By: mruberry

fbshipit-source-id: c1de7e6db6d0cb1761388f1ea0178dbff3fe6dc8
2021-11-04 00:16:47 -07:00
1baed45c6b [fbcode][static runtime] out-variant for quantized::linear_dynamic_fp16 (#67663)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67663

mostly follow the example of quantized::linear (D28428734 (4d7abdbdad)) to enable out-variant for quantized::linear_dynamic_fp16.

Reason being from MP tab ctr pytorch model migration, we observe quantized::linear_dynamic_fp16 operator has highest cost but not enable out-variant yet https://fburl.com/phabricator/b5juus2d

Test Plan:
buck build mode/opt caffe2/caffe2/fb/predictor:ptvsc2_predictor_bench

  sudo watch -n 20 /usr/local/fbprojects/dynamoserver/bin/turboDriver disable

  MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 3 ./buck-out/opt/gen/caffe2/caffe2/fb/predictor:ptvsc2_predictor_bench -- --scripted_model=/home/bwen/models/991103061_4/991103061_4.predictor --pt_inputs=/home/bwen/models/991103061_4/pt_inputs --method_name=forward --pt_cleanup_activations=1 --pt_enable_out_variant=1 --pt_optimize_memory=1 --iters=1000 --warmup_iters=1000 --num_threads=1 --repetitions=3 --do_profile=1 --do_benchmark=1 --set_compatibility=1 --compare_results=1 --pt_enable_static_runtime 2>&1 | pastry

before: P465201159

  0.929067 ms.     31.808%. quantized::linear_dynamic_fp16 (16 nodes)
  0.921679 ms.    31.7324%. quantized::linear_dynamic_fp16 (16 nodes)
  0.919127 ms.    31.7404%. quantized::linear_dynamic_fp16 (16 nodes)

after: P465203015

  0.90898 ms.    31.0205%. quantized::linear_dynamic_fp16 (16 nodes, out variant)
  0.9127 ms.      30.62%. quantized::linear_dynamic_fp16 (16 nodes, out variant)
  0.879148 ms.    31.0161%. quantized::linear_dynamic_fp16 (16 nodes, out variant)

unit test logic refers https://fburl.com/code/vv0rry13

  buck run mode/opt caffe2/benchmarks/static_runtime:static_runtime_cpptest

Reviewed By: hlu1

Differential Revision: D32001168

fbshipit-source-id: 873d9f77434b9c4bafb298c871173f9a560dd2a3
2021-11-03 22:39:04 -07:00
99c7a9f09d fix bfloat16 autocast skip (#67822)
Summary:
Per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67822

Reviewed By: mruberry

Differential Revision: D32162605

Pulled By: ngimel

fbshipit-source-id: eb5ccf6c441231e572ec93ac8c2638d028abecad
2021-11-03 21:02:37 -07:00
2486061c72 [JIT] make x (+ or -) 0 and x (* or /) 1 peepholes type promotion aware (#67688)
Summary:
Some of the "no-ops" are not actually no-ops because they can change the dtype

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67688

Reviewed By: davidberard98

Differential Revision: D32104601

Pulled By: eellison

fbshipit-source-id: ccb99179a4b30fd20b5a9228374584f2cdc8ec21
2021-11-03 20:11:46 -07:00
88d86de7d8 Add lint to ensure all test files have headers with ownership info (#66826)
Summary:
UPDATE: CI should be green now with the added files.

This should fail for now, but will pass when all action for https://github.com/pytorch/pytorch/issues/66232 is done.

Example failure run: https://github.com/pytorch/pytorch/runs/4052881947?check_suite_focus=true

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66826

Reviewed By: seemethere

Differential Revision: D32087209

Pulled By: janeyx99

fbshipit-source-id: ad4b51e46de54f23aebacd592ee67577869f8bb6
2021-11-03 18:21:49 -07:00
2766662ca9 [PyTorch][2/N] Basic implementation of ShardedEmbeddingBag using ShardedTensor. (#67188)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67188

This diff/PR is trying to implement the ShardedEmbeddingBag using the ShardedTensor.

We support both row-wise and column-wise sharding of the embedding bag. The detailed logic can be found in the comment.

Several caveats:
1. Only the sharding of one weight is supported now.
1. We support limited input params for the op. To support more params are on the way.
2. We only support chuck sharding for now.
3. We only support a single local shard per rank for now.

Some other changes include:
1. Refactor the ShardedEmbedding code so that the common logic can be reused.
2. Fix tiny typos and corner cases in API `get_chunked_dim_size`. Where it will return -1 if the we set the dim_size = 5, split_size = 2, idx = 3. (This is a valid case because when chunks = 4, dim_size = 5, then the split_size = 2)
ghstack-source-id: 142325915

Test Plan: Unit test and CI

Reviewed By: pritamdamania87

Differential Revision: D31749458

fbshipit-source-id: ed77e05e4ec94ef1a01b1feda8bbf32dc5d5da1b
2021-11-03 17:39:18 -07:00
fd77fff0b1 [FSDP] customizable backend in test (#67135)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67135

Add ability to use env var backend for quicker testing (and gloo2 in
the future)
ghstack-source-id: 142274304

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D31878285

fbshipit-source-id: 80ae7107cd631a1a15ebc23262b27d8192cfe4b6
2021-11-03 15:45:52 -07:00
83e8612d11 Clean up test autograd (#67413)
Summary:
Partially fixes https://github.com/pytorch/pytorch/issues/66066

This PR:
 - cleans up op-specific testing from test_autograd. test_autograd should be reserved for testing generic autograd functionality
 - tests related to an operator are better colocated
 - see the tracker for details

What to think about when moving tests to their correct test suite:
 - naming, make sure its not too generic
 - how the test is parametrized, sometimes we need to add/remove a device/dtype parameter
 - can this be merged with existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67413

Reviewed By: jbschlosser, albanD

Differential Revision: D32031480

Pulled By: soulitzer

fbshipit-source-id: 8e13da1e58a38d5cecbfdfd4fe2b4fe6f816897f
2021-11-03 15:26:09 -07:00
ca445645f9 Revert D31902471: [nnc] Add support for dynamic shapes in TensorExprKernel
Test Plan: revert-hammer

Differential Revision:
D31902471 (15a3c374e2)

Original commit changeset: d2729a38ba1a

fbshipit-source-id: 4c05de82e626bbf744df84fd2b914b66fd165a19
2021-11-03 14:48:12 -07:00
603116a6ae [Core ML][easy] Assign missing properties to the executor (#67737)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67737

As title says
ghstack-source-id: 142277212

Test Plan:
- buck test pp-ios
- circleci

Reviewed By: hanton

Differential Revision: D32123661

fbshipit-source-id: eff3068669f8fdc573dc81b04bcc20ef153d8c4a
2021-11-03 14:15:53 -07:00
fddfb81dd0 Add BF16 type to _autocast_to_full_precision (#67707)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67707

https://github.com/pytorch/pytorch/pull/63939/files has added FP16 support to torchscript.

This is to add BF16 device type when doing full conversion.

Test Plan: Unit test. Also tested BF16 locally on A100 using MLP model.

Reviewed By: idning

Differential Revision: D32027152

fbshipit-source-id: b2a5ff2b22ea1e02306b0399f2b39b8493be4f45
2021-11-03 14:06:50 -07:00
05e17e7ff6 Add API usage logging for several other RPC APIs. (#67722)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67722

ghstack-source-id: 142259452

Test Plan: waitforbuildbot

Reviewed By: jaceyca, fduwjj

Differential Revision: D32118872

fbshipit-source-id: 041ab5601221b1846c56ce4bb63364bec9ad28b0
2021-11-03 14:02:00 -07:00
5fd93fb5f8 broaden retries on TestHub (#67779)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67779

Not all flaky failures from this test are URLErrors; I think we should
err on the side of being expansive with retries here.

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D32145434

Pulled By: suo

fbshipit-source-id: 3c3274b2080681fcafb3ea6132e420605f65c429
2021-11-03 13:48:58 -07:00
89b02fc70b [StaticRuntime][Easy] Correct typos in test_static_runtime (#67739)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67739

Test Plan:
```
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
```

Reviewed By: mikeiovine

Differential Revision: D32125879

fbshipit-source-id: bd989e5088edff87624b858bd9045dfe9da3fbe7
2021-11-03 13:24:46 -07:00
4d601a1c36 codegen: Split up source, header and Declarations.yaml generation (#67497)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67497

This allows more of the code-generation to happen in parallel, whereas
previously all codegen was serialized.

Test Plan: Imported from OSS

Reviewed By: dagitses, mruberry

Differential Revision: D32027250

Pulled By: albanD

fbshipit-source-id: 6407c4c3e25ad15d542aa73da6ded6a309c8eb6a
2021-11-03 13:20:54 -07:00
fe91906ad7 Remove Declarations.yaml dependency from gen_autograd (#67496)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67496

gen_autograd.py doesn't use `Declarations.yaml` any more, and removing
the dependency allows it to run in parallel with
`tools/codegen/gen.py`.

Test Plan: Imported from OSS

Reviewed By: dagitses, ejguan

Differential Revision: D32027251

Pulled By: albanD

fbshipit-source-id: 2cc0bbe36478e6ec497f77a56ab8d01c76145703
2021-11-03 13:19:24 -07:00
9b1caca185 [SR] Macro to clean up c10::Symbol maps in passes (#67484)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67484

Maps from `c10::Symbol -> c10::Symbol` can be hard to parse when `fromQualString` is scattered everywhere. I've been annoyed by this issue many times when rebasing, and have even messed up `FuseListUnpack` a few times.

Introduce a macro to make it easier to see what maps to what.

Test Plan: CI

Reviewed By: hlu1

Differential Revision: D32004451

fbshipit-source-id: 1086254c8403a0880d014512c439edbefa6fa015
2021-11-03 12:57:07 -07:00
0eaa01ead1 [SR] Add EliminateTrivialEquallySplit graph pass (#67166)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67166

This optimization is not really the same thing as `FuseListUnpack`, and mixing the logic in that pass is confusing and error-prone. It should really be its own pass.

It's slower since we have to do another pass over the graph, but this is not perf critical code; readability is more important.

Test Plan: Unit tests: `buck test caffe2/benchmarks/static_runtime/...`

Reviewed By: hlu1

Differential Revision: D31887458

fbshipit-source-id: 289e281d512435861fccfe19f017751ad015688c
2021-11-03 12:57:05 -07:00
6cc6a5fd9d Fix a bug in TorchBench ondemand CI. (#67743)
Summary:
Use the main branch when TorchBench branch is not specified.

RUN_TORCHBENCH: soft_actor_critic

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67743

Reviewed By: seemethere

Differential Revision: D32142663

Pulled By: xuzhao9

fbshipit-source-id: 160227835543b8e55c970025073839bf0f03aa81
2021-11-03 12:55:52 -07:00
f455030931 Adding a docstring for memoryless in observer args (#67690)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67690

see title [skip ci]

Test Plan:
python setup.py develop

Imported from OSS

Reviewed By: ejguan

Differential Revision: D32107512

fbshipit-source-id: da5668339716d44720672f7b71a991b23530461e
2021-11-03 12:46:44 -07:00
98be5216e2 Revert D32104006: [pytorch][PR] Added forward derivatives for neg, diag, inverse, linalg_eig
Test Plan: revert-hammer

Differential Revision:
D32104006 (88c61b8d06)

Original commit changeset: 1f6ace09ee3e

fbshipit-source-id: f9f950b4177e1fe29b9059f4b5dfb9c8c67f479a
2021-11-03 12:40:00 -07:00
6df0d7d502 [lint] add basic lintrunner compatibility (#67110)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67110

Adds support for using lintrunner with:
- clang-format
- clang-tidy
- flake8
- mypy

Test Plan: Imported from OSS

Reviewed By: driazati

Differential Revision: D32145555

Pulled By: suo

fbshipit-source-id: 2150348e26fba4ae738cd0b9684b2889ce0f1133
2021-11-03 12:35:28 -07:00
89c4e8c22b [NOOP][clangformat][codemod] Enable CLANGFORMAT for some folders in caffe2/* (#67746)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67746

Test Plan: Visual inspection. Sandcastle.

Reviewed By: zertosh

Differential Revision: D31986646

fbshipit-source-id: 91885c20c3cead3853c49abb9fe0a94a67f33cc8
2021-11-03 12:23:14 -07:00
a5b57c9433 Avoid prematurely casting GEMM parameters alpha, beta to scalar_t (#67633)
Summary:
stas00 uncovered an issue where certain half-precision GEMMs would produce outputs that looked like the result of strange rounding behavior (e.g., `10008.` in place of `10000.`). ptrblck suspected that this was due to the parameters being downcasted to the input types (which would reproduce the problematic output). Indeed, the GEMM and BGEMM cublas wrappers are currently converting the `alpha` and `beta` parameters to `scalar_t` (which potentially is reduced precision) before converting them back to `float`. This PR changes the "ARGTYPE" wrappers to use `acc_t` instead and adds a corresponding test.

CC ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67633

Reviewed By: mruberry

Differential Revision: D32076474

Pulled By: ngimel

fbshipit-source-id: 2540d9b9d0195c17d07d1161374fb6a5850779d5
2021-11-03 12:01:04 -07:00
3f33ada8d5 .github: Forward fix generating GHA workflows (#67777)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67777

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: janeyx99

Differential Revision: D32143785

Pulled By: seemethere

fbshipit-source-id: fb129244bdd46ffda05ed51b16183395152d7296
2021-11-03 11:36:27 -07:00
15a3c374e2 [nnc] Add support for dynamic shapes in TensorExprKernel (#67197)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67197

Test Plan: Imported from OSS

Reviewed By: eellison, ZolotukhinM

Differential Revision: D31902471

Pulled By: navahgar

fbshipit-source-id: d2729a38ba1ac607ff07f516ed56fbd9085715dc
2021-11-03 11:24:17 -07:00
88c61b8d06 Added forward derivatives for neg, diag, inverse, linalg_eig (#67339)
Summary:
See also discussion in https://github.com/pytorch/pytorch/issues/10223, starting from [this](https://github.com/pytorch/pytorch/issues/10223#issuecomment-949499666) comment

The formulas for the derivatives are taken from https://people.maths.ox.ac.uk/gilesm/files/NA-08-01.pdf.

As indicated, the method linalg_eig_jvp should be used instead of linalg_eig_jvp_eigenvalues and linalg_eig_jvp_eigenvectors in the future. Due to a codegen limitation, this is not yet possible.

CC albanD Lezcano

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67339

Reviewed By: ejguan

Differential Revision: D32104006

Pulled By: albanD

fbshipit-source-id: 1f6ace09ee3e737b99520543b30550601809ceb5
2021-11-03 11:21:54 -07:00
a23814577b Overload TestCase not vanilla TestCase for some elastic tests (#67700)
Summary:
Addresses a bit of https://github.com/pytorch/pytorch/issues/66903

Fixes it so that https://github.com/pytorch/pytorch/issues/66207 can be properly disabled

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67700

Reviewed By: H-Huang

Differential Revision: D32116908

Pulled By: janeyx99

fbshipit-source-id: 205ff68a7408609cfced2357fd99f41949ef6390
2021-11-03 11:14:52 -07:00
201f7d330a Remove duplicate check in distributions arg validation (#67741)
Summary:
Partial fix for https://github.com/pytorch/pytorch/issues/66800. (Duplicate of https://github.com/pytorch/pytorch/issues/67725 against pytorch/pytorch so as to trigger TorchBench)

https://github.com/pytorch/pytorch/issues/61056 added a more verbose error message for distributions failing argument validation. However, it did not replace the earlier error check as was originally intended and was flagged by xuzhao9 as being the potential cause of a perf regression in `test_eval[soft_actor_critic-cuda-eager]`.

xuzhao9: Is there a way for me to check if this resolves the perf issue you mentioned?

cc VitalyFedyunin ngimel

Note that existing tests already check for the error message and should verify that the removed lines are redundant.

RUN_TORCHBENCH: soft_actor_critic

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67741

Reviewed By: neerajprad

Differential Revision: D32135675

Pulled By: xuzhao9

fbshipit-source-id: 37dfd3ff53b95017c763371979ab3a2c302a72b9
2021-11-03 10:41:41 -07:00
1ffd43cf0c generated-pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit migrated to GHA (#67695)
Summary:
in scope of: https://github.com/pytorch/pytorch/issues/67301. Main changes:
* generated-pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit deleted from circle
* pytorch_android_gradle_custom_build_single removed since it is no longer used
* generated-pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit added to GHA

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67695

Reviewed By: malfet, seemethere, ejguan

Differential Revision: D32115620

Pulled By: b0noI

fbshipit-source-id: 113d48303c090303ae13512819bac2f069a2913f
2021-11-03 10:29:37 -07:00
4a106e41e9 [fx2trt] Add torch.nn.function.pad support for fx2trt (#67498)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67498

Add acc_ops.pad and a converter for it. We want to try padding convolution channel dimension to get better int8 performance.

This one only support padding the last two dimension though. Starting from 8.2, it's suggested to use Slice layer to do padding but this might be nice to have for old version support.

Test Plan: buck test mode/dev-nosan caffe2/test/fx2trt/converters:test_pad

Reviewed By: wushirong

Differential Revision: D32006072

fbshipit-source-id: 96c3aa2aec2d28345d044a88bee2f46aba5cca0e
2021-11-03 10:21:08 -07:00
383c1f51b1 [nnc] Fixed handling of 0-sized tensors in cat (#67734)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67734

The implementation of `aten::cat` op in NNC has to ignore tensors that have 0-size in any dimension.

Test Plan: `buck test mode/dev-nosan //caffe2/test/cpp/tensorexpr:tensorexpr -- --exact 'caffe2/test/cpp/tensorexpr:tensorexpr - Kernel.CatWithEmptyInputs'`

Reviewed By: ZolotukhinM

Differential Revision: D32122171

fbshipit-source-id: 90c697813bc504664673cdc262df6e7ce419c655
2021-11-03 10:16:16 -07:00
31cf3d6aad Fix adaptive_max_pool2d for channels-last on CUDA (#67697)
Summary:
Fix https://github.com/pytorch/pytorch/issues/67239

The CUDA kernels for `adaptive_max_pool2d` (forward and backward) were written for contiguous output. If outputs are non-contiguous, first create a contiguous copy and let the kernel write output to the contiguous memory space. Then copy the output from contiguous memory space to the original non-contiguous memory space.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67697

Reviewed By: ejguan

Differential Revision: D32112443

Pulled By: ngimel

fbshipit-source-id: 0e3bf06d042200c651a79d13b75484526fde11fe
2021-11-03 09:47:29 -07:00
ff5c61a74e [TensorExpr] Add lowering for aten::max (reduction). (#66519)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66519

Differential Revision:
D31590853
D31590853

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: a702621621f681d7f5392912e8a77ca124e14170
2021-11-03 09:44:09 -07:00
00afe9ba7b [TensorExpr] Add lowering for aten::embedding. (#66518)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66518

Differential Revision:
D31590855
D31590855

Test Plan: Imported from OSS

Reviewed By: pbelevich

Pulled By: ZolotukhinM

fbshipit-source-id: aace0a87b1649330dae44182f7873aca27160d64
2021-11-03 09:44:07 -07:00
008a58d226 [TensorExpr] Add lowering for aten::conv1d. (#66517)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66517

Differential Revision:
D31590856
D31590856

Test Plan: Imported from OSS

Reviewed By: pbelevich

Pulled By: ZolotukhinM

fbshipit-source-id: c05a37d8741acd0606c2adb8d6cfeb1f57bc8aa0
2021-11-03 09:44:05 -07:00
d58ef2bbff [TensorExpr] Fix lowering for aten::softmax for the case when dtype parameter is None. (#66516)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66516

Differential Revision:
D31590858
D31590858

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: 0aeee7a5be64b3b9c8fa00aacb1a94031a7e25d1
2021-11-03 09:42:48 -07:00
ea4d983885 Modify "gemm" code to enable access to "sbgemm_" routine in OpenBLAS (#58831)
Summary:
OpenBLAS recently added support for bfloat16 GEMM, so this change has PyTorch call out to OpenBLAS for that, like it does for single and double precision

Our goal is to try to enable PyTorch to make calls to "sbgemm" in OpenBLAS.

We are prepared (if it is your preference) to add fences to the code to limit this change to the Power architecture,
but our first instinct is that anyone on any architecture that enables access to sbgemm in their OpenBLAS library
should be able to use this code.  (but again, we respect that as we are just starting to modify PyTorch, we respect
your guidance!)

(there is no issue number related to this)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58831

Reviewed By: albanD

Differential Revision: D29951900

Pulled By: malfet

fbshipit-source-id: 3d0a4a638ac95b2ff2e9f6d08827772e28d397c3
2021-11-03 08:53:27 -07:00
05d1dcc14c Split channels_last test cases for tensor conversion OpInfos (#67368)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67368

This PR adds an addition test variant for the tensor conversion
functions (bfloat16, char, long, ...) that tests channels_last. This is
because some backends (mostly just functorch right now) don't have
channels last handling and may want to test that separately from the
more general case of these operations.

Test Plan: - wait for tests

Reviewed By: mruberry

Differential Revision: D31972959

Pulled By: zou3519

fbshipit-source-id: 68fea46908b2cdfeb0607908898bb8f9ef25b264
2021-11-03 07:39:41 -07:00
92a85ecbab add a quantized hardsigmoid inplace variant (#65740)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65740

fp32 hardsigmoid supports inplace. This PR adds the inplace support to the quantized
hardsigmoid function, to make the signatures match.

Test Plan:
```
python test/test_quantization.py TestQuantizedOps.test_qhardsigmoid
```

Reviewed By: supriyar

Differential Revision: D31992282

Pulled By: vkuzo

fbshipit-source-id: f6be65d72954ab8926b36bb74a5e79d422fbac90
2021-11-03 07:35:31 -07:00
e32d7f7525 ATen | Fix potential crash if MTLCreateSystemDefaultDevice return nil (#66859)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66859

`MTLCreateSystemDefaultDevice` can return `nil`. If that happens then inside `createDeviceInfo`, we'll crash trying to convert the `nullptr` from `device.name.UTF8String` into a `std::string`.

Let's fix it by returning early in setup if there's no Metal device. But also make `createDeviceInfo` safe if we do pass in `nil`.

Test Plan: * CircleCI

Reviewed By: xta0

Differential Revision: D31759690

fbshipit-source-id: 74e878ab5b8611250c4843260f1d2e4eab22cdaf
2021-11-03 03:03:45 -07:00
510336499b [PyTorch][Static Runtime] Separate overlap checks for easier debugging (#66637)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66637

We can give more information when verify_no_memory_overlap would fail by separating the DCHECK.
ghstack-source-id: 142226105

Test Plan: fitsships

Reviewed By: d1jang

Differential Revision: D31517151

fbshipit-source-id: 8cbc324c27f6b4db4489d1bd469d37b1d8ae6ce1
2021-11-02 23:59:04 -07:00
3db536e55e add jit_trace_module python binding (#67425)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67425

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D31998564

Pulled By: Krovatkin

fbshipit-source-id: f7e38c8c3f560f2c4e5ed62e1acae2c100efebd4
2021-11-02 23:55:23 -07:00
a8757cdd70 type inputs (#67424)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67424

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D31998565

Pulled By: Krovatkin

fbshipit-source-id: 8a2b8b3f13a361fe8fce7c7c930bbfd357ef8ac1
2021-11-02 23:55:21 -07:00
d352587210 add a few convenience helpers to removeAllXXX to Block and Node (#67423)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67423

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D31998566

Pulled By: Krovatkin

fbshipit-source-id: ed435d5c35e44ab2676c47b43d6e2aa8e79d9ab2
2021-11-02 23:54:02 -07:00
7f3326a6d2 [FSDP] CPU offload resubmit (#67249)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67249

Implements CPU offload for model parameters in FSDP.

- CPU offload class with only offload_params attribute is created
- If this is specified in FSDP ctor, model parameters are moved back to CPU after sharding in __init__
- In forward pass, during lazy init, p._local_shard gets set to p.data so it is on CPU. We pin_memory here.
- In forward pass, in _rebuild_full_params, we move p.data back to self.compute_device if necessary. Note that we don't use the device of p._full_param_padded because we don't always have this attr, but when we do its always the same as compute_device.
- The same logic as above applies to the beginning of backwards pass.
- At end of fwd and end of bwd, `_use_param_local_shard` takes care to ensure the parameters are offloaded to CPU again, by pointing it to p._local_shard, which is always on CPU.

Regarding tests:
- We tests 3 different types of init: 1) CUDA the model before wrapping with FSDP, 2) CUDA the model after wrapping with FSDP, 3) never CUDA the model.
- Case 1 is always supported. Case 2 is not supported with CPU offload and throws an error during fwd pass. Case 3 is only supported with CPU offload at the moment.
- Verifies all params are offloaded to CPU after init.
- Verifies all params are offloaded to CPU after forward and backward.
- Note that there is an issue with verifying exact parity when CPU offloading, but it appears to be related to transfering model back and forth cpu/CUDA. More details in https://github.com/pytorch/pytorch/pull/66961
ghstack-source-id: 141851903

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D31911085

fbshipit-source-id: 3ddf73c070b55ce383e62251868d609004fc30e7
2021-11-02 23:27:34 -07:00
06d1be2447 [NOOP][clangformat][codemod] Enable CLANGFORMAT for caffe2/caffe2/* (#67624)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67624

Test Plan: Visual inspection. Sandcastle.

Reviewed By: malfet

Differential Revision: D31986628

fbshipit-source-id: c872bded7325997a2945dbf5d4d052628dcb3659
2021-11-02 22:14:04 -07:00
e86a5a3a1a [Static Runtime] Add PyTorchPredictor::predict_managed_result to return managed output tensors (#65598)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65598

This change adds `PyTorchPredictor::predict_managed_result` to enable Static Runtime to return managed output tensors, allocated and owned by Static Runtime to accelerate inference workloads.

- `PyTorchPredictor::predict_managed_result` does only meaningful work for the overridden `PyTorchStaticRuntimePredictor::predict_managed_result`. For other subclasses, it returns a simple object that just wraps the returned `Ivalue`.

- When `manage_output_tensors` is enabled, a `StaticRuntime` cannot be reentered until its return value gets deallocated by calling `StaticRuntime::deallocateOutputTensors`. Currently an instance of `StaticRuntime` gets immediately pushed back to `static_runtime_pool` to be reentered again, and this cannot be done when `manage_output_tensors` is enabled. `PyTorchStaticRuntimePredictorManagedResult` makes sure to delay pushing a `StaticRuntime` instance back to the pool only after `StaticRuntime::deallocateOutputTensors` is called on the runtime instance.

- When `manage_output_tensors` is enabled, `PyTorchStaticRuntimePredictor::predict_managed_result` returns the prediction result, whose backing memory is managed by an instance of `StaticRuntime`. The lifetime of any value reachable from `PyTorchStaticRuntimePredictorManagedResult.get()` is expected to end before `PyTorchStaticRuntimePredictorManagedResult` gets destructed. As explained above, `PyTorchPredictorManagedResult`'s destruction pushes the runtime instance that returned the result back to `static_runtime_pool` to be reused again.

- The current API design of adding `predict_managed_result` instead of forcing `operator()` to return `PyTorchPredictorManagedResult` was motivated by the fact that `manage_output_tensors` will be selectively enabled just for a few models. In case `manage_output_tensors` becomes a commonly used feature we should revisit this API design to merge them together.

Reviewed By: hlu1

Differential Revision: D31149323

fbshipit-source-id: 5ca026188077232d6a49a46759124a978439d7b2
2021-11-02 22:10:26 -07:00
18955d3564 Raise warning when calling collectives on non-member group objects (#67639)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67639

Due to BC considerations, we cannot directly error out, as that
might break existing applications. Raise warnings first to improve
debuggability.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D32075151

Pulled By: mrshenli

fbshipit-source-id: 5680d420f5f6cd3f74a36616c03350e8a976b363
2021-11-02 20:04:07 -07:00
54241a9cfa [quant][fx] Add support for fused modules in _convert_do_not_use (#67245)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67245

Add support for fused modules in the new convert path, including linear-relu, conv{1-3}d-relu and their qat versions,
also tested with trt (conv2d-relu and linear-relu)

Test Plan:
```
python test/fx2trt/test_quantize_fx.py TestQuantizeFxTRTOps.test_linear_relu_module
python test/fx2trt/test_quantize_fx.py TestQuantizeFxTRTOps.test_conv_relu_module
```

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D31919724

fbshipit-source-id: 7e5c96eba30706f7989da680aa3443159847bdfd
2021-11-02 19:21:54 -07:00
91971dfc2a [BE] [GHA] Use aws ecr get-login-password (#67709)
Summary:
Replacing `aws ecr get-login` with `awc ecr get-login-password`, per https://docs.aws.amazon.com/cli/latest/userguide/cliv2-migration.html#cliv2-migration-ecr-get-login

Follow up after the similar change in CircleCI: https://github.com/pytorch/pytorch/pull/58308

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67709

Reviewed By: seemethere, janeyx99

Differential Revision: D32119319

Pulled By: malfet

fbshipit-source-id: 0cd0d8f4d81e9981a5f8fbf9b812a9167fd48135
2021-11-02 19:06:50 -07:00
16ee6409ee Changed value constraint of exponential dist (#67184)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/67183.

cc fritzo neerajprad alicanb nikitaved

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67184

Reviewed By: ejguan

Differential Revision: D32114661

Pulled By: neerajprad

fbshipit-source-id: ea23e59f38a23a7b0bab4fbbd98ae3feba468b9c
2021-11-02 17:44:56 -07:00
885da61d7d [PG NCCL] Disable NCCL health check (#67668)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67668

This adds an env var to enable NCCL health check, which when left unspecified, results in the check not being run. Unit tests that need to test this functionality have the env variable set. Please see internal diff for more details.

Test Plan: CI

Reviewed By: yuguo68, mrshenli

Differential Revision: D32089763

fbshipit-source-id: dff5664a5e607f711515cd1042089ca769914fbb
2021-11-02 16:21:59 -07:00
0b2f68eadf Remove special FX OpInfo list (#67520)
Summary:
Most of the failing tests are since the test doesn't work with python functions (only builtins like `torch.add`).

I added a check for that and ported the remaining skips into the `skips` field.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67520

Reviewed By: ZolotukhinM

Differential Revision: D32046856

Pulled By: Chillee

fbshipit-source-id: 05fa3e3c40fa6cc4f776e0c24f667629b14afd25
2021-11-02 16:01:46 -07:00
96e3d1a76c Remove native_functions.yaml dependency from Sorting.cu (#66621)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66621

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D31856099

Pulled By: dagitses

fbshipit-source-id: d9c2b6b45099e49c7beaae5888140de350d23696
2021-11-02 14:46:29 -07:00
7deb1726ea Remove native_functions.yaml dependency from ScanKernels.cu (#66620)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66620

This splits the Tensor-dependant code out into a cpp file.

A slight complicating factor is `scan_dim` using `copy_` to handle
non-contiguous out arguments. So, I've moved that code into the
caller which does introduce some duplication. Though it's only ~10
lines extra in total.

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D31856106

Pulled By: dagitses

fbshipit-source-id: 91bb4ce5e7c6487e3ea0d5ec4d9f7a625d8ef978
2021-11-02 14:45:17 -07:00
9e97ccbd7a .github: Migrate iOS workflows to GHA (#67645)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67645

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D32104367

Pulled By: seemethere

fbshipit-source-id: 08ff043ed5d0b434322f1f3f20dce2a4f5fa88c1
2021-11-02 14:38:43 -07:00
a831713786 [PyTorch Edge] Use Integer Subtraction (Instead of Float) in Non-FBGEMM Dequantization (#67115)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67115

This matches what FBGEMM does (https://fburl.com/code/vjrdn6tjhttps://fburl.com/code/btkdn24l)

Benchmark Mobile Vision Transformer Model Results (as described in D31066997 and config from rebasing onto v4 of D31869106):

This diff (v18):
- NET latency: 109.866
- https://our.intern.facebook.com/intern/aibench/details/536304563225483

This diff before using vsubl (v14 but rebased onto v22 of D31205883, the previous diff in this stack)
- NET latency: 115.887
- https://our.intern.facebook.com/intern/aibench/details/906978557243297

Before this diff (v22 of D31205883):
- NET latency: 116.449
- https://our.intern.facebook.com/intern/aibench/details/870678436773989

ghstack-source-id: 142166375

Test Plan: Phabricator tests + Running quantized_test on a pixel3a passes and Running mobile vision transformer model (as described in D31066997) both work

Reviewed By: kimishpatel

Differential Revision: D31483135

fbshipit-source-id: fbef00cad6087b49900d21c3dd3b6fd432f64e94
2021-11-02 14:28:03 -07:00
23bd3cf5b2 [PyTorch Edge] Parallelize Quantize and Dequantize Tensor (#65845)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65845

Benchmarking of Non-Parallelized and Parallelized quantization/dequantization for various devices and input sizes done in this notebook:
https://www.internalfb.com/intern/anp/view/?id=1204834&scroll_cell=17&checkpoint_id=432447238302644

For example:
- {F671713127}
- {F671713209}
- {F671713238}
- {F671713253}

When run on Partially Quantized Mobile Vision Transformer Model (as described in D31066997:

Before this diff (on D31444248 v7):
- [120.907ms](https://our.intern.facebook.com/intern/aibench/details/945891590820680)

With this diff (v19):
- Threshold = 2^16: [118.086ms](https://our.intern.facebook.com/intern/aibench/details/436376817372377)
- Threshold = 2^20: [118.361ms](https://our.intern.facebook.com/intern/aibench/details/617543354077290)

ghstack-source-id: 142166374

Test Plan:
Same as previous diff (D31066997)

All tests pass

Also, set numel to 2^21 in quantized_test TestArmVectorizedAndParallelQuantizeDequantize (https://www.internalfb.com/diff/D31066997?dst_version_fbid=596325738080019&transaction_fbid=219437170135898) and the tests passed

Reviewed By: kimishpatel

Differential Revision: D31205883

fbshipit-source-id: 9ed0b11a376734feaf228074a24b8eb79d5270a3
2021-11-02 14:28:01 -07:00
92cfda1785 [PyTorch Edge] Clean up Quantize Tensor code (#66220)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66220

- Pass pointers rather than tensors to ```quantize_tensor_arm``` to allow for using ```__restrict__``` and to make parallelization easier (as in the next diff on this stack D31205883)
- Replace ```auto``` with actual types
- Replace raw cast with reinterpret_cast<...>
- All of these changes make the code structure similar to that of Dequantize
ghstack-source-id: 142166376

Test Plan: same as D31066997 (all tests pass)

Reviewed By: kimishpatel

Differential Revision: D31444248

fbshipit-source-id: 6a31d090082047263403f415911c199519987595
2021-11-02 14:27:59 -07:00
16c62a6dc9 [PyTorch Edge] Optimize Dequantize Tensor with Intrinsics (#65844)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65844

When run on [Partially Quantized Mobile Vision Transformer Model](https://www.internalfb.com/diff/D30648171), with config from rebasing onto v4 of D31869106

Before:
[AIBench Run (128ms)](https://www.internalfb.com/intern/aibench/details/309792316534505)
[Perf Report](https://interncache-all.fbcdn.net/manifold/aibench/tree/mobile/pt/profiling_reports/model_perf_1635881079420.html)

After:
[AIBench Run (117ms)](https://www.internalfb.com/intern/aibench/details/20433505461364)
[Perf Report](https://interncache-all.fbcdn.net/manifold/aibench/tree/mobile/pt/profiling_reports/model_perf_1635881527831.html)

Total events spent on at::native::dequantize_quantized reduced from 1.97 Billion to 0.97 Billion (~50% Reduction)
ghstack-source-id: 142166373

Test Plan:
To run quantized_test
- Clone open source repo
- Set ANDROID_NDK and ANDROID_SDK
- Build with ```BUILD_MOBILE_BENCHMARK=1 BUILD_MOBILE_TEST=1 ANDROID_DEBUG_SYMBOLS=1 BUILD_LITE_INTERPRETER=0  ANDROID_ABI=arm64-v8a ./scripts/build_android.sh  -DANDROID_CCACHE=$(which ccache) -DBUILD_BINARY=ON```
- Move ```build_android/bin/quantized_test``` to devserver
- Use one world to connect to android device (ex. ```one_world android device pixel-3a```)
- In another terminal: Make quantized_test executable (```chmod +x quantized_test```), copy it to android device (```adb push quantized_test /data/local/tmp```), and run it (```adb shell /data/local/tmp/quantized_test```)

Results:
{F676102702}

Also ```buck test mode/dev //caffe2/aten:quantized_test``` passes

To test performance on [Partially Quantized Mobile Vision Transformer Model](https://www.internalfb.com/diff/D30648171) with AI Bench:
- Save this config file: P466124028 (for example: D31869106)
- Before or after the changes in this diff, run ```buck run aibench:run_bench -- -b benchmark_mobile_vision_transformer_model_config.json --platform android/arm64 --framework pytorch --remote --devices Pixel-3a-11-30 --force_profile```

Reviewed By: kimishpatel

Differential Revision: D31066997

fbshipit-source-id: 9067e683e0181aa13a2b636b68ac4fe5a4b2e618
2021-11-02 14:26:42 -07:00
9cef2033f3 Modify decorator for acc op converters (#67636)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67636

Modify decorator to denote whether a acc op converter is able to support explicit/implicit batch dim. This info will be used by trt_splitter when determine whether a node can be split into acc graph.
This is can prevent us from split node to acc module and later found no proper converter exist for the node and fail the lower process.

Test Plan: unit test

Reviewed By: 842974287

Differential Revision: D31998477

fbshipit-source-id: 6789ebef4a76f9a0c1ab3edf8e846a5b6143326b
2021-11-02 13:35:40 -07:00
5ad169b7cc Adding in Wrap functions for FSDP from Fairscale (#67292)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67292

as title

Test Plan: buck test mode/dev-nosan //caffe2/test/distributed/fsdp:wrap --keep-going

Reviewed By: rohan-varma

Differential Revision: D31936404

fbshipit-source-id: b7ebead9a649766aec83e5630c2ce1386ad33e11
2021-11-02 13:30:41 -07:00
8f63cfda14 [LiteInterpreter] Specify Loader to yaml.load (#67694)
Summary:
It became a mandatory argument since PyYaml-6, but has been present since PyYaml-3

Unblock migration to newer runtime

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67694

Reviewed By: seemethere

Differential Revision: D32106043

Pulled By: malfet

fbshipit-source-id: 35246b97a974b168c066396ea31987b267534c7f
2021-11-02 12:52:57 -07:00
b00206d473 [vulkan] Use 3D textures for everything (#67647)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67647

Test Plan: Imported from OSS

Reviewed By: beback4u

Differential Revision: D32102196

Pulled By: SS-JIA

fbshipit-source-id: ded1835386a0640181f69c190a2294d298311e26
2021-11-02 12:29:26 -07:00
0ee8473af7 [SR][easy] Fix FuseListUnpack 0-use corner case (#67165)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67165

We previously skipped the optimization if `value_out->uses().size() > 1`. But it's possible that the number of uses is 0. In that case, it's not safe to access `value_out->uses()[0]`.

This is not causing any problems in production right now since we don't have any dead code before running this pass. But we should handle this case correctly to make the pass more robust.

Test Plan: CI

Reviewed By: hlu1

Differential Revision: D31887416

fbshipit-source-id: d30a5824e8bd1cda1debdc16524db3fb0da312f9
2021-11-02 12:17:16 -07:00
6b1d8e5bb2 Revert D31861962: [qnnpack] Remove redundant fp16 dependency
Test Plan: revert-hammer

Differential Revision:
D31861962 (4061239fdd)

Original commit changeset: e1425c7dc3e6

fbshipit-source-id: 418f8173c19b9541316443e1ab4ec39062561b5e
2021-11-02 11:55:07 -07:00
3e218dbd27 [PyTorch] Capture function args from schema by reference (#65951)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65951

Profiling shows that we do a bunch of heap allocations to copy Argument structs in append_operator. Capturing by reference here should be safe as long as the schema objects args is from outlive the operator function.

IMPORTANT: Reviewers (or automated tests if we're lucky) need to
confirm that the above is true or we're going to have fun
use-after-free bugs.
ghstack-source-id: 142065422

Test Plan:
AIBench run for speech model on MilanBoard

control: https://www.internalfb.com/intern/aibench/details/485570882988661 (mean 906 ms)
test: https://our.intern.facebook.com/intern/aibench/details/620835625995669 (mean 818 ms)

So almost a 10% improvement in the wall time metric?

Reviewed By: iseeyuan

Differential Revision: D31319988

fbshipit-source-id: 7da56357420df500df344f49007e070ebb1bc581
2021-11-02 11:12:04 -07:00
33d62266f2 [PyTorch][easy] Avoid allocating OperatorName strings in append_operator (#66134)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66134

No reason to do the comparison the old way when we could do it this way and avoid copying into std::string.
ghstack-source-id: 142065423

Test Plan: AIBench Milan run shows neutral to slight regression, but I think we should probably just make this change anyway.

Reviewed By: dhruvbird

Differential Revision: D31319669

fbshipit-source-id: dde329a4f2c4054f275eb98fb6556f5341e7533a
2021-11-02 11:10:52 -07:00
2644725937 [SR] Migrate gather_ranges_to_dense to new FuseListUnpack (#67164)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67164

Migrated both the variadic and non-variadic versions.

This diff is part of the effort to migrate all ops used in `FuseListUnpack` to `FuseListUnpackV2`. The original version of `FuseListUnpack` is problematic for a few reasons:

* You have to complicate the op implementation with an `is_fused` check, resulting in messier code. It is easier to reason about two ops, fused (out variant) and unfused (native).
* The original version of `FuseListUnpack` is buggy. It assumes that the `ListUnpack` node occurs immediately after the fusion candidate, which is not necessarily true.

This diff finishes the migration, so the original version of `FuseListUnpack` is removed

Test Plan:
Unit tests: `buck test caffe2/benchmarks/static_runtime/...`

**Accuracy Test**
Done at the top of this diff stack.

Reviewed By: hlu1

Differential Revision: D31887386

fbshipit-source-id: 9d44c813667a75bce13dce62bf98e6109edea6ba
2021-11-02 11:04:59 -07:00
82f7f8d471 [PyTorch] Adopt IValue::toTupleRef() where obvious (#65505)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65505

Generated with

`fastmod -m 'toTuple\(\)(\s*)->' 'toTupleRef()${1}.'`

, followed by

`fastmod '(std::move\(.*)toTupleRef\(\).' '${1}toTuple()->'`

to unbreak 2 callsites.
ghstack-source-id: 142065835

Test Plan: CI

Reviewed By: gchanan

Differential Revision: D31131025

fbshipit-source-id: 54457ae5bbeb38db9c7f196d469b98521c3d3f34
2021-11-02 10:22:18 -07:00
eb1b8a2160 pytorch_android_gradle_custom_build_single migrated from Circle to GHA. (#67577)
Summary:
in scope of: https://github.com/pytorch/pytorch/issues/67301. Main changes:
* pytorch_android_gradle_custom_build_single removed from the circle (however template is still there since it is used by another similar workflow: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit, which will be migrated next)
* new GHA workflow added: pytorch_android_gradle_custom_build_single

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67577

Reviewed By: malfet, mruberry

Differential Revision: D32087709

Pulled By: b0noI

fbshipit-source-id: f9581558ddc1453b63264bf19fe5a4c245b7c007
2021-11-02 10:21:03 -07:00
d9bac7c316 [PyTorch] Add IValue::toTupleRef() (#65504)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65504

We should be able to borrow a Tuple from an IValue without incurring refcount bumps.
ghstack-source-id: 142065833

Test Plan:
Added test coverage.

Profiled static runtime on the local_ro net for ctr_mobile_feed. Inclusive time spent in VarTupleUnpack decreased about 0.3%, which roughly matches with the 0.36% of runtime that was previously spent in IValue::toTuple().

Reviewed By: hlu1

Differential Revision: D31130570

fbshipit-source-id: afa14f46445539e449068fd908d547b8da7f402c
2021-11-02 10:16:25 -07:00
7cd62621fb [PyTorch] Adopt faster Tuple::create (#65381)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65381

The previous diff adds a way to make Tuples of size 3 or less
more efficiently. This diff makes it easier to hit that path and
updates a bunch of callsites to hit it.
ghstack-source-id: 142065832

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D31069538

fbshipit-source-id: d04da3709594ed68ab1c0a1471f8cffd8d001628
2021-11-02 10:10:31 -07:00
9e71ea292d Fix test_init_pg_and_rpc_with_same_socket by retrying on addr in use error (#67638)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67638

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D32074698

Pulled By: H-Huang

fbshipit-source-id: 6b980fcdac4b0f1edfe086d0deb99be371a73900
2021-11-02 09:42:47 -07:00
4061239fdd [qnnpack] Remove redundant fp16 dependency (#67281)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67281

`qnnpack/operator.h` introduces a dependency on an external library fp16 via `qnnpack/requantization.h`.
Including `qnnpack/operator.h` in `pytorch_qnnpack.h` will make objects who really don't require fp16 depend on it indirectly because they include `pytorch_qnnpack.h`.
This was causing some test and bench targets to fail building for local and android/arm64 (only two tried) using cmake.

This diff moves `qnnpack/operator.h` from `pytorch_qnnpack.h` to `qnnpack_func.h`, and explicitly add `qnnpack/operator.h` in `src/conv-prepack.cc`.

Test Plan: Ran all the tests for local on my devserver, and arm64 on Pixel3a.

Reviewed By: kimishpatel

Differential Revision: D31861962

fbshipit-source-id: e1425c7dc3e6700cbe3e46b64898187792555bb7
2021-11-02 09:29:55 -07:00
cd51d2a3ec Adding OpInfo for logical_or, logical_and, logical_xor (#67178)
Summary:
This PR addresses https://github.com/pytorch/pytorch/issues/54261.

This adds OpInfos for binary logical element wise operators. This is my first PR in OpInfos to PyTorch, looking forward to suggestions and any feedback.

cc: mruberry krshrimali

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67178

Reviewed By: jbschlosser

Differential Revision: D32057889

Pulled By: mruberry

fbshipit-source-id: 7e670260af6b478dba9d6e8d77de4df1b6d0b5d1
2021-11-01 20:27:45 -07:00
c65f332da4 torch::deploy unity and its demo (#67134)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67134

This diff demos torch::deploy unity which builds the model, the dependencies and the runtime as a unity!

The end user only need to use the build_unity rule to replace the python_binary rule to define the python application. Under the hood, we build the python application (an xar file), build the torch deploy runtime, and then embed the python application (the xar file) into the torch deploy runtime.

When starting the torch::deploy runtime, the xar will be written to the filesystem and extracted. We put the extracted path to python sys.path so all the model files and all the python dependencies can be found!

As a demo, the model here is just a simple python program using numpy and scipy. But  theoretically, it can be as complex as we want.

I'll check how bento_kernel works. Maybe we can learn from bento_kernel to simplify things a bit.
ghstack-source-id: 142085742

Test Plan:
```
#build
buck build mode/opt unity:unity

# make sure the path exists before we start torch::deploy runtime
# Otherwise the dynamic loader will just skip this non-existing path
# even though we create it after the runtime starts.
mkdir -p /tmp/torch_deploy_python_app/python_app_root

#run
LD_LIBRARY_PATH=/tmp/torch_deploy_python_app/python_app_root ~/fbcode/buck-out/gen/caffe2/torch/csrc/deploy/unity/unity
```

Reviewed By: suo

Differential Revision: D31816526

fbshipit-source-id: 8eba97952aad10dcf1c86779fb3f7e500773d7ee
2021-11-01 19:32:49 -07:00
ec6b472e0a [vulkan] Add prepacking for conv2d_transpose (#67358)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67358

Test Plan: Imported from OSS

Reviewed By: beback4u

Differential Revision: D31970903

Pulled By: SS-JIA

fbshipit-source-id: 128deb40dc14fb97aa61af9cddab4540b630359e
2021-11-01 17:59:32 -07:00
152f665dee Inserted check for PyObject_IsInstance in THPVariableCheck (#67588)
Summary:
Inserted check for the return of PyObject_IsInstance to capture the case in which it raises an exception and return -1. When this happen THPVariable_Check now throws a python_error to signal the exception.

Fixes https://github.com/pytorch/pytorch/issues/65084

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67588

Reviewed By: mruberry

Differential Revision: D32064776

Pulled By: albanD

fbshipit-source-id: 895c7682e0991ca257e27f9638a7462d83707320
2021-11-01 16:53:54 -07:00
c4bf196334 Strided masked reduction: mean (2nd try) (#67088)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67088

Stack from [ghstack](https://github.com/ezyang/ghstack):
* __->__ #67088

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D32070264

Pulled By: cpuhrsch

fbshipit-source-id: 08a91550dd24fb0f51abf06591a0e26186c4f9f9
2021-11-01 16:12:07 -07:00
53e6aca8b3 [Pytorch Edge] Make More Classes Selective (#67397)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67397

Expand selectivity coverage to classes created outside of TORCH_LIBRARY.

ghstack-source-id: 142076940

Test Plan: Model unit tests, manually run some models on prod apps.

Reviewed By: dhruvbird, bdhirsh

Differential Revision: D31978965

fbshipit-source-id: 708901b47a9838ac54c78788028d0e18c1e378c0
2021-11-01 15:12:30 -07:00
45d5b3248b Fixed C++ BatchNorm pretty_print() with optional momentum (#67335)
Summary:
Summary : Inserted a check for the momentum and print  "None" in case is not defined. See  https://github.com/pytorch/pytorch/issues/65143

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67335

Test Plan:
The code below now prints `torch::nn::BatchNorm2d(128, eps=1e-05, momentum=None, affine=true, track_running_stats=true)` without generating errors.
```
torch::nn::BatchNorm2d m(torch::nn::BatchNormOptions(128).momentum(c10::nullopt));
std::cerr << *m << "\n";
```
Fixes https://github.com/pytorch/pytorch/issues/65143

Reviewed By: mruberry

Differential Revision: D32067820

Pulled By: ngimel

fbshipit-source-id: f40f9bbe090aa78e00f6c3a57deae393d946b88d
2021-11-01 14:45:33 -07:00
234bd6dc56 [quantized] Add bilinear quantized grid_sample (#66879)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66879

This adds a quantized implementation for bilinear gridsample. Bicubic interpolation cannot be supported as easily since we rely on the linearity of quantization to operate on the raw values, i.e.

f(q(a), q(b)) = q(f(a, b)) where f is the linear interpolation function.
ghstack-source-id: 141321116

Test Plan: test_quantization

Reviewed By: kimishpatel

Differential Revision: D31656893

fbshipit-source-id: d0bc31da8ce93daf031a142decebf4a155943f0f
2021-11-01 14:44:26 -07:00
0cbfd466d2 Remove ProcessGroup from TensorPipeAgent initialization (#66708)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66708

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D31762735

Pulled By: H-Huang

fbshipit-source-id: 9f3879fca6b8258f7e6171b14d2c1d6cce21627d
2021-11-01 14:15:27 -07:00
ba369ea053 check to ensure profiler_edge is only added when use_kineto is on (#67494)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67494

Reviewed By: jbschlosser

Differential Revision: D32031142

Pulled By: mcr229

fbshipit-source-id: 8267f0e02c5bed0fbc4956af6935a551bedb27ef
2021-11-01 13:42:14 -07:00
76f57cd442 [CODEOWNERS] Remove @neginraoof (#67631)
Summary:
She no longer works on the ONNX exporter

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67631

Reviewed By: malfet

Differential Revision: D32070435

Pulled By: msaroufim

fbshipit-source-id: d741a15bd7a916745aa7f2f3d9bb1dc699553900
2021-11-01 13:26:38 -07:00
e80cb08cc8 [jit][shape_prop] Fix jit registration of unpack_sizes ops for prepacked (#66737)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66737

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D31703587

Pulled By: IvanKobzarev

fbshipit-source-id: ccebe5ffc4fa959e3fa63afab1058d94e9df9dd9
2021-11-01 12:43:10 -07:00
251278d385 [skip ci] set more tests with owners for distributed and elastic (#67583)
Summary:
It turns out my lint doesn't work on CI all the time because of shell differences. I'm working on a new more comprehensive lint in https://github.com/pytorch/pytorch/pull/66826 and it'd be nice if these could be cleared first.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67583

Reviewed By: H-Huang, mruberry

Differential Revision: D32045155

Pulled By: janeyx99

fbshipit-source-id: ecfe9f008310c28e3b731e246c2b2ed0106d03b1
2021-11-01 12:26:03 -07:00
4d99bc839b Remove TH/THC Storage functions for unused dtypes (#67480)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/67466

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67480

Reviewed By: mruberry

Differential Revision: D32023494

Pulled By: ngimel

fbshipit-source-id: 8827e1d6e765fee7219b5ee9888a1a3e3c5fbe89
2021-11-01 11:45:20 -07:00
a122ba776a Fix less_than_lowest warnings (#67422)
Summary:
Fixes useless comparison against zero warnings for Half.h

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67422

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D31951939

fbshipit-source-id: 3e9940adda2d57b4d9b122f3862706c673f9ef4b
2021-11-01 11:19:55 -07:00
da29655797 Disable miopen test for convolution on mobile (#66564)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66564

Mobile thinks that we are segfaulting in _convolution, and this
is the most recent substantive change to this function.  I think
it's pretty unlikely to have caused the crash, but if we don't have
any better ideas we should try this.
ghstack-source-id: 141972758

Test Plan: ship it and see if it resolves the error report

Reviewed By: kimishpatel

Differential Revision: D31598633

fbshipit-source-id: c34f4b0b7b8529e21fd019c886ad8d68ffe286b0
2021-11-01 10:22:40 -07:00
885a8e53ba replace onlyOnCPUAndCUDA with onlyNativeDeviceTypes (#65201)
Summary:
Reference https://github.com/pytorch/pytorch/issues/53849

Replace `onlyOnCPUandCUDA` with `onlyNativeDeviceTypes` which includes `cpu, cuda and meta`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65201

Reviewed By: mrshenli

Differential Revision: D31299718

Pulled By: mruberry

fbshipit-source-id: 2d8356450c035d6a314209ab51b2c237583920fd
2021-11-01 09:22:34 -07:00
39ad7b670e [SR] Native implementation for aten::squeeze (#67441)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67441

Native ops are faster than falling back to the JIT interpreter, sometimes significantly (we've previously seen this with ops like TupleUnpack). We should improve op coverage where possible.

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: hlu1

Differential Revision: D31992093

fbshipit-source-id: 88191c13d229ffeac4e5b17b78e25f51d3f7f23e
2021-11-01 08:22:57 -07:00
00da7b9a3b Set test owner for vmap (#67582)
Summary:
More leftover actions from https://github.com/pytorch/pytorch/issues/66232

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67582

Reviewed By: zou3519

Differential Revision: D32045160

Pulled By: janeyx99

fbshipit-source-id: 92ae9a533285b05b44bd04bb6127061c6fddd689
2021-11-01 07:22:48 -07:00
9cdd1d7e48 Docs module check (#67440)
Summary:
Add check to make sure we do not add new submodules without documenting them in an rst file.
This is especially important because our doc coverage only runs for modules that are properly listed.

temporarily removed "torch" from the list to make sure the failure in CI looks as expected. EDIT: fixed now

This is what a CI failure looks like for the top level torch module as an example:
![image](https://user-images.githubusercontent.com/6359743/139264690-01af48b3-cb2f-4cfc-a50f-975fca0a8140.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67440

Reviewed By: jbschlosser

Differential Revision: D32005310

Pulled By: albanD

fbshipit-source-id: 05cb2abc2472ea4f71f7dc5c55d021db32146928
2021-11-01 06:24:27 -07:00
0d7cf825fc [SR] Drop support for aten::__is__ and aten::__isnot__ (#67550)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67550

`aten::__is__` and `aten::__isnot__` are extremely problematic for a large number of SR graph optimizations.

Some examples:

- Removing ops that are no-ops in the forward pass like `aten::detach`. This would normally be trivial, but `is` introduces corner cases like this:
```
def forward(x):
    y = x.detach()
    return x is y
```
We get `False` before optimizations. But after optimizations, the test becomes `x is x`, and we get `True`.

- `ReplaceWithCopy`: the pass that replaces ops like `aten::to` with an out variant that copies its input. The following graph returns `True` before optimizations, but `False` afterwards
```
def forward(x):
    y = x.to(x.dtype)
    return x is y
```

- And many more, `FuseListUnpack` can break too

Since the ops are not used by 99.99% of users, rejecting them so we don't have to think about this is not a big deal.

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: d1jang

Differential Revision: D32022584

fbshipit-source-id: d135938edb2299c9b8f9511afac2bf568578879e
2021-11-01 04:45:14 -07:00
7fbcf79684 [tensorexpr][nnc] Support quantization (#66676)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66676

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D31676329

Pulled By: IvanKobzarev

fbshipit-source-id: 288b41ff4ed603dfaacb465f296997f14bb23c22
2021-10-31 22:49:30 -07:00
97f29bda59 Relaxes tolerance on ROCm test_noncontiguous_samples_matmul (#67593)
Summary:
This test is narrowly failing intermittently. See https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-bionic-rocm4.3.1-py3.6-test1/7736//console for an example. Relevant snippet:

```
12:28:43 ======================================================================
12:28:43 FAIL [0.104s]: test_noncontiguous_samples_matmul_cuda_float32 (__main__.TestCommonCUDA)
12:28:43 ----------------------------------------------------------------------
12:28:43 Traceback (most recent call last):
12:28:43   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1422, in wrapper
12:28:43     method(*args, **kwargs)
12:28:43   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1422, in wrapper
12:28:43     method(*args, **kwargs)
12:28:43   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 371, in instantiated_test
12:28:43     result = test(self, **param_kwargs)
12:28:43   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 737, in test_wrapper
12:28:43     return test(*args, **kwargs)
12:28:43   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 920, in only_fn
12:28:43     return fn(self, *args, **kwargs)
12:28:43   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1041, in wrapper
12:28:43     fn(*args, **kwargs)
12:28:43   File "test_ops.py", line 262, in test_noncontiguous_samples
12:28:43     self.assertEqual(actual_grad, expected_grad)
12:28:43   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1903, in assertEqual
12:28:43     super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
12:28:43 AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 1 element(s) (out of 10) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 1.2278556823730469e-05 (-1.458460807800293 vs. -1.4584730863571167), which occurred at index 7.
```

Setting an absolute tolerance of 1e-4, which is what this PR does, should make the test pass consistently.

cc jeffdaily sunway513 jithunnair-amd ROCmSupport KyleCZH

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67593

Reviewed By: ngimel

Differential Revision: D32050986

Pulled By: mruberry

fbshipit-source-id: f15bc8c4516be0a859afcfa76d52334c0b2c58a5
2021-10-31 04:26:31 -07:00
d0662f2f76 Add adaptive_max_pool OpInfo (#67405)
Summary:
Per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67405

Reviewed By: mruberry

Differential Revision: D32044712

Pulled By: ngimel

fbshipit-source-id: 4619d134d18359601801c029dd5be3f59b91626d
2021-10-30 21:19:58 -07:00
e01279cc2e Disable reduced precision reductions for fp16 GEMMs (#67578)
Summary:
It appears that most NVIDIA architectures (well, at least there haven't been many reports of this issue) don't do reduced precision reductions (e.g., reducing in fp16 given fp16 inputs), but this change attempts to ensure that a reduced precision reduction is never done. The included test case currently fails on Volta but passes on Pascal and Ampere; setting this flag causes the test to pass on all three.

CC stas00 ngimel ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67578

Reviewed By: mruberry

Differential Revision: D32046030

Pulled By: ngimel

fbshipit-source-id: ac9aa8489ad6835f34bd0300c5d6f4ea76f333d1
2021-10-30 21:14:11 -07:00
510e3026a9 [numpy] add torch.argwhere (#64257)
Summary:
Adds `torch.argwhere` as an alias to `torch.nonzero`

Currently, `torch.nonzero` is actually provides equivalent functionality to `np.argwhere`.

From NumPy docs,
> np.argwhere(a) is almost the same as np.transpose(np.nonzero(a)), but produces a result of the correct shape for a 0D array.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64257

Reviewed By: qihqi

Differential Revision: D32049884

Pulled By: saketh-are

fbshipit-source-id: 016e49884698daa53b83e384435c3f8f6b5bf6bb
2021-10-30 15:26:11 -07:00
a95c94f075 [fx2trt] fix acc_tracer when run against module that contains ScriptModule submodules (#67567)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67567

- Fix an issue to allow it to work against modules that contains ScriptModule submodules.
- Fix a bug where `getattr(base_class, method_name)` could raise KeyError

Test Plan: linter; CI;

Reviewed By: 842974287

Differential Revision: D31956070

fbshipit-source-id: 1114937f380af437fd6d36cd811ef609d7faefe7
2021-10-30 15:13:45 -07:00
b24c34426f Add OpInfo for torch.unique and torch.unique_consecutive (#67529)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67529

Reviewed By: pbelevich

Differential Revision: D32045941

Pulled By: saketh-are

fbshipit-source-id: fefea1ddabcd3c4b40e9374b991410626437cdb4
2021-10-30 08:33:41 -07:00
aa16de517d Revert D31984694: [pytorch][PR] make TORCH_(CUDABLAS|CUSOLVER)_CHECK usable in custom extensions
Test Plan: revert-hammer

Differential Revision:
D31984694 (d4493b27ee)

Original commit changeset: 0035ecd13980

fbshipit-source-id: c85689007719c9e4a930b0a8a32d481a501d3c14
2021-10-30 03:51:18 -07:00
4a2bbc619d move functionalize fallback out of aten/core (#67564)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67564

moves the functionalize fallback out of aten/core and into aten, which should fix the issue described at https://fb.workplace.com/groups/163556484490704/permalink/1029416141238063/. I'm still not clear on why this didn't fail anything in CI / sandcastle on the initial diff: D31942093 (0032fa7725)
ghstack-source-id: 141959891

Test Plan: Locally, running `buck build mode/opt //sigrid/feed/prediction_replayer:fully_remote_replayer_main`

Reviewed By: zou3519

Differential Revision: D32027585

fbshipit-source-id: 2d86c4a6b3a73b00ee0ccee2f89a54704ed83e00
2021-10-29 21:40:49 -07:00
c00806beda Add skipXLA and expectedFailureXLA decorator (#66857)
Summary:
Add skipXLA and expectedFailureXLA decorator and relevant test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66857

Reviewed By: ngimel

Differential Revision: D32039856

Pulled By: mruberry

fbshipit-source-id: 3c99d5e06c1c7684d1f798c11c783bd6ebea9899
2021-10-29 19:53:36 -07:00
69adbc8778 Fix splitter_base and add unit test for trt splitter (#67569)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67569

Splitter_base has assumption that the first subgraph after split must be cpu subgraph if there exists cpu node. This is wrong, start subgraph should be determined by which subgraph has 0-dep node.
Also add unit test for splitter.

Reviewed By: yinghai

Differential Revision: D32012549

fbshipit-source-id: e2639ccd7774b4295ca05c2ddbefff9726702b3f
2021-10-29 18:51:59 -07:00
d4493b27ee make TORCH_(CUDABLAS|CUSOLVER)_CHECK usable in custom extensions (#67161)
Summary:
Make `TORCH_CUDABLAS_CHECK` and `TORCH_CUSOLVER_CHECK` available in custom extensions by exporting the internal functions called by the both macros.

Rel: https://github.com/pytorch/pytorch/issues/67073

cc xwang233 ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67161

Reviewed By: jbschlosser

Differential Revision: D31984694

Pulled By: ngimel

fbshipit-source-id: 0035ecd1398078cf7d3abc23aaefda57aaa31106
2021-10-29 17:27:07 -07:00
ad89d994c9 [Static Runtime] Support recordio format input for benchmark (#67530)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67530

Currently `ptvsc2_predictor_bench` only uses the first input of a given recordio file even when the record io file contains many inputs.

This change extends `StaticRuntime::benchmark` to accept multiple input entries so that we can benchmark more extensibly and realistically using all the inputs in the recordio file.

Test Plan:
Tested `ptvsc2_predictor_bench` with / without this change executing the following command:
```
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 3 ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --scripted_model=/home/djang/ads/adfinder/ctr_mobilefeed/302008423/302008423_0.predictor.disagg.local  --recordio_inputs=/home/djang/ads/adfinder/ctr_mobilefeed/302008423/302008423.local.inputs.recordio --pt_enable_static_runtime=1 --compare_results=0 --iters=1 --warmup_iters=1 --num_threads=1 --do_profile=1 --method_name=local.forward --set_compatibility --do_benchmark=1 --recordio_use_ivalue_format=1
```

Reviewed By: hlu1

Differential Revision: D31947382

fbshipit-source-id: 4188271613aad201f8cad5f566e0dfed26680968
2021-10-29 14:38:14 -07:00
2cac92f470 [SR] Migrate sigrid_transforms_torch_bind to new FuseListUnpack (#67163)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67163

Migrated both the variadic and non-variadic versions.

This diff is part of the effort to migrate all ops used in `FuseListUnpack` to `FuseListUnpackV2`. The original version of `FuseListUnpack` is problematic for a few reasons:

* You have to complicate the op implementation with an `is_fused` check, resulting in messier code. It is easier to reason about two ops, fused (out variant) and unfused (native).
* The original version of `FuseListUnpack` is buggy. It assumes that the `ListUnpack` node occurs immediately after the fusion candidate, which is not necessarily true.

Test Plan:
Unit tests: `buck test caffe2/benchmarks/static_runtime/...`

**Accuracy Test**
Done at the top of this diff stack.

**Performance**
Everything seems to be about the same plus or minus some noise.

* Baseline (D31947382 with some errors correct locally, the version of the op here is fused and variadic): P464964343
* This diff, fused_variadic: P464960645
* Variadic transformation disabled, fused (caught and fixed a schema error here): P464961561
* List unpack fusion disabled, variadic: P464962661
* Both variadic and fusion passes disabled: P464963342

The predictions match with the JIT interpreter for all ops.

Reviewed By: hlu1

Differential Revision: D31887300

fbshipit-source-id: 25a7b4e35eed21ca8b2c98297513425cf17f461a
2021-10-29 14:25:10 -07:00
289b0f7b04 Resent the reverted PR: Add register_frozenpython.cpp to the torch::deploy interpreter library in the OSS build (#67303)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67303

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D32016061

Pulled By: shunting314

fbshipit-source-id: 9460c90dd4f630f4c81dbfbbd772446ddffbabd0
2021-10-29 14:10:43 -07:00
ba74b03b0d Back out "[sharded_tensor] simplify init_from_local_shards API"
Summary: Original commit changeset: 6e97d95ffafd

Test Plan: unit test

Reviewed By: wanchaol

Differential Revision: D32023341

fbshipit-source-id: 2a9f7b637c0ff18700bcc3e44466fffcff861698
2021-10-29 14:01:07 -07:00
5c77ccefe0 Resolves #67227 documentation issue (#67379)
Summary:
Changed "Chi2" in the docstring to a more intuitive "Chi-squared"

Fixes https://github.com/pytorch/pytorch/issues/67227

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67379

Reviewed By: jbschlosser

Differential Revision: D32023761

Pulled By: ngimel

fbshipit-source-id: b514b49726f616914871a9a831aa10e12e4be90b
2021-10-29 13:47:38 -07:00
66202b7f8d [Pytorch Edge] Expose runtime operators versioning (#67385)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67385

As part of the expanded operator versioning effort we are going to start looking at this variable and whats stored locally in the model file.
ghstack-source-id: 141782717

Test Plan: unit test

Reviewed By: cccclai

Differential Revision: D31976654

fbshipit-source-id: 255a23cff7c4f4039089de23b4da95772be48324
2021-10-29 13:42:59 -07:00
60a80c5bbd [jit] Move ModuleIndex operator to selective build. (#67483)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67483

Move ModuleIndex operator to selective build candidates.
ghstack-source-id: 141953898

Test Plan: eyes

Reviewed By: qihqi

Differential Revision: D32003895

fbshipit-source-id: 635c2bc37cd30a98f4a1e182fd6534eb9f1c4a69
2021-10-29 13:31:35 -07:00
12ede84dbb [jit][edge] Enable lite interpreter to correctly handle INTERFACE_CALL instruction. (#65972)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65972

ghstack-source-id: 141842336

Test Plan: buck test mode/dev //caffe2/test:mobile -- --exact 'caffe2/test:mobile - test_stacktrace_interface_call (mobile.test_lite_script_module.TestLiteScriptModule)'

Reviewed By: qihqi

Differential Revision: D31326147

fbshipit-source-id: 338ff4ce8ddc9502ffe0add49057b33b52a24955
2021-10-29 13:13:32 -07:00
d6b15bfcbd [jit][edge] Load interface methods to corresponding ClassTypes. (#65971)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65971

ghstack-source-id: 141842335

We should be able to load methods into their ClassTypes. Right now mobile runtime only loads data member to ClassTypes but not for methods. To support interface call, we inject methods into ClassTypes when the methods are loaded.

Test Plan: existing tests should all pass.

Reviewed By: qihqi

Differential Revision: D31326146

fbshipit-source-id: fb1dbea619910ef1f8fa26146da3ebab348fe902
2021-10-29 12:48:57 -07:00
6259601c8a Set test owners for tests with unknown owners (#67552)
Summary:
Action following https://github.com/pytorch/pytorch/issues/66232

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67552

Reviewed By: jbschlosser

Differential Revision: D32028248

Pulled By: janeyx99

fbshipit-source-id: a006f7026288b7126dba58b31cac28e10ce0fed6
2021-10-29 12:42:01 -07:00
c19cda5782 [skip ci] Add test owners for a special hi-pri class of tests (#67553)
Summary:
Action following https://github.com/pytorch/pytorch/issues/66232

This change does require some context: there were several suggestions regarding what to do about this group of tests: tests that are core and crucial to all of PyTorch and are too broad to be owned by one team.
1. Let's add a "module: core" and put people behind it! This idea sounds appealing unless you are one of the people backing the label. From talking to albanD among others, this idea of putting all these core tests on the shoulder of a few people or one team isn't super fair and I have not yet found anyone willing to take on this job.
2. Taking advantage of the fact that we already have a triaging oncall that takes turns triaging issues, we can leave these tests essentially unlabeled and allow the oncall to triage these tests. Since these tests are crucial to PyTorch, we'll add the "high priority" label to mark them different from other unowned tests (see https://github.com/pytorch/pytorch/issues/67552).
3. I _could_ still create an unbacked label "module: core" and attribute these tests there, but I don't like the idea of creating a facade that the tests are "triaged" to a label when no one is actually taking a look.

Now we could potentially break these tests down into smaller files so that each piece _could_ be owned by a team, but 1. I don't know if this is currently feasible and 2. This approach does not prevent that from happening in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67553

Reviewed By: albanD

Differential Revision: D32025004

Pulled By: janeyx99

fbshipit-source-id: 1fb1aa4c27e305695ab6e80ae3d02f90519939c0
2021-10-29 12:17:21 -07:00
fcba8018c2 Update codeowners for sphinx conf (#67548)
Summary:
Add a codeowner for the conf file to ensure allowlist modification is monitored.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67548

Reviewed By: jbschlosser

Differential Revision: D32023929

Pulled By: albanD

fbshipit-source-id: 63f18cdd725cc60993a6c0a9e3529ed95845e0bb
2021-10-29 10:50:15 -07:00
69f86ecd3a Sparse CSR CUDA: add torch.add with all inputs sparse (#63948)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63948

This PR adds `torch.add(a, b, alpha=None, out=out)` variant with `a, b,
out` all being sparse CSR tensors.
The underlying cuSPARSE function works only with 32-bit indices, and in
the current implementation, the result tensor has 32-bit indices. Input
tensors can have both 64-bit and 32-bit indices tensors.

Fixes https://github.com/pytorch/pytorch/issues/59060

cc nikitaved pearu cpuhrsch IvanYashchuk ngimel

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D31909731

Pulled By: cpuhrsch

fbshipit-source-id: 656f523e3947fec56b2f93c474fb6fd49f0360ca
2021-10-29 10:43:05 -07:00
285d5a55b9 Add API usage to torch.RPC (#67515)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67515

Adding API usage to torch.rpc to better understand usage of this API.
ghstack-source-id: 141877028

Reviewed By: rohan-varma

Differential Revision: D32011465

fbshipit-source-id: 34d006ece307ae4a90fbcc6cb44fc0b7edca611e
2021-10-29 10:38:41 -07:00
ddc9bd335b Adds reference vs. noncontiguous OpInfo test (#67434)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/63341.

This PR adds a new test, `test_noncontigous_samples`, that runs ops forward and backward and compares their outputs and grads between "normal" contiguous SampleInputs and noncontiguous SampleInputs. This test should preclude the need for noncontiguous SampleInputs going forward.

The test was added by generalizing the `.numpy()` transform on SampleInputs to support a new `.noncontiguous()` transform and copying forward/backward patterns from other tests in test_ops.py. It also discovered that many SampleInputs were incorrectly reusing tensors, so those have been revised. SampleInputs creating noncontiguous tensors for testing have also been altered to no longer do so.

In addition, this test discovered the following high priority silent correctness issues:

- https://github.com/pytorch/pytorch/issues/67432
- https://github.com/pytorch/pytorch/issues/67517
- https://github.com/pytorch/pytorch/issues/67513
- https://github.com/pytorch/pytorch/issues/67512
- https://github.com/pytorch/pytorch/issues/67470

It also identified the following issues:
- https://github.com/pytorch/pytorch/issues/67539

The pow OpInfo also incorrectly specified that pow supported the bool datatype, and this has been fixed. Its SampleInputs were written in a way that made requests for boolean SampleInputs return type promoting inputs that never actually tried to compute pow in bool.

This PR suggests we should add the following guidance for writing SampleInputs:

- ensure that all SampleInputs are independent of each other (don't reuse tensors)
- ensure that all SampleInput tensors have no grad or backward functions (no autograd history) -- they should be leaves
- prefer keeping sample inputs simple where possible, a good set of handwritten samples that test interesting cases may be better than an exhaustive but hard to read and maintain programmatic enumeration
- keep code readable by using functools.partial and writing simple inline helpers; break up large statements into a more readable series of smaller statements; especially don't write complicated generator expressions with a `for` at the end!

fyi kshitij12345 krshrimali pmeier anjali411 saketh-are zou3519 dagitses

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67434

Reviewed By: ngimel

Differential Revision: D32014557

Pulled By: mruberry

fbshipit-source-id: b17e19adc1d41e24441f0765af13d381fef5e3c1
2021-10-29 09:55:56 -07:00
16d937b0df Fix strided _conv_double_backward() with 3D input / weight (#67283)
Summary:
Removes the 3D special case logic in `_convolution_double_backward()` that never worked.

The logic was never called previously since `convolution()` expands input / weight from 3D -> 4D before passing them to backends; backend-specific backward calls thus save the 4D version to pass to `_convolution_double_backward()`.

The new general `convolution_backward()` saves the original 3D input / weight, uncovering the bug.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67283

Reviewed By: anjali411

Differential Revision: D32021100

Pulled By: jbschlosser

fbshipit-source-id: 0916bcaa77ef49545848b344d6385b33bacf473d
2021-10-29 09:48:53 -07:00
bf31995194 Add OpInfo for nn.functional.cosine_embedding_loss (#67465)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67465

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D32001920

Pulled By: ejguan

fbshipit-source-id: 82e547b5f0057b4ecc61e6f3be56bf038db179d1
2021-10-29 09:11:23 -07:00
bcd301a457 Add OpInfor for nn.functional.ctc_loss (#67464)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67464

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D32001919

Pulled By: ejguan

fbshipit-source-id: f277a8e9c9887ed62e871e8a0c8549e853e34356
2021-10-29 09:11:21 -07:00
e2e20e79fb Add OpInfo for nn.functional.poisson_nll_loss (#67371)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67371

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D31973173

Pulled By: ejguan

fbshipit-source-id: 3cbb21d292b95039f7c7d1f4caa300f3d619740a
2021-10-29 09:11:18 -07:00
8b8fb4f4e6 Add OpInfo for nn.functional.gaussian_nll_loss (#67376)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67376

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D31974040

Pulled By: ejguan

fbshipit-source-id: d6abac78a378d2763ca2fd465e64dea9985840f2
2021-10-29 09:11:16 -07:00
1d900ee22f Add OpInfo for nn.functional.hinge_embedding_loss (#67381)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67381

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D31976354

Pulled By: ejguan

fbshipit-source-id: 09068bb3d1bba665517254dd8a2dab9abd78b0e2
2021-10-29 09:11:14 -07:00
c6a6c09383 Add OpInfo for torch.nn.functional.gaussian_nll_loss (#67356)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67356

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D31970077

Pulled By: ejguan

fbshipit-source-id: 91bd9c5202b49f79ef83795196c2773fbe8a9afd
2021-10-29 09:09:48 -07:00
2e156f649e Sort output of *NativeFunctions.h (#67046)
Summary:
This ensures deterministic output, allowing systems like ccache to be
more effective.

cc ezyang bhosmer bdhirsh

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67046

Reviewed By: VitalyFedyunin

Differential Revision: D31896114

Pulled By: bdhirsh

fbshipit-source-id: d29ef0cf6c7e3408b104c5239b620eaa24327088
2021-10-29 09:03:39 -07:00
f95ed474ac Norms Op Info (#67442)
Summary:
Adds op infos for group_norm, instance_norm, and local_response_norm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67442

Reviewed By: mruberry

Differential Revision: D31992225

Pulled By: samdow

fbshipit-source-id: 5bf3e21cff2a39ca3e47dbe13db7671c617aaad1
2021-10-29 08:36:07 -07:00
d58f209326 add dequantize support for fp16 + cuda (#67234)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67234

Extends the dequantize fp16 function to also work on CUDA,
and adds a test.

Test Plan:
```
python test/test_quantization.py TestQuantizedTensor.test_dequantize_fp16_cuda
python test/test_quantization.py TestQuantizedTensor.test_dequantize_fp16_cpu
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D31915330

fbshipit-source-id: 622d47464fae26bf02f295ff56df63a3bf80b786
2021-10-29 07:58:38 -07:00
99282126dc pytorch quantization: document the custom module APIs (#67449)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67449

Adds a description of what the current custom module API does
and API examples for Eager mode and FX graph mode to the main
PyTorch quantization documentation page.

Test Plan:
```
cd docs
make html
python -m http.server
// check the docs page, it renders correctly
```

Reviewed By: jbschlosser

Differential Revision: D31994641

Pulled By: vkuzo

fbshipit-source-id: d35a62947dd06e71276eb6a0e37950d3cc5abfc1
2021-10-29 05:22:17 -07:00
acdc754918 [quant][graphmode][fx] Add support for ObservationType.OUTPUT_SHARE_OBSERVE_WITH_INPUT in backend_config_dict (#67210)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67210

`OUTPUT_SHARE_OBSERVE_WITH_INPUT` is an observation type for operators that would have the same observer/fake_quant instance
as output, when quantized, these ops can take quantized Tensor as input and output a quantized Tensor with the same quantization parameters (scale/zero_point etc.) as input
Using cat as an example in this PR. Other ops can be added later gradually (together with tests).

Test Plan:
python test/fx2trt/test_quantize_fx.py TestQuantizeFxTRTOps.test_cat

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D31907243

fbshipit-source-id: 2c7af4a456deb5e6597b0b9cd4e32c5fcdec580b
2021-10-29 02:37:48 -07:00
2bb20c0e48 [quant] Move test file to fx2trt folder (#67129)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67129

Since the tests depends on experimental feature (fx2trt), we'll move them to fx2trt foler

Test Plan:
python test/fx2trt/test_quantize_fx.py

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D31877123

fbshipit-source-id: 5a98a257c4806c1911cfc2616d5ad98d715234c4
2021-10-28 23:58:44 -07:00
5e46a4f6bd Fixes to make trt timing_cache really work (#67524)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67524

We have some loose ends to tie to make timing cache really work. This diff fixes them.

Reviewed By: wushirong

Differential Revision: D32012021

fbshipit-source-id: 1e93c76d48a3740a02613e1f19222ed92cca9deb
2021-10-28 23:09:24 -07:00
96c868217c [deploy] fix TypedStorage serialization (#67499)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67499

Since https://github.com/pytorch/pytorch/pull/62030 was landed, storages being produced when loading from a pickle are of type TypedStorage. We weren't catching this in our deploy serialization, leading tensors to actually get pickled instead of the storages getting shared across interpreters.

Since this is technically correct still, it wasn't caught by any of our tests, until someone tried to pass a really big tensor and started ooming.
ghstack-source-id: 141869521

Test Plan: added unit test

Reviewed By: shunting314

Differential Revision: D32004075

fbshipit-source-id: ef5a80cd3cb1dff0b6b4c1b6c95923e4faab7d50
2021-10-28 22:33:04 -07:00
4052393af8 Revert D31450501: Wextra caffe2/
Test Plan: revert-hammer

Differential Revision:
D31450501 (7c2d3e6d32)

Original commit changeset: 702728fdb3c5

fbshipit-source-id: 486b8e872c38415706288f7f389d7cb1ea5eb0a9
2021-10-28 20:43:28 -07:00
18807273cb Fix Ads build broken due to comparison type mismatch (#67526)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67526

Build error P465285570 due to D31942093 (0032fa7725)

(Note: this ignores all push blocking failures!)

Test Plan: build passes after fix

Reviewed By: jbschlosser

Differential Revision: D32013247

fbshipit-source-id: b60a65afd7a5a2d3249150fbc2ac52374d62a591
2021-10-28 20:42:13 -07:00
26241994b2 Remove the argument strip_doc_string of export() method entirely. (#66615) (#67278)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67278

Remove the argument strip_doc_string of export() method entirely.

Test Plan: Imported from OSS

Reviewed By: msaroufim

Differential Revision: D31962512

Pulled By: malfet

fbshipit-source-id: 168ad3f157a80d1edd7a9053783b3f3deb2ecf43

Co-authored-by: fatcat-z <jiz@microsoft.com>
2021-10-28 19:25:07 -07:00
43d51254bf Deprecate the argument _retain_param_name of export() method entirely. (#66617) (#67277)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67277

Remove the argument _retain_param_name of export() method entirely.

Test Plan: Imported from OSS

Reviewed By: msaroufim

Differential Revision: D31962514

Pulled By: malfet

fbshipit-source-id: 8ac5e3a4a7821cc580951a7f167fd20069116350

Co-authored-by: fatcat-z <jiz@microsoft.com>
2021-10-28 19:25:05 -07:00
40920185ac [ONNX] Remove the argument enable_onnx_checker of export() method entirely. (#66611) (#67276)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67276

[ONNX] Remove argument _retain_param_name from torch.onnx.export() function.

Test Plan: Imported from OSS

Reviewed By: msaroufim

Differential Revision: D31962520

Pulled By: malfet

fbshipit-source-id: 86ee15f525261c0da74175e47dd74eeb169ac47f

Co-authored-by: fatcat-z <jiz@microsoft.com>
2021-10-28 19:25:03 -07:00
609da98154 [ONNX] Update value name copying logic for onnx (#66170) (#67275)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67275

Specifically targets the symbolic functions that directly returns input as output. The old logic will override the value name with output value name. But since the input is unmodified and unchanged, it is more logically to keep its original input name. Especially for cases where inputs are directly from model parameters.

Test Plan: Imported from OSS

Reviewed By: msaroufim

Differential Revision: D31962517

Pulled By: malfet

fbshipit-source-id: 9cb4a2bb55aa08dd1ce8fdec24e7cfb11d7ea97c

Co-authored-by: BowenBao <bowbao@microsoft.com>
2021-10-28 19:23:55 -07:00
7c2d3e6d32 Wextra caffe2/ (#67319)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67319

Test Plan: Sandcastle

Reviewed By: pbelevich

Differential Revision: D31450501

fbshipit-source-id: 702728fdb3c5b00510ec637ff65bb2c6949fcc4e
2021-10-28 19:02:34 -07:00
d8bde98f36 Workaround the bug of TRT which creates extra outputs (#67327)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67327

At cerntain condition, TRT will create extra outputs, which seems more like a bug. If we don't capture those hidden outputs, we won't allocate memory to host those outputs and trt will end up writing to illegal memory. This diff address the issue but capturing the hidden outputs and allocate proper memory for them.

Reviewed By: jianyuh, wushirong, 842974287

Differential Revision: D31955379

fbshipit-source-id: c9faaf91ed45bec8e0bc4e0bea812a0a3f2abad0
2021-10-28 18:43:59 -07:00
fc82ad186a Add Initial NNC Dynamic Shapes Flow (#66136)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66136

FOR REVIEWERS: this is ready to review, test failures comes from somewhere else in stack..

Takes in a TensorExprGraph of static shapes and generalizes the input shapes
to symbolic dimensions. Dimensions of value 1 will be preserved, otherwise
dimensions with the same value will be bucketed to the same symbolic shape.

E.g. `Tensor(5, 3), Tensor(3, 1) -> Tensor(SS(-1), SS(-2)), Tensor(SS(-2), 1)`

From there, runs symbolic shape inference on the graph, and creates a
versioning if in the graph with prim::TensorExprDynamicGuard checking if
the inputs at runtime match the Generalized Symbolic Shapes that are inputs
to the TE Kernel. The computate to calculate all symbolic dimensions is
inlined in to the if block with the TE Kernel. All Sym Dim Value* are
appended to the end of the TE Kernel Graph/Node inputs, and the Node is
augmented with a integer list attr `symbolic_shape_inputs` that gives the
mapping from Value * -> Symbolic Shape int64_t value. For more lengthy IR
examples and walkthrough look at ShapeAnalysisTest.DynamicShapesFusion in
`test_shape_analysis` Returns True on Success, False on Failure, can fail if
shape propagation fails to propagate # of dims or if complete shapes on
inputs not set.

Example transformation
```
graph(%x_inp : Float(10, 5, strides=[5, 1], requires_grad=0, device=cpu),
      %y_inp : Float(4, 5, strides=[5, 1], requires_grad=0, device=cpu),
      %z_inp : Float(1, 1, strides=[1, 1], requires_grad=0, device=cpu)):
  %3 : Tensor = prim::TensorExprGroup_0(%x_inp, %y_inp, %z_inp)
  return ()
with prim::TensorExprGroup_0 = graph(%x.1 : Float(10, 5, strides=[5, 1], requires_grad=0, device=cpu),
      %y.1 : Float(4, 5, strides=[5, 1], requires_grad=0, device=cpu),
      %z : Float(1, 1, strides=[1, 1], requires_grad=0, device=cpu)):
  %3 : int = prim::Constant[value=0]()
  %4 : Tensor = aten::tanh(%x.1)
  %5 : Tensor = aten::erf(%4)
  %6 : Tensor = aten::relu(%y.1)
  %7 : Tensor[] = prim::ListConstruct(%5, %6)
  %8 : Tensor = aten::cat(%7, %3)
  %9 : Tensor = aten::hardswish(%8)
  %10 : Tensor = aten::mul(%9, %z)
  return (%9)
```
->

```
  graph(%x_inp : Float(10, 5, strides=[5, 1], requires_grad=0, device=cpu),
      %y_inp : Float(4, 5, strides=[5, 1], requires_grad=0, device=cpu),
      %z_inp : Float(1, 1, strides=[1, 1], requires_grad=0, device=cpu)):
  %4 : bool = prim::TensorExprDynamicGuard[types=[Float(SS(-2), SS(-3), strides=[5, 1], requires_grad=0, device=cpu), Float(SS(-4), SS(-3), strides=[5, 1], requires_grad=0, device=cpu), Float(1, 1, strides=[1, 1], requires_grad=0, device=cpu)]](%x_inp, %y_inp, %z_inp)
  %5 : Tensor = prim::If(%4)
    block0():
      %15 : int[] = aten::size(%x_inp)
      %16 : int[] = aten::size(%y_inp)
      %17 : int = prim::Constant[value=1]()
      %18 : int = prim::Constant[value=0]()
      %elem.3 : int = aten::__getitem__(%15, %18) # <string>:40:10
      %elem.5 : int = aten::__getitem__(%15, %17) # <string>:40:10
      %elem.11 : int = aten::__getitem__(%16, %18) # <string>:40:10
      %cat_dim_size.48 : int = aten::add(%elem.3, %elem.11) # <string>:321:29
      %3 : Tensor = prim::TensorExprGroup_0[symbolic_shape_inputs=[-5, -4, -3, -2]](%x_inp, %y_inp, %z_inp, %cat_dim_size.48, %elem.11, %elem.5, %elem.3)
      -> (%3)
    block1():
      %14 : Tensor = prim::FallbackGraph_1(%x_inp, %y_inp, %z_inp)
      -> (%14)
  return ()
  with prim::TensorExprGroup_0 = graph(%x.1 : Float(SS(-2), SS(-3), strides=[5, 1], requires_grad=0, device=cpu),
        %y.1 : Float(SS(-4), SS(-3), strides=[5, 1], requires_grad=0, device=cpu),
        %z : Float(1, 1, strides=[1, 1], requires_grad=0, device=cpu),
        %SS_5 : int,
        %SS_4 : int,
        %SS_3 : int,
        %SS_2 : int):
    %3 : int = prim::Constant[value=0]()
    %4 : Tensor(SS(-2), SS(-3)) = aten::tanh(%x.1)
    %5 : Tensor(SS(-2), SS(-3)) = aten::erf(%4)
    %6 : Tensor(SS(-4), SS(-3)) = aten::relu(%y.1)
    %7 : Tensor[] = prim::ListConstruct(%5, %6)
    %8 : Tensor(SS(-5), SS(-3)) = aten::cat(%7, %3)
    %9 : Tensor(SS(-5), SS(-3)) = aten::hardswish(%8)
    %10 : Tensor(SS(-5), SS(-3)) = aten::mul(%9, %z)
    return (%9)
```

Test Plan: Imported from OSS

Reviewed By: navahgar, anjali411

Differential Revision: D31797466

Pulled By: eellison

fbshipit-source-id: b508d2f5baef6e8e4020955ab1d4bc4b9c7bdfdd
2021-10-28 17:09:03 -07:00
2661507488 Adding support for Symbolic Shapes in Inplace Ops #65642 (#65729)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65729

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang gcramer23

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D31961857

Pulled By: Gamrix

fbshipit-source-id: bfb1e8a66be254638e8e93ade091ab9df6029e8c
2021-10-28 16:49:10 -07:00
d0bc01fac2 ci: Migrate hardcoded docker builds to GHA (#67455)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67455

Migrates docker builds that don't have dependent jobs within the pytorch
repository to our new GHA docker build job

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet, janeyx99

Differential Revision: D31997671

Pulled By: seemethere

fbshipit-source-id: 9d6f58fa8ea8731cf12457fe64dc65e70f3d9f25
2021-10-28 14:50:05 -07:00
6696c59af4 Adding optimizer attribute to SequentialLR (#67406)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/67318 :)

cc albanD, datumbox

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67406

Reviewed By: jbschlosser

Differential Revision: D31997873

Pulled By: albanD

fbshipit-source-id: f579fb886d049a545673fd92ef5892fcf501bcc6
2021-10-28 14:43:40 -07:00
354363b57a [SR] Native implementation for aten::size (#67346)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67346

Native ops are faster than falling back to the JIT interpreter, sometimes significantly (we've previously seen this with ops like TupleUnpack). We should improve op coverage where possible.

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: d1jang

Differential Revision: D31965159

fbshipit-source-id: 86a69c395f401c4a4c55daa4c5fe80764383c8e5
2021-10-28 14:18:17 -07:00
9f01937caf [PyTorch][easy] Deduplicate memory planner creation code (#67265)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67265

Avoid repeating this initialization code.
ghstack-source-id: 141585971

Test Plan: CI

Reviewed By: hlu1

Differential Revision: D31933368

fbshipit-source-id: 6342ae9bb82c4d152a427bad142470c3d162de69
2021-10-28 14:13:43 -07:00
82c356505f Revert D31894777: [pytorch][PR] Replace issue templates with new issue forms
Test Plan: revert-hammer

Differential Revision:
D31894777 (62feadd76f)

Original commit changeset: fbd39f7ed4ca

fbshipit-source-id: 4698ff5fe8629f9ad0249425a369c6f0bd89c891
2021-10-28 13:52:43 -07:00
afb8434440 [SR] Native implementation for aten::view (#67341)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67341

Native ops are faster than falling back to the JIT interpreter, sometimes significantly (we've previously seen this with ops like `TupleUnpack`). We should improve op coverage where possible.

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: hlu1

Differential Revision: D31962589

fbshipit-source-id: 3107fb169c1b02fb2bafbb355c005669b5fa8435
2021-10-28 13:37:46 -07:00
60472594e1 [jit][edge] Implement torch::jit::Function for mobile funciton. (#65970)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65970

ghstack-source-id: 141842338

mobile::Function should inherit from jit::Function, because for interface call support, we need an abstract jit::Function type stored in corresponding ClassTypes, so that we can look up methods in there. Previously mobile::Function is implemented separately which prevents this. Since we get rid of all the unneeded virtual methods from jit::Function, we can inherit from torch::jit::Function without too much cost.

NOTE that torch::jit::Function is already in dependency because we need it to support custom class call. We should be able to use Function uniformly without looking into whether it's a builtin function or mobile::Function.

Test Plan: no behavior change.

Reviewed By: iseeyuan, mrshenli

Differential Revision: D31326148

fbshipit-source-id: 36caeaf3c8c5f54c23a1a7c8c9e2fd6e78b19622
2021-10-28 13:33:30 -07:00
5ef62c88a9 [jit] Replace get_executor() with call() in abstract Function interface. (#65969)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65969

ghstack-source-id: 141759210

Test Plan: no behavior change.

Reviewed By: anjali411

Differential Revision: D31326151

fbshipit-source-id: 201f6dc4c23fdb2531f6b8c73d26127f9e212de4
2021-10-28 13:11:29 -07:00
8363da3f92 [SR][C2][easy] Benchmarks report # of ops (#67436)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67436

This information is useful for comparing static runtime to c2

Reviewed By: d1jang

Differential Revision: D31991571

fbshipit-source-id: eb83bc4564b05d56fb9a550863eea3f6312f3f6c
2021-10-28 13:03:09 -07:00
b8f07689f2 [ROCm] Enable frexp support for ROCm builds (#67226)
Summary:
The frexp function has been enabled in ROCm code.  Updating PyTorch
to enable this functionality.

cc jeffdaily sunway513 jithunnair-amd ROCmSupport KyleCZH

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67226

Reviewed By: jbschlosser

Differential Revision: D31984606

Pulled By: ngimel

fbshipit-source-id: b58eb7f226f6eb3e17d8b1e2517a4ea7297dc1d5
2021-10-28 12:42:09 -07:00
0795735351 [jit] Clean up unneeded virtual methods from Function interface. (#65968)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65968

tryToGraphFunction() should cover all cases and more composable than
adhoc virtual methods.
ghstack-source-id: 141759214

Test Plan: no behavior change.

Reviewed By: gmagogsfm

Differential Revision: D31326154

fbshipit-source-id: 692a35df424f7d4f777a96489c4cbb24b3ae7807
2021-10-28 12:28:48 -07:00
bd5e6fe5ac Skip complex128 dtype for test_addmm_sizes_all_sparse_csr Windows test (#67453)
Summary:
Windows CUDA 11.1 periodic CI is failing. See https://github.com/pytorch/pytorch/pull/63511#issuecomment-953940183.
I don't understand though why periodic-win-vs2019-cuda11.1-py3 was triggered on the PR, but no test from `test_sparse_csr.py` were run https://github.com/pytorch/pytorch/runs/3975200820?check_suite_focus=true.

cc nikitaved pearu cpuhrsch IvanYashchuk mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67453

Reviewed By: malfet, seemethere, janeyx99

Differential Revision: D31997574

Pulled By: cpuhrsch

fbshipit-source-id: ae8bfb6da865014f39e6ad5675eb17e5a4d39744
2021-10-28 12:24:46 -07:00
5b8b2382d1 Mark mv as CompositeExplicitAutograd (#67373)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67373

From the implementation of mv, it's decomposed into addmv. So it should
be a CompositeExplicitAutograd op.

Test Plan: It shouldn't change any behaviors. So, CI.

Reviewed By: bdhirsh

Differential Revision: D31973265

Pulled By: alanwaketan

fbshipit-source-id: 3b6850f08e6f671b908a177f148cc6194baa8a13
2021-10-28 11:59:00 -07:00
f3aae62942 Port tril and triu to structured kernels (#67055)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67055

This PR ports `tril` and `triu` operations to structured kernels.
ghstack-source-id: 141797608

Test Plan: Extended the existing unit tests.

Reviewed By: wanchaol

Differential Revision: D31844638

fbshipit-source-id: 03ea4aa2410b042cafc3c5397e777a9ca5351b39
2021-10-28 11:45:58 -07:00
4a1f73ccb3 [qnnpack] Remove asymmetrical padding parameters in qnnpack (#67102)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67102

Getting rid of top/bottom and left/right distinction, replacing with height and width. These parameters are widely used in qnnpack and always passed together but never different. Pytorch doesn't support asymmetrical paddings either so I see no potential use for this.
ghstack-source-id: 141334544

Test Plan: qnnpack unit tests

Reviewed By: kimishpatel

Differential Revision: D31863370

fbshipit-source-id: aa57490399e23d6139b2ad7b745139752acd7848
2021-10-28 11:40:19 -07:00
4e873d6799 Formatting changes (#66257)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66257

Used `clang-format -i` for these two files.

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D31762737

Pulled By: H-Huang

fbshipit-source-id: e94e301d0b013dbb8f2cef19ff140bac5811738f
2021-10-28 11:36:00 -07:00
cee4e8f35d Add FlexiBLAS build support per #64752 (#64815)
Summary:
To enable building torch+dependencies, set WITH_BLAS=flexi BLAS=FlexiBLAS

Fixes https://github.com/pytorch/pytorch/issues/64752

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64815

Reviewed By: jbschlosser

Differential Revision: D31997745

Pulled By: albanD

fbshipit-source-id: db208d59002f5896608a03132616400f09d972aa
2021-10-28 11:28:00 -07:00
55b7387e45 Timing cache for Tensort (#67214)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67214

This is draft for creating timing cache for tensorrt.

Reviewed By: yinghai, 842974287

Differential Revision: D31783757

fbshipit-source-id: 211ab68df0832120fa637304e4a7ece80d26f9b1
2021-10-28 11:21:51 -07:00
0032fa7725 Add a Functionalization pass in core (#64432)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64432

Original PR description + feedback here: https://github.com/pytorch/pytorch/pull/63048

I've addressed all of the feedback in the original PR and made some pretty large changes, listed below.

**Table of Contents**
- Starting points
- List of the main changes from the original PR
- Next Steps
- Example codegen output (for a view, mutation, and view+mutation op)

**Starting Points**

A good place to start when looking through the PR:
* Alban mentioned that this is a useful mental model (thanks Ed for originally making this clear to me). Semantically, the pass currently does THREE things, which are all needed by functorch - all fused together into one big pass.
  * (a) alias removal, which replaces {view} calls with {view}_copy calls, and manually tracks aliasing information, so that when one tensor is mutated, we re-apply the same mutation to all of the aliases. This is the bulk of the work - once this is done, the next 2 things are trivial to implement.
  * (b) mutation removal, which is easy to do once we know that there are no aliases. Every mutation `a.add_(b)` becomes `a.replace_(a.add(b))`
  * (c) reapplying views: all of the `{view}_copy` calls are replaced with `{view}` calls again. This is an optimization that we can make specifically for functorch (and strided backends), that only care about mutation removal and not alias removal
  * XLA and Vulkan only want (a), or (a) + (b). Later, we'll want to split this out so that you can actually opt into different versions of this logic.
  * There is currently no {view}_copy replacement, because the pass just <replace views with copies> and <replace copies with views> steps have been combined. Later, we'll want to actually implement {view}_copy variants of each view operator, probably with codegen.
* documentation breadcrumb 1, in `FunctionalTensorWrapper.cpp`: https://github.com/pytorch/pytorch/pull/64432/files#diff-a0bac99bf205dba5b94cb64fc2466d3d55d991887572f9cd6a02e27b3a91dd60R59 (you might have to expand the `FunctionalTensorWrapper.cpp` file, which GitHub closes by default because it's large)
* documentation breadcrumb 2, in `FunctionalTensorWrapper.h`: https://github.com/pytorch/pytorch/pull/64432/files#diff-c945c71a4ccac65871f24a912e8904f9a5088b24a32e636727ea9c8fe920708aR12
* Reading through the codegen output at the bottom of this description.

**Main changes from the original PR**

(1)  I use lambdas instead of a giant enum to handle all of the different views.

This results in less boilerplate per view op (and more stuff that can be codegen'd). Every `ViewMeta` object now contains a `forward` and `reverse` lambda, that knows how to replay the view and its inverse. This makes the actual code that executes the replaying logic a lot less boilerplate-y (see `Alias::sync_update_operations` and `FunctionalTensorWrapper::sync_`)

(2) Every tensor during the functionalization pass is always wrapped in a `FunctionalTensorWrapper`.

This is potentially unnecessary for Vulkan/XLA, and will have a mild perf impact, but for now this PR just targets the functorch use case. I previously had a complicated design a (`FunctionalTensorImplBase` class) to avoid needing the wrapper for XLA, but it had some subtleties that are gonna require more thought to fix, so I'm pushing that off for now.

(3) `FunctionalTensorWrapper` objects accurately report stride information.

It's a little annoying to do this though, because the logic that calculates stride info for each view isn't easily separated from the actual view kernels in core, `at::native::{view}`. I do this by adding logic in each `at::functionalization::{view}` kernel to call the reference implementation `at::native::{view}`. I don't do anything with the output aside from taking it's size/stride/storage_offset to set the actual output tensor's size/stride/storage_offset correctly. There's another annoying part to this: I'm pretty sure that we want to pass in the actual *wrapper* tensors directly into the native kernels, not their inner unwrapped values. But there are some `at::native::{view}` kernels that call other tensor methods, which re-invokes the dispatcher, calling functionalization/functorch kernels that try do the unwrapping.

To do this, right now I have an `AutoDispatchDirectlyToNative` guard that basically ensures that any tensor methods called inside of the at::native::{view} op always redispatch straight to the CPU kernel (which will be another at::native:: kernel). This feels kind of heavy handed, but I'm not sure of a better way to do it.

(4) `FunctionalTensorWrapper` objects accurately report aliasing information.

There's a new `FunctionalStorageImpl` class (subclass of `StorageImpl`) that allows tensors in the functionalization pass to accurately alias storage. If two tensors `a` and `b` in a functionalized program are views of one another, then `a.storage.is_alias_of(b.storage)` should return true. I added this in a pretty similar way to how meta tensors allocate storage, although I don't pass in an actual allocator (I think this is fine because you should never resize a functional tensor's storage).

One thing I'm not sure about - should `FunctionalTensorWrapper` set `storage_access_should_throw_`: (a) always, (b) never, (c) only if its wrapped tensor has it set.

Right now I have it not set, mostly because calling the reference view functions (`at::native::{view}`) requires looking at the storage. But that means that if you try to access storage from python in a functionalized program, you'll get silent garbage instead of an error. Related question: are we planning on exposing meta tensor storage to python in the future (even though it contains garbage)?

(5) better docs :)

**View operator coverage**

(6) The functionalization pass now gets math-composite view ops for free.

I didn't add the `Functionalize` dispatch key to the composite set, because I don't want composite ops like `torch.ones` to get decomposed before hitting the functionalization pass. Instead, I added codegen to manually register the `at::native::` kernels of composite view ops. This is a little hairy, because the names of the `at::native::` kernels aren't easily accessible. They're stored in a `Dict[DispatchKey, BackendIndex]`. I made a best-effort attempt to get each view kernel's name, basically by assuming that every view op has either a composite or cpu implementation.
There's also a hardcoded list of composite view ops in `gen_inplace_or_view_type.py`, but it looks like it's wrong. This is probably worth rationalizing later, but instead I created a new list of the "complete" set of composite view ops, and preserved the old set by hardcoding the delta between the two sets.

(7) I've added codegen for ops that are both views AND mutations, like `transpose_()` (why do we even have these {emoji:1f622}).

From some light testing, it looks like they work correctly with one caveat: I had a hard time ensuring that functorch programs that mutate their inputs using ops like `transpose_()` preserve the input mutations after the program finishes running. For (in my corresponding functorch branch) I emit a warning when this happens, and just don't preserve the mutation

(8) I added `{view}_inverse` implementations for every view op, in `FunctionalInverses.cpp`.

These are needed to take mutations made to views and replay them back onto the base. To reduce boilerplate, the codegen generates function declarations for each `{view}_inverse` function, so you get a nice compiler error when someone eventually adds a new view op.

The only view ops currently not supported are (a) as_strided, and (b) the sparse view ops (values()/indices()).

I can add support for as_strided, but it needs an `as_strided_inverse()` function. That will look really similar to the `as_strided_backward()` function in FunctionsManual.cpp, but it has some noticeable differences: we basically want an `as_strided_embed` for autograd and `as_strided_scatter` for functionalization. We also will probably need them to be primitives w.r.t to autograd, since the currently implementation for autograd uses view().copy_() calls that XLA won't be able to handle. I'm wondering if anyone has any objections, but otherwise I can make those change (which will require writing backward formulas for `as_strided_embed` and `as_strided_scatter`).

I did a bunch of manual testing that all looks pretty good, but it's definitely not fully tested. Ed pointed out that once XLA uses this pass (or at least once there's a POC), we can just run the existing xla view test suite. Hopefully that delay is okay - if it's not, maybe we can think about using OpInfos similar to how functorch uses them for testing.

Note: there's some duplication with autograd's view code. Every `{view}_inverse` implementation is really similar to the implementation for that view listed in `derivatives.yaml`. There are some major differences though:
* the autograd implementations over those backwards functions (like `permute_backwards()`, in `FunctionsManual.cpp`) internally call other view ops. For functoinalization, we want them to (eventually call `{view}_copy` operators).
* For view ops that take a subset of the original storage, like `slice/select/diagonal/as_strided()`, the autograd backward functions fill the "spaces" in the inverse call with zeroes. For functionalizations, we want to fill them with the value of `base` at those positions. It looks like this currently applies to 6 total ops (since we can ignore composites):
  * select
  * slice
  * diagonal
  * as_stridied
  * split
  * split_with_sizes
A nice end state would probably be for the autograd + functoinalization codegen to both look at the same yaml (either `derivatives.yaml`, or something else), and automatically generate the right thing. I didn't leave that in scope for this PR though.

**Current State + Next Steps**

There are a bunch of followups after this PR eventually lands. Roughly in order:
* Use the current pass to register problematic composite ops in functorch. Also, nested `functionalize()` calls aren't supported yet (I mostly just need to remove some debug asserts and test it).
* Work on freeing up dispatch key space in the by deduplicating the `{backend}`/`Autograd{backend}`/`Sparse{backend}`/`Quantized{backend}` keys
* Once we have more dispatch keys, split up this pass into 3 pieces - it's currently fused, and doesn't do the right thing for vulkan/XLA. Specifically, all of the `{view}` calls in the current pass's view-replay logic should turn into `{view}_copy` calls that vulkan/XLA know how to implement, and there will be separate passes for (a) removing mutations, and (b) turning `{view}_copy` calls back into `{view}` calls. For Vulkan, we eventually want a pass that ONLY removes aliasing and view calls, and doesn't remove mutations. We can also probably make the 2 new passes user dispatch keys to save dispatch key space, if they'll only be used by functorch anyway.
* Do more of a dive on perf for the vulkan/xla use cases. There are several areas to improve perf with varying levels of effort required. The simplest one that I'll probably do regardless is to codegen the out-of-place kernels instead of using a boxed fallback. Getting a POC working for xla will also be useful to test the view operator coverage.

**Example Codegen Output**

View Op:
```
::std::vector<at::Tensor> split_Tensor(c10::DispatchKeySet ks, const at::Tensor & self, int64_t split_size, int64_t dim) {

      auto self_ = at::functionalization::impl::unwrapFunctionalTensor(self);
      ::std::vector<at::Tensor> out;
      {
        at::AutoDispatchBelowFunctionalize guard;
        auto tmp_output = at::redispatch::split(ks & c10::after_func_keyset, self_, split_size, dim);
        out = at::functionalization::impl::wrapFunctionalTensor(tmp_output);
        // I'm fusing the [alias removal], [mutation removal], [add views back] passes together.
        // Later, we'll want to turn them into separate passes (since e.g. vulkan only cares about alias removal).
      }

      at::functionalization::ViewMeta view_meta = at::functionalization::ViewMeta(
        [split_size, dim](const at::Tensor& base, int64_t mutated_view_idx) -> at::Tensor {
          return base.split(split_size, dim)[mutated_view_idx];
        },
        [split_size, dim](const at::Tensor& base, const at::Tensor& mutated_view, int64_t mutated_view_idx) -> at::Tensor {
          return at::functionalization::impl::split_inverse(base, mutated_view, mutated_view_idx, split_size, dim);
        }
      );
      at::functionalization::impl::set_view_meta(out, self, view_meta);

      at::AutoDispatchDirectlyToNative native_guard;
      ::std::vector<at::Tensor> reference_tensor_output = at::native::split(self, split_size, dim);
      at::functionalization::impl::set_strides(out, reference_tensor_output);
      return out;

}
```

Mutation Op:
```
at::Tensor & add__Tensor(c10::DispatchKeySet ks, at::Tensor & self, const at::Tensor & other, const at::Scalar & alpha) {

      at::functionalization::impl::sync(self);
      at::functionalization::impl::sync(other);
      auto self_ = at::functionalization::impl::unwrapFunctionalTensor(self);
      auto other_ = at::functionalization::impl::unwrapFunctionalTensor(other);
      at::Tensor tmp_output;
      {
          at::AutoDispatchBelowFunctionalize guard;
          // The functionalization pass explicitly doesn't pass out= parameters to the redispatch
          tmp_output = at::redispatch::add(
            ks & c10::after_func_keyset, self_, other_, alpha);
      }

      self.replace_(tmp_output);
      at::functionalization::impl::maybe_add_update(self);
      return self;
}
```

View + Mutation Op:
```
at::Tensor & transpose_(c10::DispatchKeySet ks, at::Tensor & self, int64_t dim0, int64_t dim1) {

      at::functionalization::ViewMeta view_meta = at::functionalization::ViewMeta(
        [dim0, dim1](const at::Tensor& base, int64_t mutated_view_idx) -> at::Tensor {
          return base.transpose(dim0, dim1);
        },
        [dim0, dim1](const at::Tensor& base, const at::Tensor& mutated_view, int64_t mutated_view_idx) -> at::Tensor {
          return at::functionalization::impl::transpose_inverse(base, mutated_view, dim0, dim1);
        }
      );
      at::functionalization::impl::mutate_view_meta(self, view_meta);
      // See  Note [Propagating strides in the functionalization pass]
      // Directly update the sizes/strides/storage_offset fields on self using the inplace call.
      // I need the guard because I don't want the at::native kernel to end up calling more functionalization/functorch kernels.
      // Its only job is to directly compute the output size/stride/storage_offset metadata.
      at::AutoDispatchDirectlyToNative native_guard;
      at::native::transpose_(self, dim0, dim1);
      return self;

}
```

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D31942093

Pulled By: bdhirsh

fbshipit-source-id: b95598dae35dd1842fa8b1d8d1448332f3afaadf
2021-10-28 10:51:17 -07:00
b0a8ca2cb5 add tags for inplace view ops in native_functions.yaml (#65412)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65412

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D31942094

Pulled By: bdhirsh

fbshipit-source-id: 1f7f6ea7df13e9f91b81ed64088e35e471800aa8
2021-10-28 10:51:15 -07:00
03f3a0331b add slice/select/diagonal_scatter variants as primitive ops (#64430)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64430

The functionalization pass needs `{view}_scatter` versions of the slice/select/diagonal ops in order to correctly propagate mutations from a view to its base. On top of that, the implementations need to be primitive w.r.t. autograd, because they look something like `...slice().copy_()`, and the functionalization pass can't use views + mutations inside of it's own alias-removal machinery!

I added some basic tests that I tried to base off of existing tests for views (particularly around testing the derivative formulas), but I'm wondering if I should add something more comprehensive.

Also, as_strided fits into this category - the functionalization pass will need an `as_strided_scatter` op that's primitive w.r.t. autograd. I didn't add it for now, because it'll involve duplicating a bunch of logic from the current `as_strided_backward()` function, and also writing a derivative formula that I wasn't sure how to write :)

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D31942092

Pulled By: bdhirsh

fbshipit-source-id: c702a57c2748a7c771c14e4bcc3e996b48fcc4c8
2021-10-28 10:51:12 -07:00
665c148e42 move some codegen utilities into utils.py (#63094)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63094

This PR:
- Moves `FileManager` and its dependencies (`assert_never` and other imports) to `utils.py`, and updates all of the call-sites with the fresh imports
- Passes the list of NativeFunction objects into `gen_trace_type` directly, instead of requiring the function to regenerate it (we already have it)

The purpose of the reshuffling is to avoid circular dependencies in the next PR, where I add codegen for the functionalization pass, which gets called from `gen.py` (but depends on some stuff from the autograd codegen - in partulcar, the list of view ops).

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D31942096

Pulled By: bdhirsh

fbshipit-source-id: 36118facae61f25f8922bb43ad2818c80b53504e
2021-10-28 10:49:17 -07:00
b100a9ea82 Back out "Make fb::sigrid_hash_compute_multipler_shift return a std::tuple<int64_t, int64_t>" (#67456)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67456

There are some compatibility issues, we need to back-out before it gets to prod feed models

Test Plan: CI

Reviewed By: pgarbacki

Differential Revision: D31997684

fbshipit-source-id: 8b2584cb5d43e487719c6530d4178988fd03c455
2021-10-28 10:44:41 -07:00
a8f85300da [quant][graphmode][fx][test] Refactor test code for quant-fx2trt unit tests (#67070)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67070

Test Plan:
python test/test_quantization.py TestQuantizeFxTRTOps

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D31850124

fbshipit-source-id: a314b8869c091743dad7e5a1468985cf8aff6091
2021-10-28 10:39:58 -07:00
325b15039c Add FSDP tests to verify forward overlap and memory usage (#67117)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67117

Add FSDP tests to verify forward overlap and memory usage
ghstack-source-id: 141783871

Test Plan: unit tests

Reviewed By: mrshenli

Differential Revision: D31845629

fbshipit-source-id: b8b747e036925a9bb9164f0a5546000eece8442a
2021-10-28 10:29:27 -07:00
938afa37a3 Remove process group barrier and all_reduce function calls from tensorpipe agent (#65946)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65946

Add new function in agent_utils to perform a synchronization of active call counts using store. This is intended to replace the barrier and all_reduce used by the process group in RPC shutdown.

`test_ddp_comparison` and `test_ddp_comparison_uneven_inputs` test fail with these changes. It seems like the RPC agents are not accessing the same store, so the total count of processes never reaches the world size to exit the barrier, still ened to investigate why it is like this only for these test cases. Setting clean_shutdown to false ignores this code path which allows the test to pass.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D31762736

Pulled By: H-Huang

fbshipit-source-id: cb5d0efe196f72726c63393c4293e97ec4f18548
2021-10-28 10:15:56 -07:00
0c93c8e39a Disable linux-xenial-cuda10.2 config (#67344)
Summary:
linux-xenial-cuda10.2 and linux-bionic-cuda10.2 are very similar, no
need to run both configs

Moved all auxiliary builds from xenial to bionic

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67344

Reviewed By: seemethere, janeyx99

Differential Revision: D31964850

Pulled By: malfet

fbshipit-source-id: d07ce266c843c7fd69b281e678c4247b0bf6da20
2021-10-28 10:10:13 -07:00
6ed68f3f84 Document torch.jit.is_tracing() (#67326)
Summary:
This PR adds `torch.jit.is_tracing()` to the JIT API reference.
This function is widely used but left undocumented: https://github.com/search?q=torch.jit.is_tracing&type=code

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67326

Reviewed By: tugsbayasgalan

Differential Revision: D31985251

Pulled By: Krovatkin

fbshipit-source-id: 852b432b08d63df8bd7a7a02c9555e61f5f37978
2021-10-28 09:56:09 -07:00
b27b1ff809 Fix deadlock when forward and backward AD are used at the same time (#67360)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67360

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D31973040

Pulled By: albanD

fbshipit-source-id: f9c75c6497b622c86e8653027bce45461304eff5
2021-10-28 09:11:36 -07:00
d3f03af496 Fix indentation in forward_grad.h (#67359)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67359

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D31973039

Pulled By: albanD

fbshipit-source-id: 80ca7870ea35977560334aa65aa344da6847c039
2021-10-28 09:10:18 -07:00
6900aacf54 [fbcode] Fix operator_benchmark with jit mode (#67382)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67382

two simple updates:

* fix running benchmark with --use_jit. Previously will fail with error

  torch.jit.frontend.UnsupportedNodeError: import statements aren't supported:
  File "/proc/self/fd/3/bmm_test.py", line 9
  def __invoke_main():
    import ctypes
    ~~~~~~ <--- HERE
    import ctypes.util
    import errno

* add matmul to bmm benchmark as D31837588

Test Plan:
buck run mode/opt caffe2/benchmarks/operator_benchmark/pt:bmm_test --  --forward_only=True --mkl_num_threads=1 --omp_num_threads=1
 --use_jit=True

Reviewed By: ShijunK

Differential Revision: D31960528

fbshipit-source-id: 84b892934149784d1b8a0f90b0233cc2f1cf1f5f
2021-10-28 08:48:10 -07:00
eb8b80b76f Add test owners for elastic tests (#67293)
Summary:
Action following discussion with distributed and r2p team--the tests under elastic in distributed should be owned by oncall: r2p and not distributed.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67293

Reviewed By: jbschlosser

Differential Revision: D31973779

Pulled By: janeyx99

fbshipit-source-id: 05875a7600c6eb1da1310a48e1e32a1a69461c55
2021-10-28 08:32:50 -07:00
2366948085 [LT] Add ir_util for ComputePostOrder (#67282)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67282

Test Plan: `build/bin/test_lazy`

Reviewed By: wconstab, ngimel

Differential Revision: D31961754

Pulled By: desertfire

fbshipit-source-id: 28466588ece8057640a7202b8c79cc1a4357d373
2021-10-28 08:17:52 -07:00
6293e0ad61 update coverage ignore to not skip whole modules (#67395)
Summary:
This reduces the chance of a newly added functions to be ignored by mistake.

The only test that this impacts is the coverage test that runs as part of the python doc build. So if that one works, it means that the update to the list here is correct.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67395

Reviewed By: jbschlosser

Differential Revision: D31991936

Pulled By: albanD

fbshipit-source-id: 5b4ce7764336720827501641311cc36f52d2e516
2021-10-28 08:07:24 -07:00
961fd76a9a [ONNX] Relax check on Prim::PythonOp nodes for ONNX_FALLTHROUGH (#66172) (#67273)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67273

* Relax check on Prim::PythonOp nodes for Onnx_fallthrough

* Add tests

Test Plan: Imported from OSS

Reviewed By: msaroufim

Differential Revision: D31962521

Pulled By: malfet

fbshipit-source-id: 878920196d66c4f1dadaf3ebb9a7bf69b88849b4
2021-10-28 08:02:49 -07:00
02a78bdba7 [ONNX] Support conv-bn fusion in blocks (#66152) (#67272)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67272

* Support conv-bn fusion in nested blocks

* avoid running script tests twice

Test Plan: Imported from OSS

Reviewed By: msaroufim

Differential Revision: D31962513

Pulled By: malfet

fbshipit-source-id: 3ee79426542f9049cf62ac7b0c1be9d60ae6d014
2021-10-28 08:02:46 -07:00
9deb602726 [ONNX] Use Reciprocal operator instead of Div(1, x). (#65382) (#67271)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67271

* [ONNX] Use Reciprocal operator instead of Div(1, x).

This is a more readable and perhaps more performant way to export
torch.reciprocal.

* Use Reciprocal in caffe to operator to import onnx

Test Plan: Imported from OSS

Reviewed By: msaroufim

Differential Revision: D31962519

Pulled By: malfet

fbshipit-source-id: d926e75b1c8312b9a980c9a1207a1a93ba0c71e0

Co-authored-by: take-cheeze <takechi101010@gmail.com>
2021-10-28 08:01:21 -07:00
eea20bfa15 fixed type checking errors in fuse.py (#66799)
Summary:
Fixes [Issue#70](https://github.com/MLH-Fellowship/pyre-check/issues/70)
This PR fixes the type checking error that was found in fuse.py as follows:

torch/quantization/fx/fuse.py:34:13 Incompatible variable type [9]: fuse_custom_config_dict is declared to have type `Dict[str, typing.Any]` but is used as type `None`.

Signed-off-by: Onyemowo Agbo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66799

Reviewed By: 0xedward

Differential Revision: D31961462

Pulled By: onionymous

fbshipit-source-id: 7481afc07152ba13f3224e4ad198fd8e2c34c880
2021-10-28 07:45:28 -07:00
7da9c4ed2e [SR] NNC out variant for aten::where (#67255)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67255

Add an out variant for `aten::where`.

Since this op can be implemented quite trivially in NNC with `ifThenElse`, I added an NNC kernel as well.

Test Plan: Unit tests: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: navahgar

Differential Revision: D31923886

fbshipit-source-id: b4379ee3aaf31a000e626b4caeafd3e3f3d60837
2021-10-28 06:48:22 -07:00
3aadff651c [quant][embedding qat][bugfix] Fix and test QAT EmbeddingBag from_float error message (#66989)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66989

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D31961773

Pulled By: b-koopman

fbshipit-source-id: 0d28728c87751ffc696ac221c3e8e75ac923cc57
2021-10-28 06:29:20 -07:00
62feadd76f Replace issue templates with new issue forms (#65917)
Summary:
This PR introduces the new issue forms that replace issue templates.

This is similar to what was done in torchvision https://github.com/pytorch/vision/pull/4299 and torchaudio, you can see the end result here: https://github.com/pytorch/vision/issues/new/choose (click e.g. on the [bug report](https://github.com/pytorch/vision/issues/new?assignees=&labels=&template=bug-report.yml))

The main new thing is that we can enforce some of the fields to be filled, especially for bug reports. It's also a much cleaner GUI for users IMHO, and we can provide better examples and instructions.

There is still a "blank" template available.

I removed the "Questions" form: we say we close these issues anyway. I replaced it with a direct link to https://discuss.pytorch.org. Since we still have a "blank" template, I think this  covers all previous use-cases properly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65917

Reviewed By: VitalyFedyunin

Differential Revision: D31894777

Pulled By: NicolasHug

fbshipit-source-id: fbd39f7ed4cadab732d106d3166c04c451c31f94
2021-10-28 04:49:47 -07:00
6827d36c1a [Static Runtime][DI] Fuse list unpack and variadic_grouped_accessor_op (#66585)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66585

Add a new op `static_runtime::fused_variadic_grouped_accessor_op` that outputs many tensors rather than a single tensor list. Incorporated this new op into `FuseListUnpack`. This eliminates `ListUnpack` overhead and tensor refcount bumps.

Test Plan:
**Accuracy Test**

Model 294738512_40 (manually confirmed that fusion happens)
```
get 2861 prediction values
get 2861 prediction values
max_error:  0  total:  0
```

Accuracy test with model 296213501_65 (has V2 op): passes with 0 errors.

**Performance**

TW replayer test w/ 800 QPS (stacked with D31482816 (72e25c9f4e)) shows 5% CPU decrease for storage tier.
Results:

{F673610679}

Reviewed By: hlu1

Differential Revision: D31620408

fbshipit-source-id: f05c298bcbce61a491b63d575af4aca746881696
2021-10-28 04:34:57 -07:00
90b722c544 specializeGradSumToSize patch - propagate profile_none through profile_ivalue (#63941)
Summary:
simply propagate profile_none_ value through profile_ivalue nodes inserted by nvfuser.

Without the propagation, profile_ivalue inserted by other passes would block the optimization on no-op sum_to_size.

cc gmagogsfm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63941

Reviewed By: shunting314, cpuhrsch

Differential Revision: D31972765

Pulled By: Krovatkin

fbshipit-source-id: 4fa571a758e269b486c584f47c2a933de82d463b
2021-10-27 22:54:09 -07:00
fc664ac272 [sharded_tensor] easier initialization for Shard (#66351)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66351

This add the ability for use to just provide shard_offsets and optionally rank, to construct a local shard, instead of knowing there's a ShardedMetadata. Under the hood, we will construct the ShardedMetadata by inferring shard_lengths and device from the local tensor.
ghstack-source-id: 141742410

Test Plan: test_local_shards

Reviewed By: pritamdamania87

Differential Revision: D31519919

fbshipit-source-id: 8f3b4682ffc74b79b41076f3f4b832f4cacda49d
2021-10-27 22:20:37 -07:00
71a67d0ce9 [sharded_tensor] simplify init_from_local_shards API (#64481)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64481

This simplifies `init_from_local_shards` API in sharded tensor, to only require user pass in a list of `Shard` and `overall_size`, instead of ShardedTensorMetadata. We will do the all_gather inside to form a valid ShardedTensorMetadata instead.

TODO: add more test cases to improve coverage.
ghstack-source-id: 141742350

Test Plan: TestShardedTensorFromLocalShards

Reviewed By: pritamdamania87

Differential Revision: D30748504

fbshipit-source-id: 6e97d95ffafde6b5f3970e2c2ba33b76cabd8d8a
2021-10-27 22:19:20 -07:00
0117ada47c [quant][graphmode][fx] Add input_idx_to_dtype and ouptut_idx_to_dtype to backend_config_dict (#67067)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67067

Plan to gradually adding features to backend_config_dict, this PR adds support
for specifying the dtype for input and output of a given pattern

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D31849074

fbshipit-source-id: ca2fbb873176fe72e08ea79ed1bc659bf27cbd8a
2021-10-27 22:10:12 -07:00
e332d80299 [iOS][CoreML] Remove shape information from TensorSpec (#67412)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67412

For inputs, we'll be using the shape from PyTorch tensors. For outputs, we'll be using the shape from MLMultiArray. Thus, we can decouple from the symbolic shapes defined in the compile spec.
ghstack-source-id: 141746346

Test Plan:
- Sandcastle
- buck test pp-ios

Reviewed By: hanton

Differential Revision: D31299408

fbshipit-source-id: 337d5bb9efc2ff51409586c288d607399b937212
2021-10-27 21:55:29 -07:00
04aba42ed7 [Core ML] Assign Core ML computationUnit to executor (#67411)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67411

This was overlooked before.
ghstack-source-id: 141746345

Test Plan: buck test pp-ios

Reviewed By: hanton

Differential Revision: D31977097

fbshipit-source-id: f5ce9f7d58c3f35097caaa75f75310a89c918387
2021-10-27 21:55:27 -07:00
7e1a53cd5c [Core ML] Fix error messages (#67410)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67410

As title
ghstack-source-id: 141537215

Test Plan: buck-test pp-ios

Reviewed By: hanton

Differential Revision: D31901372

fbshipit-source-id: 80ae1cf8cb67c0e2ca276e21cc80b1ff799437a4
2021-10-27 21:54:14 -07:00
fae1c0a434 [PyTorch] Reduce refcount bumps in ClassType (#66724)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66724

Forwarding fix from previous diff through the ClassType getters & moving Types in where possible.

ghstack-source-id: 141594741

Test Plan: CI

Reviewed By: suo

Differential Revision: D31697995

fbshipit-source-id: 05d6af7c23e3b7a94db75b20d06338bc9ade3e20
2021-10-27 19:32:33 -07:00
c8dd90c858 [PyTorch] Fix extra refcount bumps in ClassAttribute (#66723)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66723

Missing move in constructor and forced copy in getter.
ghstack-source-id: 141594742

Test Plan: CI

Reviewed By: suo

Differential Revision: D31697702

fbshipit-source-id: c2018531e7ec4a4853cd003ea3753273a5fae7fb
2021-10-27 19:31:22 -07:00
1cfdb6f4c6 [quant][fx] add pass to duplicate dequant nodes with multi use (#67118)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67118

Fixes a bug in the reference pattern support for nn.Linear when the same quantized input is shared across multiple Linear nodes.

This PR adds a pass to duplicate the dequant nodes for each use so that for a case like
```
x -> quant -> dequant -> linear1 - quant1
                     |
                   linear2 - quant2
```
We duplicate the dequant nodes
```
x -> quant -> dequant1 -> linear1 - quant1
            |
          dequant2-> linear2 - quant2
```
So that we can match each pattern in the loweing step

We also add a pass to remove the extra/duplicate dequant nodes that may be leftover from the above pass if we don't lower them based on pattern match

Test Plan:
python test/test_quantization.py test_ref_pattern_multi_use

Imported from OSS

Reviewed By: mrshenli

Differential Revision: D31873511

fbshipit-source-id: aea0819222f084635157426743a50e065e6503c3
2021-10-27 18:25:35 -07:00
9e175400ac Moving python binding to _C and its decl to the right pyi file (#67365)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67365

Reviewed By: malfet, albanD

Differential Revision: D31972163

Pulled By: Krovatkin

fbshipit-source-id: e5313c2c8cb810b57b7fe16af8ba26edbe486488
2021-10-27 17:33:45 -07:00
882446c1d2 add frozen_numpy to :builtin_registry_cuda target (#67396)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67396

frozen_numpy did not work on GPU since we didn't added register_frozennumpy to the :builtin_registry_cuda target.

This was not found earlier since the unit test we added to test_deploy.cpp is only run on CPU. On GPU, we run test_deploy_gpu.cpp which does not contains the added unit tests for numpy.
In this diff, I just duplidate the unit tests for numpy (and pyyaml) across test_deploy.cpp and test_deploy_gpu.cpp.
I think ideally we should consolidate there 2 files to a single one. So we can add unit test in a single place while run them in both hardward platforms.
Tracking task: T104399180
ghstack-source-id: 141750276

Test Plan: buck test mode/opt :test_deploy_gpu

Reviewed By: suo

Differential Revision: D31978156

fbshipit-source-id: 2f5cd55ca33107cc7d230b72f1353df81f0a3bda
2021-10-27 17:29:25 -07:00
9ebc6357b3 [SR] Vectorize int version of fmod (#67313)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67313

Reviewed By: swolchok

Differential Revision: D31889868

fbshipit-source-id: a0af399431a0d672fa56cf2f2ba6d548c47bcedd
2021-10-27 17:02:53 -07:00
dea8b27433 [Pytorch Edge] Make some torchbind classes selective (#67340)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67340

Currently Torchbind classes arent selective. This makes is a rough granularity pass that will remove entire classes if they arent selected. If we need finer granularity in the future we can make individual methods within classes Selective though instrumenting that will be significantly more involved I think. On a linux build only __torch__.torch.classes._nnapi.Compilation remains unselective. I cant find where its registered :P (theres a couple Android only ones and presumably some metal only ones as well)

Many of the classes registered in functions returned a reference to the class that was created. I talked with dreiss about it and we decided that this seemingly didnt serve any purpose, and leaving it like that would make the return value difficult (but possible) to create with selectivity. Since it seems useless anyway I just changed them to return an int so that they can still be called from a global scope, but not have any issues with the return type.
ghstack-source-id: 141690776

Test Plan: CI, model unit tests, test models in prod apps

Reviewed By: dhruvbird

Differential Revision: D31092564

fbshipit-source-id: 657f7eb83490292436c15cf134ceca9b72fb9e1a
2021-10-27 16:58:27 -07:00
f20614af21 [jit] Allow custom class functions to be traced in invokeScriptMethodFromPython(). (#67380)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67380

Test Plan: eyes

Reviewed By: tugsbayasgalan

Differential Revision: D31975656

fbshipit-source-id: 47c8c9854899e9fed5a635f88470711dc4c95970
2021-10-27 16:38:50 -07:00
2267a984eb [ROCm] Add sparse mappings for CUDA->HIP translation (#67323)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67323

Applied patch proposed by Jeff https://github.com/pytorch/pytorch/pull/63948#issuecomment-952166982.
In PyTorch, we map cuBLAS->rocBLAS and cuSPARSE->hipSPARSE. Note the prefix, roc versus hip.
The 'hip' APIs offer a more direct CUDA-friendly mapping, but calling rocBLAS directly has better performance.
Unfortunately, the `roc*` types and `hip*` types differ, i.e., `rocblas_float_complex` versus `hipComplex`.
In the case of SPARSE, we must use the hip types for complex instead of the roc types,
but the pytorch mappings assume roc. Therefore, we create a new SPARSE mapping that has a higher priority.
Its mappings will trigger first, and only when a miss occurs will the lower-priority pytorch mapping take place.
When a file contains "sparse" in the filename, a mapping marked with API_SPARSE is preferred over other choices.

cc jeffdaily sunway513 jithunnair-amd ROCmSupport KyleCZH

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D31969246

Pulled By: cpuhrsch

fbshipit-source-id: 4ce1b35eaf9ef0d146a0955ce70c354ddd8f4669
2021-10-27 16:28:37 -07:00
708f7b1209 Update extending doc to cover forward mode AD (#66962)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66962

Reviewed By: VitalyFedyunin

Differential Revision: D31897782

Pulled By: albanD

fbshipit-source-id: 64164783a14a7ed4cedc17da28f1181d9807a499
2021-10-27 14:18:38 -07:00
d9a5668983 [ONNX] Add dim argument to all symbolic (#66093) (#67270)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67270

* Add dim argument to all symbolic

* All symbolic depends on any symbolic

Test Plan: Imported from OSS

Reviewed By: msaroufim

Differential Revision: D31962518

Pulled By: malfet

fbshipit-source-id: f7ee05cf4eff5880fc508154267e060952b5b42d
2021-10-27 13:46:31 -07:00
cb15df76ad [ONNX] Update onnxruntime to 1.9 for CI (#65029) (#67269)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67269

Test Plan: Imported from OSS

Reviewed By: ngimel, msaroufim

Differential Revision: D31962516

Pulled By: malfet

fbshipit-source-id: 39b3c6a4a05d7b769f0ef5ce7ea597209516cde2
2021-10-27 13:45:07 -07:00
9900310133 Fix sign warnings in CUDA kernels (#66753)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66753

Fixes these Wextra compilation errors:
```
stderr: caffe2/aten/src/ATen/native/cuda/UnarySignKernels.cu: In lambda function:
caffe2/aten/src/ATen/native/cuda/UnarySignKernels.cu:49:72: error: comparison is always false due to limited range of data type [-Werror=type-limits]
   49 |   AT_DISPATCH_ALL_TYPES_AND2 (44fd312604)(kBFloat16, ScalarType::Half, iter.input_dtype(), "signbit_cuda", [&]() {
      |                                                                      ~~^~~
stderr: caffe2/aten/src/ATen/native/cuda/BinaryMulDivKernel.cu: In lambda function:
caffe2/aten/src/ATen/native/cuda/BinaryMulDivKernel.cu:99:86: error: comparison is always false due to limited range of data type [-Werror=type-limits]
   99 |     AT_DISPATCH_INTEGRAL_TYPES(dtype, "div_floor_cuda", [&]() {
      |                                                                                      ^
caffe2/aten/src/ATen/native/cuda/BinaryMulDivKernel.cu:99:97: error: comparison is always false due to limited range of data type [-Werror=type-limits]
   99 |     AT_DISPATCH_INTEGRAL_TYPES(dtype, "div_floor_cuda", [&]() {
      |                                                                                                 ^
stderr: caffe2/aten/src/ATen/native/cuda/BinaryMulDivKernel.cu: In lambda function:
caffe2/aten/src/ATen/native/cuda/BinaryMulDivKernel.cu:99:86: error: comparison is always false due to limited range of data type [-Werror=type-limits]
   99 |     AT_DISPATCH_INTEGRAL_TYPES(dtype, "div_floor_cuda", [&]() {
      |                                                                                      ^
```
And also these warnings:
```
caffe2/c10/util/Half.h(461): warning: pointless comparison of unsigned integer with zero
          detected during instantiation of "std::enable_if<<expression>, __nv_bool>::type c10::overflows<To,From>(From) [with To=size_t, From=unsigned long]"
caffe2/aten/src/ATen/native/Resize.h(45): here
caffe2/c10/util/Half.h(459): warning: pointless comparison of unsigned integer with zero
          detected during instantiation of "std::enable_if<<expression>, __nv_bool>::type c10::overflows<To,From>(From) [with To=size_t, From=unsigned long]"
caffe2/aten/src/ATen/native/Resize.h(45): here
```
I thought I'd fixed this previously using `std::is_unsigned` in D25256251 (cff1ff7fb6), but apparently that was insufficient.

Test Plan: Sandcastle

Reviewed By: malfet, ngimel

Differential Revision: D31708173

fbshipit-source-id: 7714f6bbf109d2f2164630d3fc46bad18046c06c
2021-10-27 13:39:27 -07:00
3a1aa31a2f Add dummy bfloat16 VSX implementation (#67331)
Summary:
Just a copy of DEFAULT bfloat16 implementation and revert restriction
introduced by https://github.com/pytorch/pytorch/pull/61630

Fixes https://github.com/pytorch/pytorch/issues/66867 and https://github.com/pytorch/pytorch/issues/62016

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67331

Reviewed By: ngimel

Differential Revision: D31959916

Pulled By: malfet

fbshipit-source-id: 8ca5e65ca041fef67ee18ddbb215cff01fd1e004
2021-10-27 13:35:38 -07:00
7484941eaa Wrap TRTInterpreter result with wrapper (#67307)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67307

 Wrap TRTInterpreter result so that any future change to output params is less likely to break existing use cases.

Test Plan: Run test with all touched file

Reviewed By: 842974287

Differential Revision: D31945634

fbshipit-source-id: 7cf73a1ef0098bff2013815f2f1692233ef7ec14
2021-10-27 13:24:50 -07:00
fa70d72e95 Set kernel func name from AOT Compiler (#67229)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67229

Right now, assembly code generated for the a given method from the model is named wrapper or func by default. The function name is then replaced with a proper kernel_func_name after target specific assembly is generated.
This PR propagates a desired kernel_func_name right from aotCompiler API so that the generated function has the needed name that doesn't need to be replaced later.

Note: Most of this change was landed in https://github.com/pytorch/pytorch/pull/66337 which had to be reverted as it was breaking `test_profiler` in `test_jit_fuser_te` as it replaced the name generated for graph with the default kernel_func_name value. This PR fixes that as well.

```
(pytorch)  ~/local/pytorch kname
└─ $ python3 test/test_jit_fuser_te.py
CUDA not available, skipping tests
monkeytype is not installed. Skipping tests for Profile-Directed Typing
........................................<string>:3: UserWarning: torch.cholesky is deprecated in favor of torch.linalg.cholesky and will be removed in a future PyTorch release.
L = torch.cholesky(A)
should be replaced with
L = torch.linalg.cholesky(A)
and
.
.
.
......................<string>:3: UserWarning: torch.symeig is deprecated in favor of torch.linalg.eigh and will be removed in a future PyTorch release.
The default behavior has changed from using the upper triangular portion of the matrix by default to using the lower triangular portion.
L, _ = torch.symeig(A, upper=upper)
should be replaced with
L = torch.linalg.eigvalsh(A, UPLO='U' if upper else 'L')
and
L, V = torch.symeig(A, eigenvectors=True)
should be replaced with
L, V = torch.linalg.eigh(A, UPLO='U' if upper else 'L') (Triggered internally at  ../aten/src/ATen/native/BatchLinearAlgebra.cpp:2492.)
......[W pybind_utils.cpp:35] Warning: Using sparse tensors in TorchScript is experimental. Many optimization pathways have not been thoroughly tested with sparse tensors. Please include the fact that the network is running sparse tensors in any bug reports submitted. (function operator())
/data/users/priyaramani/pytorch/torch/testing/_internal/common_utils.py:403: UserWarning: Using sparse tensors in TorchScript is experimental. Many optimization pathways have not been thoroughly tested with sparse tensors. Please include the fact that the network is running sparse tensors in any bug reports submitted. (Triggered internally at  ../torch/csrc/jit/python/pybind_utils.h:691.)
  return callable(*args, **kwargs)
.....................................................................[W Resize.cpp:23] Warning: An output with one or more elements was resized since it had shape [1], which does not match the required output shape [].This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (function resize_output_check)
[W Resize.cpp:23] Warning: An output with one or more elements was resized since it had shape [1, 5], which does not match the required output shape [5].This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (function resize_output_check)
........................................................................s.......s...s.s....s......s..sss............................
----------------------------------------------------------------------
Ran 503 tests in 37.536s

OK (skipped=10)
```

Test Plan: Imported from OSS

Reviewed By: navahgar, pbelevich

Differential Revision: D31945713

Pulled By: priyaramani

fbshipit-source-id: f2246946f0fd51afba5cb6186d9743051e3b096b
2021-10-27 13:10:49 -07:00
5347dab851 Set test owners for onnx tests (#66860)
Summary:
Action following https://github.com/pytorch/pytorch/issues/66232

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66860

Reviewed By: malfet

Differential Revision: D31964696

Pulled By: janeyx99

fbshipit-source-id: 4e77d1bda92d9107ca0b90a06d24fa4477ceaffa
2021-10-27 12:50:45 -07:00
72e25c9f4e [Static Runtime][DI] Add variadic grouped_accessor_op (#66289)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66289

Add a variadic version of `grouped_accessor_op` to eliminate list construction overhead and associated refcount bumps in static runtime.

Test Plan:
Accuracy test with model 294738512_40: passes with 0 errors.
Accuracy test with model 296213501_65 (has V2 op): passes with 0 errors.

**Perf impact**

TW replayer test w/ 800 QPS (stacked with D31620408) shows ~5% CPU decrease for storage tier.
Results:

{F673610665}

Reviewed By: hlu1

Differential Revision: D31482816

fbshipit-source-id: 14393da122cefd094c3e4f423beb897c1d17b32c
2021-10-27 12:29:33 -07:00
1ec732bc46 Add fp16/fp32 autocasting to JIT/TorchScript (#63939)
Summary:
Adds mixed precision autocasting support between fp32/fp16 to torchscript/JIT. More in depth descriptoin can be found at [torch/csrc/jit/JIT-AUTOCAST.md](https://github.com/pytorch/pytorch/pull/63939/files#diff-1f1772aaa508841c5bb58b74ab98f49a1e577612cd9ea5c386c8714a75db830b)

This PR implemented an autocast optimization pass that inserts casting ops per AMP rule (torch/csrc/jit/passes/autocast.cpp), that mimics the behavior of eager autocast. The pass also takes into consideration the context of `torch.cuda.amp.autocast` and only inserts casting ops within the enabled context manager, giving feature parity as with eager amp autocast.

We currently provide JIT AMP autocast as a prototyping feature, so it is default off and could be turned on via `torch._C._jit_set_autocast_mode(True)`

The JIT support for autocast is subject to different constraints compared to the eager mode implementation (mostly related to the fact that TorchScript is statically typed), restriction on the user facing python code is described in doc torch/csrc/jit/JIT-AUTOCAST.md

This is a prototype, there are also implementation limitation that's necessary to keep this PR small and get something functioning quickly on upstream, so we can iterate on designs.

Few limitation/challenge that is not properly resolved in this PR:
1. Autocast inserts cast operation, which would have impact on scalar type of output tensor feeding downstream operations. We are not currently propagating the updated scalar types, this would give issues/wrong results on operations in promotion rules.

2. Backward for autodiff in JIT misses the casting of dgrad to input scalar type, as what autograd does in eager. This forces us to explicitly mark the casting operation for certain operations (e.g. binary ops), otherwise, we might be feeding dgrad with mismatch scalar type to input. This could potentially break gradient function consuming dgrad. (e.g. gemm backwards, which assumes grad_output to be of same scalar type as input')

3. `torch.autocast` api has an optional argument `dtype` which is not currently supported in the JIT autocast and we require a static value.

Credit goes mostly to:
tlemo
kevinstephano

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63939

Reviewed By: navahgar

Differential Revision: D31093381

Pulled By: eellison

fbshipit-source-id: da6e26c668c38b01e296f304507048d6c1794314
2021-10-27 12:11:36 -07:00
0101b1ea2b [skip-ci] .github: Set linux gpu instances to be non-ephemeral (#67345)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67345

Was hitting capacity issues, setting these to non-ephemeral would mean
keeping the current capacity at the expense of "unclean" nodes

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D31965477

Pulled By: seemethere

fbshipit-source-id: 6d45fb34d07d55c5112db065af2aa0a8b1fd8d1f
2021-10-27 11:55:45 -07:00
b55a2500d2 [jit] Remove graph() call from abstract Function interface. (#65967)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65967

Graph is an implementation detail. If user wants to get access to the
underlying graph, they should be able to explicitly dynamic cast instead.
ghstack-source-id: 141659819

Test Plan: no behavior change.

Reviewed By: gmagogsfm

Differential Revision: D31326153

fbshipit-source-id: a0e984f57c6013494b92a7095bf5bb660035eb84
2021-10-27 11:54:26 -07:00
7c48b9ee25 Sparse CSR CUDA: add triangular_solve_out (#61858)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61858

This PR adds `triangular_solve_out_sparse_csr_cuda`. The operation is
used to comput the solution to the linear system where coefficient
matrix is triangular.
Structured kernels are used and the meta function needed some changes to
support sparse csr layout. With sparse matrix input the `cloned_coefficient`
tensor is 0-sized tensor.

cc nikitaved pearu cpuhrsch IvanYashchuk ngimel

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D31948435

Pulled By: cpuhrsch

fbshipit-source-id: 7775fece83ca705a26d75f82aead10b956b14bfd
2021-10-27 11:12:20 -07:00
4b9464f4b9 [fx]Early return if a node tries prepend self (#67068)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67068

Prepending a node to itself will result in the node gets removed from the graph.

Usually people won't prepend a node with itself. But people would accidentally try to append a node that's already next to `self` node, which will be prepending `self` to `self`.

Test Plan: Added a unit test

Reviewed By: jamesr66a

Differential Revision: D31849030

fbshipit-source-id: b0fdfbb893f785f268595acd823b426d57c15e61
2021-10-27 10:49:45 -07:00
2669e4ed4e Revert D31945507: .github: Switch 8xlarge to 4xlarge instance_type
Test Plan: revert-hammer

Differential Revision:
D31945507 (1541bb823a)

Original commit changeset: fb8587de7f31

fbshipit-source-id: 3760f5610f0c9cd5298a35490c549e56f7396aaf
2021-10-27 10:02:51 -07:00
7d1c0992e1 GHA: add back runner type for distributed tests (#67336)
Summary:
Addresses https://github.com/pytorch/pytorch/pull/67264#issuecomment-953031927

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67336

Test Plan:
the 8x is used for the distributed config
![image](https://user-images.githubusercontent.com/31798555/139103861-38d7dc37-ca8b-4448-b3ec-facc24aee342.png)

Reviewed By: malfet

Differential Revision: D31961179

Pulled By: janeyx99

fbshipit-source-id: cd21e2bf2a7c6602c9a42a53759b720959e43b8d
2021-10-27 09:34:18 -07:00
f2f7b02b4c Add support for vmap+fwdAD for basic out-of-place op (#66291)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66291

In this PR:
 - Trivial batching rules for `make_dual` and `is_same_size` that enable forward ad + vmap functionality
 - Adds a check in gradcheck that is performed when both `check_batched_grad` and `check_forward_ad` are `True` (an OpInfo using this is added later in the stack).
 - Tests for the gradcheck functionality
 - Tests that basic out-of-place op works

Test Plan: Imported from OSS

Reviewed By: albanD, saketh-are

Differential Revision: D31842018

Pulled By: soulitzer

fbshipit-source-id: 84b18d9a77eeb19897757e37555581f2a9dc43d8
2021-10-27 08:55:06 -07:00
a3aa9df59f Add barrier to ProcessGroup trampoline (#67236)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67236

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D31916706

Pulled By: mrshenli

fbshipit-source-id: f3d2bcd938a384ec297f4094831c69d4059316bb
2021-10-27 08:18:07 -07:00
e52d0e773b [tensorexpr][ir][quant] Adding qscale and qzero to tensorexpr IR Buf (#66675)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66675

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D31676328

Pulled By: IvanKobzarev

fbshipit-source-id: c6479415fa7d809e02dd3789ee0bfd6dfe50dc92
2021-10-27 01:32:16 -07:00
632719c214 Enable c10d trampoline tests on MacOS (#67205)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67205

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D31916705

Pulled By: mrshenli

fbshipit-source-id: 440d319959796d01c637c277706eeab127d9bde7
2021-10-26 20:40:12 -07:00
c88da701e2 [hpc][inference] enable cuda graph in engine holder (#66738)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66738

added a field `max_batch_size` to TRTModule, which would be later used to determine how big the engine holder would need to pad the input to

Reviewed By: 842974287

Differential Revision: D31286509

fbshipit-source-id: be5c6d4ad9c87ca0842679dc507b187275d4e8dc
2021-10-26 18:48:05 -07:00
28570664d5 [Vulkan] Add vulkan_perf_test with google benchmark (#67230)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67230

Added a new test `vulkan_perf_test` for measuring performance with google benchmark.
**Summay:**
* `vulkan_perf_test` can be used to perform a quick benchmark test for Vulkan features to compare before and after performance when applying a new method and/or optimizing the existing implementation on your local machine.
* The **google benchmark** 3rd party library (https://github.com/google/benchmark) is already in the repo (`fbsource/third-party/benchmark`).
* The number of threads is set to 1 since Vulkan backend is not thread-safe.
* Added a new API `Context::wait()` to allow benchmark tests to wait for all GPU operations to be done before calling `Context::flush()`
* Call `Context::wait()` for each output Vulkan tensor and then `Context::flush()` to avoid out-of-memory issues while running a number of iterations in the benchmark test code
* Use `Time` column (wall clock) as a total execution time for each iteration (instead of `CPU` column = CPU execution time only) from the benchmark result table
* The more iterations, the more reliable data. But, it will take much longer. 100-1,000 iterations for bigger tensors and 5,000-10,000 iterations for smaller ones would be sufficient.
* The benchmark data on MacOS is not reliable since there is an extra layer [MoltenVk](https://github.com/KhronosGroup/MoltenVK) that is running on top of `Metal`. And also running Vulkan models on MacOS instead of Metal ones is generally not a good idea.

**Next steps:**
* Add more benchmark tests as we optimize more Vulkan operators
* Consider using Vulkan own performance counter such as [uVkCompute](https://github.com/google/uVkCompute) in the near future. Each iteration time can be manually set by `benchmark::State::SetIterationTime()` and `Benchmark::UseManualTime()` APIs (see [UseManualTime API](365670e432/include/benchmark/benchmark.h (L1013)))
* Consider this `vulkan_perf_test` as a performance BAT (Build Acceptance Test) on the CI pipeline. `gtest` and `google benchmark` can be written in the same place ([see](https://stackoverflow.com/questions/8565666/benchmarking-with-googletest)). And [swiftshader](https://github.com/google/swiftshader) can be used for Sandcastle devservers that don't support Vulkan. We may come up with a reasonable performance criteria for each test and it will fail if any significant performance degradation.

Test Plan:
**Test build on Android**
```
cd ~/fbsource
buck build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 //xplat/caffe2:pt_vulkan_perf_test_binAndroid\#android-arm64 --show-output
adb push buck-out/gen/xplat/caffe2/pt_vulkan_perf_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_perf_test
adb shell "/data/local/tmp/vulkan_perf_test"
```
**Test build on MacOS**
```
cd ~/fbsource
buck build //xplat/caffe2:pt_vulkan_perf_test_binAppleMac
./buck-out/gen/xplat/caffe2/pt_vulkan_perf_test_binAppleMac\#macosx-x86_64
```

**Test result on Google Pixel 5**
```
Running /data/local/tmp/vulkan_perf_test
Run on (8 X 1804.8 MHz CPU s)
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------------------------------------------------
Benchmark (Without optimization for 4x channels)                            Time             CPU   Iterations
-------------------------------------------------------------------------------------------------------------
cat_op_channel_perf/N:3/C:40/H:221/W:193/iterations:1000/threads:1       60.4 ms         14.1 ms         1000
cat_op_channel_perf/N:3/C:20/H:221/W:193/iterations:1000/threads:1       24.1 ms        0.947 ms         1000
cat_op_channel_perf/N:3/C:39/H:221/W:193/iterations:1000/threads:1       59.6 ms         14.0 ms         1000
cat_op_channel_perf/N:3/C:4/H:221/W:193/iterations:5000/threads:1        5.98 ms        0.844 ms         5000
cat_op_channel_perf/N:3/C:3/H:221/W:193/iterations:5000/threads:1        6.02 ms        0.845 ms         5000
-------------------------------------------------------------------------------------------------------------
Benchmark (With optimization for 4x channels)                               Time             CPU   Iterations
-------------------------------------------------------------------------------------------------------------
cat_op_channel_perf/N:3/C:40/H:221/W:193/iterations:1000/threads:1       39.3 ms         13.3 ms         1000
cat_op_channel_perf/N:3/C:20/H:221/W:193/iterations:1000/threads:1       16.4 ms         3.49 ms         1000
cat_op_channel_perf/N:3/C:39/H:221/W:193/iterations:1000/threads:1       59.7 ms         14.1 ms         1000
cat_op_channel_perf/N:3/C:4/H:221/W:193/iterations:5000/threads:1        3.93 ms        0.855 ms         5000
cat_op_channel_perf/N:3/C:3/H:221/W:193/iterations:5000/threads:1        6.14 ms        0.852 ms         5000
```
Note that the smaller tensors (`3.93 ms` vs `6.14 ms` when comparing `{3,4,221,193}` with `{3,3,221,193}`) receive significant improvement on the Android builds. Because `vkCmdCopyImage` API is used for the bigger tensor `{3,4,22,193}` instead of shader operations.
* `{3,40,221,193}`: 60.4 ms -> 39.3 ms (34.93% faster)
* `{3,20,221,193}`: 24.1 ms -> 16.4 ms (31.95% faster)
* `{3,4,221,193}`: 5.98 ms -> 3.93 ms (34.28% faster)

{F674052834}

**Test result on MacOS**
```
Running ./buck-out/gen/xplat/caffe2/pt_vulkan_perf_test_binAppleMac#macosx-x86_64
Run on (16 X 2400 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 256 KiB (x8)
  L3 Unified 16384 KiB (x1)
Load Average: 5.95, 5.02, 5.15
***WARNING*** Library was built as DEBUG. Timings may be affected.
-------------------------------------------------------------------------------------------------------------
Benchmark (Without optimization for 4x channels)                            Time             CPU   Iterations
-------------------------------------------------------------------------------------------------------------
cat_op_channel_perf/N:3/C:40/H:221/W:193/iterations:1000/threads:1       51.2 ms         35.5 ms         1000
cat_op_channel_perf/N:3/C:20/H:221/W:193/iterations:1000/threads:1       11.4 ms         4.76 ms         1000
cat_op_channel_perf/N:3/C:39/H:221/W:193/iterations:1000/threads:1       51.9 ms         35.0 ms         1000
cat_op_channel_perf/N:3/C:4/H:221/W:193/iterations:5000/threads:1        2.84 ms         1.36 ms         5000
cat_op_channel_perf/N:3/C:3/H:221/W:193/iterations:5000/threads:1        2.30 ms         1.13 ms         5000
-------------------------------------------------------------------------------------------------------------
Benchmark (With optimization for 4x channels)                               Time             CPU   Iterations
-------------------------------------------------------------------------------------------------------------
cat_op_channel_perf/N:3/C:40/H:221/W:193/iterations:1000/threads:1       70.1 ms         36.9 ms         1000
cat_op_channel_perf/N:3/C:20/H:221/W:193/iterations:1000/threads:1       11.8 ms         5.00 ms         1000
cat_op_channel_perf/N:3/C:39/H:221/W:193/iterations:1000/threads:1       69.3 ms         36.8 ms         1000
cat_op_channel_perf/N:3/C:4/H:221/W:193/iterations:5000/threads:1        4.60 ms         1.48 ms         5000
cat_op_channel_perf/N:3/C:3/H:221/W:193/iterations:5000/threads:1        3.65 ms         1.41 ms         5000
```
Note that `{3,40,221,193}` input tensors don't receive any performance improvement when we use `vkCmdCopyImage` API to directly copy textures when the number of channel is a multiple of 4 on MacOS. This is maybe due to an extra layer [MoltenVk](https://github.com/KhronosGroup/MoltenVK) that is running on top of `Metal`.

Reviewed By: SS-JIA

Differential Revision: D31906379

fbshipit-source-id: 0addc766502dba1a915b08840b3a4dc786a9cd9d
2021-10-26 17:55:42 -07:00
cdc9b26281 [Vulkan] Optimize cat operator for channel dimension (#67207)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67207

Improved performance for `cat` operator for channel dimension:
* Improved when the input tensor's channel size is a multiple of 4.
* Add new test cases to cover this scenario
* Limitation: We can't mix up using shader and `vkCmdCopyImage` at the same time. The way we create the output texture is different between two so we can't cross unless we create the output texture every time. We consider using `vkCmdCopyImage` only if all input tensors' channel is a multiple of 4.

{F673815905}

Test Plan:
**Test Conditions**
* 3 input tensors with size `{3, 40, 221, 193}`
* Number of iteration: `1,000`
* Compare `Time` column (`CPU` column is only for CPU execution time)
* Flushes resources every 1 iteration since the input tensor size is big
* running vulkan_perf_test requires a separate diff (D31906379)

**Test build on Android**
```
cd ~/fbsource
buck build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 //xplat/caffe2:pt_vulkan_perf_test_binAndroid\#android-arm64 --show-output
adb push buck-out/gen/xplat/caffe2/pt_vulkan_perf_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_perf_test
adb shell "/data/local/tmp/vulkan_perf_test"
```
**Test build on Mac**
```
cd ~/fbsource
buck build //xplat/caffe2:pt_vulkan_perf_test_binAppleMac
./buck-out/gen/xplat/caffe2/pt_vulkan_perf_test_binAppleMac\#macosx-x86_64
```

**Test result on Google Pixel 5**
a) Without using `vkCmdCopyImage` for multiples of 4 in channel dimension
```
Run on (8 X 1804.8 MHz CPU s)
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------------------------------------------------
Benchmark (Without optimization for 4x channels)                            Time             CPU   Iterations
-------------------------------------------------------------------------------------------------------------
cat_op_channel_perf/N:3/C:40/H:221/W:193/iterations:1000/threads:1       60.4 ms         14.1 ms         1000
cat_op_channel_perf/N:3/C:20/H:221/W:193/iterations:1000/threads:1       24.1 ms        0.947 ms         1000
cat_op_channel_perf/N:3/C:39/H:221/W:193/iterations:1000/threads:1       59.6 ms         14.0 ms         1000
cat_op_channel_perf/N:3/C:4/H:221/W:193/iterations:5000/threads:1        5.98 ms        0.844 ms         5000
cat_op_channel_perf/N:3/C:3/H:221/W:193/iterations:5000/threads:1        6.02 ms        0.845 ms         5000
```
b) With using `vkCmdCopyImage` for multiples of 4 in channel dimension
```
Run on (8 X 1804.8 MHz CPU s)
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------------------------------------------------
Benchmark (With optimization for 4x channels)                               Time             CPU   Iterations
-------------------------------------------------------------------------------------------------------------
cat_op_channel_perf/N:3/C:40/H:221/W:193/iterations:1000/threads:1       39.3 ms         13.3 ms         1000
cat_op_channel_perf/N:3/C:20/H:221/W:193/iterations:1000/threads:1       16.4 ms         3.49 ms         1000
cat_op_channel_perf/N:3/C:39/H:221/W:193/iterations:1000/threads:1       59.7 ms         14.1 ms         1000
cat_op_channel_perf/N:3/C:4/H:221/W:193/iterations:5000/threads:1        3.93 ms        0.855 ms         5000
cat_op_channel_perf/N:3/C:3/H:221/W:193/iterations:5000/threads:1        6.14 ms        0.852 ms         5000
```
* `{3,40,221,193}`: 60.4 ms -> 39.3 ms (34.93% faster)
* `{3,20,221,193}`: 24.1 ms -> 16.4 ms (31.95% faster)
* `{3,4,221,193}`: 5.98 ms -> 3.93 ms (34.28% faster)

{F674052795}

Reviewed By: SS-JIA

Differential Revision: D31781390

fbshipit-source-id: 42179d28ae461a9e247053bc9718f6b8c6c819e5
2021-10-26 17:54:19 -07:00
d691bc1207 Revert D31937065: [pytorch][PR] fix binding to the wrong python module
Test Plan: revert-hammer

Differential Revision:
D31937065 (7ac8ed741d)

Original commit changeset: 5c10b2870bcc

fbshipit-source-id: 9b21ffea8054b8a3a0b96e1b78e933f8654e7f2f
2021-10-26 17:40:59 -07:00
dfa7225a38 [Pytorch][Bootcamp] Add fix and testing for non-vectorized Adadelta optimizer to handle complex numbers (#66587)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66587

Made some changes in the step function of the non-vectorized Adadelta optimizer to handle complex numbers as two real numbers as per 65711 on github
ghstack-source-id: 141484731

Test Plan:
buck test mode/dev caffe2/test:optim -- 'test_adadelta_complex'

https://pxl.cl/1R7kJ

Reviewed By: albanD

Differential Revision: D31630069

fbshipit-source-id: 2741177b837960538ce39772897af36bbce7b7d8
2021-10-26 17:35:01 -07:00
fcefed9517 Revert D31935958: Add register_frozenpython.cpp to the torch::deploy interpreter library in the OSS build
Test Plan: revert-hammer

Differential Revision:
D31935958 (00b0d4eeed)

Original commit changeset: 3e2cc5c8bc18

fbshipit-source-id: 3f22bf88d902891b83d836e3c53be9a214a58f1f
2021-10-26 17:30:22 -07:00
1541bb823a .github: Switch 8xlarge to 4xlarge instance_type (#67299)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67299

Switches the linux.8xlarge.nvidia.gpu to the 4xlarge instance type to
help with queueing / capacity issues. This change is only meant to be a
bridge until everyone updates their PRs to use the new
linux.4xlarge.nvidia.gpu node type

NOTE: This node type will be removed so do not depend on it for any new
workflows.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D31945507

Pulled By: seemethere

fbshipit-source-id: fb8587de7f31da72e968d46eeecc573d3f5b440f
2021-10-26 17:23:46 -07:00
7ac8ed741d fix binding to the wrong python module (#67246)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67246

Reviewed By: zhxchen17

Differential Revision: D31937065

Pulled By: Krovatkin

fbshipit-source-id: 5c10b2870bccece50ba52dde26127da79bccbba6
2021-10-26 17:19:02 -07:00
0e8bd0c8d6 [Pytorch Delegated Backend] Add macro to define sentinel value of debug handle. (#66584)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66584

This will help avoid "-1"s in different places in our and backend codebase when
debug handle is not known.

Test Plan: CI

Reviewed By: sxu

Differential Revision: D31614478

fbshipit-source-id: 97fceb04e3e78f52feda7b1ba1da08fa4480dd77
2021-10-26 17:13:44 -07:00
00b0d4eeed Add register_frozenpython.cpp to the torch::deploy interpreter library in the OSS build (#67280)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67280

Test Plan: Imported from OSS

Reviewed By: zhxchen17

Differential Revision: D31935958

Pulled By: shunting314

fbshipit-source-id: 3e2cc5c8bc18b5e19bd3804ad542a9ed69e04291
2021-10-26 16:39:40 -07:00
f510193e22 [jit][edge] Export maybe-used interface methods from modules. (#65966)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65966

ghstack-source-id: 141594521

Support exportation of "interface methods" from submodule to a mobile module. "Interface methods" are defined as methods which might be dynamically called in a module therefore need to be exported anyway, like virtual functions in C++.

Before this change the algorithm of exportation is a simple iteration through all toplevel methods. Now since we have indirect calls, we need to recursively walkthrough the call graph to find all potentially used methods, which means the order we export methods might break in old runtimes, to guarantee forward compatibility we need to export toplevel methods first, then extra methods, in this order toplevel methods will always be found first.

NOTE that interface methods exportations are disabled by default in this diff. We need to call torch._C._enable_mobile_interface_call_export to actaully enable it.

Test Plan: buck test mode/dev //caffe2/test:jit -- --exact 'caffe2/test:jit - test_export_opnames_interface (jit.test_misc.TestMisc)'

Reviewed By: qihqi, iseeyuan

Differential Revision: D31326155

fbshipit-source-id: 5be7234cca07691f62648a85133b6db65e427b53
2021-10-26 16:35:15 -07:00
a72a6365c9 disallow requires_grad=True in make_tensor for integral inputs (#67149)
Summary:
per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67149

Reviewed By: albanD

Differential Revision: D31928613

Pulled By: ngimel

fbshipit-source-id: 4491954c4fcd4a4e3121155d4451cc7370c27a0b
2021-10-26 16:19:28 -07:00
81d188101f .github: Use 4xlarge instances for linux gpu (#67264)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67264

Downgrades linux gpu instances from 4xlarge -> 8xlarge

We were seeing capacity issues in terms of scaling 8xlarge instances,
downgrading this to 4xlarge (which only have a single gpu) to see if
that helps resolve some of the capacity issues we were seeing

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: janeyx99

Differential Revision: D31933488

Pulled By: seemethere

fbshipit-source-id: b41922ebb675e663cb035cd3795bc9bae94dcac7
2021-10-26 16:17:33 -07:00
fdc74e2373 Port triangular_solve to structured kernel (#61857)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61857

A few updates to internal code that allow marking triangular_solve as structured.

cc jianyuh nikitaved pearu mruberry walterddr IvanYashchuk xwang233

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D31928687

Pulled By: cpuhrsch

fbshipit-source-id: 80a2783c469d5a6194c466ccfa8808fa41c0bdb7
2021-10-26 14:50:00 -07:00
6ce14e7b51 [PyTorch][Static Runtime] Cleanup: add valueVecFromFastSet (#66996)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66996

We do this conversion a few times, and further diffs (which I'm trying to keep as small as possible) will do it more.
ghstack-source-id: 141496817

Test Plan: CI

Reviewed By: mikeiovine

Differential Revision: D31821037

fbshipit-source-id: 1d3b54cadaedd53189aec6a35ed1a126c6fe4824
2021-10-26 14:47:15 -07:00
066a980e7b [PyTorch][Static Runtime][easy] Fix ValueGroup comment (#66965)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66965

external aliases aren't defined to be outputs (though output aliases may end up in there as the following sentence clarifies).
ghstack-source-id: 141473794

Test Plan: review

Reviewed By: mikeiovine

Differential Revision: D31809715

fbshipit-source-id: 82d1391b04e22559932f82270669a7ff94a1c90f
2021-10-26 14:45:36 -07:00
1926156752 Prevent TCPServer get deleted too early (#67204)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67204

Fixes #66422
Fixes #66423

In the original test, all collectives are dummy local ones. As a
result, rank 0 could exit earlier than other ranks. However, the
`TCPStore` lives on rank 0, and other ranks might need to talk to
that store after rank 0 exits. This commit explicitly makes rank 0
wait for all other ranks to finish.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D31906802

Pulled By: mrshenli

fbshipit-source-id: 82745f5497d784ea3cea9df6bda537ec71380867
2021-10-26 14:38:11 -07:00
273ab55fc4 Revert D31914868: Strided masked reduction: mean (2nd try)
Test Plan: revert-hammer

Differential Revision:
D31914868 (a33d3d84df)

Original commit changeset: beda9d32ea65

fbshipit-source-id: dc3fa66d7e3c8a211fedac6ae191b11a4a9ab232
2021-10-26 14:18:22 -07:00
2ca552160b [DDP] logging improvements (#67059)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67059

Debugging some workflows, and sometimes the training does not finish
but I want to know whether the graph was not static. Also, log 0 for unused
parameter size if no unused params were found.
ghstack-source-id: 141428950

Test Plan: Ci

Reviewed By: mrshenli

Differential Revision: D31846669

fbshipit-source-id: 21763fcdc1b244ba829117da1f15b2271d966983
2021-10-26 13:18:00 -07:00
197dec14b3 .github: Change periodic docker jobs to always_rebuild (#67267)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67267

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: xuzhao9

Differential Revision: D31934251

Pulled By: seemethere

fbshipit-source-id: a323d2c754ff6324c69f81bf0e820ae9adbe7853
2021-10-26 13:06:16 -07:00
99b34b320b Make fb::sigrid_hash_compute_multipler_shift return a std::tuple<int64_t, int64_t> (#67123)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67123

Makes `sigrid_hash_compute_multipler_shift` return a tuple instead of a tensor and modifies functions that depends on it.

Test Plan:
```
buck test //caffe2/benchmarks/static_runtime/fb:test_fb_operators
```

Benchmarks:
`local`:
```
I1022 13:56:34.529495 2866038 PyTorchPredictorBenchLib.cpp:266] Mean milliseconds per iter: 5.67114, standard deviation: 0.336918

I1022 15:29:45.248790 3292725 PyTorchPredictorBenchLib.cpp:266] Mean milliseconds per iter: 5.66678, standard deviation: 0.403032
```

`local_ro`:
```
I1022 13:59:24.262511 2882599 PyTorchPredictorBenchLib.cpp:266] Mean milliseconds per iter: 1.56012, standard deviation: 0.0537101

I1022 15:34:53.941890 3328358 PyTorchPredictorBenchLib.cpp:266] Mean milliseconds per iter: 1.5525, standard deviation: 0.0280267
```

FB: local - P463676888, local_ro - P463676984, master local - P463686094, master local_ro - P463686470

Reviewed By: mikeiovine

Differential Revision: D31867186

fbshipit-source-id: 0f640487b74d1cd0d5f714f2258e056a2f0c2c07
2021-10-26 12:51:10 -07:00
1ce500f56f [easy][PyTorch] Use at::native::is_nonzero (#67195)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67195

Now that `is_nonzero` is part of `at::native` refer https://github.com/pytorch/pytorch/pull/66663, replacing `TensorCompare::is_nonzero` to `at::native::is_nonzero`

ghstack-source-id: 141514416

Test Plan: CI

Reviewed By: larryliu0820

Differential Revision: D31704041

fbshipit-source-id: 36813e5411d0aa2eb2d0442e2a195bbed417b33d
2021-10-26 12:40:32 -07:00
a33d3d84df Strided masked reduction: mean (2nd try) (#67088)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67088

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D31914868

Pulled By: cpuhrsch

fbshipit-source-id: beda9d32ea65bcae31c2c0181f95ad23c6631075
2021-10-26 11:54:39 -07:00
6c22b96082 [Pytorch Edge] Extend Tracer to Custom Classes (#67004)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67004

New version because the other one was impossible to rebase

Trace custom classes

Test Plan: CI.

Reviewed By: dhruvbird

Differential Revision: D31818978

fbshipit-source-id: daa22ccb153e32685bcca43a303ba9e21042d052
2021-10-26 11:38:06 -07:00
34ee5b11ff .github: Add 4xlarge nvidia gpu to scale-config (#67262)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67262

Adds a 4xlarge nvidia gpu variant to our scale-config.yml

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D31931941

Pulled By: seemethere

fbshipit-source-id: 120c73ad2c973a416a8426ad6f67457f87302db5
2021-10-26 11:19:16 -07:00
7052c41899 .github: Add workflow to build all docker images (#67215)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67215

We were regularly seeing gaps in our docker image builds due to specific
workflows not being run when docker builds occurred on PRs, this should
remove that ambiguity and ensure that all docker builds be re-built if a
rebuild is deemed necessary

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D31910422

Pulled By: seemethere

fbshipit-source-id: f346e64f1857e35a995c49bf30521a3acd8af0b1
2021-10-26 11:14:04 -07:00
d7ac6e977a Fix test_create_store_multi flaky test (#66953)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66953

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Test Plan: Imported from OSS

Reviewed By: kiukchung

Differential Revision: D31802767

Pulled By: H-Huang

fbshipit-source-id: a430e242788aac164496d4e65b85bf326537d019
2021-10-26 11:08:51 -07:00
49bf24fc83 Updated error message for nn.functional.interpolate (#66417)
Summary:
Description:
- Updated error message for nn.functional.interpolate

Fixes https://github.com/pytorch/pytorch/issues/63845

cc vadimkantorov

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66417

Reviewed By: albanD

Differential Revision: D31924761

Pulled By: jbschlosser

fbshipit-source-id: ca74c77ac34b4f644aa10440b77b3fcbe4e770ac
2021-10-26 10:33:24 -07:00
d47a9004c8 [skip ci] Set test owner for mobile tests (#66829)
Summary:
Action following https://github.com/pytorch/pytorch/issues/66232

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66829

Reviewed By: albanD

Differential Revision: D31928812

Pulled By: janeyx99

fbshipit-source-id: 8116b7f3728df8632278b013007c06ecce583862
2021-10-26 10:20:01 -07:00
204ffd33ee [CUDA][Linalg] Add gesvd as SVD fallback; optimize SVD gesvdj performance (#64533)
Summary:
Fix https://github.com/pytorch/pytorch/issues/64237
Fix https://github.com/pytorch/pytorch/issues/28293
Fix https://github.com/pytorch/pytorch/issues/4689

See also https://github.com/pytorch/pytorch/issues/47953

cc ngimel jianyuh nikitaved pearu mruberry walterddr IvanYashchuk xwang233 Lezcano

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64533

Reviewed By: albanD

Differential Revision: D31915794

Pulled By: ngimel

fbshipit-source-id: 29ea48696531ced8a48474e891a9e2d5f11e9d7a
2021-10-26 10:13:52 -07:00
828a9dcc04 [nn] MarginRankingLoss : no batch dim (#64975)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/60585

cc albanD mruberry jbschlosser walterddr

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64975

Reviewed By: albanD

Differential Revision: D31906528

Pulled By: jbschlosser

fbshipit-source-id: 1127242a859085b1e06a4b71be19ad55049b38ba
2021-10-26 09:03:31 -07:00
129e99fbce __getitem__: Ensure Tensor subclasses are not treated as tuples (#67202)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/67027

`torch.Tensor` is considered a Mapping, but not a Sequence in Python
because it uses `tp_as_mapping` instead of defining `__getitem__` in
Python. However, If you try to overwrite `__getitem__` from Python
it is considered a `Sequence` and so the tensor is treated like a
tuple for indexing purposes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67202

Reviewed By: VitalyFedyunin

Differential Revision: D31908515

Pulled By: albanD

fbshipit-source-id: 0ca55a36be3421f96428a8eacf5d195646252b38
2021-10-26 08:56:59 -07:00
3c61700cf7 torch.linalg.householder_product: forward AD support (#67043)
Summary:
As per title.

cc ezyang albanD zou3519 gqchen pearu nikitaved soulitzer Lezcano Varal7 jianyuh mruberry walterddr IvanYashchuk xwang233

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67043

Reviewed By: VitalyFedyunin

Differential Revision: D31897617

Pulled By: albanD

fbshipit-source-id: ef135fe3d9e5b9b2a541c355017f07cdb1309979
2021-10-26 08:34:00 -07:00
5b345e767e QNNPACK: Update to use pytorch/cpuinfo.git repo as a third party dependency (#67106)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67106

Test Plan: Recloned cpuinfo, rebuilt, and ran all the tests locally

Reviewed By: kimishpatel

Differential Revision: D31782317

fbshipit-source-id: 4a71be91f02bb6278db7e0124366d8009e7c7a60
2021-10-26 07:59:19 -07:00
2abffaf050 Consolidate c10d and dist imports in test_c10d_common.py (#67203)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67203

This commit uses `dist` for `torch.distributed` and `c10d` for
`torch.distributed.distributed_c10d`. The former is for public APIs
and the latter is for private ones.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D31906801

Pulled By: mrshenli

fbshipit-source-id: c3a01f33962b01a03dbd565ed119dcdac594bcf2
2021-10-26 07:50:48 -07:00
71b7182ee2 [skip ci] Set test owner for deploy/package tests (#66830)
Summary:
Action following https://github.com/pytorch/pytorch/issues/66232

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66830

Reviewed By: albanD

Differential Revision: D31905820

Pulled By: janeyx99

fbshipit-source-id: 9496acc98339d689fa62e18a8781d7344903a64c
2021-10-26 07:49:33 -07:00
49251d05ec [skip ci] Set test owners for NNC tests (#66833)
Summary:
Action following https://github.com/pytorch/pytorch/issues/66232

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66833

Reviewed By: albanD

Differential Revision: D31907812

Pulled By: janeyx99

fbshipit-source-id: 5e5013b4276fd208ac68d61cf787679799695602
2021-10-26 07:46:18 -07:00
a6d702a3ee add support for ubuntu 20.04 to CI docker images (#66942)
Summary:
Some minor changes are needed to the .circleci docker scripts to support ubuntu 20.04.  One edit updates the packages needed for all images (.circleci/docker/common/install_base.sh), while the other edit is specific to ROCm support.

cc jeffdaily sunway513 jithunnair-amd ROCmSupport KyleCZH seemethere malfet pytorch/pytorch-dev-infra

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66942

Reviewed By: albanD

Differential Revision: D31899271

Pulled By: janeyx99

fbshipit-source-id: f7677ddc063a4504da9f39a756dc181ac55f200a
2021-10-26 07:41:46 -07:00
83355f9537 [SR][easy] Alias for c10::Symbol::fromQualString (#67162)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67162

It's a bit annoying/ugly to type `c10::Symbol::fromQualString` everywhere, and we can't do `using c10::Symbol::fromQualString` since it's a static class function.

Test Plan: CI

Reviewed By: d1jang

Differential Revision: D31887042

fbshipit-source-id: 073a56c72281c20284a9feef741aed96b58a921d
2021-10-26 06:09:17 -07:00
38cbaeb8a4 Update deprecated import paths. (#67250)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67250

Test Plan: Run tests manually

Reviewed By: NicolasHug

Differential Revision: D31921656

fbshipit-source-id: e2cba7bc7d4a8c7f836bc32f1b8b11a37494a4e2
2021-10-26 04:51:07 -07:00
0c1b7545b6 [Static Runtime] Add more debug info to verify_no_memory_overlap() (#67206)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67206

The memory overlap check still checks the memory overlap for alias ops. It only skips the check for inplace ops. This needs to be fixed if we want to use the memory overlap check in prod.

This diff only adds more debug info. It doesn't fix the aforementioned problem.

Reviewed By: d1jang

Differential Revision: D31889866

fbshipit-source-id: 05a80ace3d404f66f21a8bbdc9678485ff76c8d3
2021-10-26 01:48:41 -07:00
31bcfa3760 [sharded_tensor] refactor sharded_tensor file structure (#67199)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67199

This PR refactors _sharded_tensor package so that it splits from api.py, and add different components to make it more modularized, this will also help us resolve circular dependency due to increasing code size and better organize the package:

* api.py: sharded tensor APIs
* metadata.py: Metadata definition for ShardedTensors
* shard.py: Shard definition for ShardedTensor
* utils.py: utility functions for validation, etc.
ghstack-source-id: 141533618

Test Plan: test_sharded_tensor.py

Reviewed By: pritamdamania87

Differential Revision: D31904249

fbshipit-source-id: c747d96e131a1d4731991ec4ac090f639dcb369b
2021-10-26 00:36:23 -07:00
b96337cf47 add frozen_pyyaml as a builtin library to torch::deploy (#67127)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67127

add frozen_pyyaml as a builtin library to torch::deploy

Test Plan:
unittests pass

> buck test mode/dev-nosan caffe2/torch/csrc/deploy/... -- --regex ".*TestPyYAML.*"

Reviewed By: shunting314

Differential Revision: D31852201

fbshipit-source-id: 889c4493faf09ddd3ec2b9487da9acfea3ab6bcd
2021-10-25 23:16:41 -07:00
0e371e413d [fx-acc] add automated graph opt testing using AccOpProperty (#67228)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67228

We added `AccOpProperty` for easy enablement of graph opts for new acc ops based on general properties. This diff adds
1. `AccOpProperty.unary`
2. Automated testing for acc ops with both `AccOpProperty.unary` and `AccOpProperty.pointwise` with `sink_reshape_ops` graph opt. [Adds coverage for 30 more acc_ops]
3. Refactors `graph_opts/TARGETS` to collect all graph optimizations into a common library
4. replaces `def foo(*, input, acc_out_ty=None): assert acc_out_ty is not None` with just `def foo(*, input, acc_out_ty)`. Let me know if there is some hidden purpose to the other implementation.
5. adds `AccOpProperty.*` flags to appropriate ops.

Test Plan:
`buck test mode/dev glow/fb/fx/graph_opts:test_fx_sink`

```
...
Summary
  Pass: 31
  ListingSuccess: 1
If you need help understanding your runs, please follow the wiki: https://fburl.com/posting_in_tpx_users
Finished test run: https://www.internalfb.com/intern/testinfra/testrun/4222124724581304
```

Also ran
```
`buck test mode/dev glow/fb/fx/acc_tracer:`
```
```
...
Summary
  Pass: 136
  ListingSuccess: 4
If you need help understanding your runs, please follow the wiki: https://fburl.com/posting_in_tpx_users
Finished test run: https://www.internalfb.com/intern/testinfra/testrun/5910974582823618
```

Reviewed By: jfix71

Differential Revision: D31671833

fbshipit-source-id: aa16d1008f18f7c8626058361efff33843de3505
2021-10-25 19:53:05 -07:00
3596e13d45 Add torch.nn.init.normal_ and torch.nn.init.kaiming_uniform_ ops to ShardedTensor (#67057)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67057

Extend ShardedTensor with torch.nn.init.[normal_, and kaiming_uniform_] ops
Follow up from https://github.com/pytorch/pytorch/pull/63997

Test Plan:
a) Unit Test
(pytorch) ... $ python test/distributed/_sharded_tensor/ops/test_init.py TestShardedTensorNNInit --v

or b) Manual run: Instruction here: https://docs.google.com/document/d/1_m1Hdo5w51-hhPlZ_F8Y6PIWrN7UgJZqiSpARYvhsaE/edit#
s/uniform_/normal_ or kaiming_uniform_

Imported from OSS

Reviewed By: pritamdamania87

Differential Revision: D31845654

fbshipit-source-id: e7aedc0972539da59f7b84bbbf617caf6b206d52
2021-10-25 19:14:30 -07:00
bfcde08612 [trt] Algorithm recorder/replayer (#4)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch-canary/pull/4

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67211

Record the algorithm selection, dump it in json format and replay it. This has potential to
1. consistently repro the issue (algo selection could be sensitive to local benchmark timing)
2. manual edit the dumped json file to control algorithm selection.

Reviewed By: wushirong, 842974287

Differential Revision: D31888836

fbshipit-source-id: 4611fda548f7391776f1ad61572b1f59fa4665b6
2021-10-25 18:50:55 -07:00
ecf7e96969 [Light] Remove ambiguity from compile_spec names, use actual output type (#67209)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67209

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67198

Fixing a couple instances where parameters were named method_compile_spec when they were actually compile_specs that could have multiple method_compile_specs inside.
Also use output dtype from buffer.

Test Plan:
Mobilenetv3 compiles and runs fine
```
(pytorch)  ~/fbsource/fbcode/caffe2/fb/nnc
└─ $ PYTORCH_JIT_LOG_LEVEL="aot_compiler" buck run //caffe2/binaries:aot_model_compiler -- --model mobilenetv3.pt --model_name=pytorch_dev_mobilenetv3 --model_version=v1 --input_dims="1,3,224,224
"
Downloaded 4501/6195 artifacts, 433.89 Mbytes, 14.3% cache miss (for updated rules)
Building: finished in 06:34.6 min (100%) 20233/20233 jobs, 5467/20233 updated
  Total time: 06:35.0 min
BUILD SUCCEEDED
The compiled llvm assembly code was saved to mobilenetv3.compiled.ll
The compiled model was saved to mobilenetv3.compiled.pt

└─ $ ./compile_model.sh -m pytorch_dev_mobilenetv3 -p /data/users/priyaramani/fbsource/fbcode/caffe2/fb/nnc/mobilenetv3.pt -v v1 -i "1,3,224,224"
+ VERSION=v1
+ getopts m:p:v:i:h opt
+ case $opt in
+ MODEL=pytorch_dev_mobilenetv3
.
.
Columns 961 to 9701e-11 *
-4.2304 -3.9674  2.4473 -0.8664 -0.7513  1.2140  0.0010  3.8675  1.2714  2.2989

Columns 971 to 9801e-11 *
-2.7203  1.6772 -0.7460 -0.6936  4.4421 -0.9865 -0.5186 -1.4441  1.3047 -1.6112

Columns 981 to 9901e-11 *
 0.1275 -1.8815  2.5105 -0.4871 -2.2342  0.8520  0.8658  1.6180  3.8901 -0.2454

Columns 991 to 10001e-11 *
-1.4896  4.1337 -2.6640  0.8226  0.2441 -1.4830 -1.7430  1.8758  0.5481  0.5093
[ CPUFloatType{1,1000} ]
Starting benchmark.
Running warmup runs.
Main runs.
Main run finished. Milliseconds per iter: 276.255. Iters per second: 3.61984
Memory usage before main runs: 104366080 bytes
Memory usage after main runs: 343441408 bytes
Average memory increase per iter: 2.39075e+07 bytes
0 value means "not available" in above
```

Reviewed By: ljk53

Differential Revision: D31698338

fbshipit-source-id: da6c74c1321ec02e0652f3afe6f97bf789d3361b
2021-10-25 17:44:05 -07:00
ad5731cacc [PyTorch] Add flop count for bmm and baddbmm (#66636)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66636

Add FLOP count for bmm and baddbmm, which is `2*b*m*n*k`.

Reviewed By: ngimel

Differential Revision: D31622061

fbshipit-source-id: f3e1e1e34c45228693117b81647fb4a623c4085b
2021-10-25 17:31:12 -07:00
7acf0c6d4b [PyTorch Edge][type] Add type support for NamedTuple custom class (export) (#62612)
Summary:
Add type support for namedtule custom class. For the namedtuple type, it will deserailize to the following format in string
```
"qualified_named[
    NamedTuple, [
        [filed_name_1, field_type_1],
        [filed_name_2, field_type_2]
    ]
]"
```

If it's nested, it will be
```
"__torch__.A[
    NamedTuple, [
        [field_name_a, __torch__.B [
            NamedTuple, [
                [field_name_b, __torch__.C [
                    NamedTuple, [
                      [field_name_c_1, Tensor],
                      [field_name_c_2, Tuple[Tensor, Tensor]],
                    ]
                ]
                ]
            ]
        ]
        ]
    ]
]
"
```
The nametuple type includes both `collection` and `typing`.
```

from typing import NamedTuple
from collections import namedtuple
```

It will be a forward incompatible change. However this type is never supported and exported before and we don't have a proper way to backport it. The optimum solution to ship this change is probably
1. Update the change for import without the change to export. So the runtime can read the new format, but no new format will be exported.
2. Update the change to export the new type. So runtime can export new format.

For the following example:
```
class Foo(NamedTuple):
    id: torch.Tensor

class Bar(torch.nn.Module):
    def __init__(self):
        super(Bar, self).__init__()
        self.foo = Foo(torch.tensor(1))

    def forward(self, a: torch.Tensor):
        self.foo = Foo(a)
        return self.foo
```
The new bytecode.pkl will be
```
(6,
 ('__torch__.mobile.test_lite_script_type.MyTestModule.forward',
  (('instructions',
    (('STOREN', 1, 2),
     ('DROPR', 1, 0),
     ('MOVE', 2, 0),
     ('LIST_CONSTRUCT', 0, 1),
     ('NAMED_TUPLE_CONSTRUCT', 1, 1),
     ('RET', 0, 0))),
   ('operators', ()),
   ('constants', ()),
   ('types',
    ('List[Tensor]',
     '__torch__.mobile.test_lite_script_type.myNamedTuple[NamedTuple, [[a, '
     'List[Tensor]]]]')),
   ('register_size', 2)),
  (('arguments',
    ((('name', 'self'),
      ('type', '__torch__.mobile.test_lite_script_type.MyTestModule'),
      ('default_value', None)),
     (('name', 'a'), ('type', 'Tensor'), ('default_value', None)))),
   ('returns',
    ((('name', ''),
      ('type', '__torch__.mobile.test_lite_script_type.myNamedTuple'),
      ('default_value', None)),)))))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62612

ghstack-source-id: 141485500

Test Plan:
fb:
1. Add a simple unittest to test NamedTuple custom class
2. Use following cpp code (D30271153)
```
TEST(LiteTrainerTest, CustomOp) {

  std::string jit_model =
  "/home/chenlai/local/notebooks/ads_dper_fl_model_282250609.pt";

  Module jit_m = load(jit_model);

  jit_m.eval();
  torch::jit::Module module_freeze = freeze(jit_m);
  IValue tuple =
      c10::ivalue::Tuple::create({1 * torch::ones({10, 1034}), 3 * torch::ones({10, 1034})});
  std::vector<IValue> inputs_1{tuple};
  auto jit_output = jit_m.forward(inputs_1);
  jit_output.dump();

  std::stringstream ss;
  jit_m._save_for_mobile(ss);
  jit_m._save_for_mobile("/home/chenlai/local/notebooks/tmp/tmp.ptl");

  torch::jit::mobile::Module mobile_m = _load_for_mobile(ss);
  auto mobile_output = mobile_m.forward(inputs_1);
  std::cout << "mobile output: " << std::endl;
  mobile_output.dump();
  }
```
And output from both mobile and jit are
```
{prediction: ([ CPUFloatType{0} ], [ CPUFloatType{0} ])}
```

3. N1033894 with model inspection, also compare the result between jit and mobile with the dper model.

Reviewed By: iseeyuan

Differential Revision: D30004716

fbshipit-source-id: cfd30955e66a604af8f9633b1b608feddc13d7d7
2021-10-25 17:15:50 -07:00
0d7d446154 Disallow annotations on instance attributes outside __init__ (#67051)
Summary:
**Summary**: This commit solves the first part of https://github.com/pytorch/pytorch/issues/52306, which disallows type annotations on instance attributes inside any method other than the constructor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67051

Test Plan:
Added test to test_types.py.

**Reviewers**: Zhengxu Chen

**Subscribers**: Zhengxu Chen, Yanan Cao, Peng Wu, Yining Lu

**Tasks**: T103941984

**Tags**: pytorch

**Fixes** https://github.com/pytorch/pytorch/issues/52306

Reviewed By: zhxchen17

Differential Revision: D31843527

Pulled By: andrewor14

fbshipit-source-id: 624879ae801621e367c59228be8b0581ecd30ef4
2021-10-25 16:20:47 -07:00
1f55dd83ac [WIP] wrap XLATensors into Python XLA wrapper class (#65841)
Summary:
**Improbably** fixes https://github.com/pytorch/pytorch/issues/65130

ezyang I'm super n00b in Python extensions, is this what we want to do?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65841

Reviewed By: navahgar

Differential Revision: D31889790

Pulled By: Krovatkin

fbshipit-source-id: c7f077b89f6f02df1962ab83d9e13fcc348a227d
2021-10-25 16:11:03 -07:00
fa7fb7b4d9 [skip ci] Set test owner for test_profiler.py (#66831)
Summary:
Followup action to https://github.com/pytorch/pytorch/issues/66232

cc ilia-cher robieta chaekit gdankel bitfort ngimel orionr nbcsm guotuofeng guyang3532 gaoteng-git

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66831

Reviewed By: gdankel

Differential Revision: D31909245

Pulled By: janeyx99

fbshipit-source-id: 4156a5cffa215c29022fc4dab6ee5b442a509db4
2021-10-25 15:59:52 -07:00
0acc21b412 [vulkan] Add 2D transposed convolutions (#67104)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67104

Add 2D transposed convolutions to Vulkan. Currently, only `dilation={1,1}` is supported. We plan to support dilation at a later time.

Test Plan:
Build and run `vulkan_api_test`:

```
cd ~/pytorch
BUILD_CUSTOM_PROTOBUF=OFF \
  BUILD_TEST=ON \
  USE_EIGEN_FOR_BLAS=OFF \
  USE_FBGEMM=OFF \
  USE_MKLDNN=OFF \
  USE_NNPACK=OFF \
  USE_NUMPY=OFF \
  USE_OBSERVERS=OFF \
  USE_PYTORCH_QNNPACK=OFF \
  USE_QNNPACK=OFF \
  USE_VULKAN=ON \
  USE_VULKAN_API=ON \
  USE_VULKAN_SHADERC_RUNTIME=ON \
  USE_VULKAN_WRAPPER=OFF \
  MACOSX_DEPLOYMENT_TARGET=10.9 CC=clang CXX=clang++ python3 setup.py develop --cmake && ./build/bin/vulkan_api_test
```

Reviewed By: beback4u

Differential Revision: D31731742

fbshipit-source-id: b79c946c8d988bb4d83e9fd3381992a4f2f4be80
2021-10-25 15:55:20 -07:00
059ae96007 [jit] Factor findAllNodes into one place. (#65965)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65965

ghstack-source-id: 141504185

Test Plan: no behavior change

Reviewed By: qihqi, ejguan

Differential Revision: D31326152

fbshipit-source-id: 2e0261a96853bfb67a96dd68972c905b6b26d562
2021-10-25 15:42:52 -07:00
239b38268b [fx2trt] Better trt layer name (#67200)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67200

We want to put more information on the tensorrt layer name. Mainly we want to be able to tell the original op that a TensorRT layer is mapped from.

The layer format is `[TensorRT Layer Type]-[Original Op Code]-[FX Node Name]`
```
Reformatting CopyNode for Input Tensor 0 to [FULLY_CONNECTED]-[acc_ops.linear]-[linear_1]: 0.0328ms
[FULLY_CONNECTED]-[acc_ops.linear]-[linear_1]: 0.027712ms
PWN([RELU]-[acc_ops.relu]-[relu_1]): 0.008672ms
```

Test Plan:
CI

```
buck run mode/dev-nosan -c python.package_style=inplace caffe2:fx2trt_example
```

Reviewed By: wushirong

Differential Revision: D31627274

fbshipit-source-id: 3dbb576caa63b922274541d2a306b4bd37e707c5
2021-10-25 15:41:38 -07:00
4ac8d06911 [quant] Remove unused print in quantization_patterns.py (#67191)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67191

Test Plan:
sandcastle and ossci

Imported from OSS

Reviewed By: supriyar

Differential Revision: D31899784

fbshipit-source-id: 31ad63c0b2a5328fff80c38dc4e527e0399e802e
2021-10-25 15:07:18 -07:00
12daa4f663 [jit][edge] Enable CALL instruction in lite interpreter. (#65964)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65964

ghstack-source-id: 141425519

Test Plan: buck run xplat/caffe2:test_lite_interpreter

Reviewed By: cccclai

Differential Revision: D31326149

fbshipit-source-id: 8a599d92f3fa4e6c125100adb36d89592e71e547
2021-10-25 14:44:33 -07:00
b8dfb45ac2 Refactor cub namespace handling (#66219)
Summary:
This PR is to update PyTorch with the following cub changes:
- Starting cub 1.13.1, cub requires users to define `CUB_NS_QUALIFIER` if `CUB_NS_PREFIX` is also defined. Besides that, a new mechanism `CUB_WRAPPED_NAMESPACE` is added.

And I do the following change to PyTorch:
- Starting CUDA 11.5, define `CUB_WRAPPED_NAMESPACE` globally as an nvcc flag.
- Fix caffe2 failures caused by the above change.
- Add a `aten/src/ATen/cuda/cub_definitions.cuh` that defines helper macros about feature availability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66219

Reviewed By: bdhirsh

Differential Revision: D31626931

Pulled By: ngimel

fbshipit-source-id: 97ebf5ef671ade8bf46d0860edc317f22660f26d
2021-10-25 14:37:09 -07:00
700b39a3df Sparse CSR CUDA: add torch.addmm with all inputs sparse (#63511)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63511

This PR adds `torch.addmm(c, a, b)` variant with `c, a, b` all being CSR tensors.
The underlying cuSPARSE function works only with 32-bit indices, and in
the current implementation the result tensor has 32-bit indices. Input
tensors can have both 64-bit and 32-bit indices tensors.

cc nikitaved pearu cpuhrsch IvanYashchuk ngimel

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D31809838

Pulled By: cpuhrsch

fbshipit-source-id: 97005dba27d8adcae445eb756bcbd7271061e9b5
2021-10-25 14:32:30 -07:00
333717eaf0 Improve assert failure message in test_get_torch_func_signature_exhaustive (#67039)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67039

cc mruberry

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D31899719

Pulled By: cpuhrsch

fbshipit-source-id: 819d07da5b18b31d462010b9f9382e0b8cd10f9f
2021-10-25 14:20:38 -07:00
a6d0339492 [Pytorch Edge] Extend runtime compatibility to custom classes (#66972)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66972

Add api to view how many custom classes we have and what their names are

Test Plan: unit test

Reviewed By: cccclai

Differential Revision: D31811337

fbshipit-source-id: 9f8ca1fc578a0a5360c9cd8f95475acc33f250e4
2021-10-25 13:42:26 -07:00
f4dd88489a Better and more consistent error messages in torch.linalg (#62734)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62734

Following https://github.com/pytorch/pytorch/pull/62715#discussion_r682610788
- squareCheckInputs takes a string with the name of the function
- We reuse more functions when checking the inputs

The state of the errors in torch.linalg is far from great though. We
leave a more comprehensive clean-up for the future.

cc jianyuh nikitaved pearu mruberry walterddr IvanYashchuk xwang233 Lezcano

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D31823230

Pulled By: mruberry

fbshipit-source-id: eccd531f10d590eb5f9d04a957b7cdcb31c72ea4
2021-10-25 13:24:28 -07:00
4dce051cb0 [jit][edge] Add control stack frame to lite interpreter (#65963)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65963

ghstack-source-id: 141425517

Test Plan: In next diff.

Reviewed By: qihqi, cccclai

Differential Revision: D31326150

fbshipit-source-id: dbbf65f2bf14846c45d0add71edc7d4dbfc6b92c
2021-10-25 12:15:16 -07:00
ac948f4f35 .github: Migrate linux-xenial-py3.6-gcc7 to GHA (#67072)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/66888

cc seemethere

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67072

Reviewed By: seemethere

Differential Revision: D31900833

Pulled By: zhaoalex

fbshipit-source-id: 93f8995611169d991f90e07e8c13e08182969577
2021-10-25 11:40:12 -07:00
9de0888891 Move the registration of CPython builtin modules to BuiltinRegistry (#67085)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67085

leverages BuiltinRegistry to register the CPython standard C modules. The standard C modules moved are in the FOR_EACH macro

Test Plan:
buck test mode/opt //caffe2/torch/csrc/deploy/interpreter:test_builtin_registry

buck test mode/opt //caffe2/torch/csrc/deploy:test_deploy

Reviewed By: shunting314

Differential Revision: D31848547

fbshipit-source-id: 7eb49d222eaaccb2b8ca5c984b05bf54cc233f25
2021-10-25 11:12:07 -07:00
d68bb50ef3 Disable SVE when cross-compiling for M1 (#67114)
Summary:
Followup after https://github.com/pytorch/pytorch/issues/58653
It does not matter whether one compiles locally or cross-compiles -
attempts to use SVE on M1 results in compiler crash as SVE ABI is not
defined on MacOS

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67114

Reviewed By: VitalyFedyunin

Differential Revision: D31869356

Pulled By: malfet

fbshipit-source-id: 184e26ae40edc7ef7b703200b53ea7a15da74818
2021-10-25 11:03:00 -07:00
5d9ff8f30e [Static Runtime] Add static_runtime::fused_sigrid_transforms (#66659)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66659

Original message: We added and registered a new operator, static_runtime::fused_sigrid_transforms, and modified the original sigrid_transforms to handle non-fused case only

Note: this diff was commandeered from a bootcamper. Some final touches were needed.

Test Plan: `buck test caffe2/benchmarks/static_runtime/...`

Reviewed By: swolchok

Differential Revision: D31550307

fbshipit-source-id: 287380be0cca20ee6e145bcc7217547bd58cf6d0
2021-10-25 10:44:46 -07:00
8d164a36fb Use at::native::is_nonzero in promoted ops to improve portability (#67097)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67097

all delegated models have `is_nonzero` ops by default, by making the op native and consumable without dispatch eases the portability of such models
ghstack-source-id: 141375082

Test Plan:
`buck test caffe2/test/cpp/jit:jit -- BackendTest.TestComposite`

```
~/fbsource/fbcode] cd ~/fbsource/fbcode/ && buck test caffe2/test:jit -- test_trace_arange
Parsing buck files: finished in 0.5 sec
Building: finished in 9.4 sec (100%) 16035/16035 jobs, 0/16035 updated
  Total time: 10.0 sec
More details at https://www.internalfb.com/intern/buck/build/1e55eea5-2adb-41d1-96ae-cbf4b446d6c6
BUILD SUCCEEDED
Tpx test run coordinator for Facebook. See https://fburl.com/tpx for details.
Running with tpx session id: 46eedba2-ae17-4e88-b205-93bd1332665d
Trace available for this run at /tmp/tpx-20211015-113905.235421/trace.log
Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/1970324912349177
    ✓ ListingSuccess: caffe2/test:jit - main (12.372)
    ✓ Pass: caffe2/test:jit - test_trace_arange (jit.test_tracer.TestTracer) (13.748)
    ✓ Pass: caffe2/test:jit - test_trace_arange_with_grad (jit.test_tracer.TestTracer) (13.892)
Summary
  Pass: 2
  ListingSuccess: 1
If you need help understanding your runs, please follow the wiki: https://fburl.com/posting_in_tpx_users
Finished test run: https://www.internalfb.com/intern/testinfra/testrun/1970324912349177
```

Reviewed By: iseeyuan

Differential Revision: D31656842

fbshipit-source-id: c0e6c798478a2783c0e17e6e9100ba5ce044da78
2021-10-25 10:18:31 -07:00
acb340de75 [Pytorch][Bootcamp] Add fixes and vanilla testing for Adagrad non-vectorized and vectorized optimizers to handle complex numbers (#66671)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66671

Made changes in the step function of the vectorized and non-vectorized adagrad optimizers to handle complex numbers as two real numbers as per 65711 on github
ghstack-source-id: 141442350

Test Plan:
buck test mode/dev caffe2/test:optim -- 'test_adagrad_complex'
https://pxl.cl/1Rd44

Reviewed By: albanD

Differential Revision: D31673503

fbshipit-source-id: 90a0d0c69b556716e2d17c59ce80f09c750fc464
2021-10-25 10:13:21 -07:00
a0495b3cdb [SR] Remove unused operator() overload (#67001)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67001

The overload of `operator()` taking `std::vector<at::Tensor>` was only used for testing. In a diff following this one, I will add a new overload that takes `std::vector<c10::IValue> args` and no `kwargs` so we can avoid default-constructing `kwargs` everywhere.

This new overload will probably take a forwarding reference, so to avoid problems with overloading on forwarding reference and simplify the interface, it's best to remove this unused one.

Test Plan:
`buck test caffe2/benchmarks/static_runtime/...`

`buck test caffe2/test:static_runtime`

Reviewed By: hlu1

Differential Revision: D31821990

fbshipit-source-id: 6d2e4a75ca4abe6e262651532eb96c3b274c6f4a
2021-10-25 08:18:58 -07:00
364645cd9d [SR] Factor operator() implementation into separate function (#67125)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67125

Using explicit template instantiations in D31659973 (f2582a59d0) was a bad idea. The problem is that the lvalue instantiation was for a `const` vector of `IValue`, meaning that if you tried to pass SR a non-const vector of arguments, the linker would fail to find the symbol.

The reason we didn't catch this in D31659973 (f2582a59d0) was because predictor always passes a `const` reference anyways. But we should fix this to prevent unexpected problems in the future.

Test Plan: `buck test caffe2/benchmarks/static_runtime/...`

Reviewed By: hlu1

Differential Revision: D31873406

fbshipit-source-id: 5ab5a03334bed925cec11facadcedf9bec9b90ad
2021-10-25 08:17:40 -07:00
edd4d246c3 Accept 0-dim channel inputs in convolution layer (#66256)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/56998 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66256

Reviewed By: mrshenli

Differential Revision: D31859428

Pulled By: jbschlosser

fbshipit-source-id: 034b6c1ce35aac50eabfa09bbcd8b1e3c8b171bd
2021-10-25 08:12:29 -07:00
6c985b57ff OpInfo : nn.functional.embedding (#66997)
Summary:
Adds OpInfo for `nn.functional.embedding`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66997

Reviewed By: mrshenli

Differential Revision: D31859799

Pulled By: zou3519

fbshipit-source-id: bbca860df4fbc243751f5fa81658231866c31d2e
2021-10-25 08:06:32 -07:00
adc21f1966 [quant] Fix docs build (#67169)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67169

Looks like the doc error only appears after it's landed

Test Plan: Imported from OSS

Reviewed By: seemethere

Differential Revision: D31890431

fbshipit-source-id: d40cba082712c4b35704ea15d82fbc4749f85aec
2021-10-25 08:02:26 -07:00
dd81fa9027 [JIT] Freeze allows preservation of submodule attributes (#66102)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66102

This changes allows the `preserved_attributes` parameter of `torch.jit.freeze` to accept attributes of submodules. Previously, only root-level attributes were able to be preserved. Example:

```
class SubModule(nn.Module):
    def __init__(self):
        super(SubModule, self).__init__()
        self.a = 1
        self.b = 2

    def forward(self):
        return self.a + self.b

class Module(nn.Module):
    def __init__(self):
        super(Module, self).__init__()
        self.sub = SubModule()

    def forward(self):
        return self.sub()

mod = torch.jit.script(Module())
mod.eval()
frozen_mod = torch.jit.freeze(mod, preserved_attrs = ['sub.a'])

mod.sub   # OK
mod.sub.a # OK
mod.sub.b # Error, not preserved
mod()     # = 3
mod.sub.a = 0
mod()     # = 2
```

Test Plan: `buck test caffe2/test:jit -- TestFreezing`

Reviewed By: eellison

Differential Revision: D31383868

fbshipit-source-id: 34a05ca9528d4e5f04f71ac2a339fd584a8fa305
2021-10-25 07:56:20 -07:00
09c7771e9c Set test owners for jit tests (#66808)
Summary:
Action following https://github.com/pytorch/pytorch/issues/66232

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66808

Reviewed By: mrshenli

Differential Revision: D31761414

Pulled By: janeyx99

fbshipit-source-id: baf8c49ff9c4bcda7b0ea0f6aafd26380586e72d
2021-10-25 07:51:10 -07:00
364c4959c3 [quant] Fix docs error in convert_fx (#67152)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67152

Test Plan:
```
cd docs
make html
```

Imported from OSS

Reviewed By: supriyar

Differential Revision: D31884570

fbshipit-source-id: 2b521f617c93f6fa08da3387df2d25497293eee6
2021-10-24 19:26:45 -07:00
a7ebf76a15 jit trace (#59949)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59949

Reviewed By: ZolotukhinM

Differential Revision: D31366787

Pulled By: Krovatkin

fbshipit-source-id: 798cbcd97e8ecfba984f98cd70214954be9309af
2021-10-24 18:04:22 -07:00
f1b5f1898b Automated submodule update: kineto (#67133)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/kineto](https://github.com/pytorch/kineto).

New submodule commit: 879a203d9b

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67133

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: mrshenli

Differential Revision: D31877172

fbshipit-source-id: 224a499607d1f3bf7c00d8d8dd1fdac47cd33a3b
2021-10-24 13:06:19 -07:00
b51731527d [ez] [Docs] Missing import in example for post_local_sgd (#67047)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67047

Fix missing import
ghstack-source-id: 141258423

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D31841837

fbshipit-source-id: 139e614517dcac7a53259ff7a0360bb5275bb53b
2021-10-24 01:44:06 -07:00
0000c88e10 [FSDP] No need for list() in _get_shard (#66957)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66957

chunk appears to return a tuple which is enough given that we just
index to the right chunk and discard the rest.
ghstack-source-id: 141391149

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D31780799

fbshipit-source-id: fdb1b77fffa916328e14a4cd692b5241ae46a514
2021-10-24 01:29:19 -07:00
580efb35a5 [FSDP] Add some comments after reading the code. (#66956)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66956

Adds some comments I found helpful while ramping up on FSDP code.
ghstack-source-id: 141391150

Test Plan: n/a

Reviewed By: mrshenli

Differential Revision: D31780798

fbshipit-source-id: e2d38a9801b4548b202a73615774d5f0f7f5e3ed
2021-10-24 01:28:19 -07:00
b6fa998892 Revert D31514095: Use kernel_func_name from aotCompiler
Test Plan: revert-hammer

Differential Revision:
D31514095 (7b55dc8340)

Original commit changeset: b70c8e2c7336

fbshipit-source-id: ad4d828f33506e612b51c276149fa0e12b0565d5
2021-10-23 17:17:53 -07:00
313939c9c6 [quant] Fix lint errors (#67138)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67138

Test Plan:
ossci

Imported from OSS

Reviewed By: supriyar

Differential Revision: D31879558

fbshipit-source-id: 271905d3d254c906aa78bae9f2bd411f9d57e1e8
2021-10-23 11:26:25 -07:00
7b55dc8340 Use kernel_func_name from aotCompiler (#66337)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66337

Right now, assembly code generated for the a given method from the model is named wrapper or func by default. The function name is then replaced with a proper kernel_func_name after target specific assembly is generated.
This PR propagates a desired kernel_func_name right from aotCompiler API so that the generated function has the needed name that doesn't need to be replaced later.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D31514095

Pulled By: priyaramani

fbshipit-source-id: b70c8e2c733600a435cd4e8b32092d37b7bf7de5
2021-10-23 02:20:45 -07:00
64c68edaf3 [pt] Add Half precision support for bucketize and searchsorted op (#67077)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67077

Test Plan: CI

Reviewed By: yinghai

Differential Revision: D31852556

fbshipit-source-id: 1e4212146ee67edc6b6568d25db79de525782788
2021-10-22 23:37:37 -07:00
2d81d5ab0a [quant][graphmode][fx] Remove fbgemm_backend_config_dict for now (#67066)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67066

We'll add it later when the api is ready

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D31849079

fbshipit-source-id: 0c00d08510166b2d897cf1562c7276527319b05c
2021-10-22 21:57:56 -07:00
8460fa5707 [quant][fx] Add an option in convert_fx to accept qconfig_dict to skip quantization (#66878)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66878

Currently convert_fx quantizes all layers that have been prepared, depending on the prepare qconfig_dict
This PR adds support to accept a variation of qconfig_dict in convert_fx that can be used to specify skip quantizing certain layers

This can help with prepare/observe all operators, quantize a subset of them (based on quantization error), to avoid preparing multiple times.

The qconfig_dict passed to convert_fx can only have the values set to `None`, with the keys being the same as what is allowed in the prepare qconfig_dict

Test Plan:
python test/test_quantization.py TestQuantizeFx.test_convert_qconfig_dict

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D31808247

fbshipit-source-id: a4f5dca1090f0083fc3fea14aff56924033eb24f
2021-10-22 21:18:15 -07:00
d13829e6be [quant][[fx] update observer_fqn to not depend on node.name (#66767)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66767

Make observer fqn in prepare step independent of input_node/observed_node name.
This change names the observers as `{input/output}_activation_post_process_{idx}` where idx will be incremented for each new observer instance and is guaranteed to be unique.

Test Plan:
python test/test_quantization.py test_observer_fqn

Imported from OSS

Reviewed By: anjali411

Differential Revision: D31752052

fbshipit-source-id: e0995b1ef33a99d5b012133fe92d303d55a73b7d
2021-10-22 21:16:24 -07:00
83f70db95c Fix common device computation for comparison ops. (#66245)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66245

Fixes #66053

This PR splits `declare_static_dtype_and_device` into two new methods for
`TensorIteratorBase`: `declare_static_dtype` and `declare_static_device`.

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D31503849

Pulled By: ngimel

fbshipit-source-id: 4b131b691d29ceb5f3709f5d6503997ea0875c54
2021-10-22 18:43:17 -07:00
3f5adf4f9c [quant][graphmode][fx] Use the new convert function instead of the old one in quant-fx2trt tests (#67065)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67065

Switching to use _convert_fx_do_not_use in the tests

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D31849077

fbshipit-source-id: 3688fc09ac538b6abc16ce87c600b8ee04acfcd1
2021-10-22 18:29:58 -07:00
af1a2df825 enable better depthwise conv perf on cudnn 8.2+ (#58749)
Summary:
There are multiple improvement of depthwise convolution speed in cudnn between 7.6 and 8.2, since https://github.com/pytorch/pytorch/pull/22302.
This PR aim to harvest all the new improvement by enable more cudnn kernel. The workload checking logic can also be simplified now.
To keep the change simple, I kept things before cudnn 8.2 unchanged.

Similar to https://github.com/pytorch/pytorch/pull/22302, I used a script [here](https://gist.github.com/FDecaYed/e8ba98a95cd33697df2ace86fdb44897) to benchmark. Both run are using cudnn 8.2
One enhancement I did to the script is switch to event based timing. With warmup kernels to fill the launch queue ahead, this should give us accurate kernel timing even in CPU launch bound cases.

Here is A100 and V100 result sorted by speedup.
[Book1.xlsx](https://github.com/pytorch/pytorch/files/6530371/Book1.xlsx)

Result highlights:
Newly turned on 5x5 cudnn kernel show up to 6x speedup.
Close to half of test sizes show >10% speedup.
Fixed some corner cases that previously caused 15-20x slowdown.
Only slowdown a handful of cases(~10 out of >1000)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58749

Reviewed By: bdhirsh

Differential Revision: D31613199

Pulled By: ngimel

fbshipit-source-id: 883b58facad67ccd51dc9ab539368b4738d40398
2021-10-22 17:47:07 -07:00
cf3a5160f8 [BE] move init_multigpu_helper to common_distributed (#67050)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67050

This PR moves init_multi_gpu_helper to common_distributed so that it could be shared by different distributed tests.
ghstack-source-id: 141370119

Test Plan: wait for ci.

Reviewed By: mrshenli

Differential Revision: D31842644

fbshipit-source-id: c7bad25d6cef9bdce7ad1fb6c60c1cad4b765702
2021-10-22 17:16:11 -07:00
df3f82a1ef Add more FSDP unit tests to cover core logic, freezing weights and flatten parameter wrapper (#66904)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66904

Add more FSDP unit tests to cover core logic, freezing weights and flatten parameter wrappe, these unit tests are refactored to be aligned with PyTorch commonly used test classes
ghstack-source-id: 141335614

Test Plan: unit tests

Reviewed By: mrshenli

Differential Revision: D31779565

fbshipit-source-id: c727110d1d7570c0ec49e42cadfc9e9a5e440073
2021-10-22 16:50:52 -07:00
f6c88fa99d Revert D31627107: [BE] delete frontend.cpp
Test Plan: revert-hammer

Differential Revision:
D31627107

Original commit changeset: 07d30d280c25

fbshipit-source-id: 5e82f2158f5007c67adb8f947f8cc4d995a9a3bc
2021-10-22 16:39:02 -07:00
f50bf16c04 Revert D31663043: [BE] minor improvement to dist quantization
Test Plan: revert-hammer

Differential Revision:
D31663043

Original commit changeset: 2f96b7346e9c

fbshipit-source-id: d38684dfe79ca335fbbe624496ad4c86c29d1570
2021-10-22 16:37:41 -07:00
7b0408684b Fix linter (#67122)
Summary:
Fixes regression introduced by 7e5aa0d35a

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67122

Reviewed By: seemethere

Differential Revision: D31872569

Pulled By: malfet

fbshipit-source-id: ada0137db9a46cbec573489c9c37a94f3a7576ae
2021-10-22 16:02:36 -07:00
018e06edca [torchelastic] Skip tests in tsan mode (#67103)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67103

Skip tests in tsan mode for now. More info: T104010063

Test Plan: sandcastle + running tests in mode/dev-tsan

Reviewed By: d4l3k

Differential Revision: D31861426

fbshipit-source-id: d50e5d06afbc82ccce6d102e52f72b5b01f6f41a
2021-10-22 15:55:18 -07:00
7e5aa0d35a fixed unique arguments documentation (#66132)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66132

Differential Revisi
<img width="875" alt="Screen Shot 2021-10-05 at 12 10 39 PM" src="https://user-images.githubusercontent.com/17888388/136276286-3df20681-7b7a-4a91-97d6-4f1ac3722121.png">
on: [D31397746](https://our.intern.facebook.com/intern/diff/D31397746/)

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D31734476

Pulled By: samdow

fbshipit-source-id: 8999443c7f9b24394d7543652b8350261c1f8b3a
2021-10-22 14:50:02 -07:00
a7bbf8814c [quant][graphmode][fx] Move quant-fx2trt unittests to test_quantize_fx.py (#67064)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67064

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D31849075

fbshipit-source-id: 9c5e8aad7c88070830d853faf3106491726e77ff
2021-10-22 14:36:36 -07:00
7379d4db20 [BE] minor improvement to dist quantization (#66649)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66649

some minor changes to dist quantization, mainly change the namespace and add some notes for future code dedup
ghstack-source-id: 141336191

Test Plan: wait for ci

Reviewed By: cbalioglu

Differential Revision: D31663043

fbshipit-source-id: 2f96b7346e9c90df5ab2536767f8301eb86a9c79
2021-10-22 13:46:28 -07:00
1da628bdb7 [ONNX] Update slice process shape to support rank only inference (#65782) (#66149)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66149

Updated logic will be able to infer rank of slice output, when only rank is known for slice input. Enables cases where `ConstantValueMap::HasRank(input)` is `True`, while `ConstantValueMap::HasShape(input)` is `False`.

Test Plan: Imported from OSS

Reviewed By: jansel

Differential Revision: D31423840

Pulled By: malfet

fbshipit-source-id: 17b2b24aa63435d5212ebe6bdf66ae3c348c4e3b

Co-authored-by: BowenBao <bowbao@microsoft.com>
2021-10-22 13:46:26 -07:00
0bc9928f31 [ONNX] Symbolic: dynamic input for OneHot, bool for Einsum (#65940) (#66147)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66147

Symbolic: dynamic input for OneHot, bool for Einsum

Test Plan: Imported from OSS

Reviewed By: jansel

Differential Revision: D31424094

fbshipit-source-id: 76bea22b29c93d1621c597fe7ab59deb3685087f

Co-authored-by: jiafatom <jiafa@microsoft.com>
2021-10-22 13:46:24 -07:00
2c0fe338da [ONNX] Modify softplus symbolic to support beta!=1 (#65001) (#66146)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66146

* Modify softplus symbolic to support beta!=1

* Remove parse args

Test Plan: Imported from OSS

Reviewed By: jansel

Differential Revision: D31424096

fbshipit-source-id: 971af54a28141737ccb17510ada03b0651be2a63
2021-10-22 13:46:22 -07:00
6f3f302d9f [ONNX] Deprecate fold_if pass (#65697) (#66145)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66145

Deprecate fold_if pass

Test Plan: Imported from OSS

Reviewed By: jansel

Differential Revision: D31424097

fbshipit-source-id: 25b89679c756393a1065ca6aaa24d29db960cbd4

Co-authored-by: jiafatom <jiafa@microsoft.com>
2021-10-22 13:46:20 -07:00
a0fc14c20f [ONNX] Add diagonal symbolic (#64454) (#66144)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66144

* Add logic and tests

* minor edits

* Eliminate expand ops

* Fix flake and editing

* Modified errant message

* Add overrun check

* Add overrun descriptions

* Remove emptyline

Test Plan: Imported from OSS

Reviewed By: jansel

Differential Revision: D31424095

fbshipit-source-id: 5b8ef6ac21c32d43c3dbc8e51e1ef30bffb19c25
2021-10-22 13:46:18 -07:00
b18c298f24 ONNX: Delete or document skipped ORT tests (#64470) (#66143)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66143

Delete test_list_remove. There's no point in testing conversion of
this model since TorchScript doesn't support it.

Add a link to an issue tracking test_embedding_bag_dynamic_input.

[ONNX] fix docs (#65379)

Mainly fix the sphinx build by inserting empty before
bulleted lists.

Also some minor improvements:
Remove superfluous descriptions of deprecated and ignored args.
The user doesn't need to know anything other than that they are
deprecated and ignored.

Fix custom_opsets description.

Make indentation of Raises section consistent with Args section.

[ONNX] publicize func for discovering unconvertible ops (#65285)

* [ONNX] Provide public function to discover all unconvertible ATen ops

This can be more productive than finding and fixing a single issue at a
time.

* [ONNX] Reorganize test_utility_funs

Move common functionality into a base class that doesn't define any
tests.

Add a new test for opset-independent tests. This lets us avoid running
the tests repeatedly for each opset.

Use simple inheritance rather than the `type()` built-in. It's more
readable.

* [ONNX] Use TestCase assertions rather than `assert`

This provides better error messages.

* [ONNX] Use double quotes consistently.

[ONNX] Fix code block formatting in doc (#65421)

Test Plan: Imported from OSS

Reviewed By: jansel

Differential Revision: D31424093

fbshipit-source-id: 4ced841cc546db8548dede60b54b07df9bb4e36e
2021-10-22 13:46:16 -07:00
7a78f715a6 [ONNX] Add warning for inplace updates on tensor.shape in tracing mode (#63170) (#66142)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66142

* Add warning

* Lint and clang fixes

* Remove duplicate comments

* Added pitfalls section

* Modify sections

* Minor modifications

* Add underline to avoid doc build failures

Test Plan: Imported from OSS

Reviewed By: jansel

Differential Revision: D31424092

fbshipit-source-id: c83195f3c66885ad1aecde13b3029c45dd171dbd
2021-10-22 13:46:14 -07:00
136abf5aff [ONNX] Update sum symbolic to handle dtypes (#64289) (#66141)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66141

* Update aten::sum symbolic for dtype

* Remove nesting and modify opeartor tests

* Fix expect files

[ONNX] Fix expect files added in #64289 (#65356)

Test Plan: Imported from OSS

Reviewed By: jansel

Differential Revision: D31424091

fbshipit-source-id: d4af21e9f0d7e1c68bf6ef2f3e385db84b4c53f3
2021-10-22 13:46:12 -07:00
53a163a015 [ONNX] Export nn.Module call as ONNX local function (#63589) (#66140)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66140

* Add new argument to export api to enable users specifying `nn.Module` classes that they wish to be exported as local function in ONNX model.
* Refactor `torch/csrc/jit/serialization/export.cpp`, and remove redundant `EncoderBase` class.
* ~~Contains changes from #63268~~
* Depends on #63716 to update onnx submodule.

Test Plan: Imported from OSS

Reviewed By: jansel

Differential Revision: D31424098

fbshipit-source-id: c949d0b01c206c30b4182c2dd1a5b90e32b7a0d3

Co-authored-by: BowenBao <bowbao@microsoft.com>
2021-10-22 13:44:56 -07:00
d1986a1cf5 [BE] delete frontend.cpp (#66581)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66581

c10d/frontend.cpp was originally proposed to introduce pure C++ API and use TorcBind to share python level API with TorchScript. This is no longer needed, so delete this to reduce code redundancy.
ghstack-source-id: 141336190

Test Plan: wait for ci

Reviewed By: rohan-varma

Differential Revision: D31627107

fbshipit-source-id: 07d30d280c25502a222a74c2c65dfa4069ed8713
2021-10-22 13:33:24 -07:00
e8742f15cf [quant][graphmode][fx] Add observation_type.py (#67063)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67063

Adding ObservationType Enum for `backend_config_dict`

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D31849078

fbshipit-source-id: e9e7225d564b51fa9454f7f087dd134152c069a0
2021-10-22 12:17:54 -07:00
f2582a59d0 [SR] Add rvalue overload for operator() (#66648)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66648

Currently, SR shallow-copies its `IValue` inputs when running inferences. We can avoid refcount bumps by `std::move`-ing the inputs into their slots. To achieve this, I've made the following changes:

1. Add an overload for `set_inputs` that takes a `std::vector<IValue>&&`.
2. Change the signatures of `StaticModule::operator()` and `StaticRuntime::operator()`.
Old:
```
operator()(const std::vector<IValue>& args, const std::unordered_map<std::string, IValue>& kwargs)
```
New:
```
template <class IValueList>
operator()(IValueList&& args, const std::unordered_map<std::string, IValue>& kwargs)
```

The implementations use perfect forwarding to invoke the correct overload of `set_inputs`.

Test Plan: Added a short new unit test to exercise the new code path. All other unit tests still pass.

Reviewed By: hlu1

Differential Revision: D31659973

fbshipit-source-id: b8c194405b54a5af1b418f8edaa1dd29a061deed
2021-10-22 10:51:47 -07:00
40a8a50913 Add static_runtime::fused_equally_split (#2)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch-canary/pull/2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66881

Adds `static_runtime::fused_equally_split` operator and removes `is_fused` logic from original operator. Modifies `FuseUnpackListV2` to map `fb::equally_split` to this new operator.

Test Plan:
```
adityapillai@5960 /data/sandcastle/boxes/fbsource/fbcode 1m 13s
❯ buck test //caffe2/benchmarks/static_runtime/fb:test_fb_operators
```
and sandcastle
strange_what_could_go_wrong

Reviewed By: mikeiovine

Differential Revision: D31742293

fbshipit-source-id: 60b35589c8817719b005d49811f575b6590d1c39
2021-10-22 10:26:49 -07:00
391eb1dbe3 [JIT] UseVariadicOp handles multiple lists (#66288)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66288

This change makes it so `UseVariadicOp` can transform ops with many Tensor list inputs.

Input pattern:
```
%output : Type = op(%list_1, %arg_1, %list_2, %list_3)
```
Output pattern:
```
%output : Type = variadic_op(%list_11, ..., %list_1N, %arg_1, %list_21, ..., %list_2M, %list_31, ..., %list_3K, N, M, K)
```
The length of each list is passed at the end of the variadic op so that the op implementation can process the inputs appropriately. This also frees us from needing to update `hasVarArgs` in static runtime each time we add a variadic op.

This diff also makes `UseVariadicOp` more robust. Before, `list_idx` was passed as an argument. Now, `VariadicUpdater` determines `list_idx` from the node's schema.

Test Plan:
Existing variadic ops do not break:
`buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: d1jang

Differential Revision: D31450811

fbshipit-source-id: 808fcc3ae8940b9e602586f38f8cf9154c9a6462
2021-10-22 10:22:33 -07:00
c7121ae77f fix formatting CIRCLE_TAG when building docs (#67026)
Summary:
Similar to pytorch/text#1416
malfet, brianjo

The previous code failed when tags changed from `v0.9.0` to `v0.10.0`. I tested this offline, it would be nice to somehow be actually tag the repo and see that this adds the correct documentation directory to the pytorch/pytorch.github.io repo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67026

Reviewed By: saketh-are

Differential Revision: D31843381

Pulled By: malfet

fbshipit-source-id: 21526ad9ed4c1751c2d7f6d621da305f166a7f55
2021-10-22 10:10:52 -07:00
d9c4b3feab Do rowwisemoments computation in float for half LayerNorm (#66920)
Summary:
https://github.com/pytorch/pytorch/issues/66707

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66920

Reviewed By: mrshenli

Differential Revision: D31850612

Pulled By: ngimel

fbshipit-source-id: a95a33567285dcf9ee28d33f503cead3268960f9
2021-10-22 09:50:42 -07:00
6e6ede2e70 [JIT] Re-enable alias sensitive peepholes (#65860)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65860

Re-enable peepholes like `x + 0 == x`. These were at one point enabled, and then disabled because they did not properly account for aliasing, and then re-enabled with reconstructing the alias db everytime which is slow  - O(n^2). I've added correctness conditions, and I've also made it so that we avoid using stale aliasing properties for either the input or output of nodes we optimize.
Some of the other code that we have written to avoid re-instantiating the alias db involves internally mutating it, however this is tricky to reason about and we probably have to add some extra invariants...

cc navahgar relevant to graph opts and d1jang alias analysis relevant here

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D31352382

Pulled By: eellison

fbshipit-source-id: 441a27f17dc623d6c24538d1d43cba0412c3c482
2021-10-22 09:45:57 -07:00
051ea5ccbf [Static Runtime] Bundle function & function_kind to carry them together (#66974)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66974

`D31591785 (67e003f09b)` started carrying a function object to be executed and `FunctionKind` for the type of the function *separately*, and this caused a bug fixed by D31783028 (79803b199f).

This change bundles them as it was before done by swolchok to reduce the chances of such a mistake in the future. They need to be carried altogether always since `FunctionKind` identifies the type of the function object.

Note that `struct Function` is a POD type, so accessing its field (first, second) shouldn't cause an extra overhead in `ProcessedNode::run()`.

Test Plan:
Confirmed that the managed memory metics remain the same before/after this diff on inline_cvr:

```
#AFTER
# inline_cvr/local
Total number of managed tensors: 2660
Total number of managed output tensors: 0
Total number of unmanaged values: 3041
Total memory managed: 1496896 bytes
Total number of reused tensors: 1183
Total number of 'out' variant nodes/total number of nodes: 2452/2469 (99.3115%)
# inline_cvr/local_ro
Total number of managed tensors: 1412
Total number of managed output tensors: 0
Total number of unmanaged values: 2679
Total memory managed: 39040 bytes
Total number of reused tensors: 959
Total number of 'out' variant nodes/total number of nodes: 1928/1939 (99.4327%)
# inline_cvr/remote_ro
First iter time: 12.0344 ms
Total number of managed tensors: 1293
Total number of managed output tensors: 0
Total number of unmanaged values: 14
Total memory managed: 5293824 bytes
Total number of reused tensors: 771
Total number of 'out' variant nodes/total number of nodes: 1298/1298 (100%)
```

```
#BEFORE
#  inline_cvr/local
Total number of managed tensors: 2660
Total number of managed output tensors: 0
Total number of unmanaged values: 3041
Total memory managed: 1496896 bytes
Total number of reused tensors: 1183
Total number of 'out' variant nodes/total number of nodes: 2452/2469 (99.3115%)

#inline_cvr/local_ro
Total number of managed tensors: 1412
Total number of managed output tensors: 0
Total number of unmanaged values: 2679
Total memory managed: 39040 bytes
Total number of reused tensors: 959
Total number of 'out' variant nodes/total number of nodes: 1928/1939 (99.4327%)

#inline_cvr_remote_ro
Total number of managed tensors: 1293
Total number of managed output tensors: 0
Total number of unmanaged values: 14
Total memory managed: 5293824 bytes
Total number of reused tensors: 771
Total number of 'out' variant nodes/total number of nodes: 1298/1298 (100%)
```

Reviewed By: mikeiovine

Differential Revision: D31798419

fbshipit-source-id: fd4301b6731e402be0820729654735c791511aba
2021-10-22 08:57:49 -07:00
3d7a344c5e Fix ArchiveReader to keep archive path (#67035)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67035

Incorporate the same change from https://github.com/pytorch/data/pull/73

Test Plan: Imported from OSS

Reviewed By: NivekT

Differential Revision: D31837963

Pulled By: ejguan

fbshipit-source-id: 3b0171ba30f392c8773c497702bc60aa4fbe28c6
2021-10-22 06:34:39 -07:00
d1a5612a3e remove accscalar from i0 and i0e (#67048)
Summary:
Removes some of the half math ops to make https://github.com/pytorch/pytorch/issues/64023 possible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67048

Reviewed By: mruberry

Differential Revision: D31847249

Pulled By: ngimel

fbshipit-source-id: 8385aacd846bb990e368ff336eb346d847af70b9
2021-10-22 01:34:36 -07:00
5f58764d1d [PyTorch Edge][type] Add type support for NamedTuple custom class (import) (#63130)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63130

Extend `type_parser` to handle `NamedTuple` type. It can be extended to handle other types when needed. The custom type will follow the following format:
```
"qualified_named[
    NamedTuple, [
        [filed_name_1, field_type_1],
        [filed_name_2, field_type_2]
    ]
]"
```
For example:
```
"__torch__.base_models.sparse_nn.pytorch_preproc_types.PreprocOutputType[
    NamedTuple, [
        [float_features, Tensor],
        [id_list_features, List[Tensor]],
        [label,  Tensor],
        [weight, Tensor],
        ]
    ]"
```

For nested types, the order of type lists from type table should be:
```
std::string type_1 = “__torch__.C [
    NamedTuple, [
        [field_name_c_1, Tensor],
        [field_name_c_2, Tuple[Tensor, Tensor]],
    ]
]”

std::string type_2 = “__torch__.B [
   NamedTuple, [
       [field_name_b, __torch__.C ]
   ]
]”

std::string type_3 = “__torch__.A[
   NamedTuple, [
       [field_name_a, __torch__.B]
   ]
]”
std::vector<std::string> type_strs = {type_str_1, type_str_2, type_3};
std::vector<TypePtr> type_ptrs =  c10::parseType(type_strs);
```

namedtuple from both `collection` and `typing` are supported
```

from typing import NamedTuple
from collections import namedtuple
```

This change only adds the parser and now new runtime can read the above format.
ghstack-source-id: 141293658

Test Plan:
```
buck test mode/dev //caffe2/test/cpp/jit:jit -- --exact 'caffe2/test/cpp/jit:jit - LiteInterpreterTest.CompatiblePrimitiveType'
buck test mode/dev //caffe2/test/cpp/jit:jit -- --exact 'caffe2/test/cpp/jit:jit - LiteInterpreterTest.CompatibleCustomType'

buck test mode/dev //caffe2/test/cpp/jit:jit -- --exact 'caffe2/test/cpp/jit:jit - LiteInterpreterTest.InCompatiblePrimitiveType'
buck test mode/dev //caffe2/test/cpp/jit:jit -- --exact 'caffe2/test/cpp/jit:jit - LiteInterpreterTest.InCompatibleCustomType'
```

Reviewed By: iseeyuan

Differential Revision: D30261547

fbshipit-source-id: 68a9974338464e320b39a5c613dc048f6c5adeb5
2021-10-22 00:40:57 -07:00
d3fc3c4ded Implement forward AD for linalg.matrix_exp (#62716)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62716

cc jianyuh nikitaved pearu mruberry walterddr IvanYashchuk xwang233 Lezcano

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D31823231

Pulled By: mruberry

fbshipit-source-id: 6d19b8988dce773b5716f0522d06febfe167fead
2021-10-21 23:55:36 -07:00
fe102b9888 diff tool (#66854)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66854

diff tool and script to test correctness of flatbuffer format

Test Plan:
`./verify_flatbuffer.sh | pastry`
P463163180

Reviewed By: zhxchen17

Differential Revision: D31752696

fbshipit-source-id: bea00102b21e62c02367853c8bec2742b483fbda
2021-10-21 22:53:51 -07:00
8ea985f240 [quant][fx][graphmode] Rename files and functions for convert and add do_not_use suffix (#66955)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66955

The new convert function are not meant to be used by users, it's a temporary function that
we use to build up the new convert path, we will bring feature parity with the old path
and deprecate the old path after that

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D31810488

fbshipit-source-id: 2f65a110506683123350e619c48df090a15570fc
2021-10-21 22:17:28 -07:00
01ced45217 [iOS] Bump up iOS CocoaPods version to 1.10.0 (#67058)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67058

Test Plan: Imported from OSS

Reviewed By: xta0

Differential Revision: D31846445

Pulled By: hanton

fbshipit-source-id: 7510a6c15fdeecc996fcce5c48db32e148ba7def
2021-10-21 21:30:24 -07:00
77beccaedb Do not build PyTorch with caffe2 by default (#66658)
Summary:
CAFFE2 has been deprecated for a while, but still included in every PyTorch build.
We should stop building it by default, although CI should still validate that caffe2 code is buildable.

Build even fewer dependencies when compiling mobile builds without Caffe2
Introduce `TEST_CAFFE2` in torch.common.utils
Skip `TestQuantizedEmbeddingOps` and `TestJit.test_old_models_bc`  is code is compiled without Caffe2
Should be landed after https://github.com/pytorch/builder/pull/864

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66658

Reviewed By: driazati, seemethere, janeyx99

Differential Revision: D31669156

Pulled By: malfet

fbshipit-source-id: 1cc45e2d402daf913a4685eb9f841cc3863e458d
2021-10-21 20:32:47 -07:00
4fe8055b9f made functorch not decompose by default (#66945)
Summary:
Basically reverting this: https://github.com/pytorch/pytorch/pull/63616

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66945

Reviewed By: zou3519

Differential Revision: D31802176

Pulled By: Chillee

fbshipit-source-id: b1cabd7af66aef26411801516c87336eaea4fccb
2021-10-21 19:18:00 -07:00
28fac23409 Fixes CUDA vs CPU consistency for index_put_ when accumulating (#66790)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/39227
Fixes https://github.com/pytorch/pytorch/issues/66495 (duplicate of 39227)

Description:
- Expands values for CUDA implementation
- Improved shapes checking for CUDA
- Improved error message for CUDA
- Added tests

cc zou3519

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66790

Reviewed By: mruberry

Differential Revision: D31843566

Pulled By: ngimel

fbshipit-source-id: c9e5d12a33e1067619c210174ba6e3cd66d5718b
2021-10-21 19:09:57 -07:00
35965869cf Enroll bowangbj@ to PyTorch distributed package (#67062)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67062

For cc and potential reviews

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D31849050

fbshipit-source-id: d3899c2ca857b8f22bdc88b4e83cdd20bbf0b1d6
2021-10-21 18:45:21 -07:00
20f08d23a0 Revert D31838513: Strided masked reduction: mean.
Test Plan: revert-hammer

Differential Revision:
D31838513

Original commit changeset: 54b99ccf9821

fbshipit-source-id: 5480e8482c8770b41579ee085e158572b659c1f5
2021-10-21 18:32:42 -07:00
2578de4851 [skip ci] Set test owner for test_cuda* tests (#66838)
Summary:
Action following https://github.com/pytorch/pytorch/issues/66232

cc ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66838

Reviewed By: saketh-are

Differential Revision: D31841411

Pulled By: janeyx99

fbshipit-source-id: 5cdffdef4a92f9adcef1143ae4598b052c5acc6b
2021-10-21 17:36:25 -07:00
b40a940192 Strided masked reduction: mean. (#66784)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66784

cc nikitaved pearu cpuhrsch

Test Plan: Imported from OSS

Reviewed By: saketh-are

Differential Revision: D31838513

Pulled By: cpuhrsch

fbshipit-source-id: 54b99ccf9821832c31976406379939b3c95f41de
2021-10-21 16:32:45 -07:00
b696d64ef4 Binaries without AVX512 kernels shouldn't report CPU Capability as AVX512 on machines with AVX512 support (#66703)
Summary:
### BUG
If a PyTorch binary is built with a compiler that doesn't support all the AVX512 intrinsics in the codebase, then it won't have ATen AVX512 kernels, but at runtime, CPU capability would still be incorrectly returned as AVX512 on a machine that supports AVX512. It seems that PyTorch Linux releases are done on CentOS with `gcc 7.3`, so this bug would manifest in the 1.10 release, unless a fix such as this one is added. gcc versions below 9.0 don't support all the AVX512 intrinsics in the codebase, such as `_mm512_set_epi16`.

### FIX
CPU Capability would be returned as AVX512 at runtime only if the binary was built with a compiler that supports all the AVX512 intrinsics in the codebase, and if the hardware the binary is being run on supports all the required AVX512 instruction sets.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66703

Reviewed By: gchanan

Differential Revision: D31732625

Pulled By: malfet

fbshipit-source-id: e52d06b87fbe2af9b303a2e9c264189c8512d5ec
2021-10-21 16:17:28 -07:00
33790c4e06 Implement histogramdd on CPU (#65318)
Summary:
Implements `torch.histogramdd` analogous to `numpy.histogramdd`.

Builds on https://github.com/pytorch/pytorch/pull/58780, generalizing the existing `torch.histogram` kernel to handle D-dimensional inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65318

Reviewed By: soulitzer

Differential Revision: D31654555

Pulled By: saketh-are

fbshipit-source-id: 14b781fac0fd3698b052dbd6f0fda46e50d4c5f1
2021-10-21 16:09:31 -07:00
6a224b3370 Set test owners for quantization tests (#66832)
Summary:
Action following https://github.com/pytorch/pytorch/issues/66232

cc jerryzh168 jianyuh raghuramank100 jamesr66a vkuzo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66832

Reviewed By: saketh-are

Differential Revision: D31842880

Pulled By: janeyx99

fbshipit-source-id: 8aee760e4203045c12e7548a21ed5b71c557e3ee
2021-10-21 16:04:41 -07:00
f29e5220a6 Revert D31474901: [pytorch][PR] [numpy] add torch.argwhere
Test Plan: revert-hammer

Differential Revision:
D31474901

Original commit changeset: 335327a4986f

fbshipit-source-id: 534093e459762ff7a888c58d76e49e362015f2ba
2021-10-21 15:50:54 -07:00
fcfa06586d Wextra fix for NamedTensor.cpp (#66897)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66897

Fixes:
```
stderr: caffe2/aten/src/ATen/native/NamedTensor.cpp:226:19: error: comparison of integers of different signs: 'const unsigned long' and 'int64_t' (aka 'long') [-Werror,-Wsign-compare]
    if (order_idx >= ellipsis_idx) {
        ~~~~~~~~~ ^  ~~~~~~~~~~~~
stderr: caffe2/aten/src/ATen/native/NamedTensor.cpp:226:19: error: comparison of integers of different signs: 'const unsigned long' and 'int64_t' (aka 'long') [-Werror,-Wsign-compare]
    if (order_idx >= ellipsis_idx) {
        ~~~~~~~~~ ^  ~~~~~~~~~~~~
```

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D31774623

fbshipit-source-id: b6e5b76695e512084ac5c9cb4215de7e9b763cf8
2021-10-21 14:22:38 -07:00
462f333c01 [numpy] add torch.argwhere (#64257)
Summary:
Adds `torch.argwhere` as an alias to `torch.nonzero`

Currently, `torch.nonzero` is actually provides equivalent functionality to `np.argwhere`.

From NumPy docs,
> np.argwhere(a) is almost the same as np.transpose(np.nonzero(a)), but produces a result of the correct shape for a 0D array.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64257

Reviewed By: dagitses

Differential Revision: D31474901

Pulled By: saketh-are

fbshipit-source-id: 335327a4986fa327da74e1fb8624cc1e56959c70
2021-10-21 14:02:11 -07:00
892ac08a02 Do not generate not_implemented error for forward AD when input with tangent passed to non-differentiable function (#66926)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/61926

1. update the `if` to just use requires_derivative since that should reflect when function is not differentiable
2. if `requires_derivative=True` but no outputs have forward derivatives, we should error as usual
3. ~In the future we may also want to handle the case~ when `len(fw_derivatives) > 0 and len(fw_derivatives) < num_diff_outputs` we should add assert in codegen that this does not happen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66926

Reviewed By: anjali411

Differential Revision: D31810736

Pulled By: soulitzer

fbshipit-source-id: 11a14477cc7554f576cff2ed1711a448a8c6a66a
2021-10-21 13:53:07 -07:00
062ae8df0e Automated submodule update: tensorpipe (#65353)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/tensorpipe](https://github.com/pytorch/tensorpipe).

New submodule commit: 183172ba8c

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65353

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: lw

Differential Revision: D31059779

fbshipit-source-id: 7bddff5139f8168750e22e1cc8c0d49931db542e
2021-10-21 13:30:45 -07:00
b07371f19c [skip ci] Set test owners for serialization tests (#66862)
Summary:
Action following https://github.com/pytorch/pytorch/issues/66232

cc mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66862

Reviewed By: saketh-are

Differential Revision: D31828615

Pulled By: janeyx99

fbshipit-source-id: 8d28970eead9d6f26e9ea64b823295d9c9e1469d
2021-10-21 13:22:18 -07:00
6f1ba16d6d [skip ci] Set test owners for cpp test (#66836)
Summary:
Action following https://github.com/pytorch/pytorch/issues/66232

cc yf225 glaringlee

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66836

Reviewed By: saketh-are

Differential Revision: D31828641

Pulled By: janeyx99

fbshipit-source-id: 076d41686746fecebc07452df8212eef15a7824c
2021-10-21 13:17:46 -07:00
00a871c5c9 [skip ci] Set test owner for multiprocessing tests (#66848)
Summary:
Action following https://github.com/pytorch/pytorch/issues/66232

cc VitalyFedyunin

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66848

Reviewed By: VitalyFedyunin

Differential Revision: D31828908

Pulled By: janeyx99

fbshipit-source-id: 45d6901648f5564c1bf07ad8d01d69ef486ae104
2021-10-21 13:13:53 -07:00
78f970568c Add dummy op to use instead of searchsorted (#66964)
Summary:
Would help unblock https://github.com/pytorch/pytorch/issues/66818 if this actually works

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66964

Reviewed By: mruberry

Differential Revision: D31817942

Pulled By: janeyx99

fbshipit-source-id: 9e9a2bcb0c0479ec7000ab8760a2e64bf0e85e95
2021-10-21 12:56:22 -07:00
94f4e9a995 Enable warning tests for nondeterministic backward functions (#66736)
Summary:
Followup from https://github.com/pytorch/pytorch/issues/66233

Since https://github.com/pytorch/pytorch/issues/50209 was fixed, we can enable these warning tests now

cc mruberry kurtamohler

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66736

Reviewed By: zou3519

Differential Revision: D31723385

Pulled By: mruberry

fbshipit-source-id: dc1922a6d0c45cc80020db85710e755a89113861
2021-10-21 12:51:53 -07:00
ce6f4b3a02 Setup c10d extension Backend class attr the same way as builtin ones (#66991)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66991

Currently, c10d extensions uses Backend.NAME to store the creator
function. However, builtin ones use that same field to store the
name. This commit makes c10d extensions comply with builtin ones,
and uses a dedicated `_plugins` field to store creator functions.

Thanks bryanmr for pointing this out.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D31820307

Pulled By: mrshenli

fbshipit-source-id: 259769ebfc80c0c9fc44d25498c8d19a3a09d1bc
2021-10-21 12:35:07 -07:00
40e5d31a52 Add OpInfo for torch.bincount (#65796)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65796

Reviewed By: bdhirsh

Differential Revision: D31386560

Pulled By: saketh-are

fbshipit-source-id: acb6ed3f743ddcccd0ff7ce1ab21f77c2078da37
2021-10-21 12:11:38 -07:00
9d4549295d ONNX export: propagate node metadata across passes (#45256)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45255

Mostly straightforward. Only downside in this PR is the lack of more scalable way to check for all newly-created nodes in `callPySymbolicFunction`. The other options were:
* Create a scope within the node's scope and loop through all nodes that correspond to the scope. The code would still need to loop through all nodes.
* Add extra state to the graph (no good reason to do so).
* Add extra state to the ONNX exporter, since python calls go back to `g.op(...)` (no good reason to do so, also not very pythonic).

cc BowenBao neginraoof

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45256

Reviewed By: malfet, houseroad

Differential Revision: D31744281

Pulled By: msaroufim

fbshipit-source-id: 1b63f6e7f02ed61b3a9b7ac3d0be0a3a203c8ff6
2021-10-21 11:49:05 -07:00
a33f341cee [ci] try setting MAX_JOBS on windows builds to reduce OOMs (#66986)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66986

See: https://github.com/pytorch/pytorch/issues/66674

Test Plan: Imported from OSS

Reviewed By: seemethere, anjali411

Differential Revision: D31822578

Pulled By: suo

fbshipit-source-id: e24bbe9a1ff21ad0653708217cef5d8b2f56c5a2
2021-10-21 11:41:05 -07:00
53cf7e844f [SR] Fix bug in FuseListUnpackV2 (#67021)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67021

When applying the equally split optimization, we still need to delete the list unpack node.

I did an accuracy test yesterday but didn't catch this issue because my diffs were not properly synced between devservers (I use hlu1's devbig for testing and it had an old version of "Add FuseListUnpackV2"). But I did another test this morning and realized that there was an issue.

This is not affecting anything in prod right now since D31742293 has not landed.

Reviewed By: hlu1

Differential Revision: D31827278

fbshipit-source-id: c7b05e3d8ec942632adcff4bdfebb8c27c1a7a39
2021-10-21 11:08:04 -07:00
a7ec4b53d2 Splitter: Transformer_encoder (#66952)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66952

Added splitter to lower parts of the transformer model
Program now supports arg input

Test Plan:
Performance on non-lowered model:
0.19662559509277344
Performance on semi-lowered model:
0.19131642150878905

Reviewed By: 842974287

Differential Revision: D31541325

fbshipit-source-id: 194aba97afc794dbeada4bbc4777d0a7b02e3635
2021-10-21 10:59:08 -07:00
d73b88b473 Unsqueeze bug fix (#66889)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66889

Added support for negative dims and modified unit test.

Test Plan: buck test mode/dev-nosan caffe2/test/fx2trt/converters:test_unsqueeze

Reviewed By: 842974287

Differential Revision: D31769393

fbshipit-source-id: 854335ead2ffad5f466ad66b9be36ba20a0fea67
2021-10-21 10:57:58 -07:00
23321ba7a3 Fix bug [#66780]: wrong input to torch.is_floating_point (#66783)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66783

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D31802971

Pulled By: cpuhrsch

fbshipit-source-id: 6a7d8b83dad219fd683504f9084b77358800507c
2021-10-21 09:50:58 -07:00
13b8599831 [skip ci] Set test owner for test_dispatch.py (#66840)
Summary:
Action following https://github.com/pytorch/pytorch/issues/66232

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66840

Reviewed By: saketh-are

Differential Revision: D31829224

Pulled By: janeyx99

fbshipit-source-id: 66aceacd4f976c36ed48ca5be59616d245ba2a82
2021-10-21 08:48:37 -07:00
8cbdf49dce [qnnpack] Remove conv_utils.h (#66605)
Summary:
This completes the removal of conv_utils and redistributes its dependencies

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66605

ghstack-source-id: 140565820

Test Plan: ci tests

Reviewed By: kimishpatel

Differential Revision: D31637731

fbshipit-source-id: 48d3a423e4ff0eb6ab21bb13bda44da16996423b
2021-10-21 08:23:42 -07:00
960e3216a4 [skip ci] Set test owner for named tensor tests (#66849)
Summary:
Action following https://github.com/pytorch/pytorch/issues/66232

cc zou3519

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66849

Reviewed By: zou3519

Differential Revision: D31828903

Pulled By: janeyx99

fbshipit-source-id: 30810bcec750ba8e1d5a342c31a5996bf57acd69
2021-10-21 08:22:26 -07:00
f5c5ab2868 [skip ci] Set test owner for cpp-extensions tests (#66837)
Summary:
Action following https://github.com/pytorch/pytorch/issues/66232

cc yf225 glaringlee zou3519

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66837

Reviewed By: anjali411

Differential Revision: D31828401

Pulled By: janeyx99

fbshipit-source-id: 35ac27f3e1c0eb70ccb38c07c42ba61bd0c848fe
2021-10-21 08:15:38 -07:00
32e790997b [Rocm]Reduce severity of detected possible memory leak from assertion to warning (#65973)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/62533.
In very rare cases, the decorator for detecting memory leak is throwing assertion, even when the test is passing, and the memory is being freed with a tiny delay. The issue is not being reproduced in internal testing, but shows up sometimes in CI environment.

Reducing the severity of such detection to warning, so as not to fail the CI tests, as the actual test is not failing, rather only the check inside the decorator is failing.

Limiting the change to ROCM only for now.

cc jeffdaily sunway513 jithunnair-amd ROCmSupport

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65973

Reviewed By: anjali411

Differential Revision: D31776154

Pulled By: malfet

fbshipit-source-id: 432199fca17669648463c4177c62adb553cacefd
2021-10-21 07:10:54 -07:00
70a5113e03 [ROCm] update Magma for 4.3 release (#65203)
Summary:
Upstream magma fixes the cholesky issues.
Refer https://bitbucket.org/icl/magma/issues/48/parameter-4-was-incorrect-on-entry-to

Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>

Fixes #{issue number}

cc jeffdaily sunway513 jithunnair-amd ROCmSupport

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65203

Reviewed By: anjali411

Differential Revision: D31766608

Pulled By: malfet

fbshipit-source-id: 3829b89314d25d8aa14be57ead879a811ab3f098
2021-10-21 07:06:01 -07:00
b6df043f1f Add torch.nn.init.uniform_ operator to ShardedTensor. (#63997)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63997

Use torch_function to extend torch.nn.init.uniform_
The Init is done in SPMD fashion. Note that ideally we want to aggregate sharded tensors into a global tensor, init it and reshard. It's fine to run it SPMD since uniform is I.I.D indepenent and identifically distributed.
Also enable unit test for test_linear.py for OSS test

Test Plan:
a) Unit Test
(pytorch) ... $ python test/distributed/_sharded_tensor/ops/test_init.py TestShardedTensorNNInit --v
(pytorch) ... $ python test/distributed/_sharded_tensor/ops/test_linear.py --v (before runs this command is no-op)

or b) Manual run: Instruction here: https://docs.google.com/document/d/1_m1Hdo5w51-hhPlZ_F8Y6PIWrN7UgJZqiSpARYvhsaE/edit#

Imported from OSS

Reviewed By: pritamdamania87, anjali411

Differential Revision: D30563017

fbshipit-source-id: d1859f7682235bcb44515efc69ca92bc5e34fce1
2021-10-21 00:17:13 -07:00
bdb889aca1 [nnc] Use a descriptive name for fused kernels when profiling (#66990)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66990

NNC fusion groups currently show up as "TensorExpr" in the profiler,
which is true but not super useful since it obscures what's actually happening
in the fusion group.  This change will log them as `fused_XXX` where XXX is a
(length-limited) series of ops describing the subgraph, for instance
`fused_mul_add` to represent a group containing `aten::mul`, `aten::add`.

Test Plan: New unit test to check the output of autograd profiler.

Reviewed By: dzhulgakov

Differential Revision: D31762087

fbshipit-source-id: 3fadbdc67b054faa01aa42e5b6ea2c4a6bc3481f
2021-10-21 00:06:23 -07:00
8beabffac3 [PyTorchEdge] Make aten function common to aten and torch_common (#66663)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66663

fb: TensorCompare.cpp is in per-app, a target higher than torch_mobile

Please read this doc to know about [Per-app ATen/native and Template Selective Build](
https://docs.google.com/document/d/1O5--mOAi_gGh2GkE-REo3qJRRQ_Lks69IfgszcB8ThI/edit)

Create a filed called "prim_native_functions.cpp" in ATen, add it to aten_cpu, and cut-paste native::is_nonzero() to prim_native_functions.cpp.
By doing this we move the function to lower layer which is more visible to all targets depending on it.

Instruction count comparison new vs old
https://www.internalfb.com/phabricator/paste/view/P463272302?view=diff

Test Plan:
fb:
```
(base) [pavithran@devvm1803.vll0 /data/users/pavithran/fbsource] buck build //xplat/caffe2:aten_cpu
Building: finished in 0.4 sec (100%) 1/202 jobs, 0/202 updated
  Total time: 0.4 sec
More details at https://www.internalfb.com/intern/buck/build/ea35300b-55be-4b9f-bc74-80cdd869c16a
BUILD SUCCEEDED
(base) [pavithran@devvm1803.vll0 /data/users/pavithran/fbsource] buck build //xplat/caffe2:aten_native_cpu
Building: finished in 0.7 sec (100%) 1/1 jobs, 0/1 updated
  Total time: 0.8 sec
More details at https://www.internalfb.com/intern/buck/build/ccd97d43-c59d-4f29-9418-485cd24575e2
BUILD SUCCEEDED
```

Reviewed By: iseeyuan

Differential Revision: D31669536

fbshipit-source-id: d35f069f975db6dce0b678c5b5ddd74bd690f599
2021-10-20 20:41:41 -07:00
f8f04d5424 [quant][graphmode][fx] Add support for single linear and conv2d (#66950)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66950

Just to show that it works for weighted operations as well, qat/fused op not supported yet
We can start developing the backend_config_dict and work towards making the support more complete afterwards

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D31801782

fbshipit-source-id: 8491bab7939a7a1c23ffa87c351844b82e390027
2021-10-20 19:13:27 -07:00
a89851a0d9 [quant][fx][graphmode] Adding a new convert function that produces reference pattern by default (#66925)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66925

Current convert_fx implementation is using "The Interpreter Pattern" in https://pytorch.org/docs/stable/fx.html
There are two things that's changed which make the approach in this PR possible and needed:
1). original convert implementation is developed at the initial prototype where fx does not allow mutations, now fx
supports mutations
2). original convert needs to work for a lot of fbgemm/qnnpack specific logic, which is not needed for reference patterns

Therefore it makes sense for us to write a new convert function just for reference patterns, the implementation
is significantly easier to understand than the original convert implementation

Current support:
* we should be able to support all non-weighted ops like relu, add etc.

Missing:
* linear and conv
* some advanced features like standalone modules, input_quantized_idxs etc.

will add linear and conv support and start defining the backend_config_dict based on this version of convert

Test Plan:
python test/test_quantization.py TestQuantizeFxOpsNew

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D31786241

fbshipit-source-id: 2a32156eb6d3c5271cb44906cd863055785fb5d4
2021-10-20 18:54:30 -07:00
db4165892b [SmartCompose][OnDevice]fix function name bug in mobile export & Script to convert mobile model (#66915)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66915

Pull Request resolved: https://github.com/pytorch/pytorch-canary/pull/3

fix function name bug in mobile export

Test Plan: buck run pytext/fb/assistant/smart_compose:mobile_converter -- --model_input=pytext_training/tree/teams/assistant/smart_compose/300555761/model.ts --model_output=pytext_training/tree/teams/assistant/smart_compose/300555761/mobile_model_test.ts

Reviewed By: JacobSzwejbka

Differential Revision: D31782983

fbshipit-source-id: 7288bb65adc7346d218980a535d68a12d8ef2033
2021-10-20 18:14:51 -07:00
ab1e4eac42 [Static Runtime] Add FuseListUnpackV2 (#66509)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66509

Like `FuseListUnpack`, but instead of adding arguments to the fused node's outputs, inserts a new fused op.

By using a new fused op, we can avoid runtime `is_fused` checks. This will make the op implementations significantly cleaner. Eventually, we will migrate all ops to `V2` and delete to old pass.

`FuseListUnpackV2` also fixes the bug described in T103159043.

Test Plan: I've made some changes to D31550307 locally and verified that everything works.

Reviewed By: hlu1

Differential Revision: D31492017

fbshipit-source-id: 4f90fcbc17e4c70a3d65985bee836fabf868a22c
2021-10-20 16:39:32 -07:00
17889ad26e Add support for cat in output stitching (#66098)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66098

`cat` is somewhat special-cased right now because currently we only have list of Tensor inputs where the list is constructed in the JIT IR graph. While that is generally true for Fusion (e.g. why we have ConstantChunk) that may not be true for shape analysis generally, so I'm waiting a bit to generalize.

Test Plan: Imported from OSS

Reviewed By: navahgar, anjali411

Differential Revision: D31797467

Pulled By: eellison

fbshipit-source-id: ca761e214dfd7f3bba8d189f3b3f42ffec064f63
2021-10-20 16:13:09 -07:00
2dd23ebfdb Add support for multi output nodes in partial eval graph stitching (#66097)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66097

Adding logic to generate runtime shapes for nodes with multi-outputs. It is generalizing existing flow of looking at a node, getting its shape graph, inlining it, and adding a mapping from the output to the new value in the stitched shape compute graph to loop over multiple outputs.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D31797468

Pulled By: eellison

fbshipit-source-id: 2c182b71a46b36d33f23ad35b89790a4a5d4471c
2021-10-20 16:13:07 -07:00
0196b984f3 Add Handling of Cat in Shape Analysis (#65575)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65575

This is needed for lowering an NNC model to mobile. It is also the last class of unhandled ops which NNC fuses, and we need integration this for computing output symbolic shapes.

The graph of with two dynamic shape inputs produces:
```
graph(%x.1 : Tensor(SS(-2), 2, 3),
      %y.1 : Tensor(SS(-3), 2, 3)):
  %5 : int = prim::Constant[value=0]()
  %4 : Tensor[] = prim::ListConstruct(%x.1, %y.1)
  %6 : Tensor(SS(-4), 2, 3) = aten::cat(%4, %5) # /private/home/eellison/pytorch/test/jit/test_symbolic_shape_analysis.py:290:19
  return (%6)
```
With a partial eval graph of
```
Done with partial evaluation
graph(%129 : int[],
      %130 : int[],
      %dim.14 : int):
  %738 : int = prim::Constant[value=3]()
  %737 : int = prim::Constant[value=2]()
  %132 : int = prim::Constant[value=0]()
  %392 : int = aten::__getitem__(%129, %132) # <string>:339:44
  %417 : int = aten::__getitem__(%130, %132) # <string>:339:44
  %cat_dim_size.48 : int = aten::add(%392, %417) # <string>:339:29
  %result_size.5 : int[] = prim::ListConstruct(%cat_dim_size.48, %737, %738)
  return (%result_size.5)
```

To handle cat, I essentially make the cat shape op variadic,
replacing
```
torch.cat([x, y]
...
def cat_shape_op(tensors: List[List[int]], dim: int):
    ...
    op(tensors)
```
with
```
def cat_shape_op(x: List[int], y: List[int], dim: int):
    tensors = [x, y]
    op(tensors)
```
This reuses the existing input Tensor properties partial evaluation path and avoids having to add special handling to optimize out `len(tensors)` calls in the IR.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D31797471

Pulled By: eellison

fbshipit-source-id: 62c794533d5fabfd3fad056d7e5fe3e8781b22c5
2021-10-20 16:13:05 -07:00
eaba976d49 Add x + 0 optimization (#65574)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65574

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D31797470

Pulled By: eellison

fbshipit-source-id: bf9309fb43f164665335fed0d09697b0e2f67261
2021-10-20 16:13:03 -07:00
b059f035be Fix bug preventing optimization from firing (#65573)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65573

When we remove mutation on
```
x = [0, 1, 3, 4]
x[-2] = 4
```
we have a safety check that the new index will be in bounds of the old index. in practice, this should always be the case otherwise you would have a runtime error. Within that check (not within the actual adjustment) we were using the wrong length of inputs preventing the optimization from firing.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D31797469

Pulled By: eellison

fbshipit-source-id: 02a1686b9f6016eb5aeb87ed342c043c203dcd0e
2021-10-20 16:13:01 -07:00
63b41e1f4d [JIT] Add partial evaluation graph stitching logic (#65377)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65377

When we run symbolic shape analysis on
```
conv = torch.nn.Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
max_pool = torch.nn.MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
mod = nn.Sequential(conv1, max_pool)
...
graph(%self : __torch__.torch.nn.modules.container.___torch_mangle_0.Sequential,
      %input.1 : Tensor):
  %18 : bool = prim::Constant[value=0]()
  %30 : int[] = prim::Constant[value=[1, 1]]()
  %29 : int[] = prim::Constant[value=[3, 3]]()
  %28 : int[] = prim::Constant[value=[2, 2]]()
  %6 : int = prim::Constant[value=1]()
  %self.0.bias : NoneType = prim::Constant()
  %self.0.weight : Double(64, 3, 7, 7, strides=[147, 49, 7, 1], requires_grad=0, device=cpu) = prim::Constant[value=<Tensor>]()
  %input.5 : Tensor(SS(-2), 64, SS(-3), SS(-4)) = aten::conv2d(%input.1, %self.0.weight, %self.0.bias, %28, %29, %30, %6)
  %input.9 : Tensor(SS(-2), 64, SS(-5), SS(-6)) = aten::max_pool2d(%input.5, %29, %28, %30, %30, %18)
  return (%input.9)
```
we partially evaluate the shape compute graph of `conv2d`, whose output gets passed in and used to partially evaluate the shape compute graph of `max_pool2d`.

The conv2d remaining partially eval'd graph is [here](https://gist.github.com/eellison/0598bd224a422211efa1a45d2b7560b7), and the maxpool2d eval'd graph is [here](https://gist.github.com/eellison/625540b84f650ddbefd3ae5511ab8814). We can take the partially eval'd graphs of a series of operators and stitch them together, which allows us to
a) recover symbolic equivalences by CSE'ing & other optimizations
b) calculate shapes for a whole block of operators just on the input, such as for fusing the whole model to nnc with dynamic shapes and then passing along the computed symbolic shapes. the calculation will also handle error handling.
c) (future-looking) generate inputs on demand for straight-line networks that are composed just of aten operators

The combined graph of the two gives us compute for the unknown symbolic dimensions - `SS(-2), SS(-3), SS(-4), SS(-5), and SS(-6)`.
```
graph(%input.1 : int[]):
  %42 : bool = prim::Constant[value=0]() # <string>:152:17
  %15 : int = prim::Constant[value=3]()
  %input_batch_size_dim.1 : int = prim::Constant[value=0]() # <string>:417:41
  %13 : int = prim::Constant[value=1]() # <string>:426:61
  %12 : int = prim::Constant[value=4]() # <string>:437:32
  %11 : str = prim::Constant[value="AssertionError: "]()
  %9 : int = prim::Constant[value=2]()
  %8 : int = prim::Constant[value=6]()
  %7 : int = prim::Constant[value=7]()
  %16 : int = aten::len(%input.1) # <string>:438:17
  %17 : bool = aten::eq(%16, %12) # <string>:438:17
   = prim::If(%17) # <string>:438:10
    block0():
      -> ()
    block1():
       = prim::RaiseException(%11) # <string>:438:10
      -> ()
  %18 : int = aten::__getitem__(%input.1, %13) # <string>:407:17
  %19 : bool = aten::eq(%18, %15) # <string>:407:17
   = prim::If(%19) # <string>:407:10
    block0():
      -> ()
    block1():
       = prim::RaiseException(%11) # <string>:407:10
      -> ()
  %20 : int = aten::__getitem__(%input.1, %9) # <string>:411:20
  %21 : int = aten::add(%20, %8) # <string>:411:20
  %22 : bool = aten::ge(%21, %7) # <string>:411:20
   = prim::If(%22) # <string>:411:12
    block0():
      -> ()
    block1():
       = prim::RaiseException(%11) # <string>:411:12
      -> ()
  %23 : int = aten::__getitem__(%input.1, %15) # <string>:411:20
  %24 : int = aten::add(%23, %8) # <string>:411:20
  %25 : bool = aten::ge(%24, %7) # <string>:411:20
   = prim::If(%25) # <string>:411:12
    block0():
      -> ()
    block1():
       = prim::RaiseException(%11) # <string>:411:12
      -> ()
  %26 : int = aten::__getitem__(%input.1, %input_batch_size_dim.1) # <string>:422:29
  %27 : int = aten::sub(%20, %13) # <string>:428:32
  %28 : int = aten::floordiv(%27, %9) # <string>:428:32
  %29 : int = aten::add(%28, %13) # <string>:428:32
  %30 : int = aten::sub(%23, %13) # <string>:428:32
  %31 : int = aten::floordiv(%30, %9) # <string>:428:32
  %32 : int = aten::add(%31, %13) # <string>:428:32
  %48 : int = aten::floordiv(%28, %9) # <string>:133:17
  %outputSize.2 : int = aten::add(%48, %13) # <string>:136:23
  %51 : int = aten::floordiv(%31, %9) # <string>:133:17
  %outputSize.1 : int = aten::add(%51, %13) # <string>:136:23
  %53 : bool = aten::ne(%29, %input_batch_size_dim.1) # <string>:156:41
  %54 : bool = prim::If(%53) # <string>:157:64
    block0():
      %55 : bool = aten::ne(%32, %input_batch_size_dim.1) # <string>:157:93
      -> (%55)
    block1():
      -> (%42)
   = prim::If(%54) # <string>:157:10
    block0():
      -> ()
    block1():
       = prim::RaiseException(%11) # <string>:157:10
      -> ()
  %56 : bool = aten::ge(%outputSize.1, %13) # <string>:160:17
  %57 : bool = prim::If(%56) # <string>:160:17
    block0():
      %58 : bool = aten::ge(%outputSize.2, %13) # <string>:160:38
      -> (%58)
    block1():
      -> (%42)
   = prim::If(%57) # <string>:160:10
    block0():
      -> ()
    block1():
       = prim::RaiseException(%11) # <string>:160:10
      -> ()
  return (%26, %29, %32, %outputSize.2, %outputSize.1)
  ```

This PR runs shape analysis, retains the partially evaluated graphs, and then stitches them together, keeping track of what inputs in the partial eval graph correspond to what inputs in the encompassing graph IR and what outputs correspond to what symbolic shape. Adding NNC ppl as reviewers because it is relevant to dynamic shape fusion.

Question for reviewers  : should I make this a separate file ?

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D31797472

Pulled By: eellison

fbshipit-source-id: a41ed31fad085d3563e71c815f49af0cd18aaeed
2021-10-20 16:12:58 -07:00
4ad6c144f6 [JIT][Easy] Shape cleanups (#65148)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65148

No functional changes, factoring out optimizations and renaming the `graph` in symbolic shape analysis to `shape_compute_graph` as ZolotukhinM suggested

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D31797447

Pulled By: eellison

fbshipit-source-id: 60d322da040245dd7b47ee7c8996239572fd11c2
2021-10-20 16:11:24 -07:00
e046386be8 Avoid inlining error reporting in checked_convert (#66721)
Summary:
**Summary:** Move the error reporting part to the cpp file to avoid callers inlining it, which inflates the generated code size. See https://github.com/pytorch/pytorch/issues/65830.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66721

Test Plan:
Compiling the simple program below now generates ~150 lines of assembly, compared to 700+ lines before.

```
#include <c10/core/Scalar.h>

void g(float) {}

void f(const c10::Scalar& scalar) {
    auto x = scalar.to<float>();
    g(x);
}
```

**Reviewers:** Brian Hirsh

**Subscribers:** Brian Hirsh, Edward Yang, Yining Lu

**Tasks:** T103384490

**Tags:** pytorch

Fixes https://github.com/pytorch/pytorch/issues/65830

Reviewed By: zou3519, bdhirsh

Differential Revision: D31737607

Pulled By: andrewor14

fbshipit-source-id: 3d493c4d8e51d8f8a19d00f59b8ea28176c8a9e3
2021-10-20 16:04:09 -07:00
18bbc4c2b7 [Static Runtime] Fix a bug in aten::index (#66940)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66940

`aten::index`'s schema is as follows:

```
"aten::index.Tensor(Tensor self, Tensor?[] indices) -> Tensor
```

The current implementation assumes `indices`' elements are all tensors by doing `elem.toTensor`, which is incorrectly. This change creates an empty optional value if an element from `indices` is not a tensor.

Test Plan: Fixed `StaticRuntime, IndividualOps_Index` to correctly test `aten::index` with `indices` that contains `None`.

Reviewed By: hlu1

Differential Revision: D31712145

fbshipit-source-id: be1c29674bcd55b67b0dcc2a988bc37fd43745f3
2021-10-20 15:51:21 -07:00
08cb31a03e [PyTorch][1/N] Basic implementation of ShardedEmbedding using ShardedTensor. (#66604)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66604

This diff/PR is trying to implement the ShardedEmbedding and ShardedEmbedding using the ShardedTensor.

Several caveats:
1. We support limited input params for the op. To support more params are on the way.
2. We only support chuck sharding for now.
3. We only support a single local shard per rank for now.

ghstack-source-id: 141056130

Test Plan: Unit test and CI

Reviewed By: pritamdamania87

Differential Revision: D31544556

fbshipit-source-id: cc867dcba8c11e6f4c7c3722488908f5108cc67f
2021-10-20 15:16:49 -07:00
257239972c Fix attr_to_scope's key in torch/utils/tensorboard/_pytorch_graph.py (#65692)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/65652

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65692

Reviewed By: Reubend

Differential Revision: D31678606

Pulled By: edward-io

fbshipit-source-id: 7c0bf740ee4f8c21bd01ced3ae70df23c9efadfb
2021-10-20 14:35:29 -07:00
450221c534 Sparse CSR: Add tensor.resize_ and tensor.copy_ (#63510)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63510

Sparse CSR matrix resizing behavior:
If we _increase the number of rows_ the number of specified elements in the matrix remains the same -> the size of col_indices, values doesn't change, the size of crow_indices becomes `rows+1`.
If we _decrease the number of rows_ the number of specified elements will be `min(nnz, rows*cols)` -> need to resize `crow_indices` to `rows+1` and set the last element to `min(nnz, rows*cols)`; decrease the size of col_indices and values to `min(nnz, rows*cols)`.
If we _increase the number of columns_ the number of specified elements in the matrix remains the same, the number of rows remains the same -> no need to resize anything, just set new sizes.
We _cannot decrease the number of columns_ because it would require recomputing `crow_indices`.

cc nikitaved pearu cpuhrsch IvanYashchuk

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D31796680

Pulled By: cpuhrsch

fbshipit-source-id: 7d8a9701ce06d30a1841f94bba0a057cacea9401
2021-10-20 14:19:04 -07:00
f56a1a59a3 Add simple backwards compatibility check for torch.package (#66739)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/65154, tests for backwards compatibility of torch.package by checking if packages that were created before can still be loaded.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66739

Reviewed By: suo

Differential Revision: D31771526

Pulled By: PaliC

fbshipit-source-id: ba8c652c647b94114a058e4c7d7f1c7ce6033d84
2021-10-20 12:46:17 -07:00
6e67150f57 [skip ci] Set test owner for test_mkldnn.py (#66845)
Summary:
Action following https://github.com/pytorch/pytorch/issues/66232

cc gujinghui PenghuiCheng XiaobingSuper jianyuh VitalyFedyunin

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66845

Reviewed By: anjali411

Differential Revision: D31803377

Pulled By: janeyx99

fbshipit-source-id: 4fcf77d3e4bf976449a0b1ab4d750619db3493a1
2021-10-20 12:38:56 -07:00
5569d5824c Fix documentation of arguments for torch.nn.functional.Linear (#66884)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66884

Addressing docs fix mentioned in issue 64978 on Github
ghstack-source-id: 141093449

Test Plan: https://pxl.cl/1Rxkz

Reviewed By: anjali411

Differential Revision: D31767303

fbshipit-source-id: f1ca10fed5bb768749bce3ddc240bbce1dfb3f84
2021-10-20 12:02:58 -07:00
e86d8323cb [JIT] Add special cases for batch_norm, instance_norm in alias_analysis (#66554)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66554

In native_functions.yaml, the schemas for batch_norm and instance_norm
are incorrect: the inputs `running_mean` and `running_var` are mutated,
but are not marked as such in the function schema. Since `(a!)?`
annotations are currently not working (see #65760), this instead adds a
special case to `alias_anaysis.cpp`. If the value of `training` or
`use_input_stats` is known to be `false`, then `alias_analysis` will
mark the input as _not_ being written to.

Test Plan:
Removed the `skip` annotation on the following test, and added a special
exception in `check_alias_annotations`:
```
python test/test_ops.py -k test_variant_consistency_jit_nn_functional_batch_norm
```

Also:
```
./build/bin/test_jit --gtest_filter="*BatchAndInstanceNormFixture*"
```

Imported from OSS

Reviewed By: eellison

Differential Revision: D31612339

fbshipit-source-id: 12ca61b782b9e41e06883ba080a276209dc435bb
2021-10-20 10:22:10 -07:00
cf77bd4cf4 Fix python version in test tools CI job (#66947)
Summary:
On the HUD, the test tools job is failing as the runners now install Python 3.10, which is not compatible with numpy 1.20

See https://github.com/pytorch/pytorch/runs/3952169950?check_suite_focus=true Install dependencies step:
```
 ERROR: Command errored out with exit status 1:
   command: /opt/hostedtoolcache/Python/3.10.0/x64/bin/python /opt/hostedtoolcache/Python/3.10.0/x64/lib/python3.10/site-packages/pip/_vendor/pep517/in_process/_in_process.py build_wheel /tmp/tmptq8aay7m
       cwd: /tmp/pip-install-dk_6t98q/numpy_e9431bf106b746148c0e7c36e46551b4
  Complete output (1169 lines):
  setup.py:66: RuntimeWarning: NumPy 1.20.0 may not yet support Python 3.10.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66947

Reviewed By: suo, malfet

Differential Revision: D31799205

Pulled By: janeyx99

fbshipit-source-id: 64bf10c37c0aa4f5837c48e92d56e81d920722bd
2021-10-20 10:12:16 -07:00
793f366e34 [skip ci] Set test owners for sparse tests (#66863)
Summary:
Action following https://github.com/pytorch/pytorch/issues/66232

cc nikitaved pearu cpuhrsch IvanYashchuk

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66863

Reviewed By: anjali411

Differential Revision: D31771126

Pulled By: janeyx99

fbshipit-source-id: 6cb5ca0557e8555f6a09b3e607ff8888e505486e
2021-10-20 10:12:13 -07:00
a015964cf8 Strided masked reduction: prod. (#66386)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66386

cc nikitaved pearu cpuhrsch

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D31779598

Pulled By: cpuhrsch

fbshipit-source-id: 304a3d6abc794a49de5b044aade6cfd727758495
2021-10-20 10:10:54 -07:00
822277f302 [skip ci] Set test owners for test_type_promotion.py (#66866)
Summary:
Action following https://github.com/pytorch/pytorch/issues/66232

cc nairbv mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66866

Reviewed By: anjali411

Differential Revision: D31771149

Pulled By: janeyx99

fbshipit-source-id: 87c04ed4a75ada06a553a11064d44ac65fc4c6ea
2021-10-20 09:42:37 -07:00
409364e597 [skip ci] Set test owners for test_typing.py (#66869)
Summary:
Action following https://github.com/pytorch/pytorch/issues/66232

cc ezyang malfet rgommers xuzhao9 gramster

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66869

Reviewed By: anjali411

Differential Revision: D31766850

Pulled By: janeyx99

fbshipit-source-id: e9772f5378be07162d4f4d06925165e396d7d6c6
2021-10-20 09:41:13 -07:00
452b359c3f [skip ci] Set test owners for tensor creation tests (#66864)
Summary:
Action following https://github.com/pytorch/pytorch/issues/66232

cc gchanan mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66864

Reviewed By: anjali411

Differential Revision: D31771139

Pulled By: janeyx99

fbshipit-source-id: 74adeae7de355fa6c63de22290fa324911230368
2021-10-20 09:38:21 -07:00
8a65047acc [skip ci] Set test owners for everything considered with module: tests (#66865)
Summary:
Action following https://github.com/pytorch/pytorch/issues/66232

cc mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66865

Reviewed By: anjali411

Differential Revision: D31771147

Pulled By: janeyx99

fbshipit-source-id: 8bebe5ac2098364ef1ee93b590abb5f4455b0f89
2021-10-20 09:37:03 -07:00
94f4b22df9 Revert D31761594: [pytorch][PR] opinfo : nn.functional.embedding
Test Plan: revert-hammer

Differential Revision:
D31761594 (ed5633d0c5)

Original commit changeset: d24f44728d04

fbshipit-source-id: 72574918300a7982430a0ceb772c9a24de525050
2021-10-20 09:17:16 -07:00
f95fef7897 Add prim::TensorExprDynamicGuard to bc allowlist (#66939)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66939

Reviewed By: ejguan

Differential Revision: D31797160

Pulled By: soulitzer

fbshipit-source-id: 630b7a0ab99671192397f927391361622f7e9c2e
2021-10-20 08:53:19 -07:00
3fe2ff800c Module docs update (#66909)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/37824

{F671745341}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66909

Reviewed By: anjali411

Differential Revision: D31782046

Pulled By: mikaylagawarecki

fbshipit-source-id: 009d2ea3c8a51a89786ef55bb9e88dc53aa8360f
2021-10-20 08:14:36 -07:00
62ca5a81c0 Exposed recompute_scale_factor into nn.Upsample (#66419)
Summary:
Description:
- Exposed recompute_scale_factor into nn.Upsample such that recompute_scale_factor=True option could be used

Context: https://github.com/pytorch/pytorch/pull/64501#discussion_r710205190

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66419

Reviewed By: gchanan

Differential Revision: D31731276

Pulled By: jbschlosser

fbshipit-source-id: 2118489e6f5bc1142f2a64323f4cfd095a9f3c42
2021-10-20 07:59:25 -07:00
867ccc9987 Strided masked reduction: amin. (#66385)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66385

cc nikitaved pearu cpuhrsch IvanYashchuk

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D31779530

Pulled By: cpuhrsch

fbshipit-source-id: de753c2d191f7980a48831b892d3a1e8a7a547cd
2021-10-20 07:45:40 -07:00
c69e33bb11 Fix doc string for torch.acosh (#66814)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66814
Shift equation above note as per issue 65905 on github

Test Plan:
Imported from OSS

In preview docs built from PR

https://docs-preview.pytorch.org/66814/generated/torch.acosh.html#torch.acosh equation is now above note

{F671441651}

Reviewed By: gchanan

Differential Revision: D31742677

Pulled By: mikaylagawarecki

fbshipit-source-id: 9fa5390ad2a01ca001418c0bd624f2145f861bf4
2021-10-20 07:01:42 -07:00
ed5633d0c5 opinfo : nn.functional.embedding (#66622)
Summary:
Adds opinfo for `nn.functional.embedding`

Few cases where `numerical` gradient doesn't match (gradcheck fails)

```python
import torch

try:
    t = torch.randn(2, 1, dtype=torch.float64, requires_grad=True)
    idx = torch.tensor([0, 1])
    torch.autograd.gradcheck(lambda idx, t : torch.nn.functional.embedding(idx, t, padding_idx=1), (idx, t, ))
except Exception as e:
    print("PADDING IDX:", e)

try:
    t = torch.ones(2, 1, dtype=torch.float64, requires_grad=True)
    idx = torch.tensor([0, 1])
    torch.autograd.gradcheck(lambda idx, t : torch.nn.functional.embedding(idx, t, max_norm=1.), (idx, t, ))
except Exception as e:
    print("MAX NORM:", e)

try:
    t = torch.randn(2, 1, dtype=torch.float64, requires_grad=True)
    idx = torch.tensor([0, 1, 1])
    torch.autograd.gradcheck(lambda idx, t : torch.nn.functional.embedding(idx, t, scale_grad_by_freq=True), (idx, t, ))
except Exception as e:
    print("SCALE GRAD BY FREQUENCY:", e)

try:
    t = torch.randn(2, 1, dtype=torch.float64, requires_grad=True)
    idx = torch.tensor([0, 1])
    torch.autograd.gradcheck(lambda idx, t : torch.nn.functional.embedding(idx, t, sparse=True), (idx, t, ))
except Exception as e:
    print("SPARSE", e)

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66622

Reviewed By: gchanan

Differential Revision: D31761594

Pulled By: zou3519

fbshipit-source-id: d24f44728d049e6276d6c3165aa1fba458214959
2021-10-20 06:33:55 -07:00
79803b199f [Static Runtime] Make sure ProcessedNode::function_kind_ is copied over (#66917)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66917

The total number of 'out' variant nodes/total number of nodes is now 100% for all the models, which isn't true obviously.

Reviewed By: swolchok, mikeiovine

Differential Revision: D31783028

fbshipit-source-id: e0bc2c6614aa3c3a235283c9125de1b339f42585
2021-10-20 00:21:35 -07:00
14ee608791 [PyTorch] Make rearragement in sharded linear work as expected. (#66603)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66603

Found the issue here: https://github.com/pytorch/pytorch/issues/66281 by make the test cases more complicated.

By closely reading the code again, it turns out my original understanding is also wrong. Let's use the example mentioned in the issue to explain:

If the placement is like:
```
"rank:3/cuda:3",
"rank:0/cuda:0",
"rank:1/cuda:1",
"rank:2/cuda:2",
```

First, we split the column or row by the order of [3, 0, 1, 2].

In the case of column-wise sharding:
We get to reaggrage the result from rank0-4.
Step 1: we split the output based on the original sharding strategy, aka, rank3 gets the 1st shard, rank0 get the 2nd shard, etc.
Step 2: we need to rearrange the result from rank0-4 by ordering them following the order of [3, 0, 1, 2], aka, the result from rank3 needs to be put in the front, and so forth.

In the case of row-wise sharding:
We need to rearrange the input being sent to rank0-4.
Step 1: we reorder the input and follow the map of [3, 0, 1, 2]. For example, the first shard goes to rank 3 so we need to put in the 3rd part, the second shard goes to rank 0, so we put it in the 2nd part, and so on.
Step 2: the size of the sharding for each rank is decided by the original placement: [3, 0, 1, 2], aka, rank 3 gets the first shard and its size, etc.

Update the unit test to reflect this change.

Also, correct some format and comments in the sharded linear.
ghstack-source-id: 141055689

Test Plan: unit test and wait for CI.

Reviewed By: pritamdamania87, bowangbj

Differential Revision: D31634590

fbshipit-source-id: 677a9c2b42da1e2c63220523ed2c004565bbecc7
2021-10-19 23:16:38 -07:00
ef15691a1e Revert D31732421: [JIT][Easy] Shape cleanups
Test Plan: revert-hammer

Differential Revision:
D31732421 (16d0896b69)

Original commit changeset: e934507d1795

fbshipit-source-id: 6b34815c556de64ee5c7ef8d41e4cb434ccd7098
2021-10-19 20:07:06 -07:00
70c9eb130d Revert D31732419: [JIT] Add partial evaluation graph stitching logic
Test Plan: revert-hammer

Differential Revision:
D31732419 (5db7db667f)

Original commit changeset: 883a55cbeef0

fbshipit-source-id: f5faba69dfb6b54aeb29d1beaeec8c5b0373830f
2021-10-19 20:07:04 -07:00
90b42452e2 Revert D31732417: Fix bug preventing optimization from firing
Test Plan: revert-hammer

Differential Revision:
D31732417 (853fc25fb0)

Original commit changeset: dd734254c021

fbshipit-source-id: 3da0663dac5b5d2117b3d7abdbcd45d96f98de33
2021-10-19 20:07:02 -07:00
b8d58129bb Revert D31732420: Add x + 0 optimization
Test Plan: revert-hammer

Differential Revision:
D31732420 (66543f88de)

Original commit changeset: 0271e0dc0dda

fbshipit-source-id: c2beea1661e10c2f1a982b5d4a34b1041dcb1204
2021-10-19 20:07:00 -07:00
e730752610 Revert D31732416: Add Handling of Cat in Shape Analysis
Test Plan: revert-hammer

Differential Revision:
D31732416 (cc7de1df3b)

Original commit changeset: 6d93ddf62c34

fbshipit-source-id: e2c9713177a7f783897e99dd71e631fb275c37da
2021-10-19 20:06:57 -07:00
57fcea9e88 Revert D31732418: Add support for multi output nodes in partial eval graph stitching
Test Plan: revert-hammer

Differential Revision:
D31732418 (0fdc9b77a3)

Original commit changeset: 767698d031b1

fbshipit-source-id: f899eb155dcec67d57f53a658a71169d37b63b42
2021-10-19 20:06:55 -07:00
4187d870df Revert D31732415: Add support for cat in output stitching
Test Plan: revert-hammer

Differential Revision:
D31732415 (b4db5174fe)

Original commit changeset: 7f513cea355f

fbshipit-source-id: a0d8f1512b13d51f6e50b5da58084effbaf0a0dc
2021-10-19 20:06:53 -07:00
1bf0e1acb4 Revert D31732414: Add Initial NNC Dynamic Shapes Flow
Test Plan: revert-hammer

Differential Revision:
D31732414 (de4fe7a38c)

Original commit changeset: 290a94a667c2

fbshipit-source-id: 3021a1d7a8661967e37d4f9cfc86ed47cc4a7f3d
2021-10-19 20:05:29 -07:00
9c4d7d96db Address feedback from #66673 (#66905)
Summary:
Specify both `build_generates_artifacts` and `exclude_tests` properties as suggested in https://github.com/pytorch/pytorch/pull/66673#pullrequestreview-783667960

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66905

Reviewed By: seemethere

Differential Revision: D31779742

Pulled By: malfet

fbshipit-source-id: 21f5543f3b767f38132be8c7e163455f39ff893f
2021-10-19 18:27:45 -07:00
deb6989880 [fx-acc] add optimize_quantization to FX graph opts (#65929)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65929

This adds a set of quantize/dequantize graph optimizations.

Test Plan:
```
buck test mode/opt glow/fb/fx/graph_opts:test_fx_graph_opts
```
```
Parsing buck files: finished in 0.8 sec
Building: finished in 3.0 sec (100%) 8475/80926 jobs, 0/80926 updated
  Total time: 3.9 sec
More details at https://www.internalfb.com/intern/buck/build/9dd6193b-d99c-4d2a-8ef8-4d71380916e7
BUILD SUCCEEDED
Tpx test run coordinator for Facebook. See https://fburl.com/tpx for details.
Running with tpx session id: b5a83d2a-8870-400e-b21e-3286967d1f4a
Trace available for this run at /tmp/tpx-20211018-165956.836274/trace.log
Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/4222124724048882
    ✓ ListingSuccess: glow/fb/fx/graph_opts:test_fx_graph_opts - main (3.152)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_graph_opts - test_transpose_to_reshape_1_optimizable (glow.fb.fx.graph_opts.tests.test_fx_graph_opts.TestTransposeToReshape) (0.100)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_graph_opts - test_transpose_to_reshape_0_identity (glow.fb.fx.graph_opts.tests.test_fx_graph_opts.TestTransposeToReshape) (0.017)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_graph_opts - test_optimize_quantize_clamp_ignore_one_0 (glow.fb.fx.graph_opts.tests.test_fx_graph_opts.TestOptimizeQuantizeClamp) (0.154)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_graph_opts - test_optimize_quantize_clamp_ignore_one_1 (glow.fb.fx.graph_opts.tests.test_fx_graph_opts.TestOptimizeQuantizeClamp) (0.140)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_graph_opts - test_optimize_quantization_2_QuantizePerChannel_Dequantize_X_RescaleQuantized_X_ (glow.fb.fx.graph_opts.tests.test_fx_graph_opts.TestOptimizeQuantization) (0.422)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_graph_opts - test_optimize_quantize_clamp_ignore_one_3 (glow.fb.fx.graph_opts.tests.test_fx_graph_opts.TestOptimizeQuantizeClamp) (0.296)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_graph_opts - test_optimize_dequantize_clamp_remove_one_3 (glow.fb.fx.graph_opts.tests.test_fx_graph_opts.TestOptimizeQuantizeClamp) (0.288)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_graph_opts - test_optimize_dequantize_clamp_remove_one_1 (glow.fb.fx.graph_opts.tests.test_fx_graph_opts.TestOptimizeQuantizeClamp) (0.433)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_graph_opts - test_optimize_quantize_clamp_ignore_clamp_tensor (glow.fb.fx.graph_opts.tests.test_fx_graph_opts.TestOptimizeQuantizeClamp) (0.346)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_graph_opts - test_optimize_quantization_1_Quantize_Dequantize_X_RescaleQuantized_X_ (glow.fb.fx.graph_opts.tests.test_fx_graph_opts.TestOptimizeQuantization) (0.403)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_graph_opts - test_transpose_to_reshape_2_unoptimizable (glow.fb.fx.graph_opts.tests.test_fx_graph_opts.TestTransposeToReshape) (0.117)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_graph_opts - test_optimize_quantize_clamp_remove_one_1 (glow.fb.fx.graph_opts.tests.test_fx_graph_opts.TestOptimizeQuantizeClamp) (0.415)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_graph_opts - test_optimize_quantize_clamp_remove_one_3 (glow.fb.fx.graph_opts.tests.test_fx_graph_opts.TestOptimizeQuantizeClamp) (0.280)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_graph_opts - test_optimize_quantization_3_Dequantize_Quantize_Dequantize_X_Dequantize_rescale_X_Dequantize_X_ (glow.fb.fx.graph_opts.tests.test_fx_graph_opts.TestOptimizeQuantization) (0.150)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_graph_opts - test_optimize_quantize_clamp_ignore_one_6 (glow.fb.fx.graph_opts.tests.test_fx_graph_opts.TestOptimizeQuantizeClamp) (0.133)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_graph_opts - test_optimize_dequantize_clamp_remove_one_2 (glow.fb.fx.graph_opts.tests.test_fx_graph_opts.TestOptimizeQuantizeClamp) (0.523)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_graph_opts - test_optimize_dequantize_clamp_remove_one_0 (glow.fb.fx.graph_opts.tests.test_fx_graph_opts.TestOptimizeQuantizeClamp) (0.569)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_graph_opts - test_optimize_quantization_4_Rescale_QuantizeNode_QuantizeNode_ (glow.fb.fx.graph_opts.tests.test_fx_graph_opts.TestOptimizeQuantization) (0.815)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_graph_opts - test_optimize_quantize_clamp_ignore_one_5 (glow.fb.fx.graph_opts.tests.test_fx_graph_opts.TestOptimizeQuantizeClamp) (0.295)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_graph_opts - test_optimize_quantize_clamp_ignore_one_4 (glow.fb.fx.graph_opts.tests.test_fx_graph_opts.TestOptimizeQuantizeClamp) (0.308)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_graph_opts - test_optimize_quantize_clamp_ignore_one_2 (glow.fb.fx.graph_opts.tests.test_fx_graph_opts.TestOptimizeQuantizeClamp) (0.213)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_graph_opts - test_optimize_quantize_clamp_remove_one_2 (glow.fb.fx.graph_opts.tests.test_fx_graph_opts.TestOptimizeQuantizeClamp) (0.230)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_graph_opts - test_optimize_quantization_0_Dequantize_Quantize_X_X (glow.fb.fx.graph_opts.tests.test_fx_graph_opts.TestOptimizeQuantization) (0.336)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_graph_opts - test_optimize_quantize_clamp_remove_one_0 (glow.fb.fx.graph_opts.tests.test_fx_graph_opts.TestOptimizeQuantizeClamp) (0.486)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_graph_opts - test_optimize_quantize_clamp_ignore_one_7 (glow.fb.fx.graph_opts.tests.test_fx_graph_opts.TestOptimizeQuantizeClamp) (0.306)
Summary
  Pass: 25
  ListingSuccess: 1
If you need help understanding your runs, please follow the wiki: https://fburl.com/posting_in_tpx_users
Finished test run: https://www.internalfb.com/intern/testinfra/testrun/4222124724048882
```

# Before
```
Model before opt.
graph():
    %x : [#users=1] = placeholder[target=x]
    %quantize_per_tensor_2 : [#users=1] = call_function[target=torch.fx.experimental.fx_acc.acc_ops.quantize_per_tensor](args = (), kwargs = {input: %x, acc_out_ty: ((8, 4, 2), torch.qint32, False, (8, 2, 1), torch.contiguous_format, True, {scale: 1.000001e-05, zero_point: 0, qscheme: torch.per_tensor_affine})})
    %dequantize_1 : [#users=1] = call_function[target=torch.fx.experimental.fx_acc.acc_ops.dequantize](args = (), kwargs = {input: %quantize_per_tensor_2})
    %quantize_per_tensor_3 : [#users=1] = call_function[target=torch.fx.experimental.fx_acc.acc_ops.quantize_per_tensor](args = (), kwargs = {input: %dequantize_1, acc_out_ty: ((8, 4, 2), torch.qint32, False, (8, 2, 1), torch.contiguous_format, True, {scale: 1e-05, zero_point: 0, qscheme: torch.per_tensor_affine})})
    return quantize_per_tensor_3
opcode         name                   target                                            args                      kwargs
-------------  ---------------------  ------------------------------------------------  ------------------------  ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
placeholder    x                      x                                                 ()                        {}
call_function  quantize_per_tensor_2  <function quantize_per_tensor at 0x7f66030a34c0>  ()                        {'input': x, 'acc_out_ty': ((8, 4, 2), torch.qint32, False, (8, 2, 1), torch.contiguous_format, True, {'scale': 1.000001e-05, 'zero_point': 0, 'qscheme': torch.per_tensor_affine})}
call_function  dequantize_1           <function dequantize at 0x7f66030a35e0>           ()                        {'input': quantize_per_tensor_2}
call_function  quantize_per_tensor_3  <function quantize_per_tensor at 0x7f66030a34c0>  ()                        {'input': dequantize_1, 'acc_out_ty': ((8, 4, 2), torch.qint32, False, (8, 2, 1), torch.contiguous_format, True, {'scale': 1e-05, 'zero_point': 0, 'qscheme': torch.per_tensor_affine})}
output         output                 output                                            (quantize_per_tensor_3,)  {}
```

# After
```
Model after opt.
graph():
    %x : [#users=1] = placeholder[target=x]
    %quantize_per_tensor_2 : [#users=1] = call_function[target=torch.fx.experimental.fx_acc.acc_ops.quantize_per_tensor](args = (), kwargs = {input: %x, acc_out_ty: ((8, 4, 2), torch.qint32, False, (8, 2, 1), torch.contiguous_format, True, {scale: 1e-05, zero_point: 0, qscheme: torch.per_tensor_affine})})
    return quantize_per_tensor_2
opcode         name                   target                                            args                      kwargs
-------------  ---------------------  ------------------------------------------------  ------------------------  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
placeholder    x                      x                                                 ()                        {}
call_function  quantize_per_tensor_2  <function quantize_per_tensor at 0x7f66030a34c0>  ()                        {'input': x, 'acc_out_ty': ((8, 4, 2), torch.qint32, False, (8, 2, 1), torch.contiguous_format, True, {'scale': 1e-05, 'zero_point': 0, 'qscheme': torch.per_tensor_affine})}
output         output                 output                                            (quantize_per_tensor_2,)  {}
```

Reviewed By: jfix71

Differential Revision: D30945732

fbshipit-source-id: 427cd4215b546e1d6c5362734bb7de93d0c0b1b9
2021-10-19 17:06:32 -07:00
32e3003726 Have test classes extend from common_utils.TestCase, not unittest.TestCase (#66900)
Summary:
This causes some functionality to not work, such as the disabling issues e.g., https://github.com/pytorch/pytorch/issues/66641

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66900

Reviewed By: seemethere

Differential Revision: D31778293

Pulled By: janeyx99

fbshipit-source-id: df3023ddaf7969ffb60117d1e1d7e36d87bc6139
2021-10-19 16:54:05 -07:00
de4fe7a38c Add Initial NNC Dynamic Shapes Flow (#66136)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66136

FOR REVIEWERS: this is ready to review, test failures comes from somewhere else in stack..

Takes in a TensorExprGraph of static shapes and generalizes the input shapes
to symbolic dimensions. Dimensions of value 1 will be preserved, otherwise
dimensions with the same value will be bucketed to the same symbolic shape.

E.g. `Tensor(5, 3), Tensor(3, 1) -> Tensor(SS(-1), SS(-2)), Tensor(SS(-2), 1)`

From there, runs symbolic shape inference on the graph, and creates a
versioning if in the graph with prim::TensorExprDynamicGuard checking if
the inputs at runtime match the Generalized Symbolic Shapes that are inputs
to the TE Kernel. The computate to calculate all symbolic dimensions is
inlined in to the if block with the TE Kernel. All Sym Dim Value* are
appended to the end of the TE Kernel Graph/Node inputs, and the Node is
augmented with a integer list attr `symbolic_shape_inputs` that gives the
mapping from Value * -> Symbolic Shape int64_t value. For more lengthy IR
examples and walkthrough look at ShapeAnalysisTest.DynamicShapesFusion in
`test_shape_analysis` Returns True on Success, False on Failure, can fail if
shape propagation fails to propagate # of dims or if complete shapes on
inputs not set.

Example transformation
```
graph(%x_inp : Float(10, 5, strides=[5, 1], requires_grad=0, device=cpu),
      %y_inp : Float(4, 5, strides=[5, 1], requires_grad=0, device=cpu),
      %z_inp : Float(1, 1, strides=[1, 1], requires_grad=0, device=cpu)):
  %3 : Tensor = prim::TensorExprGroup_0(%x_inp, %y_inp, %z_inp)
  return ()
with prim::TensorExprGroup_0 = graph(%x.1 : Float(10, 5, strides=[5, 1], requires_grad=0, device=cpu),
      %y.1 : Float(4, 5, strides=[5, 1], requires_grad=0, device=cpu),
      %z : Float(1, 1, strides=[1, 1], requires_grad=0, device=cpu)):
  %3 : int = prim::Constant[value=0]()
  %4 : Tensor = aten::tanh(%x.1)
  %5 : Tensor = aten::erf(%4)
  %6 : Tensor = aten::relu(%y.1)
  %7 : Tensor[] = prim::ListConstruct(%5, %6)
  %8 : Tensor = aten::cat(%7, %3)
  %9 : Tensor = aten::hardswish(%8)
  %10 : Tensor = aten::mul(%9, %z)
  return (%9)
```
->

```
  graph(%x_inp : Float(10, 5, strides=[5, 1], requires_grad=0, device=cpu),
      %y_inp : Float(4, 5, strides=[5, 1], requires_grad=0, device=cpu),
      %z_inp : Float(1, 1, strides=[1, 1], requires_grad=0, device=cpu)):
  %4 : bool = prim::TensorExprDynamicGuard[types=[Float(SS(-2), SS(-3), strides=[5, 1], requires_grad=0, device=cpu), Float(SS(-4), SS(-3), strides=[5, 1], requires_grad=0, device=cpu), Float(1, 1, strides=[1, 1], requires_grad=0, device=cpu)]](%x_inp, %y_inp, %z_inp)
  %5 : Tensor = prim::If(%4)
    block0():
      %15 : int[] = aten::size(%x_inp)
      %16 : int[] = aten::size(%y_inp)
      %17 : int = prim::Constant[value=1]()
      %18 : int = prim::Constant[value=0]()
      %elem.3 : int = aten::__getitem__(%15, %18) # <string>:40:10
      %elem.5 : int = aten::__getitem__(%15, %17) # <string>:40:10
      %elem.11 : int = aten::__getitem__(%16, %18) # <string>:40:10
      %cat_dim_size.48 : int = aten::add(%elem.3, %elem.11) # <string>:321:29
      %3 : Tensor = prim::TensorExprGroup_0[symbolic_shape_inputs=[-5, -4, -3, -2]](%x_inp, %y_inp, %z_inp, %cat_dim_size.48, %elem.11, %elem.5, %elem.3)
      -> (%3)
    block1():
      %14 : Tensor = prim::FallbackGraph_1(%x_inp, %y_inp, %z_inp)
      -> (%14)
  return ()
  with prim::TensorExprGroup_0 = graph(%x.1 : Float(SS(-2), SS(-3), strides=[5, 1], requires_grad=0, device=cpu),
        %y.1 : Float(SS(-4), SS(-3), strides=[5, 1], requires_grad=0, device=cpu),
        %z : Float(1, 1, strides=[1, 1], requires_grad=0, device=cpu),
        %SS_5 : int,
        %SS_4 : int,
        %SS_3 : int,
        %SS_2 : int):
    %3 : int = prim::Constant[value=0]()
    %4 : Tensor(SS(-2), SS(-3)) = aten::tanh(%x.1)
    %5 : Tensor(SS(-2), SS(-3)) = aten::erf(%4)
    %6 : Tensor(SS(-4), SS(-3)) = aten::relu(%y.1)
    %7 : Tensor[] = prim::ListConstruct(%5, %6)
    %8 : Tensor(SS(-5), SS(-3)) = aten::cat(%7, %3)
    %9 : Tensor(SS(-5), SS(-3)) = aten::hardswish(%8)
    %10 : Tensor(SS(-5), SS(-3)) = aten::mul(%9, %z)
    return (%9)
```

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D31732414

Pulled By: eellison

fbshipit-source-id: 290a94a667c20467717202a43c60e4f9ca4c00e2
2021-10-19 16:41:49 -07:00
b4db5174fe Add support for cat in output stitching (#66098)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66098

`cat` is somewhat special-cased right now because currently we only have list of Tensor inputs where the list is constructed in the JIT IR graph. While that is generally true for Fusion (e.g. why we have ConstantChunk) that may not be true for shape analysis generally, so I'm waiting a bit to generalize.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D31732415

Pulled By: eellison

fbshipit-source-id: 7f513cea355f1e4c1d2ca7c32c06690a9bdcb050
2021-10-19 16:41:44 -07:00
0fdc9b77a3 Add support for multi output nodes in partial eval graph stitching (#66097)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66097

Adding logic to generate runtime shapes for nodes with multi-outputs. It is generalizing existing flow of looking at a node, getting its shape graph, inlining it, and adding a mapping from the output to the new value in the stitched shape compute graph to loop over multiple outputs.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D31732418

Pulled By: eellison

fbshipit-source-id: 767698d031b1daf002678a025b270e0ede429061
2021-10-19 16:41:39 -07:00
cc7de1df3b Add Handling of Cat in Shape Analysis (#65575)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65575

This is needed for lowering an NNC model to mobile. It is also the last class of unhandled ops which NNC fuses, and we need integration this for computing output symbolic shapes.

The graph of with two dynamic shape inputs produces:
```
graph(%x.1 : Tensor(SS(-2), 2, 3),
      %y.1 : Tensor(SS(-3), 2, 3)):
  %5 : int = prim::Constant[value=0]()
  %4 : Tensor[] = prim::ListConstruct(%x.1, %y.1)
  %6 : Tensor(SS(-4), 2, 3) = aten::cat(%4, %5) # /private/home/eellison/pytorch/test/jit/test_symbolic_shape_analysis.py:290:19
  return (%6)
```
With a partial eval graph of
```
Done with partial evaluation
graph(%129 : int[],
      %130 : int[],
      %dim.14 : int):
  %738 : int = prim::Constant[value=3]()
  %737 : int = prim::Constant[value=2]()
  %132 : int = prim::Constant[value=0]()
  %392 : int = aten::__getitem__(%129, %132) # <string>:339:44
  %417 : int = aten::__getitem__(%130, %132) # <string>:339:44
  %cat_dim_size.48 : int = aten::add(%392, %417) # <string>:339:29
  %result_size.5 : int[] = prim::ListConstruct(%cat_dim_size.48, %737, %738)
  return (%result_size.5)
```

To handle cat, I essentially make the cat shape op variadic,
replacing
```
torch.cat([x, y]
...
def cat_shape_op(tensors: List[List[int]], dim: int):
    ...
    op(tensors)
```
with
```
def cat_shape_op(x: List[int], y: List[int], dim: int):
    tensors = [x, y]
    op(tensors)
```
This reuses the existing input Tensor properties partial evaluation path and avoids having to add special handling to optimize out `len(tensors)` calls in the IR.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D31732416

Pulled By: eellison

fbshipit-source-id: 6d93ddf62c34846ec238159f75229632515530b7
2021-10-19 16:41:34 -07:00
66543f88de Add x + 0 optimization (#65574)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65574

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D31732420

Pulled By: eellison

fbshipit-source-id: 0271e0dc0ddab06220048ed5bf4236fc85f3318c
2021-10-19 16:41:29 -07:00
853fc25fb0 Fix bug preventing optimization from firing (#65573)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65573

When we remove mutation on
```
x = [0, 1, 3, 4]
x[-2] = 4
```
we have a safety check that the new index will be in bounds of the old index. in practice, this should always be the case otherwise you would have a runtime error. Within that check (not within the actual adjustment) we were using the wrong length of inputs preventing the optimization from firing.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D31732417

Pulled By: eellison

fbshipit-source-id: dd734254c0212ca459c1c135da262974de5299be
2021-10-19 16:41:24 -07:00
5db7db667f [JIT] Add partial evaluation graph stitching logic (#65377)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65377

When we run symbolic shape analysis on
```
conv = torch.nn.Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
max_pool = torch.nn.MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
mod = nn.Sequential(conv1, max_pool)
...
graph(%self : __torch__.torch.nn.modules.container.___torch_mangle_0.Sequential,
      %input.1 : Tensor):
  %18 : bool = prim::Constant[value=0]()
  %30 : int[] = prim::Constant[value=[1, 1]]()
  %29 : int[] = prim::Constant[value=[3, 3]]()
  %28 : int[] = prim::Constant[value=[2, 2]]()
  %6 : int = prim::Constant[value=1]()
  %self.0.bias : NoneType = prim::Constant()
  %self.0.weight : Double(64, 3, 7, 7, strides=[147, 49, 7, 1], requires_grad=0, device=cpu) = prim::Constant[value=<Tensor>]()
  %input.5 : Tensor(SS(-2), 64, SS(-3), SS(-4)) = aten::conv2d(%input.1, %self.0.weight, %self.0.bias, %28, %29, %30, %6)
  %input.9 : Tensor(SS(-2), 64, SS(-5), SS(-6)) = aten::max_pool2d(%input.5, %29, %28, %30, %30, %18)
  return (%input.9)
```
we partially evaluate the shape compute graph of `conv2d`, whose output gets passed in and used to partially evaluate the shape compute graph of `max_pool2d`.

The conv2d remaining partially eval'd graph is [here](https://gist.github.com/eellison/0598bd224a422211efa1a45d2b7560b7), and the maxpool2d eval'd graph is [here](https://gist.github.com/eellison/625540b84f650ddbefd3ae5511ab8814). We can take the partially eval'd graphs of a series of operators and stitch them together, which allows us to
a) recover symbolic equivalences by CSE'ing & other optimizations
b) calculate shapes for a whole block of operators just on the input, such as for fusing the whole model to nnc with dynamic shapes and then passing along the computed symbolic shapes. the calculation will also handle error handling.
c) (future-looking) generate inputs on demand for straight-line networks that are composed just of aten operators

The combined graph of the two gives us compute for the unknown symbolic dimensions - `SS(-2), SS(-3), SS(-4), SS(-5), and SS(-6)`.
```
graph(%input.1 : int[]):
  %42 : bool = prim::Constant[value=0]() # <string>:152:17
  %15 : int = prim::Constant[value=3]()
  %input_batch_size_dim.1 : int = prim::Constant[value=0]() # <string>:417:41
  %13 : int = prim::Constant[value=1]() # <string>:426:61
  %12 : int = prim::Constant[value=4]() # <string>:437:32
  %11 : str = prim::Constant[value="AssertionError: "]()
  %9 : int = prim::Constant[value=2]()
  %8 : int = prim::Constant[value=6]()
  %7 : int = prim::Constant[value=7]()
  %16 : int = aten::len(%input.1) # <string>:438:17
  %17 : bool = aten::eq(%16, %12) # <string>:438:17
   = prim::If(%17) # <string>:438:10
    block0():
      -> ()
    block1():
       = prim::RaiseException(%11) # <string>:438:10
      -> ()
  %18 : int = aten::__getitem__(%input.1, %13) # <string>:407:17
  %19 : bool = aten::eq(%18, %15) # <string>:407:17
   = prim::If(%19) # <string>:407:10
    block0():
      -> ()
    block1():
       = prim::RaiseException(%11) # <string>:407:10
      -> ()
  %20 : int = aten::__getitem__(%input.1, %9) # <string>:411:20
  %21 : int = aten::add(%20, %8) # <string>:411:20
  %22 : bool = aten::ge(%21, %7) # <string>:411:20
   = prim::If(%22) # <string>:411:12
    block0():
      -> ()
    block1():
       = prim::RaiseException(%11) # <string>:411:12
      -> ()
  %23 : int = aten::__getitem__(%input.1, %15) # <string>:411:20
  %24 : int = aten::add(%23, %8) # <string>:411:20
  %25 : bool = aten::ge(%24, %7) # <string>:411:20
   = prim::If(%25) # <string>:411:12
    block0():
      -> ()
    block1():
       = prim::RaiseException(%11) # <string>:411:12
      -> ()
  %26 : int = aten::__getitem__(%input.1, %input_batch_size_dim.1) # <string>:422:29
  %27 : int = aten::sub(%20, %13) # <string>:428:32
  %28 : int = aten::floordiv(%27, %9) # <string>:428:32
  %29 : int = aten::add(%28, %13) # <string>:428:32
  %30 : int = aten::sub(%23, %13) # <string>:428:32
  %31 : int = aten::floordiv(%30, %9) # <string>:428:32
  %32 : int = aten::add(%31, %13) # <string>:428:32
  %48 : int = aten::floordiv(%28, %9) # <string>:133:17
  %outputSize.2 : int = aten::add(%48, %13) # <string>:136:23
  %51 : int = aten::floordiv(%31, %9) # <string>:133:17
  %outputSize.1 : int = aten::add(%51, %13) # <string>:136:23
  %53 : bool = aten::ne(%29, %input_batch_size_dim.1) # <string>:156:41
  %54 : bool = prim::If(%53) # <string>:157:64
    block0():
      %55 : bool = aten::ne(%32, %input_batch_size_dim.1) # <string>:157:93
      -> (%55)
    block1():
      -> (%42)
   = prim::If(%54) # <string>:157:10
    block0():
      -> ()
    block1():
       = prim::RaiseException(%11) # <string>:157:10
      -> ()
  %56 : bool = aten::ge(%outputSize.1, %13) # <string>:160:17
  %57 : bool = prim::If(%56) # <string>:160:17
    block0():
      %58 : bool = aten::ge(%outputSize.2, %13) # <string>:160:38
      -> (%58)
    block1():
      -> (%42)
   = prim::If(%57) # <string>:160:10
    block0():
      -> ()
    block1():
       = prim::RaiseException(%11) # <string>:160:10
      -> ()
  return (%26, %29, %32, %outputSize.2, %outputSize.1)
  ```

This PR runs shape analysis, retains the partially evaluated graphs, and then stitches them together, keeping track of what inputs in the partial eval graph correspond to what inputs in the encompassing graph IR and what outputs correspond to what symbolic shape. Adding NNC ppl as reviewers because it is relevant to dynamic shape fusion.

Question for reviewers  : should I make this a separate file ?

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D31732419

Pulled By: eellison

fbshipit-source-id: 883a55cbeef0fd5a6068a779ffa89b6f537245b3
2021-10-19 16:41:19 -07:00
16d0896b69 [JIT][Easy] Shape cleanups (#65148)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65148

No functional changes, factoring out optimizations and renaming the `graph` in symbolic shape analysis to `shape_compute_graph` as ZolotukhinM suggested

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D31732421

Pulled By: eellison

fbshipit-source-id: e934507d1795e0bc4d98a3bfe6cb792e2f08b119
2021-10-19 16:39:32 -07:00
b3bb234e16 Remove THCGeneral.cpp (#66766)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66766

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D31721647

Pulled By: ngimel

fbshipit-source-id: 5033a2800871c8745a1a92e379c9f97c98af212e
2021-10-19 16:09:19 -07:00
bd4d5cb14c Sparse CSR: Add torch.empty (#63509)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63509

The primary use of `torch.empty` is to reserve memory for tensor and set the type, device, size information. The same is done here for SparseCSR.
`crow_indices` is initialized as an empty tensor of size `num_rows + 1`. `col_indices` and `values` are initialized as empty tensors of size 0.

cc nikitaved pearu cpuhrsch IvanYashchuk

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D31770359

Pulled By: cpuhrsch

fbshipit-source-id: c83f2a2e0d7514ba24780add1086e1bccf541dd9
2021-10-19 15:59:07 -07:00
b1a6129e09 Add repr to StreamWrapper (#66880)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66880

Help to print out `fileobj`

Test Plan: Imported from OSS

Reviewed By: NivekT

Differential Revision: D31764431

Pulled By: ejguan

fbshipit-source-id: 668a8fbe0078196d4d584be3dfb413c8ad5e72b1
2021-10-19 15:28:25 -07:00
e70b5d64f4 Change README getting started link to explicit instructions (#66828)
Summary:
This changes the link for installing binaries to the page on pytorch.org that is entirely the download command selector (which isn't visible on a normal aspect ratio screen when the main website page first loads anymore).

This also includes some other random fixes:
* Update HUD link
* Clean ups

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66828

Reviewed By: malfet

Differential Revision: D31750654

Pulled By: driazati

fbshipit-source-id: aef9ceba71418f6f7648eab9a8c8a78d6c60518b
2021-10-19 14:59:48 -07:00
cbd7bac914 Migrate clang5-mobile build to GHA (#66673)
Summary:
`linux-xenial-py3-clang5-mobile-build`, `linux-xenial-py3-clang5-mobile-custom-build-dynamic`, `linux-xenial-py3-clang5-mobile-custom-build-dynamic` and `linux-xenial-py3-clang5-mobile-code-analysis` are just the flavors of regular linux build job with no tests.
`linux-xenial-py3-clang5-mobile-code-analysis` is the master only job

`code-analysis` job is dispatch to `.jenkins/pytorch/build-mobile-code-analysis.sh` in
583217fe37/.jenkins/pytorch/build.sh (L23-L25)
and all `mobile-build` jobs are dispatched to `.jenkins/pytorch/build-mobile.sh` in
583217fe37/.jenkins/pytorch/build.sh (L19-L21)

Rename `is_libtorch` `CIWorkflow` property into `build_generates_artifacts` and change defaults from False to True
Both libtorch and mobile build jobs do not generate build artifacts

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66673

Reviewed By: janeyx99

Differential Revision: D31674434

Pulled By: malfet

fbshipit-source-id: 24d05d55366202cd4d9c25ecab429cb8f670ded0
2021-10-19 14:13:29 -07:00
15f21eef5e [fx2trt]fix softmax test (#66885)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66885

Test Plan: CI

Reviewed By: hl475

Differential Revision: D31767433

fbshipit-source-id: 1ee79ac027c612b5397be9da9665fff21b2c321f
2021-10-19 13:55:49 -07:00
a1afb692f3 Fix metal issues with irange (#66877)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66877

Fixes (hopefully):
```
program_source:516:27: error: use of undeclared identifier 'c10'
    for (const auto idx : c10::irange(4)) {
                          ^
program_source:590:27: error: use of undeclared identifier 'c10'
    for (const auto idx : c10::irange(4)) {
                          ^
program_source:810:26: error: use of undeclared identifier 'c10'
    for (const auto iy : c10::irange(roi_bin_grid_h)) {
                         ^
program_source:811:30: error: use of undeclared identifier 'c10'
        for (const auto ix : c10::irange(roi_bin_grid_w)) {
                             ^

DeviceName: AMD Radeon Pro 5500M, LanguageVersion: 131075
Exception raised from -[MetalContext available] at xplat/caffe2/aten/src/ATen/native/metal/MetalContext.mm:66 (most recent call first):
(no backtrace available)
```

Test Plan: Sandcastle

Reviewed By: benb, xta0

Differential Revision: D31763270

fbshipit-source-id: cfe4364b14c5fe6dbd39893788919769c9a9eb00
2021-10-19 13:49:24 -07:00
66f241230d [PyTorch] Take const Type& in {tryS,s}calarTypeFromJitType (#66717)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66717

No need to require a refcount bump for this function.
ghstack-source-id: 140921170

Test Plan: CI

Reviewed By: suo

Differential Revision: D31696898

fbshipit-source-id: a3732a04ccbddc32207ce90836030f3020154a77
2021-10-19 13:08:42 -07:00
9a00910bf3 [skip ci] Set test owner for test_linalg.py (#66844)
Summary:
Action following https://github.com/pytorch/pytorch/issues/66232

cc jianyuh nikitaved pearu mruberry walterddr IvanYashchuk xwang233 Lezcano

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66844

Reviewed By: gchanan

Differential Revision: D31761714

Pulled By: janeyx99

fbshipit-source-id: a4c7b239d855707ee6ec1194f57f8a66812b4e99
2021-10-19 13:01:05 -07:00
57c596eb9e add interactive_embedded_interpreter.cpp to the OSS build (#66352)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66352

Add cmake rules for interactive_embedded_interpreter.cpp .

The builtin_registry.cpp has already been handled in https://github.com/pytorch/pytorch/pull/66347 . I'll remove the change in this PR once that one is merged.

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D31521249

Pulled By: shunting314

fbshipit-source-id: bb9d340e5a6aad7d76078ca03a82b5ae7494a124
2021-10-19 12:32:49 -07:00
3488a85a76 Sparse CSR CUDA: fix input checks for addmm and mm (#66485)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66485

The errors for incorrectly sized inputs should match the dense variants
of functions.
Moved addmm_out_sparse_csr_dense_cuda from SparseCsrTensorMath.cu and
removed unnecessary device check.

cc nikitaved pearu cpuhrsch IvanYashchuk

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D31764036

Pulled By: cpuhrsch

fbshipit-source-id: 76900fe9e4a49474695a01f34bad41cb3422321c
2021-10-19 12:01:11 -07:00
690c2a7076 masked_scatter: fuse mask count check into one kernel (#66871)
Summary:
This saves 1 kernel launch, 7 dispatcher calls, 3 `TensorImpl` allocations and 1 CUDA memory allocation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66871

Reviewed By: gchanan

Differential Revision: D31763713

Pulled By: ngimel

fbshipit-source-id: b0d2f9415b7fd013fb4e7d68ade6e38a58f5b153
2021-10-19 11:52:38 -07:00
552af8bdef [PyTorch] Fix missing move in OptionalType::createWithContained (#66697)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66697

We own this vector, so we can move from it.
ghstack-source-id: 140742640

Test Plan: CI

Reviewed By: suo

Differential Revision: D31693230

fbshipit-source-id: 3f33ca6e47e29b0e3d6c8fad59c234c55e1e159f
2021-10-19 11:47:35 -07:00
7e81a89e13 [PyTorch] Fix performance-no-automatic-move clang tidy warnings in matchTypeVariables (#66720)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66720

See the documentation for the warning. https://clang.llvm.org/extra/clang-tidy/checks/performance-no-automatic-move.html
ghstack-source-id: 140922952

Test Plan: CI

Reviewed By: suo

Differential Revision: D31697506

fbshipit-source-id: 26ce6c47d0f3b0c4e48ecc882f6792f1b5a45bac
2021-10-19 11:30:46 -07:00
50f5689d60 Set test owner for distributions tests (#66842)
Summary:
Action following https://github.com/pytorch/pytorch/issues/66232

cc fritzo neerajprad alicanb nikitaved

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66842

Reviewed By: neerajprad

Differential Revision: D31761720

Pulled By: janeyx99

fbshipit-source-id: 9d9e88d93e2efb90c971f165b4040880e9d90c56
2021-10-19 11:00:29 -07:00
c37f413e75 [skip ci] Change pretrained to false for quantization tests (#66795)
Summary:
Helps resolve a bit of https://github.com/pytorch/pytorch/issues/65439

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66795

Reviewed By: suo, jerryzh168

Differential Revision: D31732043

Pulled By: janeyx99

fbshipit-source-id: 10b71865fc937f9d72f2b1c04cbf3ea9a68c8818
2021-10-19 10:56:29 -07:00
c9d9244166 [skip ci] Set test owner for test_spectral_ops.py (#66843)
Summary:
Action following https://github.com/pytorch/pytorch/issues/66232

cc mruberry peterbell10

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66843

Reviewed By: gchanan

Differential Revision: D31761715

Pulled By: janeyx99

fbshipit-source-id: 1173a200478b87568768fafcfee117c09c1cffbd
2021-10-19 10:56:27 -07:00
34051d74da Add test owner to distributed files starting with test_ (#66797)
Summary:
Action based on https://github.com/pytorch/pytorch/issues/66232

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66797

Reviewed By: gchanan

Differential Revision: D31761389

Pulled By: janeyx99

fbshipit-source-id: c27c9ab4acec1eb71d5edd4538cd113b770dfc6c
2021-10-19 10:55:20 -07:00
94afbd158c [skip ci] Set test owner for test_numpy_interop.py (#66851)
Summary:
Action following https://github.com/pytorch/pytorch/issues/66232

cc mruberry rgommers

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66851

Reviewed By: gchanan

Differential Revision: D31761703

Pulled By: janeyx99

fbshipit-source-id: 4dec507dff0ce25d2780b6020f0d9790ab1cb499
2021-10-19 10:50:54 -07:00
17f07c310b Fix type checking errors in torch/ao/quantization/quantize_fx.py (#66804)
Summary:
- [x] Fix the Pyre type checking errors in `torch/ao/quantization/quantize_fx.py`
```
torch/quantization/quantize_fx.py:41:8 Incompatible variable type [9]: fuse_custom_config_dict is declared to have type `Dict[str, typing.Any]` but is used as type `None`.
torch/quantization/quantize_fx.py:143:16 Incompatible variable type [9]: prepare_custom_config_dict is declared to have type `Dict[str, typing.Any]` but is used as type `None`.
torch/quantization/quantize_fx.py:144:16 Incompatible variable type [9]: equalization_qconfig_dict is declared to have type `Dict[str, typing.Any]` but is used as type `None`.
torch/quantization/quantize_fx.py:206:8 Incompatible variable type [9]: prepare_custom_config_dict is declared to have type `Dict[str, typing.Any]` but is used as type `None`.
torch/quantization/quantize_fx.py:230:12 Incompatible variable type [9]: fuse_custom_config_dict is declared to have type `Dict[str, typing.Any]` but is used as type `None`.
torch/quantization/quantize_fx.py:268:8 Incompatible variable type [9]: prepare_custom_config_dict is declared to have type `Dict[str, typing.Any]` but is used as type `None`.
torch/quantization/quantize_fx.py:269:8 Incompatible variable type [9]: equalization_qconfig_dict is declared to have type `Dict[str, typing.Any]` but is used as type `None`.
torch/quantization/quantize_fx.py:427:8 Incompatible variable type [9]: prepare_custom_config_dict is declared to have type `Dict[str, typing.Any]` but is used as type `None`.
torch/quantization/quantize_fx.py:464:8 Incompatible variable type [9]: convert_custom_config_dict is declared to have type `Dict[str, typing.Any]` but is used as type `None`.
torch/quantization/quantize_fx.py:486:8 Incompatible variable type [9]: convert_custom_config_dict is declared to have type `Dict[str, typing.Any]` but is used as type `None`.
torch/quantization/quantize_fx.py:547:8 Incompatible variable type [9]: convert_custom_config_dict is declared to have type `Dict[str, typing.Any]` but is used as type `None`.
```
Fixes the issue: [MLH-Fellowship/pyre-check/issues/76](https://github.com/MLH-Fellowship/pyre-check/issues/76)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66804

Reviewed By: onionymous

Differential Revision: D31738171

Pulled By: 0xedward

fbshipit-source-id: 00d4c5749c469aff39a1531365461ced747e52fc
2021-10-19 09:45:18 -07:00
a2e94b80fa Create linalg.matrix_exp (#62715)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62715

Fixes https://github.com/pytorch/pytorch/issues/61648

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D31641698

Pulled By: mruberry

fbshipit-source-id: 2e2965d14807b6b4fada4b809d539066dd0ba277
2021-10-19 09:07:15 -07:00
fd608cd313 [skip ci] Set test owners for optim tests (#66861)
Summary:
Action following https://github.com/pytorch/pytorch/issues/66232

cc vincentqb jbschlosser albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66861

Reviewed By: albanD

Differential Revision: D31761369

Pulled By: janeyx99

fbshipit-source-id: 57829e1f1509fc2af321530a4b55c9d33b7fb150
2021-10-19 08:39:35 -07:00
c806bb1022 [skip ci] Set test owner for test_complex.py (#66835)
Summary:
Action following https://github.com/pytorch/pytorch/issues/66232

cc ezyang anjali411 dylanbespalko mruberry Lezcano nikitaved

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66835

Reviewed By: anjali411

Differential Revision: D31761723

Pulled By: janeyx99

fbshipit-source-id: ca672f5a1be9dc27284fade725a8238cbfd877a3
2021-10-19 08:36:27 -07:00
299a6a65b2 [skip ci] Set test owners for autograd tests (#66834)
Summary:
Action following https://github.com/pytorch/pytorch/issues/66232

cc ezyang albanD zou3519 gqchen pearu nikitaved soulitzer Lezcano Varal7

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66834

Reviewed By: albanD

Differential Revision: D31761778

Pulled By: janeyx99

fbshipit-source-id: 355edfb1b940154e84fbba6f7b096605e75ae459
2021-10-19 08:35:02 -07:00
39215ddf84 [skip ci] Set test owners for dataloader tests (#66839)
Summary:
Action following https://github.com/pytorch/pytorch/issues/66232

cc SsnL VitalyFedyunin ejguan NivekT

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66839

Reviewed By: ejguan

Differential Revision: D31761722

Pulled By: janeyx99

fbshipit-source-id: 8315ac03352c11b3215d89856b3cfda6cd78fa0c
2021-10-19 08:31:16 -07:00
9eab6da887 [skip ci] Set test owner for nn tests (#66850)
Summary:
Action following https://github.com/pytorch/pytorch/issues/66232

cc albanD mruberry jbschlosser walterddr

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66850

Reviewed By: albanD

Differential Revision: D31761712

Pulled By: janeyx99

fbshipit-source-id: 7272154cac77e2ce38370775a9e8d41252e13166
2021-10-19 08:26:50 -07:00
05b6dc9d75 Fix BatchMatMul test and shape inference (#66733)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66733

Fix the test for BatchMatMul to compare glow/caffe2 outputs and fix its shape inference function since it made simplifying assumptions for broadcasting and failed on some of the shapes in the test. The previous inference was failing for any cases where the first n - 2 output dimensions of A x B was not simply that of whichever one of A or B had higher rank (ex. A: [2, 2, 2, 3, 4], B: [3, 1, 2, 2, 4, 5] we expect output dimensions [3, 2, 2, 2, 3, 5] rather than [3, 1, 2, 2, 3, 5].

Test Plan:
```
buck test glow/fb/test/numerics:test_operator_onnxifinnpi -- -r .*test_batch_matmul_manydims.* --env USE_INF_API=1
```

Reviewed By: khabinov

Differential Revision: D31701184

fbshipit-source-id: 31d0fb17409a399b90fb8042385e000ed81c3581
2021-10-19 07:53:13 -07:00
9f782f8b35 add OpInfo for torch.nn.pixel_unshuffle (#65468)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65468

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D31111699

Pulled By: zou3519

fbshipit-source-id: a92c2f1f4986a54abab82360e97ea2ce22fb9397
2021-10-19 07:36:35 -07:00
1164118fc2 add OpInfo for torch.nn.pixel_shuffle (#65467)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65467

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D31111697

Pulled By: zou3519

fbshipit-source-id: 618e6b2cc927814f85500374a2838d98c9c45d6e
2021-10-19 07:36:33 -07:00
8f09292c5e add OpInfo for torch.nn.functional.pairwise_distance (#65460)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65460

cc albanD mruberry jbschlosser walterddr

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D31111701

Pulled By: zou3519

fbshipit-source-id: a4034418cf8d14f584134a16d822181703858f99
2021-10-19 07:35:10 -07:00
0036e41143 [quant][embedding qat] Add eager QAT test for EmbeddingBag+Linear model (#66334)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66334

Test Plan: Imported from OSS

Reviewed By: HDCharles

Differential Revision: D31618283

Pulled By: b-koopman

fbshipit-source-id: bb824a341f1aa9d7e83f8e66d320a9dfd348a1d7
2021-10-19 07:03:36 -07:00
0a07488ed2 use irange for loops 1 (#66741)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66741

Modified loops in files under fbsource/fbcode/caffe2/ from the format

`for(TYPE var=x0;var<x_max;x++)`

to the format

`for(const auto var: irange(xmax))`

This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand.

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D31705360

fbshipit-source-id: 7115f76e381ad2d98584eb534961c3cbb957ebaa
2021-10-19 03:28:51 -07:00
72803dbcfd [caffe2] Fix invalid vector accesses and polar() call (#66757)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66757

`InterpreterStateImpl::run()` gets the number of outputs from the current frame, but by the time the continuation completes, the frame is gone, so we're calling `front()` on an empty vector. This works out in practice (data is still there) but it is technically undefined behavior and could break in the future.

Also, `std::polar()` expects its argument to be non-negative, but `c10::polar()` does not, so implement it explicitly (implementation is the same as libstdc++).

Test Plan: JIT tests pass.

Reviewed By: zhxchen17

Differential Revision: D31715587

fbshipit-source-id: 98abcc10c2742887af866d8e70169a0187c41d33
2021-10-19 00:29:54 -07:00
147f7559b1 Add SourceView which doesn't own source text as base class of Source (#65309)
Summary:
This would save the cost copying text from stack to heap in some cases (like
parsing function schema during loading phase of libtorch.so)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65309

Reviewed By: swolchok

Differential Revision: D31060315

Pulled By: gmagogsfm

fbshipit-source-id: 0caf7a688b40df52bb4388c5191d1a42351d6f1a
2021-10-18 23:17:22 -07:00
bff64e84cd [DDP] Track models with sync bn (#66680)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66680

Closes https://github.com/pytorch/pytorch/issues/66215. Tracks models
with sync BN so we can find workflows that use them and target for perf
optimization.
ghstack-source-id: 140875182

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D31679477

fbshipit-source-id: 0e68cd1a7aabbc5b26227895c53d33b8e98bfb8e
2021-10-18 22:31:52 -07:00
e0643fa3fc use irange for loops 5 (#66744)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66744

Modified loops in files under fbsource/fbcode/caffe2/ from the format

`for(TYPE var=x0;var<x_max;x++)`

to the format

`for(const auto var: irange(xmax))`

This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand.

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D31705358

fbshipit-source-id: d6ea350cbaa8f452fc78f238160e5374be637a48
2021-10-18 21:59:50 -07:00
bceb1db885 use irange for loops 3 (#66747)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66747

Modified loops in files under fbsource/fbcode/caffe2/ from the format

`for(TYPE var=x0;var<x_max;x++)`

to the format

`for(const auto var: irange(xmax))`

This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand.

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D31705365

fbshipit-source-id: 5c3af2184766b063eed2f4e8feb69f1fedd3503e
2021-10-18 21:50:32 -07:00
061baf02bf Skip failing tests when LAPACK and MAGMA are not available (#64930)
Summary:
Skip failing tests when LAPACK and MAGMA are not available for ` test_linalg.py` and ` test_ops.py`.
Note that there's no CI without LAPACK or MAGMA. I verified locally that now it works as expected, but in the future we have no guards against tests failing again for this situation.

<details>
  <summary> test_ops.py failures that are fixed</summary>

 ```
 FAILED test/test_ops.py::TestCommonCPU::test_out_linalg_tensorinv_cpu_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_ops.py::TestCommonCPU::test_reference_testing_linalg_tensorinv_cpu_complex64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_ops.py::TestCommonCPU::test_reference_testing_linalg_tensorinv_cpu_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_ops.py::TestCommonCPU::test_variant_consistency_eager_linalg_tensorinv_cpu_complex64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_ops.py::TestCommonCPU::test_variant_consistency_eager_linalg_tensorinv_cpu_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_ops.py::TestCommonCPU::test_variant_consistency_eager_triangular_solve_cpu_complex64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_ops.py::TestCommonCPU::test_variant_consistency_eager_triangular_solve_cpu_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_ops.py::TestGradientsCPU::test_fn_grad_linalg_tensorinv_cpu_complex128 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_ops.py::TestGradientsCPU::test_fn_grad_linalg_tensorinv_cpu_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_ops.py::TestGradientsCPU::test_fn_grad_triangular_solve_cpu_complex128 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_ops.py::TestGradientsCPU::test_fn_grad_triangular_solve_cpu_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_ops.py::TestGradientsCPU::test_fn_gradgrad_linalg_tensorinv_cpu_complex128 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_ops.py::TestGradientsCPU::test_fn_gradgrad_linalg_tensorinv_cpu_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_ops.py::TestGradientsCPU::test_fn_gradgrad_triangular_solve_cpu_complex128 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_ops.py::TestGradientsCPU::test_fn_gradgrad_triangular_solve_cpu_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_ops.py::TestGradientsCPU::test_forward_mode_AD_linalg_tensorinv_cpu_complex128 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_ops.py::TestGradientsCPU::test_forward_mode_AD_linalg_tensorinv_cpu_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_ops.py::TestGradientsCPU::test_forward_mode_AD_triangular_solve_cpu_complex128 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_ops.py::TestGradientsCPU::test_forward_mode_AD_triangular_solve_cpu_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_ops.py::TestJitCPU::test_variant_consistency_jit_linalg_tensorinv_cpu_complex64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_ops.py::TestJitCPU::test_variant_consistency_jit_triangular_solve_cpu_complex64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_ops.py::TestJitCPU::test_variant_consistency_jit_triangular_solve_cpu_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_ops.py::TestMathBitsCPU::test_conj_view_linalg_tensorinv_cpu_complex64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_ops.py::TestMathBitsCPU::test_conj_view_triangular_solve_cpu_complex64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_ops.py::TestMathBitsCPU::test_neg_view_linalg_tensorinv_cpu_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_ops.py::TestMathBitsCPU::test_neg_view_triangular_solve_cpu_float64 - RuntimeError: svd: LAPACK library not found in compilation
 ```

</details>

<details>
  <summary> test_linalg.py failures that are fixed</summary>
```
FAILED test/test_linalg.py::TestLinalgCPU::test_norm_dtype_cpu - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgCPU::test_norm_matrix_cpu_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgCPU::test_norm_matrix_cpu_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgCPU::test_nuclear_norm_axes_small_brute_force_old_cpu - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_eigh_hermitian_grad_meta_complex128 - RuntimeError: Calling torch.linalg.eigh or eigvalsh on a CPU tensor requires compiling PyTorch with LAPACK. Please use PyTorch built with LAPACK support.
FAILED test/test_linalg.py::TestLinalgMETA::test_eigh_hermitian_grad_meta_float64 - RuntimeError: Calling torch.linalg.eigh or eigvalsh on a CPU tensor requires compiling PyTorch with LAPACK. Please use PyTorch built with LAPACK support.
FAILED test/test_linalg.py::TestLinalgMETA::test_inverse_meta_complex128 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_inverse_meta_complex64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_inverse_meta_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_inverse_meta_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_lu_meta_complex128 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_lu_meta_complex64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_lu_meta_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_lu_solve_batched_broadcasting_meta_complex128 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_lu_solve_batched_broadcasting_meta_complex64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_lu_solve_batched_broadcasting_meta_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_lu_solve_batched_broadcasting_meta_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_lu_solve_batched_non_contiguous_meta_complex128 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_lu_solve_batched_non_contiguous_meta_complex64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_lu_solve_batched_non_contiguous_meta_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_lu_solve_batched_non_contiguous_meta_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_lu_solve_meta_complex128 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_lu_solve_meta_complex64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_lu_solve_meta_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_lu_solve_meta_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_old_solve_batched_broadcasting_meta_complex128 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_old_solve_batched_broadcasting_meta_complex64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_old_solve_batched_broadcasting_meta_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_old_solve_batched_broadcasting_meta_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_old_solve_batched_meta_complex128 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_old_solve_batched_meta_complex64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_old_solve_batched_meta_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_old_solve_batched_meta_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_old_solve_batched_non_contiguous_meta_complex128 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_old_solve_batched_non_contiguous_meta_complex64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_old_solve_batched_non_contiguous_meta_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_old_solve_batched_non_contiguous_meta_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_old_solve_meta_complex128 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_old_solve_meta_complex64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_old_solve_meta_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_old_solve_meta_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_solve_batched_non_contiguous_meta_complex128 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_solve_batched_non_contiguous_meta_complex64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_solve_batched_non_contiguous_meta_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_solve_batched_non_contiguous_meta_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_solve_meta_complex128 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_solve_meta_complex64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_solve_meta_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_solve_meta_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_svd_square_col_maj_meta_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_svd_square_col_maj_meta_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_svd_square_meta_complex128 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_svd_square_meta_complex64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_svd_square_meta_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_svd_square_meta_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_svd_tall_all_col_maj_meta_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_svd_tall_all_col_maj_meta_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_svd_tall_all_meta_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_svd_tall_all_meta_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_svd_tall_some_col_maj_meta_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_svd_tall_some_col_maj_meta_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_svd_tall_some_meta_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_svd_tall_some_meta_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgCUDA::test_inverse_cuda_complex128 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgCUDA::test_inverse_cuda_complex64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgCUDA::test_inverse_cuda_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgCUDA::test_inverse_cuda_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgCUDA::test_svd_lowrank_cuda_float64 - RuntimeError: Calling torch.lu on a CUDA tensor requires compiling PyTorch with MAGMA. lease rebuild with MAGMA.
FAILED test/test_linalg.py::TestLinalgCUDA::test_svd_square_col_maj_cuda_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgCUDA::test_svd_square_col_maj_cuda_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgCUDA::test_svd_square_cuda_complex128 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgCUDA::test_svd_square_cuda_complex64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgCUDA::test_svd_square_cuda_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgCUDA::test_svd_square_cuda_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgCUDA::test_svd_tall_all_col_maj_cuda_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgCUDA::test_svd_tall_all_col_maj_cuda_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgCUDA::test_svd_tall_all_cuda_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgCUDA::test_svd_tall_all_cuda_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgCUDA::test_svd_tall_some_col_maj_cuda_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgCUDA::test_svd_tall_some_col_maj_cuda_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgCUDA::test_svd_tall_some_cuda_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgCUDA::test_svd_tall_some_cuda_float64 - RuntimeError: svd: LAPACK library not found in compilation
```
</details>

Fixes https://github.com/pytorch/pytorch/issues/59662

cc mruberry jianyuh nikitaved pearu walterddr IvanYashchuk xwang233 Lezcano

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64930

Reviewed By: zou3519

Differential Revision: D31739416

Pulled By: mruberry

fbshipit-source-id: 153c40d8eeeb094b06816882a7cbb28c681509a9
2021-10-18 21:30:01 -07:00
08a464a9f3 [PyTorch] Pass c10::optional<bool> to Stride ctor by value (#66698)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66698

this type should fit in a register; no need to pass by reference.
ghstack-source-id: 140742830

Test Plan: CI

Reviewed By: suo

Differential Revision: D31693291

fbshipit-source-id: 299fb3d1830a059b59268487c22e030446c3496e
2021-10-18 21:28:56 -07:00
c9c52b760b test addr type promotion in a single test (#66812)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/66802
Test time goes from 150s to 15s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66812

Reviewed By: mruberry

Differential Revision: D31739299

Pulled By: ngimel

fbshipit-source-id: cb6d92ff335f46ee06b2480bdd9143f85865bccf
2021-10-18 21:21:11 -07:00
d05c1ec007 Add lazy Node base and associated infra (#66601)
Summary:
- Adds Node base class and unit tests
- Also adds metadata utils to enable source code annotation and scope tracking

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66601

Test Plan: Add new unit tests

Reviewed By: desertfire

Differential Revision: D31634044

fbshipit-source-id: a042d54f06fbc480acfc63c18d43cb6fceb6fea5
2021-10-18 19:09:42 -07:00
a17a4e93ce [PyTorch][easy] Fix missing move in UnionType::createWithContained (#66691)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66691

Does what it says on the tin.
ghstack-source-id: 140736047

Test Plan: CI

Reviewed By: suo

Differential Revision: D31691627

fbshipit-source-id: 21a5d0248bf3412f5af36260597a5f663ab34361
2021-10-18 18:04:22 -07:00
c9c447f4be [PyTorch] Fix missing moves in ListType (#66701)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66701

We own the argument vector.
ghstack-source-id: 140760983

Test Plan: CI

Reviewed By: suo

Differential Revision: D31693645

fbshipit-source-id: 02829bc3c728f6d1d07be08b0d977eee1efee38f
2021-10-18 18:00:18 -07:00
d0a63c978b [PyTorch][easy] Don't copy string in TensorType::repr_str unnecessarily (#66699)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66699

std::string::operator+ will copy the string an extra time even if the argument is `""`. See https://godbolt.org/z/3sM5h1qTo
ghstack-source-id: 140743822

Test Plan: CI

Reviewed By: suo

Differential Revision: D31693522

fbshipit-source-id: 6a8033c90366904b9aff44214b600cfb255a0809
2021-10-18 17:55:21 -07:00
f65b4b7a4c [PyTorch] Avoid refcount bump in UnionType::canHoldType (#66693)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66693

Passing a `TypePtr` by value causes an unnececssary refcount
bump. We don't need to take ownership, so `const Type&` is all we
need.

I considered providing a compatibility shim that takes `const
TypePtr&`, but doing so is dangerous because a
copy is required to convert from a more specific pointer like
`NoneTypePtr`.
ghstack-source-id: 140737081

Test Plan: CI

Reviewed By: suo

Differential Revision: D31691869

fbshipit-source-id: f766ce3234a28771c2a9ca4c284eb3f96993a3d0
2021-10-18 17:39:59 -07:00
1db50505d5 [nn] MultiLabelSoftMarginLoss : no batch dim support (#65690)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/60585

cc albanD mruberry jbschlosser walterddr

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65690

Reviewed By: zou3519

Differential Revision: D31731162

Pulled By: jbschlosser

fbshipit-source-id: d26f27555f78afdadd49126e0548a8bfda50cc5a
2021-10-18 15:30:01 -07:00
8173d4df69 move get_cycles_per_ms() to common_utils (#66798)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66798

get_cycles_per_ms is copied and used in a few places, move it to common_utils so that it can be used as a shared util function
ghstack-source-id: 140790599

Test Plan: unit tests

Reviewed By: pritamdamania87

Differential Revision: D31706870

fbshipit-source-id: e8dccecb13862646a19aaadd7bad7c8f414fd4ab
2021-10-18 14:04:09 -07:00
d024f1134d ci: Move bazel download from github -> s3 (#66815)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66815

Was seeing 403's when attempting to wget from github, re-hosting the
binary on s3 so we shouldn't see those issues anymore

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D31740656

Pulled By: seemethere

fbshipit-source-id: 4462678d51a52b63020f8da18d7cdc80fb8dbc5d
2021-10-18 13:34:40 -07:00
06e49ea088 [not4land][quant][fx][graphmode] lower reference linear module example (#65723)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65723

Example lowering reference linear module to fbgemm/qnnpack quantized linear module

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D31567461

fbshipit-source-id: 0b8fffaf8e742ec15cb07bf6a4672cf3e856db2d
2021-10-18 13:14:39 -07:00
c994a7fc2d Update documentation of torch.nn.Upsample (#66756)
Summary:
The documentation of torch.nn.Upsample stated that `align_corners` only affects `linear`, `bilinear` and `trilinear`.

This PR updates the documentation for the Python `Upsample` module and the C++ `UpsampleOptions` struct to reflect that `bicubic` is also affected by `align_corners`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66756

Reviewed By: zou3519

Differential Revision: D31731148

Pulled By: jbschlosser

fbshipit-source-id: 3ec277fc3fbdf8414d0de327d8c57ba07342a5b9
2021-10-18 13:07:17 -07:00
0974215c4d Prefer mT and mH over transpose(-2, -1) and transpose(-2, -1).conj() (#64181)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64181

This PR replaces all the calls to:
- `transpose(-2, -1)` or `transpose(-1, -2)` by `mT()` in C++ and `mT` in Python
- `conj().transpose(-2, -1)` or `transpose(-2, -1).conj()` or `conj().transpose(-1, -2)` or `transpose(-1, -2).conj()` by `mH()` in C++ and `mH` in Python.

It also simplifies two pieces of code, and fixes one bug where a pair
of parentheses were missing in the function `make_symmetric_matrices`.

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D31692896

Pulled By: anjali411

fbshipit-source-id: e9112c42343663d442dc5bd53ff2b492094b434a
2021-10-18 13:02:25 -07:00
44fd312604 [PyTorch] Use intrusive_ptr to save space in KernelFunction (#65618)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65618

This saves 8 bytes per KernelFunction, which should help in resource-constrained environments.
ghstack-source-id: 140731069

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D25405736

fbshipit-source-id: 757c0f1387da9147e46ac69af2aa9fffd2998e35
2021-10-18 12:53:45 -07:00
622e19b859 [PyTorch] Take const Type& in TensorType::fromNumberType (#66716)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66716

No need to require a refcount bump for this function.
ghstack-source-id: 140754065

Test Plan: CI

Reviewed By: suo

Differential Revision: D31696639

fbshipit-source-id: bf8aa3f542d52e82e0f6a444b8898330f3d16a31
2021-10-18 12:49:40 -07:00
6a7296be9c [PyTorch] Use castRaw in InterfaceType (#66728)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66728

Two extra refcount bumps.
ghstack-source-id: 140760872

Test Plan: CI

Reviewed By: suo

Differential Revision: D31698577

fbshipit-source-id: 1f50195a99f98f857abc9b03b4254519c316fefe
2021-10-18 12:44:24 -07:00
9ea3424747 Set test owner for fx (#66807)
Summary:
Action following https://github.com/pytorch/pytorch/issues/66232

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66807

Reviewed By: jamesr66a

Differential Revision: D31736722

Pulled By: janeyx99

fbshipit-source-id: 5ffcb02a858137211bff1eabf158001dcb0359a6
2021-10-18 12:25:38 -07:00
8637556d23 Migrate THCState to ATen (#66765)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66765

This guts `THCState` to simply be an empty struct, as well as:
- moving `THCState_getPeerToPeerAccess` and its cache into `ATen`.
- cleaning up dead code in `THCGeneral.cpp`
- moving `THCudaInit` and `THCMagma_init` into `CUDAHooks::initCUDA`

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D31721648

Pulled By: ngimel

fbshipit-source-id: 772b24787656a95f9e3fcb287d912b1c3400f32d
2021-10-18 12:14:43 -07:00
1fcbd8fa15 [PyTorch] Fix extra refcount bumps in tryEvalTypeVariables (#66722)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66722

Missing move, s/cast/castRaw/, and take TypePtr arg by const ref because we only sometimes need to take ownership.
ghstack-source-id: 140757141

Test Plan: CI

Reviewed By: suo

Differential Revision: D31697631

fbshipit-source-id: 04afe13688c6e2aaf79157400c0a44021cb8179d
2021-10-18 12:06:37 -07:00
393299b124 [PyTorch] Fix unnecessary shared_ptr copies in RRefType (#66706)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66706

Missing moves in the construction path.
ghstack-source-id: 140746585

Test Plan: CI

Reviewed By: suo

Differential Revision: D31694356

fbshipit-source-id: 8e2bf2dd41f3f65fc06e30ffd5fddd487d01aaa8
2021-10-18 12:04:43 -07:00
d5a25faf7a [PyTorch] Fix unnecessary shared_ptr copies in EnumType (#66714)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66714

Forced copy in getValueType and unnecessary use of cast over castRaw.
ghstack-source-id: 140752791

Test Plan: CI

Reviewed By: suo

Differential Revision: D31696164

fbshipit-source-id: fc2316617a61ca32f1fb952fb0af18b8784a606b
2021-10-18 12:04:41 -07:00
9b729ebc88 [jit] shape propagation for quantization (#66343)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66343

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D31515839

Pulled By: IvanKobzarev

fbshipit-source-id: 1b2b953b93210a1cade64c30302478907fc639f3
2021-10-18 12:03:20 -07:00
1cf317b85f [ONNX] Support exporting with Apex O2 (#65374) (#66700)
Summary:
Apex O2 hook state_dict to return fp16 weights as fp32. Exporter cannot identify them as same tensors.
Since this hook is only used by optimizer, it is safe to remove this hook while exporting.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66700

Reviewed By: zou3519

Differential Revision: D31695132

Pulled By: malfet

fbshipit-source-id: 977bdf57240002498f3ad0f1a8046c352e9860e6
2021-10-18 11:54:09 -07:00
624ce95201 Run sparse tests only for TensorPipe agent. (#66661)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66661

Similar to https://github.com/pytorch/pytorch/pull/66600, runs
rpc_test.py sparse tests only for TP agent.
ghstack-source-id: 140666322

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D31669850

fbshipit-source-id: 41a66c8d1843130964aede5c77d391484607214f
2021-10-18 11:53:07 -07:00
7fad47e522 torch.linalg.lstsq: forward/backward AD support (#65054)
Summary:
As per title.

cc ezyang albanD zou3519 gqchen pearu nikitaved soulitzer Lezcano Varal7 jianyuh mruberry walterddr IvanYashchuk xwang233

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65054

Reviewed By: zou3519

Differential Revision: D31729468

Pulled By: albanD

fbshipit-source-id: ab7df824bc80128e7f64f6444c7a4baa4786c161
2021-10-18 11:28:44 -07:00
6bde474066 [PyTorch] Fix extra refcount bumps in matchTypeVariables (#66719)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66719

Some cast that could be castRaw. Parameters did not need to force a refcount bump.
ghstack-source-id: 140756356

Test Plan: CI

Reviewed By: suo

Differential Revision: D31697455

fbshipit-source-id: 87a8cba221a7ae53f2a485acafd31622e9328ff0
2021-10-18 11:15:07 -07:00
c373e188d8 [PyTorch] Fix extra refcount bumps in unifyTypes (#66718)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66718

Some missing moves and use of cast instead of castRaw (due to a previous automated fixup only being a partial fix).
ghstack-source-id: 140755229

Test Plan: CI

Reviewed By: suo

Differential Revision: D31697115

fbshipit-source-id: 86743f8982951a58638ba244b3a92d3737dde58b
2021-10-18 11:13:45 -07:00
472a6f2787 Strided masked reductions: sum, amax. Testing of masked reductions. (#65990)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65990

cc nikitaved pearu cpuhrsch IvanYashchuk

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D31729532

Pulled By: albanD

fbshipit-source-id: 855a6bb2a7c6e75c780a64ce23c0f29321f0e511
2021-10-18 11:10:32 -07:00
d777e490a5 [bc-breaking][quant][graphmode][fx] Produce reference patterns for GeneralShapeOps (#66647)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66647

Missed in the last round ,
This adds reference patterns for general shape ops like view when is_reference is True

bc-breaking:
basically disabled getitem from supporting quantized ops here, we may support it later in fbgemm

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps
python test/test_quantization.py TestQuantizeFxModels

Imported from OSS

Reviewed By: H-Huang

Differential Revision: D31680379

fbshipit-source-id: 6a3a7128514baf6d92b1607308c40339469d0066
2021-10-18 11:09:17 -07:00
eb1eefc399 [PyTorch] Fix unnecessary shared_ptr copies in DictType (#66702)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66702

Missing moves in the construction path and forced copies of the key & value type on access.
ghstack-source-id: 140744707

Test Plan: CI

Reviewed By: suo

Differential Revision: D31693818

fbshipit-source-id: 4c5d2359f58148744621abe81429e56e7889f754
2021-10-18 11:05:25 -07:00
09c4e73c95 [PyTorch] Fix unnecessary shared_ptr copies in FutureType (#66704)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66704

Missing moves in the construction path.
ghstack-source-id: 140746391

Test Plan: CI

Reviewed By: suo

Differential Revision: D31694296

fbshipit-source-id: 3bed477c811069248611efdb57ad27c6ca233442
2021-10-18 11:01:00 -07:00
62e89f692f [doc] typo (#66754)
Summary:
This PR fixes a typo in the `torch/autograd/function.py` doc

-----------------------

Additionally, the example at https://pytorch.org/docs/master/autograd.html#torch.autograd.Function doesn't quite compile:
```
'builtin_function_or_method' object has no attribute 'exp'
```
even though `i.exp()` is a valid function if `i` is a tensor.

I changed it to:
```
result = torch.exp(i)
```
but python doesn't like it either:
```
TypeError: exp(): argument 'input' (position 1) must be Tensor, not builtin_function_or_method
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66754

Reviewed By: albanD

Differential Revision: D31729400

Pulled By: soulitzer

fbshipit-source-id: eef783bcdc8d4693a8b7f1ab581e948abc0f9b94
2021-10-18 10:33:56 -07:00
f4a7273b5c Set test owners for module: ci (#66796)
Summary:
Action based on RFC https://github.com/pytorch/pytorch/issues/66232

cc seemethere malfet pytorch/pytorch-dev-infra

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66796

Reviewed By: seemethere

Differential Revision: D31732391

Pulled By: janeyx99

fbshipit-source-id: b894eab8a4a8737165d1ba7b536e1232f6c07a8f
2021-10-18 10:29:50 -07:00
8532061bce [sharded_tensor] support gloo/mpi backend in tests (#65855)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65855

This adjusted our test base to support non-nccl backend like gloo/mpi, so that we could test sharding on CPU with gloo/mpi backend.
ghstack-source-id: 140840866

Test Plan: wait for the CI for existing tests, also adding tests in the stacked diff above.

Reviewed By: pritamdamania87, bowangbj

Differential Revision: D31287162

fbshipit-source-id: d48dfc8ef886a4d34b1de42f3ce6b600b5c9a617
2021-10-18 10:17:59 -07:00
d549c8de78 fx quant: enable linear-bn1d fusion for PTQ (#66484)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66484

https://github.com/pytorch/pytorch/pull/50748 added linear - bn1d fusion
in Eager mode, for PTQ only. This PR also enables this in FX graph mode.

We reuse the existing conv-bn-relu fusion handler, renaming `conv` to
`conv_or_linear` for readability.

The QAT version is saved for a future PR, for both eager and FX graph.

Test Plan:
```
python test/test_quantization.py TestFuseFx.test_fuse_linear_bn_eval
```

Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D31575392

fbshipit-source-id: f69d80ef37c98cbc070099170e335e250bcdf913
2021-10-18 10:14:28 -07:00
9d287d0b63 [fx2trt]Add support for negative dim in softmax (#66760)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66760

Previously we didn't convert negative dim to positive dim.

Test Plan: WIP

Reviewed By: wushirong

Differential Revision: D31703127

fbshipit-source-id: 6d5ccecab45b46f867a05ee70c76a5980e41011d
2021-10-18 09:03:56 -07:00
aa7da7b09c [quant][embedding qat] Enable quint4 in EmbeddingBag QAT workflow (#66348)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66348

Test Plan: Imported from OSS

Reviewed By: HDCharles

Differential Revision: D31691300

Pulled By: b-koopman

fbshipit-source-id: 11bd75b608b972394fe9f7c9b7bf034af42f28b5
2021-10-18 08:51:39 -07:00
909694fd88 Fix nn.functional.max_poolNd dispatch (for arg: return_indices) (#62544)
Summary:
Please see https://github.com/pytorch/pytorch/issues/62545 for context.

The order of `return_indices, ceil_mode` is different for `nn.functional.max_poolNd` functions to what seen with `torch.nn.MaxPoolNd` (modular form). While this should be resolved in the future, it was decided to first raise a warning that the behavior will be changed in the future. (please see https://github.com/pytorch/pytorch/pull/62544#issuecomment-893770955 for more context)

This PR thus raises appropriate warnings and updates the documentation to show the full signature (along with a note) for `torch.nn.functional.max_poolNd` functions.

**Quick links:**

(_upstream_)

* Documentation of [`nn.functional.max_pool1d`](https://pytorch.org/docs/1.9.0/generated/torch.nn.functional.max_pool1d.html), [`nn.functional.max_pool2d`](https://pytorch.org/docs/stable/generated/torch.nn.functional.max_pool2d.html), and [`nn.functional.max_pool3d`](https://pytorch.org/docs/stable/generated/torch.nn.functional.max_pool3d.html).

(_this branch_)

* Documentation of [`nn.functional.max_pool1d`](https://docs-preview.pytorch.org/62544/generated/torch.nn.functional.max_pool1d.html?highlight=max_pool1d), [`nn.functional.max_pool2d`](https://docs-preview.pytorch.org/62544/generated/torch.nn.functional.max_pool2d.html?highlight=max_pool2d#torch.nn.functional.max_pool2d), and [`nn.functional.max_pool3d`](https://docs-preview.pytorch.org/62544/generated/torch.nn.functional.max_pool3d.html?highlight=max_pool3d#torch.nn.functional.max_pool3d).

cc mruberry jbschlosser

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62544

Reviewed By: gchanan

Differential Revision: D31179038

Pulled By: jbschlosser

fbshipit-source-id: 0a2c7215df9e132ce9ec51448c5b3c90bbc69030
2021-10-18 08:34:38 -07:00
e4a9ee8d42 Deduplicate codegenOutputQuery to query maximum CUDA compute capabilities (#55901)
Summary:
There were 2 versions of the same code which were slightly different although functionally equivalent.
When adding support for another CUDA / device version both would need to be changed and kept in sync. So it is better to have only 1 version of it as the unique source of truth.

I chose the implementation which looks cleaner and easier to read and added some minor enhancements and comments to further increase readability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55901

Reviewed By: H-Huang

Differential Revision: D31636917

Pulled By: bertmaher

fbshipit-source-id: 622e1fabc39de4f3f1b1aa9a1544cfbd35a5cfd9
2021-10-18 07:42:15 -07:00
811f5a2b94 Adding StreamWrapper to ensure file object will be closed (#66715)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66715

Adding StreamWrapper to streams produced by DataPipes within PyTorch Core and TorchData

Test Plan: OSS CI and Internal Tests

Reviewed By: ejguan

Differential Revision: D31695248

fbshipit-source-id: c26fa1bc1688d5597851ad265f667fafdcd64c59
2021-10-18 07:31:32 -07:00
0d203a16fe Add relative and absolute tolerances for matrix_rank, pinv (#63102)
Summary:
This pull request introduces new keyword arguments for `torch.linalg.matrix_rank` and `torch.linalg.pinv`: `atol` and `rtol`.

Currently, only tensor overload has default values for either `atol` or `rtol`, the float overload requires both arguments to be specified.

FC compatibility: https://github.com/pytorch/pytorch/pull/63102#discussion_r710930509

Fixes https://github.com/pytorch/pytorch/issues/54151. Fixes https://github.com/pytorch/pytorch/issues/66618.

cc jianyuh nikitaved pearu mruberry walterddr IvanYashchuk xwang233 Lezcano

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63102

Reviewed By: H-Huang

Differential Revision: D31641456

Pulled By: mruberry

fbshipit-source-id: 4c765508ab1657730703e42975fc8c0d0a60eb7c
2021-10-17 22:15:42 -07:00
53aac4b6f3 [PyTorch] Allow override for macro HAS_DEMANGLE (#66540)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66540

Currently the macro `HAS_DEMANGLE` is determined by compiler predefined macros. Here I'm adding an option to allow `HAS_DEMANGLE` to be defined in build files.

Test Plan: Rely on CI

Reviewed By: poweic

Differential Revision: D31600007

fbshipit-source-id: 76cf088b0f5ee940e977d3b213f1446ea64be036
2021-10-17 16:10:45 -07:00
3b4cb9ddca Revert D31577488: Migrate THCState to ATen
Test Plan: revert-hammer

Differential Revision:
D31577488 (65adf1dfa2)

Original commit changeset: 90604f30854f

fbshipit-source-id: 3d7e35b3d6ea94f2c999bcf821b33a9cf1db01ee
2021-10-16 21:51:36 -07:00
719d43a2a2 Revert D31547709: Remove THCGeneral.cpp
Test Plan: revert-hammer

Differential Revision:
D31547709 (aa0c31876b)

Original commit changeset: 059c47621863

fbshipit-source-id: e8c3597f2badbc5ecf356b381edea06a07331f24
2021-10-16 21:50:19 -07:00
8854817f44 Implement Python Array API asarray function. (#60627)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60627

In this PR, the core of `frombuffer` and `fromDLPack` onto _tensor_new.cpp_. `asarray`
uses such refactored functions for interpreting the object as a tensor. We follow the
Python Array API standard found:

https://data-apis.org/array-api/latest/API_specification/creation_functions.html?highlight=asarray

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D31640510

Pulled By: mruberry

fbshipit-source-id: d0869e0d73cb50023d5866b001dac5d34ca30dfd
2021-10-16 21:11:31 -07:00
9e3a2babfa Make aotCompile support multiple input sizes (#66727)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66727

Make aotCompile support multiple input sizes

Test Plan:
Able to compile and run a model with multiple inputs
```
(pytorch)  ~/fbsource/fbcode/caffe2/fb/nnc
└─ $ PYTORCH_JIT_LOG_LEVEL=aot_compiler buck run //caffe2/binaries:aot_model_compiler -- --model aot_test_model.pt --model_name=aot_test_model --model_version=v1 --input_dims="2,2,2;2,2,2"
Building: finished in 3.2 sec (100%) 7461/7461 jobs, 0/7461 updated
  Total time: 3.4 sec
BUILD SUCCEEDED
[DUMP aot_compiler.cpp:097] graph before shape propagation
[DUMP aot_compiler.cpp:097] graph(%x.1 : Tensor,
[DUMP aot_compiler.cpp:097]       %y.1 : Tensor):
[DUMP aot_compiler.cpp:097]   %3 : int = prim::Constant[value=1]() # :0:0
[DUMP aot_compiler.cpp:097]   %4 : Tensor = aten::add(%x.1, %y.1, %3) # /data/users/priyaramani/fbsource/fbcode/caffe2/test/mobile/nnc/aot_test_model.py:10:15
[DUMP aot_compiler.cpp:097]   return (%4)
(1,.,.) =                                                                                                                                                                                            0.3357  0.6137
  0.8472  0.0858

(2,.,.) =
  0.8406  0.2959
  0.6012  0.7184
[ CPUFloatType{2,2,2} ]
(1,.,.) =
  0.7086  0.6398
  0.0579  0.1913

(2,.,.) =
  0.8598  0.3641
  0.5925  0.0200
[ CPUFloatType{2,2,2} ]
here
2
2
graph 0x6130001ee2d0
[DUMP aot_compiler.cpp:118] graph after shape propagation
[DUMP aot_compiler.cpp:118] graph(%x.1 : Float(2, 2, 2, strides=[4, 2, 1], requires_grad=0, device=cpu),
[DUMP aot_compiler.cpp:118]       %y.1 : Float(2, 2, 2, strides=[4, 2, 1], requires_grad=0, device=cpu)):
[DUMP aot_compiler.cpp:118]   %3 : int = prim::Constant[value=1]() # :0:0
[DUMP aot_compiler.cpp:118]   %4 : Tensor(2, 2, 2) = aten::add(%x.1, %y.1, %3) # /data/users/priyaramani/fbsource/fbcode/caffe2/test/mobile/nnc/aot_test_model.py:10:15
[DUMP aot_compiler.cpp:118]   return (%4)
The compiled llvm assembly code was saved to aot_test_model.compiled.ll
The compiled model was saved to aot_test_model.compiled.pt

└─ $ ./compile_model.sh -m aot_test_model -p /data/users/priyaramani/fbsource/fbcode/caffe2/fb/nnc/aot_test_model.pt -v v1 -i "2,2,2;2,2,2"
+ VERSION=v1
+ getopts m:p:v:i:h opt
+ case $opt in
+ MODEL=aot_test_model
+ getopts m:p:v:i:h opt
+ case $opt in
+ MODEL_PATH=/data/users/priyaramani/fbsource/fbcode/caffe2/fb/nnc/aot_test_model.pt
+ getopts m:p:v:i:h opt
+ case $opt in
+ VERSION=v1
+ getopts m:p:v:i:h opt
+ case $opt in
+ INPUT_DIMS='2,2,2;2,2,2'
+ getopts m:p:v:i:h opt
+ require_arg m aot_test_model
+ '[' -n aot_test_model ']'
+ require_arg p /data/users/priyaramani/fbsource/fbcode/caffe2/fb/nnc/aot_test_model.pt
+ '[' -n /data/users/priyaramani/fbsource/fbcode/caffe2/fb/nnc/aot_test_model.pt ']'
+ require_arg i '2,2,2;2,2,2'
+ '[' -n '2,2,2;2,2,2' ']'
+ '[' '!' -f /data/users/priyaramani/fbsource/fbcode/caffe2/fb/nnc/aot_test_model.pt ']'
+++ dirname ./compile_model.sh
++ cd .
++ pwd -P
+ SRC_DIR=/data/users/priyaramani/fbsource/fbcode/caffe2/fb/nnc
+ FBCODE_DIR=/data/users/priyaramani/fbsource/fbcode/caffe2/fb/nnc/../../..
+ FBSOURCE_DIR=/data/users/priyaramani/fbsource/fbcode/caffe2/fb/nnc/../../../..
+ KERNEL_DIR=/data/users/priyaramani/fbsource/fbcode/caffe2/fb/nnc/../../../../xplat/pytorch_models/build/aot_test_model/v1/nnc
++ echo /data/users/priyaramani/fbsource/fbcode/caffe2/fb/nnc/aot_test_model.pt
++ sed 's/.pt.*//'
+ MODEL_PATH_PREFIX=/data/users/priyaramani/fbsource/fbcode/caffe2/fb/nnc/aot_test_model
+ LLVM_CODE_PATH=/data/users/priyaramani/fbsource/fbcode/caffe2/fb/nnc/aot_test_model.compiled.ll
+ ASSEMBLY_CODE_PATH=/data/users/priyaramani/fbsource/fbcode/caffe2/fb/nnc/aot_test_model.compiled.s
+ COMPILED_MODEL_FILE_PATH=/data/users/priyaramani/fbsource/fbcode/caffe2/fb/nnc/aot_test_model.compiled.pt
+ KERNEL_FUNC_NAME=nnc_aot_test_model_v1_forward
+ cd /data/users/priyaramani/fbsource/fbcode/caffe2/fb/nnc/../../../..
+ buck run //xplat/caffe2/fb/lite_predictor:lite_predictor_nnc -- --model /data/users/priyaramani/fbsource/fbcode/caffe2/fb/nnc/aot_test_model.compiled.pt --print_output true --input_dims '2,2,2$
2,2,2' --input_type 'float;float' --input_memory_format 'contiguous_format;contiguous_format'
clang-9: warning: argument unused during compilation: '-pthread' [-Wunused-command-line-argument]

Downloaded 1/4 artifacts, 2.11 Kbytes, 50.0% cache miss (for updated rules)
Building: finished in 12.2 sec (100%) 4572/4572 jobs, 3/4572 updated
  Total time: 12.2 sec
BUILD SUCCEEDED
Run with 56 threads
Run with 56 threads
Loading model...
Model loaded: /data/users/priyaramani/fbsource/fbcode/caffe2/fb/nnc/aot_test_model.compiled.pt
Running forward ...
(1,.,.) =
 -0.7451 -0.7451
 -0.7451 -0.7451

(2,.,.) =
 -0.7451 -0.7451
 -0.7451 -0.7451
[ CPUFloatType{2,2,2} ]
Starting benchmark.
Running warmup runs.
Main runs.
Main run finished. Milliseconds per iter: 0.0887. Iters per second: 11274
Memory usage before main runs: 71262208 bytes
Memory usage after main runs: 71573504 bytes
Average memory increase per iter: 31129.6 bytes
0 value means "not available" in above
```

Reviewed By: ljk53

Differential Revision: D31631975

fbshipit-source-id: 7956787b3e121f9c14f4733398a64c2f7ae84373
2021-10-16 20:04:52 -07:00
962c6476da Refactor: move method to func compilation work to compileMethod, add option to specify method name (#66726)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66726

Move method to func compilation work to compileMethod

Test Plan:
Mobilenetv3 compiles and runs successfully
```
(pytorch)  ~/fbsource/fbcode/caffe2/fb/nnc
└─ $ buck run //caffe2/binaries:aot_model_compiler -- --model mobilenetv3.pt --model_name=pytorch_dev_mobilenetv3 --model_version=v1 --input_dims="1,3,224,224"
Downloaded 0/4 artifacts, 0.00 bytes, 100.0% cache miss (for updated rules)
Building: finished in 13.2 sec (100%) 18719/18719 jobs, 2/18719 updated
  Total time: 13.5 sec
BUILD SUCCEEDED
The compiled llvm assembly code was saved to mobilenetv3.compiled.ll
The compiled model was saved to mobilenetv3.compiled.pt
```

Reviewed By: ljk53, IvanKobzarev

Differential Revision: D31624342

fbshipit-source-id: 233a6e94ea05ba8d6fc166d2414034c9e58cb076
2021-10-16 20:03:24 -07:00
aa0c31876b Remove THCGeneral.cpp (#66391)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66391

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D31547709

Pulled By: ngimel

fbshipit-source-id: 059c47621863738fb560f4257e7765afa9b952aa
2021-10-16 14:53:52 -07:00
8c5928bd78 add frozen_numpy as a builtin library to torch::deploy (#66297)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66297

Link register_numpy.cpp with the embedded interpreter will register numpy as a builtin library.

Test Plan: Add unit test to test basic numpy functionality in torch::deploy like creating random matrices, matric multiplication.

Reviewed By: suo

Differential Revision: D31490434

fbshipit-source-id: b052ce01fc64fb0efee846feb0acc1f107ba13e0
2021-10-15 21:48:24 -07:00
42f138469a [TS] Return early if device doesn't match (#66694)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66694

`lhs.equal(rhs)` would throw if the device doesn't match. To avoid that we return early if the device doesn't match.

Test Plan: CI

Reviewed By: houseroad

Differential Revision: D31691608

fbshipit-source-id: 513c3e0743a65d9778c7ef9b79ececfeaccc0017
2021-10-15 18:13:46 -07:00
32ac001e4d Suppress deprecated copy in vec256_qint.h (#66646)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66646

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D31660387

fbshipit-source-id: a1ea9702a8b33f78a7201a1d9214065c2fb930b1
2021-10-15 17:14:15 -07:00
65adf1dfa2 Migrate THCState to ATen (#66480)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66480

This guts `THCState` to simply be an empty struct, as well as:
- moving `THCState_getPeerToPeerAccess` and its cache into `ATen`.
- cleaning up dead code in `THCGeneral.cpp`
- moving `THCudaInit` and `THCMagma_init` into `CUDAHooks::initCUDA`

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D31577488

Pulled By: ngimel

fbshipit-source-id: 90604f30854fe766675baa3863707ac09995bc9e
2021-10-15 17:05:04 -07:00
2f099c7555 Revert D30652629: use irange for loops
Test Plan: revert-hammer

Differential Revision:
D30652629 (687c2267d4)

Original commit changeset: 0ae6c4bbbb55

fbshipit-source-id: 5c4f067b584a021c8c9656454d1ee60999600fb3
2021-10-15 15:23:10 -07:00
1e2b2ee5ff sort_out_cuda: Use custom kernels to fill index tensors (#66668)
Summary:
These stable sorts currently use a combination of `at::arange`, view ops and `tensor.copy_` to fill in the initial values for the indices before calling into `CUB` to do the actual sort. This is somewhat inefficient because it requires 2 to 4 kernel launches, and the copies all use strided kernels instead of the more efficient contiguous kernels. Instead, a fairly straight-forward custom kernel is more efficient in terms of both CUDA and CPU runtime.

In a simple benchmark I profiled `a.sort(stable=True, dim=1)` for different shapes and single out the kernel invocations for intitializing the index tensors (i.e. the non-`cub` kernels). Note that when the batch dim is `<128` we call `segmented_sort_pairs_by_full_sort` instead of `segmented_sort_pairs`:

| shape        | Master (us) | This PR (us) |
|--------------|:-----------:|:------------:|
| (100, 1000)  |    5.000    |     2.300    |
| (1000, 100)  |    2.070    |     1.090    |
| (100, 10000) |    87.34    |     26.47    |
| (1000, 1000) |    28.63    |     20.27    |

Of course for sufficiently large inputs, the overall runtime is dominated by the actual sort. But I have another motive of wanting to remove operator the calls from the middle of this kernel launch code. This change makes it easier to split the kernel code that needs to be compiled with `nvcc` into it's own file that doesn't include `Tensor.h`, similar to what I'm doing in https://github.com/pytorch/pytorch/issues/66620.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66668

Reviewed By: H-Huang

Differential Revision: D31693722

Pulled By: ngimel

fbshipit-source-id: 5765926e4dbbc7a20d2940c098ed093b3de2204e
2021-10-15 15:13:02 -07:00
9ba39d2008 Clean up test running scripts (#65508)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65508

This has some misc cleanups for the code that happens before `run_test.py`:

* remove hardcoding of 2 shards
* add `set -eux` in some places

Test Plan: Imported from OSS

Reviewed By: seemethere

Differential Revision: D31296509

Pulled By: driazati

fbshipit-source-id: 2df1463432846d8a4d8a579812a4e9c3b7c2b957
2021-10-15 14:36:32 -07:00
2c761caaaa [Vulkan] cat operator for channel dimension (#66669)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66669

Implemented `cat` operator for channel dimension

**Facts:**
* texture coordinate: x(width), y(height), z(depth)
* input x, y, z -> no change
* out x, y -> no change
* out z and index i, j only matter

**Equations:**
batch_size = bt0 (or bt1 or bt2 or ...) = # of batch for tensor i
ch_size = ch0 (or ch1 or ch2 or ...) = # of channels for tensor i
ch_interval = ch0 + ch1 + ch2 + ... = total # of channels for all tensors
ch_size_allprior = ch0 (or ch0+ch1 or ch0+ch1+ch2 or ...) = # of channels for tensor 0 to i-1 where pos.z = d (input)
i = index of input texel = vec4[i] of texel at posIn(x,y,z) on input texture
j = index of output texel = vec4[j] of texel at posOut(x',y',z') on input texture

posIn[i] = {x,y,z} at ith index of vec4
src_index = posIn.z * 4 + i
dst_index = int(src_index / ch_size) * ch_interval + (src_index % ch_size) + ch_size_allprior
d = posOut.z = int(dst_index / 4)
j = (dst_index % 4)
posOut[j] = {posIn.x, posIn.y, d} at jth index of vec4

**Shader pseudo code:**
posOut = posIn;
for (i = 0; i < 4; ++i) {
	src_index = posIn.z * 4 + i;
	if (src_index >= ch_size * batch_size) break; // out of range
	dst_index = int(src_index / ch_size) * ch_interval + (src_index % ch_size) + ch_size_allprior;
	posOut.z = int(dst_index / 4);
	j = (dst_index % 4);
	uOutput[j] = uInput[i]
}

Test Plan:
Test build on Android:
```
cd ~/fbsource
buck build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 //xplat/caffe2:pt_vulkan_api_test_binAndroid\#android-arm64 --show-output
adb push buck-out/gen/xplat/caffe2/pt_vulkan_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_api_test
adb shell "/data/local/tmp/vulkan_api_test"
```

Test result:
```
[ RUN      ] VulkanAPITest.cat_dim1_samefeature_success
[       OK ] VulkanAPITest.cat_dim1_samefeature_success (101 ms)
[ RUN      ] VulkanAPITest.cat_dim1_difffeature_success
[       OK ] VulkanAPITest.cat_dim1_difffeature_success (81 ms)
[ RUN      ] VulkanAPITest.cat_dim1_texture2d_success
[       OK ] VulkanAPITest.cat_dim1_texture2d_success (2 ms)
[ RUN      ] VulkanAPITest.cat_dim1_singledepth_success
[       OK ] VulkanAPITest.cat_dim1_singledepth_success (6 ms)
[ RUN      ] VulkanAPITest.cat_dim1_singletensor_success
[       OK ] VulkanAPITest.cat_dim1_singletensor_success (21 ms)
[ RUN      ] VulkanAPITest.cat_dim1_twotensors_success
[       OK ] VulkanAPITest.cat_dim1_twotensors_success (53 ms)
[ RUN      ] VulkanAPITest.cat_dim1_bat1_ch4multiple_success
[       OK ] VulkanAPITest.cat_dim1_bat1_ch4multiple_success (17 ms)
[ RUN      ] VulkanAPITest.cat_dim2_sameheight_success
[       OK ] VulkanAPITest.cat_dim2_sameheight_success (83 ms)
[ RUN      ] VulkanAPITest.cat_dim2_diffheight_success
[       OK ] VulkanAPITest.cat_dim2_diffheight_success (86 ms)
[ RUN      ] VulkanAPITest.cat_dim2_singledepth_success
[       OK ] VulkanAPITest.cat_dim2_singledepth_success (5 ms)
[ RUN      ] VulkanAPITest.cat_dim2_invalidinputs_exceptions
[       OK ] VulkanAPITest.cat_dim2_invalidinputs_exceptions (82 ms)
```

Reviewed By: SS-JIA

Differential Revision: D31593623

fbshipit-source-id: e52dc57985e3f0bb9b20313d4fcc7248a436e863
2021-10-15 14:25:19 -07:00
06cfdfae0e Promote integral inputs to floating for torch.logsumexp (#63393)
Summary:
Fixed https://github.com/pytorch/pytorch/issues/56132, Integral inputs of `torch.logsumexp` would be promoted to the floating point type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63393

Reviewed By: ezyang

Differential Revision: D30512180

Pulled By: mruberry

fbshipit-source-id: fbde3605c15b930411d0d1eb3a132b0088187097
2021-10-15 14:20:50 -07:00
67e003f09b [Static Runtime] Determine function for ProcessedNode::run() statically (#66692)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66692

Currently `ProcessedNode::run()` performs 2 dynamic dispatches to decide which function implementation to execute depending on if the function is an out variant / native / or interpreter fallback. Note that this is happening every time an operation is executed by Static Runtime dynamically.

This change makes *that* same decision during module loading time once so that we can remove 1 dynamic dispatch cost at runtime.

**size reduction**

Saving 4 bytes per `ProcessedNode`.

- Before: sizeof(c10::variant<OutVariant, NativeFunction, Operation>):40

- After: sizeof(std::function<void(ProcessedNode*)>): 32 + sizeof(FunctionKind):4 = 36

**latency optimization**

Expected to remove 2 memory loads & 1 conditional jump per `ProcessedNode::run()` execution (needs to be confirmed from compiled binary code).

Ran `ptvsc2_predictor_bench` with `inline_cvr` with 1000 iterations:
- local : 7.56026 -> 7.24794
- local_ro: 1.5799. -> 1.55504.
- remote_ro: 10.6464 -> 10.3017

Test Plan: Ran existing unittests

Reviewed By: swolchok

Differential Revision: D31591785

fbshipit-source-id: 5de83ca386af509381e08ecedf071ee4e9f0f0b0
2021-10-15 14:07:24 -07:00
d1b6121935 Revert D31656999: Add meta support to tensor range factories
Test Plan: revert-hammer

Differential Revision:
D31656999 (7400f34b8e)

Original commit changeset: 06e7f3655b94

fbshipit-source-id: 2f9d8d1acbb01c5105ece73472e5c1f5f90886ee
2021-10-15 14:03:04 -07:00
a25648953c Add warn_only kwarg to use_deterministic_algorithms (#66233)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/64883

Adds a `warn_only` kwarg to `use_deterministic_algorithms`. When enabled, calling an operation that does not have a deterministic implementation will raise a warning, rather than an error.

`torch.testing._internal.common_device_type.expectedAlertNondeterministic` is also refactored and documented in this PR to make it easier to use and understand.

cc mruberry kurtamohler

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66233

Reviewed By: bdhirsh

Differential Revision: D31616481

Pulled By: mruberry

fbshipit-source-id: 059634a82d54407492b1d8df08f059c758d0a420
2021-10-15 13:54:59 -07:00
687c2267d4 use irange for loops (#66234)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66234

Modified loops in files under fbsource/fbcode/caffe2/ from the format

`for(TYPE var=x0;var<x_max;x++)`

to the format

`for(const auto var: irange(xmax))`

This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand.

bypass_size_limit
allow-large-files

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D30652629

fbshipit-source-id: 0ae6c4bbbb554bad42e372792a6430e1acf15e3e
2021-10-15 13:50:33 -07:00
b5b7d6a3a6 EmbeddingBackward exclusive_scan thrust->cub (#66566)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66566

Reviewed By: H-Huang

Differential Revision: D31637660

Pulled By: ngimel

fbshipit-source-id: 8093432bb9a9b902bb6bab7da221f0bcd7e9fb34
2021-10-15 13:46:30 -07:00
bd25f92e81 Fix Wextra issues in Half.h (#66643)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66643

Fixes:
```
caffe2/c10/util/Half.h:456:14: error: comparison of integers of different signs: 'long' and 'unsigned long' [-Werror,-Wsign-compare]
    return f > limit::max() ||
           ~ ^ ~~~~~~~~~~~~
```

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D31656816

fbshipit-source-id: 7623d20e166a9e95a949ebd8b23793f24960cf07
2021-10-15 13:38:10 -07:00
abc022f9c8 Fix torch.cholesky deprecation warning (#66645)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66645

Fixes:
```
test_cholesky_solve_batched_broadcasting_cpu_complex128 (__main__.TestLinalgCPU) ... test_linalg.py:3099: UserWarning: torch.cholesky is deprecated in favor of torch.linalg.cholesky and will be removed in a future PyTorch release.
```

Test Plan: Sandcastle

Reviewed By: mruberry

Differential Revision: D31635851

fbshipit-source-id: c377eb88d753fb573b3947f0c6ff5df055cb13d8
2021-10-15 13:24:58 -07:00
0b8dc0f04a add BFloat16 operators on CPU: logaddexp, logaddexp2, remainder (#63621)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63621

Reviewed By: H-Huang

Differential Revision: D31640811

Pulled By: mruberry

fbshipit-source-id: 1fd061b65c196398738018eefc52bf459e424b1c
2021-10-15 13:11:45 -07:00
a58852fd44 Fix fx2trt broken unit test (#66696)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66696

D31511082 (9918fd8305) moved unit test but didn't add proper target in build file, fix it in this diff.

Test Plan: buck test mode/opt caffe2/test/fx2trt/converters/...

Reviewed By: 842974287

Differential Revision: D31667697

fbshipit-source-id: 49e04afa323b27a1408c9bc2b5061b6529ced985
2021-10-15 12:56:12 -07:00
e48a4cbf64 Make several methods of SharedParserData private (#66670)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66670

Reviewed By: zhxchen17

Differential Revision: D31674377

Pulled By: gmagogsfm

fbshipit-source-id: 5c73b78f842c5c4305047ca98f40bf99bd3d2d60
2021-10-15 12:43:45 -07:00
e88d1c4f10 [PyTorch] Add tuple inline storage (#64066)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64066

I noticed a bunch of time being spent heap-allocating Tuples
in the unpickler. 1-, 2-, and 3-element Tuples are apparently common
enough that they get their own bytecode instructions, so I decided to
try also giving them their own representation. We store up to 3
IValues inline in `Tuple` rather than doing a second heap allocation
for a `std::vector<IValue>`.
ghstack-source-id: 140695395

Test Plan:
Added automated tests for TupleElements.

Pixel 3 before: https://www.internalfb.com/intern/aibench/details/761596366576284
Pixel 3 after: https://www.internalfb.com/intern/aibench/details/591414145082422
We went from 347 ms to 302 ms.

Reviewed By: dhruvbird

Differential Revision: D30592622

fbshipit-source-id: 93625c54c9dca5f765ef6d5c191944179cb281a8
2021-10-15 12:16:51 -07:00
f8f9a47b02 PR3: add a workaround for reference path (#66535)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66535

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D31676400

Pulled By: rahxephon89

fbshipit-source-id: fd4c8e9bbc82930cc1255fb8bf8d8ac7f0934c3f
2021-10-15 11:56:11 -07:00
7400f34b8e Add meta support to tensor range factories (#66630)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66630

This PR adds meta backend support to the `range`, `arange`, `linspace`, and `logspace` operators.
ghstack-source-id: 140618055

Test Plan: Extended the existing tensor creation tests to assert meta backend support.

Reviewed By: ezyang

Differential Revision: D31656999

fbshipit-source-id: 06e7f3655b94c0d85a28bcd0ca61d9f9ce707f1d
2021-10-15 11:17:08 -07:00
6436bd3d5d Clarify topk doc (#65938)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/50331
<img width="855" alt="Screen Shot 2021-10-01 at 11 23 23 AM" src="https://user-images.githubusercontent.com/17888388/136036611-f2bd9c77-61b4-4ab8-85eb-44f50c1e03d7.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65938

Reviewed By: bdhirsh

Differential Revision: D31314875

Pulled By: samdow

fbshipit-source-id: bdd9425fd748710f8a64ed1989e1938dd358780f
2021-10-15 11:15:48 -07:00
2506baf9c2 [ONNX] move CheckerError from torch.onnx.utils to torch.onnx (#66644)
Summary:
This moves it to where the user would expect it to be based on the
documentation and all the other public classes in the torch.onnx module.

Also rename it from ONNXCheckerError, since the qualified name
torch.onnx.ONNXCheckerError is otherwise redundant.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66644

Reviewed By: malfet

Differential Revision: D31662559

Pulled By: msaroufim

fbshipit-source-id: bc8a57b99c2980490ede3974279d1124228a7406
2021-10-15 10:38:56 -07:00
3a9259f6cf [TensorExpr] Add missing schema for aten::where and aten::pow lowerings. (#66688)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66688

Differential Revision:
D31689431
D31689431

Test Plan: Imported from OSS

Reviewed By: bertmaher

Pulled By: ZolotukhinM

fbshipit-source-id: 6b3abb4471170ff5418f72bb700325711e7bd28f
2021-10-15 10:14:43 -07:00
06c37876b8 torch.linalg.householder_product faster backward (#63880)
Summary:
This PR implements a much more efficient algorithm. This algorithm allows to achieve MASSIVE speed-ups, especially for batched and/or larger double-precision inputs.
Here are some benchmarks:

<details>

<summary>Testing script</summary>

```python
from IPython import get_ipython
import torch
import itertools

torch.manual_seed(13)
#torch.set_num_threads(1)

ipython = get_ipython()

cpu = torch.device('cpu')
cuda = torch.device('cuda')

def generate_input(shape, dtype=torch.double, device=cpu):
    eigvals = torch.rand(*shape[:-1], dtype=dtype, device=device)
    eigvecs = torch.rand(*shape, dtype=dtype, device=device)
    input = (eigvecs * eigvals.unsqueeze(-2)) @ eigvecs.inverse()
    input.requires_grad_(True)
    tau = torch.rand(*shape[:-1], dtype=dtype, device=device)
    tau.requires_grad_(True)
    return input, tau

def run_test(shape, device, dtype):
    print(f"shape: {shape}, device: {device}, dtype: {dtype}")
    a, tau = generate_input(shape, dtype=dtype, device=device)
    prod = torch.linalg.householder_product(a, tau)
    ones_prod = torch.ones_like(prod)

    command = "torch.autograd.backward((prod,), (ones_prod), retain_graph=True)"
    if device == cuda:
        command = command + "; torch.cuda.synchronize()"
    ipython.magic(f"timeit {command}")
    print()

dtypes = [torch.float, torch.double]
devices = [cpu, cuda]
#devices = [cuda]
sizes = [
    (10, 10),
    (1000, 10, 10),
    (100, 100),
    (1000, 100, 100),
    (1000, 1000),
    (10, 1000, 1000),
]

for device, dtype, size in itertools.product(devices, dtypes, sizes):
    run_test(size, device, dtype)

```

</details>

<details>

<summary>This PR, cuda float32</summary>

```
shape: (10, 10), device: cuda, dtype: torch.float32
1.33 ms ± 1.82 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

shape: (1000, 10, 10), device: cuda, dtype: torch.float32
1.52 ms ± 40.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

shape: (100, 100), device: cuda, dtype: torch.float32
10.8 ms ± 9.62 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

shape: (1000, 100, 100), device: cuda, dtype: torch.float32
127 ms ± 8.45 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

shape: (1000, 1000), device: cuda, dtype: torch.float32
151 ms ± 127 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

shape: (10, 1000, 1000), device: cuda, dtype: torch.float32
981 ms ± 91.4 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

```

</details>

<details>

<summary>Master, cuda float32</summary>

```
shape: (10, 10), device: cuda, dtype: torch.float32
1.64 ms ± 6.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

shape: (1000, 10, 10), device: cuda, dtype: torch.float32
298 ms ± 463 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

shape: (100, 100), device: cuda, dtype: torch.float32
15.4 ms ± 41.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

shape: (1000, 100, 100), device: cuda, dtype: torch.float32
5.36 s ± 711 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

shape: (1000, 1000), device: cuda, dtype: torch.float32
1.64 s ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

shape: (10, 1000, 1000), device: cuda, dtype: torch.float32
15.7 s ± 43.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

```

</details>

<details>

<summary>This PR, cuda float64</summary>

```
shape: (10, 10), device: cuda, dtype: torch.float64
1.14 ms ± 1.43 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

shape: (1000, 10, 10), device: cuda, dtype: torch.float64
2.22 ms ± 1.32 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

shape: (100, 100), device: cuda, dtype: torch.float64
10.6 ms ± 11.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

shape: (1000, 100, 100), device: cuda, dtype: torch.float64
287 ms ± 84.9 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

shape: (1000, 1000), device: cuda, dtype: torch.float64
236 ms ± 41.9 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

shape: (10, 1000, 1000), device: cuda, dtype: torch.float64
1.88 s ± 88.3 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

```
</details>

<details>

<summary>Master, cuda float64</summary>

```
shape: (10, 10), device: cuda, dtype: torch.float64
1.58 ms ± 8.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

shape: (1000, 10, 10), device: cuda, dtype: torch.float64
308 ms ± 213 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

shape: (100, 100), device: cuda, dtype: torch.float64
79 ms ± 14.5 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

shape: (1000, 100, 100), device: cuda, dtype: torch.float64
54.2 s ± 1.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

shape: (1000, 1000), device: cuda, dtype: torch.float64
31.5 s ± 698 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

shape: (10, 1000, 1000), device: cuda, dtype: torch.float64
4min 45s ± 2.48 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

```
</details>

<details>

<summary>This PR, cpu float32</summary>

```
shape: (10, 10), device: cpu, dtype: torch.float32
476 µs ± 21.4 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

shape: (1000, 10, 10), device: cpu, dtype: torch.float32
5.1 ms ± 100 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

shape: (100, 100), device: cpu, dtype: torch.float32
4.38 ms ± 4.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

shape: (1000, 100, 100), device: cpu, dtype: torch.float32
1.55 s ± 6.64 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

shape: (1000, 1000), device: cpu, dtype: torch.float32
745 ms ± 407 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

shape: (10, 1000, 1000), device: cpu, dtype: torch.float32
5.44 s ± 15.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

```
</details>

<details>

<summary>Master, cpu float32</summary>

```
shape: (10, 10), device: cpu, dtype: torch.float32
387 µs ± 645 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

shape: (1000, 10, 10), device: cpu, dtype: torch.float32
12.3 ms ± 23.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

shape: (100, 100), device: cpu, dtype: torch.float32
39.4 ms ± 80.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

shape: (1000, 100, 100), device: cpu, dtype: torch.float32
29.1 s ± 44.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

shape: (1000, 1000), device: cpu, dtype: torch.float32
9.42 s ± 14.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

shape: (10, 1000, 1000), device: cpu, dtype: torch.float32
1min 50s ± 282 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

```
</details>

<details>

<summary>This PR, cpu float64</summary>

```
shape: (10, 10), device: cpu, dtype: torch.float64
381 µs ± 761 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

shape: (1000, 10, 10), device: cpu, dtype: torch.float64
6.19 ms ± 13.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

shape: (100, 100), device: cpu, dtype: torch.float64
4.6 ms ± 3.26 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

shape: (1000, 100, 100), device: cpu, dtype: torch.float64
2.59 s ± 8.25 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

shape: (1000, 1000), device: cpu, dtype: torch.float64
1.07 s ± 5.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

shape: (10, 1000, 1000), device: cpu, dtype: torch.float64
14.4 s ± 13.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

```
</details>

<details>

<summary>Master, cpu float64</summary>

```
shape: (10, 10), device: cpu, dtype: torch.float64
395 µs ± 1.04 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

shape: (1000, 10, 10), device: cpu, dtype: torch.float64
14.6 ms ± 9.76 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

shape: (100, 100), device: cpu, dtype: torch.float64
45.5 ms ± 154 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

shape: (1000, 100, 100), device: cpu, dtype: torch.float64
33.1 s ± 69.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

shape: (1000, 1000), device: cpu, dtype: torch.float64
19.3 s ± 80.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

shape: (10, 1000, 1000), device: cpu, dtype: torch.float64
3min 30s ± 1.29 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

```
</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63880

Reviewed By: soulitzer

Differential Revision: D30639435

Pulled By: anjali411

fbshipit-source-id: 127789943ae56e2f1dd03e0fe76ef7b6db86bcf0
2021-10-15 09:54:30 -07:00
65e25256c3 [ROCm] enable test_distributed() in test.sh (#66657)
Summary:
Restores tests for ROCm CI that used to run prior to https://github.com/pytorch/pytorch/issues/63147.

cc jeffdaily sunway513 jithunnair-amd ROCmSupport

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66657

Reviewed By: soulitzer

Differential Revision: D31668379

Pulled By: malfet

fbshipit-source-id: 91a6f6c63d6c957cc5821edbd33d4c16eecc8c0a
2021-10-15 09:45:11 -07:00
8a01bbd64a add flatten parameter module (#66578)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66578

flatten parameters for performance optimization and handle the case when grad ready order is different or there are unused parameters among ranks. when there is no param to be sharded in the FSDP instance (usually root), the flatten wrapper module's flat_param is None.
ghstack-source-id: 140696745

Test Plan: unit test

Reviewed By: mrshenli

Differential Revision: D31625194

fbshipit-source-id: c40e84f9154f5703e5bacb02c37c59d6c4e055c7
2021-10-15 09:37:26 -07:00
a3d12bcdf9 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D31681115

fbshipit-source-id: e2146e59a57ff27759de18b00fb644e9dc3c5672
2021-10-15 03:07:57 -07:00
76efbccc3b [PyTorch Edge][tracing-based] Unify tracer between internal and external (#64152)
Summary:
As title, introduce the file `TracerRunner` shared by internal/external tracer and the main function is
```
TracerResult trace_run(const std::string& input_module_path);
```
which basically takes the path to model file and generate the trace result. The main difference between external tracer and internal tracer is
1. the dependency on `<yaml-cpp/yaml.h>`.
2. the output yaml file from internal tracer includes `model_version` and `model_asset`. These are only needed for internal.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64152

ghstack-source-id: 140692467

Test Plan:
```
./build/bin/model_tracer --model_input_path "/Users/chenlai/Documents/pytorch/tracing/deeplabv3_scripted_with_bundled_input.ptl" --build_yaml_path  "/Users/chenlai/Documents/pytorch/tracing/tmp.yaml"
```
```
./fbcode/caffe2/fb/model_tracer/run_model_with_bundled_inputs.sh ~/local/notebooks/prod_models/deeplabv3_scripted_with_bundled_input.ptl
```
have the same operator output

selected_operators.yaml (P460296279)
selected_mobile_ops.h (P460296258)

Reviewed By: dhruvbird

Differential Revision: D30632224

fbshipit-source-id: eb0321dbc0f1fcf6d2e05384695eebb59ac04f8c
2021-10-15 02:19:45 -07:00
1e47181c47 [DDP Logging] Add iteration in error reporting (#65772)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65772

Looking at some workloads and it would be useful to have this info.
ghstack-source-id: 140555200

Test Plan: CI

Reviewed By: zhaojuanmao, wayi1

Differential Revision: D31224417

fbshipit-source-id: 14eeb053aced87c7ca43b6879f81f54bd0a42b76
2021-10-14 22:29:36 -07:00
3740a06712 [MonitoredBarrier] Fix some logging (#65771)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65771

Fixes some logging around monitored_barrier to make it cleaner.
ghstack-source-id: 140555204

Test Plan: CI

Reviewed By: zhaojuanmao, wayi1

Differential Revision: D31222881

fbshipit-source-id: 77d6f072ce98a9b31192e0d48ea0f8cbd8f216fe
2021-10-14 22:28:16 -07:00
06fa6c15c0 Back out "Revert D31299350: Back out "Revert D31005792: [NCCL] Init dummy NCCL comms in constructor"" (#66393)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66393

Third try!

Fixes:
- test_nccl_timeout can be flaky because of 1s timeout, bump up the timeout to resolve the flakiness. But in general we should not have been relying on time.sleep for this test, filed https://github.com/pytorch/pytorch/issues/66354 to track that.
- ciflow/all did not actually run tests due to a bug causing multigpu tests to not be run. This has since been fixed.
ghstack-source-id: 140560113

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D31534735

fbshipit-source-id: 8b7e0f4fed3972b7a77cbcda28876c9eefb0c7e2
2021-10-14 22:23:22 -07:00
59b28063b4 [NNC] Adding more python bindings for missing operators (#66612)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66612

For op authoring project, we want to expose the python bindings
to create Expr. These are the missing bindings.

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D31667852

fbshipit-source-id: 6d3ff83a7676cfea391ab3ea60dde6874a64047a
2021-10-14 22:09:01 -07:00
8dcf84069e [PyTorch] Implement improved version of gather_ranges_to_dense (#66677)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66677

Reviewed By: wfanzju

Differential Revision: D31676536

fbshipit-source-id: a2eb1b1f9e5a0b78f89c3aad19f97acb7c05e1f8
2021-10-14 21:22:15 -07:00
70fc60b9d1 Revert D31325860: [PyTorch] Implement improved version of gather_ranges_to_dense
Test Plan: revert-hammer

Differential Revision:
D31325860 (23710e2d80)

Original commit changeset: 8e154f929ff7

fbshipit-source-id: 6d36d50d6bd4ec4fe07a6e2d1d0110504b9c8b53
2021-10-14 19:43:38 -07:00
b60050e96a [qat]Make sure the bn statistics are the same in the unit test. (#66244)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66244

Make sure the bn statistics are the same in the unit test.
* The fused model in the existing code will have different bn statistics compared to the model without fusion. They will produce the same result when the model is in training mode, but different result in eval mode.

Test Plan: buck run mode/dev-nosan //caffe2/test:quantization -- -r quantization.eager.test_fusion.TestFusion

Reviewed By: jerryzh168

Differential Revision: D29504500

fbshipit-source-id: 41e3bfd7c652c27619baa7cbbe98d8d06a485781
2021-10-14 19:23:05 -07:00
23710e2d80 [PyTorch] Implement improved version of gather_ranges_to_dense (#66664)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66664

Reviewed By: hlu1

Differential Revision: D31325860

fbshipit-source-id: 8e154f929ff7c597ff6e41f18278b24c552d1719
2021-10-14 18:37:35 -07:00
583217fe37 changes for pytorch issue 55577 (#66571)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66571

changes for pytorch issue 55577

Test Plan:
Ran test:
python test/test_jit.py TestDict

Reviewed By: tugsbayasgalan

Differential Revision: D31622633

fbshipit-source-id: 171c68a65b1d0bf769b3d95f103daba375e95335
2021-10-14 18:19:11 -07:00
a1084401b0 Clean up DictLiteral and DictComprehension emission logic (#64953)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64953

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D30914687

Pulled By: ansley

fbshipit-source-id: ab9b9192a29f05b90c113c678e7c795bc087dc99
2021-10-14 17:35:39 -07:00
a7b79033ea Clean up ListLiteral and ListComprehension emission logic (#64952)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64952

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D30914690

Pulled By: ansley

fbshipit-source-id: 83ac9bc6445f89b3f47c5404435bc6058c6f3bd7
2021-10-14 17:34:17 -07:00
22ec625028 fx2trt example: run all submodules (#66590)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66590

Updated fx2trt example to run all submodules

Added assertion to make sure outputs from lowered and regular models matches

Test Plan: buck run mode/dev-nosan caffe2:fx2trt_example

Reviewed By: 842974287

Differential Revision: D31592985

fbshipit-source-id: 45ce0b33e957f16b3729d3ecde706331c29d7214
2021-10-14 17:09:29 -07:00
20aa417e38 [PyTorch] [Quantization] Speed up PackedEmbeddingBagWeight::prepack() (#66632)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66632

Calling `.item<float>()` for each element in a tensor is expensive. Instead convert the entire Tensor in one call to `Tensor::copy_(input_tensor)`. See [this post](https://fb.workplace.com/groups/1144215345733672/posts/2080756188746245/) for more details.
ghstack-source-id: 140639868

Test Plan:
Build and run with bundled inputs.

### AI Bench

Before: [AI Bench](https://www.internalfb.com/intern/aibench/details/877359346171823), [Flamegraph](https://interncache-all.fbcdn.net/manifold/aibench/tree/mobile/pt/profiling_reports/speech_transducer_v6_perf_1634185889953.html): 500ms

After: [AI Bench](https://www.internalfb.com/intern/aibench/details/60828780633319), [Flamegraph](https://interncache-all.fbcdn.net/manifold/aibench/tree/mobile/pt/profiling_reports/speech_transducer_v6_perf_1634231176980.html): 444ms

We went from 500ms to 444ms, which is a reduction of ~11%.

Reviewed By: supriyar

Differential Revision: D31657430

fbshipit-source-id: 199ec9de3dab84bb5727d81c7804bb83bebf7b48
2021-10-14 16:30:39 -07:00
871a31b9c4 [TensorExpr] Add missing schemas for lshift/rshift lowerings. (#66653)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66653

Test Plan: Imported from OSS

Reviewed By: navahgar, anijain2305

Differential Revision: D31664748

Pulled By: ZolotukhinM

fbshipit-source-id: 13a3154292f12b7bee43b9a5254fb43be032e7c1
2021-10-14 14:19:29 -07:00
f8348ce9c8 graceful failure for draw_graph() in acc_utils.py (#66631)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66631

writing to the current directory is causing issues in CI. we might also consider writing the ".dot" files to some temporary location.

Test Plan: CI

Reviewed By: 842974287

Differential Revision: D31657078

fbshipit-source-id: 9876327c7f172cd354f1b8e8076597c6a26e2850
2021-10-14 14:04:48 -07:00
1d90f29f14 [DOC] Improve Transformer documentation (#66574)
Summary:
Includes adding some typing annotations to TransformerEncoderLayer and TransformerDecoderLayer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66574

Reviewed By: soulitzer

Differential Revision: D31654024

Pulled By: jbschlosser

fbshipit-source-id: 9026bd36541699b7205e893decf5abc4a3f0ab5e
2021-10-14 13:26:12 -07:00
3097755e7a [DOC] Fix typo in KLDivLoss (#66583)
Summary:
Fix simple typo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66583

Reviewed By: soulitzer

Differential Revision: D31653998

Pulled By: jbschlosser

fbshipit-source-id: e4fc91be297cc9a85099d7883b42436b5e3392d3
2021-10-14 13:21:37 -07:00
914796a69c Fix for prim::BroadcastMKLDNNTensors issue (#66628)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66628

Ensure BroadcastMKLDNNTensors do not break the stack invariant by pushing more than 2 tensors into the stack.

Reviewed By: eellison

Differential Revision: D31638565

fbshipit-source-id: 4526c0cf7ba8d87dc8a9c213c66c711e83adfc66
2021-10-14 11:53:42 -07:00
833ede33ed Fix ubsan in concat_split_op.h (#66283)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66283

Fixes
```
UndefinedBehaviorSanitizer: nullptr-with-nonzero-offset caffe2/caffe2/operators/concat_split_op.h:185:52
```

Test Plan: Sandcastle

Reviewed By: swolchok

Differential Revision: D31486274

fbshipit-source-id: 20128056f19cf814fdc3e6e144cf9208a4080d6a
2021-10-14 11:42:30 -07:00
76f3b07caf quantization docs: remove erroneous rebase artifact (#66577)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66577

There was a rebase artifact erroneously landed to quantization docs,
this PR removes it.

Test Plan:
CI

Imported from OSS

Reviewed By: soulitzer

Differential Revision: D31651350

fbshipit-source-id: bc254cbb20724e49e1a0ec6eb6d89b28491f9f78
2021-10-14 11:30:47 -07:00
016362e2d7 Run sparse tests only for TensorPipe agent. (#66600)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66600

Sparse RPC functionality added in
https://github.com/pytorch/pytorch/pull/62794 works only for TensorPipe and is
broken for other agent types.

Moving these tests to a TensorPipe only class.
ghstack-source-id: 140553147

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D31633305

fbshipit-source-id: 37d94cb9ed5565a72a6d512c2a9db75a497d5b95
2021-10-14 11:08:15 -07:00
543b7fb942 [JIT] Fix type annotations of pooling modules (#65847)
Summary:
All of the pooling modules except MaxUnpool and LPPool return either a
Tensor or [Tensor, Tensor]. The current type annotations are inaccurate,
and prevent scripting the module if return_indices is set as True in the
module.

There's not a great way to make this agree with mypy because the
overload is dependent on the value of return_indices, an attribute.

I tried changing the annotations from `Tensor` to
`Union[Tensor, Tuple[Tensor, Tensor]]`, but that breaks a bunch of uses
that have return_indices=False.
For example, this breaks:
4e94e84f65/torch/nn/modules/container.py (L139)

Also clean up how test names were being constructed in test_jit, since
otherwise we were getting name collisions when there were two tests on
the same nn.Module.

Fixes https://github.com/pytorch/pytorch/issues/45904

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65847

Reviewed By: ZolotukhinM

Differential Revision: D31462517

Pulled By: eellison

fbshipit-source-id: 6f9e8df1be6c75e5e1e9bae07cf3ad3603ba59bd
2021-10-14 10:59:19 -07:00
51b67f2bca [qat]Removed outdated context manager in unit test. (#66274)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66274

Removed outdated context manager in unit test.
* The linked issue (https://github.com/pytorch/pytorch/issues/23825) seemed have been be fixed in 2020.

Test Plan: buck run mode/dev-nosan //caffe2/test:quantization -- -r quantization.eager.test_quantize_eager_qat

Reviewed By: vkuzo

Differential Revision: D29507087

fbshipit-source-id: e8fa04c9527023a5adaf1a012b2c393ce0c5cd97
2021-10-14 10:23:55 -07:00
49a1d7bfcb [opinfo] elemwise parcel : isfinite, isinf, isposinf, isneginf, isnan, isreal (#66400)
Summary:
Adds OpInfo for `isfinite, isinf, isposinf, isneginf, isnan, isreal`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66400

Reviewed By: bdhirsh

Differential Revision: D31602998

Pulled By: mruberry

fbshipit-source-id: 235cc414f373f014f4822a72deb1a04a58ad4a7c
2021-10-14 10:11:57 -07:00
d810e738b9 OpInfo for *_like functions (#65941)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65941

OpInfos for: empty_like, zeros_like, ones_like, full_like, randn_like

Test Plan: - run tests

Reviewed By: dagitses

Differential Revision: D31452625

Pulled By: zou3519

fbshipit-source-id: 5e6c45918694853f9252488d62bb7f4ccfa1f1e4
2021-10-14 09:14:51 -07:00
5d4452937d OpInfos for some Tensor dtype conversion methods (#64282)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64282

OpInfos for:
- Tensor.bfloat16, Tensor.bool, Tensor.bypte, Tensor.char
- Tensor.double, Tensor.float, Tensor.half, Tensor.int
- Tensor.short, Tensor.long

None of these are supported by TorchScript. Also, the OpInfo autograd
test runner assumes that the operation is not allowed to change the
dtype of the argument, so only Tensor.double has
`supports_autograd=True` (in theory Tensor.bfloat16, Tensor.float,
Tensor.half should be differentiable).

Test Plan: - run tests

Reviewed By: dagitses

Differential Revision: D31452627

Pulled By: zou3519

fbshipit-source-id: b7f272e558558412c47aefe947af7f060dfb45c5
2021-10-14 09:13:30 -07:00
77f98ea5e0 assert no duplicate yaml keys in codegen (#66238)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66238

The codegen should error if it sees two yaml entries with the same key. The default behavior of python's yaml loader is to overwrite duplicate keys with the new value.

This would have caught a nasty bug that showed up in https://github.com/pytorch/pytorch/pull/66225/files#r723796194.

I tested it on that linked PR, to confirm that it errors correctly (and gives the line number containing the duplicate).

Test Plan: Imported from OSS

Reviewed By: dagitses, albanD, sean-ngo

Differential Revision: D31464585

Pulled By: bdhirsh

fbshipit-source-id: 5b35157ffa9a933bf4b344c4b9fe2878698370a3
2021-10-14 08:28:20 -07:00
fe41df3601 Deprecate x.T on tensors of dimension other than 0 or 2 (#64180)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64180

**BC-breaking note:**

This PR deprecates the `Tensor.T` are not matrices. An upgrade guide is added to the
documentation for `Tensor.T`.

This PR DOES NOT make this attribute to throw an error when called on a tensor of `dim != 2`,
but this will be its behavior in a future PyTorch release.

cc mruberry rgommers pmeier asmeurer leofang AnirudhDagar asi1024 emcastillo kmaehashi heitorschueroff

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D31610611

Pulled By: anjali411

fbshipit-source-id: af8ff7e862790dda9f06921de005b3f6fd0803c3
2021-10-14 08:17:32 -07:00
d802877dfa speed up quantized interpolate for channels last (#66525)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66525

This should solve https://github.com/pytorch/pytorch/issues/60015

There were two `q_zero_point()` accesses inside a for loop which was
expensive. Moving them to before the loop sped things up 10x for a
microbenchmark.

Test Plan:
```
// comment out benchmarks unrelated to original issue, for simplicity
cd benchmarks/operator_benchmark
python -m pt.qinterpolate_test

// before: 2994 us
// after: 324 us
// full results: https://gist.github.com/vkuzo/cc5ef9526dc0cda170d6d63498c16453
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D31592422

fbshipit-source-id: b6078ac1039573bbe545275f7aedfd580910b459
2021-10-14 08:11:26 -07:00
a40812de53 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D31646229

fbshipit-source-id: 26a89b8eb88d31259f79c8f9061e016d57a1e462
2021-10-14 04:52:16 -07:00
6310eb30d1 [SR] Clean up GetLivenessMap (#66606)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66606

- Remove dead code (see comment for where)
- Add debug prints
- Small reorganization of the code to improve readability

Reviewed By: d1jang

Differential Revision: D31568219

fbshipit-source-id: 50240c325bf4fd012e1947ac931bb67c6f5dfafb
2021-10-13 23:55:40 -07:00
e1348973ac Add common_fx2trt.py (#66579)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66579

Didn't commit this file in the PR that open sources fx2trt tests

Test Plan: ci

Reviewed By: 842974287

Differential Revision: D31623354

fbshipit-source-id: 6cedbe0f229da40499b83e6df28e16caca392d9c
2021-10-13 21:24:11 -07:00
74849d9188 [acc_shape_inference] add shape inference for quantize_per_channel (#66562)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66562

Adding shape inference for `acc_ops.quantize_per_channel`, and fixing some bugs.

Bugs were related to the fact that `quantize_per_channel` arguments `scales` and `zero_points` take tensors, so when we fetch the values (which needs to be done using `.tolist()` instead of `.item()`) we may get either a list or a scalar value.

Test Plan:
# Test Quantized Resnet
From sandbox with GPU that supports quantized types (tested with V100)
`buck run mode/opt -c python.package_style=inplace caffe2:fx2trt_quantized_resnet_test`
Output
```
...
[TensorRT] INFO: [MemUsageSnapshot] Builder end: CPU 0 MiB, GPU 1548 MiB
[TensorRT] INFO: [MemUsageSnapshot] ExecutionContext creation begin: CPU 0 MiB, GPU 1548 MiB
[TensorRT] VERBOSE: Using cublasLt a tactic source
[TensorRT] WARNING: TensorRT was linked against cuBLAS/cuBLAS LT 11.5.1 but loaded cuBLAS/cuBLAS LT 11.1.0
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 0, GPU 1556 (MiB)
[TensorRT] VERBOSE: Using cuDNN as a tactic source
[TensorRT] INFO: [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 0, GPU 1564 (MiB)
[TensorRT] WARNING: TensorRT was linked against cuDNN 8.2.1 but loaded cuDNN 8.0.5
[TensorRT] VERBOSE: Total per-runner device memory is 23405056
[TensorRT] VERBOSE: Total per-runner host memory is 73760
[TensorRT] VERBOSE: Allocated activation device memory of size 154140672
[TensorRT] INFO: [MemUsageSnapshot] ExecutionContext creation end: CPU 0 MiB, GPU 1736 MiB
trt fp16 time (ms/iter) 1.252899169921875
trt int8 time (ms/iter) 1.3774776458740234
trt implicit int8 time (ms/iter) 1.3835883140563965
PyTorch time (CUDA) (ms/iter) 4.34483528137207
PyTorch time (CPU) (ms/iter) 55.687150955200195
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 0, GPU 1918 (MiB)
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 0, GPU 1866 (MiB)
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 0, GPU 1738 (MiB)
WARNING: Logging before InitGoogleLogging() is written to STDERR
W1012 12:07:23.556475 711816 DynoConfigLoader.cpp:32] Failed to read config: No dyno config client
```

# Test shape inference
`buck test mode/opt glow/fb/fx/acc_tracer:test_acc_shape_inference`
Output
```
...
Summary
  Pass: 95
  ListingSuccess: 1
If you need help understanding your runs, please follow the wiki: https://fburl.com/posting_in_tpx_users
Finished test run: https://www.internalfb.com/intern/testinfra/testrun/1407375092088240
```

Reviewed By: jfix71, jerryzh168

Differential Revision: D31457323

fbshipit-source-id: 8ccc4a9b0ca655fb30838e88575aff2bf3a387a6
2021-10-13 21:03:08 -07:00
7d9bbd3596 Revert D31580382: [pytorch][PR] dropout update in autodiff
Test Plan: revert-hammer

Differential Revision:
D31580382 (eb8138d886)

Original commit changeset: 41d15da99bf4

fbshipit-source-id: 59f751ee59602a5fd09c17f8c7565dca5e2beb50
2021-10-13 19:52:05 -07:00
c1c985a282 Rename tensorexpr::Value so that it can coexist with torch::jit::Value (#66467)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66467

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D31619973

Pulled By: bertmaher

fbshipit-source-id: eebea821fbbd0ae6f0a7144809c87c7da7f88699
2021-10-13 19:41:07 -07:00
6634570aef [SR] Fix bug in ValueGroup (#66470)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66470

Reviewed By: d1jang

Differential Revision: D31566348

fbshipit-source-id: e0f634af77d893bbc8d66f214b2b8bdd6ab58cc3
2021-10-13 19:26:38 -07:00
d30397d42a [PyTorch][Static Runtime] Don't use vector in ProcessedNode (#65429)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65429

The sizes of these arrays can't change, so there's no need to waste an extra pointer on them.
ghstack-source-id: 140532722

Test Plan:
CI

I profiled this diff and the previous diff together. Comparing time spent in the operator functor handler for to_copy, I see the load instruction fetching the inputs pointer from p_node on https://www.internalfb.com/code/fbsource/[4c98a83b2451fa6750f38796c91ebb0eb0afd800]/fbcode/caffe2/torch/csrc/jit/runtime/static/ops.cpp?lines=947 (`p_node->Input(0).toTensor()`) improved a tiny bit, and the overall time spent in that wrapper decreased from 0.8% to 0.7%.

Reviewed By: hlu1

Differential Revision: D31096042

fbshipit-source-id: 35c30462d6a9f9bd555d6b23361f27962e24b395
2021-10-13 19:13:20 -07:00
c6f0dde3ca Cumsum Converter (#66376)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66376

Added converter for cumsum and unit test

Test Plan: buck test mode/dev-nosan caffe2/torch/fb/fx2trt:test_cumsum

Reviewed By: wushirong, 842974287

Differential Revision: D31423701

fbshipit-source-id: ee3aa625d6875ba8e6bad27044d22638e99b5c03
2021-10-13 19:04:37 -07:00
160946e3f3 Use torch.empty() instead of torch.tensor() in torch.nn.Parameter (#66486)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66486

The newly-introduced Python dispatcher mode (`__torch_dispatch__`) does not have support for `torch.tensor()` (see #64360) and this causes friction in the user experience if some `nn.Modules` use `torch.tensor()` either implicitly or explicitly.

This PR replaces calls to `torch.tensor()` in `Parameter`, `UninitializedParameter`, and `UninitializedBuffer` with an equivalent call to `torch.empty()` which serves the same purpose and is syntactically more readable.
ghstack-source-id: 140520931

Test Plan: Since no behavioral change, run the existing unit and integration tests.

Reviewed By: pbelevich

Differential Revision: D31575587

fbshipit-source-id: bd7bdeea54370f3e53dc13bd182b97d0f67146f5
2021-10-13 18:56:36 -07:00
30d9fd9cf3 Migrate USE_MAGMA config macro to ATen (#66390)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66390

Test Plan: Imported from OSS

Reviewed By: malfet, bdhirsh

Differential Revision: D31547712

Pulled By: ngimel

fbshipit-source-id: 1b2ebc0d5b5d2199029274eabdd014f343cfbdd3
2021-10-13 17:50:10 -07:00
e75de4f307 remove a few unused THCTensor/Storage methods (#66555)
Summary:
Per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66555

Reviewed By: mruberry

Differential Revision: D31620969

Pulled By: ngimel

fbshipit-source-id: 1922ef523df473e8673a35c4a155b7b0cf000953
2021-10-13 17:18:11 -07:00
4e1c075542 log_sigmoid: Use log1p for improved precision (#66441)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/20972

log_sigmoid calculates something like `log(1 + x)` where x is always a
positive number less than one. This wastes floating point precision
because the exponent always becomes zero. Instead, using
`log1p(x)` gives the full mantissa precision around `x=0`.

This also fixes infinity propagation because the old code does,
`exp(in - in)` when `in` is negative. Which for infinity, results in a
NaN instead of 0.

cc albanD mruberry jbschlosser walterddr

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66441

Reviewed By: bdhirsh

Differential Revision: D31619630

Pulled By: albanD

fbshipit-source-id: e7867f3459a91e944b92f8ca42b6e0697b13f89b
2021-10-13 16:36:13 -07:00
24202f7fb4 Remove native_functions.yaml dependency from Activation.cu (#64499)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64499

This moves the native functions into a separate Activation.cpp file,
which calls into `launch_..._kernel` functions defined in `Activation.cu`.
The exception is `rrelu_with_noise` which is compilcated by the
random number generation code, so I've moved it into its own file.

Test Plan: Imported from OSS

Reviewed By: jbschlosser, ezyang

Differential Revision: D30867323

Pulled By: dagitses

fbshipit-source-id: a4cd6f1fb1b1fed4cc356bf8b3778991ae2278ba
2021-10-13 16:28:13 -07:00
eb8138d886 dropout update in autodiff (#66273)
Summary:
1. Unifies dropout op in autodiff
2. Removes dropout inference support in autodiff

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66273

Reviewed By: jbschlosser, gmagogsfm

Differential Revision: D31580382

Pulled By: eellison

fbshipit-source-id: 41d15da99bf4ce6c47cc335a4156c4a1c9705a70
2021-10-13 16:23:40 -07:00
5f45927d15 Autograd: Delay warnings until the end of backward execution (#66235)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/50209

This adds a new warning handler that stores all warnings in a shared
queue, which can be "replayed" at a later time and, crucially, on
another thread. Then, I use this inside the autograd engine to ensure
that warnings are processed by the handler registered on the main
thread.

For testing, I also add an operator that always warns in the backward
pass and test that the warning is a normal Python warning.

cc ezyang albanD zou3519 gqchen pearu nikitaved soulitzer Lezcano Varal7

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66235

Reviewed By: ejguan

Differential Revision: D31505413

Pulled By: albanD

fbshipit-source-id: 1a7f60b038f55c20591c0748b9e86735b3fec2f9
2021-10-13 15:38:04 -07:00
42328090cb [GHA] Hardcode doc build target to master (#66567)
Summary:
According to f48f20e154/.circleci/verbatim-sources/job-specs/job-specs-custom.yml (L46-L48)
target should always be master (even on release branches) unless it is a
tagged build

Fixes https://github.com/pytorch/pytorch/issues/66466

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66567

Reviewed By: seemethere

Differential Revision: D31621530

Pulled By: malfet

fbshipit-source-id: d6de2222d0340820555a82ae90b3de22b4dc7b88
2021-10-13 15:08:46 -07:00
0aab34c26c [jit] Refcounting spot fixes in alias_analysis (#66295)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66295

Tidying up the top sources of reference count decrements seen during static runtime startup in alias_analysis.cpp specifically.
ghstack-source-id: 140484160

Test Plan:
CI

perf now shows under 2% time spend in ~__shared_count instead of about 5%.

Reviewed By: suo

Differential Revision: D31490761

fbshipit-source-id: bbdcb7f9065c3aafa7fff7bfea9cea6dbc41f9d9
2021-10-13 14:47:32 -07:00
9767282643 [jit] Add MutableTypePtrHelper::mapTypeToBorrowedAliasTypeSet (#65344)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65344

Callsites that know they are using a cache can borrow AliasTypeSets from the cache instead of copying them.
ghstack-source-id: 140484162

Test Plan: Running perf on static runtime startup seems to show less inclusive time spent in AliasDb::getElements

Reviewed By: ejguan

Differential Revision: D31027363

fbshipit-source-id: b7a1473f4f9e9f14566f56f4b3b4e6317076beeb
2021-10-13 14:47:30 -07:00
75d98fa0ae [jit] Implement one-element MemoryDAG::mayContainAlias more efficiently (#65178)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65178

There is no need to copy the MemoryLocations in this case.
ghstack-source-id: 140484161

Test Plan:
CI

static runtime startup for ctr_mobile_feed decreased from 7.0s to 6.3s

Reviewed By: suo

Differential Revision: D30984442

fbshipit-source-id: 61bb678c4480cd030aaab2bbc8a04cbd9b7c7f4d
2021-10-13 14:46:16 -07:00
9e8281fd2f [fx2trt][code quality] Add type annotation and docstring to utils functions in acc_ops_converters.py (#66496)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66496

As the title. No changes on the code logic.

Test Plan: CI

Reviewed By: wushirong

Differential Revision: D31576303

fbshipit-source-id: f2132309023b3c9e09810e32af91eb42eefd3f32
2021-10-13 14:06:15 -07:00
37db650c9c [Static Runtime] Clone test does not use uninitialized memory (#66557)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66557

The test was previously using `at::empty_strided` to initialize one of its inputs. The contents of the tensor returned by this function are random, uninitialized memory. If we happened to get a NaN, this test would fail since `use_equalnan` was not set.

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: hlu1

Differential Revision: D31611961

fbshipit-source-id: 79a9476d0d6ce7a9f1412eefcef19bc2618c54b8
2021-10-13 14:02:34 -07:00
82986a17a6 fix lint (#66572)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66572

Test Plan: Imported from OSS

Reviewed By: seemethere

Differential Revision: D31624043

Pulled By: suo

fbshipit-source-id: 9db9cee3140d78c2a2f0c937be84755206fee1dd
2021-10-13 13:59:08 -07:00
a82fcd3560 Disable .numpy() and .tolist() for tensor subclasses subclasses and fix .tolist() for conjugated and negated tensors (#66082)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66082

Fixes https://github.com/pytorch/pytorch/issues/66024 #65779

cc ezyang anjali411 dylanbespalko mruberry Lezcano nikitaved albanD

Test Plan: Imported from OSS

Reviewed By: Gamrix, albanD

Differential Revision: D31615588

Pulled By: anjali411

fbshipit-source-id: c3e65ef0fe301630eb76732ccd7819683c09aa19
2021-10-13 13:57:51 -07:00
675ba6cd53 [qnnpack] Remove usage of conv_param_t in deconv-run.cc (#66465)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66465

conv_param_t is being removed as it stores redundant information. This removes the last usage of it in qnnpack so we can begin removing the dependency.
ghstack-source-id: 140475374

Test Plan: github tests

Reviewed By: kimishpatel

Differential Revision: D31564679

fbshipit-source-id: 049a28fac0235b2e739fb2e048484d7e8e7189fa
2021-10-13 13:51:15 -07:00
86cf22cb1c Add OpInfo for torch.bucketize (#65821)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65821

Reviewed By: malfet, mruberry

Differential Revision: D31386048

Pulled By: saketh-are

fbshipit-source-id: fae7ec7b6b57436d87d38d421c5f3f52be4cdadd
2021-10-13 13:46:35 -07:00
035310c574 Handle shared memory cases in MathBithFallback (#63602)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63602

This PR fixes the case when a read and write is performed on a memory shared between mutable and (or) non-mutable arguments. Example:
```
a=torch.tensor([1+1j])
b=a.conj()
b.add_(a) # should return tensor([2]) but returns tensor ([2-2j])
```

The issue here is that in the conjugate fallback, we resolve the conjugation in-place for mutable arguments which can be a problem as shown above in the case when other input arguments share memory with the mutable argument(s).
This PR fixes this issue by:
1. first scanning through the operator input arguments and creating a vector of mutable arguments that have the conj bit set to `True` (and accordingly setting the flag `check_for_alias_with_mut_arg ` to `True` or `False`).
2. Iterating through all the arguments. At this time we only look at the non-mutable arguments. If `check_for_alias_with_mut_arg` is set to `True`, then we iterate through `mutable_inputs` to check if the current arg tensor in question doesn't alias any of the entries in `mutable_inputs`. If yes, then we clone the non-mutable tensor arg, else we resolve the conjugation as before.
3. Now we look through the mutable_inputs vector (which contains only mutable input tensors with conj bit set to `True`). We in-place conjugate each of the entries in the vector.
4. Do the computation.
5. Re-conjugate the mutable argument tensors.

NOTE: `TensorLists` are not fully handled in ConjugateFallback. Please see the in-line comment for more details.

Fixes https://github.com/pytorch/pytorch/issues/59943

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D30466905

Pulled By: anjali411

fbshipit-source-id: 58058e5e6481da04a12d03f743c1491942a6cc9b
2021-10-13 13:39:31 -07:00
c04bcde245 Make empty* and *_like factory functions respect tensor subclasses (#65677)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/65243

cc albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65677

Reviewed By: dagitses

Differential Revision: D31432032

Pulled By: albanD

fbshipit-source-id: 77f464974c7656c1206085aba9300471d7e0ef57
2021-10-13 13:34:53 -07:00
b792a77895 Skip interactive_embedded_interpreter.cpp for clang-tidy (#66569)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66569

Reviewed By: suo

Differential Revision: D31622885

Pulled By: malfet

fbshipit-source-id: 61bad5ff3011f992cdd149724c935c098996d6a2
2021-10-13 13:27:56 -07:00
09b90612c4 .github: Enable onnx tests (#66513)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66513

These were missed in the migration of onnx to github actions.

Adds ort tests with 2 shards for the onnx workflow

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D31599433

Pulled By: seemethere

fbshipit-source-id: 73dce0d3017c4280e64f0c8578e2be7ef6a168d6
2021-10-13 13:14:02 -07:00
f48f20e154 Make ContainerHash compatible with const& types (#66497)
Summary:
- this change should not impact existing use cases, but allows for
  additional use cases where the container holds const types.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66497

Reviewed By: alanwaketan

Differential Revision: D31582242

Pulled By: wconstab

fbshipit-source-id: 3a0e18b4afaf3c7ff93a0e3d09067ed066402b44
2021-10-13 12:45:17 -07:00
fdd9f49cf5 add a note on numerical accuracy (#65947)
Summary:
Per title
Fixes https://github.com/pytorch/pytorch/issues/54437

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65947

Reviewed By: albanD

Differential Revision: D31612445

Pulled By: ngimel

fbshipit-source-id: 5c155891a088aef3b9813f253d0dc1ee4d51ae1c
2021-10-13 12:43:55 -07:00
a453ebc8ac Use interactive_embedded_interpreter to dynamicly loading various third-party libraries (#66512)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66512

TLDR, we are able to use the interactive_embedded_interpreter (basically just torch::deploy interpreter with an interactive shell) to dynamicly load various third party libraries. We use the popular libraries numpy, scipy, regex, pandas for illustration purpose.

A couple of changes need to be done for the interactive_embedded_interpreter:
1, we need link with :embedded_interpreter_all rather than :embedded_interpreter so we can enable DEEPBIND and use our custom loader
2, we provide a pylibRoot path to construct the InterpreterManager. The path will be added to the embedded interpreter's sys.path. Typically we can pass in the python library root path in a conda environment so torch::deploy interpreter can find all installed packages.
3, we allow interactive_embedded_interpreter execute a script to ease recording the exploration of various python libraries.
ghstack-source-id: 140453213

Test Plan:
Install numpy, scipy, regex, pandas in the conda environment or on the machine directly. Suppose /home/shunting/.local/lib/python3.8/site-packages/ is the root path for the installed libraries.

- buck run mode/opt :interactive_embedded_interpreter -- --pylib_root=/home/shunting/.local/lib/python3.8/site-packages/ --pyscript=~/p7/iei_examples/try_regex.py
content of try_regex.py:
```
import regex

print(regex)
pat = r'(.+)\1'
print(regex.match(pat, "abcabc"))
print(regex.match(pat, "abcba"))

print("bye")
```

- buck run mode/opt :interactive_embedded_interpreter -- --pylib_root=/home/shunting/.local/lib/python3.8/site-packages/ --pyscript=~/p7/iei_examples/try_numpy.py
content of try_numpy.py:
```
import numpy as np
print(f"numpy at {np}")
a = np.random.rand(2, 3)
b = np.random.rand(3, 2)
print(np.matmul(a, b))
```

- buck run mode/opt :interactive_embedded_interpreter -- --pylib_root=/home/shunting/.local/lib/python3.8/site-packages/ --pyscript=~/p7/iei_examples/try_scipy.py
content of try_scipy.py:
```
import numpy as np
from scipy import linalg

mat_a = np.array([[1, 0, 0, 0], [1, 1, 0, 0], [1, 2, 1, 0], [1, 3, 3, 1]])
mat_b = linalg.inv(mat_a)
print(mat_b)
```

- buck run mode/opt :interactive_embedded_interpreter -- --pylib_root=/home/shunting/.local/lib/python3.8/site-packages/ --pyscript=~/p7/iei_examples/try_pandas.py
content of try_pandas.py:
```
import pandas as pd
print(f"pandas at {pd}")
df = pd.DataFrame({
  "col1": [1, 2, 3, 4],
  "col2": [2, 4, 8, 16],
})
print(df)
```

Reviewed By: suo

Differential Revision: D31587278

fbshipit-source-id: c0b031c1fa71a77cdfeba1d04514f83127f79012
2021-10-13 12:39:13 -07:00
a8815d557a [vulkan] Remove the persistent resource pool (#66478)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66478

A persistent resource pool was needed to store prepacked tensors since the main resource pool tied to the global Vulkan context would be flushed at the end of each inference run. However, prepacked tensors needed to  alive between inference runs, so an additional persistent resource pool was introduced that would only be flushed when the Vulkan context was destroyed.

However, with [this change](https://github.com/pytorch/pytorch/pull/66477) the resource pool no longer indiscrimately flushes allocated resources at the end of an inference run. Tensors will have to call `release_resources()` before they become eligible to be destroyed. Since prepacked tensors are tied to an `OpContext` object they will stay alive between inference runs.

Therefore, the persistent resource pool is no longer needed.

Test Plan: Build and run `vulkan_api_test`.

Reviewed By: beback4u

Differential Revision: D31490076

fbshipit-source-id: 3741a2333c834796d589774e819eaaf52bb9f0fe
2021-10-13 12:01:08 -07:00
cebaf21c5a [vulkan] Release GPU resources when vTensor::View is destroyed (#66477)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66477

Currently, Vulkan tensor memory is allocated and deallocated through the following mechanism:

1. During inference, ops will request buffer and/or texture memory for tensors from the [Resource Pool](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/vulkan/api/Resource.h#L324-L327)
2. The resource pool allocates the memory and [adds it to a vector](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/vulkan/api/Resource.cpp#L609-L622) containing all the memory allocations it has made this inference, then returns the most recently allocated block of memory
3. At the end of inference, results are transferred back to the CPU and the [context is flushed](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/vulkan/ops/Copy.cpp#L150)
4. As part of the context flush the [resource pool is purged](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/vulkan/api/Context.cpp#L143) which [deallocates all buffer and texture memory](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/vulkan/api/Resource.cpp#L683-L684) allocated by the resource pool

This pattern makes it impossible to have models with multiple outputs. When the first output tensor is transferred back to the CPU, the memory of the other output tensors will be deallocated when the context is flushed.

Instead, an alternative is to tie resource destruction to the destructor of the [vTensor::View](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/vulkan/ops/Tensor.h#L243) class, which holds the actual implementation and storage of Vulkan tensors. This will ensure that memory associated with a tensor will be cleaned up whenever it is no longer used.

The new deallocation mechanism proposed is:

1. During inference, `vTensor` objects will request GPU memory from the resource pool, same as before.
2. The resource pool allocates buffer or texture memory and returns it directly to  the `vTensor`
3. Throughout inference, intermediate tensors' reference counts will go to 0 and the destructor of the `View` class will be called
4. The destructor will any texture and buffer memory it's holding to the resource pool's list of GPU memory allocations to be cleaned
5. At the end of inference `purge()` will be called which will destroy all allocations in the list of allocations to be cleaned
6. GPU memory for output tensors will not be destroyed, since their reference counts will be greater than 0, thus they have not yet been added to the list of allocations to be destroyed

Note that it is not correct to have the destructor directly deallocate GPU memory. This is due to the fact that Vulkan ops simply submit work to the GPU but does not guarantee that work has completed when the op returns. Therefore we must keep all allocated GPU memory until the end of inference, when we wait for the GPU to complete work.

Test Plan:
build and run `vulkan_api_test` to make sure existing functionality is not impacted.

Also test in a later diff that checks that output tensors stay alive after inference completes.

Reviewed By: dreiss

Differential Revision: D31510899

fbshipit-source-id: 99250c2800a68f07b1b91dbf5d3b293184da5bd2
2021-10-13 11:59:40 -07:00
5e34ac6c43 [FX] Fix cases when we should not fuse due to more than one users of intermediate node (#66472)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66472

A follow up of https://github.com/pytorch/pytorch/pull/66362. Same fix.

Test Plan:
```
buck test mode/dev-nosan caffe2/torch/fb/fx2trt:test_fuse_permute_matmul_trt
buck test mode/dev-nosan caffe2/torch/fb/fx2trt:test_fuse_permute_linear_trt

```

Reviewed By: wushirong, 842974287

Differential Revision: D31567662

fbshipit-source-id: 2c9e6a138fc31996d790fd4d79e0bf931507fc99
2021-10-13 11:53:42 -07:00
9d13ae450a [oss/ci] skip all dataloader tests with asan (#66561)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66561

See https://github.com/pytorch/pytorch/issues/66223 for context.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D31617142

Pulled By: suo

fbshipit-source-id: 16b280fc47a7c40fa19c5c72192d342dd33680bf
2021-10-13 11:39:41 -07:00
713e025c9f Add no-input-grad-needed cases to test_grid_sample (#66071)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66071

Test Plan: Imported from OSS

Reviewed By: dagitses

Differential Revision: D31431801

Pulled By: albanD

fbshipit-source-id: 57a94ed9e97e402aa8193d69355e57b6309c64f7
2021-10-13 10:56:47 -07:00
8a40bb62f9 Compute input gradient only if required (CUDA) (#66070)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66070

Test Plan: Imported from OSS

Reviewed By: dagitses

Differential Revision: D31431805

Pulled By: albanD

fbshipit-source-id: 8c3de6632aaee168ec6fd7eb79a5af26973af9c5
2021-10-13 10:56:45 -07:00
f8d98b5a6d Compute input gradient only if required (CPU) (#66069)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66069

Test Plan: Imported from OSS

Reviewed By: dagitses

Differential Revision: D31431803

Pulled By: albanD

fbshipit-source-id: d4caba5fa092e4ee7411502021836370082670b2
2021-10-13 10:56:43 -07:00
84385c40e4 Add output_mask (#66068)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66068

Test Plan: Imported from OSS

Reviewed By: dagitses

Differential Revision: D31431802

Pulled By: albanD

fbshipit-source-id: 322aae5614dacb06fd45e513465b7a5cc11f4dbb
2021-10-13 10:55:27 -07:00
6401658b08 fix type error in hipify_python.py (#66164)
Summary:
- [x] Fixed the Pyre type checking errors in `torch/utils/hipify/hipify_python.py`:
```
torch/utils/hipify/hipify_python.py:196:8 Incompatible variable type [9]: clean_ctx is declared to have type `GeneratedFileCleaner` but is used as type `None`.
torch/utils/hipify/hipify_python.py:944:4 Incompatible variable type [9]: clean_ctx is declared to have type `GeneratedFileCleaner` but is used as type `None`.
```

Fixing the issue: https://github.com/MLH-Fellowship/pyre-check/issues/78

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66164

Reviewed By: onionymous

Differential Revision: D31411443

Pulled By: 0xedward

fbshipit-source-id: c69f8fb839ad1d5ba5e4a223e1322ae7207e1574
2021-10-13 10:33:49 -07:00
d85948896c Add softplus support to autodiff (#63942)
Summary:
Add softplus definition to autodiff.

cc gmagogsfm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63942

Reviewed By: ngimel

Differential Revision: D31397158

Pulled By: eellison

fbshipit-source-id: f7db547370f82e5e282505c3c8415fb4fbd86d54
2021-10-13 08:08:09 -07:00
82a216c45b Add tensor.{adjoint(),H,mT,mH} methods and properties (#64179)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64179

This PR follows the discussion in https://github.com/pytorch/pytorch/issues/45063#issuecomment-904431478

Fixes https://github.com/pytorch/pytorch/issues/45063

cc ezyang anjali411 dylanbespalko mruberry Lezcano nikitaved rgommers pmeier asmeurer leofang AnirudhDagar asi1024 emcastillo kmaehashi heitorschueroff

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D30730483

Pulled By: anjali411

fbshipit-source-id: 821d25083f5f682450f6812bf852dc96a1cdf9f2
2021-10-13 07:44:43 -07:00
87df043f63 [Bootcamp][Pytorch]Add testing for complex parameters in Adagrad optimizer (#66501)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66501

Add testing for the Adagrad optimizer to ensure that it behaves as if complex numbers are two real numbers in R^2 as per issue 65711 on github
ghstack-source-id: 140414042

Test Plan:
buck test mode/dev caffe2/test:optim -- 'test_adagrad_complex'

https://pxl.cl/1R27M

Reviewed By: albanD

Differential Revision: D31584240

fbshipit-source-id: 5c9938084566b8ea49cc8ff002789731f62fe87e
2021-10-13 07:05:20 -07:00
ecb7b38c00 [PyTorch] Support additional arguments in Python record function (#65736)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65736

We ran into some limitations to extract PyTorch operator parameters through hooks or the execution graph. Some of these limitations are not due to the operator not exposing them, rather the inputs for these operators are already fused/processed in some cases (like embedding table). We want to be able to attach some metadata to the user scope record functions allowing the profilers to later extract these information.

The record function C++ API already supports taking inputs and outputs information. The corresponding Python interface does not support them and only allows a string name as record function parameter.

This diff adds support for user to optionally to add additional arguments to the record function in two ways.
1. to remain backward compatible with `record_function_op`, we have added an optional string arg to the interface: `with record_function(name, arg_str)`.
2. to support data dependency graph, we also have the new `torch.autograd._record_function_with_args_enter` and `torch.autograd._record_function_with_args_exit` functions to provide an interface where we can give additional tensor arguments. For now we imagine this can be used for debugging or analysis purpose. In this form, we currently support some basic data types as inputs: scalars, string, list, and tensor.

Example usage:

```
# record_function operator with a name and optionally, a string for arguments.
with record_function("## TEST 1 ##", "[1, 2, 3]"):
    <actual module or operator>

# more general form of record_function
a = _record_function_with_args_enter("## TEST 2 ##", 1, False, 2.5, [u, u], "hello", u)
<actual module or operator>
_record_function_with_args_exit(a)

```
Corresponding outputs in execution graph:
```
    {
      "name": "## TEST 2 ##", "id": 7, "parent": 3, "fw_parent": 0, "scope": 5, "tid": 1, "fw_tid": 0,
      "inputs": [1,false,2.5,[6,6],"hello",6], "input_shapes": [[],[],[],[[3,4,5],[3,4,5]],[],[3,4,5]], "input_types": ["Int","Bool","Double","GenericList[Tensor(float),Tensor(float)]","String","Tensor(float)"],
      "outputs": [], "output_shapes": [], "output_types": []
    },
    {
      "name": "## TEST 1 ##", "id": 3, "parent": 2, "fw_parent": 0, "scope": 5, "tid": 1, "fw_tid": 0,
      "inputs": ["1, 2, 3"], "input_shapes": [[]], "input_types": ["String"],
      "outputs": [], "output_shapes": [], "output_types": []
    },
```

Test Plan:
```
=> buck build caffe2/test:profiler --show-output
=> buck-out/gen/caffe2/test/profiler#binary.par test_profiler.TestRecordFunction
test_record_function (test_profiler.TestRecordFunction) ... Log file: /tmp/libkineto_activities_1651304.json
Net filter:
Target net for iteration count:
Net Iterations: 3
INFO:2021-09-27 01:10:15 1651304:1651304 Config.cpp:424] Trace start time: 2021-09-27 01:10:30
Trace duration: 500ms
Warmup duration: 5s
Net size threshold: 0
GPU op count threshold: 0
Max GPU buffer size: 128MB
Enabled activities: cpu_op,user_annotation,external_correlation,cuda_runtime,cpu_instant_event
Manifold bucket: gpu_traces
Manifold object: tree/traces/clientAPI/0/1632730215/devvm2060.ftw0/libkineto_activities_1651304.json
Trace compression enabled: 1
INFO:2021-09-27 01:10:15 1651304:1651304 ActivityProfiler.cpp:536] Tracing starting in 14s
INFO:2021-09-27 01:10:15 1651304:1651304 ActivityProfiler.cpp:48] Target net for iterations not specified - picking first encountered that passes net filter
INFO:2021-09-27 01:10:15 1651304:1651304 ActivityProfiler.cpp:57] Tracking net PyTorch Profiler for 3 iterations
INFO:2021-09-27 01:10:15 1651304:1651304 ActivityProfiler.cpp:126] Processing 1 CPU buffers
INFO:2021-09-27 01:10:15 1651304:1651304 ActivityProfiler.cpp:686] Recorded nets:
INFO:2021-09-27 01:10:15 1651304:1651304 ActivityProfiler.cpp:689] PyTorch Profiler: 1 iterations
ok

----------------------------------------------------------------------
Ran 1 test in 0.021s

OK
```

Reviewed By: gdankel

Differential Revision: D31165259

fbshipit-source-id: 15920aaef7138c666e5eca2a71c3bf33073eadc4
2021-10-13 01:49:15 -07:00
9918fd8305 [fx2trt] open source tests for converters (#66361)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66361

ossci will be setup later, fbonly ci is ready

Test Plan:
buck run caffe2/test:fx2trt_test_linear

testinprod

Reviewed By: 842974287

Differential Revision: D31511082

fbshipit-source-id: 9e2c50c83fdba822cd2488eb17b5787d8a57f087
2021-10-13 00:09:43 -07:00
80a3619823 Remove THCTensorMathReduce.cuh (#66389)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66389

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D31547711

Pulled By: ngimel

fbshipit-source-id: c181d14f66536b6873b5b14088312c6c70bf0855
2021-10-12 22:59:19 -07:00
bc6935ddf5 [PyTorch][Distributed][Easy] Make ShardedTensor.size() equivalent to torch.Tensor.size() (#65087) (#66012)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66012

Test Plan: Imported from OSS

Reviewed By: pritamdamania87

Differential Revision: D31345161

Pulled By: fduwjj

fbshipit-source-id: 10d6b65780ab0c6934babcc7c36a181cb66f0b7c
2021-10-12 22:26:22 -07:00
8eb85b5027 Remove THCNumerics (#66388)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66388

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D31547710

Pulled By: ngimel

fbshipit-source-id: 20710328f2e5fc2e931a3f8ba9b4243acc310d54
2021-10-12 22:05:03 -07:00
2d3b23190c Revert D31591512: .github: Enable onnx tests
Test Plan: revert-hammer

Differential Revision:
D31591512 (06a156efc7)

Original commit changeset: 4a8bb3f0e62f

fbshipit-source-id: 2d8580c0e507c2a0b30431bcf30eb01cef82f602
2021-10-12 20:17:02 -07:00
08f3823647 Sparse CSR CUDA: add addmv_out (#61407)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61407

This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to
compute matrix-vector multiplication. Since structured_delegate is used
we only need to implement the out variant, the in-place and normal
variants are autogenerated.
Working on this PR revealed that float16 (and probably bfloat16) inputs
do not work correctly in cusparse, therefore for this case `addmm` is
used with squeezes and unsqueezes.

cc nikitaved pearu cpuhrsch IvanYashchuk ngimel

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D31584499

Pulled By: ngimel

fbshipit-source-id: 4c507791471ada88969116b88eeaaba7a7536431
2021-10-12 20:06:56 -07:00
8492e6bc6a .github: scheduled -> schedule, fix periodic (#66531)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66531

The github.event_name should be schedule not scheduled

Reference, https://docs.github.com/en/actions/learn-github-actions/events-that-trigger-workflows#schedule

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D31598136

Pulled By: seemethere

fbshipit-source-id: 4d67f7731b21e05dabc8f54b4ebf9a5d2d3a4e1e
2021-10-12 19:46:01 -07:00
06a156efc7 .github: Enable onnx tests (#66513)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66513

These were missed in the migration of onnx to github actions.

Adds ort tests with 2 shards for the onnx workflow

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D31591512

Pulled By: seemethere

fbshipit-source-id: 4a8bb3f0e62ff98ee77d3d8afc905f4e02db6f24
2021-10-12 19:35:09 -07:00
93d326c868 Add InplaceOrView boxed kernel (#63878)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63878

See https://github.com/pytorch/pytorch/issues/64407, https://github.com/pytorch/pytorch/issues/62032 for context:

In this PR:
 - Add boxed kernel by replicating `gen_inplace_or_view`'s logic that is ONLY for use with the Autograd not-implemented kernel
   - Unlike `gen_inplace_or_view` we always pass a view_func to as_view in order to ensure that an "derivative is not implemented" error is raised even if an in-place update is performed on the view. Without the `view_func`, the CopySlice + AsStridedBackward nodes would replace the NotImplemented node.
   - This limitation makes it impossible to use this node for general use
   - view relationship must be between first input (must be tensor) and first output (may be tensor or vec of tensor)
   - do not support non-differentiable views (_values, _indices, view.dtype) - view relationship is always fw and bw differentiable
 - Adds the macro `#define REGISTER_AUTOGRAD_NOT_IMPLEMENTED_FALLBACK(ns, op)` to be the interface for this feature:
   - static initialization can be slowed down(? not measured) if there are many registrations, because each line translates to 2 library calls but the workaround is just to manually use the two functions `AutogradNotImplementedFallback` and `ADInplaceOrViewFallback` and call `m.impl`.
 - Adds testing:
    - for views: view relationship created
      -  performing in-place operation on the view, raises properly
      - trying to create two view relationships is not allowed,
      - single view relationship but not first input/first output should error
      - view relation created properly for tensor vector output
    - for in-place:
      - version count bump
      - triggers rebase_history
      - multiple mutations is okay and also updates version counter
 - TODO (follow up): Update tutorials for adding  third-party operators (and document the above limitations)
 - TODO (follow up): Look at torch-audio/torch-vision and identify places where this can simplify existing code

EDIT: Made it more clear what is introduced in this PR and moved some more contextual stuff into the issue itself

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D30901714

Pulled By: soulitzer

fbshipit-source-id: 48de14c28be023ff4bd31b7ea5e7cba88aeee04c
2021-10-12 18:55:50 -07:00
40794dbb25 add backend_config_dict to checkGraphModeFxOp (#66499)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66499

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D31582518

Pulled By: rahxephon89

fbshipit-source-id: b8107bb7140517f2dc32bf692c6b916536ea35c3
2021-10-12 18:35:54 -07:00
d32736e317 Make permission errors more human readable (#66492)
Summary:
`_mkdir_p` feels like a remnant of Python-2 era, add `exist_ok` argument and re-raise OSError to make it more human readable.

After the change attempt to build PyTorch in a folder that does not have write permissions will result in:
```
% python3.6 setup.py develop
Building wheel torch-1.10.0a0+git9509e8a
-- Building version 1.10.0a0+git9509e8a
Traceback (most recent call last):
  File "/Users/nshulga/git/pytorch-worktree/tools/setup_helpers/cmake.py", line 21, in _mkdir_p
    os.makedirs(d, exist_ok=True)
  File "/opt/homebrew/Cellar/python36/3.6.2+_254.20170915/Frameworks/Python.framework/Versions/3.6/lib/python3.6/os.py", line 220, in makedirs
    mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: 'build'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "setup.py", line 895, in <module>
    build_deps()
  File "setup.py", line 370, in build_deps
    cmake=cmake)
  File "/Users/nshulga/git/pytorch-worktree/tools/build_pytorch_libs.py", line 63, in build_caffe2
    rerun_cmake)
  File "/Users/nshulga/git/pytorch-worktree/tools/setup_helpers/cmake.py", line 225, in generate
    _mkdir_p(self.build_dir)
  File "/Users/nshulga/git/pytorch-worktree/tools/setup_helpers/cmake.py", line 23, in _mkdir_p
    raise RuntimeError(f"Failed to create folder {os.path.abspath(d)}: {e.strerror}") from e
RuntimeError: Failed to create folder /Users/nshulga/git/pytorch-worktree/build: Permission denied
```

Fixes https://github.com/pytorch/pytorch/issues/65920

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66492

Reviewed By: seemethere

Differential Revision: D31578820

Pulled By: malfet

fbshipit-source-id: afe8240983100ac0a26cc540376b9dd71b1b53af
2021-10-12 18:31:24 -07:00
d921891f57 GHA: Stop skipping periodic jobs (#66264)
Summary:
they have been skipped for too long
![image](https://user-images.githubusercontent.com/31798555/136433267-f35c0507-23ab-4348-be43-78d299c3d654.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66264

Reviewed By: dagitses, malfet, seemethere

Differential Revision: D31478705

Pulled By: janeyx99

fbshipit-source-id: 1324b123e3f8646e5cd671af4c1850398a6f6e3b
2021-10-12 14:39:47 -07:00
3ac2c74896 Revert D31082208: Use shared CUPTI by default
Test Plan: revert-hammer

Differential Revision:
D31082208 (8b0eae5aa8)

Original commit changeset: 14f66af92084

fbshipit-source-id: 0faff00832b7f79d476fd1f9f505142a548a76db
2021-10-12 14:37:54 -07:00
9984f4bb8b Remove native_functions.yaml dependency from some reduction operators (#64173)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64173

This one also required restructuring the code a bit to move the kernel
code into seperate files. So, I've mainly focused on CUDA which is
where the real build-time issues are.

Test Plan: Imported from OSS

Reviewed By: jbschlosser, ezyang

Differential Revision: D30728581

Pulled By: dagitses

fbshipit-source-id: a69eea5b4100d16165a02660dde200c8f648683d
2021-10-12 13:11:24 -07:00
ee38a467ea fix normal with empty std (#66463)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/65709

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66463

Reviewed By: navahgar

Differential Revision: D31561904

Pulled By: ngimel

fbshipit-source-id: 3b2f44dc0ec075fe4f9685696578a0ff6e58d501
2021-10-12 11:28:11 -07:00
8b0eae5aa8 Use shared CUPTI by default (#65401)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65401

Per https://github.com/pytorch/pytorch/issues/57744 statically linked CUPTI
causes exception handling to break on certain compiler configurations, likely
because CUPTI comes with incompatible libstdc++ symbols.  Rather than pray that
something reasonable happens, use the safer configuration (dynamic linking) by
default and give a warning if the user inverts the setting.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: gdankel

Differential Revision: D31082208

Pulled By: ezyang

fbshipit-source-id: 14f66af920847e158436b5801c43f3124b109b34
2021-10-12 11:01:40 -07:00
c6216b2a43 Back out "Revert D30710710: [Pytorch Edge] Support profiling kineto events from external source" (#66421)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66421

Original commit changeset: ab6bb8fe4e83

Plus this incldes BUILD.bazel changes, the reason for the revert.

Test Plan: See original diff

Reviewed By: gdankel

Differential Revision: D31542513

fbshipit-source-id: ee30aca2d6705638f97e04b77a9ae31fe5cc4ebb
2021-10-12 10:55:29 -07:00
d7916e3734 [jit] Eliminate malloc & recursive refcount bumps in HashType::operator() (#65348)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65348

Previously, this took several percent of model loading time. Now it is well under 1%.

We get this savings by avoiding allocating a vector and avoiding reference count bumps on contained types within each type.
ghstack-source-id: 140148562

Reviewed By: suo

Differential Revision: D31057278

fbshipit-source-id: 55a02cbfefb8602e41baddc2661d15385fb2da55
2021-10-12 10:51:17 -07:00
47c531b6e8 [jit] Compare object identity first in ClassType::operator== (#65347)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65347

This check is much cheaper than anything involving actually inspecting object fields (i.e., the cost is low), and if it succeeds we can skip the expensive (e.g., it involves locking a weak_ptr and then destroying the resulting shared_ptr)  function body. It almost entirely eliminates time spent in this function during model loading according to perf.
ghstack-source-id: 140148561

Test Plan: Specifically I profiled static runtime startup for the ctr_mobile_feed model and saw self time in this function go from 2-3% to 0.36%.

Reviewed By: ejguan

Differential Revision: D31057279

fbshipit-source-id: efb6bdc0957b680112ac282e85dc1b06b1b6c0bd
2021-10-12 10:49:36 -07:00
17e79bc76c remove is_reference from all is_output_quantized (#66456)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66456

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D31562633

Pulled By: rahxephon89

fbshipit-source-id: 85c73a23e90ba9c1406f4027d447fbbe4576e39a
2021-10-12 10:43:52 -07:00
702fb1de72 [fx2trt] open source tests for acc tracer (#66302)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66302

Just move files, ossci can be setup later

Test Plan:
buck run //caffe2/test:test_fx_acc_tracer

testinprod

Reviewed By: 842974287

Differential Revision: D31495087

fbshipit-source-id: f182c7438e3e80ba98924990682cb45a99b9967c
2021-10-12 10:27:34 -07:00
a6eec0c60f Upgrade onnx submodule to 85546f8c44e627f8ff1181725d03cc49f675e44f (#66427)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66427

Update the onnx submodule, so https://github.com/pytorch/pytorch/pull/66140 can land.

Test Plan: ci

Reviewed By: ezyang

Differential Revision: D31544610

fbshipit-source-id: 94831ef531bbd654a6aeb744cd53a38155848079
2021-10-12 09:46:08 -07:00
e6261083f9 [FX] fuse permute021 linear pass for trt lowering (#66362)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66362

In general we cannot rely on Permute021Linear being kept as is before lowering phase before our transformation could have traced through this module. A acc based fx pass is more reliable to recover the perf.

Test Plan:
```
buck run mode/opt -c python.package_style=inplace -c fbcode.nvcc_arch=a100 //hpc/new/models/ads/benchmarks:ads_dense_benchmark -- over-arch --model-version=23x_3tb --batch-size=2048

OverArch, PyTorch, FP16, BS: 2048, TFLOP/s: 53.22, Time per iter: 14.46ms, QPS: 141629.45
OverArch, TensorRT, FP16, BS: 2048, TFLOP/s: 92.20, Time per iter: 8.35ms, QPS: 245354.15
```

Unittest:
```
buck test mode/dev-nosan caffe2/torch/fb/fx2trt:test_fuse_permute_linear_trt
```

Reviewed By: jianyuh, wushirong, 842974287

Differential Revision: D31525307

fbshipit-source-id: b472a8c277aa4d156d933d6a5abec091133f22c5
2021-10-12 09:41:32 -07:00
8818dda237 Fix lstsq to work with inputs that require grad (#66426)
Summary:
I updated `sample_inputs_linalg_lstsq` and `test_nondifferentiable`
now correctly reveals the failure. The internal assert error was thrown
because autograd attempts to mark integer tensor as differentiable.

Fixes https://github.com/pytorch/pytorch/issues/66420.

cc ezyang albanD zou3519 gqchen pearu nikitaved soulitzer Lezcano Varal7

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66426

Reviewed By: ejguan

Differential Revision: D31550942

Pulled By: albanD

fbshipit-source-id: 4a0ca60e62c5e9bb96af5020541da2d09ea3e405
2021-10-12 08:52:21 -07:00
213ac4e59c Remove native_functions.yaml dependency from PointwiseOps (#64172)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64172

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D30728584

Pulled By: dagitses

fbshipit-source-id: 2ae9686ac7c312e2d470d26a3cad12afcf7ef47b
2021-10-12 08:12:25 -07:00
8674a3c6e3 Remove native_functions.yaml dependency from PowKernel (#64171)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64171

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D30728583

Pulled By: dagitses

fbshipit-source-id: ea6891a3598eead93daea620b94e50d3a3b248cf
2021-10-12 08:12:23 -07:00
1841f76cc0 Remove native_functions.yaml dependency from unary ops (#64170)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64170

Test Plan: Imported from OSS

Reviewed By: gchanan, ezyang

Differential Revision: D30728578

Pulled By: dagitses

fbshipit-source-id: 70baa90d0834e68324504c74064a1d1790193483
2021-10-12 08:11:03 -07:00
71e17d9827 [DataPipe] Fix HttpReader IterDataPipe Issue with streaming (#66432)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66432

This PR aims to fix the same issue that was addressed in TorchData.

See this [TorchData PR](https://github.com/pytorch/data/pull/51) and the corresponding [issue](https://github.com/pytorch/data/issues/42) for details.

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D31547565

Pulled By: NivekT

fbshipit-source-id: 1e0cb13be270e6b81a11af54fa08cf6d7e7c5721
2021-10-12 07:37:57 -07:00
5f1518609b [TensorExpr] Fix lowering for aten::t. (#65859)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65859

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D31289347

Pulled By: ZolotukhinM

fbshipit-source-id: b9648416238657fe23366928e43ed63e992a8973
2021-10-12 01:26:36 -07:00
6864146f2b [TensorExpr] Fix lowerings for aten::view and aten::reshape. (#65852)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65852

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D31286024

Pulled By: ZolotukhinM

fbshipit-source-id: eb5b5f2ed86b6f325f09904e841815b8183b4e1d
2021-10-12 01:26:34 -07:00
60a2a295ce [TensorExpr] Use schema instead of op name in NNC lowerings. (#65843)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65843

Fixes #64963.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D31282334

Pulled By: ZolotukhinM

fbshipit-source-id: ffd0e1b6433d9360fedd9081c01ef41b21684439
2021-10-12 01:26:32 -07:00
24b9b304d9 [TensorExpr] Nuke TE shape inference. (#65554)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65554

We're relying on JIT based shape inference and not using the TE
implementation.

Question to the audience: we set `hasBroadcasts_` in that function, but
this function was almost never invoked. Do we behave correctly in the
presence of rand-calls and broadcasts?

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D31148925

Pulled By: ZolotukhinM

fbshipit-source-id: 2898a57e389ea0950163122089d0fec3d92701c4
2021-10-12 01:25:14 -07:00
18e4688199 [Pytorch Edge] Improve bundled inputs name error handling (#65856)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65856

Occasionally functions dont have this __name__ variable set and have name set instead? Not sure why this happens, but this should catch it.

Test Plan: ci

Reviewed By: iseeyuan

Differential Revision: D31286787

fbshipit-source-id: 8a339541215329b6e9ff43ef77363be41f19c5ca
2021-10-12 00:08:39 -07:00
2d1552824a Revert D31386275: Migrate THCState to ATen
Test Plan: revert-hammer

Differential Revision:
D31386275 (a6774d6e1f)

Original commit changeset: 5c1f1bbe8c3d

fbshipit-source-id: bea4e80fb0bdc57e8bb6a8ee781afd224adf4ed0
2021-10-11 22:30:08 -07:00
d8532e3524 [PyTorch] Split c10 Type.cpp into two files to allow targets to include one of them (#66445)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66445

`Type.cpp` implements `demangle()` function based on the macro `HAS_DEMANGLE`. This diff splits it into two `.cpps` so that we can add either one into the build target. This change follows the patternof `flags_use_no_gflags.cpp` and `flags_use_gflags.cpp`.

Test Plan: Rely on CI

Reviewed By: iseeyuan

Differential Revision: D31551432

fbshipit-source-id: f8b11783e513fa812228ec873459ad3043ff9147
2021-10-11 21:52:24 -07:00
07ec250fd7 [deploy] fix oss build (#66347)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66347

It turns out that our hard-coded build flavor that we were running
deploy tests on in CI no longer exists lol. This PR fixes the OSS build
and also updates the build flavor.

Differential Revision:
D31517679
D31517679

Test Plan: Imported from OSS

Reviewed By: malfet, shunting314

Pulled By: suo

fbshipit-source-id: 763f126a3304f82e6dff7cff8c56414d82c54de3
2021-10-11 21:48:26 -07:00
9a85167d22 Fix batch_isend_irecv tests for err case (#63112)
Summary:
- `batch_isend_irecv` returns a list of requests instead of a single request
- remove some unused variables

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63112

Reviewed By: pbelevich, wayi1, fduwjj

Differential Revision: D30921265

fbshipit-source-id: e2075925172805d33974ef0de6fb631bdf33b5ea
2021-10-11 19:39:49 -07:00
3eb9443619 [FX] Fix issue where GraphModule.delete_all_unused_submodules deletes submodules from called leaf modules (#66430)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66430

On the whole, I'm not totally satisfied with this approach. I think we should be building a prefix tree data structure during initial iteration over the submodules and querying that when deleting submodules. But I think this approach works and I want to see if we can get it in before 1.10

Test Plan: Imported from OSS

Reviewed By: Chillee

Differential Revision: D31546137

Pulled By: jamesr66a

fbshipit-source-id: f08b8409a3cf511277017ccccb916097b7c4c4fe
2021-10-11 19:37:51 -07:00
a6774d6e1f Migrate THCState to ATen (#65948)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65948

This guts `THCState` to simply be an empty struct, as well as:
- moving `THCState_getPeerToPeerAccess` and its cache into `ATen`.
- cleaning up dead code in `THCGeneral.cpp`
- moving `THCudaInit` and `THCMagma_init` into `CUDAHooks::initCUDA`

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D31386275

Pulled By: ngimel

fbshipit-source-id: 5c1f1bbe8c3d2d9f5b99996e0588fb7f07fa6a77
2021-10-11 19:31:43 -07:00
e7b5712c21 Call PyArray_Check only if NumPy is available (#66433)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/66353

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66433

Reviewed By: seemethere, janeyx99

Differential Revision: D31548290

Pulled By: malfet

fbshipit-source-id: 3b094bc8195d0392338e0bdc6df2f39587b85bb3
2021-10-11 19:25:31 -07:00
565cf47abf Quantization docs: add pages for Numeric Suite (Eager and FX) (#66380)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66380

Description:
1. creates doc pages for Eager and FX numeric suites
2. adds a link from main quantization doc to (1)
3. formats docblocks in Eager NS to render well
4. adds example code and docblocks to FX numeric suite

Test Plan:
```
cd docs
make html
python -m http.server
// renders well
```

Reviewed By: jerryzh168

Differential Revision: D31543173

Pulled By: vkuzo

fbshipit-source-id: feb291bcbe92747495f45165f738631fa5cbffbd
2021-10-11 18:47:58 -07:00
8b1258698e Improve quantization API docs (#66379)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66379

Description:

Creates a quantization API reference and fixes all the docblock errors.

This is #66122 to #66210 squashed together

Test Plan:
```
cd docs
make html
python -m http.server
// open webpage, inspect it, looks good
```

Reviewed By: ejguan

Differential Revision: D31543172

Pulled By: vkuzo

fbshipit-source-id: 9131363d6528337e9f100759654d3f34f02142a9
2021-10-11 18:46:11 -07:00
88ed93c2ca Fix type checking errors in torch/quantization/fx/qconfig_utils.py (#66428)
Summary:
- [x] Fix the Pyre type checking errors in `torch/quantization/fx/qconfig_utils.py`
```
torch/quantization/fx/qconfig_utils.py:241:46 Incompatible variable type [9]: prepare_custom_config_dict is declared to have type `Dict[str, typing.Any]` but is used as type `None`.
torch/quantization/fx/qconfig_utils.py:267:46 Incompatible variable type [9]: convert_custom_config_dict is declared to have type `Dict[str, typing.Any]` but is used as type `None`.
torch/quantization/fx/qconfig_utils.py:284:43 Incompatible variable type [9]: fuse_custom_config_dict is declared to have type `Dict[str, typing.Any]` but is used as type `None`.
```
Fixes the issue: [MLH-Fellowship/pyre-check/issues/73](https://github.com/MLH-Fellowship/pyre-check/issues/73)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66428

Reviewed By: grievejia

Differential Revision: D31545215

Pulled By: 0xedward

fbshipit-source-id: 767ae7888854c2eec2ecf14855a5b011110b9271
2021-10-11 16:48:11 -07:00
25965619dd Back out "Revert D31495086: open source engine_layer_visualize.py" (#66431)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66431

Original commit changeset: 186f3407a642

Test Plan: testinprod

Reviewed By: 842974287

Differential Revision: D31546998

fbshipit-source-id: 4bc131d895cc4a7a84a4ff277df5f99e69ef4346
2021-10-11 16:06:23 -07:00
ae5a9a451f Do not enforce unused vars rule for torch_deploy (#66447)
Summary:
Followup after  https://github.com/pytorch/pytorch/pull/66041

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66447

Reviewed By: seemethere

Differential Revision: D31554356

Pulled By: malfet

fbshipit-source-id: 6638324dcf658f4b244da285b4360ff2e2e2c013
2021-10-11 15:24:19 -07:00
7baf4f6b12 Chunk: Converter (#66028)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66028

Added converter and unit test for torch.chunk function

Test Plan: buck test mode/dev-nosan caffe2/torch/fb/fx2trt:test_gelu

Reviewed By: 842974287

Differential Revision: D31345180

fbshipit-source-id: 9425685671b474449e825aa2a8e7e867a329eb6e
2021-10-11 14:33:25 -07:00
cc24e4e5d0 [NNC] Normalize loops in SplitWithTail (#66242)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66242

While working on random test generation, I observed that many simple transformations were upsetting vectorization. Digging deeper, I found that it calls SplitWithTail which incorrectly splits the loop when the loop start is not zero. This path normalizes the loop before we start splitting it.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D31506853

Pulled By: anijain2305

fbshipit-source-id: 5c5f2568ce0a239bfaa515458be52541eafd23b1
2021-10-11 13:44:05 -07:00
49f1605392 [RFC] Reduce logging noise from AdagradOptimizer (#66443)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66443

For some reason, this logging is adding noise to a lot of flow jobs. I am not sure if this is actually needed.
This is called from the __init__ so it's logged all the time and logs all key:values the current local symbol.

Test Plan: N/A

Reviewed By: chowarfb

Differential Revision: D31534372

fbshipit-source-id: bed032b66fed548c97a6f66b1b9e905fd2738851
2021-10-11 13:25:41 -07:00
c03f851750 [torchelastic] Fix failing tests (#66440)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66440

* Set correct name for test worker executable
* Remove `test_get_override_executable` from oss, there already test that tests the functionality

Test Plan: buck test mode/dev-nosan //caffe2/test/distributed/launcher/fb:launch_test

Reviewed By: d4l3k

Differential Revision: D31544853

fbshipit-source-id: e1e009b4b38830d3a78981f8f93c2314ed851695
2021-10-11 13:06:36 -07:00
1d14fbdad7 [TensorExpr] Adding missing python binding for operators (#66336)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66336

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D31544865

Pulled By: anijain2305

fbshipit-source-id: 04be6cf079efc952d0f0b1e68f7f4da4a19c64fa
2021-10-11 12:47:41 -07:00
08fab7ae13 Wextra fix for Integration.cpp (#66321)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66321

Fixes
```
stderr: caffe2/aten/src/ATen/native/Integration.cpp:62:27: error: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int64_t' (aka 'long') [-Werror,-Wsign-compare]
    if (curr_shape.size() >= target_n_dim)
        ~~~~~~~~~~~~~~~~~ ^  ~~~~~~~~~~~~
stderr: caffe2/aten/src/ATen/native/Integration.cpp:62:27: error: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int64_t' (aka 'long') [-Werror,-Wsign-compare]
    if (curr_shape.size() >= target_n_dim)
        ~~~~~~~~~~~~~~~~~ ^  ~~~~~~~~~~~~
```

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D31505347

fbshipit-source-id: 100b76215f78c3ce75bf4a993715a6767189747d
2021-10-11 12:30:46 -07:00
8c468ce00b [PyTorch][JIT] Return a reference from caching specializations of getTypePtr (#66342)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66342

`decltype(auto)` in D31486117 (fb5a80ffd8) wasn't the right choice in these specializations, because it will *still* deduce a copy.
See https://godbolt.org/z/GjbcPE1c4 for example.
ghstack-source-id: 140144199

Test Plan: CI, added new static_assert to make sure we got it right for std::tuple in particular

Reviewed By: hlu1, JasonHanwen

Differential Revision: D31514960

fbshipit-source-id: cae722aa34345b590c46eae478229cb5f4b0d7dc
2021-10-11 12:17:50 -07:00
998cb98844 [PyTorch][jit] Cache TupleType objects in getTypePtr (#66340)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66340

For functions that take `std::vector`s with `std::tuple`s in them, `getTypePtr` can get hit on every call, in which case creating a new `TupleType` object every time is expensive.
ghstack-source-id: 140143104

Test Plan: CI

Reviewed By: hlu1, JasonHanwen

Differential Revision: D31514792

fbshipit-source-id: 23652ca90ba1259afc05e953b99ce1fe1bebcc2b
2021-10-11 12:16:31 -07:00
acb0157a3d Specialization for c10::util:get_type_index<std::string> (#66290)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66290

Add full specialization for std::string type index

It slightly speeds up compilation as well as solves the ambiguity how template instantiations implemented in inline namespaces are rendered during `__PRETTY_FUNCTION__` computation.

Not sure what `#pragma` controls this behaviour, but when code is compiled by clang-12+ using libstdc++, `__PRETTY_PRINT__`, sometimes resolve `std::string` to `std::basic_string<char>` and sometimes to `std::__cxx11::basic_string<char>`, even though in the object file symbol is always inside `std::__cxx11::` namespace, which might break caffe2 serialization code that depends on dynamic hash generation

Template name resolution were debugged using https://gist.github.com/malfet/c83b9ebd35730ebf8bac7af42682ea37

(Note: this ignores all push blocking failures!)

Test Plan: CI

Reviewed By: r-barnes

Differential Revision: D31490050

fbshipit-source-id: 127091574cf6b92c7ec3f972821e4e76f5f626a9
2021-10-11 11:11:59 -07:00
901df0cc22 Skip test_nccl_errors_nonblocking (#66394)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66394

Skips this test as it currently does not seem to pass after several
internal local runs.
ghstack-source-id: 140210583

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D31534806

fbshipit-source-id: 799849a6a715506a85c9697b46f7098d9b71b32e
2021-10-11 10:08:31 -07:00
221c308389 Wextra fix for LossCTC.cpp (#66381)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66381

Fixes
```
stderr: caffe2/aten/src/ATen/native/cudnn/LossCTC.cpp:83:37: error: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'const long' [-Werror,-Wsign-compare]
  TORCH_CHECK(input_lengths_.size() == batch_size, "input_lengths needs to have size to match batch_size");
              ~~~~~~~~~~~~~~~~~~~~~ ^  ~~~~~~~~~~
```

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D31510217

fbshipit-source-id: e3585e08650950c08d80d347dfae375aedf2ceaf
2021-10-11 10:02:53 -07:00
736fa09a9a [Static Runtime] Manage output tensors (#65515)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65515

This change enables `StaticRuntime` to manage output tensors (returned from a graph) as follows:

- At the creation of `StaticModule`, it gathers a set of candidates for output tensors (& their aliases) for managing. This is done by `ValueGroup` introduced by the previous diff.
- At the end of the 1st iteration, `MemoryPlanner` creates a set of output  `at::Tensor*` to manage. This set consists of tensors objects from the aforementioned candidates, excluding the direct output value of the graph to simplify ivalue ownership passing (`std::move(ivalue)` to return from SR). Note that this exclusion has no perf implication for  inline_cvr & ctr_mobilefeed since they only return a container object (e.g., tuple).
-  The 2nd+ iterations preallocates a slab memory and all identified output tensors during the 1st iteration. Note that these preallocated tensors are *NOT* deallocated when returned from SR. The client receives the output tensors, and completes using them, and is responsible to call `StaticRuntime::deallocateOutputTensors()` to deallocate them. This mandates that SR cannot be reentered until `deallocateOutputTensors` is called by the client.
- In case of a buggy client missing a call to `StaticRuntime::deallocateOutputTensors()`, SR throws an exception when reentered instead of leaking memory.
- Nit: I plan to use camlcase for function names, and so all newly introduced functions use camlcase despite inconsistencies with snakecase. We can gradually fix the inconsistencies.

This change will be followed by another one to enable `manage_output_tensors` from `PyTorchScriptPredictor`, starting with `ptvsc2_prediction_bench` as a testbed.

Test Plan:
- Added `StaticRuntime.ManageOutputTensors*` to cover the newly added code paths.

- Enhanced `testStaticRuntime` to exercise each unittest test case with `manage_output_tensors` on. Confirmed that SR actually managed output tensors successfully for a few existing testcases (e.g., StaticRuntime.EmbeddingBag`).

Reviewed By: hlu1

Differential Revision: D31049221

fbshipit-source-id: 4ad1599179cc7f00d29e0ce41b33f776226d4383
2021-10-11 09:50:54 -07:00
3b4b1b2d23 .github: Remove confusing ciflow_config.enabled variable (#66260)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66260

Every workflow has ciflow enabled so this is not needed anymore

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: dagitses, janeyx99

Differential Revision: D31493340

Pulled By: seemethere

fbshipit-source-id: 8718fe5d22f4be6e0900962576782a9f23162a39
2021-10-11 09:39:31 -07:00
c66847afbe Add workaround for nvcc header dependecies bug (#62550)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62550

I noticed that running the build twice in a row resulted in ~80 CUDA files being
rebuilt. Running `ninja -d explain` shows
```
ninja explain: TH/generic/THStorage.h is dirty
ninja explain: TH/generic/THStorageCopy.h is dirty
ninja explain: THC/generic/THCStorage.h is dirty
ninja explain: THC/generic/THCStorageCopy.h is dirty
ninja explain: TH/generic/THTensor.h is dirty
ninja explain: THC/generic/THCTensor.h is dirty
ninja explain: THC/generic/THCTensorCopy.h is dirty
ninja explain: THC/generic/THCTensorMath.h is dirty
ninja explain: THC/generic/THCTensorMathMagma.h is dirty
ninja explain: THC/generic/THCTensorMathPairwise.h is dirty
ninja explain: THC/generic/THCTensorScatterGather.h is dirty
```

considering `ninja` is working relative to the `build` folder, these files don't
actually exist. I traced this back to the output of `nvcc -MD` containing
paths relative to the include directory, instead of being absolute.

This adds a little script to launch the compiler then resolve any relative paths
in the `.d` file before `ninja` looks at it. To use it, I run the build with
```
export CMAKE_CUDA_COMPILER_LAUNCHER="python;`pwd`/tools/nvcc_fix_deps.py;ccache"
```

There are some possible pit-falls here. The same relative path might work for
two include directories, and the compiler could pick a different one. Or,
the compiler might have additional implicit include directories that are needed
to resolve the path. However, this has worked perfectly in my testing and it's
completely opt-in so should be fine.

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D31503351

Pulled By: malfet

fbshipit-source-id: b184c4526679d976b93829b5715cafcb1c7db2ae
2021-10-11 09:07:12 -07:00
c373387709 Update CMake and use native CUDA language support (#62445)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62445

PyTorch currently uses the old style of compiling CUDA in CMake which is just a
bunch of scripts in `FindCUDA.cmake`. Newer versions support CUDA natively as
a language just like C++ or C.

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D31503350

fbshipit-source-id: 2ee817edc9698531ae1b87eda3ad271ee459fd55
2021-10-11 09:05:48 -07:00
d3b29afbb6 Remove old code that is unused in test/ (#66331)
Summary:
.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66331

Reviewed By: gchanan

Differential Revision: D31533549

Pulled By: albanD

fbshipit-source-id: 5addd11edc4199a88f10f0ff236be59ec2289903
2021-10-11 08:45:24 -07:00
4775419850 [BE] Address feedback from #66296 (#66315)
Summary:
Also use range loop instead of regular one

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66315

Reviewed By: albanD

Differential Revision: D31503730

Pulled By: malfet

fbshipit-source-id: f5568f7f28e15a9becd27986dd061a6fcae34651
2021-10-11 08:39:29 -07:00
822c0850cb fix pybind issue for get_autocast_cpu_dtype and get_autocast_gpu_dtype (#66396)
Summary:
There has an issue when calling **torch.get_autocast_cpu_dtype** and **torch.get_autocast_gpu_dtype**:
```
>>> torch.get_autocast_gpu_dtype()==torch.half
False
>>> torch.get_autocast_cpu_dtype()==torch.bfloat16
False
```
but the expected results  should be :
```
>>> torch.get_autocast_gpu_dtype()==torch.half
True
>>> torch.get_autocast_cpu_dtype()==torch.bfloat16
True
```

This PR is about fixing this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66396

Reviewed By: ejguan

Differential Revision: D31541727

Pulled By: albanD

fbshipit-source-id: 1a0fe070a82590ef2926a517bf48046c2633d168
2021-10-11 08:34:48 -07:00
1b40daac74 pinv: forward/backward AD which is Frechet-defined in a rank-preserving neighborhood. (#66092)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/65911. Also enables complex support/tests for `linalg_pinv` in OpInfo.

cc ezyang albanD zou3519 gqchen pearu nikitaved soulitzer Lezcano Varal7 jianyuh mruberry walterddr IvanYashchuk xwang233

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66092

Reviewed By: ejguan

Differential Revision: D31503072

Pulled By: albanD

fbshipit-source-id: 52018e826826ae62beaad76becb5edf880be253f
2021-10-11 08:33:28 -07:00
7c2f53b363 [BE] set pretrained=False for onnx tests (#66312)
Summary:
Addresses this network risk mitigation mentioned in https://github.com/pytorch/pytorch/issues/65439#issuecomment-924627239.

I didn't include any mobile app/benchmarking changes because I think the pretrained matters there.

I ended up removing the changes in test_utils because those were sensitive to the pretrained variable.

I am saving the quantization test changes for another PR because they are currently disabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66312

Reviewed By: ejguan

Differential Revision: D31542992

Pulled By: janeyx99

fbshipit-source-id: 57b4f70247af25cc96c57abd9e689c34641672ff
2021-10-11 08:29:11 -07:00
1d9a6862cd fx quant: add a BC test for loading old torch.package models (#65538)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65538

Adds a test which verifies that `prepare_fx` and `convert_fx` work
on models created by `torch.package` in the past.  In detail:

1. (one time) create a model and save it with torch.package. Also save input,
expected output, and names of quantization related get_attrs added by
our passes.
2. (every time) load the model from (1), and verify that expected output
matches current output, and that get_attr targets did not change.

Test Plan:
```
python test/test_quantization.py TestSerialization.test_linear_relu_package_quantization_transforms
```

Imported from OSS

Reviewed By: supriyar

Differential Revision: D31512939

fbshipit-source-id: 718ad5fb66e09b6b31796ebe0dc698186e9a659f
2021-10-11 08:23:38 -07:00
0348148725 Update link to qnnpack in quantization doc. (#66226)
Summary:
The old repo has been archived.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66226

Reviewed By: vkuzo

Differential Revision: D31534712

Pulled By: ezyang

fbshipit-source-id: 4d7f070c8547aeb25464c72b25ed21f209821bc2
2021-10-11 08:19:19 -07:00
58fefa6516 Add pybind trampoline for ProcessGroup and Work (#66338)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66338

This commit exposes c10d extension API to Python land. Users can
now override c10d communication behaviors in pure Python, and no
longer needs to go through the cpp extension steps.

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D31514351

Pulled By: mrshenli

fbshipit-source-id: a8b94af0af7960c078e1006c29b25f7f3bd86c81
2021-10-11 06:41:06 -07:00
bc06eefebe [reland] Allow external CUDA streams to be set as current (#66324)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66324

Fixes https://github.com/pytorch/pytorch/issues/65822.

Reland of https://github.com/pytorch/pytorch/pull/65914.
ghstack-source-id: 140105651

Test Plan: Added tests

Reviewed By: ngimel

Differential Revision: D31506134

fbshipit-source-id: ff56203a120befdb282e974309478ac11aa56652
2021-10-11 02:41:43 -07:00
355acfdebc [PyTorch Edge][tracing-based] use operator.yaml to build libtorch library (#66237)
Summary:
https://pxl.cl/1QK3N
Enable using the yaml file from tracer to build libtorch library for ios and android.

1. Android:
```
SELECTED_OP_LIST=/Users/chenlai/Documents/pytorch/tracing/deeplabv3_scripted_tracing_update.yaml TRACING_BASED=1  ./scripts/build_pytorch_android.sh x86
```
libtorch_lite.so x86: 3 MB (larger than H1, static is ~3.2 MB)

2. iOS
```
SELECTED_OP_LIST=/Users/chenlai/Documents/pytorch/tracing/deeplabv3_scripted_tracing_update.yaml TRACING_BASED=1 BUILD_PYTORCH_MOBILE=1 IOS_PLATFORM=SIMULATOR  ./scripts/build_ios.sh
```
Binary size: 7.6 MB
Size:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66237

ghstack-source-id: 140197164

Reviewed By: dhruvbird

Differential Revision: D31463119

fbshipit-source-id: c3f4eb71bdef1969eab6cb60999fec8547641cbd
2021-10-10 14:07:01 -07:00
9971113340 Revert D31447612: Create a documentation page for FX graph mode quantization APIs
Test Plan: revert-hammer

Differential Revision:
D31447612 (a89ac3138e)

Original commit changeset: 07d0a6137f15

fbshipit-source-id: f2cba7d835011500580b4ab9cff72171280ee18b
2021-10-10 01:51:13 -07:00
b85fd4c54f Revert D31447613: Create separate documentation pages for quantization observers and fake_quants
Test Plan: revert-hammer

Differential Revision:
D31447613 (f0fa3d1110)

Original commit changeset: 63b4cf518bad

fbshipit-source-id: 67de592d1e12a5149cdb22b0725caad063f94476
2021-10-10 01:51:11 -07:00
10633460ce Revert D31447614: Create a documentation page for torch.ao.quantization.QConfig
Test Plan: revert-hammer

Differential Revision:
D31447614 (7332ed13ed)

Original commit changeset: 5d9dd2a4e864

fbshipit-source-id: 6ac15a956222ca61f7fbb75ed36bcc58b23f0f36
2021-10-10 01:51:09 -07:00
037ac2330e Revert D31447616: Quantization docs: consilidate all API references on a single page
Test Plan: revert-hammer

Differential Revision:
D31447616 (fe86f0e068)

Original commit changeset: 2f9c4dac2b2f

fbshipit-source-id: 673368e87399f0a25441688bb9356de5a2f3e66e
2021-10-10 01:51:07 -07:00
09c3e6002b Revert D31447615: Quantization docs: rewrite API reference to be more automated
Test Plan: revert-hammer

Differential Revision:
D31447615 (7d2526ab20)

Original commit changeset: 09874ad9629f

fbshipit-source-id: 0963c9f5118e243cd299f8cded2bf7b0848a7105
2021-10-10 01:51:05 -07:00
df1858bea5 Revert D31447611: Quantization documentation: move backend section down
Test Plan: revert-hammer

Differential Revision:
D31447611 (309a8cf46c)

Original commit changeset: 537b146559bc

fbshipit-source-id: c400aef9a2ea5d18f8076879fe6354be7a6732f1
2021-10-10 01:51:03 -07:00
ad0accdecd Revert D31447610: Quantization docs: add pages for Numeric Suite (Eager and FX)
Test Plan: revert-hammer

Differential Revision:
D31447610 (9539e6216b)

Original commit changeset: 441170c4a6c3

fbshipit-source-id: b49bff54405cdb8465397077e38506a36b277921
2021-10-10 01:49:19 -07:00
291d463cf9 Revert D31495086: open source engine_layer_visualize.py
Test Plan: revert-hammer

Differential Revision:
D31495086 (150b7c7410)

Original commit changeset: 1f5505d6baac

fbshipit-source-id: 186f3407a6423f0981f0b7a2e7408ce53013fceb
2021-10-10 01:45:21 -07:00
0e0c98077f [quantized] Implement 3d convolution in qnnpack (#66350)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66350

Implements conv3d for QNNPACK by writing another kernel for the indirection buffer in 3 dimensions. Modifies all structs to take depth, with depth = 1 indicating 2d operation. gemm and conv (non transpose) work, next up is depthwise and tranpose.
ghstack-source-id: 140152440

Test Plan: test/quantization

Reviewed By: kimishpatel

Differential Revision: D30858693

fbshipit-source-id: 883cca8ec53b9e15ab4b9473c6cc042e3d049d9c
2021-10-09 12:28:24 -07:00
150b7c7410 open source engine_layer_visualize.py (#66301)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66301

Test Plan: testinprod

Reviewed By: 842974287

Differential Revision: D31495086

fbshipit-source-id: 1f5505d6baac66eca11a35ce9532d6c7c7513190
2021-10-09 10:25:03 -07:00
27f193af64 Automated submodule update: kineto (#59674)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/kineto](https://github.com/pytorch/kineto).

New submodule commit: 6f9c0eeff5

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59674

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: larryliu0820

Differential Revision: D28977762

fbshipit-source-id: d441d4d46a7044cc05eb8b21e59471deee312e02
2021-10-09 09:34:32 -07:00
84326ef059 Remove native_functions.yaml dependency from binary ops (#64169)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64169

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D30728586

Pulled By: dagitses

fbshipit-source-id: 17d645b6712815d1967b9ff83eecc4d16833ee6b
2021-10-09 09:25:48 -07:00
9539e6216b Quantization docs: add pages for Numeric Suite (Eager and FX) (#66222)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66222

Description:
1. creates doc pages for Eager and FX numeric suites
2. adds a link from main quantization doc to (1)
3. formats docblocks in Eager NS to render well
4. adds example code and docblocks to FX numeric suite

Test Plan:
```
cd docs
make html
python -m http.server
// renders well
```

Reviewed By: jerryzh168

Differential Revision: D31447610

Pulled By: vkuzo

fbshipit-source-id: 441170c4a6c3ddea1e7c7c5cc2f1e1cd5aa65f2f
2021-10-09 06:46:06 -07:00
309a8cf46c Quantization documentation: move backend section down (#66210)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66210

Description:

Moves the backend section of the quantization page further down,
to ensure that the API description and reference sections are closer
to the top.

Test Plan:
```
cd docs
make html
python -m server.http
// renders well
```

Reviewed By: jerryzh168

Differential Revision: D31447611

Pulled By: vkuzo

fbshipit-source-id: 537b146559bce484588b3c78e6b0cdb4c274e8dd
2021-10-09 06:46:04 -07:00
7d2526ab20 Quantization docs: rewrite API reference to be more automated (#66201)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66201

Description:

This PR switches the quantization API reference to use `autosummary`
for each section.  We define the sections and manually write a list
of modules/functions/methods to include, and sphinx does the rest.
A result is a single page where we have every quantization function
and module with a quick autogenerated blurb, and user can click
through to each of them for a full documentation page.

This mimics how the `torch.nn` and `torch.nn.functional` doc
pages are set up.

In detail, for each section before this PR:
* creates a new section using `autosummary`
* adds all modules/functions/methods which were previously in the manual section
* adds any additional modules/functions/methods which are public facing but not previously documented
* deletes the old manual summary and all links to it

Test Plan:
```
cd docs
make html
python -m http.server
// renders well, links work
```

Reviewed By: jerryzh168

Differential Revision: D31447615

Pulled By: vkuzo

fbshipit-source-id: 09874ad9629f9c00eeab79c406579c6abd974901
2021-10-09 06:46:02 -07:00
fe86f0e068 Quantization docs: consilidate all API references on a single page (#66198)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66198

Consolidates all API reference material for quantization on a single
page, to reduce duplication of information.

Future PRs will improve the API reference page itself.

Test Plan:
```
cd docs
make html
python -m http.server
// renders well
```

Reviewed By: jerryzh168

Differential Revision: D31447616

Pulled By: vkuzo

fbshipit-source-id: 2f9c4dac2b2fb377568332aef79531d1f784444a
2021-10-09 06:46:00 -07:00
7332ed13ed Create a documentation page for torch.ao.quantization.QConfig (#66129)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66129

Adds a documentation page for `torch.ao.quantization.QConfig`. It is useful
for this to have a separate page since it shared between Eager and FX graph
mode quantization.

Also, ensures that all important functions and module attributes in this
module have docstrings, so users can discover these without reading the
source code.

Test Plan:
```
cd docs
make html
python -m http.server
// open webpage, inspect it, renders correctly
```

Reviewed By: jerryzh168

Differential Revision: D31447614

Pulled By: vkuzo

fbshipit-source-id: 5d9dd2a4e8647fa17b96cefbaae5299adede619c
2021-10-09 06:45:58 -07:00
f0fa3d1110 Create separate documentation pages for quantization observers and fake_quants (#66125)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66125

Before this PR, the documentation for observers and fake_quants was inlined in the
Eager mode quantization page.  This was hard to discover, especially
since that page is really long, and we now have FX graph mode quantization reusing
all of this code.

This PR moves observers and fake_quants into their own documentation pages. It also
adds docstrings to all user facing module attributes such as the default observers
and fake_quants, so people can discover them from documentation without having
to inspect the source code.

For now, enables autoformatting (which means all public classes, functions, members
with docstrings will get docs).  If we need to exclude something in these files from
docs in the future, we can go back to manual docs.

Test Plan:
```
cd docs
make html
python -m server.http
// inspect docs on localhost, renders correctly
```

Reviewed By: dagitses

Differential Revision: D31447613

Pulled By: vkuzo

fbshipit-source-id: 63b4cf518badfb29ede583a5c2ca823f572c8599
2021-10-09 06:45:56 -07:00
a89ac3138e Create a documentation page for FX graph mode quantization APIs (#66122)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66122

Description:

Adds a documentation page for FX graph mode quantization APIs which
reads from the docstrings in `quantize_fx`, and links it from the main
quantization documentation page.

Also, updates the docstrings in `quantize_fx` to render well with reStructuredText.

Test Plan:
```
cd docs
make html
python -m http.server
// open webpage, inspect it, looks good
```

Reviewed By: dagitses

Differential Revision: D31447612

Pulled By: vkuzo

fbshipit-source-id: 07d0a6137f1537af82dce0a729f9617efaa714a0
2021-10-09 06:44:38 -07:00
b96c7aea73 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D31527108

fbshipit-source-id: 40360ebf92e67fd95613cedea9988fbe52519de6
2021-10-09 06:03:49 -07:00
109aa135e6 Remove apparently unnecessary std::remove_cv_t (#66254)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66254

`std::decay_t` already implies dropping the const

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D31465856

fbshipit-source-id: 851cdb9194354fe9a89b3a37a4463a43dbbcd77a
2021-10-09 00:38:44 -07:00
4cb4d11e0b Disable "-Wignored-qualifiers" for vec256_bfloat16.h (#66279)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66279

This error appears when compiling with "-Wextra" and cannot be resolved by fixing the code since the return type of the instrinic being passed to `map` is fixed.

Fixes:
```
caffe2/aten/src/ATen/cpu/vec/vec256/vec256_bfloat16.h:204:28: error: 'const' type qualifier on return type has no effect [-Werror,-Wignored-qualifiers]
  Vectorized<BFloat16> map(const __m256 (*const vop)(__m256)) const {
                           ^~~~~~
caffe2/aten/src/ATen/cpu/vec/vec256/vec256_bfloat16.h:204:28: error: 'const' type qualifier on return type has no effect [-Werror,-Wignored-qualifiers]
  Vectorized<BFloat16> map(const __m256 (*const vop)(__m256)) const {
                           ^~~~~~
```

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D31480888

fbshipit-source-id: 919c0d48c8ce13ce1106a9df124a077945e36707
2021-10-08 21:47:41 -07:00
3fe5895a00 Back out "Revert D30599136: [Pytorch Edge][tracing-based] build tracer in OSS" (#66267)
Summary:
Previously https://github.com/pytorch/pytorch/pull/64087 broke the  test `binary_macos_wheel_3_7_cpu_build`, because wheel build is not happy with `model_tracer`. Considering it's prototype and there is no need to ship model_tracer via wheel at the moment, using the option `TRACING_BASED` for building tracer. When tracing-based is mature enough, we can ship the tracer binary via wheel eventually.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66267

Original commit changeset: 8ac3d75a52d0
ghstack-source-id: 140122106

Test Plan:
binary_macos_wheel_3_7_cpu_build passes

{F668643831}

Reviewed By: dhruvbird

Differential Revision: D31478593

fbshipit-source-id: 726cab1b31c4596f6268b7824eecb20e2e59d161
2021-10-08 20:12:12 -07:00
1763c25414 [PyTorch][jit] Fix excess refcounting in TupleType::compare (#66286)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66286

No need to take refcount bumps on each comparator call.

Test Plan: CI, review

Reviewed By: hlu1, JasonHanwen

Differential Revision: D31487058

fbshipit-source-id: 98d2447ac27a12695cb0ebe1e279a6b50744ff4f
2021-10-08 20:08:07 -07:00
fb5a80ffd8 [jit] Don't force refcount bumps from getTypePtr (#66282)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66282

Now that a bunch of the `FooType::get()` functions return a const reference, we can forward that behavior through `getTypePtr()` using return type deduction.

Test Plan: Inspect assembly for List_test.cpp before/after the rest of the change; reference counting is no longer in the happy path.

Reviewed By: hlu1, JasonHanwen

Differential Revision: D31486117

fbshipit-source-id: 863b677bb6685452a5b325d327bdc2a0a09627bf
2021-10-08 20:06:43 -07:00
85b562dd2b Fix type checking errors in fx/utils.py (#66311)
Summary:
- [x] Fix the Pyre type checking errors in `torch/quantization/fx/utils.py`
```
torch/quantization/fx/utils.py:490:4 Incompatible variable type [9]: target_module_type is declared to have type `Type[nn.modules.module.Module]` but is used as type `None`.
```
Fixes the issue: [MLH-Fellowship/pyre-check/issues/75](https://github.com/MLH-Fellowship/pyre-check/issues/75)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66311

Reviewed By: pradeep90

Differential Revision: D31506399

Pulled By: 0xedward

fbshipit-source-id: 3d866fba6005452378d4a2613b8689fa2d7a8b67
2021-10-08 19:14:22 -07:00
e5f6f356da [hpc infer] fix bench perf number
Reviewed By: yinghai, jianyuh

Differential Revision: D31505288

fbshipit-source-id: e4951a7c5813e0ee38903dec4cef61531f1b4059
2021-10-08 19:11:04 -07:00
904fbadaff Fix merge conflict in bc tests (#66356)
Summary:
BC test currently borken on trunk

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66356

Reviewed By: malfet

Differential Revision: D31523340

Pulled By: janeyx99

fbshipit-source-id: a8d1ff697f017c710f70a76b5bb6a2f89d7637c7
2021-10-08 18:45:15 -07:00
5a67ffe0ad [PyTorch][Static Runtime] Combine ProcessedNode::{native_,}fn_ (#65414)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65414

Saves 24 bytes (`sizeof(std::function) - 8`) per ProcessedNode.
ghstack-source-id: 139999909

Test Plan: CI

Reviewed By: hlu1

Differential Revision: D31085561

fbshipit-source-id: 70734b8319e805736ba41aedaaf7fa3d463400c9
2021-10-08 18:11:59 -07:00
566922bbcd clean up mypy nit in torch/jit/_recursive.py (#66253)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66253

This was initially broken in #65829 and unbroken in #66003, this PR cleans
it up by removing the mypy ignore line.

Test Plan:
```
mypy torch/jit/_recursive.py --no-incremental
```

Imported from OSS

Reviewed By: supriyar

Differential Revision: D31475100

fbshipit-source-id: 46ab2ede72c08b926f4f9a6b03b1a1375b884c8a
2021-10-08 18:07:33 -07:00
4a302a3074 Wextra fix for CUDAApplyUtils.cuh (#66323)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66323

Fixes
```
/data/sandcastle/boxes/eden-trunk-hg-fbcode-fbsource/fbcode/caffe2/aten/src/ATen/cuda/CUDAApplyUtils.cuh:310:48: error: comparison of integers of different signs: 'unsigned long' and 'int' [-Werror,-Wsign-compare]
  const IndexType bOffset = sizeof...(Offsets) < n ?
                            ~~~~~~~~~~~~~~~~~~ ^ ~
/data/sandcastle/boxes/eden-trunk-hg-fbcode-fbsource/fbcode/caffe2/aten/src/ATen/cuda/CUDAApplyUtils.cuh:306:48: error: comparison of integers of different signs: 'unsigned long' and 'int' [-Werror,-Wsign-compare]
  const IndexType aOffset = sizeof...(Offsets) < n ?
                            ~~~~~~~~~~~~~~~~~~ ^ ~
```

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D31505428

fbshipit-source-id: 326fa8f41f2b200981eddc5cab035b18536cd24e
2021-10-08 18:02:09 -07:00
0a48f56318 Revert D31299350: Back out "Revert D31005792: [NCCL] Init dummy NCCL comms in constructor"
Test Plan: revert-hammer

Differential Revision:
D31299350 (f1f3bd8c36)

Original commit changeset: 9ad5c8fa17f7

fbshipit-source-id: d63d889922f507a4a0e2e042e451b95b9591c317
2021-10-08 17:55:28 -07:00
c62ed96496 Revert D30710710: [Pytorch Edge] Support profiling kineto events from external source
Test Plan: revert-hammer

Differential Revision:
D30710710 (c1343ff706)

Original commit changeset: 51399f9b0b64

fbshipit-source-id: ab6bb8fe4e83ed1052e621e427259192a4f0f540
2021-10-08 17:46:18 -07:00
c957d9fdf6 Replace _baddbmm_mkl_ with cpublas::gemm_batched (#66165)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66165

Test Plan: Imported from OSS

Reviewed By: dagitses

Differential Revision: D31493952

Pulled By: ngimel

fbshipit-source-id: 87cf79036c2d0f4955edbeeeb78f578b0fd223ab
2021-10-08 17:12:14 -07:00
51835bec07 Wextra fix 1 for caffe2 (#66272)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66272

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D31475543

fbshipit-source-id: f6e02d299d0b792ddb37534ad85db82af65bb42a
2021-10-08 16:36:13 -07:00
a28b038af4 [ao_migration] torch/nn/intrinsic: torch.quantization -> torch.ao.quantization (#65903)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65903

This changes the imports in the `caffe2/torch/nn/intrinsic` to include the new import locations.

```
codemod -d torch/nn/intrinsic --extensions py 'torch.quantization' 'torch.ao.quantization'
```

Test Plan: `python test/run_test.py`

Reviewed By: albanD

Differential Revision: D31301195

fbshipit-source-id: a5a9d84cb1ac33df6c90ee03cda3e2f1c5d5ff51
2021-10-08 16:21:23 -07:00
2daae532bd [ao_migration] torch/nn/qat: torch.quantization -> torch.ao.quantization (#65902)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65902

This changes the imports in the `caffe2/torch/nn/qat` to include the new import locations.

```
codemod -d torch/nn/qat --extensions py 'torch.quantization' 'torch.ao.quantization'
```

Test Plan: `python test/run_test.py`

Reviewed By: jerryzh168

Differential Revision: D31301196

fbshipit-source-id: ff237790d74cd3b3b5be642a997810f4f439a1d8
2021-10-08 16:21:21 -07:00
1a6482ee2a [ao_migration] torch/nn/quantizable: torch.quantization -> torch.ao.quantization (#65901)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65901

This changes the imports in the `caffe2/torch/nn/quantizable` to include the new import locations.

```
codemod -d torch/nn/quantizable --extensions py 'torch.quantization' 'torch.ao.quantization'
```

Test Plan: `python test/run_test.py`

Reviewed By: jerryzh168

Differential Revision: D31301194

fbshipit-source-id: 8ce8a3015ea61da62d7658846d1ca64fbdabaf7a
2021-10-08 16:21:19 -07:00
b23709df03 [ao_migration] torch/nn/quantized: torch.quantization -> torch.ao.quantization (#65900)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65900

This changes the imports in the `caffe2/torch/nn/quantized` to include the new import locations.

```
codemod -d torch/nn/quantized --extensions py 'torch.quantization' 'torch.ao.quantization'
```

Test Plan: `python test/run_test.py`

Reviewed By: jerryzh168

Differential Revision: D31301193

fbshipit-source-id: 58efb1ad51a8b441e2a3bd5b91af11eab6b9331f
2021-10-08 16:19:53 -07:00
f1f3bd8c36 Back out "Revert D31005792: [NCCL] Init dummy NCCL comms in constructor" (#65883)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65883

Original commit changeset: d8e962b8aab6
ghstack-source-id: 139836954

Test Plan: ci

Reviewed By: zhaojuanmao

Differential Revision: D31299350

fbshipit-source-id: 9ad5c8fa17f7038ba579cb1eda6d9271ac07a130
2021-10-08 16:04:20 -07:00
c1343ff706 [Pytorch Edge] Support profiling kineto events from external source (#64397)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64397

This diff exposes a way to add events to kineto profiler from external
source.
This can be a backend that executes a subgraph and wants to record this
execution in kineto profiler.
This diff also adds "backend" metadata to identify the backend an event
would have executed on.

Test Plan:
test_lite_interpreter

Imported from OSS

Reviewed By: raziel

Differential Revision: D30710710

fbshipit-source-id: 51399f9b0b647bc2d0076074ad4ea9286d0ef3e2
2021-10-08 15:59:42 -07:00
8a02d3e5d0 Wextra fix for Tensorshape.cpp (#66320)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66320

Fixes
```
stderr: caffe2/aten/src/ATen/native/TensorShape.cpp:619:36: error: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'long' [-Werror,-Wsign-compare]
    for (size_t offset = 0; offset < numel; offset++) {
                            ~~~~~~ ^ ~~~~~
stderr: caffe2/aten/src/ATen/native/TensorShape.cpp:619:36: error: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'long' [-Werror,-Wsign-compare]
    for (size_t offset = 0; offset < numel; offset++) {
                            ~~~~~~ ^ ~~~~~
```

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D31505374

fbshipit-source-id: 0fc393dacd72a8b29c0d82561f730cc047b38f0c
2021-10-08 15:03:47 -07:00
731cf494f2 Remove cuda/Loops.cuh dependency on native_functions.yaml (#64168)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64168

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D30728582

Pulled By: dagitses

fbshipit-source-id: 99dcbb9bb790dd0440d498593ac43e2c18e54a0c
2021-10-08 12:58:52 -07:00
92ce188510 Revert D31445799: [nnc] Use given kernel function name while emitting code
Test Plan: revert-hammer

Differential Revision:
D31445799 (c30dc52739)

Original commit changeset: 8d1642098313

fbshipit-source-id: 6b9d8c816437e9fcba8eb19cc683bc0a46a04cf5
2021-10-08 12:39:01 -07:00
2e6fa0261f Revert D31445797: [nnc] Added a cache to use singleton instances of PytorchLLVMJIT for every triple,cpu,attrs combination
Test Plan: revert-hammer

Differential Revision:
D31445797 (7e5ef5e517)

Original commit changeset: 4e1450100928

fbshipit-source-id: fc13b34dbb66c7a22816eb46cf6d98ae9f332d39
2021-10-08 12:38:59 -07:00
097fdcdf0c Revert D31445798: [Static Runtime] Cleanup LLVMCodeGen memory after code gen completes
Test Plan: revert-hammer

Differential Revision:
D31445798 (40dd2711b6)

Original commit changeset: c860d36456b2

fbshipit-source-id: 64d900cad87113e6b871aedd6669e771a7ede5cc
2021-10-08 12:37:48 -07:00
0be36d798b Remove Tensor.h include from TensorIterator.h (#64167)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64167

Test Plan: Imported from OSS

Reviewed By: saketh-are

Differential Revision: D30728579

Pulled By: dagitses

fbshipit-source-id: 3888da00c9c8030013c8f4b39d300fe671defb05
2021-10-08 12:28:37 -07:00
bc1dec9b81 Migrate THCStorage_resizeBytes to ATen (#65944)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65944

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D31386276

Pulled By: ngimel

fbshipit-source-id: a2b28bc09d11a856fdd3796d3df6f96613f13437
2021-10-08 11:50:52 -07:00
3bad54069b Concatting multiple linear layers with same input Tensor (different weight/bias) (#63198)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63198

Linear layers using the same input tensor can be concatted together
as long as the weights and biases are compatible.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D31240642

fbshipit-source-id: 1e78daa6b89822412ba2513d326ee0e072ceff1e
2021-10-08 10:55:46 -07:00
94845fc44e [jit] Implement one-argument AliasDb::mayContainAlias more efficiently (#65177)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65177

There is no need to heap-allocate any vectors in this case.
ghstack-source-id: 140052520

Test Plan:
CI

Startup for static runtime on ctr_mobile_feed local net decreased from 7.8s to about 7.0s

Reviewed By: malfet

Differential Revision: D30984194

fbshipit-source-id: 85091e55445f653ec728b27da4b459a2f1873013
2021-10-08 10:29:25 -07:00
c80693f7e6 [jit] Add cache for MemoryDAG::collectAllContainedMemoryLocations (#65122)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65122

Failure to cache this seems to contribute to quadratic startup time for the static runtime.

Disclaimer: I am entirely un-versed in the performance considerations for the JIT and have no idea what the other impacts of this change may be. Let the reviewer beware.
ghstack-source-id: 140052522

Reviewed By: suo

Differential Revision: D30983268

fbshipit-source-id: 4329aee6b5781f5c2e2d2334c396fab8528d4b7b
2021-10-08 10:29:23 -07:00
3ef69a4598 [static runtime] Pre-allocate hash tables (#65343)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65343

No reason not to save a bit on re-hashing.
ghstack-source-id: 140052518

Test Plan:
CI

Static runtime startup seems to go from 5.9-6.0s to 5.8s-6.0s, perf shows less time spent rehashing

Reviewed By: mikeiovine

Differential Revision: D31027362

fbshipit-source-id: 39dd53ecd462693b518535856ddd92df78a4977b
2021-10-08 10:28:13 -07:00
0020a151c6 slow_conv3d grad_weight: call gemm directly (#65759)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65759

Test Plan: Imported from OSS

Reviewed By: dagitses

Differential Revision: D31257873

Pulled By: ngimel

fbshipit-source-id: 1612c0be10b2aa269c807c7b9f5470172ed68dc1
2021-10-08 09:55:08 -07:00
dfb64b3287 log API usage for fsdp API in PyTorch (#64964)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64964

log API usage for fsdp API in PyTorch

Test Plan: unit test

Reviewed By: rohan-varma

Differential Revision: D30915734

fbshipit-source-id: 5e3b335327f4a3ff59b025e8e17a0fa0b7f6597d
2021-10-08 09:32:59 -07:00
201174cb91 Revert D31389480: [pytorch][PR] Allow external CUDA streams to be set as current
Test Plan: revert-hammer

Differential Revision:
D31389480 (61f0bb70c1)

Original commit changeset: 2b2f40e5452c

fbshipit-source-id: c6631e51abcf3819732f981f646cb77b91569c7d
2021-10-08 09:20:24 -07:00
b72a1782d8 [PG Wrapper][BE] Add collective information when monitored barrier error is (#66167)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66167

Sometimes due to desync we see PG wrapper monitored barrier fail. In
this case it would be useful to print the info about the collective that was
trying to run along with the actual error.
ghstack-source-id: 140037653

Test Plan: CI

Reviewed By: cbalioglu

Differential Revision: D31353021

fbshipit-source-id: e2a515326c9314c98119978d5566eb5431cca96c
2021-10-08 09:14:24 -07:00
b5b1d49a66 [PG Wrapper][BE] Make some methods private (#66166)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66166

These methods should be private.
ghstack-source-id: 139782587

Test Plan: CI

Reviewed By: cbalioglu

Differential Revision: D31353020

fbshipit-source-id: 583fb315cc2cacc37df3d29cd5793b42558930b3
2021-10-08 09:13:02 -07:00
0cad2c0615 Move intraop_launch_future from Parallel.h (#64166)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64166

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D30728585

Pulled By: dagitses

fbshipit-source-id: 75a41418ae9218bec9bac27597051295222b6eee
2021-10-08 09:07:35 -07:00
2d885ab73d [jit] Reduce refcounting of Types (#65345)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65345

FooType::get() can return a const reference. Inconveniently, converting shared_ptr<FooType> to shared_ptr<Type> requires a copy & refcount bump, so to properly take advantage of this in unshapedType() we need to take a const Type& in isSubtypeOf(), which is good practice anyway -- don't require a shared_ptr if you don't need to take ownership.
ghstack-source-id: 140044165

Test Plan:
CI

perf says c10::unshapedType time decreased from 2.8% to 2.2% during static runtime startup, though I expect this to be generally beneficial.

Reviewed By: hlu1

Differential Revision: D31027361

fbshipit-source-id: 676feb81db9f74ad7b8651d8774f4ecb4cfa6ab8
2021-10-08 09:03:04 -07:00
1ae468a484 [jit] Refcounting spot fixes (#65346)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65346

Tidying up the top sources of reference count decrements seen during static runtime startup.
ghstack-source-id: 140027349

Test Plan:
CI

perf now shows under 2% time spend in ~__shared_count instead of about 5%.

Reviewed By: suo

Differential Revision: D31057277

fbshipit-source-id: 9a16daf2e655fda80d4ec21290b30f02ba63d8da
2021-10-08 08:39:20 -07:00
8ebe1a924d [DataPipe] moving mux IterDataPipe test to the right location (#66277)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66277

Previously, it is grouped together with tests related to `MapDataPipe`, but it should be with `IterDataPipe`.

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D31485823

Pulled By: NivekT

fbshipit-source-id: d13d8c28cbfc305da0e3033d4109a0f971281a02
2021-10-08 08:32:29 -07:00
ed17851642 [DataPipe] adding test for IterableWrapperIterDataPipe (#66276)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66276

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D31485824

Pulled By: NivekT

fbshipit-source-id: c7b21636e4b17e264bfb5dbea69cd3c477472f0b
2021-10-08 08:32:26 -07:00
e808e3d3d6 [DataPipe] adding SequenceWrapperMapDataPipe (#66275)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66275

Once this is added to Core, TorchData's PR will not need a custom class and can use this wrapper instead.

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D31485822

Pulled By: NivekT

fbshipit-source-id: 790de27629c89c0ca7163a8ee5a09ee8b8233340
2021-10-08 08:32:24 -07:00
a7cc07f109 quantized embedding: make error message clearer (#66051)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66051

Make the error message clearer when quantized embedding is converted
with an unsupported dtype. This is helpful when debugging quantization
errors on new models.

Test Plan:
```
class M(nn.Module):
    def __init__(self):
        super().__init__()
        self.embedding = nn.Embedding(1, 1)

m = M().eval()
m.qconfig = torch.quantization.QConfig(
    activation=torch.quantization.MinMaxObserver.with_args(dtype=torch.qint8),
    weight=torch.quantization.MinMaxObserver.with_args(dtype=torch.qint8))
m.embedding.qconfig = m.qconfig
mp = torch.quantization.prepare(m)
mq = torch.quantization.convert(m)
// error message now includes the incorrect dtype
```

Imported from OSS

Reviewed By: dagitses

Differential Revision: D31472848

fbshipit-source-id: 86f6d90bc0ad611aa9d1bdae24497bc6f3d2acaa
2021-10-08 08:32:22 -07:00
c9aba3b128 make error message when trying to quantize non floats more specific (#66050)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66050

Adds the dtype to an error message when trying to quantize something
other than a float.  This is useful for debugging quantization tools on
new models.

Test Plan:
```
x = torch.randn(1, 1, 1, 1, dtype=torch.double)
xq = torch.quantize_per_tensor(x, 0.01, 0, torch.quint8)
// error message now includes Double
```

Imported from OSS

Reviewed By: dagitses

Differential Revision: D31472849

fbshipit-source-id: 2331ffacefcbc6f8eca79694757d740de74a0f1d
2021-10-08 08:32:19 -07:00
81660c08f0 quantized add: enable broadcasting (#66049)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66049

Enables quantized add with broadcasting. As pointed out by jamesr66a,
this was disabled but TensorIterator already supports it. Added a test
case to verify.

Test Plan:
```
python test/test_quantization.py TestQuantizedOps.test_qadd_broadcast
```

Imported from OSS

Reviewed By: dagitses

Differential Revision: D31472850

fbshipit-source-id: a3b16d9000487918db743525d22db6864330762b
2021-10-08 08:31:07 -07:00
ece0221854 Rename int to long, add more C++ types. (#66108)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66108

BC-breaking change: intT is now longT (which aligns it more accurately with how
the types are referred to in C++).  The benefit for this is we can idiomatically
express all C++ dtypes (with intT now mapping to int32_t).  These types are needed
for ufunc codegen in a latter patch.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D31385761

Pulled By: ezyang

fbshipit-source-id: ec6f3a0953794313470dbe14911f23ac116be425
2021-10-08 08:25:06 -07:00
11bc435622 Allow registration of custom symbolics for prim namespace (#64460) (#66139)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66139

[ONNX] Add prim::PythonOp check back in export.cpp (#64944)

Add prim::PythonOp check back in export.cpp

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D31424102

fbshipit-source-id: 6d2eef767fab846ed79ea509e97b714072bac9f4

Co-authored-by: jiafatom <jiafa@microsoft.com>
2021-10-08 07:41:06 -07:00
9b09a5f7ba [ONNX] Enable scripting tests (#64780) (#66138)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66138

* Scripting tests

* Fixed scripting tests for lower opsets

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D31424099

fbshipit-source-id: 67095b7ac67b9da986961788392aa92c95cf11f2
2021-10-08 07:41:03 -07:00
53fefaa916 [ONNX] Fix duplicated output same name case (#64190) (#66137)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66137

* fix duplicated output node same output name issue.

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D31424100

fbshipit-source-id: b1b06a92c51744030788b651f3a597d987a8deda

Co-authored-by: hwangdeyu <dejack953@outlook.com>
2021-10-08 07:41:01 -07:00
4af47eb3a7 [ONNX] Update slice process shape to support rank only inference (#65782) (#66149)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66149

Updated logic will be able to infer rank of slice output, when only rank is known for slice input. Enables cases where `ConstantValueMap::HasRank(input)` is `True`, while `ConstantValueMap::HasShape(input)` is `False`.

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D31423232

Pulled By: ezyang

fbshipit-source-id: 516e3916aa71afda2b10e44620636e42ed837236

Co-authored-by: BowenBao <bowbao@microsoft.com>
2021-10-08 07:39:40 -07:00
dc37547c44 Opinfos for avg_pooling (#64214)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64214

Added OpInfos for:
- F.adapative_avg_pool{1, 3}d
- F.avg_pool{1, 3}d

The 2d variants already had OpInfos.

Test Plan: - run tests

Reviewed By: albanD, mruberry

Differential Revision: D30667797

Pulled By: zou3519

fbshipit-source-id: 53f5cd02070de5b7db4abb017d727376b59288df
2021-10-08 07:26:08 -07:00
8d6d448238 Add HPU for Autograd Fallback (#65605)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65605

Reviewed By: albanD

Differential Revision: D31373899

Pulled By: ezyang

fbshipit-source-id: 894f62dc44b0532f152dc97b839eecfbaed25e8c
2021-10-08 07:21:44 -07:00
4af913a7cf fixed minor issues for index_add in docs (#65806)
Summary:
Hi, I'm looking forward to contributing to PyTorch, so starting with a minor fix in the documentation for `index_add`.

Currently, in the documentation for `index_add_` (please see https://pytorch.org/docs/master/generated/torch.Tensor.index_add_.html#torch.Tensor.index_add_):

1. `tensor` attribute was pointing to `torch.tensor` class, which IMO - is (thought may not be a big deal) unintentional.
2. `dim` attribute is pointing to `torch.Tensor.dim`, which again IMO - is unintentional.

This PR suggests a correction for the first point above, to rename `tensor` attribute to `input` so that it doesn't point to `torch.tensor` class. (I've verified that others ops like `scatter` use `input`, so this should not break the consistency in the documentation). I couldn't find an appropriate fix for the second point above, since renaming `dim` to something else will break the consistency (as almost all others op in PyTorch use `dim` as the attribute name).

I may be wrong here, so please let me know if there is any feedback or an alternate fix for this.

_Note:_ I plan to fix this behavior for `index_copy_` (https://pytorch.org/docs/master/generated/torch.Tensor.index_copy_.html#torch.Tensor.index_copy_) once and if this PR is approved.

To the reviewers, please help me tag the correct person who could help review this PR.

cc: krshrimali mruberry zou3519

cc brianjo mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65806

Reviewed By: dagitses, mruberry

Differential Revision: D31431182

Pulled By: zou3519

fbshipit-source-id: 66ced9677ac3bc71d672d13366f9f567ecea0a2d
2021-10-08 07:17:15 -07:00
61f0bb70c1 Allow external CUDA streams to be set as current (#65914)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/65822.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65914

Reviewed By: dagitses

Differential Revision: D31389480

Pulled By: lw

fbshipit-source-id: 2b2f40e5452c5b2a0b9f0f705750d2aa9deb2ead
2021-10-08 06:09:32 -07:00
60fe854f9f [fx2trt] save and load TRTModule for OSS (#65958)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65958

zhxchen17 added `pickle` pybind for trt engine which allows us to save and load a nn.Module with trt engine in fbcode. This diff though is explicitly ser/des engine in __set_state__` and `__get_state__` so that in OSS people can also save and load TRTModule directly.

Test Plan: buck test mode/dev-nosan caffe2/torch/fb/fx2trt:test_fx2trt

Reviewed By: wushirong

Differential Revision: D31309429

fbshipit-source-id: 9068e2ae6375ed0e1bb55b0e9d582b8d9c049dbf
2021-10-07 22:27:40 -07:00
321345d7c9 Revert "Revert D31227448: [pytorch][PR] fixing sorting in stride indices" (#66176)
Summary:
enabling https://github.com/pytorch/pytorch/issues/63940

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66176

Reviewed By: ngimel

Differential Revision: D31423920

Pulled By: dzhulgakov

fbshipit-source-id: 06b1e0f757f4fb5b31ee1fa464bcd689df919b9c
2021-10-07 22:09:07 -07:00
74477ba243 [fx2trt] More controls over output dtypes (#65959)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65959

Give some more controls over the output dtype of a trt engine. Previously it would be fp16 if we turn on fp16_mode. This diff allows the engine to generate fp32 output with fp16_mode=True.

Test Plan: CI

Reviewed By: kflu, wushirong

Differential Revision: D31243929

fbshipit-source-id: 09c752e6f382d6ad169da66878d9a9277c134869
2021-10-07 22:03:51 -07:00
227f91e72d [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D31495160

fbshipit-source-id: b0a56003a6695989dff0d325cdc118182662ec61
2021-10-07 21:09:22 -07:00
a58ff186e8 [quant][embedding qat] Add basic EmbeddingBag QAT fakeQuant workflow (#65443)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65443

Test Plan: Imported from OSS

Reviewed By: dagitses, supriyar

Differential Revision: D31456445

Pulled By: b-koopman

fbshipit-source-id: 0edda6e272d9005fce65f2ba6a5e6abc831836de
2021-10-07 20:19:29 -07:00
64caee1356 [PyTorch Edge] Leave out field for debug_handle if not being built with eager symbolication support (#66131)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66131

Turns out that a model with 72k instructions causes about 0.5MiB of additional memory overhead (if there's an 8 byte memory overhead per instruction). This is not necessary if we're building w/o eager symbolication support. This change eliminates the 8 byte `debug_handle` if the build is w/o eager symbolication support.
ghstack-source-id: 140045478

(Note: this ignores all push blocking failures!)

Test Plan:
```
buck build -c "pt.enable_eager_symbolication"=1 //xplat/caffe2/fb/lite_predictor:lite_predictor
buck build //xplat/caffe2/fb/lite_predictor:lite_predictor
```

Reviewed By: kimishpatel

Differential Revision: D31387784

fbshipit-source-id: af56787ad833b990a46b79ab021e512edaa22143
2021-10-07 20:01:18 -07:00
ebe530a9cd Periodic jobs should not have CIFLOW_DEFAULT label (#66300)
Summary:
Noticed that `periodic-pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7-slow-gradcheck` job has a `ciflow/default`, but does not have a `ciflow/scheduled` label
Added asserts to enforce that jobs with non-trival is_scheduled property do not have default and do have scheduled labesl

Rename `periodic-pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7-slow-gradcheck` to `periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck`

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66300

Reviewed By: seemethere

Differential Revision: D31493323

Pulled By: malfet

fbshipit-source-id: 194c1d7a4e659847d94a547b87a0d7d08e66406d
2021-10-07 19:57:32 -07:00
bd9eee4e65 TBB: Use static partitioner to match OpenMP scheduling (#65327)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65327

Should fix https://github.com/pytorch/pytorch/issues/64571

Test Plan: Imported from OSS

Reviewed By: dagitses

Differential Revision: D31474116

Pulled By: malfet

fbshipit-source-id: 8c4264d4778c6caf58261e3f70d72decd134128d
2021-10-07 19:12:36 -07:00
d5033410b1 Parallel: Deduplicate parallel functions in different backends (#65326)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65326

parallel_for and parallel_reduce currently share some common code in
all backends, specifically for detecting if it should run in parallel
or not. This moves all the backend-specific code into a single
`internal::invoke_parallel` function and makes the `parallel_`
functions common to all backends.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D31124495

fbshipit-source-id: 65c3d2af42a8860cc4d6349566085c9fa8d8c6f0
2021-10-07 19:11:19 -07:00
e1817d895f [BE] Cleanup python_function.cpp (#66296)
Summary:
- Delete unused `var_input_idx`
- Fix `uninitialized variable` clang-tidy warning by setting `PyObject* input` to PyNone

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66296

Reviewed By: janeyx99

Differential Revision: D31491016

Pulled By: malfet

fbshipit-source-id: 08267144be0cd049d122580cdf81cf586c3e30a6
2021-10-07 18:41:17 -07:00
ca363d1e22 docker: Ensure libgnutls30 for all docker builds (#66258)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66258

Installing libgnutls30 has shown to be good when confronted with the
CERT issue related to deb.nodesource.com

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: dagitses

Differential Revision: D31477789

Pulled By: seemethere

fbshipit-source-id: f87ae4c098771acc505db14e3982d8858cf7326f
2021-10-07 18:36:40 -07:00
38f5144eae Fix https://github.com/pytorch/pytorch/issues/61982 (#66015)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66015

Fixes https://github.com/pytorch/pytorch/issues/61982 by clone of
tensors in DDPSink. Only applies once for static_graph and generally for unused
params which already has overhead, so perf hit should not be an issue. Will
verify with benchmark.

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D31346633

fbshipit-source-id: 5b9245ade628565cffe01731f6a0dcbb6126029b
2021-10-07 18:11:18 -07:00
20f2e55d4f Rename cuda/Resize.cu to cuda/Resize.cpp (#65943)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65943

These files don't require nvcc to compile.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D31386277

Pulled By: ngimel

fbshipit-source-id: 1066ee87fa795e2c7969447fbce1fe2633fb9680
2021-10-07 16:37:51 -07:00
86de09e49a Upgrade to ubuntu:trusty-20190515 (#63468)
Summary:
Security Upgrade to ubuntu:trusty-20190515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63468

Reviewed By: ngimel

Differential Revision: D31393552

Pulled By: malfet

fbshipit-source-id: 4e2399e3cddc1d549c08c82c08015e00569c19bc
2021-10-07 16:28:08 -07:00
416f593080 [Static Runtime] Group graph nodes into input aliases & output aliases (#65517)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65517

This change retrofits `GetAlwaysAliveValues` into `ValueGroup` to group the values used by a graph into three groups as follows:

- input_aliases:  values that are either inputs or contain aliases of inputs or constants.
- output_aliases: values that are either outputs or contain aliases of outputs and are not in input_aliases.
- Values that dont't show up in input_aliases and output_aliases are internally created consumed within the graph.

`output_aliases` is the only new group introduced by this change, and a following diff will use this to preallocate output Tensors to accelerate Static Runtime's performance.

Test Plan: Added `ValueGroup.Init` to cover the updated code path. Note that there was no test for `GetAlwaysAliveValues` before.

Reviewed By: hlu1

Differential Revision: D30940955

fbshipit-source-id: 2cb065ecda0f447a61e64a7cf70cc7c6947f7dfc
2021-10-07 14:35:12 -07:00
0e2d1b221a [Bootcamp][Pytorch Core] Add testing for complex non-vanilla SGD
Summary: Adding test to ensure non-Vanilla SGD behaves as if complex numbers are two real numbers in R^2 as per issue 65711 on github

Test Plan:
```buck test mode/dev caffe2/test:optim -- 'test_sgd_complex'```

https://pxl.cl/1QLxw

Reviewed By: albanD

Differential Revision: D31477212

fbshipit-source-id: 500678e561a05ac96759223b4c87a37cab26c6a6
2021-10-07 14:07:39 -07:00
5e7d8ec846 Support Registering a Variable Length List of Builtin Modules for torch::deploy Builtin Libraries (#66021)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66021

A builtin library consists of a list of frozen modules and a list of builtin modules. For tensorrt, it's quite simple since we only have a single builtin module tensorrt.tensorrt. But it can be complex for libraries like numpy which contains multiple builtin modules (np.core._multiarray_umath, np.random.mtrand etc.) if we want to add it as a torch::deploy builtin. We enhance the macro that registers builtin libraries to accept a variable length of builtin modules. We can use this macro to register frozentorch, frozenpython, tensorrt for now and can also use it to register libraries like numpy later on.

The enhanced macro now looks as follows. Although we don't need to worry about back-compatibility for now,  but this enhanced version is fully compatible with the previous version. The previous version is just a special case when the library contains no builtin modules.

 ```
REGISTER_TORCH_DEPLOY_BUILTIN(library_name_without_quote, frozen_modules_list,
    builtin_module_name_1, builtin_module_init_function_1, ...,
    builtin_module_name_N, builtin_module_init_function_N)
```
ghstack-source-id: 140007970

Test Plan:
1. Play around with interactive_embedded_interpreter.cpp to import torch._C, tensorrt.tensorrt etc inside the embedded interpreter.
2. Enhance test_builtin_registry.cpp
3. Run test_deploy.cpp and test_deploy_gpu.cpp

Reviewed By: suo

Differential Revision: D31349390

fbshipit-source-id: 70a1fcf660341180fc4d5195aed15ceb07c2bef7
2021-10-07 13:23:46 -07:00
40dd2711b6 [Static Runtime] Cleanup LLVMCodeGen memory after code gen completes (#66218)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66218

This stack of diffs reduces the memory used by LLVMCodeGen object.

Here are the numbers on model `294738512`: (this is the number reported as `Memory turnover after freeze_module:` in the output)

```
Before: 123343496
After : 121566008
```

So, there is a reduction of about `~1.77MB` with this change of making `PytorchLLVMJIT` a singleton.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM, hlu1

Differential Revision: D31445798

Pulled By: navahgar

fbshipit-source-id: c860d36456b2c5d3e21010c1217e2948326f666d
2021-10-07 13:17:13 -07:00
7e5ef5e517 [nnc] Added a cache to use singleton instances of PytorchLLVMJIT for every triple,cpu,attrs combination (#66217)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66217

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D31445797

Pulled By: navahgar

fbshipit-source-id: 4e1450100928132ccce4ef3c6c20ad6661cfabed
2021-10-07 13:17:11 -07:00
c30dc52739 [nnc] Use given kernel function name while emitting code (#66216)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66216

Test Plan: Imported from OSS

Reviewed By: dagitses, priyaramani

Differential Revision: D31445799

Pulled By: navahgar

fbshipit-source-id: 8d164209831339d364710b14f6a263a16e108281
2021-10-07 13:15:46 -07:00
3cc40253d9 add gather to ShardedTensor (#65671)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65671

Tentative implementation to use dist.gather_object to collect shards from all ranks and then "merge" them. The merge is done on dst_rank though padding the sharded tensors into the size of full tensor based on their metadata (offsets, lengths) first, and then summing these padded tensors together.

Also considered concatenating sharded tensor without padding to minimize memory footprint (assuming padding will increase memory). But it may not be flexible enough for arbitrary sharing (e.g. shard on multiple directions)

Another way can be constructing the padded tensor on each rank and reduce to rank0. I feel this is the most easy implementation. But it will invoke higher memory usage and comm payload. Please let me know if this alternative is preferred.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang gcramer23

Test Plan:
Imported from OSS

  python test/distributed/_sharded_tensor/test_sharded_tensor.py -v -k test_gather

did not manage to test on oss, but tested in fbcode by reserving on demand gpu

  arc patch D31197611

modify the test with 2 gpus as on-demand gpu only has 2 cores (D31227986)

   buck test -c fbcode.enable_gpu_sections=true mode/dev-nosan caffe2/test/distributed/_sharded_tensor:sharded_tensor -- test_gather

   buck-out/gen/caffe2/test/distributed/_sharded_tensor/sharded_tensor#binary.par  test_sharded_tensor.TestShardedTensorChunked.test_gather

{F667213605}

Reviewed By: dagitses, pritamdamania87

Differential Revision: D31197611

Pulled By: dracifer

fbshipit-source-id: cf98b4a2d7838b11b9582eb23f826bb0fa38a7f4
2021-10-07 13:01:12 -07:00
f445ed19b2 OpInfo for 2d fft functions (#66128)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66128

cc mruberry peterbell10

Test Plan: Imported from OSS

Reviewed By: dagitses

Differential Revision: D31450217

Pulled By: mruberry

fbshipit-source-id: 1952fc60c5d5f454966c43f5710b8b97a9794d0e
2021-10-07 12:50:06 -07:00
2213c463ba C++ API and docs for hfftn (#66127)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66127

cc mruberry peterbell10

Test Plan: Imported from OSS

Reviewed By: dagitses

Differential Revision: D31450216

Pulled By: mruberry

fbshipit-source-id: 2878aee294aa7d74482b66d536258bac0541408d
2021-10-07 12:48:36 -07:00
e6a4f746c2 slow_conv3d: Use at::sum for grad_bias accumulation (#65758)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65758

The same change has been made in conv2d, the proper algorithm is both
faster and gives more precision.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D31257872

Pulled By: ngimel

fbshipit-source-id: 6ff3a7a00a05b66f83d45cc820bd0c230cb8de6d
2021-10-07 12:20:49 -07:00
2e4e5b0264 Add inplace_variant for resize_ OpInfo (#66135)
Summary:
Enable testing of `torch.Tensor.resize_`.
The negative view test is skipped as the test doesn't work with resize_ see
https://github.com/pytorch/pytorch/issues/65945.

cc mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66135

Reviewed By: dagitses

Differential Revision: D31444263

Pulled By: mruberry

fbshipit-source-id: 00c7fe05df28fba01508b31adb3ed4fdcf4d0326
2021-10-07 12:00:30 -07:00
361b34eb81 Chunk: acc_ops (#66010)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66010

Added chunk acc op and unit test.

Removed misleading return statements.

Test Plan: buck test glow/fb/fx/oss_acc_tracer:test_acc_tracer

Reviewed By: 842974287

Differential Revision: D31326490

fbshipit-source-id: 81183ad8773eb7471566bec07cdd3dd6c4cee217
2021-10-07 11:41:00 -07:00
9fb6ba24e7 Update torch.fx.passes.split_module docstring (#65542)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65542

Add docstring for torch.fx.passes.split_module that conforms to Google Python Style conventions.

Changed original example to the example from this diff:
https://www.internalfb.com/diff/D24925283 (9734c042b8)

Test Plan:
Ran buck test //caffe2/test:fx. No errors detected
https://pxl.cl/1QCch

Reviewed By: jamesr66a

Differential Revision: D31145694

fbshipit-source-id: 8e54f3b1be3dca1c4d414fdeeab71b9f2b5d9f3e
2021-10-07 10:37:10 -07:00
d5f64afc38 [Static Runtime] Support aten::to.prim_dtype overload (#64928)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64928

Added support this overload of `aten::to`:
```
aten::to.prim_dtype(Tensor(a) self, int? dtype, bool non_blocking=False, bool copy=False) -> Tensor(a|b)
```

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- IndividualOps_to`

Reviewed By: hlu1

Differential Revision: D30901398

fbshipit-source-id: 38ce807c30185e92dd472b404b362f22ac7e4efb
2021-10-07 10:22:44 -07:00
a8c0b362ce [pytorch][PR] Add hash and int128 utils for Lazy Tensor Core" (#66181)
Summary:
These utils are prerequisites for Lazy Node base class.
- set up new torch/csrc/lazy, test/cpp/lazy dirs
- add source files to build_variables.bzl in new lazy_core_sources var
- create new test_lazy binary

Fixes https://github.com/pytorch/pytorch/issues/65636

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66181

Original commit changeset: 3d0d5377d71e

Test Plan:
Run PyTorch XLA corresponding PR in XLA CI:
https://github.com/pytorch/xla/pull/3148/files

Reviewed By: suo

Differential Revision: D31416438

fbshipit-source-id: 58a6a49c5bc30134bc6bae2e42778f359b9a8f40
2021-10-07 10:05:26 -07:00
61fca037d6 [Part 1] upstreaming fairscale fsdp to PyTorch -- sharding, core data flow and hooks (#63881)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63881
This PR includes the minimal sets of features to make FSDP work, like sharding, core data flow and hooks. More tests will be added in the follow up PRs. Tests are refactored to utilize common PyTorch utils. Codes are also refactored a little bit. Alternative ways to replace ".data" usage in this PR are still being discussed offline.

Test Plan: unit tests

Reviewed By: mrshenli

Differential Revision: D30521673

fbshipit-source-id: 9a23390dd7c925749604c6860e08fbe39ddc5500
2021-10-07 09:06:44 -07:00
88f8944ef1 Revert D30599136: [Pytorch Edge][tracing-based] build tracer in OSS
Test Plan: revert-hammer

Differential Revision:
D30599136 (eeaf527feb)

Original commit changeset: 102f23fb652c

fbshipit-source-id: 8ac3d75a52d06a5c4196bae2db1c4df2d5c5c666
2021-10-07 08:34:23 -07:00
2f1ab477f1 Speed up DataTypeToTypeMeta (#66113)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66113

For a benchmark compiled in opt-mode in which the lookup items were shuffled and then the items were looked up round-robin fashion 10M times (for a total of 140M lookups) compiled in opt-mode we see:
```
Function           Container            Time (ms) Multiplier
TypeMetaToDataType if-chain                   233         1x
TypeMetaToDataType std::vector                795      3.41x
TypeMetaToDataType std::map                  1566      6.72x
TypeMetaToDataType std::unordered_map        2136      9.17x

DataTypeToTypeMeta switch                     102         1x
DataTypeToTypeMeta std::vector                666      6.53x
DataTypeToTypeMeta std::map                  1212      11.9x
DataTypeToTypeMeta std::unordered_map        1539      15.1x
DataTypeToTypeMeta folly::F14FastMap         1789      17.5x
```
From this, we draw two conclusions:
1. Using a complex container like `std::map` is worse than using a simple vector lookup here (there aren't enough items for the Big-O to assert itself).
2. Using any container at all is a mistake. (Unless we pull in more exotic reasoning like invalidating the code cache or preventing inlining.)

Test Plan: Sandcastle

Reviewed By: dzhulgakov

Differential Revision: D31375117

fbshipit-source-id: 0b310c6c2e94080d125c82fb7c2b43ab869adbcb
2021-10-07 08:06:09 -07:00
1e4bcbdddb [Bootcamp][Pytorch Core] Add test for complex numbers for vanilla SGD (#66230)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66230

Adding test to ensure Vanilla SGD behaves as if complex numbers are two real numbers in R^2 as per issue 65711 on github
https://github.com/pytorch/pytorch/issues/65711
ghstack-source-id: 139918862

Test Plan:
```buck test mode/dev caffe2/test:optim -- 'test_sgd_complex'```

https://pxl.cl/1QHvX

Reviewed By: albanD

Differential Revision: D31449289

fbshipit-source-id: da8b00421085796a23b643e73f96b19b5b560a32
2021-10-07 07:14:05 -07:00
057a01556c [Static Runtime] Do not use variadic_sigrid_transforms_torch_bind if out variant is disabled (#66221)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66221

JIT doesn't have an implementation for this op, so we can only use it when out variants are enabled.

Reviewed By: hlu1

Differential Revision: D31445887

fbshipit-source-id: 4565ac4df751d8ee4052647574c43efa05ea1452
2021-10-07 06:57:17 -07:00
dcf39f9bb9 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D31464823

fbshipit-source-id: 37bd72c8f1c8240d2ae72385a0707003ddb24ce8
2021-10-07 04:17:48 -07:00
df11e2d6f9 (torch/elastic) add fqdn hostname to error printout (#66182)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66182

closes https://github.com/pytorch/pytorch/issues/63174

Does a few things:

1. adds hostname to the error report
2. moves the "root cause" section to the end (presumably since the logs are being "tailed" we want the root cause to appear at the end)
3. moves redundant error info logging to debug
4. makes the border max 60 char in length and justifies left for the header

NOTE: YOU HAVE TO annotate your main function with torch.distributed.elastic.multiprocessing.errors.record, otherwise no traceback is printed (this is because python exception propagation does NOT work out of the both for IPC - hence the extra record annotation).

Test Plan:
Sample

```
============================================================
run_script_path FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2021-10-05_17:37:22
  host      : devvm4955.prn0.facebook.com
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 3296201)
  error_file: /home/kiuk/tmp/elastic/none_3_lsytqe/attempt_0/0/error.json
  traceback :
  Traceback (most recent call last):
    File "/tmp/jetter.xr3_x6qq/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 372, in wrapper
      return f(*args, **kwargs)
    File "main.py", line 28, in main
      raise RuntimeError(args.throws)
  RuntimeError: foobar

============================================================
```

Reviewed By: cbalioglu, aivanou

Differential Revision: D31416492

fbshipit-source-id: 0aeaf6e634e23ce0ea7f6a03b12c8a9ac57246e9
2021-10-07 01:40:02 -07:00
8a974a482c [quant] Add support for quantization of Embedding{Bag} in dynamic quant APIs (#65674)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65674

Before this PR user had to use the eager mode static quantization APIs to quantize Embedding/EmbeddingBag modules.
With this PR they can use either the static or dynamic quantization APIs for Embedding quantization

The only qconfig supported for embedding quantization is float_qparams_weight_only_qconfig whcih is currently enforced in the from_float
method of the quantized Embedding/Embedding modules.

To combine embedding quantization with Linear dynamic quantization, user can use the qconfig_dict to specify different qconfig for each module type.

The prepare/convert APIs can still be used to quantize Embeddings, with the caveat that user need to ensure input to Embedding ops are FP32.

Addresses Issue #65185
ghstack-source-id: 139935419

Test Plan:
python test/test_quantization.py

Imported from OSS

Reviewed By: gchanan

Differential Revision: D31211199

fbshipit-source-id: 8c747881caee5ccbf8b93c6704b08d132049dea4
2021-10-06 23:19:38 -07:00
115526cc88 GELU Converter (#66008)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66008

Added GELU converter and updated TARGET file of deeplearning/trt/fx2trt to load the plugins onto the converters

Test Plan: buck test mode/dev-nosan caffe2/torch/fb/fx2trt:test_gelu

Reviewed By: 842974287

Differential Revision: D31284144

fbshipit-source-id: 0e938a47a99d289aefc3308aec3937c7334e9b8a
2021-10-06 22:25:43 -07:00
ac0dbd6eec Promote missing ops for delegated models (#66052)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66052

`aten::__getitem__.Dict_str` and `prim::unchecked_cast` are used in delegate API.

ghstack-source-id: 139860350

Test Plan: CI

Reviewed By: pavithranrao

Differential Revision: D31364720

fbshipit-source-id: dfca5e3ded4cdd3329c9b9d80a13f0fb1f5f2a51
2021-10-06 21:48:42 -07:00
3f30526ff2 Remove THCAllocator (#65942)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65942

This one is a bit weird. The class is called `THCIpcDeleter` but it
actually has nothing IPC-specific. It just converts
`std::shared_ptr` + `void*` into a `c10::DataPtr`. Instead, moving
the `DataPtr` conversion into the actual IPC code allows 2 memory
allocations to be elided by merging 3 separate deletion contexts
into one.

Test Plan: Imported from OSS

Reviewed By: dagitses

Differential Revision: D31386278

Pulled By: ngimel

fbshipit-source-id: 5722beed9dcf680f0eb6bbff30405cff47b21962
2021-10-06 19:04:43 -07:00
eeaf527feb [Pytorch Edge][tracing-based] build tracer in OSS (#64087)
Summary:
1. Introduce
```
MobileModelRunner.h
MobileModelRunner.cpp
TensorUtils.h
TensorUtils.cpp
```
in external. They are pretty much the same as internal, except namespace and the dependency in folly. In next prs, TensorUtils and MobileModelRunner are unified between external and internal.
2. Introduce
```
tracer.cpp
```
for external. Majority is the same as internal one, with some cleanup on unnecessary dependency. It's unified between internal and external in next change.
3. Add an executable to build the tracer. It will be built for desktop only.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64087

ghstack-source-id: 139900300

Test Plan:
Given the model
```
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.lin = nn.Linear(10, 1)
    def forward(self, x):
        return self.lin(x)

model = Net()
scripted_module = torch.jit.script(model)
example_dict = {'a' : 1, 'b' : 2}
sample_input = {
    scripted_module.forward : [(torch.zeros(1,10),)],
}

bundled_model = torch.utils.bundled_inputs.bundle_inputs(scripted_module, sample_input)
bundled_model._save_for_lite_interpreter("dummy_model_with_bundled_input.ptl")
```
External tracer
```
./build/bin/model_tracer --model_input_path "/Users/chenlai/Documents/pytorch/tracing/dummy_model_with_bundled_input.ptl" --build_yaml_path  "/Users/chenlai/Documents/pytorch/tracing/tmp.yaml"
```
and compare `tmp.yaml` with the operator list generated from
Internal tracer
```
./fbcode/caffe2/fb/model_tracer/run_model_with_bundled_inputs.sh ~/local/notebooks/prod_models/dummy_model_with_bundled_input.ptl
```
QNNPACK only:
Example yaml from internal tracer:  P460742166 [devserver]
Example yaml from external tracer: P460759099 [mac], P460742166 [devserver]

Comparison ops between internal and external on devserver:

{F666923807}

{F666924048}

Note: The operators generated on mac and devservers are different, the one on deserver includes two extra ops: `aten::addmm_, aten::slow_conv_dilated2d"`. Based on the traced list, when calling `aten::_convolution`, one calls `aten::mkldnn_convolution`, and the other calls `aten::_convolution_nogroup`, causing the divergence.

Thanks for Martin for pointing out:
> mkldnn is another backend from Intel

Reviewed By: dhruvbird

Differential Revision: D30599136

fbshipit-source-id: 102f23fb652c728a9ee4379f9acc43ae300d8e8a
2021-10-06 19:01:04 -07:00
0cab25468d [Pytorch Edge][tracing-based] reorganize model tracer dependency (#63421)
Summary:
1. move 4 files to :
```
KernelDTypeTracer.h
KernelDTypeTracer.h
OperatorCallTracer.h
OperatorCallTracer.h
```
so it's visible in OSS.

2. Update the namespace to `torch::jit::mobile`
3. Add a `fb_xplat_cxx_library` `torch_model_tracer` with the source file list above.
4. update the `fb_xplat_cxx_library`  `model_tracer_lib` dependency on the new `torch_model_tracer` library

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63421

ghstack-source-id: 139900299

Reviewed By: dhruvbird

Differential Revision: D30378069

fbshipit-source-id: d56c6140e951bc13113a76d6b63767a93843c842
2021-10-06 18:59:50 -07:00
300613dc60 make FX symbolic tracing reuse buffers if they're the same (#66211)
Summary:
Currently, if the same tensor constant is reused multiple times, we'll store a tensor constant for each time we use it.

For example
```
val = torch.randn(5)
for _ in range(10):
    x = x + val
```
ends up storing 10 tensor constants.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66211

Reviewed By: jamesr66a

Differential Revision: D31437089

Pulled By: Chillee

fbshipit-source-id: 401169c8d58ce0afb7025ae11060680ef544419f
2021-10-06 18:35:38 -07:00
67970e8c9b Add CI tests for AOT Compile (#65441)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65441

Adding CI test to verify a simple linear model can compile fine.
Successful run from CI logs:

```
+ test_aot_model_compiler
+ echo 'Testing AOT model compiler'
Testing AOT model compiler
+ source test/mobile/nnc/test_aot_compile.sh
+++ python -c 'import site; print(site.getsitepackages()[0])'
++ TORCH_INSTALL_DIR=/opt/conda/lib/python3.6/site-packages/torch
++ TORCH_BIN_DIR=/opt/conda/lib/python3.6/site-packages/torch/bin
+++ dirname test/mobile/nnc/test_aot_compile.sh
++ CURRENT_DIR=test/mobile/nnc
++ MODEL=aot_test_model.pt
++ COMPILED_MODEL=aot_test_model.compiled.pt
++ COMPILED_CODE=aot_test_model.compiled.ll
++ test_aot_model_compiler
++ python test/mobile/nnc/aot_test_model.py
++ exit_code=0
++ [[ 0 != 0 ]]
++ /opt/conda/lib/python3.6/site-packages/torch/bin/test_aot_model_compiler --model aot_test_model.pt --model_name=aot_test_model --model_version=v1 --input_dims=2,2,2
The compiled model was saved to aot_test_model.compiled.pt
++ success=1
++ '[' '!' -f aot_test_model.compiled.pt ']'
++ '[' '!' -f aot_test_model.compiled.ll ']'
++ '[' -f aot_test_model.compiled.ll ']'
++ rm aot_test_model.compiled.ll
++ '[' -f aot_test_model.compiled.pt ']'
++ rm aot_test_model.compiled.pt
++ rm aot_test_model.pt
++ '[' 1 = 0 ']'
+ [[ linux-xenial-py3.6-gcc5.4-default == pytorch-linux-xenial-py3* ]]
+ assert_git_not_dirty
+ [[ linux-xenial-py3.6-gcc5.4-default != *rocm* ]]
+ [[ linux-xenial-py3.6-gcc5.4-default != *xla* ]]
++ git status --porcelain
+ git_status=
+ [[ -n '' ]]
+ test_custom_script_ops
```

Test Plan: Imported from OSS

Reviewed By: ljk53

Differential Revision: D31348169

Pulled By: priyaramani

fbshipit-source-id: dd5c55859dfa07d150e5decc2dd7e56f43e7f66b
2021-10-06 18:23:19 -07:00
6c54971cd9 Open Registration for torch::deploy Builtins (#65953)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65953

Previously if people want to add a torch::deploy builtin, they need to change torch::deploy internal code (interpreter_impl.cpp) to register the python part as frozen modules and C++ part as builtin modules. This is not convenient and error prone. We want to add open registration support for torch::deploy builtins so that people only need to add one effective line of code in there *library code* to complete the registration.

Here is an example to registry numpy as torch::deploy builtins:
  REGISTER_TORCH_DEPLOY_BUILTIN(numpy, numpy_frozen_modules, <list of name, PyInit function pairs>)

This diff supports open registration of frozen modules. It's the first step to achieve the plan above.
ghstack-source-id: 139888306

Test Plan: Run tests in test_deploy.cpp and test_builtin_registry.cpp

Reviewed By: suo

Differential Revision: D31321562

fbshipit-source-id: 6445bd8869f1bb7126b4c96cf06c31145f0e9445
2021-10-06 18:04:57 -07:00
213c3f45da [oss/ci] skip TestDataLoaderPersistentWorkers on ASAN (#66236)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66236

it's flaky, see https://github.com/pytorch/pytorch/issues/66223

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D31462056

Pulled By: suo

fbshipit-source-id: f4362a8020dc05ac8856706c0508d48be026eeb8
2021-10-06 17:56:19 -07:00
4937218611 [torch][launch] Add ability to override sys.executable for torch.distributed.run (#66179)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66179

The diff adds check for `PYTHON_EXEC` environment variable. If the variable is set, it will override `sys.executable` for `torch.distibuted.run`.
This means that  if `PYTHON_EXEC` is set, user scripts executed via `torch.distributed.run` will start via value of `os.environ["PYTHON_EXEC"]`

Test Plan: unittest

Reviewed By: kiukchung

Differential Revision: D31329003

fbshipit-source-id: b9d0167d99bbf463a6390f508324883ca4a1e439
2021-10-06 17:33:19 -07:00
e8837d741e [Vulkan] cat operator for height dimension (#66103)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66103

Implemented `cat` operator for height dimension

Test Plan:
On Mac
```
cd ~/fbsource
buck build //xplat/caffe2:pt_vulkan_api_test_binAppleMac
./buck-out/gen/xplat/caffe2/pt_vulkan_api_test_binAppleMac\#macosx-x86_64

[ RUN      ] VulkanAPITest.cat_dim2_sameheight_success
[       OK ] VulkanAPITest.cat_dim2_sameheight_success (272 ms)
[ RUN      ] VulkanAPITest.cat_dim2_diffheight_success
[       OK ] VulkanAPITest.cat_dim2_diffheight_success (161 ms)
[ RUN      ] VulkanAPITest.cat_dim2_invalidinputs_exceptions
[       OK ] VulkanAPITest.cat_dim2_invalidinputs_exceptions (235 ms)
```

On Android
```
cd ~/fbsource
buck build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 //xplat/caffe2:pt_vulkan_api_test_binAndroid\#android-arm64 --show-output
adb push buck-out/gen/xplat/caffe2/pt_vulkan_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_api_test
adb shell "/data/local/tmp/vulkan_api_test"

[ RUN      ] VulkanAPITest.cat_dim2_sameheight_success
[       OK ] VulkanAPITest.cat_dim2_sameheight_success (98 ms)
[ RUN      ] VulkanAPITest.cat_dim2_diffheight_success
[       OK ] VulkanAPITest.cat_dim2_diffheight_success (105 ms)
[ RUN      ] VulkanAPITest.cat_dim2_invalidinputs_exceptions
[       OK ] VulkanAPITest.cat_dim2_invalidinputs_exceptions (101 ms)
```

Reviewed By: SS-JIA

Differential Revision: D31323141

fbshipit-source-id: 68b187e856758790cc5f7b0c263feb30a2bb467f
2021-10-06 16:12:59 -07:00
1d586e78c6 *_solve methods: implements forward AD (#65546)
Summary:
This PR adds forward AD for `*_solve` methods.
Additionally, `cholesky_solve` gets OpInfo + a bug fix when wrong leading dimensions could be passed to LAPACK,
and `lu_solve` gets forward AD with 2x`lu_solve` instead of 1x`lu_solve` + 2x`triangular_solve`.

cc ezyang albanD zou3519 gqchen pearu nikitaved soulitzer Lezcano Varal7 jianyuh mruberry walterddr IvanYashchuk xwang233

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65546

Reviewed By: dagitses

Differential Revision: D31431847

Pulled By: albanD

fbshipit-source-id: 0e343e0d9da3c3d2051fca215fad289d77275251
2021-10-06 16:04:22 -07:00
78209b93b3 Don't build shared library for AOT Compiler (#66227)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66227

Building a shared library for AOT Compiler is not necessary as it's included in libtorch. Also having this built as a shared library was affecting android builds and we don't need to build AOT Compiler for mobile builds

Before fix:
```
(pytorch)  ~/local/pytorch master
└─ $ ANDROID_NDK=/opt/android_ndk/r20/ BUILD_PYTORCH_MOBILE=1 ANDROID_ABI=armeabi-v7a ./scripts/build_android.sh -DBUILD_BINARY=ON
Build with ANDROID_ABI[armeabi-v7a], ANDROID_NATIVE_API_LEVEL[21]
Bash: GNU bash, version 5.0.11(1)-release (x86_64-redhat-linux-gnu)
Python: 3.9.7 (default, Sep 16 2021, 13:09:58)
[GCC 7.5.0]
Caffe2 path: /data/users/priyaramani/pytorch
Using Android NDK at /opt/android_ndk/r20/
.
.
FAILED: lib/libaot_compiler.so
: && /opt/android_ndk/r20/toolchains/llvm/prebuilt/linux-x86_64/bin/clang++ --target=armv7-none-linux-androideabi21 --gcc-toolchain=/opt/android_ndk/r20/toolchains/llvm/prebuilt/linux-x86_64 --sysroot=/opt/and
roid_ndk/r20/toolchains/llvm/prebuilt/linux-x86_64/sysroot -fPIC -g -DANDROID -fdata-sections -ffunction-sections -funwind-tables -fstack-protector-strong -no-canonical-prefixes -fno-addrsig -march=armv7-a -mt
humb -Wa,--noexecstack -Wformat -Werror=format-security -frtti -fexceptions  -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -
DBUILD_LITE_INTERPRETER -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bound
s -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -W
no-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-typedef-redefinition -Wno-unknown-warning-option -Wno-unused-private-field -Wno-inconsistent-miss
ing-override -Wno-aligned-allocation-unavailable -Wno-c++14-extensions -Wno-constexpr-not-const -Wno-missing-braces -Qunused-arguments -fcolor-diagnostics -Wno-unused-but-set-variable -Wno-maybe-uninitialized
-fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -g0 -Oz -DNDEBUG  -Wl,--exclude-libs,libgcc.a -Wl,--exclude-libs,libatomic.a -static-libstdc++ -Wl,--build-id -Wl,--warn-shared-text
rel -Wl,--fatal-warnings -Wl,--exclude-libs,libunwind.a -Wl,--no-undefined -Qunused-arguments -Wl,-z,noexecstack  -rdynamic -shared -Wl,-soname,libaot_compiler.so -o lib/libaot_compiler.so caffe2/torch/CMakeFi
les/aot_compiler.dir/csrc/jit/mobile/nnc/aot_compiler.cpp.o  -latomic -lm && :
caffe2/torch/CMakeFiles/aot_compiler.dir/csrc/jit/mobile/nnc/aot_compiler.cpp.o:aot_compiler.cpp:function at::from_blob(void*, c10::ArrayRef<long long>, c10::TensorOptions const&): error: undefined reference t
o 'at::TensorMaker::make_tensor()'
.
.
caffe2/torch/CMakeFiles/aot_compiler.dir/csrc/jit/mobile/nnc/aot_compiler.cpp.o:aot_compiler.cpp:function torch::jit::mobile::nnc::Function::Function(): error: undefined reference to 'c10::AnyType::get()'
clang++: error: linker command failed with exit code 1 (use -v to see invocation)
```

After fix:
```
(pytorch)  ~/local/pytorch master
└─ $ ANDROID_NDK=/opt/android_ndk/r20/ BUILD_PYTORCH_MOBILE=1 ANDROID_ABI=armeabi-v7a ./scripts/build_android.sh -DBUILD_BINARY=ON
Build with ANDROID_ABI[armeabi-v7a], ANDROID_NATIVE_API_LEVEL[21]
Bash: GNU bash, version 5.0.11(1)-release (x86_64-redhat-linux-gnu)
Python: 3.9.7 (default, Sep 16 2021, 13:09:58)
[GCC 7.5.0]
Caffe2 path: /data/users/priyaramani/pytorch
Using Android NDK at /opt/android_ndk/r20/
.
.
-- Build files have been written to: /data/users/priyaramani/pytorch/build_android
Will install headers and libs to /data/users/priyaramani/pytorch/build_android/install for further Android project usage.
[2/3] Install the project...
-- Install configuration: "Release"
Installation completed, now you can copy the headers/libs from /data/users/priyaramani/pytorch/build_android/install to your Android project directory.
```

Test Plan: Imported from OSS

Reviewed By: ljk53, axitkhurana

Differential Revision: D31450970

Pulled By: priyaramani

fbshipit-source-id: 87e48033f1db46fef112bae1239a09a2365620d2
2021-10-06 15:57:32 -07:00
4a50b6c490 fix cosine similarity dimensionality check (#66191)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/66086

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66191

Reviewed By: dagitses, malfet

Differential Revision: D31436997

Pulled By: ngimel

fbshipit-source-id: 363556eea4e1696d928ae08320d298451c286b10
2021-10-06 15:44:51 -07:00
05e1476d49 [jit] Fix list copy in MemoryDAG (#65176)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65176

getElements returns a reference.
ghstack-source-id: 139745230

Test Plan:
CI

Static runtime startup for ctr_mobile_feed local net reduced from 8.35s to 7.8s

Reviewed By: malfet

Differential Revision: D30983898

fbshipit-source-id: 884bff40f12322633c0fffd45aed5b8bc7498352
2021-10-06 15:39:33 -07:00
fc4836f400 [Fix] Use full name to look for the promoted prim operator table (#66081)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66081

Two fixes:

1. Since the operators are always registered with both name and overload name, the overloaded name need to be included when looking for an operator.
2. Don't promote operators with alias, because the new registry does not support schema with alias.

ghstack-source-id: 139732099

Test Plan: CI

Reviewed By: pavithranrao

Differential Revision: D31382262

fbshipit-source-id: 43c6e6e0c13950a9ce8cf3a70debe0421372d053
2021-10-06 15:35:02 -07:00
7cc121dbcd slow_conv3d grad_input: Avoid dispatch in parallel region (#65757)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65757

See gh-56794

Avoid dispatch inside of parallel_for by:
- Replacing Tensor slicing with TensorAccessor
- Replaces `bmm` and `mm` with direct calls to gemm.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D31257878

Pulled By: ngimel

fbshipit-source-id: e6aad2d5ae7fa432bd27af2b1a8b0dcef1fc6653
2021-10-06 15:08:47 -07:00
480a1a88d6 [DDP] Log iteration in debug mode (#65770)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65770

This logging info is printed out in debug mode, make it log the
iteration as well for clarity.
ghstack-source-id: 139838595

Test Plan: CI

Reviewed By: zhaojuanmao, wayi1

Differential Revision: D31222132

fbshipit-source-id: 14519aae1ba0b2a35b4b962e7d1a957c9142c8f8
2021-10-06 14:36:07 -07:00
722f1ccfb8 [DDP][Instrumentation] Profiling range for bucket copy (#65769)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65769

Seeing some bottlenecks when copying bucket to grad, help make it more
clear here.
ghstack-source-id: 139838597

Test Plan: Ci

Reviewed By: zhaojuanmao, wayi1

Differential Revision: D31217340

fbshipit-source-id: 762a254a3538eb5292b3a53bb5d1211057ecbdbb
2021-10-06 14:34:10 -07:00
84c5970a77 ci: Migrate slow_gradcheck to GHA (#65730)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65730

This should close out the door on migrating all scheduled workflows we have for CircleCI

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

cc ezyang seemethere malfet pytorch/pytorch-dev-infra

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D31225188

Pulled By: seemethere

fbshipit-source-id: 4c49e88ec017edc30e07325dbc613ff54dd164d8
2021-10-06 14:29:14 -07:00
e2be087207 [oss][pytorch] Add quint2x4 dtype (#65545)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65545

Introduce 2bit qtensor. The new dtype added for this is c10::quint2x4

The underlying storage for this is still uint8_t, so we pack 4 2-bit values in a byte while quantizing it.

Kernels that use this dtype should be aware of the packing format. (4 2-bit values in one byte)

Test Plan: `buck test mode/dev-asan caffe2/test/:quantization -- test_qtensor`

Reviewed By: supriyar

Differential Revision: D31148141

fbshipit-source-id: 1dc1de719e097adaf93fee47c6d1b8010a3eae6c
2021-10-06 14:22:00 -07:00
252b6f2cba [PyTorch][easy] Remove dead std::set in parseAliasAnnotation (#65712)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65712

No reason for this to be here.
ghstack-source-id: 139743362

Test Plan: fitsships

Reviewed By: dhruvbird

Differential Revision: D31215696

fbshipit-source-id: 238ea6633629831e54847ce82de23571cf476740
2021-10-06 14:20:31 -07:00
90db214d4b support counter-based fused rowwise adagrad (#66177)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66177

As title, with additional change to enable counter for SparseAdagrad.

Test Plan:
buck test //caffe2/caffe2/fb/net_transforms/tests:fuse_sparse_ops_test

Testing with canary packages

baseline: f297789852

counter run: f297789912

Reviewed By: jspark1105

Differential Revision: D30903029

fbshipit-source-id: 3ed89a7da409fd820fd0b44950407c20fa2018a5
2021-10-06 13:50:43 -07:00
6d7fab5929 [Static Runtime][easy] Clone scripts do not use aten::add (#66161)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66161

`aten::add` is not guaranteed to be bit exact with the JIT interpreter. This was causing non-deterministic test failures on master.

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: hlu1

Differential Revision: D31406764

fbshipit-source-id: d968cb1bdb8f33934682ef3712a1341a3aacf18e
2021-10-06 12:37:39 -07:00
9285981de1 Clean up unused model instantiation (#65487)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65487

Test Plan: Imported from OSS

Reviewed By: jingsh

Differential Revision: D31410880

Pulled By: b-koopman

fbshipit-source-id: 09b2d2d899a232e7334c82f00eff0f900e817853
2021-10-06 12:21:56 -07:00
8548928950 Cumsum: acc_ops (#66189)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66189

Added acc_ops for cumsum and unit test

Test Plan: buck test glow/fb/fx/oss_acc_tracer:test_acc_tracer

Reviewed By: 842974287

Differential Revision: D31355244

fbshipit-source-id: 41490d300553b0a5d52cbc4e681bdd0cf990eb42
2021-10-06 12:15:36 -07:00
623ac7eabb slow_conv3d: Avoid dispatch in parallel region (#65737)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65737

See gh-56794

Avoid dispatch inside of parallel_for by:
- Replacing Tensor slicing with TensorAccessor
- Copy bias into output only once, outside of the parallel region
- Replaces `addmm_` and `baddbmm_` with direct calls to gemm.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D31257874

Pulled By: ngimel

fbshipit-source-id: 20b94daa13082fb1e39eaa8144bfa4c611b61bab
2021-10-06 12:10:55 -07:00
9a0b2acd76 [quant] Remove hypothesis from qtopk (#66158)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66158

qtopk used hypothesis which created flaky tests. In addition to that the tests generated were not representative, and would not catch the cases that we are interested in.

This diff removes the hypothesis from the qtopk and merges the qtopk and qtopk_nhwc tests. We now use specific testcases.
ghstack-source-id: 139768865

Test Plan: `buck test mode/dev //caffe2/test:quantization -- test_qtopk`

Reviewed By: jerryzh168

Differential Revision: D31401341

fbshipit-source-id: a8fb37a7221fc43c159f34e28aa4a91ed3506944
2021-10-06 11:42:34 -07:00
6d4d636d66 [GHA] Rectify trigger_action_only flag (#66209)
Summary:
No longer needed, as PR can be opened/reopened with specific label

Fixes https://github.com/pytorch/pytorch/issues/66110

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66209

Reviewed By: seemethere

Differential Revision: D31436292

Pulled By: malfet

fbshipit-source-id: 5b6e0875bec261862017dfe0eb3a5ec57fb8c705
2021-10-06 10:46:10 -07:00
c4ea447eb5 Use src size for memcpy in order to avoid fortify complaints (#65222)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65222

When compiling against the Android SDK with `--D_FORTIFY_SOURCE=2`, the compiler will complain that the `dst` size is a larger size than the `src` size due to the function templating using two differently sized objects. There is a `TORCH_CHECK` to ensure we don't go through with these `memcpy`'s, but in the interest of making the compiler happy, lets switch the `memcpy` to take `sizeof(src)`.

Test Plan: CI

Reviewed By: bertmaher, lanza

Differential Revision: D30992678

fbshipit-source-id: b3e7aa992a3650e1051abad05be800b684e6332b
2021-10-06 09:05:31 -07:00
bfaaac6392 Ignore register_rds errors (#66185)
Summary:
Network communications are flaky by nature, test should be marked as
skipped if network ops can not be completed for some reason

Fixes https://github.com/pytorch/pytorch/issues/66184

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66185

Reviewed By: seemethere

Differential Revision: D31423193

Pulled By: malfet

fbshipit-source-id: 96c3a123c65913f44ea78b30a03e8e7eda164afe
2021-10-06 08:42:35 -07:00
b8e1999253 [quant] Add op benchmark for GPU FakeQuantizePerChannel with float zero_points (#66183)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66183

Add a GPU benchmark for fakeQuant, similar to #65241
ghstack-source-id: 139810414

Test Plan: https://pxl.cl/1QjJM

Reviewed By: b-koopman

Differential Revision: D31288158

fbshipit-source-id: 65526248b5c7b70f0bc32a86b08f50b4cbc7a83d
2021-10-06 08:07:42 -07:00
9de9733390 Add 1d to 2d conv transform during mobile optimization (#65850)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65850

This step was never added
ghstack-source-id: 139753673

Test Plan: Run optimize_for_mobile on model with conv1d and see that it transforms to conv2d

Reviewed By: kimishpatel

Differential Revision: D31093503

fbshipit-source-id: 11a19f073789c01a9de80f33abbe628005996b66
2021-10-06 07:27:09 -07:00
747a5782e3 [quant][fx] Don't assume bias is a keyword argument (#61647)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61647

`prepare_fx` currently assumes that bias is always a positional argument to
convolutions, and only a keyword argument to other functions. This happens to work
today due to a quirk in how `__torch_function__` is handled for python
functions but shouldn't be considered stable.

Instead, we should support `bias` for both positional and keyword forms.

cc jerryzh168 jianyuh raghuramank100 jamesr66a vkuzo

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D31401360

Pulled By: albanD

fbshipit-source-id: 1e2f53d80e2176b870f326dc498e251e2386136e
2021-10-06 07:25:47 -07:00
ab25516054 [PyTorch] Remove unused function in import (#65865)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65865

`operator_str` is not used in `import.cpp` and it is also defined in `parse_operators.cpp` so removing it from `import.cpp`.

Test Plan: CI passing

Reviewed By: iseeyuan

Differential Revision: D31293008

fbshipit-source-id: 1c857cbd63c57b8f79c1a068789fc8605605b642
2021-10-06 06:34:51 -07:00
a5895f85be [PyTorch Edge][type] Add type check in compatibility api (#63129)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63129

1. Add an api to get `supported_types` from runtime, expose in c++ only.
2. Add an api to get `contained_types` from model, expose in both c++ and PyThon.
3. Add a field `contained_types_` in `type_parser.cpp` to track the contained types when parsing python string.
4. Expand `is_compatible` api to check type. When checking type, it will check the contained type list from the model with the support type list from runtime.
5. Expand the unittest for compatibility to cover type
6. Add unit test in python to check type list
ghstack-source-id: 139826944

Test Plan:
```
buck test mode/dev //caffe2/test/cpp/jit:jit -- --exact 'caffe2/test/cpp/jit:jit - LiteInterpreterTest.GetContainTypes'

buck test mode/dev //caffe2/test/cpp/jit:jit -- --exact 'caffe2/test/cpp/jit:jit - LiteInterpreterTest.isCompatibleSuccess'
buck test mode/dev //caffe2/test/cpp/jit:jit -- --exact 'caffe2/test/cpp/jit:jit - LiteInterpreterTest.isCompatibleFail'

buck test //caffe2/test:mobile
```

Reviewed By: iseeyuan

Differential Revision: D30231419

fbshipit-source-id: 8427f423ec28cc5de56411f15fd960d8595d6947
2021-10-06 02:23:44 -07:00
c75210face [PyTorch Edge][type] Move TypeParser class definition to header file (#65976)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65976

More TypeParser class to header file so it can be called from somewhere else. For example, the getContainedTypes() api in this stack can be moved to other files.
ghstack-source-id: 139826943

Test Plan: CI

Reviewed By: iseeyuan

Differential Revision: D31294254

fbshipit-source-id: 1c532fd69c7f6b44ad2332055d24c95a0fac1846
2021-10-06 02:22:26 -07:00
931352c68d Make handle_torch_function_no_python_arg_parser public (#66054)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66054

I need this function in functorch to support the ability of custom
jitted kernels to invoke torch_function when applicable.

Test Plan: functorch unit tests

Reviewed By: qihqi, ngimel

Differential Revision: D31416599

Pulled By: bertmaher

fbshipit-source-id: 90b57badd6a6b9d505ebfc436869b962b55c66d7
2021-10-06 00:27:10 -07:00
c0b1965f7c Back out "[vulkan] Use push constants instead of SSBOs" (#66169)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66169

Original change: D30368834 (57e5ae5306)

Switching to Push Constants from Uniform Buffers caused some unforseen memory errors when running Mac unit tests.

We'll switch back for now until we can pinpoint and resolve the issue.

Test Plan:
Build and run `vulkan_api_test`

```
buck build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 //xplat/caffe2:pt_vulkan_api_test_binAndroid\#android-arm64 --show-output
adb push buck-out/gen/xplat/caffe2/pt_vulkan_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_api_test
adb shell "/data/local/tmp/vulkan_api_test"
```

Reviewed By: beback4u

Differential Revision: D31409130

fbshipit-source-id: cab1a3330945b50522235db6738406b6037f9c68
2021-10-05 21:28:59 -07:00
8d435877d5 Fix typos at ONNX docs (#66090)
Summary:
This PR fixes small typos at ONNX docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66090

Reviewed By: albanD

Differential Revision: D31385765

Pulled By: ezyang

fbshipit-source-id: f4879069a2acf9c8adaa81c26a6a5014634761f5
2021-10-05 21:11:47 -07:00
cbc29acca3 [Codemod][FBSourceBlackLinter] Daily arc lint --take BLACK
Reviewed By: zertosh

Differential Revision: D31423202

fbshipit-source-id: 08d249e8546c0bfe6f1145c0571141b90aad03eb
2021-10-05 20:55:56 -07:00
d1058df885 fix clang-tidy error introduced by #64382 (#65977)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65977

Reviewed By: ngimel

Differential Revision: D31423174

Pulled By: malfet

fbshipit-source-id: 0ea560b9a6ddd6431f70bd3ac10ace68e26ab352
2021-10-05 20:13:13 -07:00
6cdea8239e Precomputing Transposes for frozen linear layers (#65631)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65631

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D31314248

Pulled By: Gamrix

fbshipit-source-id: 85611f3ccfe7b91a183d5d12f7fb9aca3c51acb0
2021-10-05 20:08:32 -07:00
43e26d0086 [deploy] Improve error messaging for create_movable (#65955)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65955

This diff makes sure to give clear error message when user tries to create obj from obj that lives in different session

Test Plan: buck test //caffe2/torch/csrc/deploy:test_deploy

Reviewed By: suo

Differential Revision: D31323045

fbshipit-source-id: e7bd6f76afeb0285847bc11881185a164f80e3f0
2021-10-05 19:49:51 -07:00
3bd26792c0 Skip test_multiple_groups on windows (#66154)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66154

Skips as the test is flaky:
https://github.com/pytorch/pytorch/issues/66059
ghstack-source-id: 139763149

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D31403153

fbshipit-source-id: 7f47f17cee148a708346d6d9454c44a194d13a78
2021-10-05 18:33:23 -07:00
eeabab03e7 [DataParallel] Log API Usage for tracking (#66038)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66038

Will help track workflows for DP deprecation. Tested via standalone DP
script.

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D31356975

fbshipit-source-id: c0a3ac3a1faed794e3362f3f3a19a6fb800587a7
2021-10-05 18:30:23 -07:00
dc26f5eb65 [FX] Specifies a default value when possible for placeholders created from concrete_args (#59569)
Summary:
```python
class Foo(torch.nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, a=None, b=None):
        res = a
        if b is not None:
            res = res + b
        return res

concrete_args = {'b': torch.tensor(5)}
traced = fx.symbolic_trace(Foo(), concrete_args=concrete_args)
```

Gives the following error:

```
  File "<eval_with_key_9>", line 2
    def forward(self, a = None, b_1):
                ^
SyntaxError: non-default argument follows default argument
```

Since https://github.com/pytorch/pytorch/issues/55888, placeholders are also created for concrete arguments. But these placeholders do not have default values even when  it was provided for the argument in question, causing the error above.

To solve this, I add a default value when it is available during placeholder creation for concrete arguments.

I also tried to set the default value to the value specified in concrete_args (since it many cases it will actually use this value anyway), but ran into an error because the default value is never defined:

```
def forward(self, a = None, b_1 = _tensor_constant0):
    _tensor_constant0 = self._tensor_constant0
    _tensor_constant1 = self._tensor_constant1
    add = a + _tensor_constant1;  a = _tensor_constant1 = None

NameError: name '_tensor_constant0' is not defined
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59569

Reviewed By: albanD

Differential Revision: D31385607

Pulled By: Chillee

fbshipit-source-id: 44a8ce28b5eabdb9b4c773e73a68ff0bb9c464cc
2021-10-05 17:45:09 -07:00
83bac89d64 [quant] Add fp32/fp16 zero_point support for GPU fakeQuant (#65836)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65836

Add a GPU implementation for GPU fakeQuant, similar to D30975238 (60915eb810)
ghstack-source-id: 139779416

Test Plan:
https://www.internalfb.com/intern/testinfra/testconsole/testrun/281475183488511/

{F667112564}

Reviewed By: b-koopman

Differential Revision: D31091679

fbshipit-source-id: 68fd483e6926c7fd565703c01d8ffb337b75dca5
2021-10-05 17:40:54 -07:00
f062def486 Revert D31260343: [pytorch][PR] Add hash and int128 utils for Lazy Tensor Core
Test Plan: revert-hammer

Differential Revision:
D31260343 (e94fea08d0)

Original commit changeset: 8bb1194188e3

fbshipit-source-id: 3d0d5377d71ed928015bcb2105801be368e38cd8
2021-10-05 17:15:50 -07:00
5e6347ca64 .circleci: Remove migrated distributed configs (#66174)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66174

These configs have already been migrated so going to go ahead and remove
them

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D31413579

Pulled By: seemethere

fbshipit-source-id: 8923736d347eb8c8470884be413122c198d1bf20
2021-10-05 16:53:02 -07:00
e94fea08d0 Add hash and int128 utils for Lazy Tensor Core (#65635)
Summary:
These utils are prerequisites for Lazy Node base class.

- set up new torch/csrc/lazy, test/cpp/lazy dirs
- add source files to build_variables.bzl in new lazy_core_sources var
- create new test_lazy binary

Fixes https://github.com/pytorch/pytorch/issues/65636

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65635

Reviewed By: alanwaketan

Differential Revision: D31260343

Pulled By: wconstab

fbshipit-source-id: 8bb1194188e3e77fc42e08a14ba37faed37a9c2e
2021-10-05 16:43:55 -07:00
143c957c2d [nnc] Reduced memory usage of LLVMCodeGen object after code generation is complete (#65373)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65373

Test Plan: Imported from OSS

Reviewed By: bertmaher, hlu1

Differential Revision: D31066974

Pulled By: navahgar

fbshipit-source-id: 0dbe0d1746c50adee90fe5a7cc4a66adba3a229e
2021-10-05 16:27:43 -07:00
68555339d7 test_utils.py: Add another retry to test_download_url_to_file (#66159)
Summary:
Fixes one of the flakiness concerns mentioned https://github.com/pytorch/pytorch/issues/65439#issuecomment-934686485

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66159

Reviewed By: ngimel

Differential Revision: D31406485

Pulled By: janeyx99

fbshipit-source-id: cf7834cdab58360ecef1748075d52969de2e0778
2021-10-05 16:26:20 -07:00
d2021e5e68 ci: Migrate vulkan builds to GHA (#66044)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66044

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D31370889

Pulled By: seemethere

fbshipit-source-id: 399f5f0c184f7856dcddb138c357f1374706e676
2021-10-05 16:11:36 -07:00
7452b65144 Remove unused dump method from VSX vec256 methods (#66085)
Summary:
Follow up after https://github.com/pytorch/pytorch/pull/63533

Probably fixes https://github.com/pytorch/pytorch/issues/65956

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66085

Reviewed By: ngimel

Differential Revision: D31382898

Pulled By: malfet

fbshipit-source-id: f3d97b0f2c7f1207827773ae85e2739f1d54b9c7
2021-10-05 16:05:01 -07:00
6e06cb76ff [JIT] Initialize CUDA context before launching fused kernel (#65064)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65064

The problem appears when nvfuser is triggered from LazyTensor.
Because LT maintains its own thread pool, the thread used for the first-time
compilation does CUDA context initialization properly, but later
cached execution may use a different thread which does not have
a proper CUDA context.

Test Plan: Imported from OSS

Reviewed By: saketh-are

Differential Revision: D31269691

Pulled By: desertfire

fbshipit-source-id: 384362025c087d61e8b625ff938379df283ef8b2
2021-10-05 16:01:59 -07:00
a5e6b2b2e3 [Static Runtime] Add variadic sigrid_transforms_torch_bind (#63960)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63960

Reviewed By: hlu1

Differential Revision: D30529880

fbshipit-source-id: 1c4be2f9c0944bbe1e1c146989588c96bfd14eda
2021-10-05 16:00:36 -07:00
e7747795c9 [PyTorch Edge] Reduce dispatch table size further for a trimmed build (#66112)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66112

Eliminate Metal and Vulkan Dispatch Keys.

Test Plan: Build + Sandcastle

Differential Revision: D31298307

fbshipit-source-id: 31302fc626382db7997e5058750fa85458c9cbc1
2021-10-05 15:24:07 -07:00
a3bbaf227c Revert D31227448: [pytorch][PR] fixing sorting in stride indices
Test Plan: revert-hammer

Differential Revision:
D31227448 (da0e29edd4)

Original commit changeset: 51e3cd903757

fbshipit-source-id: a752a4df70281aa0eaaeb1afdd88395b08276da8
2021-10-05 14:28:34 -07:00
89b56d630d Create CI sev template (#66163)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66163

Reviewed By: seemethere

Differential Revision: D31407988

Pulled By: suo

fbshipit-source-id: a23b6fc5410ef1f901e2a7aacc2e0c17cb04d083
2021-10-05 13:55:07 -07:00
5883523c1d Remove dtype from torch.Storage and use only torch.ByteStorage (#62030)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62030

Remove dtype tracking from Python Storage interface, remove all the different `<type>Storage` classes except for `ByteStorage`, and update serialization accordingly, while maintaining as much FC/BC as possible

Fixes https://github.com/pytorch/pytorch/issues/47442

* **THE SERIALIZATION FORMAT IS FULLY FC/BC.** We worked very hard to make sure this is the case. We will probably want to break FC at some point to make the serialization structure of tensors make more sense, but not today.
* There is now only a single torch.ByteStorage class. Methods like `Tensor.set_` no longer check that the dtype of storage is appropriate.
* As we no longer know what dtype of a storage is, we've **removed** the size method from Storage, replacing it with nbytes. This is to help catch otherwise silent errors where you confuse number of elements with number of bytes.
* `Storage._new_shared` takes a `nbytes` kwarg and will reject previous positional only calls.  `Storage._new_with_file` and `_set_from_file` require explicit element size arguments.
* It's no longer possible to convert storages to different types using the float/double/etc methods. Instead, do the conversion using a tensor.
* It's no longer possible to allocate a typed storage directly using FloatStorage/DoubleStorage/etc constructors. Instead, construct a tensor and extract its storage. The classes still exist but they are used purely for unpickling.
* The preexisting serialization format stores dtype with storage, and in fact this dtype is used to determine the dtype of the tensor overall.
 To accommodate this case, we introduce a new TypedStorage concept that exists only during unpickling time which is used to temporarily store the dtype so we can construct a tensor. **If you overrode the handling of pickling/unpickling, you MUST add handling for TypedStorage** or your serialization code will degrade to standard file-based serialization.

Original pull request: https://github.com/pytorch/pytorch/pull/59671

Reviewed By: soulitzer, ngimel

Differential Revision: D29466819

Pulled By: ezyang

fbshipit-source-id: 4a14e5d3c2b08e06e558683d97f7378a3180b00e
2021-10-05 13:50:34 -07:00
588c1787ba Update link to example pytorch/examples (#66095)
Summary:
`https://github.com/goldsborough/examples/tree/cpp/cpp` -> `https://github.com/pytorch/examples/tree/master/cpp`
As C++ examples in  https://github.com/pytorch/examples are more update

Partially addresses https://github.com/pytorch/pytorch/issues/65388

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66095

Reviewed By: janeyx99

Differential Revision: D31382888

Pulled By: malfet

fbshipit-source-id: 8884c7795386249dea07cbe66783fa1dd963e07c
2021-10-05 12:48:12 -07:00
da0e29edd4 fixing sorting in stride indices (#63940)
Summary:
Updating `computeStrideProps` logic to break ties on stride_indices.

For two dimension with identical stride, the dimension with size-1 should be considered as the faster dimension. Otherwise, its stride should be the product of existing stride and the size of the other dimension.

Note that there's still inconsistency between eager memory_format and stride_properties in JIT, this is a design issue due to the ambiguity on size-1 stride. One example showing this failing test has been disabled in the added cpp test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63940

Reviewed By: albanD

Differential Revision: D31227448

Pulled By: dzhulgakov

fbshipit-source-id: 51e3cd903757bef55d3158c057f9444d0cff7d2a
2021-10-05 12:30:41 -07:00
0d020effab [quant] Fix the parts that were missing after initial migration (#66058)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66058

After the initial migration from `torch.quantization` to `torch.ao.quantization`, some of the files did not change.
This happened because the migration was done in parallel, and some of the files were landed while the others were still in the original location.
This is the last fix in the AO migration phase 1, which completely enables the ao.quantization namespace.

Test Plan: `python test/test_quantization.py`

Reviewed By: vkuzo

Differential Revision: D31366066

Pulled By: z-a-f

fbshipit-source-id: bf4a74885be89d098df2d87e685795a2a64026c5
2021-10-05 11:45:37 -07:00
727576e501 [quant] Fixing the hypothesis test for topk (#66057)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66057

The current test is creating the sets that are too slow.
This will cause either "Filtering too much" or "Timeout" errors in the future versions of hypothesis.
This PR preemptively fixes the issue.

Test Plan: `python test/test_quantization.py`

Reviewed By: vkuzo

Differential Revision: D31366065

Pulled By: z-a-f

fbshipit-source-id: deaab4da8ee02a5dee8943cabdd30fc53d894a34
2021-10-05 11:43:56 -07:00
92d0b7e99c [deploy] fix typo in registerModuleSource (#66107)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66107

lol

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D31385631

Pulled By: suo

fbshipit-source-id: a3307e2862f7951c160776eb8edb18329c937ed1
2021-10-05 11:15:35 -07:00
458a00bacb Back out "[quant] update fused_obs_fake_quant op to accept output_fake_quant argument" (#66063)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66063

Original commit changeset: bffe776216d0

Test Plan: CI

Reviewed By: vkuzo

Differential Revision: D31347042

fbshipit-source-id: f56f628dc4690187bf284a8f2fda4c6aae10c1d6
2021-10-05 11:02:54 -07:00
2b39b80971 [quantized] Replace conv_p with convolution_op in qnnpack (#65783)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65783

convolution_op makes conv_param struct redundant, since it contains all the params of conv_param and more. We don't need to pass both structs to qnnpack or hold both in the packed weights, let's just hold convolution_op.

This makes it easier to implement 3dconv since we won't have to template two structs. The conv_param struct is left in existence since tests rely on it to set up the convolution.
ghstack-source-id: 139479651

(Note: this ignores all push blocking failures!)

Test Plan: ci

Reviewed By: kimishpatel

Differential Revision: D30738727

fbshipit-source-id: e6d39644357b99d3b7491ae8a7066bf107eb8b9e
2021-10-05 11:01:26 -07:00
bda3230b62 slow_conv2d grad_weight: call gemm directly (#65726)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65726

This PR isn't strictly necessary since grad_weight doesn't use
parallel_for. However, this does reduce the function overhead and will
make it easier to parallelize in the future.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D31257877

Pulled By: ngimel

fbshipit-source-id: d8ea97cc1f43d8d9dfff355ae27c9d982838b57e
2021-10-05 10:53:22 -07:00
1db78c30c9 Fix LLVM-12 concat_split_op.h error (#66060)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66060

Fixes
```
testTumHistoryAdditionalLaser (caffe2.caffe2.fb.layers.tests.tum_history_test.TestTumHistory) ... caffe2/caffe2/operators/concat_split_op.h:363:74: runtime error: applying non-zero offset 8 to null pointer
    #0 0x7f8f39d29795 in caffe2::ConcatOp<caffe2::CPUContext>::RunOnDevice() caffe2/caffe2/operators/concat_split_op.h:363
    #1 0x7f8f39c4978d in caffe2::Operator<caffe2::CPUContext>::Run(int) caffe2/caffe2/core/operator.h:987
    #2 0x7f8f381fe9c9 in caffe2::SimpleNet::Run() caffe2/caffe2/core/net_simple.cc:67
    #3 0x7f8f38ee488e in caffe2::Workspace::RunNet(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) caffe2/caffe2/core/workspace.cc:289
```

Test Plan: Sandcastle

Reviewed By: dzhulgakov, xush6528

Differential Revision: D31366205

fbshipit-source-id: 566aa519677c9d371189e4b1f81d595732861efc
2021-10-05 10:48:56 -07:00
9c3eb50b7b [PyTorch] Use std::move() in a couple places in function_schema_parser.cpp (#66114)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66114

ghstack-source-id: 139712533

Test Plan: Build

Reviewed By: swolchok

Differential Revision: D31387502

fbshipit-source-id: e850cb7df397a7c5b31df995b23ad6e5c004ac86
2021-10-05 10:44:07 -07:00
aa80f05d2d Remove sync in Embedding caused by unique (#66091)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66091

Reviewed By: albanD

Differential Revision: D31385576

Pulled By: ngimel

fbshipit-source-id: e656d4d9c38b705c71853ca295f977d1cddc61a1
2021-10-05 09:39:42 -07:00
1932bc69e9 Move GHA to ONNX (#65975)
Summary:
- Delete CircleCI ONNX config
- Add sharded ONNX job to the list of generated workflows
- Move ONNX runtime installation from `pytorch-job-specs.yml` to `.jenkins/caffe2/test.sh`
- Limit MKLDNN to AVX2 ISA while running  Caffe2 tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65975

Reviewed By: seemethere

Differential Revision: D31327206

Pulled By: malfet

fbshipit-source-id: 15aa53e4481e846c62b4ee2db5c03047d68679a4
2021-10-05 09:31:57 -07:00
df475aa1dc Update Vulkan runner in benchmark binary to handle non-tensor inputs (#66123)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66123

Some models may take in a list of tensors as inputs, thus the bundled inputs will contain `IValues` that are of the type `c10::List`. For Vulkan models, every tensor in the `IValue` list has to be converted to a vulkan tensor first, and this case is not currently handled by the Vulkan model wrapper in the benchmark binary.

This diff introduces `IValue` type checking to the input processor of the Vulkan model wrapper, and adds support for Tensor and List types.

Test Plan:
```
# Build the binary
cd ~/fbsource
buck build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 //xplat/caffe2:ptmobile_compareAndroid\#android-arm64 --show-output
# Push it to the device
adb push buck-out/gen/xplat/caffe2/ptmobile_compareAndroid\#android-arm64 /data/local/tmp/compare_models

# Run the benchmark binary
BENCH_CMD="/data/local/tmp/compare_models"
BENCH_CMD+=" --model=$PATH_TO_MODEL"
BENCH_CMD+=" --refmodel=$PATH_TO_REFERENCE_MODEL"
BENCH_CMD+=" --input_type=float --input_dims=$MODEL_INPUT_SIZE"
BENCH_CMD+=" --iter=100"
BENCH_CMD+=" --tolerance 1e-5"
```

Reviewed By: beback4u

Differential Revision: D31276862

fbshipit-source-id: 1d9abf958963da6ecad641202f0458402bee5ced
2021-10-05 07:59:56 -07:00
2a5116e159 [quant][fx2trt] Add quantize_per_channel in acc_ops and acc_ops_converter (#65287)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65287

Test Plan:
python torch/fx/experimental/fx2trt/example/quantized_resnet_test.py

Imported from OSS

Reviewed By: 842974287

Differential Revision: D31038882

fbshipit-source-id: cd20e132ffa85f6fb070e21cd96a9e84dd15fab5
2021-10-05 02:12:00 -07:00
d609957c95 patching graph_for (#55139)
Summary:
Allows individual DifferentiableGraphOp to display optimized forward graph. This improves user visibility to graph mutation via optimization pass, especially fusion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55139

Reviewed By: albanD

Differential Revision: D31330909

Pulled By: dzhulgakov

fbshipit-source-id: c745b482fdc34876dc404cbe3bacd99dcf2ac724
2021-10-04 21:50:22 -07:00
ed50fa2513 [Static Runtime] Test isOptimizableContainerType and getAlwaysAliveValues (#65849)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65849

Add tests for some of `StaticModule`'s exposed methods. Both of these are used by the memory planner, so it would be helpful to have some unit tests that ensure our basic invariants don't break.

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: hlu1

Differential Revision: D31282901

fbshipit-source-id: e390329f4794e034170507e3a0de0abcfe0ab7b9
2021-10-04 20:46:07 -07:00
4c4525fa5c Compile without -Wno-unused-variable (take 2) (#66041)
Summary:
Delete `-Wno-unused-variable` from top level `CMakeLists.txt`
Still suppress those warnings for tests and `torch_python`

Delete number of unused variables from caffe2 code
Use `(void)var;` to suppress unused variable in range loops
Use `C10_UNUSED` for global constructors and use `constexpr` instead of `static` for global constants

Do not delete `caffe2::OperatorBase::Output` calls as they have side effects

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66041

Reviewed By: ngimel

Differential Revision: D31360142

Pulled By: malfet

fbshipit-source-id: 6fdfb9f91efdc49ca984a2f2a17ee377d28210c8
2021-10-04 20:39:39 -07:00
6b0aa2958d [FX] Support torch.layout as arg (#66048)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66048

Previously, create_arg would fail if it encountered a not `None` layout argument. Adding it to `BaseArgumentTypes` list should be enough to fix that.

Test Plan: Added unittest

Reviewed By: jamesr66a

Differential Revision: D31362662

fbshipit-source-id: 20049971e18c17e9c75e50540500c567266daa55
2021-10-04 19:58:08 -07:00
6ea4902cf4 [ao_migration] torch.quantization --> torch.ao.quantization in caffe2/torch/fx (#66096)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66096

codemod -m -d caffe2/torch/fx --extensions py \
    'torch.quantization' \
    'torch.ao.quantization'

Test Plan: test_in_prod

Reviewed By: z-a-f

Differential Revision: D31294195

fbshipit-source-id: 00425844f8160749f68bdbdf0e08cb22c79099c9
2021-10-04 19:57:01 -07:00
de24faec5f Binary building wthout python fix (#66031)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/66030

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66031

Reviewed By: VitalyFedyunin

Differential Revision: D31356243

Pulled By: malfet

fbshipit-source-id: d1537bc65bbba5d6497ecb8db7160a397eca81fd
2021-10-04 18:34:35 -07:00
6eb3a1c831 Run master clang-tidy on PRs (#66104)
Summary:
Make PR clang-tidy a strong superset of master one
Should prevent a situation when [clang-tidy on PR](https://github.com/pytorch/pytorch/runs/3773346094) was clean but regressed  on [trunk commit](https://github.com/pytorch/pytorch/runs/3773406183?check_suite_focus=true)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66104

Reviewed By: seemethere

Differential Revision: D31384608

Pulled By: malfet

fbshipit-source-id: 397319be3480520d58eab11ec001ad7a9a94d41c
2021-10-04 18:27:38 -07:00
7c758759e3 [PyTorch Edge] Avoid string copying in TypeParser (#64278)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64278

Use c10::string_view and const char* to copy less.
ghstack-source-id: 139468089

Test Plan:
Pixel 3 before: https://www.internalfb.com/intern/aibench/details/132239033718036
Pixel 3 after: https://www.internalfb.com/intern/aibench/details/132239033718036
went from mean of 293 ms to 281 ms.

Reviewed By: dhruvbird

Differential Revision: D30650712

fbshipit-source-id: abad143f2d5cc99a30e8da376c8e37716373032a
2021-10-04 16:10:38 -07:00
69da4b4381 GHA: make obvious when we are running smoke tests to user (#66011)
Summary:
This PR clarifies what's run on PRs by explicitly stating when it runs smoke tests for windows CUDA and makes the logic so that user defined labels override other workflow logic.

1. Move smoke tests to its own config.

2. Make sure that when a user specifies a ciflow label that is not the default, the workflow runs as if it is on trunk.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66011

Test Plan:
the default on PRs would generate this matrix (default replaced by smoke_tests)
![image](https://user-images.githubusercontent.com/31798555/135672182-64454ea3-ff43-4746-b8e4-09b0b28e9d33.png)
But when retriggered with a label, it looks like (note that there's no smoke_tests config):
![image](https://user-images.githubusercontent.com/31798555/135672601-5aa9a268-bc76-40f1-80c6-62b3fac6601d.png)

Reviewed By: VitalyFedyunin, seemethere

Differential Revision: D31355130

Pulled By: janeyx99

fbshipit-source-id: fed58ade4235b58176e1d1a24101aea0bea83aa4
2021-10-04 07:53:17 -07:00
4cdfceddd2 [Reland] Avoid saving self for softmax and log_softmax (#66018)
Summary:
Reland of https://github.com/pytorch/pytorch/pull/65242

The last attempt of the reland automatically rebased onto stable, which did not yet have the revert commit

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66018

Reviewed By: albanD

Differential Revision: D31348822

Pulled By: soulitzer

fbshipit-source-id: 881d701b404530c1352ac9245bd67264e1652b8a
2021-10-03 21:35:01 -07:00
8f5631b859 Refactor functional api vectorized jacobian to use batched grad parameter (#65566)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65566

This doesn't simplify vectorized jacobian computation, but is good to consolidate logic and helps us to test the logic

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D31236257

Pulled By: soulitzer

fbshipit-source-id: 00ca0aa6519bed5f9ee2c7be4daa8872af5e92cd
2021-10-03 19:55:08 -07:00
73901b099d Add batched_grad parameter to autograd.grad (#65564)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65564

- wrap the call into engine with vmap if `batched_grad` is `True`
- improves the comment on the call to engine (somewhat addressing https://github.com/pytorch/pytorch/issues/41659)
- borrows the message from functional.jacobian's vectorized argument concerning usage of the vmap feature
- adds basic test (further testing is done when we replace the usage in vectorized jacobian computation)

TODO:
 - create an issue tracking this

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D31236259

Pulled By: soulitzer

fbshipit-source-id: b33e6b26ea98fa9f70c44da08458fc54ba4df0f7
2021-10-03 19:55:06 -07:00
b6d5f1ee70 Allow None to pass through for vmap (#65565)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65565

Does jax allow this?

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D31236258

Pulled By: soulitzer

fbshipit-source-id: 80460b355fc32ecbba8151e1f3179f076a927f9d
2021-10-03 19:53:49 -07:00
89ed9bdaee [Static Runtime] Fix bug of creating output aliases in aten::embedding_bag (#65516)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65516

This change fixes a bug that Static Runtime's `aten::embedding_bag` out variant implementation creates aliases in its managed output tensors.

Managed output tensors should never be an alias with each other since writing to them can illegally overwrite others' contents unintentionally, and this exact problem was causing the bug at T97393697, causing SR to return wrong return values.

This bug is detected in inline_cvr/remote_ro by a DCHECK, `verify_no_memory_overlap` (introduced by D30211705 (3fb33b38b9)), but wasn't found so far since our testing didn't include running the model in the debug mode. Fortunately this bug is not hitting production since the aliases outputs are not used in production.

This change fixes the root cause from `_embedding_bag_cpu_impl_out`  by replacing alias creation with copying.

Note that this change also includes a fundamental change in Static Runtime's unit testing: `testStaticRuntime` exercises the given graph 3 times:
 1. profile run
 2. run using the profile to allocate managed tensors
 3. reuse the managed tensors -- newly added

Adding 3 reveals this bug with a new unittest `EmbeddingBagWithManagedOutput`.

Test Plan:
- Confirmed that the crash experienced by `StaticRuntime.EmbeddingBagWithManagedOutput` disappears with this change (crash paste: P459807248).

- Added `StaticRuntime.EmbeddingBagWithManagedOutput` to detect the same problem in the future.

Reviewed By: hlu1

Differential Revision: D31104345

fbshipit-source-id: 7bddf9cd82b400d18d8ce1bf15e29b815ef9ba8f
2021-10-03 15:10:58 -07:00
40948a935d Fix LLVM-12 UB in generate_proposals_op.cc (#66009)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66009

Fixes
```
test_trace_c10_ops (jit.test_tracer.TestTracer) ... third-party-buck/platform009/build/eigen/include/Eigen/src/Core/Block.h:374:24: runtime error: applying non-zero offset 4 to null pointer
    #0 0x7f5228f72227 in Eigen::internal::BlockImpl_dense<Eigen::Map<Eigen::Array<float, -1, -1, 1, -1, -1>, 0, Eigen::Stride<0, 0> >, -1, -1, false, true>::BlockImpl_dense(Eigen::Map<Eigen::Array<float, -1, -1, 1, -1, -1>, 0, Eigen::Stride<0, 0> >&, long, long, long, long) third-party-buck/platform009/build/eigen/include/Eigen/src/Core/Block.h:374
    #1 0x7f5228f7212c in Eigen::BlockImpl<Eigen::Map<Eigen::Array<float, -1, -1, 1, -1, -1>, 0, Eigen::Stride<0, 0> >, -1, -1, false, Eigen::Dense>::BlockImpl(Eigen::Map<Eigen::Array<float, -1, -1, 1, -1, -1>, 0, Eigen::Stride<0, 0> >&, long, long, long, long) third-party-buck/platform009/build/eigen/include/Eigen/src/Core/Block.h:166
    #2 0x7f5228f720dc in Eigen::Block<Eigen::Map<Eigen::Array<float, -1, -1, 1, -1, -1>, 0, Eigen::Stride<0, 0> >, -1, -1, false>::Block(Eigen::Map<Eigen::Array<float, -1, -1, 1, -1, -1>, 0, Eigen::Stride<0, 0> >&, long, long, long, long) third-party-buck/platform009/build/eigen/include/Eigen/src/Core/Block.h:142
    #3 0x7f5229b0e059 in Eigen::DenseBase<Eigen::Map<Eigen::Array<float, -1, -1, 1, -1, -1>, 0, Eigen::Stride<0, 0> > >::FixedBlockXpr<internal::get_fixed_value<int>::value, internal::get_fixed_value<long>::value>::Type Eigen::DenseBase<Eigen::Map<Eigen::Array<float, -1, -1, 1, -1, -1>, 0, Eigen::Stride<0, 0> > >::block<int, long>(long, long, int, long) third-party-buck/platform009/build/eigen/include/Eigen/src/Core/../plugins/BlockMethods.h:98
    #4 0x7f5229b0c5ca in caffe2::GenerateProposalsOp<caffe2::CPUContext>::RunOnDevice() caffe2/caffe2/operators/generate_proposals_op.cc:348
```
Also cleans up some data type and const issues around the area.

Test Plan: Sandcastle

Reviewed By: xush6528

Differential Revision: D31343046

fbshipit-source-id: fd9096c8e47a0aad529c72fd313f64ca98dcb80b
2021-10-03 12:50:21 -07:00
c7748fc172 Added validation of mode parameter in AveragedModel (#65921)
Summary:
Discussion: https://github.com/pytorch/pytorch/pull/65495#issuecomment-930460469

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65921

Reviewed By: albanD

Differential Revision: D31310105

Pulled By: prabhat00155

fbshipit-source-id: 417691832a7c793744830c11e0ce53e3972d21a3
2021-10-03 08:42:28 -07:00
0fc6bd2e47 [gpu ne eval] disable adam decay unit test for gpu (#66056)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66056

keep running into this unrelated failure when landing diffs regarding the gpu inference project,
disabling this operator unit test in gpu because it doesn't exist

RuntimeError: [enforce fail at operator.cc:277] op. Cannot create operator of type 'SmartDecaySparseAdam' on the device 'CUDA'. Verify that implementation for the corresponding device exist. It might also happen if the binary is not linked with the operator implementation code. If Python frontend is used it might happen if dyndep.InitOpsLibrary call is missing. Operator def: input: "param" input: "mom1" input: "mom2" input: "last_seen" input: "indices" input: "grad" input: "lr" input: "iter" output: "param" output: "mom1" output: "mom2" output: "last_seen" name: "" type: "SmartDecaySparseAdam" arg { name: "beta1" f: 0 } arg { name: "beta2" f: 0.9 } arg { name: "epsilon" f: 1e-05 } device_option { device_type: 1 }

https://www.internalfb.com/intern/testinfra/diagnostics/5910974579962988.562949996565057.1633122845/

Test Plan: sandcastle

Reviewed By: jianyuh

Differential Revision: D31364731

fbshipit-source-id: 7fbd994cbe7f6ca116f5f34506a1ed7f14759bdf
2021-10-03 07:40:23 -07:00
29c0725e8a Back out "[caffe2] fix LLVM-12 nullptr-with-nonzero-offset UBSAN error" (#66055)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66055

Original commit changeset: c31f179f8a7d

Reviewed By: igorsugak

Differential Revision: D31353348

fbshipit-source-id: 73d928e5c938ba604a7f9ea17a6250b57306e88f
2021-10-02 16:46:26 -07:00
7c52963350 [WIP] skip constant folding dequant node (#63991)
Summary:
This PR makes Constant Propagation to ignore dequant nodes.

https://github.com/pytorch/pytorch/issues/61092

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63991

Reviewed By: pbelevich

Differential Revision: D31363993

Pulled By: Krovatkin

fbshipit-source-id: 99f7c56a4381aff2cbdf1167508414cf240e9f75
2021-10-02 15:30:43 -07:00
8a307640db selective trt import based whether we have gpu or not (#66045)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66045

Att.

Reviewed By: kflu

Differential Revision: D31357388

fbshipit-source-id: 601affe067e5e4c1f1516dff4ac84fa9cdd27d5e
2021-10-02 06:12:37 -07:00
8b8012a165 [PyTorch Edge] Skip writing version during backport (#65842)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65842

During backport, only parts of the model (like bytecode.pkl) needs to be re-written, while the rest of the model is the same. However, `version` will always be re-written when `PyTorchStreamWriter` is destrcuted.

Change version to optional and add an api to allow skipping writing version when closing the writer.
ghstack-source-id: 139580386

Test Plan: buck run papaya/scripts/repro:save_load

Reviewed By: iseeyuan, tugsbayasgalan

Differential Revision: D31262904

fbshipit-source-id: 3b8a5e1aaa610ffb0fe8a616d9ad9d0987c03f23
2021-10-01 21:18:31 -07:00
7941590a51 [JIT] Selectively enable precise alias analysis for TupleConstruct (#66025)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66025

This change adds an option to selectively enable precise alias analysis for `prim::`TupleConstruct` (introduced by D30437737 (cd458fe092)) to minimize its exposure only to `StaticRuntime` as of now.

Test Plan: Modified existing unit tests whose behavior depends on D30437737 (cd458fe092).

Reviewed By: eellison

Differential Revision: D31350285

fbshipit-source-id: 3ce777f07f99650d74634481ad0805192dce55c6
2021-10-01 20:42:22 -07:00
e4ee5ca698 Revert D31326599: [pytorch][PR] Compile without -Wno-unused-variable
Test Plan: revert-hammer

Differential Revision:
D31326599 (a6280ab653)

Original commit changeset: 924155f1257a

fbshipit-source-id: b8ee5bc0298637443232f5ee9ec79e51ed256faf
2021-10-01 20:40:47 -07:00
5ef350d7cc Revert D31359010: [pytorch][PR] Fix cang-tidy regressions caused by #65954
Test Plan: revert-hammer

Differential Revision:
D31359010 (c269f471f4)

Original commit changeset: dce4b91a9891

fbshipit-source-id: 085417432b6748d3672b9b7141460f47d1c17a7f
2021-10-01 20:35:35 -07:00
c269f471f4 Fix cang-tidy regressions caused by #65954 (#66040)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66040

Reviewed By: ZolotukhinM

Differential Revision: D31359010

Pulled By: malfet

fbshipit-source-id: dce4b91a98913c8d8c2d8f9ebc49654265239158
2021-10-01 19:50:53 -07:00
ca76e193a3 Fix nll_backward for negative weights (#64572)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64572

Fixes https://github.com/pytorch/pytorch/issues/64256
It also fixes an inconsistent treatment of the case `reduction = "mean"`
when the whole target is equal to `ignore_index`. It now returns `NaN`
in this case, consistently with what it returns when computing the mean
over an empty tensor.

We add tests for all these cases.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D31116297

Pulled By: albanD

fbshipit-source-id: cc44e79205f5eeabf1efd7d32fe61e26ba701b52
2021-10-01 19:41:51 -07:00
eb3b9fe719 [XROS][ML] System specific adjustments for UTs to work. (#65245)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65245

Building and running c10 and qnnpack tests on XROS.

Notable changes:
- Adding #if define(_XROS_) in few places not supported by XROS
- Changing Threadpool to abstract class
ghstack-source-id: 139513579

Test Plan: Run c10 and qnnpack tests on XROS.

Reviewed By: veselinp, iseeyuan

Differential Revision: D30137333

fbshipit-source-id: bb6239b935187fac712834341fe5a8d3377762b1
2021-10-01 18:15:14 -07:00
363ccb257d GELU acc OP (#65957)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65957

added accelerator ops and unit test for GELU.

Test Plan: buck test glow/fb/fx/oss_acc_tracer:test_acc_tracer

Reviewed By: 842974287

Differential Revision: D31277083

fbshipit-source-id: f66dd05ef574db58cfa599e3575f95f1ebe82e93
2021-10-01 17:49:53 -07:00
a6280ab653 Compile without -Wno-unused-variable (#65954)
Summary:
Delete `-Wno-unused-variable` from top level `CMakeLists.txt`
Still suppress those warnings for tests and `torch_python`

Delete number of unused variables from caffe2 code
Use `(void)var;` to suppress unused variable in range loops
Use `C10_UNUSED` for global constructors and use `constexpr` instead of `static` for global constants

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65954

Reviewed By: ngimel

Differential Revision: D31326599

Pulled By: malfet

fbshipit-source-id: 924155f1257a2ba1896c50512f615e45ca1f61f3
2021-10-01 17:40:47 -07:00
10f6294281 Fix shape inference dim_type for Clip, Mean, Div (#65996)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65996

Test Plan:
Facebook
```
buck build caffe2/caffe2/opt:bound_shape_inference_test && ./buck-out/gen/caffe2/caffe2/opt/bound_shape_inference_test --gtest_filter=*Clip*
```
```
buck build caffe2/caffe2/opt:bound_shape_inference_test && ./buck-out/gen/caffe2/caffe2/opt/bound_shape_inference_test --gtest_filter=*Div*
```
```
buck build caffe2/caffe2/opt:bound_shape_inference_test && ./buck-out/gen/caffe2/caffe2/opt/bound_shape_inference_test --gtest_filter=*Mean*
```

Reviewed By: yinghai

Differential Revision: D31121298

fbshipit-source-id: f366d8f4d4d0be159b62bfaafc42ca924c05e022
2021-10-01 17:34:34 -07:00
e1d963e8fc model_dump: Fix memory computation when both constants and data tensors are present (#66006)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66006

Previously, this was resulting in a key collision and a crash.
ghstack-source-id: 139342089

Test Plan: Ran webdriver test locally.

Reviewed By: dhruvbird

Differential Revision: D31281092

fbshipit-source-id: f31311726c681d6d7e0504ff8e84c888af9054f0
2021-10-01 16:31:06 -07:00
23caeb3f71 model_dump: Add a helper to produce html with a single call (#66005)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66005

ghstack-source-id: 139342091

Test Plan: Unit test, and used in a notebook.

Reviewed By: dhruvbird

Differential Revision: D31281091

fbshipit-source-id: 1e4d0713b9796a3d182de9e676c3b3c3b1610d6e
2021-10-01 16:29:43 -07:00
d9a95e66f0 Upload test failures to RDS (#65873)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65873

Test Plan: Imported from OSS

Reviewed By: janeyx99

Differential Revision: D31296520

Pulled By: driazati

fbshipit-source-id: 0bd3fb6b62e49c7177199001fda0e7b124a22ab2
2021-10-01 16:25:51 -07:00
f85d7422bb [fx2trt]add support for torch.tile (#66016)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66016

Add acc_ops.tile and converter for it.

Test Plan: buck test mode/dev-nosan caffe2/torch/fb/fx2trt:test_tile

Reviewed By: wushirong

Differential Revision: D30587939

fbshipit-source-id: 1e2613cfca486fe54fcc0d38e5c7cdeb7d0ed4a0
2021-10-01 16:06:09 -07:00
060e41eafa Forward fix type hint for DataLoader (#66001)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66001

Test Plan: Imported from OSS

Reviewed By: NivekT

Differential Revision: D31340565

Pulled By: ejguan

fbshipit-source-id: d05ae42ebf93f61d781dc5d81ef0222e24f5acb3
2021-10-01 15:48:45 -07:00
ad889d0b5e Revert D30634700: [pytorch][PR] Fix typo in tensor docs
Test Plan: revert-hammer

Differential Revision:
D30634700 (d937473709)

Original commit changeset: e8952be20966

fbshipit-source-id: b18694e332023abcdf17ec1900b81b00d21f1014
2021-10-01 15:23:38 -07:00
7d22007902 [fx-acc] add acc_op optimization flags and decorator (#65928)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65928

This diff adds a decorator for adding flags to acc_ops. These flags inform graph optimizations that the op is eligible for optimization by some general criteria (e.g. op acts elementwise, op does quantization).

This makes it simpler to expand acc_ops. The user can add an op and add flags to enable optimization without going through all graph opts and trying to determine if new acc_op is eligible for the graph optimization.

Even though our list of graph opts is small now we already see that for `sink_reshape_ops` we had hardcoded 11 pointwise acc_ops, now there are 24 pointwise acc_ops.

Test Plan:
```
buck test mode/opt glow/fb/fx/graph_opts:test_fx_sink
```

```
Parsing buck files: finished in 0.5 sec
Downloaded 0/3 artifacts, 0.00 bytes, 100.0% cache miss (for updated rules)
Building: finished in 37.1 sec (100%) 10279/10279 jobs, 3/10279 updated
  Total time: 37.7 sec
More details at https://www.internalfb.com/intern/buck/build/e13521bb-6142-4960-8cdd-6b5e4780da96
Tpx test run coordinator for Facebook. See https://fburl.com/tpx for details.
Running with tpx session id: 16260a2a-d364-4605-9111-6f2a19317036
Trace available for this run at /tmp/tpx-20210922-124332.623880/trace.log
Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/4222124720425564
    ✓ ListingSuccess: glow/fb/fx/graph_opts:test_fx_sink - main (6.038)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_sink - test_no_sink_concat_below_quantize (glow.fb.fx.graph_opts.tests.test_fx_sink.TestSink) (0.036)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_sink - test_sink_concat_below_quantize (glow.fb.fx.graph_opts.tests.test_fx_sink.TestSink) (0.048)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_sink - test_sink_reshape_nodes (glow.fb.fx.graph_opts.tests.test_fx_sink.TestSink) (0.058)
    ✓ Pass: glow/fb/fx/graph_opts:test_fx_sink - test_no_sink (glow.fb.fx.graph_opts.tests.test_fx_sink.TestSink) (0.057)
Summary
  Pass: 4
  ListingSuccess: 1
If you need help understanding your runs, please follow the wiki: https://fburl.com/posting_in_tpx_users
Finished test run: https://www.internalfb.com/intern/testinfra/testrun/4222124720425564
```

Reviewed By: jfix71

Differential Revision: D31121321

fbshipit-source-id: 6f6e3b8e2d57ea30766fa6bee34ca207cec86f0f
2021-10-01 15:19:35 -07:00
d937473709 Fix typo in tensor docs (#64160)
Summary:
Remove extra character from `torch.qfint32`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64160

Test Plan: Docs

Reviewed By: jerryzh168

Differential Revision: D30634700

Pulled By: axitkhurana

fbshipit-source-id: e8952be20966b9a3f9d62d9957ae255d5d4889bb
2021-10-01 14:57:55 -07:00
8e8695285f Re-generate workflows (#66027)
Summary:
Fix master breakage

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66027

Reviewed By: suo, malfet

Differential Revision: D31353922

Pulled By: driazati

fbshipit-source-id: cdb7f639608999b6ee72f6b1000d7ecbc02efc95
2021-10-01 14:56:51 -07:00
894d296bae Remove usage of GitHub's artifact store in linux jobs (#65875)
Summary:
The docs stuff is unnecessary since they are hosted in S3 anyways, and the reports are mirrored in S3 which has better upload/download speed and is available as soon as the upload is done rather than once the workflow is complete.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65875

Reviewed By: seemethere

Differential Revision: D31296500

Pulled By: driazati

fbshipit-source-id: 8c371230d0c8c0eb785702df9ae495de85f60afa
2021-10-01 13:49:44 -07:00
6e8ffd191e Fix typo in name of LayerNormBackwardCUDAKernel (#66000)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66000

Saw this in nvprof and I'm just a little too nitpicky to let it slide!
ghstack-source-id: 139547271

Test Plan: CI

Reviewed By: xiaomengy

Differential Revision: D31340262

fbshipit-source-id: ab48dc99c34a74585e66800b4bbcccc6aabbaff2
2021-10-01 12:28:59 -07:00
ffede499b2 [PyTorch][Static Runtime] Fast path for contiguous to_copy (#65499)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65499

When the tensors in question are contiguous, there is no need to go through dispatch, use TensorIterator, etc.
ghstack-source-id: 139549027

Test Plan:
Ran ptvsc2_predictor_bench for ctr_mobile_feed local net following https://fb.quip.com/q8hBAFGMeaOU (but without the profile and compare_results options).

Before:

I0922 14:00:32.261942 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 7.18124. Iters per second: 139.252
I0922 14:01:44.865965 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 7.25314. Iters per second: 137.871
I0922 14:02:56.929602 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 7.1986. Iters per second: 138.916
I0922 14:04:05.923025 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.89211. Iters per second: 145.093
I0922 14:05:17.953056 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 7.19577. Iters per second: 138.971

mean: 7.144172, stddev: 0.1283

After:

I0922 13:51:55.233937 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.79709. Iters per second: 147.122
I0922 13:53:03.062682 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.77605. Iters per second: 147.579
I0922 13:54:10.230386 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.70993. Iters per second: 149.033
I0922 13:55:18.403434 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.81044. Iters per second: 146.833
I0922 13:56:26.568646 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.80965. Iters per second: 146.85

mean: 6.800632, stddev: 0.013227

Looks like about a 5.3% improvement.

Reviewed By: hlu1

Differential Revision: D31125492

fbshipit-source-id: 92ab5af242d0a84dcf865323a57b48e8374eb823
2021-10-01 12:13:33 -07:00
7b10a76e05 [PyTorch] Try removing Android strtod implementation (#65713)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65713

This may not be needed anymore.
ghstack-source-id: 139114284

Test Plan: see if it builds

Reviewed By: dhruvbird

Differential Revision: D31216245

fbshipit-source-id: 29c9c013f94070c7713e46027881cb693b144d36
2021-10-01 11:43:15 -07:00
176d3c6fb4 [PyTorch] Fix many Tuple::elements() callsites (#64065)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64065

It is only safe to mutate Tuple elements if you are the sole owner
of the tuple. The most efficient way to do this, then, is
`std::move(*std::move(tupleIValue).toTuple()).elements()` (the
innermost move allows `IValue::toTuple()` to avoid a refcount bump and
the outermost move allows the element vector to be moved out of the
tuple), but many callsites write simply
`tupleIValue.toTuple().elements()`, which incurs many extra refcount
bumps.

ghstack-source-id: 139468088

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D30592621

fbshipit-source-id: e8312de866de09b9ea2a62e5128cbf403ee16f09
2021-10-01 11:36:05 -07:00
f14e5e636d [fx2trt]fix slice tensor converter (#65960)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65960

Fix a bug in the converter and add support for negative dim.

Test Plan: buck test mode/dev-nosan caffe2/torch/fb/fx2trt:test_narrow

Reviewed By: wushirong

Differential Revision: D31310232

fbshipit-source-id: 62887369d830202cae6d63b41747225b12dcf754
2021-10-01 11:29:42 -07:00
21eebc9fd6 [PyTorch][easy] Use copy-and-move instead of copy-and-swap in IValue::operator= (#65826)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65826

Should be marginally more efficient.
ghstack-source-id: 139315050

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D31272489

fbshipit-source-id: 7c309d67a0ec0ada35a5b62497bac374538394a9
2021-10-01 11:16:42 -07:00
592481a5cc [fx][const_fold] Refactor to use base split module to simplify, and correctly handle non-single-Tensor outputs (#65933)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65933

We use `split_module` to split the input model that we want to const fold into const and non-const subgraphs. Previously we were taking the non-const graph and trying to hack it back into the same signature as the input model. However this was complex/buggy.

Instead, refactor to just keep using the base split module that contains both const and non-const graphs. This means we:
- Inline the non-const graph into the split module
- Remove the const graph from the module and replace it with a getattr that will be run to insert that attr when we `run_folding`

Test Plan: Added test coverage to cover newly supported folding, and updated other tests for new strategy.

Reviewed By: yinghai

Differential Revision: D31293307

fbshipit-source-id: 6e283a8c7222cf07b14e30e74dffc8ae5ee8b55f
2021-10-01 10:26:29 -07:00
34682377b9 [iOS][CI] Update dev certs (#66004)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/65988

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66004

Reviewed By: xta0

Differential Revision: D31340893

Pulled By: malfet

fbshipit-source-id: 3bf0be266e9686a73d62e86c5cf0bebeb0416260
2021-10-01 09:38:49 -07:00
ccf8d48f16 Revert D31317680: [pytorch][PR] Avoid saving self forsoftmax and log_softmax
Test Plan: revert-hammer

Differential Revision:
D31317680 (5f7cadc7aa)

Original commit changeset: b3b921e06775

fbshipit-source-id: 1bca0672383536a2c21243ceb52349c766a94344
2021-10-01 09:31:44 -07:00
21da6ae9ce suppress mypy error (#66003)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66003

Differential Revision:
D31340874
D31340874

Test Plan: Imported from OSS

Reviewed By: seemethere

Pulled By: suo

fbshipit-source-id: d9ef0f40625fe5ff21f8a5e044d5a75400367dc2
2021-10-01 09:17:42 -07:00
eac218dbc6 Revert "Port sort kernel to structured kernels. (#62391)" (#65876)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65876

This reverts commit 93852bb2d41d90b6ac660015d79f7474bcebb774.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D31296329

Pulled By: bdhirsh

fbshipit-source-id: 85eae72f2346d69290f440f5393a7da096a96c6e
2021-10-01 07:50:28 -07:00
5f7cadc7aa Avoid saving self forsoftmax and log_softmax (#65242)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/64000
 - updates double backward formula to compute grad wrt output instead of self
 - ~~In some of the error messages, we still refer to the dtype of the input, even though we are now checking the dtype of the output~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65242

Reviewed By: malfet

Differential Revision: D31317680

Pulled By: soulitzer

fbshipit-source-id: b3b921e06775cfc12e5a97a9ee8d73aec3aac7c3
2021-10-01 07:49:07 -07:00
383c0a3858 Fix internal assert failure for torch.all and torch.any with requires_grad=True (#65714)
Summary:
This PR fixes https://github.com/pytorch/pytorch/issues/58547.
I added an OpInfo-based test that fails on master and passes with the
proposed changes.

cc ezyang albanD zou3519 gqchen pearu nikitaved soulitzer Lezcano Varal7 mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65714

Reviewed By: saketh-are, mruberry

Differential Revision: D31248307

Pulled By: albanD

fbshipit-source-id: 041eaa9b744c3043f78dd8ae5f457f67c311df4f
2021-10-01 07:32:44 -07:00
53c0d91db9 Make autograd codegen for differentiable outputs safer to use (#65823)
Summary:
This PR adds raising an error when `len(output_differentiability) != len(outputs)`

Notes in derivatives.yml tell that
> 'output_differentiability' and value a list of the same length as the number of outputs from the forward function.

but it was not enforced in codegen leading to confusion and unexpected bugs https://github.com/pytorch/pytorch/issues/65061#issuecomment-930271126.

cc ezyang albanD zou3519 gqchen pearu nikitaved soulitzer Lezcano Varal7

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65823

Reviewed By: mrshenli

Differential Revision: D31307312

Pulled By: albanD

fbshipit-source-id: caeb949e9249310dffd237e77871e6d0d784e298
2021-10-01 07:27:57 -07:00
bff8d8fd28 [nnc] Add BufHandle.store to python API (#65213)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65213

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D31328502

Pulled By: bertmaher

fbshipit-source-id: 1f260f68692c3859350587afe021a500672d79f0
2021-10-01 06:59:50 -07:00
8cf047afac [nnc] Add call_with_numel interface for fast CUDA calls (#65213)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65213

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D31319012

Pulled By: bertmaher

fbshipit-source-id: 93fee80f956795470f5a2ce3b33c2ea2f132036f
2021-10-01 06:58:37 -07:00
8595b6eeed Avoid UB when indexing into size-0 tensors (#65878)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65878

If we attempt to compute an offset into an empty tensor we trigger UB, since
we'd be adding an offset to a nullptr, which is UB
(https://reviews.llvm.org/D67122) even if we never use the pointer.

Since indexing into an empty tensor yields an empty tensor anyways, let's just
return the underlying (null) data ptr in this case.

ghstack-source-id: 139448496

Test Plan:
r-barnes originally pointed this out to me in a failing TE fuser test:
https://www.internalfb.com/intern/testinfra/diagnostics/5910974579561425.281475022329152.1632898053/
```
buck test mode/dev //caffe2/test:jit -- --exact 'caffe2/test:jit - test_unsupported_nn_functional_pad_circular_cpu_float32 (test_jit_fuser_te.TestNNCOpInfoCPU)'
```

But it turns out it's easily triggered by anything that tries to operate on a
slice of a size-0 tensor:
```
def test_pad(self):
    F.pad(torch.ones(0, 3, 3), (1, 2), 'circular')

def test_index(self):
    input = torch.zeros(0, 3, 3)
    out = torch.zeros(0, 3, 6)
    out[..., 1:4] = input[..., 0:3]

def test_add(self):
    torch.ones(0, 2)[:, 1] + torch.ones(0, 1)
```

What's the right place for these sort of operator corner-case tests?  Should
they be/are they part of OpInfo?

Reviewed By: jamesr66a

Differential Revision: D31296914

fbshipit-source-id: 0ef52ad311dceeed985498f8d9390bc6fbaefbfc
2021-10-01 06:55:15 -07:00
fc52f1293e Improve pytorch type hints (Dataloader, trig functions)
Summary:
This is to fix Pyre errors in our applications:
* calling `tensor.cos()` etc.
* creating a data loader with batch sampler that is `List[List[int]]`.

Test Plan: TODO: rebase the diffs and run Pyre.

Reviewed By: ejguan

Differential Revision: D31309564

fbshipit-source-id: 1c6f3070d7570260de170e2fe2153d277b246745
2021-10-01 06:53:57 -07:00
982ef8837b [Static Runtime] Fuse ListUnpack + gather_ranges_to_dense (#65116)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65116

Fuse `fb::gather_ranges_to_dense` with `prim::ListUnpack`.
```
%0 : Tensor[] = fb::gather_ranges_to_dense(...)
%1: Tensor, %2: Tensor, ... = prim::ListUnpack(%0)
```
turns into:
```
%0: Tensor, %1: Tensor, ... = fb::gather_ranges_to_dense(...)
```

Reviewed By: hlu1

Differential Revision: D30973525

fbshipit-source-id: f0349baa1622b697ee2ab652376a24ec0d89e819
2021-10-01 06:49:54 -07:00
227e37dd39 pytorch quantization ao migration phase 2: caffe2/test (#65832)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65832

Renames `torch.quantization` to `torch.ao.quantization` in `caffe2/test`
folder.

```
find caffe2/test/ -type f -name "*.py" -print0 | xargs -0 sed -i "s/torch\.quantization/torch.ao.quantization/g"
HG: manually revert the files testing this migration
hg revert caffe2/test/quantization/ao_migration/common.py
hg revert caffe2/test/quantization/ao_migration/test_ao_migration.py
```

Test Plan: CI

Reviewed By: z-a-f

Differential Revision: D31275754

fbshipit-source-id: 4ed54a74525634feb0f47a26d071102e19c30049
2021-10-01 06:26:30 -07:00
dac35b3592 pytorch quantization ao migration phase 2: torch/jit (#65829)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65829

Renames `torch.quantization` to `torch.ao.quantization` in `torch/jit` folder.

```
find caffe2/torch/jit/ -type f -name "*.py" -print0 | xargs -0 sed -i "s/torch\.quantization/torch.ao.quantization/g"
```

Test Plan: CI

Reviewed By: z-a-f

Differential Revision: D31273365

fbshipit-source-id: 350eb116148d91b967d428b54413caee4fd68438
2021-10-01 06:22:22 -07:00
e3af4be963 pytorch quantization ao migration phase 2: caffe2/benchmark (#65833)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65833

Renames `torch.quantization` to `torch.ao.quantization` in `caffe2/benchmarks`
folder.

```
find caffe2/benchmarks/ -type f -name "*.py" -print0 | xargs -0 sed -i "s/torch\.quantization/torch.ao.quantization/g"
```

Test Plan: CI

Reviewed By: z-a-f

Differential Revision: D31275963

fbshipit-source-id: 8596bf28df5c3ad2c4490ac8abb285d6517c0116
2021-10-01 06:17:36 -07:00
c1447f06a8 [special] special alias for softmax (#62251)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/50345

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62251

Reviewed By: H-Huang

Differential Revision: D31141834

Pulled By: mruberry

fbshipit-source-id: aecaf62af248e9034ef589159ce0fb325c729493
2021-10-01 03:55:32 -07:00
c27b427cd9 [sparsity] Add m-out-of-n support in the WeightNormSparsifier (#65295)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65295

The m-out-of-n is implemented as follows:

1. Compute the blocks that need to be sparsified using the weight-norm criterion
2. Within each block below the threshold find the smallest absolute value elements
3. Zero out only the smallest values within each block

m-out-of-n describes sparsification scheme where in a block with "n" elements, only "m" of them would be zeroed-out.
Block sparsity, with the whole block being all zeros, is a special case of m-out-n: If m==n, the whole block is reset.

This echoes the implementation described in the https://github.com/pytorch/pytorch/issues/59835,
as well as meets the support of the nVidia cusparselt requirements.
To support the CUDA sparsity (2/4), one would need to set the sparsity_level to 1.0.
That translates to all blocks of shape 1x4 within a tensor will sprasify with 2-out-4 scheme.

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D31186828

Pulled By: z-a-f

fbshipit-source-id: 7bd3e2707915b90f4831859781fc6e25f716c618
2021-10-01 03:19:15 -07:00
8b1aa85388 [sparsity] Change API to take FQNs as configuration (#65296)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65296

The original API described in the https://github.com/pytorch/pytorch/issues/59835
assumed that the per-layer configuration would take a module/layer
reference. However, a more useful approach is to refer to the layers
by their fully qualified names (FQN). That allows us to store the
configuration in a file without serializing the models.

We define a layer's FQN as it's "path" within a model. For example,
if one can refer to a model using `model.layer0.sublayerX`, the FQN
of the sublayerX is `'layer0.sublayerX'`.

Test Plan:
```
python test/test_ao_sparsity.py -- TestBaseSparsifier
buck test mode/opt //caffe2:test -- TestBaseSparsifier
```

Reviewed By: gchanan

Differential Revision: D31186830

Pulled By: z-a-f

fbshipit-source-id: d8d87f1c054e5c10d470e67837476a11e0a9b1d4
2021-10-01 03:17:31 -07:00
ea0de37d2e [PyTorch] Avoid string construction from const char* and speedup empty string creation if error messages are suppressed (#65939)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65939

This change includes 2 separate optimizations.

1. Provide an overload of `debugString(const char*, ...)` in addition to `debugString(std::string, ...)` for cases where `const char*` is passed in to avoid `std::string` construction in cases where `STRIP_ERROR_MESSAGES` is also defined and the caller is passing in a `const char*`
2. Return `std::string("", 0)` instead of `""` since the former triggers no call to `std::basic_string`'s constructor whereas the latter does. [Godbolt Link](https://godbolt.org/z/oTExed5h8). However, I'm surprosed by this since the man page for [std::basic_string](https://en.cppreference.com/w/cpp/string/basic_string/basic_string) clearly states that the constexpr overload is since C++20, and I am building using `-Os -std=c++17`

Godbolt Screenshot:

{F667311023}

ghstack-source-id: 139507542

Test Plan:
CI and local build via:

```
buck build //xplat/caffe2/fb/lite_predictor:lite_predictor
```

Reviewed By: swolchok

Differential Revision: D31312942

fbshipit-source-id: aa24abbfe1c16419f235d037595321982614c5ea
2021-10-01 00:17:21 -07:00
2828ce53fd Added jit log stream changing function and some refactor (#65768)
Summary:
Description:
- Have only added `stdout` and `stderr` as possible options from python
  API for now. We can do file path passing later maybe.
- Put the class `JitLoggingConfig` in the cpp file as none of its methods were being used outside of this file.

Python API:
`torch._C._jit_set_logging_stream('stdout|stderr')`
C++ API:
`::torch::jit::set_jit_logging_output_stream(ostream);`

Testing:
- Tested python API locally.
- Unit test for the C++ API is written

Fixes https://github.com/pytorch/pytorch/issues/54182

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65768

Reviewed By: mrshenli

Differential Revision: D31291739

Pulled By: ZolotukhinM

fbshipit-source-id: eee72edc20488efad78a01c5b0ed8a132886a08d
2021-09-30 23:25:11 -07:00
33c03cb61a [deploy][1/n] Make deploy code conform to PyTorch style. (#65861)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65861

First in a series. This PR changes the code in deploy.h/cpp and
interpreter_impl.h/cpp to be camel case instead of snake case. Starting
with this as it has the most impact on downstream users.

Test Plan: Imported from OSS

Reviewed By: shannonzhu

Differential Revision: D31291183

Pulled By: suo

fbshipit-source-id: ba6f74042947c9a08fb9cb3ad7276d8dbb5b2934
2021-09-30 22:59:47 -07:00
765b6a90f3 [TensorExpr] Move lowerings registration from kernel.cpp to lowerings.cpp. (#65553)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65553

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D31148921

Pulled By: ZolotukhinM

fbshipit-source-id: 772062155043d4be9e9a25f6259b8e4a6cb762f4
2021-09-30 22:56:22 -07:00
015e0079e3 [TensorExpr] Move 'compute*' functions to operators/... (#65552)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65552

This PR is mostly a verbatim move of several functions to different
files. The goal is to have more consistency in what resides where.

With this PR:
* All `compute*` functions defining how a given operator needs to be
lowered to TE IR will reside in `operators/*.{cpp,h}`.
* Auxiliary functions for these functions will reside in
`operators/misc.cpp`. `compute*` functions for ops not belonging
anywhere else can also go to that file.
* `operators/unary.*` is renamed to `operators/pointwise.*` and now
includes functions like `computeTwoOperands`.
* `kernel.*` now contains *only JIT-related* logic and implementations of
`TensorExprKernel` methods.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D31148923

Pulled By: ZolotukhinM

fbshipit-source-id: e36ad8e779b8d30a33b49ea4ebf6d6a7438989f4
2021-09-30 22:56:20 -07:00
3a0165da49 [TensorExpr] Port NNC lowerings to the new registry mechanism. (#65551)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65551

Previously we had a big switch on Op kind to decide how to lower a given
JIT operator to NNC. This PR changes this switch to a hash table lookup.

Why? This helps us with at least two things:
1) With this approach we can easily check if we know how to handle a
given node in advance - i.e. we can inspect the entire graph and tell
whether it's possible to compile it or not without actually trying to do
that and dying in the middle. This would allow us to, say, provide
user-friendly error messages in AOT workflow.
2) We can switch to use schema instead of op kind to determine correct
lowering. Unlike op schema, op kind might be ambigous (see e.g. #64963)
and using it instead of schema can lead to bugs.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D31148926

Pulled By: ZolotukhinM

fbshipit-source-id: ac12684e2126c899426ef5e4cc1e3f70fa01f704
2021-09-30 22:56:18 -07:00
eee9ad0fdd [TensorExpr] Add a skeleton for a registry of NNC lowerings. (#65550)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65550

This PR adds the source files and the class for the registry, subsequent
PRs actually port existing lowerings to this mechanism.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D31148922

Pulled By: ZolotukhinM

fbshipit-source-id: 4c087b22ee898d5a5a18a5d2a4bb795aa2ffd655
2021-09-30 22:56:16 -07:00
d84191fcc6 [TensorExpr] Kernel: make prim::ConstantChunk handled like other ops. (#65549)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65549

Previously it had a special handling, with this change it follows the
same mechanism as other ops.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D31148924

Pulled By: ZolotukhinM

fbshipit-source-id: 572d8ae5e123e7a0e2a656154d7bd0f73c785a06
2021-09-30 22:55:00 -07:00
a6ad2b41ac [Static Runtime] Make module_ optional in StaticModule (#65882)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65882

`torch::jit::Module` is refcounted. There is no need to wrap it in a `shared_ptr`.

Test Plan:
```
buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest
```

Reviewed By: mikeiovine

Differential Revision: D31012222

fbshipit-source-id: 74d234bd85423e5ba0e396f24899631354a2c74b
2021-09-30 22:48:49 -07:00
08df4c2b3c slow_conv2d grad_input: avoid dispatch in parallel region (#65725)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65725

See gh-56794

Avoid dispatch inside of parallel_for by:
1. Replacing Tensor slicing with TensorAccessor
2. Call `grad_input.zero_()` only once, outside of the parallel region
3. Replace `at::mm` with a `gemm` call

Test Plan: Imported from OSS

Reviewed By: saketh-are

Differential Revision: D31257876

Pulled By: ngimel

fbshipit-source-id: f2902edeccd161431c1dfb1ab3e165d039ec259d
2021-09-30 22:47:31 -07:00
6502fb89dd Make JIT Aliasing Test Less Brittle (#65493)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65493

Added a last resolve to use whatever ATen operator that has Tensor outputs in the graph as the operator node to check alias annotation.

Test Plan: python test/test_ops.py -k test_variant_consistency_jit

Reviewed By: mrshenli

Differential Revision: D31321221

Pulled By: alanwaketan

fbshipit-source-id: f4a5cbfd36bd0867d8c1bf9de9a65365ee7c35d6
2021-09-30 22:43:03 -07:00
4f5ea5983a [QPL] move metadata logging to markerEnd for model run QPL (#65451)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65451

This diff moved metadata logging from marker start to marker end. This should improve perf because we can skip metadata logging when mark is not sampled (using isMarkerOn)

Test Plan:
Verified metadata are logged: https://fburl.com/scuba/qpl_metrics/pytorch_employee/armjgtyw
https://fburl.com/scuba/qpl_metrics/pytorch_employee/zz36zkr1

Reviewed By: xcheng16

Differential Revision: D31105548

fbshipit-source-id: 0eafaaefecb7e230021616e397e548a2fd2b92e9
2021-09-30 22:12:40 -07:00
2481c06496 [caffe2] fix LLVM-12 nullptr-with-nonzero-offset UBSAN error (#65506)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65506

Test Plan: run a adfinder canary and verify this error is fixed.

Reviewed By: swolchok

Differential Revision: D31130083

fbshipit-source-id: c31f179f8a7de75ed6f6e7ee68b197f2970ddd3d
2021-09-30 21:47:25 -07:00
f6dfac6974 Migrate THCCachingHostAllocator to ATen (#65746)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65746

This also removes the cudaHostAllocator field on THCState, since there
doesn't seem to be an API anywhere for customizing it.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D31236630

Pulled By: ngimel

fbshipit-source-id: 2a8e756222ae70565e77f8e7139d60ec5be32276
2021-09-30 21:26:38 -07:00
d39790340d [ONNX] Enable export of __xor_ (#64042) (#64581)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64581

* Enbale xor

* Update test_pytorch_onnx_onnxruntime.py

* Update symbolic_opset9.py

* Update symbolic_opset9.py

* Update test_pytorch_onnx_onnxruntime.py

* Update symbolic_opset9.py

Test Plan: Imported from OSS

Reviewed By: jansel

Differential Revision: D30919598

Pulled By: malfet

fbshipit-source-id: 044e55d0697da0050f26a6ceccd1517493d7e8a6
2021-09-30 21:09:01 -07:00
e598ba2ef3 [ONNX] Fix inplace fill_ dtype export mismatch (#64233) (#64580)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64580

Append `type_as` after convert `fill_` to `full_like` without dtype argument.

BowenBao <bowbao@microsoft.com>

Test Plan: Imported from OSS

Reviewed By: jansel

Differential Revision: D30919599

Pulled By: malfet

fbshipit-source-id: f174977ced8f2c991b0615b65ff7c23fecf301c2
2021-09-30 21:08:59 -07:00
89cbe6229d [ONNX] Update doc and error message for indexing export (#64290) (#64579)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64579

Added suggested workarounds into indexing section of onnx export documentation.
Update indexing export warning message with link to documentation.

Test Plan: Imported from OSS

Reviewed By: jansel

Differential Revision: D30919603

Pulled By: malfet

fbshipit-source-id: 7fe65cb5aa7de4f7d93ff05011ba22f5adb27811

Co-authored-by: BowenBao <bowbao@microsoft.com>
2021-09-30 21:08:56 -07:00
d4ff344fae [ONNX] Fix remainder export (#64230) (#64578)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64578

* Fix remainder export for edge case when input is negative. New export relies on true_divide export.
* Simplified true_divide export. Cleaned up redundant code which is handled by scalar type analysis pass. Removed dependency on `onnx::Where`, thus supports opset 7 & 8.

Fixes #60179

Test Plan: Imported from OSS

Reviewed By: jansel

Differential Revision: D30919601

Pulled By: malfet

fbshipit-source-id: 0f78621c0ac3bdb6bf4225e049ba5f470dc8ab12

Co-authored-by: BowenBao <bowbao@microsoft.com>
2021-09-30 21:08:54 -07:00
0f0ef4fe64 Add onnx test for batched_nms (#53175) (#64381)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64381

* Added new ONNX test for batched_nms

* Update test according to PR in torchvision

* Update test/onnx/test_pytorch_onnx_onnxruntime.py

Test Plan: Imported from OSS

Reviewed By: jansel

Differential Revision: D30919602

Pulled By: malfet

fbshipit-source-id: edfb5b9f75077429f7f242fd6ac06d962968dfba

Co-authored-by: Bowen Bao <imbowenbao@outlook.com>
2021-09-30 21:08:52 -07:00
7e15f2ddaa [ONNX] Fix gather squeeze axis in constant folding (#63588) (#64379)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64379

* Fix gather squeeze axis in constant folding

* mypy

* fix indent

* address comments

Test Plan: Imported from OSS

Reviewed By: jansel

Differential Revision: D30919604

Pulled By: malfet

fbshipit-source-id: 90edb054491433a0da2fe82324ac7c12f1ef062b
2021-09-30 21:08:50 -07:00
41bdfe3919 [ONNX] Fix cuda test case (#63597) (#64378)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64378

* skip script test for unsupported autocast.
* Fix test case by adding missed `autocast` and `model.cuda()`.

Test Plan: Imported from OSS

Reviewed By: jansel

Differential Revision: D30919600

Pulled By: malfet

fbshipit-source-id: 3231fc672d97de487d6e4460626df0ba25f212ce

Co-authored-by: BowenBao <bowbao@microsoft.com>
2021-09-30 21:08:48 -07:00
2d61009f4a [ONNX] Fix input sequence for pad op (#60554) (#64377)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64377

* Fix for input primitive sequence

* Test mypy

* Fix for tracing tuples

* Fix for extra inputs

* flake8

* Rebase

* Fix for tracing tuples

Test Plan: Imported from OSS

Reviewed By: jansel

Differential Revision: D30919606

Pulled By: malfet

fbshipit-source-id: a718c4a12cda77b968cb636acd7aa63d7b5ba326
2021-09-30 21:08:45 -07:00
f17ee368b3 Fix empty size constant creation (#63607) (#64376)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64376

Test Plan: Imported from OSS

Reviewed By: jansel

Differential Revision: D30919608

Pulled By: malfet

fbshipit-source-id: 0e789e8470ce0f130148df764ce77f6d4fd0a274
2021-09-30 21:08:43 -07:00
84190dafa8 [ONNX] Update instance_norm implementation and support training (#60538) (#64375)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64375

* Update the instance_norm track_running_stats=True implementation and support the training mode
* Reference: 9baf75c86e/aten/src/ATen/native/Normalization.cpp (L532)
* Fix https://github.com/pytorch/pytorch/issues/53887

Test Plan: Imported from OSS

Reviewed By: jansel

Differential Revision: D30919605

Pulled By: malfet

fbshipit-source-id: 306eb2a1122bb5d90dcb7c18260a3a2057a21c34

Co-authored-by: hwangdeyu <dejack953@outlook.com>
2021-09-30 21:07:26 -07:00
3d6d4f4322 [fx2trt][quant] Add lowering support for per channel quantization in fx2trt (#64787)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64787

This PR added support for lowering per channel quantization and dequantization operators
in fx2trt, this also extends TensorMeta with extra arguments corresponding to per channel quantized Tensors,
initially I was thinking of adding a qpram that can capture everything, but currently we still have some lowering support
for fbgemm ops (which has scale and zero_point in operator interface). I think we can move everything to qprams
after we deprecate lowering support for fbgemm ops in the future.

Test Plan:
Test for per channel weight:
```
python torch/fx/experimental/fx2trt/example/quantized_resnet_test.py
```

change BC compatibility test expect for TensorMeta
```
python test/test_fx.py TestFXAPIBackwardCompatibility.test_class_member_back_compat --accept
```

Imported from OSS

Reviewed By: jfix71, mrshenli, 842974287

Differential Revision: D30879848

fbshipit-source-id: 76c3804bb1d9343183ae53d9f02c1a3bf6c79e1c
2021-09-30 18:54:14 -07:00
207fefc988 Delete rouge cu102 windows builds (#65961)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65961

Reviewed By: seemethere

Differential Revision: D31325279

Pulled By: malfet

fbshipit-source-id: b8748c0040cdcfb8182eb7c59a3770b7d0681de9
2021-09-30 18:44:02 -07:00
b3da2afebe Clarified difference in behavior of empty_strided and as_strided (#64568)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64568

Fix: #64389

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D31299999

Pulled By: mruberry

fbshipit-source-id: dd538ffa7cc1267ab6472806f4216b170dd0faad
2021-09-30 17:27:59 -07:00
22f36353dc Revert D31137652: [pytorch][PR] Skip failing tests when LAPACK and MAGMA are not available
Test Plan: revert-hammer

Differential Revision:
D31137652 (dd354117ef)

Original commit changeset: c969f75d7cf1

fbshipit-source-id: bc4cde4eeb5d38ac940ebb471abbd8b9009b3aee
2021-09-30 16:08:57 -07:00
6285348f06 Implement n-dimensional hermitian FFTs (#63890)
Summary:
Closes https://github.com/pytorch/pytorch/issues/59127

cc mruberry peterbell10 walterddr

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63890

Reviewed By: ngimel

Differential Revision: D30761909

Pulled By: mruberry

fbshipit-source-id: 06e1e4dc65726f35c99a74f18b9fa36eb7d694a5
2021-09-30 16:02:28 -07:00
70f9f58a71 Add __module__ to torch.dtype.__dict__ (#65182)
Summary:
torch.dtype.__reduce__ returns a string, which causes Pickle to look
up the object by module and name. In order to find the right module,
Pickle looks for __module__ on the object; if it doesn't find that, it
falls back to searching sys.modules.

Previously, torch.dtype instances did not have a `__module__`
attribute, so pickling dtypes would fall back to a search of
sys.module.

Instances of normal Python objects have a `__module__` attribute
because normal Python classes have a `__module__` key in their
`__dict__`. Imitate that by populating one in `torch.dtype`.

We set the field in `tp_dict` before calling `PyType_Ready` (instead
of afterwards) because of the doc warning against mutating a type's
dictionary once initialized:
https://docs.python.org/3/c-api/typeobj.html#c.PyTypeObject.tp_dict

fixes https://github.com/pytorch/pytorch/issues/65077

 ---

I didn't add any tests because I didn't see any obvious places with similar tests for pickling or dtype objects. Let me know if I missed the right place, or should start one.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65182

Reviewed By: mrshenli

Differential Revision: D31310530

Pulled By: ezyang

fbshipit-source-id: 20cd713ce175a709d6ce47459c3891162ce29d77
2021-09-30 14:58:11 -07:00
38c77539e8 [PyTorch][Edge] Fix inefficiency in objLoaderMobile (#65710)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65710

No need to incur extra refcount bumps, and no need to use a stringstream for what are presumably string keys anyway.
ghstack-source-id: 139325445

Test Plan: CI, reviewers to confirm the keys are supposed to be strings

Reviewed By: dhruvbird

Differential Revision: D31215347

fbshipit-source-id: 82be93cb2e57aefe94edf74d149115cb734112be
2021-09-30 14:53:40 -07:00
8f3983254b [MicroBench] Added a micro benchmark for prefix sum (#65790)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65790

Here are the results of the benchmark:

* ATen - version that calls `at::cumsum`
* NNC - a simple prefix-sum loop implemented in NNC (not vectorized)
* Local - a C++ implementation of the simple prefix-sum loop
* LocalAVX2 - a vectorized C++ implementation of prefix-sum, only using AVX2
* LocalAVX512 - a vectorized C++ implementation of prefix-sum, using AVX512.

The vectorized implementations are from the paper "Parallel Prefix Sum with SIMD" in ADMS' 20.

```
$ OMP_NUM_THREADS=1 ./buck-out/opt/gen/caffe2/benchmarks/cpp/tensorexpr/tensorexpr_bench --benchmark_filter=PrefixSumBench
Run on (36 X 1601 MHz CPU s)
2021-09-28 23:13:12
------------------------------------------------------------------------------------------
Benchmark                                   Time           CPU Iterations UserCounters...
------------------------------------------------------------------------------------------
PrefixSumBench/ATen/64                   1289 ns       1289 ns     543199 GB/s=397.069M/s
PrefixSumBench/ATen/256                  1867 ns       1867 ns     374232 GB/s=1096.8M/s
PrefixSumBench/ATen/1024                 4169 ns       4169 ns     167889 GB/s=1.9649G/s
PrefixSumBench/ATen/4096                14137 ns      14136 ns      49266 GB/s=2.31806G/s
PrefixSumBench/ATen/16384               49887 ns      49883 ns      13988 GB/s=2.6276G/s
PrefixSumBench/ATen/65536              193742 ns     193686 ns       3628 GB/s=2.7069G/s
PrefixSumBench/ATen/262144             764803 ns     764774 ns        917 GB/s=2.74219G/s
PrefixSumBench/ATen/1048576           3040653 ns    3040277 ns        231 GB/s=2.75916G/s
PrefixSumBench/Local/64                   586 ns        586 ns    1197003 GB/s=873.244M/s
PrefixSumBench/Local/256                 1077 ns       1077 ns     646265 GB/s=1.90143G/s
PrefixSumBench/Local/1024                3050 ns       3050 ns     229458 GB/s=2.68579G/s
PrefixSumBench/Local/4096               11910 ns      11910 ns      58953 GB/s=2.75132G/s
PrefixSumBench/Local/16384              43204 ns      43202 ns      16081 GB/s=3.03393G/s
PrefixSumBench/Local/65536             167966 ns     167966 ns       4154 GB/s=3.12139G/s
PrefixSumBench/Local/262144            667631 ns     667613 ns       1048 GB/s=3.14127G/s
PrefixSumBench/Local/1048576          2654785 ns    2654631 ns        264 GB/s=3.15999G/s
PrefixSumBench/NNC/64                     642 ns        642 ns    1095277 GB/s=797.442M/s
PrefixSumBench/NNC/256                   1139 ns       1138 ns     617214 GB/s=1.799G/s
PrefixSumBench/NNC/1024                  3103 ns       3103 ns     225531 GB/s=2.63979G/s
PrefixSumBench/NNC/4096                 12053 ns      12052 ns      58084 GB/s=2.71883G/s
PrefixSumBench/NNC/16384                43227 ns      43225 ns      16192 GB/s=3.03231G/s
PrefixSumBench/NNC/65536               168065 ns     168056 ns       4153 GB/s=3.11972G/s
PrefixSumBench/NNC/262144              668974 ns     668921 ns       1045 GB/s=3.13513G/s
PrefixSumBench/NNC/1048576            2657464 ns    2657341 ns        263 GB/s=3.15677G/s
PrefixSumBench/LocalAVX2/64               523 ns        523 ns    1351308 GB/s=979.537M/s
PrefixSumBench/LocalAVX2/256              755 ns        755 ns     927762 GB/s=2.71159G/s
PrefixSumBench/LocalAVX2/1024            1759 ns       1759 ns     400355 GB/s=4.65609G/s
PrefixSumBench/LocalAVX2/4096            6708 ns       6706 ns     103959 GB/s=4.88649G/s
PrefixSumBench/LocalAVX2/16384          22143 ns      22142 ns      31229 GB/s=5.91951G/s
PrefixSumBench/LocalAVX2/65536          83649 ns      83642 ns       8350 GB/s=6.26828G/s
PrefixSumBench/LocalAVX2/262144        330433 ns     330427 ns       2133 GB/s=6.34679G/s
PrefixSumBench/LocalAVX2/1048576      1302301 ns    1302179 ns        537 GB/s=6.44198G/s
PrefixSumBench/LocalAVX512/64             474 ns        474 ns    1459151 GB/s=1080.8M/s
PrefixSumBench/LocalAVX512/256            576 ns        576 ns    1217442 GB/s=3.55524G/s
PrefixSumBench/LocalAVX512/1024           994 ns        994 ns     703387 GB/s=8.24434G/s
PrefixSumBench/LocalAVX512/4096          3642 ns       3641 ns     190646 GB/s=8.99857G/s
PrefixSumBench/LocalAVX512/16384        10140 ns      10140 ns      68947 GB/s=12.9267G/s
PrefixSumBench/LocalAVX512/65536        35739 ns      35736 ns      19567 GB/s=14.6711G/s
PrefixSumBench/LocalAVX512/262144      156415 ns     156413 ns       4467 GB/s=13.4078G/s
PrefixSumBench/LocalAVX512/1048576     613952 ns     613876 ns       1144 GB/s=13.665G/s
```

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D31253849

Pulled By: navahgar

fbshipit-source-id: f33e7be787c86a09e90babddd66b16e2e0777eb4
2021-09-30 14:44:52 -07:00
24f59fa20b [ci] fix softmax bc check (#65952)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65952

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D31320441

Pulled By: suo

fbshipit-source-id: ddd2ccca523d7ed31b231d924fbd6206525f16cf
2021-09-30 14:40:43 -07:00
d4d3bb91f9 Refactor OperatorSupport related code and fix TRT not supporting int64 dtype (#65848)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65848

This diff includes:

* [fix]: The initialization of `OperatorSupport._support_dict` makes it a class variable, so we need to move its initialization into constructor.
* Add abstract class (more of an interface) `OperatorSupportBase`, since `OperatorSupport`'s purpose is too specific.
* [refactor]: what `TRToperatorSupport` really does is to populate a `OperatorSupport._support_dict`, so there really is no reason for subclassing. So removing it, and changing it to instantiating a `OperatorSupport` with properly populated `_support_dict`.
* Add a framework for defining simple and basic op support logic, and composing them into more complex ones:
    1. `create_op_support` wraps a function into a `OperatorSupportBase` instance
    2. `chain` can combine several simple `OperatorSupportBase` into more complex ones
    3. `OpSupports` provides a set of pre-defined, simple `OperatorSupportBase` that can be composed together using `chain`.
        1. Currently the only pre-defined one is `decline_if_input_dtype(..)`, which declares a node non-supported, if its args are of user specified dtype
* Fix `TRTOperatorSupport` so that it not only looks for registered converters, but also decline a node if its arg is of int64

Test Plan: linter and CI

Reviewed By: 842974287

Differential Revision: D31275525

fbshipit-source-id: bbc02f7ccf4902a7912bb98ba5be2c2fbd53b606
2021-09-30 13:36:26 -07:00
9ae63bd87c Revert D31238123: [pytorch][PR] Avoid saving self forsoftmax and log_softmax
Test Plan: revert-hammer

Differential Revision:
D31238123 (fb412bdd80)

Original commit changeset: afd319d3676d

fbshipit-source-id: b7980d653a4b8322a225f1dd08c2857ecbe5bc94
2021-09-30 11:34:14 -07:00
541eb1db63 Add cuSPARSE descriptors and update CSR addmm (#60838)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60838

Rewrote `addmm_out_sparse_csr_dense_cuda` implementation using new cusparse descriptors.

`addmm` now works without conversions with both 32-bit and 64-bit indices.
The dense tensors can have a row- or column-major layout. If the dense tensors are a contiguous slice of a larger tensor, the storage is used directly without temporary copies.

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D30643191

Pulled By: cpuhrsch

fbshipit-source-id: 5555f5b59b288daa3a3987d322a93dada63b46c8
2021-09-30 11:32:51 -07:00
be00f0207a Update git version for CentOS base dockers (#65703)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/65048

cc jeffdaily sunway513 jithunnair-amd ROCmSupport

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65703

Reviewed By: albanD

Differential Revision: D31245666

Pulled By: janeyx99

fbshipit-source-id: 5431876bf19435eb3fd90a53a3ec94fd66c9210e
2021-09-30 11:26:21 -07:00
8297a16cc0 [ci] try installing libgnutls to fix cert error (#65934)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65934

see: https://github.com/pytorch/pytorch/issues/65931, this was a
suggested remediation on the linked issue

Test Plan: Imported from OSS

Reviewed By: malfet, zhouzhuojie

Differential Revision: D31313040

Pulled By: suo

fbshipit-source-id: a9e2b82a1e879962af768ed3049c73ab77394738
2021-09-30 11:23:17 -07:00
6a30d83596 Move ASAN to GHA (#65846)
Summary:
- Introduce `ciflow/sanitizers` label
- Modify asan pattern in `.jenkins/pytorch/build.sh`
- Produce wheel in `.jenkins/pytorch/build-asan.sh`
- Increase stack size hard limit to 82Mb in test docker containers

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65846

Reviewed By: seemethere

Differential Revision: D31282654

Pulled By: malfet

fbshipit-source-id: f73e692899cc9bbe106ececc26f1fe430dfeae9d
2021-09-30 09:49:52 -07:00
cdbfb2b689 .github: Bump linux and windows gpu max available (#65923)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65923

Still noticing that queues are long particularly for windows GPU
machines, bumping this to compensate

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D31308728

Pulled By: seemethere

fbshipit-source-id: b68c3a76335960def23e1f425ba5b0a219f07e73
2021-09-30 09:38:02 -07:00
928a4bbafb [JIT] Fix compilation unit reference link in constant object upon load (#65784)
Summary:
Follow up to https://github.com/pytorch/pytorch/pull/65442, make sure objects inserted into the graph from load do not holding owning reference.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65784

Reviewed By: suo

Differential Revision: D31251033

Pulled By: eellison

fbshipit-source-id: 59efe19ce6f70744383de4eebf0f89f79f3eb03a
2021-09-30 09:32:28 -07:00
8130157504 [DataPipe] Fixes an issue where TarArchiveReader closes stream when read into a buffer (#65877)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65877

Fixes #65808

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D31296041

Pulled By: NivekT

fbshipit-source-id: cdcad3a333ae9781d6063678a122a128955b0ff4
2021-09-30 08:46:32 -07:00
7f87ff183d [RFC] [Modular] Include less headers in vararg_functions.cpp (#65672)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65672

`ATen/ATen.h` has a list of all headers but vararg_functions.cpp only uses two of them. Change to include less for min_runtime.

ghstack-source-id: 139389772

Test Plan: CI

Reviewed By: larryliu0820

Differential Revision: D31198293

fbshipit-source-id: 9794a2696a1b124be7fced2836c633ae899aa5c8
2021-09-30 08:35:28 -07:00
ea776fa034 Update CODEOWNERS for optim (#65773)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65773

Reviewed By: mrshenli

Differential Revision: D31269749

Pulled By: albanD

fbshipit-source-id: 1ec35d2396797b8e97a7122e2b3a9021f8fcf0a0
2021-09-30 08:30:42 -07:00
b777d790ea Convert Sampler back to lazily construction (#63646)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63646

Fixes #63609

Test Plan: Imported from OSS

Reviewed By: NivekT

Differential Revision: D30451774

Pulled By: ejguan

fbshipit-source-id: 550d77494326446d1a42b5da0559e0d384c47413
2021-09-30 07:32:06 -07:00
4666e3f192 [quant] update fused_obs_fake_quant op to accept output_fake_quant argument (#65621)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65621

Add a new attribute to the FusedMovingAvgObsFakeQuantize that controls if the Fake Quant operation should be applied at the output of a particular layer. The motivation is to give the users additional control to control the numerics of the fake_quant operators during training. It defaults to always fake quant the output (True).

Note: We will still observer the tensors as before (only the fake_quant operation is controlled using this flag)

For example
```
input model
x -> fc1 -> fc2 -> non_quantizable_op -> fc3

After fake_quant
x -> fake_quant(x) -> fc1 -> fake_quant(fc1) -> fc2 -> fake_quant(fc2) -> non_quantizable_op -> fake_quant() -> fc3 -> fake_quantize(fc3)

With output_fake_quant disabled at the output of fc2 and fc3 (since their outputs are non-quantizable)
x -> fake_quant(x) -> fc1 -> fake_quant(fc1) -> fc2 -> non_quantizable_op -> fake_quant() -> fc3
```

Test Plan: ./buck-out/gen/caffe2/test/quantization_fx\#binary.par -r test_disable_output_fake_quant

Reviewed By: jerryzh168

Differential Revision: D31174526

fbshipit-source-id: bffe776216d041fb09133a6fb09bfc2c0bb46b89
2021-09-30 01:08:01 -07:00
6d4b93bd96 [quant] adding memoryless observers for embeddingbag QAT work (#65699)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65699

related to: https://github.com/pytorch/pytorch/pull/65443#discussion_r715132425

The QAT and PAT (pruning aware training) support for embedding bags needs a memoryless observer to work properly. This is necessitated by the changing pruned/non-pruned weights during training which can significantly change the quantization parameters.

This PR adds a memoryless flag to the simpler observer classes (not moving average since those explicitly have memory)

In addition to the above, I altered the reset_min_max_vals
function for MinMaxObserver so that it would preserve the device of the
existing self.min_val and self.max_val which was not preserved
previously compared to how it is initialized (using factory_kwargs)

Test Plan:
python test/test_quantization.py TestObserver

(added test_memoryless_minmaxobserver, test_memoryless_per_channel_minmaxobserver, test_memoryless_histogramobserver)

Imported from OSS

Reviewed By: supriyar

Differential Revision: D31209773

fbshipit-source-id: 44a63298e44880fbd3576f49ac568e781f3fd79a
2021-09-30 00:55:32 -07:00
de80aff72d Revert D31132861: Make JIT Aliasing Test Less Brittle
Test Plan: revert-hammer

Differential Revision:
D31132861 (9f97c66a7a)

Original commit changeset: 26fc2e6bc77b

fbshipit-source-id: 46be9168179d555be6b6a92b54b2bb84b3f834ed
2021-09-29 23:39:40 -07:00
4176afc4a0 [Static Runtime] Disable SigridTransform + ListUnpack fusion when outputs reachable from graph output (#62697)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62697

Reviewed By: hlu1

Differential Revision: D29979402

fbshipit-source-id: 913e8396a0530ce3617211112a2b1147ef2e9df9
2021-09-29 22:47:48 -07:00
edab202a30 [DatePipe] add deprecation warnings for DataPipes that will solely exist in TorchData (#65827)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65827

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D31272794

Pulled By: NivekT

fbshipit-source-id: 8da8266184b4df050422904cbc5fca6d7c3d2e02
2021-09-29 22:42:22 -07:00
cd458fe092 [JIT] Make output of prim::TupleConstruct alias only with its inputs (#64879)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64879

This change makes the output of `prim::TupleConstruct` alias only with its inputs *when* the created tuple is directly returned from the graph.

The same treatment could be made to any tuples newly constructed by `prim::TupleConstruct` if they do not let their elements escape. However, this change only focuses on only one simplest, but frequently used usecase: tuples constructed only to be returned from a graph. This usecase turns out to be very often used.

Test Plan:
Added
- `AliasMoveForTupleConstructWithSingleUseAsGraphOutput`
- `WildcardAliasForTupleConstructWithUses`

to cover the newly added code.

Reviewed By: eellison

Differential Revision: D30437737

fbshipit-source-id: 417fbc6bc348062e60e7acdddd340d4754d090eb
2021-09-29 21:56:31 -07:00
dd354117ef Skip failing tests when LAPACK and MAGMA are not available (#64930)
Summary:
Skip failing tests when LAPACK and MAGMA are not available for ` test_linalg.py` and ` test_ops.py`.
Note that there's no CI without LAPACK or MAGMA. I verified locally that now it works as expected, but in the future we have no guards against tests failing again for this situation.

<details>
  <summary> test_ops.py failures that are fixed</summary>

 ```
 FAILED test/test_ops.py::TestCommonCPU::test_out_linalg_tensorinv_cpu_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_ops.py::TestCommonCPU::test_reference_testing_linalg_tensorinv_cpu_complex64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_ops.py::TestCommonCPU::test_reference_testing_linalg_tensorinv_cpu_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_ops.py::TestCommonCPU::test_variant_consistency_eager_linalg_tensorinv_cpu_complex64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_ops.py::TestCommonCPU::test_variant_consistency_eager_linalg_tensorinv_cpu_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_ops.py::TestCommonCPU::test_variant_consistency_eager_triangular_solve_cpu_complex64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_ops.py::TestCommonCPU::test_variant_consistency_eager_triangular_solve_cpu_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_ops.py::TestGradientsCPU::test_fn_grad_linalg_tensorinv_cpu_complex128 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_ops.py::TestGradientsCPU::test_fn_grad_linalg_tensorinv_cpu_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_ops.py::TestGradientsCPU::test_fn_grad_triangular_solve_cpu_complex128 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_ops.py::TestGradientsCPU::test_fn_grad_triangular_solve_cpu_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_ops.py::TestGradientsCPU::test_fn_gradgrad_linalg_tensorinv_cpu_complex128 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_ops.py::TestGradientsCPU::test_fn_gradgrad_linalg_tensorinv_cpu_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_ops.py::TestGradientsCPU::test_fn_gradgrad_triangular_solve_cpu_complex128 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_ops.py::TestGradientsCPU::test_fn_gradgrad_triangular_solve_cpu_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_ops.py::TestGradientsCPU::test_forward_mode_AD_linalg_tensorinv_cpu_complex128 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_ops.py::TestGradientsCPU::test_forward_mode_AD_linalg_tensorinv_cpu_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_ops.py::TestGradientsCPU::test_forward_mode_AD_triangular_solve_cpu_complex128 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_ops.py::TestGradientsCPU::test_forward_mode_AD_triangular_solve_cpu_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_ops.py::TestJitCPU::test_variant_consistency_jit_linalg_tensorinv_cpu_complex64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_ops.py::TestJitCPU::test_variant_consistency_jit_triangular_solve_cpu_complex64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_ops.py::TestJitCPU::test_variant_consistency_jit_triangular_solve_cpu_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_ops.py::TestMathBitsCPU::test_conj_view_linalg_tensorinv_cpu_complex64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_ops.py::TestMathBitsCPU::test_conj_view_triangular_solve_cpu_complex64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_ops.py::TestMathBitsCPU::test_neg_view_linalg_tensorinv_cpu_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_ops.py::TestMathBitsCPU::test_neg_view_triangular_solve_cpu_float64 - RuntimeError: svd: LAPACK library not found in compilation
 ```

</details>

<details>
  <summary> test_linalg.py failures that are fixed</summary>
```
FAILED test/test_linalg.py::TestLinalgCPU::test_norm_dtype_cpu - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgCPU::test_norm_matrix_cpu_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgCPU::test_norm_matrix_cpu_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgCPU::test_nuclear_norm_axes_small_brute_force_old_cpu - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_eigh_hermitian_grad_meta_complex128 - RuntimeError: Calling torch.linalg.eigh or eigvalsh on a CPU tensor requires compiling PyTorch with LAPACK. Please use PyTorch built with LAPACK support.
FAILED test/test_linalg.py::TestLinalgMETA::test_eigh_hermitian_grad_meta_float64 - RuntimeError: Calling torch.linalg.eigh or eigvalsh on a CPU tensor requires compiling PyTorch with LAPACK. Please use PyTorch built with LAPACK support.
FAILED test/test_linalg.py::TestLinalgMETA::test_inverse_meta_complex128 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_inverse_meta_complex64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_inverse_meta_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_inverse_meta_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_lu_meta_complex128 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_lu_meta_complex64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_lu_meta_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_lu_solve_batched_broadcasting_meta_complex128 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_lu_solve_batched_broadcasting_meta_complex64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_lu_solve_batched_broadcasting_meta_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_lu_solve_batched_broadcasting_meta_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_lu_solve_batched_non_contiguous_meta_complex128 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_lu_solve_batched_non_contiguous_meta_complex64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_lu_solve_batched_non_contiguous_meta_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_lu_solve_batched_non_contiguous_meta_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_lu_solve_meta_complex128 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_lu_solve_meta_complex64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_lu_solve_meta_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_lu_solve_meta_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_old_solve_batched_broadcasting_meta_complex128 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_old_solve_batched_broadcasting_meta_complex64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_old_solve_batched_broadcasting_meta_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_old_solve_batched_broadcasting_meta_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_old_solve_batched_meta_complex128 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_old_solve_batched_meta_complex64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_old_solve_batched_meta_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_old_solve_batched_meta_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_old_solve_batched_non_contiguous_meta_complex128 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_old_solve_batched_non_contiguous_meta_complex64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_old_solve_batched_non_contiguous_meta_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_old_solve_batched_non_contiguous_meta_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_old_solve_meta_complex128 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_old_solve_meta_complex64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_old_solve_meta_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_old_solve_meta_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_solve_batched_non_contiguous_meta_complex128 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_solve_batched_non_contiguous_meta_complex64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_solve_batched_non_contiguous_meta_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_solve_batched_non_contiguous_meta_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_solve_meta_complex128 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_solve_meta_complex64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_solve_meta_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_solve_meta_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_svd_square_col_maj_meta_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_svd_square_col_maj_meta_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_svd_square_meta_complex128 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_svd_square_meta_complex64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_svd_square_meta_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_svd_square_meta_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_svd_tall_all_col_maj_meta_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_svd_tall_all_col_maj_meta_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_svd_tall_all_meta_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_svd_tall_all_meta_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_svd_tall_some_col_maj_meta_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_svd_tall_some_col_maj_meta_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_svd_tall_some_meta_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgMETA::test_svd_tall_some_meta_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgCUDA::test_inverse_cuda_complex128 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgCUDA::test_inverse_cuda_complex64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgCUDA::test_inverse_cuda_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgCUDA::test_inverse_cuda_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgCUDA::test_svd_lowrank_cuda_float64 - RuntimeError: Calling torch.lu on a CUDA tensor requires compiling PyTorch with MAGMA. lease rebuild with MAGMA.
FAILED test/test_linalg.py::TestLinalgCUDA::test_svd_square_col_maj_cuda_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgCUDA::test_svd_square_col_maj_cuda_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgCUDA::test_svd_square_cuda_complex128 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgCUDA::test_svd_square_cuda_complex64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgCUDA::test_svd_square_cuda_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgCUDA::test_svd_square_cuda_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgCUDA::test_svd_tall_all_col_maj_cuda_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgCUDA::test_svd_tall_all_col_maj_cuda_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgCUDA::test_svd_tall_all_cuda_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgCUDA::test_svd_tall_all_cuda_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgCUDA::test_svd_tall_some_col_maj_cuda_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgCUDA::test_svd_tall_some_col_maj_cuda_float64 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgCUDA::test_svd_tall_some_cuda_float32 - RuntimeError: svd: LAPACK library not found in compilation
FAILED test/test_linalg.py::TestLinalgCUDA::test_svd_tall_some_cuda_float64 - RuntimeError: svd: LAPACK library not found in compilation
```
</details>

Fixes https://github.com/pytorch/pytorch/issues/59662

cc mruberry jianyuh nikitaved pearu walterddr IvanYashchuk xwang233 Lezcano

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64930

Reviewed By: H-Huang

Differential Revision: D31137652

Pulled By: mruberry

fbshipit-source-id: c969f75d7cf185765211004a0878e7c8a5d3cbf7
2021-09-29 21:31:14 -07:00
2c29ec2a41 Remove "SciPioneer" from PT Distributed code owners (#65862)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65862

ghstack-source-id: 139378782

Test Plan: N/A

Reviewed By: rohan-varma

Differential Revision: D31291340

fbshipit-source-id: 65d6a82c57dd50d8a4241e9442d73002590989d9
2021-09-29 20:52:01 -07:00
91f8755b0e Revert D31005792: [NCCL] Init dummy NCCL comms in constructor
Test Plan: revert-hammer

Differential Revision:
D31005792 (2b22a5dde2)

Original commit changeset: c2c582dee25a

fbshipit-source-id: d8e962b8aab6fda8a6c013e8577492dff9568c27
2021-09-29 20:46:38 -07:00
5349ea921b Migrate THCIntegerDivider.cuh to ATen (#65745)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65745

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D31257937

fbshipit-source-id: 283693525859b7a77a116df0c227653763911a42
2021-09-29 20:37:41 -07:00
3900509b7d (torchelastic) make --max_restarts explicit in the quickstart and runner docs (#65838)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65838

closes https://github.com/pytorch/pytorch/pull/65675

The default `--max_restarts` for `torch.distributed.run` was changed to `0` from `3` to make things backwards compatible with `torch.distributed.launch`. Since the default `--max_restarts` used to be greater than `0` we never documented passing `--max_restarts` explicitly in any of our example code.

Test Plan: N/A doc change only

Reviewed By: d4l3k

Differential Revision: D31279544

fbshipit-source-id: 98b31e6a158371bc56907552c5c13958446716f9
2021-09-29 19:29:01 -07:00
c7ef620a14 [quant] Add imports to the torch/ao/quantization/__init__.py (#64911)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64911

The import statements that involve the `quantize.py` were not added to the module level __init__ file. Those imports are necessary to mimic the behavior of the old import locations. Otherwise, the user would need to change their import statements to `from torch.ao.quantization.quantize import quantize` (instead of `from torch.ao.quantization import quantize`.

Another change in this diff is that we don't use `__all__` anymore. The all dunder was never used in quantization anyway, and just creates a potential bug when using `from ... import *`.
ghstack-source-id: 139342483

Test Plan: `buck test mode/dev //caffe2/test:quantization`

Reviewed By: vkuzo

Differential Revision: D30897663

fbshipit-source-id: a7b4919a191755e3ba690a79ce3362889f416689
2021-09-29 19:08:45 -07:00
fb412bdd80 Avoid saving self forsoftmax and log_softmax (#65242)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/64000
 - updates double backward formula to compute grad wrt output instead of self
 - ~~In some of the error messages, we still refer to the dtype of the input, even though we are now checking the dtype of the output~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65242

Reviewed By: albanD

Differential Revision: D31238123

Pulled By: soulitzer

fbshipit-source-id: afd319d3676d9ef8d81607e0e8c2a3e6d09f68e4
2021-09-29 18:16:12 -07:00
768cfaa8f8 fix typo in _sharded_tensor (#65511)
Summary:
per title

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang gcramer23

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65511

Reviewed By: albanD

Differential Revision: D31239269

Pulled By: cbalioglu

fbshipit-source-id: 602c0bf7ef96a930606d68b15a5b3cadda9d9437
2021-09-29 18:00:47 -07:00
9f97c66a7a Make JIT Aliasing Test Less Brittle (#65493)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65493

Added a last resolve to use whatever ATen operator that has Tensor outputs in the graph as the operator node to check alias annotation.

Test Plan:
python test/test_ops.py -k test_variant_consistency_jit_linalg_tensorinv
python test/test_ops.py -k test_variant_consistency_jit_nn_functional_normalize

Reviewed By: eellison

Differential Revision: D31132861

Pulled By: alanwaketan

fbshipit-source-id: 26fc2e6bc77be3a296967cf29a3f6ded231302fa
2021-09-29 17:11:04 -07:00
91611fe1d1 Decouple forward AD checks from backward AD in OpInfo tests and gradcheck (#65040)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/64999

- Adds a flag to gradcheck `check_backward_ad` that can be used to disable gradcheck for backward ad
  - This is a bit bc-breaking in terms of positional args, but I prefer this ordering
- In OpInfo tests for forward ad:
  - set `check_backward_ad` False
- In test_ops treat `supports_autograd` as if it is `supports_backward_ad` (it basically already is)
  - the only modification needed is to no longer skip forward ad tests if `supports_autograd` is false
  - test_dtype, test_variant_consistency, etc behave correctly as-is
  - In a follow-up PR, we can rename it to actually be `supports_backward_ad`
- Testing
  - https://github.com/pytorch/pytorch/pull/65060

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65040

Reviewed By: albanD

Differential Revision: D31238177

Pulled By: soulitzer

fbshipit-source-id: f068d4cbe7ffb094930b16cddb210583b9b7b2c4
2021-09-29 17:01:34 -07:00
5950240bdf Stop Win+CUDA-10.2 builds (#65649)
Summary:
See https://github.com/pytorch/pytorch/issues/65612 and https://github.com/pytorch/pytorch/issues/25393

Fixes https://github.com/pytorch/pytorch/issues/65648

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65649

Reviewed By: janeyx99

Differential Revision: D31189692

Pulled By: malfet

fbshipit-source-id: 6ec0548d5833f3428d882071d26c357d89b0a9ba
2021-09-29 15:41:23 -07:00
2b22a5dde2 [NCCL] Init dummy NCCL comms in constructor (#65173)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65173

Initializes dummy NCCL communicators in constructor for a basic health
check that communicators can be initialized prior to launching the first
collective.

After successful init, we immediately use `ncclCommAbort` to destroy these
communicators to ensure they don't interfere with regular communicator creation
during collectives.

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D31005792

fbshipit-source-id: c2c582dee25a098361ead6ef03f541e7833c606b
2021-09-29 15:36:54 -07:00
ad85b582da Remove THCDeviceTensor (#65744)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65744

This is just dead code.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D31257940

fbshipit-source-id: 6c02264106c2dcbadd332f24b95bc9351a04fd9e
2021-09-29 14:54:46 -07:00
20374c991b slow_conv2d_forward: avoid calling dispatcher in parallel region (#65724)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65724

See gh-56794

Avoid dispatch inside of parallel_for by:
1. Replacing Tensor slicing with TensorAccessor
2. Copy bias into output only once, outside of the parallel region
3. Replaces `addmm`_ with a direct call to gemm.

Technically this also adds a new requirement that the output always be
contiguous, but the out argument version isn't exposed or used
anywhere in the `torch.nn` API. So that should be fine.

Test Plan: Imported from OSS

Reviewed By: saketh-are

Differential Revision: D31257875

Pulled By: ngimel

fbshipit-source-id: 84d2b39e7f65334bdfcc2c4719f93ee3c514ca32
2021-09-29 14:09:32 -07:00
7191dd2613 Update Module docstring for Python 3 (#65748)
Summary:
In Python 3, we can call `super()` without any arguments.

If I understand correctly, Python 2 is no longer supported by PyTorch, so we can change the documentation to be Python-3 only :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65748

Reviewed By: saketh-are

Differential Revision: D31246055

Pulled By: albanD

fbshipit-source-id: 3980def1a556d4bdfa391ea61cb2a65efa20df79
2021-09-29 13:40:15 -07:00
8bf0ba546e ns for fx: add basic testing on cuda (#65593)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65593

Adds test cases that the three Numeric Suite Core APIs work
when the models are on cuda.  In particular:
1. create models and move them to cuda
2. add loggers (if applicable)
3. run data through (if applicable)
4. extract results

It works without code changes because a `Logger` object is
created without any device specific objects (they only get
added if a data is passed through). It's good to have this tested.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_extract_weights_cuda
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_add_loggers_cuda
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_add_shadow_loggers_cuda
```

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D31160897

fbshipit-source-id: 8eacf164d0496baf2830491200ea721c0f32ac92
2021-09-29 13:06:30 -07:00
0dd1b74a5b Migrate THCScanUtils to ATen (#65743)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65743

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D31257938

fbshipit-source-id: 273b22df41bb7f2a0ab605ec1f6322c2937e7472
2021-09-29 12:39:37 -07:00
a84feeeade [PyTorch Edge] Conditionally trim dispatch key set to save heap memory at runtime (#65732)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65732

For certain on-device uses, runtime memory comes at a premium. On-device deployments won't use all the available dispatch keys, so it makes sense to keep only the on-device specific ones around for such uses to reduce runtime heap memory allocated.

This change keeps just 10 dispatch keys (the ones that used on-device), guarded under the `C10_MOBILE_TRIM_DISPATCH_KEYS` macro. it tries to keep the other code-paths unaffected and uses `constexpr` for use in the `array` declaration, and simple inline functions to ensure that the compiler is able to optimize these for server builds.

Test Plan:
Build and check mobile models end to end.

```
buck build -c "pt.enable_milan_dispatch_keys_trimming"=1 //xplat/caffe2/fb/lite_predictor:lite_predictor
```

Reviewed By: ezyang

Differential Revision: D31185407

fbshipit-source-id: e954765606373dea6ee9466a851dca7684167b0b
2021-09-29 12:20:33 -07:00
7b5d676fa1 .github: Bump linux gpu max limit to 100 (#65831)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65831

Was noticing scaling issues last night due to the lack of
linux.8xlarge.nvidia.gpu machines, seems as though that even at max
capacity we were still about ~50 queued workflows behind, this should
close that gap.

Also since these run the longest types of tests these are the most
likely to overlap with scale messages being processed while available
runners are still maxed out

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D31275892

Pulled By: seemethere

fbshipit-source-id: b22ceda115b70d7bdd9c4bc207b55ffab50381ef
2021-09-29 12:06:54 -07:00
c975ca4337 [Static Runtime] Simplify out variant overload implementations (#65384)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65384

The following pattern appears frequently in `ops.cpp`:

```
if (!n->matches(schema_1) && !n->matches(schema_2) && ... && !n->matches(schema_n)) {
    LogAndDumpSchema(n);
    return nullptr;
}

return [](ProcessedNode* p_node) {
    if (p_node->Output(0).isNone()) {
        if (p_node->Input(i).isSomeType()) {
            // special logic for schema 1
        } else if (p_node->Input(i).isSomeOtherType()) {
            // special logic for schema 2
        } else if (...) {
            // special logic for schema3
        }
        // and so on
    } else {
        // another complicated type checking chain
    }
};
```

A much cleaner way to implement operator overloads is like this:
```
if (n->matches(schema_1)) {
    return schema_1_impl;
} else if (n->matches(schema_2)) {
    return schema_2_impl;
}
// and so on
```

This has a few advantages:
* Significantly reduces complexity of the out variant implementations, especially for ops with more than 2 overloads. One implementation corresponds to one schema. This makes the implementation more readable/maintainable.
* Adhering to this convention makes it easier to add a new overload. Just add a new `n->matches(...)` case instead of working the schema into existing complicated logic.
* Ops are marginally faster since we don't have to check types at runtime.

Note: there are a few cases where this actually made the code less concise (`aten::div`), so I left those ops untouched.

Thanks for pointing this out in another diff d1jang

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: hlu1

Differential Revision: D31072328

fbshipit-source-id: c40a4f7e6a79881e94c9ec49e9008ed75cfc8688
2021-09-29 12:02:11 -07:00
2f712c452e .github: Remove confusing on_pull_request variable (#65731)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65731

It originally had purpose but after ciflow was introduced every PR had
on_pull_request set so it's not really as useful as it once was

Also removes the equally as confusing only_build_on_pull_request
variable as well

This change should produce no functional changes in our generated workflows

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

cc ezyang seemethere malfet pytorch/pytorch-dev-infra

Test Plan: Imported from OSS

Reviewed By: janeyx99

Differential Revision: D31225398

Pulled By: seemethere

fbshipit-source-id: 7bd8e8175794ab7d09b0632321bf52538435e858
2021-09-29 11:56:13 -07:00
6c2f235d36 common_utils.py: Add ASAN as a platform for which you can disable tests (#65791)
Summary:
Could be useful for the future.

Next steps: document it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65791

Reviewed By: suo

Differential Revision: D31254115

Pulled By: janeyx99

fbshipit-source-id: 715c18b4505f2be6328aa0be25976116d6956b25
2021-09-29 11:00:03 -07:00
911d01c1de type annotate operator_support (#65136)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65136

Opportunistically add type annotation for operator_support.py

Test Plan: run linter, CI

Reviewed By: yinghai

Differential Revision: D30928464

fbshipit-source-id: 615c75152b9938792f03cdceb2a113bda6ab28c7
2021-09-29 10:38:47 -07:00
085e2f7bdd [ROCm] Changes not to rely on CUDA_VERSION or HIP_VERSION (#65610)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65610

- Replace HIP_PLATFORM_HCC with USE_ROCM
- Dont rely on CUDA_VERSION or HIP_VERSION and use USE_ROCM and ROCM_VERSION.

- In the next PR
   - Will be removing the mapping from CUDA_VERSION to HIP_VERSION and CUDA to HIP in hipify.
   - HIP_PLATFORM_HCC is deprecated, so will add HIP_PLATFORM_AMD to support HIP host code compilation on gcc.

cc jeffdaily sunway513 jithunnair-amd ROCmSupport amathews-amd

Reviewed By: jbschlosser

Differential Revision: D30909053

Pulled By: ezyang

fbshipit-source-id: 224a966ebf1aaec79beccbbd686fdf3d49267e06
2021-09-29 09:55:43 -07:00
9b40eaaaab Revert D31193205: [pytorch][PR] CMake: Limit python include directories to only python libraries
Test Plan: revert-hammer

Differential Revision:
D31193205 (971c57f1d0)

Original commit changeset: 5c1b554a59d0

fbshipit-source-id: 5719b7df987ded6e7e212749a438db947656df87
2021-09-29 09:49:33 -07:00
2670cacfc2 LLVM-12 fix for tensor_new.cpp (#65785)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65785

Fixes offset to nullptr at fbcode/caffe2/torch/csrc/utils/tensor_new.cpp:206

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D31250995

fbshipit-source-id: 56c7761787e732180a2537a8aa4346a39e7399a8
2021-09-29 09:35:18 -07:00
09eb3e661c don't check 0 elements for cat symbolic diff (#65751)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65751

Fixes symbolic script grad formula for cat to correctly handle empty tensors

Test Plan: Existing tests

Reviewed By: eellison

Differential Revision: D31208364

fbshipit-source-id: d676d9abcc033b56076fa946f58f3db50034502d
2021-09-29 09:34:03 -07:00
1d681c1ab2 Migrate THCThrustAllocator to ATen (#65492)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65492

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D31148180

Pulled By: ngimel

fbshipit-source-id: d5e4902036493517ca97c3442713b5e0e79229f9
2021-09-29 09:27:41 -07:00
971c57f1d0 CMake: Limit python include directories to only python libraries (#65654)
Summary:
`include_directories` is old-style CMake which adds the include path to every file being compiled. This instead makes python, numpy and pybind11 into targets that only torch_python and caffe2_pybind_state are linked to. So, python libraries can't be accidentally included elsewhere.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65654

Reviewed By: gchanan

Differential Revision: D31193205

Pulled By: malfet

fbshipit-source-id: 5c1b554a59d0e441a701a04ebb62f0032d38b208
2021-09-29 08:09:08 -07:00
5f7ab7be6f [Static Runtime] concat_add_mul_replacenan_clip retains axis arg (#65741)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65741

This op previously assumed `axis == 1`, causing graphs that would otherwise be valid to return incorrect results after fusing.

Reviewed By: hlu1

Differential Revision: D31234944

fbshipit-source-id: 89885a3b119357698ebd9fd429b009813260a2f4
2021-09-29 08:04:20 -07:00
f63150fd1d [PyTorch Edge] Reduce the cost of computing isIncludedInAlias() (#65735)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65735

Currently, `isIncludedInAlias()` calls `getRuntimeDispatchKeySet()` which creates a new `DispatchKeySet` object from an enumerated list of dispatch keys. `isIncludedInAlias()` then checks if a single dispatch key is part of this set. Instead, just pass in the key one wishes to check. This is marginally faster.

ghstack-source-id: 139281528

Test Plan:
See these 2 AI Bench Runs on the Milan-FFF-11-30 device.

### Before
[AI Bench](https://www.internalfb.com/intern/aibench/details/237302972704466), [Flamegraph](https://interncache-all.fbcdn.net/manifold/aibench/tree/mobile/pt/profiling_reports/speech_transducer_v25_perf_1632804218329.html)

### After
[AI Bench](https://www.internalfb.com/intern/aibench/details/606320012968375), [Flamegraph](https://interncache-all.fbcdn.net/manifold/aibench/tree/mobile/pt/profiling_reports/speech_transducer_v25_perf_1632807348803.html)

Check the the flamegraphs, and focus on any kernel registration code path during library initialization.

Reviewed By: swolchok

Differential Revision: D31228062

fbshipit-source-id: 7a986e3593c30ded7919cd3b564ec579dc97ab5f
2021-09-29 07:40:39 -07:00
aebde1bc2b deprecate device getter from torch.testing namespace (#63844)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63844

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D31141433

Pulled By: mruberry

fbshipit-source-id: a29331278ab99a19e225e2cb357458e3db4f9732
2021-09-29 02:40:52 -07:00
07d5d7b5cc move kernel launch checks from torch.testing to torch.testing._internal.check_kernel_launches (#60862)
Summary:
The fact that these functions are only used in a single test might be a good enough reason to move them to that module.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60862

Reviewed By: H-Huang

Differential Revision: D31141354

Pulled By: mruberry

fbshipit-source-id: 6ce1f721b88620c5f46222ad1b942bc689f0a3e0
2021-09-29 00:39:22 -07:00
0a0564a347 Revert D31206837: [pytorch][PR] *_solve methods: implements forward AD
Test Plan: revert-hammer

Differential Revision:
D31206837 (26e31f76b0)

Original commit changeset: 040beda97442

fbshipit-source-id: f28091327357af9f54f367eda6606240924b93ac
2021-09-28 23:31:16 -07:00
f9c2dc860d make layout check optional in torch.testing.assert_close() (#65419)
Summary:
In case the inputs have a different layout, `assert_close(..., check_layout=False)` converts them to strided before comparison. This is helpful if you just want to compare the values of sparse COO / CSR tensor against a strided reference.

This keeps BC, since the default `check_layout=True` was the old, hard-coded behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65419

Reviewed By: H-Huang

Differential Revision: D31133629

Pulled By: mruberry

fbshipit-source-id: ca8918af81fb0e0ba263104836a4c2eeacdfc7e6
2021-09-28 23:23:41 -07:00
8a247fb418 LLVM-12 fix for shm_mutex (#65781)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65781

Fixes
```
stderr: In file included from caffe2/caffe2/contrib/shm_mutex/shm_mutex.cc:1:
caffe2/caffe2/contrib/shm_mutex/shm_mutex.h:334:28: error: anonymous non-C-compatible type given name for linkage purposes by alias declaration; add a tag name here [-Werror,-Wnon-c-typedef-for-linkage]
using TicketStruct = struct : ShmBaseHeader {
                           ^
                            TicketStruct
caffe2/caffe2/contrib/shm_mutex/shm_mutex.h:334:31: note: type is not C-compatible due to this base class
using TicketStruct = struct : ShmBaseHeader {
                              ^~~~~~~~~~~~~
caffe2/caffe2/contrib/shm_mutex/shm_mutex.h:334:7: note: type is given name 'TicketStruct' for linkage purposes by this alias declaration
using TicketStruct = struct : ShmBaseHeader {
      ^
1 error generated.
Cannot execute a rule out of process. On RE worker. Thread: Thread[main,5,main]
Command failed with exit code 1.
```

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D31248938

fbshipit-source-id: 47342fecc72ada9397a1b7bd6fcabfccf988dd3e
2021-09-28 22:51:38 -07:00
4a7a0ea42e Skip flaky ASAN tests (#65792)
Summary:
See https://github.com/pytorch/pytorch/issues/65727

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65792

Reviewed By: janeyx99

Differential Revision: D31254490

Pulled By: malfet

fbshipit-source-id: 76714db30a5566fbab95179236ccdafab22cf551
2021-09-28 22:33:02 -07:00
d528c7f3c0 .github: Move windows back to default directory (#64962)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64962

Moves windows builds / tests back to the default directory. Previously
we had moved them because checkout would sometimes fail due to file
handlers still being open on the working directory.

Moving back to the default directory also has the added bonus of sccache
working again so here's to hoping that this doesn't have any adverse
affects

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

cc peterjc123 mszhanyi skyline75489 nbcsm ezyang seemethere malfet lg20987 pytorch/pytorch-dev-infra

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D31250072

Pulled By: seemethere

fbshipit-source-id: a803bf0e00e1b2b0d63f78600588281622ee0652
2021-09-28 19:41:35 -07:00
ed4491be6f Fix error code checking for Windows build scripts (#57331)
Summary:
The variable `%errorlevel%` is evaluated before the whole line of command starts, so it is useless when used in a if-block. Also, let's prevent using `%errorlevel%` because it may be set by the users accidentally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57331

Reviewed By: anjali411

Differential Revision: D28140182

Pulled By: malfet

fbshipit-source-id: a3f21d65623bb25f039805c175e9f3b468bcb548
2021-09-28 19:27:07 -07:00
0d7036fdaf don't leak build time path name to runtime for frozen python modules (#65715)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65715

Here is how we freeze a python module:
- we call python builtin compile method with the source code of the modules and the path. This method returns a python code object
- we call marshal.dumps to serialize the code object to bytes.

The code_object.co_filename actually matches the one passed in to the compile method. We can simply replace that with a marker
to avoid leak build time path to runtime.

This works on nested code objects as well:
```
#!/bin/env python3.8
import marshal

code_str = """
print("hello")

class MyCls:
    def __init__(self):
        pass
"""
co = compile(code_str, "<Generated by torch::deploy>", "exec")
cobytes = marshal.dumps(co)
import pdb; pdb.set_trace()
```

Checking `co`:
```
(Pdb) co.co_filename
'<Generated by torch::deploy>'
(Pdb) co.co_consts
('hello', <code object MyCls at 0x7f0e8670bbe0, file "<Generated by torch::deploy>", line 4>, 'MyCls', None)
(Pdb) co.co_consts[1].co_filename
'<Generated by torch::deploy>'
```

Test Plan:
Find the serialized frozenmodule for torch.nn.modules.linear module in the generated bytecode_x.c file. Put the content to /tmp/linear.bytecode

Run the testing script:
```
import marshal
co_bytes = bytes(eval("[{}]".format("".join(open('/tmp/linear.bytecode').readlines()).replace('\n', '').replace('\t', ''))))
co = marshal.loads(co_bytes)
print(co)

```

The output for the paste without the change:
```
<code object <module> at 0x7f39ca7f07c0, file "/data/users/shunting/fbsource/fbcode/buck-out/opt/gen/caffe2/gen_frozen_torchpython_src__srcs/torch/nn/modules/linear.py", line 1>
```

The output for the paste with the change:
```
<code object <module> at 0x7f05a765d710, file "<Generated by torch::deploy>", line 1>
````

Note that the file part is changed as expected.

Reviewed By: suo

Differential Revision: D31214555

fbshipit-source-id: 56958e0a7352f8c30a3377f83209efe7db61f0fb
2021-09-28 19:25:51 -07:00
72b27bde83 [CIFlow] Modify workflow trigger logic (#65733)
Summary:
CIFLow workflows should always run on push event
On pull-request workflow should run if label conditions are met or if
no `ciflow/` labels are associated with it, workflow is enabled by
default

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65733

Reviewed By: zhouzhuojie

Differential Revision: D31251278

Pulled By: malfet

fbshipit-source-id: 31ce745cb224df7c6fec1682ec4180513e3dadf3
2021-09-28 19:19:49 -07:00
b3c32ad32f .github: Move calculate-docker-image into build (#65789)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65789

These common types of jobs can be moved into build since it's typically
a no-op, could be annoying in the future to debug docker builds but
dedicating an entire ephemeral node to a noop seems like a waste to me

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet, janeyx99

Differential Revision: D31253017

Pulled By: seemethere

fbshipit-source-id: c7b5ea35a57fb1576122df219d387c86e420fd1f
2021-09-28 19:15:24 -07:00
609384c056 [sparsity][doc] Docstring for WeightNormSparsifier (#65294)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65294

This adds the docstring documentation to the WeightNormSparsifier and adds the typehints for the constructor args.
Note, this does not require testing as only the doc is changed.

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D31186827

Pulled By: z-a-f

fbshipit-source-id: c5010c9bba25b074c4cc6c88f251474b758f950d
2021-09-28 14:14:51 -07:00
92ee5cc2e2 [sparsity] Fix for accumulation bug in WeightNormSparsifier (#65293)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65293

This fixes a bug in the WeightNormSparsifier, where the mask is being multiplied by the newly computed mask.
Because the mask elements are binary 0/1, this accumulates the mask over every iteration, eventually collapsing the mask to zero.
This bug accidentally bled through from old versions.

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D31186829

Pulled By: z-a-f

fbshipit-source-id: 3f5b2c833148ab0bd8084e7410ce398f1252e65e
2021-09-28 14:14:49 -07:00
a90912ecc5 [sparsity] Remove the pack_param from the sparsifier state_dict (#65292)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65292

That was the original design, that we decided to simplify by removing the packing in the sparsifier.
The state of the sparsifier is saved directly, and the old behavior accidentally bled through to the current version.
This change removes the `_pack_params` method, and changes the state_dict to include the state directly.
We don't have to change the load_state_dict, as it will work with either the old or the new format.

The main reason for this PR is the simplification. The original design didn't achieve anything useful by packing the sparsification parameters.

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D31186826

Pulled By: z-a-f

fbshipit-source-id: 4ad72a7e669f048d2f2d269269ee11b63fa169db
2021-09-28 14:12:52 -07:00
c829cb6840 Port min kernel to structured kernels. (#61450)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61450

Tracking issue: #55070

Test Plan: Imported from OSS

Reviewed By: saketh-are

Differential Revision: D29741713

Pulled By: bdhirsh

fbshipit-source-id: 2c107752a90fd39cfb55e08aaf3541bd484a5fc3
2021-09-28 14:03:54 -07:00
c2252b3aa6 Port max kernel to structured kernels. (#61449)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61449

Tracking issue: #55070

Test Plan: Imported from OSS

Reviewed By: saketh-are

Differential Revision: D29741714

Pulled By: bdhirsh

fbshipit-source-id: 6c8c17d20f578ab0af8a969d103a19ccd8d51842
2021-09-28 14:02:26 -07:00
51f1569c77 Add checks for structured in-place operations. (#65686)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65686

Fixes: #57827

This PR introduces `check_inplace` function. It contains some common checks for all
structured in-place operators (e.g. dtype, device, and sizes). `set_output` method calls
`check_inplace` on in-place specializations of structured kernels.

Besides that, it also:
- adds overlap assertions for both in-place and out-of-place overloads
- remove in-place operator specific `TORCH_CHECK` around the code base

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D31234063

Pulled By: ezyang

fbshipit-source-id: fa3b45775af7812e07a282e7cae00b68caf0fdb0
2021-09-28 13:21:26 -07:00
93852bb2d4 Port sort kernel to structured kernels. (#62391)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62391

Tracking issue: #55070

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D30903992

Pulled By: bdhirsh

fbshipit-source-id: 52687aa2483c101056825433d39d69c60b829c62
2021-09-28 13:12:35 -07:00
57529d48c4 [quant] Fix applying non-zero offset 1 to null pointer in quantized interpolation (#65570)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65570

Although this is not an issue that could pop-up in practice, LLVM-12 throws an error about this issue if not checked.

Test Plan: `buck test mode/dev //caffe2/test:quantization -- --exact 'caffe2/test:quantization - test_empty_batch (quantization.core.test_quantized_op.TestQuantizedOps)'`

Reviewed By: r-barnes

Differential Revision: D31151681

fbshipit-source-id: e039c6aa1687a61ef6774f045744dc9d768d5c80
2021-09-28 12:28:59 -07:00
4752453d27 [Structured Kernels] Port for baddbmm and bmm (#64805)
Summary:
This PR attempts to port `baddbmm` and `bmm` to structured kernels. The reason it's in the same PR: because a lot of it is common for both the ops, including the checks and implementation.

Issue tracker: https://github.com/pytorch/pytorch/issues/55070

cc: ysiraichi ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64805

Reviewed By: gchanan

Differential Revision: D31134454

Pulled By: ezyang

fbshipit-source-id: 3294619834a8cc6a0407aea660c556d3a42b6261
2021-09-28 11:07:31 -07:00
278edb5626 .circleci: Only generate docker configs we need (#65728)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65728

Changes the docker image generation script to only include image build
jobs for images that we actually use within CircleCI

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

cc ezyang seemethere malfet pytorch/pytorch-dev-infra

Test Plan: Imported from OSS

Reviewed By: janeyx99

Differential Revision: D31224674

Pulled By: seemethere

fbshipit-source-id: 64b14e1a4ef82d345ec7b898c4c89d9a9419e4de
2021-09-28 10:38:13 -07:00
145202c45b Define timeout in TestIndividualWorkerQueue (#65742)
Summary:
This test occasionally deadlocks while waiting for the child process to report result.
But as the test is small, entire test should never take more than 1-2 sec, but to be on the safe side set timeout to 5 sec

Somewhat mitigates https://github.com/pytorch/pytorch/issues/65727

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65742

Reviewed By: janeyx99, ejguan

Differential Revision: D31235116

Pulled By: malfet

fbshipit-source-id: 0cdd2f7295f6f9fcefee954a14352e18fae20696
2021-09-28 10:01:19 -07:00
50edc2679d onnx/test.sh: Run test/onnx in only shard 1 (#65722)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/65458

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65722

Reviewed By: albanD

Differential Revision: D31223236

Pulled By: janeyx99

fbshipit-source-id: 3b648cb940a95866f465b27b8bdc74b06d258140
2021-09-28 08:45:25 -07:00
87cd658c27 Add override to virtual destructor in derived class (#65476)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65476

As suggested by `-Winconsistent-missing-destructor-override`.

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D31115128

fbshipit-source-id: a4e2441c13704c0c46e3e86f7886fca76c40ca39
2021-09-28 08:37:23 -07:00
57e5ae5306 [vulkan] Use push constants instead of SSBOs (#65716)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65716

Currently, we send arguments to shaders by creating and filling a SSBO (Shader Storage Buffer Object). However, we can instead use [push constants](https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/vkCmdPushConstants.html) to send a small amount of uniform data to shaders.

Push constants are slightly more performant than using a SSBO and also have the added benefit of not needing to allocate and manage memory for a buffer object since they update the pipeline data directly.

The downside of using push constants is that there is a maximum size for a push constant block, described by `maxPushConstantsSize` in [VkPhysicalDeviceLimits](https://www.khronos.org/registry/vulkan/specs/1.1/html/vkspec.html#VkPhysicalDeviceLimits). The minimum size guaranteed by the spec is 128 bytes, which is enough for 32 `float`/`int` variables, or 8 `vec4` variables. This should be enough for our purposes.

Currently, the Convolution shaders use the largest uniform block which only uses 22 bytes.

Test Plan:
Run `vulkan_api_test`:

```
buck build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 //xplat/caffe2:pt_vulkan_api_test_binAndroid\#android-arm64 --show-output
adb push buck-out/gen/xplat/caffe2/pt_vulkan_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_api_test
adb shell "/data/local/tmp/vulkan_api_test"
```

Reviewed By: beback4u

Differential Revision: D30368834

fbshipit-source-id: 65a42b9da1a9084ba2337b41eaab9b612583c408
2021-09-28 08:32:30 -07:00
e155e7520f MaxUnpooling: parallel_for not always backed by OMP (#65655)
Summary:
Use `c10::optional` + thread_fence  instead of `#pragma omp critical` inside max_unpooling kernels

Using any OpenMP pragma in `at::parallel_for` body is wrong, as it can
be implemented using native treading algorithms such as ptrheads

`c10::optional` sounds like a much better approach to pair of
`has_error` and `error_index` variables. Use `std::atomic_thread_fence` to ensure error_index value is synchronized.

It also fixes ICE reported in https://github.com/pytorch/pytorch/issues/65578

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65655

Reviewed By: ngimel

Differential Revision: D31206501

Pulled By: malfet

fbshipit-source-id: 93df34530e721777b69509cd6c68f5d713fb2b2a
2021-09-28 08:13:58 -07:00
26e31f76b0 *_solve methods: implements forward AD (#65546)
Summary:
This PR adds forward AD for `*_solve` methods.
Additionally, `cholesky_solve` gets OpInfo + a bug fix when wrong leading dimensions could be passed to LAPACK,
and `lu_solve` gets forward AD with 2x`lu_solve` instead of 1x`lu_solve` + 2x`triangular_solve`.

cc ezyang albanD zou3519 gqchen pearu nikitaved soulitzer Lezcano Varal7 jianyuh mruberry walterddr IvanYashchuk xwang233

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65546

Reviewed By: gchanan

Differential Revision: D31206837

Pulled By: albanD

fbshipit-source-id: 040beda97442e7a88a9df9abc7bb18313ce55bc3
2021-09-28 06:51:32 -07:00
2ea724b1fd Added option to update parameters using state_dict in AveragedModel (#65495)
Summary:
While implementing [EMA](https://github.com/pytorch/vision/pull/4381)(which extends AveragedModel) in torchvision, update_parameters() from AveragedModel could not be used as it did not handle state_dict(), so a custom update_parameters() needed to be defined in [EMA class](https://github.com/pytorch/vision/pull/4406). This PR aims to handle this scenario removing the need for this custom update_parameters() implementation.

Discussion: https://github.com/pytorch/vision/pull/4406#pullrequestreview-753734102

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65495

Reviewed By: datumbox

Differential Revision: D31176742

Pulled By: prabhat00155

fbshipit-source-id: 326d14876018f21cf602bab5eaba344678dbabe2
2021-09-28 03:34:49 -07:00
3324bae5f1 Remove THCTensor.cu and THCTensorCopy.cu copy (#65491)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65491

The only user of any of this code is THCStorage_copy, so I've
migrated that to call `Tensor.copy_` directly.

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D31148183

Pulled By: ngimel

fbshipit-source-id: 92bab71306c84bc481c47a0615ebb811af2c2875
2021-09-27 23:21:45 -07:00
6a99053515 Added sparse-tensor copy logic to dispatcher (#65304)
Summary:
- Only ported copy for sparse tensor to dispatcher. Everything else is the same
- Duplicated code for named tensor handling in sparse tensor copy
	- Might change it later to handle named tensors using dispatcher

Issue https://github.com/pytorch/pytorch/issues/61122

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65304

Reviewed By: gchanan

Differential Revision: D31176720

Pulled By: ezyang

fbshipit-source-id: 56757a3b0fb56c3d05c16dd935428a0cd91ea766
2021-09-27 20:08:27 -07:00
43d47bdcca [tensorexpr] conv2d handle optional bias (#64750)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64750

conv2d bias is optional. It will be ArgNone in processing of the graph.
This bias is prim::constant NoneType, so we do not know shape at the moment of constant binding.

This adding it as a constant zeros Tensor at the moment of graph processing => for that adding `std::vector<TensorExprKernel::ConstantDescr>& constants and std::vector<at::Tensor>& constant_tensors` to `computeOperandValue` as  it is not in `TensorExprKernel`

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D30842101

Pulled By: IvanKobzarev

fbshipit-source-id: 88020f6934e43fe606f8eae928b7e21b7c3f15f6
2021-09-27 20:00:53 -07:00
31ea4358d8 [tensorexpr] Add Op handling for mobilenetv3 large (#64741)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64741

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D30839110

Pulled By: IvanKobzarev

fbshipit-source-id: d8e89c086c713fbe816dd8c8096cd64c05dc7431
2021-09-27 20:00:51 -07:00
c28e3ffb4b [jit] Shape propagation batch_norm, dropout, quantize, hardswidh (#64740)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64740

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D30839111

Pulled By: IvanKobzarev

fbshipit-source-id: c8f477ee05769865c0a23127b7f8a8276f46b54e
2021-09-27 19:59:34 -07:00
46b3fc032a Migrate remainder of THCDeviceUtils.cuh to ATen (#65472)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65472

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D31148181

Pulled By: ngimel

fbshipit-source-id: f777ba85b1cd8cb98b0ceb1756c558dde5862fc2
2021-09-27 19:37:06 -07:00
12137db5e3 Fix the slowdown of _object_to_tensor since 1.9 (#65721)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65721

#Closes: https://github.com/pytorch/pytorch/issues/65696

The bug is introduced in https://github.com/pytorch/pytorch/pull/55861, and it causes 100X slowdown since 1.9.
ghstack-source-id: 139128267

Test Plan:
Performance test:
```
import time

from torch.distributed.distributed_c10d import _object_to_tensor

start = time.time()
_object_to_tensor("x" * 50_000_000)
print("Time:", time.time() - start)
```

Reviewed By: rohan-varma

Differential Revision: D31219794

fbshipit-source-id: 1abec38f9d51361c1eab6ad5efd87b589322e208
2021-09-27 19:22:10 -07:00
002ff19836 [acc_utils] Fix off by one for model info getter (#65708)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65708

att

Test Plan: added unit test

Reviewed By: khabinov

Differential Revision: D31209992

fbshipit-source-id: c1b4e70bd9705dcfdf3039cb8791149c8646f1d7
2021-09-27 19:01:55 -07:00
63bb7c6dba Refactor AotCompile to return a pair (#65707)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65707

Refactoring aotCompile to return a pair of compiled function and the LLVM assembly instead of updating an incoming string with assembly code

Testing: Gives expected results when compiled and run
```
(pytorch)  ~/local/pytorch refactor_aot
└─ $ build/bin/aot_model_compiler --model mobilenetv3.pt --model_name=pytorch_dev_mobilenetv3 --model_version=v1 --input_dims="2,2,2"
The compiled model was saved to mobilenetv3.compiled.pt
```

Test Plan: Imported from OSS

Reviewed By: qihqi

Differential Revision: D31220452

Pulled By: priyaramani

fbshipit-source-id: f957c53ba83f876a2e7dbdd4b4571a760b3b6a9a
2021-09-27 18:56:04 -07:00
e9327ed2ce Add nn.function.hardtanh in acc_tracer (#65639)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65639

This op is used by mobilenet v2.

Test Plan:
buck test glow/fb/fx/oss_acc_tracer:test_acc_tracer -- test_hardtanh
buck test glow/fb/fx/acc_tracer:test_acc_shape_inference -- hardtanh
buck test glow/fb/fx/oss_acc_tracer:test_acc_tracer -- test_hardtanh

Reviewed By: yinghai

Differential Revision: D31184297

fbshipit-source-id: 5a04319f6d16fb930372442616e27211107ecc67
2021-09-27 18:40:18 -07:00
6a6ee92e36 [quant] Add op benchmark for CPU FakeQuantizePerChannel with float zero_points (#65241)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65241

Test Plan: Imported from OSS

Reviewed By: jingsh

Differential Revision: D31150087

Pulled By: b-koopman

fbshipit-source-id: a00d4995841eee81305d0007c908473cc3d5a727
2021-09-27 16:01:49 -07:00
7c62b6e973 add deepcopy support to subclasses (#65584)
Summary:
Happy to get any feedback on how to make this code cleaner!

This:
- Fix Tensor attribute deepcopy BC-breaking?
- Add a test for Tensor attribute deepcopy
- Fix subclass deepcopy
- Moves the subclass serialization tests into their own class not to interfere with other serialization test logic
- Add a test for subclass deepcopy

cc ezyang gchanan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65584

Reviewed By: gchanan

Differential Revision: D31206590

Pulled By: albanD

fbshipit-source-id: 74a8f0767f4933b9c941fbea880a8fd1b893ea2f
2021-09-27 14:36:22 -07:00
f5b4e369f6 Sparse SoftMax: Remove unused variables (#65539)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65539

This function doesn't directly use thrust so these are simply unused variables.

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D31193191

Pulled By: malfet

fbshipit-source-id: 231b6a197c9f1bd5a61e46cb858e8eedc85b2818
2021-09-27 13:51:49 -07:00
e1340d4282 [GHA] Small refactors (#65647)
Summary:
Introduce `main` method in generate_ci_workflows
Check that all `ciflow/` labels start with the same prefix
Move `ciflow_should_run` defenition to common.yml.j2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65647

Reviewed By: janeyx99

Differential Revision: D31189537

Pulled By: malfet

fbshipit-source-id: 7cc47f63fb334c57f450034b931ff5bae1c0ed8b
2021-09-27 13:14:49 -07:00
fea32be964 Add HPU type for check_base_legacy_new (#65410)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65410

Reviewed By: H-Huang

Differential Revision: D31143754

Pulled By: malfet

fbshipit-source-id: 32abfbae4f7c09924c7dfa16758d64a2215ec636
2021-09-27 13:13:34 -07:00
82e0bf44c0 Apply linter suggestions to #65137 (#65459)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65459

Just run linter on the change and apply all suggestions

Test Plan: N/A

Reviewed By: seemethere

Differential Revision: D31102960

fbshipit-source-id: 04e1d07935690f2ddbc64533661b3e55379d13b5
2021-09-27 13:07:40 -07:00
811601e19a Upload sccache stats (#65582)
Summary:
This adds some tracking to metrics.pytorch.org for sccache build stats per environment

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65582

Reviewed By: malfet, zhouzhuojie, janeyx99

Differential Revision: D31160761

Pulled By: driazati

fbshipit-source-id: a497918bafbe610a51c92a9139684cd3efe670d3
2021-09-27 12:55:10 -07:00
ea546e20fd [Reland] nn.functional.linear OpInfo (#65498)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65498

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D31171149

Pulled By: zou3519

fbshipit-source-id: badb06af08a772397b0280189385723c0175200b
2021-09-27 12:42:46 -07:00
b91375f741 upgrade windows cuda installer: cu11.1.0 to cu11.1.1 (#65669)
Summary:
Fixes pytorch/vision#4483

Please merge it with https://github.com/pytorch/builder/pull/857

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65669

Reviewed By: gchanan

Differential Revision: D31205107

Pulled By: janeyx99

fbshipit-source-id: 654f0440ad33d2517db95d64df64e14de1233ad7
2021-09-27 12:27:19 -07:00
cd2656a2e5 [package] add some docs describing how to debug dependencies (#65704)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65704

As title.

Test Plan: Imported from OSS

Reviewed By: tugsbayasgalan

Differential Revision: D31209866

Pulled By: suo

fbshipit-source-id: 4c8ec1d5418ea75b71c4b9a498b86f0ef5383544
2021-09-27 12:14:23 -07:00
10d0dbc6d9 Avoid storage access for HPU tensors (#65409)
Summary:
Add is_hpu() methods for Aten tensor and device

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65409

Reviewed By: wconstab, H-Huang

Differential Revision: D31134422

Pulled By: malfet

fbshipit-source-id: 181ebb67dce8e05a0723ef3c82f23e39228841ee
2021-09-27 11:54:30 -07:00
aa5d2a8d86 Remove confusing SHARD_NUMBER resetting logic (#65701)
Summary:
The SHARD_NUMBER reset was to figure out a way to differentiate whether we had just one shard vs multiple.

We shouldn't reset SHARD_NUMBER but instead should just pass and use NUM_TEST_SHARDS for clarity and ease of scaling up to more shards.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65701

Reviewed By: driazati

Differential Revision: D31209306

Pulled By: janeyx99

fbshipit-source-id: 3a3504bd47e655d62aa0d9ed2f4657ca34c71c0e
2021-09-27 10:55:00 -07:00
facff2ec65 Update ProcessGroup collective C++ APIs to be non-pure virtual functions (#64943)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64943

Most ProcessGroup Collective APIs are pure virtual. As a result, c10d extensions need to override all of them and throw an error if they don't need certain APIs. This is too verbose for users. This commit changes those collective APIs to virtual and throws an error by default. Note that ProcessGroup is still an abstract class as `getBackendName` is a pure virtual function that all subclasses have to override.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang cbalioglu gcramer23

Test Plan: Imported from OSS

Reviewed By: cbalioglu

Differential Revision: D30906866

Pulled By: mrshenli

fbshipit-source-id: c4df8962d60350a44d2df72fd04f9dd6eadb9fa6
2021-09-26 19:19:43 -07:00
cd80bbe5f5 Bug fixes in dataframe_wrapper (#65629)
Summary:
## Description
- Updated functions in `dataframe_wrapper.py` to return values
- Fixed bug in `set_df_wrapper` to update `global default_wrapper`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65629

Reviewed By: ejguan

Differential Revision: D31180110

Pulled By: Nayef211

fbshipit-source-id: a8046e582fd6ce982fcdc89dae4932d0edc83d6b
2021-09-25 21:09:41 -07:00
1c8949c51a [BE] Run Zero test internally (#65519)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65519

Adds buck target so we can run this internally.
ghstack-source-id: 139009957

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D31072784

fbshipit-source-id: 7185cc1e6f9df3d79251eb017270471942a9d7dd
2021-09-25 13:26:50 -07:00
f70147b426 [BE] Enable ZeRO test on windows (#65385)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65385

Enables the ZeRO tests to run on windows. Closes
https://github.com/pytorch/pytorch/issues/63086.

Backend == NCCL was used as a proxy to see if we were running under CUDA, but Windows GPU tests uses Gloo. In this case use Gloo on GPU.

For some reason these tests don't seem to test Gloo on GPU with ZeRO in general (picks NCCL backend when GPU is available), so kept that behavior for now.
ghstack-source-id: 139003920

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D31071181

fbshipit-source-id: 45a76309ac5e882f5aa6c4b130118a68800754bb
2021-09-25 13:25:40 -07:00
4fe66d962d [Codemod][FBSourceBlackLinter] Daily arc lint --take BLACK
Reviewed By: zertosh

Differential Revision: D31192084

fbshipit-source-id: 25d490783b876253ddd1ad0a70832766ebd33f51
2021-09-25 06:42:19 -07:00
146817c9d0 Add all_paths utility function (#65602)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65602

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D31163681

Pulled By: tugsbayasgalan

fbshipit-source-id: fa0b28b1d3b73efcc7671698a613e695a01cc103
2021-09-25 01:11:20 -07:00
0256c3be50 [TensorExpr] Delete dtype_ field from Let - it should use its var's dtype. (#65634)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65634

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D31182697

Pulled By: ZolotukhinM

fbshipit-source-id: 572ecd74cdf2a671ee98e81f0b3e387f3d9c6202
2021-09-25 00:11:06 -07:00
399214efd6 Revert D31172530: [pytorch][PR] Enable CUPTI for kineto by default on windows
Test Plan: revert-hammer

Differential Revision:
D31172530 (6b60884f12)

Original commit changeset: 2c69ed0282c5

fbshipit-source-id: 649e040a8c44b0f536a8db397b4325309a285934
2021-09-24 19:18:15 -07:00
cda2ee9016 Add nn.function.hardswish in acc_tracer (#65590)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65590

hardswish is used by mobile net v3 oss model.
This diff added hardswish support in acc_tracer

Test Plan:
buck test glow/fb/fx/acc_tracer:test_acc_shape_inference
buck test glow/fb/fx/oss_acc_tracer:test_acc_tracer -- test_hardswish

Reviewed By: 842974287

Differential Revision: D30950061

fbshipit-source-id: cab57b8de5bea3a4d9d2b7d2a410d9afe787d66f
2021-09-24 17:30:39 -07:00
1de8976e85 Add quantized::convtranspose2d (#63914)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63914

Test Plan: Imported from OSS

Reviewed By: dreiss

Differential Revision: D30531889

fbshipit-source-id: a65e389da2722efbc62e3fe1edf503732326350d
2021-09-24 17:07:29 -07:00
ab5eb56983 add qmul (#63913)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63913

Test Plan: Imported from OSS

Reviewed By: dreiss

Differential Revision: D30531890

fbshipit-source-id: 29d88cc61bd1e328cc7ae7a91a2f8d4819803c8d
2021-09-24 17:06:17 -07:00
ece25c453f [PyTorch] Store Argument::alias_info_ on the heap (#64824)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64824

See comment in function_schema.h for explanation. I claim that this is a good tradeoff because the aliasing information seems to be used only in compiler-ish code paths, where performance isn't as critical as actual execution. If performance is important there too, perhaps we should hoist isWrite into the Argument itself since there are several paths that only care about isWrite.
ghstack-source-id: 138958896

Test Plan: CI, profile schema parsing on startup and see much fewer page faults in createArgumentVector.

Reviewed By: suo

Differential Revision: D30860719

fbshipit-source-id: 1d4d2328f2b8e34f5ddf9d82083fd4dd7b7f738f
2021-09-24 17:00:51 -07:00
af7238f214 Rocm4.3.1 nightly (#65624)
Summary:
Depends on pytorch/builder#851.

cc jeffdaily sunway513 jithunnair-amd ROCmSupport

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65624

Reviewed By: zou3519

Differential Revision: D31180780

Pulled By: malfet

fbshipit-source-id: 98a51eb45985ef648108e811d2c02231ec8b3a1f
2021-09-24 16:21:01 -07:00
15724bcc03 [TensorExpr] Re-enable a float16 test. (#65632)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65632

Test Plan: Imported from OSS

Reviewed By: huiguoo

Differential Revision: D31181798

Pulled By: ZolotukhinM

fbshipit-source-id: 1a57d0a878d44f8b73f3c24eef7ba707ce18fb70
2021-09-24 15:15:42 -07:00
0d3bf97fd0 TST Adds test for non-contiguous tensors (#64954)
Summary:
Follow up to https://github.com/pytorch/pytorch/issues/61935

This PR:

1. Adds test for non-contiguous tensors
2. Fixes bug in `NLLLoss` that was catch by the test.

The reason this was not catch in `common_nn` is because `CriterionTest` overrides `test_cuda` but does not call `test_nonconfig`.

cc albanD mruberry jbschlosser walterddr

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64954

Reviewed By: zou3519

Differential Revision: D31174149

Pulled By: jbschlosser

fbshipit-source-id: a16073e59b40ccc01c82ede016b63a8db2e810f5
2021-09-24 15:05:09 -07:00
a839cec0ad .github: GHA retry docker pull (#65103)
Summary:
This should help alleviate workflows failing due to docker pull timing out, which doesn't happen often, but did happen once in the past day.

Was also reported in https://github.com/pytorch/pytorch/issues/65439

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65103

Reviewed By: driazati

Differential Revision: D31157772

Pulled By: janeyx99

fbshipit-source-id: 7bf556f849b41eeb6dea69d73e5a8e1a40dec514
2021-09-24 14:31:43 -07:00
68e5935498 Remove fgrad_input from slow_conv2d (#64280)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64280

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D30830887

Pulled By: jbschlosser

fbshipit-source-id: 5a3a79ad9d9118177672eabf872f9d9a3313ebe4
2021-09-24 14:27:39 -07:00
71d1d16acb Moving the constant parameter check to a more common file (#64251)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64251

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D31161850

Pulled By: Gamrix

fbshipit-source-id: 5db3e6d52c99c1f40455c601122bb7680a287ae5
2021-09-24 13:54:27 -07:00
640a615150 [easy] [PyTorch Edge] Remove double pragma once directive in the generated code (#65620)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65620

This was bothering me for a while.

ghstack-source-id: 138914860

Test Plan: Sandcastle

Reviewed By: beback4u

Differential Revision: D31162648

fbshipit-source-id: 72c47ea34d40c772bb53da721fcb36365b5dbaf3
2021-09-24 13:14:37 -07:00
57e066e188 TST Adds gradcheck and gradgradcheck to module info (#64444)
Summary:
Follow up to https://github.com/pytorch/pytorch/issues/61935

cc albanD mruberry jbschlosser walterddr

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64444

Reviewed By: pbelevich

Differential Revision: D31174672

Pulled By: jbschlosser

fbshipit-source-id: 86dc3576479974fd0996f06298c09692c07e6b24
2021-09-24 13:10:29 -07:00
6b60884f12 Enable CUPTI for kineto by default on windows (#65608)
Summary:
Retry of https://github.com/pytorch/pytorch/pull/62175

See https://github.com/pytorch/pytorch/pull/62175#issuecomment-926411151 for more information.

malfet gdankel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65608

Reviewed By: zou3519

Differential Revision: D31172530

Pulled By: gdankel

fbshipit-source-id: 2c69ed0282c54fa6cdb6e604096d0370e230fd66
2021-09-24 13:00:49 -07:00
eca4f14b6c [PyTorch] Add C10_ prefix to MPARK_* macros in variant.h (#65589)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65589

Without this prefix, the include guards interfere with attempts to indirectly include both c10::variant and the original mpark variant in the same translation unit.
ghstack-source-id: 138901838

Test Plan: Temporarily `#include <c10/util/variant.h>` in ivalue.h and buck build //data_preproc/preproc:preproc_adapter_utils mode/no-gpu -- this delayed D31101962 (01720d6a23) from fixing S244170

Reviewed By: bhosmer

Differential Revision: D31159414

fbshipit-source-id: 234c5ed37ca853702bcdf3263e4f185b95ac1d08
2021-09-24 12:57:26 -07:00
7f25c3e666 Update distributed.rst to show that CUDA send/recv on GPU is supported (#65601)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65601

I believe this feature was supported one year ago:
https://github.com/pytorch/pytorch/pull/44921

#Closes: https://github.com/pytorch/pytorch/issues/65525
ghstack-source-id: 138918961

Test Plan: N/A

Reviewed By: pritamdamania87, mingzhe09088

Differential Revision: D31163535

fbshipit-source-id: 9321a0a5137a3e265e2b54bd78730ac28c7acd55
2021-09-24 12:30:10 -07:00
760aefd34d Fix nullptr addition (#65548)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65548

Fixes
caffe2/test:jit - test_unsupported_nn_functional_pad_circular_cpu_float32 (test_jit_fuser_te.TestNNCOpInfoCPU)

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D31148405

fbshipit-source-id: 4c8c693a45229ab4e59b0b0ec5326d3ac114dbaf
2021-09-24 11:43:22 -07:00
c3b09e977a [fx2trt] Refresh execution context across save/load for TRTModule. (#65592)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65592

IExecutionContext might not be safe to be serialized, therefore the simplest way to support save/load of TRTModule is to re-populate the execution context upon every load.
ghstack-source-id: 138904770

Test Plan: buck run mode/dev-nosan -c python.package_style=inplace -j 40 deeplearning/trt/fx2trt:acc2trt_test

Reviewed By: zrphercule

Differential Revision: D31070427

fbshipit-source-id: 88c58c6ce50e6dc9383d7f9419b5447cb89a4a3a
2021-09-24 11:36:57 -07:00
1682722152 keep output type after calling SubgraphRewriter (#65453)
Summary:
For jit **SubgraphRewriter**, it doesn't keep output type after overwriting the old graph, for example, in profiling mode, the old graph has the old operator's shapes, but after replacing the old operator with a newer operator by applying **SubgraphRewriter**, the tensor shape info was eliminated.

The activation is that I want to replace pytorch convolution with a customer's convolution, I first register **aten::_convolution** as a profiler node that can reorder the input and output's shapes, and then using graph rewrite to replace it as **aten::conv2d**, which tensors' shapes info are eliminated. I hope using input size do some pre-progress before replacing **aten::conv2d** with the customer's convolution.

Before rewrite:
```
graph(%self.1 : __torch__.MyModule,
      %x.1 : Float(2, 3, 20, 20, strides=[1200, 400, 20, 1], requires_grad=0, device=cpu)):
  %7 : int = prim::Constant[value=1](), scope: __module.conv # /home/xiaobinz/miniconda3/envs/pytorch-master/lib/python3.6/                      site-packages/torch/nn/modules/conv.py:443:0
  %6 : bool = prim::Constant[value=0](), scope: __module.conv # /home/xiaobinz/miniconda3/envs/pytorch-master/lib/python3.6                      /site-packages/torch/nn/modules/conv.py:443:0
  %5 : bool = prim::Constant[value=1](), scope: __module.conv # /home/xiaobinz/miniconda3/envs/pytorch-master/lib/python3.6                      /site-packages/torch/nn/modules/conv.py:443:0
  %4 : NoneType = prim::Constant()
  %3 : int[] = prim::Constant[value=[1, 1]]()
  %2 : int[] = prim::Constant[value=[0, 0]]()
  %conv : __torch__.torch.nn.modules.conv.Conv2d = prim::GetAttr[name="conv"](%self.1)
  %z : Float(2, 3, 20, 20, strides=[1200, 400, 20, 1], requires_grad=0, device=cpu) = aten::clone(%x.1, %4) # jit_test.py:2                      2:0
  %weight : Float(3, 3, 1, 1, strides=[3, 1, 1, 1], requires_grad=0, device=cpu) = prim::GetAttr[name="weight"](%conv)
  %x : Float(2, 3, 20, 20, strides=[1200, 400, 20, 1], requires_grad=0, device=cpu) = aten::_convolution(%x.1, %weight, %4,                       %3, %2, %3, %6, %2, %7, %6, %6, %5, %5), scope: __module.conv # /home/xiaobinz/miniconda3/envs/pytorch-master/lib/python3.                      6/site-packages/torch/nn/modules/conv.py:443:0
  %16 : Float(2, 3, 20, 20, strides=[1200, 400, 20, 1], requires_grad=0, device=cpu) = aten::add(%x, %z, %7) # jit_test.py:                      24:0
  return (%16)
```
 after rewrite by using **aten::conv2d**
```
graph(%self.1 : __torch__.MyModule,
      %x.1 : Float(2, 3, 20, 20, strides=[1200, 400, 20, 1], requires_grad=0, device=cpu)):
  %7 : int = prim::Constant[value=1](), scope: __module.conv # /home/xiaobinz/miniconda3/envs/pytorch-master/lib/python3.6/site-packages/torch/nn/modules/conv.py:443:0
  %6 : bool = prim::Constant[value=0](), scope: __module.conv # /home/xiaobinz/miniconda3/envs/pytorch-master/lib/python3.6/site-packages/torch/nn/modules/conv.py:443:0
  %5 : bool = prim::Constant[value=1](), scope: __module.conv # /home/xiaobinz/miniconda3/envs/pytorch-master/lib/python3.6/site-packages/torch/nn/modules/conv.py:443:0
  %4 : NoneType = prim::Constant()
  %3 : int[] = prim::Constant[value=[1, 1]]()
  %2 : int[] = prim::Constant[value=[0, 0]]()
  %conv : __torch__.torch.nn.modules.conv.Conv2d = prim::GetAttr[name="conv"](%self.1)
  %z : Float(2, 3, 20, 20, strides=[1200, 400, 20, 1], requires_grad=0, device=cpu) = aten::clone(%x.1, %4) # jit_test.py:22:0
  %weight : Float(3, 3, 1, 1, strides=[3, 1, 1, 1], requires_grad=0, device=cpu) = prim::GetAttr[name="weight"](%conv)
  %18 : Tensor = aten::conv2d(%x.1, %weight, %4, %3, %2, %3, %7)
  %16 : Float(2, 3, 20, 20, strides=[1200, 400, 20, 1], requires_grad=0, device=cpu) = aten::add(%18, %z, %7) # jit_test.py:24:0
  return (%16)
```

expected result after replace **aten::_convolution** with  **aten::conv2d**:

```
graph(%self.1 : __torch__.MyModule,
      %x.1 : Float(2, 3, 20, 20, strides=[1200, 400, 20, 1], requires_grad=0, device=cpu)):
  %7 : int = prim::Constant[value=1](), scope: __module.conv # /home/xiaobinz/miniconda3/envs/pytorch-master/lib/python3.6/                      site-packages/torch/nn/modules/conv.py:443:0
  %6 : bool = prim::Constant[value=0](), scope: __module.conv # /home/xiaobinz/miniconda3/envs/pytorch-master/lib/python3.6                      /site-packages/torch/nn/modules/conv.py:443:0
  %5 : bool = prim::Constant[value=1](), scope: __module.conv # /home/xiaobinz/miniconda3/envs/pytorch-master/lib/python3.6                      /site-packages/torch/nn/modules/conv.py:443:0
  %4 : NoneType = prim::Constant()
  %3 : int[] = prim::Constant[value=[1, 1]]()
  %2 : int[] = prim::Constant[value=[0, 0]]()
  %conv : __torch__.torch.nn.modules.conv.Conv2d = prim::GetAttr[name="conv"](%self.1)
  %z : Float(2, 3, 20, 20, strides=[1200, 400, 20, 1], requires_grad=0, device=cpu) = aten::clone(%x.1, %4) # jit_test.py:2                      2:0
  %weight : Float(3, 3, 1, 1, strides=[3, 1, 1, 1], requires_grad=0, device=cpu) = prim::GetAttr[name="weight"](%conv)
  %18 : Float(2, 3, 20, 20, strides=[1200, 400, 20, 1], requires_grad=0, device=cpu) = aten::conv2d(%x.1, %weight, %4, %3,                       %2, %3, %7)
  %16 : Float(2, 3, 20, 20, strides=[1200, 400, 20, 1], requires_grad=0, device=cpu) = aten::add(%18, %z, %7) # jit_test.py                      :24:0
  return (%16)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65453

Reviewed By: zdevito

Differential Revision: D31162489

Pulled By: ZolotukhinM

fbshipit-source-id: 0d1c1d607cb612df47c64f173d9f4c9e8b1d6c49
2021-09-24 11:07:40 -07:00
f3587f6bfa Remove THC ScalarConvert (#65471)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65471

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D31148182

Pulled By: ngimel

fbshipit-source-id: bbf74e36a3d91a7be3e47199981440c68a2f645f
2021-09-24 10:29:51 -07:00
5b2a7eaa03 [codemod][fbcode/caffe2] Apply all buildifier fixes
Test Plan: Visual inspection. Sandcastle.

Reviewed By: zsol

Differential Revision: D31170304

fbshipit-source-id: ee56312b5262247bb5a2e68a66d51f6cb3a0bf82
2021-09-24 09:03:29 -07:00
b858993c97 Fix engine check for case where grad is a subclass (#65568)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65568

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D31158089

Pulled By: albanD

fbshipit-source-id: 2a77df9b6340107de02a043b57a36cb7ae68df34
2021-09-24 08:41:19 -07:00
e742839f0e Fix autograd engine test in python_dispatch (#65567)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65567

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D31158090

Pulled By: albanD

fbshipit-source-id: 651b78016ad978c7419343554ce7ceffd54aef1b
2021-09-24 08:39:52 -07:00
ef9e560796 [Static Runtime] Add aten::remainder out variant (#64967)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64967

Out variant implementation for `aten::remainder`. Added both scalar and tensor overloads.

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- Remainder`

Reviewed By: d1jang

Differential Revision: D30915469

fbshipit-source-id: 9f27f18c86d66b11eac0aa4659c7062cb785b7e9
2021-09-24 07:51:39 -07:00
b003b2a9c0 [Static Runtime] Add record functions (#64698)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64698

Reviewed By: hlu1

Differential Revision: D30747191

fbshipit-source-id: 7ded6ea9bd36b5e3343d1efa9f3c92e02ff6d7f8
2021-09-24 07:20:17 -07:00
fd24e1b61f add OpInfo for torch.repeat_interleave (#65455)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65455

Addresses facebookresearch/functorch#103.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D31111696

Pulled By: zou3519

fbshipit-source-id: 4fa73708fa915cb21adbba9cb8fd2b8f75bcd3e0
2021-09-24 07:16:08 -07:00
d85e12a5bf add OpInfo for torch.argsort (#65454)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65454

Addresses facebookresearch/functorch#103.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D31111700

Pulled By: zou3519

fbshipit-source-id: ec4babd2fcdcea856ba0ee8db0fd8f42b87269f3
2021-09-24 07:14:41 -07:00
ca66698202 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D31166199

fbshipit-source-id: 3fb46d64aba5e7c443b70beda77338f2ee63a99e
2021-09-24 02:57:37 -07:00
cc4db35205 [TensorExpr] Break circular dependency of shared pointers in MemDependencyChecker. (#65600)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65600

Previously AccessInfo owned two maps: dependencies_ and dependents_,
which represented an edge in dependency graph. These two maps were
holding shared pointers and thus each edge immediately became a cycle,
which resulted in memory leaks. This PR makes one of the ends of these
edges weak pointer thus breaking the loop.

Test Plan: buck test mode/dbgo-asan-ubsan //search/lib/query_expansion/candidate_generator/test:transliteration_expander_test -- --exact 'search/lib/query_expansion/candidate_generator/test:transliteration_expander_test - TransliterationExpander.romanizationByLocaleTest'

Reviewed By: bertmaher

Differential Revision: D31163441

Pulled By: ZolotukhinM

fbshipit-source-id: 9cef921f5c9293f1237144d1ee92e31f3e44c00a
2021-09-23 23:33:36 -07:00
01720d6a23 [JIT] constant object compilation unit ref fix (#65442)
Summary:
// A non owning pointer to a type. When a class get inserted as a constant
// into a graph, if we used a strong pointer we would have a circular reference
// from Object -> CompilationUnit and CompilationUnit -> Graph (which owns the
// Constant Object)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65442

Reviewed By: ezyang

Differential Revision: D31101962

Pulled By: eellison

fbshipit-source-id: f1c1cfbe5a8d16a832cad7ba46e2a57a98670083
2021-09-23 22:43:02 -07:00
f83250fd4e Revert logic in mobile/type_parser.cpp (#65556)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65556

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D31149080

Pulled By: ansley

fbshipit-source-id: d5986d019fc2c47fd45cc10f0397499cc1e81329
2021-09-23 22:26:02 -07:00
20143bf07f [ONNX] Deprecate use_external_data_format param from torch.onnx.export() function. (#62257) (#64382)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64382

* This `use_external_data_format` parameter is used for large models cannot be exported because of the 2GB protobuf limit.

* When `use_external_data_format` set to True, the model is exported in ONNX external data format, in which case some of the model parameters are stored in external binary files and not in the ONNX model file itself.

* This PR will set this paramter to DEPRECATED and check the model proto sizes by code instead of by user, if the sizes lager than 2GB, then `use_external_data_format = True` automatically.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D30905265

Pulled By: malfet

fbshipit-source-id: 82b4e17bfa6a8de2bfd700a5282c12f6835603cb

Co-authored-by: hwangdeyu <dejack953@outlook.com>
2021-09-23 22:20:48 -07:00
478d4cf883 [ONNX] Deprecated the example_outputs param from torch.onnx.export() function. (#62815) (#64380)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64380

* `example_outputs` used to determine the type and shape of the outputs without tracing the execution of the model. And it must be provided when exporting a ScriptModule or ScriptFunction when using export() function.

* Since we can work out `example_outputs` in internal function instead of being provided by user, so we deprecated this argument in the export() function to increase user experience of calling this function.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D30905266

Pulled By: malfet

fbshipit-source-id: d00b00d7d02b365d165028288ad915678caa51f2

Co-authored-by: hwangdeyu <dejack953@outlook.com>
2021-09-23 22:20:46 -07:00
9323ea2195 [ONNX] minor doc improvements and cleanup (#62514) (#64373)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64373

* Fix some bad formatting and clarify things in onnx.rst.
* In `export_to_pretty_string`:
    * Add documentation for previously undocumented args.
    * Document that `f` arg is ignored and mark it deprecated.
    * Update tests to stop setting `f`.
    * Warn if `_retain_param_name` is set.
* Use double quotes for string literals in test_operators.py.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D30905271

Pulled By: malfet

fbshipit-source-id: 3627eeabf40b9516c4a83cfab424ce537b36e4b3
2021-09-23 22:20:44 -07:00
9965163751 [ONNX] Add supplementary tests and description for custom_opsets param from torch.onnx.export() function. (#62085) (#64372)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64372

custom_opsets arg from torch.onnx.export() is no needed to be removed.

Add some supplementary description and tests for easier understanding.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D30905269

Pulled By: malfet

fbshipit-source-id: 489fbee0e2c1d6c5405c9bf7dfd85223ed981a44

Co-authored-by: hwangdeyu <dejack953@outlook.com>
2021-09-23 22:20:42 -07:00
fb71ccf0f1 [ONNX] Remove strip_doc_string param from torch.onnx.export() function. (#61712) (#64371)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64371

As of now, the "strip_doc_string" parameter was described as below:

strip_doc_string (bool, default True): do not include the field
doc_string``` from the exported model. Otherwise the field will mention the source code locations for model``.

This is usually useless to users who want to transform a PyTorch model to ONNX one. Only when the user wants to debug the export process, these source code locations could provide benefits.

To make the export() function more friendly by providing less parameters, we combined "strip_doc_string" into "verbose" parameter. If a user set verbose to True, it means the users need some log information for debugging the export process and this is similar with the purpose of strip_doc_string parameter.

But the usage of these 2 arguments are opposite: setting verbose to True means we want to print log information to help debug, which means strip_doc_string should be False. And this is how we replace strip_doc_string with verbose argument in this PR.

This PR will still keep it in torch.onnx.export() function for backward support while the usage of it has been combined with verbose argument.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D30905268

Pulled By: malfet

fbshipit-source-id: 2f06eb805c01fe15ff7a1b4f6595c937ba716d60

Co-authored-by: fatcat-z <zhang-ji@outlook.com>
2021-09-23 22:20:40 -07:00
47d1ed60e1 [ONNX] Remove argument _retain_param_name from torch.onnx.export() function. (#61702) (#64370)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64370

As of now, the "_retain_param_name" parameter has no description in PyTorch docs website. According to code, this argument determines if we keep the original parameter names of PyTorch model in the final ONNX graph. If this is False, those original parameter names will be replaced with a series of integers starting from 1.

Since setting numbers as parameter names make no sense to users, we remove this argument from the torch.onnx.export() function to increase user experience of calling this function.

This PR will still keep it in torch.onnx.export() function for backward support while all backend logic has been changed to work as _retain_param_name is set to True.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D30905270

Pulled By: malfet

fbshipit-source-id: ca60757ca17daaff937e9f08da42596086795f4a

Co-authored-by: fatcat-z <zhang-ji@outlook.com>
2021-09-23 22:18:52 -07:00
bc02255d5e Revert D30721329: [pytorch][PR] Enable CUPTI for kineto by default on windows.
Test Plan: revert-hammer

Differential Revision:
D30721329 (7dbc21bc2b)

Original commit changeset: aa1af47df8cc

fbshipit-source-id: 565d50841e19a45f8798a490aa3aa6b9f69ca404
2021-09-23 22:14:32 -07:00
8c7caedbb8 avoid re-allocation of view_shape for every tensor in torch.meshgrid (#62908)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62908

Reviewed By: mruberry

Differential Revision: D31064165

Pulled By: dagitses

fbshipit-source-id: 3ddc3088e70fc8ef6dcf56ceb67fd20991169af1
2021-09-23 21:41:51 -07:00
963ae25e41 Migrate THCAtomics to ATen (#65470)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65470

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D31148184

Pulled By: ngimel

fbshipit-source-id: aaac3dfb5f2c6f88e9bd922b3a56d0a16a861e17
2021-09-23 19:43:34 -07:00
c73f0e457e Tensor and device is_hpu methods (#65408)
Summary:
Add is_hpu() methods for Aten tensor and device

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65408

Reviewed By: malfet

Differential Revision: D31144227

Pulled By: wconstab

fbshipit-source-id: 115f4df4b8d54e6913dd51af7b6d4cacf6dd43c5
2021-09-23 18:42:45 -07:00
d78b3909e8 Explicitly destory ProcessGroup in allgather_coalesced_async test (#65513)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65513

The error in #65231 means some child threads were destructed before
joined. I added some trace and prints and found that, in the failed
tests, all `assertEqual` are passed, but the `ProcessGroupGloo`
destructor wasn't called in one of the process. It could be due to
the only guarantee that Python makes is that garbage collection MAY
happen before the program exits. This commit adds an explicit
`destroy_process_group()` to alleviate the problem.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang gcramer23

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D31134174

Pulled By: mrshenli

fbshipit-source-id: 2e42fe93d3f16ce34681b591afc15a6ac0b9fab6
2021-09-23 18:35:08 -07:00
b77c979102 [quant][fx][graphmode] Make FixedQParam ops work for dtypes other than quint8 (#65484)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65484

This PR makes sure we only use FixedQParamFakeQuantize for quint8 dtype and allows user
to use other dtypes for ops like sigmoid, this is useful for producing reference pattern for
these ops that can be used in other backends like TensorRT

Test Plan:
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D31120377

fbshipit-source-id: 3b529d588e2b6ff0377a89c181f6237f8f0cc2f5
2021-09-23 18:29:56 -07:00
a2e631b874 Windows GHA: Only upload artifacts if prev steps pass (#65561)
Summary:
Fixes a task in https://github.com/pytorch/pytorch/issues/65439

And removes the Upload to GitHub step as it's redundant with the S3 step.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65561

Reviewed By: seemethere

Differential Revision: D31157685

Pulled By: janeyx99

fbshipit-source-id: cd23113a981eb4467fea3af14d916f6f2445a02b
2021-09-23 17:38:39 -07:00
7dbc21bc2b Enable CUPTI for kineto by default on windows. (#62175)
Summary:
It fix nothing.

For tracking this PR, please refers to https://github.com/pytorch/kineto/issues/356

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62175

Reviewed By: ezyang

Differential Revision: D30721329

Pulled By: gdankel

fbshipit-source-id: aa1af47df8cc1b6f5ba2194447f62b902a6a9c84
2021-09-23 15:13:47 -07:00
f850d7ef2e [CoreML][OSS] Add Simulator tests (#65076)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65076

ghstack-source-id: 138869950

create a new conda environment - conda create --name coreml python=3.8
conda activate coreml
pip3 install --pre torch torchvision torchaudio -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html
pip install coremltools==5.0b5
cd pytorch
git fetch
git checkout gh/xta0/131/head
cd ios/TestApp/benchmark
mkdir ../models
python coreml_backend.py
Test the model_coreml.ptl in the helloworld example

Test Plan:
1. CircleCI
2. Pytorch nightly builds

Reviewed By: hanton

Differential Revision: D30912268

fbshipit-source-id: 52b2ed1ad40e5949ee2755bca113119132dfc914
2021-09-23 14:57:01 -07:00
2a0208f4dc fixed comments referring fairscale master branch (#65531)
Summary:
replace comments referring fairscale master branch with main branch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65531

Test Plan:
buck build

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang gcramer23

Reviewed By: H-Huang, anj-s

Differential Revision: D31132552

Pulled By: tmarkstrum

fbshipit-source-id: d3ee8920ab5cccad99f640934c21e8eee022e9b9
2021-09-23 14:37:58 -07:00
c015cbabf9 [codemod][fbcode/caffe2] Apply all buildifier fixes
Test Plan: Visual inspection. Sandcastle.

Reviewed By: zsol

Differential Revision: D31144864

fbshipit-source-id: f8e65fec69f88d03048df9edb98969d648eb6981
2021-09-23 14:03:19 -07:00
d07b2cb4ec [fx2trt] update the oss fx2trt exmaple (#65544)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65544

ATT

Test Plan: CI

Reviewed By: mikekgfb

Differential Revision: D31147750

fbshipit-source-id: eacc1c9157a32d6deebbfe9ff2aaae13c434e72b
2021-09-23 13:45:22 -07:00
71704349aa [DDP] Allow await of custom buffer reduction in backward (#64515)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64515

For performance reasons, we would like to ensure that we can await
user collectives as part of custom buffer reduction in parallel to other work.
As a result, add support to return futures from custom buffer hooks and await
those futures at end of backwards pass.

Also added some docs to clarify how to use these APIs.
ghstack-source-id: 138793803

Test Plan: I

Reviewed By: zhaojuanmao

Differential Revision: D30757761

fbshipit-source-id: e1a2ead9ca850cb345fbee079cf0614e91bece44
2021-09-23 13:02:53 -07:00
36485d36b6 Docathon: Add docs for nn.functional.*d_max_pool (#63264)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63264

Adding docs to max_pool to resolve docathon issue #60904

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D31071491

Pulled By: Gamrix

fbshipit-source-id: f4f6ec319c62ff1dfaeed8bb6bb0464b9514a7e9
2021-09-23 11:59:50 -07:00
1f0f246fe2 Automated submodule update: FBGEMM (#65360)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 0108d4f552

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65360

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: jspark1105

Differential Revision: D31061552

fbshipit-source-id: 8bce5157a281e38cad5d5d0e9dcd123beda39735
2021-09-23 11:47:15 -07:00
65fbd2c12b [ci] do not continue through error on trunk (#65503)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65503

There are two reasons for this change:
- I don't think trunk jobs should have different behavior than their PR equivalents.
- Continuing through error makes it challenging to figure out what is
actually failing, especially given the poor UX of GitHub Actions when it
comes to reading logs

Example: https://github.com/pytorch/pytorch/runs/3680114581. Here, there
is a failure but the rendered test results tell me everything is
successful. I have no idea how to quickly tell what failed; the log is so long
and terms like "error", "failure", etc. are common enough that searching
it is very difficult.

Differential Revision:
D31130478
D31130478

Test Plan: Imported from OSS

Reviewed By: ezyang

Pulled By: suo

fbshipit-source-id: 15a80475ca4c49644c0f7b779f5c6c2ffeb946a6
2021-09-23 11:36:03 -07:00
7e772e7685 Update link to tutorial on defining NN modules (#65534)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/65527. Please, see my comment in the issue: https://github.com/pytorch/pytorch/issues/65527#issuecomment-925863193. The file was renamed in ce58d5904c (diff-e5ef486bd89eb38de15752211d9437953681b8caa8f44d7c86bb820d13151df2), but the link in this repository was not updated.

It doesn't change the fact that the old link is still working, but I guess this has to be fixed in [pytorch/tutorials](https://github.com/pytorch/tutorials) instead of here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65534

Reviewed By: soulitzer

Differential Revision: D31144269

Pulled By: H-Huang

fbshipit-source-id: f70744a21113b7dc84510e2992d87f0fed793985
2021-09-23 11:26:50 -07:00
cac7c1a192 [ci] remove auto-label-rocm workflow (#65558)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65558

This will temporarily be replaced by an FB-internal workflow that does
the exact same thing, pending a migration of this workflow to probot.

cc jeffdaily sunway513 jithunnair-amd ROCmSupport

Test Plan: Imported from OSS

Reviewed By: zhouzhuojie, driazati

Differential Revision: D31149105

Pulled By: suo

fbshipit-source-id: 2aa122820ae3b5286774501f5ecfe052bc949dea
2021-09-23 11:15:35 -07:00
c731be8066 [BE] Use DispatchKeySet in check_base_legacy_new (#65535)
Summary:
Refactor:
```
TORCH_CHECK ( key == a ||
              key == b ||
              key == c,
              "expected key to be in ", a, " or ", b , " or ", c,
              " but got ", key);
```
into
```
TORCH_CHECK( key_set.has(key),
            "expected key to be in ", key_set,
            " but got ", key );
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65535

Reviewed By: wconstab

Differential Revision: D31144239

Pulled By: malfet

fbshipit-source-id: 68a053041a38f043e688e491889dd7ee258f3db3
2021-09-23 11:01:23 -07:00
da166d4f12 Add a timeout argument to RPC shutdown() (#65425)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65425

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang gcramer23

Test Plan:
Imported from OSS

   python3 test/distributed/rpc/test_tensorpipe_agent.py -v -k test_wait_all_workers_timeout

Reviewed By: mrshenli

Differential Revision: D31092483

Pulled By: dracifer

fbshipit-source-id: 5b5e9f20b1d6602cf8cde3772678f721dddf0d78
2021-09-23 10:42:58 -07:00
97b535dabd [PyTorch] add fastToString for infer_schema (#64823)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64823

We seem to spend noticable time in vfprintf for this, and the number of arguments is almost always small enough to do this in just a few instructions.
ghstack-source-id: 138623354

Test Plan: Profile schema parsing, saw less time in vfprintf

Reviewed By: ezyang, dhruvbird

Differential Revision: D30860716

fbshipit-source-id: 09ef085cd6f93dc1eaa78790dde918ac60e67450
2021-09-23 10:15:40 -07:00
eb949464d6 [PyTorch] Fix missing moves in SchemaParser::parseArgument (#64839)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64839

Resulted in some extra shared_ptr refcount bumps.
ghstack-source-id: 138623356

Test Plan: CI

Reviewed By: smessmer

Differential Revision: D30875749

fbshipit-source-id: 531f04c453f7410ed3d4ff054217f21a250be8e9
2021-09-23 10:14:22 -07:00
14307f7a56 [Static Runtime] Added logging to dump the model graphs (#65509)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65509

With this change, we can get dumps of the model graphs by setting the env variable `PYTORCH_JIT_LOG_LEVEL=">>impl"` while running the model.

Test Plan: buck test mode/opt-clang //caffe2/benchmarks/static_runtime:static_runtime_cpptest

Reviewed By: mikeiovine

Differential Revision: D31125797

fbshipit-source-id: d8979a4e138047518140e0eaecb46e012891b17c
2021-09-23 10:06:13 -07:00
767a104698 [quant] change observer FQNs generated in prepare step (#65420)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65420

Context: In some FB use cases we have a need to map observer stats from train model checkpoint to inference model. We observerd that some buffer names are different becuase the intermediate activation tensors
are generated differently across train and inference model. More details in https://fb.quip.com/PtGcAR0S5CQP

Currently, for each observer (activation_post_process), the FQN of the module inserted is determined based on the FQN of the input tensor it is observing.

In this change we change the observer FQN to include the FQN of the op/module it is observing rather than tensor/intermediate op names along with the “input”/“output” detail.

Before
```
def forward(self, x):
    x_activation_post_process_0 = self.x_activation_post_process_0(x);  x = None
    mods1_w = self.mods1.w
    mods1_w_activation_post_process_0 = self.mods1_w_activation_post_process_0(mods1_w);  mods1_w = None
    mods1_b = self.mods1.b
    linear = torch.nn.functional.linear(x_activation_post_process_0, mods1_w_activation_post_process_0, bias = mods1_b);  x_activation_post_process_0 = mods1_w_activation_post_process_0 = mods1_b = None
    linear_activation_post_process_0 = self.linear_activation_post_process_0(linear);  linear = None
    return linear_activation_post_process_0
```

After
```
def forward(self, x):
    mods1_input_activation_post_process_0 = self.mods1_input_activation_post_process_0(x);  x = None
    mods1_w = self.mods1.w
    mods1_w_activation_post_process_0 = self.mods1_w_activation_post_process_0(mods1_w);  mods1_w = None
    mods1_b = self.mods1.b
    linear = torch.nn.functional.linear(mods1_input_activation_post_process_0, mods1_w_activation_post_process_0, bias = mods1_b);  x_activation_post_process_0 = mods1_w_activation_post_process_0 = mods1_b = None
    mods1_output_activation_post_process_0 = self.mods1_output_activation_post_process_0(linear);  linear = None
    return mods1_output_activation_post_process_0
```

Test Plan:
python test/test_quantization.py test_observer_fqn

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D31088652

fbshipit-source-id: 2f1526f578a13000b34cfd30d11f16f402fd3447
2021-09-23 09:08:10 -07:00
a012216b96 [nn] Fold : no batch dim (#64909)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/64907
Reference: https://github.com/pytorch/pytorch/issues/60585

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64909

Reviewed By: cpuhrsch, heitorschueroff

Differential Revision: D30991087

Pulled By: jbschlosser

fbshipit-source-id: 91a37e0b1d51472935ff2308719dfaca931513f3
2021-09-23 08:37:32 -07:00
2a4d5e4c6d [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D31138547

fbshipit-source-id: ba134ae7f057c918eaefdc6310f7663e187e9749
2021-09-23 07:54:32 -07:00
9668a8a82d [DataPipe] Update Docstrings for Tar and ZipArchiveReader (#65500)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65500

cc VitalyFedyunin ejguan

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D31127241

Pulled By: NivekT

fbshipit-source-id: aed41aa192fe55e10ba67beda460fac70f2703c7
2021-09-23 07:20:08 -07:00
7e7be526c9 Add TORCH_SHOW_CPP_STACKTRACES to Contributing.md (#64052)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64052

Reviewed By: ezyang

Differential Revision: D31107779

Pulled By: Chillee

fbshipit-source-id: 2ad8ad40cd48e54fe711863c3c74df884a2e2de7
2021-09-22 22:53:19 -07:00
14949d2922 Add nn.function.hardsigmoid in acc_tracer (#65422)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65422

hardsigmoid is used by mobile net v3 oss model.
This diff added hardsigmoid support in acc_tracer

Test Plan:
buck test glow/fb/fx/acc_tracer:test_acc_shape_inference
buck test glow/fb/fx/oss_acc_tracer:test_acc_tracer -- test_hardsigmoid

Reviewed By: jfix71

Differential Revision: D30950304

fbshipit-source-id: 8fe4b4c6df29c06a73850d32f59321a9311f94f5
2021-09-22 20:57:42 -07:00
5525e9a591 Lock unpickling of source ranges
Summary:
The source is shared across all threads running the torchscript
interpreter, so if several threads encounter errors at once, they will all race
to unpickle the source, leading to memory corruption.

Test Plan:
Model 217993215_0 is the problematic model; I wasn't able to repro
the crash with requests stored in Hive, but I could easily by adding my
devserver (SMC tier predictor.bertrand) as a shadow tier to the model's tier
(inference_platform.predictor_model.prod.bi.217993215_latest).  (i.e., set
shadow_tier property to predictor.bertrand=1 to proxy 1% of traffic).

With this diff, the ASAN/TSAN errors go away.

Reviewed By: suo

Differential Revision: D31044009

fbshipit-source-id: 56f9ef3880e7cf09f334db71b4256e362b4de965
2021-09-22 20:41:02 -07:00
228141f939 [pytorch] more informative error msg from fbgemm embedding spmdm call (#65186)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65186

FBGEMM JIT'ed EmbeddingSpMDM kernel just returns false when there's an error delegating detailed error handling to the caller (since each framework like PyTorch and Caffe2 wants to do error handling differently). Many of PyTorch code was simply reporting there was "an" error without pinpointing exactly why error happened. This diff introduces more informative error msg following what Caffe2 was doing.

Test Plan: CI

Reviewed By: dskhudia

Differential Revision: D31008300

fbshipit-source-id: b8d069af0692dc86dc642b18a9c68f22deaffea3
2021-09-22 20:13:34 -07:00
0ca1102609 [fx2trt] fuse permute + matmul using a pass instead of hardcoding it as a leaf module (#65482)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65482

Currently we hardcoded permute + bmm in a module and tagged it as a leaf module during tracing. This diff introduces a pass to fuse permute + matmul to a single node.

TODO:
For fusion transformation like this kind, they would actually share many similar code like finding the fusion pattern, replacing original nodes with fused node. Current fx subgraph rewriter allows us to specify patterns that we want to replace but we would need to extend it to allow specify constraint on nodes' kwargs.

Reviewed By: yinghai

Differential Revision: D31022055

fbshipit-source-id: 13d1f18d79b09d371897ecde840f582ccaf5713a
2021-09-22 18:43:09 -07:00
fccaa4a3c8 [fx2trt] fix transpose unittest (#65481)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65481

Previous we have `acc_ops.transpose` but after a recent diff `torch.transpose` is mapped to `acc_ops.permute`. Here we clean up the fx2trt unittest for transpose and add support for negative indices in permute.

Reviewed By: wushirong

Differential Revision: D31115280

fbshipit-source-id: 58e689e6dd14181aea5186f3bb5b8745a07d0e51
2021-09-22 18:08:55 -07:00
2f67579864 [ddp] use named_params and named_buffers explicitly (#65181)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65181

This PR changes `state_dict()` during sync to `named_parameters` and `named_buffers` explicitly. the underlying motivation is that, `state_dict()` doesn't necessarily equals to "params + buffers" for all cases, state_dict is used for checkpoint purpose mainly, and params/buffers are used for training, we might have cases that params/buffers be in different forms with state_dict (i.e. state_dict we might want to save in small pieces of tensors while in training we want to concat the tensors together for performance reasons).
ghstack-source-id: 138701159

Test Plan: wait for ci

Reviewed By: divchenko, rohan-varma

Differential Revision: D31007085

fbshipit-source-id: 4e1c4fbc07110163fb9b09b043ef7b4b75150f18
2021-09-22 17:32:54 -07:00
0eaf081018 [JIT] canonicalize aten::rsub (#65014)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65014

ghstack-source-id: 138656948

Test Plan:
```
(pytorch) [maxren@devvm3115.atn0 ~/pytorch] python3 test/test_jit.py TestPeephole
CUDA not available, skipping tests
monkeytype is not installed. Skipping tests for Profile-Directed Typing
........s......................
----------------------------------------------------------------------
Ran 31 tests in 0.393s

OK (skipped=1)
(pytorch) [maxren@devvm3115.atn0 ~/pytorch] python3 test/test_jit.py TestPeephole.test_normalized_rsub
CUDA not available, skipping tests
monkeytype is not installed. Skipping tests for Profile-Directed Typing
.
----------------------------------------------------------------------
Ran 1 test in 0.015s

OK
```

Reviewed By: eellison

Differential Revision: D30941389

fbshipit-source-id: 03f0416d99090845c9bfb1e5fcf771d5f1d7a050
2021-09-22 17:20:46 -07:00
32f0387ee8 Bug in CosineAnnealingWarmRestarts in optim/lr_scheduler.py (#64758)
Summary:
## {emoji:1f41b} Bug
'CosineAnnealingWarmRestarts'  object has no attribute 'T_cur'.
In the Constructor of the CosineAnnealingWarmRestarts, we're calling the constructor of the Parent class (_LRScheduler) which inturn calls the step method of the CosineAnnealingWarmRestarts.
The called method tries to update the object's attribute  'T_cur' which is not defined yet. So it raises the error.
This only holds, when we give the value for last_epoch argument as 0 or greater than 0 to the 'CosineAnnealingWarmRestarts', while initializing the object.

![Bug_in_CosineAnnealingWarmRestarts](https://user-images.githubusercontent.com/77477328/132552212-70abc8b5-0357-4c35-90a9-832648bac607.png)
## To Reproduce

Steps to reproduce the behavior:

1. Give the value for the last_epoch argument as zero OR
1. Give the value for the last_epoch argument as a Positive integer.

## Expected behavior

I only expected the 'CosineAnnealingWarmRestarts' object to be initialized.

## Environment

PyTorch version: 1.9.0+cpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.2 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: version 3.21.2
Libc version: glibc-2.31
Python version: 3.8.10  [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.8.0-59-generic-x86_64-with-glibc2.29
Is CUDA available: False
CUDA runtime version: No CUDA

## Additional context
We can able to solve this bug by moving the line 'self.T_cur = self.last_epoch' above the 'super(CosineAnnealingWarmRestarts,self).__init__()' line. Since we've initialized the "self.T_cur" to the object.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64758

Reviewed By: ezyang

Differential Revision: D31113694

Pulled By: jbschlosser

fbshipit-source-id: 98c0e292291775895dc3566fda011f2d6696f721
2021-09-22 16:55:14 -07:00
b80bdcc73b Add register_module alias to nn.Module (#65174)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/60397. I'm not sure how aliases are supposed to be implemented, but this is the most basic/direct way, IMO. As a side-effect, this implementation results in a "duplicate" doc entry, inheriting the one from `add_module`:

![monkey-patch](https://user-images.githubusercontent.com/7027770/133693137-8408d8e7-1f4f-436b-b176-57dda9bc3a32.png)

An alternative implementation could be:

```python
def register_module(self, name: str, module: Optional['Module']) -> None:
    r"""Alias for :func:`add_module`."""
    self.add_module(name, module)
```

which results in this documentation:

![image](https://user-images.githubusercontent.com/7027770/133693249-d969a71a-be44-489d-9633-4f38b44ab887.png)

Questions:
1. Should I replicate the tests? There are two for `add_module`: [test_add_module_raises_error_if_attr_exists](873255c6d9/test/test_nn.py (L1420-L1434)) and [test_add_module](873255c6d9/test/test_nn.py (L1837-L1855)).
2. This PR only adds `register_module` to `nn.Module`. There is an `add_module` in [`_RemoteModule`](https://github.com/pytorch/pytorch/blob/master/torch/distributed/nn/api/remote_module.py#L311-L312), which raises `NotSupported`, and there is another one in [`ConcreteModuleTypeBuilder`](873255c6d9/torch/_C/__init__.pyi.in (L468)), which means something else, I think. Should I do anything about them?

cc ngimel SsnL

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65174

Reviewed By: soulitzer

Differential Revision: D31089717

Pulled By: jbschlosser

fbshipit-source-id: abd8d14a434fd8c7efa0bd8c242df56da33491e9
2021-09-22 16:37:28 -07:00
31584d065e [Static Runtime] Added NNC implementation for signed log1p kernel. (#65387)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65387

Added a customized NNC implementation for signed log1p kernel and enabled the fusion pass that adds the fused signed log1p op.

Also, added a SR microbenchmark for this kernel which shows the performance improvement.

Without fusion:
```
--------------------------------------------------------------------------------
Benchmark                                         Time           CPU Iterations
--------------------------------------------------------------------------------
BM_signed_log1p/16                             1953 ns       1953 ns     358746
BM_signed_log1p/64                             2049 ns       2049 ns     342145
BM_signed_log1p/512                            3291 ns       3291 ns     214342
BM_signed_log1p/4096                          15559 ns      15559 ns      44420
BM_signed_log1p/32768                        101936 ns     101935 ns       6843
BM_signed_log1p/65536                        194792 ns     194789 ns       3615
```

With NNC fusion:
```
--------------------------------------------------------------------------------
Benchmark                                         Time           CPU Iterations
--------------------------------------------------------------------------------
BM_signed_log1p/16                              369 ns        369 ns    1896179
BM_signed_log1p/64                              497 ns        497 ns    1406995
BM_signed_log1p/512                            1618 ns       1618 ns     430209
BM_signed_log1p/4096                          11327 ns      11326 ns      61463
BM_signed_log1p/32768                         84099 ns      84086 ns       8325
BM_signed_log1p/65536                        166531 ns     166510 ns       4186
```

This clearly shows >15% improvement in performance of this kernel with NNC fusion.

On inline_cvr local model, there is a small improvement in terms of profiled time spent on ops:
  without fusion: `0.9%` (computed by adding the % spent on all the 4 ops involved)
  with NNC fusion: `0.55%`

Test Plan:
`buck test mode/opt-clang //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- SignedLog1p`

Also, did the accuracy test with inline_cvr as described here, https://fb.quip.com/qmdDAJzEmPtf, on the full size model (285298536_1)

```
get 57220 prediction values
get 57220 prediction values
max_error:  0  total:  0
```

Reviewed By: hlu1

Differential Revision: D30609492

fbshipit-source-id: d2e68df580569a30ee61abb0ef18d2c4c56827bd
2021-09-22 15:53:33 -07:00
1c20b98b4b [iOS][CoreML] Check backend availability at runtime. (#65315)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65315

ghstack-source-id: 138703808

Test Plan:
- OSS builds and BUCK builds
- CircleCI

Reviewed By: hanton

Differential Revision: D31048011

fbshipit-source-id: 824a8e32d65de2caf25e41efef2b022ddbb63156
2021-09-22 15:38:53 -07:00
2898ef7549 Minor ScanKernels.cu cleanup (#65350)
Summary:
- Replace THCNumerics with `at::_isnan`
- Replace `contiguous` with `expect_contiguous`
- Don't use `contiguous` on output tensors. Instead skip the copy and
  just create a new empty tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65350

Reviewed By: ezyang

Differential Revision: D31103501

Pulled By: ngimel

fbshipit-source-id: 9030869e28d6c570fad074fd0502076de8e2ab09
2021-09-22 15:34:01 -07:00
5739f77775 [DDP] Refactor and remove sync_params (#64514)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64514

sync_params is a misnomer since we don't actually synchroniz
parameters. While removing this I realized
`self._check_and_sync_module_buffers` does almost everything we need it to, so
just refactored that and made DDP forward call into it.
ghstack-source-id: 138684982

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D30751231

fbshipit-source-id: add7c684f5c6c71dad9e9597c7759849fa74f47a
2021-09-22 14:12:51 -07:00
ce5981e431 [DDP] Custom buffer reduction (#64513)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64513

Proposal: https://github.com/pytorch/pytorch/issues/63041
Support custom buffer reduction in DDP via hook
ghstack-source-id: 138655663

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D30751152

fbshipit-source-id: 257a9d46bb178d8812d4ea5a4d9c6140b8a1791f
2021-09-22 14:11:35 -07:00
923f06621c Fix Windows ninja builds when MAX_JOBS is specified (#65444)
Summary:
Reported by cloudhan in https://github.com/pytorch/pytorch/pull/64733#issuecomment-924545463

Fixes regression introduced by 047e68235f

cc malfet seemethere

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65444

Reviewed By: dagitses, seemethere

Differential Revision: D31103260

Pulled By: malfet

fbshipit-source-id: 9d5454a64cb8a0b96264119cf16582cc5afed284
2021-09-22 14:04:31 -07:00
cbc3db8274 Create test for builtin tensorrt module in torch deploy (#63819)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63819

ghstack-source-id: 138521664

Test Plan:
buck test mode/dev-nosan caffe2/torch/csrc/deploy:test_deploy_gpu

buck test mode/opt-split-dwarf caffe2/torch/csrc/deploy:test_deploy_gpu

Reviewed By: wconstab

Differential Revision: D30499301

fbshipit-source-id: 0bc165b4ed5be28ebb0becc65f292cf26368692f
2021-09-22 13:42:35 -07:00
72fc53ff27 .github: Add timeout for test step (#65486)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65486

Adding this after observing jobs running for 6+ hours on `pytorch/pytorch-canary`, still trying to debug why they happen there but this should resovle jobs running forever

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

cc ezyang seemethere malfet pytorch/pytorch-dev-infra

Test Plan: Imported from OSS

Reviewed By: ezyang, malfet, janeyx99

Differential Revision: D31117497

Pulled By: seemethere

fbshipit-source-id: 126a10e844bdef77c2852cc5c392e5f37f130f7e
2021-09-22 13:23:41 -07:00
f24bd43375 Changing type and name of local_used_maps to reflect that it is only one map (#65380)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65380

Fixing bugs that arise when running setup.py develop

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang gcramer23

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D31104844

Pulled By: jaceyca

fbshipit-source-id: acfd4cf316c71177df758ca55b470f51a17f776b
2021-09-22 11:35:33 -07:00
0fe86ac6c6 Fix torch.any documentation (#65310)
Summary:
Currently, the description of torch.any would be parsed like

```
param input
the input tensor.
```

However, it should be

```
Tests if any element in input evaluates to True.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65310

Reviewed By: ezyang

Differential Revision: D31102918

Pulled By: soulitzer

fbshipit-source-id: 678ade20ba16ad2643639fbd2420c8b36fcd8bd7
2021-09-22 11:24:20 -07:00
a0dea074b2 Remove .data from benchmarks and tensorboard (#65389)
Summary:
Related to https://github.com/pytorch/pytorch/issues/30987 and https://github.com/pytorch/pytorch/issues/33628. Fix the following tasks:

- Remove the use of `.data` in all our internal code:
  - [x] `benchmarks/`
  - [x] `torch/utils/tensorboard/`

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang gcramer23 albanD gchanan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65389

Reviewed By: soulitzer

Differential Revision: D31093464

Pulled By: albanD

fbshipit-source-id: 3a9c8834fd544a59a1cc2b930ae538fd1d46b232
2021-09-22 11:16:59 -07:00
70a545b21e Add Tensor._make_wrapper_subclass (#65340)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65340

I thought about a few possible ways of doing this.  The main hazard is
that if I create a CPU tensor that doesn't have any real storage, the
moment I actually try to access the data on the tensor I will segfault.
So I don't want to use _make_subclass on a "cpu meta tensor" because
the CPU meta tensor (with no subclass) is radioactive: printing it
will immediately cause a segfault.  So instead, I have to create
the CPU meta tensor AND subclass all in one go, and that means I need
another function for it.  One downside to doing it this way is
I need another overload for explicit strides, and in general it is
difficult to get the view relationships to all work out properly;
tracked at https://github.com/pytorch/pytorch/issues/65339

Fixes https://github.com/pytorch/pytorch/issues/62972
Fixes https://github.com/pytorch/pytorch/issues/62730

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D31057231

Pulled By: ezyang

fbshipit-source-id: 73522769e093ae8a1bf0c7f7e594659bfb827b28
2021-09-22 11:10:47 -07:00
11ca641491 [docs] Add images to some activation functions (#65415)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/65368. See discussion in the issue.

cc mruberry SsnL jbschlosser soulitzer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65415

Reviewed By: soulitzer

Differential Revision: D31093303

Pulled By: albanD

fbshipit-source-id: 621c74c7a2aceee95e3d3b708c7f1a1d59e59b93
2021-09-22 11:05:29 -07:00
158393e1a1 Fix autograd engine checks and update InputMetadata (#65235)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65235

1. Updated the legacy type checks in `torch/csrc/autograd/engine.cpp` to individually validate the dtype, device, and layout equality for grad and tensor.
2. Removed device field from `InputMetadata` since it's already stored via storing options. Also, added `dtype()` and `layout()` methods to `InputMetadata`. To make this change, some calls had to be updated due to the change in constructor.
3. To fix https://github.com/pytorch/pytorch/issues/65016:
     a. Added a `is_tensor_subclass` field in `InputMetadata` to skip device checks for grad and tensor when the tensor has
         python key set on it (tensor subclass).

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D31117318

Pulled By: anjali411

fbshipit-source-id: 825401df98695c48bf9b320be54585f6aff500bd
2021-09-22 11:01:19 -07:00
db4b68b3ac Back out "Eagerly populate python_error::what() when TORCH_SHOW_CPP_STACKTRACES=1"
Summary: Original commit changeset: 9cfda47cafb3

Test Plan: unland

Reviewed By: ezyang

Differential Revision: D31116643

fbshipit-source-id: 631eea446ed48c63ca39281d24163a2eadbe8d12
2021-09-22 10:37:27 -07:00
b3ec88f41f ugh (#65477)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65477

Test Plan: Imported from OSS

Reviewed By: zhouzhuojie

Differential Revision: D31115936

Pulled By: suo

fbshipit-source-id: fb16911a683713fdc2393bfe7150fc29c7d6814f
2021-09-22 10:15:41 -07:00
152f0236c3 Revert D31082693: Fix autograd engine checks and update InputMetadata
Test Plan: revert-hammer

Differential Revision:
D31082693 (9324d682fd)

Original commit changeset: cb551cd438c6

fbshipit-source-id: fc60f86b80fc70058984df6bccbf240d27f5843e
2021-09-22 10:00:08 -07:00
7c9a278895 fix trailing newlines (#65474)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65474

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D31114952

Pulled By: suo

fbshipit-source-id: 3b8cde2098635450c3e22571a401f78e4e54e9e0
2021-09-22 09:48:34 -07:00
508845f2b5 [quant] AO migration of the torch/quantization/quantize_fx.py and torch/quantization/fx/* (#65033)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65033

1. Move the file:
```
hg mv caffe2/torch/quantization/fx caffe2/torch/ao/quantization/fx
hg mv caffe2/torch/quantization/quantize_fx.py caffe2/torch/ao/quantization/quantize_fx.py
```
2. Create new files
```
touch caffe2/torch/quantization/quantize_fx.py
touch caffe2/torch/quantization/fx/__init__.py
```
3. import things in the new files
4. add tests to test/quantization/ao_migration/test_quantization_fx.py
this is because we have some fx import in quantize_fx and fx/*.py

Test Plan: buck test mode/dev //caffe2/test:quantization

Reviewed By: vkuzo, z-a-f

Differential Revision: D30949749

fbshipit-source-id: 9e5d4d039c8a0a0820bc9040e224f0d2c26886d3
2021-09-22 09:29:15 -07:00
762c2276e1 feed model merge net lower benchmark (#65191)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65191

Test Plan:
run command:
buck run mode/opt -c python.package_style=inplace hpc/new/models/feed/benchmark:feed_lower_benchmark

example output:
Eager, BS: 2048, TFLOP/s: 253.25, Time per iter: 4.49ms, QPS: 456289.25
TensorRT, BS: 2048, TFLOP/s: 162.30, Time per iter: 7.00ms, QPS: 292426.58

Reviewed By: yinghai

Differential Revision: D31010288

fbshipit-source-id: f30b520eca9508439588bcf48476b1b1edfb09af
2021-09-22 09:21:18 -07:00
bcc6e3ab5e add python API to print all operators that have kernels registered to a particular DispatchKey (#63575)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63575

Test Plan: Imported from OSS

Reviewed By: ezyang, Chillee

Differential Revision: D30426919

Pulled By: bdhirsh

fbshipit-source-id: b0e487e48dfe02f7b9d678403f0a2b5bfe146f4e
2021-09-22 09:15:55 -07:00
9324d682fd Fix autograd engine checks and update InputMetadata (#65235)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65235

1. Updated the legacy type checks in `torch/csrc/autograd/engine.cpp` to individually validate the dtype, device, and layout equality for grad and tensor.
2. Removed device field from `InputMetadata` since it's already stored via storing options. Also, added `dtype()` and `layout()` methods to `InputMetadata`. To make this change, some calls had to be updated due to the change in constructor.
3. To fix https://github.com/pytorch/pytorch/issues/65016:
     a. Added a `is_tensor_subclass` field in `InputMetadata` to skip device checks for grad and tensor when the tensor has
         python key set on it (tensor subclass).

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D31082693

Pulled By: anjali411

fbshipit-source-id: cb551cd438c6ca40b0f18a4d0009e0861cf0fd4e
2021-09-22 07:49:52 -07:00
f90d9b48db test_neg_view: preseve sign of sample input (#63010)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63010

This changes `test_neg_view` to call the operator with the same numeric values as the original sample input.

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D31082824

Pulled By: anjali411

fbshipit-source-id: 7d50f99dc0d1343247e366cbe9b0ca081bd0a9b1
2021-09-22 07:47:42 -07:00
9d17f21e46 Added PandasDataframeWrapper (#65411)
Summary:
- Added `PandasDataframeWrapper` around `pandas` functions to easily drop-and-replace`torcharrow` for Facebook internal use cases
- Updated relevant datapipe/dataframe usesites to use the new `PandasDataframeWrapper` instead of calling `pandas` functions directly

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65411

Reviewed By: VitalyFedyunin, hudeven

Differential Revision: D31087746

Pulled By: Nayef211

fbshipit-source-id: 299901f93a967a5fb8ed99d6db9b8b9203634b8f
2021-09-22 07:42:45 -07:00
3c6d9fd124 Eagerly populate python_error::what() when TORCH_SHOW_CPP_STACKTRACES=1 (#65376)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65376

Let's suppose there's a bug in PyTorch and python_error gets thrown
and never gets caught.  Typically, you'll get a very useless error
message like this:

```
terminate called after throwing an instance of 'python_error'
  what():
  Aborted (core dumped)
```

Now, you'll get:

```
what():  unknown Python error (for more information, try rerunning with TORCH_SHOW_CPP_STACKTRACES=1)
```

and with TORCH_SHOW_CPP_STACKTRACES=1 you'll get:

```
what():  error message from Python object
```

If we're OK with making Python exceptions go even slower, we could
eagerly populate unconditionally.  I'm also not so happy we don't get
a Python backtrace or the Python error name, that's worth improving
(this is a minimal diff to get the discussion going.)

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D31067632

Pulled By: ezyang

fbshipit-source-id: 9cfda47cafb349ee3d6853cdfb0f319073b87bff
2021-09-22 07:12:28 -07:00
2c7df1360a Bump torch version to 1.11 (#65435)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65435

Reviewed By: zhouzhuojie

Differential Revision: D31099045

Pulled By: malfet

fbshipit-source-id: 6ae6ca8a4b652fc51ee3138c800d067e144acbaa
2021-09-22 07:07:16 -07:00
96383ca704 Unify the output pathname of archive reader and extractor (#65424)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65424

This PR is re-implementation for https://github.com/facebookexternal/torchdata/pull/93
Same PR has landed into torchdata https://github.com/facebookexternal/torchdata/pull/157

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D31090447

Pulled By: ejguan

fbshipit-source-id: 45af1ad9b24310bebfd6e010f41cff398946ba65
2021-09-22 06:34:29 -07:00
e331beef20 Delete code coverage jobs from CI (#65362)
Summary:
As it does not seem useful to the lots of peope, see https://fb.workplace.com/groups/1144215345733672/posts/2062909540530910

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65362

Reviewed By: janeyx99, bdhirsh

Differential Revision: D31061945

Pulled By: malfet

fbshipit-source-id: 912ed92cc901a370a40448f1127c3ba43640ac43
2021-09-22 05:38:35 -07:00
127c9402d0 Revert "Revert D30752939: [pytorch][PR] nvfuser update" (#65137)
Summary:
This reverts commit 03389dc851db6f3ca52f9a4455ce2090c64a223d.

Attempt again for PR: https://github.com/pytorch/pytorch/issues/63745
Fixes the windows build failure.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65137

Reviewed By: seemethere, dzhulgakov, heitorschueroff

Differential Revision: D30994556

Pulled By: malfet

fbshipit-source-id: f1925b6c5cc1a1a441a96499667c91e8dfc1b53d
2021-09-22 04:54:51 -07:00
feefc94573 [fx2trt] Use itensor_to_tensor_meta to track the TensorMeta info for ITensor node (#65427)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65427

Previously we added a input_tensor_meta for dequantize function, this is a bit hacky since this creates a dependency between
the arguments of dequantize and if there are passes that changes the input then we would need to update tensor meta as well

Test Plan:
python torch/fx/experimental/fx2trt/example/quantized_resnet_test.py

Imported from OSS

Reviewed By: soulitzer

Differential Revision: D31094274

fbshipit-source-id: 5e40648d3081e2363f3a70bcc9745df4a8190ad3
2021-09-22 00:02:31 -07:00
64d3c7388f [RELAND] Enable ncclAvg for reductions (#62835)
Summary:
Resubmit of https://github.com/pytorch/pytorch/pull/62303.

Reverts the revert, and restores some diffs that were mysteriously missing from the reverted revert. I think some of the diffs I pushed to the original PR raced with its import or landing, such that the original PR's merge didn't pick up all the diffs I wanted. I don't know enough about the landing process to do more than speculate wildly, but hopefully this resubmit sorts things out.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62835

Reviewed By: zhouzhuojie, seemethere, janeyx99, heitorschueroff

Differential Revision: D30999982

Pulled By: malfet

fbshipit-source-id: 1f70ab4055208f1c6a80c9fc9fbe292ce68ecaa9
2021-09-21 18:09:45 -07:00
3f5f721ab3 Pass through allow-list from prepare_qat into propagate_qconfig_ to allow custom mapping and custom QAT module (#65119)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65119

Pytorch Quantization: allow prepare_qat to include custom module by passing allow_list into the prepare_qat.

When we are implementing custom module and custom mapping for Quantization Aware Training (QAT), we need to add the custom module to the mappings and to the allow_list during prepare_qat. The allow_list needs to be surfaced to the  propagate_qconfig_.

Test Plan: relying on general unit test

Reviewed By: supriyar

Differential Revision: D30982060

fbshipit-source-id: 1114115b6a3b853238d33d72b5cbaafc60f463e0
2021-09-21 17:15:25 -07:00
158b8bdc8a Cleaning up DDP SPMD in reducer.cpp (#64113)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64113

Since there is only one model replica per process, `replicas`
can be simplified from `std::vector<std::vector<at::Tensor>>` to
`std::vector<at::Tensor>` in the Reducer class.

Test Plan:
All tests are passing
`pytest test/distributed/test_c10d_gloo.py -vs`

Imported from OSS

Reviewed By: mrshenli

Differential Revision: D30615965

fbshipit-source-id: d2ec809d99b788c200b01411333e7dbad1269b51
2021-09-21 16:13:18 -07:00
27faa7a560 [ONNX] Support torch.isfinite export (#64759)
Summary:
Pull Request resolved:  https://github.com/pytorch/pytorch/issues/64754

1. onnx::IsInf is introduced in opset 10, onnx:isnan is introduced in opset 9 -> isfinite = not(or(isinf,isnan)) -> opset 10

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64759

Test Plan: Imported from OSS

Reviewed By: seemethere, bdhirsh

Differential Revision: D31060760

Pulled By: malfet

fbshipit-source-id: 499ecd6cc55ea881b8a57e6a9a4fb38eaaee5242
2021-09-21 15:47:48 -07:00
5aa33770f5 .circleci: Remove Windows workflows from Circle (#64959)
Summary:
Removes Windows CI from Circle

Will go in after https://github.com/pytorch/pytorch/pull/65094

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64959

Reviewed By: soulitzer

Differential Revision: D31095374

Pulled By: janeyx99

fbshipit-source-id: b0d13a59aa8c6e2f85dbd9c343cac395c4e64475
2021-09-21 15:32:24 -07:00
a1216061c1 [DataPipe] Fix deepcopy filehandle for Mapper and in-place modification for IterableWrapper (#65220)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65220

Fixes #65221

- Remove deepcopy from Mapper to support file handles
- Convert `IterableWrapper` to deepcopy iterable instance within each iterator to prevent in-place modification (different data per epoch)
- Convert `IDP` to `IterableWrapper` in test_datapipe.py
- Refine the variable names (prevent using `dp` that is module reference)

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D31021886

Pulled By: ejguan

fbshipit-source-id: 72a9eee66c758e2717d591cd0942892bddedc223
2021-09-21 14:29:40 -07:00
73c4bfc30a [ONNX] Add log10 symbolic (#63418) (#64374)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64374

Fixes #61332

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D30919609

Pulled By: msaroufim

fbshipit-source-id: f474376bbf7b59677b10565f316384eca59dba43

Co-authored-by: Shubham Bhokare <shubhambhokare@gmail.com>
2021-09-21 13:30:59 -07:00
1fec9cd76b [Fixed] Enable Half, BFloat16, and Complex dtypes for coo-coo sparse matmul [CUDA] (#59980)
Summary:
This PR enables Half, BFloat16, ComplexFloat, and ComplexDouble support for matrix-matrix multiplication of COO sparse matrices.
The change is applied only to CUDA 11+ builds.

`cusparseSpGEMM` also supports `CUDA_C_16F` (complex float16) and `CUDA_C_16BF` (complex bfloat16). PyTorch also supports the complex float16 dtype (`ScalarType::ComplexHalf`), but there is no convenient dispatch, so this dtype is omitted in this PR.

cc nikitaved pearu cpuhrsch IvanYashchuk ezyang anjali411 dylanbespalko mruberry Lezcano

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59980

Reviewed By: ngimel

Differential Revision: D30994115

Pulled By: cpuhrsch

fbshipit-source-id: 4f55b99e8e25079d6273b4edf95ad6fa85aeaf24
2021-09-21 13:03:40 -07:00
8bab468943 Reduce test size for max_pool (#65336)
Summary:
Fixe OOM in slow gradcheck tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65336

Reviewed By: malfet

Differential Revision: D31059007

Pulled By: albanD

fbshipit-source-id: 2dd6967d88663558e37f8c0836ad33333c92dfb5
2021-09-21 12:57:02 -07:00
cd813f16bf Add functional api for nn.Module (#61447)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/58839

After discussing with albanD he proposed this simple design.

Let's iterate over the idea here :).

Thanks.

The main point that this PR does is to use reparametrization to be reverted at the end of the functional call.
This allows us to have the original model with its status unchanged, also in this scenario the module is created without parameters so this will hard error if not all parameters are specified when the forward pass is done.

``` python
import torch
import torch.nn.utils._stateless

class MyModule(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.l1 = torch.nn.Linear(1, 1)

    def forward(self, x):
        return self.l1(x)

mod = MyModule()
print('weight before', mod.l1.weight)
x = torch.rand((1, 1))
parameters = {"l1.weight": torch.nn.Parameter(torch.tensor([[1.0]])),
              "l1.bias": torch.nn.Parameter(torch.tensor([0.0]))}
res = torch.nn.utils._stateless.functional_call(mod, parameters, x)
print('Functional call input ', x, ' and result ', res)
print('weight after', mod.l1.weight)
```
Output
```
weight before Parameter containing:
tensor([[-0.4419]], requires_grad=True)

Functional call input tensor([[0.3531]]) and result tensor([[0.3531]], grad_fn=<AddmmBackward>)

weight after Parameter containing:
tensor([[-0.4419]], requires_grad=True)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61447

Reviewed By: soulitzer

Differential Revision: D31082765

Pulled By: albanD

fbshipit-source-id: ba814d0f9162fb39c59989ca9a8efe160405ba76
2021-09-21 12:39:43 -07:00
c245632e2e Use higher timeout for TSAN tests. (#65391)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65391

TSAN tests are much slower than the usual dev/opt mode, about 5-10x
slower.

As a result, for TSAN build mode we use a much higher timeout for distributed
tests.
ghstack-source-id: 138584613

Test Plan: waitforbuildbot

Reviewed By: cbalioglu

Differential Revision: D31076575

fbshipit-source-id: 44a485f07101deac536470ceeff2a52cac4f9e0b
2021-09-21 12:08:27 -07:00
28bfdbb066 OpInfo for nn.functional.batch_norm (#63218)
Summary:
Addresses https://github.com/facebookresearch/functorch/issues/78 and https://github.com/pytorch/pytorch/issues/54261.

* There exists `torch.batch_norm` but it takes an extra arg: `cudnn_enabled` (not there in functional variant). This is passed from the functional variant to `torch.batch_norm` here: https://github.com/pytorch/pytorch/blob/master/torch/nn/functional.py#L2282. `test_variant_consistency_jit` fails with an error: (when passed an alias)
    ```python
    File "/home/krshrimali/Documents/Projects/Quansight/pytorch/test/test_ops.py", line 457, in _test_consistency_helper
    variant_forward = variant(cloned,
    TypeError: batch_norm() missing 1 required positional arguments: "cudnn_enabled"
    ```
    * I'm not sure of a solution to this, as AFIK - there is no way to pass a lambda wrapper for an alias. Hence, I've skipped adding this as an alias there.
    * On second thought, is this even an alias?

cc: mruberry zou3519 kshitij12345

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63218

Reviewed By: bdhirsh

Differential Revision: D31019785

Pulled By: zou3519

fbshipit-source-id: 2a834d05835da975289efc544a7ad7e98c99438f
2021-09-21 11:35:34 -07:00
9afdf017dc Add force_on_cpu test to win cuda10.2 on GHA (#65094)
Summary:
Part of migrating from Circle.

Once we get a successful force_on_cpu test, we can move it to trunk only.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65094

Reviewed By: seemethere

Differential Revision: D31086289

Pulled By: janeyx99

fbshipit-source-id: e1d135cc844d51f0b243b40efb49edca277d9de8
2021-09-21 11:14:15 -07:00
00b732e98b Remove orphan from cuDNN persistent note (#65160)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/60009.

As the document is properly [included](https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/rnn.py#L799), and [`:orphan:` doesn't need to be used in included documents](https://github.com/sphinx-doc/sphinx/issues/6787#issuecomment-549256840), and no warning is emitted in my local build when removing it, I think it can be removed.

The artifact reported in https://github.com/pytorch/pytorch/issues/60009 can be seen in 3 pages: [torch.nn.RNN](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html#torch.nn.RNN), [torch.nn.LSTM](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html#torch.nn.LSTM), and [torch.nn.GRU](https://pytorch.org/docs/stable/generated/torch.nn.GRU.html#torch.nn.GRU).

cc ezyang suo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65160

Reviewed By: bdhirsh

Differential Revision: D31020280

Pulled By: ezyang

fbshipit-source-id: 6c3541e5a856a91cf1ce1d2db4d04f5d13118ee4
2021-09-21 11:09:47 -07:00
c0eb266c02 [Static runtime] Micro-optimization pass on GetLivenessMap (#65175)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65175

More efficient use of map API, more efficient way to insert all pairs of inputs/outputs in liveness map
ghstack-source-id: 138547815

Test Plan: Time to enable static runtime down from ~8.7s to ~8.4s

Reviewed By: mikeiovine

Differential Revision: D30983897

fbshipit-source-id: fa6000bfd0fa0adfcd7c5922199ee32ada8c430e
2021-09-21 10:52:08 -07:00
6d7bc34b67 Make new_empty/new_ones/new_zeros/new_full respect subclass (#65169)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65169

Previously these composite functions created a new tensor
using at::empty (or some other factory function) using TensorOptions
which doesn't preserve Python subclass.  Making new_empty a
non-composite op and then routing everyone through it makes it
respect subclass.  We could also make all of these non-composite
but this reduces the number of derivatives.yaml entries I have to
make and allows you to trace the fill calls.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D31003713

Pulled By: ezyang

fbshipit-source-id: 19f906f1404a6b724769c49f48d123f407a561ff
2021-09-21 10:50:48 -07:00
04a5e45aeb [PyTorch] Compare Type pointers before calling operator== in EqualNode (#65352)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65352

This can be a big win if it saves the virtual call to operator== and the cost is tiny.
ghstack-source-id: 138497657

Test Plan: Profiled ptvsc2_predictor_bench startup, inclusive time spent in EqualNode::operator() dropped from 0.8% to negligible

Reviewed By: hlu1

Differential Revision: D30974969

fbshipit-source-id: 9c3af36cffe709dfce477dcc49722536470264a0
2021-09-21 10:46:24 -07:00
88232b4cee Fix ENABLE_RECORD_KERNEL_FUNCTION_DTYPE build (#65370)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65370

Forgot a wrapping 'namespace at' here!  And no contbuilds to test it.
ghstack-source-id: 138565579

Test Plan:
```
buck build --show-output -c pt.disable_per_op_profiling=0 -c pt.enable_record_kernel_dtype=1 -c pt.has_backtraces=1 fbsource//xplat/caffe2/fb/model_tracer:model_tracer
```

Reviewed By: JacobSzwejbka

Differential Revision: D31065923

fbshipit-source-id: ed4563fbd8f3c29f6b10ac8999c9010bd4359c97
2021-09-21 10:42:33 -07:00
eb4fb1ed81 THCTensor cleanup (#65369)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65369

Reviewed By: bhosmer

Differential Revision: D31071406

Pulled By: ngimel

fbshipit-source-id: bbc3f2781003333641524aeb692b944fd3ad8d7a
2021-09-21 10:28:19 -07:00
600df80296 [PT/ShardedTensor]Allow zero size local shard (#65007)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65007

Relax shard size check in ShardMetadata to allow zero size local shard.

When sharding a tensor on N ranks, some ranks may have empty shard allocated. As we are assuming SPMD, the ranks w/ empty shard still need to participate in all collectives, and we need to allow this in ShardMetadata.

Test Plan: Unit tests and CLI

Reviewed By: jiaqizhai, wanchaol

Differential Revision: D30926566

fbshipit-source-id: afa562c94ffa8f8d91d65ddb4c348156d871dc36
2021-09-21 09:58:54 -07:00
7f6580a868 OpInfo: nn.functional.conv2d (#65233)
Summary:
Reland : https://github.com/pytorch/pytorch/issues/63517
Reference: https://github.com/pytorch/pytorch/issues/54261

Reference: facebookresearch/functorch#78

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65233

Reviewed By: malfet

Differential Revision: D31025538

Pulled By: zou3519

fbshipit-source-id: b1cd38c22f4cb8eedd3f958e02dd7410dcbb8d8d
2021-09-21 09:26:23 -07:00
9324181d0a [JIT] Re-land "Add aten::slice optimization" (#65341)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65341

The changes in D30231044 (babd449978) were removed due to a downstream issue in glow. Now that the issue has been fixed by D30849396, we can safely re-introduce the changes.

Test Plan:
`buck test //caffe2/test:jit -- TestPeephole`

Glow test:
* `buck test //glow/fb/torch_glow/tests:unfuse_glow_ops_test`
* qxy11 confirmed that the problematic glow model now loads correctly with these changes

Reviewed By: eellison

Differential Revision: D31056878

fbshipit-source-id: 049903ee04ba88885cc9d1a91427af0f1f44f681
2021-09-21 07:29:51 -07:00
9c23f6eb7d [nn] TripletMarginLoss and PairwiseDistance : no batch dim (#64882)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/60585

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64882

Reviewed By: malfet

Differential Revision: D31055577

Pulled By: jbschlosser

fbshipit-source-id: 2f0a5a08619b672026b48a78bc7d83a6dccba0bf
2021-09-21 07:29:48 -07:00
d35ee431d8 correlate forward and backward op (#62553)
Summary:
Use startThreadId+seqNumber of forward-op and fwdThreadId+seqNumber of backward-op to correlate pair of them.
third_party/kineto should be updated accordingly: https://github.com/pytorch/kineto/pull/372

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62553

Reviewed By: malfet

Differential Revision: D30125728

Pulled By: gdankel

fbshipit-source-id: 9877a54392ba043d0eac56ce5b7bbf244277fa7e
2021-09-21 07:28:29 -07:00
f0ada4bd54 [docs] Remove .data from some docs (#65358)
Summary:
Related to https://github.com/pytorch/pytorch/issues/30987. Fix the following task:

- [ ] Remove the use of `.data` in all our internal code:
  - [ ] ...
  - [x] `docs/source/scripts/build_activation_images.py` and `docs/source/notes/extending.rst`

In `docs/source/scripts/build_activation_images.py`, I used `nn.init` because the snippet already assumes `nn` is available (the class inherits from `nn.Module`).

cc albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65358

Reviewed By: malfet

Differential Revision: D31061790

Pulled By: albanD

fbshipit-source-id: be936c2035f0bdd49986351026fe3e932a5b4032
2021-09-21 06:32:31 -07:00
daa50f1e9f Adds keyword only args to gradcheck (#65290)
Summary:
Changes the call signature of gradcheck so that kwargs are kwargs only.

Also modifies return call from gradgradcheck, to reflect these changes.

Fixes https://github.com/pytorch/pytorch/issues/65165

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65290

Reviewed By: soulitzer

Differential Revision: D31061316

Pulled By: albanD

fbshipit-source-id: 3505569a33a497a8be4347bdd425bb2b8e536999
2021-09-21 06:31:07 -07:00
880098a7e3 [PyTorch Edge] Backport function for defaults args with out args, flag on (#63651)
Summary:
1. Enable support for operators with default args and out args. For `torch.add(x, h, out=x)`, the number of specified arguments will be 3 instead of 4.
2. Bump bytecode version from 6 to 7
3. Implement backport_v7_to_v6 function. Also slightly refactor the local_thread to allow re-emit operators.
4. unittest to cover backport function
5. Update expect result from 4 to 3 in unit test DefaultArgsWithOutArg to cover the number of specified arguments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63651

ghstack-source-id: 138539912

Test Plan:
```
caffe2/test/cpp/jit:jit - LiteInterpreterTest.DefaultArgsWithOutArg
caffe2/test/cpp/jit:jit - LiteInterpreterTest.DefaultArgsPinvWithOutArg
caffe2/test/cpp/jit:jit - LiteInterpreterTest.BackPortByteCodeModelAllVersions
```

Reviewed By: raziel, tugsbayasgalan

Differential Revision: D30454080

fbshipit-source-id: 357c50b96682430675142d20d688d1f64e1de307
2021-09-20 22:50:30 -07:00
5826d207ad [JIT] Delete obsolete message: or if you absolutely have to, use c10::impl::GenericDict(c10::impl::deprecatedUntypedDict()) (#65164)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65164

Looks like it was forgotten in https://github.com/pytorch/pytorch/pull/25439

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D31072625

Pulled By: pbelevich

fbshipit-source-id: a5ffcfb0836f962ab6952a187ba7717c4d4a6e33
2021-09-20 22:50:28 -07:00
19a1063888 [JIT] Support device as Dict key (#65079)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65079

This is required to use RPC DeviceMap aka Dict[torch.device, torch.device] in torchscript

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D31072626

Pulled By: pbelevich

fbshipit-source-id: 51cfa5653db86de73b624e9157d68d1b319bfc64
2021-09-20 22:49:15 -07:00
512834b61d Reduce PyTorch warnings: Cast fix xplat/caffe2/aten/src/ATen/core/DeprecatedTypeProperties.h (#65031)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65031

Test Plan:
```
buck build --show-output //caffe2/torch/fb/sparsenn:sparsenn_operators

buck test caffe2/torch/fb/sparsenn:test
```

Reviewed By: r-barnes

Differential Revision: D30948791

fbshipit-source-id: 13046e1d0ce2c24864ad38f318ca5e34b1bb9552
2021-09-20 20:29:58 -07:00
0dc98728bc Basic implementation of ShardedLinear using ShardedTensor. (#64128)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64128

This PR implements a sharded nn.Linear layer using ShardedTensors with
the following limitations:

1) Works only for ChunkShardingSpec.
2) Implementation is only aimed to demonstrate functionality and is most likely
not performant at all.

The PR also introduces a `shard_parameter` API to easily shard parameters of
`nn.Modules`. This also has the following limitations:

1) Works only for ChunkShardingSpec.
2) Is not performant since it uses broadcast instead of scatter since
ProcessGroupNCCL doesn't yet support scatter.

Overall user API for running a sharded linear would be something like this:

```
# SPMD programming paradigm running same code on all nodes.
fc = nn.Linear(10, 10)

# Setup sharding.
sharding_spec=ChunkShardingSpec(...)
shard_parameter(fc, 'weight', sharding_spec, src_rank=0)

# Run as a normal linear layer.
inp = torch.rand(10, 10)
output = fc(inp)
```
ghstack-source-id: 138500985

Test Plan:
1) unit tests.
2) waitforbuildbot

Reviewed By: wanchaol, bowangbj

Differential Revision: D30621215

fbshipit-source-id: 1aa7478568c18a4572f6c3462fdf24a4cbde01d6
2021-09-20 18:31:11 -07:00
257a18d951 Track peak memory usage (#65157)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65157

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D31029049

Pulled By: driazati

fbshipit-source-id: 3e87e94e4872d118ad191aef2b77b8cefe90aeb6
2021-09-20 17:25:16 -07:00
58909395ab Fix logic to determine master vs PR (#65155)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65155

This was bugged before on empty strings which caused the hook to write on any job, not just `master` regardless of the `only_on_master` flag.

Test Plan: see `[scribe] Skipping RDS write on PR` in the logs for `linux-xenial-cuda11.3-py3.6-gcc7`

Reviewed By: malfet

Differential Revision: D31029048

Pulled By: driazati

fbshipit-source-id: 77c4a60e443d8fc19990755a3a346576afee86d8
2021-09-20 17:25:14 -07:00
60915eb810 [quant] Add fp32/fp16 zero_point support for CPU fakeQuant (#65055)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65055

Test Plan: Imported from OSS

Reviewed By: jingsh, supriyar

Differential Revision: D30975238

Pulled By: b-koopman

fbshipit-source-id: 2000660ffe71cb85d00fdabaf8fc3ba7323f9a1e
2021-09-20 17:25:12 -07:00
ce101fed02 [PyPer] copy-free freeze_module (#65118)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65118

Cloning the module can increase memory use. By freezing the module directly without cloning it first, we can avoid this memory usage increase.

Reviewed By: eellison, movefast1990

Differential Revision: D30955053

fbshipit-source-id: 2feb738eddcf66aa68c92bf695cc05b57bd990f0
2021-09-20 17:25:10 -07:00
ca649851c6 Reduce PyTorch warnings: Cast fix xplat/caffe2/c10/core/TensorOptions.h (#65030)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65030

Test Plan:
```
buck build --show-output //caffe2/torch/fb/sparsenn:sparsenn_operators

buck test caffe2/torch/fb/sparsenn:test
```

Reviewed By: r-barnes

Differential Revision: D30948721

fbshipit-source-id: 16fe42daab35709c56a4d3ccc276ea635a3510c1
2021-09-20 17:23:58 -07:00
2465a103b8 [iOS] Zero out NSError to avoid heap corruptions for the OSS builds (#65355)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65355

I've been seeing heap corruptions in the CMake builds due to the NSError* not being initialized with `nil`.  However, I haven't see this issue for the BUCK builds.
ghstack-source-id: 138502708

Test Plan:
1. Test the OSS builds to make sure the heap corruption has gone.
2. Test the Buck build in the playground app
3. Circle CI

Reviewed By: hanton

Differential Revision: D31048010

fbshipit-source-id: cfd8d614f3f91f09caee4aab61237007ec080481
2021-09-20 16:31:23 -07:00
=
b7adb3350a Add crow_/col_indices to view types (#63176)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/61103

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63176

Reviewed By: malfet, albanD

Differential Revision: D30315882

Pulled By: cpuhrsch

fbshipit-source-id: eedae5265a757ed68fd69e4f9d07070b05de4bd8
2021-09-20 14:35:58 -07:00
31f61122da Creating a helper function to generate an unique name for an attr in a module (#64970)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64970

Add a helper function to create an unique name for an attr.
This can be used when we want to add a weight to a module.

Test Plan: run CI.

Reviewed By: jfix71

Differential Revision: D30921497

fbshipit-source-id: 598569d107df8b516ff12920a4bef3a42577e987
2021-09-20 14:35:56 -07:00
b45ec16310 Add support to lower acc_ops.transpose (#65036)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65036

Reviewed By: jfix71, 842974287

Differential Revision: D30934503

fbshipit-source-id: 51880d3d36492f5206f77c9d1a994d8532597b62
2021-09-20 14:35:54 -07:00
e33a1fa680 [fx] give warning instead of fatal the program when submod not found during adding get_attr (#65225)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65225

Currently when create get_attr node, if the attribute is in a submodule, we'll fist find the submodule. If the submodule isn't in the owning module we throw an exception.

However, if the attribute can't be found, we give a warning but still allow to create the get_attr node. To align with this behavior, we change the reaction when submod not found from fatal to giving a warning.

Test Plan: CI

Reviewed By: jamesr66a, jfix71

Differential Revision: D31021535

fbshipit-source-id: 4c0b471448c09cc927d0f47b5bf56594f25a8863
2021-09-20 14:35:52 -07:00
8fb253757d Remove @balioglu from PyTorch Distributed code owners (#65239)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65239

Due to too much noise caused by the GitHub notifications, going forward I prefer to track PRs manually.
ghstack-source-id: 138386041

Test Plan: N/A

Reviewed By: mrshenli

Differential Revision: D31027792

fbshipit-source-id: 6578e41d4ab53ad2c64a41584716f4903298cd6b
2021-09-20 14:34:37 -07:00
e3210ca184 [CUDA graphs] Beta, not prototype (#65247)
Summary:
Powers have decided this API should be listed as beta.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65247

Reviewed By: malfet

Differential Revision: D31057940

Pulled By: ngimel

fbshipit-source-id: 137b63cbd2c7409fecdc161a22135619bfc96bfa
2021-09-20 13:32:36 -07:00
b71f01f70d Fix full backward hook when grad is disabled (#65335)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/59901. See discussion in the issue.

cc albanD soulitzer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65335

Reviewed By: malfet

Differential Revision: D31055865

Pulled By: albanD

fbshipit-source-id: 53605df62bc73c99d8908248087ab400b81ac495
2021-09-20 13:31:19 -07:00
2abf3594d5 Fix unassigned ciflow trigger (#65354)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/65250#issuecomment-923120764

this is a limitation of github action triggers, it's hard to introduce condition before the workflow, that's why we intentionally pick the rare event ("unassigned"). The fix I think for people didn't opt-in ciflow and manually unassign, is to run all the CI (otherwise we introduce new condition on this and not worth to make things even complex).

`unassigned` event payload looks like this, just to make sure `github.event.assignee.login` is pointing to the right location.

```
  {
    "action": "unassigned",
    "assignee": {
      "avatar_url": "https://avatars.githubusercontent.com/u/658840?v=4",
      "events_url": "https://api.github.com/users/zhouzhuojie/events{/privacy}",
      "followers_url": "https://api.github.com/users/zhouzhuojie/followers",
      "following_url": "https://api.github.com/users/zhouzhuojie/following{/other_user}",
      "gists_url": "https://api.github.com/users/zhouzhuojie/gists{/gist_id}",
      "gravatar_id": "",
      "html_url": "https://github.com/zhouzhuojie",
      "id": 658840,
      "login": "zhouzhuojie",
      "node_id": "MDQ6VXNlcjY1ODg0MA==",
      "organizations_url": "https://api.github.com/users/zhouzhuojie/orgs",
      "received_events_url": "https://api.github.com/users/zhouzhuojie/received_events",
      "repos_url": "https://api.github.com/users/zhouzhuojie/repos",
      "site_admin": false,
      "starred_url": "https://api.github.com/users/zhouzhuojie/starred{/owner}{/repo}",
      "subscriptions_url": "https://api.github.com/users/zhouzhuojie/subscriptions",
      "type": "User",
      "url": "https://api.github.com/users/zhouzhuojie"
    },
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65354

Reviewed By: malfet, seemethere, janeyx99

Differential Revision: D31060212

Pulled By: zhouzhuojie

fbshipit-source-id: ce815cc96e8a00016646d6f02f0917169fa652dc
2021-09-20 12:33:23 -07:00
378949b83c fix typo missing f string (#65226)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65226

Reviewed By: malfet

Differential Revision: D31055793

Pulled By: albanD

fbshipit-source-id: fafac53e75223c4f599bd2162095aacad7b690df
2021-09-20 12:31:54 -07:00
0430d1da12 [iOS] Fix the TestApp (#65319)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65319

Test Plan: Imported from OSS

Reviewed By: hanton

Differential Revision: D31049543

Pulled By: xta0

fbshipit-source-id: ff0d0baac30682c63b2a28254ee0a5d8d9b8ca6f
2021-09-20 11:28:40 -07:00
3e64c9e176 [Pipe] Add a WithDevice wrapper to specify device execution for a module. (#65190)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65190

As described in https://github.com/pytorch/pytorch/issues/65093, there
could be modules which don't have any parameters/buffers. In this case, Pipe
determines that the module should be executed on CPU. However this might result
in unnecessary GPU to CPU transfers whereas the user expected the module to be
executed on the GPU itself by keeping its inputs and outputs on GPU.

For this use case, we introduce a `WithDevice` wrapper which can be used to
override which device a particular module should be executed on as part of the
pipeline.

#Closes: https://github.com/pytorch/pytorch/issues/65093
ghstack-source-id: 138376272

Test Plan:
1) waitforbuildbot
2) unit tests

Reviewed By: SciPioneer

Differential Revision: D31010027

fbshipit-source-id: 4c1c61d3c6feeef341e002e5f7e83dd33ff3a516
2021-09-20 11:27:27 -07:00
0a3cf8886a Torchhub: More robust assumption regarding main or master branch (#64364)
Summary:
Closes https://github.com/pytorch/pytorch/issues/63753

This PR changes the assumption regarding the default branch of a repo to the following:

> If main exist then use main,otherwise use master

This will make torchhub more robust w.r.t. to the ongoing changes where repo use `main` instead of `master` as the development / default branch.

cc nairbv NicolasHug

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64364

Reviewed By: saketh-are

Differential Revision: D30731551

Pulled By: NicolasHug

fbshipit-source-id: 7232a30e956dcccca21933a29de5eddd711aa99b
2021-09-20 10:36:13 -07:00
99e4ab5d44 [Static Runtime] Implement and enable variadic tuple unpack (#64934)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64934

Add a new op `static_runtime::VarTupleUnpack` and a graph pass transforming graph sequences from:
```
%0, %1 = prim::TupleUnpack(%a)
%2, %3 = prim::TupleUnpack(%b)
```
into:
```
%0, %1, %2, %3 = static_runtime::VarTupleUnpack(%a, %b)
```

The pass is only applied to contiguous blocks of `TupleUnpack` nodes. This is the most straightforward way to guarantee correctness, and it is sufficient for the models we care about.

Test Plan: New unit tests: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- VarTupleUnpack`

Reviewed By: d1jang

Differential Revision: D30872109

fbshipit-source-id: 1ed4a7e201c532da28f703a3a50241c392a6c7e9
2021-09-20 10:36:11 -07:00
14347d0dd5 [quant][fx][graphmode] Fix a bug for sub (#65109)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65109

Previously for sub we set the dtype for sub with qconfig since it's matched with a QuantizeHandler,
however this is incorrect, the dtype for sub is decided by whether the output is quantized or not,
so we added a check of is_output_quantized while deciding the dtype for the output of sub.

Later: is_output_quantized now depends on is_reference, which is pretty confusing and it may cause problems down the road, we should remove this dependency in the future.

Test Plan:
python test/test_quantization.py TestQuantizeFx.test_sub_scalar

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D30977826

fbshipit-source-id: 551fd63bd61b43b3c3415944ff73174e3a21cc8a
2021-09-20 10:36:09 -07:00
c562ebca23 Revert "Revert D30558877: Ported std/var to ReductionOpInfo (#65262)
Summary:
Reland of https://github.com/pytorch/pytorch/issues/63978

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65262

Reviewed By: mruberry

Differential Revision: D31037360

Pulled By: ngimel

fbshipit-source-id: 1c60f40c547229767cba3bbe7e11ca0fbbc8f95f
2021-09-20 10:36:06 -07:00
fb1e6835cc simplify torch.meshgrid's shape computation (#62905)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62905

Reviewed By: mruberry

Differential Revision: D31021274

Pulled By: dagitses

fbshipit-source-id: c219389bdc543e9592f7b1c707acfbf752ee6f34
2021-09-20 10:34:45 -07:00
cf60d24028 [DataPipe] Unlimited buffer for Forker and Demultiplexer (#64994)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64994

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D30934362

Pulled By: ejguan

fbshipit-source-id: d3b774d7e28c0b9659e999511e5a68c3929857d4
2021-09-20 09:30:39 -07:00
88032d8943 Automated submodule update: FBGEMM (#64640)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: d1ecc7dbe2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64640

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: jspark1105

Differential Revision: D30805660

fbshipit-source-id: 9f783862e89fe3974badd5194ef793db55e7d275
2021-09-18 16:29:30 -07:00
d8189db80f [quant][fx2trt] Generate engine graph for explicit quant/implicit quant and fp16 graph (#65289)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65289

Turn on VERBOSE logging and use engine visualizer to generate the graph.

Runtime:
```
explicit quant result diff max tensor(0.0771)
implicit quant result diff max tensor(0.1909)
trt fp16 time (ms/iter) 1.0740923881530762
trt int8 time (ms/iter) 0.5288887023925781
trt implicit int8 time (ms/iter) 0.6334662437438965
PyTorch time (CUDA) (ms/iter) 4.448361396789551
PyTorch time (CPU) (ms/iter) 45.13296604156494
```

Generated Graphs:
```
explicit int8: https://www.internalfb.com/intern/graphviz/?paste=P458669571
implicit int8: https://www.internalfb.com/intern/graphviz/?paste=P458669656
fp16: https://www.internalfb.com/intern/graphviz/?paste=P458669708
```

Test Plan:
```
buck run mode/opt -c python.package_style=inplace caffe2:fx2trt_quantized_resnet_test 2>log
buck run //deeplearning/trt/fx2trt/tools:engine_layer_visualize -- --log_file log
```

Reviewed By: 842974287

Differential Revision: D30955035

fbshipit-source-id: 24949458ad9823fb026d56d78a6ee1c6874b6034
2021-09-18 13:30:37 -07:00
7f8d622d70 [Static Runtime] Add perf metrics for number of managed tensors & unmanaged values (#64992)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64992

This change lets Static Runtime print out number of managed tensors & unmanaged values as performance metrics during profile runs.

We will use /enhance these metrics to guide the effort of managing output tensors.

Test Plan:
Confirmed that a profile run prints out the added metric values on inline_cvr nets:
```
(inline_cvr/local)
...
Total number of managed tensors: 2754
Total number of unmanaged values: 3240
...
(inline_cvr/local_ro)
Total number of managed tensors: 1554
Total number of unmanaged values: 2966
...
(inline_cvr/remote_ro)
Total number of managed tensors: 1439
Total number of unmanaged values: 28
...
```

Reviewed By: hlu1

Differential Revision: D30926617

fbshipit-source-id: b86e071003ac941b9663db103eaa7c614466b4e0
2021-09-18 11:26:37 -07:00
4a128ed811 Remove incorrect stride assert in Reduce.cuh (#65227)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/37583

Per discussion with ngimel, the condition asserted here may not always hold after TensorIterator's dimension coalescing and reordering. However, the reduction output should still be correct when `sub_iter.strides(0)[0]` is non-zero.

I've verified correctness empirically by:
1. Lowering the threshold ([configured here](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/TensorIterator.cpp#L1127)) at which iterators are split into sub-iterators, making it easier to trigger.
2. Generating many tensors with random dimensions and randint elements which produce a non-zero `sub_iter.strides(0)[0]` in the CUDA kernel.
3. Verifying that the reduction `t.sum(dim=0)` produces the same results for those tensors on CPU and on CUDA.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65227

Reviewed By: ngimel

Differential Revision: D31031406

Pulled By: saketh-are

fbshipit-source-id: 5cbf2001224454c74f6db42455c507365ad1f2b1
2021-09-18 10:29:13 -07:00
543185a0fd support using gradients named for outputs in derivatives (#63947)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63947

Fixes #62196

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D30541485

Pulled By: dagitses

fbshipit-source-id: ea1dd0edd1a51936a295631e52b85e9c022a9c87
2021-09-18 07:31:45 -07:00
926a3d2e85 clarify implementation of check_grad_usage (#64439)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64439

1) remove unused fully_implemented
2) rename used_grad to uses_grad and make it a boolean
3) rename used_grads to num_grads_uses
4) add comments explaining what some of the checks mean

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D30733904

Pulled By: dagitses

fbshipit-source-id: dccbbef8a4be8713215ef91aa97a34124f06a7a1
2021-09-18 07:30:30 -07:00
d3e36fade2 [quant][fx2trt] Enable comparison with implicit quant mode (#65043)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65043

Currently got following result, will take a look at the executed graph again:
```
trt fp16 time (ms/iter) 1.0483217239379883
trt int8 time (ms/iter) 0.5329632759094238
trt implicit int8 time (ms/iter) 0.6769704818725586
PyTorch time (ms/iter) 6.453146934509277
```

Test Plan:
```
python torch/fx/experimental/fx2trt/example/quantized_resnet_test.py
```

Imported from OSS

Reviewed By: 842974287

Differential Revision: D30954871

fbshipit-source-id: 8d7ff82b8f5d0b7946fbd38a7cddede7d40b28aa
2021-09-17 23:29:35 -07:00
4150b672aa [Codemod][FBSourceBlackLinter] Daily arc lint --take BLACK
Reviewed By: zertosh

Differential Revision: D31039372

fbshipit-source-id: a5e54a9b1d2ef97e9bc206b9e2a82124e5a22a7a
2021-09-17 20:33:12 -07:00
6707dfeefb Remove 9.2 related macros for CONSTEXPR (#65066)
Summary:
Removes C10_HOST_CONSTEXPR_EXCEPT_CUDA92 references in the code

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65066

Reviewed By: driazati

Differential Revision: D31022520

Pulled By: janeyx99

fbshipit-source-id: f02cdc6caba5b48405575242921f5845ff18f729
2021-09-17 17:31:20 -07:00
1cd9018b6f Make github.com in noproxy list (#65256)
Summary:
Fixes #{issue number}

Attempt to solve some ratelimiting issue we saw from calling GitHub apis

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65256

Reviewed By: seemethere

Differential Revision: D31035115

Pulled By: zhouzhuojie

fbshipit-source-id: 7efd5d5af7d91805e4bf27b86847791e991b741e
2021-09-17 17:31:18 -07:00
50c29fef3e remove utils.cpp (#65184)
Summary:
Dead code

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65184

Reviewed By: mruberry

Differential Revision: D31031777

Pulled By: ngimel

fbshipit-source-id: 13633888229a7af8cfd8ea7e55ea2880b2e47273
2021-09-17 17:31:15 -07:00
19471c54a6 [fx const fold] fix a case when some inputs are unused (#65223)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65223

If there're unused inputs, they won't appear in `submod_1`. We need to add all the unused inputs so that the model after const fold has the same inputs as the original model.

Reviewed By: jfix71

Differential Revision: D31021217

fbshipit-source-id: b7452c90d133b747e0699936a81d3fee14af9cc9
2021-09-17 17:29:55 -07:00
992dad1855 [Profiler] Update kineto submodule (#65236)
Summary:
Update to latest kineto revision. See Kineto repo for change log.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65236

Reviewed By: malfet

Differential Revision: D31031638

Pulled By: gdankel

fbshipit-source-id: 681655b2e8e151895afa91445ced0fd57a11fa93
2021-09-17 16:26:30 -07:00
4408b755bc [fx2trt] re-enable profiler and some miscs for TRTModule (#65072)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65072

Previously disabled attaching trt profiler to exec context in TRTModule because https://fburl.com/mc33n880 states that `enqueue()` doesn't support profiling. Seems to be a lie though. Re-enable attaching profiler in this diff.

Also added a bunch of checks for dtype and shape, and fixed saving state_dict and loading back.

Test Plan: buck run mode/opt -c python.package_style=inplace -j 40 deeplearning/trt/fx2trt:acc2trt_test

Reviewed By: yinghai

Differential Revision: D30962757

fbshipit-source-id: 9c664b0500a8169b7952f6f912239a5a05772aea
2021-09-17 16:26:28 -07:00
afa25c77f1 [package] Make it possible to re-save a PackageImporter module (#65101)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65101

As title. Previously this was guarded against for implementation
simplicity, as we didn't really think there was a use case for saving a
mangled module name directly.

But people started doing stuff like:
```
exporter.save_module(my_imported_obj.__module__)
```
which implicitly passes along the mangled module name.

This PR makes it so that given `PackageImporter` instance can always
import modules that it created, and changes `PackageExporter` to
properly demangle the resulting module name when writing the package to
the export archive.

Differential Revision:
D30975712
D30975712

Test Plan: Imported from OSS

Pulled By: suo

fbshipit-source-id: d9e849bf651713890e72dccdcef74fa52d377149
2021-09-17 16:25:11 -07:00
487c771593 [FX] Fix tracing of bitwise and/or (#65196)
Summary:
Previously resulted in `AttributeError: module 'operator' has no attribute 'and'`

and/or are python keywords, so they are renamed to `operator.and_` and `operator.or_`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65196

Reviewed By: Chillee

Differential Revision: D31020336

Pulled By: jansel

fbshipit-source-id: 51d888151fe78c0c1197ecaf161976b219c59694
2021-09-17 14:33:02 -07:00
6596173811 Revert D30731191: [pytorch][PR] Torchhub: rewrite commit hash check to avoid using unnecessary GitHub API credits
Test Plan: revert-hammer

Differential Revision:
D30731191 (f9bf144a0c)

Original commit changeset: d1ee7c2ef259

fbshipit-source-id: 5c7207f66c5354ce7b9ac2594e4f5b8307619b0c
2021-09-17 14:33:00 -07:00
3d32dec5ba [ONNX] Deprecate enable_onnx_checker argument in torch.onnx.export() (#61708) (#64369)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64369

As of now, the "enable_onnx_checker" parameter was described as below:

enable_onnx_checker (bool, default True): If True the ONNX model checker will be run to ensure the exported model is a valid ONNX model.

An invalid ONNX graph is useless to users so such checker should be done for each call.

In this PR, we will still write the model to an ONNX file even it is invalid. And the exception will be thrown after the ONNX file has been created. This enables user output an invalid ONNX graph for debug.

This PR will still keep it in torch.onnx.export() function for backward support while all backend logic has been changed to work as enable_onnx_checker is set to True.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D30905267

Pulled By: malfet

fbshipit-source-id: 3ad3f68e77fcec012cc7ef674cc9a61755eebc9e

Co-authored-by: fatcat-z <zhang-ji@outlook.com>
2021-09-17 14:31:41 -07:00
ae00075ac7 [Static Runtime] Move MemoryPlanner out into memory_planner.cpp (#65123)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65123

This change re-reverts D30883290 (0e11454d19). D30883290 (0e11454d19) broke the OSS build since the change in this change implicitly removed the default move constructor of `StaticRuntime`.

```
ep 15 15:39:57 /var/lib/jenkins/workspace/benchmarks/static_runtime/deep_wide_pt_bench.cc:95:10: error: call to implicitly-deleted copy constructor of 'torch::jit::StaticRuntime'
Sep 15 15:39:57   return torch::jit::StaticRuntime(*smod);
Sep 15 15:39:57          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Sep 15 15:39:57 /var/lib/jenkins/workspace/torch/csrc/jit/runtime/static/impl.h:321:34: note: copy constructor of 'StaticRuntime' is implicitly deleted because field 'planner_' has a deleted copy constructor
Sep 15 15:39:57   std::unique_ptr<MemoryPlanner> planner_;
Sep 15 15:39:57                                  ^
Sep 15 15:39:57 /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/bits/unique_ptr.h:356:7: note: 'unique_ptr' has been explicitly marked deleted here
Sep 15 15:39:57       unique_ptr(const unique_ptr&) = delete;
Sep 15 15:39:57       ^
Sep 15 15:39:57 /var/lib/jenkins/workspace/benchmarks/static_runtime/deep_wide_pt_bench.cc:99:9: error: call to implicitly-deleted copy constructor of 'torch::jit::StaticRuntime'
Sep 15 15:39:57    auto sr = getStaticRuntime();
Sep 15 15:39:57         ^    ~~~~~~~~~~~~~~~~~~
Sep 15 15:39:57 /var/lib/jenkins/workspace/torch/csrc/jit/runtime/static/impl.h:321:34: note: copy constructor of 'StaticRuntime' is implicitly deleted because field 'planner_' has a deleted copy constructor
Sep 15 15:39:57   std::unique_ptr<MemoryPlanner> planner_;
Sep 15 15:39:57                                  ^
Sep 15 15:39:57 /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/bits/unique_ptr.h:356:7: note: 'unique_ptr' has been explicitly marked deleted here
Sep 15 15:39:57       unique_ptr(const unique_ptr&) = delete;
Sep 15 15:39:57       ^
Sep 15 15:39:57 2 errors generated.
```

This change fixes the issue by explicitly defining the default move constructor (courtesy of mikeiovine).

Original Summary:

This change moves `MemoryPlanner` out of impl.cpp into memory_planner.cpp.

`MemoryPlanner` performs an independent sub-task of static analysis of a graph, and creating memory planning, and allocating/deallocating managed Tensors.

This change will reduce merge conflicts as I work on MemoryPlanner more actively for output Tensor support.

Test Plan: - Confirm that OSS build went well (See External Tests section).

Reviewed By: mikeiovine

Differential Revision: D30983292

fbshipit-source-id: a59f407fa1123527824157268111144a1bf58116
2021-09-17 13:32:01 -07:00
eaf85fad62 [PyTorch] Extract parseOperator() into a standalone source file (#65179)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65179

This is following up this PR: https://github.com/pytorch/pytorch/pull/61862. The purpose is to modularize operator parsing so that it can be used as needed without pulling the whole `import.cpp` into build.

Test Plan: Added a unit test in `test_lite_predictor.cpp` called `ParseOperators`, similar to `ParseBytecode`.

Reviewed By: iseeyuan

Differential Revision: D31006555

fbshipit-source-id: c38e221800af4cf72963a353c452c5437f56a0ac
2021-09-17 13:31:59 -07:00
35084ee451 [PyTorch] Improve OperatorEntry::getKernelForDispatchKey (#64838)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64838

The returned pointer, if present, could never be nullptr, so there is no reason to wrap it in an optional rather than just using the nullptr state. The repeated calls to kernels_.at() were not getting optimized away, so just use the perfectly good iterator find() already gave us.
ghstack-source-id: 138304429

Test Plan: CI

Reviewed By: bdhirsh

Differential Revision: D30875748

fbshipit-source-id: 9cbb875715b7a582380c7402155fdbe21944dc85
2021-09-17 13:31:56 -07:00
fcaf526815 avoid moving Argument in infer_schema (#64822)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64822

Turns out the suppressed lint message was trying to tell us something: we can construct our Argument in-place rather than create a temporary and move into the argument vector.
ghstack-source-id: 138304423

Test Plan: CI, profile op registration and observe reduced Argument move ctor and dtor costs

Reviewed By: smessmer

Differential Revision: D30860718

fbshipit-source-id: c8da45ab7e61b5df9fa1273301896309bca108b5
2021-09-17 13:31:54 -07:00
79cbcd3e7c [PyTorch] Fix missing move in Argument ctor (#64821)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64821

Not moving adds excess refcounting overhead.
ghstack-source-id: 138304432

Test Plan: CI

Reviewed By: dhruvbird

Differential Revision: D30860720

fbshipit-source-id: de695e5cdfb1fa314b53a8bcb291343ae4eb87a5
2021-09-17 13:31:51 -07:00
5a3475df21 [PyTorch] shrink Argument (#64820)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64820

Putting boolean fields next to each other avoids wasting space for padding.
ghstack-source-id: 138304433

Test Plan: CI

Reviewed By: dhruvbird

Differential Revision: D30860717

fbshipit-source-id: ad45c37574a7c857958978aad42fd1333c6b29ee
2021-09-17 13:31:48 -07:00
132d65ed25 [PyTorch] Compare pointers before calling expensive Type comparison (#64784)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64784

See code comment for explanation.
ghstack-source-id: 138304431

Test Plan: Reduced overhead in findSchemaDifferences while profiling registration at startup in a case where I forced duplicates to be registered (by looping in RegisterDispatchKey.cpp).

Reviewed By: dhruvbird

Differential Revision: D30854036

fbshipit-source-id: 568733c3cf449697cdeb74cf57fed0926729fa68
2021-09-17 13:31:46 -07:00
cf5c00f155 CI: Consolidate Build and Test naming for better stats collection (#65232)
Summary:
All build pytorch steps should now be named "Build" and test steps named "Test" for workflows that test PyTorch on Linux and Windows.

I left the binary stuff alone as that build is different.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65232

Reviewed By: driazati, seemethere

Differential Revision: D31024232

Pulled By: janeyx99

fbshipit-source-id: 24b1a1e2b1b25aba70b7adc41603ec8fa4ce7dd6
2021-09-17 13:30:31 -07:00
45bd0f6181 Back out "Revert D30745960: [DDP] Remove SPMD from self.modules_buffers" (#64778)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64778

Original commit changeset: d3f3fb813c45
ghstack-source-id: 138326910

Test Plan: ci

Reviewed By: H-Huang

Differential Revision: D30849443

fbshipit-source-id: 15dab8a959a29d2e2fefac6ad52b8d8168eacc02
2021-09-17 12:28:36 -07:00
70f286c1e2 Back out "Revert D30745961: [DDP] Remove self.modules_params" (#64777)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64777

Original commit changeset: 59f7cc50d369
ghstack-source-id: 138326909

Test Plan: ci

Reviewed By: H-Huang

Differential Revision: D30849442

fbshipit-source-id: bb87ba83935374d8a3ebbc29365df1417dd4f26f
2021-09-17 12:28:34 -07:00
61dfcbf4bc Back out "Revert D30745921: [DDP] Fix when buffers are reassigned in module" (#64776)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64776

Original commit changeset: 343ead86bf1e
ghstack-source-id: 138326914

Test Plan: ci

Reviewed By: H-Huang

Differential Revision: D30849444

fbshipit-source-id: 9a72805416fe7d6c68e51bdcdb88f6e1fecb614d
2021-09-17 12:28:32 -07:00
cce5381238 [xplat][pytorch]: Disabling too many logging. (#65170)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65170

Disabling too many logging. These are per frame logging
and outputting lots of logs in Skylight command line.

Test Plan:
```
cd ~/fbsource
buck build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 //xplat/caffe2:pt_vulkan_api_test_binAndroid\#android-arm64 --show-output
adb push buck-out/gen/xplat/caffe2/pt_vulkan_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_api_test
adb shell "/data/local/tmp/vulkan_api_test"
cd -
```

Reviewed By: SS-JIA

Differential Revision: D30778852

fbshipit-source-id: bcf75ec417dfe3e9ce3df92a1894352772bd663d
2021-09-17 12:28:30 -07:00
047e68235f delegate parallelism to Ninja when possible (#64733)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64733

The previous implementation was wrong when CPU scheduling affinity is
set. In fact, it is still wrong if Ninja is not being used.

When there is CPU scheduling affinity set, the number of processors
available on the system likely exceeds the number of processors that
are usable to the build. We ought to use
`len(os.sched_getaffinity(0))` to determine the effective parallelism.

This change is more minimal and instead just delegates to Ninja (which
handles this correctly) when it is used.

Test Plan:
I verified this worked as correctly using Ninja on a 96-core machine
with 24 cores available for scheduling by checking:
 * the cmake command did not specify "-j"
 * the number of top-level jobs in top/pstree never exceeded 26 (24 +
   2)

And I verified we get the legacy behavior by specifying USE_NINJA=0 on
the build.

Reviewed By: jbschlosser, driazati

Differential Revision: D30968796

Pulled By: dagitses

fbshipit-source-id: 29547dd378fea793957bcc2f7d52d5def1ecace2
2021-09-17 12:28:28 -07:00
b936a10074 add test for number of jobs when building (#65162)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65162

Test Plan: Imported from OSS

Reviewed By: driazati

Differential Revision: D30998006

Pulled By: dagitses

fbshipit-source-id: 8b8d45668acf0e6c0f16df0f705a1af8c6d4f22d
2021-09-17 12:28:25 -07:00
1ee66a5278 Remove CUDA 9.2 references conditionals and workarounds (#65070)
Summary:
Title says it all

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65070

Reviewed By: malfet

Differential Revision: D30966464

Pulled By: janeyx99

fbshipit-source-id: e454906fd5d7d321d390939ba5d237e1d9b150f8
2021-09-17 12:28:23 -07:00
51e12f0071 fix torch.distributed.elastic event docs (#64974)
Summary:
the example code wasn't working for me.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang cbalioglu gcramer23

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64974

Reviewed By: kiukchung, cbalioglu

Differential Revision: D30926481

Pulled By: edward-io

fbshipit-source-id: f5e32cc2b948b6ee30d84a8247856f39fc786f67
2021-09-17 12:27:09 -07:00
bbe25af0df [nnc] Updated inlining to handle cases when producer indices are constants after eval (#65044)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65044

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D30954655

Pulled By: navahgar

fbshipit-source-id: dfaedb5af710b2625ceec3a443a6c4e34158ab16
2021-09-17 11:28:48 -07:00
03fc636d5c [nnc] Updated inliner to remove assertions and exception (#64719)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64719

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D30828583

Pulled By: navahgar

fbshipit-source-id: 9826a59085a210e44d101a843ff2cae440dfd633
2021-09-17 11:28:46 -07:00
340531f2e0 [ONNX] Do not use numpy in ONNX opsets (#65188)
Summary:
Replace `torch.tensor([numpy.arange(a, b, c)])` with `torch.arange(a, b, c).unsqueeze(0)`
Replace `tuple(numpy.add(a, b))` with `tuple( x + y for (x, y) in zip(a, b)`

As `numpy` is an optional dependency, it shouldn't be used in PyTorch core by default

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65188

Reviewed By: mruberry

Differential Revision: D31009490

Pulled By: malfet

fbshipit-source-id: 528e48f055bf9ac1de1fd7e94c0be41915df9a0b
2021-09-17 11:28:44 -07:00
7ced25eee3 [CoreML][OSS] Include Core ML in iOS/MacOS nightlies (#65075)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65075

Need to drop one line at - https://github.com/pytorch/builder/blob/master/conda/pytorch-nightly/meta.yaml#L65
ghstack-source-id: 138324213

Test Plan:
- Check the iOS nightly builds
  - `pod install LibTorch-Lite-Nightly`

Reviewed By: hanton

Differential Revision: D30912269

fbshipit-source-id: b07679b75ecf89beae2975c37cf17d2449a3304f
2021-09-17 11:27:20 -07:00
f9c0a39ad9 add a test case for const fold (#65224)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65224

Add a test case for the fix D30996277 (8c38d141df).

Test Plan: buck test mode/opt -c python.package_style=inplace -c fbcode.nvcc_arch=v100,a100 -c fbcode.enable_gpu_sections=true -j 40 caffe2/test:fx_const_fold -- test_const_fold_module_attr

Reviewed By: jfix71

Differential Revision: D31000386

fbshipit-source-id: f444361839decc583bf93ac946cfe2049376719e
2021-09-17 10:32:07 -07:00
3c003aa6ae [PyTorchEdge] promote prim ops by using ops table for mobile runtime (#64816)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64816

## Context:
Promoting prim ops:
Certain prim ops are frequent than others (like tupleIndex, raiseException, ...). These ops are frequent that they are chosen to be promoted as first class instructions. To promote it requires multiple steps and support from TS team as it changes how the bytecode is serialized and deserialized. So to prevent multiple bytecode version bumps and provided stability while these changes happen, an iterim iterative process is proposed which uses a table to lookup for "promoted" op's function. This allows us to rapidly update the ops list and test on production model without having to change the bytecode. In case of failure, we can quickly revert this change.

## Observation
The ops are chosen based on the notebook N1135657 which examines the top frequent ops.

## Fix
An iterim solution of having a static table, which when given a prim op name returns a function to be applied on the stack. This helps us check in `function.cpp` to get the "promoted" op. As a fall back, the "promoted" op still resides in `register_prim_ops.cpp` so that the function of prim op is never missed.

ghstack-source-id: 138261338

Test Plan:
```
[pavithran@67109.od ~/fbsource/fbcode (eddab7da6)]$ buck test caffe2/test/cpp/jit:jit -- BackendTest.TestComposite
Building: finished in 5.4 sec (100%) 7284/7284 jobs, 0/7284 updated
  Total time: 5.8 sec
More details at https://www.internalfb.com/intern/buck/build/480191aa-a1ba-42ca-99e9-ee4bf2b06d65
Tpx test run coordinator for Facebook. See https://fburl.com/tpx for details.
Running with tpx session id: 867382eb-327f-43d7-a45c-875b7f484b15
Trace available for this run at /tmp/tpx-20210914-100224.283682/trace.log
Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/844425134506115
    ✓ ListingSuccess: caffe2/test/cpp/jit:jit - main (12.159)
    ✓ Pass: caffe2/test/cpp/jit:jit - BackendTest.TestCompositeWithSetStates (0.797)
    ✓ Pass: caffe2/test/cpp/jit:jit - BackendTest.TestComposite (0.779)
Summary
  Pass: 2
  ListingSuccess: 1
If you need help understanding your runs, please follow the wiki: https://fburl.com/posting_in_tpx_users
Finished test run: https://www.internalfb.com/intern/testinfra/testrun/844425134506115
```

{F663491347}

Reviewed By: iseeyuan

Differential Revision: D30819926

fbshipit-source-id: 4cbe05d5761bdc9d62ef08e18172dcf64cb49526
2021-09-17 10:32:05 -07:00
ecfc784e67 Revert D30993855: [pytorch][PR] OpInfo: nn.functional.conv2d
Test Plan: revert-hammer

Differential Revision:
D30993855 (873255c6d9)

Original commit changeset: 7402f99addb4

fbshipit-source-id: b0539daa195dc6a3739bce5c264cb2177b7721ff
2021-09-17 10:32:02 -07:00
18fa58c4e9 [CoreML][OSS] Integrate with CMake (#64523)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64523

- Build Pytorch with CoreML delegate - ` USE_PYTORCH_METAL=ON python setup.py install --cmake`
- Build iOS static libs - `IOS_PLATFORM=SIMULATOR USE_COREML_DELEGATE=1  ./scripts/build_ios.sh`
ghstack-source-id: 138324216

Test Plan:
- Test the Helloword example

{F657778559}

Reviewed By: iseeyuan

Differential Revision: D30594041

fbshipit-source-id: 8cece0b2d4b3ef82d3ef4da8c1054919148beb16
2021-09-17 10:32:00 -07:00
c1415a0a72 [Reland] [Model Averaging] Simplify PostLocalSGD Optimizer API (#65197)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65197

1. The constructor accepts a local optimizer instance instead of the inputs of local optimizer constructor and the class type.
2. The parameters are read from local optimizer's param_groups instead of a separate input.

Proposal: https://github.com/pytorch/pytorch/issues/59699
ghstack-source-id: 138307226

Test Plan: buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn -- test_post_localSGD_optimizer_parity

Reviewed By: rohan-varma

Differential Revision: D31007439

fbshipit-source-id: bbb0526e6763ef76775b85088571506b3942c722
2021-09-17 10:31:58 -07:00
752a820230 Bf16 matmul (#64619)
Summary:
Re-create PR to fix https://github.com/pytorch/pytorch/pull/61891.

Drop the support for addbmm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64619

Reviewed By: jbschlosser

Differential Revision: D30902995

Pulled By: VitalyFedyunin

fbshipit-source-id: dc318d73adff8f6974c9752d0d097e69276f8206
2021-09-17 10:31:56 -07:00
f9bf144a0c Torchhub: rewrite commit hash check to avoid using unnecessary GitHub API credits (#64362)
Summary:
This PR adds more detailed error messages to torchhub if the commit hash validation goes wrong, providing suggestions to the users on how to resolve the issue.

It also documents why such validation is important.

EDIT: it also avoids validatating some stuff when we know "stuff" isn't a commit since there's no risk in this case

CC malfet mthrok

cc nairbv NicolasHug

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64362

Reviewed By: gchanan, malfet

Differential Revision: D30731191

Pulled By: NicolasHug

fbshipit-source-id: d1ee7c2ef2591dd7a5291977af1635ada2552d1b
2021-09-17 10:30:39 -07:00
0559cb37cd [FX] Ensure BC coverage for all of torch.fx.passes (#65081)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65081

Test Plan: Imported from OSS

Reviewed By: jbschlosser, khabinov

Differential Revision: D30967428

Pulled By: jamesr66a

fbshipit-source-id: 2ff83da728dc469f086cf504e71b43396db612d8
2021-09-17 09:32:43 -07:00
cf7409e184 [FX] Move graph_manipulation and param_fetch out of experimental and into passes (#65183)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65183

ghstack-source-id: 138309655

Test Plan: waitforsadcastle

Reviewed By: protonu

Differential Revision: D31007630

fbshipit-source-id: 77d14b284737aabbe2b9e6394177a0c2e40aafba
2021-09-17 09:32:40 -07:00
6aa04b6843 [fx2trt] make gpu trace better (#65168)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65168

Add record_function to TRTModule and EngineHolder so each parts would appear on gpu trace.

Test Plan: CI

Reviewed By: wushirong

Differential Revision: D30997968

fbshipit-source-id: b90662f20a8c0d321846c222f3e8c8eb7e010eba
2021-09-17 09:32:37 -07:00
a8d7b885c5 [CoreML][iOS/MacOS] Add the CoreML executor (#64522)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64522

The `PTMCoreMLExecutor` serves as a bridge between the delegate APIs and Core ML runtime.
ghstack-source-id: 138324217

allow-large-files

Test Plan:
iOS:
Run the CoreML tests in the playground app

MacOS:

```
buck test pp-macos

PASS     633ms  1 Passed   0 Skipped   0 Failed   CoreMLTests
```

{F657776101}

Reviewed By: raziel, iseeyuan

Differential Revision: D30594042

fbshipit-source-id: a42a5307a24c2f364333829f3a84f7b9a51e1b3e
2021-09-17 09:32:34 -07:00
aafeea3a6c Allow extra unused arguments in symbolic shape function (#65095)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65095

The reason I didn't do this initially was because I was worried that matching one schema to another schema with an extra argument might change semantics, e.g. Add(Tensor, Tensor) to Add(Tensor, Tensor, Tensor) might be different. However we don't actually need to worry about this because the graph schema isn't used for node matching, unlike symbolic_script.cpp

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D30972081

Pulled By: eellison

fbshipit-source-id: d4089e8feafc330df2ca158866fe779a7da0b073
2021-09-17 09:31:02 -07:00
6eafe7f15e Actually deprecate __torch_function__ as plain methods (#64843)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64843

Fix for https://github.com/pytorch/pytorch/issues/63767

Test Plan: Imported from OSS

Reviewed By: heitorschueroff

Differential Revision: D30991425

Pulled By: albanD

fbshipit-source-id: 1214143b8aea87e6ff406c7fc13096bd15d1a768
2021-09-17 08:32:53 -07:00
1ed9c33d08 Update fx proxy to use classmethod for __torch_function__ (#64842)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64842

Change the `__torch_function__` to follow best guidelines of using classmethods.
I am not sure how to handle the case where multiple tracer objects are given as input but given that before we were getting an arbitrary tracer from there based on the "self" that was arbitrarily chosen by the torch_function caller, the new implementation is not worst?
Let me know what you think!

Test Plan: Imported from OSS

Reviewed By: heitorschueroff

Differential Revision: D30991423

Pulled By: albanD

fbshipit-source-id: d28940df230b543952b278a0eb2d61cf7ae123ce
2021-09-17 08:32:51 -07:00
473e55d5b2 Use classmethods for overrides (#64841)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64841

Test Plan: Imported from OSS

Reviewed By: heitorschueroff

Differential Revision: D30991424

Pulled By: albanD

fbshipit-source-id: 551e2119768f3a4292713f3bfa83930f5506adbd
2021-09-17 08:32:49 -07:00
a95fabfecb Fix port allocation race condition for elastic test (#65149)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65149

Fixes #64789

There is a race condition between when the free port is acquired to when it is used to create the store in which it may have been used. Since this test only tests that timeout is triggered for tcpstore, we can bind to any port on tcpstore creation.

This only affects the test on the server (since that is where the port is used), but I changed both tests for clarity

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang cbalioglu gcramer23

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D30993166

Pulled By: H-Huang

fbshipit-source-id: eac4f28d641ac87c4ebee89df83f90955144f2f1
2021-09-17 08:32:47 -07:00
f101070587 Small improvements to compare_models_torch binary (#65171)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65171

Add the model comparison binary to BUCK, and also add some quality of life features such as controlling the input range.

Test Plan:
```
# Build the binary
cd ~/fbsource
buck build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 //xplat/caffe2:ptmobile_compareAndroid\#android-arm64 --show-ou
# Push it to the device
adb push buck-out/gen/xplat/caffe2/ptmobile_compareAndroid\#android-arm64 /data/local/tmp/compare_models

# Run the benchmark binary
BENCH_CMD="/data/local/tmp/compare_models"
BENCH_CMD+=" --model=$PATH_TO_MODEL"
BENCH_CMD+=" --refmodel=$PATH_TO_REFERENCE_MODEL"
BENCH_CMD+=" --input_type=float --input_dims=$MODEL_INPUT_SIZE"
BENCH_CMD+=" --iter=100"
BENCH_CMD+=" --tolerance 1e-5"

```

Reviewed By: beback4u

Differential Revision: D30371322

fbshipit-source-id: 5e520aaf119c90985a1d5a135f76e4057148333b
2021-09-17 08:32:45 -07:00
9601deb1b3 Disable autograd fallback tests on Windows (#65147)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65147

I think they trigger an MSVC bug per https://github.com/pytorch/pytorch/issues/48763
ghstack-source-id: 138247203

Test Plan: breakpointed https://www.internalfb.com/intern/sandcastle/job/9007199738584981/ and sush'ed into the host and ran `buck build arvr/mode/win/opt //xplat/caffe2:autograd_libtorch_test_ovrsource` in `/cygdrive/d/ovrsource-null-hg`

Reviewed By: soulitzer

Differential Revision: D30992685

fbshipit-source-id: 06c6fb2c18d55490f89fc91ee5b7a4c5a7faf1c6
2021-09-17 08:32:43 -07:00
aaffcfe9cd implement "xy" indexing for torch.meshgrid (#62724)
Summary:
This is step 4/7 of https://github.com/pytorch/pytorch/issues/50276. This allows the use of `"xy"` indexing but doesn't change any defaults.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62724

Reviewed By: heitorschueroff

Differential Revision: D30995290

Pulled By: dagitses

fbshipit-source-id: 08a6a6144b20bc019f68bc3c52e3bbf967976d8f
2021-09-17 08:31:17 -07:00
d37c02be08 Allow parametrization to be nested (#65167)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/65163

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65167

Reviewed By: jbschlosser

Differential Revision: D31002318

Pulled By: albanD

fbshipit-source-id: b1f1c6c9efa9e83af9789ed13efc133f777f418e
2021-09-17 07:29:01 -07:00
9157a2889f Pass GITHUB_TOKEN to linux CI jobs and avoid skipping torchhub tests (#64807)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/64760

This should hopefully put the torchhub tests back.

This also avoids skipping the torchhub tests: currently the tests are skipped if they fail, which pretty much defeats the purpose of having a test in the first place since we're never notified when they do fail.

cc ezyang seemethere malfet lg20987 pytorch/pytorch-dev-infra nairbv NicolasHug

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64807

Reviewed By: seemethere

Differential Revision: D30994585

Pulled By: NicolasHug

fbshipit-source-id: 561782c22462b5cfec99cca153eb59623db5660a
2021-09-17 03:30:56 -07:00
7dc3858deb [CoreML][fbcode] Add the preprocess python APIs (#64521)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64521

Add the preprocess part for the coreml delegate. Check out the `example.py` for the usage.
ghstack-source-id: 138324214

Test Plan:
```
(base) [taox@devvm2780.vll0 ~/fbsource/fbcode/caffe2/fb]  buck run coreml:example -- --model="/home/taox/mobilenetv2/mobilenetv2.pt" --out="/home/taox/mobilenetv2/mobilenetv2_coreml.pt"
Parsing buck files: finished in 0.5 sec
Downloaded 0/1 artifacts, 0.00 bytes, 100.0% cache miss (for updated rules)
Building: finished in 10.6 sec (100%) 12611/57623 jobs, 1/57623 updated
  Total time: 11.1 sec
Converting Frontend ==> MIL Ops: 100%|██████████████████████████████████████████▉| 382/383 [00:00<00:00, 692.58 ops/s]
Running MIL optimization passes: 100%|███████████████████████████████████████████| 18/18 [00:00<00:00, 45.55 passes/s]
Translating MIL ==> MLModel Ops: 100%|███████████████████████████████████████████| 704/704 [00:01<00:00, 468.56 ops/s]
input {
  name: "input_0"
  type {
    multiArrayType {
      shape: 1
      shape: 3
      shape: 224
      shape: 224
      dataType: FLOAT32
    }
  }
}
output {
  name: "645"
  type {
    multiArrayType {
      dataType: FLOAT32
    }
  }
}
metadata {
  userDefined {
    key: "com.github.apple.coremltools.source"
    value: "torch==1.10.0a0+fb"
  }
  userDefined {
    key: "com.github.apple.coremltools.version"
    value: "4.1"
  }
}

{'inputs': '[["input_0", "0", "[1, 3, 224, 224]"]]', 'outputs': '[["645", "0", "[1, 1000]"]]', 'config': '{"spec_ver": "4", "backend": "cpu", "allow_low_precision": "True"}', 'metadata': '{"coremltool_ver": "4.1", "torch_ver": "torch==1.10.0a0+fb"}'}
WARNING: Logging before InitGoogleLogging() is written to STDERR
W0826 13:27:12.690302 2477051 backend_detail.cpp:376] Warning: Backend [coreml] is not available. Execution of this Module is still possible by saving and loading on a device where the backend is available. (function codegen_backend_module)
graph(%self.1 : torch.jit.LoweredModule.coreml.__torch__.torchvision.models.mobilenetv2.MobileNetV2,
      %x.1 : Tensor):
  %51 : str = prim::Constant[value="Exception: Backend is not available."]()
  %50 : str = prim::Constant[value="AssertionError: "]()
  %14 : str = prim::Constant[value="forward"]() # <string>:5:62
  %48 : Tensor = prim::Uninitialized()
  %44 : Tensor = prim::Uninitialized()
  %typed_inputs.1 : Any[] = prim::ListConstruct(%x.1)
  %__backend.3 : __torch__.torch.classes.__backends__.coreml = prim::GetAttr[name="__backend"](%self.1)
  %8 : bool = prim::CallMethod[name="is_available"](%__backend.3) # <string>:4:19
  %49 : Tensor = prim::If(%8) # <string>:4:16
    block0():
      %__backend : __torch__.torch.classes.__backends__.coreml = prim::GetAttr[name="__backend"](%self.1)
      %__handles : Dict(str, Any) = prim::GetAttr[name="__handles"](%self.1)
      %15 : Any = aten::__getitem__(%__handles, %14) # <string>:5:47
      %17 : Any[] = prim::CallMethod[name="execute"](%__backend, %15, %typed_inputs.1) # <string>:5:24
      %18 : Any = prim::ListUnpack(%17)
      %20 : bool = prim::isinstance[types=[Tensor]](%18)
      %39 : Tensor = prim::If(%20) # <string>:6:18
        block0():
          %22 : Tensor = prim::unchecked_cast(%18)
          -> (%22)
        block1():
           = prim::RaiseException(%50) # <string>:6:18
          -> (%44)
      -> (%39)
    block1():
       = prim::RaiseException(%51) # <string>:9:18
      -> (%48)
  return (%49)

```

Reviewed By: raziel

Differential Revision: D30585154

fbshipit-source-id: 66c7d2e931be6eaa3c43a0ee131ea8046452449d
2021-09-17 00:25:14 -07:00
8241193d76 [Static Runtime] Introduce static_runtime::dict_unpack (#64771)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64771

Test Plan:
- Added `StaticRuntime.RemoveImmutableInputDictLookupsWithImmutableInputDict`
- Added `StaticRuntime.RemoveImmutableInputDictLookupsWithMutableInputDict`
- TBD: Perf impact measurement

Reviewed By: mikeiovine

Differential Revision: D30685083

fbshipit-source-id: 050a92ef3b3ed0fdc0ab7a13a4b5dbfede9342a9
2021-09-16 23:25:13 -07:00
e6c39a521b [ONNX] Update submodule to 1.10.1 (#63716) (#64576)
Summary:
Stack from [ghstack](https://github.com/ezyang/ghstack):
* **https://github.com/pytorch/pytorch/issues/64576 [ONNX] Update submodule to 1.10.1 (https://github.com/pytorch/pytorch/issues/63716)**

* [ONNX] Update IR version to 7

* [ONNX] update submodule to 1.10.1

* Disable some tests in caffe2 that fail b/c caffe2 doesn't support the
  new ops.
* Update Bazel file.

* Update expect files for new ONNX IR version

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64576

Reviewed By: jansel

Differential Revision: D31006896

Pulled By: msaroufim

fbshipit-source-id: f3bf97709f23a5a2cd49c708e7363231f2c1961a
2021-09-16 22:29:54 -07:00
9117eed6ed [FX} Add torch.ops.profiler._record_function_{enter,exit} as stateful ops for DCE (#65180)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65180

Test Plan: Imported from OSS

Reviewed By: jansel

Differential Revision: D31007115

Pulled By: jamesr66a

fbshipit-source-id: 823b15db712a382a4f2a4fd409983d47bc067150
2021-09-16 21:31:54 -07:00
02dec91212 [quant] AO migration of the torch/quantization/utils.py (phase 1) (#64919)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64919

AO Team is migrating the existing torch.quantization into torch.ao.quantization. We are doing it one file at a time to make sure that the internal callsites are updated properly. This migrates the quantization utilities.
ghstack-source-id: 138303325

Test Plan: `buck test mode/dev //caffe2/test:quantization`

Reviewed By: jerryzh168

Differential Revision: D30899082

fbshipit-source-id: 85eb38c419e417147e71758b682cd095308dd0c9
2021-09-16 21:30:18 -07:00
64641eaee6 [acc_utils] Add print_model_info (#65045)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65045

This is a useful tool for printing out all of the ops that are found in a model after acc_tracer. It assumes the provided model has no `call_module` or `call_method`, which is generally a reasonable assumption assuming a model has been successfully traced by the acc_tracer.

Test Plan:
Tested locally. Sample output:
```
Model Info:
> placeholder: 1184
> get_attr: 655
> output: 2
> torch.fx.experimental.fx_acc.acc_ops.add: 2
> torch.fx.experimental.fx_acc.acc_ops.cat: 23
> torch.fx.experimental.fx_acc.acc_ops.embedding_bag: 576
> torch.fx.experimental.fx_acc.acc_ops.layer_norm: 15
> torch.fx.experimental.fx_acc.acc_ops.linear: 27
> torch.fx.experimental.fx_acc.acc_ops.matmul: 3
> torch.fx.experimental.fx_acc.acc_ops.mul: 17
> torch.fx.experimental.fx_acc.acc_ops.permute: 2
> torch.fx.experimental.fx_acc.acc_ops.reshape: 419
> torch.fx.experimental.fx_acc.acc_ops.sigmoid: 16
> torch.fx.experimental.fx_acc.acc_ops.slice_tensor: 630
> torch.fx.experimental.fx_acc.acc_ops.sum: 4
> torch.fx.experimental.fx_acc.acc_ops.tanh: 315
```

Reviewed By: 842974287

Differential Revision: D30954829

fbshipit-source-id: 5c4f0770667b72859b74099d9f4575284fc48bd2
2021-09-16 20:29:22 -07:00
8c38d141df Add back the owning_module fix (#65159)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65159

This was a legit fix originally introduced in D30905949 (446d95a7f6). But we hesitated and removed it for some reason. Putting it back.

Reviewed By: 842974287

Differential Revision: D30996277

fbshipit-source-id: 3f5eede11dba2072e7cd5ae6ca7ac81d55fb75fa
2021-09-16 19:29:56 -07:00
c886406ce0 Add dropout shape inference as no-op in acc_tracer (#65113)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65113

Register dropout as no-op in acc_tracer & Add shape inference for no-op

Test Plan:
buck test glow/fb/fx/acc_tracer:test_acc_shape_inference --- test_unary_15_dropout_no_op
buck test glow/fb/fx/oss_acc_tracer:test_acc_tracer -- test_dropout

Reviewed By: jfix71

Differential Revision: D30880679

fbshipit-source-id: 592fe50e17137c94c12727658191dedf08daf8cf
2021-09-16 18:26:55 -07:00
6f120ada50 Pin SciPy to 1.6.2 on Windows (#65017)
Summary:
Re-enable previously disabled test_distributions

Note: conda does not have ScipPy-1.6.3, only 1.6.2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65017

Reviewed By: seemethere

Differential Revision: D31003199

Pulled By: malfet

fbshipit-source-id: 96b9d2a833f703008bb1f4df9361db8ec6f8ccc6
2021-09-16 18:25:43 -07:00
0a5149019f Added logging for the Reducer's non-member functions. (#65023)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65023

Added an optional logging parameter for non-member functions `compute_bucket_assignment_by_size` and `verify_replica0_across_processes`. If a logger is provided then `TORCH_CHECK` assertions are replaced with a wrapper that logs the error to the DDP reducer's logger before calling `TORCH_CHECK`. If a logger is not provided `TORCH_CHECK` is still called.

Modified python-side calls to `_compute_bucket_assignment_by_size` and `_verify_model_across_ranks` to include a logger whenever possible. A notable exception is when these non-member functions are called in DDP's constructor - we cannot pass in a logger as they may have not been initialized yet.

We also added 4 new tests: `test_compute_bucket_assignment_by_size_sparse_error_{with, without}_logger` which tests the `_compute_bucket_assignment_by_size` function to ensure that sparse tensors are rejected and the errors are logged.  `test_verify_model_across_rank_{with, without}_logger` calls `_verify_model_across_ranks` to ensure that ill-formed models (different ranks have different number of parameters compared to rank 0) are rejected and the errors are logged. The test `test_ddp_model_diff_across_ranks` remains unchanged - while it does construct a ill-formed DDP instance which triggers the error in `_verify_model_across_ranks`, we cannot check the logger because this error appears in the constructor.

Lastly, did some cleanup of the `test_ddp_model_diff_across_ranks` function to make the logic of choosing which context manager and error message to use more clean.

Test Plan:
**Build commands**
`buck build mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn --keep-going`

`buck build mode/dev-nosan //caffe2/test/distributed:distributed_gloo_spawn --keep-going`

**Test commands**
Test for `_compute_bucket_assignment_by_size` (Python)/ `compute_bucket_assignment_by_size` (C++)
`BACKEND={nccl, gloo} WORLD_SIZE=2 ../buck-out/dev/gen/caffe2/test/distributed/distributed_{nccl, gloo}_spawn#binary.par -r test_compute_bucket_assignment_by_size_sparse_error_{with, without}_logger`

Test for `_verify_model_across_ranks` (Python)/`verify_replicas0_across_process` (C++)
`BACKEND={nccl, gloo} WORLD_SIZE=2 ../buck-out/dev/gen/caffe2/test/distributed/distributed_{nccl, gloo}_spawn#binary.par -r test_verify_model_across_ranks_{with, without}_logger`

Test that constructs an ill-formed DDP instance. Only did cleanup of this function.
`BACKEND={nccl, gloo} WORLD_SIZE=2 ../buck-out/dev/gen/caffe2/test/distributed/distributed_{nccl, gloo}_spawn#binary.par -r test_ddp_model_diff_across_ranks`

Reviewed By: rohan-varma

Differential Revision: D30924790

fbshipit-source-id: dae6fa82485a204a6a4b022f2d073417d07ebb2f
2021-09-16 16:39:39 -07:00
873255c6d9 OpInfo: nn.functional.conv2d (#63517)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Reference: https://github.com/facebookresearch/functorch/issues/78

Mostly inspired from https://github.com/pytorch/pytorch/issues/62882

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63517

Reviewed By: heitorschueroff

Differential Revision: D30993855

Pulled By: zou3519

fbshipit-source-id: 7402f99addb4ef8f19c2ce1a09ed9006e737cc7e
2021-09-16 14:27:36 -07:00
4c4c03124b Remove old references to 9.2 in documentation (#65059)
Summary:
Removes references in .rst and README.md and comments in the Dockerfile

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65059

Reviewed By: malfet

Differential Revision: D30961110

Pulled By: janeyx99

fbshipit-source-id: 702a9a81bf08125ec4ac38bc656fc2c128c30018
2021-09-16 13:24:05 -07:00
4c15f8e8b4 Provide function interface for remove_duplicate_output_args (#65134)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65134

So that its implementation can be abstracted and replaced

Test Plan: Run linter, CI

Reviewed By: 842974287

Differential Revision: D30966916

fbshipit-source-id: 92ec78c7410d0be14faecb0ba1eafdc74bab5a5d
2021-09-16 13:17:37 -07:00
f9c341fdf2 Add type annotation for TRTInterpreter.run (#65135)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65135

Opportunistically adding type annotation as I work through fx2trt code base.

Test Plan: run linter and CI

Reviewed By: houseroad, 842974287

Differential Revision: D30903185

fbshipit-source-id: 3f700b57f4433f2d312c1ff2e6b99948e3c8845c
2021-09-16 13:16:06 -07:00
8a094e3270 [quant]ao migration for quantization mappings and fuser method mappings hg mv (#64985)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64985

moving quantization_mappings.py and fuser_method_mappings.py to the ao folder while retaining backwards compatibility

also added dict test

ghstack-source-id: 138215312

Test Plan:
buck test mode/dev //caffe2/test:quantization

https://www.internalfb.com/intern/testinfra/testrun/7036874471986444

buck test mode/dev //caffe2/test:quantization -- TestAOMigrationQuantization

https://www.internalfb.com/intern/testinfra/testrun/5348024625792701

Reviewed By: z-a-f

Differential Revision: D30982551

fbshipit-source-id: 00f53bd44009d6012a7de852000aad6885131edb
2021-09-16 12:59:20 -07:00
9af6fe991c Remove CUDA 9.2 and older references from our cmake (#65065)
Summary:
Removes old CUDA references in our cuda.cmake

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65065

Reviewed By: malfet

Differential Revision: D30992673

Pulled By: janeyx99

fbshipit-source-id: 85b524089ed57e5acbc71720267cf05e24a8c20a
2021-09-16 12:54:49 -07:00
67570a60ba Disable ParallelTBB (#65092)
Summary:
As ParallelTBB's `at::get_thread_num` is not compatible with general model used by OpenMP and ParallelNative (where it is an contiguous thread index within parallel loop), see https://github.com/pytorch/pytorch/issues/64571#issuecomment-914691883

More examples of similar regressions: https://github.com/pytorch/pytorch/runs/3612142217

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65092

Reviewed By: zhouzhuojie

Differential Revision: D30995936

Pulled By: malfet

fbshipit-source-id: db145b6a850d794f2c954f59f30249b291473e36
2021-09-16 12:38:45 -07:00
96cb05b49a Introduce tensorRT as builtin module for torch::deploy. (#63818)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63818

ghstack-source-id: 138156957

Test Plan: next diff

Reviewed By: wconstab

Differential Revision: D30499309

fbshipit-source-id: 4ab1bc9896243c0c1503afb18fbfb196fc37404e
2021-09-16 11:27:51 -07:00
8eb21488fd [JIT] Improve BatchMM mutability handling (#65097)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65097

Previously, BatchMM would skip any block containing any mutable
operators. Now it will avoid batching any operation whose inputs or
outputs are ever mutated. Specifically: consider a tree of ADD, T,
and MM nodes rooted at an ADD node.  If any input or output to any
node in the tree is ever mutated, then the entire tree will be ignored
by BatchMM.

Test Plan: python test/test_jit.py TestBatchMM

Reviewed By: eellison

Differential Revision: D30973515

Pulled By: davidberard98

fbshipit-source-id: 9d836faa1ef0c9e3fefe0ffc0bd265f275471f48
2021-09-16 10:46:14 -07:00
f309f8fbd4 [quant] ao migration of observer and qconfig (#64982)
Summary:
(Had to recreate this diff so it wasn't dependent on the stack)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64982

migration of qconfig.py and observer.py to torch/ao/quantization using new test format
ghstack-source-id: 138215256

Test Plan:
buck test mode/opt //caffe2/test:quantization

https://www.internalfb.com/intern/testinfra/testconsole/testrun/8444249354294701/

buck test mode/dev //caffe2/test:quantization -- TestAOMigrationQuantization

https://www.internalfb.com/intern/testinfra/testrun/3940649742829796

Reviewed By: z-a-f

Differential Revision: D30982534

fbshipit-source-id: 48d08969b1984311ceb036eac0877c811cd6add9
2021-09-16 10:33:16 -07:00
97e86cf319 [Fix] Raise error when empty index tensor is passed (gather) (#65006)
Summary:
See https://github.com/pytorch/pytorch/pull/63312#issuecomment-919330081 for context.

cc: ezyang ysiraichi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65006

Reviewed By: mruberry

Differential Revision: D30937730

Pulled By: ezyang

fbshipit-source-id: a8f77b1f40d07e7e3bef6caaafa119685f297638
2021-09-16 10:14:26 -07:00
874f9bd509 [FX] Gate FXGraphDrawer on whether pydot is installed (#65088)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65088

Test Plan: Imported from OSS

Reviewed By: khabinov

Differential Revision: D30967951

Pulled By: jamesr66a

fbshipit-source-id: dba2f13a47889b3d4187de925b4fe74ee90b7f79
2021-09-16 10:04:33 -07:00
2c57bbf521 add support for indexing to meshgrid (#62722)
Summary:
This is step 3/7 of https://github.com/pytorch/pytorch/issues/50276. It only adds support for the argument but doesn't implement new indexing modes yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62722

Test Plan:
Verified this is not FC breaking by adding logging to both meshgrid
overloads and then called meshgrid twice:

`meshgrid(*tensors)`
  and
`meshgrid(*tensors, indexing='ij')`

This confirmed that the former signature triggered the original native
function and the latter signature triggered the new native function.

Reviewed By: H-Huang

Differential Revision: D30394313

Pulled By: dagitses

fbshipit-source-id: e265cb114d8caae414ee2305dc463b34fdb57fa6
2021-09-16 09:59:49 -07:00
67bd2a31b5 [Reland] Add python mode (#64360)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64360

This PR adds a (private) enable_python_mode context manager.
(see torch/utils/_python_dispatch.py).
enable_python_mode accepts the type of a __torch_dispatch__ object
as its argument. Whenever an operator gets called inside of the
context manager, it dispatches to the __torch_dispatch__ of
the passed-in type.

Example usage:
```
with enable_python_mode(LoggingTensor):
    z = torch.empty([])
    assert isinstance(z, LoggingTensor)
```

There are quite a few changes that were made to support this.

First, we added TorchDispatchTypeObject, a C++ struct that represents the
type of a `__torch_dispatch__` object (e.g. LoggingTensor).
It holds both the PyObject* representing the class and a PyInterpreter*
so we know which Python interpreter it came from.

Next, we updated the concrete_dispatch_fn in python_variable.cpp to accept
a `const std::shared_ptr<TorchDispatchTypeObject>&` argument. When this
is null, dispatching happens as usual. When it is non-null, we prepend
the TorchDispatchTypeObject's PyObject* to the overloaded args list so that
it is considered first for dispatch.

To get that to work, we changed how `handle_torch_dispatch_no_python_arg_parser`
works. The "overloaded args list" previously only consisted of Tensor PyObjects,
but now it can have types in addition to Tensors!
- We renamed `append_overloaded_arg` to `append_overloaded_arg`
- We added a new `append_overloaded_type` that appends a type to
overloaded_args
- We added special handling in `handle_torch_dispatch_no_python_arg_parser`
and `append_overloaded_arg` to handle types in addition to Tensors.

Then, there is PythonMode and PythonModeTLS.
- We reuse the DispatchKey::Python dispatch key as a mode key
- We use PythonMode::enter and PythonMode::exit to enable/disable
DispatchKey::Python and set the PythonModeTLS.
- PythonModeTLS stores a TorchDispatchTypeObject as metadata.
- PythonMode is in libtorch_python, and PythonModeTLS is in ATen.
This split is due to the libtorch_python library boundary (because we need
to save TLS in ATen/ThreadLocalState)
- We modify the PythonFallbackKernel to look up
the relevant TorchDispatchTypeObject (if Python Mode is active) and
dispatch using it.

There are two more miscellaneous changes:
- internal_new_from_data (torch/csrc/utils/tensor_new.cpp) gets an
exclude guard. enable_python_mode currently does not handle
torch.tensor and the exclude guard is to prevent a bug.

Future:
- This PR does not allow for the nesting of Python modes. In the future we
should be able to enable this with a more sane no_dispatch API and by changing
the TLS to a stack. For now I did not need this for CompositeImplicitAutograd testing.

Test Plan: - new tests

Reviewed By: ezyang

Differential Revision: D30698082

Pulled By: zou3519

fbshipit-source-id: 7094a90eee6aa51f8b71bc4d91cfb6f49e9691f8
2021-09-16 09:02:30 -07:00
8800a8b428 Revert D30888794: [Model Averaging] Simplify PostLocalSGD Optimizer API
Test Plan: revert-hammer

Differential Revision:
D30888794 (3d312b3b8e)

Original commit changeset: 21261b480f6b

fbshipit-source-id: 87abb7e8cd9ecaac909ec6c3ee053fa7c4ae1975
2021-09-16 06:39:57 -07:00
83878e19ff Improve LSTM documentation for proj_size > 0 (#65102)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/65053. Although the documentation states that:

fe0f9d1daf/torch/nn/modules/rnn.py (L500-L506)

It seems that the definition of `weight_ih_l[k]` could be improved by specifying what happens when `k > 0` and `proj_size > 0`. As `proj_size` is only used in LSTM, no changes are needed for the other RNNs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65102

Reviewed By: supriyar

Differential Revision: D30975781

Pulled By: jbschlosser

fbshipit-source-id: 12df06e5e6a8d5de0ad10fb15e33c3e6311c11d3
2021-09-16 06:35:27 -07:00
f69cf3cf2f [Static Runtime] Use FastSet instead of std::set everywhere (#65114)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65114

There doesn't seem to be any reason to use std::set for sets of pointers, right?
ghstack-source-id: 138198504

Reviewed By: hlu1

Differential Revision: D30978450

fbshipit-source-id: 4599c6249fda3a89959f839d3bf6400c5891f82c
2021-09-15 21:44:54 -07:00
0bda7476cf Reduce PyToch Warnings - Cast fixes from D26624430 (#65015)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65015

Split out the existing fixes into a diff we can land separately.

Test Plan:
pooled_embeddings_modules_test

Parsing buck files: finished in 8.3 sec
Creating action graph: finished in 38.3 sec
[RE] Metadata: Session ID=[https://fburl.com/b/reSessionID-9bea421c-875e-4168-9e00-7d67479b1a9f]
[RE] Waiting on 46 remote actions. Completed 905 actions remotely, action cache hit rate: 5.08%.
Downloaded 7002/8869 artifacts, 560.00 Mbytes, 11.6% cache miss (for updated rules)
Building: finished in 13:12.4 min (100%) 31964/31964 jobs, 17344/31964 updated
  Total time: 13:59.1 min
More details at https://www.internalfb.com/intern/buck/build/b9a58bba-e0aa-4c2b-8824-a0c4074b0954
Tpx test run coordinator for Facebook. See https://fburl.com/tpx for details.
Running with tpx session id: 28cbe2b1-6fbc-450c-91c9-c06a7ff1d53b
Trace available for this run at /tmp/tpx-20210914-114921.005504/trace.log
Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/1407375088325000
    ✓ ListingSuccess: caffe2/torch/fb/sparsenn:pooled_embeddings_modules_test - main (23.849)
    {emoji:2702} Omit: caffe2/torch/fb/sparsenn:pooled_embeddings_modules_test - test_permutation (caffe2.torch.fb.sparsenn.tests.pooled_embeddings_modules_test.PooledEmbeddingModulesTest_1_cuda)
Test output:
> This test was disabled.
To run this test locally, add the command line flag --run-disabled to your test command (prefix with -- if using buck).
To view why this is disabled or re-enable this test in the test console, visit https://our.intern.facebook.com/intern/testinfra/testdetail/562949981577936
    ↻ Skip: caffe2/torch/fb/sparsenn:pooled_embeddings_modules_test - test_permutation (caffe2.torch.fb.sparsenn.tests.pooled_embeddings_modules_test.PooledEmbeddingModulesTest_0_cpu) (13.201)
Test output:
> Repro command : $(cat "/tmp/tpx-20210914-114921.005504/dc174692-8d92-4459-8b8f-201643c6ab7d/execution_command")
Skipped: CUDA is not available or no GPUs detected
stdout:

stderr:

    ↻ Skip: caffe2/torch/fb/sparsenn:pooled_embeddings_modules_test - test_permutation_autograd (caffe2.torch.fb.sparsenn.tests.pooled_embeddings_modules_test.PooledEmbeddingModulesTest_1_cuda) (13.201)
Test output:
> Repro command : $(cat "/tmp/tpx-20210914-114921.005504/dc174692-8d92-4459-8b8f-201643c6ab7d/execution_command")
Skipped: CUDA is not available or no GPUs detected
stdout:

stderr:

    ✓ Pass: caffe2/torch/fb/sparsenn:pooled_embeddings_modules_test - test_compatibility (caffe2.torch.fb.sparsenn.tests.pooled_embeddings_modules_test.PooledEmbeddingModulesTest_1_cuda) (13.201)
    ↻ Skip: caffe2/torch/fb/sparsenn:pooled_embeddings_modules_test - test_permutation_autograd (caffe2.torch.fb.sparsenn.tests.pooled_embeddings_modules_test.PooledEmbeddingModulesTest_0_cpu) (13.201)
Test output:
> Repro command : $(cat "/tmp/tpx-20210914-114921.005504/dc174692-8d92-4459-8b8f-201643c6ab7d/execution_command")
Skipped: CUDA is not available or no GPUs detected
stdout:

stderr:

    ✓ Pass: caffe2/torch/fb/sparsenn:pooled_embeddings_modules_test - test_compatibility (caffe2.torch.fb.sparsenn.tests.pooled_embeddings_modules_test.PooledEmbeddingModulesTest_0_cpu) (13.201)
    ✓ Pass: caffe2/torch/fb/sparsenn:pooled_embeddings_modules_test - main (13.201)
Summary
  Pass: 3
  Skip: 3
    ↻ caffe2/torch/fb/sparsenn:pooled_embeddings_modules_test - test_permutation (caffe2.torch.fb.sparsenn.tests.pooled_embeddings_modules_test.PooledEmbeddingModulesTest_0_cpu)
    ↻ caffe2/torch/fb/sparsenn:pooled_embeddings_modules_test - test_permutation_autograd (caffe2.torch.fb.sparsenn.tests.pooled_embeddings_modules_test.PooledEmbeddingModulesTest_1_cuda)
    ↻ caffe2/torch/fb/sparsenn:pooled_embeddings_modules_test - test_permutation_autograd (caffe2.torch.fb.sparsenn.tests.pooled_embeddings_modules_test.PooledEmbeddingModulesTest_0_cpu)
  Omit: 1
    {emoji:2702} caffe2/torch/fb/sparsenn:pooled_embeddings_modules_test - test_permutation (caffe2.torch.fb.sparsenn.tests.pooled_embeddings_modules_test.PooledEmbeddingModulesTest_1_cuda)
  ListingSuccess: 1

shape_inference_mode_test

[amrelshennawy@devvm855.ftw0 /data/users/amrelshennawy/fbsource/fbcode] buck test caffe2/torch/fb/sparsenn:shape_inference_mode_test
Downloaded 6/18 artifacts, 11.69 Kbytes, 53.8% cache miss (for updated rules)
Building: finished in 1.6 sec (100%) 110/110 jobs, 26/110 updated
  Total time: 1.8 sec
More details at https://www.internalfb.com/intern/buck/build/0e5f45b2-5777-49e9-a3b0-09bd05687b2b
Tpx test run coordinator for Facebook. See https://fburl.com/tpx for details.
Running with tpx session id: 99509108-5ff3-4b1a-b7b3-2f43c4036209
Trace available for this run at /tmp/tpx-20210914-120119.723607/trace.log
Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/6192449502564504
    ✓ ListingSuccess: caffe2/torch/fb/sparsenn:shape_inference_mode_test - main (0.374)
    ✓ Pass: caffe2/torch/fb/sparsenn:shape_inference_mode_test - test_set_upper_bound_mode (torch.python.fb.shape_inference_mode_test.TestShapeInferenceMode) (0.249)
    ✓ Pass: caffe2/torch/fb/sparsenn:shape_inference_mode_test - test_set_upper_bound_settings (torch.python.fb.shape_inference_mode_test.TestShapeInferenceMode) (0.253)
Summary
  Pass: 2
  ListingSuccess: 1

test
[amrelshennawy@devvm855.ftw0 /data/users/amrelshennawy/fbsource/fbcode] buck test caffe2/torch/fb/sparsenn:test
Parsing buck files: finished in 1.1 sec
Creating action graph: finished in 38.6 sec
Downloaded 6/30 artifacts, 11.29 Kbytes, 66.7% cache miss (for updated rules)
Building: finished in 41.6 sec (100%) 26783/26783 jobs, 43/26783 updated
  Total time: 01:21.4 min
More details at https://www.internalfb.com/intern/buck/build/8f794eb0-3d3c-4ee3-9aec-5ec5cec1b0f4
Tpx test run coordinator for Facebook. See https://fburl.com/tpx for details.
Running with tpx session id: a06164b5-d7d7-444c-a4ff-e312cb9970d9
Trace available for this run at /tmp/tpx-20210914-120428.464799/trace.log
Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/3377699789132066
    ✓ ListingSuccess: caffe2/torch/fb/sparsenn:test - main (16.637)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_dense_mlp_quantize_ops (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (17.870)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_clip_ranges_shape_inference_mode (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (17.922)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_gather_ranges_to_dense_caffe2 (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (18.348)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_self_binning_histogram_quantile_simple (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (18.370)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_recat_embedding_grad_output_mixed_D_batch (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (18.516)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_xl_embedding_bag_byte_rowwise_offsets (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (18.515)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_offsets_to_ranges (caffe2.torch.fb.sparsenn.tests.sigrid_transforms_test.SigridTransformsOpsTest) (18.861)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_xl_embedding_bags (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (18.873)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_offsets_to_ranges_out (caffe2.torch.fb.sparsenn.tests.sigrid_transforms_test.SigridTransformsOpsTest) (18.969)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_pack_segments_pad_minf (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (19.104)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_deprecated_multiple_runs (caffe2.torch.fb.sparsenn.tests.sigrid_transforms_test.SigridTransformsOpsTest) (19.342)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_deprecated_sigrid_transform (caffe2.torch.fb.sparsenn.tests.sigrid_transforms_test.SigridTransformsOpsTest) (19.664)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_offsets_to_ranges_out_empty_batch (caffe2.torch.fb.sparsenn.tests.sigrid_transforms_test.SigridTransformsOpsTest) (19.745)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_clip_lengths (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (19.771)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_multiple_runs_torch_bind (caffe2.torch.fb.sparsenn.tests.sigrid_transforms_test.SigridTransformsOpsTest) (19.944)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_offsets_to_ranges_empty_batch (caffe2.torch.fb.sparsenn.tests.sigrid_transforms_test.SigridTransformsOpsTest) (19.944)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_gather_ranges_shape_inference_mode (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (20.245)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_prior_correction_calibration_prediction_nonbinary (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (20.328)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_8bitfakefused (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (20.501)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_deprecated_ranges (caffe2.torch.fb.sparsenn.tests.sigrid_transforms_test.SigridTransformsOpsTest) (20.608)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_clip_lengths_inference_tests (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (22.403)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_broadcast_cat_out (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (23.025)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_clip_lengths_negatives_tests (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (23.956)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_broadcast_cat (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (24.100)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_transform_torch_bind (caffe2.torch.fb.sparsenn.tests.sigrid_transforms_test.SigridTransformsOpsTest) (17.384)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_expand_values_scores_tensor (caffe2.torch.fb.sparsenn.tests.sigrid_transforms_test.SigridTransformsOpsTest) (18.672)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_expand_empty_values_scores_tensor (caffe2.torch.fb.sparsenn.tests.sigrid_transforms_test.SigridTransformsOpsTest) (18.679)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_pack_segments (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (17.726)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_expand_ranges_tensor (caffe2.torch.fb.sparsenn.tests.sigrid_transforms_test.SigridTransformsOpsTest) (17.567)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_batch_box_cox_all_zeros (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (19.036)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_rowwise_prune_op_32bit_indices (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (18.430)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_transform_torch_bind_upper_bound (caffe2.torch.fb.sparsenn.tests.sigrid_transforms_test.SigridTransformsOpsTest) (18.176)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_expand_dense_feature_tensor (caffe2.torch.fb.sparsenn.tests.sigrid_transforms_test.SigridTransformsOpsTest) (19.006)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_clip_ranges_gather (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (18.555)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_xl_int_nbit_split_embedding_codegen_lookup_function (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (18.791)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_pack_segments_smaller_max_len (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (18.737)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_self_binning_histogram_quantile_pos (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (20.212)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_xl_embedding_bag_2bit_rowwise_offsets (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (18.612)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_prior_correction_calibration_prediction_binary (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (20.858)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_tracing_torch_bind_upper_bound (caffe2.torch.fb.sparsenn.tests.sigrid_transforms_test.SigridTransformsOpsTest) (19.002)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_deprecated_tracing (caffe2.torch.fb.sparsenn.tests.sigrid_transforms_test.SigridTransformsOpsTest) (20.824)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_self_binning_histogram_quantile_1d_counts (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (18.976)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_recat_embedding_grad_output_mixed_D (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (19.832)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_batch_one_hot_lengths (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (19.844)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_clip_ranges (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (18.558)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_batch_box_cox_non_zeros (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (19.418)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_prior_correction_calibration_accumulate (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (19.222)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_unsqueeze_vector (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (19.327)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_xl_embedding_bag_4bit_rowwise_offsets (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (17.772)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_self_binning_histogram_quantile (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (18.425)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_broadcast_cat_backward (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (17.956)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_expand_offsets_tensor (caffe2.torch.fb.sparsenn.tests.sigrid_transforms_test.SigridTransformsOpsTest) (19.320)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_gather_ranges (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (17.923)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_batch_one_hot (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (18.549)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_deprecated_sigrid_transforms_create (caffe2.torch.fb.sparsenn.tests.sigrid_transforms_test.SigridTransformsOpsTest) (18.932)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_clip_ranges_gather_lengths_to_offsets (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (18.807)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_length_to_row_idx (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (17.738)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_tracing_torch_bind (caffe2.torch.fb.sparsenn.tests.sigrid_transforms_test.SigridTransformsOpsTest) (20.175)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_batch_box_cox_mixed (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (18.116)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_self_binning_histogram_quantile_1d_bins (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (18.671)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_permute_out (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (19.002)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_create_sigrid_transforms_torch_bind (caffe2.torch.fb.sparsenn.tests.sigrid_transforms_test.SigridTransformsOpsTest) (18.151)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_ranges_torch_bind (caffe2.torch.fb.sparsenn.tests.sigrid_transforms_test.SigridTransformsOpsTest) (16.780)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_self_binning_histogram_quantile_no_bins (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (19.185)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_cumsum (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (19.242)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_self_binning_histogram_quantile_le_one (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (19.876)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_pack_and_unpack_segments (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (19.222)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_self_binning_histogram_quantile_dims (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (20.007)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_sigrid_hash_op (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (18.959)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_rowwise_prune_op_64bit_indices (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (18.601)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_ranges_torch_bind_upper_bound (caffe2.torch.fb.sparsenn.tests.sigrid_transforms_test.SigridTransformsOpsTest) (17.977)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_broadcast_stack (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (22.588)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_multiple_runs_torch_bind_upper_bound (caffe2.torch.fb.sparsenn.tests.sigrid_transforms_test.SigridTransformsOpsTest) (15.342)
Summary
  Pass: 73
  ListingSuccess: 1
If you need help understanding your runs, please follow the wiki: https://fburl.com/posting_in_tpx_users
Finished test run: https://www.internalfb.com/intern/testinfra/testrun/3377699789132066

Did not run (no GPU on my devserver):
gpu_test
cpp_gpu_test

Reviewed By: r-barnes

Differential Revision: D30940399

fbshipit-source-id: d867ca646723340775a49c1b983cdab64f2d67d8
2021-09-15 21:20:41 -07:00
db601434ef Bug fix (#65105)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65105

Using buildErrorMessage in external_functions.cpp was breaking build target nnc_cpu_backend_lib as buildErrorMessage is defined in tensorexpr/kernel.cpp which is not included in mobile builds and we don't want to include it in mobile builds.
Also buildErrorMessage wraps error messages for fuser whereas nnc_aten_conv2d is now only used in AOT workflow and not called by the fuser. So wrapping assertion failures with fuser error message would be misleading for AOT workflow.

Test Plan:
Before fix:
```
+ buck build //xplat/caffe2/fb/lite_predictor:lite_predictor_nnc
Downloading... 3/3 artifacts, 24.81 Kbytes, 0.0% cache miss (for updated rules)
Building... 1.7 sec (99%) 4639/4641 jobs, 3/4641 updated
     - //xplat/caffe2/fb/lite_predictor:lite_predictor_nnc#binary... 0.7 sec (running c++ link[0.6 sec])
Command failed with exit code 1.

command: [/data/users/priyaramani/fbsource/buck-out/cells/fbcode/gen/aab7ed39/tools/build/buck/wrappers/__ld__/ld.sh, --ld=/data/users/priyaramani/fbsource/fbcode/third-party-buck/platform009/build/llvm-fb/9.0.0/bin/clang++, --cc=/data/users/priyaramani/fbsource/buck-out/cells/fbcode/gen/aab7ed39/tools/build/buck/wrappers/__fbc...
<truncated>
...

stderr: clang-9: warning: argument unused during compilation: '-pthread' [-Wunused-command-line-argument]
ld.lld: error: undefined symbol: torch::jit::tensorexpr::buildErrorMessage(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
>>> referenced by external_functions.cpp:69 (xplat/caffe2/torch/csrc/jit/tensorexpr/external_functions.cpp:69)
>>>               ../nnc_cpu_backend_lib#compile-external_functions.cpp.o50e02bc2,platform009-clang/torch/csrc/jit/tensorexpr/external_functions.cpp.o:(nnc_aten_conv2d) in archive /data/users/priyaramani/fbsource/buck-out/gen/aab7ed39/xplat/caffe2/nnc_cpu_backend_lib#platform009-clang,static/libnnc_cpu_backend_lib.a
clang-9: error: linker command failed with exit code 1 (use -v to see invocation)

    When running <c++ link>.
    When building rule //xplat/caffe2/fb/lite_predictor:lite_predictor_nnc#binary (ovr_config//platform/linux:x86_64-fbcode).
clang-9: warning: argument unused during compilation: '-pthread' [-Wunused-command-line-argument]
ld.lld: error: undefined symbol: torch::jit::tensorexpr::buildErrorMessage(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
>>> referenced by external_functions.cpp:69 (xplat/caffe2/torch/csrc/jit/tensorexpr/external_functions.cpp:69)
>>>               ../nnc_cpu_backend_lib#compile-external_functions.cpp.o50e02bc2,platform009-clang/torch/csrc/jit/tensorexpr/external_functions.cpp.o:(nnc_aten_conv2d) in archive /data/users/priyaramani/fbsource/buck-out/gen/aab7ed39/xplat/caffe2/nnc_cpu_backend_lib#platform009-clang,static/libnnc_cpu_backend_lib.a
clang-9: error: linker command failed with exit code 1 (use -v to see invocation)

Command failed with exit code 1.

command: [/data/users/priyaramani/fbsource/buck-out/cells/fbcode/gen/aab7ed39/tools/build/buck/wrappers/__ld__/ld.sh, --ld=/data/users/priyaramani/fbsource/fbcode/third-party-buck/platform009/build/llvm-fb/9.0.0[DEBUG kernel.cpp:2766]       }
```

After fix:
```
+ buck build //xplat/caffe2/fb/lite_predictor:lite_predictor_nnc
Action graph will be rebuilt because files have been added or removed.
clang-9: warning: argument unused during compilation: '-pthread' [-Wunused-command-line-argument]

Downloaded 11/15 artifacts, 78.37 Kbytes, 15.4% cache miss (for updated rules)
Building: finished in 7.4 sec (100%) 4718/4718 jobs, 46/4718 updated
  Total time: 7.5 sec
More details at https://www.internalfb.com/intern/buck/build/b87be016-340c-49f8-b832-0c1de70aae9e
```

Reviewed By: ZolotukhinM

Differential Revision: D30975952

fbshipit-source-id: 85c028cc6af63c03b505b51302f5158c23e1a047
2021-09-15 20:11:30 -07:00
2bb898e039 [acc_ops] Add support for torch variants of squeeze and mul (#65037)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65037

att

Test Plan: updated unit tests

Reviewed By: yuhc

Differential Revision: D30952224

fbshipit-source-id: aaf75b27b4fc6c0436ba7bfcf324f761b900171b
2021-09-15 19:41:04 -07:00
206646d6ed Add NNC AOT Compiler executable (#63994)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63994

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D30582149

Pulled By: priyaramani

fbshipit-source-id: 3bbf085428824c3cb308e006c18bb0a57f50fef6
2021-09-15 19:18:24 -07:00
e0ecd09011 [quant] AO migration of the _correct_bias.py, _equalize.py, and _learnable_fake_quantize.py (#64917)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64917

AO Team is migrating the existing torch.quantization into torch.ao.quantization. We are doing it one file at a time to make sure that the internal callsites are updated properly.
This migrates from torch.quantization to torch.ao.quantization the following files:
- `_correct_bias.py`
- `_equalize.py`
- `_learnable_fake_quantize.py`

**Note:** These file are migrated completely without any warning. The old location is thus silently deprecated.

Test Plan: `buck test mode/dev //caffe2/test:quantization -- TestBiasCorrection`

Reviewed By: vkuzo

Differential Revision: D30898565

fbshipit-source-id: 1d39be2539dd1adfcb42e16bdcc0daf5c8316bbd
2021-09-15 18:15:39 -07:00
3ceecebed0 .circleci/.jenkins: Remove 9.2 references in CI (#65024)
Summary:
Removes 9.2 references in CI scripts and configs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65024

Reviewed By: driazati

Differential Revision: D30945948

Pulled By: janeyx99

fbshipit-source-id: 77890a00520c61500a934a90a74e3fcca84c09b5
2021-09-15 18:06:57 -07:00
d9d8250e3f .github: GHA add retry for docker run in chown workspace step (#65104)
Summary:
This should help prevent further errors in GHA workflows during the Chown Workspace step such as https://github.com/pytorch/pytorch/runs/3614067053

I did not add retries to other steps with docker run

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65104

Reviewed By: seemethere

Differential Revision: D30976330

Pulled By: janeyx99

fbshipit-source-id: e403008548aa01c9a0a4ccebe56df0e889dd045c
2021-09-15 18:02:07 -07:00
03389dc851 Revert D30752939: [pytorch][PR] nvfuser update
Test Plan: revert-hammer

Differential Revision:
D30752939 (cfaecaf40b)

Original commit changeset: ce122e80f01b

fbshipit-source-id: 57685df8f9946032a06eff1de8a3d1498500d2d2
2021-09-15 17:38:47 -07:00
c151d62f45 [quant] AO migration of the quant_types.py (phase 1) (#64916)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64916

AO Team is migrating the existing torch.quantization into torch.ao.quantization. We are doing it one file at a time to make sure that the internal callsites are updated properly.
This migrates the quant_type.py from torch.quantization to torch.ao.quantization.
At this point both locations will be supported. Eventually the torch.quantization will be deprecated.

Test Plan: `buck test mode/dev //caffe2/test:quantization -- TestAOMigrationQuantization`

Reviewed By: vkuzo

Differential Revision: D30898422

fbshipit-source-id: 3e6126b49f0565a4136d6928cea9eb25368927ff
2021-09-15 17:30:00 -07:00
a42996f16e [quant] AO migration of the fuse_modules.py (phase 1) (#64913)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64913

AO Team is migrating the existing torch.quantization into torch.ao.quantization. We are doing it one file at a time to make sure that the internal callsites are updated properly.
This migrates the fuse_module.py from torch.quantization to torch.ao.quantization.
At this point both locations will be supported. Eventually the torch.quantization will be deprecated.

Test Plan: `buck test mode/dev //caffe2/test:quantization`

Reviewed By: vkuzo

Differential Revision: D30882819

fbshipit-source-id: 1926ad6aa49136aceb5b625dcef4bfde3a2860d4
2021-09-15 17:28:47 -07:00
7e9c599784 [TensorExpr] Add a method for sanitizing Var and Buf names in Stmt. (#65010)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65010

This pass ensures all names are legal and not-duplicated.

Fixes #52727.

Test Plan: Imported from OSS

Reviewed By: bertmaher, navahgar

Differential Revision: D30939717

Pulled By: ZolotukhinM

fbshipit-source-id: 7dbe7f937de41f22ad49137a5e067d698443ed63
2021-09-15 17:15:06 -07:00
3d5923366d .github: Enable only specific workflows for canary (#65099)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65099

Utilizes ciflow to enable only specific workflows for
pytorch/pytorch-canary to reduce noise on that specific repository

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D30973691

Pulled By: seemethere

fbshipit-source-id: 371765535b42a00bd72c2551c4faebf733d759f0
2021-09-15 16:53:12 -07:00
59c486f2f3 ci: Disable jit legacy on circleci, enable on gha (#65106)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65106

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

cc ezyang seemethere malfet lg20987 pytorch/pytorch-dev-infra

Test Plan: Imported from OSS

Reviewed By: malfet, janeyx99

Differential Revision: D30976186

Pulled By: seemethere

fbshipit-source-id: 8958f821eab9aa284496c57915894ed70f6b2fff
2021-09-15 16:11:38 -07:00
b75d3cae4c CI: Upgrade windows 10.1 jobs to 10.2 (#65080)
Summary:
This is first 2 steps in the following task:
1. Upgrade 10.1 to 10.2
2. Migrate force_on_cpu job to GHA

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65080

Test Plan: https://github.com/pytorch/pytorch/pull/65086

Reviewed By: seemethere

Differential Revision: D30973655

Pulled By: janeyx99

fbshipit-source-id: 67ab69ea99ff9e0336400a7173efef6d7daac07c
2021-09-15 16:04:50 -07:00
3f27c1ae78 Replace windows 10.2 smoke tests on PRs to be 11.3 (#65090)
Summary:
As we default to linux CUDA 11.3 on PRs, we should do the same thing with Windows (instead of having 10.2 be the default). This means that 10.2 will now be master only, and 11.3 windows smoke tests will run on every PR.

This also copies over the "run smoke tests only" config--removing that will be in a separate PR once there's more certain decision making.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65090

Reviewed By: seemethere

Differential Revision: D30968382

Pulled By: janeyx99

fbshipit-source-id: c73f9a2cc800b678909365c4d80627d29fc09f94
2021-09-15 16:01:07 -07:00
ec1af11c2e Revert D30883290: [Static Runtime] Move MemoryPlanner out into memory_planner.cpp
Test Plan: revert-hammer

Differential Revision:
D30883290 (0e11454d19)

Original commit changeset: a37570f8d943

fbshipit-source-id: 65c57a2b0d2e3c7006765195dd519e8cf2472f72
2021-09-15 15:40:34 -07:00
37bcefa248 [quant] Removing hardcoded "torch.quantization.observer" for migration (#64981)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64981

this would have cause errors when observer.py was moved to ao.

see: D30391189
ghstack-source-id: 138118430

Test Plan:
buck test mode/opt //caffe2/test:quantization -- --exact 'caffe2/test:quantization - test_dynamic_quant_multi_uses (quantization.jit.test_quantize_jit.TestQuantizeDynamicJitPasses)'

buck test mode/opt //caffe2/test:quantization -- --exact 'caffe2/test:quantization - test_save_load_state_dict_script (quantization.core.test_workflow_module.TestObserver)'

Reviewed By: supriyar

Differential Revision: D30432008

fbshipit-source-id: 754727a89c78f6ceada6f8ff92c304f3953f38fc
2021-09-15 15:22:19 -07:00
fe0f9d1daf [Caffe2][easy] Avoid spurious vector copy in TransposeOp (#64403)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64403

No need to copy to the heap here.
ghstack-source-id: 138033019

Test Plan: CI

Reviewed By: smacke

Differential Revision: D30712506

fbshipit-source-id: 5f4131b2569ebb1f5092262aaddb17215dea88f1
2021-09-15 15:15:51 -07:00
208cf051d4 [Caffe2] Don't pass vector by value in SqueezeOp (#64400)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64400

There appears to be no need to copy this vector.
ghstack-source-id: 138033020

Test Plan: CI

Reviewed By: smacke

Differential Revision: D30711014

fbshipit-source-id: b9fcf3d496a663b8478aa22d52b2c41f8f85e90f
2021-09-15 15:14:30 -07:00
177ebea4c5 Use RDS for build size tracking (#64303)
Summary:
This adds 2 utilities: `register_rds_table` and `rds_write`. `register_rds_table` needs to be called once with the schema for the data that `rds_write` will write. These go to a lambda called `rds-proxy`, which will write to/read from the DB as necessary. This data can then be arbitrarily queried via `rds-proxy` (for use in CI) or on metrics.pytorch.org (for analysis).

It also hooks these up for build size tracking (which previously was not working on GHA)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64303

Reviewed By: mruberry

Differential Revision: D30941182

Pulled By: driazati

fbshipit-source-id: 12c5575ddd29902477464fc989ad76a052306b9b
2021-09-15 14:47:37 -07:00
cfaecaf40b nvfuser update (#63745)
Summary:
Syncing nvfuser code base from devel branch, Listing a few of our development since last sync:

- Extends support to normalization and reduction kernels.
- Multiple kernel launch for single `CudaFusionGroup`. Hierarchical caching system has been updated to cache graph segmentation.
- profile_ivalue is enabled to convert dynamic scalar into compile time constants, which are required by the codegen. (e.g. reduction axes).

To keep this PR simple and relatively review-free. We stripped most external changes and submitted them as separate PRs, so this gigantic PR is easier to handle.

internal updates are files located in:
1. updates in nvfuser codegen `torch/csrc/jit/coddgen/cuda`
2. added nvfuser specific benchmarks `benchmarks/cpp/nvfuser`
3. nvfuser jit cpp tests `test/cpp/jit/test_gpu.cpp` `test/cpp/jit/test_gpu_shift.cpp` `test/cpp/jit/test_gpu_validator.h`

updates affecting integration:

1. profile_ivalue enabled for nvfuser. related changes are in `torch/csrc/jit/runtime/*`,
2. exposed a few more symbols `aten/src/ATen/core/*` used by codegen

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63745

Reviewed By: saketh-are

Differential Revision: D30752939

Pulled By: malfet

fbshipit-source-id: ce122e80f01bcd3865f5bd3c4dfde660665fd84c
2021-09-15 14:42:55 -07:00
59988f81bd Add embedding shape analysis (#64323)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64323

Test Plan: Imported from OSS

Reviewed By: driazati

Differential Revision: D30738145

Pulled By: eellison

fbshipit-source-id: be12408330d671bc65cf645aa2c20fafd954e6a9
2021-09-15 13:45:48 -07:00
29514bfcdb Max Pool with indices (#64121)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64121

Add support for aten operators which return multiple outputs

Test Plan: Imported from OSS

Reviewed By: driazati

Differential Revision: D30738142

Pulled By: eellison

fbshipit-source-id: 0d7e51187bd5e3e9b43f0fdb5178366a97aec943
2021-09-15 13:45:46 -07:00
2626cd3ba4 Add Maxpool to shape analysis / Opinfo (#63530)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63530

how to review: pretty much just check that the inputs generated are a good representation of the op semantics, that should be sufficient for correctness, and then you can also double check the op size semantics by going to https://codebrowser.bddppq.com/pytorch/pytorch/ typing in native::{op_name} and looking at the op implementation as a bonus if you want

Test Plan: Imported from OSS

Reviewed By: driazati

Differential Revision: D30738147

Pulled By: eellison

fbshipit-source-id: cf52339e572ee04e0d6167fd95d8a82d58ea7706
2021-09-15 13:44:33 -07:00
425f173f9d [quant][refactor] Change the structure of the ao migration tests (#64912)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64912

The test naming was confusing and ambiguous. The file was changed to reflect the framework that is being migrated ("quantization" instead of "quantize"). Also, the common testing class was extracted out
ghstack-source-id: 138157450

Test Plan: `buck test mode/dev //caffe2/test:quantization -- TestAOMigrationQuantization`

Reviewed By: vkuzo

Differential Revision: D30898214

fbshipit-source-id: 017f95995271d35bcdf6ff6a1b3974b837543e84
2021-09-15 13:15:43 -07:00
2967a48b78 Add retries to ECR login step (#65013)
Summary:
Switch retry mode from `legacy` to `standard` (https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-retries.html#cli-usage-retries-configure) and up the number of retries.

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65013

Reviewed By: zhouzhuojie, mruberry

Differential Revision: D30943292

Pulled By: driazati

fbshipit-source-id: 0a21e9b4eacbb77e6aca22f9256d94cd591b23cd
2021-09-15 13:12:57 -07:00
df3d649380 To add state dict and load_dict for Chained Scheduler (#65034)
Summary:
Adding state_dict() and load_state_dict() methods for Chained Scheduler

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65034

Reviewed By: prabhat00155, nateanl

Differential Revision: D30958207

Pulled By: datumbox

fbshipit-source-id: 1a587a330d34e0548e891a39f8fb5a3d251b71fa
2021-09-15 13:11:41 -07:00
6512838fab [ONNX] Enhance shape (two changes merged) (#64585)
Summary:
Enhanced shape inference by introducing typeReliableMap.
[ONNX] exporter changes for torch hub models (https://github.com/pytorch/pytorch/issues/62856)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64585

Reviewed By: ezyang

Differential Revision: D30870418

Pulled By: msaroufim

fbshipit-source-id: 87a294799cb87d649d1d13b6114a5cfbac9be15c

Co-authored-by: jiafatom <jiafa@microsoft.com>
2021-09-15 13:02:19 -07:00
0e11454d19 [Static Runtime] Move MemoryPlanner out into memory_planner.cpp (#65011)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65011

This change moves `MemoryPlanner` out of impl.cpp into memory_planner.cpp.

`MemoryPlanner` performs an independent sub-task of static analysis of a graph, and creating memory planning, and allocating/deallocating managed Tensors.

This change will reduce merge conflicts as I work on MemoryPlanner more actively for output Tensor support.

Test Plan: N/A

Reviewed By: mikeiovine

Differential Revision: D30883290

fbshipit-source-id: a37570f8d9430224a6987d2190bcf81cf875043d
2021-09-15 12:57:39 -07:00
db134a6843 (torch.distributed.elastic) properly format traceback on error (#65041)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65041

Fixes a bug introduced in https://github.com/pytorch/pytorch/pull/64036 where the traceback of the error handler is printed out rather than the traceback of the actual exception.

Fixes https://github.com/pytorch/pytorch/issues/60910
Closes https://github.com/pytorch/pytorch/issues/60910

BEFORE (note that the `py_callstack` is NOT the traceback of the RuntimeError):
```
**************************************************************************************************************************************************************************************************************************************************
                                                                                                              run_script_path FAILED
==================================================================================================================================================================================================================================================
Root Cause:
[0]:
  time: 2021-09-14_22:01:06
  rank: 0 (local_rank: 0)
  exitcode: 1 (pid: 1092727)
  error_file: /tmp/torchelastic_aeyvjbpe/none_8zuih7tj/attempt_0/0/error.json
  msg:
    {
      "message": "RuntimeError: rasing error since --throw was specified",
      "extraInfo": {
        "py_callstack": [
          "  File \"<string>\", line 1, in <module>\n",
          "  File \"/usr/local/fbcode/platform009/lib/python3.8/multiprocessing/spawn.py\", line 116, in spawn_main\n    exitcode = _main(fd, parent_sentinel)\n",
          "  File \"/usr/local/fbcode/platform009/lib/python3.8/multiprocessing/spawn.py\", line 129, in _main\n    return self._bootstrap(parent_sentinel)\n",
          "  File \"/usr/local/fbcode/platform009/lib/python3.8/multiprocessing/process.py\", line 315, in _bootstrap\n    self.run()\n",
          "  File \"/usr/local/fbcode/platform009/lib/python3.8/multiprocessing/process.py\", line 108, in run\n    self._target(*self._args, **self._kwargs)\n",
          "  File \"/data/users/kiuk/fbsource/fbcode/buck-out/dev/gen/caffe2/run#link-tree/torch/multiprocessing/spawn.py\", line 59, in _wrap\n    fn(i, *args)\n",
          "  File \"/data/users/kiuk/fbsource/fbcode/buck-out/dev/gen/caffe2/run#link-tree/torch/distributed/elastic/multiprocessing/api.py\", line 382, in _wrap\n    ret = record(fn)(*args_)\n",
          "  File \"/data/users/kiuk/fbsource/fbcode/buck-out/dev/gen/caffe2/run#link-tree/torch/distributed/elastic/multiprocessing/errors/__init__.py\", line 373, in wrapper\n    error_handler.record_exception(e)\n",
          "  File \"/data/users/kiuk/fbsource/fbcode/buck-out/dev/gen/caffe2/run#link-tree/torch/distributed/elastic/multiprocessing/errors/error_handler.py\", line 86, in record_exception\n    _write_error(e, self._get_error_file_path())\n",
          "  File \"/data/users/kiuk/fbsource/fbcode/buck-out/dev/gen/caffe2/run#link-tree/torch/distributed/elastic/multiprocessing/errors/error_handler.py\", line 26, in _write_error\n    \"py_callstack\": traceback.format_stack(),\n"
        ],
        "timestamp": "1631682066"
      }
    }

==================================================================================================================================================================================================================================================
Other Failures:
  <NO_OTHER_FAILURES>
**************************************************************************************************************************************************************************************************************************************************
```

AFTER (note the traceback is the traceback of the RuntimeError):
```
********************************************************************************
                             run_script_path FAILED
================================================================================
Root Cause:
[0]:
  time: 2021-09-14_21:49:25
  rank: 0 (local_rank: 0)
  exitcode: 1 (pid: 1014681)
  error_file: /tmp/torchelastic_q0zods2c/none_qwmz5dgj/attempt_0/0/error.json
  msg: Traceback (most recent call last):
    File "/data/users/kiuk/fbsource/fbcode/buck-out/dev/gen/caffe2/run#link-tree/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 361, in wrapper
      return f(*args, **kwargs)
    File "/data/users/kiuk/fbsource/fbcode/buck-out/dev/gen/caffe2/run#link-tree/torch/distributed/run.py", line 671, in run_script_path
      runpy.run_path(sys.argv[0], run_name="__main__")
    File "/usr/local/fbcode/platform009/lib/python3.8/runpy.py", line 265, in run_path
      return _run_module_code(code, init_globals, run_name,
    File "/usr/local/fbcode/platform009/lib/python3.8/runpy.py", line 97, in _run_module_code
      _run_code(code, mod_globals, init_globals,
    File "/usr/local/fbcode/platform009/lib/python3.8/runpy.py", line 87, in _run_code
      exec(code, run_globals)
    File "/home/kiuk/tmp/test.py", line 55, in <module>
      main()
    File "/data/users/kiuk/fbsource/fbcode/buck-out/dev/gen/caffe2/run#link-tree/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 361, in wrapper
      return f(*args, **kwargs)
    File "/home/kiuk/tmp/test.py", line 25, in main
      raise RuntimeError("rasing error since --throw was specified")
  RuntimeError: rasing error since --throw was specified

================================================================================
Other Failures:
  <NO_OTHER_FAILURES>
********************************************************************************
```

Test Plan:
(see summary for before and after)

`test.py` contents:
```
import argparse
import os
import sys

import torch
import torch.distributed as dist
import torch.nn.functional as F

from torch.distributed.elastic.multiprocessing.errors import record

def parse_args(argv):
    parser = argparse.ArgumentParser(description="test script")
    parser.add_argument("--init_method", type=str, default="env://")
    parser.add_argument("--backend", type=str, default="gloo")
    parser.add_argument("--throw", action="store_true", default=False)
    parser.add_argument("--exit", action="store_true", default=False)
    return parser.parse_args()

record
def main():
    args = parse_args(sys.argv[1:])

    if args.throw:
        raise RuntimeError("rasing error since --throw was specified")

    if args.exit:
        sys.exit(1)

    init_method=args.init_method
    backend=args.backend

    world_size = int(os.environ["WORLD_SIZE"])
    rank = int(os.environ["RANK"])

    print(f"initializing `{backend}` process group with rank={rank}, world_size={world_size} at {init_method}")

    dist.init_process_group(
        backend=backend,
        init_method=init_method,
        world_size=world_size,
        rank=rank)

    print(f"successfully initialized process group with rank={dist.get_rank()}, world_size={dist.get_world_size()}")

    t = F.one_hot(torch.tensor(rank), num_classes=world_size)
    dist.all_reduce(t)
    derived_world_size = torch.sum(t).item()
    if derived_world_size != world_size:
        raise RuntimeError(f"derived world size: {derived_world_size} != actual world size: {world_size}")
    else:
        print(f"sucessfully derived world size: {derived_world_size} (expected: {world_size}). Exiting")

if __name__ == "__main__":
    main()
```

run it as:

```
$ python -m torch.distributed.run --nproc_per_node 2 test.py --throw
```

Reviewed By: cbalioglu

Differential Revision: D30953731

fbshipit-source-id: bbea04c59c2aec58969cf44d8e3723d5f8abe8a8
2021-09-15 12:50:21 -07:00
4bf7959de2 Remove run_functional_checks from test_autograd and create necessary OpInfos (#64993)
Summary:
OpInfo tracker: https://github.com/pytorch/pytorch/issues/54261

 - Eliminate duplicated testing logic in test_autograd
 - Moved tests that rely on this testing logic to use OpInfos
   - `cat` already has OpInfo (no action needed)
   - Created OpInfo for `block_diag` and `broadcast_tensors`

Running into some FX errors. Added op to skip-list and created an issue here: https://github.com/pytorch/pytorch/issues/64997
Both `block_diag` and `broadcast_tensors` are variadic, so skipping `test_variant_consistency_jit` (from comments on other OpInfos, it looks like JIT does not support variadic tensors)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64993

Reviewed By: jbschlosser

Differential Revision: D30961736

Pulled By: soulitzer

fbshipit-source-id: e169305384a683acae1178c4e12e9e214a67226a
2021-09-15 12:45:38 -07:00
21017ad1a1 Dispatch.h: Avoid including ivalue (#64165)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64165

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D30728587

Pulled By: ezyang

fbshipit-source-id: d0d2e97491d9d5e2d2fc2d6e51420a4467c1bba4
2021-09-15 12:16:44 -07:00
211ad231dc To add state_dict and load_state_dict to SequentialLR (#65035)
Summary:
To add state_dict() and load_state_dict() methods to SequentialLR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65035

Reviewed By: prabhat00155, nateanl

Differential Revision: D30958204

Pulled By: datumbox

fbshipit-source-id: 65114e1b07146526ae2680233f5cd42b2534d67a
2021-09-15 12:01:51 -07:00
8a652e0e91 [CircleCI] Disable pytorch_linux_xenial_cuda10_2 test jobs (#65071)
Summary:
As all of them has been migrated to GHA:
- pytorch_linux_pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_distributed_test -> "linux-xenial-cuda11.3-py3.6-gcc7 / test (distributed, 1, 1, linux.8xlarge.nvidia.gpu)"
- pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test1 -> "linux-xenial-cuda10.2-py3.6-gcc7 / test (default, 1, 2,
linux.8xlarge.nvidia.gpu)"
- pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test2 -> "linux-xenial-cuda10.2-py3.6-gcc7 / test (default, 2, 2,
linux.8xlarge.nvidia.gpu)"
- pytorch_linux_xenial_cuda10_2_cudnn7_py3_multigpu_test -> "linux-xenial-cuda10.2-py3.6-gcc7 / test (multigpu, 1, 1,
linux.16xlarge.nvidia.gpu)"
- pytorch_linux_xenial_cuda10_2_cudnn7_py3_nogpu_NO_AVX2_test -> "linux-xenial-cuda10.2-py3.6-gcc7 / test (nogpu_NO_AVX2, 1, 1, linux.2xlarge)"
- pytorch_linux_xenial_cuda10_2_cudnn7_py3_nogpu_NO_AVX_test -> "linux-xenial-cuda10.2-py3.6-gcc7 / test (nogpu_NO_AVX, 1, 1, linux.2xlarge)"
- pytorch_linux_xenial_cuda10_2_cudnn7_py3_slow_test -> "linux-xenial-cuda10.2-py3.6-gcc7 / test (slow, 1, 1, linux.8xlarge.nvidia.gpu)"

"pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_build" is still a holdout due to slow gradchecks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65071

Reviewed By: driazati, seemethere, janeyx99

Differential Revision: D30963413

Pulled By: malfet

fbshipit-source-id: d9a5188ce7eb2f60547b91b854a5db83af2b10e7
2021-09-15 11:59:40 -07:00
f1ce64a58e Starter Task 1 (#64927)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64927

Mypy error corrections

Test Plan: Corrected mypy errors to make code less prone to bugs by modifying types or adding lines that avoid special undesired cases e.g. asserting a variable to not None.

Reviewed By: wushirong

Differential Revision: D30901654

fbshipit-source-id: daae8692603b8b38203a98f673c455749c2fb855
2021-09-15 11:55:07 -07:00
dab6496dbe [ROCm] Update CI images for ROCm 4.3.1 (#64610)
Summary:
Signed-off-by: Kyle Chen <kylechen@amd.com>

reference:
https://github.com/pytorch/pytorch/issues/58017

jithunnair-amd
jeffdaily
arindamroy-eng

cc jeffdaily sunway513 jithunnair-amd ROCmSupport

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64610

Reviewed By: seemethere

Differential Revision: D30964582

Pulled By: malfet

fbshipit-source-id: a8335d3d32d7f1557d3cf6cb055ad0f9c49ef7aa
2021-09-15 11:49:54 -07:00
54d060a8c9 Port all and any full reductions to structured kernels. (#64642)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64642

Tracking issue: #55070

This PR creates out overloads for both `all` and `any` kernels (full reduction overload),
and ports them to structured kernels.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D30867354

Pulled By: ezyang

fbshipit-source-id: 46bccaf6c94a09ed77cc6c724d1183c82f801751
2021-09-15 11:06:47 -07:00
54cdf651fd [PyTorch] remove string_view::operator[] bounds check (#64670)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64670

Bounds checking is not required for `std::string_view`, and the checking hoses performance for the following performance prototype diff.
ghstack-source-id: 138037531

Test Plan: CI

Reviewed By: ezyang, bhosmer

Differential Revision: D30747515

fbshipit-source-id: 1f4374415a82dfdccce76ea2c6885c13cb93d369
2021-09-15 09:57:58 -07:00
57420a6063 [PyTorch][easy] Add cbegin/cend to SmallVector (#64682)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64682

Looks like it was forked from llvm before cbegin and cend existed.
ghstack-source-id: 138036981

Test Plan: CI

Reviewed By: dhruvbird

Differential Revision: D30814434

fbshipit-source-id: 9740fa8d3df1c90b77298a95ab9f1d0cf8c90320
2021-09-15 09:57:56 -07:00
bdbc622988 [PyTorch] Avoid extra std::vector in parseSchemaOrName (#64678)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64678

We know we only want one declaration, so let's not create an excess std::vector (and thus a heap allocation) for that.
ghstack-source-id: 138036978

Test Plan: CI

Reviewed By: dhruvbird, tugsbayasgalan

Differential Revision: D30813785

fbshipit-source-id: c67e0100cdef5d894282939fb6d39a57309bc240
2021-09-15 09:56:41 -07:00
0f1bccb692 [quant] Removing unnecessary import from torch/quantization/quantize.py (#64910)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64910

This bled through from the original location. Removing it is not just refactoring, but also prevents potential recursive imports.
ghstack-source-id: 138112663

Test Plan: `buck test mode/dev //caffe2/test:quantization`

Reviewed By: vkuzo

Differential Revision: D30882924

fbshipit-source-id: 8652a334a5186c635761ea5e50f978d1f1078c12
2021-09-15 09:39:04 -07:00
3fb33b38b9 [Static Runtime] Check if outputs of a node do not overlap with each other (#63013)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63013

This change enhances the current memory overlapping check to include outputs: the enhancement enforces a constraint that all outputs of a node should NOT overlap with each other since they are supposed to be update by a node at the same time, holding the node's outputs.

This check will detect a problem like T97393697 immediately in debug mode.

Test Plan:
- Added a unittest `ProcessedNode.VerifyMemoryOverlapWithOverlappingOutputs`

- Ran `inline_cvr` on ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench with this diff and confirmed that the checking condition holds true during the run.

Reviewed By: hlu1

Differential Revision: D30211705

fbshipit-source-id: 994d8dace2422e2498e504eb61452a55739238c0
2021-09-15 08:38:05 -07:00
26e43fe9f3 Forward fix SkipInfo missing mypy (#65063)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65063

Reviewed By: malfet

Differential Revision: D30961556

Pulled By: janeyx99

fbshipit-source-id: 9618e12ba873fb48fe5c846a48d4560ad521eb3e
2021-09-15 08:30:38 -07:00
fb8bdb8039 When test set_affinity, don't hardcode the CPU ID (#65042)
Summary:
The setaffinity test always fails when the number of CPUs is smaller
than 3. Changed the test to be dynamically based on the number of CPUs
of the system.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65042

Reviewed By: jbschlosser

Differential Revision: D30960554

Pulled By: ejguan

fbshipit-source-id: 55ac12714b4b0964b48c3617b79a7a345d40ebce
2021-09-15 08:10:59 -07:00
c625f971d3 [DataPipe] Make TarArchiveReader and ZipArchiveReader accepts FileSream with attempt to close and additional warning (#64788)
Summary:
ghstack is not working for the second commit so I'm manually creating this PR for now. Please only look at changes related to the second commit in this PR (there is a PR for the first commit).

This PR removes TarArchiveReader's dependency on FileLoader DataPipe, by allowing it to use a IterDataPipe of path names as input rather than a tuple of path name and a stream.

It also adds additional tests to ensure that the DataPipe is functioning properly when it is read multiple times or reset half way through reading.

The whole stack fixes https://github.com/pytorch/pytorch/issues/64281 - issues related to unclosed buffer stream.

Stack:
* __->__ https://github.com/pytorch/pytorch/issues/64788
* https://github.com/pytorch/pytorch/issues/64786

cc VitalyFedyunin ejguan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64788

Reviewed By: jbschlosser, ejguan

Differential Revision: D30901176

Pulled By: NivekT

fbshipit-source-id: 59746a8d0144fc6d3ce0feb2d76445b82e6d414e
2021-09-15 07:34:29 -07:00
32c5da8cd2 add OpInfo for torch.nn.functional.dropout (#62315)
Summary:
Addresses facebookresearch/functorch#78.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62315

Reviewed By: mruberry

Differential Revision: D30932765

Pulled By: zou3519

fbshipit-source-id: 481c67b59a966b4d640973d252b3e392d8db728e
2021-09-15 07:18:04 -07:00
d6d286f651 [dnnlowp] reduce num of test cases to avoid time out (#64935)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64935

As title

Test Plan: CI

Reviewed By: dskhudia

Differential Revision: D30889157

fbshipit-source-id: 316c808806b084bd2e44c56e1cdb61adf2369a9d
2021-09-14 21:32:12 -07:00
b7ec7d760d Generic test parametrization functionality (#60753)
Summary:
This PR plays around with implementation & usage of a `parametrize` decorator for test parametrization similar to `pytest.mark.parametrize`, based on previous work introducing a `_TestParametrizer` class. It works with the internal `DeviceTest` hierarchy & composes with `dtype`, `skip*`, and other decorators. Basic usage is demonstrated in `test/test_blah.py`:

```python
import unittest
from itertools import product
from torch.testing._internal.common_device_type import (
    instantiate_device_type_tests, deviceCountAtLeast, ops)
from torch.testing._internal.common_methods_invocations import op_db
from torch.testing._internal.common_utils import (
    TestCase, run_tests, parametrize, instantiate_parametrized_tests, subtest)

class TestBlah(TestCase):
    parametrize("x", range(5))
    def test_default_names(self, x):
        print('Passed in:', x)

    # Use default names but add an expected failure.
    parametrize("x", [subtest(0, decorators=[unittest.expectedFailure]),
                       *range(1, 5)])
    def test_default_names_expected_failure(self, x):
        if x == 0:
            raise RuntimeError('Boom')
        print('Passed in:', x)

    parametrize("bias", [False, True], name_fn=lambda b: 'bias' if b else 'no_bias')
    def test_custom_names(self, bias):
        print('Passed in:', bias)

    parametrize("bias", [subtest(True, name='bias'),
                          subtest(False, name='no_bias')])
    def test_custom_names_alternate(self, bias):
        print('Passed in:', bias)

    parametrize("x,y", [(1, 2), (1, 3), (1, 4)])
    def test_two_things_default_names(self, x, y):
        print('Passed in:', x, y)

    parametrize("x", [1, 2, 3])
    parametrize("y", [4, 5, 6])
    def test_two_things_composition(self, x, y):
        print('Passed in:', x, y)

    parametrize("x", [subtest(0, decorators=[unittest.expectedFailure]),
                       *range(1, 3)])
    parametrize("y", [4, 5, subtest(6, decorators=[unittest.expectedFailure])])
    def test_two_things_composition_expected_failure(self, x, y):
        if x == 0 or y == 6:
            raise RuntimeError('Boom')
        print('Passed in:', x, y)

    parametrize("x", [1, 2])
    parametrize("y", [3, 4])
    parametrize("z", [5, 6])
    def test_three_things_composition(self, x, y, z):
        print('Passed in:', x, y, z)

    parametrize("x", [1, 2], name_fn=str)
    parametrize("y", [3, 4], name_fn=str)
    parametrize("z", [5, 6], name_fn=str)
    def test_three_things_composition_custom_names(self, x, y, z):
        print('Passed in:', x, y, z)

    parametrize("x,y", product(range(2), range(3)))
    def test_two_things_product(self, x, y):
        print('Passed in:', x, y)

    parametrize("x,y", [subtest((1, 2), name='double'),
                         subtest((1, 3), name='triple'),
                         subtest((1, 4), name='quadruple')])
    def test_two_things_custom_names(self, x, y):
        print('Passed in:', x, y)

    parametrize("x,y", [(1, 2), (1, 3), (1, 4)], name_fn=lambda x, y: '{}_{}'.format(x, y))
    def test_two_things_custom_names_alternate(self, x, y):
        print('Passed in:', x, y)

class TestDeviceBlah(TestCase):
    parametrize("x", range(10))
    def test_default_names(self, device, x):
        print('Passed in:', device, x)

    parametrize("x,y", [(1, 2), (3, 4), (5, 6)])
    def test_two_things(self, device, x, y):
        print('Passed in:', device, x, y)

    deviceCountAtLeast(1)
    def test_multiple_devices(self, devices):
        print('Passed in:', devices)

    ops(op_db)
    parametrize("flag", [False, True], lambda f: 'flag_enabled' if f else 'flag_disabled')
    def test_op_parametrized(self, device, dtype, op, flag):
        print('Passed in:', device, dtype, op, flag)

instantiate_parametrized_tests(TestBlah)
instantiate_device_type_tests(TestDeviceBlah, globals())

if __name__ == '__main__':
    run_tests()
```

Generated tests:
```
TestBlah.test_custom_names_alternate_bias
TestBlah.test_custom_names_alternate_no_bias
TestBlah.test_custom_names_bias
TestBlah.test_custom_names_no_bias
TestBlah.test_default_names_expected_failure_x_0
TestBlah.test_default_names_expected_failure_x_1
TestBlah.test_default_names_expected_failure_x_2
TestBlah.test_default_names_expected_failure_x_3
TestBlah.test_default_names_expected_failure_x_4
TestBlah.test_default_names_x_0
TestBlah.test_default_names_x_1
TestBlah.test_default_names_x_2
TestBlah.test_default_names_x_3
TestBlah.test_default_names_x_4
TestBlah.test_three_things_composition_custom_names_1_3_5
TestBlah.test_three_things_composition_custom_names_1_3_6
TestBlah.test_three_things_composition_custom_names_1_4_5
TestBlah.test_three_things_composition_custom_names_1_4_6
TestBlah.test_three_things_composition_custom_names_2_3_5
TestBlah.test_three_things_composition_custom_names_2_3_6
TestBlah.test_three_things_composition_custom_names_2_4_5
TestBlah.test_three_things_composition_custom_names_2_4_6
TestBlah.test_three_things_composition_x_1_y_3_z_5
TestBlah.test_three_things_composition_x_1_y_3_z_6
TestBlah.test_three_things_composition_x_1_y_4_z_5
TestBlah.test_three_things_composition_x_1_y_4_z_6
TestBlah.test_three_things_composition_x_2_y_3_z_5
TestBlah.test_three_things_composition_x_2_y_3_z_6
TestBlah.test_three_things_composition_x_2_y_4_z_5
TestBlah.test_three_things_composition_x_2_y_4_z_6
TestBlah.test_two_things_composition_expected_failure_x_0_y_4
TestBlah.test_two_things_composition_expected_failure_x_0_y_5
TestBlah.test_two_things_composition_expected_failure_x_0_y_6
TestBlah.test_two_things_composition_expected_failure_x_1_y_4
TestBlah.test_two_things_composition_expected_failure_x_1_y_5
TestBlah.test_two_things_composition_expected_failure_x_1_y_6
TestBlah.test_two_things_composition_expected_failure_x_2_y_4
TestBlah.test_two_things_composition_expected_failure_x_2_y_5
TestBlah.test_two_things_composition_expected_failure_x_2_y_6
TestBlah.test_two_things_composition_x_1_y_4
TestBlah.test_two_things_composition_x_1_y_5
TestBlah.test_two_things_composition_x_1_y_6
TestBlah.test_two_things_composition_x_2_y_4
TestBlah.test_two_things_composition_x_2_y_5
TestBlah.test_two_things_composition_x_2_y_6
TestBlah.test_two_things_composition_x_3_y_4
TestBlah.test_two_things_composition_x_3_y_5
TestBlah.test_two_things_composition_x_3_y_6
TestBlah.test_two_things_custom_names_alternate_1_2
TestBlah.test_two_things_custom_names_alternate_1_3
TestBlah.test_two_things_custom_names_alternate_1_4
TestBlah.test_two_things_custom_names_double
TestBlah.test_two_things_custom_names_quadruple
TestBlah.test_two_things_custom_names_triple
TestBlah.test_two_things_default_names_x_1_y_2
TestBlah.test_two_things_default_names_x_1_y_3
TestBlah.test_two_things_default_names_x_1_y_4
TestBlah.test_two_things_product_x_0_y_0
TestBlah.test_two_things_product_x_0_y_1
TestBlah.test_two_things_product_x_0_y_2
TestBlah.test_two_things_product_x_1_y_0
TestBlah.test_two_things_product_x_1_y_1
TestBlah.test_two_things_product_x_1_y_2
TestDeviceBlahCPU.test_default_names_x_0_cpu
TestDeviceBlahCPU.test_default_names_x_1_cpu
TestDeviceBlahCPU.test_default_names_x_2_cpu
TestDeviceBlahCPU.test_default_names_x_3_cpu
TestDeviceBlahCPU.test_default_names_x_4_cpu
TestDeviceBlahCPU.test_default_names_x_5_cpu
TestDeviceBlahCPU.test_default_names_x_6_cpu
TestDeviceBlahCPU.test_default_names_x_7_cpu
TestDeviceBlahCPU.test_default_names_x_8_cpu
TestDeviceBlahCPU.test_default_names_x_9_cpu
TestDeviceBlahCPU.test_multiple_devices_cpu
TestDeviceBlahCPU.test_op_parametrized_<opname>_<variant>_cpu_uint8_flag_enabled_cpu
TestDeviceBlahCPU.test_two_things_x_1_y_2_cpu
TestDeviceBlahCPU.test_two_things_x_3_y_4_cpu
TestDeviceBlahCPU.test_two_things_x_5_y_6_cpu
TestDeviceBlahMETA.test_default_names_x_0_meta
TestDeviceBlahMETA.test_default_names_x_1_meta
TestDeviceBlahMETA.test_default_names_x_2_meta
TestDeviceBlahMETA.test_default_names_x_3_meta
TestDeviceBlahMETA.test_default_names_x_4_meta
TestDeviceBlahMETA.test_default_names_x_5_meta
TestDeviceBlahMETA.test_default_names_x_6_meta
TestDeviceBlahMETA.test_default_names_x_7_meta
TestDeviceBlahMETA.test_default_names_x_8_meta
TestDeviceBlahMETA.test_default_names_x_9_meta
TestDeviceBlahMETA.test_multiple_devices_meta
TestDeviceBlahMETA.test_op_parametrized_<opname>_<variant>_meta_uint8_flag_enabled_meta
TestDeviceBlahMETA.test_two_things_x_1_y_2_meta
TestDeviceBlahMETA.test_two_things_x_3_y_4_meta
TestDeviceBlahMETA.test_two_things_x_5_y_6_meta
```

Caveats:
* `parametrize` decorators cannot be "stacked" yet; each one overwrites the previous. This will change to either:
  * Allow stacking of multiple decorators
  * Error out with a nice error message if multiple decorators are specified

The PR introduces `instantiate_parametrized_tests()` in addition to `instantiate_device_type_tests()`. The former should be used for non-device-specific tests, and the latter should be used for device-specific tests, as usual. Both of these support the `parametrize` decorator. Only the latter supports the `ops` decorator (no change here- this was already the case).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60753

Reviewed By: saketh-are

Differential Revision: D30606615

Pulled By: jbschlosser

fbshipit-source-id: a34f36d643f68a6e221f419d9bb3e1ae1d84dd65
2021-09-14 19:52:59 -07:00
6ab97fbc28 [vulkan] Use volk to load vulkan libraries and fix Windows build errors (#64988)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64988

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64968

The current wrapper (provided by [Vulkan-Tools](https://github.com/KhronosGroup/Vulkan-Tools/tree/master/common)) can't handle dynamically loading Vulkan on Windows/Mac. Therefore, we can bring in [volk](https://github.com/zeux/volk) to load the vulkan libraries for other platforms.

1. Use `volk` with `link_style="static"` only if Windows. Use `vulkan_wrapper` for all others (temporary solution)
2. Make DotSlash work on Windows when resolving glslc path

Test Plan:
For Android:

```
cd ~/fbsource
buck build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 //xplat/caffe2:pt_vulkan_api_test_binAndroid\#android-arm64 --show-output
adb push buck-out/gen/xplat/caffe2/pt_vulkan_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_api_test
adb shell "/data/local/tmp/vulkan_api_test"
cd -
```

For Mac:
```
buck build //xplat/caffe2:pt_vulkan_api_test_binAppleMac
./buck-out/gen/xplat/caffe2/pt_vulkan_api_test_binAppleMac\#macosx-x86_64
```

On Local OSS repo with `pr/64988` branch:

The build and test are fine. Note that `VulkanAPITest.log_softmax()` has been broken for the past month. Ivan will take a look at when he is available.

Build: `BUILD_TEST=1 USE_VULKAN=1 USE_VULKAN_SHADERC_RUNTIME=1 USE_VULKAN_WRAPPER=0 MACOSX_DEPLOYMENT_TARGET=10.9 CC=clang CXX=clang++ python setup.py install`

Test: `$PYTORCH_ROOT/build/bin/vulkan_api_test /data/local/tmp`

```
Running main() from ../third_party/googletest/googletest/src/gtest_main.cc
[==========] Running 69 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 69 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.adaptive_avg_pool2d
[       OK ] VulkanAPITest.adaptive_avg_pool2d (228 ms)
[ RUN      ] VulkanAPITest.add
[       OK ] VulkanAPITest.add (51 ms)
[ RUN      ] VulkanAPITest.add_broadcast0
[       OK ] VulkanAPITest.add_broadcast0 (13 ms)
[ RUN      ] VulkanAPITest.add_broadcast1
[       OK ] VulkanAPITest.add_broadcast1 (9 ms)
[ RUN      ] VulkanAPITest.add_broadcast2
[       OK ] VulkanAPITest.add_broadcast2 (9 ms)
[ RUN      ] VulkanAPITest.add_
[       OK ] VulkanAPITest.add_ (60 ms)
[ RUN      ] VulkanAPITest.add_broadcast0_
[       OK ] VulkanAPITest.add_broadcast0_ (10 ms)
[ RUN      ] VulkanAPITest.add_broadcast1_
[       OK ] VulkanAPITest.add_broadcast1_ (1 ms)
[ RUN      ] VulkanAPITest.add_scalar
[       OK ] VulkanAPITest.add_scalar (24 ms)
[ RUN      ] VulkanAPITest.add_scalar_
[       OK ] VulkanAPITest.add_scalar_ (8 ms)
[ RUN      ] VulkanAPITest.addmm
[       OK ] VulkanAPITest.addmm (22 ms)
[ RUN      ] VulkanAPITest.addmm_expand
[       OK ] VulkanAPITest.addmm_expand (12 ms)
[ RUN      ] VulkanAPITest.avg_pool2d
[       OK ] VulkanAPITest.avg_pool2d (9 ms)
[ RUN      ] VulkanAPITest.clamp
[       OK ] VulkanAPITest.clamp (92 ms)
[ RUN      ] VulkanAPITest.clamp_
[       OK ] VulkanAPITest.clamp_ (60 ms)
[ RUN      ] VulkanAPITest.conv2d
[       OK ] VulkanAPITest.conv2d (15 ms)
[ RUN      ] VulkanAPITest.conv2d_dw
[       OK ] VulkanAPITest.conv2d_dw (15 ms)
[ RUN      ] VulkanAPITest.conv2d_pw
[       OK ] VulkanAPITest.conv2d_pw (34 ms)
[ RUN      ] VulkanAPITest.conv2d_winograd
[       OK ] VulkanAPITest.conv2d_winograd (10 ms)
[ RUN      ] VulkanAPITest.copy
[       OK ] VulkanAPITest.copy (1 ms)
[ RUN      ] VulkanAPITest.div
[       OK ] VulkanAPITest.div (32 ms)
[ RUN      ] VulkanAPITest.div_broadcast0
[       OK ] VulkanAPITest.div_broadcast0 (11 ms)
[ RUN      ] VulkanAPITest.div_broadcast1
[       OK ] VulkanAPITest.div_broadcast1 (9 ms)
[ RUN      ] VulkanAPITest.div_broadcast2
[       OK ] VulkanAPITest.div_broadcast2 (7 ms)
[ RUN      ] VulkanAPITest.div_
[       OK ] VulkanAPITest.div_ (46 ms)
[ RUN      ] VulkanAPITest.div_broadcast0_
[       OK ] VulkanAPITest.div_broadcast0_ (9 ms)
[ RUN      ] VulkanAPITest.div_broadcast1_
[       OK ] VulkanAPITest.div_broadcast1_ (2 ms)
[ RUN      ] VulkanAPITest.div_scalar
[       OK ] VulkanAPITest.div_scalar (95 ms)
[ RUN      ] VulkanAPITest.div_scalar_
[       OK ] VulkanAPITest.div_scalar_ (18 ms)
[ RUN      ] VulkanAPITest.empty
[       OK ] VulkanAPITest.empty (0 ms)
[ RUN      ] VulkanAPITest.hardsigmoid
[       OK ] VulkanAPITest.hardsigmoid (76 ms)
[ RUN      ] VulkanAPITest.hardsigmoid_
[       OK ] VulkanAPITest.hardsigmoid_ (80 ms)
[ RUN      ] VulkanAPITest.hardshrink
[       OK ] VulkanAPITest.hardshrink (630 ms)
[ RUN      ] VulkanAPITest.hardshrink_
[       OK ] VulkanAPITest.hardshrink_ (573 ms)
[ RUN      ] VulkanAPITest.leaky_relu
[       OK ] VulkanAPITest.leaky_relu (271 ms)
[ RUN      ] VulkanAPITest.leaky_relu_
[       OK ] VulkanAPITest.leaky_relu_ (254 ms)
[ RUN      ] VulkanAPITest.hardswish
[       OK ] VulkanAPITest.hardswish (83 ms)
[ RUN      ] VulkanAPITest.hardswish_
[       OK ] VulkanAPITest.hardswish_ (72 ms)
[ RUN      ] VulkanAPITest.max_pool2d
[       OK ] VulkanAPITest.max_pool2d (16 ms)
[ RUN      ] VulkanAPITest.mean
[       OK ] VulkanAPITest.mean (17 ms)
[ RUN      ] VulkanAPITest.mean2d
[       OK ] VulkanAPITest.mean2d (20 ms)
[ RUN      ] VulkanAPITest.mm
[       OK ] VulkanAPITest.mm (12 ms)
[ RUN      ] VulkanAPITest.mul
[       OK ] VulkanAPITest.mul (28 ms)
[ RUN      ] VulkanAPITest.mul_broadcast0
[       OK ] VulkanAPITest.mul_broadcast0 (9 ms)
[ RUN      ] VulkanAPITest.mul_broadcast1
[       OK ] VulkanAPITest.mul_broadcast1 (9 ms)
[ RUN      ] VulkanAPITest.mul_broadcast2
[       OK ] VulkanAPITest.mul_broadcast2 (9 ms)
[ RUN      ] VulkanAPITest.mul_
[       OK ] VulkanAPITest.mul_ (43 ms)
[ RUN      ] VulkanAPITest.mul_broadcast0_
[       OK ] VulkanAPITest.mul_broadcast0_ (8 ms)
[ RUN      ] VulkanAPITest.mul_broadcast1_
[       OK ] VulkanAPITest.mul_broadcast1_ (1 ms)
[ RUN      ] VulkanAPITest.mul_scalar
[       OK ] VulkanAPITest.mul_scalar (64 ms)
[ RUN      ] VulkanAPITest.mul_scalar_
[       OK ] VulkanAPITest.mul_scalar_ (17 ms)
[ RUN      ] VulkanAPITest.reflection_pad2d
[       OK ] VulkanAPITest.reflection_pad2d (7 ms)
[ RUN      ] VulkanAPITest.reshape
[       OK ] VulkanAPITest.reshape (73 ms)
[ RUN      ] VulkanAPITest.reshape_
[       OK ] VulkanAPITest.reshape_ (41 ms)
[ RUN      ] VulkanAPITest.sigmoid
[       OK ] VulkanAPITest.sigmoid (81 ms)
[ RUN      ] VulkanAPITest.sigmoid_
[       OK ] VulkanAPITest.sigmoid_ (68 ms)
[ RUN      ] VulkanAPITest.softmax
[       OK ] VulkanAPITest.softmax (28 ms)
[ RUN      ] VulkanAPITest.log_softmax
Max Diff allowed: 5.87862e-05
../aten/src/ATen/test/vulkan_api_test.cpp:1470: Failure
Value of: check
  Actual: false
Expected: true
[  FAILED  ] VulkanAPITest.log_softmax (19 ms)
[ RUN      ] VulkanAPITest.tanh
[       OK ] VulkanAPITest.tanh (63 ms)
[ RUN      ] VulkanAPITest.tanh_
[       OK ] VulkanAPITest.tanh_ (68 ms)
[ RUN      ] VulkanAPITest.sub
[       OK ] VulkanAPITest.sub (28 ms)
[ RUN      ] VulkanAPITest.sub_broadcast0
[       OK ] VulkanAPITest.sub_broadcast0 (9 ms)
[ RUN      ] VulkanAPITest.sub_broadcast1
[       OK ] VulkanAPITest.sub_broadcast1 (9 ms)
[ RUN      ] VulkanAPITest.sub_broadcast2
[       OK ] VulkanAPITest.sub_broadcast2 (8 ms)
[ RUN      ] VulkanAPITest.sub_
[       OK ] VulkanAPITest.sub_ (43 ms)
[ RUN      ] VulkanAPITest.sub_broadcast0_
[       OK ] VulkanAPITest.sub_broadcast0_ (10 ms)
[ RUN      ] VulkanAPITest.sub_broadcast1_
[       OK ] VulkanAPITest.sub_broadcast1_ (2 ms)
[ RUN      ] VulkanAPITest.upsample_nearest2d
[       OK ] VulkanAPITest.upsample_nearest2d (5 ms)
[ RUN      ] VulkanAPITest.mobilenetv2
[       OK ] VulkanAPITest.mobilenetv2 (82 ms)
[----------] 69 tests from VulkanAPITest (3885 ms total)

[----------] Global test environment tear-down
[==========] 69 tests from 1 test suite ran. (3885 ms total)
[  PASSED  ] 68 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] VulkanAPITest.log_softmax

 1 FAILED TEST
```

Differential Revision: D30925995

fbshipit-source-id: 1b1b7f7f22090064424a5379d2f0559d0da7846a
2021-09-14 19:35:05 -07:00
ff6b475d4a [fix] don't expose unique_dim in torch (#63080)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/62793

This is mostly a quick fix. I think the more correct fix could be updating `unique_dim` to `_unique_dim` which could be BC-breaking for C++ users (� maybe). Maybe something else I am missing.

~~Not sure how to add a test for it.~~ Have tested it locally.

We can add a test like following. Tested this locally, it fails currently but passes with the fix.
```python
        def test_wildcard_import(self):
            exec('from torch import *')

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63080

Reviewed By: gchanan

Differential Revision: D30738711

Pulled By: zou3519

fbshipit-source-id: b86d0190e45ba0b49fd2cffdcfd2e3a75cc2a35e
2021-09-14 18:19:17 -07:00
36cac2be4d [CUDA graphs] moves memory sharing intro paragraph (#64996)
Summary:
Puts memory sharing intro under Sharing memory... header, where it should have been all along.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64996

Reviewed By: mruberry

Differential Revision: D30948619

Pulled By: ngimel

fbshipit-source-id: 5d9dd267b34e9d3fc499d4738377b58a22da1dc2
2021-09-14 17:53:43 -07:00
36a0d97281 Revert D30558877: Ported std/var to ReductionOpInfo and minimum/maximum to BinaryUfuncInfo
Test Plan: revert-hammer

Differential Revision:
D30558877 (382e008fbf)

Original commit changeset: 3e62ff24a935

fbshipit-source-id: 3b9f03c1f43c6d5f2738ed139d0236f2ded78dbf
2021-09-14 17:33:38 -07:00
3d312b3b8e [Model Averaging] Simplify PostLocalSGD Optimizer API (#64885)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64885

1) The constructor accepts a local optimizer instance instead of the inputs of local optimizer constructor and the class type.
2) The parameters are read from local optimizer's `param_groups` instead of a separate input.

Proposal: https://github.com/pytorch/pytorch/issues/59699
ghstack-source-id: 137865867

Test Plan: buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn -- test_post_localSGD_optimizer_parity

Reviewed By: rohan-varma

Differential Revision: D30888794

fbshipit-source-id: 21261b480f6bbb9b2333426020e3f350da3f73c2
2021-09-14 16:37:14 -07:00
382e008fbf Ported std/var to ReductionOpInfo and minimum/maximum to BinaryUfuncInfo (#63978)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63978

Test Plan: Imported from OSS

Reviewed By: saketh-are

Differential Revision: D30558877

Pulled By: heitorschueroff

fbshipit-source-id: 3e62ff24a935784fc93a76a0f46a1deb060ba680
2021-09-14 16:18:09 -07:00
c65128679b [DataPipe] Improve Mapper to accept input/output index when apply fn (#64951)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64951

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D30910035

Pulled By: ejguan

fbshipit-source-id: d687fe10939920a3617a60552fe743e8526438a0
2021-09-14 15:46:42 -07:00
670853295a [quant][tensorrt] Add tensorrt backend config (#64623)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64623

The config api will change, but we'll add configs gradually for TensorRT to unblock experimentation

Test Plan:
python torch/fx/experimental/fx2trt/example/unittests.py

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D30800474

fbshipit-source-id: 3c4640de1205a0f19b62943ab84f386d80394ec2
2021-09-14 15:27:33 -07:00
85222c050f [PyTorch] Add c10::hash<c10::ArrayRef<T>> (#64277)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64277

Just moved the vector implementation to ArrayRef and re-implemented the former using the latter.
ghstack-source-id: 137978947

Test Plan: existing CI

Reviewed By: dhruvbird

Differential Revision: D30647666

fbshipit-source-id: c0f4f06c348d36882ec0db802be44d8c7749562f
2021-09-14 14:22:12 -07:00
5d4efed83e [PyTorch] Add OpCode cache in ByteCodeDeserializer (#64110)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64110

As the code comment says, we can exploit pickler string interning to accelerate OpCode parsing. No more strcmp!
ghstack-source-id: 137978946

Test Plan:
Pixel 3 before: https://www.internalfb.com/intern/aibench/details/591414145082422
Pixel 3 after: https://www.internalfb.com/intern/aibench/details/484557404703261

new mean is 292 ms, down from 302 ms.

Reviewed By: dhruvbird

Differential Revision: D30615052

fbshipit-source-id: 9707625e778388a7920ab72704d71ad57ddaac17
2021-09-14 14:22:10 -07:00
a9121df09c [PyTorch] Remove implicit conversion from Tuple to vector reference (#63993)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63993

This seems to be unused, and it's pretty scary.
ghstack-source-id: 137978949

Test Plan: CI

Reviewed By: lw

Differential Revision: D30560441

fbshipit-source-id: 08b7ce971fd1e2dbeddbf37b02413fef513b4753
2021-09-14 14:22:08 -07:00
452402b984 [PyTorch] Fix SourceRangeDeserializer vector copy (#64031)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64031

More copies of tuple elements.
ghstack-source-id: 137978948

Test Plan:
Pixel 3 before: https://our.intern.facebook.com/intern/aibench/details/724509739115867
Pixel 3 after: https://our.intern.facebook.com/intern/aibench/details/232361457767293

Top-line number doesn't seem to have moved, but we can see that the vector copy disappeared in the flame graph.

Reviewed By: raziel

Differential Revision: D30559545

fbshipit-source-id: e5343abae96b8e80e0ccec482ad316884ae231ea
2021-09-14 14:20:45 -07:00
57eda69219 [fx2trt] fix elementwise op converter with one operand being a literal and has different type (#65004)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65004

If we have some code like `torch.add(x, 1)` and x is a float tensor then in conversion things would falling apart because currently we will add a constant layer of int32 dtype for `1` but we actually need float dtype.

This diff adds an arg to `get_trt_tensor` which specify the dtype of the constant layer we would created.

Also, start to add doc string for functions.

Reviewed By: yinghai

Differential Revision: D30852156

fbshipit-source-id: 650ce72d2794093a4616e640ea503dcc1c6b2bc4
2021-09-14 12:27:37 -07:00
3727baea6f [PyTorch Edge][Model Loading] Operator Call De-dup at TorchScript Serialization Level [2/2] (#64269)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64269

Revert changes in D29826210 (693d8f2f07) (we don't need operator lambda caching since there aren't duplicate operators anymore)

This diff stack results in an additional approx 12% speedup in model loading time (from 229ms to 200ms) when run against an 87MB speech model that jiatongzhou provided.
ghstack-source-id: 138014904

Test Plan:
**Speech Transducer v25 model (as in D29826210 (693d8f2f07))**

|| Before | After |
|Load Time|[229ms](https://www.internalfb.com/intern/aibench/details/160889436133243)|[200ms](https://www.internalfb.com/intern/aibench/details/837884532607514)|
|Save File Size|[86.23 MB](https://lookaside.facebook.com/intern/diff/file/data/?number=658544950)|[86.1 MB](https://lookaside.facebook.com/intern/diff/file/data/?number=658554403)|

The "after" flamegraph shows significantly less time is spent on ```append_operator``` than before.

Steps
- Check out desired commit in devserver (base branch or this diff)
- ```buck build bento/kernels:bento_kernel_pytorch```
- Use N1094068 with pytorch_local kernel to save model for lite interpreter
- Edit ```aibench/specifications/models/pytorch/speech_transducer/v25.json ``` to have new model location and md5
- ```buck run aibench:run_bench -- -b aibench/specifications/models/pytorch/speech_transducer/v25.json --framework pytorch --platform android/arm64 --devices "S8US" --force_profile --remote ```

**Test that saving a model with de-dup ops doesn't change its output**
https://www.internalfb.com/intern/anp/view/?id=1137434

Reviewed By: iseeyuan

Differential Revision: D30615710

fbshipit-source-id: bb4052f0f16eccab386585e94411056f94bce43c
2021-09-14 12:12:46 -07:00
86e6bed0d4 [PyTorch Edge][Model Loading] Operator Call De-dup at TorchScript Serialization Level [1/2] (#64268)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64268

If the same pair of operator name and num inputs have been used to add an instruction to the operator table previously (and the operator's schema is not vararg), use the same index as that instruction rather than creating a new one.
ghstack-source-id: 138014905

Test Plan: Phabricator tests, and test performance changes in next diff

Reviewed By: iseeyuan, tugsbayasgalan

Differential Revision: D30615434

fbshipit-source-id: f442f557f12412693a73004ce44733ccef063b82
2021-09-14 12:11:32 -07:00
97df69eac6 .github: Add render test results step (#64937)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64937

Adds CLI output for rendered test results to go alongside test exeuction, users should be able to quickly diagnose test failures like so:
![fdsfdsfdsfdsf](https://user-images.githubusercontent.com/1700823/133156245-ba939cbf-8aa2-47a7-b1fb-7cc876ca75c4.png)

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

cc ezyang seemethere malfet lg20987 pytorch/pytorch-dev-infra

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D30917897

Pulled By: seemethere

fbshipit-source-id: f51ea499462e3cfd64496cb711b84a93971c91bd
2021-09-14 11:25:14 -07:00
d188204323 remove SkipInfo class (#64972)
Summary:
per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64972

Reviewed By: mruberry

Differential Revision: D30924598

Pulled By: ngimel

fbshipit-source-id: 1ac1ec8fd50ca27e3cd36c12a588d334e7466899
2021-09-14 11:23:54 -07:00
eedc234e33 [PyTorch] Don't store multiple kernels per key on mobile (#64447)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64447

As the code comment says, we needn't worry about Jupyter notebooks on mobile.
ghstack-source-id: 137951718

Test Plan: Profiled startup of //caffe2/caffe2/fb/high_perf_models/pytorch/benchmark_framework_overheads:cpp_benchmark on devserver with -niter 0 -nrep 0 and `C10_DISPATCHER_ONE_KERNEL_PER_DISPATCH_KEY` defined. Time spent in sherwood_v3_table lookups went way down.

Reviewed By: ezyang, bhosmer

Differential Revision: D30736094

fbshipit-source-id: bcc22cd0d9adceba259a03898c992759d501fe89
2021-09-14 10:36:43 -07:00
446d95a7f6 [fx const fold] fix some cases with deep model hierarchy (#64945)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64945

In the const folding pass, we try to create `get_attr` nodes in submod_1 for `get_attr` nodes that are in the main graph. But we don't have the real attributes in submod_1. To fix this we assign main module as the owning module of sumod_1 graph.

The fix above would cause problem for `call_module` node in submod_1 because during split modules gets inlined (target changed from "mod.a.b" -> "mod_a_b") to submod_1. Changing the owning module would make those `call_module nodes unable to find the referring module. To fix this, we set the targeting module to main module.

Reviewed By: jfix71

Differential Revision: D30905949

fbshipit-source-id: cd67bc8fe4b8ad4344ae97b8e36753fdce3ece6d
2021-09-14 09:45:44 -07:00
00e6e0c593 [Model Averaging] Revert #63895 (#64903)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64903

Fix the accuracy regression caused by https://github.com/pytorch/pytorch/pull/63895.

Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn -- test_periodic_model_averager
buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn -- test_post_localSGD_optimizer_parity

Reviewed By: rohan-varma

Differential Revision: D30894688

fbshipit-source-id: fe00b8b23b860d9f806f87c1b6caba1d0b807485
2021-09-14 09:45:42 -07:00
882b67dff4 Drop incremental linking on Windows with REL_WITH_DEB_INFO=1. (#64892)
Summary:
The library will no longer link properly on VS 2019 (14.29.30133). To
ensure that engineers building on Windows can use and debug with this
build type, incremental linking needs to be turned off for this build
flag.

Verified that this build type successfully builds, links, and provides
debuggable Python modules on Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64892

Reviewed By: jbschlosser

Differential Revision: D30902565

Pulled By: malfet

fbshipit-source-id: e5286a4c6f45c7cbe4cdc1b98560129bd386970b
2021-09-14 09:44:18 -07:00
01cfea9485 Disable target determination for now (#64921)
Summary:
There were several reports of target determinator incorrectly skipping
tests, most recent one is https://github.com/pytorch/pytorch/issues/64902

Let's disable it until it could be further stabilized

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64921

Reviewed By: seemethere, janeyx99

Differential Revision: D30901186

Pulled By: malfet

fbshipit-source-id: 531afd2d390c6b51f727330d5dd1882d70b6fdde
2021-09-14 09:40:13 -07:00
4e225da363 print_test_stats.py: dedup test report upload name with TEST_CONFIG (#64948)
Summary:
Connected with issue https://github.com/pytorch/pytorch/issues/64845, takeover of https://github.com/pytorch/pytorch/issues/64091

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64948

Reviewed By: malfet, seemethere

Differential Revision: D30908592

Pulled By: janeyx99

fbshipit-source-id: dc31b0bbc9f4e35d23412aa14acbbab7422b4146
2021-09-14 09:01:06 -07:00
e884554008 Make {select,slice,diagonal}_backward primitives wrt autograd (#64933)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64933

Fixes https://github.com/facebookresearch/functorch/issues/108

This is a short-term fix. A longer-term fix would be to either:
1. have proper {select,slice,diagonal}_embed functions
2. have efficient {select,slice,diagonal}_scatter functions (and
efficient zero tensors).

NB: I didn't use diag_embed because diag_embed is slightly different
from diagonal_backward.

There are no BC concerns because TorchScript (luckily) does not
serialize the backwards graph.

Test Plan:
- run tests
- run benchmarks.
https://gist.github.com/zou3519/e7c0774d1ac97f32aa02ec44d81e60e1.
Surprisingly the instruction count goes down. This is probably because
we create fewer autograd nodes now.

Reviewed By: ezyang

Differential Revision: D30909333

Pulled By: zou3519

fbshipit-source-id: 3b33e13010ba13b4d487b346aa9bee8a0e8c378c
2021-09-14 08:10:59 -07:00
2853c7da22 Replace composite dispatch with CompositeExplicitAutograd (#64641)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64641

`sum`, `mean`, and `norm` were ported to structured kernels in #61642, #61643, and #62711,
respectively. Those PRs changed related overlads into composite kernels. However, their
dispatch section remained the same, when they really should be marked as
`CompositeExplicitAutograd`. This PR fixes this issue.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D30867122

Pulled By: ezyang

fbshipit-source-id: b951aee41a3cab9ca546df826a285d60013e3b3a
2021-09-14 07:56:54 -07:00
09d221e8d4 Revert D30711934: [pytorch][PR] Use RDS for build size tracking
Test Plan: revert-hammer

Differential Revision:
D30711934 (1cd0252eed)

Original commit changeset: 0af808ddf528

fbshipit-source-id: 6f67ed5cbaf333cc55729be2a23e385772e31b10
2021-09-14 06:10:03 -07:00
f23f21dafe [TensorExpr] Remove 'Placeholder' class. (#64887)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64887

BufHandle has exactly the same functionality and should be used instead.

Differential Revision:
D30889483
D30889483

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: 365fe8e396731b88920535a3de96bd3301aaa3f3
2021-09-14 00:22:44 -07:00
199031c48e [TensorExpr] PyBinds: improve QoL of pybind users. (#64886)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64886

Bind methods for implicit conversions and constructors to avoid
boilerplate code.

Differential Revision:
D30889193
D30889193

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Pulled By: ZolotukhinM

fbshipit-source-id: 137c0c98f7f1576e1bb97c8de8a900b28407a30e
2021-09-14 00:21:28 -07:00
caaa6efc1a Fix use of deprecated tensor.type() in SegmentReduce.cpp (#64151)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64151

Reviewed By: mruberry

Differential Revision: D30917268

Pulled By: ngimel

fbshipit-source-id: 63427372b651ac495d48ef552eba5fbf0e4378e9
2021-09-13 23:16:47 -07:00
d4b4d83521 [quant] handle empty input in fused_moving_avg_obs_fake_quant op (#64829)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64829

If an empty input is passed in, the aminmax operator fails with a runtime error like
```
RuntimeError: aminmax(): cannot compute aminmax over an empty dimension as the operation has no identity.
```

To avoid this during training we just return the input if we find it to be empty

Test Plan:
python test/test_quantization.py TestFusedObsFakeQuant

Imported from OSS

Reviewed By: jingsh

Differential Revision: D30870879

fbshipit-source-id: 0cb4b187449a45a37150a77510d2292f93a7d1cd
2021-09-13 22:22:31 -07:00
0aef44cb3d Add forward AD for torch.linalg.eigh (#62163)
Summary:
This PR adds forward mode differentiation for `torch.linalg.eigh` and a few other functions required for tests to pass.

For some reason running tests for `torch.linalg.eigvalsh` and complex `torch.linalg.eigh` hangs. These tests are skipped for now.

cc ezyang albanD zou3519 gqchen pearu nikitaved soulitzer Lezcano Varal7 jianyuh mruberry heitorschueroff walterddr IvanYashchuk xwang233

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62163

Reviewed By: jbschlosser

Differential Revision: D30903988

Pulled By: albanD

fbshipit-source-id: d6a74adb9e6d2f4be8ac707848ecabf06d629823
2021-09-13 21:15:38 -07:00
35c82dbf5c [THC] remove TensorTypeUtils and TensorInfo (#64965)
Summary:
per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64965

Reviewed By: mruberry

Differential Revision: D30916754

Pulled By: ngimel

fbshipit-source-id: b24020d6a7ce8a05a5ab6c579d176dd94dd3b1d7
2021-09-13 20:36:28 -07:00
816048e7e6 EmbeddingBag sort thrust->cub (#64498)
Summary:
Partially fixes https://github.com/pytorch/pytorch/issues/57505

Also fixes a warning I found when compiling:
```
/home/gaoxiang/pytorch-cub/torch/csrc/distributed/c10d/quantization/quantization_gpu.cu(7): warning: inline qualifier ignored for "__global__" function
```
I also updated the bfloat16 guard to CUDA 11.5

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64498

Reviewed By: mruberry

Differential Revision: D30917077

Pulled By: ngimel

fbshipit-source-id: fb9df08fd469038478a563014b5af7452b4b28c0
2021-09-13 19:51:12 -07:00
ed30afd480 Speed up torch.unique_consecutive() (#64835)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/62690

Like the way `unique_consecutive_cpu_template` implemented, this PR reimplements `_unique_dim_cpu_impl` to get better performance.
Also, because the overhead of `unique_dim_consecutive_cpu` is quite large, directly call `unique_consecutive_cpu_template` when we know the given input is a 1d-array.

## Benchmark
### Script
```python
import torch
import time

torch.manual_seed(0)
t = torch.randint(500, (10000000, ))
t = torch.sort(t)[0]

start = time.time()
uniques, inverse, counts = torch.unique_consecutive(t, dim=0, return_inverse=True, return_counts=True)
end = time.time()
print("torch.unique_consecutive(dim=0) time:", end - start)

start = time.time()
uniques2, inverse2, counts2 = torch.unique_consecutive(t, return_inverse=True, return_counts=True)
end = time.time()
print("torch.unique_consecutive() time:", end - start)

t = torch.randint(500, (10000000, 2))
t = torch.sort(t)[0]

start = time.time()
uniques, inverse, counts = torch.unique_consecutive(t, dim=0, return_inverse=True, return_counts=True)
end = time.time()
print("torch.unique_consecutive(dim=0) time:", end - start)

start = time.time()
uniques, inverse, counts = torch.unique_consecutive(t, dim=1, return_inverse=True, return_counts=True)
end = time.time()
print("torch.unique_consecutive(dim=1) time:", end - start)
```

### Before
```
torch.unique_consecutive(dim=0) time: 78.64345622062683
torch.unique_consecutive() time: 0.029544353485107422
torch.unique_consecutive(dim=0) time: 91.49796152114868
torch.unique_consecutive(dim=1) time: 0.30872368812561035
```

### After
```
torch.unique_consecutive(dim=0) time: 0.08256125450134277
torch.unique_consecutive() time: 0.08162403106689453
torch.unique_consecutive(dim=0) time: 35.58408498764038
torch.unique_consecutive(dim=1) time: 1.6258199214935303
```

## System Information
```
Collecting environment information...
PyTorch version: 1.10.0a0+git7f1932e
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.3 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: 10.0.0-4ubuntu1
CMake version: version 3.16.3
Libc version: glibc-2.31

Python version: 3.8.10 (default, Jun  2 2021, 10:49:15)  [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.11.0-34-generic-x86_64-with-glibc2.29
Is CUDA available: False
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.21.2
[pip3] torch==1.10.0a0+gitbe09195
[conda] Could not collect
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64835

Reviewed By: jbschlosser

Differential Revision: D30894906

Pulled By: ngimel

fbshipit-source-id: 42ab76d638391ce6c4e589d9c71bdf7579310ad9
2021-09-13 19:00:36 -07:00
ab5e1c69a7 [WIP] Example of DataPipes and DataFrames integration (#60840)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60840

Test Plan: Imported from OSS

Reviewed By: wenleix, ejguan

Differential Revision: D29461080

Pulled By: VitalyFedyunin

fbshipit-source-id: 4909394dcd39e97ee49b699fda542b311b7e0d82
2021-09-13 18:50:15 -07:00
ee554e2e96 Re-land Fix test report uploading (#64958)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64958

This is a re-do of #64846 which was missing a path prefix for windows test reports

Test Plan: Imported from OSS

Reviewed By: seemethere

Differential Revision: D30915253

Pulled By: driazati

fbshipit-source-id: d14d0a64d2f8aabc335db9c4d0d2b63512887c66
2021-09-13 18:36:26 -07:00
f159f12fee [iOS][OSS][BE] Add Simulator tests for full JIT (#64851)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64851

ghstack-source-id: 137970229

Test Plan: CircleCI

Reviewed By: hanton, cccclai

Differential Revision: D30877963

fbshipit-source-id: 7bb8ade1959b85c3902ba9dc0660cdac8f558d64
2021-09-13 18:16:08 -07:00
fd09e564d6 add acc_ops.max, acc_ops.maximum, consolidate acc_ops.min and acc_ops.minimum
Summary:
This diff adds `acc_ops.max` and `acc_ops.maximum` support.
It further consolidates the logic for `acc_ops.min` and `acc_ops.minimum` to match the logic for max.

torch.max has three behaviors:
```1. max(input)
2. max(input, dim, keepdim=False, *, out=None)
3. max(input, other, *, out=None)
```

Likewise, `torch.min` has three identical behaviors.

I've chosen to implement each as an acc_op, then map to the appropriate one.

the third max function is effectively `torch.maximum`, so I've implemented it as that.

Reviewed By: yinghai, jfix71, 842974287

Differential Revision: D30551464

fbshipit-source-id: 0a2eec10e5185cbf7d9984eec3fd399b23528b2a
2021-09-13 18:04:33 -07:00
3855c24639 Add BFloat16 support for cross, tril, triu, tril_indices, triu_indices and cumsum operators on CPU (#62454)
Summary:
Add BFloat16 support for cross, tril, triu, tril_indices, triu_indices and cumsum operators on CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62454

Reviewed By: albanD

Differential Revision: D30845805

Pulled By: heitorschueroff

fbshipit-source-id: f83836862e38109ec929e83567133e9e88096b8b
2021-09-13 17:59:43 -07:00
1cd0252eed Use RDS for build size tracking (#64303)
Summary:
This adds 2 utilities: `register_rds_table` and `rds_write`. `register_rds_table` needs to be called once with the schema for the data that `rds_write` will write. These go to a lambda called `rds-proxy`, which will write to/read from the DB as necessary. This data can then be arbitrarily queried via `rds-proxy` (for use in CI) or on metrics.pytorch.org (for analysis).

It also hooks these up for build size tracking (which previously was not working on GHA)

TODO:
* verify output in logs + clean up prints

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64303

Reviewed By: malfet, seemethere

Differential Revision: D30711934

Pulled By: driazati

fbshipit-source-id: 0af808ddf528a24875a378caeb1aa9cb0693f802
2021-09-13 17:48:44 -07:00
c4073af61d Add skipIfTBB decorator (#64942)
Summary:
And replace two existing usages in the codebase with it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64942

Reviewed By: jbschlosser

Differential Revision: D30906382

Pulled By: malfet

fbshipit-source-id: e7f20f53aff734b0379eded361255543dab4fa4b
2021-09-13 17:11:51 -07:00
8131bc85d0 Raise TypeError on assigned grad with wrong type (#64876)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/64813

Raises a TypeError when assigned value to a grad is not a Tensor or
None.

Adds tests.

cc ezyang gchanan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64876

Reviewed By: anjali411

Differential Revision: D30901678

Pulled By: soulitzer

fbshipit-source-id: dbb3cb5fd0bbac6918e0b2e2f51d340daa43dee0
2021-09-13 16:41:45 -07:00
1e25a84993 kill SkipInfo (#64878)
Summary:
Per offline discussion, replaces SkipInfo with DecorateInfo. SkipInfo class itself is not removed yet to give functorch time to replace its SkipInfos.
cc zou3519

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64878

Reviewed By: mruberry

Differential Revision: D30908052

Pulled By: ngimel

fbshipit-source-id: 5124180b25c6e32517722883b9f3a2b488e3fe20
2021-09-13 16:32:36 -07:00
3710edc86b Fix TRTOperatorSupport (#64873)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64873

Fix TRTOperatorSupport's key naming to match the key generated by torch.fx.passes.tools_common.get_node_target. The get_node_target is used by splitter_base for comparing whether operator is supported by name.

Test Plan:
print out the supported operator dict and check name.
Run TRTSplitter with lrm_split_model_generator and verify split result is correct with all supported operators printed.
current split result:
````
Supported node types in the model:
acc_ops.size: ((), {'input': torch.float32})
acc_ops.getitem: ((), {'input': torch.float32})
acc_ops.getitem: ((), {'input': None})
acc_ops.reshape: ((), {'input': torch.float32})
acc_ops.unsqueeze: ((), {'input': torch.float32})
acc_ops.linear: ((), {'input': torch.float32, 'weight': torch.float32})
acc_ops.linear: ((), {'input': torch.float32, 'weight': torch.float32, 'bias': torch.float32})
acc_ops.mul: ((), {'input': torch.float32, 'other': torch.float32})
acc_ops.cat: ((), {})
acc_ops.add: ((), {'input': torch.float32, 'other': torch.float32})
acc_ops.add: ((), {'input': torch.float32})
acc_ops.tanh: ((), {'input': torch.float32})
acc_ops.transpose: ((), {'input': torch.float32})
acc_ops.matmul: ((), {'input': torch.float32, 'other': torch.float32})
acc_ops.div: ((), {'input': torch.float32, 'other': torch.float32})
acc_ops.squeeze: ((), {'input': torch.float32})
acc_ops.noop: ((), {'input': torch.float32})
acc_ops.layer_norm: ((), {'input': torch.float32, 'weight': torch.float32, 'bias': torch.float32})
acc_ops.permute: ((), {'input': torch.float32})
acc_ops.sigmoid: ((), {'input': torch.float32})
acc_ops.flatten: ((), {'input': torch.float32})
acc_ops.softmax: ((), {'input': torch.float32})
acc_ops.sum: ((), {'input': torch.float32})

Unsupported node types in the model:
torch.ops.fb.pad_sequence_embeddings: ((), {'embeddings': torch.float32, 'offsets': torch.int32})
acc_ops.linalg_norm: ((), {'input': torch
```

Reviewed By: yinghai

Differential Revision: D30884463

fbshipit-source-id: 22442aa6a69cd148ce9bc8be8f62157dd6d19954
2021-09-13 15:55:15 -07:00
914e3a861a Revert D30878101: [pytorch][PR] Fix test report uploading
Test Plan: revert-hammer

Differential Revision:
D30878101 (fba40bfc1a)

Original commit changeset: 0730f17fa3f4

fbshipit-source-id: dad89e68b4daf656dd0b592bc9b2758f00af38c6
2021-09-13 15:24:44 -07:00
6101cbcedb torch.ao migration: fake_quantize.py, phase 1 (#64814)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64814

1. move the file
```
hg mv caffe2/torch/quantization/fake_quantize.py caffe2/torch/ao/quantization/
```

2. create a new file in the old location and copy the imports
3. fix all callsites inside `torch`

Test Plan:
```
buck test mode/dev //caffe2/test:quantization
```

Reviewed By: z-a-f

Differential Revision: D30866792

fbshipit-source-id: 7a221cb46c0ab01f1c5de9be061f09ecc83ce23e
2021-09-13 15:22:28 -07:00
e4314dac57 [PyTorch] Reduce heap allocations in OperatorName::setNamespaceIfNotSet (#64673)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64673

We are now guaranteed to allocate at most one time in this function.
ghstack-source-id: 137786392

Test Plan: Previous diff adds test coverage for this function.

Reviewed By: dhruvbird

Differential Revision: D30813014

fbshipit-source-id: 17d844a1cc8c30574afcc6b0b41b219e62c0b723
2021-09-13 14:33:55 -07:00
000f3310d7 [PyTorch] Add test for operator_name (#64672)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64672

Just a small struct missing test coverage. Next diff changes it.
ghstack-source-id: 137786388

Test Plan: CI

Reviewed By: dhruvbird

Differential Revision: D30813013

fbshipit-source-id: 05f39494bb9512a71a928bfe6fcfa710016bdf61
2021-09-13 14:32:50 -07:00
c99277e177 handle the case in acc_ops.sum when dim == 0, differentiating it from the case when dim is None (#64869)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64869

handle the case in acc_ops.sum when dim == 0, differentiating it from the case when dim is None

Reviewed By: 842974287

Differential Revision: D30872739

fbshipit-source-id: 2755d3230804a16ef1c9289f804138c6dd7766b3
2021-09-13 14:24:16 -07:00
0561e104d9 fix build error when system cmake3 version >=3.5 but <=3.10 (#64914)
Summary:
For PyTorch source build using conda, there will raise an error in 8535418a06/CMakeLists.txt (L1) when we get a CMake version < 3.10, it can be fixed by upgrade CMake in conda env, but for centos, there has CMake3, PyTorch fist check whether CMake3's verison<=3.5, so if user's system camke<= 3.5, PyTorch will use the system's cmake3, which will have build error like:
```
CMake Error at CMakeLists.txt:1 (cmake_minimum_required):
  CMake 3.10 or higher is required.  You are running version 3.6.3

-- Configuring incomplete, errors occurred!
```

we need to check CMake3 also >=3.10, if not, then check conda's CMake version.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64914

Reviewed By: jbschlosser

Differential Revision: D30901673

Pulled By: ezyang

fbshipit-source-id: 064e2c5bc0b9331d6ecd65cd700e5a42c3403790
2021-09-13 13:26:06 -07:00
fba40bfc1a Fix test report uploading (#64846)
Summary:
Previously we just weren't uploading Windows test report XML files to S3, only to GitHub actions. This was different than Linux where we use both (though maybe we can kill the GHA upload in a follow up PR since I don't think it's very useful anymore). This factors it all out into a macro so they both do the same thing. This also fixes the naming of uploaded files to include info about the job name (the full config, so they can be matched to the job visually or by the included job id).

See https://hud.pytorch.org/pr/64846 for results

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64846

Reviewed By: seemethere

Differential Revision: D30878101

Pulled By: driazati

fbshipit-source-id: 0730f17fa3f46a32c131f52669084c3103b0e616
2021-09-13 13:22:54 -07:00
af984c78a9 Pin SciPy to 1.6.3 on Mac (take 2) (#64922)
Summary:
It's already pinned by via docker install on Linux

`scipy.stats.`[`poission`|`geom`|`binom`] returns quite different results between 1.6.x and 1.7+ versions of SciPy, which results in several distributions tests failing accuracy thresholds

Reland of https://github.com/pytorch/pytorch/pull/64844 but limited to just Mac platform
Followup PR for Windows are coming as well

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64922

Reviewed By: janeyx99

Differential Revision: D30901257

Pulled By: malfet

fbshipit-source-id: 0543e7bae9d3bbeb8b6be7b3ecf605880f97665f
2021-09-13 12:48:11 -07:00
1bea49c716 [Deploy] Avoid use-after-free during autograd shutdown (#64620)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64620

`autograd` extension module's shutdown logic destructs `PyThreadState` by `pybind11::gil_scoped_acquire` using the RAII pattern.

The problem is that torch.deploy also destructs `PyThreadState` as part of its shutdown process (https://www.internalfb.com/phabricator/paste/view/P456363738), causing double destruction, use-after-free.

This change adds `defined(USE_DEPLOY)` as a special case to avoid destruction of `PyThreadState` to the existing special treatment for  `IS_PYTHON_3_9_PLUS`.

Test Plan: Added `TorchpyTest.Autograd` unittest to ensure that torch.deploy can create multiple instances that use autograd without causing a crash.

Reviewed By: albanD

Differential Revision: D30779080

fbshipit-source-id: 4de3283cc2d394acc9b8141c17cacbfab5eea052
2021-09-13 12:43:10 -07:00
fd716fcda2 [Pytorch Edge] Quantized Ops Dtype Selective (#63680)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63680

Quantized ops not covered by DType Selectivity. Add the check, and adjust call sites to be constexpr friendly.

Test Plan: CI (this covers all model unit tests), verified that segmentation (a model that uses some of these quant ops) still works on instagram.

Reviewed By: dhruvbird, raymondethan

Differential Revision: D30457626

fbshipit-source-id: 5ba850d2b53a18558dfbb1cfaa78d8f53b5dbad8
2021-09-13 11:04:07 -07:00
4ca40aeb83 Disable more of the pragma warning stuff (#64899)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64899

ghstack-source-id: 137882055

Test Plan: sandcastle, ossci

Reviewed By: malfet, ngimel

Differential Revision: D30893691

fbshipit-source-id: 67ec8cc9f212aa16a201771603236e429944b561
2021-09-13 10:58:31 -07:00
8cfc74400a [PyTorch] Gate tls_local_dispatch_key_set off on iOS too (#64753)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64753

This may possibly be causing problems on iOS. (Maybe we should just revert inlining access to this thing? Really don't understand what's wrong with it, though.)
ghstack-source-id: 137830520

Test Plan: CI

Reviewed By: iseeyuan

Differential Revision: D30826897

fbshipit-source-id: 0438dee9d49e7601c26cdca0e8540229c777eddb
2021-09-13 10:54:28 -07:00
d4b031b31e typo fix (#64615)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64615

Reviewed By: jbschlosser

Differential Revision: D30884298

Pulled By: ngimel

fbshipit-source-id: 230f9d06aa85abcdd69828a1ea0a83f36cbfcb17
2021-09-13 10:50:01 -07:00
01e92f2a56 [nn] no batch dim support: CosineEmbeddingLoss (#64590)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/60585

TODO
* [x] Add tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64590

Reviewed By: H-Huang

Differential Revision: D30900775

Pulled By: jbschlosser

fbshipit-source-id: d24e72787017e79afbf8f04a94901a290485b81a
2021-09-13 10:45:33 -07:00
2ae938e15e Fixes failure in test_dataloader.py that occurs on jetson boards (#64757)
Summary:
CUDA IPC is not supported for jetsons

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64757

Reviewed By: jbschlosser

Differential Revision: D30900593

Pulled By: ejguan

fbshipit-source-id: c6b2e8a9746276fdb4a009b6412e47cc8aac69f2
2021-09-13 10:11:04 -07:00
8e63199c7c .github: Always run chown workspace (#64854)
Summary:
In some workflow runs, like https://github.com/pytorch/pytorch/runs/3568714658, the chown workspace step is duplicated.

Is that intentional? Unfortunately it is pretty necessary since (w/ docker) the folder can sometimes be in a broken permission state before and after we run jobs.

So this PR makes the second chown workspace run always because that's the true intention of the step.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64854

Reviewed By: jbschlosser, seemethere

Differential Revision: D30879289

Pulled By: janeyx99

fbshipit-source-id: 4157ff826c86e8c912deb1ba0cb5c47ea7596529
2021-09-13 10:06:31 -07:00
70e64feda7 Reland .circleci: Skip cuda /cudnn install if existing (#64880)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64880

This reverts commit 5836a116d0de214d6d759e70671f23150a5deaba.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D30885675

Pulled By: seemethere

fbshipit-source-id: 8c96584d5a632170e29f91c5daf0206680a78661
2021-09-13 09:52:16 -07:00
3d976d9ceb torch.ao migration: quantize_jit.py phase1 (#64860)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64860

ghstack-source-id: 137885395

Test Plan: buck test mode/dev //caffe2/test:quantization

Reviewed By: jerryzh168

Differential Revision: D30880574

fbshipit-source-id: 9629027dd3b00bb8d45633e1564fc03a866f8c31
2021-09-13 08:41:48 -07:00
9d52651d4e torch.ao migration: stubs.py phase 1 (#64861)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64861

1. move the file
  ```
  hg mv caffe2/torch/quantization/stubs.py caffe2/torch/ao/quantization/
  ```

  2. create a new file in the old location and copy the imports
  3. fix all call sites inside `torch`
ghstack-source-id: 137885365

Test Plan: buck test mode/dev //caffe2/test:quantization

Reviewed By: jerryzh168

Differential Revision: D30879678

fbshipit-source-id: a2d24f25d01064212aca15e94e8c78240ba48953
2021-09-13 08:40:29 -07:00
c08b2491cc add BFloat16 operators on CPU: cummax, cummin (#63307)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63307

Reviewed By: nikithamalgifb

Differential Revision: D30342002

Pulled By: anjali411

fbshipit-source-id: eee6e640da996ef0e983960119608d9c12405336
2021-09-13 08:00:17 -07:00
d932ddd24b fix quantization.rst doc (#64802)
Summary:
RT。

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64802

Reviewed By: jbschlosser

Differential Revision: D30887210

Pulled By: vkuzo

fbshipit-source-id: 0267883d3065d724ea654a28db78f5fe5702ef06
2021-09-13 07:19:54 -07:00
9c73a48ecf ND Embeddings benchmark - Standardize randomized inputs (#64707)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64707

Use torch.randn instead of torch.from_numpy to generate the tensor

Test Plan: buck run //caffe2/benchmarks/operator_benchmark/pt:qembedding_pack_test

Reviewed By: jingsh

Differential Revision: D30817302

fbshipit-source-id: 924c05517812b4b9f7df05a8999f9236cfe7b672
2021-09-13 06:47:35 -07:00
b37503e452 Initial implementation of nanmean (#62671)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62671

Very crude first implementation of `torch.nanmean`. The current reduction kernels do not have good support for implementing nan* variants. Rather than implementing new kernels for each nan* operator, I will work on new reduction kernels with support for a `nan_policy` flag and then I will port `nanmean` to use that.

**TODO**

- [x] Fix autograd issue

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D30515181

Pulled By: heitorschueroff

fbshipit-source-id: 303004ebd7ac9cf963dc4f8e2553eaded5f013f0
2021-09-13 05:53:58 -07:00
8535418a06 [Reland] Added reference tests to ReductionOpInfo (#64273)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64273

Reintroduced sample_inputs_prod and constrained the range of values for large reference tests.

This reverts commit e4fd2ab59ce8645f5ae9477c7724b6af82124b3b.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D30672097

Pulled By: heitorschueroff

fbshipit-source-id: b44ed8dfd5eb0c74c194164dafc3242f6728a78f
2021-09-12 20:05:43 -07:00
1cb3507ed3 Adds DLPack support (#57110)
Summary:
Partially Fixes https://github.com/pytorch/pytorch/issues/55090
Depends on https://github.com/pytorch/pytorch/issues/55365

Inspired by https://github.com/dmlc/dlpack/issues/57#issuecomment-774482973

Questions, in PyTorch we can't create streams or easily synchronize them from just an integer. Should we add an [`ExternalStream`](https://docs.cupy.dev/en/stable/reference/generated/cupy.cuda.ExternalStream.html) object like the one we have in CuPy?

TODO: Add tests

Would like some feedback as this design needs quite a few iterations
rgommers leofang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57110

Reviewed By: saketh-are

Differential Revision: D30761481

Pulled By: mruberry

fbshipit-source-id: e85d78df3c1f8defc2a698878da89cd843cb1209
2021-09-12 19:47:15 -07:00
d46ea03871 [fix] fix test_python_dispatch with pytest (#64574)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/62501

Another approach for fixing the same issue

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64574

Reviewed By: ngimel

Differential Revision: D30867237

Pulled By: ezyang

fbshipit-source-id: c632a1e0b241effdc21ae929abe42fccec88aa24
2021-09-12 17:06:55 -07:00
be79da3303 Revert D30876591: [pytorch][PR] Pin scipy to 1.6.3 on Windows and Mac
Test Plan: revert-hammer

Differential Revision:
D30876591 (39f2b9de2a)

Original commit changeset: 4946e0922063

fbshipit-source-id: b8beff3d973b21fe09c158baef25344030f8fb08
2021-09-12 15:56:40 -07:00
1577c106dc torch.ao migration: numeric suite, eager and fx (#64817)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64817

This migrates `torch.quantization._numeric_suite` to `torch.ao.ns._numeric_suite`, and `torch.quantization._numeric_suite_fx` to `torch.ao.ns._numeric_suite_fx`.

1. move the files
```
HG: move eager mode
hg mv caffe2/torch/quantization/_numeric_suite.py caffe2/torch/ao/ns/
HG: move fx
hg mv caffe2/torch/quantization/_numeric_suite_fx.py caffe2/torch/ao/ns/
hg mv caffe2/torch/quantization/ns/* caffe2/torch/ao/ns/fx/
```

2. create new versions of `_numeric_suite.py` and `_numeric_suite_fx.py` with
imports

3. update all FB callsites

Test Plan: buck test mode/dev //caffe2/test:quantization

Reviewed By: z-a-f

Differential Revision: D30867538

fbshipit-source-id: 120ee830434ca490c1183a187a518eebcbbaf22c
2021-09-12 12:00:45 -07:00
39f2b9de2a Pin scipy to 1.6.3 on Windows and Mac (#64844)
Summary:
It's already pinned by via docker install on Linux

As `scipy.stats.`[`poission`|`geom`|`binom`] returns quite different results in 1.7+ versions of SciPy

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64844

Reviewed By: driazati

Differential Revision: D30876591

Pulled By: malfet

fbshipit-source-id: 4946e0922063e9ac320c218a0b089f73486466f7
2021-09-12 10:53:48 -07:00
47144de473 Revert D30867266: [pytorch][PR] TST Adds gradcheck and gradgradcheck to module info
Test Plan: revert-hammer

Differential Revision:
D30867266 (67ebde5645)

Original commit changeset: cbc073326151

fbshipit-source-id: 00234e01eafc45fb999f7c83a397f9d6b3e01e46
2021-09-12 10:30:28 -07:00
30a7c768d7 [RFC] Modularize functions of parsing bytecode (#61862)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61862

Modularize functions of parsing bytecode tables so that they can be used as needed in situations other than mobile lite interpreter.
* The decoupled functions are re-used by current lite interpreter loader.
* The bytecode can be serialized/deserialized from other formats.
* The decoupled functions have minimum dependencies on other PyTorch components.

Next:
Build a driver binary to include the parser and interpreter, but only has necessary dependency on other PyTorch components.
ghstack-source-id: 137867287

Test Plan:
As an example, a simple bytecode is parsed to a mobile function, and directly run in the added unit test, `RunTimeTest:ParseBytecode`. It contains basic control flow (if, else) and basic data orchestration (list construction).
CI

Reviewed By: larryliu0820

Differential Revision: D29798382

Pulled By: iseeyuan

fbshipit-source-id: 1c173a5f5d37097e3a97baec3f3e48e1eea1400f
2021-09-11 22:24:05 -07:00
dd2d48df07 Revert D30875977: [caffe2] [aten] Remove loose (unpaired) #pragma warning ( pop ) in TensorBase.h
Test Plan: revert-hammer

Differential Revision:
D30875977 (1f35d20a89)

Original commit changeset: bd593feb5a75

fbshipit-source-id: 4c82dbc857fdb28e0240dacc1a0e607a76552bb4
2021-09-11 17:18:37 -07:00
d13e0c9c39 [iOS][OSS][BE] Update XCode to use 12.5.1 (#64850)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64850

ghstack-source-id: 137827895

Test Plan: CircleCI

Reviewed By: hanton

Differential Revision: D30877964

fbshipit-source-id: 803f2506a755b3815024704e6177c7826bc42de8
2021-09-11 11:24:06 -07:00
c9eb312ce9 [iOS][OSS][BE] Remove unused files (#64849)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64849

ghstack-source-id: 137827893

Test Plan: CircleCI

Reviewed By: hanton

Differential Revision: D30877962

fbshipit-source-id: a76f7fe888b990ba6cad650f72be7f4a1e58a2f1
2021-09-11 11:22:55 -07:00
82ac3f108d [TensorExpr] Move 2 graph passes from kernel.cpp to graph_opt.cpp (#64828)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64828

Also, make `removeUnusedSelfArgument` more consistent with other passes
by mutating the graph in-place rather than returning a copy.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D30870776

Pulled By: ZolotukhinM

fbshipit-source-id: 4873f01b013921143a5aa43746d655a2d8d620c9
2021-09-11 10:23:15 -07:00
ff65f637df [TensorExpr] Add debug logging (store/load tracing) to IREval. (#64848)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64848

Test Plan: Imported from OSS

Reviewed By: Chillee

Differential Revision: D30878278

Pulled By: ZolotukhinM

fbshipit-source-id: bd946075336ba2e9786602161c236a0ff8a5a011
2021-09-11 09:25:55 -07:00
180e4fbfae [TensorExpr] LLVMCodegen: fix lowering for UInt->Float casts. (#64862)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64862

Previously we erroneously were looking at dst signedness. This was
discovered when we tried to implement quantize/dequantize ops.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D30881696

Pulled By: ZolotukhinM

fbshipit-source-id: 34af842e5e52a3b6b5d2e70c4ef32f910a20341f
2021-09-11 09:24:36 -07:00
1f35d20a89 [caffe2] [aten] Remove loose (unpaired) #pragma warning ( pop ) in TensorBase.h (#64870)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64870

Remove loose (unpaired) #pragma warning ( pop ) in TensorBase.h
Issue started with D30728580 (d701357d92), was fixed with D30846958 (40098f48a1), and brought back again with the reversion of D30846958 (40098f48a1).

Reviewed By: H-Huang

Differential Revision: D30875977

fbshipit-source-id: bd593feb5a75245470e43ad568ebdd3f1738da7c
2021-09-11 00:43:19 -07:00
d4a86c1f3b [quant][fx2trt] Add lowering support for reference linear/conv modules (#64368)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64368

Test Plan:
python torch/fx/experimental/fx2trt/example/quantized_resnet_test.py

Imported from OSS

Reviewed By: 842974287

Differential Revision: D30708738

fbshipit-source-id: 88142b7ce43ed96093597112dab03a2d277de993
2021-09-10 22:25:27 -07:00
4481c87ac4 [tensorexpr] Simplify x/100 -> 0 if x is a non-negative integer less than 100. (#64763)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64763

Simplification pattern:
  x/N -> 0; N is a constant positive integer and x is a for-loop index whose range is a subset of [0, N).

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D30845854

Pulled By: huiguoo

fbshipit-source-id: 814d69ed4be05e57405c222183cc1c6c526721cd
2021-09-10 20:33:02 -07:00
5836a116d0 Revert D30869803: .circleci: Skip cuda /cudnn install if existing
Test Plan: revert-hammer

Differential Revision:
D30869803 (717d267e19)

Original commit changeset: 9eb3bd20875d

fbshipit-source-id: bef8d0c693696307a3be7abd5331b7fa813d754a
2021-09-10 18:56:50 -07:00
67ebde5645 TST Adds gradcheck and gradgradcheck to module info (#64444)
Summary:
Follow up to https://github.com/pytorch/pytorch/issues/61935

cc albanD mruberry jbschlosser walterddr

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64444

Reviewed By: ngimel

Differential Revision: D30867266

Pulled By: jbschlosser

fbshipit-source-id: cbc0733261517dbfcdd3415d969b9e802b62b7ac
2021-09-10 16:53:11 -07:00
c60075d4b5 Preserve types during empty container assignment (#58911)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58911

Stack from [ghstack](https://github.com/ezyang/ghstack):
* __->__ #58911

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D30785623

Pulled By: ansley

fbshipit-source-id: 4e05d6369318974290fea02ad2bc148293c25090
2021-09-10 16:49:21 -07:00
b4855619d1 Always upload stats to S3 (#64853)
Summary:
It's not very useful when stats are only uploaded when the tests all pass.

Like for this failing run, the stats were not uploaded to Scribe or S3: https://github.com/pytorch/pytorch/runs/3568714658

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64853

Reviewed By: seemethere

Differential Revision: D30878361

Pulled By: janeyx99

fbshipit-source-id: 19a4c520efdd5575785a3ffbc60e6c09456b9e0d
2021-09-10 16:49:19 -07:00
f3f410880a [DataPipe] Remove ZipArchiveReader's dependency on FileLoader (#64786)
Summary:
Stack from [ghstack](https://github.com/ezyang/ghstack):
* https://github.com/pytorch/pytorch/issues/64788
* __->__ https://github.com/pytorch/pytorch/issues/64786

This PR removes ZipArchiveReader's dependency on FileLoader DataPipe, by allowing it to use a IterDataPipe of path names as input rather than a tuple of path name and a stream.

It also adds additional tests to ensure that the DataPipe is functioning properly when it is read multiple times or reset half way through reading.

The whole stack fixes issues related to unclosed buffer stream (see https://github.com/pytorch/pytorch/issues/64281).

cc VitalyFedyunin ejguan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64786

Reviewed By: ngimel

Differential Revision: D30870968

Pulled By: NivekT

fbshipit-source-id: 64b04d1697b99772f2fa20fc141668e6b8e18c41
2021-09-10 16:49:17 -07:00
717d267e19 .circleci: Skip cuda /cudnn install if existing (#64825)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64825

Rewrites this script to only install the CUDA tools if they are not already
pre-installed

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D30869803

Pulled By: seemethere

fbshipit-source-id: 9eb3bd20875df0f2b18f5314ac825dbdf91637b5
2021-09-10 16:49:14 -07:00
dafa0a5a3b [doc][hackathon] To add Adadelta Optimizer to the documentation (#63255)
Summary:
It has been discussed before that adding description of Optimization algorithms to PyTorch Core documentation may result in a nice Optimization research tutorial. In the following tracking issue we mentioned about all the necessary algorithms and links to the originally published paper  https://github.com/pytorch/pytorch/issues/63236.

In this PR we are adding description of AdaDelta Algorithm to the documentation.  For more details, we refer to the paper  here https://arxiv.org/abs/1212.5701

<img width="654" alt="AdaDeltaalg" src="https://user-images.githubusercontent.com/73658284/132770544-82ccf90a-1d54-4ad5-8fc4-51c8dec63a12.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63255

Reviewed By: ngimel

Differential Revision: D30867589

Pulled By: iramazanli

fbshipit-source-id: 5ba602c20c724a4486bdd38b73e1b64c0e767bdc
2021-09-10 16:49:12 -07:00
d8ae3cc318 Add more error checking in subclass creation (#64746)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64746

This extracts the error checking that used to be in the PR above.
We are not going to land the proposed fix there, but I think we want this error checking in right now as these would lead to respectively a memory leak and arbitrary memory read/write.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D30867569

Pulled By: albanD

fbshipit-source-id: bf468033fb8b49fcb26eed423f5fad82b4a46c56
2021-09-10 16:49:10 -07:00
89f94fc15f Move THPVariable_NewWithVar around (#64550)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64550

Just moves a function around to make the next PR easier to read.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D30867570

Pulled By: albanD

fbshipit-source-id: 99ae925568ed29ca7fdea059762c21d430d4a204
2021-09-10 16:49:08 -07:00
2cc9778495 [MicroBench] Added a log_vml version of the signed log1p kernel (#64205)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64205

The log_vml version of the micro-bench is over **2x** faster than the log1p version. Here are the perf numbers:

```
---------------------------------------------------------------------------------------------
Benchmark                                   Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------
SignedLog1pBench/ATen/10/1467           45915 ns        45908 ns        14506 GB/s=2.5564G/s
SignedLog1pBench/NNC/10/1467            40469 ns        40466 ns        17367 GB/s=2.9002G/s
SignedLog1pBench/NNCLogVml/10/1467      19560 ns        19559 ns        35902 GB/s=6.00016G/s
```

Thanks to bertmaher for pointing this out.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D30644716

Pulled By: navahgar

fbshipit-source-id: ba2b32c79d4265cd48a2886b0c62d0e89ff69c19
2021-09-10 16:49:06 -07:00
cad7a4b0ea [nnc] Added an implementation of sign op (#64033)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64033

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D30579197

Pulled By: navahgar

fbshipit-source-id: f9f7fa7f2ffa109cf4e441eb1af821b8b891d4d3
2021-09-10 16:49:04 -07:00
3fbb49e75d Extend 2Dim embedding bag benchmarking to include 3Dim benchmarks (#64647)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64647

Add support for benchmarking of 8 bit quantizations of N-D batched embeddings. Currently only works for 3Dim embeddings and still requires thought on ramping up from 3Dim to NDim.

Test Plan: ```buck run //caffe2/benchmarks/operator_benchmark/pt:qembedding_pack_test```

Reviewed By: jingsh

Differential Revision: D30770085

fbshipit-source-id: 26659020f3458991592065a05366bde0f060494e
2021-09-10 16:49:02 -07:00
227aafd1d9 Revert D30846958: [caffe2/aten] Remove loose #pragma warning ( pop ) in TensorBase.h
Test Plan: revert-hammer

Differential Revision:
D30846958 (40098f48a1)

Original commit changeset: 52a3fb66e426

fbshipit-source-id: 1d749f6981756f2169d6867538555a945cbb8ca6
2021-09-10 16:47:08 -07:00
5060b69d62 [DataPipe] fixing tests related fork() to remove warnings (#64827)
Summary:
There are two warnings produced by `test_fork_datapipe`. This PR addresses the issues raised by those warnings without impacting the test cases.

cc VitalyFedyunin ejguan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64827

Reviewed By: ejguan

Differential Revision: D30870528

Pulled By: NivekT

fbshipit-source-id: 580a001c6fa3ff6f8b04a7e5183e58861938204b
2021-09-10 11:01:42 -07:00
ade4bf3e82 [tensorexpr] Add 'pre_alloc' argument in python API of tensorexpr kernel (#64718)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64718

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D30826582

Pulled By: huiguoo

fbshipit-source-id: 6c173c8964f2643039273cdc83e64fb02bb5f381
2021-09-10 10:03:00 -07:00
92cd5ab1cb Skip conjugate and negate fallback for view ops and their in-place versions (#64392)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64392

cc ezyang anjali411 dylanbespalko mruberry Lezcano nikitaved

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D30866330

Pulled By: anjali411

fbshipit-source-id: 7b2f51486bf1d610ad2b1472306bab608ee69c37
2021-09-10 09:57:27 -07:00
54b72a99ef To add Rprop documentation (#63866)
Summary:
It has been discussed before that adding description of Optimization algorithms to PyTorch Core documentation may result in a nice Optimization research tutorial. In the following tracking issue we mentioned about all the necessary algorithms and links to the originally published paper  https://github.com/pytorch/pytorch/issues/63236.

In this PR we are adding description of Rprop to the documentation.  For more details, we refer to the paper  http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.21.1417

<img width="657" alt="Rpropalg" src="https://user-images.githubusercontent.com/73658284/132750009-a5ec059e-6d53-4c67-917b-57174c8ca27b.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63866

Reviewed By: ngimel

Differential Revision: D30867590

Pulled By: iramazanli

fbshipit-source-id: 0d2d4ffc6c4d939290bbbaa84d2c6e901ed8b54a
2021-09-10 09:49:10 -07:00
c7b03e2b83 [ROCm] define C10_WARP_SIZE to warpSize HIP constant (#64302)
Summary:
warpSize is defined as a constexpr in HIP headers.  It is incorrect to assume warpSize 64.  This change fixes the C10_WARP_SIZE definition in torch sources similar to [how it was done in caffe2](https://github.com/pytorch/pytorch/blob/master/caffe2/utils/GpuDefs.cuh#L10-L14).

cc jeffdaily sunway513 jithunnair-amd ROCmSupport

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64302

Reviewed By: mrshenli

Differential Revision: D30785975

Pulled By: malfet

fbshipit-source-id: 68f8333182ad4d02bd0c8d02f1751a50bc5bafa7
2021-09-10 09:43:47 -07:00
db3fcf0af3 fix typo in torch/onnx/utils.py (#63396)
Summary:
fixes minor typo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63396

Reviewed By: pbelevich

Differential Revision: D30644295

Pulled By: SplitInfinity

fbshipit-source-id: c506f67383909aa2c0c7c533038446b4b3d76a3a
2021-09-10 09:37:44 -07:00
rui
c12df2dc23 build: bump bazel to 4.2.1 (#64455)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64455

Reviewed By: saketh-are

Differential Revision: D30752580

Pulled By: malfet

fbshipit-source-id: 4f5cc6f820396348181c09463f7e5628b5f69471
2021-09-10 08:30:10 -07:00
63b180beed ROCm MIOpen NHWC Convolution support (#63617)
Summary:
- Added 2D-Convolution NHWC support
  - on ROCm 4.3, with `PYTORCH_MIOPEN_SUGGEST_NHWC=1` flag
  - May need to force MIOpen to search for solutions ( see examples below for flags )

**PYTORCH_MIOPEN_SUGGEST_NHWC Environment Flag**
MIOpen does not officially support NHWC yet, although convolution support has been added to tip-of-tree of MIOpen. This flag is intended to be a short-lived flag to explicitly turn on NHWC support until ROCm officially supports NHWC and performance is verified.

**Examples**
1. Example usage 1 : Run test on ROCm4.3
`PYTORCH_TEST_WITH_ROCM=1 PYTORCH_MIOPEN_SUGGEST_NHWC=1 MIOPEN_FIND_ENFORCE=4 MIOPEN_DEBUG_CONV_GEMM=0 MIOPEN_FIND_MODE=1 pytest test_nn.py -v -k "test_conv_cudnn_nhwc" `
2. Example usage 2: Run the following with `PYTORCH_MIOPEN_SUGGEST_NHWC=1` on ROCm4.3.
```
#!/usr/bin/env python3
import torch
model = torch.nn.Conv2d(8, 4, 3).cuda().half()
model = model.to(memory_format=torch.channels_last)
input = torch.randint(1, 10, (2, 8, 4, 4), dtype=torch.float32, requires_grad=True)
input = input.to(device="cuda", memory_format=torch.channels_last, dtype=torch.float16)

# should print True for is_contiguous(channels_last), and strides must match NHWC format
print(input.is_contiguous(memory_format=torch.channels_last), input.shape, input.stride() )

out = model(input)

# should print True for is_contiguous(channels_last), and strides must match NHWC format
print("Contiguous channel last :", out.is_contiguous(memory_format=torch.channels_last), " out shape :",  out.shape, "out stride :", out.stride() )
```

See https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html for more examples.

cc jeffdaily sunway513 jithunnair-amd ROCmSupport

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63617

Reviewed By: saketh-are

Differential Revision: D30730800

Pulled By: ezyang

fbshipit-source-id: 61906a0f30be8299e6547d312ae6ac91cc7c3238
2021-09-10 08:06:32 -07:00
2a81e8b8f1 Let all_reduce_coalesced and all_gather_coalesced return Future objects (#64722)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64722

`all_reduce_coalesced` and `all_gather_coalesced` are never publicly
released in our API docs. So, I would assume the blast radius to be small.

The motivation for this change to allow implementing
`all_reduce_coalesced` and `all_gather_coalesced` by re-using `allreduce`
and `allgather` C++ cores and perform flatten and copy only on the Python
side. With that, we can then remove `all_reduce_coalesced` and
`all_gather_coalesced` from C++ ProcessGroup APIs. For the async mode,
the copy-back logic after the communication will need to be chained
as a callback on the returned Future and use the chained child Future
as the return value (otherwise, we will need to wrap the child Future
into another work handle). This PR tries to test if we can directly
return a Future without breaking tests and internal use cases. If yes,
it will make the consolidation a lot easier.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse agolynski SciPioneer H-Huang mrzzd cbalioglu gcramer23

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D30830994

Pulled By: mrshenli

fbshipit-source-id: dcde0ed9245e9e8fee357b3588b07d540a4b6318
2021-09-10 07:45:25 -07:00
88fff22023 torch.lu: forward AD support (#64742)
Summary:
As per title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64742

Reviewed By: H-Huang

Differential Revision: D30841227

Pulled By: albanD

fbshipit-source-id: dc4d043ab94358594adb110fbbbb60750c98262a
2021-09-10 07:19:11 -07:00
be091950d0 [const_fold] Keep around node.meta for replaced folded ops (#64782)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64782

Previously, get_attrs that were added to the graph did not retain node.meta after folding. Add such support, and improve coverage in general here.

Test Plan: Added test coverage.

Reviewed By: protonu

Differential Revision: D30852704

fbshipit-source-id: ece87a61c69b2e68982964c6adc4dde14dae12c7
2021-09-09 23:52:39 -07:00
40098f48a1 [caffe2/aten] Remove loose #pragma warning ( pop ) in TensorBase.h (#64773)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64773

Remove loose `#pragma warning ( pop )` in TensorBase.h.

Reviewed By: ezyang

Differential Revision: D30846958

fbshipit-source-id: 52a3fb66e426bc16ef7bde2a13e26e8293969026
2021-09-09 23:45:45 -07:00
95d98dfeec Add TRTSplitter (#64762)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64762

Extract and format TRTSplitter from fx2trt_example code, current implementation is tentative, subject to changed based on feeds model lowering progress.

Test Plan:
manul print of supported operator:
`{<class 'torch.nn.modules.activation.ReLU'>: None, <function relu at 0x7f9b1abd0790>: None, <class 'torch.nn.modules.activation.Sigmoid'>: None, <class 'torch.nn.modules.pooling.AdaptiveAvgPool2d'>: None, <built-in method add of type object at 0x7f9b7f402498>: None, <built-in function add>: None, <built-in method add of PyCapsule object at 0x7f9b1a3dc690>: None, <built-in method add_relu of PyCapsule object at 0x7f9b1a34cf90>: None, <class 'torch.nn.modules.batchnorm.BatchNorm2d'>: None, <class 'torch.nn.quantized.modules.batchnorm.BatchNorm2d'>: None, <class 'torch.nn.modules.conv.Conv2d'>: None, <class 'torch.nn.quantized.modules.conv.Conv2d'>: None, <class 'torch.nn.intrinsic.quantized.modules.conv_relu.ConvReLU2d'>: None, <class 'torch.nn.modules.linear.Linear'>: None, <class 'torch.nn.quantized.modules.linear.Linear'>: None, <class 'torch.nn.modules.pooling.MaxPool2d'>: None, <built-in function mul>: None, <built-in method mul of type object at 0x7f9b7f402498>: None, <built-in method mul of PyCapsule object at 0x7f9b1a3dc6c0>: None, <built-in method flatten of type object at 0x7f9b7f402498>: None, <class 'torch.nn.quantized.modules.DeQuantize'>: None, <built-in method dequantize of type object at 0x7f9b7f402498>: None, 'dequantize': None, <class 'torch.nn.quantized.modules.Quantize'>: None, <built-in method quantize_per_tensor of type object at 0x7f9b7f402498>: None, <class 'torch.nn.modules.linear.Identity'>: None, <function conv2d at 0x7f9b1a1fe9d0>: None, <function flatten at 0x7f9b1a1f5ca0>: None, <function size at 0x7f9b1a1f5b80>: None, <function batch_norm at 0x7f9b1a1feaf0>: None, <function layer_norm at 0x7f9b1a1feb80>: None, <function softmax at 0x7f9b1a1f9550>: None, <function relu at 0x7f9b1a1fe040>: None, <function sin at 0x7f9b1a2030d0>: None, <function cos at 0x7f9b1a203160>: None, <function tan at 0x7f9b1a2031f0>: None, <function sinh at 0x7f9b1a1fe160>: None, <function cosh at 0x7f9b1a1fe280>: None, <function tanh at 0x7f9b1a1fe310>: None, <function asin at 0x7f9b1a1fe3a0>: None, <function acos at 0x7f9b1a1fe430>: None, <function atan at 0x7f9b1a1fe4c0>: None, <function exp at 0x7f9b1a1fe550>: None, <function log at 0x7f9b1a1fe5e0>: None, <function sqrt at 0x7f9b1a1fe670>: None, <function reciprocal at 0x7f9b1a1fe700>: None, <function abs at 0x7f9b1a1fe790>: None, <function neg at 0x7f9b1a1fe820>: None, <function floor at 0x7f9b1a1fe8b0>: None, <function ceil at 0x7f9b1a1fe940>: None, <function sum at 0x7f9b1a1f9c10>: None, <function max_pool2d at 0x7f9b1a1f5d30>: None, <function squeeze at 0x7f9b1a1f5c10>: None, <function add at 0x7f9b1a1f91f0>: None, <function sub at 0x7f9b1a1f9ca0>: None, <function div at 0x7f9b1a1f9dc0>: None, <function mul at 0x7f9b1a1f9d30>: None, <function pow at 0x7f9b1a1f9e50>: None, <function min_two_tensors_input at 0x7f9b1a1f9940>: None, <function unsqueeze at 0x7f9b1a1f9280>: None, <function topk at 0x7f9b1a203280>: None, <function adaptive_avg_pool2d at 0x7f9b1a1f5dc0>: None, <function avg_pool2d at 0x7f9b1a1f5e50>: None, <function reshape at 0x7f9b1a203550>: None, <function slice_tensor at 0x7f9b1a1fee50>: None, <function split at 0x7f9b1a1fec10>: None, <function linear at 0x7f9b1a1f51f0>: None, <function clamp at 0x7f9b1a1f93a0>: None, <function tuple_construct at 0x7f9b1a1fed30>: None, <function contiguous at 0x7f9b1a1f9430>: None, <function getitem at 0x7f9b1a203310>: None, <function cat at 0x7f9b1a1f9310>: None, <function transpose at 0x7f9b1a1f94c0>: None, <function matmul at 0x7f9b1a1f98b0>: None, <function sigmoid at 0x7f9b1a1fe1f0>: None, <function permute at 0x7f9b1a1f9670>: None, <function quantize_per_tensor at 0x7f9b1a1f9b80>: None, <function dequantize at 0x7f9b1a1f99d0>: None, <function sign at 0x7f9b1a1f5ee0>: None}`

Reviewed By: 842974287

Differential Revision: D30798047

fbshipit-source-id: 69076a550874425b7186fbbf2ecf03da4a99b42f
2021-09-09 21:08:57 -07:00
88c0ea9131 [PyTorch] Fix missing move in torch::jit::Lexer::next (#64653)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64653

Saves shared_ptr refcount inc/dec in SourceRange.
ghstack-source-id: 137608457

Test Plan: Profiled startup of framework overheads benchmark from high_per_models; self time spent in next() is way down.

Reviewed By: dhruvbird

Differential Revision: D30739240

fbshipit-source-id: ac455678c9d46e657b111d3788d4369983028674
2021-09-09 19:01:07 -07:00
b7b4f63bbc [PyTorch] Use std::find in the JIT lexer (#64652)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64652

If nothing else, it is slightly clearer code.
ghstack-source-id: 137608456

Test Plan: CI

Reviewed By: dhruvbird

Differential Revision: D30739239

fbshipit-source-id: bc7917b59883ca4a33fc6916b4e422bad79cf04b
2021-09-09 18:59:27 -07:00
a17d6c7f80 [TensorExpr] Simplify TE IR before applying any transformations. (#64717)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64717

This also exposed several bugs, which are fixed in this PR.

Differential Revision:
D30826408
D30826408

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: a67ec5739aceed9ffdf0d24f77eb3787cefe4560
2021-09-09 18:50:51 -07:00
ef2c9d7d8a [quant][fix] Fix quantization for sub_scalar (#64603)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64603

We'll insert observer only when both the operator and dtype is supported

Test Plan:
python test/test_quantization.py TestQuantizeFx.test_sub_scalar

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D30797025

fbshipit-source-id: a77c21e2749405534fc245374cf33a0657a3d2c8
2021-09-09 17:18:31 -07:00
1b5b210f2c [Android] print type name for IValues (#64602)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64602

print type name in error message for easier debugging.

Test Plan:
Example:
java.lang.IllegalStateException: Expected IValue type Tensor, actual type TensorList

Reviewed By: beback4u

Differential Revision: D30782318

fbshipit-source-id: 60d88a659e7b4bb2b574b12c7652a28f0d5ad0d2
2021-09-09 17:06:15 -07:00
11ef68938c [caffe2][tiny] Add logging to report what the current lengths are when mismatched lengths are detected (#64768)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64768

as title

Differential Revision: D30846637

fbshipit-source-id: 266768c81b315fdebba854135ea2db1faf67fd6a
2021-09-09 16:46:55 -07:00
d4b09dbab3 [doc][hackathon] To add Adagrad Optimizer to the documentation (#63254)
Summary:
It has been discussed before that adding description of Optimization algorithms to PyTorch Core documentation may result in a nice Optimization research tutorial. In the following tracking issue we mentioned about all the necessary algorithms and links to the originally published paper  https://github.com/pytorch/pytorch/issues/63236.

In this PR we are adding description of Adagrad to the documentation.  For more details, we refer to the paper
http://jmlr.org/papers/v12/duchi11a.html

<img width="658" alt="AdaGradAlgo" src="https://user-images.githubusercontent.com/73658284/132743276-a52ea3fb-70a5-4788-94b7-f99367907a26.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63254

Reviewed By: albanD

Differential Revision: D30852139

Pulled By: iramazanli

fbshipit-source-id: 9e496560a97e92be8386585b01d9bd3bba4b0c66
2021-09-09 15:41:29 -07:00
9ad75281f6 [Static Runtime] Fix resize_output_check warning coming from prim::VarConcat (#64765)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64765

Test Plan: Tested the fix with BR v1 model predictor-replayer setup.

Reviewed By: ajyu

Differential Revision: D30846506

fbshipit-source-id: 3ef3c93f11285c7cd1e2b188ca298a7ab4fba579
2021-09-09 14:38:50 -07:00
7f1932e1b9 Rename profiler metadata key (#63743)
Summary:
rename metadata key to be the same with variable name

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63743

Reviewed By: albanD

Differential Revision: D30839501

Pulled By: gdankel

fbshipit-source-id: b9b4e670dcc9557b8d8d0730baea0ad39a1a0ca4
2021-09-09 13:06:16 -07:00
6cc8cc6e56 Add support for lowering info during serialize_module, and add padding/partial to it (#5810)
Summary:
Pull Request resolved: https://github.com/pytorch/glow/pull/5810

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64725

- Any info added to the dict in node.meta["lowering_info"] will be added to the node_rep during serialization.
- Use this to add annotations on placeholders that allow partial inputs and require padding.
- Check for these annotations and set them in the NNPICompiledFunction as expected

Test Plan: Validated working on inline_cvr in stack. Additionally existing fx_glow end to end tests should still pass.

Reviewed By: 842974287

Differential Revision: D30824192

fbshipit-source-id: def64ef097aa35c337abb494415f7d437c6c7fa9
2021-09-09 13:01:28 -07:00
d43fb75a21 cat_shape_check: Fixes dimension in the error message for CUDA cat shape check and removes unnecessary offending index information (#64556)
Summary:
Fixes: https://github.com/pytorch/pytorch/issues/64207

Thank you, SsnL for providing the reproducing script.

cc ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64556

Reviewed By: albanD

Differential Revision: D30843859

Pulled By: ngimel

fbshipit-source-id: 457ebe80eaef793d9f5d35ee962b6697e5de1907
2021-09-09 12:51:11 -07:00
2c243ed112 Enable the on-demand performance PR testing to run on a specified TB branch (#64701)
Summary:
This is to enable performance testing of experimental features such as LazyTensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64701

Test Plan:
TorchBench CI

RUN_TORCHBENCH: BERT_pytorch, mobilenet_v3_large
TORCHBENCH_BRANCH: v1.0

Reviewed By: seemethere

Differential Revision: D30847389

Pulled By: xuzhao9

fbshipit-source-id: 6853b368fa6f1ba8ffde517805c74bf318dcb35b
2021-09-09 12:41:21 -07:00
517033916c .github: Remove add_annotations workflow (#64449)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64449

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: suo, janeyx99

Differential Revision: D30738460

Pulled By: seemethere

fbshipit-source-id: f1259fcba2f0c14a9bcfbe811ec0a4bf61106619
2021-09-09 12:22:12 -07:00
9797a32faf [Dist/CI] Remove dist from target determinator (#64721)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64721

There are a couple PRs where distributed CI did not runa nd we expect
it to. Examples:

https://github.com/pytorch/pytorch/pull/64513/checks?check_run_id=3539190960,
https://github.com/pytorch/pytorch/pull/64113. All distributed tests should've
been run on these PRs, but we can see they were not:

```
Determination is skipping distributed/test_c10d_common
Determination is skipping distributed/test_c10d_gloo
Determination is skipping distributed/test_c10d_nccl
Determination is skipping distributed/test_c10d_spawn_gloo
Determination is skipping distributed/test_c10d_spawn_nccl
Running distributed/test_data_parallel without determination
Determination is skipping distributed/test_distributed_spawn
Determination is skipping distributed/test_jit_c10d
```

Since it is important to run distributed tests on PRs that touch distributed,
exclude distributed from target_det_list for now.
ghstack-source-id: 137654015

Test Plan: CI

Reviewed By: driazati, mrshenli

Differential Revision: D30830455

fbshipit-source-id: 8b0fdf5b57c2c647b0d82c48e2bb8e2bdbe4d307
2021-09-09 12:07:43 -07:00
46c886e8a6 fix acc topk's handling of the case when dim=0, fix tests as well (#64727)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64727

the acc ops convertor for topk has a subtle bug (i found this while trying to introduce max/min)
the code does not differentiate between dim == None and dim ==0, but these are both different computations

Reviewed By: jfix71, 842974287

Differential Revision: D30833621

fbshipit-source-id: 6cd84e6ca4e95bb1a6d6465e61830b76808a9c78
2021-09-09 10:35:23 -07:00
3d3ff4a9e7 Fix a shadowed variable (#64695)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64695

Resolves this warning:
```
caffe2/aten/src/ATen/ParallelOpenMP.h:109:63: warning: declaration of 'int64_t begin' shadows a parameter [-Wshadow=compatible-local]
  109 |   internal::invoke_parallel(begin, end, grain_size, [&](int64_t begin, int64_t end) {
      |                                                       ~~~~~~~~^~~~~
caffe2/aten/src/ATen/ParallelOpenMP.h:86:1: note: shadowed declaration is here
   85 | inline scalar_t parallel_reduce(
      |                 ~~~~~~~~~~~~~~~~
   86 |     const int64_t begin,
      | ^   ~
```

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D30816128

fbshipit-source-id: 3adff6d94eea9fbd65885e88283cae10b87dba18
2021-09-09 10:34:01 -07:00
8deaa476ac Added more version comparison operations (#63848)
Summary:
Currently the [TorchVersion](1022443168/torch/torch_version.py (L13)) only only supports 'greater than', and 'equal to' operations for comparing torch versions and something like `TorchVersion('1.5.0') < (1,5,1)` or `TorchVersion('1.5.0') >= (1,5)` will throw an error.

I have added 'less than' (`__lt__()`), 'greater than or equal to' (`__ge__()`) and 'less than or equal to' (`__le__()`) operations, so that the TorchVersion object can be useful for wider range of version comparisons.

cc seemethere zsol

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63848

Reviewed By: fmassa, heitorschueroff

Differential Revision: D30526996

Pulled By: seemethere

fbshipit-source-id: 1db6bee555043e0719fd541cec27810852590940
2021-09-09 10:30:20 -07:00
cfa6162e5e Reverts cat and stack warning when out= is not the expected shape (#64714)
Summary:
These warnings are being thrown too aggressively at the moment. See https://github.com/pytorch/pytorch/issues/64709 for a follow-up to reenable them once internal call sites are reviewed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64714

Reviewed By: ngimel

Differential Revision: D30822965

Pulled By: mruberry

fbshipit-source-id: 3ad7c92d381d42ac6187ed84afab477c579a8f35
2021-09-09 10:03:22 -07:00
2b41bf40c5 To add SequentialLR to PyTorch Core Schedulers (#64037)
Summary:
Partially resolves https://github.com/pytorch/vision/issues/4281

In this PR we are proposing a new scheduler --SequentialLR-- which enables list of different schedulers called in different periods of the training process.

The main motivation of this scheduler is recently gained popularity of warming up phase in the training time. It has been shown that having a small steps in initial stages of training can help convergence procedure get faster.

With the help of SequentialLR we mainly enable to call a small constant (or linearly increasing) learning rate followed by actual target learning rate scheduler.

```PyThon
scheduler1 = ConstantLR(optimizer, factor=0.1, total_iters=2)
scheduler2 = ExponentialLR(optimizer, gamma=0.9)
scheduler = SequentialLR(optimizer, schedulers=[scheduler1, scheduler2], milestones=[5])

for epoch in range(100):
    train(...)
    validate(...)
    scheduler.step()
```

which this code snippet will call `ConstantLR` in the first 5 epochs and will follow up with `ExponentialLR` in the following epochs.

This scheduler could be used to provide call of any group of schedulers next to each other. The main consideration we should make is every time we switch to a new scheduler we assume that new scheduler starts from the beginning- zeroth epoch.

We also add Chained Scheduler to `optim.rst` and `lr_scheduler.pyi` files here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64037

Reviewed By: albanD

Differential Revision: D30841099

Pulled By: iramazanli

fbshipit-source-id: 94f7d352066ee108eef8cda5f0dcb07f4d371751
2021-09-09 09:36:32 -07:00
c3203efe80 [pytorch] Make qlinear weight packing thread safe (#63804)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63804

Adding a lock around weight packing section of qlinear + qlinear_dynamic

Test Plan: automated tests

Reviewed By: kimishpatel

Differential Revision: D30340957

fbshipit-source-id: 1c9faf796c4ffbc74345396188a6f1154a76bea6
2021-09-09 09:31:48 -07:00
dc53546655 torch.lu_solve: forward AD support (#64646)
Summary:
As per title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64646

Reviewed By: VitalyFedyunin

Differential Revision: D30807898

Pulled By: albanD

fbshipit-source-id: 1f943c22357dd1b3662cfe0d2a26af68e3a2df4c
2021-09-09 08:58:00 -07:00
b7c86365d1 [nnc] Handled cast in index expression during inlining (#64716)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64716

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D30826388

Pulled By: navahgar

fbshipit-source-id: 7e446602f650527e0d954e437f0370602019e040
2021-09-09 08:30:52 -07:00
652a8bf7d0 [nnc] Updated indices during broadcast to use int64_t (#64627)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64627

This fixes the root cause of S242719

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D30801686

Pulled By: navahgar

fbshipit-source-id: b6d3ebdc7eb57116eaced53c2f35c7798bb17e80
2021-09-09 08:29:37 -07:00
459653a0f6 Revert D30745921: [DDP] Fix when buffers are reassigned in module
Test Plan: revert-hammer

Differential Revision:
D30745921 (d59ecc02df)

Original commit changeset: 25eb1edbf445

fbshipit-source-id: 343ead86bf1e2d0b2d4124be331ea2fa437303ad
2021-09-09 08:23:16 -07:00
5bc53ac5ef Revert D30745961: [DDP] Remove self.modules_params
Test Plan: revert-hammer

Differential Revision:
D30745961 (8c09510294)

Original commit changeset: 32d102502570

fbshipit-source-id: 59f7cc50d369b6cc2856cf4ebd0f58b96202336d
2021-09-09 08:23:14 -07:00
f1aaf8afcd Revert D30745960: [DDP] Remove SPMD from self.modules_buffers
Test Plan: revert-hammer

Differential Revision:
D30745960 (1553259520)

Original commit changeset: 66a8f9847e9f

fbshipit-source-id: d3f3fb813c45ac1b0ff15c6154b2e99e5dbab433
2021-09-09 08:22:12 -07:00
3bf93d769c [JIT] Add gradient check in constants (#64613)
Summary:
fixes internal issue

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64613

Reviewed By: Gamrix

Differential Revision: D30799016

Pulled By: eellison

fbshipit-source-id: 48ef52d1cac627919e6cd232216d24878a2a8b58
2021-09-09 08:13:57 -07:00
d4b1016850 Filter out _disabled_torch_function_impl from handle_torch_function (#64689)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64689

This brings it in line with the C++ implementation.

Fixes https://github.com/pytorch/pytorch/issues/64687

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D30816215

Pulled By: ezyang

fbshipit-source-id: ed36af6c35467ae678d9548197efd97c36d38dec
2021-09-09 07:29:09 -07:00
239366c9c2 To add Rectified Adam Description to Documentation (#63772)
Summary:
It has been discussed before that adding description of Optimization algorithms to PyTorch Core documentation may result in a nice Optimization research tutorial. In the following tracking issue we mentioned about all the necessary algorithms and links to the originally published paper  https://github.com/pytorch/pytorch/issues/63236.

In this PR we are adding description of Rectified Adam Algorithm to the documentation.  For more details, we refer to the paper  https://arxiv.org/abs/1908.03265

<img width="446" alt="RadamAlgo" src="https://user-images.githubusercontent.com/73658284/132587815-4764b642-df53-4e41-975f-72e0f40fdc48.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63772

Reviewed By: datumbox

Differential Revision: D30839694

Pulled By: iramazanli

fbshipit-source-id: 6f5629ce56e10c66a451433334b587b99eda1610
2021-09-09 07:10:36 -07:00
5b21f172a4 [doc][hackathon] To add AdamW Optimizer to the documentation (#63252)
Summary:
It has been discussed before that adding description of Optimization algorithms to PyTorch Core documentation may result in a nice Optimization research tutorial. In the following tracking issue we mentioned about all the necessary algorithms and links to the originally published paper  https://github.com/pytorch/pytorch/issues/63236.

In this PR we are adding description of AdamW Algorithm to the documentation.  For more details, we refer to the paper  here https://arxiv.org/abs/1711.05101

<img width="442" alt="AdamWalgo" src="https://user-images.githubusercontent.com/73658284/132589957-6d381e96-cb62-40d0-990f-82a32ec455be.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63252

Reviewed By: datumbox

Differential Revision: D30839685

Pulled By: iramazanli

fbshipit-source-id: 1a426c874ab86408d286a34f41aefcf5b21167c0
2021-09-09 07:05:31 -07:00
39ce801d1f To add Adamax algorithm to documentation (#63903)
Summary:
It has been discussed before that adding description of Optimization algorithms to PyTorch Core documentation may result in a nice Optimization research tutorial. In the following tracking issue we mentioned about all the necessary algorithms and links to the originally published paper  https://github.com/pytorch/pytorch/issues/63236.

In this PR we are adding description of Adamax Algorithm to the documentation.  For more details, we refer to the paper  https://arxiv.org/abs/1412.6980

<img width="447" alt="Adamx" src="https://user-images.githubusercontent.com/73658284/132577306-878ce64c-627a-4086-808c-d0482868d4a1.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63903

Reviewed By: albanD

Differential Revision: D30819055

Pulled By: iramazanli

fbshipit-source-id: 37f748cbea9f93bf37193ee30fc295fb1a1e9ffd
2021-09-09 06:42:33 -07:00
15c21fa45f [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D30835585

fbshipit-source-id: a7d35319fd3ae3eddd29b69d299d842f68d587f6
2021-09-09 04:23:50 -07:00
233e3e5bb4 Fix lop1p lowering bug (#64724)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64724

`1` will introduce a int tensor instead of float tensor, which doesn't work well with downstream operators (elementwise). Error would be like
```
[TensorRT] WARNING: IElementWiseLayer with inputs (Unnamed Layer* 1) [Unary]_output and (Unnamed Layer* 2) [Constant]_output: first input has type Float but second input has type Int32.
```
Changing the constant to be float type fixes this.

Reviewed By: 842974287

Differential Revision: D30796959

fbshipit-source-id: 0538e4dd960df9ce87a2d4cafe8f1a0c061b6bad
2021-09-09 00:59:44 -07:00
d0b207e68b Migrate uses of THCReduceApplyUtils to cuda_utils::BlockReduce (#64713)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64713

Resubmit of #64442

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D30825646

Pulled By: ngimel

fbshipit-source-id: 66b06bd0b30b401833e337920681d19d96b11f9d
2021-09-08 22:09:01 -07:00
1553259520 [DDP] Remove SPMD from self.modules_buffers (#64474)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64474

No need for a nested list here.
ghstack-source-id: 137526312

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D30745960

fbshipit-source-id: 66a8f9847e9fe1e02c51b79647e93bf7665cf4d9
2021-09-08 19:16:15 -07:00
8c09510294 [DDP] Remove self.modules_params (#64473)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64473

Unused after SPMD deprecated.
ghstack-source-id: 137526305

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D30745961

fbshipit-source-id: 32d102502570291e01579e5b47a6d74dc71013bb
2021-09-08 19:16:13 -07:00
d59ecc02df [DDP] Fix when buffers are reassigned in module (#64472)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64472

Sometimes, user module can reassign tensor buffer, as in:

```
self.buffer = torch.randn(1, 2) # in init
self.buffer += 1 # in forward
```

in this case, `self.modules_buffers` will become outdated and we should
repopulate self.modules_buffers if we need to sync module buffers.

See https://github.com/pytorch/pytorch/issues/63916 for full description of the
issue.
ghstack-source-id: 137526309

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D30745921

fbshipit-source-id: 25eb1edbf445703a481802e07f3058d38ea6fc64
2021-09-08 19:14:55 -07:00
b6544ef815 [PyTorch] Fix MobileDebugInfo vector copy (#64030)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64030

ghstack-source-id: 137566816

Test Plan:
Pixel 3 before:  https://our.intern.facebook.com/intern/aibench/details/320277034999340
Pixel 3 after: https://our.intern.facebook.com/intern/aibench/details/724509739115867

can see the vector copy disappear in the flame graph. Overall mean decreased from 354 ms to 348 ms (though I'm not sure if this is outside usual noise).

Reviewed By: raziel

Differential Revision: D30559032

fbshipit-source-id: 6d8bb5396d3449cc63023ee7acf694b5d146ddc1
2021-09-08 18:32:50 -07:00
0d0d2f2ac5 [PyTorch] move from input ivalues in ByteCodeDeserializer (#64029)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64029

This should save us a separate pass over the data structure to destroy it.
ghstack-source-id: 137566821

Test Plan:
Pixel3
before:
https://www.internalfb.com/intern/aibench/details/503337445067962
after:
https://our.intern.facebook.com/intern/aibench/details/320277034999340

overall mean time decreased from 373 ms to 358 ms. In flame graph, we
can see that some time spent destroying a vector of IValues was moved
into parseMethods, and the new parseMethods time is less than the old
time plus the recursive destruction time.

Reviewed By: dhruvbird

Differential Revision: D30559530

fbshipit-source-id: d080295a846745ea03ac50f08f4f6c95f4eaf3d8
2021-09-08 18:32:48 -07:00
f5e76b4e38 [PyTorch] Copy vectors less in Function::append_operator (#63977)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63977

Doesn't seem to be any reason to copy these argument vectors.
ghstack-source-id: 137566815

Test Plan: CI

Reviewed By: dhruvbird, raziel

Differential Revision: D30550301

fbshipit-source-id: 33c199f975e4fb62c50a8210dc08aa9bb7a3e2f2
2021-09-08 18:31:38 -07:00
0ef32625a8 [FX] make visualizer produce different formatted output (#64699)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64699

Previously we just hardcode to svg format. We should give folks a choice in terms of what format they want to see. If we give a weird extension like .abc and this will error out and we expect this to be the right behavior.

Reviewed By: houseroad

Differential Revision: D30718883

fbshipit-source-id: fe8827262f94ea6887999bb225de763d1909eef8
2021-09-08 18:22:12 -07:00
86e3b2727e Re-enable nightly doc pushes (#64708)
Summary:
That were accidentally disabled by https://github.com/pytorch/pytorch/pull/64222

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64708

Reviewed By: seemethere

Differential Revision: D30822089

Pulled By: malfet

fbshipit-source-id: 056b5c006f236c78ffe8afa4a5eab2f35e1bce89
2021-09-08 18:07:54 -07:00
9a6c2a75b8 [acc_tracer] Enable check_mutable_operations (#64456)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64456

att

Test Plan: CI

Reviewed By: protonu

Differential Revision: D30679174

fbshipit-source-id: 73f3a07d58380cd44fb3481aa97d463c0a964de8
2021-09-08 16:11:15 -07:00
5c27a580ec [tensorexpr] Allocate intermediate buffers at compile time (#64227)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64227

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D30652220

Pulled By: huiguoo

fbshipit-source-id: cd75005cdfa42751318de7174b44e14a3a01634e
2021-09-08 15:34:44 -07:00
527348a6fe [tensorexpr] Add 'is_allocated' flag for buffers and use it to insert 'Alloc/Free' stmts (#64226)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64226

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D30652221

Pulled By: huiguoo

fbshipit-source-id: ef9bb0e3db2c444b476e5fc23956bc34ae0f0111
2021-09-08 15:34:42 -07:00
f90153cda3 [acc_normalizer] Improve error when kwarg normalization fails (#64408)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64408

att

Test Plan: NFC

Reviewed By: protonu

Differential Revision: D30716392

fbshipit-source-id: e1c3bb1afcd5363a9d502549d8a46b90226be40c
2021-09-08 15:33:32 -07:00
4533e76e7c Update breakpad to an existing commit: 7d188f6 (#64666)
Summary:
Fixes issue https://github.com/pytorch/pytorch/issues/64561

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64666

Reviewed By: driazati

Differential Revision: D30814127

Pulled By: hyuen

fbshipit-source-id: 511a30fc26153569b1cd39f34e4a1a6bb99cc5e4
2021-09-08 15:29:10 -07:00
149f1114fe To add Stochastic Gradient Descent to Documentation (#63805)
Summary:
It has been discussed before that adding description of Optimization algorithms to PyTorch Core documentation may result in a nice Optimization research tutorial. In the following tracking issue we mentioned about all the necessary algorithms and links to the originally published paper  https://github.com/pytorch/pytorch/issues/63236.

In this PR we are adding description of Stochastic Gradient Descent to the documentation.

<img width="466" alt="SGDalgo" src="https://user-images.githubusercontent.com/73658284/132585881-b351a6d4-ece0-4825-b9c0-126d7303ed53.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63805

Reviewed By: albanD

Differential Revision: D30818947

Pulled By: iramazanli

fbshipit-source-id: 3812028e322c8a64f4343552b0c8c4582ea382f3
2021-09-08 15:22:30 -07:00
ff18195df9 .github: Upgrade windows CUDA 10.1 -> 10.2 (#64658)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64658

We don't release 10.1 anymore so let's bump to 10.2

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet, janeyx99

Differential Revision: D30811178

Pulled By: seemethere

fbshipit-source-id: c504ebf7f0d4c0d6229319d774f808b4ba0facd9
2021-09-08 14:43:33 -07:00
cc0565326c Add plugin for linalg norm operation (#64611)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64611

Add plugin for torch.linalg.norm, this plugin correctly only support norm operation without batch_size change, so vector input or matrix input with dim including '0' is not supported with this plugin.

Test Plan: Unit test

Reviewed By: 842974287

Differential Revision: D30525958

fbshipit-source-id: 0d66b60a390bb6235166e5a80390090d0acf691a
2021-09-08 14:33:20 -07:00
a97015f22c Revert D30735341: Migrate uses of THCReduceApplyUtils to cuda_utils::BlockReduce
Test Plan: revert-hammer

Differential Revision:
D30735341 (a5ad08ec70)

Original commit changeset: 3cb58bed8f1f

fbshipit-source-id: 874dd0f93b24a99694db42a15714834069d402bc
2021-09-08 14:27:40 -07:00
b12150608e [fx] make const fold code more pythonic (#64451)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64451

No functional change.

Test Plan:
```
buck test caffe2/test:fx_const_fold
```

Reviewed By: jfix71, RoshanPAN, houseroad

Differential Revision: D30718255

fbshipit-source-id: 95f98561c7f33fcc6c839db68683c85eb152c949
2021-09-08 13:55:10 -07:00
24e1315d4b [quant] Enable jit tracing on quantizable LSTM (resubmission) (#64638)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64638

The quantizable LSTM didn't support jit tracing because it had several non taceable paths. We sacrifice some of the user experience to enable the tracing.
The main UX feature removed is a user-friendly message when trying to access the backwards path in a bidirectional LSTM: When the bidirectional flag is False, we used to throw a nice error message when the user tried accessing backwards weights. Now the message is default (removed properties).

Test Plan: `buck test mode/dev //caffe2/test:quantization -- test_custom_module_lstm`

Reviewed By: HDCharles

Differential Revision: D30803753

fbshipit-source-id: a639955a96cee22538d9436f1c952a5d121f50f9
2021-09-08 13:34:18 -07:00
d701357d92 Factor out TensorBase that doesn't depend on native operators (#63612)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63612

This makes Tensor inherit from a new class TensorBase, that provides a subset of Tensor that doesn't
directly depend on native_functions.yaml. Code that only includes TensorBase.h with thus not need to
be rebuilt every time someone changes an operator signature.

Making `Tensor` inherit from this class means that `const TensorBase&` parameters will be callable
with an ordinary `Tensor`. I've also made `Tensor` constructible and assignable from `TensorBase` to
minimize friction in code mixing the two types.

To help enforce that `Tensor.h` and `Functions.h` aren't accidentally included, I've added an error
into `Operators.h` if `TORCH_ASSERT_NO_OPERATORS` is defined. We can either set this in the build
system for certain folders, or just define it at the top of any file.

I've also included an example of manually special-casing the commonly used `contiguous` operator.
The inline function's slow path defers to `TensorBase::__dispatch_contiguous` which is defined in
`Tensor.cpp`. I've made it so `OptionalTensorRef` is constructible from `TensorBase`, so I can
materialize a `Tensor` for use in dispatch without actually increasing its refcount.

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D30728580

Pulled By: ezyang

fbshipit-source-id: 2cbc8eee08043382ee6904ea8e743b1286921c03
2021-09-08 13:28:54 -07:00
92318a9116 Make doc previews use its own S3 bucket (#64594)
Summary:
We had been using the gha-artifacts bucket (which previously only stored workflow artifacts) to keep the docs around. This makes it hard to see how our storage for artifacts vs docs is trending.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64594

Reviewed By: seemethere

Differential Revision: D30794328

Pulled By: driazati

fbshipit-source-id: 6b2721a3d76e8a273bde055783d56551f8409edd
2021-09-08 11:36:50 -07:00
43c0f033fc TST Adds inplace checks to module_info (#63739)
Summary:
Follow up to https://github.com/pytorch/pytorch/pull/61935

This PR adds inplace checks to `test_modules`. This version checks the constructor for `inplace` and performs the check automatically.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63739

Reviewed By: saketh-are

Differential Revision: D30737774

Pulled By: jbschlosser

fbshipit-source-id: 8813534511e9296c8424d1ca878412726ddd4043
2021-09-08 11:08:12 -07:00
a5ad08ec70 Migrate uses of THCReduceApplyUtils to cuda_utils::BlockReduce (#64442)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64442

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D30735341

Pulled By: ngimel

fbshipit-source-id: 3cb58bed8f1f5aa32fd49fd37b10c8490bcc645a
2021-09-08 11:02:12 -07:00
deb9775c07 .github: Run docker containers in detach mode (#64459)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64459

Should allow users to exec into the docker container if using with-ssh,
even if the build / test command has finished executing

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D30742797

Pulled By: seemethere

fbshipit-source-id: 969ed8799216c6051439c7d41ab709b2d40938ac
2021-09-08 11:01:08 -07:00
18d24bb537 [NNC] Add Softplus operator (#64589)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64589

Adding softplus operator lowering for NNC. Enabling element wise fusion as well.

Test Plan: Added a test in test_jit_fuser.py

Reviewed By: bertmaher

Differential Revision: D30736449

fbshipit-source-id: 6c5fc3bceb5cef2322ecd4449f827e4af018ea93
2021-09-08 10:49:58 -07:00
35413a16f7 Add __matmul__ to the magic methods for FX tracing (#64512)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/64483

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64512

Reviewed By: mrshenli

Differential Revision: D30797265

Pulled By: Chillee

fbshipit-source-id: 7630e048a960e0b27c4309d04d85301abe325189
2021-09-08 10:03:48 -07:00
195cb4efa8 update scatter formula (#64546)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/63430

Already tested OpInfo gradient tests
544c8e6a5d/torch/testing/_internal/common_methods_invocations.py (L8575-L8577)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64546

Reviewed By: saketh-are

Differential Revision: D30768759

Pulled By: albanD

fbshipit-source-id: 27d144971c51a956a232fc7d02df5c9d2706d565
2021-09-08 10:02:35 -07:00
1409492fdb fixing trapezoid() comments for clarity (#64592)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64592

cc mruberry rgommers heitorschueroff

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D30785663

Pulled By: NivekT

fbshipit-source-id: e968687fbb83a59bb46ce6858c6caafa5aa04412
2021-09-08 09:45:46 -07:00
dd8f6ac597 Add forward mode differentiation for torch.linalg.cholesky and transpose (#62159)
Summary:
This PR adds forward mode differentiation for `torch.linalg.cholesky`, `torch.linalg.cholesky_ex`, and `transpose` functions.
Complex tests for Cholesky fail because for some reason the gradcheck sends matrices full of zeros to `cholesky_jvp` function.

cc ezyang albanD zou3519 gqchen pearu nikitaved soulitzer Lezcano Varal7 jianyuh mruberry heitorschueroff walterddr IvanYashchuk xwang233

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62159

Reviewed By: mrshenli

Differential Revision: D30776829

Pulled By: albanD

fbshipit-source-id: 32e5539ed6423eed8c18cce16271330ab0ea8d5e
2021-09-08 09:44:30 -07:00
a2934b38f8 Fix typo embedding_renorm_cuda_ (#64542)
Summary:
Fixes #{issue number}

cc ezyang albanD zou3519 gqchen pearu nikitaved soulitzer Lezcano Varal7 ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64542

Reviewed By: mrshenli

Differential Revision: D30792842

Pulled By: ngimel

fbshipit-source-id: c9a548256d02b3ce6fb77dd9fb058084f2c91608
2021-09-08 09:36:24 -07:00
e0e832c2ba [c10d] Provide failure reason from ProcessGroup when aborting NCCL comm (#64241)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64241

When things go wrong PG NCCL aborts nccl communicators via `ncclCommAbort`, but one issues is that often the error can be set to `ncclSystemError` (see  https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/NCCLUtils.hpp#L176) when that might not be the true cause of the issue and the actual issue is that some prior work timed out, communicator was aborted on other rank, etc.

This results in a lot of confusion when debugging jobs with a large no. of processes as the current message for ncclSystemError is not very informative: https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/NCCLUtils.hpp#L22

The fix here is to pass in a string exception message from PG NCCL down to `NCCLUtils` which will aim to raise that as the actual issue and not the confusing `ncclSystemError` message.

Test Plan: CI

Reviewed By: pallab-zz, cbalioglu

Differential Revision: D30658855

fbshipit-source-id: 17661dbe0a1bb8cc5b87b637c47634b1f52f54e1
2021-09-08 09:19:24 -07:00
7205ca0210 Change MaxUnpool to accept tensors with 0-dim batch sizes. (#64082)
Summary:
Part of the fix for https://github.com/pytorch/pytorch/issues/38115.

Changes the `MaxUnpool` module to work with 0-dimensions batch sizes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64082

Reviewed By: mrshenli

Differential Revision: D30793907

Pulled By: jbschlosser

fbshipit-source-id: d21aa665be5aa18f592b39ef7b4e3cbc632e21ed
2021-09-08 08:41:09 -07:00
ba8c1fc648 Add Half conversion of bit cast for SYCL kernel (#64340)
Summary:
## Motivation
Enhance the performance of Half/float conversion in SYCL kernels.

## Solution
Add the native SYCL half type to help convert the half from/to float in the kernel code.

## Additional Context
`__SYCL_DEVICE_ONLY__` is a MACRO only valid when compiling the kernel code for SYCL backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64340

Reviewed By: gchanan

Differential Revision: D30720823

Pulled By: ezyang

fbshipit-source-id: e7e770d02df5b2d45da61d2fed3ba59383b3dc3a
2021-09-08 08:25:47 -07:00
7f0feafa55 [nnc] Provide helpful error messages about turning off the fuser (#64516)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64516

If fuser compilation fails due to a bug (which should be highly
unlikely at this point) we want to direct the user how to unblock themselves by
disabling fusion, in addition to requesting that they report a bug.
ghstack-source-id: 137398537

Test Plan: existing tests

Reviewed By: ZolotukhinM

Differential Revision: D30758051

fbshipit-source-id: 98be89f1b1d4fb3bc816f5b2634c618b9297930e
2021-09-08 08:10:22 -07:00
768014b3e6 Allow disabling cache in autocast (automatic mixed precision) (#63552)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63552

In this PR, we want to exclude these 2 cases in the `Autocast` weight cache usages:

- Using `torch.jit.trace` under the `Autocast`
As report in https://github.com/pytorch/pytorch/issues/50231 and several other discussions, using `torch.jit.trace` under the `Autocast`, the trace process would hit Autocast's weight cache and fails. So we should disable weight cache under the trace process.
- Using `Autocast` with `Grad mode`

  - Usually we are using `Grad mode` for training. Since in the training phase, the weight will change in every step. So we doesn't need to cache the weight.
  - For the recommended `Autocast` training case in the [doc](https://pytorch.org/docs/stable/amp.html), `Autocast` will clear the cache every step leaving the context. We should disable it to save the clear operations.
    ```
    model = Net().cuda()
    optimizer = optim.SGD(model.parameters(), ...)

    for input, target in data:
        optimizer.zero_grad()
        with autocast():
            output = model(input)
            loss = loss_fn(output, target)
        loss.backward()
        optimizer.step()
    ```

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D30644913

Pulled By: ezyang

fbshipit-source-id: ad7bc87372e554e7aa1aa0795e9676871b3974e7
2021-09-08 07:47:18 -07:00
b616132403 Adding support for lowering 4Bit EmbeddingBag Operator (#5806)
Summary:
Pull Request resolved: https://github.com/pytorch/glow/pull/5806

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64001

Add 4 bit embeddingbag operator in  acc_ops.

Test Plan: Let CI run.

Reviewed By: jfix71

Differential Revision: D30532824

fbshipit-source-id: bf476c9710477792aae202dacf64e23539c33bd9
2021-09-08 07:13:16 -07:00
2223737da9 restore test_inplace_comparison_ops_require_inputs_have_same_dtype Expected behavior (#64267)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64267

This test expects every operation to throw a runtime error.

And Reinsert in-place operation test,Fix bug for comparison operation

fix: #64018

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D30720915

Pulled By: ezyang

fbshipit-source-id: 215a6556d20770f70f4ced1c1f9a9753933f1d37
2021-09-08 06:42:12 -07:00
9cc44aad21 [quant] AO migration of the quantize.py (resubmission) (#64445)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64445

AO Team is migrating the existing torch.quantization into torch.ao.quantization. We are doing it one file at a time to make sure that the internal callsites are updated properly.
This migrates the quantize.py from torch.quantization to torch.ao.quantization.
At this point both locations will be supported. Eventually the torch.quantization will be deprecated.

Test Plan: `buck test mode/dev //caffe2/test:quantization`

Reviewed By: HDCharles

Differential Revision: D30734870

fbshipit-source-id: dc204f3cc46bff2cc81c95159eab9d333b43bb4b
2021-09-08 04:58:47 -07:00
72274e2a2f [TensorExpr] Don't rely on exceptions in Vectorizer. (#64609)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64609

We've been using exceptions to indicate whether vectorization succeeded
or not, but that posed some problems with (e.g. we spent too much time
symbolicazing these exceptions). This change converts this mechanism to
a standard error return code.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D30795342

Pulled By: ZolotukhinM

fbshipit-source-id: 16e38b37bcdd78ceb438ac814cc377f35b058e17
2021-09-08 00:25:34 -07:00
2341ec9ef1 [fx_const_fold] Fix constant folding for attrs in submodule hierarchies (#64342)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64342

Previously we weren't handling the case where an attribute was in a module that wasn't the root.

Test Plan: Added unit test coverage.

Reviewed By: yinghai

Differential Revision: D30691730

fbshipit-source-id: b39b5cf748c4c882f315a4f32b51ad88cc7a43ed
2021-09-07 22:44:39 -07:00
5721205417 Add __ge__ to TorchVersion (#64565)
Summary:
This PR adds greater equal comparison so that not the base class's (str) comparison method is used.
This is necessary for a correct comparison with a version string.

Previously the following was the case:
```py
>>> torch.__version__
'1.10.0.dev20210830+cpu'
>>> torch.__version__>"1.9"
True
>>> torch.__version__>="1.9"
False  # Wrong output since the base class (str) was used for __ge__ comparison
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64565

Reviewed By: raghuramank100

Differential Revision: D30790463

Pulled By: mrshenli

fbshipit-source-id: 79c680f8b448001b34d3e5d5332124a78bea4e34
2021-09-07 20:16:09 -07:00
81fe2c5e49 add out variant of linear (#61801)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61801

resubmitting because the last one was unrecoverable due to making changes incorrectly in the stack

Test Plan: Imported from OSS

Reviewed By: desertfire

Differential Revision: D29812510

Pulled By: makslevental

fbshipit-source-id: ba9685dc81b6699724104d5ff3211db5852370a6
2021-09-07 19:58:52 -07:00
71ba76b1b5 Fix building docs instructions (#64508)
Summary:
Fixes #{64507}

Removed duplicate instruction and linted the file a bit (consistent spacing around codeblocks/headers, adding code types in codeblocks, remove `$` from bash code blocks when uncecessary).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64508

Reviewed By: raghuramank100

Differential Revision: D30791164

Pulled By: mrshenli

fbshipit-source-id: a00db32dcfdd1ecc194c836f31174c806062eb6d
2021-09-07 19:01:52 -07:00
4e98304eb9 Fix quicklint (#64612)
Summary:
Fixes land-race introduced by a22c936b63

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64612

Reviewed By: ngimel

Differential Revision: D30798648

Pulled By: malfet

fbshipit-source-id: ca546f68141d44493deba7bbf840e5f9662e8558
2021-09-07 18:52:22 -07:00
e777e1b01c Revert D29998114: [pytorch][PR] enable bf16 mkldnn path for gemm
Test Plan: revert-hammer

Differential Revision:
D29998114 (acc9f9afc8)

Original commit changeset: 459dc5874c63

fbshipit-source-id: 1994623a3afc22a94bd0cf5de766b023185f5238
2021-09-07 18:45:13 -07:00
1a033b45dd [JIT] Fix a bug of rejecting ops with AliasAnalysisKind::CONSERVATIVE (#64336)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64336

Currently AliasDB rejects any user-defined ops with `AliasAnalysisKind::CONSERVATIVE` if they do not have a special treatment for alias analysis. For example, the following alias schema gets rejects:

```
  m.def(torch::schema(
      "namescope::my_op(...) -> ...",
      c10::AliasAnalysisKind::CONSERVATIVE));
```

This rejection condition is contradictory: AliasDB can handle ops with `CONSERVATIVE` in a general way without any special casing at https://fburl.com/diffusion/op5u72sk calling https://fburl.com/diffusion/h3aws5dd which seems very appropriate to be conservative for alias analysis.

This change corrects the rejection condition to be satisfied for ops *with* special casing but have `CONSERVATIVE`, since they both cannot be used simultaneously.

Test Plan:
Confirmed that
```
  m.def(torch::schema(
      "namescope::my_op(...) -> ...",
      c10::AliasAnalysisKind::CONSERVATIVE));
```
gets accepted and `my_op`'s all inputs and outputs are put to point to wildcard(*) by AliasDB.

Reviewed By: eellison

Differential Revision: D30690121

fbshipit-source-id: 431cc1a84edd5227f52b44a0fd85d5eb16f3c288
2021-09-07 18:26:31 -07:00
8e1fdd4cd3 Add symbolic shape comparison optimization (#64300)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64300

Test Plan: Imported from OSS

Reviewed By: driazati

Differential Revision: D30738146

Pulled By: eellison

fbshipit-source-id: 96287798535b367f23d3e9430d70fc02c59744ab
2021-09-07 18:22:32 -07:00
474a51b6bf Refactor to use shape arguments (#64299)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64299

Test Plan: Imported from OSS

Reviewed By: driazati

Differential Revision: D30738141

Pulled By: eellison

fbshipit-source-id: 37ca30de81349ecf23d8656291863737b6ad6d96
2021-09-07 18:22:30 -07:00
bccbe310ef Add view with negative dim (#63516)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63516

how to review: pretty much just check that the inputs generated are a good representation of the op semantics, that should be sufficient for correctness, and then you can also double check the op size semantics by going to https://codebrowser.bddppq.com/pytorch/pytorch/ typing in native::{op_name} and looking at the op implementation as a bonus if you want

Test Plan: Imported from OSS

Reviewed By: driazati

Differential Revision: D30738143

Pulled By: eellison

fbshipit-source-id: c7cd01cb2c8a13cb2664415f3d98aedec19a8e07
2021-09-07 18:22:28 -07:00
5a1f8b8573 Generalize expand logic (#63615)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63615

how to review: pretty much just check that the inputs generated are a good representation of the op semantics, that should be sufficient for correctness, and then you can also double check the op size semantics by going to https://codebrowser.bddppq.com/pytorch/pytorch/ typing in native::{op_name} and looking at the op implementation as a bonus if you want

Test Plan: Imported from OSS

Reviewed By: driazati

Differential Revision: D30738148

Pulled By: eellison

fbshipit-source-id: 4ef74a9c9b39c0beb73949e63aa844c46ab637eb
2021-09-07 18:22:26 -07:00
5eb8cec663 Add permute, arange (#63407)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63407

Test Plan: Imported from OSS

Reviewed By: driazati

Differential Revision: D30738149

Pulled By: eellison

fbshipit-source-id: 36d572488408d38b0643aa93cb08aab5c45218ad
2021-09-07 18:22:24 -07:00
cf2d15bf84 Add support for slice, selec twith int, index_select (#63365)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63365

Test Plan: Imported from OSS

Reviewed By: driazati

Differential Revision: D30738144

Pulled By: eellison

fbshipit-source-id: 7e0c572209bdc6e62ecb4fd1f06f80291de69803
2021-09-07 18:22:22 -07:00
c8a608b197 Add squeeze, unsqueeze, transpose shape functins (#63099)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63099

These are checked by OpInfos, which represent all of the inputs and semantics of the operators so it should be an easy stamp

Test Plan: Imported from OSS

Reviewed By: desertfire, astaff

Differential Revision: D30347514

Pulled By: eellison

fbshipit-source-id: 37b4c9ecd8c222cc12bf39166181464b43218830
2021-09-07 18:22:19 -07:00
a39f3c68b7 Add batch of unary functions (#63050)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63050

Test Plan: Imported from OSS

Reviewed By: priyaramani, astaff

Differential Revision: D30347513

Pulled By: eellison

fbshipit-source-id: abaf641778671d17df87a2b7b47bad7501a91b5a
2021-09-07 18:21:04 -07:00
c1b701bc3e Back out "update rpc tensorpipe logic for sparse tensors" (#64575)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64575

Original commit changeset: daee9a567645

Test Plan: unit test

Reviewed By: gcramer23

Differential Revision: D30778736

fbshipit-source-id: 8d9386158fb6a3d025c149cdc37558d57c615e9f
2021-09-07 18:00:39 -07:00
566ee1217f Use trsm for triangular_solve in CPU (#63567)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63567

The current implementation called trtrs for CPU and trsm for CUDA.
See https://github.com/pytorch/pytorch/issues/56326#issuecomment-825496115 for a discussion on the differences between
these two functions and why we prefer trsm vs trtrs on CUDA.

This PR also exposes the `side` argument of this function which is used
in the second PR of this stack to optimise the number copies one needs to make
when preparing the arguments to be sent to the backends.

It also changes the use of `bool`s to a common enum type to represent
whether a matrix is transposed / conj transposed, etc. This makes the API
consistent, as before, the behaviour of these functions with `transpose=True`
and `conjugate_transpose=True` it was not well defined.
Functions to transform this type into the specific types / chars for the different
libraries are provided under the names `to_blas`, `to_lapack`, `to_magma`, etc.

This is the first of a stack of PRs that aim to improve the performance of
`linalg.solve_triangular`. `trsm` has an extra parameter (`side`), which allows to
ellide the copy of the triangular matrix in many cases.

Fixes https://github.com/pytorch/pytorch/issues/56326

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D30566479

Pulled By: mruberry

fbshipit-source-id: 3831af9b51e09fbfe272c17c88c21ecf45413212
2021-09-07 17:26:17 -07:00
52ff9bc639 [iOS][Metal] Add aten:hardswish (#64588)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64588

Add `aten::hardswish` to run the mobilenetv3 model from torchvision.
ghstack-source-id: 137479323

Test Plan:
- buck test pp-macos
- circleCI

Reviewed By: beback4u

Differential Revision: D30781008

fbshipit-source-id: 83454869195ef4ab50570ea9b3bf2a55f32a3e86
2021-09-07 15:41:29 -07:00
2c351c76e0 [special] Alias igamma, igammac to special.gammaninc, special.gammaincc (#61902)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/50345

Also added relevant OpInfo

TODO:
* [x] Check rendered docs gammainc : https://docs-preview.pytorch.org/61902/special.html#torch.special.gammainc
* [x] Check rendered docs gammaincc: https://docs-preview.pytorch.org/61902/special.html#torch.special.gammaincc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61902

Reviewed By: ngimel

Differential Revision: D30761428

Pulled By: mruberry

fbshipit-source-id: 06a16432873357958d53364f12a4e91c29779d26
2021-09-07 15:31:26 -07:00
b01d2d1d3e Disables four failing distributions tests on windows (#64596)
Summary:
Per title. Unblocks CI. See https://github.com/pytorch/pytorch/issues/64595.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64596

Reviewed By: mrshenli

Differential Revision: D30787296

Pulled By: mruberry

fbshipit-source-id: 84b90cb25c0185f1851db02425ea40aa13d3e598
2021-09-07 15:29:13 -07:00
a22c936b63 Add lint to ensure .github/ pypi dependencies are pinned (#64463)
Summary:
Example failing run: https://github.com/pytorch/pytorch/pull/64463/checks?check_run_id=3501249102

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64463

Reviewed By: janeyx99

Differential Revision: D30744930

Pulled By: driazati

fbshipit-source-id: 4dd97054db1d4c776a4512bc3d664987cd7b6d23
2021-09-07 15:28:11 -07:00
7e88d0b370 Update explicit_ci_jobs to work with GHA (#64598)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64598

This adds a filter option rather than an all-or-nothing so it's easier to iterate on a specific job.

```bash
python tools/testing/explicit_ci_jobs.py --filter-gha '*generated-linux-*gcc5.4*'
```

See #64600 for an example usage

NB: If you regenerate the worfklows you will need to re-run that command to re-delete everything.

Test Plan: Imported from OSS

Reviewed By: janeyx99

Differential Revision: D30788850

Pulled By: driazati

fbshipit-source-id: a32c266bbd876c396665bceef9a0a961b4586564
2021-09-07 15:21:12 -07:00
a48d83a575 Move ParallelTBB to GHA (take 2) (#64193)
Summary:
2nd attempt to do the same
Skip failing `TestTensorCreationCPU.test_trilu_indices_cpu`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64193

Reviewed By: mrshenli

Differential Revision: D30779469

Pulled By: malfet

fbshipit-source-id: 5c51fcbb383d0823d0e953d7af181b5f22eda9ab
2021-09-07 15:11:00 -07:00
369db8924f [Static Runtime] Add first iter metric (#64457)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64457

The first iteration is special since it initializes the memory planner. This change logs and reports first iteration time during benchmarking. It also generates a FAI-PEP output when `generate_ai_pep_output` is set.

Test Plan:
Run any benchmark, and observe:
```
I0902 15:19:32.528977 2492358 impl.cpp:948] PyTorchObserver {"value":6.415958881378174,"unit":"ms","metric":"latency","type":"static_runtime_first_iter"}
...
First iter time: 6.41596 ms
```

Note that this metric is likely to have significantly more noise than the others since we don't have as many data points.

Unit tests: `buck test //caffe2/test:static_runtime`

Reviewed By: d1jang

Differential Revision: D30740619

fbshipit-source-id: 4dcfccd5629f4fa34254fd355073ef19e151245a
2021-09-07 15:00:30 -07:00
3bd69d3020 add bubdle input into AIBench (#64557)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64557

MaskRCNN speed depends on how many people detected in the detection stage. A random input from dataloader doesn't satisfy this. In order to standardize the benchmarking, we use 2 standard image for benchmarking, 2/3 people.

Test Plan: AIBench result: https://www.internalfb.com/intern/aibench/details/945883114818980

Reviewed By: axitkhurana

Differential Revision: D30446049

fbshipit-source-id: a2826fdb69e9f840c0afc566c4cbbcde1c2fba89
2021-09-07 14:46:23 -07:00
3c87f55752 Automated submodule update: FBGEMM (#64582)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 3ce04fc664

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64582

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: mrshenli

Differential Revision: D30779695

fbshipit-source-id: 22460a4047e2462e672eb4931e44648ae6bde627
2021-09-07 14:16:22 -07:00
acc9f9afc8 enable bf16 mkldnn path for gemm (#61891)
Summary:
# Goal: Integrate mkldnn bf16 Gemm to pytorch

## BF16 Suport for mm, addmm, bmm, addbmm, baddbmm, mv, addmv, dot (with mkldnn matmul primitive):
https://oneapi-src.github.io/oneDNN/group__dnnl__api__matmul.html
For gemm related ops, we keep all inputs under plain format. So we will not introduce opaque tensor for these ops to save mem copy here.

![mkldnn bf16 gemm integration](https://user-images.githubusercontent.com/54701539/126263077-4b5134e1-52a7-4fad-94fb-19e13a0377f6.png)

The minimized integration is only dispatch to mkldnn in addmm, but for gemm with 3-D input (with additional dim for"batch") this will call mkldnn gemm for "batch" times. Since mkldnn matmul support input with multiple dims, we directly dispatch to mkldnn gemm in {bmm, addbmm, baddbmm} to reduce the time to create mkldnn memory desc, primitive, etc.

For the different definition for "bias" between mkldnn(which must be shape of (1, N)) and pytorch (which can be same shape with gemm result (M, N)), we use a fused sum to handle it.

## User Case:
User case is exactly same with before because no opaque tensor's is introduced. Since the pytorch has already support bf16 data type with CPU tensor before, we can leverage the existed bf16 gemm UT.

## Gemm performance gain on CPX 28Cores/Socket:
Note: data is collected using PyTorch operator benchmarks: https://github.com/pytorch/pytorch/tree/master/benchmarks/operator_benchmark (with adding bfloat16 dtype)

### use 1 thread on 1 core
### torch.addmm (M, N) * (N, K) + (M, K)
| impl |16x16x16|32x32x32| 64x64x64 | 128x128x128| 256x256x256| 512x512x512|1024x1024x1024|
|:---:|:---:| :---: | :---: | :---: | :---: | :---: | :---: |
| aten-fp32| 4.115us|4.583us|8.230us|26.972us|211.857us|1.458ms|11.258ms|
| aten-bf16 | 15.812us| 105.087us|801.787us|3.767ms|20.274ms|122.440ms|836.453ms|
| mkldnn-bf16 |20.561us |22.510us|24.551us|37.709us|143.571us|0.835ms|5.76ms|

We can see mkldnn-bf16 are better than aten bf16, but for smaller shapes, mkldnn bf16 are not better than aten fp32. This is because onednn overhead, this overhead more like a "constant" overhead and while problems get larger, we can ignore it. Also we are continue optimize the kernel efficiency and decrease the overhead as well.

More shapes
| impl |1x2048x2048|2048x1x2048| 2048x2048x1 |
|:---:|:---:| :---: | :---: |
| aten-fp32| 0.640ms|3.794ms|0.641ms|
| aten-bf16 | 2.924ms| 3.868ms|23.413ms|
| mkldnn-bf16 |0.335ms |4.490ms|0.368ms|

### use 1 socket (28 thread, 28 core)
| impl | 256x256x256| 512x512x512|1024x1024x1024| 2048x2048x2048|4096x4096x4096|
|:---:| :---: | :---: | :---: | :---: | :---: |
| aten-fp32| 35.943us |140.315us|643.510us|5.827ms|41.761ms|
| mkldnn-bf16 |53.432us|114.716us|421.858us|2.863ms|23.029ms|

More shapes
| impl |128x2048x2048|2048x128x2048| 2048x2048x128 |
|:---:|:---:| :---: | :---: |
| aten-fp32| 0.561ms|0.458ms|0.406ms|
| mkldnn-bf16 |0.369ms |0.331ms|0.239ms|

We dose not show aten-bf16 for this case since aten-bf16 always compute as single thread and the performance is extreme poor. The trend for this case is similar for 1 thread on 1 core.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61891

Reviewed By: iramazanli

Differential Revision: D29998114

Pulled By: VitalyFedyunin

fbshipit-source-id: 459dc5874c638d62f290c96684ca0a694ded4b5a
2021-09-07 13:00:37 -07:00
337c71be05 Array API: Add torch.linalg.matmul alias to torch.matmul (#63227)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/62811

Add `torch.linalg.matmul` alias to `torch.matmul`. Note that the `linalg.matmul` doesn't have a `method` variant.

Also cleaning up `torch/_torch_docs.py` when formatting is not needed.

cc IvanYashchuk Lezcano mruberry rgommers

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63227

Reviewed By: mrshenli

Differential Revision: D30770235

Pulled By: mruberry

fbshipit-source-id: bfba77dfcbb61fcd44f22ba41bd8d84c21132403
2021-09-07 12:35:32 -07:00
8407ce7e38 [small BE] .github: refactor concurrency into a common macro (#64587)
Summary:
By using a macro for these concurrency groups, we can edit just one place for the linux and windows workflows (vs 2).

I wanted to loop all the other workflow files in as well, but since those aren't generated, the macros won't work the same way.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64587

Reviewed By: mrshenli

Differential Revision: D30783224

Pulled By: janeyx99

fbshipit-source-id: ae16ebb12d2d63a563d28f0ce88e280f68ed4b9b
2021-09-07 12:31:55 -07:00
7e4ebe06ca Fixes issue related torch.trapezoid broadcasting behavior and documentation (#64054)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64054

Fixes #63608

cc mruberry rgommers heitorschueroff

Test Plan: Imported from OSS

Reviewed By: saketh-are

Differential Revision: D30617078

Pulled By: NivekT

fbshipit-source-id: 815896ec56d447562790df4d662e94fd13457e2a
2021-09-07 11:41:55 -07:00
c9d6ca4c54 Add space in Feature Request issue template (#64563)
Summary:
Add space between emoji and text in Feature Request issue template

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64563

Reviewed By: janeyx99

Differential Revision: D30779429

Pulled By: seemethere

fbshipit-source-id: 3625299923a7022fa66473633524a6620d58188b
2021-09-07 11:36:06 -07:00
85eeb4d682 Clean up op BC check list (#64584)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64584

It has been a while since last clean up. The list is really long.

Test Plan: ci

Reviewed By: hl475

Differential Revision: D30779350

fbshipit-source-id: 908b47d0b9a16b784aad6a34c5c87f923500c247
2021-09-07 11:25:40 -07:00
43248d9112 [doc][hackathon] To add Adam Optimizer to the documentation (#63251)
Summary:
It has been discussed before that adding description of Optimization algorithms to PyTorch Core documentation may result in a nice Optimization research tutorial. In the following tracking issue we mentioned about all the necessary algorithms and links to the originally published paper  https://github.com/pytorch/pytorch/issues/63236.

In this PR we are adding description of Adam Algorithm to the documentation.  For more details, we refer to the paper  https://arxiv.org/abs/1412.6980

<img width="442" alt="Screen Shot 2021-08-27 at 6 37 54 PM" src="https://user-images.githubusercontent.com/73658284/131195297-35fce613-3691-4fed-b42d-db234d4fcd7c.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63251

Reviewed By: albanD

Differential Revision: D30779163

Pulled By: iramazanli

fbshipit-source-id: 319a80fc3952793b0d064d0e641ddc1de3c05a86
2021-09-07 11:03:35 -07:00
adb85b32d3 minor fix for elastic doc (#64531)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64531

fix #64530

Test Plan: unit test

Reviewed By: mrshenli

Differential Revision: D30760879

fbshipit-source-id: 94ed1476e886513427d928a36f5be6b9bfff0826
2021-09-07 09:31:01 -07:00
26b7ff5aea deprecate dtype getters from torch.testing namespace (#63554)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63554

Following https://github.com/pytorch/pytorch/pull/61840#issuecomment-884087809, this deprecates all the dtype getters publicly exposed in the `torch.testing` namespace. The reason for this twofold:

1. If someone is not familiar with the C++ dispatch macros PyTorch uses, the names are misleading. For example `torch.testing.floating_types()` will only give you `float32` and `float64` skipping `float16` and `bfloat16`.
2. The dtype getters provide very minimal functionality that can be easily emulated by downstream libraries.

We thought about [providing an replacement](https://gist.github.com/pmeier/3dfd2e105842ad0de4505068a1a0270a), but ultimately decided against it. The major problem is BC: by keeping it, either the namespace is getting messy again after a new dtype is added or we need to somehow version the return values of the getters.

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D30662206

Pulled By: mruberry

fbshipit-source-id: a2bdb10ab02ae665df1b5b76e8afa9af043bbf56
2021-09-07 08:58:51 -07:00
f767cf6683 To change WarmUp Scheduler with ConstantLR and LinearLR (#64395)
Summary:
Partially unblocks https://github.com/pytorch/vision/issues/4281

Previously we have added WarmUp Schedulers to PyTorch Core in the PR : https://github.com/pytorch/pytorch/pull/60836 which had two mode of execution - linear and constant depending on warming up function.

In this PR we are changing this interface to more direct form, as separating linear and constant modes to separate Schedulers. In particular

```Python
scheduler1 = WarmUpLR(optimizer, warmup_factor=0.1, warmup_iters=5, warmup_method="constant")
scheduler2 = WarmUpLR(optimizer, warmup_factor=0.1, warmup_iters=5, warmup_method="linear")
```

will look like

```Python
scheduler1 = ConstantLR(optimizer, warmup_factor=0.1, warmup_iters=5)
scheduler2 = LinearLR(optimizer, warmup_factor=0.1, warmup_iters=5)
```

correspondingly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64395

Reviewed By: datumbox

Differential Revision: D30753688

Pulled By: iramazanli

fbshipit-source-id: e47f86d12033f80982ddf1faf5b46873adb4f324
2021-09-07 08:42:31 -07:00
75b9e4a128 [JIT] Freeze unrolls constant loops (#63614)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63614

There are a number of optimizations (`RemoveListMutation` in particular) that are tied to loop unrolling in `runOptimizations`. However, these were not invoked from `freeze_module` since the freezing pass should be idempotent.

This diff makes `runOptimizations` run `UnrollConstantLoops` instead of `UnrollLoops`. `freeze_module` is then able to run these optimizations.

Test Plan: Observed that `freeze_module` applies `RemoveListMutation`

Reviewed By: eellison

Differential Revision: D30437356

fbshipit-source-id: cba04bd958a48ad51b151aa3264f3d5bbb1fc2a4
2021-09-07 08:06:47 -07:00
adbcc819cd Fix fx2trt SplitterBase non_tensor_input logic (#64286)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64286

During graph splitting, `_SplitterBase` supports taking into consideration whether the subnet boundary nodes
produces "supported" outputs that will cross the acc/non-acc boundary. Specifically, if the backend only
supports Tensor-based data passing cross boundary, then we cannot split the graph at a place where the node
output is a non-Tensor type (e.g., `Tuple[Tensor]`).

There's currently a bug in this logic that it does not correctly detect the output type of a Node. Instead of
using `Node.meta['tensor_meta']`, we should instead check `Node.meta['type']`.

`Node.meta['tensor_meta']` is not appropriate because this key will exist if the node output is an iterable
and one of the element is of type `Tensor`. So `Tuple[Tensor]` will be wrongly considered "supported".

Test Plan:
arc lint
run CI tests

Reviewed By: yinghai, 842974287

Differential Revision: D30617147

fbshipit-source-id: e8ba70dfaddc05cafb8037d58fca73b7ccbb1a49
2021-09-07 04:02:29 -07:00
32fbeb170d Update error messages that use LAPACK error codes (#63864)
Summary:
This PR updates the` batchCheckErrors` and `singleCheckErrors` functions so that the error messages are defined only once.
`batchCheckErrors` function reuses `singleCheckErrors` now.

Fixes https://github.com/pytorch/pytorch/issues/63220, fixes https://github.com/pytorch/pytorch/issues/59779

cc jianyuh nikitaved pearu mruberry heitorschueroff walterddr IvanYashchuk xwang233 Lezcano

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63864

Reviewed By: ngimel

Differential Revision: D30672933

Pulled By: mruberry

fbshipit-source-id: 0ba37ff98ef278efdb12c3890aa07d687047da7a
2021-09-07 00:05:46 -07:00
1a1fb31cfa Support torch.concat alias, add cat OpInfo & remove OpInfo test_out skips {cat, stack, hstack, vtack, dstack} (#62560)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/61767

## Changes

- [x] Add `torch.concat` alias to `torch.cat`
- [x] Add OpInfo for `cat`/`concat`
- [x] Fix `test_out` skips (Use `at::native::resize_output` or `at::native::resize_output_check`)
  - [x] `cat`/`concat`
  - [x] `stack`
  - [x] `hstack`
  - [x] `dstack`
  - [x] `vstack`/`row_stack`
- [x] Remove redundant tests for `cat`/`stack`

~I've not added `cat`/`concat` to OpInfo `op_db` yet, since cat is a little more tricky than other OpInfos (should have a lot of tests) and currently there are no OpInfos for that. I can try to add that in a subsequent PR or maybe here itself, whatever is suggested.~
**Edit**: cat/concat OpInfo has been added.

**Note**: I've added the named tensor support for `concat` alias as well, maybe that's out of spec in `array-api` but it is still useful for consistency in PyTorch.

Thanks to krshrimali for guidance on my first PR :))

cc mruberry rgommers pmeier asmeurer leofang AnirudhDagar asi1024 emcastillo kmaehashi heitorschueroff krshrimali

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62560

Reviewed By: saketh-are

Differential Revision: D30762069

Pulled By: mruberry

fbshipit-source-id: 6985159d1d9756238890488a0ab3ae7699d94337
2021-09-06 23:57:18 -07:00
0a1aaff0de Remove dead code from THC (THCApply.cuh) (#64559)
Summary:
cc peterbell10

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64559

Reviewed By: mruberry

Differential Revision: D30769526

Pulled By: ngimel

fbshipit-source-id: 034a5c778a2b902cffa57b76511fa0dcdea26825
2021-09-06 21:26:08 -07:00
571a2becf3 Move ParallelNative and PureTorch to GHA (#64452)
Summary:
Separate ParallelTBB move to https://github.com/pytorch/pytorch/pull/64193 as it requires some further investiagation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64452

Reviewed By: seemethere, janeyx99

Differential Revision: D30738337

Pulled By: malfet

fbshipit-source-id: 81c46423e903058bd1a3e8553e8a10ce978eeefd
2021-09-06 11:40:44 -07:00
544c8e6a5d Mark functions in backend header as inline to suppress warning (#64098)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64098

Reviewed By: kimishpatel, iseeyuan

Differential Revision: D30593104

fbshipit-source-id: 328196b9bc4a89a28ad89bede7e337107976c303
2021-09-05 16:45:23 -07:00
bcc7e82371 Revert D30745610: [nnc] Make our exceptions c10::Errors, get C++ stacktraces
Test Plan: revert-hammer

Differential Revision:
D30745610 (18b2751ea1)

Original commit changeset: a1cfaa7364ef

fbshipit-source-id: 9b716053b96a65745240ddef1c456c44d5d09671
2021-09-05 16:08:09 -07:00
49fe829cae [Vulkan] Code Quality: Remove duplicate code for hardshrink and leaky_relu functions (#64405)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64405

Code quality improvement: removed duplicate code for hardshrink and leaky_relu functions.
ghstack-source-id: 137319378

Test Plan:
```buck build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 //xplat/caffe2:pt_vulkan_api_test_binAndroid\#android-arm64 --show-output
adb push buck-out/gen/xplat/caffe2/pt_vulkan_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_api_test
adb shell "/data/local/tmp/vulkan_api_test"```

Reviewed By: SS-JIA

Differential Revision: D30690251

fbshipit-source-id: 5729d1f32946e42f41df77756a8313f297dd822f
2021-09-05 12:53:58 -07:00
1901c675e1 Back out "nn.functional.linear OpInfo" (#64517)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64517

Original commit changeset: ca41dbd98176

Test Plan: PyTorch CI

Reviewed By: ngimel

Differential Revision: D30758201

fbshipit-source-id: 2d3274293d340373b8af86083336607818019619
2021-09-05 02:25:00 -07:00
008bf6689b Back out "D30740897 Add fusion enabled apis" (#64500)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64500

D30740897 (39aeb3bf63) broke caffe2/torch/fb/module_factory/optimizers/tests:test_full_sync_optimizer_needed_coverage (https://fburl.com/test/mb46jxon) and blocked training_platform_unit_tests

{F660271297}

multsect results confirms

```
multisect --config FBCODE_TEST bisect 844424966128796 --workers 16 revisions --begin 09629edc --end fc86b434
D30740897 (39aeb3bf63)

````

{F660271232}

Test Plan:
```
buck test mode/opt //caffe2/torch/fb/module_factory/optimizers/tests:test_full_sync_optimizer_needed_coverage

Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/4785074671474181
    ✓ Pass: caffe2/torch/fb/module_factory/optimizers/tests:test_full_sync_optimizer_needed_coverage - main (3.729)
Summary
  Pass: 1

```

Differential Revision: D30753916

fbshipit-source-id: 302fd4113ef1f3069846be03edc2300d82b66719
2021-09-04 20:55:58 -07:00
18b2751ea1 [nnc] Make our exceptions c10::Errors, get C++ stacktraces (#64332)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64332

With this diff, if a compiler bug occurs (unlikely, I know!) we'll be able to get a c++ stacktrace leading to the exception, rather than just a terse message.  E.g.,
```
RuntimeError: UNSUPPORTED DTYPE
Exception raised from compilation_error at ../torch/csrc/jit/tensorexpr/exceptions.h:32 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7f966659b2eb in /fsx/users/bertrand/c\
onda/envs/pytorch/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x376f099 (0x7f966a195099 in /fsx/users/bertrand/conda/envs/pytorch/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x3763bf5 (0x7f966a189bf5 in /fsx/users/bertrand/conda/envs/pytorch/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #3: torch::jit::tensorexpr::CudaCodeGen::Initialize() + 0xdd8 (0x7f966a193368 in /fsx/users/bertrand/conda/envs/pytorch/lib/python3.8/site-packages/torch/lib/libtorch_cuda\
.so)
```

Test Plan: Imported from OSS

Reviewed By: huiguoo

Differential Revision: D30745610

Pulled By: bertmaher

fbshipit-source-id: a1cfaa7364ef4120de834e9cbe57ced1d082ab4e
2021-09-04 20:31:54 -07:00
6cac7ca980 Ensure num_threads is initialized in get_num_threads (#64486)
Summary:
Possible source of the recent layernorm CI failures. `lazy_init_num_threads` appears at the top of `parallel_for` and can change the number of threads set. So, we need to ensure `num_threads` is initialized during `get_num_threads` calls as well. It's already done this way for OpenMP, but is missing from other parallel backends.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64486

Reviewed By: mruberry

Differential Revision: D30752615

Pulled By: ngimel

fbshipit-source-id: 085873ce312edbee1254c0aaae30dec7fcfe2c57
2021-09-04 12:38:09 -07:00
604e885925 Automated submodule update: FBGEMM (#64338)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 9ccb2714a9

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64338

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: jspark1105

Differential Revision: D30690319

fbshipit-source-id: 884d1f950cd1f7d2a77b79affb9215f285d5d0da
2021-09-04 00:44:28 -07:00
a91a278d60 Fix copy_transpose_valid condition for copy_same_type_transpose_ (#64425)
Summary:
Thanks to ngimel for the hint where the problem might be (https://github.com/pytorch/pytorch/issues/64358#issuecomment-910868849)!

I added a test that fails on master to verify the fix. The shape `(60, 60)` was chosen because of `MIN_SZ = 60 * 60` in `copy_transpose_valid`.

Fixes https://github.com/pytorch/pytorch/issues/64358

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64425

Reviewed By: mruberry

Differential Revision: D30752725

Pulled By: ngimel

fbshipit-source-id: f40370ea8365c94e30f8e8a3dcab5f3b3462464a
2021-09-03 18:50:33 -07:00
e4ff14ad59 [CUDA graphs] Error if attempting to capture uncapturable nccl (#64440)
Summary:
NCCL < 2.9.6 is not capturable. Attempting to capture it can cause nasty behavior (for example, ive seen capture succeed, but replay silently hang). Pytorch should preempt this with a friendlier error.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse agolynski SciPioneer H-Huang mrzzd cbalioglu gcramer23

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64440

Reviewed By: mruberry

Differential Revision: D30733884

Pulled By: ngimel

fbshipit-source-id: 5f2df3cf5cc0e5e68f49bf22a80d9f58064dc7ec
2021-09-03 13:23:07 -07:00
0e3b45eaef Fix logical typo in _compare_trilu_indices (#64468)
Summary:
I'm pretty sure that repeating the same call twice is pretty meaningless and intend was to call `tril`/`tril_indices` in first case and `triu`/`triu_indices` in another

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64468

Reviewed By: mruberry

Differential Revision: D30744978

Pulled By: malfet

fbshipit-source-id: 7cd36789a7ebf1cc263fb2d875e479c05e7588a4
2021-09-03 10:22:49 -07:00
6831d8e379 Support Union in TorchScript (#64234)
Summary:
This PR is created to replace https://github.com/pytorch/pytorch/pull/53180 PR stack, which has all the review discussions. Reason for needing a replacement is due to a messy Sandcastle issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64234

Reviewed By: gmagogsfm

Differential Revision: D30656444

Pulled By: ansley

fbshipit-source-id: 77536c8bcc88162e2c72636026ca3c16891d669a
2021-09-03 06:12:24 -07:00
91b926fab3 Add fx2trt pass for removing duplicate output args (#64461)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64461

Fx2TRT does not support duplicate nodes in the output args tuple.

This pass removes duplicate output args from the target subnets and fixes their uses in the top level module where the subnets are called. This pass must be called after acc split on the top-level net and subsequent calls to the acc trace on the subnets.

This pass will change both the subnets and top level module.

Test Plan:
Run:

```
buck run mode/opt -c python.package_style=inplace //caffe2/torch/fb/fx2trt/tests/passes/:test_remove_duplicate_output_args

```

Reviewed By: yinghai

Differential Revision: D30740499

fbshipit-source-id: 98459f7677980b21c7bffda918158001285572db
2021-09-02 23:04:12 -07:00
39aeb3bf63 Add fusion enabled apis (#64429)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64429

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D30740897

Pulled By: eellison

fbshipit-source-id: 446aa63b5d763f1cfffea62547db7294368e3438
2021-09-02 22:19:09 -07:00
7031fbdc63 update optimize_for_inference docs (#64428)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64428

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D30740898

Pulled By: eellison

fbshipit-source-id: b94d2c3deb661a6ba048f19e8c1d5e1799667eeb
2021-09-02 22:17:58 -07:00
e1c3e5f830 [resubmit][FX] Prototype for guarding against mutable operations in tracing (#64467)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64467

Test Plan: Imported from OSS

Reviewed By: driazati

Differential Revision: D30744870

Pulled By: jamesr66a

fbshipit-source-id: fc652f8b17748f90dbeb83fabf3bd5bb57d6ff1a
2021-09-02 21:13:21 -07:00
cd82bc1af9 Skips layer norm OpInfo on tbb platform (#64469)
Summary:
The OpInfo tests appear to be discovering a layer norm x tbb issue that requires investigation. Skipping tests on that platform for now to restore CI signal.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64469

Reviewed By: ngimel

Differential Revision: D30745746

Pulled By: mruberry

fbshipit-source-id: 282484cc00b867fac85b7df61430d64277da6421
2021-09-02 20:53:01 -07:00
c19bd05e84 THC: Cleanup dead code (#64441)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64441

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D30735342

Pulled By: ngimel

fbshipit-source-id: 84ab36f7aec6b8cd7f1f34c19a58a382c06ad68d
2021-09-02 17:45:16 -07:00
db692ec0b3 Regenerate generated github workflows (#64465)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64465

These were out of date and causing master failures

Test Plan: Imported from OSS

Reviewed By: zhouzhuojie

Differential Revision: D30744594

Pulled By: driazati

fbshipit-source-id: 09a21c3c5d9bc83b368d66cabbafd1ba83302dd3
2021-09-02 17:31:29 -07:00
e161872aab Revert D30732630: [quant] Enable jit tracing on quantizable LSTM
Test Plan: revert-hammer

Differential Revision:
D30732630 (116142143c)

Original commit changeset: 443e351ebb0e

fbshipit-source-id: 49001392f01366f3b1ccc31139f824c80b86cd40
2021-09-02 17:08:26 -07:00
046ed57a4d Revert D30055886: [quant] AO migration of the quantize.py
Test Plan: revert-hammer

Differential Revision:
D30055886 (44e3ed88c9)

Original commit changeset: 8ef7470f9fa6

fbshipit-source-id: c5bd3ead43a2d44b9e56872ec5bd7a195bdac725
2021-09-02 16:59:59 -07:00
4968d0b34f [POC] .github: Add event name to concurrency (#64402)
Summary:
This would ensure that manually/API triggered workflows would not cancel other triggered workflows. For example, the manually triggered periodic 11.1 linux job cancelled the scheduled one here, which we may not want:
![image](https://user-images.githubusercontent.com/31798555/131752175-1c99d56e-d344-46e1-b8ac-9c12bba0569a.png).

This would be helpful later as we use more dispatched workflows (e.g., for bisect functionality)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64402

Reviewed By: malfet

Differential Revision: D30734860

Pulled By: janeyx99

fbshipit-source-id: 220016716094666e9af836fcd716dd529cf23d8a
2021-09-02 16:24:05 -07:00
b12f34e8c2 update rpc tensorpipe logic for sparse tensors (#62960)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62960

A bug was filed a few years ago for sending sparse tensor over rpc #30807.

This pr updates rpc/tensorpipe logic for CUDA sparse tensors. During the serialization process, the pickler.cpp implementation breaks down the sparse tensor into two tensors and metadata. torch/csrc/distributed/rpc/tensorpipe_agent.cpp needs to be updated because it does not have logic sparse tensors. It pushes a single device for a sparse tensor. This is wrong because after the sparse tensor has been serialized, there will be two tensors. The second tensor will not have a device. This will cause the second tensor to have the wrong target device. tensorpipe_utils.cpp needs to be updated because deserialization happens after the data is received on the target pipe. This takes the two tensors and metadata sent and rebuilds the sparse tensor. There will be two tpDescriptors but only one tensor after deserialization. The logic is updated to verify the sparse tensor is on the correct device using the first tpDescriptor.

This pr also updates ivalue.cpp and ivalue.h to support more paths for Sparse COO tensors.

I tested these changes by adding sparse tests to rpc_test.py and dist_autograd_test.py.

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D30717285

Pulled By: gcramer23

fbshipit-source-id: daee9a56764550f56b131f9dd8e74e23113d6714
2021-09-02 16:16:19 -07:00
32a93c2424 Revert D30675780: [FX] Prototype for guarding against mutable operations in tracing
Test Plan: revert-hammer

Differential Revision:
D30675780 (795387477f)

Original commit changeset: b2116b51dcc8

fbshipit-source-id: d4f1173f4989556ea54974f4c2739ef85a705fae
2021-09-02 16:07:29 -07:00
116142143c [quant] Enable jit tracing on quantizable LSTM (#64438)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64438

The quantizable LSTM didn't support jit tracing because it had several non taceable paths. We sacrifice some of the user experience to enable the tracing.

The main UX feature removed is a user-friendly message when trying to access the backwards path in a bidirectional LSTM: When the bidirectional flag is `False`, we used to throw a nice error message when the user tried accessing backwards weights. Now the message is default (removed properties).

Test Plan: `buck test mode/dev //caffe2/test:quantization -- test_custom_module_lstm`

Reviewed By: mtl67

Differential Revision: D30732630

fbshipit-source-id: 443e351ebb0e2b636c86dea9691b9bf42ffe618f
2021-09-02 15:59:20 -07:00
795387477f [FX] Prototype for guarding against mutable operations in tracing (#64295)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64295

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D30675780

Pulled By: jamesr66a

fbshipit-source-id: b2116b51dcc87357f0c84192c4c336680875e27a
2021-09-02 15:17:04 -07:00
3c79e0b314 .github: Migrate pytorch_linux_bionic_py_3_6_clang9 to GHA (#64218)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64218

Relies on https://github.com/fairinternal/pytorch-gha-infra/pull/11

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

cc ezyang seemethere malfet walterddr lg20987 pytorch/pytorch-dev-infra bdhirsh

Test Plan: Imported from OSS

Reviewed By: malfet, H-Huang, janeyx99

Differential Revision: D30651516

Pulled By: seemethere

fbshipit-source-id: e5843dfe84f096f2872d88f2e53e9408ad2fe399
2021-09-02 14:51:00 -07:00
257623da39 Switch Shuffler to use iter-local buffer (#64195)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64195

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D30642947

Pulled By: ejguan

fbshipit-source-id: d4b52479b4ae37ad693388b9cdb8eed83a136474
2021-09-02 13:40:28 -07:00
f555348aaa Disable CircleCI ROCm build (#64434)
Summary:
Per jithunnair-amd suggestion

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64434

Reviewed By: seemethere, janeyx99

Differential Revision: D30732289

Pulled By: malfet

fbshipit-source-id: 1932d0a7d1e648006f8030c8237b187d0709f688
2021-09-02 13:32:02 -07:00
4ce9c530d6 [DataPipe] removing filter's inheritance from map (#64404)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64404

This PR remove `filter`'s inheritance from `map`. This allows `filter` to not have a `__len__` function and that behavior is what we would like.

cc VitalyFedyunin ejguan

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D30713120

Pulled By: NivekT

fbshipit-source-id: 4d5d07555297ee2bd4b49842c0d26cdc00638f6c
2021-09-02 13:09:47 -07:00
4f43480186 [DataPipe] adding/removing __len__ for different DataPipe (#64398)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64398

cc VitalyFedyunin ejguan

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D30710437

Pulled By: NivekT

fbshipit-source-id: 524eda43a2faa0db0c1a662bf9bb4283f0ade83c
2021-09-02 13:08:32 -07:00
3cd0a4ac15 Fix test_ind_worker_queue by setting max_num_worker based on system resource (#63779)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63779

Fixes #63657

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D30494185

Pulled By: ejguan

fbshipit-source-id: d1bd24299b25d589889604aaf18ad347bdff4df4
2021-09-02 12:36:56 -07:00
7d010539c9 ENH Adds test and docs for modules that already support no batch dims (#62729)
Summary:
Towards https://github.com/pytorch/pytorch/issues/60585

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62729

Reviewed By: H-Huang

Differential Revision: D30669546

Pulled By: jbschlosser

fbshipit-source-id: c771c98c1fd9d28fa984b72893585c738c736505
2021-09-02 12:36:54 -07:00
d0cb26ba57 [DDP] Fix logging iterations (#64411)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64411

These are not actually the training iterations, but are offset by how
frequently DDP stats collection actually runs (default being
kDDPRuntimeLoggingSampleRate = 100). So with this change, they are actually
logged to scuba every:
10, 10 * 100, 40 * 100, etc iterations.

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D30718274

fbshipit-source-id: 146bd2428753c93363bee37e487f40104fce3c18
2021-09-02 12:35:01 -07:00
22f3bcd164 .github: Move squid vars to common vars (#64436)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64436

Moves the squid variables to our common jinja template so that when we
have to update them they're all in the same place.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

cc ezyang seemethere malfet lg20987 pytorch/pytorch-dev-infra

Test Plan: Imported from OSS

Reviewed By: malfet, zhouzhuojie

Differential Revision: D30732776

Pulled By: seemethere

fbshipit-source-id: 22e3757c4eec775baa8abbaac2ba2a0c69c2b2a9
2021-09-02 11:31:54 -07:00
c932afe39b .github: Move upload-artifact-s3 to common var (#64435)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64435

Move upload-artifact-s3 to a common variable to be used amongst our
jinja templates, this should make it easier in the future to update
these images

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

cc ezyang seemethere malfet lg20987 pytorch/pytorch-dev-infra

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D30732777

Pulled By: seemethere

fbshipit-source-id: 51cd485f5abae134c3c49dfa878e6303ba8e5f25
2021-09-02 11:31:52 -07:00
1519b6084f nn.functional.linear OpInfo (#61971)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61971

Test Plan: - wait for tests

Reviewed By: heitorschueroff

Differential Revision: D30013750

Pulled By: zou3519

fbshipit-source-id: ca41dbd98176c12e50ad1410a658f4b06fe99a1e
2021-09-02 11:31:50 -07:00
c0cdbb1cc5 Revert D30468409: Add fx2trt pass for removing duplicate output args
Test Plan: revert-hammer

Differential Revision:
D30468409 (6da7552a8e)

Original commit changeset: b4d91b76ab5d

fbshipit-source-id: e138dc425fe55ffe3585ea5fac4db476931bafed
2021-09-02 11:31:49 -07:00
9214450b7f [tensorexpr] Wrap error msgs with buildErrorMessages for internal asserts (#64409)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64409

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D30717786

Pulled By: huiguoo

fbshipit-source-id: a3b147d339ff4927f14efa24407cd3b63d80001d
2021-09-02 11:30:34 -07:00
6da7552a8e Add fx2trt pass for removing duplicate output args (#64433)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64433

Fx2TRT does not support duplicate nodes in the output args tuple.

This pass removes duplicate output args from the target subnets and fixes their uses in the top level module where the subnets are called. This pass must be called after acc split on the top-level net and subsequent calls to the acc trace on the subnets.

This pass will change both the subnets and top level module.

Test Plan:
Run:

```
buck run mode/opt -c python.package_style=inplace //caffe2/torch/fb/fx2trt/tests/passes/:test_remove_duplicate_output_args

```

Reviewed By: 842974287

Differential Revision: D30468409

fbshipit-source-id: b4d91b76ab5d8a5275d68dd48d1327a44c22568e
2021-09-02 10:40:37 -07:00
aeafcde087 CI: Enable using labels to control GHA workflows (#64314)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/62852

Sets a global environment variable containing a list of PR labels. For this PR, the PR_LABELS variable looks like:
```
[
  "cla signed",
  "ciflow/default"
]
```
confirmed in a run: https://github.com/pytorch/pytorch/runs/3490072161?check_suite_focus=true

This information can be used in other workflow steps to control the logic. For example, if I want to force a build, I can label my PR with "force-build" and do something like the following in my build script:
```
if [[ "${PR_LABELS}" = *force-build* ]]; then
   python setup.py install
else
   #use cached wheel or something
fi
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64314

Reviewed By: driazati

Differential Revision: D30714570

Pulled By: janeyx99

fbshipit-source-id: 80b060ee32643ddd22eb7b8ec548579c7ccf6441
2021-09-02 10:34:44 -07:00
66ddc6ef9e Fixes and details to torchhub docs (#63783)
Summary:
This PR:

- adds a few details regarding the newly added `skip_validation` parameter https://github.com/pytorch/pytorch/pull/62139
- uses double-backticks instead of single-backticks since this is rst, not mardown.
- adds a few minor doc nits here and there

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63783

Reviewed By: zou3519

Differential Revision: D30696658

Pulled By: NicolasHug

fbshipit-source-id: 6f01c7eb3cfcd7e17e4c33c09d193054fa18ad36
2021-09-02 09:32:57 -07:00
50067c020a TST Adds __repr__ and str to module info (#63737)
Summary:
Follow up to https://github.com/pytorch/pytorch/pull/61935

This PR adds `test_repr` to `test_modules`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63737

Reviewed By: gchanan

Differential Revision: D30729642

Pulled By: jbschlosser

fbshipit-source-id: c11a28bc0739abd3ed40727389dd28ed4069edad
2021-09-02 09:32:55 -07:00
2c258d91cc Fix torch.istft length mismatch and window runtime error (#63469)
Summary:
The PR fixes two issues:
- See https://github.com/pytorch/pytorch/issues/62747 and https://github.com/pytorch/audio/issues/1409. The length mismatch when the given ``length`` parameter is longer than expected. Add padding logic in consistent with librosa.
- See https://github.com/pytorch/pytorch/issues/62323. The current implementations checks if the min value of window_envelop.abs() is greater than zero.  In librosa they normalize the signal on non-zero values by indexing. Like
```
approx_nonzero_indices = ifft_window_sum > util.tiny(ifft_window_sum)
y[approx_nonzero_indices] /= ifft_window_sum[approx_nonzero_indices]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63469

Reviewed By: fmassa

Differential Revision: D30695827

Pulled By: nateanl

fbshipit-source-id: d034e53f0d65b3fd1dbd150c9c5acf3faf25a164
2021-09-02 09:31:47 -07:00
616fd9219d [Static Runtime] Add sign/abs/lop1p/mul fusion pass (#64209)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64209

Add a new fusion pass that turns transforms the following pattern:
```
graph(%input):
    %0 : Tensor = aten::sign(%input)
    %1 : Tensor = aten::abs(%input)
    %2 : Tensor = aten::log1p(%1)
    %res : Tensor = aten::mul(%0, %2)
    return (%res)
```
Into a single op:
```
graph(%input):
    %res : Tensor = static_runtim::signed_log1p(%input)
    return (%res)
```

The intent is to reduce the number of passes over the tensor. However, enabling this pass actually causes a performance regression, probably due to a lack of vectorization in the fused implementation. Because of this issue, this diff **does not** enable this pass.

Followup: navahgar will add an NNC kernel which is faster than the the unfused version and enable this pass. We still need this version as a fallback since the NNC kernel will not support all dtypes.

Test Plan:
`buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- SignedLog1p`

Test passed with new graph pass disabled and enabled.

Reviewed By: hlu1

Differential Revision: D30559929

fbshipit-source-id: e4e080cb2e6a705cfdde1fc98bee92b723f8132a
2021-09-02 08:31:40 -07:00
cd3be4675f [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D30710635

fbshipit-source-id: e8dae05a7e3a19d656067a4f102aab4a3c93ac42
2021-09-02 08:31:37 -07:00
f04e6594ed Fix broken caffe2 test: PlanExecutorTest.BlockingErrorPlan (#64401)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64401

PlanExecutorTest.BlockingErrorPlan uses `ASSERT_DEATH` which internally performs a `fork()`. This can cause problems under certain configurations that use threads. This change updates this test to use the "threadsafe" style for GTest death tests in order to improve its quality in multithreaded environments.

Test Plan:
I confirmed that this change fixes the issue on my devvm with the following command:
```
buck test mode/dev //caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest.BlockingErrorPlan
```

Reviewed By: praihan

Differential Revision: D30709447

fbshipit-source-id: 12ffd9ad0371e2e5b43a9873c80568e5ab02d246
2021-09-02 08:30:29 -07:00
b737629ff0 simplify op name determination into a single forward pass (#64261)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64261

Note that this does not preserve byte-for-byte compatibility with
existing names.

Test Plan:
* Rely on CI to catch gross errors.
* Merge after release cut to catch subtle issues.

Reviewed By: albanD

Differential Revision: D30700647

Pulled By: dagitses

fbshipit-source-id: 7b02f34b8fae3041240cc78fbc6bcae498c3acd4
2021-09-02 07:32:11 -07:00
b2c7c1dfcf fix copy.deepcopy on LinearPackedParams (#64367)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64367

This is the same thing as https://github.com/pytorch/pytorch/pull/56154
but for quantized linear. It fixes the behavior of `copy.deepcopy` on
these modules. Before this PR, copied instances of `LinearPackedParams`
were not properly initialized, and inspecting them raised errors of
missing `_modules`. After this PR, inspecting and using the copies
works.

Test Plan:
```
python test/test_quantization.py TestStaticQuantizedModule.test_linear_api
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D30702667

fbshipit-source-id: 38c26d1e72663416eeb989985b77ffc2052c12b9
2021-09-02 06:30:42 -07:00
99b064fac4 [jit] shape propagation for prepack (#63585)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63585

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D30428905

Pulled By: IvanKobzarev

fbshipit-source-id: c18f6605a69b2e000bdf14a23e637c5a1c2ec64c
2021-09-02 05:30:38 -07:00
cdb46f4c6e extract TestAutogradComplex into its own test file (#63400)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63400

This is the first step to break up test_autograd.py for #63205.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D30541499

Pulled By: dagitses

fbshipit-source-id: 8d9d32007938b9eade0e88f95a6a3190e7e2ef01
2021-09-02 04:34:35 -07:00
be5b05c1dc require that TARGET_DET_LIST is sorted (and sort it here) (#64102)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64102

We sort this list so that we may add comments to indicate the absence
of a file right where that file would need to be put. This makes it
difficult to wrongly add such a file.

The sorting itself was done programmatically to ensure that no entries
were inadvertently removed.

I printed the sorted list with:

```
  for p in sorted(TARGET_DET_LIST):
    print(f'    "{p}",')
```

Then copied it back into the file.

Test Plan: Imported from OSS

Reviewed By: driazati

Differential Revision: D30625076

Pulled By: dagitses

fbshipit-source-id: cf36fcb3e53e274b76d1f4aae83da1f53c03f9ed
2021-09-02 04:34:33 -07:00
aedd70fcfe Fix list() and help() torchhub functions for Windows (#63773)
Summary:
This PR Fixes the help() and list() torchhub functions which were probably failing for Windows since the `/` OS separator was hardcoded.

Before merging this I need to double check whether the CI actually runs the corresponding tests on Windows or not

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63773

Reviewed By: zou3519

Differential Revision: D30695664

Pulled By: NicolasHug

fbshipit-source-id: fac328163fd05db804a8186ae28f22b3cc3a6404
2021-09-02 04:34:31 -07:00
030154e241 Remove outdated comment in hub.py (#63757)
Summary:
This PR removes an outdated comment about Python2 that was orginally introduced in https://github.com/pytorch/pytorch/pull/25083/files. The code has changed since then, but the comment wasn't removed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63757

Reviewed By: zou3519

Differential Revision: D30695656

Pulled By: NicolasHug

fbshipit-source-id: 431cf414588b9e5a1ad6acdae724ff5af1b16971
2021-09-02 04:34:29 -07:00
1c735768ed Update hub.load() signature to avoid polluting kwargs param (#63755)
Summary:
This PR addresses an old comment about Python2 EOL, directly putting some parameters in the function signature instead of in a `**kargs` dict.

I believe the changes are fully backward compatible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63755

Reviewed By: zou3519

Differential Revision: D30695634

Pulled By: NicolasHug

fbshipit-source-id: 398f347c5a04bfb58e77e46773a869cb9d0eb225
2021-09-02 04:32:22 -07:00
6db8f7a709 Fix TRTModule not adding outputs in order (#64418)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64418

In T99368564, we found that when running TRT lowered module, the output tensors are out-of-order, as compared to the output from the original, non-lowered module. It turns out that in `TRTModule.forward()`, we cannot rely on `ICudaEngine` bindings natural order indices to create the output tensors, but rather, we should explicitly construct the output tensor from the bindings' names, in an ordered that we supply.

Test Plan:
* Arc lint
* Run CI/sandcastle tests
* Run GPU lowering using commands and code changes in D30171741 and ensure we don't observe out-of-order outputs

Reviewed By: yinghai

Differential Revision: D30693545

fbshipit-source-id: 32a894ceeb148fcf4e8d279be3835c7d1f1aa2ba
2021-09-02 01:36:23 -07:00
76e187aa08 Port gather to structured kernel (#63312)
Summary:
Will add a description once this is ready for review.

cc: ysiraichi ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63312

Reviewed By: iramazanli

Differential Revision: D30597447

Pulled By: ezyang

fbshipit-source-id: d36e59835c2f4b38e286032dd2a1111a7e16b7e5
2021-09-02 01:36:21 -07:00
ee8a6c1d14 Replace std::unordered_map<c10::Device, c10::Device> with DeviceMap (#64393)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64393

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse agolynski SciPioneer H-Huang mrzzd cbalioglu gcramer23

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D30708384

Pulled By: pbelevich

fbshipit-source-id: 1c565727e4f09cd9e560874dd90aa403470b4a97
2021-09-02 01:36:19 -07:00
8d5b95019d [PyTorch Edge] Support default args with out arg, flag off (#63540)
Summary:
1. Allow consuming operators with defaults arguments and out arguments. Flag is off to keep the same behavior as v6, in pr 63651, turn on the flag.
2. Add two unittests to cover this type of operators.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63540

ghstack-source-id: 137211562

Test Plan:
```
caffe2/test/cpp/jit:jit - LiteInterpreterTest.DefaultArgsWithOutArg
caffe2/test/cpp/jit:jit - LiteInterpreterTest.DefaultArgsPinvWithOutArg
```

Reviewed By: raziel, iseeyuan, tugsbayasgalan

Differential Revision: D30414156

fbshipit-source-id: 0f3a219a22aee10ac53184cbd95940726c459d1f
2021-09-02 01:36:16 -07:00
0addd75be9 Remove unnecessary resize_output (#64272)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64272

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: H-Huang, bdhirsh

Differential Revision: D30686941

Pulled By: ezyang

fbshipit-source-id: de60e6f1115648f8cf7daaa1e652594fe8b06742
2021-09-02 01:34:17 -07:00
69e1207084 Move graph util to fx2trt (#64064)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64064

Move original util in torch2trt to fx2trt dir since torch2trt is gonne be deprecated. This is a follow up diff for D30379124

Test Plan: manual

Reviewed By: yinghai, mikekgfb

Differential Revision: D30591687

fbshipit-source-id: ae0e59dfbc2d2e2aa4f3ccea7cff2291c7deb388
2021-09-01 22:34:11 -07:00
71e149834b Add a warning about DataLoader num_workers > 0 "memory leak" (#64337)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64337

See https://github.com/pytorch/pytorch/issues/13246

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D30690320

Pulled By: ezyang

fbshipit-source-id: 2751aca05a94e63d25162599f458855988516fad
2021-09-01 21:49:41 -07:00
d067f15622 [Dist CI] Move rest of distributed tests to their own CI job (#64253)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64253

Follow up to D30496178 (f4aff3a346) to move the rest of distributed tests to their own jobs for Linux GHA.
ghstack-source-id: 137233785

Test Plan: CI

Reviewed By: walterddr

Differential Revision: D30662999

fbshipit-source-id: f7cfbc0d1223aca52120f17f9da987d70fda8de6
2021-09-01 21:43:41 -07:00
4d6314a16e [DDP] Log num threads (#64072)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64072

Log gloo threads to DDP logging.
ghstack-source-id: 137119480

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D30596083

fbshipit-source-id: 2b4f6e762cb5d850be6056bcc5922029a1af3c91
2021-09-01 18:36:15 -07:00
59c6ceb6a8 add documentation to shape inference algorithm (#64312)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64312

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D30709254

Pulled By: migeed-z

fbshipit-source-id: 3297d26fe6727c5b9ca176625b1683d787f59659
2021-09-01 18:34:17 -07:00
778af56504 [DDP Comm Hook] Add debugging communication hooks to ddp_comm_hooks.rst (#64352)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64352

as title
ghstack-source-id: 137246253

Test Plan: N/A

Reviewed By: rohan-varma

Differential Revision: D30694089

fbshipit-source-id: a78110b11d59bb0718f43c99ede23f2fd8ab21d0
2021-09-01 17:37:19 -07:00
bf9d66586c [DDP Comm Hook] Create a noop hook for performance debugging (#64344)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64344

As title.

Additionally, avoid using numpy array in test_ddp_hooks.py.
ghstack-source-id: 137170449

Test Plan: buck test mode/dev-nosan caffe2/test/distributed/algorithms/ddp_comm_hooks:test_ddp_hooks -- test_ddp_comm_hook_noop_hook

Reviewed By: rohan-varma

Differential Revision: D30693220

fbshipit-source-id: e17f0d1c6198863cf20a53566f586a6bff602522
2021-09-01 17:36:22 -07:00
baceea4426 [DDP] Add more logging iterations (#64071)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64071

Adding more logging iterations to get additional data.
ghstack-source-id: 137119476

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D30579367

fbshipit-source-id: 57195266ada5e5926f0d8eaf4fb4e01dc98924d7
2021-09-01 17:32:17 -07:00
59fcbd172b Fix incorrect DDP test (#64074)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64074

Previous PR https://github.com/pytorch/pytorch/pull/63831 did not actually test the error in https://github.com/pytorch/pytorch/issues/63812. Introduce a test
directly from the repro that simulates it.
ghstack-source-id: 137171460

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D30569719

fbshipit-source-id: fd61250ef6d291c093607663d91d6d2cb5574eb7
2021-09-01 16:34:06 -07:00
9b8f9d5a25 [c10d] Prefer use of torch_check (#63928)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63928

throw std::invalid_argument results in not getting stacktraces with
TORCH_SHOW_CPP_STACKTRACES=1, so instead prefer torch_check here.
ghstack-source-id: 137135328

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D30533955

fbshipit-source-id: 33e5bf4f449e3043dec68da93f8022f6624d9675
2021-09-01 16:34:05 -07:00
5d80a48cef Add fast path for addmm when the inputs are conjugate (#59380)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59380

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D28898374

Pulled By: anjali411

fbshipit-source-id: eab0e64d37bb57c18b54cabb8e5c00666338ba04
2021-09-01 16:34:02 -07:00
a8f9aab840 [DDP Comm Hook] Add bf16 gradient compression to ddp_comm_hooks.rst (#64346)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64346

as title
ghstack-source-id: 137170288

Test Plan: N/A

Reviewed By: rohan-varma

Differential Revision: D30693513

fbshipit-source-id: 8c64b8404ff3b0322e1bbbd93f6ef051ea91307d
2021-09-01 16:34:00 -07:00
ed89937d2c [quant][graphmode][fx] Add fbgemm backend_config_dict (#64288)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64288

This is just to setup the file structure and unblock experimentation.
The format for backend_config_dict will change in the future

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: zou3519

Differential Revision: D30699457

fbshipit-source-id: 28211a4def05d34757850c045a36e311f54760fe
2021-09-01 16:32:43 -07:00
69f4401b7b Make datasets in ConcatDataset not need to be sized (#64114)
Summary:
`datasets` needs to be iterable, but also sized because the length is checked. But immediately after it's converted to a list. By changing the order of these 2 lines, it doesn't need to be sized anymore.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64114

Reviewed By: H-Huang

Differential Revision: D30641480

Pulled By: ejguan

fbshipit-source-id: 7e16548c2123afa65b83845f9929271fa07fe1e8
2021-09-01 15:32:50 -07:00
535526b95c Restore LayerNorm numerics test (#64385)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64385

It was deleted in https://github.com/pytorch/pytorch/pull/63276.

The numerics test was meant to check LayerNorm behavior on large inputs,
but we deleted it without realizing that.

Test Plan: - wait for tests.

Reviewed By: ngimel

Differential Revision: D30702950

Pulled By: zou3519

fbshipit-source-id: a480e26c45ec38fb628938b70416cdb22d976a46
2021-09-01 15:32:49 -07:00
7ffcf15503 [quant][graphmode][api] Add backend_config_dict to prepare_fx api (#64135)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64135

We want to start aligning the api with the design in https://github.com/pytorch/pytorch/wiki/Extending-PyTorch-Quantization-to-Custom-Backends

We plan to gradually move things from `prepare_custom_config_dict` and `convert_custom_config_dict`
to `backend_config_dict` and allow custom backend developer to define their own way of quantizing operators.

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: zou3519

Differential Revision: D30699456

fbshipit-source-id: e3c068da8d3da2270f57719f7159cc71cafa8598
2021-09-01 15:32:47 -07:00
93bc03622e Silent rm error for sccache log file (#64388)
Summary:
Sample reporting from dr.ci

![image](https://user-images.githubusercontent.com/658840/131724645-75afa04f-7554-4674-8e7c-cf139c84d994.png)

The `rm` command is not actually running into problems, just need to silent the console output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64388

Reviewed By: walterddr, malfet, seemethere

Differential Revision: D30704439

Pulled By: zhouzhuojie

fbshipit-source-id: ecd35531decf05b75cef30d08d46635f81112f67
2021-09-01 15:32:45 -07:00
9495674905 [xplat][metal] Add getters and setters for ivars in Conv2dOpContext (#57395)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57395

As title
ghstack-source-id: 137223806

(Note: this ignores all push blocking failures!)

Test Plan:
### Lib Build
- `buck build caffe2:aten_metal_prepack`

### Integration Test
- `arc focus2 pp-ops -a ModelRunner`
- Click "Test Person/Hair Segmentation Model"

{F612831435}

- Image Classification Demo

{F614144868}

Reviewed By: xta0

Differential Revision: D28132020

fbshipit-source-id: 73560263a9d14e9ecfa39c69deb158a2ed8cb179
2021-09-01 15:31:12 -07:00
968d7ee46a [structured] Preserve computed elements from meta func to impl (#61746)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61746

**Summary**
This commit introduces a new feature for structured kernels that allows
kernels to declare quantities as "precomputed" in
`native_functions.yaml`, compute them once in the `meta` function and
reuse them again in the `impl`. The names and types of these quantities
are used to generate code for a struct containing them that the `meta`
function must return. In the case of a handful of surveyed kernels
(`all,`, `any`, `avg_pool2d`), these quantities that are used both in
the `meta` and `impl` have the same meaning as certain kernel arguments
and in fact supersede them. Accordingly, the correspondence between a
kernel argument and the precomputed elements that supersede it is also
captured in `native_functions.yaml`. This information is used to unpack
the struct returned by `meta` and pass its contents correctly to the
`impl` function.

The primary goal is to avoid recompute and enhance developer experience
(e.g. sometimes people can forget to compute these elements while
porting a kernel).

Test Plan: Imported from OSS

Reviewed By: tugsbayasgalan

Differential Revision: D30407831

Pulled By: SplitInfinity

fbshipit-source-id: 00975525ea373721fe52d06f75cd4ac91f3dc556
2021-09-01 14:34:25 -07:00
4aad366111 [Static Runtime] Make per-op latency readable by FAI-PEP (#64315)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64315

Add a new flag `generate_ai_pep_output` to `StaticRuntime::benchmark`. If set, produces per-op-kind average total latency in milliseconds in a JSON format recognized by [Facebook AI performance evaluation platform (FAI-PEP)](https://github.com/facebook/FAI-PEP).

This is useful for observing the impact of changes that make a big difference for a specific op, but do not affect the overall SR latency by more than a few percent.

Reviewed By: hlu1

Differential Revision: D30679352

fbshipit-source-id: c847fa6ea20774aaf1e7949b11db4421d1f70b7e
2021-09-01 14:34:22 -07:00
86c9654291 Update optimize_for_mobile to preserve node's debug information (#63106)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63106

Propagate debug info to the re-written nodes in the graph.

Test Plan:
- Clone open source repo and build
- ``` python3 test/test_jit.py TestOptimizeForMobilePreserveDebugInfo ```
- Tests pass

Reviewed By: kimishpatel

Differential Revision: D28654659

fbshipit-source-id: 2d7c87f2fb95a3be53246375f35639bbd97c237e
2021-09-01 14:34:20 -07:00
15ff25d1fc Break up "@generated" string so Phabricator shows changes
Summary: Created from CodeHub with https://fburl.com/edit-in-codehub

Test Plan:
CI

Sandcastle run

Reviewed By: larryliu0820

Differential Revision: D30701781

fbshipit-source-id: 3acab8b65a327c4ec7da90bc855ecf02f801c40a
2021-09-01 14:34:18 -07:00
e322547fe6 Add forward AD support for custom Functions (#64061)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64061

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D30640868

Pulled By: albanD

fbshipit-source-id: b0e6610430a879074d6d5306443772fc154b431f
2021-09-01 14:33:09 -07:00
25e2578967 Fix bytes_written and bytes_read (#64244)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64244

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64040

In operator cost inference functions, in many places we are using sizeof(x.data_type()). Since data_type() returns a 32 bit integer from [this enum](https://www.internalfb.com/code/fbsource/[15e7ffe4073cf08c61077c7c24a4839504b964a2]/fbcode/caffe2/caffe2/proto/caffe2.proto?lines=20), we are basically always getting 4 for sizeof(x.data_type()) no matter what actual data type x has. Big thanks to Jack Langman for specifically pointing to this bug.

We would instead use the size in bytes based on actual data type.

Test Plan:
Added unit tests BatchMatMulMemCostTest:

buck test //caffe2/caffe2/fb/fbgemm:batch_matmul_op_test -- BatchMatMulMemCostTest

Extended existing unit test test_columnwise_concat for different data types:

buck test //caffe2/caffe2/python/operator_test:concat_op_cost_test -- test_columnwise_concat

Reviewed By: CrazySherman

Differential Revision: D30656698

fbshipit-source-id: d42c0c9a0c5b0ddc5dba39e4994f1f85a5e618bf
2021-09-01 13:35:41 -07:00
03a58a2ba0 [Caffe2] Create fewer strings during argument fetching (#64285)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64285

With C++14 heterogeneous ordered container lookup, it is no longer necessary to create a `std::string` in order to look up elements of a `CaffeMap` keyed by std::string. Accordingly, this diff reworks the argument-getting operator functions to avoid that in favor of `c10::string_view`.
ghstack-source-id: 137139818
ghstack-source-id: 137139818

Test Plan: buildsizebot iOS apps -- code size win. less strings is probably marginally good for perf but this only happens at setup time anyway.

Reviewed By: dzhulgakov

Differential Revision: D26826676

fbshipit-source-id: ee653b14dc2c528bae8c90f0fc6a7a419cbca1d6
2021-09-01 13:30:54 -07:00
468001600c Back out "Revert D30327514: [Pytorch lite predictor] Use KinetoEdgeCPUProfiler for operator profiling." (#64307)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64307

Original commit changeset: 0b2aa7c57d08

Restores original changes.
This diff changes the way operator profiling is done in lite predictor
benchmarking binary.
Instead of using custom callbacks it uses KinetoEdgeCPUProfiler to profile
events and then generate operator level metric from it.
Since KinetoEvents do not contain cpu clock time, now we report only wallclock
time.
This unifies various profiling effort that we have for benchmarking purpose. In
production we will still use observer based mechanism, but the advantage of
using kineto profiler is that we get few other things for free, such as:
chrome trace generation.
operator level memory profiling (to be added)
flop counts (to be added)
Furthermore possible we can use python post processing script to parse chrome
trace and generate output similar to torch.profiler. (To be done)

Furthermore removes some tests from test_lite_interpreter.cpp which were testing module hierarchy in debug info. They should be covered by test_mobile_profiler.cpp.

Test Plan:
aibench run
Model without debug info:
https://www.internalfb.com/intern/aibench/details/219598441154763
Model with debug info and --print_module_info true (see Operator summary has now module hierarchy information).
https://www.internalfb.com/intern/aibench/details/617154236292985

Reviewed By: raziel

Differential Revision: D30680354

fbshipit-source-id: b6ba0d59c510c13d13d9935b1d8051cc82ffa4e9
2021-09-01 13:29:35 -07:00
421d8f86b6 Add a record scope around autograd::engine::evaluate_function (#63619)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63619

Adds a RECORD_FUNCTION with the function that is being valuate as part
of backwards execution. This has been useful in picking up some operations
in the backwards pass that otherwise would not show up, for example custom cpp
functions that use custom C++ code.
ghstack-source-id: 137041723

Test Plan:
CI

benchmark:
buck run mode/opt //scripts/rvarm1/ddp:bench

Reviewed By: albanD

Differential Revision: D30439492

fbshipit-source-id: 955917770cdf2a2edb0303223ace710b668ba388
2021-09-01 12:32:30 -07:00
0b48d96895 [Bootcamp] Include both python unittest and parser parameters in --help and -h flag (#64297)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45945

Creates a new thread to run -h or --help with unittest.main if the help flag is present, and keeps the add_help default for parameters.

Includes both python unittest and parser parameters in --help and -h flag and will remain up to date since both messages are displayed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64297

Test Plan:
Imported from GitHub

`python test/test_spectral_ops.py --help`

Output:
```
% python test/test_spectral_ops.py --help
usage: test_spectral_ops.py [-h] [-v] [-q] [--locals] [-f] [-c] [-b] [-k TESTNAMEPATTERNS] [tests [tests ...]]

positional arguments:
  tests                a list of any number of test modules, classes and test methods.

optional arguments:
  -h, --help           show this help message and exit
  -v, --verbose        Verbose output
  -q, --quiet          Quiet output
  --locals             Show local variables in tracebacks
  -f, --failfast       Stop on first fail or error
  -c, --catch          Catch Ctrl-C and display results so far
  -b, --buffer         Buffer stdout and stderr during tests
  -k TESTNAMEPATTERNS  Only run tests which match the given substring

Examples:
  test_spectral_ops.py                           - run default set of tests
  test_spectral_ops.py MyTestSuite               - run suite 'MyTestSuite'
  test_spectral_ops.py MyTestCase.testSomething  - run MyTestCase.testSomething
  test_spectral_ops.py MyTestCase                - run all 'test*' test methods
                                       in MyTestCase

usage: test_spectral_ops.py [-h] [--subprocess] [--seed SEED] [--accept] [--jit_executor JIT_EXECUTOR] [--repeat REPEAT]
                            [--test_bailouts] [--save-xml [SAVE_XML]] [--discover-tests] [--log-suffix LOG_SUFFIX]
                            [--run-parallel RUN_PARALLEL] [--import-slow-tests [IMPORT_SLOW_TESTS]]
                            [--import-disabled-tests [IMPORT_DISABLED_TESTS]]

optional arguments:
  -h, --help            show this help message and exit
  --subprocess          whether to run each test in a subprocess
  --seed SEED
  --accept
  --jit_executor JIT_EXECUTOR
  --repeat REPEAT
  --test_bailouts
  --save-xml [SAVE_XML]
  --discover-tests
  --log-suffix LOG_SUFFIX
  --run-parallel RUN_PARALLEL
  --import-slow-tests [IMPORT_SLOW_TESTS]
  --import-disabled-tests [IMPORT_DISABLED_TESTS]
  ```

Also ran some other tests to make sure tests still worked, and other tests with --help or -h flag

Reviewed By: seemethere

Differential Revision: D30677776

Pulled By: PatrickKan

fbshipit-source-id: eb3d6e3fa677137ec703ec3a23808efb99acc896
2021-09-01 12:30:47 -07:00
c6505cc383 [FX] Fix python code generation for wrapped getattr() with default value (#64271)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64271

Closes #60417

Modified emit_node() in fx/graph.py to generate getattr() call with default value when len(node.args) != 2 instead of accessing the attribute.
Added test_torch_fx_getattr() in test/test_fx.py.

Test Plan:
pytest test/test_fx.py

Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D30671265

fbshipit-source-id: f2db9ea47e0cb247547e200684f715aab006c374
2021-09-01 11:30:27 -07:00
87d8ab6e50 [nnc] Updated generic error message with info about turning off the fuser (#64316)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64316

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D30683942

Pulled By: navahgar

fbshipit-source-id: d86607563672213f99a1436dcf4f5dc28053b713
2021-09-01 10:31:50 -07:00
c4f3f6e62d Fixes reduction launch config (#64304)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48573
See also https://github.com/pytorch/pytorch/pull/64194

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64304

Reviewed By: janeyx99

Differential Revision: D30689600

Pulled By: ngimel

fbshipit-source-id: bf2103ca177fd3b6e27bc0324b81925234483a29
2021-09-01 10:30:40 -07:00
d5bfdd3dac OpInfo for nn.functional.layer_norm (#63276)
Summary:
Please see https://github.com/facebookresearch/functorch/issues/78 and https://github.com/pytorch/pytorch/issues/54261.

Note:

* This PR also adds a reference test inspired by existing tests in `test_nn.py`.

cc: mruberry zou3519

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63276

Reviewed By: ejguan

Differential Revision: D30452483

Pulled By: zou3519

fbshipit-source-id: 2578d01ca34e031668a41bd284db60c31ae1fba8
2021-09-01 09:31:45 -07:00
d1f3d85fd8 fix GradBucket.is_last() logic (#63768)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63768

passed number of buckets to GradBucket constructor, to check if index is equal to num_buckets - 1 in the .is_last() function.

Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed/algorithms/ddp_comm_hooks:test_ddp_hooks

test output: https://www.internalfb.com/intern/testinfra/testconsole/testrun/8162774375985873/

Reviewed By: SciPioneer, mrshenli

Differential Revision: D30455913

fbshipit-source-id: 8c67ca69cbf191d6e189e09248407eb167bb24b6
2021-09-01 09:29:13 -07:00
92b31b59af Revert D29699456: [pytorch][PR] Enable Half, BFloat16, and Complex dtypes for coo-coo sparse matmul [CUDA]
Test Plan: revert-hammer

Differential Revision:
D29699456 (ad4848565e)

Original commit changeset: 407ae53392ac

fbshipit-source-id: b6c70ba8bb28c0c38de47857030b69792a8470de
2021-09-01 07:32:24 -07:00
0c4e4e588e [FX] Rename reduce functions back to their old, public names (#64324)
Summary:
Unfortunately pickle serializes the names of these functions. Also put them under backward-compatibility enforcement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64324

Test Plan: Local repro https://fb.workplace.com/groups/3440841732711443/permalink/4018921611570116/

Reviewed By: SplitInfinity, TailofJune

Differential Revision: D30684185

Pulled By: jamesr66a

fbshipit-source-id: 900701220155d15115cd0c07cf7774a2891bd04f
2021-08-31 22:36:11 -07:00
05ecaefbbf [Metal][GPU] Enable metal for simulators and fix test failures if possible (#64322)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64322

As title
ghstack-source-id: 137143877

Test Plan:
- `aibench-cli mobile`
- Select iOS -> `y` -> `1` -> `n` -> "--metal_op_test"
- Select all iPhone 6 + iPhone 7 + iPhone 8 and a iPhone X or 11 or 12
```
Benchmark Submitted. Find more details at: https://our.intern.facebook.com/intern/aibench/details/318120612514604
Benchmark Status:
        D10 (b8256280ce)AP-12.0.1: DONE
        N71mAP-14.3: DONE
DUMMY latency:
        D10 (b8256280ce)AP-12.0.1: 4319.3
        N71mAP-14.3: 8868.51
I0831 16:06:27.210558 605277 ClientSingletonManager.cpp:99] Shutting down Manifold ClientSingletonManager
```

Reviewed By: xta0

Differential Revision: D30147163

fbshipit-source-id: 2de6bbd9bd525e32ca92b2845eb435800855edcc
2021-08-31 22:36:09 -07:00
24e50b8453 [CUDA graphs] hotfix for test_graph_ (#64339)
Summary:
Graphed workloads that try to capture a full backward pass must do warmup on a non-default stream. If warmup happens on the default stream, AccumulateGrad functions might tag themselves to run on the default stream, and therefore won't be capturable.

ngimel and I suspect some test_cuda.py tests run with the default stream as the ambient stream, which breaks `test_graph_grad_scaling` because `test_graph_grad_scaling` does warmup on the ambient stream _assuming_ the ambient stream is a non-default stream.

This PR explicitly sets a side stream for the warmup in `test_graph_grad_scaling`, which is what I should have done all along because it's what the new documentation recommends.

I pushed the PR branch straight to the main pytorch repo because we need to run ci-all on it, and I'm not sure what the requirements are these days.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64339

Reviewed By: mruberry

Differential Revision: D30690711

Pulled By: ngimel

fbshipit-source-id: 91ad75f46a11f311e25bc468ea184e22acdcc25a
2021-08-31 22:34:10 -07:00
479fc4e412 Remove outdated warning about RecursiveScriptModule not being copiable (#64085)
Summary:
RecursiveScriptModule has its customized `__copy__` and `__deepcopy__` defined. The warning/error  that says it is not copiable is outdated

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64085

Reviewed By: rohan-varma

Differential Revision: D30598623

Pulled By: gmagogsfm

fbshipit-source-id: 0701d8617f42d818bc7b88244caee4cd47fbe976
2021-08-31 21:31:32 -07:00
8337a3fb3f [TensorExpr] Wrap error messages with buildErrorMessage call. (#64330)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64330

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D30687226

Pulled By: ZolotukhinM

fbshipit-source-id: ade1be2ad6847c6afbba60307ef854696821b4e3
2021-08-31 20:31:16 -07:00
a87808de93 Fix bug in ShardedTensorMetadata serde. (#63902)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63902

The 'memory_format' field was not being serialized correctly and used
the same encoding for different fields.
ghstack-source-id: 137142406

Test Plan: waitforbuildbot

Reviewed By: bowangbj

Differential Revision: D30527324

fbshipit-source-id: f0f223e2d660ef6e4abae9649d9992acc36e1278
2021-08-31 20:31:14 -07:00
fa5676a41b Delete some dead code from RRefMessageBase (#64298)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64298

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse agolynski SciPioneer H-Huang mrzzd cbalioglu gcramer23

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D30676702

Pulled By: pbelevich

fbshipit-source-id: 77dbc0f8064c3518376454ff573d45ed0274956b
2021-08-31 20:30:04 -07:00
6bb4b5d150 disallow empty named dims list to flatten(names, name) (#61953)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/61137 by raising an error if an empty tuple is passed in for the names:
```
>>> torch.empty((2, 3), names=['a', 'b']).flatten((), 'abc')
RuntimeError: flatten(tensor, dims, out_dim): dims cannot be empty
```

or from the original issue:
```
>>> torch.empty((2, 3)).flatten((), 'abc')
RuntimeError: flatten(tensor, dims, out_dim): dims cannot be empty
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61953

Reviewed By: iramazanli

Differential Revision: D30574571

Pulled By: malfet

fbshipit-source-id: e606e84458a8dd66e5da6d0eb1a260f37b4ce91b
2021-08-31 19:32:30 -07:00
c59970db6b [caffe2][easy] Save heap allocation in ConcatOp (#63529)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63529

Output() takes an IntArrayRef, so we can just use a std::initializer_list (stack-allocated array) instead of std::vector here.
ghstack-source-id: 137085908

Test Plan: existing CI

Reviewed By: mruberry

Differential Revision: D29687400

fbshipit-source-id: 9f2a7c6679f2552c098bb1bf7befaca18e0e5d4d
2021-08-31 18:33:32 -07:00
b23e4f6086 Convert mul to use opmath_gpu_kernel_with_scalars (#64019)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64019

Note that previously the functor operated on scalar_t and
this modifies it to operate on opmath_t, but this is not
a problem as half precision was implemented by performing the
compute in float anyway.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D30575282

Pulled By: ezyang

fbshipit-source-id: cc6900ef996e755740afe48f9cb4d0366858dd47
2021-08-31 18:33:30 -07:00
0733582087 Use the correct overloaded name to skip boxed autograd not implemented kernel registration (#64182)
Summary:
Some internal use_count tests are failing for `dequantize_self` because we only compare the skip list with the base name `dequantize` when we should be comparing with the full name including the overload

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64182

Reviewed By: albanD

Differential Revision: D30639909

Pulled By: soulitzer

fbshipit-source-id: d4d22dd1a5c8f7180251ce7739830764cce6f151
2021-08-31 18:33:28 -07:00
09e610e36d [Static Runtime] Out version for softmax (#64243)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64243

Test Plan:
```
> buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --v=1
...
V0830 16:35:22.524479 613839 impl.cpp:1410] Switch to out variant for node: %5 : Tensor = aten::softmax(%a.1, %dim.1, %dtype.1)
...
[       OK ] StaticRuntime.IndividualOps_Softmax (803 ms)
```

Reviewed By: hlu1

Differential Revision: D30656149

fbshipit-source-id: 115b7b4a75448fd6a5c526808080ca9a4251302c
2021-08-31 18:33:26 -07:00
0b9cdeb295 .circleci: Remove already migrated CUDA configs (#64231)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64231

This migrates over the CUDA 11.1 and CUDA 10.2 configs that we had
previously migrated to GHA

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

cc ezyang seemethere malfet walterddr lg20987 pytorch/pytorch-dev-infra

Test Plan: Imported from OSS

Reviewed By: zhouzhuojie

Differential Revision: D30683811

Pulled By: seemethere

fbshipit-source-id: 71b0761461557d871c26eb02f665a2e4d9b1d9fb
2021-08-31 18:33:24 -07:00
23da90ab84 .github: Consolidate linux setup / teardown (#64229)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64229

Consolidates linux setup / teardown into easy to use jinja2 macros

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

cc ezyang seemethere malfet walterddr lg20987 pytorch/pytorch-dev-infra

Test Plan: Imported from OSS

Reviewed By: zhouzhuojie, driazati

Differential Revision: D30683810

Pulled By: seemethere

fbshipit-source-id: 2578630df3e212fb79392a699090553baef44cc2
2021-08-31 18:31:48 -07:00
5ecb966e0f Add ciflow-tracking issue to pytorch-probot (#64125)
Summary:
Doesn't do anything yet...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64125

Reviewed By: zhouzhuojie

Differential Revision: D30620283

Pulled By: malfet

fbshipit-source-id: 91869d35c1b70a55e32261d2c32fb0136ec33960
2021-08-31 17:38:34 -07:00
9e25634833 [TensorExpr] Move declaration of buildErrorMessage to exception.h (#64301)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64301

Test Plan: Imported from OSS

Reviewed By: navahgar, huiguoo

Differential Revision: D30678215

Pulled By: ZolotukhinM

fbshipit-source-id: 599c83b3890450a0fb6526815f037eec9563661c
2021-08-31 17:37:29 -07:00
44fcb00a56 Fix redundant class definition in GraphModule singleton constructor (#64274)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/63883

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64274

Reviewed By: jamesr66a

Differential Revision: D30675970

Pulled By: jayleverett

fbshipit-source-id: e74ef2a28013f0fa7c58d14f38e66cfe48d26b74
2021-08-31 17:34:14 -07:00
c2da103fe6 Discover new tests in run_tests.py (#64246)
Summary:
Introduce `discover_tests` function that globs for all Python files
starting with `test_` in test folder excluding subfolders which are
executed differently

Fixes https://github.com/pytorch/pytorch/issues/64178

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64246

Reviewed By: walterddr, seemethere

Differential Revision: D30661652

Pulled By: malfet

fbshipit-source-id: a52e78ec717b6846add267579dd8d9ae75326bf9
2021-08-31 17:32:55 -07:00
0457a85d45 Revert D30543236: Add python mode
Test Plan: revert-hammer

Differential Revision:
D30543236 (4bd03b0242)

Original commit changeset: ef5444d96a5a

fbshipit-source-id: b0042ac2c22765fa11d6d00bf751f6a4489eb6d8
2021-08-31 15:28:33 -07:00
6c8cb9bd76 [DataPipe] export fork, mux, demux for public usage (#64279)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64279

cc VitalyFedyunin ejguan

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D30671971

Pulled By: NivekT

fbshipit-source-id: 056ac12ef7183b254d1eec341145594639e47ef6
2021-08-31 14:34:30 -07:00
491bf7cb74 [DataPipe] adding description, __len__, tests for mux() (#64224)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64224

cc VitalyFedyunin ejguan

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D30651551

Pulled By: NivekT

fbshipit-source-id: f8af98ba71a592900b992a8077432062ec57bb48
2021-08-31 14:34:28 -07:00
9a0456939b Try the forked checkout action with retry (#64120)
Summary:
Fixes #{issue number}

The main difference is:
ffc6f93ad4

Can test multiple times in this PR to see if it works, will make the `retry` number configurable if it's usable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64120

Reviewed By: malfet

Differential Revision: D30656099

Pulled By: zhouzhuojie

fbshipit-source-id: a89932196bb0c44e412a34664ed6a061b02ef92e
2021-08-31 14:34:26 -07:00
13484084a6 fix syntax error in bfloat16 PR (#64122)
Summary:
fixes prior syntax error from PR ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64122

Reviewed By: H-Huang

Differential Revision: D30643596

Pulled By: ngimel

fbshipit-source-id: 0a2d5a40fb6dc7339cd03112e57ef0e1bf8a000e
2021-08-31 14:33:12 -07:00
8d08b103be [CUDA graphs] Prototype API and documentation (#63269)
Summary:
RFC: https://github.com/pytorch/pytorch/issues/61880

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63269

Reviewed By: mruberry

Differential Revision: D30596643

Pulled By: ngimel

fbshipit-source-id: b1f8061406364b667e2c2d4d30fbce1f0d8456be
2021-08-31 13:34:23 -07:00
1c2b5e59ae Remove ref to test_distributed_fork (#64197)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64197

Removes this line as test is gone.
ghstack-source-id: 136986275

Test Plan: CI

Reviewed By: walterddr

Differential Revision: D30642929

fbshipit-source-id: a0c7dfdfb35a4a7f7ec1b881dbea53d85136012c
2021-08-31 13:31:27 -07:00
555171a273 .circleci: Remove migrated jobs, move docs builds (#64222)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64222

Removes both backwards_compat as well as docs_test from the general
gcc5.4 config and moves the docs build from being run on every PR to
only being run on master.

We can remove docs builds when we migrate the docs push job (including
all secrets associated with that)

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

cc ezyang seemethere malfet walterddr lg20987 pytorch/pytorch-dev-infra

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D30650953

Pulled By: seemethere

fbshipit-source-id: ac11da6a551a6c81f3dc1d47fd81846cbfe9975a
2021-08-31 13:30:13 -07:00
347ef69529 [ao][docs] Clarify operator support for quantization (#63270)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63270

Add table to quantization main page showing supported modules
for static and dynamic quantization.
ghstack-source-id: 137087204

Test Plan: Imported from OSS

Reviewed By: HDCharles

Differential Revision: D30658654

fbshipit-source-id: a82c998e1db6370596d5b0ca4c7cc96c1c90f30e
2021-08-31 12:32:47 -07:00
3a46edb8d8 ns for fx: make layer types more readable (#64270)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64270

Before this PR, layer types were populated by doing
`str(module_instance)` and `str(function)`. This resulted
in moderately readable strings for modules, and poorly readable
strings for functions.

This PR switches the logic to use `torch.typename` utility instead.
The results are significantly more readable.

Example function type:

```
# before
'<built-in method linear of PyCapsule object at 0x7fe9b20ce7b0>'

# after
'torch._ops.quantized.PyCapsule.linear'
```

Example module type:

```
# before
"<class 'torch.nn.quantized.modules.conv.Conv2d'>"

# after
'torch.nn.quantized.modules.conv.Conv2d'
```

Test Plan:
Manually inspect NS results for modules and functions, verify they are
more readable.

Manually inspect NS results for modules and functions, verify they are
more readable.

Imported from OSS

Differential Revision:
D30669545
D30669545

Reviewed By: jerryzh168

Pulled By: vkuzo

fbshipit-source-id: 60959e5cafa0a4992b083bf99f5d8260f9acdac0
2021-08-31 12:31:34 -07:00
845bc89811 [fx2trt] Add acc_ops.sign and converter for it (#63876)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63876

Add `acc_ops.sign` which maps from `torch.sign`.

Add a plugin (not support dynamic shape currently) for `acc_ops.sign`. The plugin calls `at::sign` directly.

Test Plan: buck test mode/opt -c python.package_style=inplace -c fbcode.nvcc_arch=a100 caffe2/torch/fb/fx2trt:test_unary_ops

Reviewed By: yinghai

Differential Revision: D30518081

fbshipit-source-id: a0b9e6c30deac0b04b8cb09a162579e229985330
2021-08-31 11:31:53 -07:00
83e28a7d28 Use stacklevel for floordiv deprecation warnings (#64034)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/60548

`Tensor.__floordiv__` was indirectly deprecated by deprecation of `torch.floor_divide` (see https://github.com/pytorch/pytorch/issues/43874). Deprecating it directly provides clearer feedback.

Repro:
```
import torch
x = torch.tensor(0)
x // 1
```

Before this change, a deprecation warning was triggered within the C++ implementation of floor_divide:
```
UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  ../aten/src/ATen/native/BinaryOps.cpp:571.)
  return torch.floor_divide(self, other)
```

After this change, the warning instead cites the user's offending line of Python code:
```
UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  x // 1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64034

Reviewed By: mruberry

Differential Revision: D30658010

Pulled By: saketh-are

fbshipit-source-id: b0e6c5008d741897509d102f4a89efb47de4aa2a
2021-08-31 11:27:56 -07:00
b9275a4003 [ao][docs] Add description of qconfig and qengine to quantization page (#63582)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63582

Current quantization docs do not define qconfig and qengine. Added text to define these concepts before they are used.
ghstack-source-id: 137051719

Test Plan: Imported from OSS

Reviewed By: HDCharles

Differential Revision: D30658656

fbshipit-source-id: a45a0fcdf685ca1c3f5c3506337246a430f8f506
2021-08-31 10:33:07 -07:00
ca8dd296ee Add OpInfo for nn.functional.cosine_similarity (#62959)
Summary:
Please see https://github.com/facebookresearch/functorch/issues/78 and https://github.com/pytorch/pytorch/issues/54261.

Notes:

* Some redundant tests from `test_nn.py` have been removed. I'm unsure about precision checks if they can be removed as well.
* Broadcasting is also checked in the OpInfo for `cosine_similarity`.

cc: mruberry zou3519 Chillee

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62959

Reviewed By: heitorschueroff

Differential Revision: D30520176

Pulled By: zou3519

fbshipit-source-id: 14e902eb4bcce875edab28a1669a2ea021052b9b
2021-08-31 10:31:36 -07:00
0ef8760bf6 [DataPipe] implementing __len__ for fork (no valid length for demux) (#64215)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64215

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D30648672

Pulled By: NivekT

fbshipit-source-id: 4780f2f6a79ae15a4009092475e7d92f96dd09a2
2021-08-31 08:32:31 -07:00
0deb7a0bc0 [DataPipe] implementing demux() (#63650)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63650

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D30493944

Pulled By: NivekT

fbshipit-source-id: 0aa06dee8c7fb1744975b8f6a0694b90c11ef80d
2021-08-31 08:32:29 -07:00
eee054e6ea [DataPipe] implementing fork() (#63649)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63649

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D30493945

Pulled By: NivekT

fbshipit-source-id: 40db7d4134facd266d86bc0dc2edf2729c4e5842
2021-08-31 08:32:27 -07:00
67cb131458 Revert D30327514: [Pytorch lite predictor] Use KinetoEdgeCPUProfiler for operator profiling.
Test Plan: revert-hammer

Differential Revision:
D30327514 (bc9277dca3)

Original commit changeset: 3bb2f2daaaed

fbshipit-source-id: 0b2aa7c57d08de77c9aaa75e546a7d0938610f64
2021-08-31 08:30:36 -07:00
3c15822f5f [Static Runtime] Implement aten::nonzero out variant (#64126)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64126

Test Plan:
Confirm out variant is called:

```
> buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --v=1
```

Reviewed By: mikeiovine

Differential Revision: D30617729

fbshipit-source-id: 752749638c8f467815efa57021cb3de5c728ab1b
2021-08-31 00:51:15 -07:00
a3d6dae319 Automated submodule update: FBGEMM (#64213)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 9d69998df6

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64213

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: jspark1105

Differential Revision: D30647878

fbshipit-source-id: b903b39441b4e28dda7eab226ac874e2227e750a
2021-08-30 21:33:17 -07:00
bc9277dca3 [Pytorch lite predictor] Use KinetoEdgeCPUProfiler for operator profiling. (#63367)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63367

This diff changes the way operator profiling is done in lite predictor
benchmarking binary.
Instead of using custom callbacks it uses KinetoEdgeCPUProfiler to profile
events and then generate operator level metric from it.
Since KinetoEvents do not contain cpu clock time, now we report only wallclock
time.
This unifies various profiling effort that we have for benchmarking purpose. In
production we will still use observer based mechanism, but the advantage of
using kineto profiler is that we get few other things for free, such as:
- chrome trace generation.
- operator level memory profiling (to be added)
- flop counts (to be added)

Furthermore possible we can use python post processing script to parse chrome
trace and generate output similar to torch.profiler. (To be done)

Test Plan:
aibench run
Model without debug info:
https://www.internalfb.com/intern/aibench/details/219598441154763
Model with debug info and `--print_module_info true` (see Operator summary has now module hierarchy information).
https://www.internalfb.com/intern/aibench/details/617154236292985

Reviewed By: raziel

Differential Revision: D30327514

fbshipit-source-id: 3bb2f2daaaedfb04bd6f5d9c91292783f9c4344f
2021-08-30 20:54:51 -07:00
7ca4728e6d Compile BatchLinearAlgebra without nvcc (#64146)
Summary:
These files only use cuda libraries interfaces, so don't actually need to be compiled with nvcc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64146

Reviewed By: ezyang

Differential Revision: D30633189

Pulled By: ngimel

fbshipit-source-id: c9d0ae5259a10cb49332d31f0da89ad758736ea8
2021-08-30 20:18:21 -07:00
e7fb35021a [nnc] Enable fusion of bfloat16 ops (#64196)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64196

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D30643864

Pulled By: bertmaher

fbshipit-source-id: e95edeaf7089464d713ea1d1f951743d3e5f61c5
2021-08-30 20:09:36 -07:00
538647fe1f [WIP][FX] BC guarantees for 1.10 (#63888)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63888

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D30523133

Pulled By: jamesr66a

fbshipit-source-id: b04cc0d842a74862f42ecba98b757310cd2ec7b0
2021-08-30 19:56:46 -07:00
09dfaa0339 add operation list for AutocastCPU (#63534)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63534

In this PR:
* We have changed the default dtype of `AutocastCPU` from `float16` to `bfloat16` as discussed here `https://github.com/pytorch/pytorch/pull/61002`
* We also update the operation list which needs casting to `lower_precision_fp` or `float32`.

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D30644914

Pulled By: ezyang

fbshipit-source-id: 8b93485ba452b3759611e3f0ac88e920fe495ac1
2021-08-30 19:30:33 -07:00
93f1090267 Update contribution_guide.rst (#64142)
Summary:
Grammatical update.

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64142

Reviewed By: mruberry

Differential Revision: D30639394

Pulled By: ezyang

fbshipit-source-id: cf1a4dfbd8e34b0772f1b09f5d820278e8ef8574
2021-08-30 19:26:59 -07:00
6b85c99ce5 Avoid an unnecessary list creation in DataChunk (#64111)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64111

Reviewed By: mruberry

Differential Revision: D30639383

Pulled By: ezyang

fbshipit-source-id: 96b243307413c99a67d55d862a71937e1ef210f4
2021-08-30 19:25:42 -07:00
c7c711bfb8 Add optional tensor arguments to (#63967)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/63435

Adds optional tensor arguments to check handling torch function checks. The only one I didn't do this for in the functional file was `multi_head_attention_forward` since that already took care of some optional tensor arguments but not others so it seemed like arguments were specifically chosen

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63967

Reviewed By: albanD

Differential Revision: D30640441

Pulled By: ezyang

fbshipit-source-id: 5ef9554d2fb6c14779f8f45542ab435fb49e5d0f
2021-08-30 19:21:26 -07:00
cb7cf823b3 add BFloat16 support for fold and unfold on CPU (#62880)
Summary:
Add BFloat16 support for fold and unfold operators on CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62880

Reviewed By: iramazanli

Differential Revision: D30576387

Pulled By: zou3519

fbshipit-source-id: c48f6e56702bfea34448db1b3a1634c49c5d8ec8
2021-08-30 19:14:10 -07:00
ffc2612087 Add acc_gpu_kernel_with_scalars and port add to use it (#63884)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63884

See https://dev-discuss.pytorch.org/t/cuda-loops-case-study-code-generation-vs-templates/302
for explanation of what's going on here.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D30545296

Pulled By: ezyang

fbshipit-source-id: f0da52153ae63599fe1d57e90e73f50ca2116939
2021-08-30 19:10:16 -07:00
a49907f984 Modify inline doc for DataPipe (#64221)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64221

List of tasks in this PR
- [x]  Add inline doc for DataPipe
- [x] Improve the inline doc
- [x] Expose DataPipe to `datapipes.iter` (`UnBatcher`) Note: `Forker`, `Demux`, `Mux` are exposed in another PR authored by Kevin
- [x] Add correct typing to DataPipe
- [x] Unify the argument to `datapipe` rather than `source_datapipe`

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D30650541

Pulled By: ejguan

fbshipit-source-id: c09d1b9742b8097d8e645c15947cef80c876877b
2021-08-30 18:45:46 -07:00
af85bc5ffd Replace group_by_key by group_by IterDataPipe (#64220)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64220

Remove `ByKeyGrouperIterDataPipe` due to duplicated functionality.
Fix a bug in `GrouperIterDataPipe` using the existing test.

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D30650542

Pulled By: ejguan

fbshipit-source-id: 666b4d28282fb4f49f3ff101b8d08be16a50d836
2021-08-30 18:45:44 -07:00
4bd03b0242 Add python mode (#63496)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63496

This PR adds a (private) enable_python_mode context manager.
(see torch/utils/_python_dispatch.py).
enable_python_mode accepts the type of a __torch_dispatch__ object
as its argument. Whenever an operator gets called inside of the
context manager, it dispatches to the __torch_dispatch__ of
the passed-in type.

Example usage:
```
with enable_python_mode(LoggingTensor):
    z = torch.empty([])
    assert isinstance(z, LoggingTensor)
```

There are quite a few changes that were made to support this.

First, we added TorchDispatchTypeObject, a C++ struct that represents the
type of a `__torch_dispatch__` object (e.g. LoggingTensor).
It holds both the PyObject* representing the class and a PyInterpreter*
so we know which Python interpreter it came from.

Next, we updated the concrete_dispatch_fn in python_variable.cpp to accept
a `const std::shared_ptr<TorchDispatchTypeObject>&` argument. When this
is null, dispatching happens as usual. When it is non-null, we prepend
the TorchDispatchTypeObject's PyObject* to the overloaded args list so that
it is considered first for dispatch.

To get that to work, we changed how `handle_torch_dispatch_no_python_arg_parser`
works. The "overloaded args list" previously only consisted of Tensor PyObjects,
but now it can have types in addition to Tensors!
- We renamed `append_overloaded_arg` to `append_overloaded_arg`
- We added a new `append_overloaded_type` that appends a type to
overloaded_args
- We added special handling in `handle_torch_dispatch_no_python_arg_parser`
and `append_overloaded_arg` to handle types in addition to Tensors.

Then, there is PythonMode and PythonModeTLS.
- We reuse the DispatchKey::Python dispatch key as a mode key
- We use PythonMode::enter and PythonMode::exit to enable/disable
DispatchKey::Python and set the PythonModeTLS.
- PythonModeTLS stores a TorchDispatchTypeObject as metadata.
- PythonMode is in libtorch_python, and PythonModeTLS is in ATen.
This split is due to the libtorch_python library boundary (because we need
to save TLS in ATen/ThreadLocalState)
- We modify the PythonFallbackKernel to look up
the relevant TorchDispatchTypeObject (if Python Mode is active) and
dispatch using it.

There are two more miscellaneous changes:
- internal_new_from_data (torch/csrc/utils/tensor_new.cpp) gets an
exclude guard. enable_python_mode currently does not handle
torch.tensor and the exclude guard is to prevent a bug.

Future:
- This PR does not allow for the nesting of Python modes. In the future we
should be able to enable this with a more sane no_dispatch API and by changing
the TLS to a stack. For now I did not need this for CompositeImplicitAutograd testing.

Test Plan: - new tests

Reviewed By: malfet, albanD

Differential Revision: D30543236

Pulled By: zou3519

fbshipit-source-id: ef5444d96a5a957d1657b7e37dce80f9a497d452
2021-08-30 18:44:35 -07:00
ebc0aacf83 [nnc] Fix half2float conversion and re-enable float16 (#64199)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64199

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D30643865

Pulled By: bertmaher

fbshipit-source-id: 9de6adca53bd08839328cbaf6364f7de9550264b
2021-08-30 18:37:55 -07:00
1f16c22dc8 [Static Runtime] Implement aten::cumsum out variant (#64159)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64159

Test Plan:
Confirm out variant is called for both versions:

```
> buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --v=1
```

Reviewed By: mikeiovine

Differential Revision: D30622819

fbshipit-source-id: a2c8c7f969dae5f507718fb3d513e1fb4f026736
2021-08-30 16:18:22 -07:00
5401159b8f OpInfo for nn.functional.interpolate (#61956)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61956

Each mode goes through a different implementation so they are listed as
different variants.

Test Plan: - run tests

Reviewed By: malfet

Differential Revision: D30013751

Pulled By: zou3519

fbshipit-source-id: 4253b40b55667d7486ef2d98b441c13d807ab292
2021-08-30 16:00:43 -07:00
a7ae73a238 BUG Fixes regression for nllloss gradcheck (#64203)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/64163

This PR includes the fix and the opinfo from https://github.com/pytorch/pytorch/pull/63854/ for non-regression testing.

cc albanD mruberry jbschlosser

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64203

Reviewed By: albanD

Differential Revision: D30647522

Pulled By: jbschlosser

fbshipit-source-id: 2974d299763505908fa93532aca2bd5d5b71f2e9
2021-08-30 15:13:09 -07:00
ad4848565e Enable Half, BFloat16, and Complex dtypes for coo-coo sparse matmul [CUDA] (#59980)
Summary:
This PR enables Half, BFloat16, ComplexFloat, and ComplexDouble support for matrix-matrix multiplication of COO sparse matrices.
The change is applied only to CUDA 11+ builds.

`cusparseSpGEMM` also supports `CUDA_C_16F` (complex float16) and `CUDA_C_16BF` (complex bfloat16). PyTorch also supports the complex float16 dtype (`ScalarType::ComplexHalf`), but there is no convenient dispatch, so this dtype is omitted in this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59980

Reviewed By: ngimel

Differential Revision: D29699456

Pulled By: cpuhrsch

fbshipit-source-id: 407ae53392acb2f92396a62a57cbaeb0fe6e950b
2021-08-30 15:06:25 -07:00
c3464e78a4 Revert D30561459: Fix bytes_written and bytes_read
Test Plan: revert-hammer

Differential Revision:
D30561459 (e98173ff34)

Original commit changeset: 976fa5167097

fbshipit-source-id: 43f4c234ca400820fe6db5b4f37a25e14dc4b0dd
2021-08-30 14:59:54 -07:00
e4fd2ab59c Back out "Added reference tests to ReductionOpInfo" (#64183)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64183

Original commit changeset: 6a1f82ac2819

Test Plan: CI

Reviewed By: soulitzer

Differential Revision: D30639835

fbshipit-source-id: e238043c6fbd0453317a9ed219e348298f98aaca
2021-08-30 14:48:10 -07:00
8f88f797db [quant][graphmode][fx] Add reference quantized conv module (#63828)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63828

Added reference quantized conv module for the custom backend flow, the reference quantized module will
have the following code:
```
        w(float) -- quant - dequant \
        x(float) ------------- F.conv2d ---
```
In the full model, we will see
```
        w(float) -- quant - *dequant \
        x -- quant --- *dequant --  *F.conv2d --- *quant - dequant
```
and the backend should be able to fuse the ops with `*` into a quantized linear

Test Plan:
python test/test_quantization.py TestQuantizeFx.test_conv_linear_reference

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D30504749

fbshipit-source-id: e1d8c43a0e0d6d9ea2375b8ca59a9c0f455514fb
2021-08-30 14:23:17 -07:00
65050ec924 Back out "[JIT] Add aten::slice optimization"
Summary:
Original commit changeset: d12ee39f6828
build-break
overriding_review_checks_triggers_an_audit_and_retroactive_review
Oncall Short Name: dskhudia

Test Plan: Local run succeeds

Differential Revision: D30633990

fbshipit-source-id: 91cf7cc0ad7e47d919347c2a1527688e062e0c62
2021-08-30 14:05:04 -07:00
09e53c0cfe .github: Adding configuration for backwards_compat (#64204)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64204

Adds backwards_compat to our existing test matrix for github actions

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

cc ezyang seemethere malfet walterddr lg20987 pytorch/pytorch-dev-infra

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D30646764

Pulled By: seemethere

fbshipit-source-id: f0da6027e29fab03aff058cb13466fae5dcf3678
2021-08-30 13:59:00 -07:00
9035a1cb4d .github: Adding configuration for docs_test (#64201)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64201

Adds docs_test to our existing test matrix for github actions

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

cc ezyang seemethere malfet walterddr lg20987 pytorch/pytorch-dev-infra

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D30646765

Pulled By: seemethere

fbshipit-source-id: 946adae01ff1f1f7ebe626e408e161b77b19a011
2021-08-30 13:57:20 -07:00
85df73658c Make name() part of IMethod interface (#63995)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63995

JIT methods already have name() in their interface, and Py methods have names in their implementation.  I'm adding this for a particular case where someone tried to use name() on a JIT method that we're replacing with an IMethod.

Test Plan: add case to imethod API test

Reviewed By: suo

Differential Revision: D30559401

fbshipit-source-id: 76236721f5cd9a9d9d488ddba12bfdd01d679a2c
2021-08-30 13:31:55 -07:00
b9933f08b9 Fix type annotation in tools/nightly.py (#64202)
Summary:
`tempfile.TemporaryDirectory` is a generic only in python-3.9 and above

Workaround by wrapping type annotation in quotes

Fixes https://github.com/pytorch/pytorch/issues/64017

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64202

Reviewed By: janeyx99

Differential Revision: D30644215

Pulled By: malfet

fbshipit-source-id: 3c16240b9fa899bd4d572c1732a7d87d3dd0fbd5
2021-08-30 13:27:43 -07:00
f3e329cbec Implements the orthogonal parametrization (#62089)
Summary:
Implements an orthogonal / unitary parametrisation.

It does passes the tests and I have trained a couple models with this implementation, so I believe it should be somewhat correct. Now, the implementation is very subtle. I'm tagging nikitaved  and IvanYashchuk as reviewers in case they have comments / they see some room for optimisation of the code, in particular of the `forward` function.

Fixes https://github.com/pytorch/pytorch/issues/42243

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62089

Reviewed By: ezyang

Differential Revision: D30639063

Pulled By: albanD

fbshipit-source-id: 988664f333ac7a75ce71ba44c8d77b986dff2fe6
2021-08-30 13:12:07 -07:00
e98173ff34 Fix bytes_written and bytes_read (#64040)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64040

In operator cost inference functions, in many places we are using sizeof(x.data_type()). Since data_type() returns a 32 bit integer from [this enum](https://www.internalfb.com/code/fbsource/[15e7ffe4073cf08c61077c7c24a4839504b964a2]/fbcode/caffe2/caffe2/proto/caffe2.proto?lines=20), we are basically always getting 4 for sizeof(x.data_type()) no matter what actual data type x has. Big thanks to Jack Langman for specifically pointing to this bug.

We would instead use the size in bytes based on actual data type.

Test Plan:
Added unit tests BatchMatMulMemCostTest:

buck test //caffe2/caffe2/fb/fbgemm:batch_matmul_op_test -- BatchMatMulMemCostTest

Extended existing unit test test_columnwise_concat for different data types:

buck test //caffe2/caffe2/python/operator_test:concat_op_cost_test -- test_columnwise_concat

Differential Revision: D30561459

fbshipit-source-id: 976fa5167097a35af548498480001aafd7851d93
2021-08-30 12:57:31 -07:00
eafe33c995 remove componentwise comparison of complex values in torch.testing.assert_close (#63841)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63841

Closes #61906.

cc ezyang gchanan

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D30633526

Pulled By: mruberry

fbshipit-source-id: ddb5d61838cd1e12d19d0093799e827344382cdc
2021-08-30 12:38:44 -07:00
401bbb2aa0 remove componentwise comparison of complex values in TestCase.assertEqual (#63572)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63572

Addresses #61906. Issue will be fixed later in the stack when `torch.testing.assert_close` got the same treatment.

cc ezyang gchanan

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D30633527

Pulled By: mruberry

fbshipit-source-id: c2002a4998a7a75cb2ab83f87190bde43a9d4f7c
2021-08-30 12:36:45 -07:00
a8ffe81b2c Bring back old algorithm for sorting on small number of segments (#64127)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/63456
The code was copy-pasted from the previous commit without modification.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64127

Reviewed By: mruberry

Differential Revision: D30632090

Pulled By: ngimel

fbshipit-source-id: 58bbdd9b0423f01d4e65e2ec925ad9a3f88efc9b
2021-08-30 12:30:50 -07:00
d37636901e [Doc] make_tensor to torch.testing module (#63925)
Summary:
This PR aims to add `make_tensor` to the `torch.testing` module in PyTorch docs.

TODOs:

* [x] Add examples

cc: pmeier mruberry brianjo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63925

Reviewed By: ngimel

Differential Revision: D30633487

Pulled By: mruberry

fbshipit-source-id: 8e5a1f880c6ece5925b4039fee8122bd739538af
2021-08-30 12:25:40 -07:00
5b0dfd0f8a Fix bad use of channels last kernel in sync batch norm backward (#64100)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/64039

There are two distinct problems here.
1. If `grad_output` is channels last but not input, then input would be read as-if it were channels last. So reading the wrong values.
2. `use_channels_last_kernels` doesn't guarunte that `suggest_memory_format` will actually return channels last, so use `empty_like` instead so the strides always match.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64100

Reviewed By: mruberry

Differential Revision: D30622127

Pulled By: ngimel

fbshipit-source-id: e28cc57215596817f1432fcdd6c49d69acfedcf2
2021-08-30 12:16:30 -07:00
ac99d63f83 [jit] Make operation call accept Stack& instead Stack* (#63414)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63414

Misuse of raw pointer in here where stack is never nullable.
ghstack-source-id: 136938318

Test Plan:
compiles.

Imported from OSS

Reviewed By: ejguan

Differential Revision: D30375410

fbshipit-source-id: 9d65b620bb76d90d886c800f54308520095d58ee
2021-08-30 11:49:20 -07:00
=
93d2e5090f Improve performance of index_select by avoiding item (#63008)
Summary:
Partially fixes https://github.com/pytorch/pytorch/issues/61788

From a CUDA perspective: item already pulls all Tensor content onto the host (albeit one-by-one), which incurs very expensive memory transfers. This way we'll do it all at once.
From a CPU perspective: item has a lot of overhead as a native function in comparison to simply using a pointer.

Overall there's still lots of performance gains to be had, but this is a small change that should take us into a more usable landscape. This doesn't land a separate benchmark, but I postulate that's not necessary to decide on the benefit of this (we'll also see if it shows up indirectly), however is still a good follow-up item.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63008

Reviewed By: zou3519

Differential Revision: D30211160

Pulled By: cpuhrsch

fbshipit-source-id: 70b752be5df51afc66b5aa1c77135d1205520cdd
2021-08-30 09:50:41 -07:00
e24c3644d8 [Static Runtime] aten::cat out version when it is not being replaced by prim::VarConcat (#64157)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64157

UseVariadicCat optimization is not applied to aten::cat if list input to the op can not be moved to the position before op (https://fburl.com/diffusion/l6kweimu). For these cases we will need out version for SR.

Test Plan:
Confirm out variant is called:
```
> buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --v=1
```

Reviewed By: d1jang

Differential Revision: D30598574

fbshipit-source-id: 74cfa8291dc8b5df4aef58adfb1ab2a16f10d90a
2021-08-30 09:42:38 -07:00
16ecdbbaa2 [PyTorch] Fix missing move in unpickler (#63974)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63974

Saw some time spent in this for model loading, no reason not to move here.
ghstack-source-id: 136760979

Test Plan: Re-profile model loading on devserver; IValue copy ctor time has gone down

Reviewed By: dhruvbird

Differential Revision: D30548923

fbshipit-source-id: 42000f2e18582762b43353cca10ae094833de3b3
2021-08-30 09:38:55 -07:00
9777887f0e [PyTorch] Reduce copies/refcount bumps in BytecodeDeserializer::parseMethods (#63961)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63961

Saw a report that this function was slow and was doing unexplained vector copies. First pass to remove a bunch of copying.
ghstack-source-id: 136760976

Test Plan:
Pixel 3
before: https://our.intern.facebook.com/intern/aibench/details/461850118893980
after: https://www.internalfb.com/intern/aibench/details/48965886029524

MilanBoard failed to return data from simpleperf

Reviewed By: dhruvbird

Differential Revision: D30544551

fbshipit-source-id: 0e2b5471a10c0803d52c923e6fb5625f5542b99d
2021-08-30 09:37:10 -07:00
dc4fd3bdda [MicroBench] Added a micro benchmark for a signed log1p kernel. (#64032)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64032

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D30579198

Pulled By: navahgar

fbshipit-source-id: a53d68225fba768b26491d14b535f8f2dcf50c0e
2021-08-30 09:27:51 -07:00
f79df24859 Automated submodule update: FBGEMM (#64149)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: f6dfed87a1

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64149

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: jspark1105

Differential Revision: D30632209

fbshipit-source-id: aa1cebaf50169c3a93dbcb994fa47e29d6b6a0d7
2021-08-30 08:30:57 -07:00
82174330d0 [DataLoader2] Adding Messages, Protocols, Loop wrappers (#63882)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63882

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D30627452

Pulled By: VitalyFedyunin

fbshipit-source-id: 561ea2df07f3572e04401171946154024126387b
2021-08-30 07:57:20 -07:00
7701ea48be remove one more distributed test (#64108)
Summary:
Follow up on https://github.com/pytorch/pytorch/issues/62896. one more place we should remove distributed test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64108

Reviewed By: janeyx99, soulitzer

Differential Revision: D30614062

Pulled By: walterddr

fbshipit-source-id: 6576415dc2d481d65419da19c5aa0afc37a86cff
2021-08-30 07:51:11 -07:00
093a12aaa9 [nnc] Updated internal asserts to include more detailed error messages (#64118)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64118

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D30616944

Pulled By: navahgar

fbshipit-source-id: 35289696cc0e7faa01599304243b86f0febc6daf
2021-08-30 04:40:51 -07:00
a836d83957 [nnc] Fixed warning due to implicit parameter conversion (#64117)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64117

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D30616945

Pulled By: navahgar

fbshipit-source-id: eaf69232ac4a684ab5f97a54a514971655f86ef3
2021-08-30 04:39:34 -07:00
d3bcba5f85 ENH Adds label_smoothing to cross entropy loss (#63122)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/7455

Partially resolves pytorch/vision#4281

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63122

Reviewed By: iramazanli

Differential Revision: D30586076

Pulled By: jbschlosser

fbshipit-source-id: 06afc3aa1f8b9edb07fe9ed68c58968ad1926924
2021-08-29 23:33:04 -07:00
8af1407eab [Static Runtime] Out version for torch.linalg.norm (#64070)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64070

Test Plan:
Confirm out variant is called for both versions:

```
> buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --v=1
```

Reviewed By: d1jang

Differential Revision: D30595816

fbshipit-source-id: e88d88d4fc698774e83a98efce66b8fa4e281563
2021-08-29 21:00:11 -07:00
44e3ed88c9 [quant] AO migration of the quantize.py (#64086)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64086

AO Team is migrating the existing torch.quantization into torch.ao.quantization. We are doing it one file at a time to make sure that the internal callsites are updated properly.

This migrates the `quantize.py` from torch.quantization to `torch.ao.quantization`.

At this point both locations will be supported. Eventually the torch.quantization will be deprecated.

Test Plan: `buck test mode/opt //caffe2/test:quantization`

Reviewed By: jerryzh168, raghuramank100

Differential Revision: D30055886

fbshipit-source-id: 8ef7470f9fa640c0042bef5bb843e7a05ecd0b9f
2021-08-29 20:30:01 -07:00
29ad84f252 Removes beta warning from the special module documentation (#64148)
Summary:
Updates documentation per feature review. torch.special is now stable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64148

Reviewed By: ngimel

Differential Revision: D30632049

Pulled By: mruberry

fbshipit-source-id: 8f6148ec7737e7b3a90644eeca23eb217eda513d
2021-08-29 19:38:46 -07:00
c5ed31e4a7 add channel last support for MaxUnpool2d (#49984)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49984

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D26007051

Pulled By: VitalyFedyunin

fbshipit-source-id: 6c54751ade4092e03c1651aaa60380f7d6e92f6b
2021-08-29 18:37:10 -07:00
9db56531f7 Revert D30620966: [pytorch][PR] Move Parallel[Native|TBB] to GHA
Test Plan: revert-hammer

Differential Revision:
D30620966 (223f886032)

Original commit changeset: 9a23e4b3e168

fbshipit-source-id: b9248d377b9a7b850dfb3f10f3350fbc9855acfe
2021-08-29 15:51:27 -07:00
710a2e933f [DOC] Add doc for maybe_wrap_dim (#63161)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63161

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D30629451

Pulled By: tugsbayasgalan

fbshipit-source-id: b03f030f197e10393a8ff223b240d23c30858028
2021-08-29 14:19:28 -07:00
7ebdbf82dc add support for sending cpu sparse tensors over rpc (#62794)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62794

This pr updates jit serialization to support pickling Sparse COO tensors.
This pr updates message.cpp to support Sparse COO tensors.
A bug was filed a few years ago https://github.com/pytorch/pytorch/issues/30807.

I tested the fix by adding sparse tensor tests to rpc_test.py and dist_autograd_test.py.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse agolynski SciPioneer H-Huang mrzzd cbalioglu gcramer23 gmagogsfm

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D30608848

Pulled By: gcramer23

fbshipit-source-id: 629ba8e4a3d8365875a709c9b87447c7a71204fb
2021-08-29 11:35:00 -07:00
52d7dd7398 [DOC] improve docstring for Optimizer.state_dict (#63153)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63153

Fixes: https://github.com/pytorch/pytorch/issues/60121

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D30629462

Pulled By: tugsbayasgalan

fbshipit-source-id: a9160e02ac53bb1a6219879747d73aae9ebe4d2f
2021-08-29 10:20:58 -07:00
371c6612b3 Automated submodule update: FBGEMM (#64141)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 9939bac9de

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64141

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: jspark1105

Differential Revision: D30629417

fbshipit-source-id: 1b1ad3d4caff925f798b86b358ab193554c9b8e0
2021-08-29 09:58:04 -07:00
2e6221a232 [nnc] Make 64-bit dimensions work (#64077)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64077

We were assuming kernel dimensions fit in 32 bits (the old fuser made
this assumption too), but we should be able to support 64.
ghstack-source-id: 136933272

Test Plan: unit tests; new IR level test with huge sizes

Reviewed By: ZolotukhinM

Differential Revision: D30596689

fbshipit-source-id: 23b7e393a2ebaecb0c391a6b1f0c4b05a98bcc94
2021-08-28 19:59:47 -07:00
405c15516c Parse int64 sizes/strides (#64076)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64076

We were parsing sizes into int32s, so if you had a tensor with more
than 2^32 elements, you couldn't represent it.
ghstack-source-id: 136933273

Test Plan: parseIR with size of 4e9

Reviewed By: ZolotukhinM

Differential Revision: D30521116

fbshipit-source-id: 1e28e462cba52d648e0e2acb4e234d86aae25a3e
2021-08-28 19:58:34 -07:00
4f969db325 [nnc] Fix batchnorm implementation (#64112)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64112

Fixes #64062

Test Plan: Imported from OSS

Reviewed By: zhxchen17

Differential Revision: D30622897

Pulled By: bertmaher

fbshipit-source-id: 7d7c6131aa786e61fa1d0a517288396a0bdb1d22
2021-08-28 19:20:35 -07:00
aefa2f3e64 To add RMSProp algorithm documentation (#63721)
Summary:
It has been discussed before that adding description of Optimization algorithms to PyTorch Core documentation may result in a nice Optimization research tutorial. In the following tracking issue we mentioned about all the necessary algorithms and links to the originally published paper  https://github.com/pytorch/pytorch/issues/63236.

In this PR we are adding description of RMSProp to the documentation.  For more details, we refer to the paper   https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf

<img width="464" alt="RMSProp" src="https://user-images.githubusercontent.com/73658284/131179226-3fb6fe5a-5301-4948-afbe-f38bf57f24ff.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63721

Reviewed By: albanD

Differential Revision: D30612426

Pulled By: iramazanli

fbshipit-source-id: c3ac630a9658d1282866b53c86023ac10cf95398
2021-08-28 15:55:56 -07:00
8b6266fe4f Automated submodule update: FBGEMM (#64129)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: f14e794814

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64129

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: jspark1105

Differential Revision: D30621549

fbshipit-source-id: 34c109e75c96a261bf370f7a06dbb8b9004860ab
2021-08-28 11:56:17 -07:00
223f886032 Move Parallel[Native|TBB] to GHA (#64123)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64123

Reviewed By: driazati

Differential Revision: D30620966

Pulled By: malfet

fbshipit-source-id: 9a23e4b3e16870f77bf18df4370cd468603d592d
2021-08-28 11:50:30 -07:00
d0c63e857d Enhancement for smart serialization for out schemas (#63096)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63096

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D30415255

Pulled By: tugsbayasgalan

fbshipit-source-id: eb40440a3b46258394d035479f5fc4a4baa12bcc
2021-08-28 11:46:27 -07:00
f4496528e3 [Light] Fix error message (#64010)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64010

Fixing typos in a error message

Test Plan:
Error message before fix:
Lite Interpreter verson number does not match. The model version must be between 3 and 5But the model version is 6

Error message after fix:
Lite Interpreter version number does not match. The model version must be between 3 and 5 but the model version is 6

Reviewed By: larryliu0820

Differential Revision: D30568367

fbshipit-source-id: 205f3278ee8dcf38579dbb828580a9e986ccacc1
2021-08-27 22:54:38 -07:00
0d0605eaa9 [quant][graphmode][fx] Add reference quantized linear module (#63627)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63627

Added reference quantized linear module for the custom backend flow, the reference quantized module will
have the following code:
```
        w(float) -- quant - dequant \
        x(float) ------------- F.linear ---
```
In the full model, we will see
```
        w(float) -- quant - *dequant \
        x -- quant --- *dequant --  *F.linear --- *quant - dequant
```
and the backend should be able to fuse the ops with `*` into a quantized linear

Test Plan:
python test/test_quantization.py TestQuantizeFx.test_conv_linear_reference

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D30504750

fbshipit-source-id: 5729921745c2b6a0fb344efc3689f3b170e89500
2021-08-27 22:53:24 -07:00
a3a7a67048 [iOS][GPU] Consolidate array and non-array kernel for hardswish (#63369)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63369

ghstack-source-id: 136918152

(Note: this ignores all push blocking failures!)

Test Plan:
- `buck test pp-macos`
- Op tests in PyTorchPlayground app
- Run mobilenetv3 test

https://pxl.cl/1Ncls

Reviewed By: xta0

Differential Revision: D30354454

fbshipit-source-id: 88bf4f8b5871e63170161b3f3e44f99b8a3086c6
2021-08-27 19:31:08 -07:00
9ccb9299e0 To add Nesterov Adam algorithm description to documentation (#63793)
Summary:
It has been discussed before that adding description of Optimization algorithms to PyTorch Core documentation may result in a nice Optimization research tutorial. In the following tracking issue we mentioned about all the necessary algorithms and links to the originally published paper  https://github.com/pytorch/pytorch/issues/63236.

In this PR we are adding description of Nesterov Adam Algorithm to the documentation.  For more details, we refer to the paper  https://openreview.net/forum?id=OM0jvwB8jIp57ZJjtNEZ

<img width="439" alt="NAdam" src="https://user-images.githubusercontent.com/73658284/131185124-e81b2edf-33d9-4a9d-a7bf-f7e5eea47d7c.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63793

Reviewed By: NivekT

Differential Revision: D30617057

Pulled By: iramazanli

fbshipit-source-id: cd2054b0d9b6883878be74576e86e307f32f1435
2021-08-27 19:29:34 -07:00
07c5cb8c48 [Static Runtime] Optimize memory planner initialization (#64101)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64101

Checking `getOutOfPlaceOperation(n)` is a very expensive operation, especially in multithreaded environments, due to a lock acquisition when the NNC cache is queried. This slows down the memory planner initialization time, and by extension, the latency for the first static runtime inference.

There are two optimizations in this diff:
* Cache the result of `p_node->has_out_variant()` to avoid the call to `getOutOfPlaceOperation`. This speeds up calls to `canReuseInputOutputs`, which in turn speeds up `isOptimizableContainerType`
* Precompute all `isOptimizableContainerType` during static runtime initialization to avoid a pass over all of each node's inputs.

Test Plan: All unit tests pass: `buck test caffe2/benchmarks/static_runtime/...`

Reviewed By: movefast1990

Differential Revision: D30595579

fbshipit-source-id: 70aaa7af9589c739c672788bf662f711731864f2
2021-08-27 17:40:43 -07:00
2d75ab0c8f [TensorExpr] Update tutorial. (#64109)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64109

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D30614050

Pulled By: ZolotukhinM

fbshipit-source-id: e8f9bd9ef2483e6eafbc0bd5394d311cd694c7b2
2021-08-27 16:19:29 -07:00
3abbcf079d .github: Add cpp_docs job to current gcc5 workflow (#64044)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64044

Adds the cpp_docs job to the current workflow, also modifies the scripts
surrounding building docs so that they can be powered through
environment variables with sane defaults rather than having to have
passed arguments.

Ideally should not break current jobs running in circleci but those
should eventually be turned off anyways.

Coincides with work from:
* https://github.com/seemethere/upload-artifact-s3/pull/1
* https://github.com/seemethere/upload-artifact-s3/pull/2

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

cc ezyang seemethere malfet walterddr lg20987 pytorch/pytorch-dev-infra

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D30610010

Pulled By: seemethere

fbshipit-source-id: f67adeb1bd422bb9e24e0f1ec0098cf9c648f283
2021-08-27 16:06:12 -07:00
6ccb74b837 Update codegen to use boxed kernel (#63459)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63459

 - Replaces the usual registration basically when "requires_derivative" is True (as in we still need a grad_fn), but `fn.info` is `None` (TODO maybe make sure differentiable inputs > 0 also to match requires_derivative).
 - Adds some (temporary?) fixes to some sparse functions See: https://github.com/pytorch/pytorch/issues/63549
 - To remove the codegen that generates NotImplemented node (though that should only be one line),  because there are some ops listed under `RESET_GRAD_ACCUMULATOR` that have a extra function call. We would need to make this list of ops available to c++, but this would either mean we'd have to codegen a list of strings, or move the RESET_GRAD_ACCUMULATOR to cpp land. We could do this in a future PR if necessary.

Test Plan: Imported from OSS

Reviewed By: heitorschueroff

Differential Revision: D30518571

Pulled By: soulitzer

fbshipit-source-id: 99a35cbced46292d1b4e51594ae4d534c2caf8b6
2021-08-27 15:01:50 -07:00
90a6498a12 Add autograd not implemented boxed fallback (#63458)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63458

See description and discussion from https://github.com/pytorch/pytorch/pull/62450

Test Plan: Imported from OSS

Reviewed By: heitorschueroff

Differential Revision: D30518572

Pulled By: soulitzer

fbshipit-source-id: 3b1504d49abb84560ae17077f0dec335749c9882
2021-08-27 15:00:28 -07:00
8406dba65a Removing references to ProcessGroupAgent in comments (#64051)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64051

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse agolynski SciPioneer H-Huang mrzzd cbalioglu gcramer23

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D30587076

Pulled By: jaceyca

fbshipit-source-id: 414cb95faad0b4da0eaf2956c0668af057f93574
2021-08-27 14:47:37 -07:00
bdde898d9c Add README to datapipes (#63982)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63982

Add a readme to `datapipes` for developer. This is can be a replacement of https://github.com/pytorch/pytorch/blob/master/torch/utils/data/datapipes_tutorial_dev_loaders.ipynb

After this PR is landed, the README.md will be added to PyTorch Wiki

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D30554198

Pulled By: ejguan

fbshipit-source-id: 6091aae8ef915c7c1f00fbf45619c86c9558d308
2021-08-27 14:17:08 -07:00
358c46f99e Implement leaky relu op
Summary: Implemented leaky relu op as per: https://www.internalfb.com/tasks/?t=97492679

Test Plan:
buck build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 //xplat/caffe2:pt_vulkan_api_test_binAndroid\#android-arm64 --show-output
adb push buck-out/gen/xplat/caffe2/pt_vulkan_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_api_test
adb shell "/data/local/tmp/vulkan_api_test"

all tests pass, including new ones

Reviewed By: SS-JIA

Differential Revision: D30186225

fbshipit-source-id: fdb1f8f7b3a28b5504581822185c0475dcd53a3e
2021-08-27 13:52:49 -07:00
18cb3fc910 [FX] Validate data type of target on Node Construction (#64050)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64050

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D30585535

Pulled By: yqhu

fbshipit-source-id: 96778a87e75f510b4ef42f0e5cf76b35b7b2f331
2021-08-27 13:40:57 -07:00
ff4569ae29 Sparse CUDA: rename files *.cu -> *.cpp (#63894)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63894

This PR introduces a few code structure changes. There is no need to use
.cu extension for pure c++ code without cuda. Moved
`s_addmm_out_csr_sparse_dense_cuda_worker` to a separate cpp file from
cu file.

cc nikitaved pearu cpuhrsch IvanYashchuk ngimel

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D30548771

Pulled By: cpuhrsch

fbshipit-source-id: 6f12d36e7e506d2fdbd57ef33eb73192177cd904
2021-08-27 13:22:54 -07:00
8fc1064b7f [PyTorch] Reduce code size of register_prim_ops.cpp (#61494)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61494

Creating a constexpr array and then looping over it is much cheaper than emitting a function call per item.
ghstack-source-id: 136639302

Test Plan:
fitsships

Buildsizebot some mobile apps to check size impact.

Reviewed By: dhruvbird, iseeyuan

Differential Revision: D29646977

fbshipit-source-id: 6144999f6acfc4e5dcd659845859702051344d88
2021-08-27 12:56:35 -07:00
6a76ee04de Adding alltoall_single collective to collective quantization API (#63154)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63154

The collective quantization API now supports alltoall, alltoall_single, and allscatter. The test is also included.
ghstack-source-id: 136856877

Test Plan: buck test mode/dev-nosan //caffe2/test/distributed/algorithms/quantization:DistQuantizationTests_nccl -- test_all_to_all_single_bfp16

Reviewed By: wanchaol

Differential Revision: D30255251

fbshipit-source-id: 856f4fa12de104689a03a0c8dc9e3ecfd41cad29
2021-08-27 12:46:31 -07:00
04108592a3 New TLS to disable forward mode AD (#63117)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63117

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D30388097

Pulled By: albanD

fbshipit-source-id: f1bc777064645db1ff848bdd64af95bffb530984
2021-08-27 11:59:24 -07:00
6257f5b168 [pruner] add README to repo (#64099)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64099

adding readme to pruner in OSS
ghstack-source-id: 136867516

Test Plan: should not affect behavior

Reviewed By: z-a-f

Differential Revision: D30608045

fbshipit-source-id: 3e9899a853395b2e91e8a69a5d2ca5f3c2acc646
2021-08-27 11:52:59 -07:00
101a626330 Improve distributed.get_rank() API docstring (#63296)
Summary:
See discussion in https://pytorch.slack.com/archives/CBHSWPNM7/p1628792389008600

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63296

Reviewed By: cbalioglu

Differential Revision: D30332042

Pulled By: mrshenli

fbshipit-source-id: 3a642fda2e106fd35b67709ed2adb60e408854c2
2021-08-27 11:34:55 -07:00
196fd3ee7a Modules note v2 (#63963)
Summary:
This PR expands the [note on modules](https://pytorch.org/docs/stable/notes/modules.html) with additional info for 1.10.

It adds the following:
* Examples of using hooks
* Examples of using apply()
* Examples for ParameterList / ParameterDict
* register_parameter() / register_buffer() usage
* Discussion of train() / eval() modes
* Distributed training overview / links
* TorchScript overview / links
* Quantization overview / links
* FX overview / links
* Parametrization overview / link to tutorial

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63963

Reviewed By: albanD

Differential Revision: D30606604

Pulled By: jbschlosser

fbshipit-source-id: c1030b19162bcb5fe7364bcdc981a2eb6d6e89b4
2021-08-27 11:30:18 -07:00
19c1b45f25 Detect out argument in the schema (#62755)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62755

After this change, out argument can be checked by calling is_out()

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D30415256

Pulled By: tugsbayasgalan

fbshipit-source-id: b2e1fa46bab7c813aaede1f44149081ef2df566d
2021-08-27 11:20:33 -07:00
9f1f22b9bc [Static Runtime] Add out variant of quantized::embedding_bag_byte_prepack (#64081)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64081

This change add an out variant of `quantized::embedding_bag_byte_prepack`.

Test Plan:
- Added `ShapeInferenceTest.QEmbeddingBagByteUnpack`.

- Observed

```
V0824 13:38:49.723708 1322143 impl.cpp:1394] Switch to out variant for node: %2 : Tensor = quantized::embedding_bag_byte_prepack(%input)
```

Reviewed By: hlu1

Differential Revision: D30504216

fbshipit-source-id: 1d9d428e77a15bcc7da373d65e7ffabaf9c6caf2
2021-08-27 10:53:23 -07:00
6ab3a21098 fix resize bug (#61166)
Summary:
I think the original intention here is to only take effect in the case of align_corners (because output_size = 1 and the divisor will be 0), but it affects non-align_corners too. For example:

```python
input = torch.tensor(
        np.arange(1, 5, dtype=np.int32).reshape((1, 1, 2, 2)) )
m = torch.nn.Upsample(scale_factor=0.5, mode="bilinear")
of_out = m(input)
```

The result we expect should be [[[[2.5]]]]

but pytorch get [[[[1.0]]]] which is different from OpenCV  and PIL, this pr try to fixed it。

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61166

Reviewed By: malfet

Differential Revision: D30543178

Pulled By: heitorschueroff

fbshipit-source-id: 21a4035483981986b0ae4a401ef0efbc565ccaf1
2021-08-27 10:49:31 -07:00
538c30a713 [caffe2] fixes to allow stricter compilation flag (#64016)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64016

In order to increase the strictness of the compilation for some target depending on caffe2 we need to fix some errors uncovered when rising such flags.

This change introduces the required override tokens for virtual destructors

Test Plan: CI. Moreover targets depending on caffe2  using clang strict warnings now compile

Reviewed By: kalman5

Differential Revision: D30541714

fbshipit-source-id: 564af31b4a9df3536d7d6f43ad29e1d0c7040551
2021-08-27 10:38:37 -07:00
eca87f729d Added reference tests to ReductionOpInfo (#62900)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62900

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D30408815

Pulled By: heitorschueroff

fbshipit-source-id: 6a1f82ac281920ff7405a42f46ccd796e60af9d6
2021-08-27 10:32:16 -07:00
babd449978 [JIT] Add aten::slice optimization (#63049)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63049

Given a graph produced from a function like this:
```
def foo():
    li = [1, 2, 3, 4, 5, 6]
    return li[0:2]
```
This pass produces a graph like this:
```
def foo():
    li = [1, 2]
    return li
```

These changes are mostly adapted from https://github.com/pytorch/pytorch/pull/62297/

Test Plan: `buck test //caffe2/test:jit -- TestPeephole`

Reviewed By: eellison

Differential Revision: D30231044

fbshipit-source-id: d12ee39f68289a574f533041a5adb38b2f000dd5
2021-08-27 10:12:45 -07:00
3abb606091 Add doc for nn.MultiMarginLoss (shape, example) (#63760)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/63747

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63760

Reviewed By: malfet

Differential Revision: D30541581

Pulled By: jbschlosser

fbshipit-source-id: 99560641e614296645eb0e51999513f57dfcfa98
2021-08-27 09:51:05 -07:00
a9983ac09c Refactor structured set_output in Register{DispatchKey}.cpp (#62188)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62188

These parts of the `set_output` code are identical for all operators in the
kernel registration files. So, this moves them from being copied into every
class to two helper functions at the top of the file.

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D29962045

Pulled By: albanD

fbshipit-source-id: 753b8aac755f3c91b77ffa2c30a89ac91a84b7c4
2021-08-27 09:38:27 -07:00
f922b58b5f [bazel] GPU-support: add @local_config_cuda and @cuda (#63604)
Summary:
## Context

We take the first step at tackling the GPU-bazel support by adding bazel external workspaces `local_config_cuda` and `cuda`, where the first one has some hardcoded values and lists of files, and the second one provides a nicer, high-level wrapper that maps into the already expected by pytorch bazel targets that are guarded with `if_cuda` macro.

The prefix `local_config_` signifies the fact that we are breaking the bazel hermeticity philosophy by explicitly relaying on the CUDA installation that is present on the machine.

## Testing

Notice an important scenario that is unlocked by this change: compilation of cpp code that depends on cuda libraries (i.e. cuda.h and so on).

Before:
```
sergei.vorobev@cs-sv7xn77uoy-gpu-1628706590:~/src/pytorch4$ bazelisk build --define=cuda=true //:c10
ERROR: /home/sergei.vorobev/src/pytorch4/tools/config/BUILD:12:1: no such package 'tools/toolchain': BUILD file not found in any of the following directories. Add a BUILD file to a directory to mark it as a package.
 - /home/sergei.vorobev/src/pytorch4/tools/toolchain and referenced by '//tools/config:cuda_enabled_and_capable'
ERROR: While resolving configuration keys for //:c10: Analysis failed
ERROR: Analysis of target '//:c10' failed; build aborted: Analysis failed
INFO: Elapsed time: 0.259s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (2 packages loaded, 2 targets configured)
```

After:
```
sergei.vorobev@cs-sv7xn77uoy-gpu-1628706590:~/src/pytorch4$ bazelisk build --define=cuda=true //:c10
INFO: Analyzed target //:c10 (6 packages loaded, 246 targets configured).
INFO: Found 1 target...
Target //:c10 up-to-date:
  bazel-bin/libc10.lo
  bazel-bin/libc10.so
INFO: Elapsed time: 0.617s, Critical Path: 0.04s
INFO: 0 processes.
INFO: Build completed successfully, 1 total action
```

The `//:c10` target is a good testing one for this, because it has such cases where the [glob is different](075024b9a3/BUILD.bazel (L76-L81)), based on do we compile for CUDA or not.

## What is out of scope of this PR

This PR is a first in a series of providing the comprehensive GPU bazel build support. Namely, we don't tackle the [cu_library](11a40ad915/tools/rules/cu.bzl (L2)) implementation here. This would be a separate large chunk of work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63604

Reviewed By: soulitzer

Differential Revision: D30442083

Pulled By: malfet

fbshipit-source-id: b2a8e4f7e5a25a69b960a82d9e36ba568eb64595
2021-08-27 09:33:42 -07:00
22d38bd10d [OSS] Enable Metal in PyTorch MacOS nightly builds (#63718)
Summary:
Build on https://github.com/pytorch/pytorch/pull/63825

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63718

Test Plan:
1.Add `ci/binaries` label to PR, so the CI will build those nightly builds

2.Make sure the following CI jobs build with `USE_PYTORCH_METAL_EXPORT` option is `ON`:
```
ci/circleci: binary_macos_arm64_conda_3_8_cpu_nightly_build
ci/circleci: binary_macos_arm64_conda_3_9_cpu_nightly_build
ci/circleci: binary_macos_arm64_wheel_3_8_cpu_nightly_build
ci/circleci: binary_macos_arm64_wheel_3_9_cpu_nightly_build
ci/circleci: binary_macos_conda_3_6_cpu_nightly_build
ci/circleci: binary_macos_conda_3_7_cpu_nightly_build
ci/circleci: binary_macos_conda_3_8_cpu_nightly_build
ci/circleci: binary_macos_conda_3_9_cpu_nightly_build
ci/circleci: binary_macos_libtorch_3_7_cpu_nightly_build
ci/circleci: binary_macos_wheel_3_6_cpu_nightly_build
ci/circleci: binary_macos_wheel_3_7_cpu_nightly_build
ci/circleci: binary_macos_wheel_3_8_cpu_nightly_build
ci/circleci: binary_macos_wheel_3_9_cpu_nightly_build
```

3.Test `conda` and `wheel` builds locally on [HelloWorld-Metal](https://github.com/pytorch/ios-demo-app/tree/master/HelloWorld-Metal) demo with [(Prototype) Use iOS GPU in PyTorch](https://pytorch.org/tutorials/prototype/ios_gpu_workflow.html)

(1) conda
```
conda install https://15667941-65600975-gh.circle-artifacts.com/0/Users/distiller/project/final_pkgs/pytorch-1.10.0.dev20210826-py3.8_0.tar.bz2
```
(2) wheel
```
pip3 install https://15598647-65600975-gh.circle-artifacts.com/0/Users/distiller/project/final_pkgs/torch-1.10.0.dev20210824-cp38-none-macosx_10_9_x86_64.whl
```

Reviewed By: xta0

Differential Revision: D30593167

Pulled By: hanton

fbshipit-source-id: 471da204e94b29c11301c857c50501307a5f0785
2021-08-27 09:25:05 -07:00
a43e7a51d7 Adds return type annotation for fork_rng function (#63724)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/63723

Since it's a generator function the type annotation shall be `Generator`.
![image](https://user-images.githubusercontent.com/47299190/130318830-29ef9529-0daa-463c-90b2-1b11f63ade8a.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63724

Reviewed By: iramazanli

Differential Revision: D30543098

Pulled By: heitorschueroff

fbshipit-source-id: ebdd34749defe1e26c899146786a0357ab4b4b9b
2021-08-27 09:03:40 -07:00
ad8eddbd80 More robust check of whether a class is defined in torch (#64083)
Summary:
This would prevent bugs for classes that
1) Is defined in a module that happens to start with `torch`, say `torchvision`
2) Is defined in torch but with an import alias like `import torch as th`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64083

Reviewed By: soulitzer

Differential Revision: D30598369

Pulled By: gmagogsfm

fbshipit-source-id: 9d3a7135737b2339c9bd32195e4e69a9c07549d4
2021-08-27 08:55:35 -07:00
f2c47cf4db [Static Runtime] Out version for fmod (#64046)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64046

Test Plan:
Confirm out variant is used:
```
> //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --v=1

V0826 23:31:30.321382 193428 impl.cpp:1395] Switch to out variant for node: %4 : Tensor = aten::fmod(%a.1, %b.1)
```

Reviewed By: mikeiovine

Differential Revision: D30581228

fbshipit-source-id: dfab9a16ff8afd40b29338037769f938f154bf74
2021-08-27 03:05:06 -07:00
c90b3cb1da [Static Runtime] Manage temporary Tensors for aten::layer_norm (#64078)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64078

This change converts `aten::layer_norm -> output Tensor` to `static_runtime::layer_norm -> (output Tensor, temp1 Tensor, tmp2 Tensor)` to manage `tmp1` and `tmp2` Tensors by the static runtime.

Currently the out-variant of `aten::layer_norm` creates two temporary Tensors inside it:
```
    at::Tensor mean = create_empty_from({M}, *X);
    at::Tensor rstd = create_empty_from({M}, *X);
```
that the static runtime misses an opportunity to manage.

This change puts them into (unused) output Tensors of a new placeholder op `static_runtime::layer_norm` so that the static runtime can mange them since the static runtime as of now chooses to manage only output tensors.

Test Plan:
- Enhanced `StaticRuntime.LayerNorm` to ensure that `static_runtime::layer_norm` gets activated.

- Confirmed that the new op gets activated during testing:

```
V0825 12:51:50.017890 2265227 impl.cpp:1396] Switch to out variant for node: %8 : Tensor, %9 : Tensor, %10 : Tensor = static_runtime::layer_norm(%input.1, %normalized_shape.1, %4, %4, %5, %3)

```

Reviewed By: hlu1

Differential Revision: D30486475

fbshipit-source-id: 5121c44ab58c2d8a954aa0bbd9dfeb7468347a2d
2021-08-27 02:44:43 -07:00
3c3bba4169 [Static Runtime] Use F14FastMap/F14FastSet (#63999)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63999

Use folly::F14FastMap/F14FastSet instead of std::unordered_map/unordered_set in the Static Runtime code base. folly::F14FastMap/F14FastSet implements the same APIs as std::unordered_map/unordered_set but faster. For details see https://github.com/facebook/folly/blob/master/folly/container/F14.md

Reviewed By: d1jang

Differential Revision: D30566149

fbshipit-source-id: 20a7fa2519e4dde96fb3fc61ef6c92bf6d759383
2021-08-27 01:40:41 -07:00
3f1c809470 [static runtime] port c2 argmin kernel (#63632)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63632

Local benchmarking with 1 input repeated 10k iter on 290331537_4 local net. Reduces argmin runtime by about 80% and and local net execution by about ~0.71-0.77ms.

Before:
```
I0826 17:25:53.972786 1104614 PyTorchPredictorBenchLib.cpp:313] PyTorch run finished. Milliseconds per iter: 7.37599. Iters per second: 135.57
```
```
Static runtime ms per iter: 8.22086. Iters per second: 121.642
Time per node type:
        4.13527 ms.    50.9157%. fb::sigrid_transforms_torch_bind (1 nodes, out variant)
       0.868506 ms.    10.6935%. aten::argmin (1 nodes, out variant)
...
```

After:
```
I0826 17:17:54.165174 1064079 PyTorchPredictorBenchLib.cpp:313] PyTorch run finished. Milliseconds per iter: 6.66724. Iters per second: 149.987
```
```
Static runtime ms per iter: 7.68172. Iters per second: 130.179
Time per node type:
         4.1452 ms.    54.0612%. fb::sigrid_transforms_torch_bind (1 nodes, out variant)
       0.656778 ms.    8.56562%. fb::quantized_linear (8 nodes)
       0.488229 ms.    6.36741%. static_runtime::to_copy (827 nodes, out variant)
       0.372678 ms.    4.86042%. aten::argmin (1 nodes, out variant)
...Time per node type:
        3.39387 ms.    53.5467%. fb::sigrid_transforms_torch_bind (1 nodes, out variant)
       0.636216 ms.    10.0379%. fb::quantized_linear (8 nodes, out variant)
       0.410535 ms.    6.47721%. fb::clip_ranges_to_gather_to_offsets (304 nodes, out variant)
       0.212721 ms.     3.3562%. fb::clip_ranges_gather_sigrid_hash_precompute_v3 (157 nodes, out variant)
       0.173736 ms.    2.74111%. aten::matmul (1 nodes, out variant)
       0.150514 ms.    2.37474%. aten::argmin (1 nodes, out variant)
```
P447422384

Test Plan:
Test with local replayer sending traffic to `ansha_perf_test_0819.test`, and compare outputs to jit interpreter.

Start compute tier:
```
RUN_UUID=ansha_perf_test_0819.test.storage JOB_EXPIRE_TIME=864000 MODEL_ID=290331537_4 PREDICTOR_TAG= PREDICTOR_VERSION=405 PREDICTOR_TYPE=CPU ADDITIONAL_FLAGS="--enable_disagg_file_split=true --enable_adx=false --load_remote_file_locally=true --pytorch_predictor_static_runtime_whitelist_by_id=290331537" GFLAGS_CONFIG_PATH=sigrid/predictor/gflags/predictor_gflags_ads_perf_cpu_pyper SMC_TIER_NAME=sigrid.predictor.perf.ansha_per_test_0819.test.storage CLUSTER=tsp_rva ENTITLEMENT_NAME=ads_ranking_infra_test_t6 PREDICTOR_LOCAL_DIRECTORY= ICET_CONFIG_PATH= NNPI_COMPILATION_CONFIG_FILE= NUM_TASKS=1 NNPI_NUM_WORKERS=0 tw job start /data/users/ansha/fbsource/fbcode/tupperware/config/admarket/sigrid/predictor/predictor_perf_canary.tw
```

Start nnpi tier:
```
RUN_UUID=ansha_perf_test_0819.test JOB_EXPIRE_TIME=247200 MODEL_ID=290331537_4 PREDICTOR_TAG= PREDICTOR_VERSION=343 PREDICTOR_TYPE=NNPI_TWSHARED ADDITIONAL_FLAGS="--torch_glow_min_fusion_group_size=30 --pytorch_storage_tier_replayer_sr_connection_options=overall_timeout:1000000,processing_timeout:1000000 --predictor_storage_smc_tier=sigrid.predictor.perf.ansha_perf_test_0819.test.storage --pytorch_predictor_static_runtime_whitelist_by_id=290331537" GFLAGS_CONFIG_PATH=sigrid/predictor/gflags/predictor_gflags_ads_perf_glow_nnpi_pyper_v1 SMC_TIER_NAME=sigrid.predictor.perf.ansha_perf_test_0819.test CLUSTER=tsp_rva ENTITLEMENT_NAME=ads_ranking_infra_test_t17 PREDICTOR_LOCAL_DIRECTORY= ICET_CONFIG_PATH= NNPI_COMPILATION_CONFIG_FILE= NUM_TASKS=1 NNPI_NUM_WORKERS=0 tw job start /data/users/ansha/fbsource/fbcode/tupperware/config/admarket/sigrid/predictor/predictor_perf_canary.tw
```

```buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- StaticRuntime.IndividualOps_Argmin --print-passing-details```

Compared outputs to jit interpreter to check for no differences greater than 1e-3 (with nnc on) https://www.internalfb.com/intern/diff/view-version/136824794/

Reviewed By: hlu1

Differential Revision: D30445635

fbshipit-source-id: 048de8867ac72f764132295d1ebfa843cde2fa27
2021-08-26 23:19:19 -07:00
294db0603f [quant] Add support for linear_relu fusion for FP16 dynamic quant (#63826)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63826

Support the conversion of the intrinsic linearRelu module to the quantized dynamic LinearReLU module
Verify the support works for both linear module and functional linear fusion

Test Plan:
python test/test_quantization.py test_dynamic_with_fusion

Imported from OSS

Reviewed By: iramazanli

Differential Revision: D30503513

fbshipit-source-id: 70446797e9670dfef7341cba2047183d6f88b70f
2021-08-26 21:12:06 -07:00
cec44aa574 [quant] Add op support for linear_relu_dynamic_fp16 (#63824)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63824

Add a fused operator implementation that will work with the quantization fusion APIs.
Once FBGEMM FP16 kernel supports relu fusion natively we can remove the addition from the PT operator.

Test Plan:
python test/test_quantization.py

Imported from OSS

Reviewed By: heitorschueroff

Differential Revision: D30503514

fbshipit-source-id: 6bf3bd53f47ffaa3f1d178eaad8cc980a7f5258a
2021-08-26 21:12:04 -07:00
975f4ccad6 [quant] support linear_relu_dynamic for qnnpack backend (#63820)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63820

Adds support in the operator directly to call relu operator if relu fusion is enabled.
Once QNNPACK natively supports relu fusion in the linear_dynamic this can be removed

Test Plan:
python test/test_quantization.py TestDynamicQuantizedLinear.test_qlinear

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D30502813

fbshipit-source-id: 3352ee5f73e482b6d1941f389d720a461b84ba23
2021-08-26 21:12:02 -07:00
c7027f19ef [quant][fx] Add support for dynamic linear + relu fusion (INT8) (#63799)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63799

Add a new module that can be used for module swap with the nni.LinearReLU module in convert function.
Supports INT8 currently (since FP16 op doesn't have relu fusion yet).

Fixes #55393

Test Plan:
python test/test_quantization.py test_dynamic_fusion

Imported from OSS

Reviewed By: heitorschueroff

Differential Revision: D30502812

fbshipit-source-id: 3668e4f001a0626d469e17ac323acf582ee28a51
2021-08-26 21:10:46 -07:00
63c90ec3bf [torch/deploy] add torch.distributed to build (#63918)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63918

Previously we were building with `USE_DISTRIBUTED` off, because c10d was built as a separately library for historical reasons. Since then, lw has merged the c10d build into libtorch, so this is fairly easy to turn on.

Differential Revision:
D30492442

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.intern.facebook.com/intern/diff/D30492442/)!
D30492442
D30492442

Test Plan: added a unit test

Reviewed By: wconstab

Pulled By: suo

fbshipit-source-id: 843b8fcf349a72a7f6fcbd1fcc8961268690fb8c
2021-08-26 20:58:44 -07:00
65e6194aeb Introduce the torchrun entrypoint (#64049)
Summary:
This PR introduces a new `torchrun` entrypoint that simply "points" to `python -m torch.distributed.run`. It is shorter and less error-prone to type and gives a nicer syntax than a rather cryptic `python -m ...` command line. Along with the new entrypoint the documentation is also updated and places where `torch.distributed.run` are mentioned are replaced with `torchrun`.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse agolynski SciPioneer H-Huang mrzzd cbalioglu gcramer23

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64049

Reviewed By: cbalioglu

Differential Revision: D30584041

Pulled By: kiukchung

fbshipit-source-id: d99db3b5d12e7bf9676bab70e680d4b88031ae2d
2021-08-26 20:17:48 -07:00
510d2ece81 Merge script and _script_pdt API (#62420)
Summary:
Merge `torch.jit.script` and `torch.jit._script_pdt` API. This PR merges profile directed typing with script api

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62420

Reviewed By: iramazanli

Differential Revision: D30579015

Pulled By: nikithamalgifb

fbshipit-source-id: 99ba6839d235d61b2dd0144b466b2063a53ccece
2021-08-26 18:58:19 -07:00
0e8c3c51d9 port glu to use structured kernel approach (#61800)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61800

resubmitting because the [last one](https://github.com/pytorch/pytorch/pull/61433) was unrecoverable due to making changes incorrectly in the stack

Test Plan: Imported from OSS

Reviewed By: iramazanli

Differential Revision: D29812492

Pulled By: makslevental

fbshipit-source-id: c3dfeacd1e00a526e24fbaab02dad48069d690ef
2021-08-26 18:01:28 -07:00
a5f35ac7cd Run through failures on trunk (#64063)
Summary:
This PR runs all the tests on trunk instead of stopping on first failure.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64063

Reviewed By: malfet, seemethere

Differential Revision: D30592020

Pulled By: janeyx99

fbshipit-source-id: 318b225cdf918a98f73e752d1cc0227d9227f36c
2021-08-26 17:38:19 -07:00
0c9dce90ed [pytorch] add per_sample_weights support for embedding_bag_4bit_rowwise_offsets (#63605)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63605

Reviewed By: houseroad

Differential Revision: D30434664

fbshipit-source-id: eb4cbae3c705f9dec5c073a56f0f23daee353bc1
2021-08-26 17:31:45 -07:00
81764d1153 document that torch.triangular_solve has optional out= parameter (#63253)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63253

Fixes #57955

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D30312134

Pulled By: dagitses

fbshipit-source-id: 4f484620f5754f4324a99bbac1ff783c64cee6b8
2021-08-26 17:28:17 -07:00
ed573a8e08 Enable test_api IMethodTest in OSS (#63345)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63345

This diff did the following few things to enable the tests:
1. Exposed IMethod as TORCH_API.
2. Linked torch_deploy to test_api if USE_DEPLOY == 1.
3. Generated torch::deploy examples when building torch_deploy library.

Test Plan: ./build/bin/test_api --gtest_filter=IMethodTest.*

Reviewed By: ngimel

Differential Revision: D30346257

Pulled By: alanwaketan

fbshipit-source-id: 932ae7d45790dfb6e00c51893933a054a0fad86d
2021-08-26 16:50:52 -07:00
0bd8d0951d [Static Runtime] Remove unnecessary fb::equally_split nodes (#64022)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64022

Test Plan: - Added unittest `StaticRuntime.RemoveEquallySplitListUnpack`.

Reviewed By: hlu1

Differential Revision: D30472189

fbshipit-source-id: 36040b0146f4be9d0d0fda293f7205f43aad0b87
2021-08-26 16:29:43 -07:00
dfa35ab3e7 [pytorch][quant][oss] Support 2-bit embedding_bag op "embedding_bag_2bit_rowwise_offsets" (#63658)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63658

Support 2-bit embedding_bag op "embedding_bag_2bit_rowwise_offsets"

Reviewed By: jingsh, supriyar

Differential Revision: D30454994

fbshipit-source-id: 7aa7bfe405c2ffff639d5658a35181036e162dc9
2021-08-26 16:09:35 -07:00
92a154aa29 Move variabletype functions around (#63330)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63330

 - This is in preparation for templated/boxed autograd-not-implemented fallback
 - Make sure VariableTypeUtils does not depend on generated code
 - Lift `isFwGradDefined` into `autograd/functions/utils.cpp` so it's available to mobile builds
 - Removes `using namespace at` from VariableTypeUtils, previously we needed this for Templated version, but now its not strictly necessary but still a good change to avoid name conflicts if this header is included elsewhere in the future.

Test Plan: Imported from OSS

Reviewed By: heitorschueroff

Differential Revision: D30518573

Pulled By: soulitzer

fbshipit-source-id: a0fb904baafc9713de609fffec4b813f6cfcc000
2021-08-26 16:02:39 -07:00
49353e319c More sharded_tensor creation ops: harded_tensor.zeros, sharded_tensor.full, sharded_tensor.rand (#63732)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63732

Test Plan:
$ python test/distributed/_sharded_tensor/test_sharded_tensor.py  --v

$ python test/distributed/_sharded_tensor/test_sharded_tensor.py TestCreateTensorFromParams --v
$ python test/distributed/_sharded_tensor/test_sharded_tensor.py TestShardedTensorChunked --v

Imported from OSS

Differential Revision:
D30472621
D30472621

Reviewed By: pritamdamania87

Pulled By: bowangbj

fbshipit-source-id: fd8ebf9b815fdc292ad1aad521f9f4f454163d0e
2021-08-26 16:01:38 -07:00
49b782b2cb Add shard number to print_test_stats.py upload name (#64055)
Summary:
Now that the render test results job is gone, each shard on GHA is uploading a JSON test stats report. To ensure differentiation, this PR includes the shard number in the report name.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64055

Reviewed By: iramazanli

Differential Revision: D30586869

Pulled By: janeyx99

fbshipit-source-id: fd19f347131deec51486bb0795e4e13ac19bc71a
2021-08-26 15:43:29 -07:00
085278f8b1 Derivatives of relu (#63027) (#63089)
Summary:
Optimization of relu and leaky_relu derivatives for reduction of VRAM needed for the backward-passes

Fixes https://github.com/pytorch/pytorch/issues/63027

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63089

Reviewed By: iramazanli

Differential Revision: D30582049

Pulled By: albanD

fbshipit-source-id: a9481fe8c10cbfe2db485e28ce80cabfef501eb8
2021-08-26 15:33:25 -07:00
7861dba7f6 Automated submodule update: FBGEMM (#62879)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: ce54703857

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62879

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: jspark1105

Differential Revision: D30154801

fbshipit-source-id: b2ce185da6f6cadf5128f82b15097d9e13e9e6a0
2021-08-26 15:20:06 -07:00
aeec177833 [JIT] UseVariadicOp takes list_idx parameter (#63915)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63915

Previously, this function only worked for variadic op substitutions of the form `op(list, args) -> variadic_op(list_1, ..., list_n, args)`. This change allows for transformations of the form `op(args_0, list, args_1) -> variadic_op(args_0, list_1, ..., list_n, args_1)`.

Test Plan:
`buck test caffe2/test/cpp/jit:jit -- Stack Concat`

(tests exercising `list_idx != 0` will be added further up in this diff stack)

Reviewed By: navahgar

Differential Revision: D30529729

fbshipit-source-id: 568080679c3b40bdaedee56bef2e8a5ce7985d2f
2021-08-26 14:10:35 -07:00
d8d8e4902a [torch/elastic] Pretty print the failure message captured by @record (#64036)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64036

This PR slightly revises the implementation of the internal `_format_failure()` method in order to pretty print the error message captured in a subprocess by the `record` annotation.

With this PR a failure log is formatted as below:

```
Root Cause:
[0]:
  time: 2021-08-26_17:12:07
  rank: 0 (local_rank: 0)
  exitcode: 1 (pid: 8045)
  error_file: /tmp/torchelastic_6cj9eppm/6d9d844a-6ce4-4838-93ed-1639a9525b00_rec9kuv3/attempt_0/0/error.json
  msg:
    {
      "message": "ValueError: Test",
      "extraInfo": {
        "py_callstack": [
          "  File \"/data/home/balioglu/fail.py\", line 7, in <module>\n    main()\n",
          "  File \"/fsx/users/balioglu/repos/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py\", line 373, in wrapper\n    error_handler.record_exception(e)\n",
          "  File \"/fsx/users/balioglu/repos/pytorch/torch/distributed/elastic/multiprocessing/errors/error_handler.py\", line 86, in record_exception\n    _write_error(e, self._get_error_file_path())\n",
          "  File \"/fsx/users/balioglu/repos/pytorch/torch/distributed/elastic/multiprocessing/errors/error_handler.py\", line 26, in _write_error\n    \"py_callstack\": traceback.format_stack(),\n"
        ],
        "timestamp": "1629997927"
      }
    }
```

in contrast to the old formatting:

```
Root Cause:
[0]:
  time: 2021-08-26_17:15:50
  rank: 0 (local_rank: 0)
  exitcode: 1 (pid: 9417)
  error_file: /tmp/torchelastic_22pwarnq/19f22638-848c-4b8f-8379-677f34fc44e7_u43o9vs7/attempt_0/0/error.json
  msg: "{'message': 'ValueError: Test', 'extraInfo': {'py_callstack': 'Traceback (most recent call last):\n  File "/fsx/users/balioglu/repos/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 351, in wrapper\n    return f(*args, **kwargs)\n  File "/data/home/balioglu/fail.py", line 5, in main\n    raise ValueError("BALIOGLU")\nValueError: BALIOGLU\n', 'timestamp': '1629998150'}}"
```
ghstack-source-id: 136761768

Test Plan: Run the existing unit tests.

Reviewed By: kiukchung

Differential Revision: D30579025

fbshipit-source-id: 37df0b7c7ec9b620355766122986c2c77e8495ae
2021-08-26 13:56:46 -07:00
5a12cb611f To add Chained Scheduler to the list of PyTorch schedulers. (#63491)
Summary:
In this PR we are introducing ChainedScheduler which initially proposed in the discussion https://github.com/pytorch/pytorch/pull/26423#discussion_r329976246 .

The idea is to provide a user friendly chaining method for schedulers, especially for the cases many of them are involved and we want to have a clean and easy to read interface for schedulers. This method will be even more crucial once CompositeSchedulers and Schedulers for different type of parameters are involved.

The immediate application of Chained Scheduler is expected to happen in TorchVision Library to combine WarmUpLR and  MultiStepLR https://github.com/pytorch/vision/blob/master/references/video_classification/scheduler.py#L5 . However, it can be expected that in many other use cases also this method could be applied.

### Example
The usage is as simple as below:

```python
sched=ChainedScheduler([ExponentialLR(self.opt, gamma=0.9),
                        WarmUpLR(self.opt, warmup_factor=0.2, warmup_iters=4, warmup_method="constant"),
                        StepLR(self.opt, gamma=0.1, step_size=3)])
```

Then calling
```python
sched.step()
```
would trigger step function for all three schedulers consecutively

Partially resolves https://github.com/pytorch/vision/issues/4281

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63491

Reviewed By: datumbox, mruberry

Differential Revision: D30576180

Pulled By: iramazanli

fbshipit-source-id: b43f0749f55faab25079641b7d91c21a891a87e4
2021-08-26 13:30:21 -07:00
7cfbc85821 [fx_acc] [fx2trt] add acc op mapper for argmin and converter for topk (#63823)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63823

Add mapper for `torch.argmin` which maps it to `acc_ops.flatten` (optional) + `acc_ops.topk` + `acc_ops.getitem` + `acc_ops.squeeze` (optional). This diff doesn't allow mapping if `dim=None && keepdim=True` in `torch.argmin`.

Add fx2trt converter for `acc_ops.topk`.

Test Plan:
buck test mode/opt glow/fb/fx/oss_acc_tracer:test_acc_tracer -- test_argmin
buck run mode/opt caffe2/torch/fb/fx2trt:test_topk

Reviewed By: jfix71

Differential Revision: D30501771

fbshipit-source-id: 0babc45e69bac5e61ff0b9b4dfb98940398e3e57
2021-08-26 13:16:22 -07:00
cbfec02007 [Static Runtime] Add native op for aten::expand_as (#64024)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64024

`aten::expand_as` creates a view of the input tensor. This change adds its native op implementation for the static runtime.

Test Plan: - Added `StaticRuntime.IndividualOps_ExpandAs`

Reviewed By: hlu1

Differential Revision: D30546851

fbshipit-source-id: e53483048af890bc41b6192a1ab0c5ba0ee2bdc0
2021-08-26 13:05:53 -07:00
95d0b3199b Back out "[ONNX] Fix an issue that optimizations might adjust graph inputs unexpectedly. (#61280)" (#64004)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64004

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63904

Fixes T98808160

Test Plan: T98808160

Reviewed By: msaroufim

Differential Revision: D30527450

fbshipit-source-id: 6262901a78ca929cecda1cf740893139aa26f1b4
2021-08-26 12:49:42 -07:00
c5cc185b6d Allow uncompiled strings as input to checkScriptRaisesRegex (#63901)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63901

cc gmagogsfm

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D30579472

Pulled By: ansley

fbshipit-source-id: 59ee09c1f25278d4f6e51f626588251bd095c6ea
2021-08-26 12:17:07 -07:00
48c57b9b2e Leverage TensorPipe's automatic SHM address selection (#63028)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63028

TensorPipe until now required PyTorch to come up and provide a unique identifier to use as address for the UNIX domain socket used in the SHM transport. However the Linux kernel can automatically assign an available address (like it does with IP ports), and TensorPipe now supports it, so we can remove that useless PyTorch logic.

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D30220352

fbshipit-source-id: 78e8a6ef5916b2a72df26cdc9cd367b9d083e821
2021-08-26 12:15:53 -07:00
ad47fb8858 Rename IterableAsDataPipe to IterableWrapper (#63981)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63981

Rename `IterableAsDataPipe` to `IterableWrapper` based on our naming convention `Op-er`

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D30554197

Pulled By: ejguan

fbshipit-source-id: c2eacb20df5645d83ca165d6a1591f7e4791990f
2021-08-26 10:23:25 -07:00
0f6b524665 [NNC] Add C++ codegen backend to NNC (#62869)
Summary:
Adds a C++ codegen backend to NNC to generate C++ for CPU instead of generating LLVM IR.
Tensors are represented as blobs of float. Vector operations are devectorized/unrolled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62869

Test Plan:
https://github.com/pytorch/pytorch/tree/mvz-nnc-aot-prototype makes it able to AOT compile the whole MobileNetV3 model into binary code through LLVM codegen in NNC.

I forked that branch to https://github.com/cheng-chang/pytorch/tree/cc-aot-cpp, merged this PR into it, and modified `fancy_compile` to compile MobileNetV3 into C++ through

```
import torch

m = torch.jit.load('mobnet.pt')
m.eval()
f = torch.jit.freeze(m)
torch._C._fancy_compile(f.graph, [1, 3, 224, 224])
```

The generated C++ file `mobnet.cc` can be found at https://gist.github.com/cheng-chang/e2830cc6920b39204ebf368035b2bcec.

I manually compiled the generated C++ through `g++ -o mobnet -std=c++14 -L./build/lib -ltorch_cpu -ltorch mobnet.cc`, and it succeeded.

Reviewed By: ZolotukhinM

Differential Revision: D30149482

Pulled By: cheng-chang

fbshipit-source-id: e77b189f0353e37cd309423a48a513e668d07675
2021-08-26 09:56:37 -07:00
6d31ba6ddc [nnc] Sanitized the names of constants in the input graph. (#63990)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/63923

The input graph can contain constants whose names contain special characters. So, all names of constants in the input graph need to be sanitized.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63990

Reviewed By: ZolotukhinM

Differential Revision: D30558432

Pulled By: navahgar

fbshipit-source-id: de5b0c23d50ee8997f40f2c0fc605dda3719186f
2021-08-26 09:52:02 -07:00
ba5f1b1076 [nnc] Fix dtype promotion involving scalars (#64002)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64002

Fixes https://github.com/pytorch/vision/issues/4315

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D30566979

Pulled By: bertmaher

fbshipit-source-id: eaa98b9534a926be7fcd337d46c5a0acb3243179
2021-08-26 09:43:15 -07:00
1354ee417a run_test.py: add option to run only core tests (#63976)
Summary:
This is in response to a feature request from some folks in the core team to have a local command that would only run relevant "core" tests. The idea is to have a local smoke test option for developers to run locally before making a PR in order to verify their changes did not break core functionality. These smoke tests are not targeted to be short but rather relevant.

This PR enables that by allowing developers to run `python test/run_test.py --core` or `python test/run_test.py -core` in order to run the CORE_TEST_LIST, which is currently test_nn.py, test_torch.py, and test_ops.py.

I am not the best person to judge what should be considered "core", so please comment which tests should be included and/or excluded from the CORE_TEST_LIST!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63976

Test Plan:
```
(pytorch) janeyx@janeyx-mbp test % python run_test.py --core -v
Selected tests: test_nn, test_ops, test_torch
Running test_nn ... [2021-08-25 14:48:28.865078]
Executing ['/Users/janeyx/miniconda3/envs/pytorch/bin/python', 'test_nn.py', '-v'] ... [2021-08-25 14:48:28.865123]
test_to (__main__.PackedSequenceTest) ... ok
test_to_memory_format (__main__.PackedSequenceTest) ... ok
```

Reviewed By: walterddr

Differential Revision: D30575560

Pulled By: janeyx99

fbshipit-source-id: 3f151982c1e315e50e60cb0d818adaea34556a04
2021-08-26 09:29:57 -07:00
fbe7133b58 [Static Runtime] Disable out variant of aten::clone (#63980)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63980

The out variant implementation of `aten::clone` causes a crash, which needs further investigation. This change disables it until the problem gets fixed.

Note that `inline_cvr` doesn't use `aten::clone` as of now, so no perf implication: https://www.internalfb.com/phabricator/paste/view/P446858755?lines=121

Test Plan: N/A

Reviewed By: hlu1

Differential Revision: D30544149

fbshipit-source-id: facb334d67473f622b36862fbdb2633358556fdf
2021-08-26 08:10:13 -07:00
7ccc4b5cc8 [CI] move distributed test into its own CI job (#62896)
Summary:
Moving distributed to its own job.

- [x] ensure there should be a distributed test job for every default test job matrix (on GHA)
- [x] ensure that circleci jobs works for distributed as well
- [x] waiting for test distributed to have its own run_test.py launch options, see https://github.com/pytorch/pytorch/issues/63147

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62896

Reviewed By: seemethere

Differential Revision: D30230856

Pulled By: walterddr

fbshipit-source-id: 0cad620f6cd9e56c727c105458d76539a5ae976f
2021-08-26 08:02:20 -07:00
733755f72c remove special grad_mode tls handling (#63116)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63116

This PR removes the special flag to disable grad mode tracking on the ThreadLocalState and replaces it with an explicit setter that users can use.
This allows to reduce complexity of ThreadLocalState.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D30388098

Pulled By: albanD

fbshipit-source-id: 85641b3d711179fb78ff6a41ed077548dc821a2f
2021-08-26 07:51:30 -07:00
950f7c0237 Added API tests to ReductionOpInfo and ported amax/amin/nansum tests (#62899)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62899

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D30408816

Pulled By: heitorschueroff

fbshipit-source-id: 6cb0aa7fa7edba93549ef873baa2fb8a003bd91d
2021-08-26 07:18:43 -07:00
10da1fc3f8 Deify opmath_t into its own header, align with accscalar_t (#63986)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63986

Fixes #63985

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D30555996

Pulled By: ezyang

fbshipit-source-id: b6e4d56a5658ed028ffc105cc4b479faa6882b65
2021-08-26 06:59:46 -07:00
774ae0851d [OpInfo] Added ReductionOpInfo subclass of OpInfo and ported sum test (#62737)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62737

ReductionOpInfo is a specialization of OpInfo for reduction operators. For now, it is designed to work with reductions that return a single tensor and that reduce all elements along one or more dimensions to a single value. In particular this excludes operators such as `max` and `min` that return multiple tensors and `quantile` that can return multiple values.

fixes https://github.com/pytorch/pytorch/issues/49746

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D30406568

Pulled By: heitorschueroff

fbshipit-source-id: 218b1da1902f67bcf4c3681e2a0f0029a25d51f1
2021-08-26 06:06:38 -07:00
c02eda8166 Update TensorPipe submodule
Summary: The bot failed to do it.

Test Plan: D30542677

Reviewed By: beauby

Differential Revision: D30573500

fbshipit-source-id: 50abd6fc415cead0a6b6d9290fa0e5f97d0e4989
2021-08-26 05:44:38 -07:00
61d88cdd1c use const auto& as type for grad alias (#63949)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63949

This is an extension of the discussion in
https://github.com/pytorch/pytorch/pull/63040#discussion_r687793027.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D30546789

Pulled By: dagitses

fbshipit-source-id: 3046aff4f129d5492d73dfb67717a824e16ffee8
2021-08-26 04:44:03 -07:00
5757d03145 Add logging for _MinimizerBase
Summary: Add logging so we know which nodes are currently being visited

Test Plan: lint & SC tests

Reviewed By: 842974287

Differential Revision: D30509865

fbshipit-source-id: 09e77e44c97c825242e0b24f90463b50f3ca19c6
2021-08-26 00:52:58 -07:00
a6f767ed3d Fix issue re: DDP and create_graph=True (#63831)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63831

Closes https://github.com/pytorch/pytorch/issues/63812

`at::mul_out` is not supported when `grad` itself requires grad, which is useful for computing higher order derivatives.

In this case, fall back to a mul + copy instead of mul_out.
ghstack-source-id: 136614644

Test Plan: UT

Reviewed By: SciPioneer

Differential Revision: D30505573

fbshipit-source-id: 83532b6207b3d80116fcc4dff0e5520d73b3454f
2021-08-25 23:50:25 -07:00
3b284ab024 Adding BFP16 quantization/dequantization support to OSS (#63059)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63059

Supporting BFP16 quantization method to OSS. Currently only support CPU
ghstack-source-id: 136639528

Test Plan: Imported from OSS

Reviewed By: wanchaol

Differential Revision: D30194538

fbshipit-source-id: ac248567ad8028457c2a91b77ef2ce81709fce53
2021-08-25 23:41:34 -07:00
9d95d48567 (torch.distributed) Add torch.distributed.is_torchelastic_launched() util method + make init_method=tcp:// compatible with torchelastic (#63910)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63910

Addresses the current issue that `init_method=tcp://` is not compatible with `torch.distributed.run` and `torch.distributed.launch`. When running with a training script that initializes the process group with `init_method=tcp://localhost:$port` as such:

```
$ python -u -m torch.distributed.run --max_restarts 0 --nproc_per_node 1 --nnodes 1 --master_addr $(hostname) --master_port 6000 ~/tmp/test.py
```

An `Address in use` error is raised since the training script tries to create a TCPStore on port 6000, which is already taken since the elastic agent is already running a TCPStore on that port.

For details see: https://github.com/pytorch/pytorch/issues/63874.

This change does a couple of things:

1. Adds `is_torchelastic_launched()` check function that users can use in the training scripts to see whether the script is launched via torchelastic.
1. Update the `torch.distributed` docs page to include the new `is_torchelastic_launched()` function.
1. Makes `init_method=tcp://` torchelastic compatible by modifying `_tcp_rendezvous_handler` in `torch.distributed.rendezvous` (this is NOT the elastic rendezvous, it is the old rendezvous module which is slotted for deprecation in future releases) to check `is_torchelastic_launched()` AND `torchelastic_use_agent_store()` and if so, only create TCPStore clients (no daemons, not even for rank 0).
1. Adds a bunch of unittests to cover the different code paths

NOTE: the issue mentions that we should fail-fast with an assertion on `init_method!=env://` when `is_torchelastic_launched()` is `True`. There are three registered init_methods in pytorch: env://, tcp://, file://. Since this diff makes tcp:// compatible with torchelastic and I've validated that file is compatible with torchelastic. There is no need to add assertions. I did update the docs to point out that env:// is the RECOMMENDED init_method. We should probably deprecate the other init_methods in the future but this is out of scope for this issue.

Test Plan: Unittests.

Reviewed By: cbalioglu

Differential Revision: D30529984

fbshipit-source-id: 267aea6d4dad73eb14a2680ac921f210ff547cc5
2021-08-25 22:57:43 -07:00
b629ea4620 Update persons_of_interest.rst (#63907)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63907

Reviewed By: jspisak

Differential Revision: D30534972

Pulled By: dzhulgakov

fbshipit-source-id: ba726fc53e292a362c387cc8b5f7776ca2a2544c
2021-08-25 22:50:54 -07:00
b1154cc774 enable equal_nan for complex values in isclose (#63571)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63571

Test Plan: Imported from OSS

Reviewed By: malfet, ngimel

Differential Revision: D30560127

Pulled By: mruberry

fbshipit-source-id: 8958121ca24e7c139d869607903aebbe87bc0740
2021-08-25 22:05:49 -07:00
49c8fbc92f Clean up related to type refinements (#62444)
Summary:
Creates a helper function to refine the types into a torchScript compatible format in the monkeytype config for profile directed typing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62444

Reviewed By: malfet

Differential Revision: D30548159

Pulled By: nikithamalgifb

fbshipit-source-id: 7c09ce5f5e043d069313b87112837d7e226ade1f
2021-08-25 21:53:00 -07:00
80a61142e4 inference for algebraic expressions (#63822)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63822

Infer algebraic expressions and add it to our symbolic inferencer. Works for conv2D and can be extended to other operations.

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D30518469

Pulled By: migeed-z

fbshipit-source-id: b92dfa40b2d834a535177da42b851701b8f7178c
2021-08-25 20:47:23 -07:00
124ae597fb [quant] Fixing the conversion of the quantizable RNN (#63879)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63879

Quantizable RNN had a bug, where the `from_observed` was an instance method, instead of a class method. This caused the `tq.convert` to fail. This fixes the issue by making the `from_observed` a classmethod.

The tests were passing before because the unittests were not using the custom module path, but a conventional `from_float`, which is also supported.

Test Plan:
`buck test mode/dev //caffe2/test:quantization -- test_custom_module_lstm`

```
buck test mode/dev //caffe2/test:quantization -- test_custom_module_lstm
Parsing buck files: finished in 0.5 sec
Downloaded 0/2 artifacts, 0.00 bytes, 100.0% cache miss (for updated rules)
Building: finished in 9.2 sec (100%) 12622/12622 jobs, 2/12622 updated
  Total time: 9.7 sec
More details at https://www.internalfb.com/intern/buck/build/0d87b987-649f-4d06-b0e2-97b5077
Tpx test run coordinator for Facebook. See https://fburl.com/tpx for details.
Running with tpx session id: cb99305f-65c9-438b-a99f-a0a2a3089778
Trace available for this run at /tmp/tpx-20210824-115652.540356/trace.log
Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/5066549645030046
    ✓ ListingSuccess: caffe2/test:quantization - main (12.550)
    ✓ Pass: caffe2/test:quantization - test_custom_module_lstm (quantization.core.test_quantized_op.TestQuantizedOps) (174.867)
Summary
  Pass: 1
  ListingSuccess: 1
If you need help understanding your runs, please follow the wiki: https://fburl.com/posting_in_tpx_users
Finished test run: https://www.internalfb.com/intern/testinfra/testrun/5066549645030046
```

Reviewed By: jerryzh168, mtl67

Differential Revision: D30520473

fbshipit-source-id: bc5d0b5bb079fd146e2614dd42526fc7d4d4f3c6
2021-08-25 20:39:02 -07:00
2ea2711501 Make frozen symbol name customizable in torch deploy. (#63817)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63817

ghstack-source-id: 136699671

Test Plan: eyes

Reviewed By: wconstab

Differential Revision: D29571559

fbshipit-source-id: 8e3caa4932ef8d7c8559f264f0e9bb5474ad2237
2021-08-25 20:10:35 -07:00
f4bc28990f Compute cuda reduction buffer size in elements (#63969)
Summary:
Resubmit of https://github.com/pytorch/pytorch/issues/63885

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63969

Reviewed By: mruberry

Differential Revision: D30549423

Pulled By: ngimel

fbshipit-source-id: b16d25030d44ced789c125a333d72b02a8f45067
2021-08-25 18:18:37 -07:00
01b8162d00 Back out "Revert D30384746: [fx2trt] Add a test for quantized resnet18" (#63973)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63973

Original commit changeset: b93235323e22

Test Plan: buck run mode/opt -c python.package_style=inplace caffe2:fx2trt_quantized_resnet_test

Reviewed By: 842974287

Differential Revision: D30546036

fbshipit-source-id: 2c8302456f072d04da00cf9ad97aa8304bc5e43e
2021-08-25 17:52:22 -07:00
57d4c6cf42 replace self.assertTrue(torch.allclose(..)) with self.assertEqual(…) (#63637)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/63565

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63637

Reviewed By: malfet

Differential Revision: D30541266

Pulled By: mruberry

fbshipit-source-id: ab461949782c6908a589ea098fcfcf5c3e081ee6
2021-08-25 16:47:40 -07:00
1be1c901aa Remove render_test_results job (#63877)
Summary:
This removes the `render_test_results` job we had before which had been causing some confusion among devs when it failed and isn't really necessary now that we can actually render test results on the PR HUD.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63877

Reviewed By: walterddr, janeyx99

Differential Revision: D30546705

Pulled By: driazati

fbshipit-source-id: 55fdafdb6f80924d941ffc15ee10787cb54f34a1
2021-08-25 15:55:55 -07:00
ba0e6a1e03 [EASY] Update the clang-tidy error message (#63370)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63370

As shown by this CI run, the actual thing that is incorrect is the prompt.
https://github.com/pytorch/pytorch/actions/runs/1137298261

The CI runs the below command instead of the original command.
The original command errors out when importing another file on line 1.
Trying to fix the code to work with the original command causes the CI to error out.

We should actually ask the user to run
`python3 -m tools.linter.install.clang_tidy`

Test Plan: Imported from OSS

Reviewed By: janeyx99, heitorschueroff

Differential Revision: D30530216

Pulled By: Gamrix

fbshipit-source-id: 2a2b8d539dcc2839e4000c13e82c207fa89bfc9f
2021-08-25 15:30:13 -07:00
44ede71751 Shard python_torch_functions.cpp (#62187)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62187

This file can take 3 minutes on its own to compile, and after
python_functions.cpp is the second limiting factor for compile time of
`libtorch_python` on a 32-core threadripper. This splits it into 3 files that
take around 1 minute each to compile.

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D29962048

Pulled By: albanD

fbshipit-source-id: 99016d75912bff483fe21b130cef43a6882f8c0e
2021-08-25 15:10:43 -07:00
730ce29baf Add note on ifdefing based on CUDA_VERSION for ROCm path (#62850)
Summary:
CUDA_VERSION and HIP_VERSION follow very unrelated versioning schemes, so it does not make sense to use CUDA_VERSION to determine the ROCm path. This note explicitly addresses it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62850

Reviewed By: mruberry

Differential Revision: D30547562

Pulled By: malfet

fbshipit-source-id: 02990fa66a88466c2330ab85f446b25b78545150
2021-08-25 15:02:03 -07:00
b5b9ce146f Small fixes to the Contributing.txt (#63385)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63385

Correcting a mistake for the pytorch uninstall, and
adding an extra note for Darwin.

Test Plan: Imported from OSS

Reviewed By: janeyx99, heitorschueroff

Differential Revision: D30530234

fbshipit-source-id: e0f88a1725eeadabfb4b28c1da11e369ee878ab4
2021-08-25 14:50:37 -07:00
52ebe7e14e Back out "Temporary fix for remote gpu execution issue" (#63983)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63983

Test for fixes in D30545351. it should resolve the remote execution flag being populated incorrectly issue.

Test Plan: CI

Reviewed By: malfet, seemethere

Differential Revision: D30549443

fbshipit-source-id: b3895909f5cd654ba163b77950872b332fbad3fe
2021-08-25 14:37:01 -07:00
5b548f6f64 Shape Propagation Pass: Fix AdaptiveAveragePooling2d (#63629)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63629

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D30461727

Pulled By: priyaramani

fbshipit-source-id: 3873d1d636f79185680b82de06174d8de288c941
2021-08-25 13:13:41 -07:00
ab5cf5a1eb Move existing target determinator to tools (#63809)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63809

This moves out the modulefinder determinator to `tools/testing` since it is supposed to be CI-only. This also simplifies run_test.py a little bit.

Test Plan: Imported from OSS

Reviewed By: malfet, seemethere, janeyx99

Differential Revision: D30497438

Pulled By: driazati

fbshipit-source-id: 1d203037af5af6a20c1e7812da935e7cbb5cd82f
2021-08-25 13:03:53 -07:00
7edeead796 Add a comment on the potential implicit type up-casting (#63905)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63905

as title
ghstack-source-id: 136590703

Test Plan: N/A

Reviewed By: mrshenli

Differential Revision: D30527929

fbshipit-source-id: 69402bbfa87cfd8fc166ce313cde9736ee072589
2021-08-25 12:47:45 -07:00
b0782f0f32 add BFloat16 support for bernoulli and Dropout on CPU (#56372)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56372

Test Plan: Imported from OSS

Reviewed By: heitorschueroff

Differential Revision: D28836792

Pulled By: VitalyFedyunin

fbshipit-source-id: ede951d172a59276e11383fd767778ab959b5a6b
2021-08-25 12:01:27 -07:00
7299565768 Update torch.distributed.run OMP_NUM_THREADS message to log.warning (#63953)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63953

Closes #61138

Test:
`python -m torch.distributed.run --nproc_per_node 2 test.py`
Still outputs message

`LOGLEVEL=ERROR python -m torch.distributed.run --nproc_per_node 2 test.py`
Does not output message anymore

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse agolynski SciPioneer H-Huang mrzzd cbalioglu gcramer23

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D30542997

Pulled By: H-Huang

fbshipit-source-id: e7da30dcda51516abf4e56f1f510132e44397027
2021-08-25 11:55:06 -07:00
3d4aabfc48 Fix ciflow/all label generation (#63954)
Summary:
the `ciflow/all` is automatically added but need to be added before we call `gen_root_job_condition`.

- fix the order of adding `ciflow/all`
- refactor all the string into global constants

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63954

Reviewed By: malfet

Differential Revision: D30545596

Pulled By: zhouzhuojie

fbshipit-source-id: 83ab668f0234488afb855a72e3ebd4503f7f1a78
2021-08-25 11:32:32 -07:00
67d8e7b659 Reformat run_test.py (#63808)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63808

`black run_test.py`

Test Plan: Imported from OSS

Reviewed By: seemethere

Differential Revision: D30497437

Pulled By: driazati

fbshipit-source-id: 41b29b73f41fa4bb15fce5eaa69f8efe614e02f7
2021-08-25 11:27:18 -07:00
64d605bab8 [Static Runtime] Added caching for the NNC code generated for Logit. (#63840)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63840

Added NNC generated code for Logit to the cache.

```
Logit NNC Benchmark	Time (ns)
	            w/o cache	w/ cache
logit_nnc_sleef/64	543	536
logit_nnc_sleef/512	3517	3465
logit_nnc_sleef/8192	88483	85881
logit_nnc_sleef/32768	337016	323090
logit_nnc_fast/64	167	163
logit_nnc_fast/512	866	817
logit_nnc_fast/8192	13069	12801
logit_nnc_fast/32768	53429	52530
logit_nnc_vml/64	164	151
logit_nnc_vml/512	783	769
logit_nnc_vml/8192	11563	11674
logit_nnc_vml/32768	46720	46452
```

Test Plan: Unit tests and inline_cvr model.

Reviewed By: hlu1

Differential Revision: D30405424

fbshipit-source-id: 938b1b74758e2612ae151bac890c5f8ebbc42d50
2021-08-25 11:19:58 -07:00
dde07cad6f [Static Runtime] Added a variable for clamp in the NNC code for Logit. (#63839)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63839

Replaced the use of a constant for clamp in the NNC code for Logit
with a variable. This makes it easier to enable caching for Logit.

There is no performance difference with this change, as shown in the micro-benchmarks below.

```
Logit NNC Benchmark	Time (ns)
	           const-clamp	var-clamp
logit_nnc_sleef/64	550	543
logit_nnc_sleef/512	3514	3517
logit_nnc_sleef/8192	85537	82900
logit_nnc_sleef/32768	347635	337016
logit_nnc_fast/64	173	167
logit_nnc_fast/512	829	866
logit_nnc_fast/8192	13286	13069
logit_nnc_fast/32768	51116	53429
logit_nnc_vml/64	146	164
logit_nnc_vml/512	773	783
logit_nnc_vml/8192	11556	11563
logit_nnc_vml/32768	44815	46720
```

Test Plan: SR unit tests and the inline_cvr model.

Reviewed By: bertmaher

Differential Revision: D30405466

fbshipit-source-id: adb891fdae5746439931ce5f43165291fec08f52
2021-08-25 11:19:55 -07:00
a2399a76e1 [Static Runtime] Moved NNC operator definitions to separate files. (#63838)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63838

Refactored NNC operator definitions code into separate files.

Made `TEWrapper` a class with a fixed set of methods and added separate definitions for them based on `TORCH_ENABLE_LLVM` to keep the same functionality as before.

Test Plan: Build and ran Static Runtime tests.

Reviewed By: hlu1

Differential Revision: D30405467

fbshipit-source-id: 606ef852bb820d5e23a0f8af1bf5dc122e90bceb
2021-08-25 11:18:32 -07:00
8a22d4fa5c [Reland] Replacing the p.data acccess in utils with tensor.set_ . Passes both test_post_localSGD_optimizer_pari and test_periodic_model_averager tests (#63895)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63895

When updating the model parameter, updating `parameter.data` is no longer recommended, because this `data` field will be deprecated in the future.

The replacement is `tensor.set_`.
ghstack-source-id: 136593433

Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn -- test_periodic_model_averager
buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn -- test_post_localSGD_optimizer_parity

Reviewed By: SciPioneer

Differential Revision: D30526178

fbshipit-source-id: a1ac0ec3665d8623edd5bf94f01c1132daff5c00
2021-08-25 11:12:55 -07:00
ab954cb0d1 clean up engine.cpp thread state (#63115)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63115

This actually changes:
- callbacks now run with proper grad mode even in worker threads
- graphtask's Future callbacks now run with proper TLS when erroring
  out from a worker thread

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D30388100

Pulled By: albanD

fbshipit-source-id: 7ae9c461c2f0040548dd9e1e314f25e8da0c2e67
2021-08-25 11:08:43 -07:00
c06dfd7c26 [fx2trt] Check input device in TRTModule (#63893)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63893

Add a check to ensure all the inputs are on cuda device.

Test Plan: CI

Reviewed By: kflu, houseroad

Differential Revision: D30525265

fbshipit-source-id: 6e50b70fd535defc1f802d51e8bb991b2dd73741
2021-08-25 10:25:34 -07:00
6324d98e9e bf16 Error message cleanup as well as addition of is_bf16_supported (#63798)
Summary:
ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63798

Reviewed By: heitorschueroff

Differential Revision: D30526187

Pulled By: ngimel

fbshipit-source-id: c484aec14638097c96c720095d3491249b6b2d14
2021-08-25 09:59:59 -07:00
eebac46282 [pruner] add getter for pruned outputs in base pruner (#63520)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63520

Rather than having to call `module.parametrizations.weight[0].pruned_outputs` each time we need to access the set of pruned indices, we add a getter `get_module_pruned_outputs` which takes the module as an argument and returns the set.

This is used for testing.
ghstack-source-id: 136561130

Test Plan:
` buck test mode/dev-nosan //caffe2/test:ao -- TestBasePruner`

https://pxl.cl/1N4gK

Reviewed By: z-a-f

Differential Revision: D30374558

fbshipit-source-id: e38dfee0879cadde52b942e899a3d8d7151ee493
2021-08-25 09:57:29 -07:00
83b132b112 [pruner] add support for pruning BatchNorm2d (#63519)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63519

If the pruner should be pruning biases along with weights, then if the model has BatchNorm2d following pruned Conv2d layers, then the corresponding channels of the BatchNorm must also be pruned.

Specifically, they need to zeroed out, rather than fully removed, since in eager mode, the dimensions between layers need to be preserved.

To do this, we add a pruning parametrization called `ZeroesParametrization` which zeroes out pruned channels, rather than removing them.

The user must provide in the config, a tuple of the Conv2d and BatchNorm layers that go together. The `prepare` method will add the tuple to the `module_groups`; then it will add a PruningParametrization to the Conv2d layer, and a ZeroesParametrization to BatchNorm, and then set their pruned sets to be the same set. That way, during `step`, both masks are updated with the same pruned indices.

ghstack-source-id: 136562278

Test Plan:
`buck test mode/dev-nosan //caffe2/test:ao -- TestBasePruner`

https://pxl.cl/1N1P6

Reviewed By: z-a-f

Differential Revision: D30349855

fbshipit-source-id: 3199d3688d5a70963f9b32d7a8fdac3962ae6a65
2021-08-25 09:56:19 -07:00
c1dfd58715 Minor OptionalTensorRef updates (#63611)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63611

A few minor updates to `OptionalTensorRef`:
1. use `Tensor`'s `unsafe_borrow_t` constructor which avoids an unnecesary `nullptr` check.
2. copy constructor cannot defer to the `const Tensor&` constructor because it checks the tensor is
defined, and so would fail for disengaged optionals.
3. use copy-swap idiom to avoid issues with self-assignment. `x = x` should be a no-op, but the old
version would clear `x`.
4. Add pointer-like access for consistency with `optional` and `MaybeOwned`

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D30484704

Pulled By: ezyang

fbshipit-source-id: 738f4bd22359eaecd0a519a04e89a4b44d92da5b
2021-08-25 09:37:02 -07:00
5ab356ffe6 Update CMake minimum version to 3.10 (#63660)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63660

Test Plan: Imported from OSS

Reviewed By: janeyx99, mruberry

Differential Revision: D30543878

fbshipit-source-id: a7d938807653f39727f2cc7d7ca167200567b6a0
2021-08-25 09:25:43 -07:00
34ed16ffef Temporary fix for remote gpu execution issue (#63899)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63899

See: T99020845

Test Plan: sandcastle

Reviewed By: heitorschueroff

Differential Revision: D30527384

fbshipit-source-id: ce9933e5e181322c02d4ed17f3fdaabe4c5ba29e
2021-08-25 09:14:03 -07:00
01c35115d8 Fix bug in check_empty_containers (#63492)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63492

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D30402749

Pulled By: ansley

fbshipit-source-id: 7de533355fe91ca4f45b2bafc3bfb205a028c1ed
2021-08-25 09:05:08 -07:00
8c897d254d Swap CUDA 11.1 and 11.3 in CI to make 11.1 periodic (#63900)
Summary:
Preparing for supporting 11.3 in the next release.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63900

Reviewed By: malfet

Differential Revision: D30541437

Pulled By: janeyx99

fbshipit-source-id: a7297da7f7818a4291b1c321d62d76fc2c0f1f90
2021-08-25 09:01:26 -07:00
3926fdbaa4 [skip ci] Add generated comment to ruleset json (#63896)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63896

Reviewed By: heitorschueroff

Differential Revision: D30529820

Pulled By: zhouzhuojie

fbshipit-source-id: 7529803af23ea36a7bcb673cd399da80da8e3feb
2021-08-25 08:53:33 -07:00
87a661c79f Revert D30526034: [pytorch][PR] compute reduction intermediate buffer size in elements
Test Plan: revert-hammer

Differential Revision:
D30526034 (e69a1398cb)

Original commit changeset: 0aca7f887974

fbshipit-source-id: a22472723818d6fe0c11a6e134080df1ac408038
2021-08-25 07:17:22 -07:00
839eaa2e91 Revert D30384746: [fx2trt] Add a test for quantized resnet18
Test Plan: revert-hammer

Differential Revision:
D30384746 (10dfa58eba)

Original commit changeset: 1a8638777116

fbshipit-source-id: b93235323e229b391f5456f6e3543988062dd0d4
2021-08-25 00:43:06 -07:00
10dfa58eba [fx2trt] Add a test for quantized resnet18 (#63446)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63446

Add a test for quantized resnet18 running in TensorRT

Test Plan: buck run mode/opt -c python.package_style=inplace caffe2:fx2trt_quantized_resnet_test

Reviewed By: 842974287

Differential Revision: D30384746

fbshipit-source-id: 1a863877711618cd23d887694269ed9e44ee606c
2021-08-24 21:34:23 -07:00
0301c3bc01 [quant][graphmode][fx] Make maxpool and flatten produce the reference pattern (#63501)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63501

Currently some of the ops are considered as working with both float and quantized input,
so we may have things like "quant - some_op - dequant" this might not work well with the backend,
we may consider change everything to produce "quant - dequant - some_op - quant - dequant" instead
in the future, this PR fixes it for maxpool and flatten only to unblock resnet benchmarking on TensorRT

Test Plan:
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: mruberry

Differential Revision: D30402788

fbshipit-source-id: 892c5ff6552775070e2c1453f65846590fb12735
2021-08-24 21:31:01 -07:00
d388a1a5df [TensorExpr] LLVMCodegen: Use addFnAttr instead of addAttribute which was deleted. (#63886)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63886

cc gmagogsfm

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D30523135

Pulled By: ZolotukhinM

fbshipit-source-id: 62e125f917b2a0153eb30879d93cf956587a05e0
2021-08-24 21:23:06 -07:00
c8527bc398 [qunat][graphmode][fx] Add a separate lower_to_native_backend function for relu (#62861)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62861

This PR adds a lower_to_native_backend function to lower a quantized reference model
to a model that uses fbgemm/qnnpack ops. We'll gradually add support and remove
the fbgemm/qnnpack specific handling in quantization_patterns.py

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D30165828

fbshipit-source-id: de1149cd7e7c1840c17c251cd4d35004afd015b7
2021-08-24 21:07:03 -07:00
e69a1398cb compute reduction intermediate buffer size in elements (#63885)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/63869
`iter` strides are in bytes, and we are additionally multiplying size computed using those strides by `sizeof(arg_t)`. Computing `output_memory_size` in elements should be enough.
This doesn't fix the still real problem of allocating large intermediate tensor, but it makes this tensor smaller by typically a factor of 4.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63885

Reviewed By: mruberry

Differential Revision: D30526034

Pulled By: ngimel

fbshipit-source-id: 0aca7f887974b7776e380463bbd82d32a5786ee8
2021-08-24 19:39:21 -07:00
ba126df614 TST Adds more modules into common module tests (#62999)
Summary:
This PR moves some modules into `common_modules` to see what it looks like.

While migrating some no batch modules into `common_modules`, I noticed that `desc` is not used for the name. This means we can not use `-k` to filter tests. This PR moves the sample generation into `_parametrize_test`, and passes in the already generated `module_input` into users of `modules(modules_db)`.

I can see this is a little different from opsinfo and would be happy to revert to the original implementation of `modules`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62999

Reviewed By: heitorschueroff

Differential Revision: D30522737

Pulled By: jbschlosser

fbshipit-source-id: 7ed1aeb3753fc97a4ad6f1a3c789727c78e1bc73
2021-08-24 19:16:32 -07:00
544af391b5 Allow arbitrary objects in state_dicts (#62976)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/62094

Introduces functionality for adding arbitrary objects to module state_dicts. To take advantage of this, the following functions can be defined on a module:
* `get_extra_state(self) -> dict` - Returns a dict defining any extra state this module wants to save
* `set_extra_state(self, state)` - Subsumes the given state within the module

In the details, a sub-dictionary is stored in the state_dict under the key `_extra_state` for each module that requires extra state.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62976

Reviewed By: heitorschueroff

Differential Revision: D30518657

Pulled By: jbschlosser

fbshipit-source-id: 5fb35ab8e3d36f35e3e96dcd4498f8c917d1f386
2021-08-24 19:06:14 -07:00
58ef99bd5a TST Adds pickle testing for ModuleInfo (#63736)
Summary:
Follow up to https://github.com/pytorch/pytorch/pull/61935

This PR adds `test_pickle` to `test_modules`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63736

Reviewed By: heitorschueroff

Differential Revision: D30522462

Pulled By: jbschlosser

fbshipit-source-id: a03b66ea0d81c6d0845c4fddf0ddc3714bbf0ab1
2021-08-24 19:04:46 -07:00
8dda299d96 Re-apply: [nnc] Support thread level parallelism in fused kernels (#63776)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63776

I reverted this out of an abundance of caution because some test
failures occurred, but they were all due to precision issues fixed lower in
this stack.  Let's try again.

I've rolled the elimination of the allow-parallelism-in-fusions toggle into
this diff since they're pretty tightly coupled.
ghstack-source-id: 136529847

Test Plan: CI

Reviewed By: huiguoo

Differential Revision: D30484555

fbshipit-source-id: 38fd33520f710585d1130c365a8c60c9ce794a59
2021-08-24 18:56:55 -07:00
1787b905c4 Don't switch executors mid test (#63830)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63830

It's really not safe to change the executor out from under models that may have
already been partially compiled.
ghstack-source-id: 136526228

Test Plan:
```
DEBUG=1 CFLAGS="-fsanitize=address" CXXFLAGS="-fsanitize=address" USE_LLVM=$(realpath ../llvm-project/install) CMAKE_PREFIX_PATH=$CONDA_PREFIX python setup.py install
LD_PRELOAD=/lib64/libasan.so.5 numactl -C3 pytest -v --cov --cov-report xml:test/coverage.xml --cov-append onnx/test_pytorch_onnx_onnxruntime.py::TestONNXRuntime_opset11 -s
```

Reviewed By: desertfire

Differential Revision: D30504489

fbshipit-source-id: 188581cb53f0cf5bd3442d1e9d46e8c0c7e124f8
2021-08-24 18:56:53 -07:00
543130511a [nnc] Disable erf and erfc (#63775)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63775

These introduce small accuracy differences that cause some internal
tests to fail, and it's not worth fixing the tests right now because they're
slower than the ATen ops anyways.
ghstack-source-id: 136526229

Test Plan:
```
buck test mode/dev //aml/eccv/mcm/training:tests -- --exact 'aml/eccv/mcm/training:tests - test_build_torch_script_model (aml.eccv.mcm.training.tests.publish_helper_tests.TransformerPredictorPublishHelperTests)'
```

Reviewed By: navahgar

Differential Revision: D30484557

fbshipit-source-id: 095a9c810539a499105b76e1d96843dbc61b0079
2021-08-24 18:55:45 -07:00
d454c9e76e Migrate THCTensor_copyIgnoringOverlaps to ATen (#63505)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63505

This isn't a public operator, just a helper function used in CUDA_tensor_apply.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D30441305

Pulled By: ngimel

fbshipit-source-id: 84fabc701cbd8479e02d80f373a3dd62d70df2ce
2021-08-24 18:50:28 -07:00
5b28e3c183 [quant][graphmode][fx] Add reference option support for binary ops (#62698)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62698

We also removed the special handling in match_utils for binary ops

Test Plan:
python test/test_quantize.py TestQuantizeFx
python test/test_quantize.py TestQuantizeFxOps

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D30093781

fbshipit-source-id: 58cc972de8211a80dd4d111e25dc4ad36057933f
2021-08-24 18:22:11 -07:00
6fa646ad54 [StaticRuntime] Fix bug in HasInplaceOp (#63842)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63842

Reviewed By: mikeiovine

Differential Revision: D30506914

fbshipit-source-id: b2e358cfb991dacdb295b61bbc37beb36b73b852
2021-08-24 17:07:45 -07:00
956c8fa01e Microbenchmarking matrix mult (einsum, torch.mult, torch.mm) (#63654)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63654

Test Plan:
```
> buck run mode/opt caffe2/benchmarks/operator_benchmark/pt:matrix_mult_test

# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: einsum_bmm
# Mode: Eager
# Name: einsum_bmm_B4_M5_N3_K2_cpu
# Input: B: 4, M: 5, N: 3, K: 2, device: cpu
Forward Execution Time (us) : 27.970

# Benchmarking PyTorch: einsum_bmm
# Mode: Eager
# Name: einsum_bmm_B32_M25_N20_K30_cpu
# Input: B: 32, M: 25, N: 20, K: 30, device: cpu
Forward Execution Time (us) : 41.830

# Benchmarking PyTorch: einsum_bmm
# Mode: Eager
# Name: einsum_bmm_B128_M100_N120_K110_cpu
# Input: B: 128, M: 100, N: 120, K: 110, device: cpu
Forward Execution Time (us) : 499.114

# Benchmarking PyTorch: bmm
# Mode: Eager
# Name: bmm_B4_M5_N3_K2_cpu
# Input: B: 4, M: 5, N: 3, K: 2, device: cpu
Forward Execution Time (us) : 6.268

# Benchmarking PyTorch: bmm
# Mode: Eager
# Name: bmm_B32_M25_N20_K30_cpu
# Input: B: 32, M: 25, N: 20, K: 30, device: cpu
Forward Execution Time (us) : 12.676

# Benchmarking PyTorch: bmm
# Mode: Eager
# Name: bmm_B128_M100_N120_K110_cpu
# Input: B: 128, M: 100, N: 120, K: 110, device: cpu
Forward Execution Time (us) : 438.219

# Benchmarking PyTorch: einsum_elementwise
# Mode: Eager
# Name: einsum_elementwise_B4_M5_N3_cpu
# Input: B: 4, M: 5, N: 3, device: cpu
Forward Execution Time (us) : 7.657

# Benchmarking PyTorch: einsum_elementwise
# Mode: Eager
# Name: einsum_elementwise_B32_M25_N20_cpu
# Input: B: 32, M: 25, N: 20, device: cpu
Forward Execution Time (us) : 18.523

# Benchmarking PyTorch: einsum_elementwise
# Mode: Eager
# Name: einsum_elementwise_B100_M90_N110_cpu
# Input: B: 100, M: 90, N: 110, device: cpu
Forward Execution Time (us) : 55.103

# Benchmarking PyTorch: mul
# Mode: Eager
# Name: mul_B4_M5_N3_cpu
# Input: B: 4, M: 5, N: 3, device: cpu
Forward Execution Time (us) : 2.501

# Benchmarking PyTorch: mul
# Mode: Eager
# Name: mul_B32_M25_N20_cpu
# Input: B: 32, M: 25, N: 20, device: cpu
Forward Execution Time (us) : 10.589

# Benchmarking PyTorch: mul
# Mode: Eager
# Name: mul_B100_M90_N110_cpu
# Input: B: 100, M: 90, N: 110, device: cpu
Forward Execution Time (us) : 50.102

Reviewed By: ajyu

Differential Revision: D30455179

fbshipit-source-id: 9f2d92b2d2b860f41a8e59be2cc086d75b587f7b
2021-08-24 16:26:26 -07:00
6d58c83007 Turn off layer norm in jit symbolic differentiation (#63816)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63816

Test Plan:
Confirmed this can rescue the NE:

https://www.internalfb.com/mast/job/torchx_xdwang-SparseNNApplication_72cf593d

Reviewed By: ngimel

Differential Revision: D30498746

fbshipit-source-id: 4a387f32ee2f70685de6104459c7f21bfbddc187
2021-08-24 15:47:13 -07:00
41ffec07ce Add a common autograd TLS state (#63860)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63860

Test Plan: Imported from OSS

Reviewed By: heitorschueroff

Differential Revision: D30513253

Pulled By: albanD

fbshipit-source-id: 97d76ed54dfbdf4ba3fc7051ce3b9bb636cefb4b
2021-08-24 15:34:06 -07:00
865d127a66 .github: Enable with-ssh for Windows (#63440)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63440

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: janeyx99

Differential Revision: D30521460

Pulled By: seemethere

fbshipit-source-id: e987e170e73fb4f9d9f024bed0e58404ed206848
2021-08-24 14:14:27 -07:00
4e37a015c7 [FX] Fix _replicate_for_data_parallel (#63821)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63821

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D30502115

Pulled By: jamesr66a

fbshipit-source-id: 0f004f95def6e1ba21ccbeab40cb0a739a0ad20c
2021-08-24 13:48:15 -07:00
5be17ec1fc Do not modify saved variables in-place for spectral norm during power iteration (#62293)
Summary:
Interestingly enough, the original code did have a mechanism that aims to prevent this very issue:
but it performs a clone AFTER modifying u and v in-place.
This wouldn't work though because we can later use the cloned u and v in operations that save for backward, and the next time we execute forward, we modify the same cloned u and v in-place.
So if the idea is that we want to avoid modifying saved variable in-place we should clone it BEFORE the in-place operation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62293

Reviewed By: bdhirsh

Differential Revision: D30489750

Pulled By: soulitzer

fbshipit-source-id: cbe8dea885aef97adda8481f7a822e5bd91f7889
2021-08-24 13:08:59 -07:00
4a0776100e Migrate legacy lstsq from THC to ATen (CUDA) (#63504)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63504

Closes gh-24592

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D30441304

Pulled By: ngimel

fbshipit-source-id: ec176596f54bc084af48a73d1dbb0dcb82fec593
2021-08-24 12:47:16 -07:00
699c764d2e Revert D30513613: Removing tensor.data usage in utils with tensor set_ method
Test Plan: revert-hammer

Differential Revision:
D30513613 (d08a36f831)

Original commit changeset: 402efb9c30fa

fbshipit-source-id: 911c66a9852de77dc5274b5fb373258c0c97739a
2021-08-24 12:20:37 -07:00
835dac0869 Merge common fields from TensorInitParams and ShardedTensorMetadata into TensorProperties (#63731)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63731
1) Follow up [PR/63378 last comment](https://github.com/pytorch/pytorch/pull/63378#discussion_r693143053)
2) Also updated the caller side (usage of ShardedTensorMetadta) in fbcode

Ref: [landing workflow 3](https://www.internalfb.com/intern/wiki/PyTorch/PyTorchDev/Workflow/Landing/#landing-your-prs-from-gi-1)

Test Plan:
Imported from OSS

OSS: (pytorch).. $ python test/distributed/_sharded_tensor/test_sharded_tensor.py --v
FB:  fbcode $ buck test mode/dev //aiplatform/modelstore/checkpointing/pyper/tests:checkpoint_utils_test

Reviewed By: wanchaol, heitorschueroff

Differential Revision: D30472281

fbshipit-source-id: 727fb0e7f10eab4eb7a10476194e9008f2ac1fb5
2021-08-24 11:49:06 -07:00
d08a36f831 Removing tensor.data usage in utils with tensor set_ method (#63867)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63867

When updating the model parameter, updating `parameter.data` is no longer recommended, because this `data` field will be deprecated in the future.

The replacement is `tensor.set_`.

ghstack-source-id: 136531233

Test Plan: buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn -- test_periodic_model_averager

Reviewed By: SciPioneer

Differential Revision: D30513613

fbshipit-source-id: 402efb9c30fafc3f285bebc631639f656ceae585
2021-08-24 11:20:44 -07:00
73431449b3 update readme and contributing.md (#63843)
Summary:
1. In fact, Visual Studio isn't supported as CMAKE generator
2. I was asked many times why there's error as 'Could NOT find OpenMP'
3. Add Newly added Best Practices link in contributing.md

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63843

Reviewed By: seemethere, heitorschueroff

Differential Revision: D30514095

Pulled By: janeyx99

fbshipit-source-id: 76715a1d8c049122546e5a7778cafe54e4dfd5d6
2021-08-24 10:52:11 -07:00
e6dc7bc61b Subprocess encoding fixes for cpp extension (#63756)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/63584

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63756

Reviewed By: bdhirsh

Differential Revision: D30485046

Pulled By: ezyang

fbshipit-source-id: 4f0ac383da4e8843e2a602dceae85f389d7434ee
2021-08-24 10:46:11 -07:00
14d4723abd add bf16 support for bucketize (#55588)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55588

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D28836796

Pulled By: VitalyFedyunin

fbshipit-source-id: c9ae5b969c30a45473533be5f29bb497f8da5143
2021-08-24 10:31:42 -07:00
1256dcd509 [pruner] modify base pruner to prune bias by default (#63202)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63202

By default, the prune will also prune biases, such that the whole output channel is removed. The user can manually set `also_prune_bias` to False when calling `prepare` if they don't want the bias to be pruned.
ghstack-source-id: 136466671

Test Plan:
`buck test mode/dev-nosan //caffe2/test:ao -- TestBasePruner`

https://pxl.cl/1MV32

modify `fusion_tests` according to API change
`buck test mode/opt //scripts/kazhou:fusion_tests`

https://pxl.cl/1NbKz

Reviewed By: z-a-f

Differential Revision: D30294494

fbshipit-source-id: c84655648bee0035559195ca855b98fb7edaa134
2021-08-24 10:25:45 -07:00
16ba20507a [pruner] amend base pruner API to match base sparsifier (#63178)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63178

Update base pruner API to match base sparsifier API as defined in D28970960 / PR58955

Changes include:
- `enable_mask_update = True` in `__init__`
- `prepare` takes model and config instead of constructor
- convert functionality renamed to `squash_mask`, `convert` method call now raises Error
- `activation_handles` ad `bias_handles` initialized in `_prepare` instead of constructor
ghstack-source-id: 136467595

Test Plan:
Function names updates according to changes

`buck test mode/dev-nosan //caffe2/test:ao -- TestBasePruner`

https://pxl.cl/1MTgH

TODO will need to modify `fbcode/scripts/kazhou/fusion_tests.py` to use new API

Reviewed By: z-a-f

Differential Revision: D30287179

fbshipit-source-id: d4727bea1873b500f2d4bb784db26d532bf26cce
2021-08-24 10:25:43 -07:00
5dee15401c [pruner] refactor ActivationReconstruction forward hooks (#63158)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63158

Combined functionality for `ActivationReconstruction` for both Linear and Conv2d in one class. The only difference between the old classes was the size and indexing of the reconstructed tensor -- that logic can be generalized by iterating over the size of `output`.
ghstack-source-id: 136467465

Test Plan:
`buck test mode/dev-nosan //caffe2/test:ao -- TestBasePruner`

https://pxl.cl/1MSSv

Reviewed By: raghuramank100

Differential Revision: D30282765

fbshipit-source-id: 08a1e4e0650511019fff85cf52b41dd818b0c7f8
2021-08-24 10:24:29 -07:00
7774a4e95b [Static Runtime] Implement prim::VarStack out variant (#63579)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63579

Provide a static runtime out variant implementation for the new op introduced in D30426232 (1385f9fb12).

Test Plan: `buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- IndividualOps_VarStack`

Reviewed By: navahgar

Differential Revision: D30410525

fbshipit-source-id: bc59a3d8ad23e3d94561ec2dca9cc20687dbadf8
2021-08-24 09:44:29 -07:00
227cb268bc [Reland] Embedding thrust->cub migration (#63806)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/63427

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63806

Reviewed By: bdhirsh

Differential Revision: D30498255

Pulled By: ngimel

fbshipit-source-id: 78b7085a92a168cf0163f53dcb712bac922f5235
2021-08-24 09:30:32 -07:00
94d621584a optimize BFloat16 elemwise operators CPU: sigmoid, sigmoid_backward, tanh_backward, addcmul, addcdiv (#55221)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55221

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D28836797

Pulled By: VitalyFedyunin

fbshipit-source-id: 6b79098c902ffe65d228668118ef36fb49bab800
2021-08-24 08:56:17 -07:00
33a163d886 Enable BFloat16 LeakyReLU and RReLU in CPU path (#61514)
Summary:
Enable and optimize BFloat16 LeakyReLU and RReLU in CPU path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61514

Reviewed By: ejguan

Differential Revision: D30257612

Pulled By: VitalyFedyunin

fbshipit-source-id: 8cc0d1faacd02dcc9827af724a86d95b6952748f
2021-08-24 08:34:56 -07:00
2ca2761f3c ENH Adds no_batch_dim for NLLLoss (#62651)
Summary:
Towards https://github.com/pytorch/pytorch/issues/60585

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62651

Reviewed By: VitalyFedyunin

Differential Revision: D30303340

Pulled By: jbschlosser

fbshipit-source-id: 7ab478cf63bf6cd1f850cad5fd101e74a2cfe3f5
2021-08-24 08:27:27 -07:00
d3be02d100 fix batchnorm2d issue when input is non contiguous (#63392)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63392

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D30476317

Pulled By: VitalyFedyunin

fbshipit-source-id: 03055a0aec21cf2c029b6f32315da2b09cb722d0
2021-08-24 08:24:01 -07:00
1385f9fb12 [JIT] Add variadic stack op (#63578)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63578

Added a new op `prim::VarStack` and a pass that transforms instances of `aten::stack(list, dim)` into `prim::VarStack(list[0], ..., list[n], dim)`. Also provided a JIT interpreter implementation.

Most of the implementation/tests are the same as `prim::VarConcat`.

Test Plan: `buck test caffe2/test/cpp/jit:jit -- TestStackOpt`

Reviewed By: navahgar

Differential Revision: D30426232

fbshipit-source-id: 9829a7db6e0a5038c9b7528c43c25b0c221aa2ce
2021-08-24 08:20:54 -07:00
f4aff3a346 [BE] add distributed run_test options (#63147)
Summary:
Currently distributed tests are mixed within test_python.
We would like to split the distributed tests into its own batch thus we need to split them out.

Adding an option to include/exclude distributed tests with CUSTOM_HANDLERS.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63147

Test Plan:
- locally run with the addition run_test.py options.
- CI

Dependency: found a bug in mpiexec test and need https://github.com/pytorch/pytorch/issues/63580 to fix it first.

Reviewed By: bdhirsh

Differential Revision: D30496178

Pulled By: walterddr

fbshipit-source-id: 7903a57b619f2425028028f944211938823918a6
2021-08-24 08:03:01 -07:00
688f06cac3 Revert D30388099: Add a common autograd TLS state
Test Plan: revert-hammer

Differential Revision:
D30388099 (83d9bad44a)

Original commit changeset: 8e03f940150f

fbshipit-source-id: f6d60fec66e8292f5268335bb8a3e7e1a662f23b
2021-08-24 07:22:39 -07:00
9914fb6615 ENH Adds no_batch_dim tests/docs for LPPool1d and Identity (#62190)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/60585

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62190

Reviewed By: ejguan

Differential Revision: D29942385

Pulled By: jbschlosser

fbshipit-source-id: 00df6f6f01ad039631bb8679f8de94863aac7650
2021-08-24 06:59:41 -07:00
83d9bad44a Add a common autograd TLS state (#63114)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63114

This PR collapses the GradMode and InferenceMode thread local booleans into a single thread local uint8.
This helps reducing the number of thread local variable accesses done when we propagate ThreadLocalStates.

Note that this is even more beneficial as we will add a forward mode AD TLS (similar to GradMode) higher in this stack and this new structure should reduce the perf impact of adding this new TLS.

Here is the full benchmark result between master and the top of this stack: https://gist.github.com/albanD/e421101e9ed344e94999bef3a54bf0f3
tl;dr: give a benefit in most cases. It is never detrimental.

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D30388099

Pulled By: albanD

fbshipit-source-id: 8e03f940150ff063c2edd792733663413ae2f486
2021-08-24 06:54:02 -07:00
c545b099aa Separating quantization test from distributed_test (#63058)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63058

Dedicating separate tests for different quantization methods. Currently supporting FP16 method.
ghstack-source-id: 136499767

Test Plan: uck test mode/dev //caffe2/test/distributed/algorithms/quantization:quantization_gloo_fork -- name_of_the_test

Reviewed By: wanchaol

Differential Revision: D30142580

fbshipit-source-id: 3aacec1a231a662067d2b48c001f0c69fefcdd60
2021-08-24 01:44:55 -07:00
f0d274294d [TensorExpr] Nuke KernelArena and KernelScope. (#63587)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63587

Now that there is no classes using KernelArena for memory management we
can remove it.

Differential Revision:
D30429115
D30429115

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: 375f6f9294d27790645eeb7cb5a8e87047a57544
2021-08-24 00:32:16 -07:00
62d02f2b57 [TensorExpr] Make 'Tensor' a value type. (#63586)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63586

This is another commit in transition from KernelArena memory management.
Tensor is essentially just a pair of <BufPtr, StmtPtr> and we don't need
to dynamically allocate it at all - it's cheap to pass it by value, and
that's what we're switching to in this commit.

After this change nothing uses KernelScope/KernelArena and they can be
safely removed.

Differential Revision:
D30429114
D30429114

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: f90b859cfe863692b7beffbe9bd0e4143df1e819
2021-08-24 00:32:13 -07:00
4e15a6f495 [TensorExpr] Switch Exprs and Stmt from kernel-arena to shared_ptr. (#63216)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63216

Currently there are three classes managed by KernelArena: Expr, Stmt,
and Tensor (and derived classes). KernelArena has been a long standing
painpoint for NNC devs and we're moving away from that memory management
model to ref-count based memory model (using shared_ptr). This commit
switches Expr and Stmt to shared_ptr and is the biggest change in this
transition. Later commits will detach Tensor from KernelArena and kill
the arena + scope altogether.

Differential Revision:
D30353195
D30353195

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: 9575225ada3d0fb65087ae40435f3dfea4792cae
2021-08-24 00:32:11 -07:00
dd96c26066 [TensorExpr] More NFC changes like Expr* -> ExprPtr. (#63778)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63778

This is a preparation for a switch from raw pointers to shared pointers
as a memory model for TE expressions and statements.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D30487425

Pulled By: ZolotukhinM

fbshipit-source-id: 9cbe817b7d4e5fc2f150b29bb9b3bf578868f20c
2021-08-24 00:30:49 -07:00
5b7cdc5a3d add channels last for GroupNorm (#49821)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49821

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D26007053

Pulled By: VitalyFedyunin

fbshipit-source-id: 34a48d5d3b66a159febf3c3d96748fbaba1b9e31
2021-08-23 22:54:59 -07:00
f5d585391d Add ROCm as a platform for which tests can be disabled (#63813)
Summary:
Realized we were missing ROCm as a platform on which one could disable a flaky test. (like how this issue specifies windows https://github.com/pytorch/pytorch/issues/61655)

cc jeffdaily sunway513 jithunnair-amd ROCmSupport

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63813

Reviewed By: seemethere

Differential Revision: D30498478

Pulled By: janeyx99

fbshipit-source-id: f1abe8677e1ddd01de3291e1618272ad8e287dc4
2021-08-23 18:50:04 -07:00
d96ef8c1b1 [Static Runtime] SR clones graph input (#63704)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63704

Previously SR did not clone the graph. This was leading to subtle bugs in `testStaticRuntime`; static runtime would modify its graph, and the graph used by the JIT interpreter would change as well. The JIT interpreter would then crash if SR-only ops were added!

Cloning the graph is more consistent with the behavior of the `Module` ctor.

Test Plan: `buck test caffe2/benchmarks/static_runtime/...`

Reviewed By: hlu1

Differential Revision: D30463294

fbshipit-source-id: b771551a1f55f95fde79373b23babcf3e5ddf726
2021-08-23 18:45:41 -07:00
195c60d844 [fx2trt] Add acc op and converter for torch.pow (#63795)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63795

att

Test Plan: buck run mode/opt caffe2/torch/fb/fx2trt:test_binary_ops

Reviewed By: jackm321, wushirong

Differential Revision: D30492488

fbshipit-source-id: 6d615770567b13720316f06fd2f866ea2fdc2995
2021-08-23 18:18:31 -07:00
e1bdebf685 Adding DataLoader2 class as future replacement of DataLoader (#63742)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63742

Supports sharding and batching on loader level**

Supports sharding and batching on loader level

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D30494506

Pulled By: VitalyFedyunin

fbshipit-source-id: 6648e09d955055ac38e3a4e3973f701acefca762
2021-08-23 18:09:07 -07:00
fc07489ec5 [BE] Enable PostLocalSGD tests on windows (#63463)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63463

Now that `torch.distributed.optim` gates DistributedOptimizer on RPC availability, local sgd optimizer can be used on windows.
ghstack-source-id: 136437632

Test Plan: Ci

Reviewed By: SciPioneer

Differential Revision: D30358922

fbshipit-source-id: 9b56aebf1075f026637296d338805ad8851c9d40
2021-08-23 17:49:03 -07:00
16a4434422 [BE] Enable functional optim tests for windows (#63462)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63462

Now that `torch.distributed.optim` gates DistributedOptimizer on RPC availability, these tests can be run on windows.
ghstack-source-id: 136437635

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D30358923

fbshipit-source-id: 36739bdfe7214789f17de652d30c62c2bc124c73
2021-08-23 17:49:01 -07:00
630ec2e190 [fx_acc] Add mapper for torch.log1p (#63792)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63792

Map `torch.log1p` to `acc_ops.add` + `acc_ops.log`.

Test Plan: buck test mode/opt glow/fb/fx/oss_acc_tracer:test_acc_tracer -- test_log1p

Reviewed By: wushirong

Differential Revision: D30491706

fbshipit-source-id: bcbeddf06131113185d2019cfd7cf5e9193a8a78
2021-08-23 17:48:59 -07:00
e4f44bec27 Fix pocketfft include path in mobile build (#63714)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63714

PocketFFT was disabled for CMake < 3.9 but CMake 3.11 is the first version to support `INCLUDE_DIRECTORIES` as a target property. So updating to CMake 3.10 causes the mobile builds to fail. Instead of limiting the CMake support, this just adds the include directory to the entire target,

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D30498369

Pulled By: malfet

fbshipit-source-id: 83372e29c477c97e7015763b7c29d6d7e456bcef
2021-08-23 17:48:57 -07:00
fc47497905 Simplify ccache instructions in CONTRIBUTING.md (#62549)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62549

When building CUDA files with native CMake support, it will respect the
`CMAKE_CUDA_COMPILER_LAUNCHER` setting. So, there's no need for symlinks.

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D30498488

Pulled By: malfet

fbshipit-source-id: 71c2ae9d4570cfac2a64d777bc95cda3764332a0
2021-08-23 17:47:38 -07:00
d9231dc3df Skip archiving useless build artifacts (#63785)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63785

We currently zip up everything in `build/` which includes a lot of cruft (`.o` files, random things copied in from dependencies, etc). This makes the artifact bigger (slower upload/download times, and takes about 1.5 minutes to archive). This change makes archiving instead take ~15 seconds and removes the 50 second upload to GitHub step that isn't as useful now that we have the HUD PR page that lists out all artifacts.

Test Plan: Imported from OSS

Reviewed By: seemethere, janeyx99

Differential Revision: D30494444

Pulled By: driazati

fbshipit-source-id: 93202dba7387daeb4859a938110b02ff2dc2ccc4
2021-08-23 17:40:01 -07:00
172e5c76ab Fix some memory bugs in onnx passes (#63754)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63754

Running onnx tests with ASAN uncovers several memory errors.  These two are caused by: (1) iterating the uses list of a node after mutation, and (2) accessing the `blocks` attribute of a possibly deleted node.

To reproduce (this is on a CentOS 7 box):
```
DEBUG=1 CFLAGS="-fsanitize=address" CXXFLAGS="-fsanitize=address" USE_LLVM=$(realpath ../llvm-project/install) CMAKE_PREFIX_PATH=$CONDA_PREFIX python setup.py install
LD_PRELOAD=$(realpath /lib64/libasan.so.5) numactl -C3 pytest -v --cov --cov-report xml:test/coverage.xml --cov-append onnx/test_pytorch_onnx_onnxruntime.py::TestONNXRuntime_opset11 -s
```

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D30493939

Pulled By: bertmaher

fbshipit-source-id: e16e19dc9b4c9896e102ca8bf04c8bedfdde87af
2021-08-23 17:31:45 -07:00
fc6dd0bc00 [JIT] Move UseVariadicCat internals (#63577)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63577

Since other variadic ops will have an almost identical implementation, we can generalize the `UseVariadicCat` implementation and put it in a common folder.

Also moved some test utilities that other variadic op tests will likely need.

Test Plan: `buck test caffe2/test/cpp/jit:jit -- ConcatOptTest`

Reviewed By: navahgar

Differential Revision: D30409937

fbshipit-source-id: 925c11c27b58ce98cb8368d2a205e26ba66d3db9
2021-08-23 17:30:36 -07:00
130549d61b Fix typo in NNAPI tests (#63797)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63797

nnapi memory format test has a typo

Test Plan:
pytest test/test_nnapi.py::TestNNAPI

Imported from OSS

Reviewed By: Amyh11325

Differential Revision: D30495473

fbshipit-source-id: 8edad7c01a080847a64a2797e077ec4d6077552a
2021-08-23 16:34:24 -07:00
84890aae35 [Static Runtime] Add an out variant op for aten::abs (#63675)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63675

This change adds an out variant implementation for `aten::abs`.

Test Plan:
- Observed `V0820 14:14:08.880342 101788 impl.cpp:1394] Switch to out variant for node: %3 : Tensor = aten::abs(%a.1)`

- Perf impact: TBD

Reviewed By: hlu1

Differential Revision: D30461317

fbshipit-source-id: 0c0230bd40afe463ae1ccb222c2a1207ebcf4191
2021-08-23 16:25:10 -07:00
55f8f95ad4 fix git diff issue (#63408)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/60111, ideally we should merge this before https://github.com/pytorch/pytorch/issues/63360 but we can also test this with https://github.com/pytorch/pytorch/issues/63360 easily.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63408

Test Plan:
- This is conform working with local test.sh run by setting PR_NUMBER
- should be validated by GHA CI as well

Concern:
- currently GHA CI is running into proxy 403 rate-limit exceeded issue consistently. However the worst case is not generating any git diff files, which is going to be exactly the same as current behavior.
- depends on https://github.com/pytorch/pytorch/issues/63770.

Reviewed By: driazati, janeyx99

Differential Revision: D30489355

Pulled By: walterddr

fbshipit-source-id: a638b7ae5820f29a7aca6cc40ff390ab253cb174
2021-08-23 15:38:18 -07:00
49be16d50a .github: Add ec2 information as a step (#63784)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63784

Also creates the common.yml.j2 file as a place to store common code
amongst the templates

Should look like:
![image](https://user-images.githubusercontent.com/1700823/130495226-f18b8c0f-1ea7-4097-8bbb-e998fabb71f2.png)

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet, driazati

Differential Revision: D30490682

Pulled By: seemethere

fbshipit-source-id: 18028b4acff938ef54cd6e4877561b2d830a11cf
2021-08-23 15:04:04 -07:00
7946f8a9f6 Rename DataPipe to Op-er (#63325)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63325

Rename each DataPipe to an operation name ending with er. Functional API should remain `verb` such as `read_from_tar` , `shuffle`, ... (Discussed in [here](https://github.com/facebookexternal/torchdata/pull/97#discussion_r688553905))
- Batch -> Batcher
- Collate -> Collator
- Concat -> Concater
- GroupByKey - > ByKeyGrouper ?
- ListDirFiles -> FileLister
- LoadFilesFromDisk -> FileLoader
- Map -> Mapper
- ReadFilesFromTar -> TarArchiveReader
- ReadFilesFromZip -> ZipArchiveReader
- ReadLinesFromFile -> LineReader
- Shuffle -> Shuffler
- ToBytes -> StreamReader
- Transforms -> Transformer
- Zip -> Zipper

Let me know if you have better name for each DataPipe

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D30466950

Pulled By: ejguan

fbshipit-source-id: 72909dca7b3964ab83b965891f96cc1ecf62d049
2021-08-23 14:36:10 -07:00
a781340bf7 Add equality constraints for some acc opeartions for symbolic inference (#63689)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63689

Test Plan:
buck run mode/opt-clang caffe2/torch/fb/model_transform/experimental:fx_ir_lower_inline_cvr -- \
    --action=lower_and_run \
    --filename=inline_cvr_7x_dec_2020.model \
    --print_glow_glog=True

Reviewed By: jamesr66a

Differential Revision: D30462113

fbshipit-source-id: 0b2a1ce9770561248527d47c07b80112491dc949
2021-08-23 14:11:08 -07:00
0bc7fef406 [Static Runtime] Remove unused fusion patterns (#63636)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63636

Reviewed By: d1jang

Differential Revision: D30446573

fbshipit-source-id: 3abb7f697380f3b4e865b98c594de359b5e26b96
2021-08-23 12:55:09 -07:00
a709ab34a8 [nnc] Re-enable CPU fusion" (#63665)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63665

This reverts commit 125e2d02e575612eb427104e7c67f1c28f090db8.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D30471646

Pulled By: bertmaher

fbshipit-source-id: 4189869566f03b5f9ada78d78830f6a34946eed6
2021-08-23 12:42:42 -07:00
560cd88195 Kill THCUNN (#63429)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63429

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D30441308

Pulled By: ngimel

fbshipit-source-id: 3ae342a2f8d5c7f8827b637c4055c5d1b0a1be26
2021-08-23 12:07:16 -07:00
db1b27fa8d fix mpi ssh runtime error (#63580)
Summary:
should fix https://github.com/pytorch/pytorch/issues/60756.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63580

Test Plan:
- this CI.
- validated by running on the bionic_cuda container: https://app.circleci.com/pipelines/github/pytorch/pytorch/366632/workflows/478602fb-698f-4210-ac09-d9c61af5c62b/jobs/15472104

Reviewed By: malfet

Differential Revision: D30486472

Pulled By: walterddr

fbshipit-source-id: d83ab88d163d4a468f03961a13d891b658668a7f
2021-08-23 09:45:33 -07:00
98449f5bba hotfix clone issue (#63770)
Summary:
This was discovered during https://github.com/pytorch/pytorch/issues/63408. For some reason only this checkout action is not correctly set fetch-depth

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63770

Reviewed By: malfet, janeyx99

Differential Revision: D30486110

Pulled By: walterddr

fbshipit-source-id: a67395cca2487407ed0d49c8c89587935ca5f212
2021-08-23 09:30:48 -07:00
f1d865346f [ONNX] add test images to repo (#63717)
Summary:
This is better than the status quo:
* Test doesn't download files from the internet -> faster and more
  reliable.
* Test doesn't leave the git working directory dirty.

Rather than using the original images, I've copied some images from
the pytorch/vision repo. This will keep the tests in the two repos
in sync, while avoiding adding new assets to the vision repo.

See https://github.com/pytorch/vision/pull/4176.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63717

Reviewed By: janeyx99

Differential Revision: D30466016

Pulled By: malfet

fbshipit-source-id: 2c56d4c11b5c74db1764576bf1c95ce4ae714574
2021-08-23 07:43:21 -07:00
bafd875f74 Allow implementing either backward or vjp for Function (#63434)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63434

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D30431968

Pulled By: albanD

fbshipit-source-id: 0bb88664283486a9fd3364e6c3d79442a44625c2
2021-08-23 07:07:11 -07:00
726fd26b3e Update ROCm PyTorch persons of interest (#55206)
Summary:
cc jeffdaily sunway513

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55206

Reviewed By: VitalyFedyunin

Differential Revision: D30296584

Pulled By: dzhulgakov

fbshipit-source-id: 6e5c610cc6b7c7fd58b80fa3f9de31f269341a88
2021-08-22 22:31:09 -07:00
d6133b2fe6 Remove _fork_processes from common_distributed.py (#63711)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63711

This removes `_fork_process` from common_distributed.py and fixes all
other callpoints to use `spawn_process` instead.
ghstack-source-id: 136395719

Test Plan: waitforbuildbot

Reviewed By: xush6528

Differential Revision: D30463834

fbshipit-source-id: 0c09e8a996d0e5b912c8cdd45488a39951bac4db
2021-08-22 18:57:12 -07:00
2289a12f21 Made FuncTorchBatched decompose CompositeImplicitAutograd (#63616)
Summary:
See https://github.com/facebookresearch/functorch/issues/56

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63616

Reviewed By: zou3519

Differential Revision: D30438316

Pulled By: Chillee

fbshipit-source-id: e84446d9f68b87daa0cfff75b3b8a972f36ec85a
2021-08-21 17:14:39 -07:00
e926f75b0b BatchNorm autodiff re-enabled (#57321)
Summary:
Turns on BN in autodiff:

1. outputs an empty tensor for running stats to by pass autodiff issue on None;
2. fixing BN inference backward in cudnn & miopen, where backward falls back to native batchnorm kernel instead;

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57321

Reviewed By: albanD, ngimel

Differential Revision: D30250419

Pulled By: jansel

fbshipit-source-id: a62553789c20fb50a820003a056f40d9d642dfaa
2021-08-21 09:07:31 -07:00
37d60c08e5 Revert D30360382: [nnc] Support thread level parallelism in fused kernels
Test Plan: revert-hammer

Differential Revision:
D30360382 (d6d86efb1c)

Original commit changeset: 29acf4e932c6

fbshipit-source-id: e0531113135d30eabb172dc1537d5dd6d65dc438
2021-08-21 03:46:43 -07:00
76da46ccdc Revert D30417127: Remove flag to toggle CPU fusion in the presence of parallelism
Test Plan: revert-hammer

Differential Revision:
D30417127 (6600bc9651)

Original commit changeset: b77d7c68364f

fbshipit-source-id: 6b52fb83a84fe241945e3cb3eeb71050d1d9c8f1
2021-08-21 03:38:07 -07:00
8871ff29b7 [sharded_tensor] add readonly tensor properties (#63679)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63679

This PR add read only tensor properties to sharded tensor, to match the torch.Tensor behaviors.

Test Plan: test_sharded_tensor_metadata

Reviewed By: pritamdamania87

Differential Revision: D30459343

fbshipit-source-id: 9aec8ecfe76479eed25f3b843495e5719ed2956d
2021-08-20 22:17:11 -07:00
b2a601ffe5 [Static Runtime] Implement out variant for fb::quantized_linear (#63635)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63635

Reviewed By: ajyu

Differential Revision: D30446234

fbshipit-source-id: 1ef014186ff725930a97d0159626f9233ee74030
2021-08-20 21:42:22 -07:00
2d58f3f56d NNAPI: Support const values in binary ops
Summary:
NNAPI converter failed with 1 const value and one tensor earlier
Code suggestions from dreiss

Test Plan:
pytest test/test_nnapi.py::TestNNAPI::test_pointwise_binary

Imported from OSS

Reviewed By: anshuljain1

Differential Revision: D28893881

fbshipit-source-id: 59240373fb03c6fdafa4cb2fa4d8408dd20092f6
2021-08-20 21:10:26 -07:00
b4f5809db8 Migrate thnn_conv2d from THC to ATen (#63428)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63428

Closes gh-24644, closes gh-24645

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D30441307

Pulled By: ngimel

fbshipit-source-id: 9c3dec469c0525831ae398df261cf41b7df7e373
2021-08-20 18:29:02 -07:00
3ee1f81dce Extend _sharded_tensor constructor to support other ops like torch.ones (#63378)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63378

a) Introduce InitCommonParams to wrap tensor creation params
b) Factor local tensor initiation into common_params so that tensor value is not hard specified in ShardedTensor constructor
c) Add _sharded_tensor.ones(...) to exemplify - Note memory_format arg is not provided to be consistent as torch.ones
d) Follow up: more ops like torch.full, torch.zero, torch.rand,

Test:
$ python test/distributed/_sharded_tensor/test_sharded_tensor.py TestCreateTensorFromParams --v
$ python test/distributed/_sharded_tensor/test_sharded_tensor.py TestShardedTensorChunked.test_create_sharded_tensor_with_ones --v
$ python test/distributed/_sharded_tensor/test_sharded_tensor.py TestShardedTensorEnumerable.test_create_sharded_tensor_with_ones --v

Test Plan: Imported from OSS

Reviewed By: pritamdamania87, wanchaol

Differential Revision: D30359245

Pulled By: bowangbj

fbshipit-source-id: 85768fcb36e9d9d40213036884b1266930a91701
2021-08-20 17:11:34 -07:00
7c0f5b9aa4 [clang-tidy] Enable more folders (#63380)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63380

Crosses off some more of #62011, see the test in the stacked PR #63381

Test Plan: Imported from OSS

Reviewed By: malfet, seemethere

Differential Revision: D30455843

Pulled By: driazati

fbshipit-source-id: d473545d05ffa0b2476968f0b1c55f3a16a2c755
2021-08-20 16:40:42 -07:00
e0fe5699c4 enable increment build for build_libtorch (#63074)
Summary:
Since issue https://github.com/pytorch/pytorch/issues/59859 is resolved.

rerun_cmake in build_libtorch should not be hardcoded.
build_libtorch is necessary to generate debug version libtorch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63074

Reviewed By: VitalyFedyunin, seemethere

Differential Revision: D30306705

Pulled By: malfet

fbshipit-source-id: f2077d334191f4973da0681560937bc8bab730c1
2021-08-20 16:30:34 -07:00
efe01c59e3 [Doc] Deprecation notice for only_inputs argument (#63631)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/63544.

Changed docstring accordingly. I'm new here, not sure if the style is okay. Please check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63631

Reviewed By: ejguan

Differential Revision: D30459439

Pulled By: soulitzer

fbshipit-source-id: 8df3c509d1dd39764815b099ab47229550126cbe
2021-08-20 15:49:49 -07:00
bcf8e2f57e Remove breakpad from docker image (#63598)
Summary:
As of https://github.com/pytorch/pytorch/issues/63186 we're doing this properly via a third_party cmake build, so we don't need it here anymore.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63598

Reviewed By: walterddr, malfet

Differential Revision: D30432250

Pulled By: driazati

fbshipit-source-id: d0d5db14355cf574e42c0d0ed786bb26230180bd
2021-08-20 15:48:39 -07:00
da0820e553 add BFloat16 operators on CPU: range, sinh, cosh, frexp, nan_to_num (#61826)
Summary:
Added BFloat16 support for range, sinh, cosh, frexp, and nan_to_num on CPU, and collected the benchmark data of these OPs(range, sinh, cosh, frexp, and nan_to_num) for BFloat16 and Float32 data type by using the operator_benchmark tool of PyTorch on the platform of Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz

Number of cores: 1 core, 28 cores(1 socket)
[cosh_sinh_benchmark.txt](https://github.com/pytorch/pytorch/files/6974313/cosh_sinh_benchmark.txt)
[frexp_benchmark.txt](https://github.com/pytorch/pytorch/files/6974315/frexp_benchmark.txt)
[nan_to_num_benchmark.txt](https://github.com/pytorch/pytorch/files/6974317/nan_to_num_benchmark.txt)
[range_benchmark.txt](https://github.com/pytorch/pytorch/files/6974318/range_benchmark.txt)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61826

Reviewed By: saketh-are

Differential Revision: D30257259

Pulled By: VitalyFedyunin

fbshipit-source-id: 394cd713e6394050a8c90b2160633beb675d71dd
2021-08-20 14:56:52 -07:00
a8de0d83fe empty caching allocator before test_avg_pool2d large subtest (#63528)
Summary:
Otherwise, unrecoverable OOM occurs on MI25.  Fixes broken ROCm CI test1.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63528

Reviewed By: malfet, zhouzhuojie

Differential Revision: D30459151

Pulled By: walterddr

fbshipit-source-id: 63e205c4f486fcbdd514cfb0ed8e38584f894585
2021-08-20 14:01:45 -07:00
b008bb4443 Include iostream in ProcessGroupMPI.cpp (#63656)
Summary:
As it uses `std::cerr`, which in turn results in compilation regression introduced by https://github.com/pytorch/pytorch/pull/61500
Fixes https://github.com/pytorch/pytorch/issues/63653

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63656

Reviewed By: ejguan

Differential Revision: D30455824

Pulled By: malfet

fbshipit-source-id: 29f316e7f7fd8e7dcbee2666e7a985f25bf56515
2021-08-20 13:15:40 -07:00
07e41cf2d7 [easy]Unbreak caffe2benchmarking build (#63655)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63655

ghstack-source-id: 136324310

Test Plan: buck build //fbobjc/Apps/Internal/Caffe2Benchmarking:Caffe2Benchmarking fbobjc/mode/iphonesimulator

Reviewed By: hl475, JacobSzwejbka

Differential Revision: D30455659

fbshipit-source-id: b6da6be4f89b6e84753ef0849ffedea04785034a
2021-08-20 12:57:27 -07:00
1dd648f1c4 [ONNX] Suppport torch.dot and torch.nn.utils.spectral_norm (#62596) (#62765)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62765

Fixes #27723

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D30375181

Pulled By: msaroufim

fbshipit-source-id: 715f4745899757ec405877980cd20c826028eb2c

Co-authored-by: BowenBao <bowbao@microsoft.com>
2021-08-20 12:46:56 -07:00
db0771b05d [ONNX] Update repeat_interleave for dynamic repeats (#59979) (#62764)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62764

Fixes #58733

- Support dynamic interleave for cases with dynamic repeat values
- Moved repeat_interleave symbolic from opset 11 to opset 13, as sequence as output types for loop outputs is needed for this change

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D30375179

Pulled By: msaroufim

fbshipit-source-id: 787f96bf91d124fd0483761088c5f4ae930d96a9

Co-authored-by: Shubham Bhokare <shubhambhokare@gmail.com>
2021-08-20 12:46:54 -07:00
8760254911 [ONNX] Fix an issue that optimizations might adjust graph inputs unexpectedly. (#61280) (#62763)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62763

This PR is to fix the issue that the graph inputs might be updated when we export the model in inference mode.

When a model is export in inference mode, some optimizations will be made. One side effect of these optimizations is: the inputs of graph might be adjusted. Such optimizatiosn include:

	1. Conv and BatchNorm op fusion.
	2. Do constant folding.

If the user sets export_params=False, or set keep_initializers_as_inputs=True, it's highly possible that the user wants to provide the corresponding parameters or initiliazers as the inputs of the graph.
In such situation, no matter the model is export in inference mode or training mode, exporter needs to prevent above optimizations from adjusting the graph inputs. By this, the inputs of graph could match inputs that users provided.

The changes in this PR, add an additional common judgement to see if the above optimizations needs to be done or not. From the value of export_params and keep_initializers_as_inputs arguments, infer if the graph inputs are allowed to be adjusted.
If no, these optimizations will be ignored, even other requirements are matched.

Besides these code changes, the comments of some parameters below have been updated so that users have more thoughts when they consider how to leverage these parameters for different purposes:

	1. export_params
	2. training
	3. do_constant_folding
	4. keep_initializers_as_inputs

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D30375183

Pulled By: msaroufim

fbshipit-source-id: 4db8b9695649eb32a3a0fefa950ee2e5651bdba0

Co-authored-by: fatcat-z <jiz@microsoft.com>
2021-08-20 12:46:52 -07:00
a65d1ae7cc [ONNX] Fix controlflow shape inference with contrib op (#60707) (#62762)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62762

`ONNXShapeTypeInference` for node `n` is skipped if `n` is non ONNX namespace, or if `n` contains any non ONNX namespace nodes. This prevents controlflow nodes containing contrib ops from running `SpecialPostProcess`, which sets up correct node output shape/type information in rare cases.

This PR depends on opset 14 export https://github.com/pytorch/pytorch/pull/59486

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D30375180

Pulled By: msaroufim

fbshipit-source-id: 5deacec39f091deb4d75ddd9e660e12fca7f16c5

Co-authored-by: BowenBao <bowbao@microsoft.com>
2021-08-20 12:45:53 -07:00
125e2d02e5 Revert D30417370: [nnc] Enable CPU fusion
Test Plan: revert-hammer

Differential Revision:
D30417370 (b9fc656cf2)

Original commit changeset: 84ce7a578a36

fbshipit-source-id: cd23774cdc3273fd72f8a05f1900eaf36f373e6b
2021-08-20 12:30:21 -07:00
2d671ca41b [8/N] Remove c10d/ddp fork tests. (#63454)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63454

Continuation of https://github.com/pytorch/pytorch/pull/63443, this
PR removes all fork tests from torch.distributed.
ghstack-source-id: 136285511

Test Plan: waitforbuildbot

Reviewed By: SciPioneer

Differential Revision: D30387872

fbshipit-source-id: f6d6313db126ae7b95b86f78a1e0726887c5c513
2021-08-20 12:23:18 -07:00
71da114412 Revert D30426527: Adding DataLoader2 class as future replacement of DataLoader
Test Plan: revert-hammer

Differential Revision:
D30426527 (5a7133b87f)

Original commit changeset: e5905d3364c4

fbshipit-source-id: 794d8a4e9256ccff8cf894aee10eff6adc30d502
2021-08-20 12:06:52 -07:00
70a3210eca Add BinaryUfuncOpInfo and broadcasting tests (#61964)
Summary:
As proof of concept, this PR uses the new `BinaryUfuncOpInfo` in broadcasting tests for `add`, `sub`, `mul`, `div`, `floor_div`, and `true_div`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61964

Reviewed By: ngimel

Differential Revision: D30407734

Pulled By: mruberry

fbshipit-source-id: ada28994f43b0635f279f45a02ecba18bc8ee033
2021-08-20 11:44:15 -07:00
b9fc656cf2 [nnc] Enable CPU fusion (#63545)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63545

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D30417370

Pulled By: bertmaher

fbshipit-source-id: 84ce7a578a3678d5562bab99d1dc00330c4f72d1
2021-08-20 11:18:21 -07:00
6600bc9651 Remove flag to toggle CPU fusion in the presence of parallelism (#63514)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63514

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D30417127

Pulled By: bertmaher

fbshipit-source-id: b77d7c68364f2af73570740540f3b1152313016e
2021-08-20 11:18:19 -07:00
d6d86efb1c [nnc] Support thread level parallelism in fused kernels (#63386)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63386

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D30360382

Pulled By: bertmaher

fbshipit-source-id: 29acf4e932c669ce0f35823faea9099bcd8119b6
2021-08-20 11:18:17 -07:00
c78ab28441 Add support for the ONNX Runtime Eager Mode backend (#58248)
Summary:
This PR implements the necessary hooks/stubs/enums/etc for complete ONNX Runtime (ORT) Eager Mode integration. The actual extension will live out of tree at https://github.com/pytorch/ort.

We have been [working on this at Microsoft](https://github.com/microsoft/onnxruntime-pytorch/tree/eager-ort/torch_onnxruntime) for the last few months, and are finally ready to contribute the PyTorch core changes upstream (nothing major or exciting, just the usual boilerplate for adding new backends).

The ORT backend will allow us to ferry [almost] all torch ops into granular ONNX kernels that ORT will eagerly execute against any devices it supports (therefore, we only need a single ORT backend from a PyTorch perspective).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58248

Reviewed By: astaff

Differential Revision: D30344992

Pulled By: albanD

fbshipit-source-id: 69082b32121246340d686e16653626114b7714b2
2021-08-20 11:17:13 -07:00
b95ce1591d Add docs describing saved tensor hooks (#62362)
Summary:
Add section to the Autograd mechanics docs to describe the recently
exposed saved tensors (https://github.com/pytorch/pytorch/issues/52451), how to register packing / unpacking
hooks (https://github.com/pytorch/pytorch/issues/60975) and how to use default hooks (https://github.com/pytorch/pytorch/issues/61834)

Sister PR: https://github.com/pytorch/pytorch/issues/62361 (will add a link from autograd.rst to notes/autograd in whatever PR does not land first)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62362

Reviewed By: soulitzer

Differential Revision: D30453177

Pulled By: Varal7

fbshipit-source-id: f5759977b069ff0ef36a47b08856d297691a6caa
2021-08-20 11:10:51 -07:00
03cc46a0ac [fx2trt] Add layernorm plugin for dynamic shape (#63620)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63620

Added layernorm dynamic plugin, so that it works when explicit batch dim is required. Needed for ig model.

Changed the way of how we creating a plugin layer from instantiating the plugin directly to use plugin creator with `PluginFieldCollection`.

Follow ups:
Another way to convert layernorm is by breaking it down to supported trt layers. T97398182

Test Plan: layernorm unittest

Reviewed By: yinghai

Differential Revision: D30138205

fbshipit-source-id: aebe021d8de818e20376634f30e84579b9807f9b
2021-08-20 10:52:42 -07:00
5f997a7d2f [PyTorch][Edge] Improve InflatableArgs for Bundled Inputs (#62368)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62368

# Context
The bundled inputs accepts an expression in the form of string InflatableArg.fmt that can be applied on the inputs to inflate. The InflatableArg.fmt provides flexibility to have custom transformation to inflate. When the input arguments to a function are not Tensor type, TorchScript casts the inputs from type T to Optional[T] expects the function to handle Nullable (None) clause as well. This becomes tricky to handle in one line code or lambda functions.

We propose an alternative way which allows InflatableArg to include the text of a TorchScript function that would be defined on the module as a helper, then use that in its inflation expression. This can be provided by InflatableArg.fmt_fn. Please refer to pytorch/test/test_bundled_inputs.py for example on how to use the same.

Also refer JacobSzwejbka comment on the same [here](https://github.com/pytorch/pytorch/pull/62368#issuecomment-892012812)

# Mitigation
Allow InflatedArg to include the text of a TorchScript function that would be defined on the module as a helper, then use that in its inflation expression.
ghstack-source-id: 135158680

Test Plan:
To run `test_dict_args`

```
(base) [pavithran@devvm1803.vll0 /data/users/pavithran/fbsource/fbcode] buck test //caffe2/test:test_bundled_inputs -- test_dict_args
Action graph will be rebuilt because files have been added or removed.
Building: finished in 5.4 sec (100%) 12180/12180 jobs, 0/12180 updated
  Total time: 5.8 sec
More details at https://www.internalfb.com/intern/buck/build/fafcf277-1095-4cba-978d-6022f0d391ad
Tpx test run coordinator for Facebook. See https://fburl.com/tpx for details.
Running with tpx session id: 5ef9de71-c1b1-406b-a6c0-3321c2368b8d
Trace available for this run at /tmp/tpx-20210727-163946.454212/trace.log
Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/7036874465805934
    ✓ ListingSuccess: caffe2/test:test_bundled_inputs - main (11.365)
    ✓ Pass: caffe2/test:test_bundled_inputs - test_dict_args (test_bundled_inputs.TestBundledInputs) (12.307)
Summary
  Pass: 1
  ListingSuccess: 1
If you need help understanding your runs, please follow the wiki: https://fburl.com/posting_in_tpx_users
Finished test run: https://www.internalfb.com/intern/testinfra/testrun/7036874465805934
```

To check the py code of TS module:
P433043973

Reviewed By: dreiss

Differential Revision: D29950421

fbshipit-source-id: c819ec5c94429b7fbf6c4beb0259457f169b08ec
2021-08-20 09:36:08 -07:00
5a7133b87f Adding DataLoader2 class as future replacement of DataLoader (#63523)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63523

Supports sharding and batching on loader level**
* #63522 Adding IterableAsDataPipe IterDataPipe
usefull for tests and simple cases

Supports sharding and batching on loader level

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D30426527

Pulled By: VitalyFedyunin

fbshipit-source-id: e5905d3364c4880e720dd62fb066f08881c71a6e
2021-08-20 09:01:55 -07:00
99e28baeba Small custom function refactor which doesn't change anything (#63433)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63433

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D30431970

Pulled By: albanD

fbshipit-source-id: 905fa4d2ddeca18005b1bcb13dd6f8a080327e7c
2021-08-20 08:44:23 -07:00
0f2c60f0e3 Adding IterableAsDataPipe IterDataPipe (#63522)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63522

Supports sharding and batching on loader level
* **#63522 Adding IterableAsDataPipe IterDataPipe
usefull for tests and simple cases**

usefull for tests and simple cases

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D30426528

Pulled By: VitalyFedyunin

fbshipit-source-id: 535b5cc1505bb58731fcca8170541ac5ee7bd417
2021-08-20 08:38:23 -07:00
ae901e372e [Static Runtime] Enable RemoveListMutation (#63536)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63536

Enable a pass that transforms sequences like this:
```
li = []
li.append(1)
li.append(2)
```
into this:
```
li = [1, 2]
```
Initially I implemented this pass myself (D30387213), but I discovered that there is an existing pass that does the same thing.

Reviewed By: hlu1

Differential Revision: D30412970

fbshipit-source-id: 0810ef03480878d5039bd800a40f5fd31c2652ec
2021-08-20 06:15:41 -07:00
913c1f83f4 [Static Runtime] Add native op for aten::detach (#63625)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63625

This change adds a static runtime's native op implementation for `aten::detach` op.

See the standard  `aten::detach`'s implementation (https://codebrowser.bddppq.com/pytorch/pytorch/aten/src/ATen/native/TensorShape.cpp.html#_ZN2at6native6detachERKNS_6TensorE ) for comparison.

Test Plan:
- Added `StaticRuntime.IndividualOps_Detach`.

- Observed

```
V0819 18:55:33.181188 3092034 impl.cpp:1398] Switch to native impl for node: %a.1 : Tensor = aten::detach(%input.1)
```

Reviewed By: hlu1

Differential Revision: D30443187

fbshipit-source-id: d6e0eadb1b817e0a126c4fc97526abc276ee8a17
2021-08-20 00:46:27 -07:00
bec75daa77 Update protobuf to 3.13.1 (#62571)
Summary:
Update bazel to 4.10.0

Update ASAN_SYMBOLIZER_PATH to llvm-7
Suppress `vptr` ubsan violations in `test_jit`
Fix ProtoBuf patching for ONNX which caused Windows builds to crash while attempting to free `std::string` allocated on stack

Fixes https://github.com/pytorch/pytorch/issues/62569

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62571

Reviewed By: walterddr

Differential Revision: D30048685

Pulled By: malfet

fbshipit-source-id: 6462c1bef9c42318551d2cf906bbab41e1d4e1cd
2021-08-19 23:43:55 -07:00
d82667f7e2 [nnc] Updated sliceTail to do inplace mutation (#63532)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63532

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D30412184

Pulled By: navahgar

fbshipit-source-id: e7669d3b9d24e14501f3feb6505c88d1d42030c6
2021-08-19 22:55:30 -07:00
5e31a3b904 [nnc] Updated sliceHead to do inplace mutation (#63531)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63531

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D30412183

Pulled By: navahgar

fbshipit-source-id: 47ee9482a36e606788d28d22eee4edaca45ffa50
2021-08-19 22:54:05 -07:00
0a66d5b325 [PyTorch] Remove unnecessary iostream includes in headers (#61500)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61500

libstdc++ defines a static variable called `std::__ioinit` in iostream that adds global constructor size overhead to each translation that includes iostream. To reduce the size overhead from that, we can often include ostream instead.
ghstack-source-id: 136163529

Test Plan: buildsizebot some mobile apps

Reviewed By: dhruvbird

Differential Revision: D29648016

fbshipit-source-id: 9c3139712c71248513cc5032d21e77f3ecbae8fe
2021-08-19 18:54:51 -07:00
b99a299c60 [PyTorch] Remove unused dump() methods in vec headers (#63533)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63533

These methods don't seem to be used, and they use std::cout, which incurs a small code size overhead on platforms using libstdc++ due to std::__ioinit (see #61500). Seems like we can just delete them?
ghstack-source-id: 136163409

Test Plan:
CI

Reviwers: #sentinel, dhruvbird

Reviewed By: dskhudia

Differential Revision: D30412269

fbshipit-source-id: 380b9aa2f9aabc4107188b6b209d2afc1769c0ee
2021-08-19 18:53:49 -07:00
0b6cc8daf2 [PyTorch][Edge] Support backtrace symbolication for Android builds (#63339)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63339

# Context
https://fb.workplace.com/groups/pytorch.dev/permalink/900474523864362/?comment_id=901125403799274&reply_comment_id=905023386742809

##### WHAT IS A STACK TRACE?
A stack trace (also called stack backtrace or stack traceback) is a report of the active stack frames at a certain point in time during the execution of a program.

Typically when an exception is thrown, one would expect to see the code (file:line) that threw the exception, and every intermediate frame up to and including the main function.

We are enabling android stack trace to help debugging on android devices.

Test Plan:
## Steps to test
```
buck build fbsource//xplat/caffe2/mode/aibench_pytorch_android -c pt.enable_qpl=0 -c pt.has_backtraces=1 fbsource//xplat/caffe2/fb/lite_predictor:lite_predictorAndroid#android-x86_64

one_world android emulator android-28

adb push ~/fbsource/buck-out/gen/xplat/caffe2/fb/lite_predictor/lite_predictorAndroid#android-x86_64 /data/local/tmp

cd /data/local/tmp
./lite_predictorAndroid#android-x86_64

./lite_predictorAndroid#android-x86_64 --model ./detect.bc --input_dims "1,3,192,192" --input_type float --warmup 20 --iter 5 --report_pep true
```

## See how model file is not found stack traces is:

### before
```
./lite_predictorAndroid#android-x86_64 --model ./detect.bc --input_dims "1,3,192,192" --input_type float --warmup 20 --iter 5 --report_pep true

Run with 2 threads
Run with 2 threads
Loading model...
terminating with uncaught exception of type c10::Error: open file failed, file path: ./detect.bc
Exception raised from RAIIFile at xplat/caffe2/caffe2/serialize/file_adapter.cc:13 (most recent call first):
(no backtrace available)
Aborted
```

### after
```
134|generic_x86_64:/data/local/tmp $ ./lite_predictorAndroid#android-x86_64 --model ./detect.bc --input_dims "1,3,192,192" --input_type float --warmup 20 --iter 5 --report_pep true
Run with 2 threads
Run with 2 threads
Loading model...
terminating with uncaught exception of type c10::Error: open file failed, file path: ./detect.bc
Exception raised from RAIIFile at xplat/caffe2/caffe2/serialize/file_adapter.cc:13 (most recent call first):
 frame #0       c10::get_backtrace(unsigned long, unsigned long, bool)[0x59494274f10e]
 frame #1       [0x5949427b1eee]
 frame #2       [0x5949427b1eb2]
 frame #3       [0x5949427b1cdc]
 frame #4       std::__ndk1::function<std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > ()>::operator()() const[0x5949427afc34]
 frame #5       c10::Error::Error(c10::SourceLocation, std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> >)[0x5949427b05b1]
 frame #6       c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > const&)[0x5949427aca5f]
 frame #7       caffe2::serialize::FileAdapter::RAIIFile::RAIIFile(std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > const&)[0x5949426b37b2]
 frame #8       caffe2::serialize::FileAdapter::FileAdapter(std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > const&)[0x5949426b3903]
 frame #9       torch::jit::_load_for_mobile(std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > const&, c10::optional<c10::Device>, std::__ndk1::unordered_map<std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> >, std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> >, std::__ndk1::hash<std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > >, std::__ndk1::equal_to<std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > >, std::__ndk1::allocator<std::__ndk1::pair<std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > const, std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > > > >&)[0x5949422737bd]
 frame #10      torch::jit::_load_for_mobile(std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > const&, c10::optional<c10::Device>)[0x594942273769]
 frame #11      benchmark(std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > const&, int, std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > const&, std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > const&, std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > const&, bool, int, int, int, bool, int, bool, int, double, bool, bool, bool, std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > const&)[0x59494189b21d]
 frame #12      main[0x594941882aff]
 frame #13      __libc_init[0x7b699d08578d]
```

### what we get for os:linux
```
(base) [pavithran@devvm1803.vll0 /data/users/pavithran/fbsource] ./buck-out/gen/xplat/caffe2/fb/lite_predictor/lite_predictor --model ./detect.bc --input_dims "1,3,192,192" --input_type float --warmup 20 --iter 5 --report_pep true
Run with 24 threads
Run with 24 threads
Loading model...
terminate called after throwing an instance of 'c10::Error'
  what():  open file failed, file path: ./detect.bc
Exception raised from RAIIFile at xplat/caffe2/caffe2/serialize/file_adapter.cc:13 (most recent call first):
frame #0: ./buck-out/gen/xplat/caffe2/fb/lite_predictor/lite_predictor() [0x20cb7fe]
frame #1: ./buck-out/gen/xplat/caffe2/fb/lite_predictor/lite_predictor() [0x20cb6c6]
frame #2: std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>::operator()() const + 0x54 (0x20ca4e4 in ./buck-out/gen/xplat/caffe2/fb/lite_predictor/lite_predictor)
frame #3: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x57 (0x20ca9a7 in ./buck-out/gen/xplat/caffe2/fb/lite_predictor/lite_predictor)
frame #4: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x7a (0x20c823a in ./buck-out/gen/xplat/caffe2/fb/lite_predictor/lite_predictor)
frame #5: caffe2::serialize::FileAdapter::RAIIFile::RAIIFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x96 (0x206f3d6 in ./buck-out/gen/xplat/caffe2/fb/lite_predictor/lite_predictor)
frame #6: caffe2::serialize::FileAdapter::FileAdapter(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x42 (0x206f502 in ./buck-out/gen/xplat/caffe2/fb/lite_predictor/lite_predictor)
frame #7: torch::jit::_load_for_mobile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&) + 0x30 (0x1be826c in ./buck-out/gen/xplat/caffe2/fb/lite_predictor/lite_predictor)
frame #8: torch::jit::_load_for_mobile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<c10::Device>) + 0x35 (0x1be8214 in ./buck-out/gen/xplat/caffe2/fb/lite_predictor/lite_predictor)
frame #9: benchmark(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool, int, int, int, bool, int, bool, int, double, bool, bool, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x16d (0x12093ad in ./buck-out/gen/xplat/caffe2/fb/lite_predictor/lite_predictor)
frame #10: main + 0x25c (0x11f933c in ./buck-out/gen/xplat/caffe2/fb/lite_predictor/lite_predictor)
frame #11: __libc_start_main + 0x105 (0x7fc7b9f2ed95 in /usr/local/fbcode/platform009/lib/libc.so.6)
frame #12: _start + 0x2a (0x11f902a in ./buck-out/gen/xplat/caffe2/fb/lite_predictor/lite_predictor)

Aborted (core dumped)
````

Reviewed By: dhruvbird

Differential Revision: D30135947

fbshipit-source-id: f50c634ef4545843305cad4b4a14a8776b1aec76
2021-08-19 18:41:29 -07:00
f2bf0f229f Revert D30359218: [pytorch][PR] [doc] pre-commit fix instructions
Test Plan: revert-hammer

Differential Revision:
D30359218 (4e1d84ae8f)

Original commit changeset: 61771babeac4

fbshipit-source-id: c2ac0a4a7463fafa03ad0b20bfb0701a8c1476c4
2021-08-19 16:48:04 -07:00
d0d27f6971 Add concurrency group for more workflows (#63606)
Summary:
Fixes unnecessary duplicated workflows runs

![image](https://user-images.githubusercontent.com/658840/130146332-ecf54e49-3538-49c1-88de-b099f1c1e41f.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63606

Reviewed By: malfet, mruberry

Differential Revision: D30436889

Pulled By: zhouzhuojie

fbshipit-source-id: aafbad1edc45e3ab9bceb00e8f3b4204f18e43d0
2021-08-19 15:39:28 -07:00
71ab48ed3b acc type inference (#63119)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63119

Test Plan:
buck run mode/opt-clang caffe2/torch/fb/model_transform/experimental:fx_ir_lower_inline_cvr -- \
    --action=lower_and_run \
    --filename=inline_cvr_7x_dec_2020.model \
    --print_glow_glog=True

Reviewed By: jamesr66a, jfix71, ansley

Differential Revision: D30235895

fbshipit-source-id: dab7f96e1799b99eeae0ee519cf0ddd636fddf2e
2021-08-19 15:23:56 -07:00
ccca66597a Replace hardcoded values in IndexKernel.cu (#63372)
Summary:
This is a small change that helps to maintain Cruise pytorch fork, since we use a different hardcoded value.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63372

Reviewed By: mruberry

Differential Revision: D30396171

Pulled By: ejguan

fbshipit-source-id: cc0023f58b5922d3d98c7283495e6dc8d35049b6
2021-08-19 15:02:28 -07:00
e5ab0d1013 DataLoader: allow non-integer Samplers (#63500)
Summary:
Not entirely sure how to use TypeVar but if someone could give me a hint it would be appreciated. Also let me know if you want me to add tests so we can make sure non-integer samplers actually work. It seems like `test/test_dataloader.py` is the correct location but that's a big file.

Fixes https://github.com/pytorch/pytorch/issues/63483

ejguan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63500

Reviewed By: mruberry

Differential Revision: D30403689

Pulled By: ejguan

fbshipit-source-id: 464e09e5aad3215b94a29cc5e21cb4b10ec136e3
2021-08-19 14:55:46 -07:00
11a40ad915 [Pytorch] Fix callstack pointer serialization bug (#63576)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63576

We serialize function name associated with InlinedCallStackPtr. This is derived
via querying Function* stored in InlinedCallStack. However this is a raw
pointer that is not gauranteed to be valid when we serialization happens. On
the other hand we also store function name separately when constructing
InlinedCallStack anyways. So this change just uniformly relies on function_name
instead of Function*

Test Plan: Internal build's asan failure + CI

Reviewed By: larryliu0820

Differential Revision: D30427029

fbshipit-source-id: de9617482404785920ed2e67b72f38461590fba3
2021-08-19 13:35:52 -07:00
6c3ebccc00 Updating the names of these functions (#63513)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63513

updating these names per Jerry's nits in the previous pr

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D30406710

fbshipit-source-id: a9f1577a2b8c4a93f5005e0f6278b7d7348d8b66
2021-08-19 13:34:34 -07:00
ce6fe50158 Revert embedding thrust->cub migration (#63451)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/63427

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63451

Reviewed By: mruberry

Differential Revision: D30398482

Pulled By: ngimel

fbshipit-source-id: e153786d204215555a6571688eabae712facad7e
2021-08-19 13:03:33 -07:00
99203580a9 Updates internal assert_allclose callsites in favor of assert_close (#61841)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61841

Redo of #60863.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D30408145

Pulled By: mruberry

fbshipit-source-id: 0b34ebc7f23ba38ecd89640b61d8aca59b7eab58
2021-08-19 12:50:41 -07:00
efd70b7ce6 Modernizes add and mul documentation (#63309)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/39329.

The documentation for torch.add and torch.mul was sorely out of date and even included deprecated references. This PR modernizes their descriptions consistent with torch.sub.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63309

Reviewed By: ngimel

Differential Revision: D30338004

Pulled By: mruberry

fbshipit-source-id: ee1c2a8106af8341253cafb0003b06e8f652624d
2021-08-19 12:49:30 -07:00
d986d4bf63 [special] use __all__ to hide internal imports (#63135)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/50345

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63135

Reviewed By: ngimel

Differential Revision: D30364287

Pulled By: mruberry

fbshipit-source-id: 20078668943fafa45ce09610634b1d2c424b1922
2021-08-19 12:45:43 -07:00
0c3904d180 [BF16] Add a missing thread local specifier to autocast_gpu_dtype (#63416)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63416

Fix a missing thread local specifier introduced by recent PR

https://github.com/pytorch/pytorch/pull/61002

Test Plan: Unit Tests

Reviewed By: ngimel

Differential Revision: D30376154

fbshipit-source-id: c70d37ec85c3eba88eb87f766f1c4e7aeff8eaf9
2021-08-19 12:39:27 -07:00
535d44141b [7/N] Remove fork tests for RPC. (#63443)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63443

After https://github.com/pytorch/pytorch/pull/63442, all distributed
tests can run with opt-asan. As a result, we can now remove all of our fork
based tests.

This is the first PR in a stack, which first removes fork based tests from RPC.
ghstack-source-id: 136177744

Test Plan: waitforbuildbot

Reviewed By: lw

Differential Revision: D30384905

fbshipit-source-id: 86d438aebaa6cb02ae2a966fea244849849a1889
2021-08-19 11:22:40 -07:00
bd8608cd5c Use CMake for breakpad (#63186)
Summary:
We currently build breakpad from [this fork](https://github.com/driazati/breakpad) to include extra logic to restore signal handlers that were previously present. With some [new additions](https://github.com/google/breakpad/compare/main...driazati:main) this fork now includes a CMake based build, so we can add breakpad as a proper dependency rather than rely on including it in Docker images as a system library which is error prone (we have a bunch of images) and hard to extend to MacOS / Windows. This also includes some changes to the crash handling code to support MacOS / Windows in a similar way to Linux.

```python
import torch

# On Windows this writes crashes to C:\Users\<user>\AppData\pytorch_crashes
# On MacOS/Linux this writes crashes to /tmp/pytorch_crashes
torch.utils._crash_handler.enable_minidumps()

# Easy way to cause a segfault and trigger the handler
torch.bincount(input=torch.tensor([9223372036854775807]))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63186

Reviewed By: malfet, seemethere

Differential Revision: D30318404

Pulled By: driazati

fbshipit-source-id: 0d7daf3701cfaba5451cc529a0730272ab1eb1dc
2021-08-19 10:42:01 -07:00
e030b81356 [easy] Fix missing move in TupleType::createNamed (#61572)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61572

ghstack-source-id: 136161829

Test Plan: CI

Reviewed By: SplitInfinity

Differential Revision: D29672872

fbshipit-source-id: d8ba2d54f7914dbeb3fc52aa21dd77025951c4b5
2021-08-19 10:38:52 -07:00
3aa4521fe8 [hpc] use fx2trt for exploration track (#63535)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63535

Reviewed By: yinghai, jianyuh

Differential Revision: D30272810

fbshipit-source-id: 61f3edf2a2282cd8c268a92acf92feb05a6ae3e1
2021-08-19 10:18:56 -07:00
885e312ce0 Add permute021 fx2trt converter (#63238)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63238

Reviewed By: yinghai

Differential Revision: D30295373

fbshipit-source-id: 2a189fe485edaa978fd03e4b8d8582edb34ec648
2021-08-19 10:17:48 -07:00
e7831fe5de [PyTorch] Test IValue move/copy/assign/swap more (#54717)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54717

Hit more tags in these tests
ghstack-source-id: 136140508

Test Plan: buck test //caffe2/aten:ivalue_test

Reviewed By: anjali411

Differential Revision: D27339736

fbshipit-source-id: 610c8e92846bb70ba725ab117440326ab50af5ce
2021-08-19 09:50:40 -07:00
79693bb86a Use linecache.lazycache to cache generated code. (#63453)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63453

Instead of patching linecache.getlines, use linecache.lazycache and
parts of the loader protocol described in PEP-302

Test Plan:
python3 test/test_fx.py

Imported from OSS

Reviewed By: suo

Differential Revision: D30388176

fbshipit-source-id: 92933711ecf3a21a07e1d6b0d1185ab0efd8341c
2021-08-19 09:17:01 -07:00
e1334512a3 Add fastpath for dot and vdot when the inputs have conj bit set to True (#62915)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62915

As much as 45% and 20% perf improvement on CUDA and CPU respectively.
consistent improvement in perf for all cases -- see perf numbers in comments below

Test Plan: Imported from OSS

Reviewed By: heitorschueroff

Differential Revision: D30404006

Pulled By: anjali411

fbshipit-source-id: 565940da28c7761d993cf43346932c24292e8a4d
2021-08-19 08:42:24 -07:00
f596aa8b77 Poisson zero rate (#61511)
Summary:
This PR fixes https://github.com/pytorch/pytorch/issues/53485 by allowing zero rates for the Poisson distribution. This implementation is consistent with `scipy.stats.poisson` which admits zero rates. In addition to addressing the aforementioned issue, this PR makes two supporting changes:

1. add a `nonnegative` constraint to enforce non-negative rates for the Poisson distribution.
2. adjust the evaluation of the gradient of `xlogy` such that it is well defined for `x == 0 and y == 0`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61511

Reviewed By: ejguan

Differential Revision: D30352917

Pulled By: albanD

fbshipit-source-id: f3d33da58360e80d75eb83519f199b93232a2a2d
2021-08-19 08:30:28 -07:00
be9be9bfdd add distributed/_sharded_tensor/test_sharded_tensor to ROCM_BLOCKLIST (#63508)
Summary:
Fixes current ROCm CI test2 brokenness until tensorpipe is fully supported by ROCm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63508

Reviewed By: ejguan

Differential Revision: D30406450

Pulled By: walterddr

fbshipit-source-id: c07509271d5d33901f3eaf7ffb916dc3626e1f9a
2021-08-19 07:50:55 -07:00
e7c4988b52 To fix the chainability at epoch zero for some schedulers (#63457)
Summary:
It has been discussed in the https://github.com/pytorch/pytorch/pull/60836#issuecomment-899084092 that we have observed an obstacle to chain some type of learning rate schedulers. In particular we observed

* some of the learning rate schedulers returns initial learning rates at epoch 0 as
```
       return self.base_lrs`
```

* This can be a problem when two schedulers called as chained as

```
     scheduler1.step()
     scheduler2.step()
```

in particular, we completely ignore the effect of scheduler1 at epoch 0.  This could not be an issue if at epoch 0, scheduler1 was ineffective as in many schedulers, however for schedulers as WarmUp Schedulers, where at epoch 0 schedulers multiplicative value is smaller than 1 this could lead to undesired behaviors.

The following code snippet illustrates the problem better

## Reproducing the bug

```python
import torch
from torch.nn import Parameter
from torch.optim import SGD
from torch.optim.lr_scheduler import WarmUpLR, ExponentialLR

model = [Parameter(torch.randn(2, 2, requires_grad=True))]
optimizer = SGD(model, 1.0)
scheduler1 = WarmUpLR(optimizer, warmup_factor=0.1, warmup_iters=5, warmup_method="constant")
scheduler2 = ExponentialLR(optimizer, gamma=0.9)

for epoch in range(10):
     print(epoch, scheduler2.get_last_lr()[0])
     optimizer.step()
     scheduler1.step()
     scheduler2.step()
```

### Current Result

```
0 1.0
1 0.9
2 0.81
3 0.7290000000000001
4 0.6561000000000001
5 5.904900000000001
6 5.314410000000001
7 4.782969000000001
8 4.304672100000001
9 3.874204890000001
```

### Expected Result

```
0 1.0
1 0.9
2 0.81
3 0.7290000000000001
4 0.6561000000000001
5 0.5904900000000001
6 0.5314410000000001
7 0.4782969000000001
8 0.4304672100000001
9 0.3874204890000001
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63457

Reviewed By: datumbox

Differential Revision: D30424160

Pulled By: iramazanli

fbshipit-source-id: 3e15af8d278c872cd6f53406b55f4d3ce5002867
2021-08-19 07:17:03 -07:00
2d5b19f62b Update full backward hook doc with not-same-object note (#63245)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/61446

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63245

Reviewed By: ejguan

Differential Revision: D30352656

Pulled By: albanD

fbshipit-source-id: 7000ecb54a80f2da968ec7600b98574b608578ae
2021-08-19 06:50:56 -07:00
47a9e8ff32 [Static Runtime] Support __getitem__ for lists (#63398)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63398

This change provides a native `__getitem__` implementation for lists to avoid overhead associated with falling back to the JIT interpreter.

Test Plan: Unit tests: `buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: hlu1

Differential Revision: D30368464

fbshipit-source-id: e0e0971508cd5d9bcf6025606993dc24ecbf6764
2021-08-19 06:38:51 -07:00
ce61100923 Revert D29399533: Hoisting common expressions out of If blocks
Test Plan: revert-hammer

Differential Revision:
D29399533 (9477211e7d)

Original commit changeset: 9336b9dc48c0

fbshipit-source-id: f081c7280203f40328bcbb0c03a7c6a007acedb7
2021-08-19 06:20:40 -07:00
6bb68ba507 Fix interpreter debug logging message (#63499)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63499

https://github.com/pytorch/pytorch/pull/62418 combine the instruction and debug handle. This change fix the debugging message.
ghstack-source-id: 136184053

Test Plan: Uncomment and it works

Reviewed By: kimishpatel, raziel

Differential Revision: D30390699

fbshipit-source-id: e32b7b297ad3b7d8bffebd025d15519083a244c4
2021-08-19 02:14:13 -07:00
5254e3adb8 layernom inplace (#63437)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63437

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D30388824

Pulled By: Krovatkin

fbshipit-source-id: 852d19bf238544c5de177ed5854dcd01c7ae5572
2021-08-18 23:07:25 -07:00
531262fe2e layernorm (#63436)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63436

use MKLDNN layernorm

use mkldnn version 2

address Elias feedback

fix build CI errors

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D30388825

Pulled By: Krovatkin

fbshipit-source-id: fb909bfbf53cb8567a43aac40f51c491daeec908
2021-08-18 23:05:39 -07:00
6e00b31b15 [TensorExpr] Make CacheReplacer and IndexFlattener mutate stmts/exprs inplace. (#63527)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63527

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D30411411

Pulled By: ZolotukhinM

fbshipit-source-id: efb14ee57b36537fa4fefa89bdd6bafe7151c012
2021-08-18 22:59:31 -07:00
1d62fb8a63 [TensorExpr] Speedup ExternalCall.ComputeInterop test by reducing tensor sizes. (#63526)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63526

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D30411410

Pulled By: ZolotukhinM

fbshipit-source-id: d9a99afac14d2238b5100c98ae9ed4467f9f05ea
2021-08-18 22:58:25 -07:00
773c8b6440 support optional comparisons with different but comparable types (#62890)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/62565

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62890

Reviewed By: ejguan

Differential Revision: D30396008

Pulled By: dagitses

fbshipit-source-id: fca02207509f882973d54484f89c4d116505fc66
2021-08-18 21:40:38 -07:00
2544664e54 Beef up comment in AccumulateType (#63503)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63503

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D30403160

Pulled By: ezyang

fbshipit-source-id: 6cb24418152d9fb146f86b6f973ec50f1a397a58
2021-08-18 20:59:37 -07:00
0d437fe6d0 BF16 allreduce hook (#63260)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63260

Add BF16 all-reduce communication hook. Skip if CUDA version < 11 or NCCL version < 2.9.7.

Reviewed By: SciPioneer

Differential Revision: D30238317

fbshipit-source-id: bad35bf7d43f10f1c40997a282b831b61ef592bb
2021-08-18 20:53:49 -07:00
9477211e7d Hoisting common expressions out of If blocks (#59492)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59492

Adding code to find common expressions from the two subblocks of an if
operation and hoist them before the if block.
This also allows Dead Code Elimination to
then eliminate some if blocks.

Also eliminated some dead code in the codebase.

Test Plan:
python test_jit.py TestIfHoisting

Imported from OSS

Reviewed By: ngimel

Differential Revision: D29399533

fbshipit-source-id: 9336b9dc48c02c38862f98f98cd72fc1767a1802
2021-08-18 16:29:30 -07:00
d9547b9bb2 Nnapi Delegation: Quick improvements (#63489)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63489

A few quick improvements to the Android NNAPI Delegate, some of which were discussed here https://github.com/pytorch/pytorch/pull/62272:
1) `throw std::exception` replaced with `TORCH_CHECK` to reduce runtime
size (nnapi_backend_lib.cpp)
2) weights processing moved from compile to preprocess step, since it can
be done AOT (nnapi_backend_lib.cpp & nnapi_backend_preprocess.cpp)
3) `ser_model_` and `shape_compute_module_` member variables removed, since they are never used after
`init()`, so they are not needed (nnapi_backend_lib.cpp)

Test Plan:
Unit tests: `python test/test_jit.py TestNnapiBackend`
Run SparkAR segmentation with delegated NNAPI as done here D30259033 (can use `jf download GAekdAwsyGKXhggFALN4LnSBTzcubsIXAAAz --file "v303-nnd-mod.ptl"` to get a preprocessed model from these changes)

Imported from OSS

Reviewed By: raziel, iseeyuan

Differential Revision: D30398880

fbshipit-source-id: b6872e1e9ccd583622b80659da00c83fdd82580e
2021-08-18 16:25:01 -07:00
4dcc2197ce [fix] tensor_split : non-contiguous indices tensor (#63390)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/63281

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63390

Reviewed By: ejguan

Differential Revision: D30362649

Pulled By: mruberry

fbshipit-source-id: 3ea3ad02199e4345beb0b580d056babd56112309
2021-08-18 16:10:17 -07:00
1f4e019d8e [Vulkan] Fix incorrect input range for Hardshrink tests (#63515)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63515

Fixed inappropriate input range for Hardshrink tests:
The range -10 ~ +10 for input tensors is more proper when we use the test set of lambda {-4.2, -1.0, -0.42, 0.0, 0.42, 1.0, 4.2, 42.42}.
ghstack-source-id: 136141416

Test Plan:
```build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 //xplat/caffe2:pt_vulkan_api_test_binAndroid\#android-arm64 --show-output
adb push buck-out/gen/xplat/caffe2/pt_vulkan_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_api_test
adb shell "/data/local/tmp/vulkan_api_test"
```
Note that the test can fail sporadically due to the precision loss by FP16(Vulkan)/FP32(CPU). This issue will be handled separately after some design discussions.

Reviewed By: SS-JIA

Differential Revision: D30389646

fbshipit-source-id: 7224bd8ba4e4972f5fc147df8a0cb84808f8c62e
2021-08-18 15:52:12 -07:00
15eec8e1d1 using PR number instead of IN_PULL_REQUEST (#63360)
Summary:
PR numbers should be available on GHA after this.

This fixes some target determinator not working issue discovered when manually running: https://github.com/pytorch/pytorch/issues/63412.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63360

Reviewed By: malfet, zhouzhuojie, seemethere

Differential Revision: D30374615

Pulled By: walterddr

fbshipit-source-id: eee8d8bb7aa4308a6a50cfdcd4423a96d846777f
2021-08-18 15:05:10 -07:00
779a3d47b0 [Static Runtime] Benchmark reports native nodes (#63346)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63346

We have seen that we can get significant perf wins essentially for free by implementing native ops for ops that we cannot write out variants for (e.g. TupleUnpack D30306955 (078b8004a6), append D30326461 (9d9e7a8d72)). Therefore, whether or not SR is using a native implementation is valuable information. By capturing this in the benchmarking suite, we can hopefully avoid wasting time profiling/manually inspecting `native_ops.cpp`

Reviewed By: hlu1

Differential Revision: D30346752

fbshipit-source-id: 205b090513b6a5a6ce4cb92f75ab0395b15d08f9
2021-08-18 15:05:08 -07:00
139413078f [FX] make ASTReriter patch wrapped functions properly (#62987)
Summary:
reference the same global namespace (instead of copying it) in ASTRewriter to patch wrapped functions properly

Fixes #{62071}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62987

Test Plan:
To test it you may write this snippet and ensure the results are as shown in the comments:

```
import torch
import torch.fx

torch.fx.wrap
def to_be_wrapped(x):
    return torch.relu(x)

class Foo(torch.nn.Module):
    def forward(self, x):
        return to_be_wrapped(x)

traced = torch.fx.symbolic_trace(Foo())
print(traced.graph)
"""
graph():
    %x : [#users=1] = placeholder[target=x]
    %to_be_wrapped : [#users=1] = call_function[target=__main__.to_be_wrapped](args = (%x,), kwargs = {})
    return to_be_wrapped
"""

from torch.fx.experimental.rewriter import RewritingTracer

rt = RewritingTracer()
graph = rt.trace(Foo())
print(graph)
"""
### AFTER FIX (CORRECT):
graph():
    %x : [#users=1] = placeholder[target=x]
    %to_be_wrapped : [#users=1] = call_function[target=__main__.to_be_wrapped](args = (%x,), kwargs = {})
    return to_be_wrapped

### BEFORE FIX (WRONG):
graph():
    %x : [#users=1] = placeholder[target=x]
    %relu : [#users=1] = call_function[target=torch.relu](args = (%x,), kwargs = {})
    return relu
"""
```

Reviewed By: ansley

Differential Revision: D30396176

Pulled By: mostafaelhoushi

fbshipit-source-id: f61eddf32e9ef42b5f5c3ce21d559945214ee833
2021-08-18 15:03:57 -07:00
9bbf80969e [PyTorch] Avoid using std::regex for device string parsing in Device.cpp (#63464)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63464

This was previously committed as D30281388 (4d6f98ecad), but was reverted due to t98478641. jnkwok1 confirmed that this change was not the root cause, so trying to land it again.

Currently, `std::regex` is used for parsing device strings. This is undesirable for a few reasons.

1. Increases binary size
2. Slows down model loading
3. Potentially uses more memory at runtime
4. Takes marginally longer time to build code that uses std::regex v/s not using std::regex

This change avoids the use of `std::regex` for parsing the device string since we don't need to.
ghstack-source-id: 136006963
ghstack-source-id: 136081898

Test Plan:
### AI Bench Runs

**Before this change:**
1. Model Load time: [252ms](https://www.internalfb.com/intern/aibench/details/332471502816548)
2. Model unload time: 3.5ms

**After this change:**
1. Model Load time: [240ms](https://www.internalfb.com/intern/aibench/details/652195589031318), which is an approx 5% reduction for the current model. I suspect percentage wise, it will be larger for smaller models since this is a fixed cost reduction.
2. Model unload time: 3.3ms (probably too small to be meaningfully impactful to an end user).

### BSB Results

```
D30281388 (4d6f98ecad)-V1 (https://www.internalfb.com/intern/diff/D30281388 (4d6f98ecad)/?dest_number=135713848)

messenger-pika-optimized-device: Succeeded
Change in Download Size for arm64 + 3x assets variation: -7.1 KiB
Change in Uncompressed Size for arm64 + 3x assets variation: -17.6 KiB

Mbex Comparison: https://our.intern.facebook.com/intern/mbex/bsb:551399955987465@base/bsb:551399955987465@diff/
```

Reviewed By: raziel, pavithranrao

Differential Revision: D30388269

fbshipit-source-id: 10942e7aa56f9ea47aa479a8f50187f2ce2899bf
2021-08-18 14:55:12 -07:00
7fdba4564a [TensorExpr] IRSimplifier: sort terms in polynomials, terms, minterms, maxterms. (#63197)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63197

This solves non-determinism from using hash values in sort methods.
Changes in tests are mostly mechanical.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D30292776

Pulled By: ZolotukhinM

fbshipit-source-id: 74f57b53c3afc9d4be45715fd74781271373e055
2021-08-18 14:49:27 -07:00
8bdd542417 [TensorExpr] Add debug logging to LoopNest::computeInline. (#63196)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63196

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D30292778

Pulled By: ZolotukhinM

fbshipit-source-id: d8a111b75466a9354f6d048119cc6f814c9d5abb
2021-08-18 14:48:05 -07:00
feba6806c9 clarify that torch.finfo.tiny is the smallest normal number (#63241)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63241

This is a common source of confusion, but it matches the NumPy
behavior.

Fixes #44010
Fixes #59526

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D30307646

Pulled By: dagitses

fbshipit-source-id: d848140ba267560387d83f3e7acba8c3cdc53d82
2021-08-18 13:44:52 -07:00
9253dc1e58 Fix segmentation fault due to access to destroyed CudaIPCGlobalEntities instance (#56141)
Summary:
There is an instance of the static destruction order fiasco where cuda_ipc_global_entities may be accessed after it is destroyed. See https://github.com/pytorch/pytorch/issues/51961

This change uses a flag and avoids accesses to the destroyed class when it is set to false.

Fixes https://github.com/pytorch/pytorch/issues/51961

This removes the function to clear shared_blocks introduced by https://github.com/pytorch/pytorch/issues/53080 which had multiple issues: Unprotected access to a shared structure and modification of the vector which is being cleared by the destructors of the objects contained.
I.e. what happened was:

- `CudaIPCSentDataLimbo_.clear_shared_blocks();` is called from the destructor of CudaIPCGlobalEntities as of your PR
- This deletes instances of `CudaIPCSentData` which hold `at::DataPtr` created by `GetNewRefCountedSentData`
- This means `CudaIPCSentDataDelete` is called with still active pointers
- Hence `CudaIPCSentDataLimbo_.add` is called adding a new value to `shared_blocks_`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56141

Reviewed By: ejguan

Differential Revision: D30397279

Pulled By: VitalyFedyunin

fbshipit-source-id: ce4b8b90fa1c90d275e5eca93ba84321cbc6140a
2021-08-18 13:38:55 -07:00
877e6f2be3 Bugfix for fuse qconfig comparison (#63384)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63384

In some cases the changes to qconfig on module would cause the
fusions to fail. This bugfix solves that problem by adding a
qconfig_function_comparison that compares the functions within the
qconfig rather than the modules the qconfigs are on. The comparison
looks at the partial object within QConfig.activation/weight.p and
compares args, keywords and func. This is necessary to do mannually
because partial doesn't have __eq__ implemented and so == reverts to is.

Test Plan:
python test/test_quantization.py
TestFuseFx.test_problematic_fuse_example

Imported from OSS

Reviewed By: supriyar, ejguan

Differential Revision: D30386264

fbshipit-source-id: 51e358c021c39d6f48dc12ad2a82b2838677b9de
2021-08-18 13:31:56 -07:00
2aa19f33c6 [ONNX] Fix for batchnorm training op mode (#52758) (#62760)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62760

* Rebase

# Conflicts:
#	torch/csrc/jit/passes/onnx/eval_peephole.cpp

# Conflicts:
#	test/onnx/test_utility_funs.py
#	torch/onnx/symbolic_opset9.py

* Update symbolic_opset12.py

* Update test.sh
# Conflicts:
#	.jenkins/caffe2/test.sh

* Merge

* Fix utility tests

# Conflicts:
#	test/onnx/test_pytorch_onnx_onnxruntime.py
#	test/onnx/test_utility_funs.py

* Fix for comment

* Enable BN tests

* Fix for test

* Update test_pytorch_onnx_onnxruntime.py

* Update test_pytorch_onnx_onnxruntime.py

* Update test_utility_funs.py

* Update test_pytorch_onnx_onnxruntime.py

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D30349060

Pulled By: msaroufim

fbshipit-source-id: 93312c17607974731c17099ae181acb6e4c1c409
2021-08-18 13:29:07 -07:00
e182401062 [ONNX] Remove aten parameter (#61652) (#62759)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62759

* remove aten argument in export()

* add export_to_pretty_string default value OperatorExportTypes.ONNX

* add DPYTORCH_ONNX_CAFFE2_BUNDLE description

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D30349062

Pulled By: msaroufim

fbshipit-source-id: d9738f3aa8b80eac54548d0b9494f9f1e544f20f

Co-authored-by: Gary Miguel <garymiguel@microsoft.com>
2021-08-18 13:29:04 -07:00
3a7bbf5fb7 [ONNX] Add support for opset14 in PT-ONNX exporter (#59486) (#62758)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62758

* Add initial changes for opset14

* Fixed flake

* Add onnx submodule changes and removed utility func tests

* Add updated batchNorm symbolic

* Add triu/tril symbolics

* Fix lint

* Fixed test failures

* Add reshape with allowzero

* Added tests/refactored opset versioning

* Bump onnxruntime version

* Fix clang/lint failures

* Add reshape shape inference for opset 14

* Changes for allowzero

* Fix lint/clang and test failures

* Updated PR

* Flake fixes

* Fix flake

* Remove new_jit_api tests

* Add opset14 models

* Update allowzero

* Fix test failures

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D30349063

Pulled By: msaroufim

fbshipit-source-id: 54724246149b01a2f627c43d7396253a7e9c9eb9

Co-authored-by: Shubham Bhokare <sbhokare@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2021-08-18 13:29:01 -07:00
99b154b8be [ONNX] Support lstm_cell symbolic (#61476) (#62757)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62757

Support lstm_cell symbolic

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D30349061

Pulled By: msaroufim

fbshipit-source-id: f236177e3e5c62a30b7e4d91a623bcaef21b5eb1

Co-authored-by: jiafatom <jiafa@microsoft.com>
2021-08-18 13:27:46 -07:00
d661e646ad [FX] Fix GraphModule deepcopy to use deepcopied graph (#63090)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63090

Test Plan: Imported from OSS

Reviewed By: ansley

Differential Revision: D30252471

Pulled By: jamesr66a

fbshipit-source-id: cafd7d7917935a5ea6ffa2a7fe9e9b2a9578b3e3
2021-08-18 13:17:14 -07:00
11fbd3958c MaybeOwned page for dev wiki (#63450)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63450

Brief guide to understanding `MaybeOwned<Tensor>`, aimed at C++ PT devs who are obliged to interact with existing uses of it, rather than encouraging new usage.

For reviewers: I haven't yet added a link to this page from anywhere. I'm thinking the right place is the [dev wiki main page C++ section](https://github.com/pytorch/pytorch/wiki#c) but happy to put it wherever makes sense, suggestions welcome.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D30402313

Pulled By: bhosmer

fbshipit-source-id: 69b15909ecafcd8d88e44f664f88c3ad4eb26d84
2021-08-18 12:08:58 -07:00
9bb1371cc2 Disable RDYNAMIC check with MSVC (#62949)
Summary:
When testing with clang-cl, the flag is added though it is unsupported and that generates a few warnings. Tried a few alternatives like https://cmake.org/cmake/help/latest/module/CheckLinkerFlag.html, but they just don't work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62949

Reviewed By: zhouzhuojie, driazati

Differential Revision: D30359206

Pulled By: malfet

fbshipit-source-id: 1bd27ad5772fe6757fa8c3a4bddf904f88d70b7b
2021-08-18 11:51:23 -07:00
d4593d9d08 document why wrappers exist in torch.functional (#62847)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/62844.

These wrappers are not super obvious, but ultimately stem from the lack of support for functions with variadic args in native_functions.yaml. https://github.com/pytorch/pytorch/issues/62845 tracks that issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62847

Reviewed By: VitalyFedyunin

Differential Revision: D30305016

Pulled By: dagitses

fbshipit-source-id: 716fcecb0417b770bc92cfd8c54f7ead89070896
2021-08-18 11:51:21 -07:00
f0f5cffde9 [DDP] Add a debug check in cpp fp16 compress (#63379)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63379

this codepath has been prone to bugs as seen in the below diff, this
will help ensure against changes/refactors that touch this, as a basic sanity
check. Enabled it in debug-only builds to not affect the perf.
ghstack-source-id: 136056093

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D30358440

fbshipit-source-id: e1b3893a223722c2593ceed8696a09c7d07d47c1
2021-08-18 11:51:19 -07:00
ac1ece054b [DDP][Grad compression] Fix fp16 cpp hook (#63375)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63375

I think tensor.copy_(tensor.to(torch::kFloat16)); will keep it as
float32.

Tested by add the following line:

```
LOG(INFO) << "Type is: " << compressed_tensor.scalar_type();
```

before:

```
I0816 17:03:09.823688 364141 default_comm_hooks.cpp:21] Type is: Float
```
after:

```
I0816 17:01:16.779052 353924 default_comm_hooks.cpp:21] Type is: Half
```
ghstack-source-id: 136056092

Test Plan: ci

Reviewed By: SciPioneer

Differential Revision: D30356256

fbshipit-source-id: 8208a705acd7628541cd43c8bf61d007dfdd2435
2021-08-18 11:49:35 -07:00
4e1d84ae8f [doc] pre-commit fix instructions (#61717)
Summary:
fix invalid instruction

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61717

Reviewed By: zhouzhuojie, driazati

Differential Revision: D30359218

Pulled By: malfet

fbshipit-source-id: 61771babeac4d34425a61ce49f38a7099b521eec
2021-08-18 11:42:25 -07:00
50a3b6a6a8 Make SkipInfo with expected_failure an XFAIL (#63481)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63481

This PR changes the SkipInfo decorators to use unittest.expectedFailure so that the test reports as XFAIL as opposed to PASSED.

Note that changing the expectedFailure here 30e1c74dc1/torch/testing/_internal/common_device_type.py (L879) to an XFAIL is not possible because the decision of whether to decorate is delayed until the wrapper function is called.

fixes https://github.com/pytorch/pytorch/issues/63363

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D30397154

Pulled By: heitorschueroff

fbshipit-source-id: c5e4911969ad8667763eec4203dbbc6a51178592
2021-08-18 11:36:18 -07:00
2f615f6313 Improve custom function docs (#60312)
Summary:
- Adds some code examples for `ctx` methods and make requirements of arguments more clear
- Type annotations for `save_for_backward`, `mark_dirty`, `mark_non_differentiable`, and `set_materialize_grads` (BC-breaking?)
- Refactor `torch.autograd.Function` doc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60312

Reviewed By: VitalyFedyunin

Differential Revision: D30314961

Pulled By: soulitzer

fbshipit-source-id: a284314b65662e26390417bd2b6b12cd85e68dc8
2021-08-18 11:31:31 -07:00
d565a7bd68 [6/N] Enable opt-asan for elastic and launcher tests. (#63442)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63442

Continuation of https://github.com/pytorch/pytorch/pull/62051, I've
enabled elastic and launcher tests to run in opt-asan mode which is supported
with spawn multiprocessing.

This allows us to completely get rid of fork based tests from torch.distributed
and have all tests run in spawn mode.
ghstack-source-id: 136057123

Test Plan: waitforbuildbot

Reviewed By: cbalioglu

Differential Revision: D30384267

fbshipit-source-id: ad3447cfb9d6e31e7ec8332d64c8ff1054858dcb
2021-08-18 10:48:49 -07:00
af3cbfed95 Add validation check in fx2trt interpreter (#63424)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63424

Add validation check in fx2trt for missing converter operators. If any op missing, interpreter init will report missing operators.

Test Plan:
for call_function and call_method:
manual test with feeds benchmark and verify init failed with expected message.
{F642390780}

for call_module:
specify a module as leaf node and make acc_tracer trace it as a node; then in fx2trt.py, in CONVERTER initialize stage make it skip recording all modules; initialize interpreter and call validator function, verify the output includes the missing module name, return value print as screenshot below.

{F643458718}

Reviewed By: 842974287

Differential Revision: D30294832

fbshipit-source-id: 243dca3fdfc6a174ded65248938e2a234aec19c6
2021-08-18 10:41:10 -07:00
7df2324120 [pytorch] Make qconv forward() thread safe (#63432)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63432

There's a race condition in quantized models when multiple threads call forward() due to qnnpack packing the weights the first time the operator is called. This locks the entire apply_impl function.

Test Plan:
https://github.com/pytorch/pytorch/issues/58055

Ran the script before and after, original crashes went away

Reviewed By: kimishpatel

Differential Revision: D30229520

fbshipit-source-id: d06cabe24199a80325cd57f24a7fd60624be2cf7
2021-08-18 10:37:13 -07:00
565578cdab Use fastAtomicAdd in EmbeddingBag (mode "max") backward (#63298)
Summary:
Rel: https://github.com/pytorch/pytorch/issues/62695

### This PR
|   n_tokens |   num_embeddings |   embedding_dim | mode   |    bwd_fp32 |    bwd_fp16 |
|-----------:|-----------------:|----------------:|:-------|------------:|------------:|
|       4096 |             4096 |            4096 | max    | 0.000326228 | 0.000181448 |
|       4096 |             4096 |           16384 | max    | 0.00102805  | 0.000618136 |
|       4096 |            16384 |            4096 | max    | 0.000907326 | 0.000530422 |
|       4096 |            16384 |           16384 | max    | 0.00334988  | 0.00264645  |
|      16384 |             4096 |            4096 | max    | 0.000366449 | 0.000320232 |
|      16384 |             4096 |           16384 | max    | 0.00126421  | 0.00104183  |
|      16384 |            16384 |            4096 | max    | 0.00087738  | 0.00065068  |
|      16384 |            16384 |           16384 | max    | 0.00379229  | 0.00298201  |

### Original
|   n_tokens |   num_embeddings |   embedding_dim | mode   |    bwd_fp32 |    bwd_fp16 |
|-----------:|-----------------:|----------------:|:-------|------------:|------------:|
|       4096 |             4096 |            4096 | max    | 0.00032407  | 0.000188231 |
|       4096 |             4096 |           16384 | max    | 0.00104356  | 0.000624001 |
|       4096 |            16384 |            4096 | max    | 0.000902069 | 0.000527382 |
|       4096 |            16384 |           16384 | max    | 0.00302202  | 0.00255153  |
|      16384 |             4096 |            4096 | max    | 0.000384343 | 0.000403249 |
|      16384 |             4096 |           16384 | max    | 0.00126445  | 0.00135069  |
|      16384 |            16384 |            4096 | max    | 0.000880814 | 0.000825679 |
|      16384 |            16384 |           16384 | max    | 0.00337611  | 0.00319515  |

cc xwang233 ptrblck ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63298

Reviewed By: mruberry

Differential Revision: D30383583

Pulled By: ngimel

fbshipit-source-id: 14dd9d67002c53a153721812709033c198f68c1e
2021-08-18 10:14:40 -07:00
e2ddaec5cf Reverting launch bounds change in topK that induced a regression in perf (#63431)
Summary:
[topkwsyncs.zip](https://github.com/pytorch/pytorch/files/7003077/topkwsyncs.zip)

Running this script on nvidia containers 21.08 vs 21.07 we see the following perf drops:
topk(input=(dtype=torch.float16,shape=[60, 201600]), k=2000, dim=1, sorted=True) - 0.63

topk(input=(dtype=torch.float32,shape=[120000]), k=12000, dim=0, sorted=False) - 0.55

topk(input=(dtype=torch.float16,shape=[5, 201600]), k=2000, dim=1, sorted=True) - 0.55

topk(input=(dtype=torch.float32,shape=[1, 10000]), k=1000, dim=1, sorted=False) - 0.33

The relative perf drop is reported as (21.08_time - 21.07_time) / 21.07_time

I narrowed down the source of the regression to this commit: https://github.com/pytorch/pytorch/pull/60314
which reduced launch bounds from 1024 to 512.

The perf did not seem to regress in the original  evidence provided to change 1024 to 512 due to the input shapes in the benchmark being a lot smaller than the input shapes of the tensors which I am witnessing perf regression in. I suggest reverting back to 1024 as with 512 there was no considerable improvement in perf for small inputs and a major regression in perf for large tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63431

Reviewed By: mruberry

Differential Revision: D30384087

Pulled By: ngimel

fbshipit-source-id: 11eecbba82a069b1d4579d674c3f644ab8060ad2
2021-08-18 09:44:07 -07:00
383a33a0eb Make DataChunk support list in-place ops (#63422)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63422

Fixes #63095

Make `DataChunk` delegate to list method. Then it will support in-place operations:
- `sort`
- `reverse`
- `append`
- `extend`
- `random.shuffle`

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D30379027

Pulled By: ejguan

fbshipit-source-id: d176bd0cc8b89b915c7bb184ff243ab1f605616d
2021-08-18 08:48:47 -07:00
cyy
93582e3bba A tiny fix in MT19937RNGEngine (#63219)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63219

Reviewed By: VitalyFedyunin

Differential Revision: D30341484

Pulled By: ezyang

fbshipit-source-id: 0ff4499d0f4a3dfeb991c0f10fe3248c6ca1c992
2021-08-18 08:05:23 -07:00
c508433617 Implement subclass priority for __torch_dispatch__ (#63411)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63411

In order to get this behavior, you have to use append_overloaded,
which I forgot to use in the previous implementation.  I exposed
an internal helper function which is more appropriate for dispatch
to Python where we know that an argument is definitely a Tensor (and
this test no longer needs to be done).

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D30374489

Pulled By: ezyang

fbshipit-source-id: 43b08c00d1958c9b26d82a025d19f0b67bb85590
2021-08-18 07:49:03 -07:00
061b36e2f5 [fx2trt] Add dequantize support (#63448)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63448

Only available after TensorRT 8.0

Test Plan: buck run mode/opt caffe2/torch/fb/fx2trt:test_dequantize

Reviewed By: 842974287

Differential Revision: D30296863

fbshipit-source-id: 44b9630ef0d210e7f20e650dc81c519f7e41f5f3
2021-08-18 07:44:17 -07:00
a00d587849 add OpInfo for torch.linalg.tensorinv (#62326)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53739.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62326

Reviewed By: H-Huang

Differential Revision: D30136376

Pulled By: zou3519

fbshipit-source-id: 04ec9450e8866667649af401c7559b96ddc91491
2021-08-18 07:37:34 -07:00
30e1c74dc1 Update cuda amp to also check xla device (#63413)
Summary:
Fixes https://github.com/pytorch/xla/issues/3086. Pytorch/XLA:GPU also use cuda amp. I verified the pt/xla `test_autocast` with this fix and all test passed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63413

Reviewed By: ngimel

Differential Revision: D30380785

Pulled By: bdhirsh

fbshipit-source-id: fd1a1de7d224c616fc3fa90b80a688a21f6b1ecc
2021-08-18 06:44:10 -07:00
4a390a56c4 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D30391472

fbshipit-source-id: d4eb1e7debea8905e7fee5f026c082bee65e78f3
2021-08-18 04:20:05 -07:00
2b303f3f31 enhance comparison tests for c10::optional (#62887)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62887

Reviewed By: VitalyFedyunin

Differential Revision: D30305044

Pulled By: dagitses

fbshipit-source-id: d0a3a9e4ea186915ef087543aaf81a606f943380
2021-08-18 04:08:05 -07:00
0f2f6a79cb clarify the documentation of torch.meshgrid (#62977)
Summary:
Also warn about the behavior differences from `numpy.meshgrid`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62977

Reviewed By: mruberry, ngimel

Differential Revision: D30220930

Pulled By: dagitses

fbshipit-source-id: ae6587b41792721cae2135376c58121b4634e296
2021-08-18 04:01:22 -07:00
f8a84a80cd [5/N] Run opt-asan with detect_leaks=0 (#63361)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63361

Python multiprocessing doesn't support LSAN and causes false positives
instead. As a result, disabling LSAN for these tests so that we can still run
with opt-asan
ghstack-source-id: 135962489

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D30352269

fbshipit-source-id: f6ab5abce7bdef00cd5e1f5977424d2b151174af
2021-08-18 01:59:56 -07:00
d431c77d76 [sharded_tensor] fix typing issue for placement (#63426)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63426

placement should either be a string or a _remote_device, this fixes the type to match the behaviors
ghstack-source-id: 136041125

Reviewed By: pritamdamania87

Differential Revision: D30379702

fbshipit-source-id: 34e226494240923b433e3a39cc08c84d42cdad6b
2021-08-17 23:11:48 -07:00
2fd14735d6 [easy][PyTorchEdge] print error message when failing to load model file (#63404)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63404

# Context
Loading a model file using `fopen` might error out for multiple reasons. Repro'ing the error on devices takes some time and efforts. Logging the error no# will help in debugging and fixing the error quickly.

# Mitigation
Printout the error message of the `fopen` to help users debug the issue.

Test Plan:
```
(base) [pavithran@devvm1803.vll0 /data/users/pavithran/fbsource] buck run xplat/caffe2/fb/lite_predictor:lite_predictor -- --model=/home/pavithran/models/prod/GAaNhAoTIV6cIvgJAHn30m8NR1QgbmQwAAAA.ptl --use_bundled_input=0
Building: finished in 0.5 sec (100%) 354/354 jobs, 0/354 updated
  Total time: 0.6 sec
Run with 24 threads
Run with 24 threads
Loading model...
terminate called after throwing an instance of 'c10::Error'
  what():  open file failed because of errno 2 on fopen: No such file or directory, file path: /home/pavithran/models/prod/GAaNhAoTIV6cIvgJAHn30m8NR1QgbmQwAAAA.ptl
Exception raised from RAIIFile at xplat/caffe2/caffe2/serialize/file_adapter.cc:15 (most recent call first):
(no backtrace available)
```

Reviewed By: dhruvbird

Differential Revision: D30372308

fbshipit-source-id: 5346e828f53f6bc5d871b403586566a3332a389a
2021-08-17 22:27:49 -07:00
15144ade25 [fx2trt] Add quantize_per_tensor support (#63447)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63447

Only available in TRT 8.0 and above

Test Plan: buck run mode/opt caffe2/torch/fb/fx2trt:test_quantize_per_tensor

Reviewed By: 842974287

Differential Revision: D30322844

fbshipit-source-id: dfd925e3432de128f2925b1aa55d6125e63359af
2021-08-17 21:37:26 -07:00
3fd8e09102 Fix RPC Python User Function Error Handling (#63406)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63406

The `RemoteException` will be thrown on the caller side when converting
the response message to IValue. Since it is a Python error, the error
message needs to be extracted explicitly and clear the `PyErr`.

Test Plan: Imported from OSS

Reviewed By: rohan-varma, ngimel

Differential Revision: D30372741

Pulled By: mrshenli

fbshipit-source-id: 1f72a7ee0c39cc2ef070f99884c142f7b3e0543d
2021-08-17 20:14:03 -07:00
f12f667e12 [torch] Set default log level for torch elastic (#63214)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63214

The default log level in fb and oss is different: in oss we use WARNING and in fb we use INFO.

Test Plan: unittests, f291441502

Reviewed By: cbalioglu

Differential Revision: D30296298

fbshipit-source-id: 89067352be767255fbc66e790ec333582de64c6c
2021-08-17 19:58:13 -07:00
dcf90b797c [BE] remove _SUPPORTED_OPTIM_MAP from tests (#63383)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63383

Per title
ghstack-source-id: 135966157

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D30358921

fbshipit-source-id: 965e054e525194b1ee55980340df275bab355c9b
2021-08-17 17:17:25 -07:00
5b8862abf1 [DDP] Support step_param for AdamW (#63382)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63382

Per title
ghstack-source-id: 135966156

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D30255446

fbshipit-source-id: e6ffbf339db0bc5b4702d02b74a462309df07c75
2021-08-17 17:16:11 -07:00
cd5e9dcc1d [quant][graphmode][fx][fix] Fix quantization for tuple arguments (#63376)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63376

Previously when tuple is an argument for a quantizable op it would be transformed to a list by mistake,
this PR fixes that.

Test Plan:
python test/test_quantization.py TestQuantizeFx.test_preserve_tuple

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D30357642

fbshipit-source-id: 82d10805d9c00c003cc99983dca68b6455ff7b2e
2021-08-17 17:01:24 -07:00
975542c314 Add more ciflow labels for more workflows (#63410)
Summary:
- Add more ciflow labels and enable it for more workflows.
- Only the 'ciflow/default' workflows are run by default on pull_request time
- Other labels can be manually triggered by (adding the labels + unassign pytorchbot), OR wait for pytorchbot's comment opt-in rollout
- The label design is a logical operator `OR`, i.e. adding ('ciflow/cuda' + 'ciflow/win') will trigger the union of them. (design feedback is needed here)

Typical default workflows for normal PRs.

<details>
<summary>Generated label rules</summary>

![image](https://user-images.githubusercontent.com/658840/129779905-eb5e56dd-a696-4040-9eb6-71ecb6487dc1.png)

```
{
  "label_rules": {
    "ciflow/all": [
      "libtorch-linux-xenial-cuda10.2-py3.6-gcc7",
      "libtorch-linux-xenial-cuda11.1-py3.6-gcc7",
      "linux-bionic-cuda10.2-py3.9-gcc7",
      "linux-bionic-py3.8-gcc9-coverage",
      "linux-xenial-cuda10.2-py3.6-gcc7",
      "linux-xenial-cuda11.1-py3.6-gcc7",
      "linux-xenial-py3.6-gcc5.4",
      "linux-xenial-py3.6-gcc7-bazel-test",
      "periodic-libtorch-linux-xenial-cuda11.3-py3.6-gcc7",
      "periodic-linux-xenial-cuda11.3-py3.6-gcc7",
      "periodic-win-vs2019-cuda11.3-py3",
      "win-vs2019-cpu-py3",
      "win-vs2019-cuda10.1-py3",
      "win-vs2019-cuda11.1-py3"
    ],
    "ciflow/bazel": [
      "linux-xenial-py3.6-gcc7-bazel-test"
    ],
    "ciflow/coverage": [
      "linux-bionic-py3.8-gcc9-coverage"
    ],
    "ciflow/cpu": [
      "linux-bionic-py3.8-gcc9-coverage",
      "linux-xenial-py3.6-gcc5.4",
      "linux-xenial-py3.6-gcc7-bazel-test",
      "win-vs2019-cpu-py3"
    ],
    "ciflow/cuda": [
      "libtorch-linux-xenial-cuda10.2-py3.6-gcc7",
      "libtorch-linux-xenial-cuda11.1-py3.6-gcc7",
      "linux-bionic-cuda10.2-py3.9-gcc7",
      "linux-xenial-cuda10.2-py3.6-gcc7",
      "linux-xenial-cuda11.1-py3.6-gcc7",
      "periodic-libtorch-linux-xenial-cuda11.3-py3.6-gcc7",
      "periodic-linux-xenial-cuda11.3-py3.6-gcc7",
      "periodic-win-vs2019-cuda11.3-py3",
      "win-vs2019-cuda10.1-py3",
      "win-vs2019-cuda11.1-py3"
    ],
    "ciflow/default": [
      "linux-bionic-py3.8-gcc9-coverage",
      "linux-xenial-cuda11.1-py3.6-gcc7",
      "linux-xenial-py3.6-gcc5.4",
      "linux-xenial-py3.6-gcc7-bazel-test",
      "win-vs2019-cpu-py3",
      "win-vs2019-cuda10.1-py3"
    ],
    "ciflow/libtorch": [
      "libtorch-linux-xenial-cuda10.2-py3.6-gcc7",
      "libtorch-linux-xenial-cuda11.1-py3.6-gcc7",
      "periodic-libtorch-linux-xenial-cuda11.3-py3.6-gcc7"
    ],
    "ciflow/linux": [
      "libtorch-linux-xenial-cuda10.2-py3.6-gcc7",
      "libtorch-linux-xenial-cuda11.1-py3.6-gcc7",
      "linux-bionic-cuda10.2-py3.9-gcc7",
      "linux-bionic-py3.8-gcc9-coverage",
      "linux-xenial-cuda10.2-py3.6-gcc7",
      "linux-xenial-cuda11.1-py3.6-gcc7",
      "linux-xenial-py3.6-gcc5.4",
      "linux-xenial-py3.6-gcc7-bazel-test",
      "periodic-libtorch-linux-xenial-cuda11.3-py3.6-gcc7",
      "periodic-linux-xenial-cuda11.3-py3.6-gcc7"
    ],
    "ciflow/scheduled": [
      "periodic-libtorch-linux-xenial-cuda11.3-py3.6-gcc7",
      "periodic-linux-xenial-cuda11.3-py3.6-gcc7",
      "periodic-win-vs2019-cuda11.3-py3"
    ],
    "ciflow/slow": [
      "linux-bionic-cuda10.2-py3.9-gcc7",
      "linux-xenial-cuda10.2-py3.6-gcc7"
    ],
    "ciflow/win": [
      "periodic-win-vs2019-cuda11.3-py3",
      "win-vs2019-cpu-py3",
      "win-vs2019-cuda10.1-py3",
      "win-vs2019-cuda11.1-py3"
    ]
  },
  "version": "v1"
}
```
</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63410

Reviewed By: ngimel

Differential Revision: D30378553

Pulled By: zhouzhuojie

fbshipit-source-id: 4e0953740793e5e72b95018f8ab2ce4a6a364c38
2021-08-17 17:00:09 -07:00
da87d648b3 F.avg_pool3 CUDA backward: gpuAtomicAddNoReturn -> fastAtomicAdd (#63387)
Summary:
Rel: https://github.com/pytorch/pytorch/issues/62695

In the following two tables, I set `kernel_size` to 3 and `stride` to 2.
In benchmark, input tensors have the shape of (N, C, n_features, n_features, n_features).
Tested on RTX3080 w/ CUDA11.4 Update 1.

## This PR

|   N |   C |   n_features | dtype         |        time |
|----:|----:|-------------:|:--------------|------------:|
|  32 |   3 |            8 | torch.float16 | 7.46846e-05 |
|  32 |   3 |            8 | torch.float32 | 8.18968e-05 |
|  32 |   3 |           32 | torch.float16 | 0.000156748 |
|  32 |   3 |           32 | torch.float32 | 0.000165236 |
|  32 |   3 |          128 | torch.float16 | 0.00549854  |
|  32 |   3 |          128 | torch.float32 | 0.008926    |

## master (6acd87f)

|   N |   C |   n_features | dtype         |        time |
|----:|----:|-------------:|:--------------|------------:|
|  32 |   3 |            8 | torch.float16 | 7.60436e-05 |
|  32 |   3 |            8 | torch.float32 | 7.55072e-05 |
|  32 |   3 |           32 | torch.float16 | 0.000189292 |
|  32 |   3 |           32 | torch.float32 | 0.000168645 |
|  32 |   3 |          128 | torch.float16 | 0.00699538  |
|  32 |   3 |          128 | torch.float32 | 0.00890226  |

master's time divided by PR's time is as follows:

| N | C | n_features | master / PR |
|---:|---:|---------------:|----------------:|
| 32 | 3 | 8 | 1.018 |
| 32 | 3 | 32 | 1.208 |
| 32 | 3 | 128 | 1.272|

cc: xwang233 ptrblck ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63387

Reviewed By: mruberry

Differential Revision: D30381434

Pulled By: ngimel

fbshipit-source-id: 3b97aee4b0d457a0277a0d31ac56d4151134c099
2021-08-17 16:53:13 -07:00
6e5d065b2b Add pocketfft as submodule (#62841)
Summary:
Using https://github.com/mreineck/pocketfft

Also delete explicit installation of pocketfft during the build as it will be available via submodule

Limit PocketFFT support to cmake-3.10 or newer, as `set_source_files_properties` does not seem to work as expected with cmake-3.5

Partially addresses https://github.com/pytorch/pytorch/issues/62821

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62841

Reviewed By: seemethere

Differential Revision: D30140441

Pulled By: malfet

fbshipit-source-id: d1a1cf1b43375321f5ec5b3d0b538f58082f7825
2021-08-17 15:29:56 -07:00
078dcc4e97 [wip] Move smallest bucket to end after rebuild buckets (#62279)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62279

Before rebuild buckets, `kDefaultFirstBucketBytes` is actually misleading because we reverse the parameter indices when initialize reducer so it is actually the size of the last bucket.

Currently rebuild buckets sets this to be the first bucket size, but seeing if keeping it as last can help perf.

This is currently experimental only and don't plan to land it unless experiments show a clear win.
ghstack-source-id: 135966897

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D29927931

fbshipit-source-id: 55b949986fa2c3bade6fcb4bf5b513461bf0f490
2021-08-17 15:04:50 -07:00
e0e2796fa9 adding a note to the documentation of polar (#63259)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63259

Fix #52919

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D30342536

Pulled By: NivekT

fbshipit-source-id: 4c61a86f96a6370cc64652bf652c4ae25c9f4601
2021-08-17 14:48:32 -07:00
bcddc71f26 [quant][graphmode][fx][bc-breaking] Support for reference pattern for fixqparam ops in eval mode (#62608)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62608

Insert extra fixeqparam fake quant in the output of fixed qparam ops in fbgemm e.g. sigmoid
so that we can produce reference patterns for these ops

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: iramazanli

Differential Revision: D30053978

fbshipit-source-id: c527944b6e791bb4d45ebe96265af52794203695
2021-08-17 14:42:40 -07:00
9cd24e12a1 Revert D30281388: [PyTorch] Avoid using std::regex for device string parsing in Device.cpp
Test Plan: revert-hammer

Differential Revision:
D30281388 (4d6f98ecad)

Original commit changeset: 4d998e9f313e

fbshipit-source-id: 11134b3400cc3e851155c9c1b6fb59308ff1567b
2021-08-17 14:40:27 -07:00
495e7e4815 Fix zero-dim handling in torch.matmul (#63359)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63359

Fixes #63352. The problem was that in e.g. `torch.matmul(A, B)` with A,
B having shapes [3, 2, 0] and [0, 2], the code attempts to call
`A.view(-1, 0)` which fails due to "-1 being ambiguous". The solution is
to manually compute what we want the shape of the view to be.

Test Plan: - new tests

Reviewed By: ngimel

Differential Revision: D30351583

Pulled By: zou3519

fbshipit-source-id: 7625691fe8b85d96a4073409596a932c303e3e8c
2021-08-17 13:44:47 -07:00
1dc2b52764 [TensorExpr] Add a wrapper for all expr and stmt pointers. (#63195)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63195

This helps us to later switch from using KernelArena with raw pointers
to shared pointers without having to change all our source files at
once.

The changes are mechanical and should not affect any functionality.

With this PR, we're changing the following:
 * `Add*` --> `AddPtr`
 * `new Add(...)` --> `alloc<Add>(...)`
 * `dynamic_cast<Add*>` --> `to<Add>`
 * `static_cast<Add*>` --> `static_to<Add>`

Due to some complications with args forwarding, some places became more
verbose, e.g.:
 * `new Block({})` --> `new Block(std::vector<ExprPtr>())`

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D30292779

Pulled By: ZolotukhinM

fbshipit-source-id: 150301c7d2df56b608b035827b6a9a87f5e2d9e9
2021-08-17 13:44:45 -07:00
a2db5d34a5 OpInfo fix: conv_transpose2d (#63389)
Summary:
Addresses comment: https://github.com/pytorch/pytorch/pull/62882#issuecomment-899679606.

cc: mruberry ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63389

Reviewed By: mruberry

Differential Revision: D30377481

Pulled By: ngimel

fbshipit-source-id: 0fa21acc3503c259c9b27463e8555247c43d9e2e
2021-08-17 13:42:36 -07:00
9d9e7a8d72 [Static Runtime] Implement aten::append (#63350)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63350

Add a native implementation for `aten::append`, the list append op.

Test Plan: New unit test: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- Append`

Reviewed By: hlu1

Differential Revision: D30326461

fbshipit-source-id: 0dbdf6cc82e78c7c36db39583256f6b87385e3d3
2021-08-17 13:40:18 -07:00
6621df9a6a [vulkan] Add log_softmax (#63193)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63193

Test Plan: Imported from OSS

Reviewed By: SS-JIA

Differential Revision: D30291987

fbshipit-source-id: 89c6560274e5a841e5af249f6963b67ef6826f4c
2021-08-17 13:36:02 -07:00
b0396e39f4 [quant][fx] Ensure qconfig works for QAT with multiple modules (#63343)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63343

The previous implementation had a bug where we were trying to modify an ordered dict value while iterating through it.
This fixes it by creating a copy before modifying it.

Test Plan:
python test/test_quantization.py TestQuantizeFx.test_qconfig_qat_module_type

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D30346116

fbshipit-source-id: 0e33dad1163e8bff3fd363bfd04de8f7114d7a3a
2021-08-17 11:40:51 -07:00
e000dfcf97 Add return type hint and improve the docstring of consume_prefix_in_state_dict_if_present method (#63388)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63388

Context: https://discuss.pytorch.org/t/how-to-use-the-helper-function-consume-prefix-in-state-dict-if-present/129505/3

Make it clear that this method strips the prefix in place rather than returns a new value.

Additional reformatting is also applied.
ghstack-source-id: 135973393

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D30360931

fbshipit-source-id: 1a0c7967a4c86f729e3c810686c21dec43d1dd7a
2021-08-17 11:30:27 -07:00
fcc840eae0 Add handling of ifs to shape propagation (#62914)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62914

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D30196945

Pulled By: eellison

fbshipit-source-id: 1c0c7f938c4547330fd1dba8ab7dd0b99a79b6a9
2021-08-17 11:26:42 -07:00
3975c08e5d Small shape analysis changes (#62911)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62911

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D30196946

Pulled By: eellison

fbshipit-source-id: 2562bab323088d9c1440ae0431e533f9bcc513d3
2021-08-17 11:26:40 -07:00
e2227e86e4 Add a few peepholes (#62910)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62910

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D30196947

Pulled By: eellison

fbshipit-source-id: d88c92616d4de4f47ff4fcf5c1994e629ca20395
2021-08-17 11:26:38 -07:00
9a60759453 Propagate symbolic dimensions through idioms like x.view(y.size()) (#61975)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61975

Propagate symbolic dimensions through size calls. We did this by associating SymbolicSizes with integer inputs by looking through their constructors for `x.size(1)` or `x.size()` nodes.

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D30196948

Pulled By: eellison

fbshipit-source-id: 377fc1d2f6d396c52dc0e87fa814b15720f1414e
2021-08-17 11:25:22 -07:00
60cadd0bd1 [fx2trt] Refactor linear op to use mm + add
Summary:
Previously linear is translated to fully_connected which only works when weight is a constant,
this diff changes that to mm + add so that the weight can be an ITensor so that we can have the weight - quantize - dequantize
pattern in the produced TensorRT network

Test Plan: buck run mode/opt caffe2/torch/fb/fx2trt:test_linear

Reviewed By: 842974287

Differential Revision: D30294751

fbshipit-source-id: 596fbd4c81caef8df41a002a2e14fbf22d9d2a80
2021-08-17 10:52:28 -07:00
517aa8965a Updates set_default_dtype documentation (#63233)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/60560.

The description of set_default_dtype is updated to clarify that it affects the interpretation of Python numbers as either float32 (complex64) or float64 (complex128) and that default (floating) dtypes other than float32 or float64 are unsupported.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63233

Reviewed By: VitalyFedyunin

Differential Revision: D30306396

Pulled By: mruberry

fbshipit-source-id: bbee62f323c773b23b2fa45cb99122bc28197432
2021-08-17 10:41:03 -07:00
63554cfb3d Remove backend_debug from torch_core srcs and replace with library dependency (#63111)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63111

### Problem:
Buck contains at least two libraries which have `backend_debug_info.cpp` as a source, `torch_core` and `backend_interface_lib`. `backend_debug_info.cpp` registers BackendDebugInfo as a class. If targets contain both libraries (e.g. sparkAR debug build with NNAPI delegation), then BackendDebugInfo is registered twice, causing a runtime error.
### Solution:
These changes remove `backend_debug_info.cpp` and `backend_interface.cpp` as a source in `torch_core` and adds backend_interface_lib as a dependency instead.

**build_variables.bzl:**
- Added a list that excludes `backend_debug_info.cpp` and `backend_interface.cpp` ( both srcs already included by `backend_interface_lib`)

**buck:**
- torch_core: Removed `backend_debug_info.cpp` from srcs and added `backend_interface_lib` deps
- backend_interface_lib: Replaced `torch_mobile_core` dep with more specific deps
  - to avoid an indirect dep between `torch_core` and `torch_mobile_core`

ghstack-source-id: 135981061

Test Plan:
### Test Plan:
Build and run SparkAR internally with Android NNAPI Delegation (`buck build --show-output arstudioplayer_arm64_debug`)
and internal tests.

Reviewed By: iseeyuan

Differential Revision: D30259034

fbshipit-source-id: 0c14c827732f07fb9b9bd25a999828b51793cdcc
2021-08-17 10:33:35 -07:00
3aecec609f Move Android Nnapi srcs from aten_native_cpu to aten_cpu (#62919)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62919

Move Android NNAPI srcs (nnapi_bind.cpp, nnapi_wrapper.cpp, nnapi_model_loader.cpp) from aten_native_cpu to aten_cpu, so that later the NNAPI delegate's execution library can depend on it.

aten_native_cpu is built selectively per app, but the srcs have no selective components and are required for the NNAPI delegate library in D30259033.

See Buck Dependencies: https://docs.google.com/document/d/17RuWkqWKCO6sc5fKzIDkGeNhhvMk7BvJOqeSnGsHZ8o/edit?usp=sharing
ghstack-source-id: 135981062

Test Plan: `buck build --show-output arstudioplayer_arm64_debug` and internal tests

Reviewed By: iseeyuan

Differential Revision: D30164867

fbshipit-source-id: 0beff481ff250e75664ce8393beabbeb9db66770
2021-08-17 10:32:30 -07:00
c982f13a80 [android][vulkan] Fix model loading for Vulkan backend (#63402)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63402

Test Plan: Imported from OSS

Reviewed By: SS-JIA

Differential Revision: D30370692

Pulled By: IvanKobzarev

fbshipit-source-id: 73311b9b767fe9ed3ae390db59d6aa2c4a98f06d
2021-08-17 10:20:32 -07:00
f70b9ee5de Advertise USE_PRECOMPILED_HEADERS in CONTRIBUTING.md (#62827)
Summary:
This option was added in https://github.com/pytorch/pytorch/issues/61940 and fits with this section's theme of improving build times.

I've also changed it to a `cmake_dependent_option` instead of `FATAL_ERROR`ing for older CMake versions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62827

Reviewed By: astaff

Differential Revision: D30342102

Pulled By: malfet

fbshipit-source-id: 3095b44b7085aee8a884ec95cba9f8998d4442e7
2021-08-17 10:14:40 -07:00
011fdc3b7e [fx] persist tracer_cls on fx.Graph when deep copying (#63353)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63353

Custom deepcopy method copies all nodes but does not copy the tracer_cls attribute

Reviewed By: houseroad

Differential Revision: D30349424

fbshipit-source-id: 3e98bdac8a8a992eb0b4ec67fe80bb2e5cf3884d
2021-08-17 09:57:48 -07:00
4d6f98ecad [PyTorch] Avoid using std::regex for device string parsing in Device.cpp (#63204)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63204

Currently, `std::regex` is used for parsing device strings. This is undesirable for a few reasons.

1. Increases binary size
2. Slows down model loading
3. Potentially uses more memory at runtime
4. Takes marginally longer time to build code that uses std::regex v/s not using std::regex

This change avoids the use of `std::regex` for parsing the device string since we don't need to.
ghstack-source-id: 136006963

Test Plan:
### AI Bench Runs

**Before this change:**
1. Model Load time: [252ms](https://www.internalfb.com/intern/aibench/details/332471502816548)
2. Model unload time: 3.5ms

**After this change:**
1. Model Load time: [240ms](https://www.internalfb.com/intern/aibench/details/652195589031318), which is an approx 5% reduction for the current model. I suspect percentage wise, it will be larger for smaller models since this is a fixed cost reduction.
2. Model unload time: 3.3ms (probably too small to be meaningfully impactful to an end user).

### BSB Results

```
D30281388-V1 (https://www.internalfb.com/intern/diff/D30281388/?dest_number=135713848)

messenger-pika-optimized-device: Succeeded
Change in Download Size for arm64 + 3x assets variation: -7.1 KiB
Change in Uncompressed Size for arm64 + 3x assets variation: -17.6 KiB

Mbex Comparison: https://our.intern.facebook.com/intern/mbex/bsb:551399955987465@base/bsb:551399955987465@diff/
```

Reviewed By: raziel

Differential Revision: D30281388

fbshipit-source-id: 4d998e9f313e6366d9d89a6a73cd090ddfb059fc
2021-08-17 09:23:48 -07:00
013a42bdb1 [PyTorch] Add Device_test.cpp (#63203)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63203

Currently, `c10::Device` isn't being tested - i.e. there's no test to ensure that the device string parsing works as expected. This diff adds very basic tests to assert that the stuff we expect to work works, and the stuff that we don't expect to work doesn't work.

ghstack-source-id: 136006962

Test Plan:
New test. Ran as:

```
cd fbsource/fbcode/
buck test //caffe2/c10:c10_test_0 -- -r '.*DeviceTest.*'
```

Reviewed By: dreiss, raziel

Differential Revision: D30286910

fbshipit-source-id: b5699068dcbba89d5d224dbaf74b175f3f785a00
2021-08-17 09:22:35 -07:00
336aa9cd85 change with_callable_args to return a fresh _PartialWrapper (#63374)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/63326

Currently `get_callable_args` has the side effect of mutating the input _PartialWrapper. When that input is one of the global defaults, there are all sorts of lifetime issues that crop up. (Details in the linked issue.) So far as I can tell, we only need to make a constructor which is module (and by extension device) aware, so making a fresh one should have the same effect without leaking the last call's module.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63374

Test Plan: the repro in https://github.com/pytorch/pytorch/issues/63326 now reports no leaked Tensors, and all quantization tests pass locally.

Reviewed By: HDCharles

Differential Revision: D30359360

Pulled By: robieta

fbshipit-source-id: aef33261ac49952d8d90da868a57ab063dfc456e
2021-08-17 09:11:38 -07:00
7bad9ac78a Fix flaky test for dp saved tensor hooks (#63324)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63324

Fix for https://www.internalfb.com/tasks/?t=98258963
`catch_warnings` seem to only trigger once in certain cases where it
should trigger twice.
This test is only meant to test whether hooks are trigger / not trigger,
so changing it to self.assertGreater is ok.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D30340833

Pulled By: Varal7

fbshipit-source-id: 1bfb9437befe9e8ab8f95efe5f513337fa9bdc5c
2021-08-17 08:56:58 -07:00
2992d92b5a Add mode to TarArchiveReader (#63332)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63332

Add a corresponding PR from [torchdata](https://github.com/facebookexternal/torchdata/pull/101)

Test Plan: Imported from OSS

Reviewed By: astaff

Differential Revision: D30350151

Pulled By: ejguan

fbshipit-source-id: bced4a1ee1ce89d4e91e678327342e1c095dbb9e
2021-08-17 07:28:37 -07:00
cae5ddc427 add torch.meshgrid() OpInfo (#62720)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/62719

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62720

Reviewed By: astaff

Differential Revision: D30344574

Pulled By: dagitses

fbshipit-source-id: ed42d9fe20741df98018efb08e640fca370583fb
2021-08-17 04:04:24 -07:00
22f78144c7 Extends warning on norm docs (#63310)
Summary:
torch.norm has a couple documentation issues, like https://github.com/pytorch/pytorch/issues/44552 and https://github.com/pytorch/pytorch/issues/38595, but since it's deprecated this PR simply clarifies that the documentation (and implementation) of torch.norm maybe be incorrect. This should be additional encouragement for users to migrate to torch.linalg.vector_norm and torch.linalg.matrix_norm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63310

Reviewed By: ngimel

Differential Revision: D30337997

Pulled By: mruberry

fbshipit-source-id: 0fdcc438f36e4ab29e21e0a64709e4f35a2467ba
2021-08-16 22:23:45 -07:00
ad94248b57 Cleanup dead code (#63328)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63328

This code supported the old `at::_fft_with_size` operator which no longer exists.

Test Plan: Imported from OSS

Reviewed By: astaff

Differential Revision: D30343557

Pulled By: mruberry

fbshipit-source-id: 7a71585e013acb46c98f14fd40e15bdfbf026bac
2021-08-16 22:13:08 -07:00
877b649bc3 Workaround for cuFFT bug (#63327)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63327

Fixes #63152

Test Plan: Imported from OSS

Reviewed By: astaff

Differential Revision: D30343558

Pulled By: mruberry

fbshipit-source-id: 68e17a07650f65f397e26efc417e97e2ab302f82
2021-08-16 22:11:52 -07:00
794b04c6c8 Add step to report code coverage from GHA (#63373)
Summary:
Similar to the logic provided in b2069e7d01/.circleci/verbatim-sources/job-specs/pytorch-job-specs.yml (L197-L201)

Fixes https://github.com/pytorch/pytorch/issues/63366

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63373

Reviewed By: walterddr

Differential Revision: D30357737

Pulled By: malfet

fbshipit-source-id: 20b115eb4d6412bd9895680308a9097742d2ae7b
2021-08-16 20:42:38 -07:00
548c717cbd [TensorExpr] Remove test_train from tensorexpr tests. (#63194)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63194

This test implements functionality used nowhere, and the author no
longer works on that. This PR also adds test_approx to CMakeLists where
it's been missing before.

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D30292777

Pulled By: ZolotukhinM

fbshipit-source-id: ab6d98e729320a16f1b02ea0c69734f5e7fb2554
2021-08-16 20:36:31 -07:00
e7724bb100 [JIT] Set future's error to current exception as is when --torch_jit_enable_rethrow_caught_exception=true (#63348)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63348

This change addresses singlaiiit's comment on D30241792 (61b49c8e41), which makes the JIT interpreter's behavior consistent between `future` is set and not.

Test Plan: Enhanced `EnableRethrowCaughtExceptionTest.EnableRethrowCaughtExceptionTestRethrowsCaughtException` to cover the modified code path.

Reviewed By: singlaiiit

Differential Revision: D30347782

fbshipit-source-id: 79ce57283154ca4372e5341217d942398db21ac8
2021-08-16 17:32:13 -07:00
075024b9a3 [Static Runtime] Fix a bug that assigns multiple outputs to single storage (#63012)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63012

This change fixes a bug that the static runtime's memory optimizer assigns multiple outputs of a node to the same storage.  Fixing this bug enables the static runtime to run `inline_cvr` with its memory optimizer enabled.

A problematic line from `inline_cvr` was as follows:
```
  %7767 : Tensor, %getitem_6419.1 : Tensor = fb::gather_ranges(%tensor74.1, %7764)
```
where enabling the memory optimizer assigns `%7767` and `%getitem_6419.1` to the same storage, which made their data corrupted during the 2nd iteration.

This change fixed the aforementioned bug by marking all inputs & outputs of a node as `alive` during our liveness analysis. By doing that, no inputs / outputs will collide with each other. I believe this is a fair assumption that most ops' implementation always has, but missing in our analysis before this change.

Test Plan: - Added a unittest `StaticRuntime.ValuesShareSameStorageDoesNotContainOutputsFromSameNode` to cover the new code.

Reviewed By: hlu1

Differential Revision: D30202018

fbshipit-source-id: 10287a1bee9e86be16a5201e9a7cd7c7f046bab9
2021-08-16 16:52:02 -07:00
068d6fec5c [Model Averaging] Add a few member methods of PostLocalSGDOptimizer (#63340)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63340

Some methods are needed such as accessing optimizer states. These are necessary for integration with PyTorch Lightning.

Proposal: https://github.com/pytorch/pytorch/issues/59699
ghstack-source-id: 135912246

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_ddp_hook_parity_post_localSGD

Reviewed By: rohan-varma

Differential Revision: D30328794

fbshipit-source-id: e585b874313bd266fdc7c79936e2af98700c7bad
2021-08-16 16:39:01 -07:00
aa63c0d9df [PyPer] Skip printing out per node time when do_profile is on (#63256)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63256

This suppresses printing out the per node time which is very long when the net has too many ops. It can be easily turned on by setting `--pt_sr_print_per_node_time=1`.

Reviewed By: ajyu, mikeiovine

Differential Revision: D30298331

fbshipit-source-id: 32b3f93b3fe19d335654168311fda93331a1e706
2021-08-16 16:32:19 -07:00
b2069e7d01 Refactor NnapiCompilation registration into it's own file (#63183)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63183

Move registration of NnapiCompilation into it's own file, so that `nnapi_bind.cpp` (which contains the implementation of NnapiCompilation) can be moved to `aten_cpu`, while maintaining the selectiveness for registration.

`nnapi_bind.cpp` is moved to `aten_cpu` in https://github.com/pytorch/pytorch/pull/62919. See the PR for more details on why it's needed.

ghstack-source-id: 135900318

Test Plan: Nnapi unit tests: `python test/test_nnapi.py`

Reviewed By: iseeyuan

Differential Revision: D30288708

fbshipit-source-id: 6ed5967fa6bd018075469d18e68f844d413cf265
2021-08-16 15:45:26 -07:00
da36bbcd35 Add section to CONTRIBUTING.md explaining developer docs (#63228)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63228

It is a quick summary and links to a page on the Developer Wiki that has
more detail.

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D30347109

Pulled By: zou3519

fbshipit-source-id: a6242986d275e5279ca3f61ade2294a132d268c4
2021-08-16 15:44:10 -07:00
4982fc4ecf test: Add ability to set CONTINUE_THROUGH_ERROR (#63357)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63357

Adds the ability to set CONTINUE_THROUGH_ERROR as an environment
variable so that we can easily set it without having to add the flag
directly

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: astaff

Differential Revision: D30351108

Pulled By: seemethere

fbshipit-source-id: 767fa9bd24e1399f359eb24d16f6cc985a2d7173
2021-08-16 15:35:40 -07:00
6acd87fe6a Add driver function to run test_sharded_tensor.py and test_sharding_spec.py (#63189)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63189

Add main --> run_tests func in test file which is needed to launch the real test cases in OSS flow.

Test Plan:
b/f:
$ python test/distributed/_sharding_spec/test_sharding_spec.py --v   ==> nothing happened
$ python test/distributed/_sharded_tensor/test_sharded_tensor.py --v ==> nothing happened

after:

$ python test/distributed/_sharding_spec/test_sharding_spec.py --v   ==>

test_chunked_sharding_spec (__main__.TestShardingSpec) ... ok
test_device_placement (__main__.TestShardingSpec) ... ok
test_enumerable_sharding_spec (__main__.TestShardingSpec) ... ok

$ python test/distributed/_sharded_tensor/test_sharded_tensor.py --v

test_complete_world_size (__main__.TestShardedTensorChunked) ... ok
test_insufficient_sharding_dims (__main__.TestShardedTensorChunked) ... ok
test_invalid_pg_rpc_ranks (__main__.TestShardedTensorChunked) ... [W tensorpipe_agent.cpp:699] RPC agent for worker2 encountered error when reading incoming request from worker0: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
ok
test_invalid_sharding (__main__.TestShardedTensorChunked) ... ok
test_load_state_dict_errors (__main__.TestShardedTensorChunked) ... ok
test_multiple_local_shards (__main__.TestShardedTensorChunked) ... ok
test_new_group (__main__.TestShardedTensorChunked) ... ok
test_partial_world_size (__main__.TestShardedTensorChunked) ... ok
test_sharded_tensor_metadata (__main__.TestShardedTensorChunked) ... ok
test_sharded_tensor_sizes (__main__.TestShardedTensorChunked) ... ok
test_sharding_columns (__main__.TestShardedTensorChunked) ... ok
test_state_dict (__main__.TestShardedTensorChunked) ... ok
test_state_dict_new_group (__main__.TestShardedTensorChunked) ... ok
test_state_dict_no_sharded_tensors (__main__.TestShardedTensorChunked) ... ok
test_grid_sharding (__main__.TestShardedTensorEnumerable) ... ok
test_multiple_local_shards (__main__.TestShardedTensorEnumerable) ... ok
test_new_group (__main__.TestShardedTensorEnumerable) ... ok
test_partial_world_size (__main__.TestShardedTensorEnumerable) ... ok
test_sharded_tensor_metadata (__main__.TestShardedTensorEnumerable) ... ok
test_uneven_shards (__main__.TestShardedTensorEnumerable) ... ok
test_with_rpc_names (__main__.TestShardedTensorEnumerable) ... ok
test_init_from_local_shards (__main__.TestShardedTensorFromLocalShards) ... ok
test_init_from_local_shards_invalid_shards (__main__.TestShardedTensorFromLocalShards) ... ok
test_init_from_local_shards_invalid_shards_gaps (__main__.TestShardedTensorFromLocalShards) ...

Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D30294094

fbshipit-source-id: 08f0431a12ea854abe00dc920205b10ba43ae6b6
2021-08-16 15:25:32 -07:00
f4f2c1231a [fx2trt] add unsqueeze converter (#63355)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63355

Added converter for acc_ops.unsqueeze. Needed for ig model.

DIdn't add support for input that has more than one dynamic dim. This is not needed right now and I feel it would be a rare case.

Test Plan: unit test

Reviewed By: yinghai

Differential Revision: D30138293

fbshipit-source-id: 899fe8eb68387de83195a2f6e199618d96f09a9e
2021-08-16 15:18:43 -07:00
078b8004a6 [Static Runtime] Implement prim::TupleUnpack (#63243)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63243

Add `prim::TupleUnpack` native op to static runtime.

Test Plan: Unit test: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: hlu1

Differential Revision: D30306955

fbshipit-source-id: 21923d6cbd5545c144ac051b3d48b37ec6e610cf
2021-08-16 14:56:30 -07:00
a12b371f7c [fx2trt] Factor out add_matrix_multiply_layer
Summary: Factor out the function so that it can be reused in future diffs

Test Plan: buck run mode/opt caffe2/torch/fb/fx2trt:test_matmul

Reviewed By: 842974287

Differential Revision: D30322823

fbshipit-source-id: 069b945de2c744cdbcca1618b62827692dfb4174
2021-08-16 14:13:37 -07:00
MY_
dc5ce22a1a A re-open PR: Avoid re-creating the random number generator in RandomSampler (#63026)
Summary:
More details can be found in the old pr: https://github.com/pytorch/pytorch/pull/53085

ejguan  Thanks for your guidance. I tried to reopen this PR following your instructions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63026

Reviewed By: anjali411

Differential Revision: D30224920

Pulled By: ejguan

fbshipit-source-id: 2fa83bd4a2661485e553447fe3e57ce723f2716d
2021-08-16 14:08:37 -07:00
3f06f29577 Improve pip package determination (#63321)
Summary:
Invoking `pip` or `pip3` yields list of packages invoked for `pip` alias on the path, rather than for the one currently being executed. Changed `get_pip_packages` to use `sys.executable + '-mpip'`

Also, add mypy to the list of packages of interest

Discovered while looking at https://github.com/pytorch/pytorch/issues/63279

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63321

Reviewed By: walterddr

Differential Revision: D30342099

Pulled By: malfet

fbshipit-source-id: fc8d17cf2ddcf18236cfde5c1b9edb4e72804ee0
2021-08-16 13:54:39 -07:00
4a59f0b9d9 [Profiler] Change FLOP/s to Total FLOPs (#62779)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62779

Change from floating point operations per second to total floating point operations.  This requires removing the division  by executing time from the Kineto computed FLOPs and updating necessary documentation

Test Plan:
Running the following script:

```
import torch
from torch.profiler import profile
import torchvision.models as models

model = models.resnet18().eval()
inputs = torch.randn(5, 3, 224, 224)
with torch.no_grad():
    with profile(record_shapes=True, with_flops=True) as prof:
        model(inputs)
print(prof.key_averages().table(sort_by="cpu_time_total"))
```

Before diff results in:

{F636640118}

And after diff should be about `(27.78 * 10^9) FLOP/s * .652838 seconds =18135839640 FLOP = 18.136 GFLOP`.  Running the script again yields this answer:

{F636655686}

------------------------------------

Reviewed By: gdankel

Differential Revision: D29972997

fbshipit-source-id: 0f8d9f264b7d9f8f6bb3f10ab7c2c9794291e28b
2021-08-16 13:43:32 -07:00
d2e8359971 Fix triage workflow when the card already exists in project (#63347)
Summary:
Fixes issues like https://github.com/pytorch/pytorch/runs/3336787242

```
RequestError [HttpError]: Validation Failed: {"resource":"ProjectCard","code":"unprocessable","field":"data","message":"Project already has the associated issue"}
Error: Unhandled error: HttpError: Validation Failed: {"resource":"ProjectCard","code":"unprocessable","field":"data","message":"Project already has the associated issue"}
    at /home/runner/work/_actions/actions/github-script/v2/dist/index.js:7531:23
    at processTicksAndRejections (internal/process/task_queues.js:93:5)
    at async eval (eval at callAsyncFunction (/home/runner/work/_actions/actions/github-script/v2/dist/index.js:7985:56), <anonymous>:63:1)
    at async main (/home/runner/work/_actions/actions/github-script/v2/dist/index.js:8011:20) {
  name: 'HttpError',
  status: 422,

...
```

The card may already exist, thus no need to handle `422` status code. Anything else will re-throw the err.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63347

Reviewed By: malfet

Differential Revision: D30348529

Pulled By: zhouzhuojie

fbshipit-source-id: 36647837bfccad43ce01eb5dfe6642e685615037
2021-08-16 13:33:58 -07:00
3ce67efea2 [opinfo] nn.functional.pad (#62814)
Summary:
Reference: https://github.com/facebookresearch/functorch/issues/78

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62814

Reviewed By: VitalyFedyunin

Differential Revision: D30307492

Pulled By: zou3519

fbshipit-source-id: 4f6062eb4a3c91ed1795df1f82846afa0abafcdc
2021-08-16 13:29:34 -07:00
1e8de64c66 Add expecttest to requirements.txt (#63320)
Summary:
This PR closes the developer environment gap left by https://github.com/pytorch/pytorch/issues/60658 by adding [expecttest](https://github.com/ezyang/expecttest) to `requirements.txt`. Thus it provides a solution to one of the short-term problems that https://github.com/pytorch/pytorch/issues/60697 tries to solve, but does not provide a long-term solution to https://github.com/pytorch/pytorch/issues/61375.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63320

Reviewed By: malfet

Differential Revision: D30340654

Pulled By: samestep

fbshipit-source-id: 26c8f8c9889cce4a94fafb1bf2f0d6df4c70503f
2021-08-16 13:22:43 -07:00
e75ed4a4b5 add comma to prevent syntax errors (#62492)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62492

Reviewed By: VitalyFedyunin

Differential Revision: D30304684

Pulled By: ezyang

fbshipit-source-id: db08ca39bcecbfd79ea50df18536bf4e87f51e15
2021-08-16 12:27:31 -07:00
0074a099a8 Retry apt-get during setup_ci_workspace (#63319)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63319

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D30346067

Pulled By: bertmaher

fbshipit-source-id: 2aafa97e78f9297553d772b2524d6f1c0ebaa46e
2021-08-16 12:20:51 -07:00
dbcfd7739f Make torch.lu differentiable for wide/tall inputs + jit (#61564)
Summary:
As per title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61564

Reviewed By: astaff

Differential Revision: D30338136

Pulled By: mruberry

fbshipit-source-id: f01436fc90980544cdfa270feee16bb3dda21b93
2021-08-16 11:40:57 -07:00
979180cd01 [Model Averaging] Allow subgroup to be None in PostLocalSGDState (#63277)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63277

`PostLocalSGDState` requires a subgroup. To initialize this subgroup, a global process group must be initialized. However, this imposes a restriction that a hook state can only be provided after distributed environment initialization, which is not compatible with lightning DDP plugin setup where hook state should be provided before distributed environment initialization.

Proposal: https://github.com/pytorch/pytorch/issues/59699
ghstack-source-id: 135848575

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_ddp_hook_parity_post_localSGD

Reviewed By: cbalioglu

Differential Revision: D30325041

fbshipit-source-id: 7b870166d096d306c3f2f7c69816a705cec0bebd
2021-08-16 10:07:41 -07:00
d5d5f42ea9 Revert "[docs] Update docs for NegativeBinomial (#45693)" (#63192)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63192

**Summary**
This reverts commit 402caaeba513929dcfe12df183c764b0ef43f688. As per the
dicussion in #62178, this commit was not needed.

**Test Plan**
Continuous integration.

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D30293202

Pulled By: SplitInfinity

fbshipit-source-id: 91ee7ad0523a9880605d83fe9712c39df67384a8
2021-08-16 09:14:44 -07:00
d1cbee7b2b Refactor BucketBatch (#63185)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63185

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D30288893

Pulled By: ejguan

fbshipit-source-id: b88b792d12a83c99d8ea9e516e3b4c54a82100f6
2021-08-16 06:42:56 -07:00
56d609d93e Replace str by repr for DataChunk (#63184)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63184

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D30288892

Pulled By: ejguan

fbshipit-source-id: 45c88fdd3987e234f2c22ebbbfd8d5044983c34c
2021-08-16 06:41:38 -07:00
e50e8b07d8 [nnc] Updated IRMutator and IRSimplifier to perform in-place mutations. (#63246)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63246

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D30309636

Pulled By: navahgar

fbshipit-source-id: 409ea8d6982888cfee9127e6248044dd2ed9d8d4
2021-08-16 00:09:22 -07:00
a421cba325 [docs][ao] Add overload information for fake_quantize_per_tensor_affine (#63258)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63258

This function supports scalar and tensor qparams

Test Plan:
CI

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D30316432

fbshipit-source-id: 8b2f5582e7e095fdda22c17d178abcbc89a2d1fc
2021-08-15 22:47:05 -07:00
0831b59cf5 [docs][ao] Add missing docstrings for quantized_max_pool1d and quantized_max_pool2d (#63242)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63242

These functions are part of the native functions namespace as well as the quantized namespace

Test Plan:
CI

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D30316430

fbshipit-source-id: cd9c839e5c1a961e3c6944e514c16fbc256a2f0c
2021-08-15 22:47:03 -07:00
a090073fe4 [docs][ao] Add missing documentation for torch.quantized_batch_norm (#63240)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63240

Op is exposed via torch.quantized_batch_norm to the end user without any existing documentation

Test Plan:
CI

Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D30316431

fbshipit-source-id: bf2dc8b7b6f497cf73528eaa2bedef9f65029d84
2021-08-15 22:45:56 -07:00
50fc8e8250 [OpInfo] Add expected_failure kwarg to SkipInfo (#62963)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62963

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D30327199

Pulled By: heitorschueroff

fbshipit-source-id: 45231eca11d1697a4449d79849fb17264d128a6b
2021-08-15 18:09:20 -07:00
8987726cc6 Small refactor for OpInfo decorators (#62713)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62713

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D30327200

Pulled By: heitorschueroff

fbshipit-source-id: 1899293990c8c0a66da88646714b38f1aae9179d
2021-08-15 18:08:12 -07:00
3ca3349555 [Pytorch Edge] Fix broken test post changes in error reporting format. (#63287)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63287

Recent changes in https://github.com/pytorch/pytorch/pull/62419 changed
the way module hierarchy is reported. Now it includes information about
function names as well.

Test Plan:
python test/mobile/test_lite_script_module.py
TestLiteScriptModule.test_save_mobile_module_with_debug_info_with_trace

Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D30328512

fbshipit-source-id: ddd6b11b9ab01cc725f4568a35eff7a92f17204b
2021-08-15 16:14:11 -07:00
cec08e7032 To add warm-up scheduler to optim (#60836)
Summary:
Warm up of learning rate scheduling has initially been discussed  by Priya et. al. in the paper: https://arxiv.org/pdf/1706.02677.pdf .

In the section 2.2 of the paper they discussed and proposed idea of warming up learning schedulers in order to prevent big variance / noise in the learning rate. Then idea has been further discussed in the following papers:
  * Akilesh Gotmare et al. https://arxiv.org/abs/1810.13243
  * Bernstein et al  http://proceedings.mlr.press/v80/bernstein18a/bernstein18a.pdf
  * Liyuan Liu et al: https://arxiv.org/pdf/1908.03265.pdf

There are two type of popularly used learning rate warm up ideas
  * Constant warmup  (start with very small constant learning rate)
  * Linear Warmup        ( start with small learning rate and gradually increase)

In this PR we are adding warm up as learning rate scheduler. Note that learning rates are chainable, which means that we can merge warmup scheduler with any other learning rate scheduler to make more sophisticated learning rate scheduler.

## Linear Warmup

Linear Warmup is multiplying learning rate with pre-defined constant - warmup_factor in the first epoch (epoch 0). Then targeting to increase this multiplication constant to one in warmup_iters many epochs. Hence we can derive the formula at i-th step to have multiplication constant equal to:

                    warmup_factor + (1-warmup_factor) * i /  warmup_iters

Moreover, the fraction of this quantity at point i to point i-1 will give us

           1 + (1.0 - warmup_factor) / [warmup_iters*warmup_factor+(i-1)*(1-warmup_factor)]

which is used in get_lr() method in our implementation. Below we provide an example how to use linear warmup scheduler and to give an example to show how does it works.

```python
import torch
from torch.nn import Parameter
from torch.optim import SGD
from torch.optim.lr_scheduler import WarmUpLR

model = [Parameter(torch.randn(2, 2, requires_grad=True))]
optimizer = SGD(model, 0.1)
scheduler = WarmUpLR(optimizer, warmup_factor=0.1, warmup_iters=10, warmup_method="linear")

for epoch in range(15):

    print(epoch, scheduler.get_last_lr()[0])

    optimizer.step()
    scheduler.step()
```

```
0 0.010000000000000002
1 0.019000000000000003
2 0.028000000000000008
3 0.03700000000000001
4 0.04600000000000001
5 0.055000000000000014
6 0.06400000000000002
7 0.07300000000000002
8 0.08200000000000003
9 0.09100000000000004
10 0.10000000000000005
11 0.10000000000000005
12 0.10000000000000005
13 0.10000000000000005
14 0.10000000000000005
```

## Constant Warmup

Constant warmup has straightforward idea, to multiply learning rate by warmup_factor until we reach to epoch warmup_factor, then do nothing for following epochs

```python
import torch
from torch.nn import Parameter
from torch.optim import SGD
from torch.optim.lr_scheduler import WarmUpLR

model = [Parameter(torch.randn(2, 2, requires_grad=True))]
optimizer = SGD(model, 0.1)
scheduler = WarmUpLR(optimizer, warmup_factor=0.1, warmup_iters=5, warmup_method="constant")

for epoch in range(10):

    print(epoch, scheduler.get_last_lr()[0])

    optimizer.step()
    scheduler.step()
```

```
0 0.010000000000000002
1 0.010000000000000002
2 0.010000000000000002
3 0.010000000000000002
4 0.010000000000000002
5 0.10000000000000002
6 0.10000000000000002
7 0.10000000000000002
8 0.10000000000000002
9 0.10000000000000002
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60836

Reviewed By: saketh-are

Differential Revision: D29537615

Pulled By: iramazanli

fbshipit-source-id: d910946027acc52663b301f9c56ade686e62cb69
2021-08-15 12:31:45 -07:00
8e0998ca70 Move fx2trt and oss_acc_tracer to oss (#63101)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63101

Move internal fx2trt to torch/fx/experimental/fx2trt and merge the two TRT interpreter we have right now. cc: mortzur as this might affect uru exporting script.

Move oss_acc_tracer to torch/fx/experimental/fx_acc.

Test Plan: CI

Reviewed By: jerryzh168

Differential Revision: D30257909

fbshipit-source-id: 4e374965fbf88d72e91844d9e9b6ff9b98f467d1
2021-08-15 11:53:36 -07:00
0ce4d30c44 Hide all symbols in llvm namespace (#63272)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63272

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D30331695

Pulled By: bertmaher

fbshipit-source-id: d35130c96f7e2a31fa86d9d80de59002e96301df
2021-08-15 11:29:43 -07:00
045c4cb82f Add copy button to code snippets in docs (#63149)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63149

Test Plan: Imported from OSS

Reviewed By: navahgar, albanD

Differential Revision: D30308891

Pulled By: anjali411

fbshipit-source-id: ad51180ab2f27c4525682b2603bbf753bb8f1ce9
2021-08-15 06:25:32 -07:00
38c185189c [Pytorch Edge] Enable kineto profiler on mobile via EdgeKinetoProfiler (#62419)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62419

This diff adds support for cpu only kineto profiler on mobile. Thus
enabling chrome trace generation on mobile. This bring cpp API for
mobile profiling on part with Torchscript.
This is done via:
1. Utilizating debug handle annotations in KinetoEvent.
2. Adding post processing capability, via callbacks, to
KinetoThreadLocalState
3. Creating new RAII stype profiler, KinetoEdgeCPUProfiler, which can be
used in surrounding scope of model execution. This will write chrome
trace to the location specified in profiler constructor.

Test Plan:
MobileProfiler.ModuleHierarchy

Imported from OSS

Reviewed By: raziel

Differential Revision: D29993660

fbshipit-source-id: 0b44f52f9e9c5f5aff81ebbd9273c254c3c03299
2021-08-13 21:40:19 -07:00
77a6436cac [Pytorch Mobile] Combing instructions and debug hanles in single struct (#62418)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62418

Debug handles have one to one correspondence with instruction, so just
combine them in one.

Test Plan:
CI

Imported from OSS

Reviewed By: raziel

Differential Revision: D29993661

fbshipit-source-id: 125c7163174cf66624dd95f110fdc8208fea8a07
2021-08-13 21:40:17 -07:00
1b04d99f55 [Pytorch Profiler] Introduce scopes to enableProfiler (#62417)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62417

This diff adds an option to make enableProfiler enable callbacks only
for certain RecordScopes.
Why?
Profiling has some overhead when we repeatedly execute callbacks for
alls copes. On mobile side when we often have small quantized models
this overhead can be large. We observed that by only profiling top level
op and skipping profiling of other atend ops called within we can limit
this overhead. For example, instead of profling at::conv2d -> at::convolution ->
at::convolution_ and further more if ops like transpose etc. are called,
skipping profiling of those. Of course this limits the visibility, but
at the least this way we get a choice.

Test Plan: Imported from OSS

Reviewed By: ilia-cher

Differential Revision: D29993659

fbshipit-source-id: 852d3ae7822f0d94dc6e507bd4019b60d488ef69
2021-08-13 21:40:15 -07:00
b00afe135d [Pytorch Profiler] Add debug_handles to KinetoEvent (#62228)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62228

This diff adds debug handles to events and provides a way to use
RECORD_FUNCTIONs that will pass debug_handles down to profiler, which
will record it in the events.

Why add debug_handles?
For pytorch mobile, with lite interpreter, we generate debug handles
that can be used for lazily symbolicate exception traces to model level
stack trace. Similar to the model level stack trace you get in
TorchScript models. The debug_handles also enable getting module
hierarchy for lite interpreter model, support for which was added to
KinetoProfiler in previous diffs.

Followup plan:
1. Enabled scope callbacks such that lite interpreter can use it to
profiler only top level ops.
2. Enable post processing callbacks that take KinetoEvents and populate
module hierarchy using debug handles.

This will let us use KinetoProfiler for lite interpter use cases on
mobile. Aim is to use RAII guard to similarly generate chrome trace for
mobile usecases as well, although only for top level ops.

Test Plan:
test_misc : RecordDebugHandles.Basic

Imported from OSS

Reviewed By: ilia-cher

Differential Revision: D29935899

fbshipit-source-id: 4f06dc411b6b5fe0ffaebdd26d3274c96f8f389b
2021-08-13 21:40:14 -07:00
44b12ba862 [Pytorch Profiler] Move start timestamp to end of start callback (#62191)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62191

This moves start timestamping to end of callback. This way we dont
account for callstack/module hierarchy related overhead in op runtime.

Test Plan:
CI

Imported from OSS

Reviewed By: ilia-cher

Differential Revision: D29910519

fbshipit-source-id: f462031a81ae12b3db7993cf482e5ad93a35e096
2021-08-13 21:40:12 -07:00
54f2eb6e7e [Pytorch Profiler] Add support for adding module hierarchy to (#61792)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61792

KinetoEvent

This PR adds module hierarchy information to events.
What is module hierarchy information attached to events?
During profiling a TorchScript module, when events are added, we ask JIT
what is the module hierarchy associated with the node being
executed. At the time of execution of that node, there might be multiple
frames in the stack of interpreter. For each frame, we find
corresponding node and the corresponding module hierarchy is queried.
Module hierarchy corresponding to the node is associated with node's
InlinedCallStack. InlinedCallStack of node tracks the path via which the
node is inlined. Thus during the inlining process we annotate
module information corresponding to the CallMethod nodes being inlined.

With this PR, chrome trace will contain additional metadata:
"Module Hierarchy". This can look like this:
TOP(ResNet)::forward.SELF(ResNet)::_forward_impl.layer1(Sequential)::forward.0(BasicBlock)::forward.conv1(Conv2d)::forward.SELF(Conv2d)::_conv_forward
It contains module instance, type name and the method name in the
callstack.

Test Plan:
test_profiler

Imported from OSS

Reviewed By: raziel, ilia-cher

Differential Revision: D29745442

fbshipit-source-id: dc8dfaf7c5b8ab256ff0b2ef1e5ec265ca366528
2021-08-13 21:39:10 -07:00
385b082854 add substract of max and testcase (#63132)
Summary:
As discussed here https://github.com/pytorch/pytorch/pull/62897, in the path of BF16/non-last-dim Softmax, we miss the subtractions of max value which will cause the overflow in the `exp()` calculation when the value of input tensor is large, such as `1000.0`.
To avoid this issue, we add the subtractions of max value and the corresponding test cases in this PR.

Note w/o subtractions of max value(accidental reverts or changes), we will get the underlying error message of the test case
```
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=0.05 and atol=0.05, found 103984 element(s) (out of 126720) whose difference(s) exceeded the margin of error (including 103984 nan comparisons). The greatest difference was nan (0.0 vs. nan), which occurred at index (0, 0, 0, 1).
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63132

Reviewed By: VitalyFedyunin

Differential Revision: D30280792

Pulled By: cpuhrsch

fbshipit-source-id: 722821debf983bbb4fec878975fa8a4da0d1d866
2021-08-13 20:50:49 -07:00
baedb559e3 OpInfo: nn.functional.conv_transpose2d (#62882)
Summary:
See https://github.com/facebookresearch/functorch/issues/78 and https://github.com/pytorch/pytorch/issues/54261.

cc: mruberry zou3519 Chillee

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62882

Reviewed By: bdhirsh

Differential Revision: D30280804

Pulled By: zou3519

fbshipit-source-id: e40cdf43e98c1f11e45df6b8bc13110b4d29c45f
2021-08-13 17:11:23 -07:00
f8e217a17e refactor fx2trt example script so it can be imported as a library (#63262)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63262

Just create a `__main__` guard.

Test Plan: run linter, sandcastle tests

Reviewed By: 842974287

Differential Revision: D30263617

fbshipit-source-id: 8044ce5d815b043c3778591384cb13d9a89d0048
2021-08-13 16:59:29 -07:00
3f43a8b9a3 [iOS] Add LibTorch-Lite-Nightly pod (#63239)
Summary:
D30090760 (e182b459d9) was reverted by D30303292 because of a lint issue in `LibTorch-Lite-Nightly.podspec.template`. Resubmit the diff after fixing the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63239

Test Plan: Imported from OSS

Reviewed By: xta0

Differential Revision: D30315690

Pulled By: hanton

fbshipit-source-id: f0fa719ffc3b8181ab28c123584ae5c1da8992c0
2021-08-13 16:21:41 -07:00
809e1e7457 Allow TransformerEncoder and TransformerDecoder to accept 0-dim batch sized tensors. (#62800)
Summary:
This issue fixes a part of https://github.com/pytorch/pytorch/issues/12013, which is summarized concretely in  https://github.com/pytorch/pytorch/issues/38115.

This PR allows TransformerEncoder and Decoder (alongwith the inner `Layer` classes) to accept inputs with 0-dimensional batch sizes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62800

Reviewed By: VitalyFedyunin

Differential Revision: D30303240

Pulled By: jbschlosser

fbshipit-source-id: 8f8082a6f2a9f9d7ce0b22a942d286d5db62bd12
2021-08-13 16:11:57 -07:00
ab7a472980 [ROCm] Update HIP_VERSION to TORCH_HIP_VERSION (#62786)
Summary:
- HIP_VERSION semantic versioning will change in ROCm4.3. The changes essentially remove the dependency on HIP_VERSION provided in the hip header to keep code compatible with older and newer versions of ROCm.
- TORCH_HIP_VERSION is derived from HIP_VERSION_MAJOR and HIP_VERSION_MINOR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62786

Reviewed By: bdhirsh

Differential Revision: D30281682

Pulled By: seemethere

fbshipit-source-id: e41e69fb9e13de5ddd1af99ba5bbdcbb7b64b673
2021-08-13 15:00:43 -07:00
e711b5ce6c Respect user-set CMAKE_PREFIX_PATH (#61904)
Summary:
Fixes the case where the `CMAKE_PREFIX_PATH` variable gets silently overwritten by a user specified environment variable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61904

Reviewed By: walterddr, malfet

Differential Revision: D29792014

Pulled By: cbalioglu

fbshipit-source-id: babacc8d5a1490bff1e14247850cc00c6ba9e6be
2021-08-13 13:49:05 -07:00
90a96e0642 Remove left-over print in test_diff_graph_inline_threshold (#63231)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63231

Reviewed By: VitalyFedyunin

Differential Revision: D30305851

Pulled By: gmagogsfm

fbshipit-source-id: 43da3b5f49ad4a6a2d6d174acf792f3ccf41a463
2021-08-13 13:11:27 -07:00
cc6b023cba Add CostInferenceFunction for SplitOp (#63133)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63133

SplitOp is costly but missing cost inference function which hurts cost based balancing. Changes are:
(1) Addition of CostInferenceFunction for SplitOp
(2) Small fix in CostInferenceFunction for ConcatOp

Test Plan:
Added unit tests:

buck test //caffe2/caffe2/python/operator_test:split_op_cost_test

buck test //caffe2/caffe2/python/operator_test:concat_op_cost_test

Reviewed By: smacke

Differential Revision: D30247360

fbshipit-source-id: 989e962f3a981acc85b73aac3fb23e603b7d1591
2021-08-13 12:28:15 -07:00
acdad8bc63 [docs] Merge note block in torch.lu documentation (#63156)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63156

**Summary**
This commit merges the four successive `Note` blocks that appear in the
documentation for `torch.lu`. Each one only has one line in it, so all
of them have been merged into one block with a bulleted list that
contains the original items.

**Test Plan**
Continuous integration.

*Before*
<img width="888" alt="Captura de Pantalla 2021-08-12 a la(s) 10 48 39 a  m" src="https://user-images.githubusercontent.com/4392003/129244443-b7d1594e-8833-4c20-a911-e1bf7ca88a8d.png">

*After*
<img width="932" alt="Captura de Pantalla 2021-08-12 a la(s) 10 48 46 a  m" src="https://user-images.githubusercontent.com/4392003/129244462-1f39dcdb-90e0-4fd9-a95f-343b0b6be1f1.png">

**Fixes**
This commit fixes #62339.

Test Plan: Imported from OSS

Reviewed By: navahgar, pbelevich

Differential Revision: D30292633

Pulled By: SplitInfinity

fbshipit-source-id: cb9071165629bfe7316b1d2fe952e4354c75d48f
2021-08-13 12:11:35 -07:00
e5c32cdde7 [docs] Remove input parameter from Tensor.flatten docs (#63180)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63180

**Summary**
This commit removes the `input` parameter from the signature for
`Tensor.flatten` shown in its documentation. This parameter is accepted
by `torch.flatten` but not `Tensor.flatten` (since the input is the
`Tensor` on which `flatten` is invoked).

**Test Plan**
Continuous integration.

**Fixes**
This commit fixes #57478.

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D30293156

Pulled By: SplitInfinity

fbshipit-source-id: 4ad70d638af009fb6bdeb703433b306904d39a76
2021-08-13 12:10:16 -07:00
548fe682e2 [docs] Add cross references to torch.transpose and torch.t (#63177)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63177

**Summary**
This commit adds a link in the documentation for `torch.transpose` that
directs to `torch.t` and vice versa. These two functions are related and
it is useful for users of one to know about the other.

**Test Plan**
Continuous integration.

**Fixes**
This commit fixes #56267.

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D30292654

Pulled By: SplitInfinity

fbshipit-source-id: 8e60cd7a598ff8b4756cb30141399dfe8e118338
2021-08-13 11:51:55 -07:00
7107c367b5 [docs] Mention vsplit, hsplit and tensor_split in Tensor views doc (#63191)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63191

**Summary**
This commit adds `vsplit`, `hsplit` and `tensor_split` to the list of
view ops on the Tensor Views documentation page.

**Test Plan**
Continuous integration.

*Before*
<img width="195" alt="Captura de Pantalla 2021-08-12 a la(s) 2 55 07 p  m" src="https://user-images.githubusercontent.com/4392003/129275921-c1cfdf6c-9f1f-45f3-98b6-1de7a0f0cc84.png">

*After*
<img width="197" alt="Captura de Pantalla 2021-08-12 a la(s) 2 55 15 p  m" src="https://user-images.githubusercontent.com/4392003/129275936-de4afde7-0143-4e1d-b38f-c86256f4896c.png">

**Fixes**
This commit fixes #62727.

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D30293181

Pulled By: SplitInfinity

fbshipit-source-id: 283783a4ccc3ebc50cb0a427e55c7a6cb618ffd7
2021-08-13 11:44:38 -07:00
38a825c648 Allow Average Pooling modules to accept tensors with 0-dim batch sizes. (#62025)
Summary:
This issue fixes a part of https://github.com/pytorch/pytorch/issues/12013, which is summarized concretely in  https://github.com/pytorch/pytorch/issues/38115.

It introduces changes and tests for allowing the Average Pooling layers to accept tensors with 0 sized batch dimensions and return meaningful results.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62025

Reviewed By: VitalyFedyunin

Differential Revision: D30303256

Pulled By: jbschlosser

fbshipit-source-id: 5f727e62a7c58d2b8bb49fcc3bd7688474917ba5
2021-08-13 11:31:17 -07:00
de7ae9e9b6 [skip ci] fix workflow code generation (#63235)
Summary:
Fixes a clean git check with code generation introduced by https://github.com/pytorch/pytorch/pull/63148

`generated-win-vs2019-cuda10-py3.yml` was renamed as `generated-win-vs2019-cuda10.1-py3.yml`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63235

Reviewed By: VitalyFedyunin

Differential Revision: D30306474

Pulled By: zhouzhuojie

fbshipit-source-id: cbae1ace064e360e8ca0c0e997116bdb20d54d46
2021-08-13 10:38:30 -07:00
000e3a0881 [Static Runtime] Add pass to eliminate __getitem__/DictConstruct calls (#62429)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62429

Introduce a new pass to eliminate calls to `prim::DictConstruct/aten::__getitem__`. Given a graph like this:
```
%2 : Dict = prim::DictConstruct(%key, %value)
%3 : Tensor = aten::__getitem__(%2, %key)
%4 : Tensor = op(%3)
```
This pass produces a graph like this (after dead code elimination):
```
%4 : Tensor = op(%value)
```

This optimization is applied in the static runtime.

Test Plan:
`buck test //caffe2/test:jit -- TestPeephole`

**local.forward performance summary**
About 3% runtime benefit. All `DictConstruct` calls optimized out, `__getitem__` calls reduced significantly (~50% of them are cut out)
P438354810

**local_request_only.forward performance summary**
About 14% runtime benefit. Again, all `DictConstruct` calls optimized out, 50% `__getitem__` calls removed.
P438359742

There is some variance with runtime measurements, so take these numbers with a grain of salt. Also note that the benefit does not exist in the shrunk model since there are no `DictConstruct` calls

Reviewed By: hlu1

Differential Revision: D29995087

fbshipit-source-id: f376376a46ff808115afd2d60446e5db8f6f752f
2021-08-13 10:21:16 -07:00
fcc1f87b6a Fixing user inputs for low, high in make_tensor (#61108)
Summary:
**TODOs:**

* [x] Do not clamp inputs for low and high when given and valid.
* [x] Devise rules for modifying `low` and `high` when extremals/invalid values passed.
* [x] Testing with `test_references_numerics_hard` with the revised changes. _(I've tested locally, the changes will take place in a separate PR though after offline discussion with mruberry)_
* [x] Revise comments/documentation for `make_tensor`

See https://github.com/pytorch/pytorch/issues/61758 for tracker issue.

cc: mruberry pmeier

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61108

Reviewed By: VitalyFedyunin

Differential Revision: D30296167

Pulled By: mruberry

fbshipit-source-id: 67e8d15b173209a9c97ca013231494a5fa99f8c7
2021-08-13 10:13:12 -07:00
720a7a0d81 [hackathon] fix benchmarking script in CONTRIBUTING (#63199)
Summary:
[skip ci]
Per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63199

Reviewed By: mruberry

Differential Revision: D30305487

Pulled By: ngimel

fbshipit-source-id: 2704c4f08ab976a55c9f8c2fe54cd4f3f39412cf
2021-08-13 09:50:48 -07:00
bd9fad25c2 [codemod][lint][caffe2] Extend BLACK coverage
Test Plan: Sandcastle

Reviewed By: zsol

Differential Revision: D30302716

fbshipit-source-id: f9724d4f4d1b8950f581cc2c6c77eedf19b4b6fc
2021-08-13 09:28:10 -07:00
c5f3ab6982 ENH Adds no_batch_dim to FractionalMaxPool2d (#62490)
Summary:
Towards https://github.com/pytorch/pytorch/issues/60585

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62490

Reviewed By: bdhirsh

Differential Revision: D30287143

Pulled By: jbschlosser

fbshipit-source-id: 1b9dd932157f571adf3aa2c98c3c6b56ece8fa6e
2021-08-13 08:48:40 -07:00
61b49c8e41 [JIT] Add a flag to rethrow caught exception in jit interpreter (#63073)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63073

It turned out that it's less than ideal to print out verbose stacktrace in exception messages in high-QPS services (see the related task) with a non-significant failure rate due to the truncation of long stacktrace which results in losing the original exception message thrown from native code. It is actually desirable to retain only the message of the original exception directly thrown from native code in such a usecase.

This change adds a new flag `torch_jit_disable_exception_stacktrace` to the pytorch jit interpreter to suppress stacktrace in the messages of exception thrown from the interpreter.

Reviewed By: Krovatkin

Differential Revision: D30241792

fbshipit-source-id: c340225c69286663cbd857bd31ba6f1736b1ac4c
2021-08-13 08:44:24 -07:00
32b6104f37 Port norm kernel to structured kernels. (#62711)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62711

Tracking issue: #55070

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D30109866

Pulled By: ezyang

fbshipit-source-id: 894c9496894d059c7690a174b75bbd4db7ed6016
2021-08-13 08:27:48 -07:00
07bb6e4fd0 Port prod kernel to structured kernels. (#62024)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62024

Tracking issue: #55070

In this PR, I also broke down the meta functions of other reduction kernels (e.g. `all`,
`argmax`, `sum`) into the composition of common patterns.

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D29847122

Pulled By: ezyang

fbshipit-source-id: a6680a6cf6e59bb46b8ffe7bf2a3a611d6e0fd14
2021-08-13 08:27:46 -07:00
1280363bad Port mean kernel to structured kernels. (#61643)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61643

Tracking issue: #55070

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D29783866

Pulled By: ezyang

fbshipit-source-id: dc95baf593096c03fb5f292ee6c36de3cc7f2b35
2021-08-13 08:26:01 -07:00
2d75703c6a Remove req to call step() in training loop (#63164)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63164

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D30284616

Pulled By: andwgu

fbshipit-source-id: afdb677fb08851b139178a9f6d782196f26773e1
2021-08-13 08:22:44 -07:00
28f9e108b1 Pass _allow_empty_param_list into func opt ctor (#63163)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63163

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D30284615

Pulled By: andwgu

fbshipit-source-id: 4857f5b618ec5b007648737ab532ce605e5d70dc
2021-08-13 08:22:42 -07:00
bd81c9178a Simplify data structures, add uniform approximation, fix mem leak (#63162)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63162

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D30284617

Pulled By: andwgu

fbshipit-source-id: 9bd9e5f89abcc0d3dac56b85d55cc88e843baa9f
2021-08-13 08:20:59 -07:00
75f198d48d [docs][ao] update quantize_per_tensor to mention overloads (#63165)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63165

Add details about the overloads for
* list of tensors input
* supporting tensor scale/zero-point inputs

Test Plan:
CI

Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D30291045

fbshipit-source-id: 9fc6418792c5e3a35417eeb8d31de4a4bfcbb7a5
2021-08-13 08:00:10 -07:00
5abeac3ef7 Make saved tensors default hooks thread local (#62909)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62909

This PR makes saved tensors default hooks thread local.
This allows using default hooks in a multithreaded context.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D30165416

Pulled By: Varal7

fbshipit-source-id: 10a7d580661d3d94bdaf398c4e076b7bea11c16b
2021-08-13 07:49:20 -07:00
cb23976f9f Allow 0-dim batch sizes for AdaptiveMaxPool and MaxPool. (#62088)
Summary:
This issue fixes a part of https://github.com/pytorch/pytorch/issues/12013, which is summarized concretely in  https://github.com/pytorch/pytorch/issues/38115.

This PR allows `MaxPool` and `AdaptiveMaxPool` to accept tensors whose batch size is 0. Some changes have been made to modernize the tests so that they will show the name of C++ function that throws an error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62088

Reviewed By: bdhirsh

Differential Revision: D30281285

Pulled By: jbschlosser

fbshipit-source-id: 52bffc67bfe45a78e11e4706b62cce1469eba1b9
2021-08-13 07:33:17 -07:00
72bc6dc8c3 DOC Improve documentation for LayerNorm (#63144)
Summary:
In this [commit](7026995f3c) and [issue](https://github.com/pytorch/pytorch/pull/59178#issuecomment-897485295), the [Line 134](47e286d024/torch/nn/modules/normalization.py (L134)) will overwrite the "embedding" variate which would cause an error when initiating `nn.LayerNorm` function.

I suggest renaming the "embedding" in [Line 133](47e286d024/torch/nn/modules/normalization.py (L133)) to "embedding_dim".

The final example is:
```
batch, sentence_length, embedding_dim = 20, 5, 10
embedding = torch.randn(batch, sentence_length, embedding_dim)
layer_norm = nn.LayerNorm(embedding_dim)
```

Fixes #{59178}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63144

Reviewed By: bdhirsh

Differential Revision: D30288778

Pulled By: jbschlosser

fbshipit-source-id: e74b11430e302dae5661bf6e830ee5ac6c1838c4
2021-08-13 07:04:40 -07:00
aa665e1ab8 Revert D30090760: [iOS] Add podspec for libTorch-lite nightly build
Test Plan: revert-hammer

Differential Revision:
D30090760 (e182b459d9)

Original commit changeset: 361aa2ed24a1

fbshipit-source-id: 9c0dfee80a80eb012b142d3928204d6eb8025b0a
2021-08-13 06:45:43 -07:00
dcb5eb8d9b OpInfo for torch.nn.functional.normalize (#62635)
Summary:
See https://github.com/facebookresearch/functorch/issues/78 and https://github.com/pytorch/pytorch/issues/54261

cc: mruberry zou3519 Chillee

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62635

Reviewed By: H-Huang

Differential Revision: D30136503

Pulled By: zou3519

fbshipit-source-id: 258c069f30d9c2a51ed27dadf94f3703b9432a4a
2021-08-13 06:36:50 -07:00
741accb11e Implements backward for torch.lu_solve (#61681)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/22620

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61681

Reviewed By: ngimel

Differential Revision: D30063116

Pulled By: mruberry

fbshipit-source-id: e095b0cadfb7c8b37a7ef91bae5b5dc170d8ef1c
2021-08-12 21:17:11 -07:00
126ff6222e Moving getattr_from_fqn to torch.quantization.utils (#63107)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63107

moving this function because the functionality would be useful outside of ns
ghstack-source-id: 135727260

Test Plan: buck test //caffe2/test:quantization_fx mode/dev-nosan --keep-going --config client.id=nuclide --show-full-output -- suite

Reviewed By: supriyar

Differential Revision: D30260735

fbshipit-source-id: 58deabdd0f3b03b0ee7ee92be0548a0945084d65
2021-08-12 20:59:01 -07:00
07b00fc324 ENH Migrate nll_loss2d from THC to ATen (#62826)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/24608
Fixes https://github.com/pytorch/pytorch/issues/24607

With the following benchmark, the backward pass runs a little slower. This is strange since the implementation should be exactly the same.

<details>
 <summary>Benchmark script</summary>

```python
from itertools import product

import torch
import torch.nn as nn
import torch.nn.functional as F
import time

torch.manual_seed(0)
MS_PER_SECOND = 1000

def _time():
    torch.cuda.synchronize()
    return time.perf_counter() * MS_PER_SECOND

device = "cuda"
C = 3
n_runs = 30
reductions = ["none", "sum", "mean"]
Ns = [128, 256, 512]
Hs = [128, 256, 512]

for reduction, N, H in product(reductions, Ns, Hs):
    total_fwd_time = 0
    total_back_time = 0
    if reduction == "none":
        grad_out = torch.randn(N, H, H, device=device)
    else:
        grad_out = torch.randn(1)[0]

    for _ in range(n_runs):
        input = torch.randn(N, C, H, H, device=device, requires_grad=True)
        target = torch.rand(N, H, H, device=device).mul(3).floor().long()

        # forward
        start = _time()
        result = F.nll_loss(input, target, reduction=reduction)
        total_fwd_time += _time() - start

    result = F.nll_loss(input, target, reduction=reduction)
    for _ in range(n_runs):
        # backward
        start = _time()
        result.backward(grad_out, retain_graph=True)
        total_back_time += _time() - start

    fwd_avg = total_fwd_time / n_runs
    bwd_avg = total_back_time / n_runs
    print(
        f"input size({N}, {C}, {H}, {H}), reduction: {reduction}, fwd: {fwd_avg:.2f} (ms), back: {bwd_avg:.2f} (ms)"
    )

```

</details>

<details>
 <summary>master results</summary>

```
input size(128, 3, 128, 128), reduction: none, fwd: 0.34 (ms), back: 0.57 (ms)
input size(128, 3, 256, 256), reduction: none, fwd: 2.56 (ms), back: 3.85 (ms)
input size(128, 3, 512, 512), reduction: none, fwd: 14.54 (ms), back: 16.62 (ms)
input size(256, 3, 128, 128), reduction: none, fwd: 1.26 (ms), back: 1.78 (ms)
input size(256, 3, 256, 256), reduction: none, fwd: 7.07 (ms), back: 8.22 (ms)
input size(256, 3, 512, 512), reduction: none, fwd: 29.38 (ms), back: 33.29 (ms)
input size(512, 3, 128, 128), reduction: none, fwd: 3.41 (ms), back: 4.05 (ms)
input size(512, 3, 256, 256), reduction: none, fwd: 14.32 (ms), back: 16.46 (ms)
input size(512, 3, 512, 512), reduction: none, fwd: 59.20 (ms), back: 66.68 (ms)
input size(128, 3, 128, 128), reduction: sum, fwd: 0.08 (ms), back: 0.21 (ms)
input size(128, 3, 256, 256), reduction: sum, fwd: 0.21 (ms), back: 0.73 (ms)
input size(128, 3, 512, 512), reduction: sum, fwd: 0.82 (ms), back: 2.86 (ms)
input size(256, 3, 128, 128), reduction: sum, fwd: 0.12 (ms), back: 0.39 (ms)
input size(256, 3, 256, 256), reduction: sum, fwd: 0.42 (ms), back: 1.45 (ms)
input size(256, 3, 512, 512), reduction: sum, fwd: 1.53 (ms), back: 5.66 (ms)
input size(512, 3, 128, 128), reduction: sum, fwd: 0.21 (ms), back: 0.74 (ms)
input size(512, 3, 256, 256), reduction: sum, fwd: 0.78 (ms), back: 2.86 (ms)
input size(512, 3, 512, 512), reduction: sum, fwd: 2.98 (ms), back: 11.23 (ms)
input size(128, 3, 128, 128), reduction: mean, fwd: 0.07 (ms), back: 0.21 (ms)
input size(128, 3, 256, 256), reduction: mean, fwd: 0.21 (ms), back: 0.73 (ms)
input size(128, 3, 512, 512), reduction: mean, fwd: 0.82 (ms), back: 2.86 (ms)
input size(256, 3, 128, 128), reduction: mean, fwd: 0.13 (ms), back: 0.39 (ms)
input size(256, 3, 256, 256), reduction: mean, fwd: 0.42 (ms), back: 1.45 (ms)
input size(256, 3, 512, 512), reduction: mean, fwd: 1.54 (ms), back: 5.65 (ms)
input size(512, 3, 128, 128), reduction: mean, fwd: 0.22 (ms), back: 0.74 (ms)
input size(512, 3, 256, 256), reduction: mean, fwd: 0.78 (ms), back: 2.87 (ms)
input size(512, 3, 512, 512), reduction: mean, fwd: 2.98 (ms), back: 11.23 (ms)
```

</details>

<details>
 <summary>PR results</summary>

```
input size(128, 3, 128, 128), reduction: none, fwd: 0.33 (ms), back: 0.59 (ms)
input size(128, 3, 256, 256), reduction: none, fwd: 2.51 (ms), back: 3.92 (ms)
input size(128, 3, 512, 512), reduction: none, fwd: 14.52 (ms), back: 17.05 (ms)
input size(256, 3, 128, 128), reduction: none, fwd: 1.23 (ms), back: 1.85 (ms)
input size(256, 3, 256, 256), reduction: none, fwd: 7.07 (ms), back: 8.45 (ms)
input size(256, 3, 512, 512), reduction: none, fwd: 29.39 (ms), back: 34.21 (ms)
input size(512, 3, 128, 128), reduction: none, fwd: 3.40 (ms), back: 4.18 (ms)
input size(512, 3, 256, 256), reduction: none, fwd: 14.33 (ms), back: 16.90 (ms)
input size(512, 3, 512, 512), reduction: none, fwd: 59.04 (ms), back: 68.36 (ms)
input size(128, 3, 128, 128), reduction: sum, fwd: 0.07 (ms), back: 0.25 (ms)
input size(128, 3, 256, 256), reduction: sum, fwd: 0.21 (ms), back: 0.86 (ms)
input size(128, 3, 512, 512), reduction: sum, fwd: 0.82 (ms), back: 3.33 (ms)
input size(256, 3, 128, 128), reduction: sum, fwd: 0.12 (ms), back: 0.46 (ms)
input size(256, 3, 256, 256), reduction: sum, fwd: 0.42 (ms), back: 1.70 (ms)
input size(256, 3, 512, 512), reduction: sum, fwd: 1.53 (ms), back: 6.58 (ms)
input size(512, 3, 128, 128), reduction: sum, fwd: 0.21 (ms), back: 0.87 (ms)
input size(512, 3, 256, 256), reduction: sum, fwd: 0.78 (ms), back: 3.34 (ms)
input size(512, 3, 512, 512), reduction: sum, fwd: 2.98 (ms), back: 13.07 (ms)
input size(128, 3, 128, 128), reduction: mean, fwd: 0.07 (ms), back: 0.26 (ms)
input size(128, 3, 256, 256), reduction: mean, fwd: 0.21 (ms), back: 0.86 (ms)
input size(128, 3, 512, 512), reduction: mean, fwd: 0.82 (ms), back: 3.34 (ms)
input size(256, 3, 128, 128), reduction: mean, fwd: 0.12 (ms), back: 0.46 (ms)
input size(256, 3, 256, 256), reduction: mean, fwd: 0.42 (ms), back: 1.72 (ms)
input size(256, 3, 512, 512), reduction: mean, fwd: 1.53 (ms), back: 6.60 (ms)
input size(512, 3, 128, 128), reduction: mean, fwd: 0.21 (ms), back: 0.87 (ms)
input size(512, 3, 256, 256), reduction: mean, fwd: 0.78 (ms), back: 3.33 (ms)
input size(512, 3, 512, 512), reduction: mean, fwd: 2.98 (ms), back: 13.07 (ms)
```

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62826

Reviewed By: bdhirsh

Differential Revision: D30282279

Pulled By: ngimel

fbshipit-source-id: 4aa0ff3f8af0632957417931d332ec486a12b52d
2021-08-12 18:07:15 -07:00
219ba6575b add autowrap_functions kwarg to fx.Tracer (#62106)
Summary:
Implements feature request https://github.com/pytorch/pytorch/issues/62021

Test it out with

```python
from torch import fx
from torch import nn

def fx_int(x):
    return int(x)

class MyModule(nn.Module):
    def forward(self, x):
        return fx_int(x.shape[0] / 2)

tracer = fx.Tracer(autowrap_functions=(fx_int,))  # or remove kwarg to demonstrate symbolic trace error
tracer.trace(MyModule())
```

First time contributor, so please advise if I could have done anything to make lives easier for next time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62106

Reviewed By: SplitInfinity, driazati

Differential Revision: D30080834

Pulled By: jamesr66a

fbshipit-source-id: 68fadf8c881ea7930e7afd62b642874010fe4903
2021-08-12 17:38:25 -07:00
7a1ab9f5d7 [fx] store Tracer class on Graph and GraphModule for package deserialization [v2, the re-do] (#63121)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63121

Re-introducing this diff with a small change to ignore setting Tracer classes on GraphModules when the Tracer class is defined not at module-level (prevents pickling).

Previous, reverted Pull Request: https://github.com/pytorch/pytorch/pull/62497

Reviewed By: houseroad

Differential Revision: D30252776

fbshipit-source-id: 42d2bc846e4b32d00563419c38c02b63cd0986e6
2021-08-12 17:28:50 -07:00
988ef190e3 Show warning in eager mode for empty containers (#62978)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/54873

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62978

Reviewed By: navahgar

Differential Revision: D30278343

Pulled By: ansley

fbshipit-source-id: ebb19f7b8a10720f2612b99a2668d1ebbc1f2d16
2021-08-12 16:11:27 -07:00
e182b459d9 [iOS] Add podspec for libTorch-lite nightly build (#62691)
Summary:
The nightly pod version will be aliased with [PyTorch nightly build version](https://l.facebook.com/l.php?u=https%3A%2F%2Fgithub.com%2Fpytorch%2Fpytorch%2Fblob%2Fmaster%2F.circleci%2Fscripts%2Fbinary_populate_env.sh%23L88&h=AT3AeTpSGcz9YVeG7Lr_bweWOv8H2-kAMevglFfMslaZwgEPptNM59WdWj2ZER806rKVLNhQGM5EQcyFC_8xOq334LBo2J6YzgPW2LELkgASlA6UxP2gaD2 (fa22f6303f)Wy5mA6_lu_YlHHbEGPIU7ewJQD1 (2d884f2263)aBSlOy) and [CocoaPods version specification](https://l.facebook.com/l.php?u=https%3A%2F%2Fguides.cocoapods.org%2Fusing%2Fthe-podfile.html%23specifying-pod-versions&h=AT3AeTpSGcz9YVeG7Lr_bweWOv8H2-kAMevglFfMslaZwgEPptNM59WdWj2ZER806rKVLNhQGM5EQcyFC_8xOq334LBo2J6YzgPW2LELkgASlA6UxP2gaD2 (fa22f6303f)Wy5mA6_lu_YlHHbEGPIU7ewJQD1 (2d884f2263)aBSlOy), the version format of the podspect is `PyTorch version + nightly build date`, like `1.10.0.20210812`.

Usage:
1. Add `pod 'LibTorch-Lite-Nightly'` to `Podfile`
2. Run `pod install` to install the nightly built lib
3. Run `pod update` to update the lib to the latest version

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62691

Test Plan:
* Test on [TestApp](https://github.com/pytorch/pytorch/tree/master/ios/TestApp) and [HelloWorld](https://github.com/pytorch/ios-demo-app):
Podfile: `pod 'LibTorch-Lite-Nightly'`

* Test on Private Pod:
{F642106928}

Reviewed By: xta0

Differential Revision: D30090760

Pulled By: hanton

fbshipit-source-id: 361aa2ed24a11d6aced8374cb45f70f49bd5da52
2021-08-12 15:35:14 -07:00
0b89e69e7c [BE] delete GHA generated workflow files before regen (#63148)
Summary:
Unlike circle which all workflow goes in one file, GHA legacy generated files will stay silently in once's PR. e.g. when we change build_environment name and that's not ideal.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63148

Reviewed By: bdhirsh

Differential Revision: D30283382

Pulled By: walterddr

fbshipit-source-id: ffdd5bf9561dd38499052855a12ee5cf838a20b0
2021-08-12 14:43:00 -07:00
ba25527ffc [iOS][GPU] Fix the clamp shader function for x86_64 (#63062)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63062

Pervasively, due to the need of supporting 10.0, we used a fp16 version of the clamp kernel on Metal, which didn't work well on x86_64. Since we don't need to support 10.0 anymore, we can use the fp32 version, which works both on arm64 and x86_64.
ghstack-source-id: 135536785

Test Plan:
- `buck test pp-macos`
- Op tests in the playground app

{F641013793}

Reviewed By: husthyc

Differential Revision: D30239931

fbshipit-source-id: 6ad1bf71422b537e052fbd7b7465ba8deb7ca0cf
2021-08-12 13:20:27 -07:00
ed7ece389d Forbid inplace modification of a saved tensor's pack_hook input (#62717)
Summary:
When using saved tensors hooks (especially default hooks),
if the user defines a `pack_hook` that modifies its input,
it can cause some surprising behavior.

The goal of this PR is to prevent future user headache by catching
inplace modifications of the input of `pack_hook` and raising an error if
applicable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62717

Reviewed By: albanD

Differential Revision: D30255243

Pulled By: Varal7

fbshipit-source-id: 8d73f1e1b50b697a59a2849b5e21cf0aa7493b76
2021-08-12 12:40:10 -07:00
aa5141f204 Update CONTRIBUTING.md to remove ProcessGroupAgent (#63160)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63160

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D30284439

Pulled By: H-Huang

fbshipit-source-id: 53c31b6917ef5e2125e146fb0ed73ae3d76a8cf9
2021-08-12 12:26:12 -07:00
96fb1a56ea add use_strict_trace to tensorboard add_graph method (#63120)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63120

FAIM returns dictionaries as the model output, which throws an error when trying to trace using add_graph. Pass in `strict` to the tracer to make this user configurable.

User post: https://fb.workplace.com/groups/pytorchLightning/permalink/1510194972650369/?comment_id=1510252919311241&reply_comment_id=1510281112641755

Test Plan: unit test

Reviewed By: Reubend

Differential Revision: D30265890

fbshipit-source-id: 58b25d9500b875a29a664aa9ef4c1e7f13631fa1
2021-08-12 12:12:12 -07:00
1022443168 Revert D30279364: [codemod][lint][fbcode/c*] Enable BLACK by default
Test Plan: revert-hammer

Differential Revision:
D30279364 (b004307252)

Original commit changeset: c1ed77dfe43a

fbshipit-source-id: eab50857675c51e0088391af06ec0ecb14e2347e
2021-08-12 11:45:01 -07:00
ed0b8a3e83 LayerNorm Support in autodiff: (#50467)
Summary:
1. extend autodiff by adding entry for layer_norm in symbolic script, we now use native_layer_norm_backward
2. added backward function `layernorm_double_backward` for `native_layer_norm_backward`, preserves double backward support for LayerNorm in autodiff/ScriptModule
3. added python test to verify autodiff on layer_norm with various configuration of optional tensors; (verify the fix in https://github.com/pytorch/pytorch/issues/49430)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50467

Reviewed By: eellison

Differential Revision: D30232864

Pulled By: jansel

fbshipit-source-id: b9c33075386aff96afff7415df9f94388bfb474a

Co-authored-by: Ryan Spring <rspring@nvidia.com>
Co-authored-by: Jie <jiej@nvidia.com>
2021-08-12 11:05:53 -07:00
b004307252 [codemod][lint][fbcode/c*] Enable BLACK by default
Test Plan: manual inspection & sandcastle

Reviewed By: zertosh

Differential Revision: D30279364

fbshipit-source-id: c1ed77dfe43a3bde358f92737cd5535ae5d13c9a
2021-08-12 10:58:35 -07:00
aac3c7bd06 [reland] OpInfo: adaptive_avg_pool2d (#62935)
Summary:
This PR is an attempt to reland https://github.com/pytorch/pytorch/pull/62704.

**What has changed?**

The op has non-deterministic behavior, hence an appropriate `gradcheck` wrapper had to be added.

cc: mruberry zou3519 heitorschueroff kshitij12345

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62935

Reviewed By: anjali411

Differential Revision: D30225095

Pulled By: zou3519

fbshipit-source-id: 644873cc21d44b19c8b68f9edff691913778de0e
2021-08-12 09:46:38 -07:00
daba551922 [BE] shorten CI name part2 (#63030)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/62357
there's no need to specify cudnn version since they are recommended from cuda version already.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63030

Reviewed By: zhouzhuojie, driazati

Differential Revision: D30226354

Pulled By: walterddr

fbshipit-source-id: 7e2dc577810e0ce80ee27569c25a814566250ab1
2021-08-12 08:14:22 -07:00
eea52b7d47 Skip zero test on windows (#63087)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63087

Test failed on windows unexpectedly see
https://github.com/pytorch/pytorch/issues/63086. Skip for now while we
investigate
ghstack-source-id: 135631811

Test Plan: CI

Reviewed By: ngimel

Differential Revision: D30251300

fbshipit-source-id: 8acb1ea8863c654c171fe989ac24446c321c085d
2021-08-12 00:38:42 -07:00
4d7a12f68b BatchNorm: Use resize_output and empty, instead of empty_like (#63084)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/62967

This lets each of the three implementations choose which memory format
to use for the output, meaning channels_last can be used in more cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63084

Reviewed By: saketh-are

Differential Revision: D30255740

Pulled By: ngimel

fbshipit-source-id: 48d42850952ec910b29521a1c4e530eb2b29df5e
2021-08-11 23:47:24 -07:00
d5a7579597 [quant] Make version 1 the default for get_default_qat_qconfig (#63043)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63043

In version 1 we use the fused module/operator during QAT. Making this the default for all QAT runs going forward.

Older models saved after prepare_qat_fx can still load their state_dict into a model prepared using version 1.
The state_dict will still have the same attribute for the observer/fake_quant modules.

There may be some numerics difference between the old observer code in observer.py and the new fused module that was
re-written in C++/CUDA to perform observe + fake_quantize.

This PR also updates the test to check for the new module instead of the default FakeQuantize module.
Note: there are also some changes to make the operator work for multi-dim per-channel quantization + updated the test for that.

Test Plan:
python test/test_quantization.py TestSerialization.test_default_qat_qconfig

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D30232222

fbshipit-source-id: f3553a1926ab7c663bbeed6d574e30a7e90dfb5b
2021-08-11 22:06:44 -07:00
91525d42d9 Fix sharded tensor tests. (#63054)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63054

1) Ensure these tests are skipped in environments without any GPUs.
2) Add the test to run_test.py
ghstack-source-id: 135595698

Test Plan: waitforbuildbot

Reviewed By: wanchaol

Differential Revision: D30239159

fbshipit-source-id: 21b543ba72e8d10182bc77e7ae1fd34fd4096509
2021-08-11 21:46:45 -07:00
bf7d03ff1f Port log_softmax_backward_data to structured kernel (#62372)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62372

Test Plan: Imported from OSS

Reviewed By: saketh-are

Differential Revision: D30240242

Pulled By: SplitInfinity

fbshipit-source-id: 67d5e4b1543c2e43675e905ce18ca49c11e33748
2021-08-11 21:03:59 -07:00
ba603594fd Port log_softmax to structured kernel (#57374)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57374

Test Plan: Imported from OSS

Reviewed By: saketh-are

Differential Revision: D30240243

Pulled By: SplitInfinity

fbshipit-source-id: de6617c75d16e26d607a884c25b8752b7b561737
2021-08-11 21:02:48 -07:00
d2eda7f2f3 Add ciflow_ruleset.json generator along with gha ci (#63097)
Summary:
- Add `.github/generated-ciflow-ruleset.json` for ciflow-bot (so that we can generate better comments)
- The lint job also checks git dirty to make sure that the file is always in sync with ciflow configs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63097

Reviewed By: saketh-are

Differential Revision: D30263278

Pulled By: zhouzhuojie

fbshipit-source-id: bad68105a228e892ba071b29ecfdf433e1038054
2021-08-11 17:14:40 -07:00
04caef8e1d Improve IMethod::getArgumentNames to deal with empty argument names list (#62947)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62947

This diff improved IMethod::getArgumentNames to deal with empty argument names list.

Test Plan:
buck test mode/dev //caffe2/caffe2/fb/predictor:pytorch_predictor_test -- PyTorchDeployPredictor.GetEmptyArgumentNamesValidationMode
buck test mode/dev //caffe2/caffe2/fb/predictor:pytorch_predictor_test -- PyTorchDeployPredictor.GetEmptyArgumentNamesRealMode

Reviewed By: wconstab

Differential Revision: D30179974

fbshipit-source-id: c7aec35c360a73318867c5b77ebfec3affee47e3
2021-08-11 16:44:00 -07:00
5cf32c1d09 Fix Nnapi backend execute's dangling pointer (#63092)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63092

Bug discovered while testing NNAPI Delegate on SparkAR.
Using
```
c10::IntArrayRef order = {0, 2, 3, 1};
fixed_inputs.push_back(tensorInp.get(i).permute(order).contiguous());
```
results in a garbage value for order in `permute()`.
Moving order inside the call to `permute()` fixes this issue. Problem is seemingly related to https://github.com/pytorch/pytorch/issues/44409, but luckily the solution in this case is simple.

Bug wasn't caught earlier, since regular unit tests weren't affected by the dangling pointer, and address sanitizer NNAPI tests are turned off due to there being a different failure (T95764916).
ghstack-source-id: 135526129

Test Plan:
Run Unit tests: `python test/test_jit.py`

Build and run SparkAR on an Android phone at the top of this diff stack (D30173959): `buck build --show-output arstudioplayer_arm64_debug -c pt.enable_nnapi=1`

Reviewed By: raziel, iseeyuan

Differential Revision: D30237504

fbshipit-source-id: c946d81feefc453b43d9295d8d6f509cafdcec03
2021-08-11 14:26:48 -07:00
709ac6853a Fix warnings (#62930)
Summary:
Add `-Wno-writable-strings`(which is clang's flavor of `-Wwrite-strings`) to list of warnings ignored while compiling torch_python.
Avoid unnecessary copies in range loop
Fix number of signed-unsigned comparisons

Found while building locally on M1

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62930

Reviewed By: albanD

Differential Revision: D30171981

Pulled By: malfet

fbshipit-source-id: 25bd43dab5675f927ca707e32737ed178b04651e
2021-08-11 14:07:10 -07:00
855e8f2b17 [iOS][GPU] Consolidate array and non-array kernel for upsampling_nearest2d (#63061)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63061

Cleanup the redundant shader code for the upsampling nearest kernel.
ghstack-source-id: 135524349

Test Plan:
- `buck test pp-macos`
- Op tests in PyTorchPlayground app

Reviewed By: husthyc

Differential Revision: D30236905

fbshipit-source-id: e1e001b446452b077e6db719b0519c9070f3300b
2021-08-11 13:29:39 -07:00
456364729e irange-ify 13b (#62476)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62476

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D30001445

fbshipit-source-id: 6f4525338c80e9f929695f47f36ca9c72d96a75d
2021-08-11 13:13:44 -07:00
31c1983603 Add BFloat16 support for unique and unique_consecutive on CPU (#62559)
Summary:
Add BFloat16 support for unique and unique_consecutive on CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62559

Reviewed By: saketh-are

Differential Revision: D30250675

Pulled By: ngimel

fbshipit-source-id: 26e48f971d87f3b86db237e8ad3a4b74eb3c1def
2021-08-11 12:54:46 -07:00
51a67d3168 Add Github action to upload full source releases (#63022)
Summary:
Those release tarballs include the submodules.
The action is run on every tag, master-branch push but will not upload anything.
This makes sure nothing is broken when an actual release happens.

On created releases the action runs and uploads the tarball

Fixes https://github.com/pytorch/pytorch/issues/62708

As I don't have access rights here and testing is obviously hard (as a new release needs to be published), I set up a test at https://github.com/Flamefire/pytorch/releases/tag/testtag
See also the run(s) at https://github.com/Flamefire/pytorch/actions/workflows/create_release.yml

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63022

Reviewed By: saketh-are

Differential Revision: D30256253

Pulled By: seemethere

fbshipit-source-id: ab5fe131452de14ae3768b91c221e68c536cb3aa
2021-08-11 12:47:17 -07:00
821c1edea9 Embedding thrust->cub: unique (#63042)
Summary:
Followup of https://github.com/pytorch/pytorch/pull/62495

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63042

Reviewed By: saketh-are

Differential Revision: D30231084

Pulled By: ngimel

fbshipit-source-id: 03b0a88107e8a2aee3570881d81bf2b676f525cd
2021-08-11 12:40:36 -07:00
fa22f6303f [PyTorch] Add flop count for addmm (#61895)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61895

* Add FLOP count for addmm, should be `2*m*n*k`.

Share the same code path for `addmm` and `mm`.

Test Plan:
Imported from OSS

`python test/test_profiler.py`
Run a sample profile and check that FLOPS for `aten::addmm` is correct.

`[chowar@devbig053.frc2 ~/local/pytorch/build] ninja bin/test_jit`
`[chowar@devbig053.frc2 ~/local/pytorch/build] ./bin/test_jit --gtest_filter='ComputeFlopsTest*'`

Reviewed By: dskhudia

Differential Revision: D29785671

fbshipit-source-id: d1512036202d7234a981bda897af1f75808ccbfe
2021-08-11 12:33:43 -07:00
fb4ba9e664 XNNPack Input Pointer Caching Comment (#62818)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62818

Added a comment to explain why we no longer need to manually cache pointers/parameters for convolution, as removed in D29777605 (f5c6c3947e)

Test Plan: Sandcastle tests (no code changed)

Reviewed By: kimishpatel

Differential Revision: D30113489

fbshipit-source-id: d697f05816acbd367d59a4aced1925303c683d40
2021-08-11 11:55:42 -07:00
82123758ba _convert_coo_to_csr CPP and CUDA functionality (#61838)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/57381 and improves https://github.com/pytorch/pytorch/pull/61340 via dedicated `coo_to_csr` functionalities.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61838

Reviewed By: ezyang

Differential Revision: D30132736

Pulled By: cpuhrsch

fbshipit-source-id: a1fd074c0d70366a524d219a620b94f8bed71d7c
2021-08-11 11:37:20 -07:00
b8e6144e0a Add a _RemoteDevice structure for ShardedTensor/ShardingSpec. (#62927)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62927

As part of the ShardedTensor work, we realized we do need some sort of
_RemoteDevice structure that deals with our format of "workername/device" so
that users don't have to worry about parsing this string directly.

Right now this structure is just the bare minimum and is mostly a container for
describing a remote device. It is currently only used in ShardedTensor,
ShardingSpec and RemoteModule.

Once we actually have a consolidated remote device proposal, this class can be
extended appropriately if needed.
ghstack-source-id: 135534086

Test Plan:
1) unit tests
2) waitforbuildbot

Reviewed By: SciPioneer

Differential Revision: D30170689

fbshipit-source-id: 1ac2e81c7a597dc40bf3fbf2c1168c382c66649f
2021-08-11 11:27:32 -07:00
b746fed164 [Pytorch Edge] Move RuntimeCompatibilityInfo Factory Method (#63005)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63005

Realized I forgot to move the Runtime half of these functions be within the struct.

Test Plan: ci

Reviewed By: pavithranrao

Differential Revision: D30205521

fbshipit-source-id: ccd87d7d78450dd0dd23ba493bbb9d87be4640a5
2021-08-11 11:15:57 -07:00
3d3ad0a52f [easy] add an inplace argument to MutableNetProto.to_net() and core.Net() constructor (#63068)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63068

The caffe2 core.Net constructor can accept a caffe2_pb2.NetDef proto, but it always creates a copy. This is wasteful when we can prove that the proto being passed to it will not be used anywhere else. So we add an "inplace" argument to the `core.Net` constructor that allows clients to give away ownership of the passed proto without copying. We default this argument to `False`, ensuring that behavior does not change unless explicitly requested.

Test Plan: Let CI run.

Differential Revision: D29976510

fbshipit-source-id: 26e13ca76f3431b8ef0de51f08bbf263491d323e
2021-08-11 11:10:52 -07:00
c090ae291e Fix gha render-test-result mixed failure passthrough (#63056)
Summary:
To fix something like https://github.com/pytorch/pytorch/actions/runs/1114555082

![image](https://user-images.githubusercontent.com/658840/128956528-86997457-5e18-4ae1-83cc-aa7d0ca03c0e.png)

Not sure why `needs.test.result` doesn't capture the `failure` case before, so changed it to `needs.test.result != 'skipped' || failure()`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63056

Reviewed By: walterddr, tktrungna

Differential Revision: D30240112

Pulled By: zhouzhuojie

fbshipit-source-id: d159cc3f79ed5d604ae12583736b37ac28e8d87c
2021-08-11 09:45:31 -07:00
4ea6a3aa74 Fix issues with printing certain torch modules (#62447)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/54420

When I tested on master, with the testing code, there were multiple objects on the garbage collector that cannot be printed.

Testing code:
```
import torch
import gc
import os
import sys

print(torch.__version__)

a = torch.rand(10)

print(a)

objects = gc.get_objects()

for i in range(len(objects)):
   print(objects[i])
```

### 1
```
print(torch.classes)
```

Like SplitInfinity has mentioned in the GitHub issue, the solution here is to set `__file__` for `torch.classes` to something. Similar to [_ops.py](https://github.com/pytorch/pytorch/blob/master/torch/_ops.py#L69), where `__file__` is set to `_ops.py`, we could set `__file__` for torch.classes to `_classes.py`.

### 2
```
print(torch._ops.ops.quantized)
print(torch._ops.ops.atan)
```

When we try to print these two modules, it will call `_OpNamespace::__getattr__`, but the `op_name` is `__file__`. This becomes a problem when `torch._C._jit_get_operation(qualified_op_name)` [(link)](https://github.com/pytorch/pytorch/blob/master/torch/_ops.py#L60) tries to look for an actual op on the native C++ side.

Only when we get the attribute for an actual op, e.g. `print(torch._ops.ops.quantized.elu)`, the `op_name` becomes proper (e.g. `elu`).

My current solution is to return a hardcoded string (i.e. “torch.ops”) if `op_name` is `"__file__"`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62447

Reviewed By: saketh-are

Differential Revision: D30234654

Pulled By: yidawang-oss

fbshipit-source-id: de43a8f599739c749fb3307eea015cc61f1da60e
2021-08-11 09:40:41 -07:00
5c00091f02 Shard python_functions.cpp (#62186)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62186

This file takes 6 minutes on its own to compile and is the limiting factor for
building `libtorch_python` on a 32-core threadripper. This splits the file into
5 shards which take around 50 seconds each to compile.

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D29962046

Pulled By: albanD

fbshipit-source-id: df13cfaebd54296f10609f67ae74a850c329bd37
2021-08-11 09:21:26 -07:00
c5de83adca Fix inconsisteny between Python and JIT power operation (#62842)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62842

Test Plan:
Wrote unit test TestAtenPow to test behavior of aten::pow when:
1. base is int, exponent is int
2. base is int, exponent is float
3. base is float, exponent is int
4. base is float, exponent is float

Specifically, we test that when base is zero and exponent is negative, we raise error. In all other cases, we expect behavior to be the same as the result returned by Python.

It is because the cpp code relies on overloading, we need to make sure all combinations of types give us the expected result.

Reviewed By: zhxchen17

Differential Revision: D30146115

Pulled By: szewaiyuen7

fbshipit-source-id: dc661897ad38da286ee454120fbe41314b7f2995
2021-08-11 08:41:46 -07:00
f446e835ee Fix CUDA_KERNEL_ASSERT ambiguous symbol in NDEBUG mode (#62527)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62527

If NDEBUG is applied inconsistently in compilation we might get 'ambiguous declaration' error. Let's make sure that the forward declaration matches glibc including all specifiers.

Test Plan: sandcastle

Reviewed By: mdschatz

Differential Revision: D30030051

fbshipit-source-id: 9f4d5f1d4e74f0a4eaeeaaaad76b93ee485d8bcd
2021-08-11 01:10:09 -07:00
f7611b31aa [4/N] Enable opt-asan for distributed unit tests. (#62051)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62051

The goal here is to enable opt-asan for "spawn" based unit tests since
this works for "spawn" unlike "dev-asan". As a result, we can run ASAN for
"spawn" unit tests as well.

This means we can completely remove fork unit tests from the code base since
the only purpose for these tests was to run ASAN.
ghstack-source-id: 135523770

Test Plan: waitforbuildbot

Reviewed By: SciPioneer

Differential Revision: D29854514

fbshipit-source-id: 02a5bfcfae2afc21badecff77082c7a6ad83636b
2021-08-10 22:38:31 -07:00
847a7cfa10 Back out "[fx] store Tracer class on Graph and GraphModule for package deserialization" (#63053)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63053

Original commit changeset: eca09424ad30

The original diff - D30019214 (6286d33878) breaks the publish flow in model saving.

Test Plan: ci

Differential Revision: D30236517

fbshipit-source-id: 3e05db02fc1cbbc2ed262c83bf56d555277abb34
2021-08-10 21:58:08 -07:00
324673a537 rebase for autocast updates to include device_type and dtype flags (#61002)
Summary:
Fixes #{55374}
https://github.com/pytorch/pytorch/issues/55374

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61002

Reviewed By: malfet, mruberry

Differential Revision: D30016812

Pulled By: ngimel

fbshipit-source-id: 6e09a29f539d28e9aea5cd9489b1e633cc588033
2021-08-10 20:03:12 -07:00
a55cae3d37 Fix missing element types and shapes when autograd.Function has multiple tensor outputs (#57966)
Summary:
When generating IR for autograd.Function, if the function has multiple outputs, a TupleUnpack may be inserted after the original function node, and Pytorch only assigns proper information (tensor element type and shape) to the TupleUnpack and forgets the original function node. In contrast, if autograd.Function only produces one output, the original function node may have tensor
element type and shape in its output schema.

Before this PR:
- (simplified) IR for autograd.Function with one output: input (tensor, dtype=float32, shape=[2, 3]) -> PythonOp -> output (tensor, dtype=float32, shape=[4, 5])
- (simplified) IR for autograd.Function with one output: input (tensor, dtype=float32, shape=[2, 3]) -> PythonOp -> output_0 **(tensor)**, output_1 **(tensor)** -> TupleUnpack output_2 (tensor, dtype=float32, shape=[4, 5]), output_3 (tensor, dtype=float32, shape=[6, 7])

After this PR:
- (simplified) IR for autograd.Function with one output: input (tensor, dtype=float32, shape=[2, 3]) -> PythonOp -> output (tensor, dtype=float32, shape=[4, 5])
- (simplified) IR for autograd.Function with one output: input (tensor, dtype=float32, shape=[2, 3]) -> PythonOp ->output_0 **(tensor, dtype=float32, shape=[4, 5])**, output_1 **(tensor, dtype=float32, shape=[6, 7])** -> TupleUnpack output_2 (tensor, dtype=float32, shape=[4, 5]), output_3 (tensor, dtype=float32, shape=[6, 7])

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57966

Reviewed By: zhxchen17

Differential Revision: D30208207

Pulled By: gmagogsfm

fbshipit-source-id: 42a3d1f9c0932133112a85df0c49cf4ea0afa175
2021-08-10 19:48:11 -07:00
390c0ac403 remove dead code (#63031)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63031

Reviewed By: mruberry

Differential Revision: D30225094

Pulled By: ngimel

fbshipit-source-id: 3666a0fa120bea85225cd3ee04f89d64952d2862
2021-08-10 18:41:13 -07:00
94c5309369 Revert D30199482: [pytorch][PR] Add BFloat16 support for unique and unique_consecutive on CPU
Test Plan: revert-hammer

Differential Revision:
D30199482 (fc0b8e6033)

Original commit changeset: 6f2d9cc1a528

fbshipit-source-id: 39e9f202bcbd978525f792173d4f97b5b329b5b1
2021-08-10 18:27:18 -07:00
d1f9c03cef Use const auto with irange (#62990)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62990

Test Plan: Sandcastle

Reviewed By: zhouzhuojie

Differential Revision: D30199748

fbshipit-source-id: 284b208ffa3c6c4749e5ac9b1fccb28914590f2c
2021-08-10 17:59:01 -07:00
d893b44cd8 change nccl version reporting (#62916)
Summary:
https://github.com/pytorch/pytorch/issues/62295

Previously the packing and unpacking of the NCCL version "integer" was done to have parity with the upstream NCCL version encoding. However, there doesn't seem to be any place where this integer is directly compared with a version integer sourced from upstream NCCL, and syncing the encoding seems to be error-prone (e.g., a recent change where a special case was added for minor versions >= 10 7e51592129/src/nccl.h.in (L22)).

This patch changes the reporting to return a tuple of version numbers instead (to preserve ease-of-use for comparisons) and tweaks the passing between C/Python to avoid the digit overflow problem.

CC ngimel mcarilli

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62916

Reviewed By: anjali411

Differential Revision: D30201069

Pulled By: mrshenli

fbshipit-source-id: 2e4e7c69f001c3f22bd04aa6df6a992e538bea45
2021-08-10 17:46:27 -07:00
f307120df4 Update test_torch_deploy (#62838)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62838

Fixes #62380

* update test functions to call wheel install folder {sitepackages}/torch instead of build/ folder
* add symbolic link for shared libraries which are called by the tests (this is a bit hacky and should be fixed the rpath before compiling -- similar to https://github.com/pytorch/pytorch/blob/master/.jenkins/pytorch/test.sh#L204-L208).

### Test plan
check if all ci workflows pass

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D30193141

Pulled By: tktrungna

fbshipit-source-id: 72c2bd3a740fca0f72e4803df505240193692c44
2021-08-10 16:29:50 -07:00
af6ed084b4 update test_libtorch (#62797)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62797

Fixes #62380

* update test functions to call wheel install folder {sitepackages}/torch instead of build/ folder
* add symbolic link for shared libraries which are called by the tests (this is a bit hacky and should be fixed the rpath before compiling -- similar to https://github.com/pytorch/pytorch/blob/master/.jenkins/pytorch/test.sh#L204-L208).

### Test plan
check if all ci workflows pass

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D30193140

Pulled By: tktrungna

fbshipit-source-id: d8e54c403f42abbbbe4556abf40c22a7955df737
2021-08-10 16:29:48 -07:00
2f5ac9c0ba update test distributed (#62796)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62796

Fixes #62380

* update test functions to call wheel install folder {sitepackages}/torch instead of build/ folder
* add symbolic link for shared libraries which are called by the tests (this is a bit hacky and should be fixed the rpath before compiling -- similar to https://github.com/pytorch/pytorch/blob/master/.jenkins/pytorch/test.sh#L204-L208).

### Test plan
check if all ci workflows pass

Test Plan: Imported from OSS

Reviewed By: driazati

Differential Revision: D30193142

Pulled By: tktrungna

fbshipit-source-id: 1247f9eda1c11c763c31c7383c77545b1ead1a60
2021-08-10 16:29:47 -07:00
dfe8445cd7 update test_vulkan (#62795)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62795

Test Plan: Imported from OSS

Reviewed By: driazati

Differential Revision: D30124421

Pulled By: tktrungna

fbshipit-source-id: 235ba166b02f7334e89cb2493024067851bf5b9b
2021-08-10 16:29:45 -07:00
25c3b9dc10 update test_rpc (#62781)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62781

Test Plan: Imported from OSS

Reviewed By: walterddr, zhouzhuojie

Differential Revision: D30124391

Pulled By: tktrungna

fbshipit-source-id: 99c275d6c9f23b4f274fd0ca19a16879ed27afd5
2021-08-10 16:28:35 -07:00
f807229fd4 [ONNX] add support for prim::Unitialized in lower_tuples pass (#56912)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/56911

Code from issue generates this Torchscript:
```
graph(%self : __torch__.MyModule,
      %t.1 : Tensor):
  %12 : None = prim::Constant()
  %7 : str = prim::Constant[value="Negative input"]() # /mnt/nvdl/usr/msladek/notes/python_code/unitialized.py:11:28
  %3 : int = prim::Constant[value=0]() # /mnt/nvdl/usr/msladek/notes/python_code/unitialized.py:10:15
  %9 : int = prim::Constant[value=5]() # /mnt/nvdl/usr/msladek/notes/python_code/unitialized.py:13:31
  %33 : (Tensor, Tensor) = prim::Uninitialized()
  %4 : Tensor = aten::lt(%t.1, %3) # /mnt/nvdl/usr/msladek/notes/python_code/unitialized.py:10:11
  %6 : bool = aten::Bool(%4) # /mnt/nvdl/usr/msladek/notes/python_code/unitialized.py:10:11
  %34 : (Tensor, Tensor) = prim::If(%6) # /mnt/nvdl/usr/msladek/notes/python_code/unitialized.py:10:8
    block0():
       = prim::RaiseException(%7) # /mnt/nvdl/usr/msladek/notes/python_code/unitialized.py:11:12
      -> (%33)
    block1():
      %11 : int[] = prim::ListConstruct(%9)
      %16 : Tensor = aten::zeros(%11, %12, %12, %12, %12) # /mnt/nvdl/usr/msladek/notes/python_code/unitialized.py:13:19
      %18 : int[] = prim::ListConstruct(%9)
      %23 : Tensor = aten::zeros(%18, %12, %12, %12, %12) # /mnt/nvdl/usr/msladek/notes/python_code/unitialized.py:13:35
      %24 : (Tensor, Tensor) = prim::TupleConstruct(%16, %23)
      -> (%24)
  return (%34)
```

Problem is that onnx exporter during lower_tuples pass doesn't support forwarding of tuples in prim::Unitialized.
Solution is:
1. add prim::Unitialized to supported_op in lower_tuples pass
1. As prim::Unitialized has now multiple outputs, we should call giveFreshAlias for every output

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56912

Reviewed By: nikithamalgifb

Differential Revision: D29837200

Pulled By: SplitInfinity

fbshipit-source-id: 321fae6fe52b1523df5653dbb9ea73b998ef1cda
2021-08-10 16:21:16 -07:00
4d0497034c Remove process_group_agent and faulty_process_group_agent files (#62985)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62985

Remove the process_group_agent and faulty_process_group_agent code now that PROCESS_GROUP backend has been deprecated for RPC (https://github.com/pytorch/pytorch/issues/55615). Discussed with xush6528 that it was okay to remove ProcessGroupAgentTest and ProcessGroupAgentBench which depended on process_group_agent.

Test Plan: CI tests

Reviewed By: pritamdamania87

Differential Revision: D30195576

fbshipit-source-id: 8b4381cffadb868b19d481198015d0a67b205811
2021-08-10 15:57:39 -07:00
790553811c fix sort and topk with discontiguous out (#63029)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/62645 and https://github.com/pytorch/pytorch/issues/62940. The root cause of those bugs is in the bad interaction between `collapseDims` and setting the size of sorting/topK dimension to 1. If all other dimensions happen to be 1, `collapseDims` thinks that that `1` dimension is collapsible (even though it was specifically marked to be preserved) and loses its stride information. If dimension was really of size 1, the stride information would be unimportant, but since in reality that dimension is not 1 and was set to 1 for convenience, the loss of stride information results in incorrect outputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63029

Reviewed By: heitorschueroff

Differential Revision: D30224925

Pulled By: ngimel

fbshipit-source-id: 269dd375c5cd57c6007fe91f729f8c60a2e7a264
2021-08-10 15:45:28 -07:00
500b24e303 [iOS] enable Metal in the nightly build (#62855)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62855

Test Plan: Test on Private Pod with the [HelloWorld](https://fburl.com/3hiwkkhm) demo

Reviewed By: xta0

Differential Revision: D30174151

Pulled By: hanton

fbshipit-source-id: 22cd8663ac239811bf8ed1c3b6301460d798dbfa
2021-08-10 15:18:58 -07:00
3beb65d45d test_cudnn_convolution_relu skipCUDAIfRocm
Summary: skip rocm test for test_cudnn_convolution_relu

Test Plan: This skips a test

Reviewed By: ngimel

Differential Revision: D30233620

fbshipit-source-id: 31eab8b03c3f15674e0d262a8f55965c1aa6b809
2021-08-10 15:15:23 -07:00
557047eb4c Add docstring for saved tensors default hooks (#62361)
Summary:
Add documentation for the saved tensors default hooks introduced in https://github.com/pytorch/pytorch/issues/61834 / https://github.com/pytorch/pytorch/issues/62563

Sister PR: https://github.com/pytorch/pytorch/issues/62362 (will add a link from autograd.rst to notes/autograd in whatever PR does not land first)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62361

Reviewed By: zou3519

Differential Revision: D30081997

Pulled By: Varal7

fbshipit-source-id: cb923e943e1d96db9669c1d863d693af30910c62
2021-08-10 14:59:38 -07:00
dbb7be2e79 [iOS][CI] Store every version of nightlies in S3 (#63039)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63039

Test Plan: Imported from OSS

Reviewed By: hanton

Differential Revision: D30229385

Pulled By: xta0

fbshipit-source-id: 15b438a6326159258803ab97e67dc9ec5db50d59
2021-08-10 14:33:36 -07:00
990c2190d1 [quant][graphmode] Reference pattern support for elu (#62607)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62607

Removing the quantize handler for elu since it can be covered by DefaultNodeQuantizeHandler

Test Plan:
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: iramazanli

Differential Revision: D30053977

fbshipit-source-id: 426789443e928bb01a88907de616cbda5866f621
2021-08-10 14:00:39 -07:00
f836c4f8bd [fix] TestMultiThreadAutograd: propagate exception from child thread to main thread (#63018)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/62895

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63018

Reviewed By: anjali411

Differential Revision: D30225856

Pulled By: Varal7

fbshipit-source-id: b5dd7999de5060e06f8958ea3ce49e0b74110971
2021-08-10 13:56:49 -07:00
bfa67264d1 [1/N] Nnapi backend execute and compile (#62272)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62272

Added Android NNAPI delegate implementation of runtime initialization (compilation) and execution.
The delegate's preprocess step was [previously implemented](https://github.com/pytorch/pytorch/pull/62225). Now, the reset of the delegate, which implements client-side execution, is added.

**nnapi_backend_lib.cpp**:
Implementation of delegate's compile and execute.
`execute()` is essentially a C++ implementation of [`NnapiModule`](https://github.com/pytorch/pytorch/blob/master/torch/backends/_nnapi/prepare.py), which wraps an NNAPI Compilation and handles preparation of weights, inputs, and outputs.
- Any steps that can be done before execution are moved to `compile()`.
    - `init()` cannot be moved to `compile()` because it requires real inputs for dynamic shaping.
    - `shape_compute_module` cannot currently be deserialized in `compile()`, since mobile::Module has no IValue conversion.
- Processed arguments that are modified by `init()` must be kept as member variables. Any other processed arguments are passed through a dictionary, `handles`.

**nnapi_bind.cpp & nnapi_bind.h**:
Created a header file for `nnapi_bind.cpp`, so that it's NnapiCompilation class can be used by `nnapi_backend_lib.cpp`.
**test_backend_nnapi.py**:
Enabled execution testing.
ghstack-source-id: 135432844

Test Plan:
Imported from OSS

Tested on devserver.
1. Load and unpack a special devserver build of NNAPI: `jf download GICWmAAzUR0eo20TAPasVts8ObhobsIXAAAz --file "nnapi-host-linux.tar.xz"`
2. `export LIBNEURALNETWORKS_PATH=/path/to/libneuralnetworks.so`
3. Run unittests: `python test/test_jit.py TestNnapiBackend` and `python test/test_nnapi.py`

TODO: test with lite interpreter runtime

Reviewed By: raziel, iseeyuan

Differential Revision: D29944873

fbshipit-source-id: 48967d873e79ef2cce9bcba2aeea3c52f7a18c07
2021-08-10 13:37:39 -07:00
fc0b8e6033 Add BFloat16 support for unique and unique_consecutive on CPU (#62559)
Summary:
Add BFloat16 support for unique and unique_consecutive on CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62559

Reviewed By: anjali411

Differential Revision: D30199482

Pulled By: ngimel

fbshipit-source-id: 6f2d9cc1a528bea7c723139a4f1b14e4b2213601
2021-08-10 13:22:54 -07:00
cb7f35d47a [quant][refactor] Checking activation_dtype instead of activation_post_process (#62489)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62489

Addressing comment from previous PR: https://github.com/pytorch/pytorch/pull/62374#discussion_r679354145

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: iramazanli

Differential Revision: D30053980

fbshipit-source-id: 79c216410282eccd6f0a8f24e38c55c4d18ec0d0
2021-08-10 12:17:36 -07:00
6d21e36f21 LU solve uses cuBLAS and cuSOLVER for matrices with dim > 1024 (#61815)
Summary:
This PR builds off of https://github.com/pytorch/pytorch/issues/59148 and modifies the `lu_solve` routine to avoid MAGMA for `b` or `lu_data` matrices with any dimension > 1024, since MAGMA has a bug when dealing with such matrices (https://bitbucket.org/icl/magma/issues/19/dgesv_batched-dgetrs_batched-fails-for).
Fixes https://github.com/pytorch/pytorch/issues/36921
Fixes https://github.com/pytorch/pytorch/issues/61929

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61815

Reviewed By: anjali411

Differential Revision: D30199618

Pulled By: ngimel

fbshipit-source-id: 06870793f697e9c35aaaa8254b8a8b1a38bd3aa9
2021-08-10 11:07:16 -07:00
0c39cea3d2 [sharded_tensor] add default fields to ShardedTensorMetadata (#62867)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62867

This add default fields for ShardedTensorMetadata, to allow easy construction and modification afterwards.
ghstack-source-id: 135284133

Test Plan: ShardedTensorMetadata validity should be guarded with `init_from_local_shards` API and its tests.

Reviewed By: pritamdamania87

Differential Revision: D30148481

fbshipit-source-id: 0d99f41f23dbeb4201a36109556ba23b9a6c6fb1
2021-08-10 11:00:01 -07:00
5fb79f61a8 [DDP] Dont set thread local state in reducer autograd hook. (#62996)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62996

No need to set this because autograd engine already propagates TLS
states.
ghstack-source-id: 135438220

Test Plan: CI

Reviewed By: albanD

Differential Revision: D30202078

fbshipit-source-id: e5e917269a03afd7a6b8e61f28b45cdb71ac3e64
2021-08-10 10:50:16 -07:00
6915bc0781 [typing] suppress errors in fbcode/caffe2 - batch 2
Test Plan: Sandcastle

Differential Revision: D30222378

fbshipit-source-id: 6a0a5d210266f19de63273240a080365c9143eb0
2021-08-10 10:26:52 -07:00
ea808df25d Test shape analysis with opinfos (#59814)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59814

Using opinfos to test shape analysis. By default, we just check that we don't give incorrect answers, and then if `assert_jit_shape_analysis` is true, tests that we correctly propagates the full shape. and it found a couple bugs {emoji:1f603}

Test Plan: Imported from OSS

Reviewed By: Krovatkin

Differential Revision: D30200058

Pulled By: eellison

fbshipit-source-id: 6226be87f5390277cfa5a1fffaa1b072d4bc8803
2021-08-10 09:47:33 -07:00
7312bd953c add ssupport for a few more opinfos in jit (#59812)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59812

This is sort of a half measure: we can successfully trace through opinfos which are registered as lambdas, we just can't script them. This tests if the op is a lambda in which case bails... see the next PR to get resize_ to work, maybe this should be consolidated with that...

Test Plan: Imported from OSS

Reviewed By: pbelevich, zhxchen17

Differential Revision: D30200061

Pulled By: eellison

fbshipit-source-id: 7e3c9b0be746b16f0f57ece49f6fbe20bf6535ec
2021-08-10 09:47:32 -07:00
9cbdc90d73 Don't substitute in symbolic shapes to shape compute graph (#59811)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59811

We don't want to actually substitute in symbolic shapes, because it invalidates the partially evaluated graph for further use.

Test Plan: Imported from OSS

Reviewed By: pbelevich, zhxchen17

Differential Revision: D30200059

Pulled By: eellison

fbshipit-source-id: 267ed97d8421fe480dec494cdf0dec9cf9ed3ba2
2021-08-10 09:47:30 -07:00
7db0bcfb40 small cleanups (#59810)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59810

Rephrasings and cleanup of dead code

Test Plan: Imported from OSS

Reviewed By: pbelevich, zhxchen17

Differential Revision: D30200062

Pulled By: eellison

fbshipit-source-id: b03e5adb928aa46bee6685667cad43333b6e6016
2021-08-10 09:47:28 -07:00
9cd990de0d Only optimize after change (redo) (#59809)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59809

Some how this didnt get landed previously in ghstack mixup

Test Plan: Imported from OSS

Reviewed By: pbelevich, zhxchen17

Differential Revision: D30200060

Pulled By: eellison

fbshipit-source-id: 47f256421a1fe1a005cd11fcc4d7f023b5990834
2021-08-10 09:46:13 -07:00
4c630773e8 [jit] warn if _check_overload_body fails to find source
Summary:
Under certain conditions (particularly if a module is frozen, like with
PyInstaller or torch::deploy), we will not have source code available for
functions. `import torch` should still work in this case, but this check is
currently causing it to raise an exception.

Since this is an initial check (if an overload is actually exercised there will
be hard failure), raise a warning and move on.

Test Plan: unit tests

Reviewed By: eellison

Differential Revision: D30214271

fbshipit-source-id: eb021503e416268e8585e0708d6271c1e7b91e95
2021-08-10 09:28:50 -07:00
aa89d5f7f6 [quant] Update get_default_qat_qconfig to return the fused observer+fake_quant module (#62702)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62702

Expose the qconfig to the user to speed up training by leveraging the fused module.
The module currently supports per-tensor/per-channel moving avg observer and fake-quantize.

For details on perf benefits, refer to https://github.com/pytorch/pytorch/pull/61691

Test Plan: Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D30093719

fbshipit-source-id: b78deb7810f5b597474b9b9a0395d361d04eb46a
2021-08-10 09:28:49 -07:00
08d1a12d69 [quant] add reduce_range option to FusedMovingAvgFakeQuantize module (#62863)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62863

To make this consistent with other observers, add reduce_range option that can be used to update quant_min/max

Test Plan:
python test/test_quantization.py test_fused_mod_reduce_range

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D30146602

fbshipit-source-id: a2015f095766f9c884611e9ab6942528bc9bc972
2021-08-10 09:27:01 -07:00
978490d7c7 Codegen: Fix operator::name on windows (#62278)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62278

In `Operators.h` we're using `str(BaseOperatorName)`, while in
`OperatorsEverything.cpp` we're using `str(OperatorName)`. e.g.
```
STATIC_CONSTEXPR_STR_INL_EXCEPT_WIN_CUDA(name, "aten::abs")
```
vs
```
STATIC_CONST_STR_OUT_OF_LINE_FOR_WIN_CUDA(abs_out, name, "aten::abs.out")
```

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D29962047

Pulled By: albanD

fbshipit-source-id: 5a05b898fc734a4751c2b0187e4eeea4efb0502b
2021-08-10 07:58:09 -07:00
cdf702b60c Reject kwonly arguments passed positionally in torch.ops (#62981)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62981

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: Chillee

Differential Revision: D30211030

Pulled By: ezyang

fbshipit-source-id: aae426592e92bf3a50076f470e153a4ae7d6f101
2021-08-10 07:16:00 -07:00
9e7b6bb69f Allow LocalResponseNorm to accept 0 dim batch sizes (#62801)
Summary:
This issue fixes a part of https://github.com/pytorch/pytorch/issues/12013, which is summarized concretely in  https://github.com/pytorch/pytorch/issues/38115.

This PR allows `LocalResponseNorm` to accept tensors with 0 dimensional batch size.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62801

Reviewed By: zou3519

Differential Revision: D30165282

Pulled By: jbschlosser

fbshipit-source-id: cce0b2d12dbf47dc8ed6247c267bf2f2305f858a
2021-08-10 06:54:52 -07:00
061062ae2a Update TensorPipe submodule
Test Plan: CI ran as part of https://github.com/pytorch/pytorch/pull/60938.

Reviewed By: beauby

Differential Revision: D30219343

fbshipit-source-id: 531338f912fee488d312d23da8bda63ceb862aa9
2021-08-10 05:46:12 -07:00
3df4870343 [Reland][DDP] Support not all outputs used in loss calculation (#61753)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61753

Reland of https://github.com/pytorch/pytorch/pull/57081.
Main difference is that the former diff moved `prepare_for_backward` check into `DDPSink` backward, but that resulted in issues due to potential autograd engine races. The original diff moved `prepare_for_backward` into `DDPSink` as part of a long-term plan to always call it within `DDPSink`.

In particular this doesn't work because `prepare_for_backward` sets `expect_autograd_hooks=true` which enables autograd hooks to fire, but there were several use cases internally where autograd hooks were called before DDPSink called `prepare_for_backward`, resulting in errors/regression.

We instead keep the call to `prepare_for_backward` in the forward pass, but still run outputs through `DDPSink` when find_unused_parameters=True. As a result, outputs that are not used when computing loss have `None` gradients and we don't touch them if they are globally `None`. Note that the hooks still fire with a undefined gradient which is how we avoid the Reducer erroring out with the message that some hooks did not fire.

Added the unittests that were part of the reverted diff.
ghstack-source-id: 135388925

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D29726179

fbshipit-source-id: 54c8819e0aa72c61554104723a5b9c936501e719
2021-08-09 22:29:11 -07:00
5ed6e4429e To fix variance computation for complex Adam (#62946)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/59998

It has been discussed in the issue that the variance term of Adam optimizer currently doesn't compute correctly for complex domain.  As it has been stated in the Generalization to Complex numbers section  in https://en.wikipedia.org/wiki/Variance variance is computed as E[(X - mu)(X-mu)*] (where mu = E[X] and * stands for conjugate) for complex random variable X.

However, currently the computation method in implementation of Adam is via E[(X - mu)(X-mu)] which doesn't return right variance value, in particular it returns complex number. Variance is defined to be real number even though underlying random variable is complex.

We fix this issue here, and testing that resulting variance is indeed real number.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62946

Reviewed By: albanD

Differential Revision: D30196038

Pulled By: iramazanli

fbshipit-source-id: ab0a6f31658aeb56bdcb211ff86eaa29f3f0d718
2021-08-09 17:54:43 -07:00
3c1d1170a4 [quant][graphmode][fx] Attach a weight qparam dict to linear and conv in reference quantized model (#62488)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62488

Instead of attaching weight observer/fake_quant to the float linear and conv, we can
compute the quantization parameters and attach that as a dictionary to these modules so
that we can reduce the model size and make the reference module clearer

TODO: the numerics for linear and conv in reference quantized model is still not correct since
we did not quantize weight, we may explore things like parameterization to implement this support

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D30053979

fbshipit-source-id: b5f8497cf6cf65eec924df2d8fb10a9e154b8cab
2021-08-09 16:55:14 -07:00
59ac451ba3 Simplify the logic of running ci workflow codegen (#62853)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62853

wanted to simplify the logic in the `__post_int__`, and delegate the settings back to individual workflows, this gives us more flexibility in changing individual workflows, as well as reducing the complexity of understanding the mutation conditions.

Test Plan: Imported from OSS

Reviewed By: walterddr, seemethere

Differential Revision: D30149190

Pulled By: zhouzhuojie

fbshipit-source-id: 44df5b1e14184f3a81cb8004151525d0e0fb20d9
2021-08-09 16:47:46 -07:00
8720369a48 irange-ify 12b (#62484)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62484

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D30015528

fbshipit-source-id: c4e1a5425a73f100102a97dcec1579f1049c9c1d
2021-08-09 16:40:47 -07:00
93e0f3a330 Shard Operators.cpp (#62185)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62185

This file can take 5 minutes on its own to compile, and is the single limiting
factor for compile time of `libtorch_cpu` on a 32-core threadripper. Instead,
sharding into 5 files that take around 1 minute each cuts a full minute off the
overall build time.

This also factors out the `.findSchemaOrThrow(...).typed` step so the code can
be shared between `call` and `redispatch`.

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D29962049

Pulled By: albanD

fbshipit-source-id: be5df05fbea09ada0d825855f1618c25a11abbd8
2021-08-09 16:19:49 -07:00
4b9ca72c7c irange-ify 13d (#62477)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62477

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D30001499

fbshipit-source-id: 993eb2b39f332ff0ae6c663792bd04734cfc262b
2021-08-09 16:16:58 -07:00
d16587f84d Enable rebuilds for Ninja on Windows (#62948)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/59859.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62948

Reviewed By: seemethere, tktrungna

Differential Revision: D30192246

Pulled By: janeyx99

fbshipit-source-id: af25cc4bf0db67a1304d9971cfa0ff6831bb3b48
2021-08-09 16:15:45 -07:00
a82b9ef1ff BFP16 quantization/dequantization (#62974)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62974

Testing the functionality of `tensor.to` approach.
Comparing `tensor.to` and `torch.ops.fb.FloatToBfloat16Quantized` approach and testing if they match for 2d tensors.

Test Plan: buck test //torchrec/fb/distributed/tests:test_quantized_comms

Reviewed By: wanchaol

Differential Revision: D30079121

fbshipit-source-id: 612e92baeb2245449637faa9bc31686353d67033
2021-08-09 15:47:07 -07:00
c4aeecac75 Migrate Embedding thrust sort to cub sort (#62495)
Summary:
This PR only migrates sort. Other thrust operations will be migrated in followup PRs

Benchmark `num_embeddings` pulled from https://github.com/huggingface/transformers/tree/master/examples by
```
grep -P 'vocab_size.*(=|:)\s*[0-9]+' -r transformers/examples/
grep -P 'hidden_size.*(=|:)\s*[0-9]+' -r transformers/examples/
```
to get `vocab_size = 119547, 50265, 32000, 8000, 3052` (similar size omitted) and `hidden_size = 512, 768`

Code:
```python
import torch
import itertools

num_embeddings = (119547, 50265, 32000, 8000, 3052)
num_tokens = (4096, 16384)
hidden_sizes = (512, 768)

for ne, nt, nh in itertools.product(num_embeddings, num_tokens, hidden_sizes):
    print(f"Embedding size: {ne}, Tokens: {nt}, Hidden size: {nh}")
    embedding = torch.nn.Embedding(ne, nh).cuda()
    input_ = torch.randint(ne, (nt,), device='cuda')
    out = embedding(input_)
    torch.cuda.synchronize()
    %timeit out.backward(out, retain_graph=True); torch.cuda.synchronize()
```

## On CUDA 11.3.1

Before:
```
Embedding size: 119547, Tokens: 4096, Hidden size: 512
1.43 ms ± 11.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 119547, Tokens: 4096, Hidden size: 768
2.07 ms ± 56.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Embedding size: 119547, Tokens: 16384, Hidden size: 512
1.61 ms ± 2.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 119547, Tokens: 16384, Hidden size: 768
2.32 ms ± 8.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Embedding size: 50265, Tokens: 4096, Hidden size: 512
738 µs ± 1.38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 50265, Tokens: 4096, Hidden size: 768
1.02 ms ± 1.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 50265, Tokens: 16384, Hidden size: 512
913 µs ± 3.89 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 50265, Tokens: 16384, Hidden size: 768
1.27 ms ± 1.09 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 32000, Tokens: 4096, Hidden size: 512
559 µs ± 860 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 32000, Tokens: 4096, Hidden size: 768
743 µs ± 630 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 32000, Tokens: 16384, Hidden size: 512
713 µs ± 969 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 32000, Tokens: 16384, Hidden size: 768
977 µs ± 884 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 8000, Tokens: 4096, Hidden size: 512
301 µs ± 8.02 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 8000, Tokens: 4096, Hidden size: 768
383 µs ± 4.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 8000, Tokens: 16384, Hidden size: 512
409 µs ± 1.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 8000, Tokens: 16384, Hidden size: 768
515 µs ± 766 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 3052, Tokens: 4096, Hidden size: 512
215 µs ± 1.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 3052, Tokens: 4096, Hidden size: 768
250 µs ± 320 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 3052, Tokens: 16384, Hidden size: 512
271 µs ± 888 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 3052, Tokens: 16384, Hidden size: 768
325 µs ± 1.14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```

After:
```
Embedding size: 119547, Tokens: 4096, Hidden size: 512
1.42 ms ± 1.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 119547, Tokens: 4096, Hidden size: 768
2.05 ms ± 9.93 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Embedding size: 119547, Tokens: 16384, Hidden size: 512
1.6 ms ± 3.19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 119547, Tokens: 16384, Hidden size: 768
2.3 ms ± 3.67 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Embedding size: 50265, Tokens: 4096, Hidden size: 512
730 µs ± 811 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 50265, Tokens: 4096, Hidden size: 768
1.01 ms ± 2.71 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 50265, Tokens: 16384, Hidden size: 512
887 µs ± 1.08 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 50265, Tokens: 16384, Hidden size: 768
1.25 ms ± 2.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 32000, Tokens: 4096, Hidden size: 512
556 µs ± 1.86 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 32000, Tokens: 4096, Hidden size: 768
744 µs ± 4.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 32000, Tokens: 16384, Hidden size: 512
691 µs ± 570 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 32000, Tokens: 16384, Hidden size: 768
957 µs ± 2.02 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 8000, Tokens: 4096, Hidden size: 512
309 µs ± 2.84 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 8000, Tokens: 4096, Hidden size: 768
376 µs ± 2.18 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 8000, Tokens: 16384, Hidden size: 512
381 µs ± 1.49 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 8000, Tokens: 16384, Hidden size: 768
487 µs ± 2.42 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 3052, Tokens: 4096, Hidden size: 512
202 µs ± 383 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 3052, Tokens: 4096, Hidden size: 768
239 µs ± 1.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 3052, Tokens: 16384, Hidden size: 512
243 µs ± 1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 3052, Tokens: 16384, Hidden size: 768
340 µs ± 2.28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```

## On CUDA 11.1

Before:
```
Embedding size: 119547, Tokens: 4096, Hidden size: 512
1.41 ms ± 14.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 119547, Tokens: 4096, Hidden size: 768
2.05 ms ± 7.61 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Embedding size: 119547, Tokens: 16384, Hidden size: 512
1.61 ms ± 1.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 119547, Tokens: 16384, Hidden size: 768
2.32 ms ± 2.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Embedding size: 50265, Tokens: 4096, Hidden size: 512
743 µs ± 1.03 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 50265, Tokens: 4096, Hidden size: 768
1.02 ms ± 2.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 50265, Tokens: 16384, Hidden size: 512
912 µs ± 5.91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 50265, Tokens: 16384, Hidden size: 768
1.28 ms ± 6.17 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 32000, Tokens: 4096, Hidden size: 512
555 µs ± 2.61 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 32000, Tokens: 4096, Hidden size: 768
743 µs ± 655 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 32000, Tokens: 16384, Hidden size: 512
714 µs ± 1.89 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 32000, Tokens: 16384, Hidden size: 768
980 µs ± 1.52 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 8000, Tokens: 4096, Hidden size: 512
312 µs ± 396 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 8000, Tokens: 4096, Hidden size: 768
386 µs ± 2.32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 8000, Tokens: 16384, Hidden size: 512
413 µs ± 3.19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 8000, Tokens: 16384, Hidden size: 768
512 µs ± 1.03 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 3052, Tokens: 4096, Hidden size: 512
209 µs ± 585 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 3052, Tokens: 4096, Hidden size: 768
271 µs ± 776 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 3052, Tokens: 16384, Hidden size: 512
297 µs ± 1.11 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 3052, Tokens: 16384, Hidden size: 768
377 µs ± 3.87 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```

After:
```
Embedding size: 119547, Tokens: 4096, Hidden size: 512
1.46 ms ± 12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 119547, Tokens: 4096, Hidden size: 768
2.09 ms ± 4.31 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Embedding size: 119547, Tokens: 16384, Hidden size: 512
1.64 ms ± 4.48 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 119547, Tokens: 16384, Hidden size: 768
2.35 ms ± 2.54 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Embedding size: 50265, Tokens: 4096, Hidden size: 512
782 µs ± 2.12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 50265, Tokens: 4096, Hidden size: 768
1.06 ms ± 596 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 50265, Tokens: 16384, Hidden size: 512
945 µs ± 2.19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 50265, Tokens: 16384, Hidden size: 768
1.31 ms ± 553 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 32000, Tokens: 4096, Hidden size: 512
603 µs ± 856 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 32000, Tokens: 4096, Hidden size: 768
789 µs ± 500 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 32000, Tokens: 16384, Hidden size: 512
752 µs ± 7.56 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 32000, Tokens: 16384, Hidden size: 768
1.01 ms ± 4.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 8000, Tokens: 4096, Hidden size: 512
323 µs ± 7.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 8000, Tokens: 4096, Hidden size: 768
398 µs ± 765 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 8000, Tokens: 16384, Hidden size: 512
412 µs ± 544 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 8000, Tokens: 16384, Hidden size: 768
519 µs ± 614 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 3052, Tokens: 4096, Hidden size: 512
229 µs ± 1.17 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 3052, Tokens: 4096, Hidden size: 768
263 µs ± 417 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 3052, Tokens: 16384, Hidden size: 512
274 µs ± 576 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 3052, Tokens: 16384, Hidden size: 768
354 µs ± 1.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62495

Reviewed By: gchanan

Differential Revision: D30176833

Pulled By: ngimel

fbshipit-source-id: 44148ebb53a0abfc1e5ab8b986865555bf326ad1
2021-08-09 15:31:55 -07:00
=
084e92bb76 Use output memory format based on input for cudnn_convolution_relu (#62482)
Summary:
Currently when cudnn_convolution_relu is passed a channels last Tensor it will return a contiguous Tensor. This PR changes this behavior and bases the output format on the input format.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62482

Reviewed By: ngimel

Differential Revision: D30049905

Pulled By: cpuhrsch

fbshipit-source-id: 98521d14ee03466e7128a1912b9f754ffe10b448
2021-08-09 15:31:53 -07:00
4fdb9579fa irange-ify 12 (#62120)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62120

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D29879713

fbshipit-source-id: 3084a5eacb722f7fb0a630d47bf694f4d6831136
2021-08-09 15:31:51 -07:00
da9958c899 irange-ify 1 (#62193)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62193

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D29879504

fbshipit-source-id: adc86adcd1e7dcdfa2d7adf4d576f081430d52ec
2021-08-09 15:30:43 -07:00
161fb31893 Fix render_test_results if condition on always() (#62997)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62997

Fixes #62979, changed the condition to listen on the previous'
job's result to be either 'success' or 'failure'.

Notice that 'skipped' will also skip this job, which is what
we want.

Test Plan: Imported from OSS

Reviewed By: driazati, seemethere

Differential Revision: D30202598

Pulled By: zhouzhuojie

fbshipit-source-id: f3c0f715c39a5c8119b528b66e45f594a54b49d1
2021-08-09 15:27:40 -07:00
39ec1da935 [reland] Gate DistributedOptimizers on RPC availability (#62937)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62937

reland due to windows + cuda failure, fix by running it on gloo on windows even with cuda.
ghstack-source-id: 135306176

Test Plan: ci

Reviewed By: mrshenli

Differential Revision: D30177734

fbshipit-source-id: 7625746984c8f858648c1b3632394b98bd4518d2
2021-08-09 14:41:06 -07:00
5b8389e536 irange-ify 8d (#62505)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62505

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D29971891

fbshipit-source-id: 7dcbe27221788695f320c7238f5fe81e32823802
2021-08-09 13:18:38 -07:00
6286d33878 [fx] store Tracer class on Graph and GraphModule for package deserialization (#62497)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62497

Previously named: add support for custom tracer in __reduce_package__

Stores a Tracer class on a Graph created by Tracer, and copies the Tracer class into the GraphModule's state so that when a GraphModule is packaged by torch package, it can be reconstructed with the same Tracer and GraphModule class name.

Reviewed By: suo

Differential Revision: D30019214

fbshipit-source-id: eca09424ad30feb93524d481268b066ea55b892a
2021-08-09 13:07:30 -07:00
f82d4b8957 Mark unused functions with C10_UNUSED (#62929)
Summary:
Which fixes number of warnings

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62929

Reviewed By: walterddr, albanD

Differential Revision: D30171953

Pulled By: malfet

fbshipit-source-id: f82475289ff4aebb0c97794114e94a24d00d2ff4
2021-08-09 13:00:33 -07:00
08f6bc1da6 Stop exporting symbols in anonymous namespaces (#62952)
Summary:
The cases are found out by compiling against clang on Windows.
Those functions will still be exported under this case, which is a waste of space in the symbol table.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62952

Reviewed By: gchanan

Differential Revision: D30191291

Pulled By: ezyang

fbshipit-source-id: 3319b0ec4f5fb02e0fe1b81dbbcedcf12a0c795e
2021-08-09 12:52:12 -07:00
3dcd785cac [Static Runtime] Add tests for all aten ops (#62347)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62347

This diff includes tests for all `aten` ops that did not already have test coverage.

Test Plan: `buck test //caffe2/benchmarks/static_runtime/static_runtime:static_runtime_cpptest`

Reviewed By: hlu1

Differential Revision: D29968280

fbshipit-source-id: 768655ca535f9e37422711673168dce193de45d2
2021-08-09 12:09:59 -07:00
a01f832329 handle get_attr opearations in typechecker (#62682)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62682

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D30107789

Pulled By: migeed-z

fbshipit-source-id: 0b21b2893e2dc7cfaf5b5f5990f662e051a981b4
2021-08-09 11:49:04 -07:00
3eeaffc7c5 Linker version script to hide LLVM symbols (#62906)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62906

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D30193893

Pulled By: bertmaher

fbshipit-source-id: 9b189bfd8d4c52e8dc4296a4bed517ff44994ba0
2021-08-09 11:26:02 -07:00
1b1f1e36b4 Add `allow_empty_param_list` to functional optimizers (#62522)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62522

Addresses https://github.com/pytorch/pytorch/issues/62481

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D30072074

Pulled By: andwgu

fbshipit-source-id: 1a5da21f9636b8d74a6b00c0f029427f0edff0e3
2021-08-09 11:18:56 -07:00
710c419f11 [Vulkan] Added Hardshrink op (#62870)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62870

Added Hardshrink operator for Vulkan
Added tests for Hardshrink op

Reference: [Hardshrink](https://pytorch.org/docs/stable/generated/torch.nn.Hardshrink.html#torch.nn.Hardshrink)

Test Plan: Imported from OSS

Reviewed By: SS-JIA

Differential Revision: D30174950

Pulled By: beback4u

fbshipit-source-id: 3e192390eb9f92abecae966e84bbfae356bfd7c8
2021-08-09 10:54:11 -07:00
922710f9b9 Change output node handling for typechecker to deal with tuples (#62582)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62582

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D30050004

Pulled By: migeed-z

fbshipit-source-id: 9b81b10d24e1e8165cdc18c820ea314349b463cb
2021-08-09 10:47:12 -07:00
e55f271859 __torch_dispatch__: Populate kwargs dictionary with keyword-only arguments (#62822)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62822

This is BC breaking for people who were using the old integration,
although only if you had been writing bindings for functions with
keyword-only arguments (that includes functorch).  Other than that,
the patch was pretty straightforward.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D30134552

Pulled By: ezyang

fbshipit-source-id: a47f536fb030994a07c9386069b8f800ac86d731
2021-08-09 10:02:54 -07:00
2b83007ae2 Modify GHA CI to use PYTORCH_IGNORE_DISABLED_ISSUES based on PR body (#62851)
Summary:
Another step forward in fixing https://github.com/pytorch/pytorch/issues/62359

Disclaimer: this only works with GHA for now, as circleci would require changes in probot.

Test plan can be seen a previous description where I modified the description to include linked issues. I've removed them now since the actual PR doesn't fix any of them.

It works! In the [periodic 11.3 test1](https://github.com/pytorch/pytorch/pull/62851/checks?check_run_id=3263109970), we get this in the logs and we see that PYTORCH_IGNORE_DISABLED_ISSUES is properly set:
```
  test_jit_cuda_extension (__main__.TestCppExtensionJIT) ... Using /var/lib/jenkins/.cache/torch_extensions/py36_cu113 as PyTorch extensions root...
Creating extension directory /var/lib/jenkins/.cache/torch_extensions/py36_cu113/torch_test_cuda_extension...
Detected CUDA files, patching ldflags
Emitting ninja build file /var/lib/jenkins/.cache/torch_extensions/py36_cu113/torch_test_cuda_extension/build.ninja...
Building extension module torch_test_cuda_extension...
Using envvar MAX_JOBS (30) as the number of workers...
[1/3] c++ -MMD -MF cuda_extension.o.d -DTORCH_EXTENSION_NAME=torch_test_cuda_extension -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11 (d55b25a633)_COMPILER_TYPE=\"_gcc\" -DPYBIND11 (d55b25a633)_STDLIB=\"_libstdcpp\" -DPYBIND11 (d55b25a633)_BUILD_ABI=\"_cxxabi1011\" -isystem /opt/conda/lib/python3.6/site-packages/torch/include -isystem /opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.6/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.6/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/include/python3.6m -D_GLIBCXX_USE_CXX11_ABI=1 -fPIC -std=c++14 -c /var/lib/jenkins/workspace/test/cpp_extensions/cuda_extension.cpp -o cuda_extension.o
[2/3] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=torch_test_cuda_extension -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11 (d55b25a633)_COMPILER_TYPE=\"_gcc\" -DPYBIND11 (d55b25a633)_STDLIB=\"_libstdcpp\" -DPYBIND11 (d55b25a633)_BUILD_ABI=\"_cxxabi1011\" -isystem /opt/conda/lib/python3.6/site-packages/torch/include -isystem /opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.6/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.6/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/include/python3.6m -D_GLIBCXX_USE_CXX11_ABI=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_50,code=sm_50 -gencode=arch=compute_52,code=compute_52 -gencode=arch=compute_52,code=sm_52 --compiler-options '-fPIC' -O2 -std=c++14 -c /var/lib/jenkins/workspace/test/cpp_extensions/cuda_extension.cu -o cuda_extension.cuda.o
nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
[3/3] c++ cuda_extension.o cuda_extension.cuda.o -shared -L/opt/conda/lib/python3.6/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda_cu -ltorch_cuda_cpp -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o torch_test_cuda_extension.so
Loading extension module torch_test_cuda_extension...
ok (26.161s)
```

whereas on the latest master periodic 11.1 windows [test](https://github.com/pytorch/pytorch/runs/3263762478?check_suite_focus=true), we see
```
test_jit_cuda_extension (__main__.TestCppExtensionJIT) ... skip (0.000s)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62851

Reviewed By: walterddr, tktrungna

Differential Revision: D30192029

Pulled By: janeyx99

fbshipit-source-id: fd2ecc59d2b2bb5c31522a630dd805070d59f584
2021-08-09 09:48:56 -07:00
8b54b14f92 [Static Runtime] Added a cache for NNC generated code across different calls to the same ops (#62921)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62921

Added a cache for NNC generated code across different calls to the same ops.

Before this diff:
```
ProcessedNode time 13402.9 ms
Static Module initialization took 30964.8 ms
```

After this diff:
```
ProcessedNode time 85.4195 ms
Static Module initialization took 4348.42 ms
```

There is one global cache for all the ops. It is guarded with a reader-writer lock. This is necessary because we could have multiple threads loading different models in parallel. Note that this locking does not guarantee that there will be exactly one code generated for each op. There could be more than one thread generating code for the same op simultaneously and all of them will update the cache in some order. But that should be small number bounded by the number of threads. Also, there is no correctness issue, since the generated code is always the same and the one generated by the last thread is retained in the cache and reused later while running the model.

Test Plan: Tested inline_cvr model

Reviewed By: hlu1

Differential Revision: D30104017

fbshipit-source-id: 32e9af43d7e724ed54b661dfe58a73a14e443ff7
2021-08-09 09:30:07 -07:00
3782f3eced Enable upper for torch.linalg.cholesky (#62434)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/61988

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62434

Reviewed By: seemethere, tktrungna

Differential Revision: D30079806

Pulled By: walterddr

fbshipit-source-id: 044efb96525155c9bc7953ac4ad47c1b7c12fb20
2021-08-09 09:28:33 -07:00
e54ee9bac1 [nnc] Updated IR cloning to create clones of expressions in addition to statements (#62833)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62833

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D30135980

Pulled By: navahgar

fbshipit-source-id: e557eedec7ecf596a4045756276d25a485fa66fb
2021-08-09 09:13:03 -07:00
5deeaab36a minor fixes in c10d for Windows (#62953)
Summary:
Found out by triggering builds against clang on Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62953

Reviewed By: gchanan

Differential Revision: D30191300

Pulled By: ezyang

fbshipit-source-id: d929119768298084c41d70dbc3a78aacd64fb715
2021-08-09 09:05:09 -07:00
fff83f3f66 Add handling of list write to remove mutation (#62904)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62904

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D30168493

Pulled By: eellison

fbshipit-source-id: 3b25982b235938cc7439dd3a5236dfce68254c05
2021-08-09 08:56:06 -07:00
254148ec7d Add tensor-scalar op (#62903)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62903

Test Plan: Imported from OSS

Reviewed By: pbelevich, SplitInfinity

Differential Revision: D30168338

Pulled By: eellison

fbshipit-source-id: 7dcb34ddd76c6aad4108a4073d3c8a93d974d0ef
2021-08-09 08:54:47 -07:00
4c4c5b14e4 Port sum.dim_IntList kernel to structured kernels. (#61642)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61642

Tracking issue: #55070

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D29783865

Pulled By: ezyang

fbshipit-source-id: 375d4cd5f915812108367601a610a428762e606d
2021-08-09 08:46:16 -07:00
c7db642a72 Adding collective quantization API (#62142)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62142

Created wrapper that takes the collective op and a quantization type as an arguments. It quantize the input, performs the collective op, and and perform dequantization

Test Plan:
Tested through distributed_gloo_fork.
e.g., buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_all_to_all_quantized

Reviewed By: wanchaol

Differential Revision: D29682812

fbshipit-source-id: 79c39105ff11270008caa9f566361452fe82a92e
2021-08-09 08:11:22 -07:00
6ccedc7c1f Set mkl thread locally (#62891)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62891

Fixes #60469

We want to land this PR before next release, so soliciting the idea from raven38 in https://github.com/pytorch/pytorch/pull/60471. And, add corresponding test to verify the result.

- Before this PR using this test:
![image](https://user-images.githubusercontent.com/68879799/128542334-1b899be5-2b6e-4c03-8ac0-568fb15470b8.png)
- After this PR the test passed without Error.

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D30161483

Pulled By: ejguan

fbshipit-source-id: 800f7204e0e1a19c492b2e556c92a91115f1b69b
2021-08-09 07:37:18 -07:00
30214aef2d [BE] irangefy (#62928)
Summary:
Replace for loop with for `irange` loop. Also fix some unused variable warnings in range loop cases

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62928

Reviewed By: driazati

Differential Revision: D30171904

Pulled By: malfet

fbshipit-source-id: 1b437a0f7e3515f4a2e324f3450e93312f1933ae
2021-08-07 13:34:13 -07:00
9f7aba737b Make IMethod cache mutable so getArgument works on const IMethod (#62834)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62834

Test Plan: existing unit tests

Reviewed By: alanwaketan

Differential Revision: D30135939

fbshipit-source-id: e19c0ac1af6996e065a18318351265b5c4a01e70
2021-08-06 22:58:21 -07:00
b80dffd911 [TensorExpr] Remove more 'const' from IRVisitor methods for *Imm types. (#62932)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62932

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D30172961

Pulled By: ZolotukhinM

fbshipit-source-id: 9b7f45880d356f823364135fe29fc08f6565f827
2021-08-06 22:44:09 -07:00
b45cf9b81b Revert D30117838: [WIP] Gate DistributedOptimizers on RPC availability
Test Plan: revert-hammer

Differential Revision:
D30117838 (3f09485d7e)

Original commit changeset: e6365a910a3d

fbshipit-source-id: f276b2b2bdf5f7bd27df473fca0eebaee9f7aef2
2021-08-06 22:10:41 -07:00
e6a3154519 Allow broadcasting along non-reduction dimension for cosine similarity (#62912)
Summary:
Checks introduced by https://github.com/pytorch/pytorch/issues/58559 are too strict and disable correctly working cases that people were relying on.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62912

Reviewed By: jbschlosser

Differential Revision: D30165827

Pulled By: ngimel

fbshipit-source-id: f9229a9fc70142fe08a42fbf2d18dae12f679646
2021-08-06 19:17:04 -07:00
6630d98ae5 Refactor codegen file sharding (#62184)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62184

File sharding is currently implemented twice, once for VariableType and once for
TraceType. This refactors the implementation into `FileManager` and also changes
it so template substitution is only done once and shared between the sharded
file and the "Everything" file.

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D29962050

Pulled By: albanD

fbshipit-source-id: 7858c3ca9f6e674ad036febd2d1a4ed2323a2861
2021-08-06 19:13:42 -07:00
44fad84bca [DDP] Add host-side time to CUDATimer (#62770)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62770

Adding timing of forward, backward comp, backward comm, etc will help
detect desynchronization issues.
ghstack-source-id: 135195680

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D30115585

fbshipit-source-id: 509bf341c5c92dcc63bdacd3c1e414da4eb4f321
2021-08-06 18:41:40 -07:00
22e3cc21e5 Back out "Enable test_api IMethodTest in OSS" (#62893)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62893

Original commit changeset: 50eb3689cf84

Test Plan: Confirm pytorch_linux_xenial_cuda11_1_cudnn8_py3_gcc7_test2 passes in OSS

Reviewed By: seemethere, alanwaketan

Differential Revision: D30159999

fbshipit-source-id: 74ff8975328409a3dc8222d3e2707a1bb0ab930c
2021-08-06 16:43:50 -07:00
bbe2c8e6d2 Fix reshape for the Lazy key (#62846)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62846

Test Plan: CI

Reviewed By: zou3519

Differential Revision: D30162185

Pulled By: asuhan

fbshipit-source-id: d582dcef35ce7e8bebf161a5c93e470339891e29
2021-08-06 15:29:56 -07:00
6e24ce7a46 Revert D30138788: [pytorch][PR] OpInfo for adaptive_avg_pool2d
Test Plan: revert-hammer

Differential Revision:
D30138788 (5c431981b5)

Original commit changeset: 66735ceaa85b

fbshipit-source-id: 75eb241ef82d32d6480db069c035df0abc6753fe
2021-08-06 15:17:05 -07:00
d9154b9b26 [quant] Input-Weight Equalization - allow logical evaluation (#61603)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61603

Test Plan: Imported from OSS

Reviewed By: supriyar

Differential Revision: D29686878

fbshipit-source-id: 67ca4cab98b3d592ff2bb8db86499789b85bd582
2021-08-06 15:10:32 -07:00
43b087791c .github: Make sure to deep clone on windows (#62907)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62907

Deep clones allow us to use git commands on historical commits so that
we can do things like collect test times correctly

Should fix empty `.pytorch-test-times.json` files that walterddr was observing

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: janeyx99

Differential Revision: D30166414

Pulled By: seemethere

fbshipit-source-id: 1f9904eeb5a8ebaf0a02d1aa7291fffe1aecd57b
2021-08-06 15:06:56 -07:00
e3944ab00e Revert D30038175: Improve IMethod::getArgumentNames to deal with empty argument names list
Test Plan: revert-hammer

Differential Revision:
D30038175 (64b3ab6407)

Original commit changeset: 46f08dda9418

fbshipit-source-id: 604735d2300487a0b75890b330d7ba5b3e7145b2
2021-08-06 14:58:43 -07:00
7a3f1386ae Add GradBucket::parameters() to ddp_comm_hooks.rst (#62877)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62877

as title
ghstack-source-id: 135214612

Test Plan: N/A

Reviewed By: rohan-varma

Differential Revision: D30153490

fbshipit-source-id: d4cec434a53ef6e65b60c065804884d1a114aa0d
2021-08-06 14:50:47 -07:00
eqy
6d24a075cb Check contiguous to dispatch to NHWC cuda template (#62839)
Summary:
follow up of https://github.com/pytorch/pytorch/issues/62773

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62839

Reviewed By: H-Huang

Differential Revision: D30142906

Pulled By: ngimel

fbshipit-source-id: 600a7ad240a4a1827352eab8c8cbc98240d693f0
2021-08-06 14:11:10 -07:00
=
e6e579ce74 [FX] Add torch.memory_format as a BaseArgumentType (#62593)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/62498

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62593

Reviewed By: H-Huang

Differential Revision: D30104091

Pulled By: cpuhrsch

fbshipit-source-id: 25b7a4b308219860c969db54d7b1867b1aa4180a
2021-08-06 14:03:41 -07:00
97dc43beeb use test environment for test phase (#62824)
Summary:
Currently all test generated in test matrix share the same `BUILD_ENVIRONMENT` variable. we should distinguish them because some test scripts uses BUILD_ENVIRONMENT to differentiate what to run.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62824

Reviewed By: zhouzhuojie

Differential Revision: D30162250

Pulled By: walterddr

fbshipit-source-id: 3a99a21e91e02ed8638feed102e7966af01dd175
2021-08-06 11:52:41 -07:00
786934902c Adds JOB_BASE_NAME to steps of CircleCI mac workflows (#62892)
Summary:
Upon noticing that we had a job entry named "None" in our S3 stats, I set out to find which test reporting had a JOB_BASE_NAME that wasn't set.

It turns out all non Windows and Linux workflows did not have JOB_BASE_NAME but instead used CIRCLE_JOB. This remedies the current issue by explicitly setting JOB_BASE_NAME in Mac workflows, but doesn't touch anything else as those other jobs (like android) do not report test stats.

This also adds back the CIRCLE_JOB dependency in print_test_stats to be backwards compatible, but the goal is to move off of CIRCLE_JOB dependency to a more CI-platform-agnostic naming of variables.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62892

Test Plan:
Imported from GitHub, without a `Test Plan:` line.

{F639556801}
None is now the macos!

Reviewed By: walterddr

Differential Revision: D30160234

Pulled By: janeyx99

fbshipit-source-id: df868dec5f9b289d3837e927d2bb95acb2d9185b
2021-08-06 11:34:17 -07:00
c9b5d79d40 [hotfix] fix BC checker direction (#62901)
Summary:
fix https://github.com/pytorch/pytorch/issues/62687 error. should allow listed those that has date time newer than today.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62901

Reviewed By: zhouzhuojie

Differential Revision: D30163202

Pulled By: walterddr

fbshipit-source-id: b882975a231249137cb2d252f41e98e133b6f337
2021-08-06 11:29:28 -07:00
59d09b148c BUG Fixes bug in no_batch_dim tests (#62726)
Summary:
The way that Python captures variables for lambdas meant that only the last `input_fn`, etc were captured. This PR adds makes sure the local variable to captured by a lambda.

REF: https://docs.python.org/3/faq/programming.html#why-do-lambdas-defined-in-a-loop-with-different-values-all-return-the-same-result

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62726

Reviewed By: zou3519

Differential Revision: D30159478

Pulled By: jbschlosser

fbshipit-source-id: cfef3d9776d2676b2f5bb6d39d569b8ca07b0fe5
2021-08-06 11:11:25 -07:00
a03604c610 Set JOB_BASE_NAME consistently for bazel (#62886)
Summary:
It was manually set incorrectly before to pytorch-linux-xenial-py3.6-gcc7-bazel-test-test, which is inconsistent with the rest of our naming scheme.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62886

Reviewed By: driazati

Differential Revision: D30159860

Pulled By: janeyx99

fbshipit-source-id: 4984ec04ee2bcf68b9a57e241ca9f979bfe6398a
2021-08-06 11:07:03 -07:00
3f09485d7e [WIP] Gate DistributedOptimizers on RPC availability (#62774)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62774

Gates DistributedOptimizer which relies on RRef based on if RPC is available. This should enable ZeRo to work with Windows as Windows should not try to import the DIstributedOptimizer. If this works as expected we can enable the windows tests for functional/local sgd optimizers as well.
ghstack-source-id: 135216642

Test Plan: CI

Reviewed By: pbelevich

Differential Revision: D30117838

fbshipit-source-id: e6365a910a3d1ca40d95fa6777a7019c561957db
2021-08-06 10:59:00 -07:00
1dba329d20 Enable step_param for Adam functional optimizer (#62611)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62611

Enables optimizer overlap with backwards in DDP for Adam. Additional optimizers, especially Adagrad will be done in follow up diffs.

1. Implement `step_param` method based on `step` in _FunctionalAdam (perf permitting we can later dedupe `step` to call `step_param`
2. Modify tests to test all current functional optimizers.
ghstack-source-id: 135207143

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D29891783

fbshipit-source-id: 321915982afd5cb0a9c2e43d27550f433bff00d1
2021-08-06 10:53:55 -07:00
836b2431dc [quant] Input-Weight Equalization - selective equalization (#61916)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61916

Functions used to run selective equalization based on the SQNR obtained from running the Numeric Suite. After running the Numeric Suite between the equalized and float model, we will get the SQNR between the two models and construct an equalization_qconfig_dict that specifies to only equalize the layers with the highest quantization errors.

How to run:
```
layer_to_sqnr_dict = get_layer_sqnr_dict(float_model, equalized_model, input)
eq_qconfig_dict = get_equalization_qconfig_dict(layer_to_sqnr_dict, equalized_model, num_layers_to_equalize)

prepared = prepare_fx(float_model, qconfig_dict, eq_qconfig_dict)
...
```

Test Plan:
`python test/test_quantization.py TestEqualizeFx.test_selective_equalization`

Imported from OSS

Reviewed By: supriyar

Differential Revision: D29796950

fbshipit-source-id: 91f0f8427d751beaea32d8ffc2f3b8aa8ef7ea95
2021-08-06 09:29:03 -07:00
e6ef87001c [BF16] Add BF16 support to _aminmax and _anminmax_all operators (#62767)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62767

Add BF16 support to _aminmax_all and _aminmax operators.

Test Plan:
Added unit test:
https://www.internalfb.com/intern/testinfra/testconsole/testrun/2533274857208373/

Reviewed By: anjali411

Differential Revision: D30073837

fbshipit-source-id: 9cb4991e644cfdb2f0674ccaff161d223c174150
2021-08-06 08:56:12 -07:00
56ff996386 [vulkan] Add _reshape_alias (#62858)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62858

D29792126 (adb73d3dcf) changed the behaviour of `reshape()` such that it calls `_reshape_alias()` instead of `view()` in order to avoid duplicating some work such as computing strides.

Vulkan has not yet implemented `_reshape_alias()` so `reshape()` would fail with

```
C++ exception with description "Could not run 'aten::_reshape_alias' with arguments from the 'Vulkan' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions.
```

For Vulkan there is no concept of strides so it's fine to just have `_reshape_alias()` point to `view()`.

Test Plan:
```
cd ~/fbsource
buck build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 //xplat/caffe2:pt_vulkan_api_test_binAndroid\#android-arm64 --show-output
adb push buck-out/gen/xplat/caffe2/pt_vulkan_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_api_test
adb shell "/data/local/tmp/vulkan_api_test"
cd -
```

Reviewed By: kimishpatel

Differential Revision: D30054706

fbshipit-source-id: 770979fa3a0f99bcc2ddaefa4674e5bd79b17c03
2021-08-06 08:44:15 -07:00
5f4207eb91 [vulkan] Throw an exception if device does not support Vulkan (#62859)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62859

If the Vulkan instance cannot be initialized successfully (i.e. no `vkPhysicalDevice` could be found due to missing drivers) then Vulkan ops will not be able to execute. However, currently `api::context()` which is used to access the global Vulkan context simply returns a null pointer if there is a problem initializing the Vulkan instance.

This leads to Segmentation Faults later on because Vulkan ops assume that `api::context()` will not return a `nullptr`. For instance: [this line](https://www.internalfb.com/code/fbsource/xplat/caffe2/aten/src/ATen/native/vulkan/ops/Persistent.cpp?lines=14) will frequently cause a Segmentation Fault when drivers are not present.

Instead of having `api::context()` returning a nullptr when Vulkan cannot be initialized, it should just throw an exception since ops cannot be executed anyway. This results in a more graceful failure as these exceptions can be caught instead of crashing the app with a Seg Fault down the line.

Test Plan:
```
cd ~/fbsource
buck build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 //xplat/caffe2:pt_vulkan_api_test_binAndroid\#android-arm64 --show-output
adb push buck-out/gen/xplat/caffe2/pt_vulkan_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_api_test
adb shell "/data/local/tmp/vulkan_api_test"
cd -
```

On an Omni model portal, I can also remove the vulkan drivers in order to test the functionality when Vulkan is not supported.

Reviewed By: kimishpatel

Differential Revision: D30139891

fbshipit-source-id: 47fcc8dcd219cb78ab9bec0b6a85b2aa7320ab50
2021-08-06 08:42:26 -07:00
d3bdf345cb Introducing DataChunk for DataPipes batching (#62768)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62768

This is part of TorchArrow DF support preparation, separating it to multiple PRs to simplify review process.

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D30149090

Pulled By: VitalyFedyunin

fbshipit-source-id: a36b5ff56e2ac6b06060014d4cd41b487754acb8
2021-08-06 08:38:33 -07:00
5e5de75f4d Add getPyInterpreter() API (#62659)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62659

It turns out that it is occasionally useful to be able to access the
PyInterpreter object from other Python bindings (see next diff in the
stack).  Make it publicly available.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D30074926

Pulled By: ezyang

fbshipit-source-id: 2f745ab7c7a672ed7215231fdf9eef6af9705511
2021-08-06 08:23:24 -07:00
27135f86fd fix docstring default value of last_epoch for SWALR in torch/optim/… (#62799)
Summary:
…swa_utils

Fixes https://github.com/pytorch/pytorch/issues/62633

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62799

Reviewed By: zou3519

Differential Revision: D30131929

Pulled By: H-Huang

fbshipit-source-id: 741c077073bbe398492dff0761836acdbba7be78
2021-08-06 08:15:10 -07:00
9573e7a644 rename namespace f4d to velox (#61)
Summary:
Pull Request resolved: https://github.com/facebookexternal/torchdata/pull/61

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62860

Pull Request resolved: https://github.com/facebookexternal/presto_cpp/pull/453

Moving all namespace definitions, declarations and  references from 'f4d' to 'velox'

Test Plan:
```
buck build //f4d/...
buck test //f4d/...
```
Also monitor the signals from sandcaslte

Reviewed By: pedroerp

Differential Revision: D30140136

fbshipit-source-id: 5b53ac768bb7e5cd07c93a9b04dfd6363080eb52
2021-08-05 21:04:36 -07:00
e1f81c9321 [torchelastic][multiprocessing] Print warning message only when child processes are stuck (#62823)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62823

The diff makes sure that the warning message is printed only when the child processes are stuck after sending termination code.

Test Plan:
sandcastle

    buck build mode/dev-nosan //caffe2:run
    buck-out/gen/caffe2/run.par --nnodes 1 --nproc_per_node 1 main.py
P435691445

Differential Revision: D30046695

fbshipit-source-id: c59170b297f4a0e530906fa5069234303deee938
2021-08-05 19:57:31 -07:00
f6c7081a16 Allow FractionalMaxPool 2D and 3D layers to accept 0 dim batch size tensors. (#62083)
Summary:
This issue fixes a part of https://github.com/pytorch/pytorch/issues/12013, which is summarized concretely in  https://github.com/pytorch/pytorch/issues/38115.

Allow `FractionalMaxPool` 2D and 3D layers to accept 0 dim batch sizes. Also make some minor corrections to error messages to make them more informative.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62083

Reviewed By: H-Huang

Differential Revision: D30134461

Pulled By: jbschlosser

fbshipit-source-id: 0ec50875d36c2083a7f06d9ca6a110fb3ec4f2e2
2021-08-05 17:40:10 -07:00
8aa12cbf86 Add tutorial link (#62785)
Summary:
Addresses: https://github.com/pytorch/pytorch/pull/62605#discussion_r681380364

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62785

Test Plan: I checked the render, and the link redirects as desired.

Reviewed By: mrshenli

Differential Revision: D30133229

Pulled By: andwgu

fbshipit-source-id: baefe0d1f1b78ece44bb42e67629bc130dbf8e9a
2021-08-05 17:28:02 -07:00
64c54f92ca [opinfo] nn.functional.unfold (#62705)
Summary:
Reference: facebookresearch/functorch#78

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62705

Reviewed By: H-Huang

Differential Revision: D30138807

Pulled By: zou3519

fbshipit-source-id: 1d0b0e58feb13aec7b231c9f632a6d1694b9d272
2021-08-05 17:12:25 -07:00
9ac56ef0fc [DDP] log gradient ready order and bucket indices (#62751)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62751

This will help us determine whether gradient ready order and bucket
indices are aligned amongst all the ranks. This should always be true for rank
0 as we determine rebuilt bucket order by the gradient ready order on rank 0,
but would be interested to see this on different workloads for other ranks
ghstack-source-id: 135104369

Test Plan: CI

Reviewed By: SciPioneer, wanchaol

Differential Revision: D30111833

fbshipit-source-id: a0ab38413a45022d953da76384800bee53cbcf9f
2021-08-05 16:36:25 -07:00
80091cb0f7 [DDP] Allow tuning of first bucket (#62748)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62748

Previously after buckets were rebuilt the first bucket size was always
defaulted to 1MB, this diff allows first bucket to be tuned like the rest of
the bucket sizes can.

Setting `dist._DEFAULT_FIRST_BUCKET_BYTES = 1` results in the following logs as
expected:
I0804 12:31:47.592272 246736 reducer.cpp:1694] 3 buckets rebuilt with size
limits: 1, 1048, 1048 bytes.
ghstack-source-id: 135074696

Test Plan: CI

Reviewed By: SciPioneer, wanchaol

Differential Revision: D30110041

fbshipit-source-id: 96f76bec012de129d1645e7f50e266d4b255ec66
2021-08-05 16:35:07 -07:00
5c431981b5 OpInfo for adaptive_avg_pool2d (#62704)
Summary:
Please see https://github.com/facebookresearch/functorch/issues/78 and https://github.com/pytorch/pytorch/issues/54261.

Note regarding sample inputs for this function:

* Checks added for all relevant/interesting cases for `output_size`: `(None, None), (None, width), (height, None), (height, width)`.

cc: mruberry zou3519 Chillee

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62704

Reviewed By: H-Huang

Differential Revision: D30138788

Pulled By: zou3519

fbshipit-source-id: 66735ceaa85b9e6050d4ec27749fc3a8108cf557
2021-08-05 16:11:31 -07:00
eaaceea8d4 Bump protobuf version in CircleCI docker images (#62441)
Summary:
Needed to update ONNX to 1.10 (https://github.com/pytorch/pytorch/issues/62039) because that introduces uses
of the "reserved" protobuf feature.

Also:
* Remove protobuf install code from scripts where it was unused.
* Add `-j` flag to make commands to speed things up.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62441

Reviewed By: soulitzer

Differential Revision: D30072381

Pulled By: malfet

fbshipit-source-id: f55a4597baf95e3ed8ed987d6874388cab3426b0
2021-08-05 15:46:12 -07:00
e62189ad69 [jit] Better checking for overload function declarations. (#59956)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59956

Issue #50175. Basically two things need to be checked and are lacking currently:
1. Overload declarations should always have a single `pass` statement as the body.
2. There should be always an implementation provided for decls which doesn't
   have the torch.jit._overload decorator. So in this case we need to check
   whether we are actually compiling a function body with decorator ahead.

Test Plan:
python test/test_jit.py TestScript.test_function_overloads

Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D29106555

fbshipit-source-id: 2d9d7df2fb51ab6db0e1b726f9644e4cfbf733d6
2021-08-05 14:21:48 -07:00
63fa53d37a Add batched model to torchdeploy examples (#62836)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62836

Used for upcoming diff that adds support for batching to torchdeploy

Test Plan: Models are used by later diffs, but generation script is verified by CI now and locally.

Reviewed By: gunchu

Differential Revision: D30135938

fbshipit-source-id: 566a32a3ede56833e41712025e9d47191dfc5f39
2021-08-05 14:01:40 -07:00
c8eda919a4 test, fix sparse * dense exceptions and corner case (#61723)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/59916

This fixes two problems with sparse multiplication
- 0d-dense * sparse was creating a non-sparse output and failing.
- dense * sparse or sparse * dense is not supported, but would emit an unhelpful error message
<details>
<summary> unhelpful error message </summary>
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NotImplementedError: Could not run 'aten::_nnz' with arguments from the 'CPU' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::_nnz' is only available for these backends: [SparseCPU, SparseCUDA, SparseCsrCPU, SparseCsrCUDA, BackendSelect, Python, Named, Conjugate, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradXPU, AutogradMLC, AutogradHPU, AutogradNestedTensor, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, Tracer, UNKNOWN_TENSOR_TYPE_ID, Autocast, Batched, VmapMode].

SparseCPU: registered at aten/src/ATen/RegisterSparseCPU.cpp:961 [kernel]
SparseCUDA: registered at aten/src/ATen/RegisterSparseCUDA.cpp:1092 [kernel]
SparseCsrCPU: registered at aten/src/ATen/RegisterSparseCsrCPU.cpp:202 [kernel]
SparseCsrCUDA: registered at aten/src/ATen/RegisterSparseCsrCUDA.cpp:229 [kernel]
BackendSelect: fallthrough registered at ../aten/src/ATen/core/BackendSelectFallbackKernel.cpp:3 [backend fallback]
Python: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:38 [backend fallback]
Named: registered at ../aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback]
Conjugate: registered at ../aten/src/ATen/ConjugateFallback.cpp:118 [backend fallback]
ADInplaceOrView: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:60 [backend fallback]
AutogradOther: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:11202 [autograd kernel]
AutogradCPU: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:11202 [autograd kernel]
AutogradCUDA: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:11202 [autograd kernel]
AutogradXLA: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:11202 [autograd kernel]
AutogradXPU: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:11202 [autograd kernel]
AutogradMLC: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:11202 [autograd kernel]
AutogradHPU: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:11202 [autograd kernel]
AutogradNestedTensor: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:11202 [autograd kernel]
AutogradPrivateUse1: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:11202 [autograd kernel]
AutogradPrivateUse2: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:11202 [autograd kernel]
AutogradPrivateUse3: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:11202 [autograd kernel]
Tracer: registered at ../torch/csrc/autograd/generated/TraceType_2.cpp:10254 [kernel]
UNKNOWN_TENSOR_TYPE_ID: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:446 [backend fallback]
Autocast: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:285 [backend fallback]
Batched: registered at ../aten/src/ATen/BatchingRegistrations.cpp:1016 [backend fallback]
VmapMode: fallthrough registered at ../aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]
</details>

Also added tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61723

Reviewed By: ezyang

Differential Revision: D29962639

Pulled By: cpuhrsch

fbshipit-source-id: 5455680ddfa91d5cc9925174d0fd3107c40f5b06
2021-08-05 11:27:12 -07:00
8d7786ada6 Simplify hardswish ONNX export graph. (#60080)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/58301

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60080

Reviewed By: suo

Differential Revision: D30002939

Pulled By: SplitInfinity

fbshipit-source-id: 8b4ca6f62d51b72e9d86534592e3c82ed6608c9d
2021-08-05 11:15:14 -07:00
7630f407cc add OpInfo for torch.nn.functional.grid_sample (#62311)
Summary:
Addresses facebookresearch/functorch#78.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62311

Reviewed By: malfet

Differential Revision: D30013388

Pulled By: zou3519

fbshipit-source-id: 0887ae9935923d928bfeb59054afe1aab954b40b
2021-08-05 10:43:54 -07:00
5dbcd5638b OpInfo for nn.functional.avg_pool2d (#62455)
Summary:
Please see https://github.com/facebookresearch/functorch/issues/78

cc: mruberry zou3519

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62455

Reviewed By: soulitzer

Differential Revision: D30096146

Pulled By: heitorschueroff

fbshipit-source-id: ef09abee9baa5a9aab403201226d1d9db5af100a
2021-08-05 10:28:52 -07:00
878943c64f Preserve memory layout when aten batchnorm is used (#62773)
Summary:
https://github.com/pytorch/pytorch/issues/62594

CC cpuhrsch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62773

Reviewed By: H-Huang

Differential Revision: D30118658

Pulled By: cpuhrsch

fbshipit-source-id: bce9e92f5f8710c876a33cccbd1625155496ddea
2021-08-05 10:21:44 -07:00
d45291613c [pruner] generalize bias hook for conv2d (#62430)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62430

The bias hook is a forward hook that is part of the pruning parametrization; it is attached after the activation reconstruction forward hook, so adding the bias occurs after zeros are reinserted to the pruned activation.

This diff/PR amends the bias hook to work for Conv2d layers, in addition to Linear layers. The reshaping of the ._bias parameter ensures that it is added to the right dimension of the output.
ghstack-source-id: 135097700

Test Plan:
Added tests for `Conv2dB()`, a model with Conv2d layers that have `bias=True`.

`buck test mode/dev-nosan //caffe2/test:ao -- TestBasePruner`

https://pxl.cl/1MfgL

Reviewed By: jerryzh168

Differential Revision: D29979571

fbshipit-source-id: c1a7e9fabc8b3c9d0050bd6b6c6a631ddfdf2a68
2021-08-05 09:27:17 -07:00
b524a1101a ns for fx: add ref_node_target_type (#62685)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62685

Adds a `ref_node_target_type` field to hold the string type
of the base node. This is needed because in some cases
the previous node does not match ref_node (if we have observers,
or if we are logging inputs), and it is useful to know the type
of ref_node.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs
```

Imported from OSS

Reviewed By: hx89

Differential Revision: D30082947

fbshipit-source-id: 98ded7b25a5d8d5ea820e0ef62c3799b65c3fc77
2021-08-05 09:26:10 -07:00
b96acb7591 Allow disabled tests to be re-enabled with IGNORE_DISABLED_ISSUES (#62686)
Summary:
Part 1 of fixing https://github.com/pytorch/pytorch/issues/62359

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62686

Test Plan:
1. Check out this PR and run `python setup.py install`.
2. The test we will be running requires CUDA. If you don't have CUDA, you can try this on another device or simply comment out the skipIf statement before the `test_jit_cuda_extension` test in `test_cpp_extensions_jit.py`
3. Run: `IN_CI=1 python test/run_test.py -i test_cpp_extensions_jit -- -k test_jit_cuda_extension` and notice that it should skip. If it doesn't skip, edit test/.pytorch-disabled-tests.json: modify the platforms list of the first issue (61655) to include whatever platform you are on (macos or linux), and just run `python test/test_cpp_extensions_jit.py -v -k test_jit_cuda_extension --import-disabled-tests` to make sure it skips.
4. Now `export PYTORCH_IGNORE_DISABLED_ISSUES=61655` or `export PYTORCH_IGNORE_DISABLED_ISSUES=34952,61655`.
5. `rm test/.pytorch-*` to clear the cached files.
6. Run the same command as in step 5 and note that it SHOULDN'T skip. It should run.

Reviewed By: walterddr, samestep

Differential Revision: D30108773

Pulled By: janeyx99

fbshipit-source-id: dbf015a266f57577dc9283b0cdff720083b5c0cb
2021-08-05 09:05:40 -07:00
24a2681358 Revert D30094460: [profiler] Re-enable test on Windows
Test Plan: revert-hammer

Differential Revision:
D30094460 (5a1017be97)

Original commit changeset: 80521f6bc136

fbshipit-source-id: 7c01493ad078be7df1bbb81c08be6364d6ffaa4d
2021-08-05 08:34:15 -07:00
0c8ed042f2 Revert D30095246: [pytorch][PR] Enable ncclAvg for reductions
Test Plan: revert-hammer

Differential Revision:
D30095246 (a749180e4e)

Original commit changeset: d3a3475345fa

fbshipit-source-id: 34b5100b925859461296cae5a717a70e5eca6af6
2021-08-05 07:56:40 -07:00
6d896cb545 Update faq.rst so OOM section mentions checkpoint (#62709)
Summary:
This FAQ has a section for CUDA OOMs where there are lots of don'ts. This limits modeling solution. Deep nets can blow up memory due to output caching during training.
It's a known problem with a known solution: to trade-off compute for memory via checkpointing.
FAQ should mention it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62709

Reviewed By: nairbv

Differential Revision: D30103326

Pulled By: ezyang

fbshipit-source-id: 3a8b465a7fbe19aae88f83cc50fe82ebafcb56c9
2021-08-05 07:40:08 -07:00
b84885cc8b Add support for boxed functors (#62658)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62658

Boxed functors, like their unboxed brethren, support operators which
aren't just a function pointer, but a function pointer with some
associated global state that is allocated at registration time.

The use case I have in mind with this implementation is "dispatcher
API from Python", where the extra state kernel registrations need is
the PyObject callable we will invoke to do the actual invocation.
See next PR in this stack.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D30074925

Pulled By: ezyang

fbshipit-source-id: ee040edbbec1e607486d338d1ea78bb5c6b2ece9
2021-08-05 07:26:09 -07:00
e6a227465b Add serialization support for slots and subclass getstate/setstate (#62745)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62745

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D30113112

Pulled By: albanD

fbshipit-source-id: 6c562d0c060fb0280e5e3d432bb42fb833e6d500
2021-08-05 06:49:44 -07:00
056b147e10 clean torch_function handling in serialization (#62744)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62744

The `Tensor._reduce_ex_internal` function can only be called via the `Tensor.__reduce_ex__` function.
And that second function already properly handles the `__torch_function__` overwrites. So no need to handle them again in `Tensor._reduce_ex_internal`.

This PR also updates `Tensor.__reduce_ex__` to use the specialized unary API for `__torch_function__` that makes it nicer to read.

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D30113113

Pulled By: albanD

fbshipit-source-id: c94f5d2597ee3afe799d9de991f75615c3c172d6
2021-08-05 06:48:26 -07:00
ee82e7a14e [DDP Communication Hook] Renaming C++ calls to match python API closer (#62735)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62735

Renamed the following
1. getTensor -> getBuffer
2. getTensorRef -> getBufferRef
3. setTensor -> setBuffer
and all associated private variables as well

Reviewed By: SciPioneer

Differential Revision: D30069124

fbshipit-source-id: fa8f1f8a7f3255e6242973bc37b3f7b2731af55d
2021-08-05 05:06:29 -07:00
64b3ab6407 Improve IMethod::getArgumentNames to deal with empty argument names list (#62782)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62782

This diff improved IMethod::getArgumentNames to deal with empty argument names list.

Test Plan:
buck test mode/dev caffe2/caffe2/fb/predictor:pytorch_predictor_test -- PyTorchDeployPredictor.GetEmptyArgumentNamesValidationMode
buck test mode/dev caffe2/caffe2/fb/predictor:pytorch_predictor_test -- PyTorchDeployPredictor.GetEmptyArgumentNamesRealMode

Reviewed By: wconstab

Differential Revision: D30038175

fbshipit-source-id: 46f08dda94187160b4d6ee87600d1b46fe934222
2021-08-05 01:32:00 -07:00
019048b3b6 [PyTorch Edge] Simplify Exception Handling (Take-2) (module.cpp) (#62634)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62634

Apply the same set of changes as in D27688352 (d728491fc1) to `module.cpp` as instructed by xcheng16.

Basically, this simplifies exception handling and allows propagation of the original message undisturbed to the caller so that we can figure out the lineage of the exception in crash tasks such as t96812652
ghstack-source-id: 134877012

Test Plan: Build/Sandcastle

Reviewed By: raziel

Differential Revision: D30038867

fbshipit-source-id: 8dfd415c510bcd0ab49814f4eb559ec6fc8f72e5
2021-08-04 23:25:30 -07:00
4b68801c69 Enable test_api IMethodTest in OSS (#62521)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62521

This diff did the following few things to enable the tests:
1. Exposed IMethod as TORCH_API.
2. Linked torch_deploy to test_api if USE_DEPLOY == 1.

Test Plan:
./build/bin/test_api --gtest_filter=IMethodTest.*

To be noted, one needs to run `python torch/csrc/deploy/example/generate_examples.py` before the above command.

Reviewed By: ezyang

Differential Revision: D30055372

Pulled By: alanwaketan

fbshipit-source-id: 50eb3689cf84ed0f48be58cd109afcf61ecca508
2021-08-04 21:14:20 -07:00
a749180e4e Enable ncclAvg for reductions (#62303)
Summary:
[ncclAvg](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/types.html?highlight=ncclavg#c.ncclAvg) is a new `ncclRedOpt_t` that fuses a div-by-world-size with ncclAllReduce, Reduce, or ReduceScatter. This PR adds support.

This PR and https://github.com/pytorch/pytorch/pull/62140 lay the foundation for to DDP allreduce+average grad tensors in place with a single nccl call without additional memory pass(es) to flatten or average or unflatten. I'll write the necessary DDP changes once this PR and https://github.com/pytorch/pytorch/pull/62140 land.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62303

Reviewed By: soulitzer

Differential Revision: D30095246

Pulled By: rohan-varma

fbshipit-source-id: d3a3475345fafb0ab265c11d36db74d7c5613a0a
2021-08-04 19:43:50 -07:00
4bd54cebe0 Refinement types and unification for symbolic shape inference (#61776)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61776

Test Plan: Imported from OSS

Reviewed By: iramazanli

Differential Revision: D29772537

Pulled By: migeed-z

fbshipit-source-id: 3555d43152a213087c64faa326432f1628eb3bb1
2021-08-04 17:34:29 -07:00
a27a0b1ef5 [SR] Disable NNC temporarily (#62746)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62746

Disable NNC temporarily until a code cache is implemented to reduce the compilation time.

Reviewed By: ajyu

Differential Revision: D30080326

fbshipit-source-id: ef8bb3ac3a6947614f4a03a3d52774b6933d3ea8
2021-08-04 17:33:07 -07:00
afc1d1b3d6 Fix lint errors in cuda_ReportMemoryUsage tests (#62778)
Summary:
Introduced in 8bbcef5096

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62778

Reviewed By: chaekit, driazati

Differential Revision: D30120245

Pulled By: malfet

fbshipit-source-id: 2cb5755b870182dd147a6685c74f7defcc10030a
2021-08-04 17:26:23 -07:00
658540f43f remove deprecated is_deterministic and set_deterministic (#62158)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/58096

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62158

Reviewed By: mruberry

Differential Revision: D29909634

Pulled By: ezyang

fbshipit-source-id: ccffbcf8f378e39bd2c7fbeace7ed1cbbe003981
2021-08-04 16:45:23 -07:00
a705b8f08f OpInfo for nn.functional.relu (#62076)
Summary:
See https://github.com/facebookresearch/functorch/issues/78 and https://github.com/pytorch/pytorch/issues/54261.

cc: mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62076

Reviewed By: soulitzer

Differential Revision: D30013262

Pulled By: zou3519

fbshipit-source-id: 7df5e930d1588146e09cf58c53c8860392da7348
2021-08-04 15:50:18 -07:00
123be6b261 Port addcdiv to structured kernels. (#62319)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62319

Tracking issue: #55070

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D29961996

Pulled By: bdhirsh

fbshipit-source-id: d38141476b41dbfd4bf029d631f81a32aff82a5e
2021-08-04 15:35:25 -07:00
693b0af996 Port addcmul kernels to structured kernels. (#62318)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62318

Tracking issue: #55070

This PR introduces the method `TensorIteratorBase::build_ternary_op` for building a
`TensorIteratorBase` for 3-input 1-output kernel.

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D29961997

Pulled By: bdhirsh

fbshipit-source-id: 2208d24823bad6e74c8d508f363716d8125b8619
2021-08-04 15:34:01 -07:00
8bbcef5096 Report more information for memory profiling (#61282)
Summary:
Report pointed memory size, total allocated memory, total reserved size all in one report.

`ptr` and `alloc_size` will be used for associating with op trace.
`allocated_size`, `reserved_size` will be used for memory trace.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61282

Reviewed By: ejguan

Differential Revision: D29796282

Pulled By: chaekit

fbshipit-source-id: 5314c867632d3af1fa9a3811b35eaa5e931a5d87
2021-08-04 15:03:14 -07:00
0aee9c0ef8 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D30097148

fbshipit-source-id: 514c22ea52f048bb048a53fa6b5ea57f3ac12250
2021-08-04 14:58:29 -07:00
aed01a991d Add hasattr to torch::deploy interface and hasMethod to PredictorContainer (#62669)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62669

Useful to avoid having to implement null checking on the application side.

Test Plan: Add unit tests

Reviewed By: suo, houseroad

Differential Revision: D30074406

fbshipit-source-id: 881aec735953b43cb24786c1a2d79e8e724928b8
2021-08-04 14:48:34 -07:00
281737ea6f [DDP Communication Hook] Rename 4 Methods of GradBucket Class
Summary:
1. getPerParameterTensors -> getGradients
2. getModelParamsForBucket -> getParameters
3. isTheLastBucketToAllreduce -> IsLast

Test Plan:
Test results for "buck test mode/dev-nosan caffe2/test/distributed:c10d":
https://pxl.cl/1Mrq8

Test results for "buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork":
https://pxl.cl/1MrtP

Reviewed By: SciPioneer

Differential Revision: D30076436

fbshipit-source-id: 0bd1e410186a318ea6328f4c1e830ea5632f8a47
2021-08-04 14:37:23 -07:00
7f1b672b7a Revert D29952381: [Static Runtime] Ensure that unittests only use out variants or native ops
Test Plan: revert-hammer

Differential Revision:
D29952381 (8737e17af2)

Original commit changeset: e60e70b80ccf

fbshipit-source-id: 59dc2f920b7ceaf94ba8f5f36024e7cc710f6645
2021-08-04 14:25:11 -07:00
491d89da1b .github: Fix --no-build-suffix (#62739)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62739

Original flag didn't initially work correctly so this makes it actually
output the right thing

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: janeyx99

Differential Revision: D30107694

Pulled By: seemethere

fbshipit-source-id: 5ff28d6820b9cf7145dbb617b86a941bf7686b5c
2021-08-04 14:19:38 -07:00
de94034328 Fixes #62636 (#62670)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/62636.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62670

Reviewed By: ezyang

Differential Revision: D30102179

Pulled By: soulitzer

fbshipit-source-id: 38480463ef354f2c12ed83e6678aed26b0b96efe
2021-08-04 13:58:21 -07:00
8e35df0bf3 det_backward: return svd path for double backward (so that all ci tests pass) (#62570)
Summary:
Potentially fixes https://github.com/pytorch/pytorch/issues/62327 and fixes https://github.com/pytorch/pytorch/issues/62328.
This PR replaces the double backward of det from eig to svd. The latter is slower but should be more stable.

CC anjali411

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62570

Reviewed By: pbelevich

Differential Revision: D30072876

Pulled By: anjali411

fbshipit-source-id: c91b507dbfd6a3ec47dc6d0b0dcfa5f8c8228c30
2021-08-04 13:43:51 -07:00
6f0abba04c [fix] manual_seed{_all}: mem leak (#62534)
Summary:
Fixes: https://github.com/pytorch/pytorch/issues/55768

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62534

Reviewed By: nairbv

Differential Revision: D30103294

Pulled By: ezyang

fbshipit-source-id: d871ae869314dfd2d27544a51107ab752abfe452
2021-08-04 13:03:12 -07:00
89f898ebb5 Fix wrong command in README.md (#62472)
Summary:
If it is `[15^,16^)`, 16.10 is not included.
https://github.com/Microsoft/vswhere/wiki/Examples

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62472

Reviewed By: nairbv

Differential Revision: D30103199

Pulled By: ezyang

fbshipit-source-id: 82085627ca53cd5a4e666848d27d4ab062de8352
2021-08-04 12:55:18 -07:00
b454275f47 Support eager mode use of torch.jit.isinstance with multiple types (#60465)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/60095

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60465

Reviewed By: soulitzer

Differential Revision: D30093110

Pulled By: ansley

fbshipit-source-id: ee9c654bdb031e9eff4837f9f1d489c81e47cc06
2021-08-04 12:45:24 -07:00
5a1017be97 [profiler] Re-enable test on Windows (#62703)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62703

Re-enable test on Windows

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D30094460

Pulled By: ilia-cher

fbshipit-source-id: 80521f6bc1365d2c252f20b5d0485fc062c8d9c3
2021-08-04 12:32:24 -07:00
8737e17af2 [Static Runtime] Ensure that unittests only use out variants or native ops (#62335)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62335

This change ensures that unittests only use out variants or native ops.

- Our unittests currently assume that a graph fed to the static runtime correctly replaces an interpreter op for its corresponding out variant / native op, but it's not checked by the unittest. This change ensures that.

- We relied on manual inspection of log messages to see if an out variant is used for a specific workload even for unittesting. This change frees us from doing that.

- `aten::add` is excluded from this check since it's only enabled for an internal workload. Also some unittests are excluded by using `expect_interpreter_op  = true` since they are written to use interpreter ops by design.

Test Plan: Ran `buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest` successfully.

Reviewed By: mikeiovine, hlu1

Differential Revision: D29952381

fbshipit-source-id: e60e70b80ccf45e91c6654b4ad53f92ffd5ab702
2021-08-04 11:37:15 -07:00
de77c6a0eb [BE] fix bc check (#62687)
Summary:
a bug was discovered in https://github.com/pytorch/pytorch/issues/62434, for some reason comparing the schema name didn't match the allow_list item. So:
1. remove duplicate regex compile
2. make use of the schema string is used instead of just the name

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62687

Reviewed By: ezyang

Differential Revision: D30102437

Pulled By: walterddr

fbshipit-source-id: 541b2ed77948f24daebb08623cadabb034a241e0
2021-08-04 11:00:22 -07:00
0a66416767 Rename master to main for test-infra references (#62728)
Summary:
Reacting to the main->master switch in test-infra

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62728

Reviewed By: samestep

Differential Revision: D30104777

Pulled By: janeyx99

fbshipit-source-id: a7af7dfc69fd6e02c30ad6c15808a5b32a68c587
2021-08-04 10:45:47 -07:00
90ba71f841 Automated submodule update: FBGEMM (#62688)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 10ec0d3388

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62688

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: dskhudia

Differential Revision: D30088109

fbshipit-source-id: da8a1e6232e489eac0384faadb71c2dfac5927f7
2021-08-04 10:40:50 -07:00
8bcf01631a [ROCm] update magma (#62502)
Summary:
Update magma to point to magma_ctrl_launch_bounds branch.
When upstream magma branch is used,  cholesky tests in test_ops.py and test_linalg.py
fails due to "Intel MKL ERROR: Parameter 4 was incorrect on entry to DPOTRF."
Suspect commit: [35325212b15c5baadd7493d61b19b2db2635cb68](35325212b1) in magma master.

Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62502

Reviewed By: malfet

Differential Revision: D30089171

Pulled By: seemethere

fbshipit-source-id: b07234ce66d48e3af113640995f923ee586b3cd9
2021-08-04 10:19:55 -07:00
dfdc3069e7 Revert D30072994: [pytorch][PR] [6/n Update test rpc path
Test Plan: revert-hammer

Differential Revision:
D30072994 (ad4e1f1132)

Original commit changeset: 3217e764bd85

fbshipit-source-id: cf89df78a4e04ef03b04ec3c253c5cbb1a1f5f63
2021-08-04 10:14:31 -07:00
34c9f5a8da [DDP Communication Hook] Update get_tensor and set_tensor to be cleaner naming conventions (buffer() and set_buffer()) (#62662)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62662

Replaced the methods set_tensor(.) and get_tensor() in the python exposed API from the C++ logic with buffer() and set_buffer(.) to be a cleaner interface.

Reviewed By: SciPioneer

Differential Revision: D30012869

fbshipit-source-id: bd8efab583dd89c96f9aeb3dd48a12073f0b1482
2021-08-04 09:27:31 -07:00
4b47ea9446 adding a skip for ROCm for a flaky test (#62664)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62664

Skipping a test for ROCm because of issue #62602

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D30079534

Pulled By: NivekT

fbshipit-source-id: a9cf35e5d3a8d218edc9c5a704d1f9599d2f38a6
2021-08-04 07:29:06 -07:00
d1c85d2c06 Move ASAN tests to clang-7 (#62663)
Summary:
This should avoid following false positives:
```
[ RUN      ] ProtoTest.Basic
/var/lib/jenkins/workspace/build/third_party/onnx/onnx/onnx_onnx_torch-ml.pb.h:7060:15: runtime error: member call on address 0x7fffffffdd80 which does not point to an object of type 'google::protobuf::MessageLite'
0x7fffffffdd80: note: object is of type 'onnx_torch::ModelProto'
 00 00 00 00  b0 b9 05 ef ff 7f 00 00  00 00 00 00 00 00 00 00  01 00 00 00 00 00 00 00  00 00 00 00
              ^~~~~~~~~~~~~~~~~~~~~~~
              vptr for 'onnx_torch::ModelProto'
 UndefinedBehaviorSanitizer: undefined-behavior /var/lib/jenkins/workspace/build/third_party/onnx/onnx/onnx_onnx_torch-ml.pb.h:7060:15 in
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62663

Reviewed By: tktrungna

Differential Revision: D30076315

Pulled By: malfet

fbshipit-source-id: 7bfc2c4b417307195e3c3379e4874eaceb4f3134
2021-08-04 06:26:03 -07:00
773a8eede4 [profiler][refactor] Refactor the usage of legacy profiler implementation (#61931)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61931

This PR consolidates the profiling code around a new C++ implementation
(profiler_kineto.h/cpp) and uses it unconditionally from
torch.autograd.profiler/torch.profiler:
1. Always use profiler_kineto.h/cpp as the C++ implementation
2. Simplify profiler.py to remove unneeded parts depending on legacy
impl
3. Move some of the legacy logic into profiler_legacy.py (to be fully
deleted later)

Test Plan:
USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install --cmake
python test/test_profiler.py -v
USE_KINETO=0 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install --cmake
python test/test_profiler.py -v

Imported from OSS

Reviewed By: gdankel

Differential Revision: D29801599

fbshipit-source-id: 9794d29f2af38dddbcd90dbce4481fc8575fa29e
2021-08-03 18:51:29 -07:00
5830f122f1 Add docstrings for save_on_cpu hooks (#62410)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62410

This PR adds docstrings for CPU hooks introduced in #61928.

Also uncomments the warning about pinned memory in CUDA semantics docs.

Depends on: #62361.

For now docstrings are an orphan page at https://docs-preview.pytorch.org/62410/generated/torch.autograd.graph.set_save_on_cpu_hooks.html#torch-autograd-graph-set-save-on-cpu-hooks

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D29990129

Pulled By: Varal7

fbshipit-source-id: 7a98eeee6a0abb11e2c2d9169cd1aa35ad7ba3f4
2021-08-03 17:53:45 -07:00
5542d590d4 [EZ] Fix type of functional.pad default value (#62095)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62095

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D29879898

Pulled By: jamesr66a

fbshipit-source-id: 903d32eca0040f176c60ace17cadd36cd710345b
2021-08-03 17:47:20 -07:00
d7d399f3df Exposes _aminmax as aminmax and makes it structured (#62401)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62401

This PR exposes the `torch._aminmax` operator as `torch.aminmax`.

**TODO**

- [x] add examples to documentation
- [x] add minmax to rst docs

fixes https://github.com/pytorch/pytorch/issues/62164

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D30072246

Pulled By: heitorschueroff

fbshipit-source-id: 557d30af7c28ca6c238c59122367104036429ecd
2021-08-03 16:10:43 -07:00
92f470da08 Revert D30070707: [pytorch][PR] [5/n] Update test distribute path
Test Plan: revert-hammer

Differential Revision:
D30070707 (d8849bdb03)

Original commit changeset: c45f07b7b548

fbshipit-source-id: 867019e95b2898346ba2d918fa7a7291c8125efd
2021-08-03 16:00:56 -07:00
18eeccc7e8 [mypy] Fix Optional type check (#62668)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62668

Test Plan: Imported from OSS

Reviewed By: malfet, 842974287

Differential Revision: D30077960

Pulled By: IvanKobzarev

fbshipit-source-id: 5e423bfb65a65974ed848caa177330d6e61452e6
2021-08-03 16:00:55 -07:00
5a49abfaf1 Revert "Revert D29940705: [fx2trt] Dynamic shape inference support" (#62667)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62667

This reverts commit 053e11f1b39b50fcd7aa7cdd272f7775c7a5e1ba.

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D30077961

Pulled By: IvanKobzarev

fbshipit-source-id: a7e76b2d2fa79e6c42a6a87f0a13f62642591fee
2021-08-03 15:59:40 -07:00
34f50c6e35 [Static Runtime] testStaticRuntime verifies that # of nodes is at least 2 (#62622)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62622

This allows us to catch cases where an out variant is being tested but the test author forgot to call `.clone()` in the test script. More than 2 ops does not guarantee that the memory planner is being exercised, but less than 2 guarantees that it is not being used.

Reviewed By: hlu1

Differential Revision: D30058050

fbshipit-source-id: 5bc053736f1cc6fd1ffcf8254bf38874ac18c34b
2021-08-03 15:55:57 -07:00
2bddaf6149 Revert D30072859: [pytorch][PR] [4/n] Update vulkan test path
Test Plan: revert-hammer

Differential Revision:
D30072859 (1630b86dd6)

Original commit changeset: bf75faabf6b6

fbshipit-source-id: 3e2672bd19544ed3f1e2a951eb02d58f5c2f9d52
2021-08-03 15:28:04 -07:00
ad4e1f1132 [6/n Update test rpc path (#62526)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/62380

* update `test_rpc` function to call wheel install folder {sitepackages}/torch instead of build/ folder
* add IN_WHEEL_TEST to limit the change for linux CI GHA only

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62526

Test Plan: check if all ci workflows pass

Reviewed By: walterddr, seemethere

Differential Revision: D30072994

Pulled By: tktrungna

fbshipit-source-id: 3217e764bd859dc2db597d24a1abb5ec1d0e8c9e
2021-08-03 15:26:54 -07:00
c48dfe0d9f .github: Enable SSH to linux runners (#62280)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62280

Enables SSH to linux GHA runners for FB employees while on the FB VPN

SSH keys will be added to runners when the label "with-ssh" is applied to
your pull request.

Depnds on https://github.com/fairinternal/pytorch-gha-infra/pull/8

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: janeyx99, soulitzer

Differential Revision: D29941681

Pulled By: seemethere

fbshipit-source-id: 9d291f4291eb1d814d4a3473f7daf7f6951ad724
2021-08-03 15:15:39 -07:00
9beb279d84 Add context manager to save tensors on CPU (#61928)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61928

Fix #57100.
Creates a function `torch.autograd.graph.set_save_on_cpu_hooks()` which can be used to register default hooks under which all tensors saved during the forward pass are actually copied* to cpu, then copied back to the appropriate device for the backward pass.

*If the tensor was already on cpu, the entire operation is a no op.

If the tensor is on GPU, we copy the tensor to `pin_memory` during packing so that the unpacking can be done asynchronously.

See [benchmark](https://github.com/pytorch/pytorch/pull/61928#issuecomment-885089279) and [note about training large models](https://github.com/pytorch/pytorch/pull/61928#issuecomment-887009448)

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D29848526

Pulled By: Varal7

fbshipit-source-id: 3d289cddd4fa377bd4884ba0d569fa47c777d9e5
2021-08-03 13:08:37 -07:00
91ef19309e [quant] Input-weight equalization - branch support (#62366)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62366

In the case of models with branches, we are unable to equalize the branching part in the graph.

For example, given this graph:
```
     conv2
    /     \
x -> conv1 -> add
```

After prepare, we will ignore the branched layers (conv1 and conv2) and will not insert the equalization observers. A warning message will also be printed with the layers that are unable to be equalized.
```
                        conv2 -> out_quant_obs2
                       /                       \
x -> input_quant_obs -> conv1 -> out_quant_obs1 -> add
```

Test Plan:
`python test/test_quantization.py TestEqualizeFx.test_input_weight_equalization_prepare`

Imported from OSS

Reviewed By: malfet, supriyar

Differential Revision: D29982585

fbshipit-source-id: 706297e7f1861975998dfa83e7ca59af09d80618
2021-08-03 12:45:25 -07:00
62a90c227f Make _Join, _Joinable, _JoinHook public (#62605)
Summary:
**Overview:**
This removes the preceding `_` from `_Join`, `_Joinable`, and `_JoinHook` in preparation for adding the generic join context manager tutorial (see [here](https://github.com/pytorch/tutorials/pull/1610)). This also adds a docs page, which can be linked from the tutorial. [Here](https://github.com/pytorch/pytorch/files/6919475/render.pdf) is a render of the docs page.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62605

Test Plan:
`DistributedDataParallel.join()`:
```
touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py -- TestDistBackendWithFork.test_ddp_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_inputs_stop_iteration_sync_bn TestDistBackendWithFork.test_ddp_grad_div_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_input_join_disable TestDistBackendWithFork.test_ddp_uneven_input_exception
```

`ZeroRedundancyOptimizer`:
```
gpurun4 python test/distributed/optim/test_zero_redundancy_optimizer.py
```
NOTE: DDP overlap tests are failing due to a landing race. See https://github.com/pytorch/pytorch/pull/62592. Once the fix is landed, I will rebase, and tests should be passing.

`Join`:
```
gpurun4 python test/distributed/algorithms/test_join.py
```

Reviewed By: mrshenli

Differential Revision: D30055544

Pulled By: andwgu

fbshipit-source-id: a5ce1f1d9f1904de3bdd4edd0b31b0a612d87026
2021-08-03 12:20:11 -07:00
053e11f1b3 Revert D29940705: [fx2trt] Dynamic shape inference support
Test Plan: revert-hammer

Differential Revision:
D29940705 (6b02ad5f82)

Original commit changeset: 1eab53a8cfd5

fbshipit-source-id: 68150a193df6f11389b14a0e8224e1489b29ff0b
2021-08-03 12:03:42 -07:00
ff31389c21 Cast a few vars to void that are otherwise unused
Summary:
llvm-13 marks this as an error when a variable is set but not used.
Evidently this macro doesn't always expand to using the var.  Work around that
here with void casts.

Test Plan: nfc

Reviewed By: drodriguez

Differential Revision: D30062462

fbshipit-source-id: ff868ec74116da99afd539142996d2ffffd399fb
2021-08-03 11:57:57 -07:00
59dd12042e [nnc] Removed const from all fields in IR. (#62336)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62336

This PR was generated by removing `const` for all types of nodes in NNC IR, and fixing compilation errors that were the result of this change.

This is the first step in making all NNC mutations in-place.

Test Plan: Imported from OSS

Reviewed By: iramazanli

Differential Revision: D30049829

Pulled By: navahgar

fbshipit-source-id: ed14e2d2ca0559ffc0b92ac371f405579c85dd63
2021-08-03 11:44:36 -07:00
474d7ec43b [Pytorch Edge] Black Box Compatibility API (#61477)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61477

It would be nice if the compatibility api was just kinda plug and play with no care about the internals of the api at all. Thats what this diff aims to provide.

The general usage would be something like
  < On the Client >
  RuntimeCompatibilityInfo runtime_info = get_runtime_compatibility_info();

  .
  .
  .
  < On the Server >
  ModelCompatibilityInfo model_info = get_model_compatibility_info(<model_path>);
  bool compatible = is_compatible(runtime_info, model_info);

Currently RuntimeCompatibilityInfo and ModelCompatibilityInfo are exactly the same, but it seemed feasible to me that they may end up diverging as more information is added to the api (such as a min supported bytecode version being exposed from the runtime).

Test Plan: unit test and ci

Reviewed By: dhruvbird, raziel

Differential Revision: D29624080

fbshipit-source-id: 43c1ce15531f6f1a92f357f9cde4e6634e561700
2021-08-03 11:27:28 -07:00
b7391f44df cast return of cudaGetLastError() to void when discarding (#62518)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/62511.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62518

Reviewed By: walterddr, janeyx99

Differential Revision: D30029858

Pulled By: malfet

fbshipit-source-id: d47ce4e507ac800b4e5a5e0a8d9a6fabdfd28e6d
2021-08-03 11:17:22 -07:00
d6048ecd6b Enable bazel builds on ciflow/default (#62649)
Summary:
Add `regenerate.sh` convenience script
Remove "TODO: Reenable on PR" label from workflows which are enabled on PRs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62649

Reviewed By: seemethere

Differential Revision: D30071905

Pulled By: malfet

fbshipit-source-id: c82134cb676b273d23e225be21166588996a31d3
2021-08-03 11:05:41 -07:00
4d5607bb25 [Reland][DDP] log bucket sizes (#62625)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62625

reland of https://github.com/pytorch/pytorch/pull/62232 which ran into a land race.

Test Plan: ci

Reviewed By: SciPioneer

Differential Revision: D30058217

fbshipit-source-id: 1454dd481e630f3de9ec6111b3f2e18cd8976091
2021-08-03 10:55:46 -07:00
1630b86dd6 [4/n] Update vulkan test path (#62519)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/62380

* update `test_vulkan` function to call wheel install folder {sitepackages}/torch instead of build/ folder
* add `IN_WHEEL_TEST` to limit the change for `pytorch_linux_test` only
* add symbolic link for shared libraries which are called by the tests (this is a bit hacky and should be fixed the rpath before compiling -- similar to https://github.com/pytorch/pytorch/blob/master/.jenkins/pytorch/test.sh#L204-L208).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62519

Test Plan: check if all ci workflows pass

Reviewed By: walterddr

Differential Revision: D30072859

Pulled By: tktrungna

fbshipit-source-id: bf75faabf6b6070c366571a74834a1f58b2549d3
2021-08-03 10:24:47 -07:00
ddd916c210 [quant][refactor] Return the models in checkGraphModeFxOp for further checking (#62487)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62487

checkGraphModeFxOp is our utility test function to quantize a given model with FX Graph Mode Quantization
and checks whether the result model contains expected ops, previously it only returns a result on the sample data for the
quantized model, this PR chagnes it to return prepared, quantized, quantized_reference models together with the result
for quantized models.

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: iramazanli

Differential Revision: D30053981

fbshipit-source-id: 31fbce48d138261d0b00ba24e1427fd0c6208990
2021-08-03 10:12:16 -07:00
76c447a730 Remove CUDA10.2 + gcc 9 in CI (#62609)
Summary:
This is an invalid combination because CUDA10.2 does not support gcc>8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62609

Reviewed By: iramazanli

Differential Revision: D30057292

Pulled By: seemethere

fbshipit-source-id: 7cb0fa8401e80297846b0fcb5e0ecaa435b101be
2021-08-03 10:05:16 -07:00
d8849bdb03 [5/n] Update test distribute path (#62520)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/62380

* update `test_distributed` function to call wheel install folder {sitepackages}/torch instead of build/ folder
* add IN_WHEEL_TEST to limit the change for linux CI GHA only

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62520

Test Plan: check if all ci workflows pass

Reviewed By: soulitzer

Differential Revision: D30070707

Pulled By: tktrungna

fbshipit-source-id: c45f07b7b54857dc8e78405714d6d5b864c30868
2021-08-03 09:52:48 -07:00
6b02ad5f82 [fx2trt] Dynamic shape inference support
Summary:
Add a field called `shape_range` to `inputTensorSpec` which allow user to indicate the range of the input shape.

Make all current converters work with dynamic shape expect `layer_norm`. Need to make the layer_norm plugin to be `IPluginV2Ext`.

Some ops only have limited dynamic shape support for now:
- "linear": only support at most 1 dynamic dim. We add full support but I'm thinking breaking down linear to matmul + add.
- "adaptive_avgpool`: right now we lower it to trt avgpool which means we need to know the last two dims to calculate parameters like kernel_size, strides, etc. Follow up would be make a plugin for adaptive avgpool. TRTorch already have one, we can borrow it.

Test Plan: Added unit tests for dynamic shape inference for converter tests.

Reviewed By: jackm321

Differential Revision: D29940705

fbshipit-source-id: 1eab53a8cfd5e8db0be57845062e9794578165d1
2021-08-03 09:44:26 -07:00
b7ac286d0e CMake: Add optional precompiled header support (#61940)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61940

This adds a `USE_PRECOMPILED_HEADERS` option to the CMake build which
precompiles `ATen.h` and also `CUDAContext.h` for the cuda library.
After making a change in `native_functions.yaml`, this speeds up compilation
time by around 15% on my machine.

Test Plan: Imported from OSS

Reviewed By: heitorschueroff

Differential Revision: D29988775

Pulled By: malfet

fbshipit-source-id: a23c468c958a8b74ebaef052a5b2e5fa3836c64b
2021-08-03 09:13:47 -07:00
2cf4d8128d add OpInfo for torch.nn.functional.mse_loss (#62254)
Summary:
Addresses facebookresearch/functorch#78.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62254

Reviewed By: malfet

Differential Revision: D30013331

Pulled By: zou3519

fbshipit-source-id: e3242cb97d1f061b932e3e0ed589f1ee6a291512
2021-08-03 09:01:09 -07:00
ab8af15545 [Static Runtime] Enabled building Static Runtime tests and benchmarks in OSS CI (#62226)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62226

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D29923800

Pulled By: navahgar

fbshipit-source-id: 33cfe0e92a34c7140ea762e5715301cfbf401434
2021-08-03 08:52:36 -07:00
43327cc197 Refactor commonalities between two approaches (#62624)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62624

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D30058543

Pulled By: andwgu

fbshipit-source-id: 73c794062b75e011868fae264f592549eed67482
2021-08-03 08:43:14 -07:00
e6a3967c2a Add invariant check (bucket indices: 0, 1, ..., k-1) (#62623)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62623

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D30058544

Pulled By: andwgu

fbshipit-source-id: a56910f294c6a40118751eebe255b62700f42be9
2021-08-03 08:13:52 -07:00
87465a6e68 adding operator cumulative_trapezoid (#61615)
Summary:
Stack from [ghstack](https://github.com/ezyang/ghstack):
* https://github.com/pytorch/pytorch/issues/61616
* **https://github.com/pytorch/pytorch/issues/61615**
* https://github.com/pytorch/pytorch/issues/61475

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61615

Reviewed By: malfet, mruberry

Differential Revision: D29975064

Pulled By: NivekT

fbshipit-source-id: 4d4e98f3efb720fdc44eb238ecbf0fa157ac13d7
2021-08-03 08:04:00 -07:00
b37578b3c0 Make bazel output less verbose in CI (#62601)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/62600

Adds `bazel --config=no-tty` that is useful for less verbose output in environments that don't implement full tty like CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62601

Reviewed By: soulitzer

Differential Revision: D30070154

Pulled By: malfet

fbshipit-source-id: 5b89af8441c3c6c7ca7e9a0ebdfddee00c9ab576
2021-08-03 07:59:01 -07:00
3bda4ea842 Avoid unnecessary copying data in Saved Variable (#61927)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61927

This is a refactor of `SavedVariable.cpp` to prevent ever defining the `data_` tensor if default hooks are set.

Before the refactor:

```c++
data_ = variable.tensor_data(); // this is wasteful if hooks are defined
register_hooks(Engine::get_default_engine().get_default_saved_variable_hooks());
```

After the refactor:
```c++
if (get_default_hooks_()) {
  save_metadata_(variable);
  register_hooks_(get_default_hooks_(), variable);
  return;
}
save_metadata_(variable);
data_ = variable.tensor_data(); // only needed if hooks are not defined
```

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D29848524

Pulled By: Varal7

fbshipit-source-id: abca1eee37a17b47841e28d8a576490913fce1ce
2021-08-03 07:09:47 -07:00
7edb4f8761 Port cumprod kernel to structured kernels. (#61899)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61899

Tracking issue: #55070

This PR also removes `at::_cumprod`, which was the "backend" for `at::cumprod`.

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D29939489

Pulled By: ezyang

fbshipit-source-id: d5e4a6dfa6c79e4b135508ea13c2d11bd0684f63
2021-08-03 06:58:13 -07:00
e52325655a Port cumprod kernel to structured kernels. (#61899)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61899

Tracking issue: #55070

This PR also removes `at::_cumprod`, which was the "backend" for `at::cumprod`.

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D29939152

Pulled By: ezyang

fbshipit-source-id: b3379033a1ffe3c7bc8216d16d089d388ea559ba
2021-08-03 06:57:09 -07:00
c7a7c2b62f Enable Gelu fp32/bf16 in CPU path using Mkldnn implementation (#58525)
Summary:
Enable Gelu bf16/fp32 in CPU path using Mkldnn implementation. User doesn't need to_mkldnn() explicitly. New Gelu fp32 performs better than original one.

Add Gelu backward for https://github.com/pytorch/pytorch/pull/53615.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58525

Reviewed By: ejguan

Differential Revision: D29940369

Pulled By: ezyang

fbshipit-source-id: df9598262ec50e5d7f6e96490562aa1b116948bf
2021-08-03 06:52:23 -07:00
fd8004b42e add bfloat16 impl for nextafter (#61829)
Summary:
Add `BFloat16` support for `nextafter`.

* [x] Add OpInfo
* [x] Add Implementation Test (C++ tests)
* [x] Add credit

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61829

Reviewed By: ejguan

Differential Revision: D29932498

Pulled By: mruberry

fbshipit-source-id: 89524531a4800569ba1addd08a4ace330a6f72a4
2021-08-02 23:16:58 -07:00
2888b7fec5 Fix sign comparison (#62483)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62483

Test Plan: Sandcastle

Reviewed By: albanD

Differential Revision: D30015385

fbshipit-source-id: eefc3208fb8c42ff46b9f4d910eb93c32595fa28
2021-08-02 22:50:39 -07:00
a77be16538 TensorAccessor::bounds_check should be a CPU-only funciton (#62628)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62628

This fixes following errors when ROCm compiler is used
```
caffe2/aten/src/ATen/core/TensorAccessor.h:160:5: error: throw is prohibited in AMP-restricted functions
    TORCH_CHECK_INDEX(
    ^
```

Test Plan: CI

Reviewed By: zhouzhuojie, seemethere

Differential Revision: D30059737

fbshipit-source-id: d094ee608768db41fcc91d044c2c6d7d29f33fe4
2021-08-02 22:46:24 -07:00
e0364ccc33 [caffe2] break one circular dependency between Caffe2 and ATen-cpu (#62632)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62632

Update the caffe2/core/context.h to directly use `at::mt19937` instead of the
`at::CPUGeneratorImpl` wrapper class from the ATen-cpu library.

Using `at::CPUGeneratorImpl` causes circular dependencies between the ATen and
caffe2 code.  In particular the `at::CPUGeneratorImpl::get_state()` logic
depends on CPU Tensor functionality that currently depends on code from
caffe2.

Test Plan:
The RNG behavior should be identically to the previous code (perhaps even
faster since we now avoid virtual function calls).

  buck test //caffe2/caffe2:caffe2_test_cpu \
    //caffe2/caffe2/python: //caffe2/caffe2/fb/operators:

Differential Revision: D29915701

fbshipit-source-id: f9b2eab8d3b21b2224d30bcf52be9c0e7eb7cd0a
2021-08-02 22:40:56 -07:00
88af4d8441 Initialize RRefs only when explicitly asked for. (#62618)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62618

ShardedTensor implicitly initialized RRefs to remote shards if the
RPC framework was initialized. Although, there are use cases where the RPC
framework might be initialized for a different purpose but users would not
prefer that ShardedTensor initializes RRefs as well.

As a result, I've made RRef initialization explcitit in ShardedTensor APIs.
ghstack-source-id: 134889287

Test Plan:
1) waitforbuildbot
2) unit tests.

Reviewed By: wanchaol

Differential Revision: D30056833

fbshipit-source-id: 9b2433a38dafa1888589c5b72ed93b6f0ee51639
2021-08-02 22:17:17 -07:00
b58e04f156 Make sure FindLAPACK finds the same BLAS library (#49647)
Summary:
BLAS library is found by cmake/Dependencies.cmake and then
LAPACK library is found by FindLAPACK.cmake which in turn calls
FindBLAS.cmake. This means that we are searching for BLAS twice
and they might be different things. By setting a few variables,
this can be avoided.

cc seemethere

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49647

Reviewed By: seemethere, ejguan

Differential Revision: D29943680

Pulled By: malfet

fbshipit-source-id: 3cbc350ea645a1a28dd92c19e5ee7f9eecdeff59
2021-08-02 20:41:00 -07:00
2d038b5dc8 Cast a var to void that is unused
Summary: The comment above makes it seem intentional, so just ignore it.

Test Plan: NFC

Reviewed By: smeenai

Differential Revision: D30057632

fbshipit-source-id: 45929b4eeeefdf22f5c7c4dd603229635f9da31b
2021-08-02 19:56:41 -07:00
c4196bee93 Save some memory in scatter (#62516)
Summary:
Also removes some redundant parenthesis for clarity.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62516

Reviewed By: andwgu

Differential Revision: D30030546

Pulled By: SciPioneer

fbshipit-source-id: e106486f70b9590bf3dcffb76d23f5725737542f
2021-08-02 18:41:58 -07:00
10d3a2c13a [tensorexpr] Added logging info for SimplifierUnderContext (#62138)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62138

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D29891257

Pulled By: huiguoo

fbshipit-source-id: c36b3d615fa2fe971d022111bef61ee843a9dbea
2021-08-02 18:38:55 -07:00
3a592730d5 [nnc] Simplify i%100 to i if i is less than 100; fixed #52580 (#60693)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60693

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D29375938

Pulled By: huiguoo

fbshipit-source-id: 1388729c5b93805cb156efa53e8823d5462885bf
2021-08-02 18:38:54 -07:00
8f7ae77040 [nnc] Add context-sensitive simplification for div/mod (#60688)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60688

Test Plan: Imported from OSS

Reviewed By: navahgar, ZolotukhinM

Differential Revision: D29373313

Pulled By: huiguoo

fbshipit-source-id: 90d7f2fbfce583b0ea3b0f1c7899e22b0210bd62
2021-08-02 18:37:39 -07:00
c07a123b26 Support saving and loading ShardedTensor. (#62242)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62242

1) Add a state_dict hook to ensure ShardedTensors are
added to a state_dict.
2) Add a pre load state_dict hook to ensure ShardedTensor are added back to a
module at load time.
3) Add a `with_load_process_group` context manager for load time.
4) Added ser-de capability to ShardedTensor.
ghstack-source-id: 134860967

Test Plan:
1) unit tests
2) waitforbuildbot

Reviewed By: wanchaol

Differential Revision: D29927881

fbshipit-source-id: b1ef8872ed91e9cb0e2d5dd17d2764678ab89f0c
2021-08-02 18:33:19 -07:00
dd23372aa5 .circleci: Prefix intermediate build image tags (#62610)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62610

Prefixes intermediate build image tags with build- so that ECR lifecycle
policies can automatically clean them up

Policy to automatically cleanup images prefixed with `build-`: b02dd818f9

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D30055952

Pulled By: seemethere

fbshipit-source-id: 328b9c94ffc02877d088d0118a19c732f580838b
2021-08-02 18:17:14 -07:00
525fa2f0b6 [reland] Catch saved tensors default hooks race condition (#62564)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62564

If the user runs code that registers default saved tensor hooks from
multiple threads, it will fail with a nice error message most of the
time. This commit handles the very rare case where a race condition
would have made it fail silently.

Relanding previous PR #61957

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D30045406

Pulled By: Varal7

fbshipit-source-id: d04f74c99affbbf655e53cfc2acd42f7c5b4e6eb
2021-08-02 18:00:37 -07:00
f5cf24a224 Fix lint in test_deploy_from_python.py (#62626)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62626

Reviewed By: walterddr, zhouzhuojie, seemethere

Differential Revision: D30059119

Pulled By: malfet

fbshipit-source-id: 2aff44c1585091d864ab7e02d69046204e5b5d17
2021-08-02 17:55:24 -07:00
615ac8e573 Added logic for notifying PTE webapp for Nightly and PR builds (#62512)
Summary:
This PR adds the logic to notify the PTE webapp for DevOps PyTorch Nightly and PR builds.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62512

Reviewed By: iramazanli

Differential Revision: D30046165

Pulled By: malfet

fbshipit-source-id: ef7e4848d4db9f38536a647fcd2d8e26cf64b960
2021-08-02 16:44:35 -07:00
db071ef005 [Reland][DDP Communication Hook] Rename 4 Methods of GradBucket Class (#62592)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62592

Reland #62510

`GradBucket` is an important class defined in both C++ and Python, used for PyTorch Distributed Training. We need to rename the following methods for simplicity:
1) get_index -> index
2) is_the_last_bucket_to_allreduce -> is_last,
3) get_per_parameter_tensors -> gradients,
4) get_model_params_for_bucket -> parameters.
ghstack-source-id: 134848352

Test Plan: unit test

Reviewed By: andwgu

Differential Revision: D30049431

fbshipit-source-id: 1bcac331aa30e529b7230e3891bc811c531b0ea9
2021-08-02 16:38:09 -07:00
d228a8fc94 [Vulkan] Softmax Along Channel Dim (#62239)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62239

Added naive implementation of vulkan softmax (not using shared memory)

Based off of naive implementation of mean, found here:

2565a33c98/aten/src/ATen/native/vulkan/glsl/mean.glsl

Test Plan:
After building:

```
build/bin/vulkan_api_test
```

{F637001190}

```
[ RUN      ] VulkanAPITest.softmax
[       OK ] VulkanAPITest.softmax (180 ms)
```

Reviewed By: SS-JIA

Differential Revision: D29793150

fbshipit-source-id: 4f9d8e1dae8a43cbcb7063b095fa4726df06c929
2021-08-02 16:20:44 -07:00
940cbbce76 Add BFloat16 support to CPU nansum (#61083)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61083

It's already supported on CUDA, so it seems reasonable to support on CPU as
well. This also changes `test_nansum` to compare against `torch.sum` since numpy
doesn't support BFloat16. Note that `test_nansum_vs_numpy` checks against
NumPy as well, so that's still being tested.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D30006227

Pulled By: heitorschueroff

fbshipit-source-id: 1449730e1936417e7de1f8b3cf8cdcc15518873c
2021-08-02 16:03:57 -07:00
27d3d3a7d7 deploy in python fix to work in @opt mode
Summary: if we let torch_deploy get put in libomnibus, it hides the symbols we need to link against

Test Plan: buck test //caffe2/torch/csrc/deploy:test_deploy_from_python -- --exact 'caffe2/torch/csrc/deploy:test_deploy_from_python - test_deploy_from_python (caffe2.torch.csrc.deploy.test_deploy_from_python.TestDeployFromPython)' --run-disabled

Reviewed By: wconstab

Differential Revision: D30031134

fbshipit-source-id: e5c2f740f17abafec7d01c57c664bd71a00b6f61
2021-08-02 14:47:49 -07:00
a4af91b2fe Cleanup CUDA 10.1 and 10.0 support on CI (#62597)
Summary:
10.1 is removed in https://github.com/pytorch/pytorch/pull/56056

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62597

Reviewed By: walterddr

Differential Revision: D30053902

Pulled By: seemethere

fbshipit-source-id: deb148e5e44c12b08c267a36fbd4a1afa138e6e4
2021-08-02 14:42:25 -07:00
305d5fcc05 [Pytorch Edge] get_model_bytecode int -> uint (#62201)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62201

change int to uint to be the same type as the runtimes bytecode. Only affects c++ since python doesn't have uints iirc. Also changed the behavior of the functions from returning -1 and a warning to just throw an exception. Wasnt sure what the proper behavior here would be (returning UINT_MAX seemed gross) so feedback is appreciated.

Test Plan: ci

Reviewed By: raziel

Differential Revision: D29914072

fbshipit-source-id: 1bb08702fc301d7c7612b5ad7205a6dbe855c890
2021-08-02 14:17:44 -07:00
0c4c37b01e Disable libtorch testing on MacOS (#62599)
Summary:
Fixes regression introduced by https://github.com/pytorch/pytorch/issues/62402

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62599

Reviewed By: walterddr, driazati

Differential Revision: D30051914

Pulled By: malfet

fbshipit-source-id: a07184b21cc4b2d0ae31fe385bb58a0f665595c6
2021-08-02 13:41:18 -07:00
093495d3f0 [fx] prevent implicit submodule inlining when submodule is a GraphModule (#62436)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62436

## Problem

Given two modules and a tracer that indiscriminately marks all modules as a leaf:
```
class InnerModule(torch.nn.Module):

    def forward(self, t):
        return t + t

class MyModule(torch.nn.Module):
    def __init__(self, inner):
        super().__init__()
        self.inner = inner

    def forward(self, t):
        x = self.inner(t)
        y = self.inner(t)
        return x + y

class MyTracer(torch.fx.Tracer):
    def is_leaf_module(self, module, name):
        return True
```

One might generally expect the following behavior (note call_module nodes):
```
print(">> Outer GraphModule (with inner module as nn.Module):")
inner = InnerModule()
m = MyModule(inner)
gm = torch.fx.GraphModule(m, MyTracer().trace(m))
print(gm.graph.print_tabular())

>> Outer GraphModule (with inner module as nn.Module):
opcode         name     target                   args              kwargs
-------------  -------  -----------------------  ----------------  --------
placeholder    t        t                        ()                {}
call_module    inner    inner                    (t,)              {}
call_module    inner_1  inner                    (t,)              {}
call_function  add      <built-in function add>  (inner, inner_1)  {}
output         output   output                   (add,)            {}
None
```

However, when the inner module is first symbolically traced, the symbolic trace of the outer module ignores `is_leaf_node` entirely, and traces through the whole module (note call_function nodes).
```
print(">> Inner module as GraphModule:")
inner = InnerModule()
inner_gm = torch.fx.GraphModule(inner, MyTracer().trace(inner))
print(inner_gm.graph.print_tabular())

print(">> Outer GraphModule (with inner module as GraphModule):")
m = MyModule(inner_gm)
gm = torch.fx.GraphModule(m, MyTracer().trace(m))
print(gm.graph.print_tabular())

>> Inner module as GraphModule:
opcode         name    target                   args    kwargs
-------------  ------  -----------------------  ------  --------
placeholder    t       t                        ()      {}
call_function  add     <built-in function add>  (t, t)  {}
output         output  output                   (add,)  {}
None

>> Outer GraphModule (with inner module as GraphModule):
opcode         name    target                   args          kwargs
-------------  ------  -----------------------  ------------  --------
placeholder    t       t                        ()            {}
call_function  add     <built-in function add>  (t, t)        {}
call_function  add_1   <built-in function add>  (t, t)        {}
call_function  add_2   <built-in function add>  (add, add_1)  {}
output         output  output                   (add_2,)      {}
None
```

This is surprising behavior and at first glance violates the tracer's intent. As I understand it, `torch.fx.symbolic_trace.Tracer.trace` intends to patch `torch.nn.Module.__call__` with a `module_call_wrapper()` that records a `call_module` node if the module is a leaf, else executes `torch.fx._symbbolic_trace._orig_module_call = torch.nn.Module.__call__`, which is set a module loading time.

**Every submodule should be a leaf, but no `call_module` nodes are created when that submodule is a `GraphModule`. Why?**

Upon further inspection, I found:

- The constructor for GraphModule includes a path to `GraphModule.recompile()` via the setter for a `fx.Graph`:
```
inner_gm = torch.fx.GraphModule(inner, MyTracer().trace(inner))

File "/torch/fx/graph_module.py", line 252, in __init__
self.graph = graph

File "/torch/nn/modules/module.py", line 1183, in __setattr__
object.__setattr__(self, name, value)

File "/torch/fx/graph_module.py", line 277, in graph
self.recompile()
```
- `recompile()` wraps the `__call__` method by holding a reference to the `__call__` method at the time of recompilation:
```
cls = type(self)
cls_call = cls.__call__
...
def wrapped_call(self, *args, **kwargs):
    try:
        return cls_call(self, *args, **kwargs)
    except Exception as e:
        ...
cls.__call__ = wrapped_call
```
- Recompilation of the inner GraphModule happens on initialization, before creation or tracing of the outer module. Adding some old-fashioned print debug statements gives:
```
Inner Module:
_orig_module_call: <function Module._call_impl at 0x7faaebfee8b0>
recompile: cls.__call__ now wraps _orig_module_call, <function Module._call_impl at 0x7faaebfee8b0>

Outer Module:
_orig_module_call: <function Module._call_impl at 0x7faaebfee8b0>
tracing: patching method <class 'torch.nn.modules.module.Module'>.__call__ <function Module._call_impl at 0x7faaebfee8b0> with <function Module._call_impl at 0x7fa9d42bce50>

outer module MRO before tracing:
(0) <class '__main__.MyModule'>: <function Module._call_impl at 0x7faaebfee8b0>
(1) <class 'torch.nn.modules.module.Module'>: <function Module._call_impl at 0x7faaebfee8b0>
(2) <class 'object'>: <method-wrapper '__call__' of type object at 0x7fac3cd15f00>

outer module MRO during tracing:
(0) <class '__main__.MyModule'>: <function Module._call_impl at 0x7fa9d42bce50>
(1) <class 'torch.nn.modules.module.Module'>: <function Module._call_impl at 0x7fa9d42bce50>
(2) <class 'object'>: <method-wrapper '__call__' of type object at 0x7fac3cd15f00>

inner module MRO before tracing:
(0) <class 'torch.fx.graph_module.GraphModule.__new__.<locals>.GraphModuleImpl'>: <function x.y.z.wrapped_call at 0x7fa9d42a8670>
(1) <class 'torch.fx.graph_module.GraphModule'>: <function Module._call_impl at 0x7faaebfee8b0>
(2) <class 'torch.nn.modules.module.Module'>: <function Module._call_impl at 0x7faaebfee8b0>
(3) <class 'object'>: <method-wrapper '__call__' of type object at 0x7fac3cd15f00>

inner module MRO during tracing:
(0) <class 'torch.fx.graph_module.GraphModule.__new__.<locals>.GraphModuleImpl'>: <function x.y.z.wrapped_call at 0x7fa9d42a8670>
(1) <class 'torch.fx.graph_module.GraphModule'>: <function Module._call_impl at 0x7fa9d42bce50>
(2) <class 'torch.nn.modules.module.Module'>: <function Module._call_impl at 0x7fa9d42bce50>
(3) <class 'object'>: <method-wrapper '__call__' of type object at 0x7fac3cd15f00>
```

- The outer module is patched correctly, but the inner module's first element in its MRO is the `wrapped_call` from `recompile` that still invokes `<function Module._call_impl at 0x7faaebfee8b0>` directly. Therefore, no call_module nodes are created.

## In Practice

In practice, this behavior affects the ability of `torch.package` to package `GraphModules` whose submodules are `GraphModules`. In our case, the `GraphModule` submodules are not passed through a constructor, but created separately and installed on the root `GraphModule` via `setattr`. This means that prior to packaging, there appear to be no issues with the module, since the root's graph was created before any call_module targets were replaced with `GraphModules`.

When unpackaging such a model with `torch.package`, `torch.fx.graph_module._deserialize_graph_module` uses an inline `KeepModules` tracer that sets all submodules to leaves; the unpackaged module is implicitly and surprisingly inlined in the process.

## Potential Solution

This behavior was previously not understood by us, and so the current workaround is a gnarly process of wrapping all submodules with a `nn.Module` with a manually installed forward method.

Changing `wrapped_call` to return `return super(type(self), self).__call__(*args, **kwargs)` whenever `__call__` is inherited at least appears to solve the issue. Does this seem like an acceptable approach?

## Other Thoughts
- Repeated calls to `recompile` create nested `wrapped_calls`, all for the purpose of error handling. This seems probably unnecessary ¯\\_(ツ)\_/¯
- If a root module with a overriden `__call__` method is symbolically traced, it is ignored

Test Plan:
```
buck test:
    ✓ ListingSuccess: caffe2/test:fx - main (12.570)
    ✓ Pass: caffe2/test:fx - test_tracing_graphmodules_as_leaf_submodules (test_fx.TestFX) (11.982)
```

Reviewed By: ansley

Differential Revision: D29997935

fbshipit-source-id: 1988fbb025b14188da26a3e73e94fb789c3c1f74
2021-08-02 13:37:08 -07:00
dc1bd6acee Remove PROCESS GROUP rpc backend (#62411)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62411

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D29990408

Pulled By: H-Huang

fbshipit-source-id: 183d3b316767b12993cebbe32b73c2850fd1cc42
2021-08-02 12:26:22 -07:00
2ec4f69b48 [DDP Comm Hook] Do not expose hook_then_optimizer as a public method (#62532)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62532

This method is not stable at this time, so avoid releasing it when DDP communication hook feature is released as a stable feature.
ghstack-source-id: 134787831

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_hook_with_optimizer_parity
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_hook_then_optimizer_nccl

Reviewed By: rohan-varma

Differential Revision: D30031222

fbshipit-source-id: e03a8e13fee5116a5ddd724eb76316ee98f2a676
2021-08-02 12:25:01 -07:00
b161ac541d [reland] Add default Saved Variable hooks (#62563)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62563

Expose a pair of functions to Python users: torch.autograd.graph.set_saved_tensors_default_hooks(pack, unpack) and torch.autograd.graph.reset_saved_tensors_default_hooks().
These functions control the hooks applied to saved tensors: all tensors saved in that context will be packed using the pack function, then unpacked accordingly when needed.

Currently, this works by simply calling register_hooks (cf #60975) directly at the end of the constructor of a SavedVariable. This could be optimized further by not performing the copy before registering default hooks, but this would require a small refactor. Edit: the refactor is done in #61927.

A current limitation is that if users create tensors in this context, they will not be able to register additional hooks on the saved tensor.

For instance, to perform something like #28997, one could define a pack function that saves to disk whenever the tensor size is too big and returns a filename, then unpack simply reads the content of the file and outputs a tensor, e.g.:

```
def pack(x):
    name = os.path.join(tmp_dir, str(uuid.uuid4()))
    torch.save(x, name)
    return name

def unpack(name):
    return torch.load(name)
```

Relanding previous PR: https://github.com/pytorch/pytorch/pull/61834

Original PR led to timeout error in: https://www.internalfb.com/mast/job/yuguo-release_canary_offline_training-inlinecvrp_a-canary_offline_train_28a7ecfc

Now passing: https://www.internalfb.com/mast/job/quach-release_canary_offline_training-inlinecvrp_a-canary_offline_train_9bb57e98

The difference with the new version is we don't need to acquire the GIL when calling `PyDefaultSavedVariableHooks::get_hooks`.

Test Plan: Imported from OSS

Reviewed By: iramazanli

Differential Revision: D30045405

Pulled By: Varal7

fbshipit-source-id: 7f6c07af3a56fe8835d5edcc815c15ea4fb4e332
2021-08-02 11:30:26 -07:00
6f95850127 Revert D30024161: [DDP Communication Hook] Rename 4 Methods of GradBucket Class
Test Plan: revert-hammer

Differential Revision:
D30024161 (29c8b1db57)

Original commit changeset: 07e6072a2f7b

fbshipit-source-id: d571c2caadaf7b71fe2aba3c0597bd8074d153de
2021-08-02 10:26:54 -07:00
2e4f566d30 add OpInfo for torch.nn.functional.softplus (#62317)
Summary:
Addresses facebookresearch/functorch#78.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62317

Reviewed By: malfet

Differential Revision: D30013322

Pulled By: zou3519

fbshipit-source-id: e80affd10b81534234694c9e4326cc68c7efc7fe
2021-08-02 09:46:13 -07:00
cb626da145 [fix] mark non-differentiable ops (#62529)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/62506
Fixes https://github.com/pytorch/pytorch/issues/62504

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62529

Reviewed By: albanD

Differential Revision: D30032665

Pulled By: malfet

fbshipit-source-id: 90254c50fb4a873e3eda59c8484626137e01cb31
2021-08-02 09:40:45 -07:00
562b555a2b [CUDA] Fix typo in Normalization.cu (#62515)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62515

**Summary**
This commit fixes an obvious typo in `Normalization.cu` I found while
working on #62452. Since that PR will not be landed anytime soon, I
thought it would be prudent to land this fix.

**Test Plan**
Continuous integration.

Test Plan: Imported from OSS

Reviewed By: makslevental

Differential Revision: D30027324

Pulled By: SplitInfinity

fbshipit-source-id: 9d368a54c13f8e2bf6f6956dfb2bee974226f726
2021-08-02 09:38:46 -07:00
29c8b1db57 [DDP Communication Hook] Rename 4 Methods of GradBucket Class (#62510)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62510

`GradBucket` is an important class defined in both C++ and Python, used for PyTorch Distributed Training. We need to rename the following methods for simplicity:
1) get_index -> index
2) is_the_last_bucket_to_allreduce -> is_last,
3) get_per_parameter_tensors -> gradients,
4) get_model_params_for_bucket -> parameters.

Test Plan:
Local run comprehensive test with following results:
https://pxl.cl/1Ml8b
For two timeout failure test cases, most likely environment related and fail in my devserver.

Reviewed By: SciPioneer

Differential Revision: D30024161

fbshipit-source-id: 07e6072a2f7b81f731425d9b71f8c8b60d383b0f
2021-08-02 09:33:32 -07:00
34cb2b5d04 Update SobolEngine docstring w/ correct behavior (#62548)
Summary:
Sobol was modfied to not drop the first point. This update reflects this behavior in the docstring.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62548

Reviewed By: qingfeng10

Differential Revision: D30035627

Pulled By: Balandat

fbshipit-source-id: 64c659ea30c0c929778da3b58041875834e25e9a
2021-08-02 09:04:38 -07:00
2445d5c60a Removed the hypothesis tests for adaptive_avg_pool (#60886)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60886

Remove all the hypothesis tests from test_adaptive_avg_pool2d_nhwc, test_adaptive_avg_pool, and test_adaptive_avg_pool3d_ndhwc.

Test Plan: I tested it with buck test //caffe2/test:quantization and all three tests passed. The tests that failed are test_conv2d_api (test_quantized_functional.py), test_conv3d_api (test_quantized_functional.py),

Reviewed By: wanchaol, jerryzh168

Differential Revision: D29432184

fbshipit-source-id: 2a4c540d2c169aec69cf8d143d5a155394885745
2021-08-02 08:57:14 -07:00
3dc588d577 Fix: no enough space for cu102 debug nightly build (#62465)
Summary:
Fixes #{issue number}
![image](https://user-images.githubusercontent.com/16190118/127632173-783630b7-c644-4239-b1dd-fb12b6bacf83.png)

verification:
https://app.circleci.com/pipelines/github/pytorch/pytorch/358483/workflows/a34a0123-54fe-418f-9211-4af75ee56a70/jobs/15120463

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62465

Reviewed By: iramazanli

Differential Revision: D30045280

Pulled By: janeyx99

fbshipit-source-id: f40090eb02fd1d86033971611d492c7b107cc4bd
2021-08-02 08:44:16 -07:00
51f687fd4b Add overlap with DDP to ZeRO (two approaches) (#62157)
Summary:
**Overview:**
This adds two approaches to overlapping `DistributedDataParallel.backward()` with `ZeroRedundancyOptimizer.step()` by providing two hook constructors: `hook_with_zero_step()` and `hook_with_zero_step_interleaved()`. The former waits for all backward computation to finish before starting optimizer computation, while the latter launches a partial optimizer computation using the contents of a gradient bucket once that bucket's all-reduce completes. The two approaches each suffer from their own weaknesses, and which one to use depends on the specific hardware configuration.

Both approaches can share changes to `ZeroRedundancyOptimizer`. A user should pass `overlap_with_ddp=True` to `ZeroRedundancyOptimizer`, construct a DDP communication hook using either `hook_with_zero_step()` or `hook_with_zero_step_interleaved()`, and register that communication hook. `ZeroRedundancyOptimizer.step()` should still be called in the training loop, though the optimizer computation and communication will be offloaded to originate from the communication hook. Currently, the first two iterations are vacuous, meaning they do not result in parameter updates and the inputs are ignored. This is required to finalize the DDP bucket strategy and to then initialize the `ZeroRedundancyOptimizer`'s local optimizer based on that bucketing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62157

Test Plan:
The existing `ZeroRedundancyOptimizer` tests pass, and new unit tests for both hooks pass:
- ~~`test_ddp_with_zero_step_parity_cpu`~~ (removed for now due to flakiness in CI -- under investigation, could possibly be similar Gloo issue as with `hook_with_zero_step_interleaved()`)
- `test_ddp_with_zero_step_parity_gpu`
- `test_ddp_with_zero_step_interleaved_parity_gpu`

These were tested on the AI AWS cluster.

An analogous `test_ddp_with_zero_step_interleaved_parity_cpu` is missing due to existing bugs with Gloo. See https://github.com/pytorch/pytorch/pull/62302.

Both approaches have been verified using an internal accuracy benchmark.

Reviewed By: mrshenli

Differential Revision: D29971046

Pulled By: andwgu

fbshipit-source-id: a7234c23c7ea253f144a698fd7e3c0fe039de5e8
2021-08-02 08:33:34 -07:00
ee482edf0a Callable activation function support for Transformer modules (C++) (#62342)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/60747

Enhances the C++ versions of `Transformer`, `TransformerEncoderLayer`, and `TransformerDecoderLayer` to support callables as their activation functions. The old way of specifying activation function still works as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62342

Reviewed By: malfet

Differential Revision: D30022592

Pulled By: jbschlosser

fbshipit-source-id: d3c62410b84b1bd8c5ed3a1b3a3cce55608390c4
2021-08-02 08:06:39 -07:00
c9d5325c52 [BE] shorten the name part 1 (#62402)
Summary:
This should address part of https://github.com/pytorch/pytorch/issues/62357.

1. rename all files 'generated-*' to make it clear, filename will not be in CI workflow name
2. remove all 'pytorch-' in names
3. make sure the build test shell scripts are adopted to new name

Next change should reduce more device related naming

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62402

Reviewed By: malfet

Differential Revision: D30021959

Pulled By: walterddr

fbshipit-source-id: 64b21a2020e25a507101c09c010cb593d8ac4146
2021-08-02 07:56:55 -07:00
7565039ee9 Support system-provided Intel TBB (#61934)
Summary:
This PR: (1) enables the use of a system-provided Intel TBB for building PyTorch, (2) removes `tbb:task_scheduler_init` references since it has been removed from TBB a while ago (3) marks the implementation of `_internal_set_num_threads` with a TODO as it requires a revision that fixes its thread allocation logic.

Tested with `test/run_test`; no new tests are introduced since there are no behavioral changes (removal of `tbb::task_scheduler_init` has no impact on the runtime behavior).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61934

Reviewed By: malfet

Differential Revision: D29805416

Pulled By: cbalioglu

fbshipit-source-id: 22042b428b57b8fede9dfcc83878d679a19561dd
2021-08-02 07:39:00 -07:00
bbf6131159 Add factory kwargs test to test_modules (#62340)
Summary:
Adds a new `ModuleInfo`-based test to `test_modules.py`.

The test passes `device` and `dtype` to each module during instantiation, ensuring that the kwargs are applied to any newly-created parameters or buffers. Note that the `device` and `dtype` kwargs should only be present when a module creates parameters or buffers; the test uses some mock magic to identify this.

Originally lifted from `test/test_module_init.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62340

Reviewed By: malfet

Differential Revision: D30022543

Pulled By: jbschlosser

fbshipit-source-id: 77e5d46d6b11c16dc39d19a1c650ee48c26c54c1
2021-08-02 06:53:00 -07:00
46b18aa294 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D30039182

fbshipit-source-id: 3b38fc89585853bb9a5483a0de9ebd6852154a8d
2021-08-02 04:17:10 -07:00
aa5e3ad705 [quant] Support PerChannel quantization in FusedMovingAvgObsFakeQuantize (#62346)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62346

Update the operator code to resize the min/max tensors if per-channel quant is selected. We need to do this because by default the observer creates empty tensors for min/max and scale/zero_point values when per-channel quantization is enabled

Test Plan:
python test/test_quantization.py test_fused_mod_per_channel

Imported from OSS

Reviewed By: HDCharles

Differential Revision: D30003835

fbshipit-source-id: b5ec80261cb50ee543f21191a887e979dcde4667
2021-08-01 21:45:11 -07:00
7adb78017a [countbuild][xplat/caffe2] contbuild with sanitizers (#61724)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61724

To improve the stability of xplat/caffe2 code, we are enabling sanitizers (asan, tsan, ubsan) on contbuild.
ghstack-source-id: 134339882

Test Plan:
```
buck test --show-output --flagfile fbsource//fbcode/mode/dev-asan --config fbsource.sanitizer=address fbsource//xplat/pytorch_models/build/pytorch_model_test/v13:body_tracking_v124_test
clang-9: warning: argument unused during compilation: '-pthread' [-Wunused-command-line-argument]

Downloaded 0/7 artifacts, 0.00 bytes, 100.0% cache miss
Building: finished in 14.5 sec (100%) 4538/4538 jobs, 4 updated
  Total time: 14.5 sec
Testing: finished in 1.1 sec (1 PASS/0 FAIL)
RESULTS FOR //xplat/pytorch_models/build/pytorch_model_test/v13:body_tracking_v124_test
PASS      1.0s  1 Passed   0 Skipped   0 Failed   //xplat/pytorch_models/build/pytorch_model_test/v13:body_tracking_v124_test
TESTS PASSED
```

```
buck test --show-output --flagfile fbsource//fbcode/mode/dev-tsan --config fbsource.sanitizer=thread fbsource//xplat/pytorch_models/build/ads_mai_test_train/v4:model_test
clang-9: warning: argument unused during compilation: '-pthread' [-Wunused-command-line-argument]

Downloaded 3/19 artifacts, 88.30 Kbytes, 66.7% cache miss
Building: finished in 24.0 sec (100%) 4609/4609 jobs, 9 updated
  Total time: 24.9 sec
Testing: finished in 0.9 sec (1 PASS/0 FAIL)
RESULTS FOR //xplat/pytorch_models/build/ads_mai_test_train/v4:model_test
PASS     808ms  1 Passed   0 Skipped   0 Failed   //xplat/pytorch_models/build/ads_mai_test_train/v4:model_test
TESTS PASSED
````

Reviewed By: dhruvbird, albanD

Differential Revision: D29348099

fbshipit-source-id: 3d3255bff0464129745d2ed729d443f3e7470313
2021-08-01 12:02:30 -07:00
32b37ba246 [DDP Communication Hook] Update the typing info of comm hook output as well as some docstring (#62457)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62457

Specify `Future[torch.Tensor]` as DDP communication hook return type, which should be explicitly a single tensor. The previous API takes a list that has a single tensor.

Note that now the typing info no longer accepts the internal type of `torch._C.Future`, which does not support torchscript and hence cannot support `Future[torch.Tensor]`.
ghstack-source-id: 134771419

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_invalid_comm_hook_return_type

Reviewed By: rohan-varma

Differential Revision: D30007390

fbshipit-source-id: 246667c9b575b4c6e617b0a5b373151f1bd81e7f
2021-07-30 20:51:34 -07:00
72295da6c3 Reformat (#62456)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62456

as title
ghstack-source-id: 134771417

Test Plan: N/A

Reviewed By: rohan-varma

Differential Revision: D30006493

fbshipit-source-id: 1d1dc9cfff69a9b4fa31470177c1f4fa206a94ef
2021-07-30 20:50:19 -07:00
c506769f19 irange-ify 8 (#62422)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62422

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D29879655

fbshipit-source-id: 69fdf0196091932f866bfaba707ad7643790fdd8
2021-07-30 20:31:58 -07:00
bd9f35313a Revert D29922299: [DDP] log bucket sizes
Test Plan: revert-hammer

Differential Revision:
D29922299 (5429f68f00)

Original commit changeset: 538b331c96e7

fbshipit-source-id: 3595fe04e8dea38bc9d05e8c70f2dcd2ad496ced
2021-07-30 20:27:36 -07:00
9df7ac7a94 Port nll_loss_backward to structured (#62144)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62144

Test Plan: Imported from OSS

Reviewed By: heitorschueroff

Differential Revision: D29945279

Pulled By: SplitInfinity

fbshipit-source-id: 2fee60e8424fc590a81767c9b0a8226a0c2fd69c
2021-07-30 19:43:10 -07:00
5429f68f00 [DDP] log bucket sizes (#62232)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62232

Logs the bucket sizes in DDP logging so that we know which workflow ran with what bucket size config. Will be used to verify how changing bucket sizes in DDP affects perf.

Based on the test, we can see inconsistency where the "first" bucket size actually is (last before rebuild buckets, first after).
ghstack-source-id: 134663867

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D29922299

fbshipit-source-id: 538b331c96e77048164ad130b377433be100a761
2021-07-30 18:07:04 -07:00
63d3da7961 Fix sign comparison (#62194)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62194

Reviewed By: albanD

Differential Revision: D29885396

Pulled By: r-barnes

fbshipit-source-id: 8092f3002474a48fc6b349b9e369c8d59e832fcc
2021-07-30 17:18:05 -07:00
2006dc6316 [3/N] Remove unittest.skip from torch/testing/_internal distributed files. (#61991)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61991

Continuation of https://github.com/pytorch/pytorch/pull/61887 and
removing unittest.skip as much as possible.
ghstack-source-id: 134759368

Test Plan: waitforbuildbot

Reviewed By: SciPioneer

Differential Revision: D29831860

fbshipit-source-id: fe57a7d56d4423924a2dec10bb670137ace0c9a4
2021-07-30 16:40:43 -07:00
7521addede [deploy] loader cleanup (#62223)
Summary:
Some refactoring of the custom loader logic:

* Make sure we unregister frames when they are deleted so that future exceptions do not attempt to read unallocated memory
* rename linker -> loader to make its name more correct
* move the build of the loader into lib deploy since it can be shared across interpreters
* unify the logic for finding the library symbol across ops and fbcode

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62223

Test Plan: Imported from OSS

Reviewed By: wconstab

Differential Revision: D29922002

Pulled By: zdevito

fbshipit-source-id: b7f8ee5812e29a5d098fcf1bd9f4cea7d30ecb4c
2021-07-30 16:34:13 -07:00
174433267c [dte] fastpath implementation for broadcast utility function (4/x) (#62493)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62493

This diff adds a broadcast fastpath for the caffe2 broadcast utility function, which just copies the contents of a smaller tensor into a larger one. We also update the tests to exercise the new functionality.

Test Plan: unit tests + let CI run

Differential Revision: D29938285

fbshipit-source-id: 543ecc548500380e307be91902696033454964a2
2021-07-30 16:15:10 -07:00
08539ca047 Add non-context manager usage support for profiler (#61690)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/60238, https://github.com/pytorch/kineto/issues/329

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61690

Reviewed By: malfet

Differential Revision: D30016561

Pulled By: ngimel

fbshipit-source-id: 93a578ffbb556f4b584213ac9cfafcc5cf0a9270
2021-07-30 15:54:36 -07:00
6441caeaa7 Use multi-dimensional cuFFT transforms to improve FFT performance (#61203)
Summary:
Benchmark and numerical accuracy tests on A100 and V100 are available at https://github.com/xwang233/code-snippet/tree/master/fft-61203.

I've checked the FFT results for different shapes/dims and different `dim` arg for `rfftn` and `irfftn` before and after this PR, and they all numerically matched.

With this PR, about 10%~15% performance gain is expected on commonly used shapes and dims.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61203

Reviewed By: heitorschueroff

Differential Revision: D29996244

Pulled By: zou3519

fbshipit-source-id: 02c9862eaa1ad8f2ae9c7f7448aeb9e23bcda276
2021-07-30 14:54:04 -07:00
73c46092f1 [pytorch] sort the output of the model_dump util (#62485)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62485

Make it easier to browse the code section by sorting the files by name.

Test Plan: Imported from OSS

Reviewed By: dhruvbird, malfet

Differential Revision: D30016245

Pulled By: ljk53

fbshipit-source-id: c9cb3c1ad9bcaa5337a6ad5c575ac0c240751f6c
2021-07-30 14:40:07 -07:00
49060aa81a Revert D29999785: Reland D29943356: .github: Migrate ecr_gc to github actions
Test Plan: revert-hammer

Differential Revision:
D29999785 (49dc827712)

Original commit changeset: bb9285076551

fbshipit-source-id: c26b39fb2d3c361015ce7f205d3f5f4232845289
2021-07-30 14:33:12 -07:00
43d4fe68cd [Foreach] support implicit broadcasting in slow path (#62167)
Summary:
This PR has foreach functions support implicit broadcasting via slow path.

rel: https://github.com/pytorch/pytorch/issues/58833

cc: ptrblck  ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62167

Reviewed By: mruberry

Differential Revision: D30005109

Pulled By: ngimel

fbshipit-source-id: f48c0a13e304411763541ffcfcfc6154adb26bac
2021-07-30 13:29:56 -07:00
70f57bcb1e [PyTorch] Fix quantized Conv1d module parameters (#62356)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62356

In `torch/nn/quantized/module/conv.py`, Conv1d is making a scaler `kernel_size` into a tuple with size 2 by repeating `kernel_size` value. This logic is breaking `Conv1d` because internally it's unsqueezing the input with shape N, C, L to N, C, 1, L in [`qconv.cpp`](06dfaadfc6/aten/src/ATen/native/quantized/cpu/qconv.cpp (L841)). Applying aforementioned kernel to this input shape will produce negative output shape in [`ConvUtils.h`](203f7ff6e0/include/fbgemm/ConvUtils.h (L118-L119)), if kernel_size > 1.

Here I'm modifying the processing logic for `kernel_size` and a few other parameters, to follow the pattern of [`torch/nn/module/conv.py`](aae2a3c95e/torch/nn/modules/conv.py (L284-L287)).

Test Plan: Rely on unit test

Reviewed By: kimishpatel

Differential Revision: D29957556

fbshipit-source-id: ae13f7ca892d60b82cfffdf972cce422ebfaae8e
2021-07-30 12:27:52 -07:00
eac288ea77 [Pytorch Backend Delegation] Annotate function args with type information (#62433)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62433

Without type information, default type is Tensor which may conflict at runtime.

Test Plan: CI

Reviewed By: raziel

Differential Revision: D29990902

fbshipit-source-id: 0a38843d7d0612a458bb38fad7c86bad08c7197b
2021-07-30 11:34:40 -07:00
f16c73b9f3 Improve error messages of torch.testing.assert_close for sparse inputs (#61583)
Summary:
This utilizes the feature introduced in https://github.com/pytorch/pytorch/issues/60091 to modify the header of the error message.

Before:

```python
AssertionError: Tensor-likes are not equal!

Mismatched elements: 1 / 2 (50.0%)
Greatest absolute difference: 1 at index 1
Greatest relative difference: 0.3333333432674408 at index 1

The failure occurred for the values.
```

After:

```python
AssertionError: Sparse COO values of tensor-likes are not equal!

Mismatched elements: 1 / 2 (50.0%)
Greatest absolute difference: 1 at index 1
Greatest relative difference: 0.3333333432674408 at index 1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61583

Reviewed By: malfet

Differential Revision: D30014797

Pulled By: cpuhrsch

fbshipit-source-id: 66e30645e94de5c8c96510822082ff9aabef5329
2021-07-30 11:23:26 -07:00
8a9dfa52e9 Delete an unused variable
Summary: This was set twice but never used. Delete it.

Test Plan: NFC

Reviewed By: smeenai

Differential Revision: D30000794

fbshipit-source-id: 084d16da914febec58c4cb5f452c37027275cd08
2021-07-30 11:10:38 -07:00
73ba166e2a fix(elastic-docs): Fix elastic launch doc (#62378)
Summary:
The documentation link should be https://pytorch.org/docs/stable/elastic/run.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62378

Reviewed By: aivanou

Differential Revision: D30002830

Pulled By: kiukchung

fbshipit-source-id: 34b434acaa10222561df43f6397a2420eef02015
2021-07-30 10:58:13 -07:00
635e63c53d irange-ify 15 (#62123)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62123

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D29879765

fbshipit-source-id: eda8e641e9fd06e16ad71b8144332f253537955a
2021-07-30 10:41:33 -07:00
3c0c1c4ecb Fix incorrectly sized tensors for svd when full_matrices=False (#62022)
Summary:
Before this PR for m x n input matrix, the return matrices were always allocated as m x m and n x n and then narrowed.
This unnecessarily requires a lot of memory that is then discarded.
With this PR when `compute_uv=True and full_matrices=False` correctly sized tensors are allocated. Moreover, if `compute_uv=False` U, V matrices are not allocated as they are not needed. However, cusolver's gesvdj routines fail when these matrices are not allocated, which is a bug, so this allocation is done separately in cusolver specific code path.

MAGMA doesn't work for this input because it tries to allocate a large matrix internally (ROCm doesn't work as it uses MAGMA). Example error:
```
CUBLAS error: memory mapping error (11) in magma_sgelqf at /opt/conda/conda-bld/magma-cuda110_1598416697386/work/src/sgelqf.cpp:161
CUBLAS error: out of memory (3) in magma_sgeqrf2_gpu at /opt/conda/conda-bld/magma-cuda110_1598416697386/work/src/sgeqrf2_gpu.cpp:145
CUBLAS error: not initialized (1) in magma_sgeqrf2_gpu at /opt/conda/conda-bld/magma-cuda110_1598416697386/work/src/sgeqrf2_gpu.cpp:145
MAGMA error: function-specific error, see documentation (1) in magma_sgeqrf2_gpu at /opt/conda/conda-bld/magma-cuda110_1598416697386/work/src/sgeqrf2_gpu.cpp:145
MAGMA error: function-specific error, see documentation (1) in magma_sgeqrf2_gpu at /opt/conda/conda-bld/magma-cuda110_1598416697386/work/src/sgeqrf2_gpu.cpp:145
python: /opt/conda/conda-bld/magma-cuda110_1598416697386/work/interface_cuda/interface.cpp:806: void magma_queue_create_internal(magma_device_t, magma_queue**, const char*, const char*, int): Assertion `queue->dAarray__ != __null' failed.
Aborted (core dumped)
```

Fixes https://github.com/pytorch/pytorch/issues/61949.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62022

Reviewed By: heitorschueroff

Differential Revision: D29994429

Pulled By: ngimel

fbshipit-source-id: c3f7744d7adc5fd6787f6cbb1ec41405f89a6d4c
2021-07-30 10:27:13 -07:00
26d2f4acb2 Quick fix to make torch.tensor work with functorch (#62423)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62423

Fixes https://github.com/facebookresearch/functorch/issues/7.

functorch uses FuncTorchDynamicLayerBackMode as a mode key to wrap all
tensors returned from operators in special TensorWrapper tensor
extension.

The problem with this is that TensorWrapper does not have storage so
accessing the data_ptr (for recursive_store) internal asserts.

As a quick hack, the guard added prevents functorch from wrapping the
empty tensor in a TensorWrapper and instead when `tensor.to` is called later,
the tensor gets wrapped. This is effectively what Ed proposed in
https://github.com/facebookresearch/functorch/issues/7#issuecomment-847501020

In the long term we probably want some better way of extending
`internal_new_from_data` for cases like this (where there is a
mode-based dispatch key for a C++ tensor extension -- the Python case
may be different).

Test Plan: - Verified that this fixes functorch's problem

Reviewed By: malfet

Differential Revision: D29992607

Pulled By: zou3519

fbshipit-source-id: 82b713156a37d7470f8fc46e3803ee7353689a33
2021-07-30 10:15:23 -07:00
8c4d8c29e4 [2/n] add test ATen to wheel test (#62341)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/62380

* This PR introduces env variable IN_WHEEL_TEST to control the dependency on `build/` folder
* update `test_aten` function to call wheel install folder `{sitepackages}/torch` instead of `build/` folder

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62341

Test Plan: check if all ci workflows pass

Reviewed By: walterddr

Differential Revision: D30004259

Pulled By: tktrungna

fbshipit-source-id: ccebd513a3530f1e5c8c9177d5f2daf14de3e853
2021-07-30 10:09:09 -07:00
d08165dfdf [fx2trt] Add op converters for ads 23x dense arch
Summary:
Adding 4 converters for
1. torch.addmm
2. torch.mul
3. torch.t
4. torch.sigmoid

Test Plan:
fx2trt unittests

Able to lower dense arch with fx2trt locally.

Reviewed By: ajtulloch, yinghai

Differential Revision: D29563962

fbshipit-source-id: 114c4e871efb25379043224f5f0116829cd7dc50
2021-07-30 09:26:11 -07:00
d783617216 enable warnings on cuda synchronization (#62092)
Summary:
This creates `torch.cuda.set_warn_on_synchronization()` function that would warn or error when synchronizing operation is performed. We could wrap it in a context manager for ease of use, but it would be a lie, because it sets global, and not thread-local state. Since it's intended for debugging, maybe that's ok though.
As all `torch.cuda.*` functions, it's going through CPython, not pybind, so the argument is converted to long before being passed to c10 function. I'll make python argument a python enum class, but without pybind it'll still have to go thourgh long conversion.

For a test script
```
import torch
torch.cuda.set_warn_on_synchronization(1)
x=torch.randn(10, device="cuda")
x.nonzero()
y=torch.randn((), device="cuda")

if y:
    print("something")
torch.multinomial(x.abs(), 10, replacement=False)
torch.randperm(20000, device="cuda")
ind = torch.randint(10, (3,), device="cuda")
mask = torch.randint(2, (10,), device="cuda", dtype=torch.bool)
val = torch.randn((), device="cuda")
x[mask]=1.
x[mask] = val
torch.cuda.synchronize()
```
the output is
```
/../playground/sync_warn_test.py:4: UserWarning: called a synchronizing operation (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:145.)
  x.nonzero()
/../playground/sync_warn_test.py:7: UserWarning: called a synchronizing operation (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:145.)
  if y:
something
/../playground/sync_warn_test.py:9: UserWarning: called a synchronizing operation (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:145.)
  torch.multinomial(x.abs(), 10, replacement=False)
/../playground/sync_warn_test.py:15: UserWarning: called a synchronizing operation (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:145.)
  x[mask] = val
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62092

Reviewed By: mruberry

Differential Revision: D29968792

Pulled By: ngimel

fbshipit-source-id: cc6f817212c164727ed99ecf6ab050dc29631b9e
2021-07-30 09:13:01 -07:00
273188549f pass through *EXITCODE *EXITCODE__TRYRUN_OUTPUT variables (#49646)
Summary:
This is needed to allow cross compiling to work

There are some `try_run` statements in CMake files used for building pytorch and dependencies. Since we are cross compiling, there's no way to run the compiled executables to get the output for `try_run` function. CMake provides a solution to this by requiring the user to manually provide the exitcode and the output of the executable which should be given by `*EXITCODE` and `*EXITCODE__TRYRUN_OUTPUT` respectively.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49646

Reviewed By: heitorschueroff

Differential Revision: D29960301

Pulled By: malfet

fbshipit-source-id: b10ab9c182d1220f7e1911f922e7db261d521145
2021-07-30 08:22:33 -07:00
b3781f0244 Remove faulty process group agent logic (#62409)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62409

This a reland of #61907 because removing process_group_agent.h / cpp broke facebook specific tests. I will remove the files and update the internal test code in a separate PR.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D29990001

Pulled By: H-Huang

fbshipit-source-id: 2ee333322247d8b72691152308c3297e8c0c006d
2021-07-30 08:12:48 -07:00
ee7d19ac29 add OpInfo for torch.nn.functional.one_hot (#62253)
Summary:
Addresses facebookresearch/functorch#78.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62253

Reviewed By: heitorschueroff

Differential Revision: D29992924

Pulled By: zou3519

fbshipit-source-id: 1fc81edf3c8ca0722c5db0b32929a4cb3285f634
2021-07-30 07:05:29 -07:00
09d10c4329 OpInfo for nn.functional.softmax (#62077)
Summary:
This PR:

* Adds OpInfo for `softmax` and `nn.functional.softmax` (alias).
* Skip removal for `test_jit_alias_remapping` test of `log_softmax`.

Please see https://github.com/facebookresearch/functorch/issues/78 and https://github.com/pytorch/pytorch/issues/54261.

cc: mruberry zou3519 pmeier

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62077

Reviewed By: heitorschueroff

Differential Revision: D29990019

Pulled By: zou3519

fbshipit-source-id: 67476990b54a5dd824eed9d10236e118564f2501
2021-07-30 06:56:03 -07:00
9fdf7ec6a2 [docs] Update sphinx to 3.5.4 (#61601)
Summary:
Sphinx 4.x is out, but it seems that requires many more changes to
adopt. So instead use the latest version of 3.x, which includes
several nice features.

* Add some noindex directives to deal with warnings that would otherwise
  be triggered by this change due to conflicts between the docstrings
  declaring a function and the autodoc extension declaring the
  same function.
* Update distributions.utils.lazy_property to make it look like a
  regular property when sphinx autodoc inspects classes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61601

Reviewed By: ejguan

Differential Revision: D29801876

Pulled By: albanD

fbshipit-source-id: 544d2434a15ceb77bff236e934dbd8e4dbd9d160
2021-07-30 06:23:10 -07:00
e352585f67 Clean up running smoke tests logic for Windows GHA (#62344)
Summary:
Followup to https://github.com/pytorch/pytorch/issues/62288

Front loads the logic and also force smoke tests to run on only one shard.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62344

Test Plan: Note that for the windows cuda10 run on PR, we get only 1 shard with the smoke tests running: https://github.com/pytorch/pytorch/pull/62344/checks?check_run_id=3194294041

Reviewed By: seemethere, heitorschueroff

Differential Revision: D29991573

Pulled By: janeyx99

fbshipit-source-id: 263d7de72c7a82a7205932914c32d39892294cad
2021-07-30 05:00:56 -07:00
329426c249 Fix cppdoc example syntax (#62385)
Summary:
a simple fix

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62385

Reviewed By: suo

Differential Revision: D30000982

Pulled By: heitorschueroff

fbshipit-source-id: e2e1c9efba3734b58d9b5f358c01d12c2c8c91ff
2021-07-30 04:36:55 -07:00
d57ce8cf89 [Linalg] Add cusolver syevjBatched path for torch.linalg.eigh when cuda >= 11.3 U1 (#62003)
Summary:
This PR adds the `cusolverDn<T>SyevjBatched` fuction to the backend of `torch.linalg.eigh` (eigenvalue solver for Hermitian matrix). Using the heuristics from https://github.com/pytorch/pytorch/pull/53040#issuecomment-788264724 and my local tests, the `syevj_batched` path is only used when `batch_size > 1` and `matrix_size <= 32`. This would give us huge performance boost in those cases.

Since there were known numerical issues on cusolver `syevj_batched` before cuda 11.3 update 1, this PR only enables the dispatch when cuda version is no less than that.

See also https://github.com/pytorch/pytorch/issues/42666 #47953 https://github.com/pytorch/pytorch/issues/53040

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62003

Reviewed By: heitorschueroff

Differential Revision: D30006316

Pulled By: ngimel

fbshipit-source-id: 3a65c5fc9adbbe776524f8957df5442c3d3aeb8e
2021-07-30 00:35:21 -07:00
956c22b1f9 [dte] fastpath implementations for mulgrad / divgrad (3/x) (#62437)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62437

In this diff we add a broadcast fastpath for MulGradient and DivGradient ops, whose tests we update to exercise the new functionality.

Test Plan: Added test cases to elementwise ops (which will exercise the new MulGradient / DivGradient broadcast fastpath functionality) that will be run by CI. It's worth noting there's still no code (outside of the new test cases) that takes the new code paths added -- the user must explicitly request  allow_broadcast_fastpath=True, and nothing outside of the added tests currently does so.

Differential Revision: D29938273

fbshipit-source-id: 281c1a109e38c25b9bf9ff8d832de60ac3c231a9
2021-07-30 00:05:34 -07:00
607d720be1 Remove an unused variable
Summary: This is set but never used

Test Plan: NFC

Reviewed By: smeenai

Differential Revision: D30000830

fbshipit-source-id: 702d6f7b844b52bfe696725a6b0a055d494b739a
2021-07-29 23:10:03 -07:00
cfd0f5ebc9 [quant] update per-channel observer min/max_val attribute names (#62345)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62345

This PR updates the attribute names from min_vals to min_val. the motivation for this is to keep the attribute name consistent with per-tensor observers so that dependencies (like FusedMovingAvgObsFakeQuantize) don't need to differentiate between the two observer types to access the attributes.

It also adds some BC tests to make sure that observers saved earlier with min_vals/max_vals can be loaded depending on the state_dict version.
Note: Scriptability of the observers isn't fully supported yet, so we aren't testing for that in this PR.

Test Plan:
python test/test_quantization.py TestSerialization

Imported from OSS

Reviewed By: HDCharles

Differential Revision: D30003700

fbshipit-source-id: 20e673f1bb15e2b209551b6b9d5f8f3be3f85c0a
2021-07-29 22:28:53 -07:00
d92301dd02 [sharded_tensor] add new init_from_local_shards API (#60479)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60479

This added `init_from_local_shards` API to construct a ShardedTensor from local_shards and global sharded_tensor_metadata. It also refactors the utils in ShardingSpec to be able to be used by sharded_tensor for sanity check purpose.

Test Plan:
test_init_from_local_shards
test_init_from_local_shards_invalid_sharding

Reviewed By: pritamdamania87

Differential Revision: D29276777

fbshipit-source-id: 011c1d70426bc560a59b8d858c68f1aa12db8481
2021-07-29 22:04:13 -07:00
bc787f2402 Fix setArgumentNames and make Script/Python consistent (#62442)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62442

For PythonMethodWrapper::setArgumentNames, make sure to use the correct method
specified by method_name_ rather than using the parent model_ obj which itself
_is_ callable, but that callable is not the right signature to extract.

For Python vs Script, unify the behavior to avoid the 'self' parameter, so we only
list the argument names to the unbound arguments which is what we need in practice.

Test Plan: update unit test and it passes

Reviewed By: alanwaketan

Differential Revision: D29965283

fbshipit-source-id: a4e6a1d0f393f2a41c3afac32285548832da3fb4
2021-07-29 21:29:06 -07:00
725d98bab6 [Prototype] [PyTorch Edge] Speed up model loading by 12% by directly calling the C file API from FileAdapter (#61997)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61997

After profiling the model loading latency on AI Bench (Android Galaxy S8 US), it seems like a significant amount of time was spent reading data using FileAdapter, which internally calls IStreamAdapter. However, IStreamAdapter uses `std::istream` under the hood, which is not that efficient. This change reduces the model loading time from [~293ms](https://www.internalfb.com/intern/aibench/details/600870874797229) to [~254ms](https://www.internalfb.com/intern/aibench/details/163731416457694), which is a reduction of ~12%.
ghstack-source-id: 134634610

Test Plan: See the AI Bench links above.

Reviewed By: raziel

Differential Revision: D29812191

fbshipit-source-id: 57810fdc1ac515305f5504f88ac5e9e4319e9d28
2021-07-29 20:14:49 -07:00
693d8f2f07 [PyTorch Edge] Cache operator lambda during model loading [7% faster model loading] (#61996)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61996

A recent post https://fb.workplace.com/groups/pytorch.edge.users/posts/2012215235600341/ about slow model loading with an accompanying perf report (report.html) caused me to look at the report and find hot spots during model loading. This suggested that we spend quite a bit of time looking up operators from the dispatcher. This means that we can probably just cach the operator handler functions (instead of computing them every time the operator name shows up since it potentially shows up multiple times in a given model).

This diff results in an approx 7% speedup in model loading time (from [315ms](https://www.internalfb.com/intern/aibench/details/45077128343028) to [293ms](https://www.internalfb.com/intern/aibench/details/600870874797229)) when run against an 87MB speech model that jiatongzhou provided.

See https://fb.workplace.com/groups/pytorch.dev/posts/855724575006024/ for the previous post from jiatongzhou.
ghstack-source-id: 134634612

Test Plan:
Run using AI Bench.

### Speech Transducer v25 model (87MiB)

Followed up with jiatongzhou and he gave me his speech model. For posterity, here's how to fetch it (you don't need to since I uploaded it to NMLML and now has a permanent Everstore Handle):

```
cd /tmp/
mkdir speech_model
cd speech_model
fbpkg fetch speech.stella.neural_transducer.on_device.en_us:25
cp pytorchmodel.pt ~/speech_transducer_v25_pytorchmodel.ptl
```

Here's how to build and run the benchmark using AI Bench:

```
buck run aibench:run_bench -- -b aibench/specifications/models/pytorch/speech_transducer/v25.json --framework pytorch --platform android/arm64 --devices "S8US" --force_profile --remote
```

Reviewed By: raziel

Differential Revision: D29826210

fbshipit-source-id: 134b67eb466e73f0e43447b9b966278f13c4b56f
2021-07-29 20:14:47 -07:00
0b3f42fa4f [PyTorch Edge] Add test for lite interpreter operator caching (#62306)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62306

Test to see if caching of operators works as expected. When caching operators during model load we look up using the operator name. This test ensures that even if there are multiple operators with the same name (in the same model), the caching distinguishes between the ones that have a different number of arguments specified during the call in the serialized bytecode.

In this specific test, there's a model with 3 methods, 2 of which return a `float32` tensor and one which return an `int64` dtype. Please see the comments in the diff for details.

ghstack-source-id: 134634613

Test Plan:
Test command:

```
cd fbsource/fbcode/
buck test mode/dev //caffe2/test/cpp/jit:jit -- --exact 'caffe2/test/cpp/jit:jit - LiteInterpreterTest.OperatorCacheDifferentiatesDefaultArgs'
```

```
cd fbsource/
buck test xplat/caffe2:test_lite_interpreter
```

Reviewed By: raziel

Differential Revision: D29929116

fbshipit-source-id: 1d42bd3e6d33128631e970c477344564b0337325
2021-07-29 20:14:45 -07:00
0bbdf0e1e3 [PyTorch Edge] Add test_lite_interpreter to fbsource xplat BUCK files (#62305)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62305

Currently, it's super time consuming to run a lite interpreter test from fbcode since it takes > 10 minutes to build. Recently, I haven't been able to do that either due to low disk space.

Having this test available in fbsource/xplat/ is a great win for productivity since I can re-run it in ~2 minutes even after significant changes!

I've had to disarm some tests that can only run in OSS of fbcode builds (since they need functionality that we don't include for on-device FB builds). They are disarmed using the macro `FB_XPLAT_BUILD`.

ghstack-source-id: 134634611

Test Plan: New test!

Reviewed By: raziel, JacobSzwejbka, cccclai

Differential Revision: D29954943

fbshipit-source-id: e55eab14309472ef6bc9b0afe0af126c561dbdb1
2021-07-29 20:13:06 -07:00
90977e10ed Remove an unused variable
Summary: This is defined and then set once but never actually used. Kill it here.

Test Plan: NFC

Reviewed By: smeenai

Differential Revision: D29994983

fbshipit-source-id: 0cb7383b3ec95f1aeed5210974bc95060cf10be5
2021-07-29 18:04:01 -07:00
74291c8347 [quant][graphmode][fx] Fix the calls to load_arg in quantization_patterns.py (#62376)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62376

load_arg(quantized=...) accepts a dictionary from index to dtype,
not a list of dtype, the call is just to make sure the inputs are quantized with correct
dtype

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: heitorschueroff

Differential Revision: D29979711

fbshipit-source-id: 8499976ac5df8eb2019c3beae573dec6c9a56247
2021-07-29 17:28:07 -07:00
eef85f89b9 [dte] broadcast fastpath implementations for reduce utility functions (2/x) (#62428)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62428

In this diff we add a broadcast fastpath for reduce utility functions. These functions are used by various elementwise ops, whose tests we update to exercise the new functionality.

Test Plan: Added test cases to elementwise ops (which will exercise the new reducer functionality) that will be run by CI. It's worth noting there's still no code (outside of the new test cases) that takes the new code paths added -- the user must explicitly request  `allow_broadcast_fastpath=True`, and nothing outside of the added tests currently does so.

Differential Revision: D29938264

fbshipit-source-id: 5d5542bd93afb85fd9f7a4073f766adc07eb3b65
2021-07-29 17:27:39 -07:00
219917706e [quant][graphmode] Add support for reference pattern for default ops (#62375)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62375

default ops means ops that has one quantized input and one quantized output,
e.g. gelu, silu, leaky_relu etc. and we need to insert observer for the output

Test Plan:
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D29979712

fbshipit-source-id: ed88210a9d6f1ab5cdb9397b4ff7f1628162ef22
2021-07-29 17:27:37 -07:00
acba9b3104 [DDP Communication Hook] Simplify the implementation of parseHookResult of PythonCommHook (#62389)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62389

Simplify the implementation of `parseHookResult` of `PythonCommHook`, by not directly accepting the output of allreduce, which is a tensor list.

Address the comment on https://github.com/pytorch/pytorch/pull/62074#discussion_r675303280

Additionally, formatter is also applied to `OptimizerHookState` and `hook_then_optimizer`.
ghstack-source-id: 134626246

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork

Reviewed By: rohan-varma

Differential Revision: D29982485

fbshipit-source-id: 5b27cc5ef09d2f87c1ade4c0feef7eacc1af3a9a
2021-07-29 17:27:35 -07:00
554daef820 Reformat test_c10d_nccl.py and distributed_test.py (#62388)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62388

as title
ghstack-source-id: 134626247

Test Plan: N/A

Reviewed By: rohan-varma

Differential Revision: D29984086

fbshipit-source-id: 0960e5acc379ccdf08813387e11d3fb0a5f0e4b0
2021-07-29 17:27:33 -07:00
9fee176be3 [Model Averaging] Fix docstring of PeriodicModelAverager (#62392)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62392

The constructor of `PeriodicModelAverager` does not need to accept parameters.
ghstack-source-id: 134626245

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork --  test_periodic_model_averager

Reviewed By: rohan-varma

Differential Revision: D29986446

fbshipit-source-id: 6a8b709e4383a3c44b9e60955fbb067cd2868e76
2021-07-29 17:26:27 -07:00
8f519c5e07 [quant][graphmode] Add support for reference pattern for torch.cat (#62374)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62374

Test Plan:
python test/test_quantization.py TestQuantizeFxOps.test_cat

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D29979713

fbshipit-source-id: 2d38991f96fcca783169ffd306bc2b0fb7debc69
2021-07-29 16:31:09 -07:00
502823c201 Change torch::Tensor to at::Tensor to fix build failure (#62425)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62425

Fixes https://github.com/pytorch/pytorch/issues/62416

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D30000948

Pulled By: heitorschueroff

fbshipit-source-id: 07dfc88a01b7718bc32be4342f43bb2cf2842b60
2021-07-29 16:31:08 -07:00
49dc827712 Reland D29943356: .github: Migrate ecr_gc to github actions (#62438)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62438

Switches out BASH_ENV for GITHUB_ENV

This reverts commit 1f1d01df3ec06046880d0a92b930fbd051d60606.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: heitorschueroff

Differential Revision: D29999785

Pulled By: seemethere

fbshipit-source-id: bb92850765518005a3f530264643959e5038e681
2021-07-29 16:31:06 -07:00
dc8b5db5f8 [quant][graphmode] relax the constraint for supported_dtypes for reference option (Linear and Conv) (#62348)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62348

Originally we have a supported_dtypes check for linear and conv, but it's only valid for non reference option,
this PR removes the constraint when is_reference=True and enables producing reference patterns for the dtype
combinations that's not supported by fbgemm/qnnpack, for example qint8 activation dtypes

Test Plan:
python test/test_quantization.py TestQuantizeFx.test_linear_qint8_activation

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D29968675

fbshipit-source-id: 2abe37940eb62e16fcf0cbb700c174de49719223
2021-07-29 16:31:04 -07:00
9f9244aabe [dte] scaffolding for c2 operator broadcasting fastpath (1/x) (#62369)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62369

This diff is a big no-op that just sets up scaffolding for passing the "allow_broadcast_fastpath" from caffe2 operator protos created in Python down to C++. To facilitate this, we create helper template wrappers that pass a flag for "allow_broadcast_fastpath" down to elementwise functors. This flag will determine whether to try and take the broadcast fastpath, which we will add in subsequent diffs.

Test Plan: sandcastle + let github CI run

Differential Revision: D28154475

fbshipit-source-id: 15750a0bcd2994fbc6a61fb5653d8cae6b0177dd
2021-07-29 16:31:02 -07:00
5c47038d12 Back out D29792193 "Add default Saved Variable hooks" (#62415)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62415

test error

Differential Revision: D29990361

fbshipit-source-id: 99c87dec6c5be6496c9db5c9205c3cb72a953dd9
2021-07-29 16:31:00 -07:00
dcfcefcd0b Back out D29848525 "Catch saved tensors default hooks race condition" (#62414)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62414

test error

Differential Revision: D29990348

fbshipit-source-id: 1a7c668153ad7ad9e847dd1a74db678e787b6b0e
2021-07-29 16:29:46 -07:00
389380ffcc [reland] Refactor Tensor::to to call a primitive that is not copy_. (#62262)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62262

Context
-------
functorch is unable to vmap(grad(f)) when f contains a .to
call. This is because .to (when it is not a no-op) decomposes
to .copy_ under grad and the .copy_ is not compatible with vmap.

Fix
 ---
The fix for this is to have all Tensor::to variants call a new operator,
`_to_copy`, that always copies and is a primitive w.r.t. autograd so
that autograd decomposes Tensor::to into a call to `_to_copy`.
(This is related to https://github.com/pytorch/pytorch/issues/60956,
please let me know if you want to bikeshed the naming).

In order to get this done I had to do a bit of refactoring. All of the
`::to` implementations now call `to_impl` which may call `_to_copy`.

Autograd codegen changes
------------------------

The second thing I had to do was modify the autograd codegen. Right now,
autograd assumes that every output is either statically known to be
differentiable or not differentiable at codegen time. `_to_copy` is a
little special because its differentiability depends on the output
dtype. e.g. `torch.randn(3, requires_grad=True).to(torch.long)` is non
differentiable. To get this to work:
- I changed how `output_differentiability` in derivatives.yaml work.
- output_differentiability can now accept "conditions" for each of the
output arguments. A "condition" is some C++ code.
- We currently only support `output_differentiability` with conditions
if there is a single output. This is for convenience and can be changed
in the future.
- I added a new `output_differentiability_conditions` field to
DifferentiabilityInfo. This gets populated in load_derivatives.yaml
- forward-mode and reverse-mode AD take
`output_differentiability_conditions` into account.

Here's how the generated code for `VariableType::_to_copy`
[looks
like](https://gist.github.com/zou3519/93462df4bda1837acee345205b7cc849)
No other autogenerated code gets modified by this PR.

Performance benchmarking
------------------------
- I benchmarked [three
cases that demonstrate overhead](https://gist.github.com/zou3519/5b6985e6906b80eec5a0dd94ed5b6a1a).
- Case A: No-op .to(). Instruction count went from 50223 to 25623. I
have no clue why but this is a good thing.
- Case B: not-no-op .to(). Instruction count went from 665291 to 671961.
This is expected; `_to_copy` adds an additional dispatch.
- Case C: not-no-op .to() forward pass and backward pass. Instruction count
went from 4022841 to 4030057. This PR adds
an additional dispatch to .to() (so there should be one additional
dispatch in the forward pass) so this number looks reasonable.

Test Plan
---------
- test_torch.py has a test_to
- test_cuda.py has test_to*
- test_autograd has tests (test_type_conversions) that exercise the
reverse-mode path
- test_ops.py has some tests (like log_softmax) that exercise the
reverse-mode and forward-mode AD path.
- test_quantization, test_namedtensor all exercise tensor.to as well.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D29934998

Pulled By: zou3519

fbshipit-source-id: 820069acd66fd5af97b98f42edfca68572c9eb1c
2021-07-29 10:49:32 -07:00
7b6d569a2b [jit] Renamed prim::Concat as prim::VarConcat (#61983)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61983

Trial #2. The previous PR (https://github.com/pytorch/pytorch/pull/61498) was reverted because this caused a failure in `pytorch_linux_backward_compatibility_check_test`. Fixed that now by adding to the exception list in `check_backward_compatibility.py`.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D29828830

Pulled By: navahgar

fbshipit-source-id: 947a7b1622ff6e3e575c051b8f34a789e105bcee
2021-07-29 10:28:59 -07:00
5ede826178 Fix alpine ecr image pull (#62413)
Summary:
Fixes alpine ecr image pull in the render_test_result step

![image](https://user-images.githubusercontent.com/658840/127527503-e88f198d-a8d5-4d3b-a064-096dca07d713.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62413

Reviewed By: malfet

Differential Revision: D29990844

Pulled By: zhouzhuojie

fbshipit-source-id: ff420f57d5e4b80d0ebf73508001a127649e9eb2
2021-07-29 10:20:13 -07:00
a42345adee Support for target with class probs in CrossEntropyLoss (#61044)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/11959

Alternative approach to creating a new `CrossEntropyLossWithSoftLabels` class. This PR simply adds support for "soft targets" AKA class probabilities to the existing `CrossEntropyLoss` and `NLLLoss` classes.

Implementation is dumb and simple right now, but future work can add higher performance kernels for this case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61044

Reviewed By: zou3519

Differential Revision: D29876894

Pulled By: jbschlosser

fbshipit-source-id: 75629abd432284e10d4640173bc1b9be3c52af00
2021-07-29 10:04:41 -07:00
dd0ef23a85 Delete .clang-tidy-oss (#62373)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62373

Internal clang-tidy can handle all the options after  D29863426 was deployed

Test Plan: CI

Reviewed By: 1ntEgr8

Differential Revision: D29978471

fbshipit-source-id: ea531734ab4fc3e0a26552bd24846b22c2e5c745
2021-07-29 09:30:18 -07:00
7157ad44bc Fix windows ci squid env (#62353)
Summary:
This is a re-land of https://github.com/pytorch/pytorch/pull/62244, noticeable changes are

- Use jinja2 variables to DRY the settings
- Added no_proxy for common destinations that don't fit into proxy (e.g. the magic settings from [aws link](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/http_proxy_config.html#windows-proxy))
- Try to trigger windows GHA CI flows
- Also went through the actionlint for github action linting errors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62353

Reviewed By: driazati

Differential Revision: D29970842

Pulled By: zhouzhuojie

fbshipit-source-id: b9c457b0005bb1a64850949a56679d68fbb281d6
2021-07-29 09:20:30 -07:00
80a662e773 ENH Updates docs and tests for classification modules that already support no batch dims (#61874)
Summary:
Towards https://github.com/pytorch/pytorch/issues/60585

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61874

Reviewed By: heitorschueroff

Differential Revision: D29979977

Pulled By: jbschlosser

fbshipit-source-id: 82c19151aa7220e564337b05d7677d52981e0aa2
2021-07-29 09:14:52 -07:00
b9f02778b2 Forward fix mypy for #61820 (#62398)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62398

Test Plan: Imported from OSS

Reviewed By: malfet, anjali411

Differential Revision: D29988610

Pulled By: ejguan

fbshipit-source-id: 700dfa5b1c415bc058390bbe5727a739c8419b0f
2021-07-29 07:43:12 -07:00
2d103025a5 Adding warning on isend about modifying after send (#61875)
Summary:
This is a standard limitation on communication collective libraries. For example:

https://www.open-mpi.org/doc/v4.0/man3/MPI_Isend.3.php
```
A nonblocking send call indicates that the system may start copying data out of the send buffer. The sender should not modify any part of the send buffer after a nonblocking send operation is called, until the send completes.
```

http://openucx.github.io/ucx/api/latest/html/group___u_c_p___c_o_m_m.html#ga8323878b60f426c630d4ff8996ede3cc
```
The user should not modify any part of the buffer after this operation is called, until the operation completes.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61875

Reviewed By: suo

Differential Revision: D29783720

Pulled By: mrshenli

fbshipit-source-id: 78fd047c74449f77b906f3766a6c2bc29499847d
2021-07-29 07:37:18 -07:00
945d40dca6 Also disable inplace fw AD for acos on windows (#62360)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/62304

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62360

Reviewed By: malfet, bdhirsh

Differential Revision: D29973310

Pulled By: albanD

fbshipit-source-id: 3b033e779f557724602c5a87f497698f2262a12e
2021-07-29 06:42:25 -07:00
1b147a52f5 Allow FX tracer to trace control flow (if/while) statements when parameter shapes are in the conditionals (#61820)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/61733

Allow FX tracer to trace control flow (if/while) statements when parameter shapes are in the condition.
If the user specifies the new "param_shapes_constant" option when constructing a tracer,  the model's parameter shape attribute will be evaluated and the resulting constant will be emitted into the IR during tracing.
Also added a new test

`
python test/fx/test_fx_param_shape_control_flow.py
`
The test also performs a somewhat whitebox style testing to check the generated Python code from the IR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61820

Reviewed By: bdhirsh

Differential Revision: D29969299

Pulled By: jerryzhenleicai

fbshipit-source-id: 99aae824bdfec880be69258de7ead5c8cd59eddc
2021-07-28 23:48:44 -07:00
4ed8858817 Exclude time of waiting in queue from gloo communication prof… (#61342)
Summary:
Background:
    The gloo communication implementation is as follow:
        1. Construct communication workers and push them into a queue.
        2. Initialize a thread pool and each thread run a loop to get worker from the queue and execute it.
Issue:
        The recorded profiling time span start from the worker construction and end at finish. So it will include the time of worker waiting in the queue and will result in multiple gloo communication time span overlapping with each other in a same thread in the timeline:
![image](https://user-images.githubusercontent.com/62738430/124867273-5bc95b80-dff0-11eb-8664-6e5d4166fc39.png)
This is because when next work is waiting in the queue, the last work is not finished.

Solution:
     This PR delays the profiling start time of gloo communication from worker construction to worker is really executed, so the profiling span will not include the time of waiting in queue. Implementation as follow:
             1. Firstly, disable the original record function by specifying 'nullptr' to 'profilingTitle' argument of ProcessGroup::Work
             2. Construct a 'recordFunctionBeforeCallback_' and 'recordFunctionEndCallback_' and save it as member of the worker.
             3. When the worker is executed, invoke the 'recordFunctionBeforeCallback_'.
             4. The 'recordFunctionEndCallback_' will be invoked at finish as before.
      After this modification, the gloo profiling span in timeline will not overlap with each other:
![image](https://user-images.githubusercontent.com/62738430/124868716-bb286b00-dff2-11eb-9cf0-d0494a356d0c.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61342

Reviewed By: albanD

Differential Revision: D29811656

Pulled By: gdankel

fbshipit-source-id: ff07e8906d90f21a072049998400b4a48791e441
2021-07-28 22:24:26 -07:00
35307b131d Callable activation function support for Transformer modules (Python) (#61355)
Summary:
Fixes Python part of https://github.com/pytorch/pytorch/issues/60747

Enhances the Python versions of `Transformer`, `TransformerEncoderLayer`, and `TransformerDecoderLayer` to support callables as their activation functions. The old way of specifying activation function still works as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61355

Reviewed By: bdhirsh

Differential Revision: D29967302

Pulled By: jbschlosser

fbshipit-source-id: 8ee6f20083d49dcd3ab432a18e6ad64fe1e05705
2021-07-28 21:42:56 -07:00
1f2b96e7c4 [DDP] Make compute_bucket_assignment_by_size return per bucket sizes (#62231)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62231

`compute_bucket_assignment_by_size` is responsible for setting per-bucket size limits, return this information from the function so that we are aware of size limits for each bucket.

This is currently not being consumed, but will be in the next diff when we log bucket size limits to DDP logging. This will help us run experiments under different bucket size configs and analyze the impact.
ghstack-source-id: 134480575

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D29919056

fbshipit-source-id: dd5a096fa23d22e5d9dc1602899270a110db4a19
2021-07-28 20:21:01 -07:00
c76daa6de3 [DDP][ez] Remove misleading comment (#62230)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62230

We don't iterate over model replicas anymore.
ghstack-source-id: 134475834

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D29918760

fbshipit-source-id: 84bde670b4e91667a49f94f1b597fad364498467
2021-07-28 20:20:59 -07:00
842228fd0d [DDP] Save bucket size limits (#62229)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62229

First of a stack of diffs to save and log the bucket size limits to help debug/discover discrepancies and analyze impact of bucket size tuning
ghstack-source-id: 134475835

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D29918629

fbshipit-source-id: b9b3f9a5658340a4c7fd72874c2254664e3c52e9
2021-07-28 20:19:56 -07:00
cac4aa71ca Provide option to pass module instance to _load_state_dict_pre_hooks. (#62070)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62070

We have a custom Tensor:
https://github.com/pytorch/pytorch/blob/master/torch/distributed/_sharded_tensor/api.py#L67,
which doesn't show up in state_dict for the module. This was resolved by
using the _register_state_dict_hook:
https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/module.py#L1196
to parse and add custom tensors to state_dict.

However, the problem is during load time  _register_load_state_dict_pre_hook:
https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/module.py#L1272,
does not pass in the module instance and as a result, a ShardedTensor in the
state_dict cannot be appropriately added to a module at load time.

To resolve this issue, in this PR I've enhanced this hook to support two
variations, one which passes in the module instance (for the problem described
above) and one is the previous version for BC reasons.
ghstack-source-id: 134541391

Test Plan:
1) unit tests
2) waitforbuildbot

Reviewed By: jbschlosser

Differential Revision: D29867142

fbshipit-source-id: bcb136ff51eedd0b508cfb419e8b8a6b7d95539c
2021-07-28 19:22:47 -07:00
2eaf71d749 [Model Averaging] Update model averager API to avoid the redundant params arg needed by post-localSGD optimizer (#62132)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62132

as title

Proposal: https://github.com/pytorch/pytorch/issues/59699
ghstack-source-id: 134560541

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_post_localSGD_optimizer_parity

buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_periodic_model_averager

Reviewed By: rohan-varma

Differential Revision: D29887751

fbshipit-source-id: 60dadb04790d800fdcc7cb8a08d060e411718739
2021-07-28 18:43:09 -07:00
55bee44951 [Model Averaging] Post-localSGD optimizer (#62131)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62131

Wrap `PeriodicModelAverager` as an optimizer.

Currently both the optimizer and averager require an input `params` arg, where the latter actually can read params from the optimizer wrapper. Will update averager class API in a follow-up PR.

Proposal: https://github.com/pytorch/pytorch/issues/59699
ghstack-source-id: 134560248

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_post_localSGD_optimizer_parity

Reviewed By: rohan-varma

Differential Revision: D29881465

fbshipit-source-id: b9634972f4d8bffd3b3eb94f5dbbb19db2bcd759
2021-07-28 18:42:06 -07:00
58d45d950b [DDP] Log unused param names under DETAIL debug mode. (#62209)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62209

When `TORCH_DISTRIBUTED_DEBUG=DETAIL` is set, log names and indices of unused parameters when searching for them.

Motivation is that we have seen a couple of issues occasionally when there are errors related to parameter possibly being marked as unused when it shouldn't, this can help narrow down the root cause by explicitly logging param names that are marked as unused.
ghstack-source-id: 134541461

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D29916085

fbshipit-source-id: d84cf637cbbd811521e6264ffd6c50ca8a79595b
2021-07-28 18:10:32 -07:00
24ed6e6b16 Add actionlint (#62364)
Summary:
This adds a linter for our GitHub actions. When a GitHub Actions workflow has an invalid definition, GitHub doesn't queue the job and doesn't report it as failed, so these can be hard to detect with the usual tools. This adds an explicit job to check if our workflow YAMLs are valid using [https://github.com/rhysd/actionlint](https://github.com/rhysd/actionlint). We deployed a similar check in pytorch/test-infra [here](https://github.com/pytorch/test-infra/pull/89).

This PR enables the linter and fixes all the issues it complained about (it did already catch one bug where we were leaving `CIRCLE_BRANCH` blank when uploading binary size)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62364

Reviewed By: zhouzhuojie

Differential Revision: D29973928

Pulled By: driazati

fbshipit-source-id: 83b365e98fd6cbdcd75eeb44daf1be1c89056f8d
2021-07-28 17:10:20 -07:00
fcc7fbe15f Split zeta_kernel out of BinaryMiscOpsKernel.cu (#62261)
Summary:
`BinaryMiscOpsKernel.cu` takes 4 m 30 s to compile on my machine, which is the second slowest after `PowKernel.cu`. Moving the zeta kernel into it's own file takes 3 m 30 s, and reduces `BinaryMiscOpsKernel.cu` compile time to 1 m.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62261

Reviewed By: bdhirsh

Differential Revision: D29969350

Pulled By: ngimel

fbshipit-source-id: 37cad5775088b2f7d22948414e4bf0427f88e07d
2021-07-28 16:07:15 -07:00
f6e137598d ns for fx: fix nit in default qlinear weight extraction function (#62334)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62334

Removes the assert for node type in default qlinear weight extraction
function. Without the assert, user defined functions can now use
this util function without failing this check.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs

// further tests will be in follow-up fb-only diffs
```

Imported from OSS

Reviewed By: hx89

Differential Revision: D29963501

fbshipit-source-id: a634eabb5165375bde186438318ec52fa29c970f
2021-07-28 16:07:13 -07:00
72c943a2ac ns for fx: fix bug for user function in weight extraction (#62333)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62333

We incorrectly ignored any custom relationships the user specified
in the `extract_weights` API.  Fixing this and adding a test case.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_user_defined_function
```

Imported from OSS

Reviewed By: hx89

Differential Revision: D29963502

fbshipit-source-id: 33ce3d4df1acb6298b6c7dcb6674015c8d14bdf4
2021-07-28 16:05:51 -07:00
d98b1c400d [pruner] add cuda tests for pruner (#61993)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61993

Repeating `test_pruner` unit tests for Linear and Conv2d models with device = 'cuda' to confirm pruner will work on GPU
- set device to cuda
- move model to device
- assert that module.weight.device is cuda
ghstack-source-id: 134554382

Test Plan:
`buck test mode/dev-nosan //caffe2/test:ao -- TestBasePruner`

https://pxl.cl/1Md9c

Reviewed By: jerryzh168

Differential Revision: D29829293

fbshipit-source-id: 1f7250e45695d0ad634d0bb7582a34fd1324e765
2021-07-28 14:45:04 -07:00
b39b28ced3 irange-ify 10 (#62122)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62122

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D29879694

fbshipit-source-id: 87cd8ab17061129c835d9f961b67587c84d181d1
2021-07-28 13:35:23 -07:00
88f8f2ab94 irange-ify 6 (#62115)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62115

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D29879576

fbshipit-source-id: 63cbf0ab5a52325fa2c3dec0e8239e2eac1ecf72
2021-07-28 13:32:11 -07:00
9e77113e85 irange-ify 11 (#62121)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62121

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D29879701

fbshipit-source-id: 5c51879c88fa6a5790db241c8b33ec0dc4b177ca
2021-07-28 13:32:09 -07:00
b5867a1b34 irange-ify 7 (#62117)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62117

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D29879640

fbshipit-source-id: 189578a57301747a3421742e145bbcdf2ad75c49
2021-07-28 13:30:39 -07:00
59bb4f2dab Revert D29928698: [pytorch][PR] Use private squid proxy
Test Plan: revert-hammer

Differential Revision:
D29928698 (6da4a25509)

Original commit changeset: 4ee78be0abe3

fbshipit-source-id: 44679a2b247ba8163f09895d9d36ecf5df4390b8
2021-07-28 12:35:55 -07:00
3a2603bc68 Port slow_conv_transpose2d to structured (#55503)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55503

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D29945028

Pulled By: SplitInfinity

fbshipit-source-id: 0b696d104938287444210f1bc926afc13f899991
2021-07-28 12:03:03 -07:00
05b802d4e0 [pytorch] Bring back RemoveInplaceOps() (#62200)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62200

This commit brings back the `RemoveInplaceOps` pass removed in D29523283 (dec5aa2260) that apparently had a bunch of internal users.

Test Plan: danthe3rd

Reviewed By: danthe3rd

Differential Revision: D29833316

fbshipit-source-id: 6cf13d463ab0a5e50ba3eb3243f79a9c51623809
2021-07-28 12:00:38 -07:00
b91a917616 [Static Runtime] Fixed another build failure in OSS due to test_utils.h (#62338)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62338

Test Plan: Imported from OSS

Reviewed By: d1jang

Differential Revision: D29965744

Pulled By: navahgar

fbshipit-source-id: cf3e54ac13432ea8afc4b718fac6c9768743d01b
2021-07-28 11:41:33 -07:00
7c588d5d00 ENH Adds no_batch_dim support for pad 2d and 3d (#62183)
Summary:
Towards https://github.com/pytorch/pytorch/issues/60585

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62183

Reviewed By: ejguan

Differential Revision: D29942250

Pulled By: jbschlosser

fbshipit-source-id: d1df4ddcb90969332dc1a2a7937e66ecf46f0443
2021-07-28 11:10:44 -07:00
6da4a25509 Use private squid proxy (#62244)
Summary:
This PR adds a **private** squid proxy (note that the internal ELB is only accessible from the private VPC subnets of GitHub Runners) that's deployed dedicated for PyTorch CI for GitHub runners.

```
dig $SQUID_PROXY

10.0.x.x
10.0.x.x
```

http_proxy and https_proxy are compatible with the following http clients:

- curl
- wget
- python

Existing cache policy:

refresh_pattern -i .(7z|deb|rpm|exe|zip|tar|tgz|gz|ram|rar|bin|tiff|bz2|run|csv|sh)$ 1440 80% 2880
It uses the standard squid refresh_pattern for cache requests. In our setup, we tried
to cache at least (1440 minutes - 1 day) and at max (2880 minutes - 2 days), with
last-modified factor 80% (squid doc). Please refer to pytorch/test-infra for details.

Right now, it only applies to the build and test step, to limit the scope and make sure build and test are more reliable with egress cache.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62244

Test Plan:
```
# first time, cache miss (4min20s)
http_proxy=$SQUID_PROXY https_proxy=$SQUID_PROXY curl -v -L http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz --output /tmp/tmp_mnist.zip
100 9680k  100 9680k    0     0  37836      0  0:04:21  0:04:21 --:--:-- 29908

# second time, cache hit (0s)
http_proxy=$SQUID_PROXY https_proxy=$SQUID_PROXY curl -v -L http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz --output /tmp/tmp_mnist.zip
100 9680k  100 9680k    0     0   103M      0 --:--:-- --:--:-- --:--:--  103M
```

Load Test Plan:
```
# ab load test with `-n 100` requests
ab -X $SQUID_PROXY -n 100 http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz

Concurrency Level:      1
Time taken for tests:   9.044 seconds
Complete requests:      100
Failed requests:        0
Total transferred:      991326300 bytes
HTML transferred:       991242200 bytes
Requests per second:    11.06 [#/sec] (mean)
Time per request:       90.442 [ms] (mean)
Time per request:       90.442 [ms] (mean, across all concurrent requests)
Transfer rate:          107040.50 [Kbytes/sec] received
```

Reviewed By: malfet

Differential Revision: D29928698

Pulled By: zhouzhuojie

fbshipit-source-id: 4ee78be0abe35411666c6121991b0addded57106
2021-07-28 10:37:42 -07:00
2581dfc249 [Model Averaging] Create a base class for model averaging (#62111)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62111

This base class will be passed to the post-localSGD optimizer in the next PR. This way, the same post-localSGD optimizer can choose different model averaging algorithms.

Proposal: https://github.com/pytorch/pytorch/issues/59699
ghstack-source-id: 134489187

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_periodic_model_averager

Reviewed By: rohan-varma

Differential Revision: D29884954

fbshipit-source-id: 1dc5e35c58895902991567f633afd621c7108938
2021-07-28 10:15:36 -07:00
a15fff0a7f Revert D29794666: Remove faulty process group code
Test Plan: revert-hammer

Differential Revision:
D29794666 (afe3644321)

Original commit changeset: 0b35191cc072

fbshipit-source-id: 6467bc5100f4115f2fdb385e205740cd68c89743
2021-07-28 10:15:34 -07:00
71a6ef17a5 ENH Adds no_batch_dim tests/docs for Maxpool1d & MaxUnpool1d (#62206)
Summary:
Towards https://github.com/pytorch/pytorch/issues/60585

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62206

Reviewed By: ejguan

Differential Revision: D29942341

Pulled By: jbschlosser

fbshipit-source-id: a3fad774cee30478f7d6cdd49d2eec31be3fc518
2021-07-28 10:15:32 -07:00
cdf85a82ed [quant][graphmode][fx] Add reference pattern support for BatchNorm (#62215)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62215

including batchnorm2d, batchnorm3d, batchnormrelu2d and batchnormrelu3d

Test Plan:
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D29917524

fbshipit-source-id: 3a9520ff659cb21e6e2fe614973b3d08aa0af923
2021-07-28 10:14:16 -07:00
7443c90f15 optimize non lastdim softmax bf16 (#60371)
Summary:
Here is the PR to enable the softmax calculation with data type of `bfloat16` when not along the last dim.
* Use bf16 specialization for forward calculation to reduce the bf16/fp32 cast in vec template.
* Release the bf16 limitation for backward calculation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60371

Reviewed By: ejguan

Differential Revision: D29563109

Pulled By: cpuhrsch

fbshipit-source-id: f6b439fa3850a6c633f35db65ea3d735b747863e
2021-07-28 10:06:51 -07:00
68efa186cc [static runtime] Implement aten::full (#62227)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62227

Test Plan: Added `StaticRuntime.IndividualOps_Full` to cover the newly added code path.

Reviewed By: hlu1

Differential Revision: D29923649

fbshipit-source-id: 722950137c35ae325590a670b97f03b395e8eac3
2021-07-28 09:50:27 -07:00
10c6811a6b [DDP] Run test_ddp_new_tensor_in_fwd with static graph (#61992)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61992

This test previously was not enabled for static graph but to ensure
this feature is supported with DDPSink, enable it for static graph which
currently passes outputs to DDPSink.
ghstack-source-id: 134471406

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D29830887

fbshipit-source-id: 2d3f750d9eb4289558ed21acccd172d83d9b82cc
2021-07-28 09:49:12 -07:00
acf8907e94 These should be equivalent per the previous formula but breaks xla (#62329)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62329

Reviewed By: ejguan

Differential Revision: D29961527

Pulled By: albanD

fbshipit-source-id: 46e46726591f4c0c8faf6ec0d7136a2d4ca976ea
2021-07-28 09:23:51 -07:00
f4baa83eae [bc-breaking] reference option for conv produce a pattern instead of reference conv module (#61942)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61942

This PR changes is_reference=True for conv to produce a pattern consists of dequant - float conv - quant instead of reference conv module, this is useful for future transformations to custom backends, it is also helpful to simplify the implementation for
convert in the future.

Test Plan:
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D29810656

fbshipit-source-id: 549237a62bfda4341a2a7474c124f5e33350e267
2021-07-28 09:13:40 -07:00
52d1ffb789 Teach pytrees about namedtuple (#62292)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62292

This PR adds pytree support for namedtuples. The challenge about namedtuple
is that each namedtuple class is actually different. This PR does the
following:
- it adds a namedtuple flatten/unflatten. The flatten function returns
a context that is the actual type of the namedtuple subclass. The
unflatten function uses that type to reconstruct the namedtuple
- Special cases all pytree logic to consider all namedtuples the same.
This is done by creating a `_get_node_type(pytree)` helper function that
returns `namedtuple` if `pytree` is any namedtuple subclass. The effect
of this is that all namedtuple subclasses will go through the namedtuple
flatten/unflatten functions
- Adds a `_namedtuple_flatten_spec` function for FX pytrees. This function
flattens the namedtuple based on the spec and is equivalent to the
`_tuple_flatten_spec`.

Test Plan
- new tests in test/test_pytree.py and test/test_fx.py

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D29947302

Pulled By: zou3519

fbshipit-source-id: 19c00665b13546642c315df0f243ad99b8e7ff7c
2021-07-28 06:27:44 -07:00
c06b6e445f Build M1 binaries with PocketFFT (#62222)
Summary:
As MKL is only available on x86_64 platform, clone header-only PocketFFT
library and use it as FFT provider

Fixes https://github.com/pytorch/pytorch/issues/62107

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62222

Reviewed By: ejguan

Differential Revision: D29938718

Pulled By: malfet

fbshipit-source-id: ac0bd98b5090d6c8a26c36c4e34a4d6e1d9f1a92
2021-07-27 22:41:29 -07:00
cb2b5f06c9 Revert D29816592: [pytorch][PR] [fix] polygamma n>=1
Test Plan: revert-hammer

Differential Revision:
D29816592 (b73d759708)

Original commit changeset: 2c020a6e4c32

fbshipit-source-id: 310c93ade300966366ef04f206a5908fb27745db
2021-07-27 22:14:10 -07:00
73f1e2d1dc [8/N] Nnapi backend delegation preprocess: New refactored design (#62225)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62225

Rewrote the preprocess function for Android NNAPI delegate.
Previously, `preprocess()` called `convert_model_to_nnapi()` using Pybind and returned a NnapiModule that is serialized for mobile. Now, `preprocess()` calls a sub-function of `convert_model_to_nnapi()` and returns several preprocessed items (that were previously components of NnapiModule).

Dictionary returned contains:
   "shape_compute_module": torch::jit::Module,
   "ser_model": torch::Tensor,
   "weights": List[torch.Tensor],
   "inp_mem_fmts": List[int],
   "out_mem_fmts": List[int]

**Purpose and Future:**
The purpose of these changes are to move more implementation from bytecode and Torchscript to the delegate API, since bytecode is less efficient.
Now, only the shape computation uses bytecode. In the future, shape computation will be moved out of Torchscript as well.

**nnapi_backend_preprocess.cpp:** preprocess implementation
**prepare.py**: refactored a portion of `convert_model_to_nnapi()` to `process_for_nnapi()`, so preprocess can get components of NnapiModule

**Test:**
Ran `python test/test_jit.py TestNnapiBackend` and `python test/test_nnapi.py` on OSS successfully
ghstack-source-id: 134444190

Test Plan: Ran `python test/test_jit.py TestNnapiBackend` and `python test/test_nnapi.py` on OSS successfully

Reviewed By: raziel

Differential Revision: D29922279

fbshipit-source-id: cadcf8908d8a745dc7abbe286e97d6ead937d4ab
2021-07-27 18:52:48 -07:00
7aabda6d5d Update nccl to v2.10.3-1 (#62276)
Summary:
Which at the time of creating PR is points to 7e51592129

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62276

Reviewed By: ngimel

Differential Revision: D29940950

Pulled By: malfet

fbshipit-source-id: 59c6fda76a9023af3adbfb5a96b83ca50950df6c
2021-07-27 18:32:53 -07:00
1f1d01df3e Revert D29943356: .github: Migrate ecr_gc to github actions
Test Plan: revert-hammer

Differential Revision:
D29943356 (8e0622abf1)

Original commit changeset: 493592baf2f7

fbshipit-source-id: f0e604aab2b828561adc3e8fabf0f39221e15615
2021-07-27 18:14:31 -07:00
af0f083d42 [dist_optim] fix the bug of none grads on functional optimizers (#62249)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62249

parameter and grads passed to torch.optim.functional should always match, we should skip the parameters that have none gradients to avoid the size mismatch
ghstack-source-id: 134452467

Test Plan: test_dist_optim_none_grads

Reviewed By: mrshenli

Differential Revision: D29929653

fbshipit-source-id: 4ca6167fecdfe1db422236655edee3aa59b8b044
2021-07-27 18:10:51 -07:00
c0b806694f Do not use deprecated data accessor in IndexKernel.cu (#62268)
Summary:
Fixes repeated warnings like:
```
/var/lib/jenkins/workspace/aten/src/ATen/native/cuda/IndexKernel.cu: In lambda function:
/var/lib/jenkins/workspace/aten/src/ATen/native/cuda/IndexKernel.cu:354:683: warning: 'T* at::Tensor::data() const [with T = c10::BFloat16]' is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
   AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3 (e23ddf06e9)(at::ScalarType::Half, at::ScalarType::Bool, at::ScalarType::BFloat16, iter.dtype(), "take_cuda", [&] {
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           ^
/var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:559:1: note: declared here
   T * data() const {
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62268

Reviewed By: walterddr

Differential Revision: D29937267

Pulled By: malfet

fbshipit-source-id: 6413deb9762b973880f4a7db47652eacd013214f
2021-07-27 17:58:19 -07:00
e3be185069 [PyTorch] Add KWargs support to script module forward (#62224)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62224

They underlying operator allows both args and kwargs, but we only expose args in this convenience method. this brings them in line while not changing any existing programs.

Test Plan: CI

Reviewed By: gunchu

Differential Revision: D29920830

fbshipit-source-id: f4b2aa88d4a679e33595625b7ef355e4d14e54c4
2021-07-27 17:02:57 -07:00
9776e1ff2f Migrate thnn_conv_depthwise2d from THC to ATen (#62281)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62281

Closes gh-24646, Closes gh-24647

There is no `TensorIterator` equivalent to these kernels so this is just
migrating the existing kernels over to the ATen style.

I've benchmarked for contiguous tensors with this script:
```
import torch
shape = (10, 10, 100, 100)
x = torch.randn(*shape, device='cuda')
w = torch.randn((10, 1, 5, 5), device='cuda')

for _ in range(100):
    torch.nn.functional.conv2d(x, w, groups=10)
```

and similarly for backwards. I see these as the same to within measurement error.

|                   | Master Forward (us) | This PR Forward (us) |
|------------------:|:-------------------:|:--------------------:|
|           Forward |        133.5        |         133.6        |
|  Backward (input) |        1,102        |         1,119        |
| Backward (weight) |        2,220        |         2,217        |

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D29943062

Pulled By: ngimel

fbshipit-source-id: fc5d16496eb733743face7c5a14e532d7b8ee26a
2021-07-27 16:51:23 -07:00
ba9423aa93 Fix forward ad for matrix power land race (#62291)
Summary:
Fix land race from https://github.com/pytorch/pytorch/pull/59993

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62291

Reviewed By: driazati, seemethere

Differential Revision: D29946599

Pulled By: albanD

fbshipit-source-id: 16411e1a0c298fad12a6a6788ec2427923b0112a
2021-07-27 16:17:51 -07:00
171e13fde9 Rework PowKernel.cu (#62260)
Summary:
PowKernel.cu is the single slowest file to compile in all of pytorch, taking
7 m 34 s on my machine. After investigating, I discovered that the case with
complex inputs and a cpu scalar for the first argument takes more than half that
time just on its own.

Noting that [`thrust::pow`] for complex is just `exp(log(base) * exponent)`,
we can improve this kernel by precomputing `log(base)` on cpu and computing
only the `exp` on CUDA. This is faster in both runtime and compile time.
For 1 million elements, master takes 61.6 us vs 56.9 us with this PR.

I also noticed that the constant exponent case is implemented twice, once in
`gpu_kernel_with_scalars` and again in `pow_tensor_scalar_kernel`. Further, the
`Pow.cpp` code detects cpu-scalar exponents and redispatches to the `tensor_scalar`
overload, making the `gpu_kernel_with_scalars` version dead code. Now instead,
we unconditionally run `tensor_tensor` and it will call into `tensor_scalar` if appropriate.

With these changes, PowKernel.cu takes just 2 m 30 s to compile.

[`thrust::pow`]: 368266e80e/thrust/detail/complex/cpow.h (L33)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62260

Reviewed By: ejguan

Differential Revision: D29938789

Pulled By: ngimel

fbshipit-source-id: 7ab7d81ececc92a9e6e62e60b0a4f2e6e3146df8
2021-07-27 16:16:20 -07:00
7507aeded5 [reland][bc-breaking] reference option for linear produce a pattern instead of reference linear module (#61892) (#62277)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62277

This PR changes is_reference=True for linear to produce a pattern consists of dequant - float linear - quant instead of reference linear module, this is useful for future transformations to custom backends, it is also helpful to simplify the implementation for
convert in the future.

Test Plan:
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Imported from OSS

Reviewed By: ejguan

Differential Revision: D29941079

fbshipit-source-id: 84bdfc0bb872c34fc345875e545c8b323e77c41e
2021-07-27 15:46:44 -07:00
24d94f5102 Limit smoke tests on PRs to just one config (#62288)
Summary:
When coming across the short runtime of a periodic job on this PR, I realized the current smoke tests on PRs set up was flawed. Previously an attempt for better future compatibility, our conditional for running smoke tests only was for USE_CUDA=1 on Windows.

This is BAD and has unintended consequences, such as misleading results when a ci/scheduled workflow is triggered but fails to test the full test suite. e.g., with PR https://github.com/pytorch/pytorch/issues/62266 https://github.com/pytorch/pytorch/actions/runs/1071698069

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62288

Reviewed By: seemethere, ejguan

Differential Revision: D29945540

Pulled By: janeyx99

fbshipit-source-id: 3cc91511c151f7348872b039c94d7752b6ea4692
2021-07-27 15:33:37 -07:00
8e0622abf1 .github: Migrate ecr_gc to github actions (#62284)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62284

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet, zhouzhuojie

Differential Revision: D29943356

Pulled By: seemethere

fbshipit-source-id: 493592baf2f7abe206e1fb17438bac4e908b1251
2021-07-27 15:11:01 -07:00
d0e5ef5eba .circleci: Remove conda-package-handling pin (#62290)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62290

No longer needed anymore.

Fixes nightly failures that we're observing as well:

```
Jul 27 07:33:02 Found conflicts! Looking for incompatible packages.
Jul 27 07:33:02 This can take several minutes.  Press CTRL-C to abort.
Jul 27 07:33:02 failed
Jul 27 07:33:02
Jul 27 07:33:02 UnsatisfiableError: The following specifications were found
Jul 27 07:33:02 to be incompatible with the existing python installation in your environment:
Jul 27 07:33:02
Jul 27 07:33:02 Specifications:
Jul 27 07:33:02
Jul 27 07:33:02   - conda-package-handling=1.6.0 -> python[version='>=2.7,<2.8.0a0|>=3.6,<3.7.0a0|>=3.7,<3.8.0a0|>=3.8,<3.9.0a0']
Jul 27 07:33:02
Jul 27 07:33:02 Your python: python=3.9
```

From: https://app.circleci.com/pipelines/github/pytorch/pytorch/356478/workflows/2102acf1-c92a-4a59-919c-61d32d3bcd71/jobs/15027876

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: driazati

Differential Revision: D29946501

Pulled By: seemethere

fbshipit-source-id: 3e9182f4cbcf2aab185dbbc21b7a6171746e2281
2021-07-27 14:59:41 -07:00
8fe32c9c13 fix test-report uploading uniqueness issue (#62217)
Summary:
Should fix: https://github.com/pytorch/pytorch/issues/61978.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62217

Reviewed By: seemethere, ejguan

Differential Revision: D29944444

Pulled By: walterddr

fbshipit-source-id: 4b737d1535fd5cbfafb24245fad9ef67285f1dc0
2021-07-27 14:17:50 -07:00
190cdcb08c remove print for status on scribe sending (#62285)
Summary:
Following up on https://github.com/pytorch/pytorch/issues/61768.

Currently the printout is hugely long because each test case returns a status code OK without an exception.
This should be avoided when no exception was raised from send_to_scribe.

Removing the log printing when response without error

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62285

Reviewed By: zhouzhuojie

Differential Revision: D29944461

Pulled By: walterddr

fbshipit-source-id: fc3c2b88bba27c68521cef7079ca2b6197d2d58b
2021-07-27 14:16:32 -07:00
e1bee3eb30 [Static Runtime] Add missing unit tests for static runtime ops (#62238)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62238

Added tests for the following ops:

* `aten::mul`
* `aten::nan_to_num`
* `aten::stack`
* `aten::relu`
* `aten::tanh`

Reviewed By: hlu1

Differential Revision: D29914217

fbshipit-source-id: 6a6c39629310e7131127e24fdce7253ccdf80340
2021-07-27 14:12:21 -07:00
4a15f4a902 Allow 0-dim batch sizes in Bilinear NN layer. (#47106)
Summary:
Part of the fix for https://github.com/pytorch/pytorch/issues/12013

Checks if the inputs and outputs are non-zero in order to allow the Bilinear layer to accept 0-dim batch sizes. The if-check for this checks for both input and output dim sizes since the `_trilinear` function is written to work with both forward and backward for Bilinear.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47106

Reviewed By: ejguan

Differential Revision: D29935589

Pulled By: jbschlosser

fbshipit-source-id: 607d3352bd4f88e2528c64408f04999960be049d
2021-07-27 13:59:42 -07:00
ab0354b650 All remaining linear/element-wise formulas (#59993)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59993

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D29914594

Pulled By: albanD

fbshipit-source-id: 2ffc5993cb66586e1458d7016774a03dfe786863
2021-07-27 13:06:46 -07:00
4c3eea26bd Fix out= variant forward grad detection (#60499)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60499

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D29914595

Pulled By: albanD

fbshipit-source-id: c51bb3aed91ab1f6ebc57936143b249590a43bd5
2021-07-27 13:06:45 -07:00
4a36e2a223 Add forward AD inplace check and fix codegen (#60498)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60498

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D29914593

Pulled By: albanD

fbshipit-source-id: bde649d5a03639a240dfe5fe027c6a3f758428a4
2021-07-27 13:04:55 -07:00
df18d05429 Make bytes_read available for OperatorCost (#62059)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62059

GetOperatorCost in Workspace exposes flops and bytes_written only. Make the an additional piece, bytes_read, available from OperatorSchema::Cost.

Test Plan:
Added the two additional pieces in the unit test testGetOperatorCost in workspace_test

buck test caffe2/caffe2/python:workspace_test -- testGetOperatorCost

buck test //aml/ml_foundation/exp_platform/large_scale_training/distributed_hogwild/auto_device_placement/tests/...

buck test //aiplatform/training/autotuning/tests/...

buck test //aiplatform/training/pipelining/tests/...

buck test //deeplearning/fblsim/tests/...

Flow tests:

ADP Greedy: f288078287
ADP MILP: f288079278

Reviewed By: CrazySherman, xtaofb

Differential Revision: D29860676

fbshipit-source-id: 8b3a9f2bf17c0dae48cfe2800e8821bf441e0b03
2021-07-27 12:48:36 -07:00
bba7800933 Add logical op symbol (#62063)
Summary:
This is for xla side [pr](https://github.com/pytorch/xla/pull/3054) to add logical op lowering

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62063

Reviewed By: ejguan

Differential Revision: D29937449

Pulled By: bdhirsh

fbshipit-source-id: ba421f6c2dad67395a383b5ed0b81ad9d59abe86
2021-07-27 12:19:56 -07:00
3bdee2bbed [jit] Rewrote DFS graph iterator to remove unnecessary local state (#61326) (#61980)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61980

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D29917766

Pulled By: laurencer

fbshipit-source-id: 536c4806636fe9e709e8bffdefa9320127064dea
2021-07-27 11:50:20 -07:00
fa52b4b922 .github: chown workspace for render_test_results (#62207)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62207

Workspace was getting held back due to permission denied errors, let's
ensure we have a chown'd / clean workspace for all render_test_results
runs

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: walterddr, janeyx99

Differential Revision: D29915232

Pulled By: seemethere

fbshipit-source-id: dd9fcc9c00d9665569bd8cfa57e5d2d8da965aac
2021-07-27 11:44:15 -07:00
acaac70f63 Revert D29883676: Migrate thnn_conv_depthwise2d from THC to ATen
Test Plan: revert-hammer

Differential Revision:
D29883676 (de3a4eb583)

Original commit changeset: 9b2ac62cdd8a

fbshipit-source-id: d211d3cb7723b5d2e73de6941a7e649e5f78864f
2021-07-27 11:28:52 -07:00
82d81455ae [2/N] Remove unittest.skip across all of torch.distributed. (#61887)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61887

1) Introduced a `sandcastle_skip_if` decorator that ensures these
tests just get passed on sandcastle.
2) Fixed all test files under `test/distributed` to not use `unittest.skip`

Overall goal is to avoid using skips since sandcastle tags these tests as
continuously skipping.
ghstack-source-id: 134382237

Test Plan: waitforbuildbot

Reviewed By: SciPioneer

Differential Revision: D29784152

fbshipit-source-id: 17b4df6c5a55ff1d1e8e1de128fa679c3dfbcb7d
2021-07-27 10:53:23 -07:00
7fc96db45d fix typo errors in quantization-support.rst Line320 (#44447)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44379

change
"`torch.per_channel_symmetric` — per tensor, symmetric"
to
 "`torch.per_channel_symmetric` — per channel, symmetric"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44447

Reviewed By: mruberry

Differential Revision: D29909645

Pulled By: ezyang

fbshipit-source-id: e1505d070ec2b335dd6503b528e6a2f3bda2f1e3
2021-07-27 10:42:29 -07:00
5f7f08f498 Reenable AMP on XLA (#61861)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61861

Fixes https://github.com/pytorch/pytorch/issues/61804

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D29881903

Pulled By: ezyang

fbshipit-source-id: 91530c10fa37715bec33f477285da119415a9da9
2021-07-27 10:32:01 -07:00
a0c1c7e5d4 Fixing the case when starter nodes depend on get_attr node (#62234)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62234

There was a typo that we caught until recently, thus making this fix.

Reviewed By: 842974287

Differential Revision: D29924190

fbshipit-source-id: ee6259fcd41358aefe9680b419acc87c0c2821cb
2021-07-27 10:29:53 -07:00
8cdf16d1de Revert D29810657: [bc-breaking] reference option for linear produce a pattern instead of reference linear module
Test Plan: revert-hammer

Differential Revision:
D29810657 (9df605133e)

Original commit changeset: 949615bbc017

fbshipit-source-id: 54597d1f9636b0f94ae01c66018ff2592e5c39fc
2021-07-27 10:10:13 -07:00
d7ddae8e4f det_backward: correct, more robust and with complex support [clone] (#61905)
Summary:
Clone of https://github.com/pytorch/pytorch/pull/58195 to ease the import. Done by request from anjali411

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61905

Reviewed By: albanD

Differential Revision: D29937920

Pulled By: anjali411

fbshipit-source-id: 025892a8e6147790825b20458986730ad8c5bb0f
2021-07-27 10:08:26 -07:00
de3a4eb583 Migrate thnn_conv_depthwise2d from THC to ATen (#62006)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62006

Closes gh-24646, gh-24647

There is no `TensorIterator` equivalent to these kernels so this is just
migrating the existing kernels over to the ATen style.

I've benchmarked for contiguous tensors with this script:
```
import torch
shape = (10, 10, 100, 100)
x = torch.randn(*shape, device='cuda')
w = torch.randn((10, 1, 5, 5), device='cuda')

for _ in range(100):
    torch.nn.functional.conv2d(x, w, groups=10)
```

and similarly for backwards. I see these as the same to within measurement error.

|                   | Master Forward (us) | This PR Forward (us) |
|------------------:|:-------------------:|:--------------------:|
|           Forward |        133.5        |         133.6        |
|  Backward (input) |        1,102        |         1,119        |
| Backward (weight) |        2,220        |         2,217        |

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D29883676

Pulled By: ngimel

fbshipit-source-id: 9b2ac62cdd8a84e1a23ffcd66035b2b2fe2374d8
2021-07-27 10:00:25 -07:00
9df605133e [bc-breaking] reference option for linear produce a pattern instead of reference linear module (#61892)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61892

This PR changes is_reference=True for linear to produce a pattern consists of dequant - float linear - quant instead of reference linear module, this is useful for future transformations to custom backends, it is also helpful to simplify the implementation for
convert in the future.

Test Plan:
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D29810657

fbshipit-source-id: 949615bbc017bc454d81c8a6b2bdec53badaab19
2021-07-27 09:49:20 -07:00
6c6a9c73f2 [7/N] Nnapi backend delegation preprocess: compile_spec sanity check (#62213)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62213

Added sanity checks in preprocess function for Android NNAPI delegate.
`preprocess()` requires some input metadata passed through its `method_compile_spec` function argument.

`preprocess()` now throws specific error messages, if it cannot find the correct input arguments.
Example error message:
```
RuntimeError: method_compile_spec does not contain the "forward" key.
method_compile_spec should contain a Tensor or Tensor List which bundles input parameters: shape, dtype, quantization, and dimorder.
For input shapes, use 0 for run/load time flexible input.
method_compile_spec must use the following format: {"forward": {"inputs": at::Tensor}} OR {"forward": {"inputs": c10::List<at::Tensor>}}
```

nnapi_backend_preprocess.cpp: contains sanity check implementation
test_backend_nnapi.py: sanity check unit tests

Test: Ran `python test/test_jit.py TestNnapiBackend` in OSS successfully.

TODO: Using Tensors to pass input parameters is a temporary hack. When a dedicated object is implemented, update the sanity check error message.
ghstack-source-id: 134339282

Test Plan: Ran `python test/test_jit.py TestNnapiBackend` in OSS successfully.

Reviewed By: raziel, iseeyuan

Differential Revision: D29917004

fbshipit-source-id: 0d5c6b35889c556cda905ffc29c25c5422ae9ee4
2021-07-27 09:31:35 -07:00
2cbc0ede7d [DDP] Log if graph is static at end of training (#61871)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61871

When set_static_graph=False, the only type of dynamism we really
support in DDP is dynamic set of unused parameters which must be explicitly
enabled with find_unused_parameters=True. Although, some workflows have static
set of unused parameters, would be good to detect and add this to logging to
identify workflows that are candidates for static graph optimization.
ghstack-source-id: 134371429

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D29773962

fbshipit-source-id: 1f741984c6e6f8e3e55cf69ca719b1e25a485b13
2021-07-27 09:23:43 -07:00
79eb8bb299 [Static Runtime] Enforce proper output dtype for many ops (re-land) (#62267)
Summary:
Re-land of D29935444
We previously had lots of ops with implementations like this:
```
if (p_node->Output(0).isNone()) {
  p_node->Output(0) = create_empty_like(input_0);
}
...
auto& out = p_node->Output(0);
some_func_out(inputs, out);
```
This would make the output have the correct shape. But it would
also take the dtype of `input_0`, which is not always correct.

This change transforms these blocks to:
```
if (p_node->Output(0).isNone()) {
  p_node->Output(0) = some_func(inputs)
} else {
  ...
  auto& out = p_node->Output(0);
  some_func_out(inputs, out);
}
```
This gives the output the correct shape and dtype.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62267

Reviewed By: ejguan

Differential Revision: D29937253

Pulled By: malfet

fbshipit-source-id: d91ca5d5703490d7d349a1de2ad3bb09b0c33967
2021-07-27 08:54:09 -07:00
2eef1f27f8 Disable ccache for nccl builds (#62208)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62208

reverts
https://github.com/pytorch/pytorch/pull/55814
which removed a workaround for:
https://github.com/pytorch/pytorch/issues/13362

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D29935472

Pulled By: nairbv

fbshipit-source-id: 7ce9cde1408f17153632036fd128814032739746
2021-07-27 08:07:26 -07:00
dc55d511d9 Forward fix mypy (#62263)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62263

Fixes current HUD Error: https://github.com/pytorch/pytorch/runs/3170342799

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D29935265

Pulled By: ejguan

fbshipit-source-id: 6f247833d24bff7aea42f6287493a85d62d73b96
2021-07-27 07:52:31 -07:00
3cd12448b4 Add forward mode differentiation for inverse and solve (#62160)
Summary:
This PR adds forward mode differentiation for `torch.linalg.inv`, `torch.linalg.inv_ex`, and `torch.linalg.solve` functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62160

Reviewed By: mruberry

Differential Revision: D29917213

Pulled By: albanD

fbshipit-source-id: b08bbc830f77f342cc7ca5b823d7ea4380f2aaa8
2021-07-27 07:51:22 -07:00
a0309f89f4 Initial ModuleInfo implementation (#61935)
Summary:
This PR contains the initial version of `ModuleInfo` for use in testing modules. The design philosophy taken here is to start small and simple and build out / refactor as needed when more test coverage or `ModuleInfo` entries are added. As such, it's not intended for general usage yet. The PR contains the following:

* (new file) `torch/testing/_internal/common_modules.py`
  * `ModuleInfo` definition - metadata for each module to use in testing
  * `module_db` - the actual `ModuleInfo` database; currently contains entries for two modules
  * `ModuleInput` - analogous to `SampleInput` from OpInfo; contains `FunctionInput`s for both constructor and forward pass inputs
      * Constructor and forward pass inputs are tied together within a `ModuleInput` because they are likely correlated
  * `FunctionInput` - just contains args and kwargs to pass to a function (is there a nicer way to do this?)
  * `modules` decorator - analogous to `ops`; specifies a set of modules to run a test over
  * Some constants used to keep track of all modules under torch.nn:
      * `MODULE_NAMESPACES` - list of all namespaces containing modules
      * `MODULE_CLASSES` - list of all module class objects
      * `MODULE_CLASS_NAMES` - dict from module class object to nice name (e.g. torch.nn.Linear -> "nn.Linear")
* (new file) `test/test_modules.py`
    * Uses the above to define tests over modules
    * Currently, there is one test for demonstration, `test_forward`, which instantiates a module, runs its forward pass, and compares it to a reference, if one is defined

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61935

Reviewed By: mruberry

Differential Revision: D29881832

Pulled By: jbschlosser

fbshipit-source-id: cc05c7d85f190a3aa42d55d4c8b01847d1efd57f
2021-07-27 07:42:07 -07:00
afe3644321 Remove faulty process group code (#61907)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61907

Removing the code for faulty process group agent since it was replaced by faulty tensorpipe agent

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D29794666

Pulled By: H-Huang

fbshipit-source-id: 0b35191cc07220b6774ecacc8d004f25fd2e87f0
2021-07-27 07:37:40 -07:00
a3be2ecc3a Revert D29887367: [Static Runtime] Enforce proper output dtype for many ops
Test Plan: revert-hammer

Differential Revision:
D29887367 (f4136c5efc)

Original commit changeset: cef04bfa52ec

fbshipit-source-id: 32e89f2b6381930559dd746b535904c3e90fd52b
2021-07-27 07:29:09 -07:00
b599c1e794 Create linalg and parametrizations codeowners (#62086)
Summary:
Added myself nikitaved  and IvanYashchuk

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62086

Reviewed By: mruberry

Differential Revision: D29920798

Pulled By: albanD

fbshipit-source-id: dcbd57bb2a438a1f04d4651447710fced83264d3
2021-07-27 06:50:41 -07:00
228b50e053 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D29930232

fbshipit-source-id: e36dbc59a25d7f36d3bb7a02ad76696f299712cf
2021-07-27 04:13:15 -07:00
2d7c1e3fa8 [bc-breaking] Produce quantization pattern for add_scalar and mul_scalar (#61859)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61859

BC-breakign note:
Previously we do not add observer/fake_quant for output of add/mul for tensor - scalar operation,
in this PR we added the observer/fake_quant instance (that's the same as input) to correctly model
the behavior of the quantized add_scalar and mul_scalar op (since quantized add/mul scalar assumes the
output quantized tensor have the same quantization parameter as input)

Test Plan:
python test/test_quantization.py TestQuantizeFxOps.test_add
python test/test_quantization.py TestQuantizeFxOps.test_mul

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D29770859

fbshipit-source-id: f43fcbfecd04c392467770b22c481bbbdaf43c25
2021-07-27 02:46:00 -07:00
b176feec1e Add device and key for lazy tensors (#61621)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61621

Test Plan: CI

Reviewed By: mruberry

Differential Revision: D29912934

Pulled By: asuhan

fbshipit-source-id: 493c32063a3e756d93cbf1d876563a35eaafb537
2021-07-26 23:00:22 -07:00
2945a73d90 Add option to skip GH validation for torch.hub (#62139)
Summary:
Split from https://github.com/pytorch/pytorch/pull/62072

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62139

Reviewed By: mthrok

Differential Revision: D29891497

Pulled By: malfet

fbshipit-source-id: 5c0baf53a2acf8f95062bd001457e1f936011529
2021-07-26 22:44:12 -07:00
64283fe146 [DDP/Functional Optim] Support kwarg arguments (#62079)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62079

Adds support for kwarg arguments into functional optimizer running as
hook.
ghstack-source-id: 134330379

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D29838127

fbshipit-source-id: 2ab051ef5f0dff19c145ebe2260668b927ba47b2
2021-07-26 22:12:50 -07:00
c0ebeca1a8 [Functional Optim] Test kwargs parity for SGD (#62078)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62078

Ensure that kwarg arguments such as momentum and weight decay maintain
parity between optimizer.step and step_param.
ghstack-source-id: 134330377

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D29837942

fbshipit-source-id: 1ae39648fc26aebd8aaef1a7ac0e03b598a8ed60
2021-07-26 22:11:40 -07:00
478098aaac Revert D29801652: Refactor Tensor::to to call a primitive that is not copy_.
Test Plan: revert-hammer

Differential Revision:
D29801652 (29bb3f4647)

Original commit changeset: bb01eb1acf3d

fbshipit-source-id: 93693bad8068d47a3a4c16f34f300e03ea573897
2021-07-26 19:37:17 -07:00
69adb21940 Parity tests for functional optimizer step_param (#61756)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61756

DDP will support running optimizer as communication hook with
optimizers that support a per-parameter/gradient step function `step_param`.
Add parity tests as we implement more optimizers that support step_param to
ensure parity with regular optimizers.
ghstack-source-id: 134330378

Test Plan: Ci

Reviewed By: SciPioneer

Differential Revision: D29727549

fbshipit-source-id: 18977c896f12b8e478298488b298fd107affcf5f
2021-07-26 19:03:22 -07:00
b6d10a3a27 Fix infinite loop in _validate_not_a_forked_repo() (#62072)
Summary:
Increase `page_idx` in the loop rather than outside of it
Break from the loop when receive empty response as it means there are no more items to fetch via pagination request

Also, add options to use provided github token (via `GITHUB_TOKEN` environment variable)

Fixes failure with "Rate Limit Exceeded" when doing something like `torch.hub.list("pytorch/test-infra:dsf")`

Fixes https://github.com/pytorch/pytorch/issues/61755

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62072

Reviewed By: jbschlosser

Differential Revision: D29868539

Pulled By: malfet

fbshipit-source-id: 206082a0ba1208e9b15ff6c9c6cb71d2da74f1c3
2021-07-26 17:54:07 -07:00
d0f430927b [PyTorch][Edge] Serializing sub modules with same names (#61933)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61933

### Issue:

SubModules with same name are not serialized correctly in bytecode format while using `_save_for_mobile`. These submodules are not distinguished as different modules even though they have different foward, setstate etc if they have the same name.

### Fix:
Mangler creates unique names so that modules and submodules that have same names can be uniquely identified  while saving the module. iseeyuan rightly pointed out the underlying issue that mangler is not used in the process of saving bytecode and hence unique references for the submodules are not created. Please refer to the notebook to repro the issue: N777224

### Diff:
The above idea of fix is implemented. The mangled names are used in bytecode thereby the files in `code/` directory now have right reference to the `bytecode.pkl`

Will this have backward compatibility?
iseeyuan please feel free to correct or update this.
Yes. This fix impacts only modules with same name sub modules which were not serialized correctly before. Existing modules should have correct references and `_load_for_mobile` must not see any change. To confirm this the existing test cases need to pass for the diff to be approved and shipped.
ghstack-source-id: 134242696

Test Plan:
```
~/fbsource/fbcode > buck test caffe2/test/cpp/jit:jit -- BackendTest.TestCompositeWithSetStates
Downloaded 0/5 artifacts, 0.00 bytes, 100.0% cache miss (for updated rules)
Building: finished in 19.2 sec (100%) 17619/17619 jobs, 3/17619 updated
  Total time: 19.5 sec
More details at https://www.internalfb.com/intern/buck/build/91542d50-25f2-434d-9e1a-b93117f4efe1
Tpx test run coordinator for Facebook. See https://fburl.com/tpx for details.
Running with tpx session id: de9e27cf-4c6c-4980-8bc5-b830b7c9c534
Trace available for this run at /tmp/tpx-20210719-161607.659665/trace.log
Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/844425127206388
    ✓ ListingSuccess: caffe2/test/cpp/jit:jit - main (8.140)
    ✓ Pass: caffe2/test/cpp/jit:jit - BackendTest.TestCompositeWithSetStates (0.528)
Summary
  Pass: 1
  ListingSuccess: 1
If you need help understanding your runs, please follow the wiki: https://fburl.com/posting_in_tpx_users
Finished test run: https://www.internalfb.com/intern/testinfra/testrun/844425127206388
```

```
~/fbsource/fbcode > buck test caffe2/test/cpp/jit:jit -- BackendTest.TestConsistencyOfCompositeWithSetStates
Building: finished in 4.7 sec (100%) 6787/6787 jobs, 0/6787 updated
  Total time: 5.0 sec
More details at https://www.internalfb.com/intern/buck/build/63d6d871-1dd9-4c72-a63b-ed91900c4dc9
Tpx test run coordinator for Facebook. See https://fburl.com/tpx for details.
Running with tpx session id: 81023cd2-c1a2-498b-81b8-86383d73d23b
Trace available for this run at /tmp/tpx-20210722-160818.436635/trace.log
Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/8725724325952153
    ✓ ListingSuccess: caffe2/test/cpp/jit:jit - main (7.867)
    ✓ Pass: caffe2/test/cpp/jit:jit - BackendTest.TestConsistencyOfCompositeWithSetStates (0.607)
Summary
  Pass: 1
  ListingSuccess: 1
If you need help understanding your runs, please follow the wiki: https://fburl.com/posting_in_tpx_users
Finished test run: https://www.internalfb.com/intern/testinfra/testrun/8725724325952153
```

To check the `bytecode.pkl` using module inspector please check:
N1007089

Reviewed By: iseeyuan

Differential Revision: D29669831

fbshipit-source-id: 504dfcb5f7446be5e1c9bd31f0bd9c986ce1a647
2021-07-26 16:31:48 -07:00
a13f714b6d DOC: remove git stamp from release documentation version (#58486)
Summary:
CI built the documentation for the recent 1.9.0rc1 tag, but left the git version in the `version`, so (as of now) going to https://pytorch.org/docs/1.9.0/index.html and looking at the version in the upper-left corner shows "1.9.0a0+git5f0bbb3" not "1.9.0". This PR should change that to cut off everything after and including the "a".

It should be cherry-picked to the release/1.9 branch so that the next rc will override the current documentation with a "cleaner" version.

brianjo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58486

Reviewed By: zou3519

Differential Revision: D28640476

Pulled By: malfet

fbshipit-source-id: 9fd1063f4a2bc90fa8c1d12666e8c0de3d324b5c
2021-07-26 16:28:59 -07:00
60070982d2 [Static Runtime] Fixed build failure in OSS due to test_utils (#62216)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62216

Test Plan: Imported from OSS

Reviewed By: hlu1

Differential Revision: D29917514

Pulled By: navahgar

fbshipit-source-id: 379863e6cd0b157de3bfa1482f5519b26654b3d2
2021-07-26 16:10:10 -07:00
962841b532 Fix subnet counting and re-enable check for multiple onnxifi ops in AOT (#62033)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62033

Count the number of onnxifi ops rather than just number of subnets, since when the subnet size < min_ops, it isn't turned into an onnxifi op.

Test Plan:
Runs which ran into the "Did not find a partition with an SLS node" error now report "multiple onnxifi ops found"
From https://fb.workplace.com/groups/527892364588452/permalink/807802049930814/:
```
buck run mode/opt-clang -c python.package_style=inplace sigrid/predictor/scripts:rerun_aot -- --manifold_url="https://manifold.facebook.net/v0/read/tree/2021-06-30/onnxifi_caffe2_net_aot_input_arguments_01-55-32_711d9476?bucketName=dper3_job_meta&apiKey=dper3_job_meta-key&timeoutMsec=5000&withPayload=1"

```
Reran some failures from last week which now pass AOT:
From https://fb.workplace.com/groups/527892364588452/permalink/807802049930814/,
https://fb.workplace.com/groups/243933520351820/permalink/572715897473579/

```
buck run mode/opt-clang -c python.package_style=inplace sigrid/predictor/scripts:rerun_aot -- --manifold_url="https://manifold.facebook.net/v0/read/tree/2021-07-09/onnxifi_caffe2_net_aot_input_arguments_05-31-08_ef5393a6?bucketName=dper3_job_meta&apiKey=dper3_job_meta-key&timeoutMsec=5000&withPayload=1"
```
```
buck run mode/opt-clang -c python.package_style=inplace sigrid/predictor/scripts:rerun_aot -- --manifold_url="https://manifold.facebook.net/v0/read/tree/2021-07-12/onnxifi_caffe2_net_aot_input_arguments_14-44-34_cfdf3053?bucketName=dper3_job_meta&apiKey=dper3_job_meta-key&timeoutMsec=5000&withPayload=1"
```
```
buck run mode/opt-clang -c python.package_style=inplace sigrid/predictor/scripts:rerun_aot -- --manifold_url="https://manifold.facebook.net/v0/read/tree/2021-07-13/onnxifi_caffe2_net_aot_input_arguments_04-03-30_162e7e53?bucketName=dper3_job_meta&apiKey=dper3_job_meta-key&timeoutMsec=5000&withPayload=1"
```

Reviewed By: khabinov

Differential Revision: D29796893

fbshipit-source-id: e9de7529ef86745207d41643d0fbe932fa166437
2021-07-26 16:08:51 -07:00
037c4aa1d1 [fx2trt] flatten converter (#62202)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62202

Add acc_ops.flatten converter. Also migrate to oss acc tacer for trt interpreter.

Test Plan: unit test

Reviewed By: khabinov

Differential Revision: D29861555

fbshipit-source-id: dac88a703fdbf386f3f7fb27674a67951f3add49
2021-07-26 15:49:01 -07:00
f883ed9095 irange-ify 8b (#62195)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62195

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D29887946

fbshipit-source-id: e3bd44721cf06a34ced47994810212be8460a2bb
2021-07-26 15:38:54 -07:00
f7743e92bf irange-ify 9 (#62118)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62118

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D29879670

fbshipit-source-id: 99b86ac7d65dfa2a47d0e6b7d65433200d18081e
2021-07-26 15:13:02 -07:00
026cfe85b4 Fix InlinedCallStack annotation to account for module calling its own (#61791)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61791

methods from forward

During inlining we attached InlinedCallstack to nodes being inlined. In
the process we attach moodule information as well, such that if
CallMethod is being inlined we know which class instance and class type
the method belongs to. However, CallMethod can be calling a method of
the same object to which the graph belongs. e.g.:

```
def forward(self, input):
  x = input + 10
  return forward_impl_(x, input)
```
Here forward_impl is method defined on the same class in which forward
is defined. Existing module hierarchy annotation will mislabel this as
unknown instance since the method is not associated with output of
GetAttr node (it would be we had called self.conv.forward_impl_ for
example).
Change in this PR reconciles this by creating a placeholder name "SELF"
for module instance indicating that you can traverse InlinedCallStack
backwards to find first node with name != SELF, which would be the name
of the object.
e.g.:
TOP(ResNet)::forward.SELF(ResNet)::_forward_impl.layer1(Sequential)::forward.0(BasicBlock)::forward.conv1(Conv2d)::forward.SELF(Conv2d)::_conv_forward

Test Plan:
Add test

Imported from OSS

Reviewed By: larryliu0820

Differential Revision: D29745443

fbshipit-source-id: 1525e41df53913341c4c36a56772454782a0ba93
2021-07-26 15:00:57 -07:00
f16102f72a Revert D29892919: Add squid proxy as egress cache
Test Plan: revert-hammer

Differential Revision:
D29892919 (e63160d735)

Original commit changeset: ac17227f2553

fbshipit-source-id: b78313147d60f26c1df68a25293e6b571ba66919
2021-07-26 14:42:28 -07:00
cf1f59452b Hacky support for meta tensor serialization. (#62192)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62192

This support is hacky because it doesn't preserve meta tensor storage
sharing (e.g., if you serialize a model with shared storage, e.g., a
tensor and a view on a tensor, when I deserialize the viewing
relationship will be broken and these are just different tensors.) The
hack is also durable, in the sense that we will be on the hook for
supporting `_rebuild_meta_tensor_no_storage` in perpetuity in the
future, even if we change our mind about the serialization format.

This unblocks an FB production use case. I didn't add C++ support to minimize
blast area of this patch.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D29910535

Pulled By: ezyang

fbshipit-source-id: d98dcdd0108dfc3ae730a071d3c583b6d0281d21
2021-07-26 14:33:45 -07:00
f0140a8c5f Disable cppcoreguidelines-non-private-member-variables-in-classes (#62212)
Summary:
This PR disables the `cppcoreguidelines-non-private-member-variables-in-classes` check. PyTorch makes use of `protected` members throughout the codebase, and we do not want to perform this clang-tidy check in CI to improve signal-to-noise.

Relevant failure: https://github.com/pytorch/pytorch/pull/61871/checks?check_run_id=3146453417

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62212

Reviewed By: driazati

Differential Revision: D29917882

Pulled By: 1ntEgr8

fbshipit-source-id: f607c3d050a122e95136f9915060c4cda6694c9d
2021-07-26 14:14:05 -07:00
1343eea037 Fix clang-tidy line filtering logic (#62210)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62210

Fixes #62204

Test Plan: #62211 clang-tidy should only error on the added lines (and not on context/removals)

Reviewed By: driazati

Differential Revision: D29917897

Pulled By: 1ntEgr8

fbshipit-source-id: de91dbf34c1ad8405507cad91ab3dd0d6c61d82e
2021-07-26 14:12:53 -07:00
2a83f24027 Enable macos clang-tidy installs (#62214)
Summary:
This PR enables installing our custom MacOS clang-tidy binaries. It also updates related documentation.

The binaries are produced by [this CI job](https://github.com/pytorch/test-infra/blob/master/.github/workflows/clang-tidy-macos.yml), and are published to S3.

This PR does not handle versioning of the downloaded binaries as this is being worked on separately. See https://github.com/pytorch/test-infra/issues/73

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62214

Test Plan:
On a MacOS machine, run
```bash
python3 -m tools.linter.install.clang_tidy
.clang-tidy-bin/clang-tidy --checks="*" --list-checks | grep "misc-max-tokens"
```

Reviewed By: janeyx99, mruberry

Differential Revision: D29917728

Pulled By: 1ntEgr8

fbshipit-source-id: 98d0d8b7a57bdebf0ebcdc83228ef391e8c6629e
2021-07-26 13:43:29 -07:00
f4136c5efc [Static Runtime] Enforce proper output dtype for many ops
Summary:
We previously had lots of ops with implementations like this:
```
if (p_node->Output(0).isNone()) {
  p_node->Output(0) = create_empty_like(input_0);
}
...
auto& out = p_node->Output(0);
some_func_out(inputs, out);
```
This would make the output have the correct shape. But it would
also take the dtype of `input_0`, which is not always correct.

This change transforms these blocks to:
```
if (p_node->Output(0).isNone()) {
  p_node->Output(0) = some_func(inputs)
} else {
  ...
  auto& out = p_node->Output(0);
  some_func_out(inputs, out);
}
```
This gives the output the correct shape and dtype.

Test Plan: `buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: hlu1

Differential Revision: D29887367

fbshipit-source-id: cef04bfa52ec082ad3a9a32aa27c44e275c6b24c
2021-07-26 13:27:02 -07:00
29bb3f4647 Refactor Tensor::to to call a primitive that is not copy_. (#61458)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61458

Context
-------
functorch is unable to vmap(grad(f)) when f contains a .to
call. This is because .to (when it is not a no-op) decomposes
to .copy_ under grad and the .copy_ is not compatible with vmap.

Fix
 ---
The fix for this is to have all Tensor::to variants call a new operator,
`_to_copy`, that always copies and is a primitive w.r.t. autograd so
that autograd decomposes Tensor::to into a call to `_to_copy`.
(This is related to https://github.com/pytorch/pytorch/issues/60956,
please let me know if you want to bikeshed the naming).

In order to get this done I had to do a bit of refactoring. All of the
`::to` implementations now call `to_impl` which may call `_to_copy`.

Autograd codegen changes
------------------------

The second thing I had to do was modify the autograd codegen. Right now,
autograd assumes that every output is either statically known to be
differentiable or not differentiable at codegen time. `_to_copy` is a
little special because its differentiability depends on the output
dtype. e.g. `torch.randn(3, requires_grad=True).to(torch.long)` is non
differentiable. To get this to work:
- I changed how `output_differentiability` in derivatives.yaml work.
- output_differentiability can now accept "conditions" for each of the
output arguments. A "condition" is some C++ code.
- We currently only support `output_differentiability` with conditions
if there is a single output. This is for convenience and can be changed
in the future.
- I added a new `output_differentiability_conditions` field to
DifferentiabilityInfo. This gets populated in load_derivatives.yaml
- forward-mode and reverse-mode AD take
`output_differentiability_conditions` into account.

Here's how the generated code for `VariableType::_to_copy`
[looks
like](https://gist.github.com/zou3519/93462df4bda1837acee345205b7cc849)
No other autogenerated code gets modified by this PR.

Performance benchmarking
------------------------
- I benchmarked [three
cases that demonstrate overhead](https://gist.github.com/zou3519/5b6985e6906b80eec5a0dd94ed5b6a1a).
- Case A: No-op .to(). Instruction count went from 50223 to 25623. I
have no clue why but this is a good thing.
- Case B: not-no-op .to(). Instruction count went from 665291 to 671961.
This is expected; `_to_copy` adds an additional dispatch.
- Case C: not-no-op .to() forward pass and backward pass. Instruction count
went from 4022841 to 4030057. This PR adds
an additional dispatch to .to() (so there should be one additional
dispatch in the forward pass) so this number looks reasonable.

Test Plan
---------
- test_torch.py has a test_to
- test_cuda.py has test_to*
- test_autograd has tests (test_type_conversions) that exercise the
reverse-mode path
- test_ops.py has some tests (like log_softmax) that exercise the
reverse-mode and forward-mode AD path.
- test_quantization, test_namedtensor all exercise tensor.to as well.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D29801652

Pulled By: zou3519

fbshipit-source-id: bb01eb1acf3d79d84f284150d1be4be3b4ace351
2021-07-26 13:02:39 -07:00
e63160d735 Add squid proxy as egress cache (#62103)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62103

This PR adds a squid proxy that's deployed dedicated for PyTorch CI. Initially we only roll out to GHA, and if things are ok we will extend this to circleci tests if necessary.

`http_proxy` and `https_proxy` are compatible with the following http clients:

- curl
- wget
- python

Existing cache policy:

```
refresh_pattern -i .(7z|deb|rpm|exe|zip|tar|tgz|gz|ram|rar|bin|tiff|bz2|run|csv|sh)$ 1440 80% 2880
```

It uses the standard squid refresh_pattern for cache requests. In our setup, we tried
to cache at least (1440 minutes - 1 day) and at max (2880 minutes - 2 days), with
last-modified factor 80% ([squid doc](http://www.squid-cache.org/Doc/config/refresh_pattern/)). Please refer to [pytorch/test-infra](https://github.com/pytorch/test-infra/tree/master/aws/websites/squid-proxy) for details.

Right now, it only applies to the `build` and `test` step, to limit the scope and make sure build and test are more reliable with egress cache.

Test Plan: Imported from OSS

Reviewed By: jbschlosser, malfet, seemethere, janeyx99

Differential Revision: D29892919

Pulled By: zhouzhuojie

fbshipit-source-id: ac17227f2553ca62881711b3e9943488dfd8defd
2021-07-26 13:01:34 -07:00
d2594fa538 irange-ify 3 (#62112)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62112

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D29879513

fbshipit-source-id: c01d18d34bb19014bf28d92c4d04b07e50a2770a
2021-07-26 12:56:58 -07:00
f5c6c3947e Remove Input Pointer Caching for XNNPack (#61959)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61959

We no longer need to cache the Input Pointer as XNNPACK has implemented a more robust approach where indirection buffer does not need to be recalculated even if activation tensor pointer changes, as long as tensor dimensions are the same.

This reverses the changes in https://github.com/pytorch/pytorch/pull/42840/files

Reviewed By: kimishpatel

Differential Revision: D29777605

fbshipit-source-id: c1750538c17bce34f885c6f1bbb1f7164ebba25b
2021-07-26 12:02:15 -07:00
7ec6d1e857 irange-ify 2 (#62113)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62113

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D29879507

fbshipit-source-id: 1fb114e44afe8c1407f648b705db7fd4edb9d6e3
2021-07-26 12:00:52 -07:00
6dc2c07304 [Reland] [DDP] Implement a hook which performs FunctionalSGD step. (#62177)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62177

Reland of https://github.com/pytorch/pytorch/pull/61678
Fix CI failure by gating including torchvision model on whether torchvision is available or not.
ghstack-source-id: 134282165

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D29904101

fbshipit-source-id: 47e799eb4a90acbbda91c5857ea00de3045d49f5
2021-07-26 11:56:56 -07:00
1dfb687f3c Fixed off-by-one bug in Adam Smart Decay (#62135)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62135

The initial implementation of Adam with Smart Decay had an off-by-one error.  This was in the summation of the geometric series used to calculate how much built-up momentum would have been discharged in skipped minibatches.

The unit tests should have caught these, but the testing strategy missed this because k, the "number of skipped minibatches" was always either 0 or so high that the impact of the bug was too small.  The impact of the bug was proportional to 1/k.  The testing strategy has also been adjusted to cover this bug.

Differential Revision: D29889309

fbshipit-source-id: b086c0efed5c27f621061e726533c73658daffc6
2021-07-26 11:55:38 -07:00
dcb3eadc1f [quant][fix] Update quantization c++ tests to not run if CPU_STATIC_DISPATCH is specified (#62197)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62197

For build configs with ATEN_CPU_STATIC_DISPATCH defined, quantization tests will fail since they
require QuantizedCPU dispatch to be enabled.
This will fix some internal test failures like https://www.internalfb.com/intern/test/844424941811803?ref_report_id=0 which are run under the `caffe2_aten_cpu_inference` project

Test Plan:
buck test mode/dev //caffe2/aten:quantized_test

Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D29912742

fbshipit-source-id: b117eb9f4afb51e0d0dd52fbe9d5c5be7dfafe85
2021-07-26 11:39:45 -07:00
0ca5dc7f03 irange-ify 5 (#62114)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62114

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D29879534

fbshipit-source-id: 0b1d6d2c9062a2fd7a55b00cb9f3d59ec941bad3
2021-07-26 11:07:54 -07:00
8e71f48f0a Handle simple NNAPI flatten NHWC case (#61796)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61796

We can easily handle nnapi conversion for nhwc inputs
that have 1 channel or H & W are 1

Test Plan:
pytest test/test_nnapi.py::TestNNAPI::test_flatten

Imported from OSS

Reviewed By: saketh-are

Differential Revision: D29827735

fbshipit-source-id: 65dee4b42fceef1b032bf5dd1c4cc6e020d01e14
2021-07-26 10:59:04 -07:00
b73d759708 [fix] polygamma n>=1 (#61641)
Summary:
Fixes: https://github.com/pytorch/pytorch/issues/55357

TODO:
* [x] Use proper casting to avoid confusing the compiler

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61641

Reviewed By: albanD

Differential Revision: D29816592

Pulled By: mruberry

fbshipit-source-id: 2c020a6e4c325c1b5d15499a77fb39f9ba93dd79
2021-07-26 10:52:20 -07:00
ef7d572afa Ensure ShardedTensor handles list/tuple appropriately as size parameter. (#62109)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62109

The `size` parameter only worked correctly for *args like invocation
:10, 20 and not for list: [10, 20] and tuples: (10, 20). This PR ensures this
works similar to `torch.empty`.
ghstack-source-id: 134246166

Test Plan:
1) unit tests
2) waitforbuildbot

Reviewed By: SciPioneer

Differential Revision: D29884768

fbshipit-source-id: 7a4a3c5ed5d7c081344f6ead3170905b97fc652d
2021-07-26 10:31:32 -07:00
f9dce598a5 Add some missing cuda guards (#62100)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62100

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D29880330

fbshipit-source-id: 7089000ccbcaa70a13f0ab4531b032bd5326e539
2021-07-26 10:26:22 -07:00
200b6ccdc0 Catch saved tensors default hooks race condition (#61957)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61957

If the user runs code that registers default saved tensor hooks from
multiple threads, it will fail with a nice error message most of the
time. This commit handles the very rare case where a race condition
would have made it fail silently.

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D29848525

Pulled By: Varal7

fbshipit-source-id: eb9bdcfbeed857a988834651246390ea14eedd33
2021-07-26 09:48:47 -07:00
f2369f12f9 Add logging for dynamic rendezvous (#61822)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61822

Added scuba logging to the following files:
- dynamic_rendezvous.py
- c10d_rendezvous_backend.py

NOTE: This diff introduces the use of python's inspect module to easily allow for obtaining the calling method name and filename when logging. This module can mess with python's garbage collector, so special care was taken to never store references to results from inspect.stack() longer than absolutely needed.

Test Plan:
The following tests can be run.
```
buck run mode/dev-nosan //caffe2/test/distributed/elastic/rendezvous:c10d_rendezvous_backend_test
```
```
buck run mode/dev-nosan //caffe2/test/distributed/elastic/rendezvous:dynamic_rendezvous_test
```
```
buck run mode/dev-nosan //caffe2/test/distributed/elastic/events:lib_test
```

Reviewed By: aivanou

Differential Revision: D29643774

fbshipit-source-id: f10cd5ebf8f6860856267bc2483c0b85faacb0fd
2021-07-26 09:39:09 -07:00
6007ad3529 [Static Runtime] Refactor fb op tests to use testStaticRuntime (#62064)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62064

`testStaticRuntime` was previously only available in `test_static_runtime.cc`. It has been moved to a common library `test_utils` to facilitate code re-use. This also lets us test dynamic shapes in `test_fb_operators`

Reviewed By: hlu1

Differential Revision: D29858928

fbshipit-source-id: 68a94760166ddb745972b0f1fc24bed594937d1c
2021-07-26 08:25:10 -07:00
be17d6eadf Add default Saved Variable hooks (#61834)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61834

Expose a pair of functions to Python users: torch.autograd.graph.set_saved_tensors_default_hooks(pack, unpack) and torch.autograd.graph.reset_saved_tensors_default_hooks().
These functions control the hooks applied to saved tensors: all tensors saved in that context will be packed using the pack function, then unpacked accordingly when needed.

Currently, this works by simply calling register_hooks (cf #60975) directly at the end of the constructor of a SavedVariable. This could be optimized further by not performing the copy before registering default hooks, but this would require a small refactor. Edit: the refactor is done in #61927.

A current limitation is that if users create tensors in this context, they will not be able to register additional hooks on the saved tensor.

For instance, to perform something like #28997, one could define a pack function that saves to disk whenever the tensor size is too big and returns a filename, then unpack simply reads the content of the file and outputs a tensor, e.g.:

```
def pack(x):
    name = os.path.join(tmp_dir, str(uuid.uuid4()))
    torch.save(x, name)
    return name

def unpack(name):
    return torch.load(name)
```

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D29792193

Pulled By: Varal7

fbshipit-source-id: 33e931230ef59faa3ec8b5d11ef7c05539bce77c
2021-07-26 08:14:32 -07:00
89ca638c18 ENH Adds no batch dim support for AdativeMaxPool*D (#61847)
Summary:
Towards https://github.com/pytorch/pytorch/issues/60585

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61847

Reviewed By: suo

Differential Revision: D29883887

Pulled By: jbschlosser

fbshipit-source-id: de3fcf1cc3878b138ab766d2a50cc59c52ec5a60
2021-07-26 07:35:36 -07:00
394dd391dd [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D29904940

fbshipit-source-id: 16ce87cc328f2950ed95a12710b50c444e363c79
2021-07-26 03:41:55 -07:00
e6e8745bea [nnc] Add simplifierUnderContext for simplification that needs context info: currently added for-stmt index var bounds info as context (#60687)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60687

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D29373315

Pulled By: huiguoo

fbshipit-source-id: 8729af60dd6d9735187b2118e3e83c75ef21789d
2021-07-25 23:30:13 -07:00
2299d6a013 Revert D29701447: [DDP] Implement a hook which performs FunctionalSGD step.
Test Plan: revert-hammer

Differential Revision:
D29701447 (bd95cf4473)

Original commit changeset: 183954593b82

fbshipit-source-id: 714e6a2b698147db9533a67783aed2a65d9d5bfe
2021-07-25 22:23:30 -07:00
457a3fb6d1 [bc-breaking][quant][graphmode][fx] Produce dequant - fp_op - quant pattern for copy nodes (#61763)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61763

This PR changes the is_reference=True option for convert_fx to produce a dequant - fp_op - quant
pattern for copy nodes like maxpool op.

Before the PR:
```
def forward(self, x):
    maxpool2d_input_scale_0 = self.maxpool2d_input_scale_0
    maxpool2d_input_zero_point_0 = self.maxpool2d_input_zero_point_0
    quantize_per_tensor = torch.quantize_per_tensor(x, maxpool2d_input_scale_0, maxpool2d_input_zero_point_0, torch.quint8);  x = maxpool2d_input_scale_0 = maxpool2d_input_zero_point_0 = None
    maxpool2d = self.maxpool2d(quantize_per_tensor);  quantize_per_tensor = None
    dequantize = maxpool2d.dequantize();  maxpool2d = None
    return dequantize
```

After (we expand the maxpool2d that works with quantized input to "dequant - maxpool2d - quant" pattern
```
def forward(self, x):
    maxpool2d_input_scale_0 = self.maxpool2d_input_scale_0
    maxpool2d_input_zero_point_0 = self.maxpool2d_input_zero_point_0
    quantize_per_tensor = torch.quantize_per_tensor(x, maxpool2d_input_scale_0, maxpool2d_input_zero_point_0, torch.quint8);  x = maxpool2d_input_scale_0 = maxpool2d_input_zero_point_0 = None
    dequantize = quantize_per_tensor.dequantize();  quantize_per_tensor = None
    maxpool2d = self.maxpool2d(dequantize);  dequantize = None
    maxpool2d_output_scale_0 = self.maxpool2d_output_scale_0
    maxpool2d_output_zero_point_0 = self.maxpool2d_output_zero_point_0
    quantize_per_tensor_1 = torch.quantize_per_tensor(maxpool2d, maxpool2d_output_scale_0, maxpool2d_output_zero_point_0, torch.quint8);  maxpool2d = maxpool2d_output_scale_0 = maxpool2d_output_zero_point_0 = None
    dequantize_1 = quantize_per_tensor_1.dequantize();  quantize_per_tensor_1 = None
    return dequantize_1
```

note that the call to self.maxpool2d is expanded to
```
    dequantize = quantize_per_tensor.dequantize();  quantize_per_tensor = None
    maxpool2d = self.maxpool2d(dequantize);  dequantize = None
    maxpool2d_output_scale_0 = self.maxpool2d_output_scale_0
    maxpool2d_output_zero_point_0 = self.maxpool2d_output_zero_point_0
    quantize_per_tensor_1 = torch.quantize_per_tensor(maxpool2d, maxpool2d_output_scale_0, maxpool2d_output_zero_point_0, torch.quint8);  maxpool2d = maxpool2d_output_scale_0 = maxpool2d_output_zero_point_0 = None
```

Test Plan:
```
python test/test_quantization.py TestQuantizeFx.test_copy_node_has_shared_actpp_instance
```

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D29728900

fbshipit-source-id: cf2c7f1f6659e3ba97cbb7c6204dd13983da10bd
2021-07-25 19:49:13 -07:00
76d3cdf9df [quant] Add from_blob_quantized_per_channel API (#62049)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62049

Adds a new function that accepts qint data blobs as input and creates a per-channel quantized tensor using the pre-allocated data and the provided scale and zero_point inputs
Addresses issue #61777

Test Plan:
./build/bin/quantized_test --gtest_filter='TestQTensor.FromBlobQuantizedPerChannel'

Imported from OSS

Reviewed By: kimishpatel

Differential Revision: D29854136

fbshipit-source-id: da6ecd3fb59a6f40ae88430fdd5d895f93d5411c
2021-07-25 14:09:38 -07:00
7195b78a59 [quant] Add from_blob_quantized_per_tensor API (#61986)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61986

Adds a new function that accepts qint data blobs as input and creates a quantized tensor using the pre-allocated data and the provided scale and zero_point inputs
Addresses issue https://github.com/pytorch/pytorch/issues/61777

Test Plan:
./build/bin/quantized_test --gtest_filter='TestQTensor.FromBlobQuantizedPerTensor'

Imported from OSS

Reviewed By: kimishpatel

Differential Revision: D29831135

fbshipit-source-id: b08299bbe9e939fedff98a585e6b12c14d31f17e
2021-07-25 14:08:25 -07:00
bd95cf4473 [DDP] Implement a hook which performs FunctionalSGD step. (#61678)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61678

This diff makes the following changes: - Add `step_param` method to `_FunctionalSGD` class which is written similar to `step` but for a single param - Implement a communication hook wrapper that runs a given comm. hook and then applies functional SGD step - Verifies that this is equal to regular allreduce + SGD optimizerghstack-source-id: 133567598
ghstack-source-id: 134263399

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D29701447

fbshipit-source-id: 183954593b82a092414623292f9b10e675fef96e
2021-07-25 13:36:47 -07:00
8152433de2 [1/n] Update testing lib*.so path (#61960)
Summary:
### Issue

Build PyTorch wheel packages during build stage for pull requests and install during test stage.

### Fix
Update all tests which call lib*.so (under `./build folder`), change to call lib*.so in `{ent}/pytorch/lib/python3.8/site-packages/torch`

### Diff
This diff starts to update test_fx, test_backend and test_torchbind first to check if current ci pass

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61960

Test Plan: check of all ci workflows pass

Reviewed By: malfet, saketh-are

Differential Revision: D29823235

Pulled By: tktrungna

fbshipit-source-id: e7f652def698e303d4843fbaedf4859f5eca2fd9
2021-07-24 05:16:35 -07:00
956f1c981e fix a typo (#61061)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61061

Reviewed By: navahgar, Gamrix

Differential Revision: D29495806

Pulled By: Krovatkin

fbshipit-source-id: 510de724e3108c52af1b25b8ab53ae3c895b55f9
2021-07-24 00:35:58 -07:00
ee44d73e59 Modernize override (#61744)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61744

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D29717320

fbshipit-source-id: 6eea4295ee2e5572ab337620be412376fcc2f3cc
2021-07-23 23:04:46 -07:00
d2e03dc484 [fx2trt] Add support for explicit batch dimension (#62110)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62110

Add an option to opt in explicit batch dimension. Extend unit tests to test both scenario (implicit and explicit). Fixed some converters that doesn't work with explicit batch dimension before.

Add broadcast support and a generic function for adding elementwise binary ops.

Follow ups:
1. Adding the dynamic shape support in explicit batch dimension mode to allow different batch dimension at least.
2. Extend layer_norm plugin `PluginV2Ext` to make it work in explicit batch dimension.

Test Plan: unit tests

Reviewed By: jackm321

Differential Revision: D29798239

fbshipit-source-id: 91d47c6155d2473ed4a6f8d2816715a32c61b869
2021-07-23 22:54:07 -07:00
cc263ef795 [bc-breaking][quant][graphmode][fx] Add observer/fake_quant for copy nodes (#61687)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61687

Previously we do not insert observer/fake_quant for output copy nodes (e.g. maxpool).
But to produce reference patterns we need to insert observer/fake_quant for the output and later convert that to a quantize
node.

Model:
```
class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.maxpool2d = torch.nn.MaxPool2d(kernel_size=3)

    def forward(self, x):
        x = self.maxpool2d(x)
        return x
```
result of prepare:

Before:
def forward(self, x):
    x_activation_post_process_0 = self.x_activation_post_process_0(x);  x = None
    maxpool2d = self.maxpool2d(x_activation_post_process_0);  x_activation_post_process_0 = None
    return maxpool2d

After:
def forward(self, x):
    x_activation_post_process_0 = self.x_activation_post_process_0(x);  x = None
    maxpool2d = self.maxpool2d(x_activation_post_process_0);  x_activation_post_process_0 = None
    maxpool2d_activation_post_process_0 = self.maxpool2d_activation_post_process_0(maxpool2d);  maxpool2d = None
    return maxpool2d_activation_post_process_0

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D29715566

fbshipit-source-id: 817df9b2933a35cad5331d8d8ce1c5ba0752e9df
2021-07-23 21:29:37 -07:00
78f7d8ccfa [Static Runtime] Remove wrappers for aten::cat (#62067)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62067

The wrapper for aten::cat is no longer needed after the variadic cat change in D29565344 (ae58a4c45d) .
Also added a simple test to test dynamic shapes, i.e., input tensors in args2 are larger than in args1.

Reviewed By: navahgar, mikeiovine

Differential Revision: D29864600

fbshipit-source-id: 44a712c2e776815c09e0bf5631412149b81274b2
2021-07-23 20:33:41 -07:00
7c09de8384 [torch deploy] add support for Python C extension modules (#58117)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58117

Previously it was not possible to load C extension modules with deploy because extension
modules need to link against the Python.h API functions. Since
each libtorchdeploy_interpreter.so had its own copy of these functions, it is not possible
to tell dlopen to resolve symbols in a loaded SO from one of these libraries without exposing
its symbols globally.

This patch adds a custom ELF loader which does the custom loading of attaching c extension libraries
to the Python API that loaded the shared library. Simple use of numpy and regex modules appears to work.

This diff has some limitations:

* 64-bit Linux only. OSX and windows use different formats for shared libraries. 32-bit ELF files are not supported.
* debug info is not immediately availiable to debuggers. A script for lldb is provided which can be loaded
so that lldb knows about the libraries as they are loaded.
* shared libraries can directly use the Python API, but libraries they depend on
  (via DT_NEEDED entries in their dynamic segment) may not use Python. In the future, we can
  try to detect whether a sub library uses the Python API and load it with our customer loader.
* TLS initialization and library initialization may occur in a different order than what would happen with dlopen,
  potentially leading to some issues running destructors in TLS segments. Use of this C++ features is relatively rare.

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D28435305

Pulled By: zdevito

fbshipit-source-id: 10f046053dd1d250e3c73f2cce8eb945eeba31b6
2021-07-23 19:58:54 -07:00
e856a45283 [Model Averaging] Refactor averagers to accept parameters instead of a module (#62105)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62105

This is for the preparation of wrapping the averager as an optimizer, which can only accept parameters rather than a module.

Proposal: https://github.com/pytorch/pytorch/issues/59699
ghstack-source-id: 134213572

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_periodic_model_averager

buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_average_parameters

Reviewed By: rohan-varma

Differential Revision: D29883693

fbshipit-source-id: 474ba924a0b05068b12f163fb74582bccf314964
2021-07-23 18:39:45 -07:00
41f7a9dac0 [profiler][refactor] Avoid using legacy event in profiler (#61721)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61721

Remove dependency on LegacyEvent from the profiler

Test Plan:
python test/test_profiler.py -v

Imported from OSS

Reviewed By: kimishpatel, gdankel

Differential Revision: D29716769

fbshipit-source-id: 2c2b48f2ee096adcbde09821e0cc7c0fcb94d19f
2021-07-23 18:28:08 -07:00
06a3b23971 [android] Lite interpreter module to load from assets (#61609)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61609

Test Plan: Imported from OSS

Reviewed By: cccclai

Differential Revision: D29688641

Pulled By: IvanKobzarev

fbshipit-source-id: 7857bad51e91eae7c90a1218d463f3767f4fae15
2021-07-23 17:51:18 -07:00
643e58466e [nnc] Rename IRSimplifierBase with PolynomialBase (#60686)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60686

Test Plan: Imported from OSS

Reviewed By: navahgar, soulitzer

Differential Revision: D29373316

Pulled By: huiguoo

fbshipit-source-id: bd44bff60455076d1c5291273989e9939a428f9a
2021-07-23 17:18:41 -07:00
046272f3e5 [6/N] Nnapi Backend Delegate: Comprehensive OSS Tests (#61782)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61782

This PR depends on https://github.com/pytorch/pytorch/pull/61787

### Summary:
Added more comprehensive tests for Android NNAPI delegate.
Previously, there was only one basic test for lowering a PReLU module with the NNAPI delegate. Now, more tests are inherited from `test_nnapi.py`, the file for testing NNAPI conversion and execution without the delegate.

**test_backend_nnapi.py**
Test file for Android NNAPI delegate.
- `TestNnapiBackend` class inherits tests from `test_nnapi.py` and overrides the model conversion to use the delegate API.
- Includes an extra test for passing input arguments as Tensors and Tensor Lists.
- Has extra set up for loading the NNAPI delegate library and changing the default dtype from float64 to float32 (dtype is typically float32 by default, but not in delegate backend unit tests)

**test_nnapi.py**
Test file for Android NNAPI without the delegate.
- Some code was refactored to allow override of only the NNAPI conversion call.
- An extra function was added to allow the NNAPI delegate unit test to turn off the model execution step. Once the NNAPI delegate's execution implementation is complete, this may no longer be necessary.

### Test Plan:
I ran `python test/test_jit.py TestNnapiBackend` and `python test/test_nnapi.py` to run both test files.

Test Plan: Imported from OSS

Reviewed By: raziel, iseeyuan

Differential Revision: D29772005

fbshipit-source-id: 5d14067a4f6081835699b87a2ece5bd6bed00c6b
2021-07-23 17:04:07 -07:00
f03e7170f0 ENH Updates docs and tests for regression modules that already support no-batch-dims (#61461)
Summary:
Towards https://github.com/pytorch/pytorch/issues/60585

This PR does not use `check_sum_reduction` because I wanted to test every reduction option.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61461

Reviewed By: suo

Differential Revision: D29883744

Pulled By: jbschlosser

fbshipit-source-id: cdad0effb41f0484938caad0d4c9d6d83e2aec07
2021-07-23 16:40:17 -07:00
1ec6205bd0 ENH Adds no_batch_dim support for maxpool and unpool for 2d and 3d (#61984)
Summary:
Towards https://github.com/pytorch/pytorch/issues/60585

(Interesting how the maxpool tests are currently in `test/test_nn.py`)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61984

Reviewed By: suo

Differential Revision: D29883846

Pulled By: jbschlosser

fbshipit-source-id: 1e0637c96f8fa442b4784a9865310c164cbf61c8
2021-07-23 16:14:10 -07:00
f4ffaf0cde Fix type promotion for cosine_similarity() (#62054)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/61454

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62054

Reviewed By: suo

Differential Revision: D29881755

Pulled By: jbschlosser

fbshipit-source-id: 10499766ac07b0ae3c0d2f4c426ea818d1e77db6
2021-07-23 15:20:48 -07:00
e408af083f Improve MHA docs (#61977)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/60831
Also clarifies the relationship between `embed_dim` and `num_heads` (see https://github.com/pytorch/pytorch/issues/60853 and https://github.com/pytorch/pytorch/issues/60445).
Formatting was overhauled to remove some redundancy between the input docs and shape docs; suggestions / comments welcome!

Link to rendered docs here: https://14912919-65600975-gh.circle-artifacts.com/0/docs/generated/torch.nn.MultiheadAttention.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61977

Reviewed By: bhosmer

Differential Revision: D29876884

Pulled By: jbschlosser

fbshipit-source-id: a3e82083219cc4f8245c021d309ad9d92bf39196
2021-07-23 15:19:34 -07:00
cf3cc01f1d [Static Runtime] Add is_frozen to StaticModule ctor (#62020)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62020

Add is_frozen to StaticModule ctor so we can skip freezing in StaticModule.

Reviewed By: ajyu, mikeiovine

Differential Revision: D29807431

fbshipit-source-id: 7742e9f5c5ae9f442a9e4007c870a14fd8b4af20
2021-07-23 15:12:35 -07:00
fa11103c6a [clang-tidy] Fix unknown GNU flag error (#62128)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62128

Test Plan: Imported from OSS

Reviewed By: driazati

Differential Revision: D29888297

Pulled By: 1ntEgr8

fbshipit-source-id: 0657d5baa72c014a83c9def4a39338c52f4ef8d1
2021-07-23 14:46:51 -07:00
9730d91abd MAINT Migrates multilabel_margin_loss from THC to ATen (CUDA) (#60708)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/24603
Fixes https://github.com/pytorch/pytorch/issues/24602

<s>The implementation should be exactly the same, so it is strange that the benchmarks show such a significant improvement in this PR.</s>

The benchmarks are now the same.

<details>
 <summary>Benchmark script</summary>

```python
from itertools import product

import torch
import torch.nn as nn
import torch.nn.functional as F
import time

torch.manual_seed(0)
MS_PER_SECOND = 1000

def _time():
    torch.cuda.synchronize()
    return time.perf_counter() * MS_PER_SECOND

device = "cuda"
C = 30
n_runs = 100
reductions = ["none", "sum", "mean"]
Ns = [1_000, 10_000, 100_000]

for reduction, N in product(reductions, Ns):
    total_fwd_time = 0
    total_back_time = 0
    grad_out = torch.randn(N, device=device)
    if reduction != "none":
        grad_out = grad_out[0]

    for _ in range(n_runs):
        input = torch.randn(N, C, device=device, requires_grad=True)
        target = torch.randint(0, C, size=input.size(), device=device)

        # forward
        start = _time()
        result = F.multilabel_margin_loss(input, target, reduction=reduction)
        total_fwd_time += _time() - start

    result = F.multilabel_margin_loss(input, target, reduction=reduction)
    for _ in range(n_runs):
        # backward
        start = _time()
        result.backward(grad_out, retain_graph=True)
        total_back_time += _time() - start

    fwd_avg = total_fwd_time / n_runs
    bwd_avg = total_back_time / n_runs
    print(
        f"input size({N}, {C}), reduction: {reduction}, fwd: {fwd_avg:.2f} (ms), back: {bwd_avg:.2f} (ms)"
    )
```

</details>

## master

```
input size(1000, 30), reduction: none, fwd: 0.14 (ms), back: 0.41 (ms)
input size(10000, 30), reduction: none, fwd: 1.26 (ms), back: 3.58 (ms)
input size(100000, 30), reduction: none, fwd: 13.15 (ms), back: 34.68 (ms)
input size(1000, 30), reduction: sum, fwd: 0.14 (ms), back: 0.38 (ms)
input size(10000, 30), reduction: sum, fwd: 1.16 (ms), back: 3.53 (ms)
input size(100000, 30), reduction: sum, fwd: 13.04 (ms), back: 34.53 (ms)
input size(1000, 30), reduction: mean, fwd: 0.14 (ms), back: 0.38 (ms)
input size(10000, 30), reduction: mean, fwd: 1.17 (ms), back: 3.52 (ms)
input size(100000, 30), reduction: mean, fwd: 13.12 (ms), back: 34.54 (ms)
```

## this PR

```
input size(1000, 30), reduction: none, fwd: 0.14 (ms), back: 0.35 (ms)
input size(10000, 30), reduction: none, fwd: 1.22 (ms), back: 2.98 (ms)
input size(100000, 30), reduction: none, fwd: 12.90 (ms), back: 29.32 (ms)
input size(1000, 30), reduction: sum, fwd: 0.14 (ms), back: 0.32 (ms)
input size(10000, 30), reduction: sum, fwd: 1.16 (ms), back: 2.97 (ms)
input size(100000, 30), reduction: sum, fwd: 13.00 (ms), back: 29.17 (ms)
input size(1000, 30), reduction: mean, fwd: 0.14 (ms), back: 0.32 (ms)
input size(10000, 30), reduction: mean, fwd: 1.17 (ms), back: 2.97 (ms)
input size(100000, 30), reduction: mean, fwd: 13.09 (ms), back: 28.91 (ms)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60708

Reviewed By: saketh-are

Differential Revision: D29856579

Pulled By: ngimel

fbshipit-source-id: b6bbf27a71e5a04f61779f6fef4ed1c98baa2607
2021-07-23 13:45:28 -07:00
a6c6fd923e [profiler] Nvtx support (#61634)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61634

Legacy profiler supported Nvtx and that was used by emit_nvtx, this PR
adds support for Nvtx in the new compiler, to prepare for the eventual
deprecation of the legacy profiler

Test Plan:
Verified that the profiles produced with nvprof are the same
```
import torch
import torchvision.models as models
from torch.autograd.profiler import emit_nvtx, load_nvprof

model = models.resnet18().cuda()
inputs = torch.randn(5, 3, 224, 224).cuda()

with emit_nvtx(record_shapes=True):
  model(inputs)
```
/usr/local/cuda/bin/nvprof  -o test_trace2.prof -f  -- python test_emit_nvtx.py
```
evt = load_nvprof("/home/iliacher/local/pytorch/test_trace.prof")
```

Imported from OSS

Reviewed By: kimishpatel, gdankel

Differential Revision: D29691316

fbshipit-source-id: 1e186cc072368f3e3987a2da0bfd90ed328817c5
2021-07-23 13:37:09 -07:00
812bc1dde6 Smart Decay for Adam - DPER3 (#62058)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62058

This is the second diff in this stack.  This diff includes the changes to DPER3; the first diff includes the changes to Caffe2.

We want to decay learning parameters properly.  Previously this was not done when a parameter is absent from a minibatch.  We fix this by keeping track of missed minibatches and making decay catch up accordingly.

The exponential moving averages (EMA) for the first and second moments used in Adam are updated only for parameters seen in a minibatch.  Actually, for these parameters, 0 should be added to the EMAs and the EMAs should then be decayed by multiplying by beta1 and beta2 respectively.

To avoid the computational overhead of touching every parameter for every minibatch, we:
* keep track of the last time a parameter is seen
* instead of decaying the EMAs by multiplying by beta1 and beta2, we multiply by beta1^k and beta2^k, where k is the number of minibatches since the parameter was last seen.

We hope this will significantly improve the inconsistent learning parameter issue we have seen with Adam.

Differential Revision: D29638897

fbshipit-source-id: 18d8e227d72c2e23010ca81e0f6eeb78872c8d3c
2021-07-23 13:26:30 -07:00
5224490ae9 Implement NumPy-like frombuffer tensor constructor. (#59077)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59077

Fixes #58549

`from_buffer` constructs a tensor object from an already allocated buffer through
CPython's buffer protocol. Besides the standard `dtype`, `count`, and `offset` parameters,
this function also accepts:

- `device`: where the buffer lives
- `requires_grad`: should autograd record operations on the new tensor

A new test file _test_buffer_protocol.py_ was created. Currently, only CPU tests were
implemented. That's because neither PyTorch nor Numba implements CPython's buffer
protocol. Therefore, there's no way to create a CUDA buffer with the existing
dependencies (could use PyCUDA for that, though).

At the moment, if `device` differs from the device the buffer actually lives, two things
may happen:

- `RuntimeError`, if `device='cuda'`
- Segmentation fault (not tested -- see above), if `device='cpu'`

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D29870914

Pulled By: mruberry

fbshipit-source-id: 9fa8611aeffedfe39c9af74558178157a11326bb
2021-07-23 13:17:48 -07:00
ec4e6181e6 [Static Runtime] Fix broken test_static_runtime build (#62098)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62098

The build was broken by D29821533 (1d2ea76afb). The `clamp` overloads used in `deep_wide.h`
are no longer available in the `at::native` namespace.

Use `at::cpu::clamp` and `at:🗜️:clip_out` (which should be an alias for
clamp) instead.

Reviewed By: hlu1

Differential Revision: D29880187

fbshipit-source-id: 210b6d2be8a8142e7af1a0ba07e55a95b1a77d25
2021-07-23 12:35:09 -07:00
b820493cf1 [skip ci] Refactor CIFlow init logic (#62102)
Summary:
This PR refactors the CIWorkflow post_init step to best account for how CIFlow interacts with everything.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62102

Test Plan: This PR did NOT garner any workflow changes. I ran mypy and flake8 on the changed file locally with no issues.

Reviewed By: jbschlosser

Differential Revision: D29883275

Pulled By: janeyx99

fbshipit-source-id: 6c5c1fc1878159e0de1bf8d9bd0cb32aa47af49a
2021-07-23 12:29:04 -07:00
71cfbc45b4 Remove redundant torch.cuda.set_device(self.rank) (#62097)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62097

as title
ghstack-source-id: 134196740

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_ddp_profiling_autograd_profiler

Reviewed By: rohan-varma

Differential Revision: D29880040

fbshipit-source-id: 6a06fb2d87e9a7dfa1d7c81bf0c3fe115c1a1abb
2021-07-23 11:59:16 -07:00
5ef667a8b8 Remove duplicated movedim implementation (#61939)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61939

Test Plan: Imported from OSS

Reviewed By: saketh-are

Differential Revision: D29850798

Pulled By: zou3519

fbshipit-source-id: e803b235d8535a204515ff9f5d46b8c4d191b73c
2021-07-23 11:52:07 -07:00
10ccc5a81c remove randn? from torch.testing namespace (#61840)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61840

Redo of #60859.

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D29871017

Pulled By: mruberry

fbshipit-source-id: 47afed1dc6aa0bb1e826af616ef5d5aaabb8e5bb
2021-07-23 11:51:03 -07:00
cb47d1f9c8 OpInfo Ref: fmod, remainder (#61527)
Summary:
See https://github.com/pytorch/pytorch/issues/54261 for OpInfo tracker.

This PR:

* [x] Adds references to both `fmod` and `remainder` for testing.
* [x] Updates `remainder` documentation to add a note on divergence with `std::remainder`. (something similar to NumPy's note: https://numpy.org/doc/1.20/reference/generated/numpy.remainder.html), see: https://github.com/pytorch/pytorch/pull/61527#discussion_r670238788 for further discussion.

cc: mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61527

Reviewed By: albanD

Differential Revision: D29841266

Pulled By: mruberry

fbshipit-source-id: be99851a94f53ea2fc07b64fd7c947775129658c
2021-07-23 11:44:32 -07:00
c9b71549f2 don't allow alias dispatch keys to go in the DispatchKeySet (#61771)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61771

Test Plan: Imported from OSS

Reviewed By: asuhan

Differential Revision: D29736432

Pulled By: bdhirsh

fbshipit-source-id: 54bb716db1e41565b00f4f01ea0096f834087577
2021-07-23 11:29:46 -07:00
143ef016ee Throw RuntimeError when numpy() is called on a tensor with conjugate or negative bit set (#61925)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61925

Resolves https://github.com/pytorch/pytorch/issues/59945 and https://github.com/pytorch/pytorch/issues/59946

bc breaking note: Unlike before, complex_tensor.conj().numpy(),  complex_float_tensor.conj().view(torch.float64), complex_float_tensor.conj().imag.view(torch.int32) now doesn't return a view but instead errors out

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D29819288

Pulled By: anjali411

fbshipit-source-id: 4bebec721eb535f44ef4b728bdc75fa444e05d16
2021-07-23 11:28:36 -07:00
943ca5f6f7 [special] alias for mvlgamma (#61633)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/50345

Have added `out` variant for consistency.

TODO:
* [x] Check docs https://docs-preview.pytorch.org/61633/special.html#torch.special.multigammaln

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61633

Reviewed By: albanD

Differential Revision: D29815514

Pulled By: mruberry

fbshipit-source-id: 003c7b6a5938ecc7a96727310e8a39da0b3d7aca
2021-07-23 11:24:27 -07:00
0c55f1bdec [torchelastic] Improve process termination logic (#61602)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61602

The diff introduces signal handlers and SignalException that is raised when the agent process receives SIGTERM or SIGINT.

When any of these signals received, the termination handler will raise the `SignalException`. The exception will then be processed by the main agent loop. The `shutdown(signum)` will be invoked, that would propagate the received signal to the child processes. The default 30 seconds timeout introduced: if child processes will not be able gracefully terminate during this timeout, the agent process would kill the processes via SIGKILL.

Test Plan: unittests, sandcastle

Reviewed By: cbalioglu

Differential Revision: D29671783

fbshipit-source-id: 3dbca2125676dc18d417cc3e3bb0301fdd42737a
2021-07-23 11:00:15 -07:00
e42360d56f Remove default arguments before calling to __torch_dispatch__ (#61123)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61123

This applies the design pattern of removing explicit arguments when they
coincide with the default arguments.  This simplifies argument patterns
that dispatch kernels receive and make it easier for us to maintain BC
(as addition of a new default argument isn't immediately BC-breaking
for dispatch implementors).

There is an important extra API which I haven't implemented here yet,
which is to take an incomplete sequence of arguments and fill out their
defaults (in case the user did want normalization).  I plan on adding
that in a future PR.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: saketh-are

Differential Revision: D29853616

Pulled By: ezyang

fbshipit-source-id: 71c672cb3a7d4d01f838a1c7fcdb75a8ce7d058e
2021-07-23 10:41:35 -07:00
32d0c3e8ee Support for reference convert_fx working on gpu
Summary:
This PR enables gpu only quantization, best used with is_reference since
there are not many gpu kernels for ops as of now.

This PR mainly changes how qconfigs and their obs constructors operate once they
on modules qconfig. The function add_module_to_qconfig_obs_ctr takes the obs constructors on the original
qconfig, and configures them so that when invoked, the created obs will
be on whatever device the module occupies. (Once observers are created,
module.to(device) is already setup so that it moves any observers). To do this,
a new method and a few small chanegs were added to the _PartialWrapper class that
our observers already use to create constructors (without changing the
existing functionality). These changes work in
concert with changes to the prepare flow such that when the qconfigs are
propagated to the moduels (in quantize.py and qconfig_utils.py) they are configured using add_module_to_qconfig_obs_ctr.

Ideally this would work on other models but the is_reference support for
a lot of modules isn't there yet, those tests should be added in a
future PR

Test Plan:
python test/test_quantization.py TestQuantizeFxModels.test_static_gpu_convert_basic

python test/test_quantization.py TestQuantizeFxModels.test_switch_device_prepare_convert

python test/test_quantization.py TestQuantizeFxModels.test_prepare_serialize_switch_device_convert

python test/test_quantization.py TestQuantizeFx.test_qconfig_precedence

Reviewed By: vkuzo

Differential Revision: D29684114

fbshipit-source-id: 19fefb8e1998eaf212723e836276ccf39467f2e7
2021-07-23 10:30:38 -07:00
0df1679e5c BatchNorm: fix mixed precision usage with affine=False (#61962)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/61924

The fused backward kernel was using the weight dtype to detect mixed precision usage, but the weights can be none and the `running_mean` and `running_var` can still be mixed precision. So, I update the check to look at those variables as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61962

Reviewed By: albanD

Differential Revision: D29825516

Pulled By: ngimel

fbshipit-source-id: d087fbf3bed1762770cac46c0dcec30c03a86fda
2021-07-23 09:55:52 -07:00
e318058ffe Ignore LNK4099 for debug binary libtorch builds (#62060)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/61979

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62060

Test Plan:
This CI shouldn't break
and https://github.com/pytorch/pytorch/pull/62061

Reviewed By: driazati

Differential Revision: D29877487

Pulled By: janeyx99

fbshipit-source-id: 497f84caab3f9ae609644fd397ad87a6dc8a2a77
2021-07-23 09:31:41 -07:00
04c95a0638 ns for fx: expose hook to define custom weight extraction functions (#62047)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62047

Adds a hook for user to define a weight extraction function for a
custom type.

Example usage:
```
op_to_type_to_weight_extraction_fn = \
    get_op_to_type_to_weight_extraction_fn()
op_to_type_to_weight_extraction_fn['call_function'][_wrapped_linear] = \
    torch.quantization.ns.weight_utils.get_linear_fun_weight

results = extract_weights_impl(
    'a', m1, 'b', m2,
    op_to_type_to_weight_extraction_fn=op_to_type_to_weight_extraction_fn)
```

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_user_defined_function
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D29853625

fbshipit-source-id: 183916ef54ba303bc818e0eba00b52e33c4633ad
2021-07-23 09:31:37 -07:00
07c6a12008 ns for fx: fix typing issue in weight extraction (#62041)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62041

Before this PR, weights of conv and linear modules were extracted
as lists, in order to match the signature of LSTM weights.

After this PR, weight extraction preserves the type of the weights,
so extracted weights of conv and linear have a different type
from LSTM weights.  The comparison util functions are updated to
handle the LSTM weight type of `List[tensor]`.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs
python test/test_quantization.py TestFXNumericSuiteCoreAPIsModels
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D29853626

fbshipit-source-id: 93da5b9b0b174679c61528d02b6b902cb064444e
2021-07-23 09:31:33 -07:00
eaba16d665 ns for fx: change weight extraction to direct mapping (#62038)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62038

Updates the logic to extract weights from nodes to use a
direct mapping from type to weight extraction function.

This is needed for a future PR which will allow users to
specify custom weight extraction functions for user defined
types.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs
python test/test_quantization.py TestFXNumericSuiteCoreAPIsModels
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D29853627

fbshipit-source-id: 3ef90ef4bd7b28f6316c0af215a2bd3ff8a2aeca
2021-07-23 09:30:08 -07:00
8a2c525d3b Fix some sign comparisons (#61849)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61849

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D29736180

fbshipit-source-id: 1391b11e73725ee985b9aa768566ca77f44d04ae
2021-07-23 09:03:33 -07:00
9d4056468e Migrate scheduled jobs debuggability to GHA (#62056)
Summary:
This removes the debuggable-ci workflow in Circle and enables the same idea in GHA, to allow contributors to run scheduled GHA workflows by:
1. assigning the PR to pytorchbot.
2. labeling the PR with ciflow/scheduled
3. unassigning the PR.

This PR also adds the trigger_action_only logic to windows_ci_template yaml, as it was present on the linux template and seemed to be left out by mistake.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62056

Test Plan: Note that this periodic job https://github.com/pytorch/pytorch/pull/62056/checks?check_run_id=3138504471 ran later than other jobs (like [this one](https://github.com/pytorch/pytorch/pull/62056/checks?check_run_id=3138226668)), and its time is close to when unassigning happens.

Reviewed By: seemethere

Differential Revision: D29859079

Pulled By: janeyx99

fbshipit-source-id: cd5c6be415cfa8090e3cac90625f92b49fd453a8
2021-07-23 08:48:22 -07:00
b03b45afd9 [DDP Comm Hook] Use a single tensor instead of a tensor list as the comm hook result (#62074)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62074

Since SPMD mode is retired, the comm hook result will always be a single tensor.

This can improve comm hook developer experience, as no need to add an extra `[0]` to the precursor future result.

#Closes: https://github.com/pytorch/pytorch/issues/61914
ghstack-source-id: 134164593

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork

Reviewed By: rohan-varma

Differential Revision: D29864732

fbshipit-source-id: 59fe6dd78b66214b1788514ad4d236039d9bda31
2021-07-23 03:32:05 -07:00
1d2ea76afb clamp: port to structured kernel (#61361)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61361

This PR ports the `clamp` kernel to the structured format. In addition, it introduces `OptionalScalarRef` as a replacement for `c10::optional<Scalar>&`. The latter, although it is a reference type, can still involve copying the contained `Scalar` (e.g. if the actual parameter is a `Scalar` or if a `c10::optional<Scalar>` is constructed just to call a kernel). `OptionalScalarRef` contains only a `const Scalar&`, and stores flag about whether the instance contains something inside the `Scalar` itself using a new tag.

For more information, see #55070.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D29821533

Pulled By: SplitInfinity

fbshipit-source-id: 88d55df5a4b2c14b68a57e4905d90eea1b088d99
2021-07-23 02:02:07 -07:00
b106b958eb preserve residual in transformer norm_first (#61692)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61692

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D29706830

Pulled By: bhosmer

fbshipit-source-id: d9c9e88fb589d46189955a96909c6ca76d587f72
2021-07-22 23:49:08 -07:00
53222c59f0 Reformat (#62073)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62073

as title
ghstack-source-id: 134159445

Test Plan: N/A

Reviewed By: rohan-varma

Differential Revision: D29869185

fbshipit-source-id: 17a32d56860e9469bd26c4eb4ca2d483827d946e
2021-07-22 23:36:22 -07:00
3687bbb1ed [pruner] add Conv2d support (#61778)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61778

Adding Conv2d as supported modules for the pruner. Previously the pruner only supported Linear layers. This addition includes:
- adding a Conv2d activation reconstruction forward hook to match Conv2d weight shapes
- in `prepare`, checking the type of the module and using the corresponding activation forward hook
ghstack-source-id: 134143557

Test Plan:
Added conv2d tests
`buck test mode/dev-nosan //caffe2/test:ao -- TestBasePruner`

https://pxl.cl/1LLf3

Reviewed By: jerryzh168

Differential Revision: D29719045

fbshipit-source-id: 6a9f91b96992c552fff32f0e5a6e22f16eb7077b
2021-07-22 23:00:31 -07:00
a9b0a921d5 Disable avoid-non-const-global-variables lint check (#62008)
Summary:
As GoogleTest `TEST` macro is non-compliant with it as well as `DEFINE_DISPATCH`

All changes but the ones to `.clang-tidy` are generated using following script:
```
for i in `find . -type f -iname "*.c*" -or -iname "*.h"|xargs grep cppcoreguidelines-avoid-non-const-global-variables|cut -f1 -d:|sort|uniq`;  do sed -i "/\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)/d" $i; done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62008

Reviewed By: driazati, r-barnes

Differential Revision: D29838584

Pulled By: malfet

fbshipit-source-id: 1b2f8602c945bd4ce50a9bfdd204755556e31d13
2021-07-22 18:04:40 -07:00
260198d42c Disable bazel in CircleCI (#62055)
Summary:
As it runs in GHA for a while

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62055

Reviewed By: zhouzhuojie, seemethere

Differential Revision: D29856620

Pulled By: malfet

fbshipit-source-id: 754e392442f68d4eee15811e2bd2cf147326c42a
2021-07-22 16:28:12 -07:00
a91be24e2d Modernize make pointers (#61741)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61741

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D29717385

fbshipit-source-id: 4452b77981e49175f744bdaab12cd225bf75b90e
2021-07-22 15:54:37 -07:00
f98fa5ea13 [skip ci] minor typo link fix (#62042)
Summary:
This is not a functional change but a typo fix where I forgot to update the link to windows_smoke_tests.csv in test_python_first_shard. The windows_smoke_tests.csv is currently the same in pytorch/test-infra and my fork, janeyx99/test-infra, but that will not be the case in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62042

Reviewed By: seemethere

Differential Revision: D29851984

Pulled By: janeyx99

fbshipit-source-id: 9bafdf0ba006b9128463e3cf132fdfcddd3d10f2
2021-07-22 15:34:41 -07:00
1a64a5c0ba .github: Only run workflows on pytorch/pytorch (#62044)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62044

Downstream users have reported that they're seeing github workflows pop
up in their downstream forks which is not ideal. Let's make it so that
all of these generated workflows actually get skipped.

Also includes workflows related to automating pytorch/pytorch repository
maintenance

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet, janeyx99

Differential Revision: D29852199

Pulled By: seemethere

fbshipit-source-id: bbc1684c06a50bb3597f3112cb65fe9c1a4d7c1f
2021-07-22 15:08:31 -07:00
414537ac99 DOC Fixes link in register_module_backward_hook (#61999)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/61580

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61999

Reviewed By: saketh-are

Differential Revision: D29847397

Pulled By: albanD

fbshipit-source-id: 3d9e1a5abac82d658b4f1746ace73e2fecb41725
2021-07-22 14:29:40 -07:00
b522f3be4c Svd docfix (#62028)
Summary:
moving back the variable names to match the python variable and remove unicode exponents.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62028

Reviewed By: saketh-are, mruberry

Differential Revision: D29848591

Pulled By: albanD

fbshipit-source-id: f86b8666cb5f86e300e214a6d59638d069018c50
2021-07-22 14:11:52 -07:00
d6e776d961 Add build/.ninja_log to artifacts for Windows (#62035)
Summary:
Being able to download the .ninja_log allows for better debugging. There may be a follow-up PR to convert this to a better tracefile.

This PR only handles windows as it is already handled for linux here:
https://github.com/pytorch/pytorch/blob/master/.jenkins/pytorch/build.sh#L248-L252

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62035

Test Plan: Check the artifacts for a windows job and see if we see .ninja_log

Reviewed By: malfet

Differential Revision: D29852228

Pulled By: janeyx99

fbshipit-source-id: a3a87b709cd0c84f5b3cdc274ac4a623771c2b5c
2021-07-22 13:04:29 -07:00
0309c5780d ENH Adds no batch dim support for AvgPool1d (#61860)
Summary:
Towards https://github.com/pytorch/pytorch/issues/60585

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61860

Reviewed By: albanD

Differential Revision: D29826382

Pulled By: jbschlosser

fbshipit-source-id: 47e12073d866f0604310fc1ff270cde9907e516d
2021-07-22 12:46:48 -07:00
5a00152a3d Warn about poor performance creating Tensor from list of numpy.array's (#51680)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/13918

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51680

Reviewed By: saketh-are

Differential Revision: D29847229

Pulled By: ezyang

fbshipit-source-id: 0519aad27f9ca1d8c06be5b9e6de382374d8b72b
2021-07-22 12:02:50 -07:00
2b0eddb0aa [Static Runtime] Implement prim::isinstance and prim::TypeCheck (#61783)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61783

Implement two new prim operators for static runtime: `isinstance` and `TypeCheck`. `isinstance` is very straightforward, but there were a few wrinkles with implementing `TypeCheck`:

1. There is no way to directly generate `TypeCheck` nodes from TorchScript, they are generated by the JIT at runtime. This makes testing a little difficult. I had to make some modifications to `testStaticRuntime` to allow for the use of IR and TorchScript tests.
2. The behavior of `prim::TypeCheck` as implemented here does not match up 1:1 with the version implemented in the interpreter! This is because grad mode is disabled in static runtime. Here's an example.

IR is the same as the one included in this test, but with `requires_grad == 1`
```
graph(%a.1 : Tensor,
      %b.1 : Tensor):
  %t0 : Float(2, 2, strides=[2, 1], device=cpu, requires_grad=1), %t1 : Float(3, 3, strides=[3, 1]), %type_matched : bool = prim::TypeCheck[types=[Float(2, 2, strides=[2, 1], device=cpu, requires_grad=1), Float(3, 3, strides=[3, 1])]](%a.1, %b.1)
  return (%t0, %t1, %type_matched)
```

And in the test setup:
```
auto a = at::zeros({2, 2}, at::kFloat);
a.to(at::kCPU);
a.set_requires_grad(true);
auto b = at::ones({3, 3}, at::kFloat);

std::vector<IValue> args_correct = {a, b};

// prim::TypeCheck should be true with args_correct,
// but we get false when using static runtime!
```

Reviewed By: hlu1

Differential Revision: D29743862

fbshipit-source-id: db1788f0f5de42bab42602e8cc24eee04cbcc280
2021-07-22 10:23:35 -07:00
e6339ee336 optimize imports (#61908)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61908

Reviewed By: suo

Differential Revision: D29800269

Pulled By: ejguan

fbshipit-source-id: 74ce4414eb6d2a5608df9ec1efdc71e2112aef70
2021-07-22 09:58:44 -07:00
554e04090f Add 11.3 conda nightly binaries (#61873)
Summary:
Adds conda 11.3 cuda binaries to our nightly matrix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61873

Test Plan:
Tested by https://github.com/pytorch/pytorch/pull/61867-->testing complete, showing all passing binaries.

THIS CAN ONLY BE MERGED _AFTER_ pytorch/builder#806 and pytorch/builder#807 are merged, which they now are.

Reviewed By: saketh-are

Differential Revision: D29848267

Pulled By: janeyx99

fbshipit-source-id: db04899418bd0b4116315fbbe36b06f772020c2e
2021-07-22 09:50:13 -07:00
e858f6eed9 torch.nn.utils.clip_grad_norm_: remove device syncs (#61042)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/60691

### Changes

Per the discussion in the above issue, this PR makes 2 changes:
1. When `error_if_nonfinite=False`, the NaN/Inf checks are truly skipped, and no device synchronization occurs.
    - Additionally, when performing the checks, the 2 results are combined with `torch.logical_or` to incur only a single sync (instead of 2 in the happy/finite path).
2. The `clip_coef` conditional is removed, in favor of a call to `clamp(..., max=1.0)` and an unconditional multiplication.

### Testing

- The existing unit tests for `clip_grad_norm_` pass.
- I have manually profiled the example program from https://github.com/pytorch/pytorch/issues/60691, and verified that:
    - No synchronizations occur when using `error_if_nonfinite=False`.
    - A single synchronization occurs when using `error_if_nonfinite=True`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61042

Reviewed By: mrshenli

Differential Revision: D29764096

Pulled By: jbschlosser

fbshipit-source-id: db594b24608d16374b91bcbb9469046dfeeb152d
2021-07-22 08:53:40 -07:00
9e53c823b8 Add AVX512 support in ATen & remove AVX support (#61903)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61903

### Remaining Tasks

- [ ] Collate results of benchmarks on two Intel Xeon machines (with & without CUDA, to check if CPU throttling causes issues with GPUs) - make graphs, including Roofline model plots (Intel Advisor can't make them with libgomp, though, but with Intel OpenMP).

### Summary

1. This draft PR produces binaries with with 3 types of ATen kernels - default, AVX2, AVX512 . Using the environment variable `ATEN_AVX512_256=TRUE`  also results in 3 types of kernels, but the compiler can use 32 ymm registers for AVX2, instead of the default 16. ATen kernels for `CPU_CAPABILITY_AVX` have been removed.

2. `nansum` is not using AVX512 kernel right now, as it has poorer accuracy for Float16, than does AVX2 or DEFAULT, whose respective accuracies aren't very good either (#59415).
It was more convenient to disable AVX512 dispatch for all dtypes of `nansum` for now.

3. On Windows , ATen Quantized AVX512 kernels are not being used, as quantization tests are flaky. If `--continue-through-failure` is used, then `test_compare_model_outputs_functional_static` fails. But if this test is skipped, `test_compare_model_outputs_conv_static` fails. If both these tests are skipped, then a third one fails. These are hard to debug right now due to not having access to a Windows machine with AVX512 support, so it was more convenient to disable AVX512 dispatch of all ATen Quantized kernels on Windows for now.

4. One test is currently being skipped -
[test_lstm` in `quantization.bc](https://github.com/pytorch/pytorch/issues/59098) - It fails only on Cascade Lake machines, irrespective of the `ATEN_CPU_CAPABILITY` used, because FBGEMM uses `AVX512_VNNI` on machines that support it. The value of `reduce_range` should be used as `False` on such machines.

The list of the changes is at https://gist.github.com/imaginary-person/4b4fda660534f0493bf9573d511a878d.

Credits to ezyang for proposing `AVX512_256` - these use AVX2 intrinsics but benefit from 32 registers, instead of the 16 ymm registers that AVX2 uses.
Credits to limo1996 for the initial proposal, and for optimizing `hsub_pd` & `hadd_pd`, which didn't have direct AVX512 equivalents, and are being used in some kernels. He also refactored `vec/functional.h` to remove duplicated code.
Credits to quickwritereader for helping fix 4 failing complex multiplication & division tests.

### Testing
1. `vec_test_all_types` was modified to test basic AVX512 support, as tests already existed for AVX2.
Only one test had to be modified, as it was hardcoded for AVX2.
2.  `pytorch_linux_bionic_py3_8_gcc9_coverage_test1` & `pytorch_linux_bionic_py3_8_gcc9_coverage_test2` are now using `linux.2xlarge` instances, as they support AVX512. They were used for testing AVX512 kernels, as AVX512 kernels are being used by default in both of the CI checks. Windows CI checks had already been using machines with AVX512 support.

### Would the downclocking caused by AVX512 pose an issue?

I think it's important to note that AVX2 causes downclocking as well, and the additional downclocking caused by AVX512 may not hamper performance on some Skylake machines & beyond, because of the double vector-size. I think that [this post with verifiable references is a must-read](https://community.intel.com/t5/Software-Tuning-Performance/Unexpected-power-vs-cores-profile-for-MKL-kernels-on-modern-Xeon/m-p/1133869/highlight/true#M6450). Also, AVX512 would _probably not_ hurt performance on a high-end machine, [but measurements are recommended](https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/). In case it does, `ATEN_AVX512_256=TRUE` can be used for building PyTorch, as AVX2 can then use 32 ymm registers instead of the default 16. [FBGEMM uses `AVX512_256` only on Xeon D processors](https://github.com/pytorch/FBGEMM/pull/209), which are said to have poor AVX512 performance.

This [official data](https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-scalable-spec-update.pdf) is for the Intel Skylake family, and the first link helps understand its significance. Cascade Lake & Ice Lake SP Xeon processors are said to be even better when it comes to AVX512 performance.

Here is the corresponding data for [Cascade Lake](https://cdrdv2.intel.com/v1/dl/getContent/338848) -

![CASCADE LAKE AVX2](https://user-images.githubusercontent.com/76181208/120666172-ffec3f80-c451-11eb-8ea1-8933ccc12a1b.PNG)
![CASCADE LAKE AVX512](https://user-images.githubusercontent.com/76181208/120666190-04b0f380-c452-11eb-9faa-38d233c874c8.PNG)

The corresponding data isn't publicly available for Intel Xeon SP 3rd gen (Ice Lake SP), but [Intel mentioned that the 3rd gen has frequency improvements pertaining to AVX512](https://newsroom.intel.com/wp-content/uploads/sites/11/2021/04/3rd-Gen-Intel-Xeon-Scalable-Platform-Press-Presentation-281884.pdf). Ice Lake SP machines also have 48 KB L1D caches, so that's another reason for AVX512 performance to be better on them.

### Is PyTorch always faster with AVX512?

No, but then PyTorch is not always faster with AVX2 either. Please refer to #60202. The benefit from vectorization is apparent with with small tensors that fit in caches or in kernels that are more compute heavy. For instance, AVX512 or AVX2 would yield no benefit for adding two 64 MB tensors, but adding two 1 MB tensors would do well with AVX2, and even more so with AVX512.

It seems that memory-bound computations, such as adding two 64 MB tensors can be slow with vectorization (depending upon the number of threads used), as the effects of downclocking can then be observed.

Original pull request: https://github.com/pytorch/pytorch/pull/56992

Reviewed By: soulitzer

Differential Revision: D29266289

Pulled By: ezyang

fbshipit-source-id: 2d5e8d1c2307252f22423bbc14f136c67c3e6184
2021-07-22 08:51:49 -07:00
cyy
59d6e07ada fix forward_idx check (#59911)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59911

Reviewed By: dzhulgakov

Differential Revision: D29829020

Pulled By: albanD

fbshipit-source-id: f685063061dab499368a272d6b94a44e89f9a143
2021-07-22 08:37:33 -07:00
b60d1b713e Revert D26007050: add channels last support for thnn_conv2d (non-dilated)
Test Plan: revert-hammer

Differential Revision:
D26007050 (8b88c24670)

Original commit changeset: 1289e0687c24

fbshipit-source-id: 88b679efbcae572fe604d50e2199861cadbc3d4a
2021-07-22 08:31:15 -07:00
171598f0e3 [Refactoring] Fix imports order for torch/utils/data/dataset.py (#61328)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61328

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D29588897

Pulled By: VitalyFedyunin

fbshipit-source-id: 63df653fb471532819c83ebcee4f9dc951500ffb
2021-07-22 08:30:08 -07:00
1b02641bb1 add BFloat16 operators on CPU: arange, acosh, asinh, atanh, exp2, digamma, trigamma, polygamma (#60444)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60444

Reviewed By: ejguan

Differential Revision: D29800899

Pulled By: ezyang

fbshipit-source-id: 26d2c2ac3e7d3a2d49679508aad8c8bf0232cad5
2021-07-22 08:13:22 -07:00
f3f7e92be5 Manually call lazyInitCUDA in structured CUDA calls (#61882)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61882

If you directly call the native implementation that bypasses the
initialization, which is bad!  This probably slows things down a little
though...

Fixes problem uncovered by #61642

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D29783856

Pulled By: ezyang

fbshipit-source-id: 16857569a049e09c6ebd96ef04b0025403b254af
2021-07-22 07:50:05 -07:00
196679d3aa [Refactoring] Reordering imports in torch/utils/data/datapipes/iter/__init__.py (#61325)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61325

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D29588896

Pulled By: VitalyFedyunin

fbshipit-source-id: 8c0f3580f82083c43a590a18ecddb3e04ae93ca9
2021-07-22 07:46:08 -07:00
25be031c6e Add missing docker build to slow gradcheck label-triggered build (#61941)
Summary:
Currently, when adding the label, it fails like: https://app.circleci.com/pipelines/github/pytorch/pytorch/352569/workflows/d213cbad-edd6-4fe0-a79c-d46f8c0aae85/jobs/14856158

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61941

Reviewed By: suo

Differential Revision: D29827084

Pulled By: albanD

fbshipit-source-id: 134828d36e51324e6b6539dd4bc5f1eebfb89a03
2021-07-22 07:37:21 -07:00
5186fa2831 Fix c10d -> dist in test_ddp_hooks.py (#61864)
Summary:
**Overview:**
The existing `test_ddp_hooks.py` test file uses a prefix `c10d`, which is not defined in the file, meaning the test errors if left as is. This renames each `c10d` prefix to `dist`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61864

Test Plan:
All four tests pass when run:
```
gpurun python test/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py
```

Reviewed By: ejguan

Differential Revision: D29783860

Pulled By: andwgu

fbshipit-source-id: 16bdd2dfcb76192964246148f14851a74f8907c8
2021-07-22 07:20:41 -07:00
109bd5e78a OpInfo: bitwise_and (#61349)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61349

Also add type promotion test for bugs found by pr #60813

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D29592840

Pulled By: ezyang

fbshipit-source-id: ee013b20e31baf6c6ebf2edb881ae6d8e215c7a6
2021-07-22 07:04:17 -07:00
2f3300f25f [docs] Correct torch.permute (#61833)
Summary:
Noted while reviewing https://github.com/pytorch/pytorch/issues/61830

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61833

Reviewed By: albanD

Differential Revision: D29816661

Pulled By: mruberry

fbshipit-source-id: 895607d7ddcbd4319218ab7719a2f57cbde2283c
2021-07-22 00:27:23 -07:00
5801431c9b OpInfo Ref: addbmm (#61832)
Summary:
See https://github.com/pytorch/pytorch/issues/54261. This PR:

* Adds reference wrapper using NumPy for reference function of `addbmm`
* Refines sample inputs (makes it more readable and avoids redundancy)

cc: mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61832

Reviewed By: albanD

Differential Revision: D29816024

Pulled By: mruberry

fbshipit-source-id: e0fea6dc923504169a13bfaa258c61fbbc5fa9f4
2021-07-22 00:26:10 -07:00
31beef009d Fix IMethodTest.GetArgumentNames after D29648756 (#61985)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61985

Fix IMethodTest.GetArgumentNames after D29648756 (641f6ef8a7).
ghstack-source-id: 134054637

Test Plan: buck test mode/dev caffe2/test/cpp/api:imethod -- IMethodTest.GetArgumentNames

Reviewed By: suo

Differential Revision: D29828807

fbshipit-source-id: b1411745b91e1b8c0ea0fd9e9666e22125dde333
2021-07-22 00:21:59 -07:00
07a91f1cfd fix graph deepcopy to propagate output type (#61747)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61747

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D29737565

Pulled By: migeed-z

fbshipit-source-id: 8583f0c87f2db27695e062f59a15de77f3b00fd6
2021-07-21 23:53:03 -07:00
8a2063e58a Foreach Test Refactor: Pointwise, Min/Max-imum (#61327)
Summary:
- rewrite pointwise unittests using `ops` decorator
- rewrite minimum&maximum unittests using `ops` decorator
- enable minimum/maximum fastpath for BFloat16
- remove _test_data method

https://github.com/pytorch/pytorch/issues/58833

cc: ptrblck ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61327

Reviewed By: albanD

Differential Revision: D29830209

Pulled By: ngimel

fbshipit-source-id: fa7805262b86c40fc32750b16629d80ad48ea4b5
2021-07-21 21:59:57 -07:00
d6899fe492 [Refactoring] Reordering imports in utils/data/__init__.py (#61324)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61324

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D29588895

Pulled By: VitalyFedyunin

fbshipit-source-id: 5e719c80f9cb5630c65187ac89773831777f368d
2021-07-21 21:38:28 -07:00
06efced177 .github: Specify directory to pull reports from (#61990)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61990

This adds more specificity to where to pull test reports from since I
believe that actions/upload-artifact doesn't actually respect the
working-directory default

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD, zhouzhuojie

Differential Revision: D29831719

Pulled By: seemethere

fbshipit-source-id: cee5609f97338d44a484d85baa77f0167d81ce55
2021-07-21 20:57:07 -07:00
cc18654d66 [fx_acc] Refactoring acc_tracer (#61963)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61963

Test Plan: CI

Reviewed By: jfix71

Differential Revision: D29772522

fbshipit-source-id: 4b117735147624f9428b933ea798495823423a0e
2021-07-21 20:09:15 -07:00
6284d2a82b wrap cudaStreamSynchronize calls (#61889)
Summary:
This is a first step towards creating context manager that errors out on synchronizing calls.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61889

Reviewed By: albanD

Differential Revision: D29805280

Pulled By: ngimel

fbshipit-source-id: b66400fbe0941b7daa51e6b30abe27b9cccd4e8a
2021-07-21 19:30:52 -07:00
3d6aa3a2f6 Enable torch.isclose to suppport bool tensors (#61271)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/60533

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61271

Reviewed By: zhxchen17

Differential Revision: D29737618

Pulled By: SplitInfinity

fbshipit-source-id: 45314bc7e0b9a28c10700455b1e6267c0db3eefc
2021-07-21 18:50:14 -07:00
243c7079a1 add 3d input and output shapes to maxpool documentation (#61310)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61310

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D29737516

Pulled By: migeed-z

fbshipit-source-id: eb6964f6808b8ae05d4d3852a5162dc66930cd64
2021-07-21 18:27:27 -07:00
d00bb45846 [typing] suppress errors in fbcode/caffe2 - batch 2
Test Plan: Sandcastle

Differential Revision: D29827809

fbshipit-source-id: 7ca7c2a33d691ac57392945b78a320d253c84ed4
2021-07-21 17:56:26 -07:00
a0e381641b Remove relative paths for clang-tidy annotations (#62004)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62004

Some of the files checked by clang tidy are compiled from a sibling directory, so the files all start with something like `../torch`. This ends up messing with `translate_annotations.py` which runs from the repo root. This fixes it by chopping off any relative paths in the clang tidy output.

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D29835446

Pulled By: driazati

fbshipit-source-id: 2bd279370e41ed0a321e30f88fe38434105c75e8
2021-07-21 17:52:31 -07:00
e731a63e63 Silence clang-tidy linter for TorchpyTest.FxModule test (#62001)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62001

This will fix [this linter error](https://github.com/pytorch/pytorch/runs/3120335141) introduced with D29690088 (810e19979d).

Test Plan: N/A (just looked at other examples and tidy doc https://clang.llvm.org/extra/clang-tidy/)

Reviewed By: suo

Differential Revision: D29832654

fbshipit-source-id: 8cf69cb5551f3b1bd384a2553dc5c827beb0a68f
2021-07-21 17:40:46 -07:00
b6ff0fa8dd Enable dynamically ciflow/slow so that we can run GHA slow tests on PR (#61987)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61987

This PR enables us to run slow GHA tests on PR.

Steps to do (~may only take effect after this PR is merged~ works on this PR)
- Add label `ciflow/slow`
- Assign/unassign pytorchbot
- The job should be running .github/workflows/pytorch-linux-xenial-cuda10.2-cudnn7-py3.6-gcc7.yml

The above steps are manual, and after probot can do the dispatch work, the ciflow will be automated.

Related meta RFC issue: #61888

Test Plan: Imported from OSS

Reviewed By: driazati

Differential Revision: D29832758

Pulled By: zhouzhuojie

fbshipit-source-id: 64d31ef572502e62b80e6b7ac480ffcfa9f4e38b
2021-07-21 16:56:54 -07:00
9d6cdf34a4 Annotate generated files in .gitattributes (#61995)
Summary:
Mark CI yaml files generated from templates as linguist-generated
Fixes https://github.com/pytorch/pytorch/issues/61994

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61995

Reviewed By: seemethere

Differential Revision: D29832199

Pulled By: malfet

fbshipit-source-id: 86ad3a16b4d3e4f94c35b8f766a8556a07632419
2021-07-21 16:49:07 -07:00
ae58a4c45d [Static Runtime] Added a variadic cat operator (#61302)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61302

Test Plan: Imported from OSS

Reviewed By: hlu1

Differential Revision: D29565344

Pulled By: navahgar

fbshipit-source-id: 96f5f4546ec0e61eb7f87e016e026e7b62576248
2021-07-21 15:58:20 -07:00
b145889192 Modernize use make_unique (#61739)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61739

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D29717133

fbshipit-source-id: 70e3d81a48f7ae90cca3ef3c9587174ca15d81f4
2021-07-21 15:28:26 -07:00
2c0ecfbb20 [PyTorch] Expose bias() and unpack() API of LinearPackedParamsBase to Python layer (#61855)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61855

Exposing `bias()` and `unpack()` for `LinearPackedParamsBase`. This is useful for inspecting linear op attributes.

Test Plan:
See unit test passing:

```
[ (6c61a5eb4) | devvm1625 ~/fbsource/fbcode] buck test //caffe2/test:quantization -- test_linear_bias_unpack
Parsing buck files: finished in 2.8 sec
Building: finished in 9.9 sec (100%) 11973/55220 jobs, 0/55220 updated
  Total time: 12.8 sec
More details at https://www.internalfb.com/intern/buck/build/2d0ee210-c8f3-4994-ac2b-1dccf4c3ca6c
Tpx test run coordinator for Facebook. See https://fburl.com/tpx for details.
Running with tpx session id: b7c6ea1b-8eef-430e-b83a-dad4033ecc87
Trace available for this run at /tmp/tpx-20210720-115423.031745/trace.log
Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/5348024618459562
    ✓ ListingSuccess: caffe2/test:quantization - main (10.806)
    ✓ Pass: caffe2/test:quantization - test_linear_bias_unpack (quantization.core.test_quantized_op.TestQuantizedOps) (10.913)
Summary
  Pass: 1
  ListingSuccess: 1
If you need help understanding your runs, please follow the wiki: https://fburl.com/posting_in_tpx_users
Finished test run: https://www.internalfb.com/intern/testinfra/testrun/5348024618459562
```

Reviewed By: kimishpatel

Differential Revision: D29767704

fbshipit-source-id: 716f43b61814b92094c0b08d4e63e1dddc352aa7
2021-07-21 15:13:40 -07:00
a02ccd6080 [ONNX] add supplement for standardOps low precision cast (#60731) (#61561)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61561

address Gary reply and add supplement of https://github.com/pytorch/pytorch/pull/53813.

- add more details for LowPrecisionCastNodeForStandardOps to make it more comprehensible.

- remove unuse gemm test

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D29767991

Pulled By: SplitInfinity

fbshipit-source-id: d00032e13699f5b02fc619e64aa8fdd39f3a66b8

Co-authored-by: hwangdeyu <dejack953@outlook.com>
2021-07-21 15:10:36 -07:00
6f08ddfc28 [ONNX] Enable aten:normal op and add tests for aten:uniform op. (#60441) (#61560)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61560

1. Add a new symbolic function broadcast_tensors() to support exporting torch.broadcast_tensors() function. This is required by exporting torch.distribution.normal() function.
2. Add a new symbolic function normal() to support exporting torch.distribution.normal() function.
3. Add relative tests for normal and uniform ops as well.

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D29767995

Pulled By: SplitInfinity

fbshipit-source-id: acfe5e7801d00c0df8ca46966bbd6015fed0045e

Co-authored-by: Jay Zhang <jiz@microsoft.com>
2021-07-21 15:10:35 -07:00
f0054e1a6e [ONNX] Update expand_as for dynamic shape (#61084) (#61559)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61559

Update expand_as for dynamic shape

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D29767990

Pulled By: SplitInfinity

fbshipit-source-id: 3f1e3f68fd17c5ffbd4a50fccff224fd9d6c84fb

Co-authored-by: Negin Raoof <neginmr@utexas.edu>
2021-07-21 15:10:33 -07:00
34075e2c8b [ONNX] Fix the issue of converting empty list to sequence. (#58651) (#61558)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61558

When we construct an empty list by python list comprehension, we need to avoid converting the node without inputs to onnx::Concat in shape_type_inference.cpp and peephole.cpp because it will create an invalid Concat node which doesn't have inputs.

In addition, update the code to avoid passing a Sequence input to an onnx::Cast node which doesn't accept Sequence data type as an input.

Add tests for the validation as well.

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D29767989

Pulled By: SplitInfinity

fbshipit-source-id: f97f172ff20eebda4c3744c7a934df36716f12a2

Co-authored-by: fatcat-z <jiz@microsoft.com>
2021-07-21 15:10:31 -07:00
22e60d77e7 [ONNX] Support tensor list as module attribute (#59685) (#61557)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61557

* Support tensor list as module attribute.
* Support exporting `torch.set_`.

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D29767992

Pulled By: SplitInfinity

fbshipit-source-id: 5ac5a09600d4dbe86b2fe354d240e46f1d1084ef
2021-07-21 15:08:35 -07:00
a8f6b5a80a [1/N] Avoid skipping tests in sandcastle. (#61876)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61876

In the sandcastle environment, avoid skipping tests and instead just
"pass" these tests to avoid a large number of tasks being created which are not
actionable.
ghstack-source-id: 133846232

Test Plan: Test with `SANDCASTLE=1 TW_JOB_USER=sandcastle`

Reviewed By: rohan-varma

Differential Revision: D29779699

fbshipit-source-id: add71008830dfa6f456ce2365a2d70436b7b7a31
2021-07-21 14:31:17 -07:00
adb73d3dcf Removed overhead from reshape() call if tensor doesn't need to be changed (#61466)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61466

## Goal

Per #55126 the performance of `reshape` is worse than `alias` in cases where they are performing the same operation (i.e. where reshape is returning a view) because `reshape` delegates to `view` and duplicates some of the operations (specifically `infer_size_dv` and `computeStride`).

The goal of this pull-request is to reduce or remove the additional overhead that `reshape` has.

### Proposed Implementation

Instead of using `view` we implement a private/internal operator (`_reshape_alias`) that `reshape` dispatches to which skips the relevant checks. This is functionally equivalent to `as_strided` however it is a lot simpler because it's specialized to this use-case, and importantly the `backward` implementation is a lot faster.

Note that we have to dispatch (`reshape` is a composite operator) because `reshape` can return either a view or a copy of the Tensor depending on the parameters, and this complicates implementing a derivative/backward for `reshape`.

### Why not `as_strided`?

Using `as_strided` directly slows down autograd. If we use a custom function equivalent to `_reshape_alias` but with a simpler backward function then `view` has the same performance as `reshape`. If we delegate to `as_strided` it is about 56% slower (and this holds against our custom function).

This is also the reason we make an internal operator named `_reshape_alias` instead of exposing a new operator since this should only be used in the `reshape` case and it is effectively a more limited version of `view`, `alias`, and `as_strided`.

## Benchmarks
In a micro-benchmark for `backward` running:

```cpp
// Setup
at::Tensor x=torch::empty({2,2}, torch::requires_grad(true));

// Benchmark loop
// `reshape(-1)` replaced with a call to view(-1) for view baseline
x.pow(4).reshape(-1).mean().backward();
```

I also benchmarked simple operations without gradients using:

```cpp
// Setup
at::Tensor x=torch::empty({2,2}, torch::requires_grad(true));

// Benchmark loop
x.reshape(-1) // replaced with a call to view(-1) for view baseline
```

Baselined to `view`:

* Original `reshape`: `+3.3%` (without gradients `+20.8%`)
* Using `as_strided`: `+55.1%` (without gradients `+1.0%`)
* Using custom `_reshape_view`: `-1.0%` (without gradients `+6.2%`)

In absolute terms (note the percentages above were generated comparing between runs/tests rather than to a single baseline):

* Original `view`: `53.66 us` (without gradients `582.78 ns`)
* Original `reshape`: `55.46 us` (without gradients `704.24 ns`)
* Using `as_strided`: `83.24 us` (without gradients `576.49 ns`)
* Using custom `_reshape_view`: `53.13 us` (without gradients `536.01 ns`)

Note that these benchmarks perform a backwards operation as well. When compared without using gradient computation at all the performance differneces are more pronounced as this takes up more of the time.

### Original performance

<details>
  <summary>Benchmark results</summary>

```
[<torch.utils.benchmark.utils.common.Measurement object at 0x7f0e4d393160>
x.pow(4).view(-1).mean().backward();
setup: at::Tensor x=torch::empty({2,2}, torch::requires_grad(true));
  Median: 53.66 us
  IQR:    2.70 us (52.54 to 55.24)
  884 measurements, 100 runs per measurement, 1 thread]

[<torch.utils.benchmark.utils.common.Measurement object at 0x7f0e2ebd4fa0>
x.pow(4).reshape(-1).mean().backward();
setup: at::Tensor x=torch::empty({2,2}, torch::requires_grad(true));
  Median: 55.46 us
  IQR:    2.61 us (54.39 to 57.01)
  889 measurements, 100 runs per measurement, 1 thread]

2276116
2286256

<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.FunctionCounts object at 0x7f0e5b2e3e20>
   2640  ???:at::detail::computeStride(c10::ArrayRef<long>, c10::ArrayRef<long>, c10::SmallVector<long, 5u> const&)
   1920  ???:at::native::reshape(at::Tensor const&, c10::ArrayRef<long>)
   1520  ???:at::_ops::reshape::call(at::Tensor const&, c10::ArrayRef<long>)
   1040  ???:c10::SmallVectorImpl<long>::operator=(c10::SmallVectorImpl<long>&&)
    980  ???:void at::infer_size_impl<c10::SmallVector<long, 5u> >(c10::ArrayRef<long>, long, c10::SmallVector<long, 5u>&)
    720  ???:__tls_get_addr
    520  ???:at::shouldRunRecordFunction(bool*)
    520  ???:__memcpy_avx_unaligned_erms
    200  ???:c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10:: ... g>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>)
    100  ???:c10::TensorImpl::strides() const
    100  ???:c10::TensorImpl::sizes() const
    100  ???:at::(anonymous namespace)::manager()
     77  /tmp/benchmark_utils_jit_build__1626465284__8a34e7ff-cd37-4a82-be28-7f19e081e771/timer_cpp_7815557938202456331/timer_src.cpp:main
     40  ???:c10::TensorImpl::numel() const
    -77  /tmp/benchmark_utils_jit_build__1626465284__8a34e7ff-cd37-4a82-be28-7f19e081e771/timer_cpp_8055217880649990171/timer_src.cpp:main
   -260  ???:at::native::view(at::Tensor const&, c10::ArrayRef<long>)

Total: 10140
```

```
[<torch.utils.benchmark.utils.common.Measurement object at 0x7f850dd66c10>
x.view(-1);
setup: at::Tensor x=torch::empty({2,2});
  Median: 582.78 ns
  IQR:    33.80 ns (573.80 to 607.61)
  833 measurements, 10000 runs per measurement, 1 thread]

[<torch.utils.benchmark.utils.common.Measurement object at 0x7f850de31e20>
x.reshape(-1);
setup: at::Tensor x=torch::empty({2,2});
  Median: 704.24 ns
  IQR:    24.42 ns (697.20 to 721.62)
  679 measurements, 10000 runs per measurement, 1 thread]

56896
67036

<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.FunctionCounts object at 0x7f84e1930bb0>
   2640  ???:at::detail::computeStride(c10::ArrayRef<long>, c10::ArrayRef<long>, c10::SmallVector<long, 5u> const&)
   1920  ???:at::native::reshape(at::Tensor const&, c10::ArrayRef<long>)
   1520  ???:at::_ops::reshape::call(at::Tensor const&, c10::ArrayRef<long>)
   1040  ???:c10::SmallVectorImpl<long>::operator=(c10::SmallVectorImpl<long>&&)
    980  ???:void at::infer_size_impl<c10::SmallVector<long, 5u> >(c10::ArrayRef<long>, long, c10::SmallVector<long, 5u>&)
    720  ???:__tls_get_addr
    520  ???:at::shouldRunRecordFunction(bool*)
    520  ???:__memcpy_avx_unaligned_erms
    200  ???:c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10:: ... g>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>)
    100  ???:c10::TensorImpl::strides() const
    100  ???:c10::TensorImpl::sizes() const
    100  ???:at::(anonymous namespace)::manager()
     76  /tmp/benchmark_utils_jit_build__1626466038__15fbbac0-2072-4459-8f8e-08121a905b99/timer_cpp_547407365342278353/timer_src.cpp:main
     40  ???:c10::TensorImpl::numel() const
    -76  /tmp/benchmark_utils_jit_build__1626466038__15fbbac0-2072-4459-8f8e-08121a905b99/timer_cpp_3457873755756181226/timer_src.cpp:main
   -260  ???:at::native::view(at::Tensor const&, c10::ArrayRef<long>)

Total: 10140
```

</details>

### Using `as_strided`

<details>
  <summary>Benchmark results</summary>

```
[<torch.utils.benchmark.utils.common.Measurement object at 0x7f8b13bb5b50>
x.pow(4).view(-1).mean().backward();
setup: at::Tensor x=torch::empty({2,2}, torch::requires_grad(true));
  Median: 53.37 us
  IQR:    3.15 us (51.73 to 54.88)
  936 measurements, 100 runs per measurement, 1 thread]

[<torch.utils.benchmark.utils.common.Measurement object at 0x7f8af55f8490>
x.pow(4).reshape(-1).mean().backward();
setup: at::Tensor x=torch::empty({2,2}, torch::requires_grad(true));
  Median: 83.24 us
  IQR:    4.05 us (81.20 to 85.25)
  609 measurements, 100 runs per measurement, 1 thread]

2267916
2525061

<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.FunctionCounts object at 0x7f8af55f8e50>
   31930  ???:_int_free
   15940  ???:malloc
   11595  ???:_int_malloc
   10100  ???:torch::autograd::generated::details::as_strided_backward(at::Tensor, at::TensorGeometry, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::optional<long>)
    9360  ???:__tls_get_addr
    8280  ???:free
    8100  ???:torch::autograd::VariableType::(anonymous namespace)::as_strided(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::optional<long>)
    4520  ???:c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>::reset_()
    4080  ???:operator new(unsigned long)
     ...
    -780  ???:at::_ops::view::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>)
    -920  ???:c10::SmallVectorImpl<long>::operator=(c10::SmallVectorImpl<long> const&)
   -1220  ???:torch::autograd::generated::ViewBackward::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&)
   -1520  ???:at::_ops::view::call(at::Tensor const&, c10::ArrayRef<long>)
   -1580  ???:torch::ADInplaceOrView::(anonymous namespace)::view(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>)
   -1680  ???:at::Tensor at::native::alias_with_sizes_and_strides<c10::SmallVector<long, 5u> >(at::Tensor const&, c10::SmallVector<long, 5u> const&, c10::SmallVector<long, 5u> const&)
   -2560  ???:at::detail::computeStride(c10::ArrayRef<long>, c10::ArrayRef<long>, c10::SmallVector<long, 5u> const&)
   -2640  ???:at::native::view(at::Tensor const&, c10::ArrayRef<long>)
   -4860  ???:torch::autograd::VariableType::(anonymous namespace)::view(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>)

Total: 257145
```

```

[<torch.utils.benchmark.utils.common.Measurement object at 0x7f93176a0160>
x.view(-1);
setup: at::Tensor x=torch::empty({2,2});
  Median: 570.55 ns
  IQR:    32.69 ns (552.87 to 585.56)
  874 measurements, 10000 runs per measurement, 1 thread]

[<torch.utils.benchmark.utils.common.Measurement object at 0x7f92f8f29490>
x.reshape(-1);
setup: at::Tensor x=torch::empty({2,2});
  Median: 576.49 ns
  IQR:    37.95 ns (559.51 to 597.46)
  861 measurements, 10000 runs per measurement, 1 thread]

56896
58556

<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.FunctionCounts object at 0x7f932556ca60>
    2140  ???:at::native::reshape(at::Tensor const&, c10::ArrayRef<long>)
    1940  ???:torch::autograd::VariableType::(anonymous namespace)::as_strided(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::optional<long>)
    1880  ???:torch::ADInplaceOrView::(anonymous namespace)::as_strided(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::optional<long>)
    1720  ???:at::_ops::as_strided::call(at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::optional<long>)
    1520  ???:at::_ops::reshape::call(at::Tensor const&, c10::ArrayRef<long>)
    1400  ???:at::native::as_strided_tensorimpl(at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::optional<long>)
    1260  ???:at::_ops::as_strided::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::optional<long>)'2
    1260  ???:at::_ops::as_strided::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::optional<long>)
     980  ???:void at::infer_size_impl<c10::SmallVector<long, 5u> >(c10::ArrayRef<long>, long, c10::SmallVector<long, 5u>&)
     ...
    -620  ???:at::Tensor c10::Dispatcher::redispatch<at::Tensor, at::Tensor const&, c10::ArrayRef<long ... ::ArrayRef<long>)> const&, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>) const
    -780  ???:at::_ops::view::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>)'2
    -780  ???:at::_ops::view::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>)
    -920  ???:c10::SmallVectorImpl<long>::operator=(c10::SmallVectorImpl<long> const&)
   -1520  ???:at::_ops::view::call(at::Tensor const&, c10::ArrayRef<long>)
   -1580  ???:torch::ADInplaceOrView::(anonymous namespace)::view(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>)
   -1680  ???:at::Tensor at::native::alias_with_sizes_and_strides<c10::SmallVector<long, 5u> >(at::Tensor const&, c10::SmallVector<long, 5u> const&, c10::SmallVector<long, 5u> const&)
   -1740  ???:torch::autograd::VariableType::(anonymous namespace)::view(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>)
   -2640  ???:at::native::view(at::Tensor const&, c10::ArrayRef<long>)

Total: 1660

```

</details>

### Using custom function (`_reshape_alias`)

<details>
  <summary>Benchmark results</summary>

```
[<torch.utils.benchmark.utils.common.Measurement object at 0x7f16861d6b50>
x.pow(4).view(-1).mean().backward();
setup: at::Tensor x=torch::empty({2,2}, torch::requires_grad(true));
  Median: 53.50 us
  IQR:    2.64 us (52.32 to 54.96)
  906 measurements, 100 runs per measurement, 1 thread]

[<torch.utils.benchmark.utils.common.Measurement object at 0x7f1667b2ed60>
x.pow(4).reshape(-1).mean().backward();
setup: at::Tensor x=torch::empty({2,2}, torch::requires_grad(true));
  Median: 53.13 us
  IQR:    3.40 us (51.72 to 55.13)
  914 measurements, 100 runs per measurement, 1 thread]

2269736
2273236

<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.FunctionCounts object at 0x7f1693f8dc10>
    5060  ???:torch::autograd::VariableType::(anonymous namespace)::_reshape_alias(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>)
    2000  ???:at::native::reshape(at::Tensor const&, c10::ArrayRef<long>)
    1780  ???:torch::ADInplaceOrView::(anonymous namespace)::_reshape_alias(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>)
    1660  ???:at::_ops::_reshape_alias::call(at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>)
    1600  ???:at::Tensor at::native::alias_with_sizes_and_strides<c10::ArrayRef<long> >(at::Tensor const&, c10::ArrayRef<long> const&, c10::ArrayRef<long> const&)
    1520  ???:at::_ops::reshape::call(at::Tensor const&, c10::ArrayRef<long>)
    1240  ???:at::_ops::_reshape_alias::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>)'2
    1240  ???:at::_ops::_reshape_alias::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>)
    1220  ???:torch::autograd::generated::AliasToShapeBackward::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&)
     ...
    -780  ???:at::_ops::view::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>)'2
    -780  ???:at::_ops::view::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>)
    -920  ???:c10::SmallVectorImpl<long>::operator=(c10::SmallVectorImpl<long> const&)
   -1220  ???:torch::autograd::generated::ViewBackward::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&)
   -1520  ???:at::_ops::view::call(at::Tensor const&, c10::ArrayRef<long>)
   -1580  ???:torch::ADInplaceOrView::(anonymous namespace)::view(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>)
   -1680  ???:at::Tensor at::native::alias_with_sizes_and_strides<c10::SmallVector<long, 5u> >(at::Tensor const&, c10::SmallVector<long, 5u> const&, c10::SmallVector<long, 5u> const&)
   -2640  ???:at::native::view(at::Tensor const&, c10::ArrayRef<long>)
   -4860  ???:torch::autograd::VariableType::(anonymous namespace)::view(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>)

Total: 3500
```

```

[<torch.utils.benchmark.utils.common.Measurement object at 0x7f5287adfb20>
x.view(-1);
setup: at::Tensor x=torch::empty({2,2});
  Median: 505.10 ns
  IQR:    20.04 ns (500.41 to 520.45)
  944 measurements, 10000 runs per measurement, 1 thread]

[<torch.utils.benchmark.utils.common.Measurement object at 0x7f526951b430>
x.reshape(-1);
setup: at::Tensor x=torch::empty({2,2});
  Median: 536.01 ns
  IQR:    17.81 ns (531.34 to 549.16)
  916 measurements, 10000 runs per measurement, 1 thread]

56896
60376

<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.FunctionCounts object at 0x7f5295896c10>
    2000  ???:at::native::reshape(at::Tensor const&, c10::ArrayRef<long>)
    1860  ???:torch::autograd::VariableType::(anonymous namespace)::_reshape_alias(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>)
    1780  ???:torch::ADInplaceOrView::(anonymous namespace)::_reshape_alias(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>)
    1660  ???:at::_ops::_reshape_alias::call(at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>)
    1600  ???:at::Tensor at::native::alias_with_sizes_and_strides<c10::ArrayRef<long> >(at::Tensor const&, c10::ArrayRef<long> const&, c10::ArrayRef<long> const&)
    1520  ???:at::_ops::reshape::call(at::Tensor const&, c10::ArrayRef<long>)
    1240  ???:at::_ops::_reshape_alias::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>)'2
    1240  ???:at::_ops::_reshape_alias::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>)
     980  ???:void at::infer_size_impl<c10::SmallVector<long, 5u> >(c10::ArrayRef<long>, long, c10::SmallVector<long, 5u>&)
     ...
    -620  ???:at::Tensor c10::Dispatcher::redispatch<at::Tensor, at::Tensor const&, c10::ArrayRef<long ... ::ArrayRef<long>)> const&, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>) const
    -780  ???:at::_ops::view::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>)'2
    -780  ???:at::_ops::view::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>)
    -920  ???:c10::SmallVectorImpl<long>::operator=(c10::SmallVectorImpl<long> const&)
   -1520  ???:at::_ops::view::call(at::Tensor const&, c10::ArrayRef<long>)
   -1580  ???:torch::ADInplaceOrView::(anonymous namespace)::view(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>)
   -1680  ???:at::Tensor at::native::alias_with_sizes_and_strides<c10::SmallVector<long, 5u> >(at::Tensor const&, c10::SmallVector<long, 5u> const&, c10::SmallVector<long, 5u> const&)
   -1740  ???:torch::autograd::VariableType::(anonymous namespace)::view(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>)
   -2640  ???:at::native::view(at::Tensor const&, c10::ArrayRef<long>)

Total: 3480

```

</details>

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D29792126

Pulled By: laurencer

fbshipit-source-id: f0519b45b65f868aa3e8651679354558bd761dfd
2021-07-21 14:05:35 -07:00
a8d99a28d7 Modernize avoid a C array (#61740)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61740

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D29717118

fbshipit-source-id: 70e73346b75deb4fe6b6399e06bd576f3b6e2b91
2021-07-21 13:52:54 -07:00
d7b31fe95d Add ciflow config and change jinja2 templates (#61886)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61886

This PR is rolling out at the `1. Manual Phase`.

```
#       Rollout Strategy:
#       1. Manual Phase
#          step 1. Add 'ciflow/default' label to the PR
#          step 2. Once there's an [unassigned] event from PR, it should rerun
#          step 3. Remove 'ciflow/default' label
#          step 4. Trigger the [unassigned] event again, it should not rerun
#       2. Probot Phase 1 (manual on 1 workflow)
#          step 1. Probot automatically add labels based on the context
#          step 2. Manually let probot trigger [unassigned] event
#       4. Probot Phase 3 (auto on 1 workflows)
#          step 1. Modify the workflows so that they only listen on [unassigned] events
#          step 2. Probot automatically adds labels automatically based on the context
#          step 3. Probot automatically triggers [unassigned] event
#       4. Probot Phase 3 (auto on many workflows)
#          step 1. Enable it for all workflows
```

Test Plan: Imported from OSS

Reviewed By: seemethere

Differential Revision: D29808366

Pulled By: zhouzhuojie

fbshipit-source-id: c7e5009d839239df58825dec093ff0f1fd281697
2021-07-21 13:32:09 -07:00
2dab368d26 Refactor generate_ci_workflows (#61879)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61879

Refactor generate_ci_workflows to support CI dispatcher. This is the first step to refactor the workflow into a dataclass with some validation and OOP.

Verified that the output is the same:

```
.github/scripts/generate_ci_workflows.py
git status
```

Test Plan: Imported from OSS

Reviewed By: seemethere

Differential Revision: D29808365

Pulled By: zhouzhuojie

fbshipit-source-id: b8c5fd43f4bd6e17e06f3925a1a509084b790d95
2021-07-21 13:30:36 -07:00
e2acce373f Run Windows smoke tests with gflags in test dir (#61967)
Summary:
Previous testing yielded the torch.version ModuleNotFound error when I ran the smoke tests from the pytorch root directory.

This PR simply reorders the commands to run the smoke tests within the test directory, which passes in this series of runs:
https://github.com/seemethere/test-repo/actions/runs/1050734298 (the failures are due to missing credentials during uploading stats, which we don't need here)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61967

Reviewed By: samestep

Differential Revision: D29820985

Pulled By: janeyx99

fbshipit-source-id: 363ef321c32cfaf4446ceeb6117ea26abc311816
2021-07-21 12:06:34 -07:00
a03466cb07 Back out "Revert D29687143: [3/N] Nnapi Backend Delegate Preprocess: Basic OSS Test" (#61878)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61878

CMakeLists.txt
Android NNAPI delegate library was moved from test/cpp/jit/CMakeLists.txt to torch/CMakeLists.txt. This resolves the issue the original PR had, where the NNAPI delegate library was added to builds without Python (when it depends on Python).
Original PR: https://github.com/pytorch/pytorch/pull/61594

There's an error where the library cannot be built on MacOS. This problem existed in the original PR as well, but now an issue has been created: https://github.com/pytorch/pytorch/issues/61930

test_backend_nnapi.py
Also changed the skip unit test headers so that it's a little cleaner. Now the unit tests are skipped if the Nnapi delegate library file is not found. Previously, the skip was based on the platform (only allowing Linux).

Test Plan:
To run NNAPI delegate unit tests: `python test/test_jit.py TestNnapiBackend`

Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D29799895

fbshipit-source-id: b69a767b5cde3814b0853cfbc84d61ab4155f619
2021-07-21 11:58:45 -07:00
4532b3c4a9 Fix _C public bindings test (#61088)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61088

The test was previously a no-op since it was comparing the bindings with themselves. This fixes that to use the hardcoded list and adds the items that changed in the meantime.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D29510525

Pulled By: driazati

fbshipit-source-id: 3497023e5c8b3cd6fdd1d07d48b4f2650b203ded
2021-07-21 11:50:37 -07:00
8880f3d450 [fx] introduce __fx_create_arg__ dunder method for controlling custom classes are handled as node args (#61780)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61780

These changes would allow objects to control how they are handled when they are an argument to a torch.fx call_module node from within their source. Previously, we have been using a custom Tracer with an overridden create_arg() method and branching based on class name to handle args that are unusual (data classes, etc).

Reviewed By: suo, houseroad

Differential Revision: D27976120

fbshipit-source-id: 0c5249c5f8398368ca0fbec0ad8a07ccf99b7da4
2021-07-21 11:27:09 -07:00
3c7bfa632a reland D29801875: .github: Clone pytorch to separate directory (#61972)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61972

This reverts commit 716567504c8b4da8d764d9674595c2095b62080c.

Also includes change to add the TEST_CONFIG env variable so that test
reports get uploaded correctly.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D29821858

Pulled By: seemethere

fbshipit-source-id: 23602706446e0a95db6bd7cedfa665e8c4145168
2021-07-21 11:15:52 -07:00
810e19979d Torch deploy for fx.grapm_module with non-torch dependencie (#61680)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61680

This diff enables torch deploy for fx.graph_module with non-torch dependencies . Here are the issues currently preventing this and are fixed in this change:
-  Pickle is used as an internal format to transmit objects between interpreters. It needs to serialize python code, but to be able to get the source code for imports from python_code.globals it needs access to the PackageImporter. Currently a regular _reduce_ function is used which doesn't have the notion of custom importer.
- When deserializing pickled objects on an interpreter, it is passing empty globals to exec, thus it will not be able to resolve non-torch imports located in the package. We need to be able to point exec to our custom PackageImporter.
- Subclasses extending fx.graph_module should be able to optionally provide their own Tracer (extending fx.Tracer).

As a solution a new reducer is introduced (_reduce_deploy_) for torch deploy workflow. Reducer will be registered in _deploy.py (entry point for C++ torch deploy API) when saving the object transmitting it between interpreters. It allows us to pass a proper PackageImporter for each interpreter for pickling/unpickling fx.graph_module. It also defines an api for passing custom fx.Tracer when needed.

Test Plan:
Added UT to cover changes.
```
buck test //caffe2/torch/csrc/deploy:test_deploy
```
```
buck test caffe2/test:fx
```

Reviewed By: suo

Differential Revision: D29690088

fbshipit-source-id: 3a8dbe02d5d7e085534aa61b7773c86f0f8c19b0
2021-07-21 10:29:48 -07:00
f41d3341b1 [pytorch] Support embedding_bag_4bit_rowwise_offsets in cuda (#61728)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61728

Templatize existing embedding_bag_byte_rowwise_offsets_kernel to support both 4 bits per dimension and 8 bits per dimension. Test rigorously using fb internal random testing vs CPU ops.

Reviewed By: hyuen

Differential Revision: D29706346

fbshipit-source-id: c9f4591a2cc6205e4b7e57a363ba0a6306fdddd5
2021-07-21 10:23:30 -07:00
716567504c Revert D29801875: .github: Clone pytorch to separate directory
Test Plan: revert-hammer

Differential Revision:
D29801875 (a152c12d7b)

Original commit changeset: 71a3c7c949e5

fbshipit-source-id: 85175a9933d1e33117b1461d5a760e1a79f60047
2021-07-21 10:19:28 -07:00
ea8abcf76e [quant] Remove calls to .item() for fake_quant_on (#61921)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61921

For GPU training, the fake_quant_on tensors are present on the GPU and the .item() calls incur a GPU->CPU copy to access the tensor element.
This call can prove expensive and hurt the performance during training as the `item()` and `local_scalar_dense()` calls take up 11% of the total CPU execution time.
The solution here is to access the tensor on the GPU without a copy.

Individual op benchmarks show a 33% speedup just by removing the `.item()` calls

Profiler Before
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                  aten::fused_moving_avg_obs_fake_quant         5.61%       1.538ms       100.00%      27.421ms     548.425us     978.208us         3.42%      28.575ms     571.501us            50
                  aten::_fused_moving_avg_obs_fq_helper        27.63%       7.576ms        94.39%      25.883ms     517.668us       6.536ms        22.87%      27.597ms     551.937us            50
aten::_fake_quantize_per_tensor_affine_cachemask_ten...        11.07%       3.037ms        21.54%       5.905ms     118.103us       9.549ms        33.42%       9.549ms     190.978us            50
                                         aten::_aminmax        19.39%       5.317ms        27.44%       7.524ms     150.484us       8.683ms        30.38%       8.683ms     173.651us            50
                                             aten::item         4.49%       1.232ms        11.12%       3.051ms      61.011us       1.058ms         3.70%       2.829ms      56.579us            50
                              aten::_local_scalar_dense         6.63%       1.818ms         6.63%       1.818ms      36.363us       1.771ms         6.20%       1.771ms      35.419us            50
                                            aten::empty         5.76%       1.579ms         5.76%       1.579ms      15.792us       0.000us         0.00%       0.000us       0.000us           100
                                       aten::as_strided         2.29%     628.399us         2.29%     628.399us       6.284us       0.000us         0.00%       0.000us       0.000us           100
                                       aten::empty_like         7.56%       2.073ms        17.13%       4.696ms      31.310us       0.000us         0.00%       0.000us       0.000us           150
                                    aten::empty_strided         9.57%       2.623ms         9.57%       2.623ms      17.489us       0.000us         0.00%       0.000us       0.000us           150
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 27.421ms
Self CUDA time total: 28.575ms
```
After
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                  aten::fused_moving_avg_obs_fake_quant         6.59%       1.240ms       100.00%      18.820ms     376.396us     490.272us         2.36%      20.745ms     414.901us            50
                  aten::_fused_moving_avg_obs_fq_helper        26.12%       4.916ms        93.41%      17.580ms     351.597us       2.033ms         9.80%      20.255ms     405.096us            50
aten::_fake_quantize_per_tensor_affine_cachemask_ten...        14.55%       2.738ms        31.09%       5.850ms     117.005us       9.968ms        48.05%       9.968ms     199.363us            50
                                         aten::_aminmax        25.28%       4.758ms        36.21%       6.814ms     136.278us       8.253ms        39.79%       8.253ms     165.069us            50
                                            aten::empty         7.94%       1.494ms         7.94%       1.494ms      14.944us       0.000us         0.00%       0.000us       0.000us           100
                                       aten::as_strided         2.99%     561.785us         2.99%     561.785us       5.618us       0.000us         0.00%       0.000us       0.000us           100
                                       aten::empty_like         8.36%       1.573ms        16.53%       3.112ms      31.118us       0.000us         0.00%       0.000us       0.000us           100
                                    aten::empty_strided         8.17%       1.538ms         8.17%       1.538ms      15.384us       0.000us         0.00%       0.000us       0.000us           100
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 18.820ms
Self CUDA time total: 20.745ms
```

Test Plan:
python test/test_quantization.py

Imported from OSS

Reviewed By: jingsh

Differential Revision: D29796533

fbshipit-source-id: 10abb93abd61c6ac25b8e8c114aa57b9db891918
2021-07-21 10:13:06 -07:00
b8386f5d72 [quant] Create FusedMovingAvgObsFakeQuantize for QAT (#61691)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61691

Create a new module for QAT that does a Fused MovingAvgMinMaxObserver and FakeQuantize operation
The module currently only supports per-tensor quantization (affine/symmetric). Follow-up PR will add support for per-channel

Results on running QAT with MobileNetV2 (Obs enabled/fake_quant enabled)
Original FQ module
PyTorchObserver {"type": "_", "metric": "qnnpack_fp_latency_ms", "unit": "ms", "value": "242.80261993408203"}
PyTorchObserver {"type": "_", "metric": "qnnpack_qat0_latency_ms", "unit": "ms", "value": "505.7964324951172"}
PyTorchObserver {"type": "_", "metric": "fbgemm_fp_latency_ms", "unit": "ms", "value": "235.80145835876465"}
PyTorchObserver {"type": "_", "metric": "fbgemm_qat0_latency_ms", "unit": "ms", "value": "543.8144207000732"}

Fused FakeQuant module (~50% improvement in latency)
PyTorchObserver {"type": "_", "metric": "qnnpack_fp_latency_ms", "unit": "ms", "value": "232.1624755859375"}
PyTorchObserver {"type": "_", "metric": "qnnpack_qat0_latency_ms", "unit": "ms", "value": "263.8866901397705"}
PyTorchObserver {"type": "_", "metric": "fbgemm_fp_latency_ms", "unit": "ms", "value": "236.9832992553711"}
PyTorchObserver {"type": "_", "metric": "fbgemm_qat0_latency_ms", "unit": "ms", "value": "292.1590805053711"}

Individual module benchmark result (>5x improvement in latency)
===> Baseline FakeQuantize module
```
---------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                               Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
---------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
              aten::fake_quantize_per_tensor_affine         0.77%       1.210ms         4.92%       7.730ms     154.596us     718.528us         0.45%       9.543ms     190.862us            50
    aten::fake_quantize_per_tensor_affine_cachemask         2.41%       3.792ms         4.15%       6.520ms     130.402us       8.825ms         5.58%       8.825ms     176.492us            50
                                     aten::_aminmax         3.25%       5.105ms         4.43%       6.955ms     139.102us       8.193ms         5.18%       8.193ms     163.868us            50
                                   aten::zeros_like         1.87%       2.939ms         6.95%      10.922ms     109.218us       5.992ms         3.79%      10.844ms     108.442us           100
                                        aten::zeros         0.97%       1.527ms         3.11%       4.885ms      97.702us       2.383ms         1.51%       4.800ms      96.010us            50
                                         aten::rsub         1.34%       2.106ms         2.94%       4.614ms      92.277us       2.063ms         1.30%       4.559ms      91.173us            50
                                        aten::clamp         2.79%       4.381ms         5.42%       8.519ms      85.190us       5.385ms         3.41%       8.438ms      84.381us           100
                                           aten::eq        11.70%      18.384ms        21.31%      33.479ms      83.280us      22.465ms        14.21%      33.310ms      82.861us           402
                                         aten::ones         1.05%       1.656ms         2.57%       4.038ms      80.751us       2.494ms         1.58%       3.951ms      79.028us            50
                                           aten::le         2.52%       3.955ms         4.84%       7.607ms      76.071us       4.998ms         3.16%       7.702ms      77.016us           100
                                          aten::min         0.69%       1.087ms         2.32%       3.641ms      72.827us       1.017ms         0.64%       3.603ms      72.055us            50
                                          aten::max         1.40%       2.195ms         4.62%       7.260ms      72.597us       2.008ms         1.27%       7.140ms      71.404us           100
                                   aten::is_nonzero         2.68%       4.207ms        11.35%      17.829ms      71.033us       4.062ms         2.57%      17.225ms      68.625us           251
                                       aten::detach         1.17%       1.831ms         3.65%       5.736ms      57.360us       1.680ms         1.06%       5.634ms      56.340us           100
                                          aten::mul         3.36%       5.278ms         3.36%       5.278ms      53.862us       5.215ms         3.30%       5.215ms      53.216us            98
                                          aten::div         3.42%       5.376ms         3.42%       5.376ms      53.759us       5.320ms         3.36%       5.320ms      53.196us           100
                                          aten::sub         6.79%      10.672ms         6.79%      10.672ms      53.901us      10.504ms         6.64%      10.504ms      53.050us           198
                                         aten::item         4.06%       6.380ms        12.02%      18.883ms      53.798us       6.127ms         3.87%      18.322ms      52.198us           351
                                          aten::add         3.28%       5.147ms         3.28%       5.147ms      52.518us       5.113ms         3.23%       5.113ms      52.171us            98
                                      aten::minimum         1.63%       2.555ms         1.63%       2.555ms      51.092us       2.585ms         1.64%       2.585ms      51.708us            50
                                      aten::maximum         3.22%       5.065ms         3.22%       5.065ms      50.646us       5.133ms         3.25%       5.133ms      51.329us           100
                                        aten::round         1.61%       2.529ms         1.61%       2.529ms      50.578us       2.528ms         1.60%       2.528ms      50.552us            50
                                        aten::zero_         1.99%       3.125ms         4.72%       7.422ms      49.481us       2.835ms         1.79%       7.269ms      48.462us           150
                                        aten::copy_         6.62%      10.394ms         6.62%      10.394ms      41.576us      10.252ms         6.48%      10.252ms      41.010us           250
                                             detach         2.49%       3.905ms         2.49%       3.905ms      39.049us       3.954ms         2.50%       3.954ms      39.539us           100
                                       aten::select         2.01%       3.154ms         2.47%       3.876ms      38.759us       3.866ms         2.44%       3.866ms      38.658us           100
                          aten::_local_scalar_dense         7.96%      12.503ms         7.96%      12.503ms      35.621us      12.195ms         7.71%      12.195ms      34.743us           351
                                           aten::to         2.31%       3.625ms         4.16%       6.530ms      32.650us       4.320ms         2.73%       6.270ms      31.348us           200
                                        aten::fill_         3.70%       5.808ms         3.70%       5.808ms      29.039us       5.892ms         3.73%       5.892ms      29.459us           200
                                   aten::as_strided         0.79%       1.244ms         0.79%       1.244ms       6.221us       0.000us         0.00%       0.000us       0.000us           200
                                        aten::empty         3.55%       5.579ms         3.55%       5.579ms      11.137us       0.000us         0.00%       0.000us       0.000us           501
                                      aten::resize_         2.36%       3.712ms         2.36%       3.712ms      12.332us       0.000us         0.00%       0.000us       0.000us           301
                                   aten::empty_like         1.45%       2.284ms         3.68%       5.776ms      28.878us       0.000us         0.00%       0.000us       0.000us           200
                                aten::empty_strided         2.80%       4.398ms         2.80%       4.398ms      17.592us       0.000us         0.00%       0.000us       0.000us           250
---------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 157.108ms
Self CUDA time total: 158.122ms
```

===> FusedFakeQuant
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                   fb::fused_fake_quant        23.42%       6.408ms       100.00%      27.361ms     547.215us       7.887ms        27.20%      28.996ms     579.925us            50
                  aten::fake_quantize_per_tensor_affine         4.25%       1.162ms        27.65%       7.565ms     151.298us     686.176us         2.37%      10.217ms     204.336us            50
aten::_fake_quantize_per_tensor_affine_cachemask_ten...        14.11%       3.860ms        23.40%       6.403ms     128.068us       9.531ms        32.87%       9.531ms     190.612us            50
                                         aten::_aminmax        20.57%       5.628ms        27.47%       7.515ms     150.305us       8.218ms        28.34%       8.218ms     164.367us            50
                                             aten::item         3.65%     999.522us        10.27%       2.810ms      56.202us     931.904us         3.21%       2.674ms      53.481us            50
                              aten::_local_scalar_dense         6.62%       1.811ms         6.62%       1.811ms      36.212us       1.742ms         6.01%       1.742ms      34.843us            50
                                            aten::empty        10.85%       2.969ms        10.85%       2.969ms      14.843us       0.000us         0.00%       0.000us       0.000us           200
                                       aten::as_strided         1.92%     524.365us         1.92%     524.365us       5.244us       0.000us         0.00%       0.000us       0.000us           100
                                       aten::empty_like         6.48%       1.774ms        14.62%       4.000ms      26.670us       0.000us         0.00%       0.000us       0.000us           150
                                    aten::empty_strided         8.14%       2.226ms         8.14%       2.226ms      14.842us       0.000us         0.00%       0.000us       0.000us           150
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 27.361ms
Self CUDA time total: 28.996ms
```

Test Plan:
python test/test_quantization.py TestFusedObsFakeQuantModule

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D29706889

fbshipit-source-id: ae3f9fb1fc559920459bf6e8663e8299bf7d21e1
2021-07-21 10:13:04 -07:00
afdca41bab [quant] Add a new fused MovingAvg Obs + FakeQuant operator (GPU) (#61589)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61589

Custom GPU implementation that does the observer + calculate qparams calculation on GPU.
It calls the aten fake_quant_per_tensor/channel functions to perform the fake quant step.

Test Plan:
python test/test_quantization.py TestFusedObsFakeQuant

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D29682761

fbshipit-source-id: 373a50f88481b7e5b4d9e65d84a6c174bb277dd4
2021-07-21 10:13:02 -07:00
92d3391fb1 [quant] Add a new fused MovingAvg Obs + FakeQuant operator(CPU) (#61570)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61570

Fused operator that computes moving average min/max values (in-place) of the input tensor and fake-quantizes it.
It expects the qmin/qmax values to reflect the range of the quantized tensor (instead of reduce_range)

Motivation for adding this operator is for performance reasons, since moving the computation from python to C++/CUDA can increase the performance of QAT.

Test Plan:
python test/test_quantization.py TestFusedObsFakeQuant

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D29682762

fbshipit-source-id: 28e4c50e77236d6976fe4b326c9a12103ed95840
2021-07-21 10:11:41 -07:00
403f59701c Changes default DDP behavior to divide sparse grad by world size before allreduce, not after (#61814)
Summary:
I appreciate https://github.com/pytorch/pytorch/pull/61379, which restores the fusion of div-by-world-size and copy-to-allreduce-buffer for dense gradients. But i noticed in the wake of https://github.com/pytorch/pytorch/pull/61379 there's misaligned treatment of dense and sparse gradients. Specifically, dense gradients are dived by world size before the allreduce, and sparse gradients are dived by world size after the allreduce. On paper you wouldn't expect that to matter, but for cluster-scale DDP training with amp gradient scaling and allreduces of FP16 grads, we've noticed several cases where postdividing grads by world size caused nonconvergence while predividing worked. I'm not aware of any cases where the reverse was true.

This PR changes the treatment of sparse gradients to match the treatment of dense gradients (both will be dived by world size before allreduce).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61814

Reviewed By: mrshenli

Differential Revision: D29772444

Pulled By: rohan-varma

fbshipit-source-id: 033a17d5c019511889d908876282c6624fb26a2d
2021-07-21 09:54:53 -07:00
17d743ff04 ENH Adds test and docs for dropout for no batch dims (#61911)
Summary:
Towards https://github.com/pytorch/pytorch/issues/60585

I think `Dropout` is already tested in `test_Dropout` for no batch dims.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61911

Reviewed By: albanD

Differential Revision: D29810928

Pulled By: jbschlosser

fbshipit-source-id: 7716a1a808e9e34aae43573f38706212552afbb4
2021-07-21 09:07:10 -07:00
06df33857b fix adapative_avg_pool (#61851)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61851

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D29812559

Pulled By: makslevental

fbshipit-source-id: ac54166aaec63992748ea3299c3144ee107b24f4
2021-07-21 08:42:26 -07:00
33db828e52 Revert D29647586: [jit] Renamed prim::Concat as prim::VarConcat
Test Plan: revert-hammer

Differential Revision:
D29647586 (db11619901)

Original commit changeset: cdd34ea5a3c9

fbshipit-source-id: bab5ac4ed67a00ac151fe39463aa3fb56897d7f4
2021-07-21 08:28:26 -07:00
48af9de92f ENH Enables No-batch for *Pad1d Modules (#61060)
Summary:
Toward https://github.com/pytorch/pytorch/issues/60585

This PR adds a `single_batch_reference_fn` that uses the single batch implementation to check no-batch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61060

Reviewed By: mrshenli

Differential Revision: D29739823

Pulled By: jbschlosser

fbshipit-source-id: d90d88a3671177a647171801cc6ec7aa3df35482
2021-07-21 07:12:41 -07:00
bdf439a958 Adds _LazyInstanceNorm and LazyInstanceNormXd (#60982)
Summary:
Signed-off-by: Calvin McCarter <calvin@lightmatter.co>

Fixes https://github.com/pytorch/pytorch/issues/60981

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60982

Reviewed By: albanD

Differential Revision: D29810547

Pulled By: jbschlosser

fbshipit-source-id: d933d4c7fe5cf7be9b09a5ab93f740b94cf08cc1
2021-07-21 06:45:45 -07:00
db11619901 [jit] Renamed prim::Concat as prim::VarConcat (#61498)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61498

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D29647586

Pulled By: navahgar

fbshipit-source-id: cdd34ea5a3c986350a813be17e7d428844ea4cbf
2021-07-20 19:30:00 -07:00
7fbdc86aec [jit] Removed a local function to check for dominators and used the one added to Node class (#60909)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60909

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D29441864

Pulled By: navahgar

fbshipit-source-id: 362bd462fa70256dd1f8b05756a76da0cb3d4b76
2021-07-20 19:29:58 -07:00
429908e540 [jit] Updated the concat common inputs elimination pass to use the variadic cat op instead of aten::cat (#60908)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60908

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D29441865

Pulled By: navahgar

fbshipit-source-id: 2ab08168102eff1f43667ca418bdd94bb2df562a
2021-07-20 19:29:57 -07:00
53668f8bf6 [jit] Added an API to remove list mutations and replace with variadic cat until fixed point (#60776)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60776

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D29406099

Pulled By: navahgar

fbshipit-source-id: e2e69eb6ebff3bc6e25d80f46ce118e52f557fb6
2021-07-20 19:29:55 -07:00
0cfcf68aa5 [jit] Added special handling for prim::ListConstruct while checking for may alias inputs (#60775)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60775

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D29406101

Pulled By: navahgar

fbshipit-source-id: 9b8a4050167750610400637e7de48ffa8727051a
2021-07-20 19:29:53 -07:00
4dd04a8bbe [jit] Handled cases when input list to cat is mutated after cat using AliasDb (#60774)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60774

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D29406100

Pulled By: navahgar

fbshipit-source-id: af6afca65881c18c51b482eb63898a0f1c94d591
2021-07-20 19:28:42 -07:00
604f503d30 Revert D29794958 + compilation fix (#61937)
Summary:
This PR un-reverts https://github.com/pytorch/pytorch/issues/61475 + fixes compilation with MSVC, that does not recognize alternative operator spellings (i.e. using `or` instead of `||` )

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61937

Reviewed By: albanD

Differential Revision: D29805941

Pulled By: malfet

fbshipit-source-id: 01e5963c6717c1b44b260300d87ba0bf57f26ce9
2021-07-20 18:14:45 -07:00
a152c12d7b .github: Clone pytorch to separate directory (#61932)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61932

Clones pytorch to a separate directory for each run so that they do not
overlap with each other

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: zhouzhuojie

Differential Revision: D29801875

Pulled By: seemethere

fbshipit-source-id: 71a3c7c949e5aeacf033ae1fc9aaef13b42833b6
2021-07-20 17:30:30 -07:00
7cbb7c6d2e [vulkan] Make vulkan ops selective (#58332)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58332

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D28454976

Pulled By: IvanKobzarev

fbshipit-source-id: 445c1f326be76e3530a4884aa5fe749d636e0ae5
2021-07-20 16:26:55 -07:00
73fbf43684 [vulkan] Fix asserts (#61495)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61495

Test Plan: Imported from OSS

Reviewed By: SS-JIA

Differential Revision: D29647357

Pulled By: IvanKobzarev

fbshipit-source-id: cb4ba15f28625ea6e667883c9a2d31eba48b6f37
2021-07-20 16:07:13 -07:00
22fff61f06 Revert D29794958: [pytorch][PR] changing trapz to trapezoid
Test Plan: revert-hammer

Differential Revision:
D29794958 (95cec8f4fa)

Original commit changeset: 60b9c07efd47

fbshipit-source-id: 2dcda2d62e01c2521a86ae5ed8246cfb686d3f64
2021-07-20 16:00:46 -07:00
e067960243 lint_setup should not require elevated privileges (#61798)
Summary:
s/pip/pip3/ (because unversioned pip can reference either pip2 or pip3
depending on setup)
Always invoke `pip install` with user option (otherwise, unless one is
using conda environment, it will try to install in system folder, which
should not be writable to regular users)

Do not install shellcheck in `/usr/bin`, but rather rely on `~/.local/bin` and add it to the PATH

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61798

Reviewed By: zhouzhuojie, seemethere

Differential Revision: D29747286

Pulled By: malfet

fbshipit-source-id: 30cb51fe60b5096b758f430d1c51465205532a19
2021-07-20 15:53:12 -07:00
994434ad16 Adding complex number support for all_to_all/scatter (#61299)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61299

Modifying all_to_all and scatter to support complex numbers as well as float numbers.

Test Plan: buck run //caffe2/test/distributed:distributed_gloo_fork -- test_name --print-passing-details --run-disabled

Reviewed By: wanchaol

Differential Revision: D29563938

fbshipit-source-id: 59e436b3fa1aee3d5195cbcffd39587e642c76b9
2021-07-20 15:45:34 -07:00
457a0b63bf use torch.bucketize into_sparse_csr implementation (+ additional tests) (#61340)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/57381

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61340

Reviewed By: bhosmer

Differential Revision: D29601393

Pulled By: cpuhrsch

fbshipit-source-id: 4ca1f013d96e8716f0e658e0cd685d9aa0d98a5c
2021-07-20 15:44:25 -07:00
95cec8f4fa changing trapz to trapezoid (#61475)
Summary:
This PR resolves issue https://github.com/pytorch/pytorch/issues/52606 while also adding support for complex number

Stack from [ghstack](https://github.com/ezyang/ghstack):
* https://github.com/pytorch/pytorch/issues/61616
* https://github.com/pytorch/pytorch/issues/61615
* **https://github.com/pytorch/pytorch/issues/61475**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61475

Reviewed By: mruberry

Differential Revision: D29794958

Pulled By: NivekT

fbshipit-source-id: 60b9c07efd47fd85b9c8178768fc7828d7b57d29
2021-07-20 15:25:55 -07:00
86715623dd Adding super calls to JIT test case setUp and tearDown (#61922)
Summary:
This issue was surfaced when adding this issue: https://github.com/pytorch/pytorch/issues/61655 did not manage to skip the appropriate test case.

I then investigated and realized it was because the setUp code that does the test disabling is not called because another defined setUp overrode the parent class' setUp.

I am not sure if that was intentional--if so we would have to adopt the child class' code to call the check_if_enable function in common_utils.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61922

Reviewed By: ejguan

Differential Revision: D29798716

Pulled By: janeyx99

fbshipit-source-id: d31b664e48507d69de14574ff5e6ecf1d41ae24d
2021-07-20 15:08:44 -07:00
7acb8b71e1 Remove AVX detection code that duplicates FindAVX.cmake (#61748)
Summary:
This PR deletes some code in `MiscCheck.cmake` that perform the exact
same functionality as `FindAVX.cmake`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61748

Reviewed By: ejguan

Differential Revision: D29791282

Pulled By: malfet

fbshipit-source-id: 6595fd1b61c8ae12b821fad8c9a34892dd52d213
2021-07-20 14:34:36 -07:00
e8d2916b84 Add faulty tensorpipe implementation (#61421)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61421

This PR adds the faulty tensorpipe agent implementation and replaces all faulty process group agent tests with it. The faulty tensorpipe agent code is very similar to that of faulty process group agent. It allows the user to fail or delay certain types of rpc messages, which is used in the faulty agent tests. These changes are needed to deprecate the process group rpc backend.

Summary of changes:
- Add faulty tensorpipe agent class
- Update tensorpipe pipeWrite function to allow to be overwritten and add delay
- Update test backend registry and faulty agent tests to use the FAULTY_TENSORPIPE_AGENT backend.

This effects all faulty agent tests, here a few of them as sample commands:
`pytest test/distributed/rpc/test_faulty_agent.py -vs -k test_verify_backend_options`
`pytest test/distributed/rpc/test_faulty_agent.py -vs -k test_no_faulty_messages`
`pytest test/distributed/rpc/test_faulty_agent.py -vs -k test_builtin_remote_message_dropped_timeout`

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D29773739

Pulled By: H-Huang

fbshipit-source-id: 6b2bc366735d70b79943d4207f454bc9555bbf5f
2021-07-20 13:54:30 -07:00
d856914c57 Fix missing braces (#61745)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61745

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D29717538

fbshipit-source-id: ed0ff4fb6a72b701bf6d36ebde343672356a916a
2021-07-20 13:32:38 -07:00
f78142b68d Modernize emplace (#61742)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61742

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D29717433

fbshipit-source-id: 93996388780862e90ab4e697508407091e8e763b
2021-07-20 13:31:19 -07:00
2c2a084012 approx 100x acceleration for parse_kineto_results (#60432)
Summary:
Fixes https://github.com/pytorch/kineto/issues/308, https://github.com/pytorch/pytorch/issues/58983 maybe related

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60432

Reviewed By: ilia-cher

Differential Revision: D29715257

Pulled By: gdankel

fbshipit-source-id: 7c94d1bb00b609f502db7aa9d9a447ab09645e6a
2021-07-20 13:21:49 -07:00
4567a50b2a Enable clang-tidy on master (#61689)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61689

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D29767984

Pulled By: 1ntEgr8

fbshipit-source-id: 658355da274ada41e01ed2772a03a701b90fbbab
2021-07-20 12:55:12 -07:00
8b88c24670 add channels last support for thnn_conv2d (non-dilated) (#49582)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49582

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D26007050

Pulled By: VitalyFedyunin

fbshipit-source-id: 1289e0687c2459dd4eb8e4ba2efc8266397cfe5f
2021-07-20 12:50:24 -07:00
91bc285084 Fix clang-tidy error in pre-commit script (#61918)
Summary:
Fixes a clang-tidy error in the git-pre-commit script. See log below for the error it fixes.

```
Running pre-commit flake8
Running pre-commit clang-tidy
usage: clang_tidy [-h] [-e CLANG_TIDY_EXE] [-g GLOB] [-x REGEX] [-c COMPILE_COMMANDS_DIR] [--diff-file DIFF_FILE] [-p PATHS [PATHS ...]] [-n] [-v] [-q] [--config-file CONFIG_FILE] [--print-include-paths] [-I INCLUDE_DIR] [-s]
                  [--disable-progress-bar]
                  [extra_args [extra_args ...]]
clang_tidy: error: unrecognized arguments: -j
```

It gets rid of the redundant binary check because `tools.linter.clang_tidy` already does this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61918

Test Plan: Run `tools/git-pre-commit`. It should not show a clang-tidy error.

Reviewed By: driazati

Differential Revision: D29796383

Pulled By: 1ntEgr8

fbshipit-source-id: b804b0170747f04e84d21e03d1c4985748d78cf2
2021-07-20 12:40:56 -07:00
f6446802c7 Revert D29783943: [pytorch][PR] add BFloat16 operators on CPU: arange, acosh, asinh, atanh, exp2, digamma, trigamma, polygamma
Test Plan: revert-hammer

Differential Revision:
D29783943 (513c40cb1a)

Original commit changeset: 40cebe829720

fbshipit-source-id: 5276dea572f1286dad7b7caa69ecc2f369ec13ff
2021-07-20 12:33:52 -07:00
c2cc6a9396 Add generic join unit tests (#61786)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61786

This adds unit tests for the generic join context manager.

```
gpurun python test/distributed/algorithms/test_join.py
```

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D29746646

Pulled By: andwgu

fbshipit-source-id: 2933d85783c2225574c4b77bfb90064690c6e668
2021-07-20 12:13:05 -07:00
1c80b5220b nll_loss_forward: port to structured kernel (#61443)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61443

For more information, see #55070.

This PR also adds a new type, `OptionalTensorRef` as a replacement for `c10::optional<Tensor>&` in order to avoid the reference count manipulations that are inevitable with the latter. I have confirmed using Godbolt/Compiler Explorer that this class does indeed avoid manipulating the reference count of the `intrusive_ptr` inside the `Tensor` it refers to:

1. [P429709479](https://www.internalfb.com/phabricator/paste/view/P429709479) - Given a `const Tensor&` in scope, an `OptionalTensorRef` can be constructed without bumping refcount.
2. [P429709883](https://www.internalfb.com/phabricator/paste/view/P429709883) - Given an `OptionalTensorRef`, a `const Tensor&` can be produced without bumping refcount.
3. [P429710335](https://www.internalfb.com/phabricator/paste/view/P429710335) - When `OptionalTensorRef` is destructed, the refcount should not be decremented.
4. [P429769525](https://www.internalfb.com/phabricator/paste/view/P429769525) - `OptionalTensorRef` can be assigned without refcount manipulation.
5. [P429769882](https://www.internalfb.com/phabricator/paste/view/P429769882) - `OptionalTensorRef` can be move assigned without refcount manipulation.

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D29780666

Pulled By: SplitInfinity

fbshipit-source-id: 7af157215300e9254d635433cbd583f7329fe064
2021-07-20 11:45:44 -07:00
f0df0207ec [jit] Arithmetic simplification for integers. (#61444)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61444

Add a mini pass to merge arithmetic nodes like (((x - 1) + 2) * 1) - 1.
Issue #60913

Test Plan:
python test/test_jit.py TestPeephole.test_peephole_arith

Imported from OSS

Reviewed By: eellison

Differential Revision: D29630614

fbshipit-source-id: 08ac64cee39070401f9ff9163d309f20ff53c5ac
2021-07-20 11:35:42 -07:00
d2abfc547b Add ShardedTensorMetadata for ShardedTensor. (#61683)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61683

This PR adds a consolidated metadata field (ShardedTensorMetadata)
which has all the necessary global metadata for a ShardedTensor.
ghstack-source-id: 133847517

Test Plan:
1) unit tests
2) waitforbuildbot

Reviewed By: wanchaol

Differential Revision: D29703719

fbshipit-source-id: 567279e46c787a88ef3310e4dce6fd2ad7631c62
2021-07-20 11:28:13 -07:00
87334c40a7 Remove torch._bmm and remove torch.bmm deterministic arg documentation (#61629)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/61571

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61629

Reviewed By: mrshenli

Differential Revision: D29774486

Pulled By: albanD

fbshipit-source-id: bfc9119c478f0244d5be681bcf4954a3eb97e542
2021-07-20 10:55:43 -07:00
513c40cb1a add BFloat16 operators on CPU: arange, acosh, asinh, atanh, exp2, digamma, trigamma, polygamma (#60444)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60444

Reviewed By: ejguan

Differential Revision: D29783943

Pulled By: ezyang

fbshipit-source-id: 40cebe8297207669d1ca430ed1d1e81dda5a0c45
2021-07-20 10:30:04 -07:00
45751e0b34 Support integral target for the backward of nn.SmoothL1Loss (#61112)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/58816

- enhance the backward of `nn.SmoothL1Loss` to allow integral `target`
- add test cases in `test_nn.py` to check the `input.grad` between the integral input and its floating counterpart.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61112

Reviewed By: mrshenli

Differential Revision: D29775660

Pulled By: albanD

fbshipit-source-id: 544eabb6ce1ea13e1e79f8f18c70f148e92be508
2021-07-20 10:24:03 -07:00
59a5312ce6 Modernize fix deprecated header (#61736)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61736

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D29716965

fbshipit-source-id: 314c2b557c240ac16bbfab114ab764beb189e78a
2021-07-20 10:06:11 -07:00
5a04bd8723 Modernize some loops in torch (#61737)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61737

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D29716813

fbshipit-source-id: 21f9716bead4e0e913406e681c55d1956327e6af
2021-07-20 10:04:54 -07:00
65616184bc [Docs] Bundle of errata and small corrections / improvements for torch.linalg docs (#61578)
Summary:
This PR bundles a number of errata detected in the linalg docs over the last few weeks.

- Simpler Cholesky deprecation rule
- Remove repeated consecutive words
- Correct cond with rcond in lstsq
- Correct examples of lstsq
- More concise examples
- Use the names of the inputs / outputs in the variables of the examples

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61578

Reviewed By: mrshenli

Differential Revision: D29757988

Pulled By: mruberry

fbshipit-source-id: a740a64826c065c1d7c1b8b498364d147008d76d
2021-07-20 09:58:09 -07:00
a0c9d70fba bitwise_and: Port to structured (#60813)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60813

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D29449374

Pulled By: ezyang

fbshipit-source-id: d7e236ad841dcb9d5914352d117a34b10894bb91
2021-07-20 09:01:41 -07:00
875d63ed04 bitwise_xor: Port to structured (#60812)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60812

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D29449372

Pulled By: ezyang

fbshipit-source-id: 016d2012f64486c2490ff319e753b0d054dccf2c
2021-07-20 09:01:40 -07:00
ce8aeefbf4 bitwise_or: Port to strucutred (#60811)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60811

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D29449370

Pulled By: ezyang

fbshipit-source-id: ac176985b0141a55807ba909d7342eb35b1dc28f
2021-07-20 09:00:20 -07:00
f59ac5abc8 Add thread local state guards in autograd engine hooks. (#60067)
Summary:
The thread local state of backward thread is not aligned to the GraphTask's `thread_local_` when calling the hooks in backward.

This is required for profiling the statistics c10d operation of `DistributedDataParallel` module.

Is there any concern to add the thread local state guard when calling the hooks in backward? ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60067

Reviewed By: ezyang

Differential Revision: D29654599

Pulled By: albanD

fbshipit-source-id: 656c4f91017184fd40f1a184de24757a13387e37
2021-07-20 07:41:49 -07:00
641f6ef8a7 Implement IMethod::getArgumentNames() (#61856)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61856

This diff did the following few things:
1. It implemented IMethod::getArgumentNames() for all IMethod's subclasses.
2. It refactors PyTorchDeployPredictor to use IMethod for model executions.

Test Plan:
[... ~/fbsource/fbcode/caffe2] buck test mode/dev caffe2/fb/predictor:pytorch_predictor_test -- PyTorchDeployPredictor
[... ~/fbsource/fbcode/caffe2] buck test mode/dev caffe2/fb/predictor:pytorch_predictor_test -- PyTorchPredictor

Reviewed By: wconstab

Differential Revision: D29648756

fbshipit-source-id: e047345f26ce495a5d74d8063f7f8edc32a1b13c
2021-07-19 23:16:48 -07:00
42d6543c7b [bc-breaking] Dispatch index_put with boolean mask argument to masked_fill (#61612)
Summary:
https://github.com/pytorch/pytorch/issues/57515

Based on ngimel 's branch, with a few tweaks to determine when to copy value tensors to device memory/additional tests.
bc-breaking note: Previously, if in `x[index]=value` `value` was a 0-d tensor with device different from `x`'s device, it resulted in a RuntimeError. Now this case is handled by copying `value` to the correct device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61612

Reviewed By: mrshenli

Differential Revision: D29753491

Pulled By: ngimel

fbshipit-source-id: 3fba14f4c2b9b136b50af020f9c1eda88f7373b0
2021-07-19 22:53:14 -07:00
018dc4193e Factor vector intrinsics out of SumKernel.cpp (#61483)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61483

This will make it simpler to support AVX512 which is upcoming in #56992, see https://github.com/pytorch/pytorch/pull/56992#discussion_r667060280 for reference.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D29753536

Pulled By: ngimel

fbshipit-source-id: 03ae66cdc01a3679c67214468e2bdf93b15c3bc2
2021-07-19 21:49:01 -07:00
c44d9d9f70 Use cascade-summation to improve nansum accuracy (#61082)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61082

Fixes #59415

This implements nansum as a new `LoadPolicy` for the existing sum functions.
So, it's using the more accurate cascade-sum algorithm.

I've also expanded `test_nansum` to cover the four special cases of the sum
algorithm (inner/outer reduction; vectorized or scalar).

Nansum performance comparison
-----------------------------
For float sums, contiguous reductions are as much as 10x faster and discontiguous sums are ~1.8x faster (more for small shapes due to TensorIterator overheads).

|        Shape | Dim | Master Contiguous (us) | This PR Contiguous (us) | Master Discontiguous (us) | This PR Discontiguous (us) |
|-------------:|-----|:----------------------:|:-----------------------:|:-------------------------:|:--------------------------:|
|     10, 1000 | 0   |          74.9          |           2.02          |            75.6           |            6.41            |
|              | 1   |          8.24          |           1.8           |            8.28           |            5.24            |
|    100, 1000 | 0   |           134          |           7.55          |            130            |            43.2            |
|              | 1   |          70.5          |           7.01          |            71.5           |            40.6            |
|   1000, 1000 | 0   |           726          |           69.2          |            737            |             403            |
|              | 1   |           702          |           51.0          |            709            |             404            |
|  10000, 1000 | 0   |         15,300         |          2,470          |           18,200          |           10,400           |
|              | 1   |          7,200         |          1,160          |           7,470           |            4,440           |
| 100000, 1000 | 0   |         163,000        |          28,000         |          199,000          |           131,000          |
|              | 1   |         70,700         |          13,500         |           75,700          |           44,200           |

Sum performace comparison
-------------------------

For float sums, performance is unchanged to within measurement precision:
|        Shape | Dim | Master Contiguous (us) | This PR Contiguous (us) | Master Discontiguous (us) | This PR Discontiguous (us) |
|-------------:|-----|:----------------------:|:-----------------------:|:-------------------------:|:--------------------------:|
|     10, 1000 | 0   |          1.92          |           2.01          |            4.2            |            4.49            |
|              | 1   |          1.68          |           1.68          |            2.79           |            2.75            |
|    100, 1000 | 0   |          6.52          |           7.07          |            26.9           |            27.3            |
|              | 1   |          5.91          |           5.66          |            16.8           |            16.9            |
|   1000, 1000 | 0   |          55.6          |           58.6          |            256            |             254            |
|              | 1   |          41.0          |           41.2          |            150            |             147            |
|  10000, 1000 | 0   |          1,370         |          1,650          |           8,070           |            8,020           |
|              | 1   |           908          |           845           |           3,100           |            2,980           |
| 100000, 1000 | 0   |         24,700         |          24,700         |           90,900          |           91,000           |
|              | 1   |         12,500         |          12,100         |           31,500          |           31,800           |

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D29753523

Pulled By: ngimel

fbshipit-source-id: 28095ac39e4a07ff878775c98f7a7815d9a4e457
2021-07-19 21:47:43 -07:00
bf1c9aaa79 logit_backward: Port to structured (#60817)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60817

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D29449376

Pulled By: ezyang

fbshipit-source-id: e6f793300488370f50a97db58f0400c557ee64e5
2021-07-19 21:23:05 -07:00
b8686b42d8 tanh_backward: Port to structured (#60816)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60816

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D29449375

Pulled By: ezyang

fbshipit-source-id: 93b70341fc6a2a42056fef74d6e5d81ec34e9da2
2021-07-19 21:23:03 -07:00
8c42d7ad07 sigmoid_backward: Port to structured (#60815)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60815

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D29449371

Pulled By: ezyang

fbshipit-source-id: e68c05cc90446e86d50b67d8346f145bf13ed207
2021-07-19 21:23:01 -07:00
11cc179366 xlogy: Port to structured (#60814)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60814

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D29449373

Pulled By: ezyang

fbshipit-source-id: a37499cd4fabff80f848627def7dd500364b8a22
2021-07-19 21:21:54 -07:00
9fb6b40f3e Makes a streaming backward test try gradient stealing more directly (#60065)
Summary:
Closes https://github.com/pytorch/pytorch/issues/59846.

https://github.com/pytorch/pytorch/issues/59846 is likely paranoia, and some of the test_streaming_backward_* in test_cuda.py already use gradient stealing (ie, they start with `.grad`s as None before backward). Regardless, this PR augments one of the tests to stress gradient stealing a bit more directly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60065

Reviewed By: mrshenli

Differential Revision: D29779518

Pulled By: ngimel

fbshipit-source-id: ccbf278543c3adebe5f4ba0365b1dace9a14da9b
2021-07-19 20:39:55 -07:00
873cc7a46d Support 3 argument variant of the getattr() call where the third arg is the default return value (#61599)
Summary:
Issue: https://github.com/pytorch/pytorch/issues/56909

Note the emitted code for such a call will either be a) getattr() call with first two args if the
attribute name (which must be a string literal) is determined to be valid based on the hasAttr() result,
or b) just the AST node for the default value (the 3rd arg) alone with no getattr call at all.

Test code:

```
import torch
import numpy as np

class Shape:
    def __init__(self):
        self.center = 1.0

def f(x):
    s = Shape()
    return getattr(s, "missing", [])

y = torch.jit.script(f)
print(y.graph)
```
Output:
```
graph(%x : Tensor):
  %s.1 : __torch__.Shape = prim::CreateObject()
  %2 : NoneType = prim::CallMethod[name="__init__"](%s.1) # ts.py:10:8
  %4 : Tensor[] = prim::ListConstruct()
  return (%4)
```

Another example:
```
import torch

class Shape:
    def __init__(self):
        self.center = 1.0

def f(x):
    s = Shape()
    y = getattr(s, "center")
    w : list[float] = [1.0]
    z = getattr(s, "missing", w)
    z.append(y)
    return z

y = torch.jit.script(f)
print(y.graph)
 --- output ---

graph(%x : Tensor):
  %5 : float = prim::Constant[value=1.]() # ts.py:12:23
  %s.1 : __torch__.Shape = prim::CreateObject()
  %2 : NoneType = prim::CallMethod[name="__init__"](%s.1) # ts.py:10:8
  %center : float = prim::GetAttr[name="center"](%s.1)
  %w.1 : float[] = prim::ListConstruct(%5)
  %11 : float[] = aten::append(%w.1, %center) # ts.py:14:4
  return (%w.1)
```
Fixes #{56969}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61599

Reviewed By: ZolotukhinM

Differential Revision: D29776058

Pulled By: jerryzhenleicai

fbshipit-source-id: 76333bd54002e08a064677c1f287115a80cc7c8e
2021-07-19 20:04:21 -07:00
ffd2e602f4 [CUDA graphs] Make sure graph mempool cudaMalloc_count decrement pairs with cudaFree for all allocations (#61567)
Summary:
Graphs mempools aren't deleted until all their allocations are cudaFreed. `PrivatePool::cudaMalloc_count` tracks the number of outstanding (not-yet-cudaFreed) allocations.

https://github.com/pytorch/pytorch/pull/44742 moves cudaFree to [release_block](https://github.com/pytorch/pytorch/pull/44742/files#diff-acc6337586bf9cdcf0a684380779300ec171897d05b8569bf439820dc8c93bd5R1160), while the `cudaMalloc_count` decrement (if needed) remains in a caller ([release_blocks](https://github.com/pytorch/pytorch/pull/44742/files#diff-acc6337586bf9cdcf0a684380779300ec171897d05b8569bf439820dc8c93bd5R1177)). But I noticed there's also a path ([release_available_cached_blocks](https://github.com/pytorch/pytorch/pull/44742/files#diff-acc6337586bf9cdcf0a684380779300ec171897d05b8569bf439820dc8c93bd5R1094)) that calls `release_block` without calling `release_blocks`, in other words, it calls cudaFree but dodges any potential `cudaMalloc_count` decrement.

In practice, the way the code is currently organized, I don't _think_ this second path can cause the pool to become a zombie whose `cudaMalloc_count` will never reach zero (I think this could only happen if you call `release_available_cached_blocks` on a private pool, and the only way it would be called on a private pool is if capture is underway, and if capture is underway, the cudaFree call will hard error). Regardless, I feel much more comfortable keeping the cudaMalloc_count decrement right next to the cudaFree.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61567

Reviewed By: mrshenli

Differential Revision: D29765198

Pulled By: ezyang

fbshipit-source-id: bcbeed656c3e0d101112aa470d8a098c73a011b1
2021-07-19 19:22:18 -07:00
208d06ca8c Port other comparison ops: ne, lt, gt, le, ge to structured kernels. (#60942)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60942

Tracking Issue: #55070

This PR applies the same transformation of `eq` to the other comparison ops: `ne`, `lt`,
`gt`, `le`, and `ge`. Macros for crating meta and impl functions are used (since the
checks they have are the same).

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D29509868

Pulled By: ezyang

fbshipit-source-id: 6a1ed1d93d08884c9e09d3f419037533a235d68c
2021-07-19 19:14:12 -07:00
97327137ba Port eq kernel to structured kernels. (#60177)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60177

Tracking issue: #55070

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D29509871

Pulled By: ezyang

fbshipit-source-id: ad81bb49c46edc81c705d12108b98c5ffaaddf92
2021-07-19 19:13:09 -07:00
64ac428889 [vulkan] Adaptive local work group size (#61170)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61170

Instead of using a fixed local work group size of {4,4,4}, adjust the size based on the global size in order to minimize the number of inactive invocations.

## Perf improvements from this change
On aloha portal devices, in conjunction with the below diff that tweaks the conv2d_pw shader to calculate a 4x4 output, benchmark latency of the xirp14b model was reduced from ~8.7 ms to ~6.6 ms.

Test Plan:
```
cd ~/fbsource
buck build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 //xplat/caffe2:pt_vulkan_api_test_binAndroid\#android-arm64 --show-output
adb push buck-out/gen/xplat/caffe2/pt_vulkan_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_api_test
adb shell "/data/local/tmp/vulkan_api_test"
cd -
```

Reviewed By: IvanKobzarev

Differential Revision: D28724591

fbshipit-source-id: ede896300b2be1a9578e492cb870121012886aa7
2021-07-19 18:52:19 -07:00
f324421d34 [vulkan] Calculate a 4x4 output tile for each invocation in conv2d_pw (#60760)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60760

A simple optimization to the `conv2d_pw` shader that makes each invocation calculate a 4x4 output tile instead of a single output texel. This results in better memory reuse and subsequently a pretty significant performance win for models similar to the MobileNets.

## Perf improvements from this change
On aloha portal devices, in conjunction with the above diff that introduces adaptive work group sizes, benchmark latency of the xirp14b model was reduced from ~8.7 ms to ~6.6 ms.

Test Plan:
Test vulkan ops:

```
cd ~/fbsource
buck build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 //xplat/caffe2:pt_vulkan_api_test_binAndroid\#android-arm64 --show-output
adb push buck-out/gen/xplat/caffe2/pt_vulkan_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_api_test
adb shell "/data/local/tmp/vulkan_api_test"
cd -
```

Reviewed By: IvanKobzarev

Differential Revision: D28724590

fbshipit-source-id: e742286b01bf566dc6378677be55409b7faa8cfb
2021-07-19 18:52:18 -07:00
a1b5025ecd [vulkan] Convolution op cleanup (#60759)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60759

Remove unused convolution implementations and refactor convolution op code to make this file easier to maintain.

Test Plan:
Test vulkan ops:

```
cd ~/fbsource
buck build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 //xplat/caffe2:pt_vulkan_api_test_binAndroid\#android-arm64 --show-output
adb push buck-out/gen/xplat/caffe2/pt_vulkan_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_api_test
adb shell "/data/local/tmp/vulkan_api_test"
cd -
```

Reviewed By: IvanKobzarev

Differential Revision: D28724592

fbshipit-source-id: cb509fa1cd68089f78188bfb3c866aabc9b0cbdb
2021-07-19 18:52:16 -07:00
cacab7e9d6 [vulkan] Reduce submission rate to save CPU cycles (#60758)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60758

Further tweak the submission rate of ops. Before in D28293756 (bc0965ac85), the submission rate was set as high as possible in order to prioritize performance. However, in practice (i.e. when running the model in an app) the high rate of submission increases CPU usage and increases GPU contention which may regress fps.

In the future it would be beneficial to devise a scheme to adaptively set the GPU submission rate.

## Perf Improvements
This change doesn't really affect benchmark latency. However, through systraces it can be observed that CPU usage is reduced without too much impact on FPS/model latency.

Test Plan:
Test vulkan ops:

```

cd ~/fbsource
buck build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 //xplat/caffe2:pt_vulkan_api_test_binAndroid\#android-arm64 --show-output
adb push buck-out/gen/xplat/caffe2/pt_vulkan_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_api_test
adb shell "/data/local/tmp/vulkan_api_test"
cd -
```

Reviewed By: IvanKobzarev

Differential Revision: D29062836

fbshipit-source-id: 1a0f42b49fecb80baee08cb3f1048bb35a1b5d5c
2021-07-19 18:51:04 -07:00
554038c2a2 [package] merge test_torchscript into test_package_script (#61807)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61807

These shouldn't be separate files, they test the same thing

Differential Revision:
D29748967
D29748967

Test Plan: Imported from OSS

Reviewed By: Lilyjjo

Pulled By: suo

fbshipit-source-id: 177f40fa460d00d064dfd1f33a0b6656b214a296
2021-07-19 18:23:45 -07:00
f02cfcc802 ban PyTorchStreamWriter from writing the same file twice (#61805)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61805

Similar in spirit to https://github.com/pytorch/pytorch/pull/61371.
While writing two files with the same name is allowed by the ZIP format,
most tools (including our own) handle this poorly. Previously I banned
this within `PackageExporter`, but that doesn't cover other uses of the
zip format like TorchScript.

Given that there are no valid use cases and debugging issues caused by
multiple file writes is fiendishly difficult, banning this behavior enitrely.

Differential Revision:
D29748968
D29748968

Test Plan: Imported from OSS

Reviewed By: Lilyjjo

Pulled By: suo

fbshipit-source-id: 0afee1506c59c0f283ef41e4be562f9c22f21023
2021-07-19 18:23:43 -07:00
04043d681e [package] fix storage serialization collision (#61806)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61806

Currently, if you do `save_pickle` on a ScriptModule, then `save_pickle`
on a tensor, this would result in a `0.storage` tensor being written
*twice* to the zip archive. This would cause weird bugs on the
serializing side (this presented as a ASAN-detected heap buffer overflow
because we tried to read more memory from a tensor than we actually
had).

Turns out this was because when we did:
```
self.storage_context = self.script_module_serializer.storage_context()
```
it returned a new copy of the storage context, so we weren't actually
assigning unique names to tensors!!

This PR fixes the issue by making `(De)SerializationStorageContext`
non-copyable and fixing up the parts of the bindings that returned by
copy.

Differential Revision:
D29748969
D29748969

Test Plan: Imported from OSS

Reviewed By: Lilyjjo

Pulled By: suo

fbshipit-source-id: c2f89ab270e07e7a111fb35c545b5e07b804dc3c
2021-07-19 18:22:36 -07:00
c30048fccf add BFloat16 support for topk on CPU (#59547)
Summary:
Added BFloat16 support for topk on CPU, and collected the benchmark data of topk for BFloat16 and Float32 data type by using the operator_benchmark tool of PyTorch on the platform of Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz

Input: 512x512, 512x1024, 1024x512, 1024x1024
K: 5
Number of cores: 1 core, 28 cores(1 socket)

For 1 core:

 ----------------------------------------
 PyTorch/Caffe2 Operator Micro-benchmarks
 ----------------------------------------
 Tag : all

 Benchmarking PyTorch: topk
 Mode: Eager
 Name: topk_H512_W512_k5_dtypetorch.float32_cpu
 Input: H: 512, W: 512, k: 5, dtype: torch.float32, device: cpu
Forward Execution Time (us) : 911.401

 Benchmarking PyTorch: topk
 Mode: Eager
 Name: topk_H512_W512_k5_dtypetorch.bfloat16_cpu
 Input: H: 512, W: 512, k: 5, dtype: torch.bfloat16, device: cpu
Forward Execution Time (us) : 911.700

 Benchmarking PyTorch: topk
 Mode: Eager
 Name: topk_H512_W1024_k5_dtypetorch.float32_cpu
 Input: H: 512, W: 1024, k: 5, dtype: torch.float32, device: cpu
Forward Execution Time (us) : 1506.927

 Benchmarking PyTorch: topk
 Mode: Eager
 Name: topk_H512_W1024_k5_dtypetorch.bfloat16_cpu
 Input: H: 512, W: 1024, k: 5, dtype: torch.bfloat16, device: cpu
Forward Execution Time (us) : 1492.036

 Benchmarking PyTorch: topk
 Mode: Eager
 Name: topk_H1024_W512_k5_dtypetorch.float32_cpu
 Input: H: 1024, W: 512, k: 5, dtype: torch.float32, device: cpu
Forward Execution Time (us) : 1825.634

 Benchmarking PyTorch: topk
 Mode: Eager
 Name: topk_H1024_W512_k5_dtypetorch.bfloat16_cpu
 Input: H: 1024, W: 512, k: 5, dtype: torch.bfloat16, device: cpu
Forward Execution Time (us) : 1819.872

 Benchmarking PyTorch: topk
 Mode: Eager
 Name: topk_H1024_W1024_k5_dtypetorch.float32_cpu
 Input: H: 1024, W: 1024, k: 5, dtype: torch.float32, device: cpu
Forward Execution Time (us) : 3001.459

 Benchmarking PyTorch: topk
 Mode: Eager
 Name: topk_H1024_W1024_k5_dtypetorch.bfloat16_cpu
 Input: H: 1024, W: 1024, k: 5, dtype: torch.bfloat16, device: cpu
Forward Execution Time (us) : 2970.718

For 28 cores(1 socket):

 ----------------------------------------
 PyTorch/Caffe2 Operator Micro-benchmarks
 ----------------------------------------
 Tag : all

 Benchmarking PyTorch: topk
 Mode: Eager
 Name: topk_H512_W512_k5_dtypetorch.float32_cpu
 Input: H: 512, W: 512, k: 5, dtype: torch.float32, device: cpu
Forward Execution Time (us) : 146.995

 Benchmarking PyTorch: topk
 Mode: Eager
 Name: topk_H512_W512_k5_dtypetorch.bfloat16_cpu
 Input: H: 512, W: 512, k: 5, dtype: torch.bfloat16, device: cpu
Forward Execution Time (us) : 123.423

 Benchmarking PyTorch: topk
 Mode: Eager
 Name: topk_H512_W1024_k5_dtypetorch.float32_cpu
 Input: H: 512, W: 1024, k: 5, dtype: torch.float32, device: cpu
Forward Execution Time (us) : 105.967

 Benchmarking PyTorch: topk
 Mode: Eager
 Name: topk_H512_W1024_k5_dtypetorch.bfloat16_cpu
 Input: H: 512, W: 1024, k: 5, dtype: torch.bfloat16, device: cpu
Forward Execution Time (us) : 101.498

 Benchmarking PyTorch: topk
 Mode: Eager
 Name: topk_H1024_W512_k5_dtypetorch.float32_cpu
 Input: H: 1024, W: 512, k: 5, dtype: torch.float32, device: cpu
Forward Execution Time (us) : 128.023

 Benchmarking PyTorch: topk
 Mode: Eager
 Name: topk_H1024_W512_k5_dtypetorch.bfloat16_cpu
 Input: H: 1024, W: 512, k: 5, dtype: torch.bfloat16, device: cpu
Forward Execution Time (us) : 125.172

 Benchmarking PyTorch: topk
 Mode: Eager
 Name: topk_H1024_W1024_k5_dtypetorch.float32_cpu
 Input: H: 1024, W: 1024, k: 5, dtype: torch.float32, device: cpu
Forward Execution Time (us) : 129.855

 Benchmarking PyTorch: topk
 Mode: Eager
 Name: topk_H1024_W1024_k5_dtypetorch.bfloat16_cpu
 Input: H: 1024, W: 1024, k: 5, dtype: torch.bfloat16, device: cpu
Forward Execution Time (us) : 124.556

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59547

Reviewed By: mrshenli

Differential Revision: D29763916

Pulled By: ezyang

fbshipit-source-id: 706c7d4349ac9ebd5d63f4844fca70febcb67023
2021-07-19 16:06:24 -07:00
15210f3b82 ignore and clear not ready errors (#61554)
Summary:
Follow-up to https://github.com/pytorch/pytorch/issues/18584. This PR covers the remaining places where event or stream query might result in not ready errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61554

Reviewed By: mrshenli

Differential Revision: D29763973

Pulled By: ezyang

fbshipit-source-id: 41d988d1826b2309cc6b01a81144094b353abdf9
2021-07-19 16:03:04 -07:00
e68c016871 Regenerate libtorch workflow files that got lost in merge conflict (#61872)
Summary:
Forward fixes merge conflict on master: https://github.com/pytorch/pytorch/runs/3106027618

for PR https://github.com/pytorch/pytorch/issues/61774

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61872

Reviewed By: dzhulgakov

Differential Revision: D29775595

Pulled By: janeyx99

fbshipit-source-id: 8194dd123f166fd5f3fd1e77417e865c188f40c8
2021-07-19 15:30:13 -07:00
0a6d88244b Fix grammatical errors on the PyTorch Contribution Guide (#61818)
Summary:
## What does the PR do?
- Fix grammatical errors on the PyTorch Contribution Guide page.

## Changes [Screenshots]
> Note:
> 1. The changes are highlighted in each screenshot.
> 2. Could not load CSS while testing locally, hope that is not an issue as all the changes are made on the content.

1.
![Change1](https://user-images.githubusercontent.com/20442648/126077764-39fd8b78-524f-407d-bc39-c93167bd10a7.PNG)

2.
![Change2](https://user-images.githubusercontent.com/20442648/126077766-9dd7dc61-ef06-41d0-a7e5-cfd179ece0cd.PNG)

3.
![Change3](https://user-images.githubusercontent.com/20442648/126077767-2c2e05e4-09fc-403a-a18e-9b108651a5f8.PNG)

4.
![Change4](https://user-images.githubusercontent.com/20442648/126077769-ad755db6-3afa-457b-b95c-9f6c6281f828.PNG)

5.
![Change5](https://user-images.githubusercontent.com/20442648/126077770-a7759dee-7f90-4b9e-a07c-4dec4ca934d0.PNG)

6.
![Change6](https://user-images.githubusercontent.com/20442648/126077772-0474e58d-c0c8-4156-b56f-808d225c38e7.PNG)

7.
![Change7](https://user-images.githubusercontent.com/20442648/126077774-d48382a7-5379-49a4-a8d2-b478fabf0bf0.PNG)

8.
![Change8](https://user-images.githubusercontent.com/20442648/126077777-fd743825-8dd7-4cb9-a22c-233e5fa085a6.PNG)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61818

Reviewed By: dzhulgakov

Differential Revision: D29775606

Pulled By: mrshenli

fbshipit-source-id: 3f3bfdeede341f784b72dfe55da9ba8bdce1192a
2021-07-19 15:06:22 -07:00
43c5dc40c5 Port signbit to structured kernel (#57936)
Summary:
Port signbit to structured kernel
Related https://github.com/pytorch/pytorch/issues/55070

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57936

Reviewed By: mrshenli

Differential Revision: D29764904

Pulled By: ezyang

fbshipit-source-id: 758f5f085d0cc84af612726f667cde15d615053b
2021-07-19 15:03:10 -07:00
44d3267103 Remote whitespace introduced by #61438 (#61863)
Summary:
Since it's a one-character change it feels faster to fix than revert

Verified with `(! git --no-pager grep -In '[[:blank:]]$' -- . ':(exclude)**/contrib/**' ':(exclude)third_party' || (echo "The above lines have trailing spaces; please remove them"; false))` from the lint check

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61863

Reviewed By: ZolotukhinM

Differential Revision: D29772353

Pulled By: dzhulgakov

fbshipit-source-id: 33cb887f25e344b420f645a8e4dc8d0d7462e9ef
2021-07-19 14:57:10 -07:00
26d17ddc9f Exclude wrapper tensors from functorch in the native::resize_output fastpath (#61846)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61846

Related to #61485.

native::resize_output has a fast path that avoids dispatching.
Unfortunately, we have a number of CompositeImplicitAutograd operations
that directly call out= variants of operators. These
CompositeImplicitAutograd operators (e.g. torch.linalg.norm) end up
calling native::resize_output. That function, combined with how
functorch uses a mode-dispatch key to wrap tensors, causes silently
incorrect behavior in functorch (more details are available in #61485).

The very easy short-term fix is to have `native::resize_output` always
dispatch on a Tensor (and skip the fast-path) if a Tensor is a functorch
wrapped Tensor. More long-term fixes are proposed in the issue.

Test Plan:
- I checked that this change fixes torch.linalg.norm and other operators
with this problem in functorch.
- We're not testing functorch in pytorch/pytorch CI but we probably will
in the near future.
- wait for PyTorch tests.

Reviewed By: ezyang

Differential Revision: D29764293

Pulled By: zou3519

fbshipit-source-id: c7afcb0bd3bc77d2ba716d5b11f62830d8bdf0a9
2021-07-19 13:50:37 -07:00
f912889726 Remove unnecessary Ubuntu version checks (#61738)
Summary:
PR https://github.com/pytorch/pytorch/issues/5401 missed another Ubuntu version check in `cmake/MiscCheck.cmake`.

The check for available functions added by https://github.com/pytorch/pytorch/issues/5401 are already present below the code snippet that this PR deletes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61738

Reviewed By: mrshenli

Differential Revision: D29757525

Pulled By: ezyang

fbshipit-source-id: 7f5f9312284973481a8b8a2b9c51cc09774722e9
2021-07-19 13:04:24 -07:00
1b0a7f3887 Always use fast gradcheck for LayerNorm 3d_no_affine_large_feature (#61848)
Summary:
Due to the introduction of a test from https://github.com/pytorch/pytorch/pull/59987/files, slow gradcheck has been failing intermittently (timing out/getting killed).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61848

Reviewed By: mrshenli

Differential Revision: D29765773

Pulled By: soulitzer

fbshipit-source-id: d78bee758cab76f26ba9f54925c42d4825db9449
2021-07-19 12:33:55 -07:00
094abf5fd0 [BE] Include a unit test for Save Operator with db_options
Summary: A test case that triggers db_options with the save operator is missing.

Test Plan: buck test

Differential Revision: D29642719

fbshipit-source-id: 72b7374d40430398abac26dfe91538550525384d
2021-07-19 12:22:59 -07:00
e389650f10 Upgrade CPUFallback for loops (#61722)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61722

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D29715862

fbshipit-source-id: 21e12c71e28e542abc649890f72938801d9d7d7a
2021-07-19 11:27:26 -07:00
04bd9d7577 [DDP] Add API to get model parameters in hook (#61637)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61637

To support running optimizer as a communication hook, add API to
retrieve the model parameters.

The API returns a `dict[idx -> tensor]` where `idx` is the intra bucket index of gradient tensor and thus the same index of `perParameterTensors`. The API can be used as follows to retrieve the model parameters:

```
per_param_grad_tensors = bucket.get_per_parameter_tensors()
idx_to_model_params = bucket.get_grad_index_to_variable_mapping()
for grad_tensor_idx, model_param in idx_to_model_params.items():
    self.assertEqual(model_param.grad, per_param_grad_tensors[grad_tensor_idx])
```

This provides a way for comm. hook developer to retrieve model parameters within a hook. In the next diffs, we will use this to run optimizer as a DDP comm. hook.
ghstack-source-id: 133768666

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D29691418

fbshipit-source-id: 4bfa824768a5850f73ee330017e2bcc29ceb7edc
2021-07-19 11:24:54 -07:00
66c8d21d7b Update progress and error reporting in clang-tidy (#61672)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61672

This PR adds a progress bar to clang-tidy, and updates how it threads error codes (when run in parallel). The progress bar is disabled on GHA because backspace escape codes are not supported.

It also adds a `--quiet` flag to the script.

Screenshot of progress bar:
<img width="955" alt="Screen Shot 2021-07-14 at 3 17 11 PM" src="https://user-images.githubusercontent.com/40111357/125686114-a8a7c154-3e65-43a8-aa8f-c1fb14d51d27.png">

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D29763848

Pulled By: 1ntEgr8

fbshipit-source-id: cbd352593b279f279911bc3bb8d5ed54abd5f1d5
2021-07-19 11:19:06 -07:00
24a6eb3fda ENH Adds tests and docs for 2d & 3d modules that already support no batch (#61262)
Summary:
Toward https://github.com/pytorch/pytorch/issues/60585

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61262

Reviewed By: mrshenli

Differential Revision: D29660554

Pulled By: jbschlosser

fbshipit-source-id: d5e3dc7096fcf8621bce4a1063d521b84092e0ca
2021-07-19 11:12:28 -07:00
4f46943e3d enable check trace when tracing a mkldnn model (#61241)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43039, when tracing a MKLDNN model with setting **check_trace=True**, there has an error: **RuntimeError: unsupported memory format option Preserve**, this PR is to solve this problem.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61241

Reviewed By: anjali411

Differential Revision: D29737365

Pulled By: suo

fbshipit-source-id: e8f7f124bc6256f10b9d29969e0c65d332514625
2021-07-19 11:03:53 -07:00
75b68def63 fmin has been ported to the structured kernel, removing the old implementation (#60810)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60810

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D29449377

Pulled By: ezyang

fbshipit-source-id: 0b43562d0dfe81dfa401268f1d12e0d2c3c9f420
2021-07-19 10:20:06 -07:00
b526080d89 fmod: Port to structured (#60809)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60809

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D29449378

Pulled By: ezyang

fbshipit-source-id: 70f6fa95988f753eec4aefa60a60dddb7f3d744e
2021-07-19 10:18:57 -07:00
b65ddef000 for shared-memory handles, use an atomic counter, instead of potentially colliding random numbers (#60978)
Summary:
These handles, used for shared-memory tensors, can collide.

E.g. see https://github.com/pytorch/pytorch/issues/60626#issuecomment-869919018

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60978

Reviewed By: mruberry

Differential Revision: D29479291

Pulled By: ezyang

fbshipit-source-id: 408ef1817768f007ad4795b286482809ea43467c
2021-07-19 09:56:43 -07:00
ac5a40e068 Fix benchmark's import module and remove its usage of tools.stats.scribe (#61808)
Summary:
There're a few convoluted logic here to fix the `benchmarks`'s import module for pytest.

- On one hand, if we want to use `tools.stats.scribe` from `benchmarks`, we will need to add `benchmarks/__init__.py`
- On the other hand, if we add `benchmarks/__init__.py`, it breaks how `pytest` is working on searching what is the system built `torch` instead of the local source module `../torch`
  - That's why we are seeing errors like

```
ImportError while loading conftest '/var/lib/jenkins/workspace/benchmarks/fastrnns/conftest.py'.
benchmarks/fastrnns/__init__.py:1: in <module>
    from .cells import *  # noqa: F403
benchmarks/fastrnns/cells.py:1: in <module>
    import torch
torch/__init__.py:29: in <module>
    from .torch_version import __version__ as __version__
torch/torch_version.py:9: in <module>
    from .version import __version__ as internal_version
E   ModuleNotFoundError: No module named 'torch.version'
```

Instead, this PR changed the usage of `upload_scribe.py` back to its original form using HTTP request, and only circleci for now will continue the this path using the `python benchmarks/upload_scribe.py`, which is gated by `if [[ -z "${GITHUB_ACTIONS}" ]];`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61808

Reviewed By: seemethere

Differential Revision: D29750188

Pulled By: zhouzhuojie

fbshipit-source-id: 3b842b21978f2159001e9c6c1cdc96c5a0515f2e
2021-07-19 09:45:05 -07:00
9c3346c8aa reduce max_num_threads for complex double ops in reduce_kernel (#61438)
Summary:
reduce_kernel currently has a all-purpose MAX_NUM_THREADS of 512, which causes register spilling in various kernel instantiations for the various ops that use it as a template (ReduceLogicKernel, ReduceMinMaxKernel, ReduceMomentKernel, ReduceNormKernel, and ReduceSumProdKernel). This is a coarse first attempt at mitigating spillage by reducing max_num_threads to 256 for all complex double ops, which are by far the most common and egregious offenders, while keeping it 512 for the other normal ops, the large majority of which are fine. Besides complex double ops, the remaining kernels which exhibit lmem usage are ReduceMinMax double, long, and BFloat16; ReduceMomentKernel BFloat16, Half, float, and double; and ReduceNorm double.

The proposed fix manages to eliminate lmem usage and massively improve runtime (by 3-5x) for complex double ops. All other ops are unaffected and have the same runtime; if they used lmem before, they still do now. We would still strongly recommend further testing of input shapes and ops as well as looking into if there's a cleaner approach to doing this.

We tested the following ops for both complex double instantiations, as well as testing torch.max and torch.argmax with doubles to make sure they didn't break. We didn't include the double instantiations in the timing data, since they remain unchanged post-fix vs pre-fix. Timing data for the complex double ops below (all done on Nvidia Titan-V GPU):

torch.mean:
![MeanTimingData](https://user-images.githubusercontent.com/22803332/125005623-0f424800-e011-11eb-864e-8419485a9c76.PNG)

torch.linalg.norm:
![NormTimingData](https://user-images.githubusercontent.com/22803332/125005649-179a8300-e011-11eb-96e1-54e18c85a336.PNG)

torch.sum:
![SumTimingData](https://user-images.githubusercontent.com/22803332/125005655-1b2e0a00-e011-11eb-928e-ee5941608fb2.PNG)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61438

Reviewed By: mrshenli

Differential Revision: D29756863

Pulled By: ngimel

fbshipit-source-id: 4c4635df58af9313966ff1df1095f7e15a39bb07
2021-07-19 09:38:22 -07:00
d565b3e9ea Migrate libtorch to GHA (#61774)
Summary:
Makes progress on https://github.com/pytorch/pytorch/issues/57686

Tested in https://github.com/pytorch/pytorch/pull/61775:

periodic 11.3 libtorch: https://github.com/pytorch/pytorch/pull/61775/checks?check_run_id=3088529584?check_suite_focus=True
10.2: https://github.com/pytorch/pytorch/pull/61775/checks?check_run_id=3089965441
11.1: https://github.com/pytorch/pytorch/pull/61775/checks?check_run_id=3089965697

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61774

Reviewed By: samestep

Differential Revision: D29745793

Pulled By: janeyx99

fbshipit-source-id: a17f561051b1e5eccf4918137a4b5df19308a716
2021-07-19 09:21:52 -07:00
3e3acf8a9a Minor documentation fixes (#61785)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61785

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D29746648

Pulled By: andwgu

fbshipit-source-id: 435bbd8894f2ae5c814b9acd562673affea1daf6
2021-07-19 09:01:29 -07:00
813b887dad Fix indent (#61784)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61784

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D29746647

Pulled By: andwgu

fbshipit-source-id: f42d3a0864a8291941d695a0cf575a5737cbb35c
2021-07-19 09:00:25 -07:00
cyy
a26a9f8b75 zero initialize some members and other fixes (#59915)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59915

Reviewed By: soulitzer

Differential Revision: D29106684

Pulled By: ezyang

fbshipit-source-id: 713cbdf10866017ee715ee89ec82acb592c769b6
2021-07-19 07:36:26 -07:00
0263865bfe [Docs] Fix docs for torch.chunk (#61097)
Summary:
torch.chunk may return less than the requested number of chunks silently if some undocumented division constraints are not met. The functionality that users expect is provided by another function: torch.tensor_split

This has led to confusion countless times and who knows how many systems out there are fragile because of this.
My changes describe the discrepancy, show an example and direct users to the usually preferred function.

Issues mentioning this problem:
https://github.com/pytorch/pytorch/issues/9382
https://github.com/torch/torch7/issues/617

I considered documenting the constraint for when an unexpected number of chunks may be returned (it is  chunks*chunks>input.size[dim] ), so that users could quickly tell if their code may be affected. Please let me know if you think this should be in the docs or not.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61097

Reviewed By: heitorschueroff

Differential Revision: D29660280

Pulled By: ezyang

fbshipit-source-id: 675086bc8a8882c1685a50a2c083ae8dd1854384
2021-07-19 06:13:04 -07:00
552eab7935 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D29758833

fbshipit-source-id: e07673bb19f15865bf5810910224f3f37a759db7
2021-07-19 04:12:20 -07:00
593e8f41ca [jit] Fixed a bug in the pass that replaces cat with the variadic op (#61795)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61795

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D29748785

Pulled By: navahgar

fbshipit-source-id: df5b84c35f007718c92a21a0b44a231e6d346918
2021-07-18 21:38:30 -07:00
ff82394fc0 Apply saved tensor hooks (#60975)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60975

Fixes #58512

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D29466227

fbshipit-source-id: c1498d52173aceb29638b5c4f521ac05356a5958
2021-07-18 08:42:51 -07:00
eefbff773b ns for fx: add utils for l2 error and cosine similarity (#61380)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61380

Adds convenience wrappers for l2 error and cosine similarity
to NS utils.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_extend_logger_results_with_comparison
```

Imported from OSS

Reviewed By: hx89

Differential Revision: D29600354

fbshipit-source-id: 670c44a44df7f345884cacf26ed3c885edbe9977
2021-07-17 20:53:43 -07:00
2a2bc1fc8a ns for fx: add fqn to results, when present (#61377)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61377

Both the quantization tracer and the NS tracer record
`_node_name_to_scope`, which contains the mapping from
node name to FQN.

This PR adds the FQN information to the NS results, so that it is
more convenient for users to attribute a NS result to the corresponding
module in their model.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_extract_weights_fqn
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_match_activations_fqn
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_shadow_activations_fqn
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D29600349

fbshipit-source-id: df489e03daff97dd380f59c83ffdc2b0012a0a53
2021-07-17 20:53:41 -07:00
7449f49a4c ns for fx: return results in execution order (#61360)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61360

By default, NS graph matching matches from the end of the graph
to the start.  This PR reverses the returned results so that
the outputs of the NS APIs are in the order of execution, making
it easier to analyze.

Test Plan:
```
python test/test_quantization.py TestFXGraphMatcher.test_results_order
```

Imported from OSS

Reviewed By: hx89

Differential Revision: D29600348

fbshipit-source-id: c9fa4a3748db27c1788eebf803f35221e6fc8701
2021-07-17 20:53:39 -07:00
2b2928c5ca ns for fx: improve error messages for graph matching (#61359)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61359

Makes the error messages when graph matching easier to read
for users.

Test Plan:
```
// inspect the exceptions in the following two tests and verify
// that they are easier to read than before
python test/test_quantization.py TestFXGraphMatcher.test_matching_failure_node_count
python test/test_quantization.py TestFXGraphMatcher.test_matching_failure_node_type
```

Imported from OSS

Reviewed By: hx89

Differential Revision: D29600353

fbshipit-source-id: ec6640fe6cab7b62a697e4ee385be182f2918fd4
2021-07-17 20:53:38 -07:00
ddf6d6cc14 ns for fx: clean up override_qengines and copy TODO in tests (#61358)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61358

1. changes override_qengines to require fbgemm instead, these
tests are not testing any qengine specific logic so better to just
run them once
2. removes a TODO about copy.deepcopy which we do not plan to address

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs
```

Imported from OSS

Reviewed By: hx89

Differential Revision: D29600352

fbshipit-source-id: 4db08f0080233ff46d7679928c83e41c5ba21ec8
2021-07-17 20:53:36 -07:00
cf6f5efb39 ns for fx: test case for comparing fp32 vs fp32_prepared shadow (#61357)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61357

Adds a test case for comparing fp32 vs fp32_prepared in a shadow model.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs
```

Imported from OSS

Reviewed By: hx89

Differential Revision: D29600350

fbshipit-source-id: ff7518ce8a789ab7469cb22044f1d7c697e2cd04
2021-07-17 20:53:34 -07:00
4acd14da02 ns for fx: preserve observers and fake_quants through passes (#61323)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61323

Before this PR, all observers and fake quants were silently removed
when adding loggers with NS. This is problematic for QAT models because
we need the fake quants to run in order to properly capture intermediate
outputs.

This PR fixes the issue by preserving the observers throughout
the passes which add loggers.  In detail:
* for each quantization module or fusion, add additional patterns with that fusion and an observer/fake_quant at the end
* remove the places in the logger model creation code which removed observers
* add unit testing that QAT numerics do not change after adding loggers

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_loggers_preserve_qat_numerics
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_shadow_loggers_preserve_qat_numerics
```

Imported from OSS

Reviewed By: hx89

Differential Revision: D29600351

fbshipit-source-id: 5f25118b79eb47860c49bca882de6a8eae7a4456
2021-07-17 20:53:33 -07:00
a70505cdbd ns for fx: support comparing fp32 vs fp32_prepared, except shadowed (#61129)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61129

Adds support the comparing fp32 model (without quantization) to a
fp32 model prepared with quantization. The main missing feature was
handling conv-bn fusion, since this fusion for PTQ happens outside
of quantization patterns.

Adds testing for this case for comparing weights and comparing
activations

Adds a TODO for also handling this for shadow activations, we need to
first stop removing observers in graph passes before we can add
this support, will be in a future PR.

Test Plan:
```
python test/test_quantization.py TestFXGraphMatcherModels.test_mobilenet_v2
python test/test_quantization.py TestFXGraphMatcherModels.test_mobilenet_v2_qat
python test/test_quantization.py TestFXNumericSuiteCoreAPIsModels.test_compare_activations_conv
```

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D29520009

fbshipit-source-id: f63484a998f1424bd9cacf5d823b82b2edfea1ae
2021-07-17 20:52:23 -07:00
e117d94e21 Wrapped create_type_hint in try/except block so that NormalizeArgs doesn't fail if create_type_hint fails (#61524)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61524

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D29746106

Pulled By: Chillee

fbshipit-source-id: d08c0030f40b504e8f7a61fc0ee432f1515a0e6d
2021-07-17 16:13:17 -07:00
59ca89dca8 Fix scribe logs again (#61768)
Summary:
revert the revert of 3624d75 with additional fix in https://github.com/pytorch/pytorch/pull/61764

Got the corrent logs sent to lambda

```
...
,"21721":"OK","21722":"OK","21723":"OK","21724":"OK","21725":"OK","21726":"OK","21727":"OK","21728":"OK","21729":"OK","21730":"OK","21731":"OK","21732":"OK","21733":"OK","21734":"OK","21735":"OK","21736":"OK","21737":"OK","21738":"OK","21739":"OK","21740":"OK","21741":"OK","21742":"OK","21743":"OK","21744":"OK","21745":"OK","21746":"OK","21747":"OK","21748":"OK","21749":"OK","21750":"OK","21751":"OK","21752":"OK","21753":"OK","21754":"OK","21755":"OK","21756":"OK","21757":"OK","21758":"OK","21759":"OK","21760":"OK","21761":"OK","21762":"OK","21763":"OK","21764":"OK","21765":"OK","21766":"OK","21767":"OK","21768":"OK","21769":"OK","21770":"OK","21771":"OK","21772":"OK","21773":"OK","21774":"OK","21775":"OK","21776":"OK","21777":"OK","21778":"OK","21779":"OK","21780":"OK","21781":"OK","21782":"OK","21783":"OK","21784":"OK","21785":"OK","21786":"OK","21787":"OK","21788":"OK","21789":"OK","21790":"OK","21791":"OK","21792":"OK","21793":"OK","21794":"OK","21795":"OK","21796":"OK","21797":"OK","21798":"OK","21799":"OK","21800":"OK","21801":"OK","21802":"OK","21803":"OK","21804":"OK","21805":"OK","21806":"OK","21807":"OK","21808":"OK","21809":"OK","21810":"OK","21811":"OK","21812":"OK","21813":"OK","21814":"OK","21815":"OK","21816":"OK","21817":"OK","21818":"OK","21819":"OK","21820":"OK","21821":"OK","21822":"OK","21823":"OK","21824":"OK","21825":"OK","21826":"OK"}}

class StartProcessesTest:
    tests: 14 failed: 0 skipped: 0 errored: 0
    run_time: 4.86 seconds
    avg_time: 0.35 seconds
    median_time: 0.01 seconds
    3 longest tests:
        test_function_large_ret_val time: 1.55 seconds
        test_pcontext_wait time: 1.11 seconds
        test_void_function time: 1.03 seconds

...
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61768

Reviewed By: janeyx99

Differential Revision: D29735781

Pulled By: zhouzhuojie

fbshipit-source-id: 6882e334f5108d20773ad66d5300cd37eb509ded
2021-07-16 17:56:16 -07:00
311f1f275a Update clang-tidy-linux64 (#61797)
Summary:
Update clang-tidy linux hash to match one build for 7ae60a49ac by  https://github.com/pytorch/test-infra/runs/3090057893

Fixes `The downloaded binary is not what was expected!`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61797

Reviewed By: zhouzhuojie

Differential Revision: D29746840

Pulled By: malfet

fbshipit-source-id: a7388952b04ba12f250003c32629d57b8d5ffed8
2021-07-16 17:23:21 -07:00
4337650c91 Fixing a bug in .to for qtensors so scale/zp move too (#61576)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61576

This also fixed an issue in the
empty_quantized_per_channel_affine function where specifying a device
that was different from the device of scale/zp would result in a
mismatched qtensor

Test Plan:
python test/test_quantization.py
testquantizedtensor.test_per_channel_to_device

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D29675461

fbshipit-source-id: 0e2ff20f0f581dae94ee01d3ceead2a620cd26b9
2021-07-16 17:16:24 -07:00
cb6841b263 Fix ConnectionError in download_mnist (#61789)
Summary:
Fixes issues like the following error. Note that `ConnectionResetError` is a subclass of `ConnectionError`.

```
+ python tools/download_mnist.py --quiet -d test/cpp/api/mnist
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz ...
Traceback (most recent call last):
  File "tools/download_mnist.py", line 93, in <module>
    main()
  File "tools/download_mnist.py", line 86, in main
    download(path, resource, options.quiet)
  File "tools/download_mnist.py", line 42, in download
    urlretrieve(url, destination_path, reporthook=hook)
  File "/opt/conda/lib/python3.6/urllib/request.py", line 277, in urlretrieve
    block = fp.read(bs)
  File "/opt/conda/lib/python3.6/http/client.py", line 463, in read
    n = self.readinto(b)
  File "/opt/conda/lib/python3.6/http/client.py", line 507, in readinto
    n = self.fp.readinto(b)
  File "/opt/conda/lib/python3.6/socket.py", line 586, in readinto
    return self._sock.recv_into(b)
ConnectionResetError: [Errno 104] Connection reset by peer
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61789

Reviewed By: dreiss

Differential Revision: D29745459

Pulled By: zhouzhuojie

fbshipit-source-id: 2deb668bd74478f32bd01704d4362e8a4d95087b
2021-07-16 17:02:13 -07:00
4e2fe9718d flatten operation (resnet50) (#61265)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61265

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D29626383

Pulled By: migeed-z

fbshipit-source-id: 107769fc14f1fad295a93a10e84235f25ae17357
2021-07-16 16:06:10 -07:00
4479aa8838 Remove all the code that constructs metadata.pkl file (#61760)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61760

Remove all code that related to metadata.pkl creation including creating metadata.pkl, converting data from extra/mobile_info.json and extra/producer_info.json to metadata.pkl file.

Test Plan:
## Run buck commands:
  - `cd` into `fbcode` then `buck build //caffe2/caffe2/fb/init:init`
  - `cd` into `fbcode` then `buck build //caffe2/torch/fb/init:init`
  - `buck build //xplat/caffe2:torch_mobile_core`

## Export a PyTorch lite/mobile model
- Run: `flow-cli canary users.xcheng16.pytorch_trainer.TestWorkflow --run-as-secure-group ai_mobile_platform --buck-target //fblearner/flow/projects/users/xcheng16:workflow` under `fbcode` on devserver.
-  Resulted Model: metadata.pkl no longer exist
{F632063134}

Reviewed By: guangy10

Differential Revision: D29702943

fbshipit-source-id: ec7964f4aa3a8e09ccc20b1a7e2232f85931dd81
2021-07-16 15:39:07 -07:00
7ac8054d5a Use better defaults in the clang-tidy wrapper script (#61651)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61651

This PR sets some QOL defaults to the clang-tidy wrapper script and refactors how defaults are set.

- Runs in parallel
- Custom executable (prints an error message to users asking them to install our custom build)
- `generate_build_files` can now be run as a script

Test Plan: Imported from OSS

Reviewed By: malfet, zhouzhuojie

Differential Revision: D29743661

Pulled By: 1ntEgr8

fbshipit-source-id: 256617d006a03e4ab96091593f5bb80c9b31a2d1
2021-07-16 14:58:19 -07:00
dc0d1612e1 ENH Updates docs and tests for activation modules for no-batch dims (#61300)
Summary:
Towards https://github.com/pytorch/pytorch/issues/60585

This PR updates docs and tests for activation modules that already support no-batch dims.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61300

Reviewed By: heitorschueroff

Differential Revision: D29660543

Pulled By: jbschlosser

fbshipit-source-id: 5edad45f7e9995aca6c3403469668e6e1cbb94b6
2021-07-16 14:42:18 -07:00
6a085648d8 add aten symbols for amin and amax (#61550)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61550

Test Plan: Imported from OSS

Reviewed By: asuhan

Differential Revision: D29668123

Pulled By: bdhirsh

fbshipit-source-id: b111e1c6c6d2beddb220cad70d95954756a3ee9d
2021-07-16 14:06:00 -07:00
4e94e84f65 Type annotate torch.nn.Module ctor (#61334)
Summary:
Annotate generic types
Fix some type violations
Override `_modules` and `_parameters` in `Sequential`, `ModuleList`, `ModuleDict`, etc

Fixes https://github.com/pytorch/pytorch/issues/45497

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61334

Reviewed By: albanD

Differential Revision: D29579533

Pulled By: malfet

fbshipit-source-id: 5cd8ca918b260ca35cfdd873dee8851d39d17de2
2021-07-16 13:59:06 -07:00
ee2f2ec9a5 Revert D29687143: [3/N] Nnapi Backend Delegate Preprocess: Basic OSS Test
Test Plan: revert-hammer

Differential Revision:
D29687143 (5798a00aa4)

Original commit changeset: 9ba9e57f7f85

fbshipit-source-id: 6a672c76a04366b35c492698ae5b39fd4dd1785f
2021-07-16 13:32:51 -07:00
a07d3dc34c Pin macos mkl conda version to fix the cmake build (#61773)
Summary:
Fixes macos build error in master, recently mkl had a upgrade.

CircleCI error:
https://app.circleci.com/pipelines/github/pytorch/pytorch/351645/workflows/d22421c1-bb8f-48fd-9efd-7c0d77f0b083/jobs/14815607

```
Jul 16 11:43:05 CMake Error at /Users/distiller/workspace/miniconda3/lib/cmake/mkl/MKLConfig.cmake:456 (list):
Jul 16 11:43:05   list does not recognize sub-command PREPEND
Jul 16 11:43:05 Call Stack (most recent call first):
Jul 16 11:43:05   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/share/cmake/Caffe2/public/mkl.cmake:1 (find_package)
Jul 16 11:43:05   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:109 (include)
Jul 16 11:43:05   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:68 (find_package)
Jul 16 11:43:05   CMakeLists.txt:5 (find_package)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61773

Reviewed By: soulitzer

Differential Revision: D29736742

Pulled By: zhouzhuojie

fbshipit-source-id: 68c5244196f7f7562a6c202157c4ccdcfcb64337
2021-07-16 13:15:04 -07:00
8ad584823f add shortcircuit in isclose for zero tolerances (#61529)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/61412.

Large integers gave false positives, because the comparison always takes place in floating point dtypes. This happens, because their integer precision is lower than the range of an integer dtype with the same number of bits.

For non-extremal values, `isclose` is defined by [this equation]:

```python
abs(a - b) <= atol + rtol * abs(b)
```

For `rtol == 0 and atol==0`, this is equivalent to `a == b`. This PR goes for the low hanging fruit and adds a shortcut for this case that falls back to an actual equality check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61529

Reviewed By: gchanan

Differential Revision: D29707534

Pulled By: mruberry

fbshipit-source-id: 71b8c4901e9cd4f366442437e52032b0d3002b4a
2021-07-16 12:48:16 -07:00
612632556d Fix torch.median crash on empty tensor (#61698)
Summary:
`torch.tensor([]).median()` returns `nan`, which mimics the behavior of `np.median`
Add test to `TestReductions.test_median_corner_cases`
Fixes https://github.com/pytorch/pytorch/issues/61656

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61698

Reviewed By: heitorschueroff

Differential Revision: D29706912

Pulled By: malfet

fbshipit-source-id: ea5f58327fbff371f3fb8786b269430c7a10d05f
2021-07-16 12:36:18 -07:00
3fd9dcf934 Move non-libtorch scheduled linux CI to GHA (#61732)
Summary:
Move non-libtorch Linux 11.3 scheduled CI job to GHA.
Libtorch builds will be migrated here: https://github.com/pytorch/pytorch/pull/61774

Successful run: https://github.com/pytorch/pytorch/actions/runs/1035592487

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61732

Reviewed By: seemethere

Differential Revision: D29735637

Pulled By: janeyx99

fbshipit-source-id: dce13370b218ae7833483fdaa00137db95e27c98
2021-07-16 12:16:58 -07:00
287603f51c Revert D29698486: [pytorch][PR] Remove torch._bmm and remove torch.bmm deterministic arg documentation
Test Plan: revert-hammer

Differential Revision:
D29698486 (328606699f)

Original commit changeset: 5af2d3803ab1

fbshipit-source-id: ce954c13196b1fb8277d61a686ac351d3bf13903
2021-07-16 11:02:09 -07:00
5798a00aa4 [3/N] Nnapi Backend Delegate Preprocess: Basic OSS Test (#61594)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61594

### Summary:
Added a unit test for the Nnapi delegate's preprocess() function. The
function was previously tested locally, but now a basic test is
added for OSS.

See https://github.com/pytorch/pytorch/pull/61499 for preprocess
implementation. See D29647123 for local testing.

**TODO:**
Add more comprehensive tests.
Add tests for model execution, after the Nnapi delegate's initialization
and execution is implemented T91991928.

**CMakeLists.txt:**
Added a library for the Nnapi delegate
- Explicit linking of torch_python is necessary for the Nnapi delegate's use of pybind

**test_backends.py:**
Added a test for lowering to Nnapi
- Based off https://github.com/pytorch/pytorch/blob/master/test/test_nnapi.py
- Only differences are the loading of the nnapi backend library and the need to change dtype from float64 to float32

### Test Plan:
Running `python test/test_jit.py TestBackendsWithCompiler -v` succeeds. Also saved and examined the model file locally.

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D29687143

fbshipit-source-id: 9ba9e57f7f856e5ac15e13527f6178d613b32802
2021-07-16 11:00:38 -07:00
349f2f767c Modernize to default constructor and nullptr in torch (#61735)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61735

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D29716659

fbshipit-source-id: ec2a0a0b7e55d2e50b1d35f0b651bd40675ae7e8
2021-07-16 10:51:13 -07:00
736bb26746 use rand over empty in flaky test (#61710)
Summary:
Fixes https://github.com/pytorch/pytorch/pull/61694#issuecomment-880641635. cc krshrimali.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61710

Reviewed By: anjali411

Differential Revision: D29719660

Pulled By: mruberry

fbshipit-source-id: 589574a039ad431acc7d095d452f0b3e52260208
2021-07-16 10:50:05 -07:00
efeacc0779 [Static Runtime] Fixed visibility of ProcessedNode class and a newly added function (#61729)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61729

Test Plan: Imported from OSS

Reviewed By: hlu1

Differential Revision: D29719644

Pulled By: navahgar

fbshipit-source-id: 27a77b2a281d1a8a48e2a9df1c254f62c0e2e7ef
2021-07-16 10:42:02 -07:00
6fa80f7f9f Refactor embedded_interpreter registration to be friendly to python case (#59991)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59991

add a registration mechanism whereby on loading the embedded interpreter library, a registration function is called which links up the symbols it provides with torch::deploy.

Test Plan: local and CI deploy tests pass

Reviewed By: suo

Differential Revision: D28764436

fbshipit-source-id: 88416bd098be306f887cc9fd2d65d29199439bc4
2021-07-16 10:33:58 -07:00
6349bde572 [4/N] Nnapi backend delegation preprocess: List Tensors & Comment Updates (#61752)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61752

Updated Android NNAPI preprocess, so that it can accept both single Tensor inputs and Tensor List inputs.
- The inputs are not real data, but input parameters for shape, dtype, quantization, and dimorder that are bundled as a Tensor. Comments were updated to make this clearer.
- In the future, preprocess will also accept a dedicated NnapiArg object.

Compile_spec should have the following format:
{"forward": {"inputs": at::Tensor}} OR {"forward": {"inputs": c10::List< at::Tensor >}}
Example input Tensor:
torch.tensor([[1.0, -1.0, 2.0, -2.0]]).unsqueeze(-1).unsqueeze(-1)

### Testing
OSS testing is blocked by https://github.com/pytorch/pytorch/pull/61594. Testing was done locally in D29726948
TODO: Add OSS tests for single Tensor and Tensor List inputs.
ghstack-source-id: 133683735

Test Plan:
OSS testing is blocked by https://github.com/pytorch/pytorch/pull/61594. Testing was done locally in D29726948.
TODO: Add OSS tests for single Tensor and Tensor List inputs.

Reviewed By: iseeyuan

Differential Revision: D29726432

fbshipit-source-id: 08de70578f37681bda365f9776a1c96030257e7a
2021-07-16 10:17:56 -07:00
328606699f Remove torch._bmm and remove torch.bmm deterministic arg documentation (#61629)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/61571

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61629

Reviewed By: zou3519

Differential Revision: D29698486

Pulled By: albanD

fbshipit-source-id: 5af2d3803ab1eb093616bcfc7e074d8b57ef6958
2021-07-16 09:18:34 -07:00
28150fd0c8 [static_runtime] Implement aten::linear (#61595)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61595

Add out variant wrapper for `aten::linear` in the static runtime

Test Plan: `buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: hlu1

Differential Revision: D29684236

fbshipit-source-id: 94df6d7267b3f269b2cadf065f207648777147df
2021-07-16 08:55:43 -07:00
3624d75864 Revert D29703523: [pytorch][PR] Fix scribe logs
Test Plan: revert-hammer

Differential Revision:
D29703523 (eb5a56fb74)

Original commit changeset: 829ad3630d35

fbshipit-source-id: 2b2196d58791b995a008b6d810b3248ed27e7d94
2021-07-16 08:50:13 -07:00
b963607d50 [nnc] Insert alloc/free at global scope (#61725)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61725

Alloc/free inside a loop isn't really an optimization, and furthermore
it breaks some attempted optimization in the llvm backend: we use alloca for
small allocations, which is efficient since alloca is on the stack, but there's
no corresponding free, so we leak tons of stack.  I hit this while building an
rfactor buffer inside a very deeply nested loop.
ghstack-source-id: 133627310

Test Plan:
Unit test which simulates use of a temp buffer in a deeply nested
loop.

Reviewed By: navahgar

Differential Revision: D29533364

fbshipit-source-id: c321f4cb05304cfb9146afe32edc4567b623412e
2021-07-16 08:42:24 -07:00
4c3d9cfe03 [BE] Fix flaky test_ddp_model_diff_across_ranks test (#61546)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61546

Closes https://github.com/pytorch/pytorch/issues/60661

Fixes this flaky test by using blocking wait instead of async error handling, and performs a gloo-based barrier with higher timeout at the end of test which avoids issues with Barrier.sync. This also allows us to remove this test from the `skip_return_code_checks` list.
ghstack-source-id: 133657107

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D29663884

fbshipit-source-id: 9f0df085b1968f6a7e2c7ae2f06b6dcd4838a87e
2021-07-16 08:37:02 -07:00
f1114364ad [DDP] Enhance comm hook docs (#61677)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61677

Specify return type more clearly, 2) Misc fixes
ghstack-source-id: 133657895

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D29701384

fbshipit-source-id: 7f77b99065bd2977153f397745e07b75bbdd7a94
2021-07-16 08:35:49 -07:00
39ce29efe0 Refactor metadata_map with flattened key/value pair (#61731)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61731

In the previous diff, metadata_map contains mobile_info.json and producer_info.json. We need to parse json each time when we log the required information. This diff helps to flatten the content in the files into key/value pair. It allows logger to directly loop through the metadata_map and log the information.

Test Plan:
Since 3D Photo is disabled for current FB app, testings are only performed on CC scanner.

# Test On CC Scanner
**Test content with LOG(WARNING)**
{P429123273}

**Scuba Logger Output**

1. MOBILE_MODULE_LOAD_STATS

{F631884673}

2.  MOBILE_MODULE_STATS

{F631884787}

Reviewed By: xcheng16

Differential Revision: D29690702

fbshipit-source-id: 1db5a1f5c25e98e5b2f1cc254fd880dfdfa025e2
2021-07-16 00:37:17 -07:00
00a7f55b6e Apply for MOBILE_MODULE_STATS Logging (#61600)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61600

This diff changes the module.h constructor, and removes metadata_. It refactors all the constructors caller side, and creates a getter & setting for metadata_. MOBILE_MODULE_STATS reads the metadata from mobile::Module, and pass it into logger.

Test Plan:
Since 3D Photo is disabled for current FB app, testings are only performed on CC scanner.

# Test On CC Scanner
**Test content with LOG(WARNING)**
{P428930572}

**Scuba Logger Output**

{F631761194}

Reviewed By: xcheng16

Differential Revision: D29673184

fbshipit-source-id: 962e0d7b06a07caaa0c695a4ac58b885fd1505ea
2021-07-16 00:37:15 -07:00
fc710eecc0 Apply for MOBILE_MODULE_LOAD_STATS Logging (#61480)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61480

Append mobile_info.json and producer_info.json into extra_files and parse the jsons from “model_info.json” in onExitLoadModel.
ghstack-source-id: 133327912

Test Plan:
# Test On CC Scanner
**Test content with LOG(WARNING)**
{P428339274}

**Scuba Logger Output**
{F631024095}

# Test On 3D Photo
**Test content with LOG(WARNING)**
{P428340927}

**Scuba Logger Output**

{F631026739}

Reviewed By: xcheng16, guangy10

Differential Revision: D29608014

fbshipit-source-id: abc39c44b947632fd4349de8a432649e84284a87
2021-07-16 00:36:09 -07:00
56d562e790 [DDP] fix test_ddp_inference (#61666)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61666

Closes https://github.com/pytorch/pytorch/issues/61481. Fixes this
test by removing section that uses only torch.no_grad() and doesn't call
model.eval(). For SyncBN, need to call model.eval() otherwise SyncBN will
assume it is in training mode, which does collective calls in the forward pass
and does not work for inference.
ghstack-source-id: 133657549

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D29699444

fbshipit-source-id: 03ccb296dd9cb56729cd23e91c7f50b72fcf3adf
2021-07-16 00:25:02 -07:00
7e1f01d4c0 Alias for polygamma (#59691)
Summary:
See https://github.com/pytorch/pytorch/issues/50345

cc: mruberry kshitij12345

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59691

Reviewed By: gchanan

Differential Revision: D29707514

Pulled By: mruberry

fbshipit-source-id: 40c15e1fda3d9f7013977b0f36a77b228dda6aa5
2021-07-16 00:06:27 -07:00
f008e8d32d Remove test_out, test_variant_consistency_eager skips for addmv; fixed before (#61579)
Summary:
This PR:

1. Removes `test_out` skip: it's not needed anymore after it was fixed in https://github.com/pytorch/pytorch/pull/55746. This should also close https://github.com/pytorch/pytorch/issues/55589.
2. Removes `test_variant_consistency_eager` skip, it was added by mistake in https://github.com/pytorch/pytorch/issues/55771.
3. Refines `sample_inputs_addmv` function, the updated function should now be cleaner and easy to read.

cc: mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61579

Reviewed By: gchanan

Differential Revision: D29709674

Pulled By: mruberry

fbshipit-source-id: 9b975c024777efdd33c6b9444b0b36e0eab85c03
2021-07-15 22:35:03 -07:00
843c42ffd8 [nnc] Refactored test macros and updated compress buffer tests to use them (#61716)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61716

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D29715754

Pulled By: navahgar

fbshipit-source-id: c400a58b7f393c0f93e5a25f118403124f8834b0
2021-07-15 21:17:14 -07:00
d01837081d [nnc] Cleaned up compress buffer tests to use BufHandle instead of Buf (#61715)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61715

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D29715755

Pulled By: navahgar

fbshipit-source-id: 453adac8f5b13263c39d96b6b4086425a01bae54
2021-07-15 21:15:23 -07:00
eb5a56fb74 Fix scribe logs (#61675)
Summary:
Related to https://github.com/pytorch/pytorch/issues/61632

This PR adds
- refactoring of scribe related code to scribe.py
- changed the `render_test_results` job to always use the `linux.2xlarge` runner
- if SCRIBE_GRAPHQL_ACCESS_TOKEN is empty, try boto3 instead

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61675

Reviewed By: seemethere

Differential Revision: D29703523

Pulled By: zhouzhuojie

fbshipit-source-id: 829ad3630d3500a498b41aa458ce6539aaeae938
2021-07-15 19:27:58 -07:00
127562a0ed Fix some sign comparisons (#61618)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61618

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D29688193

fbshipit-source-id: ea7a6b6be8b25d4a0668e744688f96bbbb144dc7
2021-07-15 18:28:41 -07:00
e6860ba508 Fix some sign comparisons and a loop (#61663)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61663

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D29696766

fbshipit-source-id: eb5a77bd0cfafeb6209d274f121f10dca20d461a
2021-07-15 18:27:42 -07:00
9d955abcdb Fix test_reductions when no SciPy is installed (#61699)
Summary:
Also, use skipIfNoSciPy decorator instead of implicit `unittest.skipIf`

This fixes regression introduced by https://github.com/pytorch/pytorch/pull/52565

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61699

Reviewed By: seemethere

Differential Revision: D29706938

Pulled By: malfet

fbshipit-source-id: 0b63c3ddadfa7f68bed994b71cadf68976d3b396
2021-07-15 15:57:11 -07:00
968a01a94a [special] migrate xlogy (#60641)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/50345

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60641

Reviewed By: gchanan

Differential Revision: D29709306

Pulled By: mruberry

fbshipit-source-id: e8a5f64009a895a25618637de40b55cf36b8f794
2021-07-15 15:32:09 -07:00
1ce3281a6d Revert D29361872: [pytorch][PR] det_backward: more robust and with complex support
Test Plan: revert-hammer

Differential Revision:
D29361872 (fce85480b9)

Original commit changeset: b1f0fec7e3ac

fbshipit-source-id: feffa74ad65b0b294e0a9b0ee72d245393421f70
2021-07-15 15:26:00 -07:00
3a0801f960 [skip ci] Fix "arugment" typos (#61459)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/61455.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61459

Reviewed By: soulitzer

Differential Revision: D29636559

Pulled By: samestep

fbshipit-source-id: 9ad65265c0491d9e81bb303abe3a07c6843bfa4a
2021-07-15 15:20:18 -07:00
e5fcc903d6 torch: Make __version__ better with comparisons (#61556)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61556

Prior to 1.10.0 `torch.__version__` was stored as a str and so many did
comparisons against `torch.__version__` as if it were a str. In order to not
break them we have TorchVersion which masquerades as a str while also
having the ability to compare against both packaging.version.Version as
well as tuples of values, eg. (1, 2, 1)

Examples:
  Comparing a TorchVersion object to a Version object
```
TorchVersion('1.10.0a') > Version('1.10.0a')
```
  Comparing a TorchVersion object to a Tuple object
```
TorchVersion('1.10.0a') > (1, 2)    # 1.2
TorchVersion('1.10.0a') > (1, 2, 1) # 1.2.1
```

  Comparing a TorchVersion object against a string
```
TorchVersion('1.10.0a') > '1.2'
TorchVersion('1.10.0a') > '1.2.1'
```

Resolves https://github.com/pytorch/pytorch/issues/61540

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D29671234

Pulled By: seemethere

fbshipit-source-id: 6044805918723b4aca60bbec4b5aafc1189eaad7
2021-07-15 15:12:09 -07:00
0ea29a6ccb Analysing time taken by gradgrad checks for Spectral Functions (#60435)
Summary:
**Description:** `SpectralFuncInfo` defines decorator mentioning: "gradgrad is quite slow". This PR re-analyses that statement since things have changed with gradient tests.

**Test times:** https://github.com/pytorch/pytorch/pull/60435#issuecomment-865658177

**Follow-up** of https://github.com/pytorch/pytorch/pull/57802

cc: mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60435

Reviewed By: gchanan

Differential Revision: D29707444

Pulled By: mruberry

fbshipit-source-id: 444b4863bac8556c7e8fcc8ff58d81a91bd96a21
2021-07-15 14:02:03 -07:00
4ff121f58d Add complex64 dtype for OpInfo Reference testing (#61627)
Summary:
This PR adds `complex64` dtype testing, following conversation from: pytorch/xla#3019 ([comment](https://github.com/pytorch/xla/pull/3019#discussion_r666754943)). Original PR that added OpInfo reference testing: https://github.com/pytorch/pytorch/pull/59369.

cc: mruberry kshitij12345

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61627

Reviewed By: gchanan

Differential Revision: D29710560

Pulled By: mruberry

fbshipit-source-id: 55b2e5ff47f031069335a0c75a45d4f4885ef9ac
2021-07-15 13:40:37 -07:00
e2c3049e2a Delete stable-sort-only-works-on-cpu warning (#61685)
Summary:
stable GPU sorting is implemented by https://github.com/pytorch/pytorch/pull/56821
Fixes https://github.com/pytorch/pytorch/issues/61682

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61685

Reviewed By: gchanan

Differential Revision: D29704864

Pulled By: malfet

fbshipit-source-id: 3a5aa24bf6507be63844fe6016fb9e3c682f4d84
2021-07-15 13:34:41 -07:00
e098e9000b Compare DDP static graph (C++ core) with legacy DDP forward and backward delay. (#61507)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61507

Benchmark Python-only DDP vs production C++ based DistributedDataParallel.
- Implemented a pure python DDP: PythonDDP with support of SYNC and ASYNC reduction
- Added compare_ddp to measure the difference in forward and backward step

Kudos on Shen and Yi for the great idea.

Test Plan:
Test on DevGPUS with 2 CUDA devices.

$python compare_ddp.py

Python only DDP has slightly better (-1%) forward performance and slightly slower (2%-20%) backward performance.
This suggested that we need to keep C++ Core since the maximum latency increase can be 20%. See README.md for details.
Imported from OSS

Differential Revision:
D29685364
D29685364

Reviewed By: mrshenli

Pulled By: bowangbj

fbshipit-source-id: 429e4473fac0ec4c70d6db12d946d2636dd6477a
2021-07-15 12:52:22 -07:00
7a3b05ea6d Fix hardswish inplace op for strided tensor with skipped elements (#61622)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61622

Hardswish inplace op would return incorrect response for strided tensor inputs that skip elements like a slice. Create a contiguous tensor and copy elements back to return the correct answer

Test Plan: Internal CI tests

Reviewed By: kimishpatel

Differential Revision: D29689745

fbshipit-source-id: 11618a8d865f550f6b70637345f9ebc3e5676f11
2021-07-15 11:50:27 -07:00
fce85480b9 det_backward: more robust and with complex support (#58195)
Summary:
As per title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58195

Reviewed By: albanD

Differential Revision: D29361872

Pulled By: anjali411

fbshipit-source-id: b1f0fec7e3ac52acd1481bcc878cc0c1d07c1852
2021-07-15 11:04:42 -07:00
bd360ebe6f [nnc] Added a new API to distribute loop and all its parents (#61293)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61293

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D29560008

Pulled By: navahgar

fbshipit-source-id: e4e459184f20b1872bc242ba8626d0a6df29e810
2021-07-15 10:28:20 -07:00
76f097466e [nnc] Added a new API to compress all buffers in a given statement (#61087)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61087

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D29506677

Pulled By: navahgar

fbshipit-source-id: 63583fd5a0e42c0096ddf08d5b96bc680ea8a44e
2021-07-15 10:28:18 -07:00
2908d3eb45 [nnc] Modified the semantics of reorder in using permutation (#61085)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61085

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D29506679

Pulled By: navahgar

fbshipit-source-id: f674aedff8175b9947404fd2164a0b4f57a71e93
2021-07-15 10:28:16 -07:00
7177509380 Revert [DDP] Support not all outputs used in loss calculation (#61497)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61497

Reverts [DDP] Support not all outputs used in loss calculation
ghstack-source-id: 133589153

Test Plan: CI, ping authors to run their workflow on this diff

Reviewed By: zhaojuanmao

Differential Revision: D29642892

fbshipit-source-id: 81a15b9ab3329602f34d3758bb0799005a053d4f
2021-07-15 10:28:14 -07:00
25f9c35dd7 Revert [DDP] Support for multiple backwards (#61401)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61401

Reverts https://github.com/pytorch/pytorch/pull/59359, which is causing a few internal issues in DDP training. We will evaluate the internal use cases and reland it after reconsidering the design.

Also moves `prepare_for_backward` back into forward pass instead of DDP Sink for `find_unused_parameters`. This ensures that hooks will always fire in the backwards pass, which is behavior that internal training workloads rely on. Calling `prepare_for_backward` in DDPSink autograd function is not the best solution since other autograd threads may have been executing which can cause races.

ghstack-source-id: 133589152

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D29608948

fbshipit-source-id: f060f41cd103573ddff8da50cdbb6c56768dab46
2021-07-15 10:28:13 -07:00
38ac9e69aa Back out "[DDP] Disable reducer hooks from running outside of DDP backwards." (#61399)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61399

Reverts https://github.com/pytorch/pytorch/pull/60921
Original commit changeset: fef76a0dd295
ghstack-source-id: 133581300

Test Plan: CI

Differential Revision: D29594262

fbshipit-source-id: a308d3f10dbbb2169d9a7f60f2f28f139185ed1f
2021-07-15 10:27:02 -07:00
a50a389ca6 Revert D29701479: [pytorch][PR] Remove _broadcast_object() from ZeroRedundancyOptimizer
Test Plan: revert-hammer

Differential Revision:
D29701479 (9b5d9b4049)

Original commit changeset: c8d5f9057b32

fbshipit-source-id: 35ab1f399513fb9d1c4e73b1fa906e559d2a6994
2021-07-15 10:03:08 -07:00
aa01a7a61c Fix for get_buffer(): check buffers by name instead of value (#61429)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/61242

Previous code was wrongly checking if a tensor is a buffer in a module by comparing values; fix compares names instead.
Docs need some updating as well- current plan is to bump that to a separate PR, but I'm happy to do it here as well if preferred.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61429

Reviewed By: gchanan

Differential Revision: D29712341

Pulled By: jbschlosser

fbshipit-source-id: 41f29ab746505e60f13de42a9053a6770a3aac22
2021-07-15 09:55:09 -07:00
5407108533 CopyBackward: Remove redundant src_device and unnecessary copy=True (#60025)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60025

`to` already copies unconditionally if `src.device() != options.device()` so
specifying the copy argument is unnecessary.

`src.device()` is also completely equivalent to `src.options().device()` so
storing both is redundant.

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D29698627

Pulled By: albanD

fbshipit-source-id: eb091d39b71db688e6bcbb33a227c01b94b432bb
2021-07-15 09:48:03 -07:00
da667e2d5f Add .github for CODEOWNERS (#61598)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61598

I'd like to be notified on changes to the github actions workflows, add
this so I can be notified.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: janeyx99, samestep

Differential Revision: D29685783

Pulled By: seemethere

fbshipit-source-id: 865a1360a24633ef5074e43b8277838a0eef94f6
2021-07-15 09:39:12 -07:00
8afb65b6c5 changed launch bounds for upsample_linear1d fwd, bwd from 1024 to 512 (#61307)
Summary:
Changed launch bounds for upsample_linear1d_out_frame and upsample_linear1d_backward_out_frame from 1024 to 512. Shows performance improvement as shown below. Does not completely eliminate lmem usage (lmem usage goes from 40-48 bytes to 8-16 bytes), not sure why.

Timing data (using Nvidia Titan-V GPU):
![UpsampleLinear1dTimingData](https://user-images.githubusercontent.com/22803332/124677708-e20d6280-de75-11eb-8187-fb50ec89dc50.PNG)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61307

Reviewed By: heitorschueroff

Differential Revision: D29662137

Pulled By: ngimel

fbshipit-source-id: 9653672ee17f25b75a02f295f388a78327091431
2021-07-15 09:19:16 -07:00
ee5a97de11 Register Saved Tensors hooks (#60663)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60663

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D29466223

fbshipit-source-id: 65dc3a935c18a0e6b93a37e24543c696e6ae0321
2021-07-15 08:09:55 -07:00
94965212e5 [static runtime] Use at::allclose to test NNC sigmoid (#61566)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61566

This change uses `at::allclose` to compare results from sigmoid functions (CPU/NNC) instead of `Tensor::equals` due to numerical errors occurring between them.

Test Plan:
I confirmed that the flakiness of `StaticRuntime.Sigmoid` is gone with this change:

```
[djang@devvm1999.ftw0 ~/fbsource/fbcode] buck-out/gen/caffe2/benchmarks/static_runtime/static_runtime_cpptest -v 3 --gtest_filter=StaticRuntime.Sigmoid --gtest_repeat=100 &> output.txt
[djang@devvm1999.ftw0 ~/fbsource/fbcode] grep PASSED output.txt  | wc
    100     500    2100
```

Reviewed By: bertmaher

Differential Revision: D29671203

fbshipit-source-id: 99a7b16d18ea047c9aad444f36d8368f9d0b088d
2021-07-14 19:48:00 -07:00
9b5d9b4049 Remove _broadcast_object() from ZeroRedundancyOptimizer (#61539)
Summary:
Revised version of https://github.com/pytorch/pytorch/issues/60573.

**Overview:**
This makes two changes:
- It introduces a `map_location` argument to `broadcast_object_list()`. The argument specifies the device to load tensors contained in objects received from the broadcast. This change requires modifying the implementation of `_object_to_tensor()` and `_tensor_to_object()` to use `torch.save()` and torch.load()` respectively.
- It removes all calls to `_broadcast_object()` in `ZeroRedundancyOptimizer` and the corresponding test file in favor of `broadcast_object_list()`.

The default value of `map_location` is `None`, in which case `_object_to_tensor()` and hence `broadcast_object_list()` preserve their original behavior. Namely, contained tensors are loaded to their original device.

In `consolidate_state_dict()`, I specify `map_location=torch.device("cpu")` instead of `self._default_device`. This slightly changes the behavior from before when using `_broadcast_object()`. The reason I do so is that it saves one GPU to CPU data transfer since the action immediately after receiving the broadcasted `local_state_dict` is to copy it to CPU.

Explicitly, if `map_location=self._default_device`, then the data transfer path assuming NCCL backend is as follows:
`source GPU --[before serialize]--> source CPU --[before broadcast]--> source GPU --[broadcast]--> destination GPU --[before deserialize]--> destination CPU --[deserialize]--> destination GPU --[copy]--> destination CPU`
Hence, by setting `map_location=torch.device("cpu")` instead, the suffix becomes:
`destination CPU --[deserialize]--> destination CPU --[copy]--> destination CPU`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61539

Test Plan:
I added a test `test_broadcast_object_list_map_location()` that checks for both `map_location` as CPU and GPU that (1) tensors contained in broadcasted objects are appropriately loaded onto the specified device and (2) that the contents of the tensors are correct.

The existing `ZeroRedundancyOptimizer` tests pass.
```
gpurun4 python test/distributed/optim/test_zero_redundancy_optimizer.py
```

The existing `broadcast_object_list()` test passes:
```
touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py -- TestDistBackendWithFork.test_broadcast_object_list
```

Reviewed By: zou3519

Differential Revision: D29701479

Pulled By: andwgu

fbshipit-source-id: c8d5f9057b32e5e9f40e8edc5b2cc25fb21414a9
2021-07-14 17:36:30 -07:00
e3d5619ff0 [pytorch][profiler] Fix division by 0 in computeFlops (#61676)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61676

Reviewed By: ilia-cher

Differential Revision: D29646067

fbshipit-source-id: d872221bbde5384a9e397e68c1e5b0664d913b42
2021-07-14 16:38:19 -07:00
70e94bb1dd Avoid redefining __BYTE_ORDER (#60346)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60346

Introduction:
In order to support the Intel SGX platform, we have to avoid redefining __BYTE_ORDER.
Solution:
Check if the platform is SGX and avoid the redefinition.

Test Plan: Run the PyTorch tests.

Reviewed By: h397wang, malfet

Differential Revision: D29022626

fbshipit-source-id: 801c3a75c202d192a3808eb5d54b875094499996
2021-07-14 14:55:04 -07:00
a9c3580080 Grammatical update of tech docs (#61547)
Summary:
Added some minor grammatical updates to the 'Complex Numbers' docs.

![Screenshot (180)](https://user-images.githubusercontent.com/75036632/125342884-0b952500-e373-11eb-9e63-410ff31e6c21.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61547

Reviewed By: zou3519

Differential Revision: D29677361

Pulled By: H-Huang

fbshipit-source-id: 78222310a755911192905a8f52aa0ae325900006
2021-07-14 14:01:59 -07:00
5a5c7f563d add trainer hook functions (#60785)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60785

This pr adds hook functions for the trainers.

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D29697299

Pulled By: gcramer23

fbshipit-source-id: cc3b991aad0d32503fbfc5acd4fca8b404e74c0f
2021-07-14 13:19:17 -07:00
304c02ee44 refactor ps benchmark (#60784)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60784

This pr refactors the ps benchmark for modular trainers.

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D29697291

Pulled By: gcramer23

fbshipit-source-id: 64579a1f5326d3cd9f32936dcf53bc243d54b71d
2021-07-14 13:19:13 -07:00
7d2ea9a8f7 Release GIL as much as possible in dist_autograd pybind. (#61593)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61593

Following the pattern in https://github.com/pytorch/pytorch/pull/61588
to avoid deadlocks as much as possible.
ghstack-source-id: 133497897

Test Plan: waitforbuildbot

Reviewed By: mrshenli

Differential Revision: D29683451

fbshipit-source-id: 1951622eb964f57a551a9c0d46ad0ab24b66c458
2021-07-14 13:19:10 -07:00
5ebc7c9f97 Avoid holding GIL while calling retrieveContext. (#61588)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61588

As part of debugging https://github.com/pytorch/pytorch/issues/60290,
we discovered the following deadlock:

```
Thread 79 (Thread 0x7f52ff7fe700 (LWP 205437)):
#0  pthread_cond_timedwait@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
#1  0x0000564880199152 in PyCOND_TIMEDWAIT (cond=0x564880346080 <gil_cond>, mut=0x564880346100 <gil_mutex>, us=5000) at /home/builder/ktietz/cos6/ci_cos6/python_1622833237666/work/Python/condvar.h:103
#2  take_gil (tstate=0x7f5254005ef0) at /home/builder/ktietz/cos6/ci_cos6/python_1622833237666/work/Python/ceval_gil.h:224
#3  0x0000564880217b62 in PyEval_AcquireThread (tstate=0x7f5254005ef0) at /home/builder/ktietz/cos6/ci_cos6/python_1622833237666/work/Python/ceval.c:278
#4  0x00007f557d54aabd in pybind11::gil_scoped_acquire::gil_scoped_acquire() () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so
#5  0x00007f557da7792f in (anonymous namespace)::concrete_decref_fn(c10::impl::PyInterpreter const*, _object*) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so
#6  0x00007f5560dadba6 in c10::TensorImpl::release_resources() () from /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so
#7  0x00007f5574c885bc in std::_Sp_counted_ptr_inplace<torch::distributed::autograd::DistAutogradContext, std::allocator<torch::distributed::autograd::DistAutogradContext>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
#8  0x00007f5574c815e9 in std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<std::pair<long const, std::shared_ptr<torch::distributed::autograd::DistAutogradContext> >, false> > >::_M_deallocate_node(std::__detail::_Hash_node<std::pair<long const, std::shared_ptr<torch::distributed::autograd::DistAutogradContext> >, false>*) [clone .isra.325] () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
#9  0x00007f5574c81bf1 in torch::distributed::autograd::DistAutogradContainer::eraseContextIdAndReset(torch::distributed::autograd::DistAutogradContainer::ContextsShard&, long) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
#10 0x00007f5574c86e83 in torch::distributed::autograd::DistAutogradContainer::releaseContextIfPresent(long) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
#11 0x00007f5574cc6395 in torch::distributed::rpc::RequestCallbackNoPython::processCleanupAutogradContextReq(torch::distributed::rpc::RpcCommandBase&) const () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
#12 0x00007f5574cccf15 in torch::distributed::rpc::RequestCallbackNoPython::processRpc(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so

Thread 72 (Thread 0x7f53077fe700 (LWP 205412)):
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f55bc62adbd in __GI___pthread_mutex_lock (mutex=0x564884396440) at ../nptl/pthread_mutex_lock.c:80
#2  0x00007f5574c82a2f in torch::distributed::autograd::DistAutogradContainer::retrieveContext(long) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
#3  0x00007f557de9bb2f in pybind11::cpp_function::initialize<torch::distributed::autograd::(anonymous namespace)::dist_autograd_init(_object*, _object*)::{lambda(long)#11}, pybind11::dict, long, pybind11::name, pybind11::scope, pybind11::sibling, char [931], pybind11::arg>(torch::distributed::autograd::(anonymous namespace)::dist_autograd_init(_object*, _object*)::{lambda(long)#11}&&, pybind11::dict (*)(long), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, char const (&) [931], pybind11::arg const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so

```

Basically Thread 72, holds GIL and tries to acquire the lock for
DistAutogradContainer to perform a lookup on a map. On the other hand,
Thread 79 holds the lock on DistAutogradContainer to remove a Tensor and as
part of TensorImpl destructor, concrete_decref_fn is called which waits for
GIL. As a result, we have a deadlock.

To fix this issue, I've ensured we release GIL when we call `retrieveContext`
and acquire it later when needed.
ghstack-source-id: 133493659

Test Plan: waitforbuildbot

Reviewed By: mrshenli

Differential Revision: D29682624

fbshipit-source-id: f68a1fb39040ca0447a26e456a97bce64af6b79c
2021-07-14 13:17:16 -07:00
f2adbff36e [Metal] Do not use read/write textures in concat shaders (#61074)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61074

`read_write` textures are not available on some devices, such as iPhone 7. This prevents the concat op from functioning on those devices.

This diff rewrites the concat shaders such that they do not depend on `read_write` textures.

Test Plan:
Test on device: run squeezenet and/or the operator tests
```
arc focus2 pp-ios
```

Test on Mac
```
buck test pp-macos
```

Test specifically on iPhone7, either device or simulator.

Reviewed By: xta0

Differential Revision: D29501656

fbshipit-source-id: de4a059953ab4b0abf38b6ecb3f665323dcdbea1
2021-07-14 13:03:48 -07:00
80bdfd64c5 Skip Bfloat16 support when building for VSX (#61630)
Summary:
Copy-paste ifdef guard from vec256/vec256.h
Probably fixes https://github.com/pytorch/pytorch/issues/61575

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61630

Reviewed By: janeyx99

Differential Revision: D29690676

Pulled By: malfet

fbshipit-source-id: f6d91eadab74bcbcb1dc9854ae1b98a0dccacd14
2021-07-14 13:02:29 -07:00
43a2f7c26a [TensorExpr] Do not fuse float16 values. (#61569)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61569

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D29672564

Pulled By: ZolotukhinM

fbshipit-source-id: fe64ec38209d43f8246bcb6c397b64a28cbd86fa
2021-07-14 12:53:59 -07:00
ab27399566 Make broadcast_object_list accept a device parameter. (#61305)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61305

Part I (this PR): Add dist_device argument to broadcast_object_list API
Part II: andwgu@ will deprecate _broadcast_object with the newly introduced API
	 Also include the changes to _object_to_tensor()/_tensor_to_object() with PR 60573

Context: https://github.com/pytorch/pytorch/issues/60062

Test Plan:
Run the following on DevGpus with two cuda devices

$python setup.py develop    --- run this build on DevGPU
$BACKEND='nccl' WORLD_SIZE=2 with-proxy  python test/distributed/test_distributed_fork.py  TestDistBackendWithFork.test_broadcast_object_list --v
$BACKEND='gloo' WORLD_SIZE=2 with-proxy  python test/distributed/test_distributed_fork.py  TestDistBackendWithFork.test_broadcast_object_list --v

Build with distributed on: USE_DISTRIBUTE=1 python setup.py develop
Test on CPU devvm:

$ with-proxy python test/distributed/optim/test_zero_redundancy_optimizer.py

Imported from OSS

Differential Revision:
D29566538
D29566538

Reviewed By: iramazanli, mrshenli

Pulled By: bowangbj

fbshipit-source-id: 0bea52442551c5194acba85eadda16ba2ec4b6ef
2021-07-14 11:43:17 -07:00
9b3cbeaf7d [pruner] fix activation handles logic (#61592)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61592

Add activation handles for each layer (stored in a list), so they can each be removed.

We don't remove them in the `convert` in eager mode because we aren't modifying output/input layer dimensions. We will need this in Fx mode though.
ghstack-source-id: 133497376

Test Plan:
Added some tests to make sure `model(x)` runs without error.

`buck test mode/dev-nosan //caffe2/test:ao --
TestBasePruner`

https://pxl.cl/1LBf4

Reviewed By: z-a-f

Differential Revision: D29682789

fbshipit-source-id: 9185702736e5f7f4320754ffef441610738ac154
2021-07-14 11:07:23 -07:00
343cb276b0 [pytorch] Add broadcasting support to add_relu kernel (#61584)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61584

add_relu is not working with broadcasting. This registers a scalar version of add_relu in native_functions that casts to tensor before calling the regular function. TensorIterator handles broadcasting analogously to existing add.
ghstack-source-id: 133480068

Test Plan: python3 test/test_nn.py TestAddRelu

Reviewed By: kimishpatel

Differential Revision: D29641768

fbshipit-source-id: 1b0ecfdb7eaf44afed83c9e9e74160493c048cbc
2021-07-14 10:32:20 -07:00
c23db9327a Smart Decay for Adam - Caffe2 (#61548)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61548

We want to decay learning parameters properly.  Previously this was not done when a parameter is absent from a minibatch.  We fix this by keeping track of missed minibatches and making decay catch up accordingly.

The exponential moving averages (EMA) for the first and second moments used in Adam are updated only for parameters seen in a minibatch.  Actually, for these parameters, 0 should be added to the EMAs and the EMAs should then be decayed by multiplying by beta1 and beta2 respectively.

To avoid the computational overhead of touching every parameter for every minibatch, we:
* keep track of the last time a parameter is seen
* instead of decaying the EMAs by multiplying by beta1 and beta2, we multiply by beta1^k and beta2^k, where k is the number of minibatches since the parameter was last seen
* we calculate the amount of momentum that would have been discharged over the missed minibatches and update the weight accordingly.

Differential Revision: D29654246

fbshipit-source-id: 7a6cd7966eb1f31116d99dfce79a78b2d3ee9e3e
2021-07-14 10:22:38 -07:00
58adaaba60 Enable C2 load rate limiter [2/n] (#61551)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61551

We aim to enable rate limiter in C2 load, with a fix bandwidth limit.
This diff update LoadOp to pass down the manifold db options.

Test Plan:
```
buck test mode/opt caffe2/caffe2/python/operator_test:load_save_test
```

Differential Revision: D29639102

fbshipit-source-id: cf69549adadf4c7f12a8a2b7f3ca39092cab4b99
2021-07-14 08:27:05 -07:00
57feb35474 Refactor non-joined process computation (#61555)
Summary:
**Overview:**
This refactors the computation on non-joined processes relating to the join context manager. The concept was inspired by a comment from pritamdamania.

**Changes:**
This introduces a `_Joinable` abstract base class, which requires a `_join_hook()` method and `_join_device()` and `_join_process_group()` property methods. Any class that we want to be compatible with the generic join context manager should inherit from `_Joinable` and implement `_join_hook()`, `_join_device()`, and `_join_process_group()`. (The `device` and `process_group` information has been moved from `_JoinHook` to `_Joinable`.)

The generic join context manager now takes in a `List[_Joinable]` instead of `List[_JoinHook]`. The motivation for this is that previously, by passing the `_JoinHook`s into the context manager, the class providing a `_JoinHook` can modify the context manager's behavior, but the context manager cannot modify the class's behavior. This is solved by giving the context manager a reference to the class's instance.

This implementation reserves the field `_join_config` in every `_Joinable` to store a `_JoinConfig` instance, which holds all dynamic fields needed from the `_Joinable` for the join context manager: `enable`, `throw_on_early_termination`, and `is_first_joinable`. ("dynamic" here means that for a given `_Joinable` instance, the values for those fields may change across different join context usages.) In particular, these fields are needed to implement a method `notify_join_context()`, which encapsulates the computation performed on non-joined processes relating to the join context manager --- (1) the all-reduce to indicate that the process has not yet joined and (2) the all-reduce to check whether to throw an exception if `throw_on_uneven_inputs=True`. The idea is that every `_Joinable` class only needs to make a call to `notify_join_context()` before its per-iteration collective communications; it is a simple one-line addition.

Only the first `_Joinable` instance passed into the context manager actually performs the collective communications in `notify_join_context()`. In that case, the method returns an async work handle for the initial all-reduce indicating that the process not yet joined. Otherwise, the method returns `None`. This conditional logic is handled internally without additional input from the user.

**New API:**
Now, the example usage would look like:
```
ddp_model = DistributedDataParallel(...)
zero_optim = ZeroRedundancyOptimizer(ddp_model.parameters(), ...)
with _Join([ddp_model, zero_optim]):
    ...
```
Any arguments meant for a join hook (e.g. `divide_by_initial_world_size`) must be specified as keyword arguments. For example:
```
with _Join([ddp_model, zero_optim], divide_by_initial_world_size=False):
    ...
```
They will be forwarded to every `_join_hook()` function via `**kwargs`. This creates a clear separation between the variables needed by the context manager (`enable` and `throw_on_early_termination`) and those needed by the `_Joinable` class (e.g. `divide_by_initial_world_size`).

**Recap:**
After this change, the relevant information to use the generic join context manager looks like the following (omitting prefix `_` from names):
- Suppose we have a class `C` (e.g. `DistributedDataParallel`) that we want to be able to use the `Join` context.
- We make `C` inherit from `Joinable` and implement `join_hook() -> JoinHook`, `join_device()`, and `join_process_group()`.
- To implement `join_hook()`, we define a `CJoinHook` class inheriting from `JoinHook` and implement `main_hook()` and `post_hook()` as needed.
- We locate a place before `C`'s per-iteration collective communications and add a call to `Join.notify_join_context()`.
- We call `Joinable.__init__(self)` in `C`'s constructor.
- The `C.join_config` field will be used internally by the context manager. This does not affect `C`'s serializability.
- Run time arguments for `C`'s join hook can be passed in as keyword arguments to the context manager: `with Join([C()], arg1=..., arg2=...):`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61555

Test Plan:
I ran the existing DDP join tests:
```
touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py -- TestDistBackendWithFork.test_ddp_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_inputs_stop_iteration_sync_bn TestDistBackendWithFork.test_ddp_grad_div_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_input_join_disable TestDistBackendWithFork.test_ddp_uneven_input_exception
```
I ran the ZeRO join tests:
```
gpurun4 python test/distributed/optim/test_zero_redundancy_optimizer.py TestZeroRedundancyOptimizerDistributed.test_zero_join_gpu TestZeroRedundancyOptimizerDistributed.test_zero_join_cpu
```

Reviewed By: zou3519

Differential Revision: D29690359

Pulled By: andwgu

fbshipit-source-id: 2950f78de755eb5fb13b95b803dd7c705879a9c7
2021-07-14 08:20:40 -07:00
03a79f43e3 adding support for index_select on quantized tensors (#61406)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61406

Only really needed to fix a few select functions so that they could work
for quantized tensors. Primarily creation and resizing of tensors
required a branch for quantized tensors. This doesn't work for
per_channel tensors

Test Plan:
```python test/test_quantization.py TestQuantizedTensor.test_qtensor_index_select_cuda```

```python test/test_quantization.py TestQuantizedTensor.test_qteensor_index_select_cpu```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D29654446

fbshipit-source-id: 8fde9b2dd2c3e380cc330bbad71d6c4d2aeec0ab
2021-07-14 05:38:00 -07:00
a07b08136f [Static Runtime] Check unsupported up when enabling static runtime (#61613)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61613

Reviewed By: ajyu, movefast1990

Differential Revision: D29663466

fbshipit-source-id: d819903b7227f534c0a4fffa5eeea2b5c0c04750
2021-07-14 02:13:51 -07:00
ac64a41e8a [FX][docs] Add note about python set pitfall (#61597)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61597

Test Plan: Imported from OSS

Reviewed By: Chillee

Differential Revision: D29685735

Pulled By: jamesr66a

fbshipit-source-id: b5c5b53ff94fac1022f69b7c0ad4e4055b116029
2021-07-13 20:09:13 -07:00
9ade039593 fix test file not found issue (#61610)
Summary:
it should not error out if the file is not found.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61610

Reviewed By: samestep

Differential Revision: D29687958

Pulled By: walterddr

fbshipit-source-id: 17cacba8daa131df9bfb37fd58d6e4870ff75198
2021-07-13 17:50:50 -07:00
2ab8126e36 Add NewLib support (#60345)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60345

Add support for NewLib, an embedded libc variant by re-using existing Android library stubs plus few NewLib specific guards

Problem:
Newlib is a C standard library intended for embedded use, similarly to how Android uses bionic. This causes some incompatibility with the math functions that are present in glibc but not Newlib (and some versions bionic) and makes porting PyTorch to environments such as SGX hard.

Solution:
Subscribed Newlib to the same fixes present for older versions of Android and add fixes specific for Newlib

Test Plan: Run the PyTorch tests.

Reviewed By: malfet

Differential Revision: D29022623

fbshipit-source-id: 028dd7ff9b3ee394371c275642c90c9ef108e639
2021-07-13 17:26:45 -07:00
8e6d8991b2 [torch/elastic] Fix the agent store key prefix used by workers (#61590)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61590

This PR fixes the bug where the state of the first run of a failed training job leaks to the secondary runs due to constant worker key prefix.
ghstack-source-id: 133494239

Test Plan: Run the existing integ tests.

Reviewed By: SciPioneer

Differential Revision: D29682743

fbshipit-source-id: d96ecadcfe5b6563225ee19f5d0776c7f935393a
2021-07-13 14:57:27 -07:00
523d6fe27c [PyTorch] Remove unnecessary std::string in Device.cpp (#61502)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61502

No reason not to use string literals here.
ghstack-source-id: 133449808

Test Plan: buildsizebot

Reviewed By: dhruvbird

Differential Revision: D29648079

fbshipit-source-id: 74ecf12283c2f196b4b3edb75c6bb1eeed51322e
2021-07-13 14:36:13 -07:00
72394aaf68 Bump addressable from 2.7.0 to 2.8.0 in /ios/TestApp (#61573)
Summary:
Bumps [addressable](https://github.com/sporkmonger/addressable) from 2.7.0 to 2.8.0.
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a href="https://github.com/sporkmonger/addressable/blob/main/CHANGELOG.md">addressable's changelog</a>.</em></p>
<blockquote>
<h1>Addressable 2.8.0</h1>
<ul>
<li>fixes ReDoS vulnerability in Addressable::Template#match</li>
<li>no longer replaces <code>+</code> with spaces in queries for non-http(s) schemes</li>
<li>fixed encoding ipv6 literals</li>
<li>the <code>:compacted</code> flag for <code>normalized_query</code> now dedupes parameters</li>
<li>fix broken <code>escape_component</code> alias</li>
<li>dropping support for Ruby 2.0 and 2.1</li>
<li>adding Ruby 3.0 compatibility for development tasks</li>
<li>drop support for <code>rack-mount</code> and remove Addressable::Template#generate</li>
<li>performance improvements</li>
<li>switch CI/CD to GitHub Actions</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="6469a232c0"><code>6469a23</code></a> Updating gemspec again</li>
<li><a href="24336385de"><code>2433638</code></a> Merge branch 'main' of github.com:sporkmonger/addressable into main</li>
<li><a href="e9c76b8897"><code>e9c76b8</code></a> Merge pull request <a href="https://github-redirect.dependabot.com/sporkmonger/addressable/issues/378">https://github.com/pytorch/pytorch/issues/378</a> from ashmaroli/flat-map</li>
<li><a href="56c5cf7ece"><code>56c5cf7</code></a> Update the gemspec</li>
<li><a href="c1fed1ca0a"><code>c1fed1c</code></a> Require a non-vulnerable rake</li>
<li><a href="0d8a3127e3"><code>0d8a312</code></a> Adding note about ReDoS vulnerability</li>
<li><a href="89c76130ce"><code>89c7613</code></a> Merge branch 'template-regexp' into main</li>
<li><a href="cf8884f815"><code>cf8884f</code></a> Note about alias fix</li>
<li><a href="bb03f7112e"><code>bb03f71</code></a> Merge pull request <a href="https://github-redirect.dependabot.com/sporkmonger/addressable/issues/371">https://github.com/pytorch/pytorch/issues/371</a> from charleystran/add_missing_encode_component_doc_entry</li>
<li><a href="6d1d8094a6"><code>6d1d809</code></a> Adding note about :compacted normalization</li>
<li>Additional commits viewable in <a href="https://github.com/sporkmonger/addressable/compare/addressable-2.7.0...addressable-2.8.0">compare view</a></li>
</ul>
</details>
<br />

[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=addressable&package-manager=bundler&previous-version=2.7.0&new-version=2.8.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

 ---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `dependabot rebase` will rebase this PR
- `dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `dependabot merge` will merge this PR after your CI passes on it
- `dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `dependabot cancel merge` will cancel a previously requested merge and block automerging
- `dependabot reopen` will reopen this PR if it is closed
- `dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
- `dependabot use these labels` will set the current labels as the default for future PRs for this repo and language
- `dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language
- `dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language
- `dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/pytorch/pytorch/network/alerts).

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61573

Reviewed By: xta0

Differential Revision: D29685329

Pulled By: seemethere

fbshipit-source-id: a43008155144a358950dc3ed1934fcc470b73c02
2021-07-13 14:30:33 -07:00
0751a41ab1 [quant] Input-Weight Equalization - ConvReLU support (#61350)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61350

Applied changes in convert to allow for ConvReLU2d layers

Initial Model: `x -> conv1 -> relu`

After fusion: `x -> convRelu2d`

After prepare: `x -> input_quant_obs -> input_eq_obs1 -> convRelu2d -> output_quant_obs1`

After equalization functions: `x -> mul -> input_quant_obs (scaled) -> convRelu2d -> output_quant_obs`

After convert: `x -> mul -> quantize_per_tensor -> quantized::convRelu2d -> dequantize`

Test Plan:
`python test/test_quantization.py TestEqualizeFx`

Initial Model:
```
ConvReluModel(
  (fc): Conv2d(3, 5, kernel_size=(3, 3), stride=(1, 1))
  (relu): ReLU()
)
```

After prepare:
```
GraphModule(
  (x_activation_post_process_0): MinMaxObserver(min_val=5.960464477539063e-08, max_val=0.9999999403953552)
  (x_activation_post_process_0_equalization_process_0): _InputEqualizationObserver(
    (input_obs): PerChannelMinMaxObserver(min_val=tensor([1.1921e-07, 3.3379e-06, 5.9605e-08]), max_val=tensor([1.0000, 1.0000, 1.0000]))
  )
  (fc): ConvReLU2d(
    (0): Conv2d(3, 5, kernel_size=(3, 3), stride=(1, 1))
    (1): ReLU()
  )
  (fc_activation_post_process_0): MinMaxObserver(min_val=0.0, max_val=1.2341605424880981)
)

graph():
    %x : [#users=1] = placeholder[target=x]
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_0](args = (%x,), kwargs = {})
    %x_activation_post_process_0_equalization_process_0 : [#users=1] = call_module[target=x_activation_post_process_0_equalization_process_0](args = (%x_activation_post_process_0,), kwargs = {})
    %fc : [#users=1] = call_module[target=fc](args = (%x_activation_post_process_0_equalization_process_0,), kwargs = {})
    %fc_activation_post_process_0 : [#users=1] = call_module[target=fc_activation_post_process_0](args = (%fc,), kwargs = {})
    return fc_activation_post_process_0
```

After equalization functions:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_scale0 : [#users=1] = get_attr[target=x_equalization_scale0]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_scale0), kwargs = {})
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_0](args = (%mul,), kwargs = {})
    %fc : [#users=1] = call_module[target=fc](args = (%x_activation_post_process_0,), kwargs = {})
    %fc_activation_post_process_0 : [#users=1] = call_module[target=fc_activation_post_process_0](args = (%fc,), kwargs = {})
    return fc_activation_post_process_0
```

After convert:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_scale0 : [#users=1] = get_attr[target=x_equalization_scale0]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_scale0), kwargs = {})
    %fc_input_scale_0 : [#users=1] = get_attr[target=fc_input_scale_0]
    %fc_input_zero_point_0 : [#users=1] = get_attr[target=fc_input_zero_point_0]
    %quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %fc_input_scale_0, %fc_input_zero_point_0, torch.quint8), kwargs = {})
    %fc : [#users=1] = call_module[target=fc](args = (%quantize_per_tensor,), kwargs = {})
    %dequantize : [#users=1] = call_method[target=dequantize](args = (%fc,), kwargs = {})
    return dequantize
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D29638275

fbshipit-source-id: 40d4666a4451e132612ea38fdfeaaec177a1defb
2021-07-13 14:00:40 -07:00
b3e4dab45a [quant] Input-Weight Equalization - Conv convert support (#61287)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61287

Modifications to functions during convert() to support equalization. Note that this implementation does not work for connected F.conv2d layers yet.

Initial:
```
      w
      |
x -> conv -> y
```

After prepare:
```
                                         w
                                         |
                                  weight_quant_obs
                                         |
                                    weight_eq_obs
                                         |
x -> input_quant_obs -> input_eq_obs -> conv -> out_quant_obs -> y
```

After convert:
```
                scale, zero_point             w (scaled)
                       |                           |
x -> mul -> quantize_per_tensor (scaled) -> quantized::conv -> dequant -> y
      |
   eq_scale
```

Test Plan:
`python test/test_quantization.py TestEqualizeFx`

Initial model:
```
ConvModel(
  (conv): Conv2d(3, 5, kernel_size=(3, 3), stride=(1, 1), bias=False)
)
```

After prepare:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_0](args = (%x,), kwargs = {})
    %x_activation_post_process_0_equalization_process_0 : [#users=1] = call_module[target=x_activation_post_process_0_equalization_process_0](args = (%x_activation_post_process_0,), kwargs = {})
    %conv : [#users=1] = call_module[target=conv](args = (%x_activation_post_process_0_equalization_process_0,), kwargs = {})
    %conv_activation_post_process_0 : [#users=1] = call_module[target=conv_activation_post_process_0](args = (%conv,), kwargs = {})
    return conv_activation_post_process_0
```

After equalization functions:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_scale0 : [#users=1] = get_attr[target=x_equalization_scale0]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_scale0), kwargs = {})
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_0](args = (%mul,), kwargs = {})
    %conv : [#users=1] = call_module[target=conv](args = (%x_activation_post_process_0,), kwargs = {})
    %conv_activation_post_process_0 : [#users=1] = call_module[target=conv_activation_post_process_0](args = (%conv,), kwargs = {})
    return conv_activation_post_process_0
```

After convert:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_scale0 : [#users=1] = get_attr[target=x_equalization_scale0]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_scale0), kwargs = {})
    %conv_input_scale_0 : [#users=1] = get_attr[target=conv_input_scale_0]
    %conv_input_zero_point_0 : [#users=1] = get_attr[target=conv_input_zero_point_0]
    %quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %conv_input_scale_0, %conv_input_zero_point_0, torch.quint8), kwargs = {})
    %conv : [#users=1] = call_module[target=conv](args = (%quantize_per_tensor,), kwargs = {})
    %dequantize : [#users=1] = call_method[target=dequantize](args = (%conv,), kwargs = {})
    return dequantize
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D29557055

fbshipit-source-id: dc9f44182e31fa362c43ad2dfe224e6f4e4a730e
2021-07-13 14:00:38 -07:00
77d36b657a [quant] Input-Weight Equalization - Conv prepare support (#61286)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61286

Modifies the prepare step to support conv layers during input-weight equalization and adds tests to make sure that the results are as expected.

Initial:
```
      w
      |
x -> conv -> y
```

After prepare:

```
                                         w
                                         |
                                  weight_quant_obs
                                         |
                                    weight_eq_obs
                                         |
x -> input_quant_obs -> input_eq_obs -> conv -> out_quant_obs -> y
```

Test Plan:
`python test/test_quantization.py TestEqualizeFx.test_input_weight_equalization_prepare`

Initial:
```
ConvModel(
  (conv): Conv2d(3, 5, kernel_size=(3, 3), stride=(1, 1), bias=False)
)
```

After prepare:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_0](args = (%x,), kwargs = {})
    %x_activation_post_process_0_equalization_process_0 : [#users=1] = call_module[target=x_activation_post_process_0_equalization_process_0](args = (%x_activation_post_process_0,), kwargs = {})
    %conv : [#users=1] = call_module[target=conv](args = (%x_activation_post_process_0_equalization_process_0,), kwargs = {})
    %conv_activation_post_process_0 : [#users=1] = call_module[target=conv_activation_post_process_0](args = (%conv,), kwargs = {})
    return conv_activation_post_process_0
```

Imported from OSS

Reviewed By: supriyar

Differential Revision: D29557051

fbshipit-source-id: 25d1531645dfaf565f5c615e2ee850fcf96c7eb9
2021-07-13 14:00:36 -07:00
ce9cedd119 [quant] Input-Weight Equalization - Conv observer support (#61285)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61285

Modifies observers to support conv layers and tests to make sure that the observers are returning the expected values for conv inputs.

Test Plan:
`python test/test_quantization.py TestEqualizeFx.test_input_weight_eq_observer`

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D29557041

fbshipit-source-id: 5e43329f189ba352eb8b991f38bf37752eebb6e6
2021-07-13 13:59:23 -07:00
30e48bbeae Add neg bit (#56058)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56058

User facing changes:
1. Adds a negative bit and corresponding new API (`is_neg()`,`resolve_neg()`)
2. `tensor.conj().imag` now returns a floating point tensor with neg bit set to 1 instead of a tensor with no notion of negative bit. Note that imag is still a view and all the view properties still hold for imag.

Non user facing changes:
1. Added a new Negative dispatch key and a backend fallback to handle it
2. Updated copy kernel to handle negative bit
3. Merged conjugate and negative bit fallback kernel
4. fixed https://github.com/pytorch/pytorch/issues/60478 (caused due to https://github.com/pytorch/pytorch/pull/54987)

Testing:
1. Added a new OpInfo based test `test_neg_view` (verifies that out-of-place and in-place operations work correctly for all operations when the input is a neg view tensor by checking the result against an actually negated tensor, verifies that autograd returns the same output for both neg view and actually negated tensors as well as it works fine when grad_out is a neg view).
2. Added a new test class containing `test_conj_view`, `test_neg_view`.

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D29636403

fbshipit-source-id: 12214c9dc4806c51850f4a72a109db9527c0ca63
2021-07-13 13:50:42 -07:00
60382de455 [torch] Set nproc_per_node to 1 (#61552)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61552

Set `nproc_per_node` to 1

Test Plan: unittests

Reviewed By: cbalioglu

Differential Revision: D29667056

fbshipit-source-id: 6601f66fec5e018c7737d909f8c71642451abb29
2021-07-13 13:35:25 -07:00
437e7d9fc9 codegen_backend_module() now passes correct type designators to isinstance in the generated script
Summary: For methods returning complex (i.e. container) types, the existing code attempted to pass type designators with unsupported syntax (e.g. `Tensor[]`) into `isinstance`. Will now use the correct syntax supported by TorchScript (i.e. `List[Tensor]`).

Test Plan:
Unfortunately, a backend supporting methods returning container types has not yet been identified so the functionality cannot be tested end-to-end.

Adding a printout of `method_ct.format(method_te)` before https://fburl.com/code/4619d12g lets inspect the difference in the generated method body, e.g.:

```
assert isinstance(_0, List[Tensor])
```
vs
```
assert isinstance(_0, Tensor[])
```

Reviewed By: allwu

Differential Revision: D29537358

fbshipit-source-id: 3356f3c1477aa9304e1f070711f480441579414d
2021-07-13 12:18:17 -07:00
b42cc19c88 Fix broken assertion error test in NNAPI convertor (#61586)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61586

Error message was changed

Test Plan:
pytest test/test_nnapi.py:

Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D29682319

fbshipit-source-id: 52a96d79633ee9aae1de2056c7583311edc92353
2021-07-13 11:46:32 -07:00
2ade4d2a92 .github: Ensure clean workspaces before checkout (#61565)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61565

I was noticing the checkout step failing a lot for me, this adds a
cleaning step to fully remove the github workspace before attempting to
do your checkout

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: zhouzhuojie

Differential Revision: D29671074

Pulled By: seemethere

fbshipit-source-id: 43a8f9a9a272c6bdbfffa9c6263443aac37f4b89
2021-07-13 11:13:48 -07:00
d5204064dc [BE] Fix flaky ProcessGroupGloo tests (#61396)
Summary:
A hypothesis as to why tests such as https://github.com/pytorch/pytorch/issues/57469 may be flaky is due to `c10d = ProcessGroupGloo(...)` is not actually guaranteed to be a synchronization point, so some ranks may create the PG, run all the error checking (which does not actually call into gloo APIs so doesn't require synchronization), and then exit, all before other ranks have created the gloo pg.

This can result in the following error:
```
File "distributed/test_c10d_gloo.py", line 1037, in test_reduce_checks
May 03 06:42:34     pg = c10d.ProcessGroupGloo(store, self.rank, self.world_size, self.opts())
May 03 06:42:34 RuntimeError: [/var/lib/jenkins/workspace/third_party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [127.0.0.1]:35521
```

which indicates that the remote end has hung up. Furthermore all the flaky tests in this file only do error checking and don't call into the gloo APIs, further indicating that this issue may be the root cause. Not 100% sure this PR will fix it because I haven't been able to actually repro the issue even after 10000+ runs, but it happens regularly in CI.

To fix this, we add a `dist.barrier(group=pg)` call after creating the pg to enforce a synchronization. Would be good to land this and observe whether it helps with the flakiness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61396

Reviewed By: mrshenli

Differential Revision: D29664189

Pulled By: rohan-varma

fbshipit-source-id: bc046d5d816fe6cb426522b85312383bfa3f90b7
2021-07-13 10:34:59 -07:00
3e5d2b539d Replace deprecated comment with C10_DEPRECATED in linalg.h (#60374)
Summary:
Replace // DEPRECATED comment with C10_DEPRECATED.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60374

Reviewed By: H-Huang

Differential Revision: D29661630

Pulled By: heitorschueroff

fbshipit-source-id: fc086276fd7d3ddfb8d17c67ade456377ef0e990
2021-07-13 08:21:22 -07:00
9679fa7f30 Update cpp_extension.py (#61484)
Summary:
By default, majority of Python-3.[6789] installation comes with `pkg_resources.packaging` version 16.8 (or `setuptool` older than 49.6.0), which does not have major/minor properties on Version package, as one can observe in https://github.com/pypa/setuptools/blob/v49.5.0/pkg_resources/_vendor/packaging/version.py
On the other hand, compare operators exists, so why not use it to check for version equality

Fixes https://github.com/pytorch/pytorch/issues/61036

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61484

Reviewed By: walterddr, seemethere

Differential Revision: D29643883

Pulled By: malfet

fbshipit-source-id: 3db9168c1b009ac3a278709083ea8c5b417471b8
2021-07-13 07:11:58 -07:00
0afbb9e81e PYTHON_LIBRARY may be set to empty or NOTFOUND. (#61230)
Summary:
Not sure why (maybe from dependencies?) but it can certainly break package lookup upon re-entry of cmake.
So instead of checking whether they are defined, we should check whether there is any meaningful value inside.

Fixes https://github.com/pytorch/pytorch/issues/59887

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61230

Reviewed By: H-Huang

Differential Revision: D29668766

Pulled By: malfet

fbshipit-source-id: 79a59578740c4434327aff4f9a22eba9c4bf48d1
2021-07-13 07:09:31 -07:00
ac6ec0efa1 [ROCM] fix bug in #60313 (#61073)
Summary:
This PR fixes a bug in https://github.com/pytorch/pytorch/issues/60313. Where the tensors generated by _generate_valid_rocfft_input are on the cpu instead of the gpu. This was due to using numpy to generate tensors and converting it to pytorch using torch.from_numpy. This leads to the generated tensors staying on the cpu. We now generate the tensors using pytorch itself which carries over the device type of the input tensors to the generated tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61073

Reviewed By: H-Huang

Differential Revision: D29668418

Pulled By: malfet

fbshipit-source-id: ce2025c26d079c15603a89b9bf7878f48d73155e
2021-07-13 07:08:17 -07:00
2e49c5dc37 Move GetArgumentNamesModule registration to InterpreterManager() (#61549)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61549

Move GetArgumentNamesModule registration to InterpreterManager() such that the module is a permanent part of the interpreters and can be used by InterpreterSession.global() freely.

Test Plan: [... ~/fbsource/fbcode/caffe2] buck test mode/dev caffe2/fb/predictor:pytorch_predictor_test -- PyTorchDeployPredictor.GetArgumentNames

Reviewed By: wconstab

Differential Revision: D29643460

fbshipit-source-id: cf132d4795cbb334ce164ac715d590a105535508
2021-07-13 00:57:01 -07:00
5144381b1d [pytorch][JIT] Widen exception caught by ScriptList casting (#61520)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61520

This commit widens the exception caught by the try-catch block that checks if
an object passed to a scripted function is a `ScriptList`. It turns out that
there are internal tests that do not throw a `py::cast_error` so catching only
that is not sufficient.

Test Plan: Ran the failing tests in T94889011.

Reviewed By: Chillee

Differential Revision: D29560815

fbshipit-source-id: 442258f8997146d833a9d5db923e1f6359f2bfdd
2021-07-12 23:20:58 -07:00
94840969e4 SGX can not read from /dev/urandom (#60368)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60368

Problem:
The SGX secure enclave does not support reading from /dev/urandom as it is isolated from the OS for greater security. The SGX api provides a way to generate random numbers as a replacment.
Solution:
Conditionally enable SGX api for random number generation when building for it.

Test Plan: Run the PyTorch tests

Reviewed By: malfet, LiJihang

Differential Revision: D29022616

fbshipit-source-id: 1c7115457a2abde682df4d55fa4a8446fc5f8613
2021-07-12 20:43:23 -07:00
8a2c7d902f [static runtime] Add DCHECK to ensure that outputs do not overlap with immutable inputs (#61301)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61301

This change adds a `DCHECK` to ensure that outputs do not overlap with immutable inputs.

Test Plan:
Added unittests as follows:

- `ProcessedNode.VerifyOutputsNotOverlappingWithImmutableInputsWithImmutableArguments`
- `ProcessedNode.VerifyOutputsNotOverlappingWithImmutableInputsWithMutableArguments`

Reviewed By: hlu1

Differential Revision: D29564158

fbshipit-source-id: bf14b4978ab544af79010cf724ed28202b4521cc
2021-07-12 18:04:05 -07:00
4ef640d6f6 Sort imports of test_datapipe.py (#61312)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61312

Sorting according to isort output. Alphabetically ordered one per line imports help merging.

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D29588833

Pulled By: VitalyFedyunin

fbshipit-source-id: 4c80c3086132b50894e734ad6c5799d78d689e42
2021-07-12 15:33:20 -07:00
fd13e925ec Adding backward compatibility for sharding support in old DataLoader (#61237)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61237

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D29588832

Pulled By: VitalyFedyunin

fbshipit-source-id: 3bfa4417f6a04450f656ecf28fc95322d2cf076a
2021-07-12 14:53:45 -07:00
d3cb065b2f Implement usage of is_shardable and apply_sharding (#61236)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61236

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D29588835

Pulled By: VitalyFedyunin

fbshipit-source-id: 00c3042f96af498637b2dcf6e3f842c1fc05ddd8
2021-07-12 14:23:20 -07:00
4d842d909b Revert FC workaround for ReflectionPad3d (#61308)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/61248

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61308

Reviewed By: iramazanli

Differential Revision: D29566849

Pulled By: jbschlosser

fbshipit-source-id: 8ab443ffef7fd9840d64d71afc2f2d2b8a410ddb
2021-07-12 14:19:07 -07:00
2fd37a830e Revert D29642893: .github: Add force_on_cpu tests for windows
Test Plan: revert-hammer

Differential Revision:
D29642893 (a52de0dfec)

Original commit changeset: 2dd2b295c71d

fbshipit-source-id: c01c421689f6d01cdfb3fe60a8c6428253249c5f
2021-07-12 14:01:44 -07:00
7fdce39a4b Revert D29642891: .circleci: Remove force_on_cpu jobs from circleci
Test Plan: revert-hammer

Differential Revision:
D29642891 (2aedd17661)

Original commit changeset: d51bb859bc28

fbshipit-source-id: a39a2d57d6e68961d94d4137a57bdc280f9b1b5b
2021-07-12 13:59:39 -07:00
58df01c3b8 clarify default value of requires_grad for tensors (#61038)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61038

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D29491984

Pulled By: dagitses

fbshipit-source-id: 7e6b7f8e81d77f38c881b86a68c17d3cf5483dad
2021-07-12 12:57:37 -07:00
5897a60480 warn about SVD outputs not supporting backprop (#61037)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61037

* **#61037**

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D29491985

Pulled By: dagitses

fbshipit-source-id: 6322e7c86cade52671062ee97d2fcb8c15d8aa86
2021-07-12 12:55:37 -07:00
65ab861ec6 fix mm not correctly report TORCH_CHECK failure issue (#61394)
Summary:
fixes https://github.com/pytorch/pytorch/issues/61291.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61394

Reviewed By: zhouzhuojie, seemethere

Differential Revision: D29614208

Pulled By: walterddr

fbshipit-source-id: f49a15dde708e30b06059b47fae1cda7c2c3571c
2021-07-12 12:50:51 -07:00
68f9819df4 Typo fix (#41121)
Summary:
Description:
- Typo fix in the docstring

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41121

Reviewed By: heitorschueroff

Differential Revision: D29660228

Pulled By: ezyang

fbshipit-source-id: fc2b55683ec5263ff55c3b6652df3e6313e02be2
2021-07-12 12:43:47 -07:00
255a324258 add nesting_level as attribute to pickle for map datapipe (#61534)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61534

currently, attribute `nesting_level` on `MapIterDataPipe` is not pickled. this yields `AttributeError` exceptions when multiprocessing with `DataLoader`

this diff adds it as an attribute to pickle

Test Plan: confirmed errors go away after change

Reviewed By: ejguan

Differential Revision: D29648655

fbshipit-source-id: 943b57eaff9712eb7ce92f43cb360acdb3111f2b
2021-07-12 11:41:01 -07:00
5144cc029e Bump docker image tag for clang-tidy (#61545)
Summary:
Fixes recent `clang-diagnostic-errors` on clang-tidy runs

See https://github.com/pytorch/test-infra/pull/59

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61545

Reviewed By: malfet, seemethere

Differential Revision: D29664061

Pulled By: 1ntEgr8

fbshipit-source-id: cca482a8774e34e61919f2298846ae0b479bf224
2021-07-12 11:32:39 -07:00
a5a10fe353 Move all downloading logic out of common_utils.py (#61479)
Summary:
and into tools/ folder

Currently run_tests.py invokes tools/test_selections.py
1. download and analyze what test_file to run
2. download and parse S3 stats and pass the info to local files.
3. common_utils.py uses download S3 stats to determine what test cases to run.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61479

Reviewed By: janeyx99

Differential Revision: D29661986

Pulled By: walterddr

fbshipit-source-id: bebd8c474bcc2444e135bfd2fa4bdd1eefafe595
2021-07-12 11:23:22 -07:00
2aedd17661 .circleci: Remove force_on_cpu jobs from circleci (#61473)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61473

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D29642891

Pulled By: seemethere

fbshipit-source-id: d51bb859bc28efe15618d1e65f1a1cee64d60508
2021-07-12 11:17:33 -07:00
a52de0dfec .github: Add force_on_cpu tests for windows (#61472)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61472

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D29642893

Pulled By: seemethere

fbshipit-source-id: 2dd2b295c71d79593ad7f71d6160de4042c08b80
2021-07-12 11:16:17 -07:00
51d18369c3 [1/N] Nnapi backend delegation preprocess (#61499)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61499

Added a preprocess function for the delegate to Nnapi backend (internal and external files).

In the past we had functions and classes for converting to the Nnapi backend. Now, these functions and classes will be wrapped by the delegate API.

### nnapi_backend_preprocess.cpp:

Contains the preprocess function, which uses Pybind to call an existing python function, `convert_model_to_nnapi()`.
- The model is wrapped by a `RecursiveScriptModule`, so that `convert_model_to_nnapi()` can run correctly, since when jumping from Python to C++ to Python, the model loses its original wrapper.
- A tensor, which includes shape, data type, and quantization information, is passed through preprocess's compile_spec to `convert_model_to_nnapi()`.
- Finally, the Nnapi model is serialized for mobile and returned as a string.
### nnapi_backend_lib.cpp:
Contains stub functions for compile and execute, and is necessary for the Nnapi backend to be registered correctly. These will be implemented in a future PR.

**TODO:** implement execute and compile for the delegate API; throw exceptions for incorrect an compile_spec; add OSS tests
**Testing:** Tests were done locally (see D29647123). A simple module was lowered to Nnapi, saved locally, and examined.

ghstack-source-id: 133415234

Test Plan:
Tests were done locally (see D29647123).
TODO: add test in OSS in test_backends.py after CMake is ready.
I ran buck run caffe2:nnapi_backend_example. The model files are saved as nnapi_model.ptl and mobile_model.ptl. I checked that both zip files have expected contents.

Reviewed By: iseeyuan

Differential Revision: D29563351

fbshipit-source-id: 642e349356e38aecc1b9973c285569650c02668c
2021-07-12 11:13:05 -07:00
3faf6a715d [special] migrate log_softmax (#60512)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/50345

Rendered Docs: https://14335157-65600975-gh.circle-artifacts.com/0/docs/special.html#torch.special.log_softmax

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60512

Reviewed By: iramazanli

Differential Revision: D29626262

Pulled By: mruberry

fbshipit-source-id: c42d4105531ffb004f11f1ba6ae50be19bc02c91
2021-07-12 11:01:25 -07:00
f2857883c4 Add DataPipes Graph Functions (#61235)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61235

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D29588834

Pulled By: VitalyFedyunin

fbshipit-source-id: e0331d6e1fc2a3f8b6211aac83965bcf13165161
2021-07-12 10:28:35 -07:00
25a705610f ENH Adds support for no-batch dim in AdaptiveAvgPool1d (#61264)
Summary:
Towards https://github.com/pytorch/pytorch/issues/60585

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61264

Reviewed By: iramazanli

Differential Revision: D29615292

Pulled By: jbschlosser

fbshipit-source-id: 826d1c87d67261a7211270e90e3a1022bbbe37bd
2021-07-12 10:24:37 -07:00
583b045fc3 Make .contiguous(memory_format) call .clone(memory_format) (#61456)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61456

functorch is unable to `vmap(grad(f))` when `f` contains a `.contiguous`
call. This is because `.contiguous` (when it is not a no-op) decomposes
to `.copy_` under grad and the `.copy_` is not compatible with vmap.

The fix for this is to have `.contiguous` call `.clone` instead of
`.copy_`. `clone` is a primitive w.r.t. to autograd, so `grad`
decomposes contiguous into clone.

Perf testing (forward pass)
- [script and
output](https://gist.github.com/zou3519/294f583b9c5d7bdf234d5295f97fb02e)
- The instruction count increased from 774479 to 781379. This is because
we're now calling .clone(), which does an additional dispatch. We could
optimize the implementation of clone() to not dispatch on .copy_() in
the future if we really care about this.

Perf testing (backward pass)
- [script and
output](https://gist.github.com/zou3519/6fbdb121de6342334192d55c8a72276a)
- The instruction count decreased from 5402648 to 5335977. This is
because the [backward for
.clone](9b908ab0d0/tools/autograd/derivatives.yaml (L383))
is a lot simpler than the [backward for
copy_](9b908ab0d0/torch/csrc/autograd/functions/tensor.cpp (L37-L41))
- The backward for .clone() and .copy_() end up doing the same thing for
contiguous (from reading the code above, they both do no-op copies).

Test Plan:
- wait for existing tests (test_view_ops have the tests)
- functorch isn't tested in PyTorch CI yet.
- Taking suggestions on how to write a test for this. I'm thinking we
could use LoggingTensor from #59760 (because it logs underneath
autograd) and test that clone is called instead of copy_ but I didn't
want to refactor it into a utility

Reviewed By: soulitzer

Differential Revision: D29636859

Pulled By: zou3519

fbshipit-source-id: 97eb56bfae1c4bb31612dc9d06536019f21d69a6
2021-07-12 10:19:33 -07:00
5a20c56ebc [static runtime] Remove hasOperation() check (#61496)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61496

glow::FusionGroup is JitOnlyOperator that produces an Operation when passed a Node* https://fburl.com/ybwfn3bl

hasOperation doesn't return true in that case https://fburl.com/19wd10aw

by removing the hasOperation() check, the Operation gets successfully materialized, and static runtime enables successfully and runs ok. Will check that the outputs match with jit interpreter

Test Plan:
Test with 281805158_2
```
./buck-out/gen/admarket/lib/ranking/prediction_replayer/replayer --model_inference_type_target=DISAGG_ACCELERATOR --prediction_replayer_force_model_type=inline_cvr_post_imp_model --prediction_replayer_force_model=281805158_2 --prediction_replayer_target_tier=127.0.0.1:7447 --prediction_replayer_input_stream_filename=/data/users/ansha/tmp/adfinder/filter_requests_inline_cvr_post_imp_model_1000_2021_04_29 --ignore_model_id_mismatch --check_performance --fully_remote_sr_connection_options="overall_timeout:10000000,processing_timeout:10000000" --use_new_encoding_for_ads_services --use_new_encoding_from_model_id_to_shard_id --sigrid_force_model_dir=/data/users/ansha/tmp/adfinder/281805158_2/ --sigrid_predictor_model_suffix=.predictor.disagg.local —use_new_encoding_from_model_id_to_shard_id=true --prediction_replayer_force_model_kind=19 --pytorch_predictor_static_runtime_enable=true --prediction_replayer_target_qps=1
```

```
NNPI_LOG_LEVEL=0 USE_INF_API=1 ./buck-out/gen/sigrid/predictor/sigrid_remote_predictor_glow_nnpi \
  --force_models=281805158_2 \
  --sigrid_predictor_model_suffix=.predictor.disagg.remote_other \
  --gflags_config_path=sigrid/predictor/gflags/predictor_gflags_ads_perf_glow_nnpi_pyper_v1 \
  --smc_server_port=7447 \
  --sigrid_predictor_tier_name=sigrid.predictor.perf.dianshi_staticruntime_debug_0604.test.storage \
  --predictor_storage_smc_tier=sigrid.predictor.perf.dianshi_staticruntime_debug_0604.test.storage \
  --predictor_storage_smc_tier_v2=sigrid.predictor.perf.dianshi_staticruntime_debug_0604.test.storage \
  --torch_glow_min_fusion_group_size=30 \
  --glow_enable_sanitize_inputs=100 \
  --sigrid_force_model_dir=/data/users/ansha/tmp/adfinder/281805158_2/ \
  --pytorch_predictor_static_runtime_enable=true \
  --pytorch_predictor_glow_enable=true \
  --pytorch_predictor_enable_loading_xl_format_on_cpu=false \
  --pytorch_disagg_acc_input_dump_path=/tmp/
```

Reviewed By: hlu1

Differential Revision: D29647043

fbshipit-source-id: 8ce6dc0f4f0464b65ca6a8c9d42e3d8bb392e66e
2021-07-12 10:09:33 -07:00
99959fe3f5 [DataLoader] Adding demux and mux DataPipe-s (#61234)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61234

* **#61234 [WIP] Adding demux and mux DataPipe API examples**

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D29588836

Pulled By: VitalyFedyunin

fbshipit-source-id: 523d12ea6be7507d706b4c6d8827ec1ac4ccabc3
2021-07-12 10:04:03 -07:00
d46689a201 OpInfo reference tests for add and sub (#61169)
Summary:
This PR adds OpInfo reference checks for `add, sub`. See https://github.com/pytorch/pytorch/issues/54261

cc: mruberry pmeier

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61169

Reviewed By: iramazanli

Differential Revision: D29625702

Pulled By: mruberry

fbshipit-source-id: c5e536ab52865890990353c5c862b44b5a16ed20
2021-07-12 09:27:22 -07:00
c18017190b Relax some linalg test tolerances (#61101)
Summary:
We are seeing some test failures on A100 machine, though TF32 matmul is not involved in these cases.

I tried `svd_lowrank` test. It passed while testing itself, but failed when I run the whole test suite. It's probably some random seed issue. Relax test tolerance would be much easier to do.

Some SVD tests failed when we compare CPU float32 vs GPU float32. Since linear algebra are sort of unstable at single precision, comparing two single precision results may give some false positives. So we calculate CPU results in float64 or complex128, which is much more accurate.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61101

Reviewed By: ngimel

Differential Revision: D29593483

Pulled By: mruberry

fbshipit-source-id: 3df651e3cca1b0effc1a4ae29d4f26b1cb4082ed
2021-07-12 09:17:59 -07:00
bacf8ecbd1 Make pin_memory/is_pinned use BackendSelect (#60547)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60547

These now dispatch on the optional Device argument, which specifies
what device you want to pin for.  We now directly register pinned
memory implementations for CUDA specifically, eliminating the need
for extra virtual methods.

This makes it possible for other backends to override the behavior
of pinned memory, c.f. https://github.com/pytorch/pytorch/pull/59291

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD, bdhirsh

Differential Revision: D29331881

Pulled By: ezyang

fbshipit-source-id: db3b4e2c872ba1caa0243fecc60a4da65179ce28
2021-07-12 09:13:14 -07:00
7136a62b56 Add expecttest to CONTRIBUTING.md (#61163)
Summary:
Now expecttest is an independent library but `CONTRIBUTING.md` and `requirements.txt` do not mention the need of the library.

Related: https://github.com/pytorch/pytorch/pull/60658

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61163

Reviewed By: heitorschueroff

Differential Revision: D29660296

Pulled By: ezyang

fbshipit-source-id: e2e86d42526c83bec7cdf7221e19fe83d9686103
2021-07-12 09:11:12 -07:00
8754238410 torch._utils.ExceptionWrapper: fix for Exceptions with multiple args (#58131)
Summary:
Here's an example of what this PR should fix:
```
from torch._utils import ExceptionWrapper

class TwoArgException(Exception):
    def __init__(self, msg, count): ...

# If you need a "real world" exception with two args, here's one from the stdlib:
# import asyncio
# TwoArgException = asyncio.exceptions.LimitOverrunError
# or if on Python 3.7, try:
# TwoArgException = asyncio.streams.LimitOverrunError

try:
    raise TwoArgException("oh no", 0)
except Exception as e:
    data = ExceptionWrapper(where="in a test case")

data.reraise()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58131

Reviewed By: heitorschueroff

Differential Revision: D29660248

Pulled By: ezyang

fbshipit-source-id: cbcecfee9cac183354542e147ee3d956038c8986
2021-07-12 09:04:36 -07:00
93d98ecef7 update the pytorch-gdb example so that it works on current master (#61175)
Summary:
As pointed out by https://github.com/pytorch/pytorch/pull/54339#issuecomment-872827580, the `pytorch-gdb` example is currently broken because the code has been refactored.

This PR updates the example so that it works again.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61175

Reviewed By: heitorschueroff

Differential Revision: D29660336

Pulled By: ezyang

fbshipit-source-id: 8bcd32fc583c0b28a705ef37203ce7ad4d636732
2021-07-12 08:57:18 -07:00
cyy
0de35fe039 fix return local reference (#59913)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59913

Reviewed By: soulitzer

Differential Revision: D29107110

Pulled By: ezyang

fbshipit-source-id: c0f9888867c7dfeb05f6a3b9d2067df35e1e3ffb
2021-07-12 08:29:32 -07:00
d4549ba5dc Add VS_VERSION to Circle (#61532)
Summary:
Fixes current HUD 10.1 failure https://app.circleci.com/pipelines/github/pytorch/pytorch/349359/workflows/ead2904b-3f37-4c9d-b271-a8e772046523/jobs/14713215

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61532

Test Plan: The new 10.1 CI run: https://app.circleci.com/pipelines/github/pytorch/pytorch/349677/workflows/b7143b56-e8e7-4f85-8bdf-0ce50788f3c0/jobs/14727686

Reviewed By: walterddr

Differential Revision: D29661179

Pulled By: janeyx99

fbshipit-source-id: 5023c41fe6ddce4113116b07d8f0fd7d66c864a8
2021-07-12 08:21:02 -07:00
cyy
00c4897c51 use make_unique (#61272)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61272

Reviewed By: pbelevich

Differential Revision: D29660354

Pulled By: ezyang

fbshipit-source-id: f0aba1ea6983aec415915ed9b7dbced2e2b3b171
2021-07-12 08:09:46 -07:00
ac086ca15b Update version.txt file path (#61177)
Summary:
The file version.txt is located one directory above generate_torch_version,
some platforms are unable to find this file unless given an explicit
path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61177

Reviewed By: pbelevich

Differential Revision: D29660334

Pulled By: ezyang

fbshipit-source-id: f66105f782aaff031e373f96a69baabb13c89337
2021-07-12 07:30:10 -07:00
09679af260 Delete dead code in Tensor::to implementation (#61435)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61435

Deleted the following:
- I couldn't find the NOTE mentioned so I deleted the reference to it
- The memory_format check (because it always passes)
- The requires_grad check (because it always passes)

Test Plan: - run tests

Reviewed By: soulitzer

Differential Revision: D29636872

Pulled By: zou3519

fbshipit-source-id: 48a32c1821b72c512d337becf2398ce7f4cf01a2
2021-07-12 07:10:27 -07:00
60086ab39b Remove export PYTHONPATH hacks (#61487)
Summary:
Remove `export PYTHONPATH=$PWD` in favor of `-m`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61487

Test Plan: Let's see if CI passes

Reviewed By: 1ntEgr8

Differential Revision: D29645544

Pulled By: janeyx99

fbshipit-source-id: 841aea8ebed2cb1c7dbc68754b5fbdee932559c2
2021-07-12 06:59:50 -07:00
5c1505076b [Codemod][FBSourceBlackLinter] Daily arc lint --take BLACK
Reviewed By: zertosh

Differential Revision: D29656934

fbshipit-source-id: c40bbc8e4512b145050ee47db2c8dc781f3c36e9
2021-07-12 04:15:21 -07:00
666dff381d add AdaptiveAvgPooling2D (#61239)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61239

Test Plan: Imported from OSS

Reviewed By: iramazanli

Differential Revision: D29626359

Pulled By: migeed-z

fbshipit-source-id: b7cd4ce4176e2d6e7a853974443affd23a49d3d9
2021-07-10 20:07:14 -07:00
93ef40bd83 add linear operation and modify one of the tests (#61238)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61238

Test Plan: Imported from OSS

Reviewed By: iramazanli

Differential Revision: D29626333

Pulled By: migeed-z

fbshipit-source-id: d4303918e380d64ba8ab678f249db6674e89357a
2021-07-10 20:07:12 -07:00
292ee65261 add maxpool2D, add more tests, handle integer parameters for maxpool2D (#61188)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61188

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D29626303

Pulled By: migeed-z

fbshipit-source-id: 32309cd1eb1189beaba63017653b3aeccdf2761d
2021-07-10 20:06:07 -07:00
7a15576a65 [quant] update FakeQuant modules to use tensor qparams (#61318)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61318

Remove the `float()` and `int()` calls in the forward function so that we can directly use the tensor qparams in the fake_quantize operator.

Calling `float()/int()` internally calls `item()` which can trigger a gpu-> cpu copy if the original tensors reside on GPU.
Local benchmark P427668213

Before this change
```
                                               Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
---------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                     aten::_aminmax         2.57%       1.507ms         3.10%       1.819ms      36.371us       2.872ms         4.81%       2.872ms      57.446us            50
              aten::fake_quantize_per_tensor_affine         1.04%     610.915us         3.60%       2.114ms      42.276us     472.896us         0.79%       2.698ms      53.962us            50
    aten::fake_quantize_per_tensor_affine_cachemask         1.69%     993.626us         2.56%       1.503ms      30.058us       2.225ms         3.73%       2.225ms      44.504us            50
                                   aten::is_nonzero         3.85%       2.258ms        19.68%      11.540ms      46.161us       2.168ms         3.63%      11.084ms      44.336us           250
                                   aten::zeros_like         1.82%       1.064ms         6.65%       3.901ms      39.007us       1.531ms         2.57%       3.905ms      39.045us           100
                                           aten::eq        13.80%       8.093ms        25.90%      15.189ms      37.972us       9.580ms        16.05%      15.566ms      38.914us           400
                                         aten::item         5.67%       3.323ms        21.50%      12.607ms      36.019us       3.233ms         5.42%      12.167ms      34.762us           350
                                        aten::zeros         0.94%     549.208us         2.93%       1.717ms      34.343us     688.928us         1.15%       1.695ms      33.894us            50
                                           aten::le         2.52%       1.478ms         4.50%       2.641ms      26.411us       1.753ms         2.94%       2.845ms      28.448us           100
                                         aten::rsub         1.04%     608.715us         2.44%       1.433ms      28.667us     532.000us         0.89%       1.418ms      28.353us            50
                                          aten::max         1.54%     905.401us         4.62%       2.711ms      27.106us     847.488us         1.42%       2.697ms      26.969us           100
                                         aten::ones         0.92%     542.159us         2.16%       1.266ms      25.324us     661.856us         1.11%       1.301ms      26.017us            50
                                          aten::min         0.82%     479.167us         2.15%       1.258ms      25.160us     407.808us         0.68%       1.276ms      25.530us            50
                          aten::_local_scalar_dense        15.83%       9.284ms        15.83%       9.284ms      26.526us       8.934ms        14.97%       8.934ms      25.524us           350
                                        aten::clamp         2.35%       1.378ms         4.21%       2.467ms      24.669us       1.546ms         2.59%       2.461ms      24.612us           100
                                        aten::zero_         2.53%       1.482ms         5.65%       3.316ms      22.108us       1.326ms         2.22%       3.380ms      22.531us           150
                                      aten::maximum         3.08%       1.805ms         3.08%       1.805ms      18.052us       1.849ms         3.10%       1.849ms      18.494us           100
                                      aten::minimum         1.33%     778.854us         1.33%     778.854us      15.577us     868.672us         1.46%     868.672us      17.373us            50
                                        aten::round         1.36%     799.910us         1.36%     799.910us      15.998us     809.568us         1.36%     809.568us      16.191us            50
                                        aten::copy_         6.61%       3.878ms         6.61%       3.878ms      15.513us       4.036ms         6.76%       4.036ms      16.143us           250
                                          aten::div         2.53%       1.483ms         2.53%       1.483ms      14.833us       1.535ms         2.57%       1.535ms      15.353us           100
                                          aten::mul         2.44%       1.431ms         2.44%       1.431ms      14.314us       1.478ms         2.48%       1.478ms      14.782us           100
                                       aten::detach         1.46%     855.670us         2.41%       1.411ms      14.110us     832.448us         1.39%       1.395ms      13.949us           100
                                          aten::add         2.22%       1.301ms         2.22%       1.301ms      13.008us       1.383ms         2.32%       1.383ms      13.828us           100
                                        aten::fill_         4.18%       2.452ms         4.18%       2.452ms      12.262us       2.693ms         4.51%       2.693ms      13.463us           200
                                          aten::sub         5.06%       2.967ms         5.06%       2.967ms      14.837us       2.675ms         4.48%       2.675ms      13.374us           200
                                           aten::to         2.10%       1.230ms         3.65%       2.140ms      10.701us       1.310ms         2.20%       2.062ms      10.310us           200
                                       aten::select         1.28%     749.144us         1.49%     874.227us       8.742us     863.232us         1.45%     863.232us       8.632us           100
                                             detach         0.95%     555.326us         0.95%     555.326us       5.553us     562.496us         0.94%     562.496us       5.625us           100
                                   aten::as_strided         0.40%     232.289us         0.40%     232.289us       1.161us       0.000us         0.00%       0.000us       0.000us           200
                                        aten::empty         2.93%       1.720ms         2.93%       1.720ms       3.439us       0.000us         0.00%       0.000us       0.000us           500
                                      aten::resize_         1.04%     611.313us         1.04%     611.313us       2.038us       0.000us         0.00%       0.000us       0.000us           300
                                   aten::empty_like         0.75%     438.585us         1.77%       1.036ms       5.180us       0.000us         0.00%       0.000us       0.000us           200
                                aten::empty_strided         1.36%     799.442us         1.36%     799.442us       3.198us       0.000us         0.00%       0.000us       0.000us           250
---------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 58.645ms
Self CUDA time total: 59.674ms
```

After this change
```

test_fake_quant_profiler (scripts.supriyar.benchmark.module_bench.ProfilerBench) ... -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                  aten::fake_quantize_per_tensor_affine         0.98%     505.210us         4.38%       2.259ms      45.187us     419.424us         0.78%       3.218ms      64.367us            50
                                         aten::_aminmax         2.78%       1.434ms         3.42%       1.766ms      35.321us       2.825ms         5.27%       2.825ms      56.505us            50
aten::fake_quantize_per_tensor_affine_cachemask_tens...         2.38%       1.229ms         3.40%       1.754ms      35.083us       2.799ms         5.22%       2.799ms      55.979us            50
                                             aten::rsub         0.94%     485.040us         5.02%       2.590ms      51.793us     458.976us         0.86%       2.587ms      51.747us            50
                                       aten::is_nonzero         3.78%       1.952ms        23.64%      12.196ms      48.786us       2.055ms         3.83%      11.986ms      47.944us           250
                                             aten::item         6.92%       3.572ms        19.86%      10.244ms      40.977us       3.670ms         6.85%       9.931ms      39.724us           250
                                       aten::zeros_like         1.65%     848.874us         6.64%       3.426ms      34.260us       1.397ms         2.61%       3.572ms      35.717us           100
                                            aten::zeros         0.85%     436.691us         3.00%       1.549ms      30.984us     551.936us         1.03%       1.576ms      31.516us            50
                                               aten::eq        10.60%       5.467ms        20.26%      10.452ms      26.130us       7.018ms        13.09%      10.832ms      27.079us           400
                                               aten::le         2.58%       1.332ms         4.67%       2.407ms      24.074us       1.580ms         2.95%       2.614ms      26.144us           100
                              aten::_local_scalar_dense        12.93%       6.673ms        12.93%       6.673ms      26.691us       6.261ms        11.68%       6.261ms      25.046us           250
                                            aten::clamp         2.43%       1.253ms         4.37%       2.256ms      22.560us       1.431ms         2.67%       2.273ms      22.725us           100
                                             aten::ones         0.89%     460.133us         2.18%       1.123ms      22.467us     570.496us         1.06%       1.128ms      22.551us            50
                                              aten::min         0.74%     383.132us         2.06%       1.065ms      21.296us     377.536us         0.70%       1.091ms      21.824us            50
                                            aten::zero_         2.36%       1.219ms         5.87%       3.029ms      20.194us       1.261ms         2.35%       3.199ms      21.327us           150
                                              aten::max         1.51%     779.081us         4.06%       2.096ms      20.960us     791.680us         1.48%       2.130ms      21.295us           100
                                              aten::sub         7.97%       4.111ms         7.97%       4.111ms      20.556us       3.847ms         7.18%       3.847ms      19.234us           200
                                              aten::div         2.94%       1.516ms         2.94%       1.516ms      15.158us       1.580ms         2.95%       1.580ms      15.798us           100
                                            aten::round         1.45%     750.445us         1.45%     750.445us      15.009us     756.064us         1.41%     756.064us      15.121us            50
                                            aten::copy_         6.88%       3.548ms         6.88%       3.548ms      14.190us       3.701ms         6.90%       3.701ms      14.803us           250
                                          aten::minimum         1.32%     681.654us         1.32%     681.654us      13.633us     713.664us         1.33%     713.664us      14.273us            50
                                          aten::maximum         2.55%       1.317ms         2.55%       1.317ms      13.169us       1.338ms         2.50%       1.338ms      13.378us           100
                                              aten::mul         2.63%       1.358ms         2.63%       1.358ms      13.581us       1.328ms         2.48%       1.328ms      13.283us           100
                                           aten::detach         1.34%     688.820us         2.35%       1.211ms      12.110us     772.800us         1.44%       1.278ms      12.779us           100
                                            aten::fill_         4.53%       2.338ms         4.53%       2.338ms      11.692us       2.495ms         4.65%       2.495ms      12.473us           200
                                              aten::add         2.32%       1.197ms         2.32%       1.197ms      11.968us       1.240ms         2.31%       1.240ms      12.405us           100
                                               aten::to         2.07%       1.069ms         3.66%       1.889ms       9.443us       1.224ms         2.28%       1.975ms       9.874us           200
                                           aten::select         1.44%     743.042us         1.64%     848.207us       8.482us     641.600us         1.20%     641.600us       6.416us           100
                                                 detach         1.01%     522.155us         1.01%     522.155us       5.222us     505.088us         0.94%     505.088us       5.051us           100
                                       aten::as_strided         0.44%     227.884us         0.44%     227.884us       1.139us       0.000us         0.00%       0.000us       0.000us           200
                                            aten::empty         3.20%       1.652ms         3.20%       1.652ms       3.304us       0.000us         0.00%       0.000us       0.000us           500
                                          aten::resize_         1.25%     646.711us         1.25%     646.711us       2.156us       0.000us         0.00%       0.000us       0.000us           300
                                       aten::empty_like         0.79%     407.768us         2.07%       1.067ms       5.334us       0.000us         0.00%       0.000us       0.000us           200
                                    aten::empty_strided         1.52%     785.788us         1.52%     785.788us       3.143us       0.000us         0.00%       0.000us       0.000us           250
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 51.590ms
Self CUDA time total: 53.609ms
ghstack-source-id: 133370215

Test Plan: buck test mode/dev-nosan caffe2/test/:quantization

Reviewed By: raghuramank100

Differential Revision: D29566512

fbshipit-source-id: 1aefca51f99949da7334bcfe504848275c9f952c
2021-07-10 19:43:02 -07:00
99848c7269 [quant] Add tensor_qparam variant to fake_quantize_per_tensor (#61317)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61317

Add an overload to fake_quantize_per_tensor that accepts scale/zero_point as input. The reasons to do this are

* required for fused observer + fake_quant operator on GPU where the scale/zero_point will be calculated by the observer on device. Passing tensor inputs enables us to directly access the scale/zero-point value in the cuda kernel to avoid extra copies/malloc
* enables us to pass in float as scale dtype and int32 as zero_point dtype (which is consistent with what the quantize call actually uses) https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/quantized/affine_quantizer_base.cpp#L52-L53
* overload consistent with `quantizer_per_tensor.tensor_qparams`
ghstack-source-id: 133370216

Test Plan:
buck test mode/dev-nosan caffe2/test/:quantization -- test_backward_per_tensor_cachemask
buck test mode/dev-nosan caffe2/test/:quantization -- test_forward_per_tensor_cachemask

Reviewed By: raghuramank100

Differential Revision: D29552727

fbshipit-source-id: cbb9af40fc575ad27a29c646b760d5ee52cc923d
2021-07-10 19:41:55 -07:00
57676ce128 Migrate multi_margin_loss to ATen (CUDA) (#61426)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61426

Closes gh-24600, closes gh-24601

These operators use custom kernels that aren't well suited to `TensorIterator` style, so this is just changing the CPU code and cleaning up the style.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D29648015

Pulled By: ngimel

fbshipit-source-id: cadf1890cdc2199d57f4533370e554613efeb54a
2021-07-10 18:48:58 -07:00
5a17cb6f44 Add channels-last support for bilinear and nearest 2d interpolation on CUDA (#56322)
Summary:
Add channels-last support for bilinear and nearest 2d interpolation on CUDA

Benchmark (on 2070 Super) is available at

- nearest 2d: https://github.com/xwang233/code-snippet/tree/master/interpolate-channels-last/nearest-2d
- bilinear: https://github.com/xwang233/code-snippet/tree/master/interpolate-channels-last/bilinear

Some regressions are seen for tensors with small channel size. We may add a heuristic to dispatch the contiguous and channels-last path if needed.

Close https://github.com/pytorch/pytorch/issues/60137

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56322

Reviewed By: mruberry

Differential Revision: D29645980

Pulled By: ngimel

fbshipit-source-id: c36dff4ee4789bec9b01da4029f326d30067c6b7
2021-07-10 18:00:50 -07:00
df00c636d2 [Model Averaging] Skip model averaging for the first K steps (#61207)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61207

Model averager now must be combined with post-localSGD DDP communication hook. It will skip model averaging for the first K steps, because post-localSGD communication hook will run global gradient averaging during this phase.

Proposal: https://github.com/pytorch/pytorch/issues/59699
ghstack-source-id: 133371335

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_periodic_model_averager

Reviewed By: pritamdamania87

Differential Revision: D29523738

fbshipit-source-id: 3fa9611046e1c0afa4bda78aa3ba200fa2a5fa4b
2021-07-10 17:12:16 -07:00
0f6876d721 [Model Averaging] Create a post-localSGD communication hook (#61206)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61206

Create a communication hook to run post-local SGD. This will be combined with model averager component to better support local SGD.

In contrast to the previous approach that runs local gradient averaging + global model averaging at each step for the first K steps, now we plan to run global gradient averaging only for the first K steps at each step, just like normal DDP. This can give us two advantages:
1) For some optimizers, model averaging can cause discrepancy in optimizer states. If we still do global gradient averaging for the first K steps, we can defer such discrepancy until we actually start local SGD.
2) Gradient averaging at the first K steps only run one allreduce that overlaps with backward pass, so it should also be more efficient.

Proposal: https://github.com/pytorch/pytorch/issues/59699
ghstack-source-id: 133371322

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_ddp_hook_parity_post_localSGD

Reviewed By: pritamdamania87

Differential Revision: D29523292

fbshipit-source-id: 3f215f7150f2917c2781278fad759530c685ea2c
2021-07-10 17:11:10 -07:00
a46d4212bf Allow dims=0 in torch.tensordot call (#61331)
Summary:
In one of my previous PRs that rewrite `tensordot` implementation, I mistakenly take empty value of `dims_a` and `dims_b` as illegal values. This turns out to be not true. Empty `dims_a` and `dims_b` are supported, in fact common when `dims` is passed as an integer. This PR removes the unnecessary check.

Fixes https://github.com/pytorch/pytorch/issues/61096

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61331

Reviewed By: eellison

Differential Revision: D29578910

Pulled By: gmagogsfm

fbshipit-source-id: 96e58164491a077ddc7a1d6aa6ccef8c0c9efda2
2021-07-10 17:05:20 -07:00
7d7b7abb3b [Static Runtime] Separate function for getting always_alive values (#61506)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61506

Separate out the logic of GetAlwaysAliveValues from GetLivenessMap so to simplify the code structure. Also you don't need to run GetLivenessMap if optimize_memory is turned off.

Reviewed By: ajyu

Differential Revision: D29423534

fbshipit-source-id: dbdeeb10f7bcad86a24aa12f741f7c9ab946bb3b
2021-07-10 16:59:29 -07:00
7fdc5f9e08 model_dump: Fix non-counting and double-counting bugs in tensor memory (#60702)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60702

- Instead of traversing and counting all tensor memory, collect a map
  from storage key to storage info while traversing.  Add up sizes at
  the end to avoid double counting.
- Count tensor memory from constants as well.

Test Plan: Ran webdriver test.

Reviewed By: dhruvbird

Differential Revision: D29380396

Pulled By: dreiss

fbshipit-source-id: 6d0fd66f677fe23c851aa218387aa4dc59502b1e
2021-07-10 15:16:34 -07:00
158d351517 model_dump: Add webdriver test (#60701)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60701

The unit test previously only tested that the dump could complete
successfully.  It was not able to verify that any JS worked properly.
Now we can test the JS as long as webdriver is installed.

Tweaked the implementation of Hider a bit to make it easier for tests to
find and open them.

I disabled the tests by default since I don't want to deal with
webdriver in CI.  Enable them with the environment variable
RUN_WEBDRIVER=1.

We could make the tests use headless mode, but it's kind of fun to watch
them run.

Add a test to verify that tensor memory computation is working for the
simple model.

Test Plan: Ran the test.

Reviewed By: dhruvbird

Differential Revision: D29380398

Pulled By: dreiss

fbshipit-source-id: f19d0b05d79ad5a8231e85422976f1889e021c89
2021-07-10 15:16:32 -07:00
cc78c463c0 model_dump: Render constants.pkl similar to data.pkl (#60700)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60700

Test Plan:
Dumped a model with a lot of constants (qconvs produced by optimizing).
Was able to see them rendered nicely.

Reviewed By: dhruvbird

Differential Revision: D29380400

Pulled By: dreiss

fbshipit-source-id: c951508b92bb2717591dd173282157e1a40a30bd
2021-07-10 15:16:31 -07:00
e292f34def model_dump: Make stdout argument for main a keyword-only argument (#60699)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60699

Also add a unit test for main, which brings the test coverage up to
~98%.  Also factor out the "needs importlib.resources" check into a
function for easier reuse.

Test Plan: CI

Reviewed By: dhruvbird

Differential Revision: D29380397

Pulled By: dreiss

fbshipit-source-id: bba16da85bf7bfb4370308e38c844694d01b47eb
2021-07-10 15:16:29 -07:00
2942e9aa80 model_dump: update maintainer comment (#60698)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60698

... to reflect that the Python command should be re-run when changing
the model.

Test Plan: CI

Reviewed By: dhruvbird

Differential Revision: D29380399

Pulled By: dreiss

fbshipit-source-id: 1ec464da4ebe6ddf400eb4a3b14da683369c0039
2021-07-10 15:15:15 -07:00
f5c10fdbd3 Allow for heterogenous List and Dict values + Improve container typing algorithm (#57137)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57137

This PR corrects and expands our typing algorithm for unannotated, non-empty dicts and lists. Previously, to verify type correctness for an unannotated, non-empty container, we had gotten the type of the first element in the container, then checked if each following element was a subtype of the first type. That's too restrictive--what if the first element were a subtype of the second element? Instead, we should type the container by getting the smallest common supertype of all the given elements.

We need slightly different rules for keys and values in dicts, though: because the set of key types is restricted, finding two key types that cannot be unified should cause an error. On the other hand, the set of value types is not restricted, so we should be able to use `Any` as a valid supertype. We need to keep the set of keys restricted since the keys are used to generate and match schemas.

This does not break backwards compatibility, because the default element type is the smallest supertype of all the given types. So, if someone creates an unannotated dict where the keys are all `str` and the values are all `torch.Tensor`, the dict will be inferred to `Dict[str, Tensor]` just like it was before. Empty lists are still typed as `List[torch.Tensor],` and empty dicts are still typed as `Dict[str, Tensor]`.

This PR unblocks three engineers on an FB-internal team and improves FX-TorchScript compatibility.

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D28231839

Pulled By: ansley

fbshipit-source-id: 7297bf239749daa54895add708185c75e6ca5999
2021-07-10 14:29:05 -07:00
ccd0977060 [Static Runtime] Support prim::GetAttr/SetAttr (#61505)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61505

The handling of `self` in static runtime was previously incorrect. This diff fixed that issue, since self is essential to prim::GetAttr/SetAttr. After all, most of the time we're getting and setting attributes from self, the torch script module.

Reviewed By: ajyu

Differential Revision: D29350173

fbshipit-source-id: 6e62add4cda517ef8cd6c315d4cb0595e7d531fb
2021-07-10 14:06:06 -07:00
f291b1899f Revert D27978269: Smart Decay for Adam - Caffe2
Test Plan: revert-hammer

Differential Revision:
D27978269 (aaa1e07609)

Original commit changeset: e47524101ddf

fbshipit-source-id: 334824bbf9a6ed788e75af9c292754081f70a19b
2021-07-10 13:09:58 -07:00
8bcf24b37a [TCPStore] enhance connect timeout error message (#61390)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61390

Enhances this error message for better debugability.
ghstack-source-id: 133185482

Test Plan: CI

Reviewed By: H-Huang

Differential Revision: D29601528

fbshipit-source-id: f7aaf4d67ac96e6ed0b535e0200f918dd01e42f9
2021-07-10 03:57:23 -07:00
336970c03e Add note on torch.distributed backends on ROCm (#58975)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58975

Reviewed By: soulitzer

Differential Revision: D29595510

Pulled By: rohan-varma

fbshipit-source-id: 384bb67fcd003d65b76e957a474406b2a38099b9
2021-07-10 03:51:19 -07:00
73b86c9f9c Add getMethod to PytorchPredictorContainer (#61052)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61052

Implement getMethod in the container in a similar way to getPredictor,
using either Deploy or Script functionality depending on how the container
was initialized and how the gflag deploy override are set.

Test Plan: Add new unit test

Reviewed By: houseroad

Differential Revision: D29346969

fbshipit-source-id: 08e95ee96d533f5a7cc9c8f9b1c53751715c9181
2021-07-09 22:27:40 -07:00
677313b670 ReLU (#61150)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61150

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D29625826

Pulled By: migeed-z

fbshipit-source-id: 10e0662e33ccd4342cedd51579a10651755b633f
2021-07-09 19:32:08 -07:00
a556c1c4dc [profiler] Update Kineto submodule (ci-all) (#61478)
Summary:
Update Kineto submodule

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61478

Test Plan:
CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61432

Reviewed By: gdankel

Differential Revision: D29646019

Pulled By: ilia-cher

fbshipit-source-id: 02ecb0a2a6b457f6537c7d6b3c475e1e0ace3b6f
2021-07-09 19:32:06 -07:00
06166a13e0 Remove VS install step unless necessary from GHA Windows workflows (#60791)
Summary:
~~This should only be merged after our AMI has been deployed after https://github.com/fairinternal/pytorch-gha-infra/pull/1. (And will likely fail our current windows jobs)~~

I have revised this PR to install VS only when it's not already installed.

This should save ~5min per Windows workflow.
![image](https://user-images.githubusercontent.com/31798555/125141598-7e886c80-e0e3-11eb-9fe0-bb9e6bcc14f1.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60791

Reviewed By: soulitzer

Differential Revision: D29643876

Pulled By: janeyx99

fbshipit-source-id: 4bcfaf5bcad9e5636a1624c3e799e7cc97a87660
2021-07-09 19:32:04 -07:00
9b2b45919a Revert D29639797: [package] error if we try to mock a module in 3.6
Test Plan: revert-hammer

Differential Revision:
D29639797

Original commit changeset: 775ed78638fb

fbshipit-source-id: 9d2f6dae7ee35c6b37338e36ec7ade9d9e2ccbc2
2021-07-09 19:31:04 -07:00
aaa1e07609 Smart Decay for Adam - Caffe2 (#61488)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61488

We want to decay learning parameters properly.  Previously this was not done when a parameter is absent from a minibatch.  We fix this by keeping track of missed minibatches and making decay catch up accordingly.

The exponential moving averages (EMA) for the first and second moments used in Adam are updated only for parameters seen in a minibatch.  Actually, for these parameters, 0 should be added to the EMAs and the EMAs should then be decayed by multiplying by beta1 and beta2 respectively.

To avoid the computational overhead of touching every parameter for every minibatch, we:
* keep track of the last time a parameter is seen
* instead of decaying the EMAs by multiplying by beta1 and beta2, we multiply by beta1^k and beta2^k, where k is the number of minibatches since the parameter was last seen.

Differential Revision: D27978269

fbshipit-source-id: e47524101ddfcb281c46c505b9b7a8f0835bc64a
2021-07-09 18:28:21 -07:00
b52909d861 [TensorExpr] Add python bindings for ArgValue class and TensorExprKernel constructor accepting custom lowerings. (#61385)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61385

The bindings coverage might be not full yet, but this already allows us
to register custom lowerings from python.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D29623487

Pulled By: ZolotukhinM

fbshipit-source-id: b97ee420a57fd887e204c021b9e098764b2ee232
2021-07-09 18:27:14 -07:00
dec5aa2260 [JIT] clean up (#60390)
Summary:
* Minor: spelling, grammar.
* Add calls to `GRAPH_DUMP()` where they were missing.
* Add or expand a few comments.
* Move a few comments to seemingly more appropriate spots.
* In canonicalize_graph_fuser_ops.cpp inline `runnableInputs()` since it
  was only called in one place and had a misleading comment and
  confusing name.
* In `PeepholeOptimizeImpl::optimizeBlock()`, set `changed = true;` when
  removing `aten::is_complex`. Pretty sure its absence was a bug.
* Delete unused `_jit_pass_remove_inplace_ops` and and its
  implementation `RemoveInplaceOps()`.
* In `preprocessCaffe2Ops()`, remove redundant check for nested optional
  types. It was already checked in `checkONNXCompatibility()`.
* In `EncoderBase::AddAttribute`, log the unexpected attribute kind.
  I don't remember the repro case now but I did hit this error at some
  point and this additional logging made it easier to understand.
* In `fuseConvBatchNorm()` in eval_peephole.cpp, consistently use
  camelCase instead of snake_case for local variables.
* Add curly braces around the bodies of if and loops.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60390

Reviewed By: Krovatkin

Differential Revision: D29523283

Pulled By: SplitInfinity

fbshipit-source-id: 4e16c5648616f53da07d68dab7fdf252e06a0752
2021-07-09 16:28:27 -07:00
54ea7d33ba [package] error if we try to mock a module in 3.6 (#61469)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61469

This feature is not supported, error out early.

Differential Revision:
D29639797
D29639797

Test Plan: Imported from OSS

Reviewed By: Lilyjjo

Pulled By: suo

fbshipit-source-id: 775ed78638fb6da8f830b632726b00c0533ed176
2021-07-09 16:26:38 -07:00
a3670ba377 Add option to specify custom NNAPI serializer (#61025)
Summary:
To add serializer for custom ops we can subclass default serializer
and update ADDER_MAP

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61025

Test Plan:
* pytest test/test_nnapi.py::TestNNAPI for current serializer
* Custom serializers to be tested with custom ops

Imported from OSS

Reviewed By: anshuljain1

Differential Revision: D29480745

fbshipit-source-id: 37e3f8de3c97f6c8a486f9879ce11430ea89af34
2021-07-09 15:27:10 -07:00
cbb6ab6d88 [package] ignore dunder import errors (#61148)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61148

Changes `__import__` processing to silently skip cases where the `__import__` statement cannot be parsed. Adds failed imports to a list retrievable by `PackageExporter.failed_dunder_import_list()`.

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D29559680

Pulled By: Lilyjjo

fbshipit-source-id: 2513d0b9ef271c85cadc3f5a013fbd8c8de80b46
2021-07-09 15:27:08 -07:00
12772c8dd8 [package] PackageExporter visualization methods (#61147)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61147

Basic tooling to enable users to see what is inside of a PackageExporter. Added methods:
- `externed/interned/mocked/denied_list()`: returns list of modules which are currently in the specified category
- `relied_on_by(module_name)`: returns list of modules which rely on `module_name`
- `dependency_graph_str()`: returns string format of graph for users. Example of output:
```
digraph G {
rankdir = LR;
node [shape=box];
"<res.foo.pkl>" -> "foo";
"foo" -> "torch.package";
"foo" -> "time";
"foo" -> "sentencepiece";
"foo" -> "package_top";
}
```

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D29559683

Pulled By: Lilyjjo

fbshipit-source-id: 5dff4d04af911a9c9fdd0d100420f1382eaef46e
2021-07-09 15:27:06 -07:00
b5f0576278 [package] Modify Digraph to track predecessors (#61146)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61146

Track predecessors of nodes in DiGraph in order to enable cleaner dependency visualization code.

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D29559682

Pulled By: Lilyjjo

fbshipit-source-id: 06f51b1108423aece5bdd72a7b82ab736e5e4f94
2021-07-09 15:27:04 -07:00
ae65f63971 Make nnapi flatten converter accept flex inputs (#61024)
Summary:
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61024

Test Plan: pytest test/test_nnapi.py::TestNNAPI::test_flatten

Reviewed By: anshuljain1

Differential Revision: D29480748

fbshipit-source-id: c334b09600a64d3e552cec843d6da3de28e7d27c
2021-07-09 15:27:02 -07:00
028e438d6c [torchelastic] Make sure rdzv_configs[timeout] is not getting overwritten (#61471)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61471

Make sure `rdzv_configs[timeout]` is not getting overwritten

Test Plan: sandcastle

Differential Revision: D29638606

fbshipit-source-id: e164cdddaed77e7e35412ed58ac1ee312e9d489d
2021-07-09 15:27:00 -07:00
1f4bba77b6 [fx] fix subgraph API call_module warning about no owning module (#61463)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61463

seems like a small oversight(?), current test fails when warnings are recorded. discovered this when calling `graph.call_module(existing_call_module_node.target)` and it raised a warning

Test Plan: `buck test //caffe2/test:fx`

Reviewed By: ansley

Differential Revision: D29637799

fbshipit-source-id: 2305629863230235f76a926fe2e4de480cbf853c
2021-07-09 15:25:44 -07:00
76c0f223d3 Make nnapi cat converter accept flex inputs
Summary: As title

Test Plan: pytest test/test_nnapi.py::TestNNAPI::test_cat

Reviewed By: anshuljain1

Differential Revision: D29480747

fbshipit-source-id: 161803054ff1a4c2c750fc30a5f0fc6d8a24b2c9
2021-07-09 14:27:53 -07:00
9e81d3d869 Make NNAPI linear converter accept flex inputs (#61022)
Summary:
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61022

Test Plan: pytest test/test_nnapi.py::TestNNAPI::test_linear

Reviewed By: anshuljain1

Differential Revision: D29480749

fbshipit-source-id: 35975861740298c9e16f866c939e7ee3c2151710
2021-07-09 14:27:51 -07:00
35b950ea98 [package] properly handle case where we are re-packaging mocked modules (#61434)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61434

Mocking is the only time we introduce a "special" module to a
torch.package of our own creation. This interacts poorly with
re-packaging, since if we treat `_mock` as a regular module and try to
package it normally we will produce a broken package.

This PR teaches PackageExporter to recognize `_mock` modules and treat
them specially during the dependency walking process, thus avoiding the
issue.

Test Plan: Imported from OSS

Reviewed By: jdonald, Lilyjjo

Differential Revision: D29638283

Pulled By: suo

fbshipit-source-id: 37a7ffa34da8bb665f679fbd72aa3d71154b2209
2021-07-09 14:27:49 -07:00
4f4beb8286 Add Model Parallel Support to ZeRO (#61370)
Summary:
**Overview:**
The existing `ZeroRedundancyOptimizer` implementation assumes that all model parameters are stored on the same device (due to the recent [refactor](https://github.com/pytorch/pytorch/pull/59834)). This change allows model parameters to be sharded across multiple devices, as in the DDP with Model Parallelism example [here](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html).

The only logic affected is the bucketing strategy used when `parameters_as_bucket_view=True`. Let `n` denote the world size and `k` denote the number of devices per process.
- Previously, `k = 1`, and `self._buckets` was a `List[torch.Tensor]`, where `self._buckets[j]` is a tensor (i.e. bucket) containing the parameters assigned to rank `j` for `j = 0, ..., n - 1`.
- Now, `self._buckets` is a `List[List[torch.Tensor]]`, where `self._buckets[i][j]` is a tensor containing the parameters stored on device `i` assigned to rank `j` for `i = 0, ..., k - 1` and `j = 0, ..., n - 1`.

This bucket construction uses an auxiliary data structure `self._device_to_per_rank_params`, which is a `Dict[torch.device, List[List[torch.Tensor]]]`. It maps:
- `dev_0` to `[rank 0's assigned parameters on dev_0, rank 1's assigned parameters on dev_1, ...]`,
- `...`
- `dev_{k-1}` to `[rank 0's assigned parameters on dev_{k-1}, rank 1's assigned parameters on dev_{k-1}, ...]`

I removed the invariant checker `_verify_same_param_device()` and its corresponding test since it is no longer an invariant.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61370

Test Plan: I added a new test `test_zero_model_parallel()` that checks for parity between a DDP model with model parallelism using `ZeroRedundancyOptimizer` and a local model with the same architecture using a local optimizer. I also verified that the existing tests still pass.

Reviewed By: soulitzer

Differential Revision: D29637132

Pulled By: andwgu

fbshipit-source-id: 07112959fa4e94a3f40e67e88cbb58ce3cd1e033
2021-07-09 14:27:47 -07:00
fb7ed24f6e [PyTorch] Try using ExclusivelyOwned in LinearAlgebra (#59420)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59420

This is a sample of how we might use ExclusivelyOwned on an opt-in basis.
ghstack-source-id: 133089540

Test Plan:
1) CI to run regression tests
2) Spot-checked assembly for linalg_det_out. Rather than calling the intrusive_ptr dtor, we get the ExclusivelyOwned dtor inline. In particular, we do not get any atomic refcount decrement instructions emitted.
3) TODO: some kind of perf profiling; advice welcome

Reviewed By: ezyang

Differential Revision: D28885313

fbshipit-source-id: ae4b39ed738c41d0c4a4509a5199c040ba9aa63a
2021-07-09 14:27:45 -07:00
a5c5b56cf5 gen ExclusivelyOwned in structured kernels (#59827)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59827

ghstack-source-id: 133089541

Test Plan: existing CI

Reviewed By: ezyang, janeyx99

Differential Revision: D28965922

fbshipit-source-id: ffbc1d43e5d3ab3abfad3b0830b4da1ce899f505
2021-07-09 14:26:37 -07:00
711ded688d Add a script to codemod max_tokens_total pragmas to C/C++ files (#61369)
Summary:
This PR adds a new script: `max_tokens_pragmas.py`

This is a utility script that can add/remove `max_tokens_total` pragmas from the codebase.

- [x] Implement script and test manually
- [x] Write test script

Examples:
First, change directories
```bash
cd tools/linter/clang_tidy
```

Then run the following:
```bash
cat << EOF > test/test1.cpp
// File without any prior pragmas

int main() {
    for (int i = 0; i < 10; i++);
    return 0;
}
EOF

cat << EOF > test/test2.cpp
// File with prior pragmas

#pragma clang max_tokens_total 1

int main() {
    for (int i = 0; i < 10; i++);
    return 0;
}
EOF

cat << EOF > test/test3.cpp
// File with multiple prior pragmas

#pragma clang max_tokens_total 1

// Different pragma; script should ignore this
#pragma clang max_tokens_here 20

int main() {
    for (int i = 0; i < 10; i++);
    return 0;
}

#pragma clang max_tokens_total 1
EOF

# Add pragmas to some files
python3 max_tokens_pragma.py --num-max-tokens 42 test/*.cpp
grep "#pragma clang max_tokens_total 42" test/*.cpp

# Remove pragmas from files
python3 max_tokens_pragma.py --strip test/*.cpp
grep "#pragma clang max_tokens_total 42" test/*.cpp # should fail

# Ignore files
python3 max_tokens_pragma.py --num-max-tokens 42 test/*.cpp --ignores test/test2.cpp
grep "#pragma clang max_tokens_total 42" test/*.cpp # should not list `test/test2.cpp`
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61369

Test Plan: `tools/linter/clang_tidy/test/test_max_tokens_pragma.py`

Reviewed By: malfet

Differential Revision: D29604291

Pulled By: 1ntEgr8

fbshipit-source-id: 3efe52573583769041a07e6776161d4d5bbf16a7
2021-07-09 13:30:52 -07:00
3b004aed3a Enable local clang-tidy lint (#61121)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61121

This change enables the make target to run clang-tidy locally

Test Plan:
Run this command
```
make clang-tidy
```
This should run `clang-tidy` on the paths and filters specified in `tools/linter/clang_tidy/__main__.py`

Quicklint
```
make quicklint
```
This should report "No files detected" if no c/cpp files are altered.

Reviewed By: soulitzer

Differential Revision: D29598927

Pulled By: 1ntEgr8

fbshipit-source-id: aa443030494fed92c313da4b203a5450be09fa38
2021-07-09 13:30:50 -07:00
8296cb37c7 [torchelastic] Set the correct maximum border width
Summary: The diff sets the correct max border delimiters between error sections

Test Plan: Example of the uncontrolled border: https://www.internalfb.com/intern/testinfra/diagnostics/7599824415964133.844424970500348.1625590344/

Reviewed By: kiukchung

Differential Revision: D29636814

fbshipit-source-id: 95465d3150066bff82dc7499bb1c63ea4f5ebc2d
2021-07-09 13:29:23 -07:00
6bb33d93ab disable the format library in C10 (#60052)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60052

Introduction:
We would like to use the minimal implementation of C10 for our our SGX port of pytorch. This would include disabling signal handlers and the fmt library.

Problem :
When C10_SUPPORTS_SIGNAL_HANDLER is disabled there is no reason to have fmt enabled as it is used only in stacktraceSignalHandler. The problem is that fmt/format.h is included regardless whether C10_SUPPORTS_SIGNAL_HANDLER is disabled or not.

Solution :
Move the #include <fmt/format.h> inside the #ifdef section of code where  C10_SUPPORTS_SIGNAL_HANDLER is checked.

Test Plan: Run the pytorch unit tests.

Reviewed By: h397wang, LiJihang

Differential Revision: D29022628

fbshipit-source-id: 638cf98381585cd6059129d9c5a65d9e6a841575
2021-07-09 12:28:19 -07:00
b01329b164 [xplat] Update XNNPACK to github revision 79cd5f9 (#61400)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61400

allow-large-files Update XNNPACK to github version 79cd5f9.

Test Plan:
Spark apps build works.

Hand tracking works:

https://pxl.cl/1L76g

Reviewed By: dreiss

Differential Revision: D29385882

fbshipit-source-id: 6be920a68b876faedf7e86e33df43f8b1db14a4d
2021-07-09 12:28:16 -07:00
86463a8d02 Save some little memory in default_collate (#61424)
Summary:
It can be a non-little save if there are many workers and a large batch size.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61424

Reviewed By: soulitzer

Differential Revision: D29635477

Pulled By: ejguan

fbshipit-source-id: 1fc48b5964e873bd8833ad81bed9d51b0b6d137e
2021-07-09 12:27:07 -07:00
c830db0265 Raise error in CMake for CUDA <9.2 (#61462)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61462

Anything before CUDA 9.2 is not supported (see https://github.com/pytorch/pytorch/pull/36848), and perhaps not even that.
ghstack-source-id: 133312018

Test Plan: CI

Reviewed By: samestep

Differential Revision: D29637251

fbshipit-source-id: 4300169b7298274b2074649342902a34bd2220b5
2021-07-09 11:28:38 -07:00
b5c464d5ef Make Future store weak pointers to storages (#60943)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60943

In https://github.com/pytorch/pytorch/pull/60470 we made Future store Storages rather than store references to their DataPtrs (because these references could go stale...). However this meant that the Future could keep the Storage alive, and thus keep its memory allocated, even after the user was done with it. We fix it here by instead storing a weak ptr to that Storage (well, in fact to the StorageImpl, but it's the same).
ghstack-source-id: 133295799

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D29454104

fbshipit-source-id: d36dee00a4841c087bb7b3f5bc39e0459f209cdb
2021-07-09 11:28:36 -07:00
962c9fbf85 [pruner] add handles for hooks (#61425)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61425

Adding handle for activation reconstruction and bias forward hooks so they can be removed later
ghstack-source-id: 133244536

Test Plan:
This change should not affect behavior yet, but to double check:

`buck test mode/dev-nosan //caffe2/test:ao -- TestBasePruner`

https://pxl.cl/1LpM9

Reviewed By: z-a-f

Differential Revision: D29619720

fbshipit-source-id: c7428d2d0325cd11ce7919e0b67321e8cc196041
2021-07-09 11:28:35 -07:00
682ebc1dd1 remove UsageError in favor of ValueError (#61031)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61031

See https://github.com/pytorch/pytorch/pull/58916#issuecomment-868519515.

Test Plan: Imported from OSS

Reviewed By: iramazanli

Differential Revision: D29626810

Pulled By: mruberry

fbshipit-source-id: 25ddf26815f9ef82b8234d7dac811a6a13a53c54
2021-07-09 11:28:33 -07:00
5401dd2f9a change language from array to tensor (#60639)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60639

Test Plan: Imported from OSS

Reviewed By: iramazanli

Differential Revision: D29626812

Pulled By: mruberry

fbshipit-source-id: 1b0e78426fd08d7b72d890adc9811d31afd805fe
2021-07-09 11:28:31 -07:00
09c90b3589 relax type equality constraint (#60638)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60638

Initial proposal in https://github.com/pytorch/pytorch/pull/58981#issuecomment-866690334. Opposed to the proposal, this PR only allows relaxing the type equality constraint to a common superclass constraint, for example `torch.Tensor` vs `torch.nn.Parameter`. Inputs that do not share a common superclass will still fail.

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D29626811

Pulled By: mruberry

fbshipit-source-id: 1916c3b710d38889de7ce57eb0770c76cbbb8166
2021-07-09 11:27:32 -07:00
24a8915534 Relax use-count check to allow for 0 (#61414)
Summary:
Previously we require tensor use count to be exactly 1. We should actually allow for use count to be zero as well. Use count is zero when an undefined tensor is returned, and this is common in backward functions that have multiple outputs.

In this PR I also remove some entries from the skip list that should be covered by this change: they return multiple tensors AND are backward functions. Batch norm is also known to return undefined tensors when `training=False`.

Related issue: https://github.com/pytorch/pytorch/issues/60426

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61414

Reviewed By: albanD

Differential Revision: D29614687

Pulled By: soulitzer

fbshipit-source-id: ab0892aed4bd1346b50b0a9552ffcc3287ac96af
2021-07-09 10:28:12 -07:00
9e533a62f6 Make conv2d nnapi converter accept flexible batch (#61021)
Summary:
Same as title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61021

Test Plan: pytest test/test_nnapi.py::TestNNAPI

Reviewed By: anshuljain1

Differential Revision: D29480746

fbshipit-source-id: 7217c8f3a811db8c3c373f3e7ca31caf9502ef22
2021-07-09 10:28:10 -07:00
64d61901eb [ROCm] Skip test_masked_scatter_large_tensor_cuda (#61313)
Summary:
Refer https://github.com/pytorch/pytorch/issues/60190. Skipping unit test until hipcub issue is fixed.

Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61313

Reviewed By: iramazanli

Differential Revision: D29626664

Pulled By: malfet

fbshipit-source-id: db2a390d2a3e28ec05a5032a50aa9a35c86b96ca
2021-07-09 10:27:08 -07:00
ee2dd35ef4 Resolving native dependency and try_run for cross compile (#59764)
Summary:
This is a PR on build system that provides support for cross compiling on Jetson platforms.

The major change is:

1. Disable try runs for cross compiling in `COMPILER_WORKS`, `BLAS`, and `CUDA`. They will not be able to perform try run on a cross compile setup

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59764

Reviewed By: soulitzer

Differential Revision: D29524363

Pulled By: malfet

fbshipit-source-id: f06d1ad30b704c9a17d77db686c65c0754db07b8
2021-07-09 09:29:21 -07:00
8bd3e52e00 Add conv2d transpose NNAPI converter (#59529)
Summary:
* Conv2d transpose support
* Quantize WIP

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59529

Test Plan: pytest test/test_nnapi.py::TestNNAPI::test_conv2d_transpose

Reviewed By: anshuljain1

Differential Revision: D28926335

fbshipit-source-id: 8f90182f96cee0a13c4f38331d421e1e8ac618de
2021-07-09 09:29:20 -07:00
c19adfff54 [DataLoader] Introduce ConcatMapDataPipe functional datapipe (#61010)
Summary:
As part of https://github.com/pytorch/pytorch/issues/57031, this PR adds the ConcatMapDataPipe functional datapipe for the MapDataPipe class.

We may need to discuss how to treat the datapipes with no valid length. For now, I just use them as if they have infinite length and the `__getitem__` could not go pass them.

Thank you for your time on reviewing this~

cc ejguan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61010

Reviewed By: soulitzer

Differential Revision: D29587679

Pulled By: ejguan

fbshipit-source-id: 5eb97fa727209bec6c534520057c64a78000626e
2021-07-09 09:29:18 -07:00
2bbcc80de3 Enable disabling test cases on specific platforms (#61427)
Summary:
This adds functionality to our common_utils.py to allow disabling test cases for platforms Mac, Windows, and Linux.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61427

Test Plan:
CI should not change as no issues currently have the line "Platforms:..."

I tested locally by making sure `test_async_script` is skipped while running `python test/test_jit.py -k TestAsync.test_async_script` with a cached modified `.pytorch-disabled-tests.json`:
```
{
  "total_count": 32,
  "incomplete_results": false,
  "items": [
    {
      "url": "https://api.github.com/repos/pytorch/pytorch/issues/60652",
      "repository_url": "https://api.github.com/repos/pytorch/pytorch",
      "labels_url": "https://api.github.com/repos/pytorch/pytorch/issues/60652/labels{/name}",
      "comments_url": "https://api.github.com/repos/pytorch/pytorch/issues/60652/comments",
      "events_url": "https://api.github.com/repos/pytorch/pytorch/issues/60652/events",
      "html_url": "https://github.com/pytorch/pytorch/issues/60652",
      "id": 929288995,
      "node_id": "MDU6SXNzdWU5MjkyODg5OTU=",
      "number": 60652,
      "title": "DISABLED test_async_script (jit.test_async.TestAsync)",
      "user": {
        "login": "ezyang",
        "id": 13564,
        "node_id": "MDQ6VXNlcjEzNTY0",
        "avatar_url": "https://avatars.githubusercontent.com/u/13564?v=4",
        "gravatar_id": "",
        "url": "https://api.github.com/users/ezyang",
        "html_url": "https://github.com/ezyang",
        "followers_url": "https://api.github.com/users/ezyang/followers",
        "following_url": "https://api.github.com/users/ezyang/following{/other_user}",
        "gists_url": "https://api.github.com/users/ezyang/gists{/gist_id}",
        "starred_url": "https://api.github.com/users/ezyang/starred{/owner}{/repo}",
        "subscriptions_url": "https://api.github.com/users/ezyang/subscriptions",
        "organizations_url": "https://api.github.com/users/ezyang/orgs",
        "repos_url": "https://api.github.com/users/ezyang/repos",
        "events_url": "https://api.github.com/users/ezyang/events{/privacy}",
        "received_events_url": "https://api.github.com/users/ezyang/received_events",
        "type": "User",
        "site_admin": false
      },
      "labels": [
        {
          "id": 1301397902,
          "node_id": "MDU6TGFiZWwxMzAxMzk3OTAy",
          "url": "https://api.github.com/repos/pytorch/pytorch/labels/module:%20flaky-tests",
          "name": "module: flaky-tests",
          "color": "f7e101",
          "default": false,
          "description": "Problem is a flaky test in CI"
        },
        {
          "id": 679953883,
          "node_id": "MDU6TGFiZWw2Nzk5NTM4ODM=",
          "url": "https://api.github.com/repos/pytorch/pytorch/labels/oncall:%20distributed",
          "name": "oncall: distributed",
          "color": "f7e101",
          "default": false,
          "description": "Add this issue/PR to distributed oncall triage queue"
        }
      ],
      "state": "open",
      "locked": false,
      "assignee": {
        "login": "rohan-varma",
        "id": 8039770,
        "node_id": "MDQ6VXNlcjgwMzk3NzA=",
        "avatar_url": "https://avatars.githubusercontent.com/u/8039770?v=4",
        "gravatar_id": "",
        "url": "https://api.github.com/users/rohan-varma",
        "html_url": "https://github.com/rohan-varma",
        "followers_url": "https://api.github.com/users/rohan-varma/followers",
        "following_url": "https://api.github.com/users/rohan-varma/following{/other_user}",
        "gists_url": "https://api.github.com/users/rohan-varma/gists{/gist_id}",
        "starred_url": "https://api.github.com/users/rohan-varma/starred{/owner}{/repo}",
        "subscriptions_url": "https://api.github.com/users/rohan-varma/subscriptions",
        "organizations_url": "https://api.github.com/users/rohan-varma/orgs",
        "repos_url": "https://api.github.com/users/rohan-varma/repos",
        "events_url": "https://api.github.com/users/rohan-varma/events{/privacy}",
        "received_events_url": "https://api.github.com/users/rohan-varma/received_events",
        "type": "User",
        "site_admin": false
      },
      "assignees": [
        {
          "login": "rohan-varma",
          "id": 8039770,
          "node_id": "MDQ6VXNlcjgwMzk3NzA=",
          "avatar_url": "https://avatars.githubusercontent.com/u/8039770?v=4",
          "gravatar_id": "",
          "url": "https://api.github.com/users/rohan-varma",
          "html_url": "https://github.com/rohan-varma",
          "followers_url": "https://api.github.com/users/rohan-varma/followers",
          "following_url": "https://api.github.com/users/rohan-varma/following{/other_user}",
          "gists_url": "https://api.github.com/users/rohan-varma/gists{/gist_id}",
          "starred_url": "https://api.github.com/users/rohan-varma/starred{/owner}{/repo}",
          "subscriptions_url": "https://api.github.com/users/rohan-varma/subscriptions",
          "organizations_url": "https://api.github.com/users/rohan-varma/orgs",
          "repos_url": "https://api.github.com/users/rohan-varma/repos",
          "events_url": "https://api.github.com/users/rohan-varma/events{/privacy}",
          "received_events_url": "https://api.github.com/users/rohan-varma/received_events",
          "type": "User",
          "site_admin": false
        }
      ],
      "milestone": null,
      "comments": 0,
      "created_at": "2021-06-24T14:28:33Z",
      "updated_at": "2021-06-24T16:40:42Z",
      "closed_at": null,
      "author_association": "CONTRIBUTOR",
      "active_lock_reason": null,
      "body": "Platforms:Mac, windows, Linux\r\n```\r\nJun 24 00:59:14 ======================================================================\r\nJun 24 00:59:14 ERROR [0.477s]: test_async_script (__main__.ProcessGroupGlooWrapperTest)\r\nJun 24 00:59:14 ----------------------------------------------------------------------\r\nJun 24 00:59:14 Traceback (most recent call last):\r\nJun 24 00:59:14   File \"/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py\", line 398, in wrapper\r\nJun 24 00:59:14     self._join_processes(fn)\r\nJun 24 00:59:14   File \"/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py\", line 590, in _join_processes\r\nJun 24 00:59:14     self._check_return_codes(elapsed_time)\r\nJun 24 00:59:14   File \"/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py\", line 633, in _check_return_codes\r\nJun 24 00:59:14     raise RuntimeError(error)\r\nJun 24 00:59:14 RuntimeError: Process 0 exited with error code 10 and exception:\r\nJun 24 00:59:14 RuntimeError: [/var/lib/jenkins/workspace/third_party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [172.17.0.2]:21400\r\nJun 24 00:59:14 \r\nJun 24 00:59:14 During handling of the above exception, another exception occurred:\r\nJun 24 00:59:14 \r\nJun 24 00:59:14 Traceback (most recent call last):\r\nJun 24 00:59:14   File \"/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py\", line 516, in run_test\r\nJun 24 00:59:14     getattr(self, test_name)()\r\nJun 24 00:59:14   File \"/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py\", line 400, in wrapper\r\nJun 24 00:59:14     fn()\r\nJun 24 00:59:14   File \"distributed/test_pg_wrapper.py\", line 270, in test_collective_hang\r\nJun 24 00:59:14     self._test_collective_hang(pg)\r\nJun 24 00:59:14   File \"distributed/test_pg_wrapper.py\", line 52, in _test_collective_hang\r\nJun 24 00:59:14     wrapper_pg.allreduce([tensor])\r\nJun 24 00:59:14   File \"/opt/conda/lib/python3.6/unittest/case.py\", line 217, in __exit__\r\nJun 24 00:59:14     expected_regex.pattern, str(exc_value)))\r\nJun 24 00:59:14   File \"/opt/conda/lib/python3.6/unittest/case.py\", line 135, in _raiseFailure\r\nJun 24 00:59:14     raise self.test_case.failureException(msg)\r\nJun 24 00:59:14 AssertionError: \"Ranks 1 failed to pass monitoredBarrier\" does not match \"[/var/lib/jenkins/workspace/third_party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [172.17.0.2]:21400\"\r\n```\r\n\r\nhttps://www.internalfb.com/intern/opensource/ci/job/log/225221175921058/\n\ncc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse agolynski SciPioneer H-Huang mrzzd cbalioglu gcramer23",
      "performed_via_github_app": null,
      "score": 0.0
    }
  ]
}
```

Reviewed By: iramazanli

Differential Revision: D29627799

Pulled By: janeyx99

fbshipit-source-id: 5ef79127cbe0055c4f41766048e66f98cf80d2c4
2021-07-09 09:29:16 -07:00
e9a40de1af Add other Linux GPU auxiliary test jobs (#61055)
Summary:
- [x] add the jobs to the matrix
  - [x] `jit_legacy`
  - [x] `nogpu_NO_AVX`
  - [x] `nogpu_NO_AVX2`
  - [x] `slow`
- [x] use the test config properly to enable the different test conditions
- [x] validate that it works
- [x] disable on pull requests before merging

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61055

Test Plan: CI. Example run: https://github.com/pytorch/pytorch/actions/runs/1013240987

Reviewed By: walterddr

Differential Revision: D29594080

Pulled By: samestep

fbshipit-source-id: 02c531ebc42feae81ecaea0785915f95e0f53ed7
2021-07-09 09:29:15 -07:00
c966ce6933 Fix several test_ops cuda dtypes tests (#60922)
Summary:
Close https://github.com/pytorch/pytorch/issues/60443

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60922

Reviewed By: jdonald, iramazanli

Differential Revision: D29630122

Pulled By: mruberry

fbshipit-source-id: 441f79828860282e5849a2565facf9e7f72912e8
2021-07-09 09:29:13 -07:00
5e9bcf9101 fix: support removing hook in the hook (#61250)
Summary:
Fixes: https://github.com/pytorch/pytorch/issues/58354

Problem:
Once a hook is called
05c1e5b655/torch/csrc/autograd/python_hook.cpp (L51-L54)

If the hook has `handle.remove()` while executing and if there are no references to the hook function object then `python` is free to garbage collect.

At the subsequent call to
05c1e5b655/torch/csrc/autograd/python_hook.cpp (L54)

we have `hook` pointing to invalid memory

Thus when we try to fetch the name for `hook` from `check_single_result` with
05c1e5b655/torch/csrc/autograd/python_hook.cpp (L175-L177)
we get segfault.

Solution:
Temporarily increase the life-time of hook with `Py_INCREF` till we have verified the result.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61250

Reviewed By: iramazanli

Differential Revision: D29623826

Pulled By: soulitzer

fbshipit-source-id: c71322311f19066cafb7203980668868c59d4e5e
2021-07-09 09:27:58 -07:00
179249084b Refactor DDP join() API, adding hooks (#60757)
Summary:
Targets https://github.com/pytorch/pytorch/issues/54318.

**Overview:**
DDP offers a `join()` context manager to accommodate training on uneven inputs. This creates a new generic `_Join()` API permitting custom hooks, refactors DDP `join()` to call this generic `_Join()`, and implements a hook for ZeRO. (For now, the generic `_Join()` is implemented as private, but this may change after design discussions are cleared.)

There are two classes introduced: `_JoinHook`, the class defining the customizable join hook, and `_Join`, the generic join context manager.

The `_JoinHook` provides two entry points: `main_hook()`, which is called repeatedly while there exists a non-joined process, and `post_hook()`, which is called once all process have joined with the additional `bool` argument `is_last_joiner`. The class also requires `process_group` and `device` information by defining corresponding abstract property methods. Thus, to implement a join hook, (1) inherit from `_JoinHook`, (2) override `main_hook()` and `post_hook()` as appropriate, and (3) override `process_group()` and `device()` to provide process group and device information to be used by the join context manager implementation for collective communications.

The `_Join` constructor requires `join_hooks: List[_JoinHook]` and optionally `enable: bool = True` and `throw_on_early_termination: bool = False`. A training loop only needs to be wrapped with `with _Join(join_hooks):` (using the appropriate `join_hooks`) to be able to train on uneven inputs without hanging/erroring. The context manager requires a `dist.all_reduce(torch.ones(1))` to be called on every non-joined process each time before it performs its collective communications in order to indicate that the process has not yet joined. It also requires that all `process_group` attributes in the `_JoinHook` objects are the same.

**Notes:**
- The argument `is_last_joiner` to `post_hook()` may be useful for finding an authoritative rank when synchronizing.
- `enable` is a flag that can be set to `False` if the user knows the current training loop will not have uneven inputs. This may be used to disable join-related computation in  the classes providing join hooks.
- `throw_on_early_termination` is a flag that can be set to `True` to notify processes to terminate upon detecting uneven inputs (i.e. upon the first process joining when there exists a non-joined process). Notably, the notification requires an all-reduce, so to prevent hanging/erroring, non-joined process must participate in the all-reduce. The first-joining process raises a `RuntimeError`, and the other processes are expected (but not required) to do the same. This may be used to implement training on uneven inputs in cases that do not conform to the generic join context manager (e.g. `SyncBatchNorm`).
- Classes providing a join hook should do so via a `_join_hook()` method that returns a `_JoinHook` instance with the methods appropriately overridden.
- If there are multiple join hooks, the device specified by the first is used by the join context manager implementation to perform its collective communications.
- If there are multiple join hooks, both the main and post-hooks are iterated in the order in which the `_JoinHook` objects are passed into the context manager constructor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60757

Test Plan:
The current implementation preserves backward compatibility by not changing the existing DDP `join()` API at all. To check this, I ran through the uneven input tests (`test_ddp_grad_div_uneven_inputs`, `test_ddp_uneven_inputs_stop_iteration_sync_bn`, `test_ddp_uneven_inputs`, `test_ddp_uneven_input_join_disable`, `test_ddp_uneven_input_exception`) on the AI AWS cluster:
```
touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py --
```

Because the existing DDP join logic does not provide correct gradients to the joined processes if `gradient_as_bucket_view=False` and a joined process requires those gradients to correctly update its shard of the parameters in `ZeroRedundancyOptimizer.step()`, DDP and ZeRO are not fully compatible at the moment. To work around this and to test ZeRO's join hook separately, I added a test `_test_zero_join()` (with `test_zero_join_gpu()` and `test_zero_join_cpu()` flavors), which compares DDP with a local optimizer on uneven inputs against ZeRO on uneven inputs with the gradients set manually.

Reviewed By: iramazanli, mrshenli

Differential Revision: D29624636

Pulled By: andwgu

fbshipit-source-id: ec70a290e02518b0d8b683f9fed2126705b896c7
2021-07-09 08:29:20 -07:00
8423ab4f99 Fix CosineAnnealingWarmRestart annotation (#61106)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44770.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61106

Reviewed By: 1ntEgr8

Differential Revision: D29635764

Pulled By: walterddr

fbshipit-source-id: ddc45a7f04532a76d033ae7774706da1fa8608f7
2021-07-09 08:28:18 -07:00
9b908ab0d0 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D29631829

fbshipit-source-id: 6cef1a3a091bdf0e10838d05b2e82fc0760ebe48
2021-07-09 05:28:44 -07:00
819bac63ff [Codemod][FBSourceBlackLinter] Daily arc lint --take BLACK
Reviewed By: zertosh

Differential Revision: D29632524

fbshipit-source-id: 3eccc1804a7bf953480b9754f68ea56a2a8e3fd8
2021-07-09 05:27:29 -07:00
14f63763c1 Avoid using mp.Manager to report #GPUs needed in dist tests (#61409)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61409

We used a multiprocessing.Manager in order to share TEST_SKIPS between the parent and the child processes. TEST_SKIPS is a global variable that defines a unique error code for each "error type", so that the parent can figure out the reason a child exited. While originally this mapping was immutable, at some point we allowed children to modify the parent's value of that mapping so they could update the message for the `multi-gpu` error to make it reflect how many GPUs were really needed. This occurred in D23285790 (2a4d312027). Since then this Manager proved to be quite problematic, especially around thread safety, races, TSAN, ... (see D22753459 (f0c46878c6), D23641618 (567c51cce9), D28490129, D28794321 (0128eb9a85) and D29585862). This seems like an awful lot of trouble for such a small functionality. Here I propose we drop Manager and instead get the same result by using separate error codes for each number of GPUs. It should be much simpler and thus more robust.
ghstack-source-id: 133236447

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D29612614

fbshipit-source-id: 8ad0fedcb7796e5832a0eb196f8fdc147e02b3df
2021-07-09 01:29:35 -07:00
905cd6733e [DDP Comm Hook] Re-enable the optimization of fusing copy and division when no comm hook is specified (#61379)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61379

The optimization was accidentally removed in https://github.com/pytorch/pytorch/pull/59574

This optimization can help save a scan over all the input parameters, by fusing copy and div operations.

Now the default temporary hook is allreduce by sum, and no extra division is done inside the hook.
ghstack-source-id: 133288529

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_accumulate_gradients_no_sync
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_ddp_grad_div_uneven_inputs
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_grad_is_view
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork --  test_DistributedDataParallel_non_default_stream

buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_sparse_gradient

buck test mode/dev-nosan caffe2/test/distributed:c10 -- test_ddp_checkpointing_once
buck test mode/dev-nosan caffe2/test/distributed:c10 -- test_ddp_checkpointing_twice

Reviewed By: rohan-varma

Differential Revision: D29597614

fbshipit-source-id: 2434e4fd4e6abad7871cfe47886fe97b6e4ba28f
2021-07-09 01:29:33 -07:00
8f61d94610 Fix a variable initialization (#60896)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60896

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D29431625

fbshipit-source-id: 076d5ed350507b3ab1f14c1a5c7700de0427eefc
2021-07-09 01:29:31 -07:00
15010bf223 Make some downcast issues explicit (#60412)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60412

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D29243195

fbshipit-source-id: c508b729d6a0e6f8a591521bce788e6cfd8531f8
2021-07-09 01:29:29 -07:00
6a3170dba1 [package] minor cleanups to internal APIs (#61428)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61428

I was reading this code again after a while and didn't understand as
quickly as I would have liked. Some of the function names are no longer
accurate, etc.

This PR renames these functions to be in the same language of
"dependencies" that the rest of the API uses. I think the resulting
usage of the APIs is more clear than before

Test Plan: Imported from OSS

Reviewed By: Chillee

Differential Revision: D29620946

Pulled By: suo

fbshipit-source-id: 7df640a7ffbd43998063b9ee3955c9dfcbc42cfb
2021-07-09 01:28:24 -07:00
d52ebf2b1b conv2d (#61093)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61093

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D29562478

Pulled By: migeed-z

fbshipit-source-id: d41f3a9526ee52a9571cb861be03bf9ae176a373
2021-07-08 20:29:32 -07:00
5fbc853c5f [package] PackageExporter remove verbose mode (#61145)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61145

Remove 'verbose' mode from PackageExporter as people have complained that it is not useful.

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D29559681

Pulled By: Lilyjjo

fbshipit-source-id: eadb1a3a25fadc64119334a09bf1fa4b355b1edd
2021-07-08 18:26:43 -07:00
a74516d699 [static runtime] implement aten::log (#61393)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61393

Test Plan:
Added `StaticRuntime.IndividualOps_Log`

```
...
[ RUN      ] StaticRuntime.IndividualOps_Log
V0701 12:10:50.829100 3708165 impl.cpp:455] StaticModuleOptions: cleanup_activations 1, enable_out_variant 1, optimize_memory1, optimize_graph_output_memory0
V0701 12:10:50.888468 3708165 impl.cpp:1279] Switch to out variant for node: %3 : Tensor = aten::log(%inp.1)
V0701 12:10:50.889098 3708165 impl.cpp:1279] Switch to out variant for node: %a.1 : Tensor = aten::clone(%3, %2)
```

Reviewed By: hlu1

Differential Revision: D29511622

fbshipit-source-id: 819fd7d90c084609a060efeadb3015e35acac517
2021-07-08 18:25:35 -07:00
06dfaadfc6 update internal function names that apply to both cpu and cuda (#59701)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59701

These functions have been updated to work for cpu and cuda, their names are now changed to reflect that

quantize_per_channel_cpu -> quantize_per_channel
dequantize_quantized_cpu -> dequantize_quantized

(Note: this ignores all push blocking failures!)

Test Plan:
python test/test_quantization.py TestQuantizedTensor

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D29018270

fbshipit-source-id: 3a0da8d5e3f357dcf19119bcdbc6172d41f2b0c1
2021-07-08 17:26:25 -07:00
8726f08e15 [ONNX] Update documentation (#58712) (#60249)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60249

* Add introductory paragraph explaining what ONNX is and what the
  torch.onnx module does.
* In "Tracing vs Scripting" and doc-string for torch.onnx.export(),
  clarify that exporting always happens on ScriptModules and that
  tracing and scripting are the two ways to produce a ScriptModule.
* Remove examples of using Caffe2 to run exported models.
  Caffe2's website says it's deprecated, so it's probably best not to
  encourage people to use it by including it in examples.
* Remove a lot of content that's redundant:
  * The example of how to mix tracing and scripting, and instead
    link to Introduction to TorchScript, which includes very similar
    content.
  * "Type annotations" section. Link to TorchScript docs which explain
    that in more detail.
  * "Using dictionaries to handle Named Arguments as model inputs"
    section. It's redundant with the description of the `args` argument
    to `export()`, which appears on the same page once the HTML
    is generated.
  * Remove the list of supported Tensor indexing patterns. If it's not
    in the list of unsupported patterns, users can assume it's
    supported, so having both is redundant.
  * Remove the list of supported operators and models.
    I think the list of supported operators is not very useful.
    A list of supported model architectures may be useful, but in
    reality it's already very out of date. We should add it back if
    / when we have a system for keeping it up to date.
  * "Operator Export Type" section. It's redundant with the description
    of the `operator_export_type` arg to to `export()`, which appears on
    the same page once the HTML is generated.
  * "Use external data format" section. It's redundant with the
    description of the `use_external_data_format` arg to `export()`.
  * "Training" section.  It's redundant with the
    description of the `training` arg to `export()`.
* Move the content about different operator implementations producing
  different results from the "Limitations" section into the doc for the
  `operator_export_type` arg.
* Document "quantized" -> "caffe2" behavior of
  OperatorExportTypes.ONNX_ATEN_FALLBACK.
* Combing the text about using torch.Tensor.item() and the text about
  using NumPy types into a section titled
  "Avoid NumPy and built-in Python types", since they're both
  fundamentally about the same issue.
* Rename "Write PyTorch model in Torch way" to "Avoiding Pitfalls".
* Lots of minor fixes: spelling, grammar, brevity, fixing links, adding
  links.
* Clarify limitation on input and output types. Phrasing it in terms of
  PyTorch types is much more accessible than in terms of TorchScript
  types. Also clarify what actually happens when dict and str are used
  as inputs and outputs.
* In Supported operators, use torch function and class names and link
  to them. This is more user friendly than using the internal aten
  op names.
* Remove references to VariableType.h, which doesn't appear to contain
  the information that it once did. Instead refer to the generated
  .pyi files.
* Remove the text in the FAQ about appending to lists within loops.
  I think this limitation is no longer present
  (perhaps since https://github.com/pytorch/pytorch/pull/51577).
* Minor fixes to some code I read along the way.
* Explain the current rationale for the weird ::prim_PythonOp op name.

Test Plan: Imported from OSS

Reviewed By: zou3519, ZolotukhinM

Differential Revision: D29494912

Pulled By: SplitInfinity

fbshipit-source-id: 7756c010b2320de0692369289604403d28877719

Co-authored-by: Gary Miguel <garymiguel@microsoft.com>
2021-07-08 16:29:32 -07:00
00b0d826a1 [ONNX] shape type inference fixes for control flow (#59319) (#60248)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60248

* ~~Allow shape inference to skip for blocks by checking unsupported cases recursively. Currently onnx::Identity would trigger a shape inference failure.~~ Fixed in onnx submodule 1.9.
* Remove previous special post process for if op, since that was for constant folding, and now it is handled elsewhere. Update new post process for control flow nodes to copy assign node shape from subblock output shape correctly.

Test Plan: Imported from OSS

Reviewed By: zou3519, ZolotukhinM

Differential Revision: D29494913

Pulled By: SplitInfinity

fbshipit-source-id: de274a388df86e86403981e1b89b8b4a0d1e26d1

Co-authored-by: BowenBao <bowbao@microsoft.com>
2021-07-08 16:29:30 -07:00
81f95cce59 [ONNX] Extend chunk for dynamic chunk values (#59644) (#60247)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60247

Related to #42785

Test Plan: Imported from OSS

Reviewed By: zou3519, ZolotukhinM

Differential Revision: D29494914

Pulled By: SplitInfinity

fbshipit-source-id: 51ddb876d00185e59cfe54a8af5a9c8dd073a09f

Co-authored-by: Shubham Bhokare <shubhambhokare@gmail.com>
2021-07-08 16:29:28 -07:00
d9dc94406f [ONNX] Add linspace symbolic (#58854) (#60246)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60246

* Adds support for linspace op
* Modifies arange symbolic in opset 9 to replicate the same behavior in which dtype is determined (similar to opset 11) as in https://pytorch.org/docs/stable/generated/torch.arange.html
* Enabled some arange unit tests which were disabled for opset 9

Test Plan: Imported from OSS

Reviewed By: zou3519, ZolotukhinM

Differential Revision: D29494911

Pulled By: SplitInfinity

fbshipit-source-id: bddff18a90f8a78121c8ecdd1dafc15c69962d66

Co-authored-by: Shubham Bhokare <shubhambhokare@gmail.com>
2021-07-08 16:29:26 -07:00
4ccfa3ffeb [ONNX] Fix sum export with attribute keepdims (#59316) (#60245)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60245

Fix after b9bdb07a0261ab5a0b1038f290fa03af6ce0415f. Improving previous fix on two aspects
* Not only checks 0 on first dimension for empty tensor.
* Do not assume empty tensor when shape is not accessible.

Test Plan: Imported from OSS

Reviewed By: zou3519, ZolotukhinM

Differential Revision: D29494917

Pulled By: SplitInfinity

fbshipit-source-id: 02587c3c3be0510312c1a1959f28cab12d81812d

Co-authored-by: BowenBao <bowbao@microsoft.com>
2021-07-08 16:29:24 -07:00
95a7f3ccfe [ONNX] Fix shape inference for large model (#59320) (#60244)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60244

Do 2GB size check for protocol buffer serialization at a later time, to avoid false alarming for cases like shape inference where no serialization actually happens.

Test Plan: Imported from OSS

Reviewed By: zou3519, ZolotukhinM

Differential Revision: D29494910

Pulled By: SplitInfinity

fbshipit-source-id: 4c36d26de9a94e5d6cf78f332d4dffc46588ebf0

Co-authored-by: BowenBao <bowbao@microsoft.com>
2021-07-08 16:29:22 -07:00
9636c077c3 [ONNX] Handle onnx::Size in ComputeConstant folding (#59122) (#60243)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60243

Handle onnx::Size in ComputeConstant folding

Test Plan: Imported from OSS

Reviewed By: zou3519, ZolotukhinM

Differential Revision: D29494915

Pulled By: SplitInfinity

fbshipit-source-id: 9782e356f5e36ae1dd2819412f970010360e9cc0

Co-authored-by: jiafatom <jiafa@microsoft.com>
2021-07-08 16:29:21 -07:00
38c48e42c6 [Reland][BE] add test wall time report (#61389)
Summary:
This is a reland of https://github.com/pytorch/pytorch/issues/61322.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61389

Reviewed By: malfet

Differential Revision: D29601573

Pulled By: walterddr

fbshipit-source-id: dfb2bdc7d72d493c01b9dbac50ef9b79c1782054
2021-07-08 16:29:19 -07:00
7481c6fc02 Bump googletest version to v1.11.0 (#61395)
Summary:
This PR bumps the `googletest` version to v1.11.0.

To facilitate this change, `CAFFE2_ASAN_FLAG` and `CAFFE2_TSAN_FLAG` are divided into corresponding compiler and linker variants. This is required because `googletest v1.11.0` sets the `-Werror` flag. The `-pie` flag is a linker flag, and passing it to a compiler invocation results in a `-Wunused-command-line-argument` warning, which in turn will cause `googletest` to fail to build with ASAN.

Fixes https://github.com/pytorch/pytorch/issues/60865

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61395

Reviewed By: iramazanli

Differential Revision: D29620970

Pulled By: 1ntEgr8

fbshipit-source-id: cdb1d3d12e0fff834c2e62971e42c03f8c3fbf1b
2021-07-08 16:29:17 -07:00
13658b10bb [torch] Various improvements to torch.distributed.launch and torch.distributed.run (#61294)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61294

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60925

* Make `torch.distributed.launch` restarts to 0
* Remove unnecessary `-use_env` warning, move `-use_env` warnings
* Move `-use_env` warnings to `torch.distributed.launch`
* Make default log level WARNING
* Add new doc section around transitioning to `torch.distributed.run`
* Make `torch.distributed.launch` not use error-propagation
* Set default events handler to `null` that does not print events to console
* Add reference from `torch.distributed.launch` to `torch.distributed.run`
* Set correct preexec function that sends SIGTERM to child processes when parent dies

Issues resolved:

https://github.com/pytorch/pytorch/issues/60716
https://github.com/pytorch/pytorch/issues/60754

Test Plan:
sandcastle

    python -m torch.distributed.launch --nproc_per_node 2 main.py -> uses 0 restarts
    python -m torch.distributed.run --nproc_per_node 2 main.py -> uses default for torchelastic, 0 restarts

    python -m torch.distributed.launch --nproc_per_node=4  --use_env --no_python  main.py -> produces error
    python -m torch.distributed.launch --nproc_per_node=4  --use_env main.py -> no warning
    python -m torch.distributed.launch --nproc_per_node=4  --no_python  main.py ->warning

Output of running torch.distributed.launch without --use_env:

    $path/torch/distributed/launch.py:173: FutureWarning: The module torch.distributed.launch is deprecated
    and will be removed in future. Use torch.distributed.run.
    Note that --use_env is set by default in torch.distributed.run.
    If your script expects `--local_rank` argument to be set, please
    change it to read from `os.environ('LOCAL_RANK')` instead.

New section:

{F628923078}

{F628974089}

Reviewed By: cbalioglu

Differential Revision: D29559553

fbshipit-source-id: 03ed9ba638bf154354e1530ffc964688431edf6b
2021-07-08 16:28:06 -07:00
10f372601d Support RRefs that contain torch.cuda.Event (#61354)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61354

Test Plan: Imported from OSS

Reviewed By: iramazanli

Differential Revision: D29617155

Pulled By: pbelevich

fbshipit-source-id: 6e56b3fd0a0f93ecec048b58c90f2a47b4cba688
2021-07-08 15:33:08 -07:00
8bc2ba3fe3 detect missing kernels from external backends in codegen (#60737)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60737

Test Plan: Imported from OSS

Reviewed By: ezyang, jdonald

Differential Revision: D29392615

Pulled By: bdhirsh

fbshipit-source-id: d49d013243dbc8c8b55fbdb0b9b3eed38df52255
2021-07-08 15:33:04 -07:00
7318747a3b move all external kernels into a class for better compiler error messages (#59839)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59839

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D29047680

Pulled By: bdhirsh

fbshipit-source-id: 18cf4124be440a0a343b5983e1a4165db808e7c1
2021-07-08 15:31:02 -07:00
86eac5b456 [caffe2] Check for number of created subnets and optionally throw an error (#57366)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57366

We often get error messages such as
```
Model failed AOT (glow ahead-of-time compilation) with exception: Error during AOT optimization (non-provisioned addNetwork):
Non-recoverable device error when adding network:
Error code: PARTITIONER_ERROR
Error message: Did not find a partition with an SLS node

Error return stack:
--------------------------------------------------------------------------------
glow/glow/lib/Partitioner/Partitioner.cpp:1244
--------------------------------------------------------------------------------
glow/glow/lib/Runtime/HostManager/HostManager.cpp:375
--------------------------------------------------------------------------------
```
This makes the error message more clear by checking for the number of OnnixifiOp created before going into Glow. The check is enabled with the `verify_only_single_subnet` flag, and is disabled by default.

Test Plan: Unit tests pass

Reviewed By: khabinov

Differential Revision: D28097674

fbshipit-source-id: 0eefd8f6ec1a82546b759be8e541256bf271a673
2021-07-08 14:29:03 -07:00
0fc110cdd1 [CUDA graphs] Don't sync between replays for cuda driver version 11.4+ (#61063)
Summary:
The bug in libcuda.so that required https://github.com/pytorch/pytorch/pull/57556 is fixed for libcuda.so versions >= 11.4.

This PR changes replay() to sync after each launch only if the process's in-use libcuda.so is < 11.4.

With all the "enhanced" and "forward" compatibility promises flying around, and the fact that "driver" sometimes means kernel-mode driver and sometimes means user-mode driver (libcuda.so), I wasn't sure if this PR's check suffices to trigger the sync iff the in-use libcuda.so is < 11.4, but Cuda people say what I wrote is reasonable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61063

Reviewed By: mruberry

Differential Revision: D29600907

Pulled By: ngimel

fbshipit-source-id: 71bf0bcbde43091e29f3812440abeb7a95d161e2
2021-07-08 13:26:07 -07:00
80797d03e0 Simplify lambda syntax in SegmentReduce.cpp (#61416)
Summary:
Fixes Windows build by dismantling a combination nested lambdas+preprocessor magic into explicit templates

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61416

Reviewed By: pbelevich

Differential Revision: D29616449

Pulled By: malfet

fbshipit-source-id: 687ef73b8b37bc272f82d44fc690448e403e3a0c
2021-07-08 12:30:35 -07:00
cdc027679b Add compare_set in distributed docs (#61351)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61351

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D29588206

Pulled By: H-Huang

fbshipit-source-id: 9db48e7b6de29503275f10616470ad2d66b075f9
2021-07-08 12:30:32 -07:00
f01a4e3b02 .github: Ensure build-results per job is unique (#61005)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61005

build-results have the potential to be tainted between jobs since runs
are not ephemeral

Signed-off-by: Eli Uriegas <seemethere101@gmail.com>

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D29526747

Pulled By: seemethere

fbshipit-source-id: f8c5bc5f647b771a059cbe380d694ce6dc535ae4
2021-07-08 12:30:28 -07:00
4beb5f9ad6 [DDP Comm Hook] Fix some comments (#61376)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61376

After SPMD is retired, the API of `get_tensors` becomes `get_tensor`. Fix some comments that refer to the obsolete API.

The `allreduce` hook example does not do division inside, which actually is incorrect.
ghstack-source-id: 133174272

Test Plan: N/A

Reviewed By: rohan-varma

Differential Revision: D29596857

fbshipit-source-id: 2046b185225cd6d1d104907b5f9b4009b6e87c99
2021-07-08 12:30:24 -07:00
dfe25069a8 [ROCm] Skip test_*_stress_cuda test for ROCm (#60490)
Summary:
Skipping test_*_stress_cuda tests because they sometimes fail for ROCm

Signed-off-by: Kyle Chen <kylechen@amd.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60490

Reviewed By: SciPioneer

Differential Revision: D29595552

Pulled By: rohan-varma

fbshipit-source-id: fee18204775211747337985c472ab1084a71f2f1
2021-07-08 12:28:06 -07:00
9310f6bac1 Use our own statically stored vs_buildtools.exe (#61372)
Summary:
We might be getting limited for our VS install requests, leading to HUD failures. This PR moves it to curl from our own S3, so we wouldn't get limited.

This PR also upgrades our vs_install to 16.8.6 from 16.8.5 as moving to S3 didn't help, but moving to the newer installer did.

The CI passes the VS install now, but fails on a build error that I don't think is relevant: https://github.com/pytorch/pytorch/pull/61372/checks?check_run_id=3013140957

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61372

Reviewed By: iramazanli

Differential Revision: D29597204

Pulled By: janeyx99

fbshipit-source-id: 3eb52da308451271ea80120bbf2e511fb781b5dc
2021-07-08 11:27:02 -07:00
ac5b910600 clang-tidy patch (#60714)
Summary:
Two changes made here:
1. Set `LANG=C.UTF-8` for clang-tidy so we can properly decode symbols in comment;
2. In case of file removed, `end` could be null and we should skip the chunk/file;
3. tiny bug fix for the loop indent.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60714

Reviewed By: iramazanli

Differential Revision: D29617171

Pulled By: 1ntEgr8

fbshipit-source-id: b1603929333529a174105baf51e18246d09c012e
2021-07-08 11:16:00 -07:00
074c776011 Force mypy colors in CI (#61391)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61391

Both the [GitHub Actions log viewer](https://github.community/t/ansi-color-output-in-webview/17621) and the HUD PR page log viewer support ANSI color codes so turn those on via this [secret env variable](https://github.com/python/mypy/issues/7771)

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D29602686

Pulled By: driazati

fbshipit-source-id: e8f4cd71572cc068927e6719534e64773cb16c7f
2021-07-08 11:08:38 -07:00
c76eba650a [bootcamp][pytorch][WIP] Support embedding_bag_byte_rowwise_offsets in cuda (#61075)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61075

Completed implementation of the embedding_bag_byte_rowwise_offsets wrote randomized test comparing GPU and CPU kernel outputs.

Test Plan:
```
buck build mode/opt --show-full-output  //caffe2/torch/fb/sparsenn:gpu_test
/data/users/johnsonpaul/fbsource/fbcode/buck-out/gen/caffe2/torch/fb/sparsenn/gpu_test#binary.par -r test_embedding_bag_byte_rowwise_offsets
```

Reviewed By: hyuen

Differential Revision: D29218597

fbshipit-source-id: 786260466ab4e8e3d89540496bd8a38be14c5c1b
2021-07-08 10:51:50 -07:00
9ef1c64907 [PyTorch][Edge] Tests for QuantizationFx API on lite modules (#60476)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60476

# Context
Add tests for Lite modules that are quantized using fx API

Read this posts for details about why we need a test bench for quantized lite modules
https://fb.workplace.com/groups/2322282031156145/permalink/4289792691071726/

https://github.com/pytorch/pytorch/pull/60226#discussion_r654615851

moved common code to `caffe2/torch/testing/_internal/common_quantization.py`

ghstack-source-id: 133144292

Test Plan:
```
~/fbsource/fbcode] buck test caffe2/test:fx_quantization_lite
Downloaded 0/2 artifacts, 0.00 bytes, 100.0% cache miss
Building: finished in 8.3 sec (100%) 11892/11892 jobs, 2 updated
  Total time: 8.6 sec
More details at https://www.internalfb.com/intern/buck/build/ffb7d517-d85e-4c8f-9531-5e5d9ca1d34c
Tpx test run coordinator for Facebook. See https://fburl.com/tpx for details.
Running with tpx session id: d79a5713-bd29-4bbf-ae76-33a413869a09
Trace available for this run at /tmp/tpx-20210630-105547.675980/trace.log
Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/3096224749578707
    ✓ ListingSuccess: caffe2/test:fx_quantization_lite - main (9.423)
    ✓ Pass: caffe2/test:fx_quantization_lite - test_embedding (mobile.test_quantize_fx_lite_script_module.TestFuseFx) (10.630)
    ✓ Pass: caffe2/test:fx_quantization_lite - test_submodule (mobile.test_quantize_fx_lite_script_module.TestFuseFx) (12.464)
    ✓ Pass: caffe2/test:fx_quantization_lite - test_conv2d (mobile.test_quantize_fx_lite_script_module.TestFuseFx) (12.728)
Summary
  Pass: 3
  ListingSuccess: 1
If you need help understanding your runs, please follow the wiki: https://fburl.com/posting_in_tpx_users
Finished test run: https://www.internalfb.com/intern/testinfra/testrun/3096224749578707
```

Reviewed By: iseeyuan

Differential Revision: D29306402

fbshipit-source-id: aa481e0f696b7e9b04b9dcc6516e8a390f7dc1be
2021-07-08 10:40:08 -07:00
179b3ab88c [cuDNN] Enable cudnn_batchnorm_spatial_persistent for BatchNorm3d channels_last_3d (#59129)
Summary:
This PR enables the use of cuDNN BatchNorm spatial persistent algorithm for BatchNorm3d (5-D tensor) in channels_last_3d format, aka NDHWC. Performance and numerical accuracy are tested.

- [x] Performance check for common shapes.
- [x] Numerical accuracy check for (1 million) random shapes
    https://github.com/xwang233/code-snippet/tree/master/batchnorm3d-channels-last/A100
    https://github.com/xwang233/code-snippet/tree/master/batchnorm3d-channels-last/V100
- [ ] Convergence check for common 3D models

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59129

Reviewed By: mruberry

Differential Revision: D29593309

Pulled By: ngimel

fbshipit-source-id: 2caf282c6cf2f426aa14a24f94e6bddada68ddac
2021-07-07 21:28:29 -07:00
0222291544 Fix docs for ShardMetadata. (#61388)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61388

The doc for `placement` argument was outdated and is now fixed.
ghstack-source-id: 133184441

Test Plan: waitforbuildbot

Reviewed By: wanchaol

Differential Revision: D29601316

fbshipit-source-id: a0817f799382bf91a5192c54dfeea4d253eb0d56
2021-07-07 21:27:30 -07:00
7011513d23 Enable sparse_csr.to_dense() for bool, float16, bfloat16 and complex (#60657)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60657

Fixes https://github.com/pytorch/pytorch/issues/60648

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D29408102

Pulled By: cpuhrsch

fbshipit-source-id: 406505c1c52c0eada934833f9723f58fa67e9256
2021-07-07 19:29:19 -07:00
5054cb8934 fix torch.cat bug with boxed CPUFallback (#60993)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60993

Fixes https://github.com/pytorch/pytorch/issues/60902

The boxed fallback was written to assume that there was at least one tensor argument, which it used to figure out what device to move the cpu tensors to. That fails with an op like `torch.cat()`, which doesn't have any tensor arguments, but instead has a single `TensorList` argument.

I also added handling to gracefully deal with the case where you have an empty list of tensors - in that case we don't know what device to move everything to, but that doesn't matter because an empty list of tensors implies that we have no tensors to move anyway.

I tested it out though and noticed that `torch.cat(())` doesn't handle empty lists well anyway (erroring out in the dispatcher). I'm not sure that it's a huge issue, and not even sure that we want to fix it (default to CPU? add an extra codegen'd check into every op that only takes TensorList args?) but I'll file a separate bug for that: https://github.com/pytorch/pytorch/issues/60997

I tested it by running the pytorch/xla suite after removing `cat` from `xla_native_functions.yaml`, and confirming that we don't segfault anymore.

Test Plan: Imported from OSS

Reviewed By: asuhan

Differential Revision: D29471577

Pulled By: bdhirsh

fbshipit-source-id: 58c96e8d48d993785b8d15cfa846ec745a34e623
2021-07-07 19:29:17 -07:00
141bfbef86 [iOS GPU] Add tanh and clamp to support GAN (#61383)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61383

Since we already have the support for hardtanh, it's easy to add support for clamp. GPU is 40% ish faster.
ghstack-source-id: 133113272

Test Plan:
- CI
- buck test pp-macos

Reviewed By: dhruvbird

Differential Revision: D29572933

fbshipit-source-id: d22ec09e18d02456440f552067c9a8aea9a1a8ab
2021-07-07 19:29:16 -07:00
4937d9fd6f Fix Dispatching not considering List[Optional[Tensor]] for dispatch (#60787)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60787

Fixes #60461.

Previously, when one calls `self.index(indices)` using a regular `self`
Tensor and a `BatchedTensor` indices the dispatcher would not dispatch
to the Batched key. This is because the dispatcher did not extract
dispatch keys from `indices`.

Similar #58283 and #58296, this PR modifies the dispatcher to extract
dispatch keys from List[Optional[Tensor]] arguments. We do this for both
boxed and unboxed kernels.

Test Plan:
- run the test case in
https://gist.github.com/zou3519/4421df7c5271376a0ef53ca857b18740
(requires functorch). After this PR, it raises `RuntimeError: Batching
rule not implemented for aten::index.Tensor. We could not generate a
fallback.`, which shows that dispatch happened on the Batched key.
- Taking suggestions for how to write a test for this in core

Reviewed By: jbschlosser

Differential Revision: D29438611

Pulled By: zou3519

fbshipit-source-id: 77e182f763e18aa3fa857eebafa8b7f83384db71
2021-07-07 19:28:07 -07:00
426c42ba45 [package] ensure we don't write files twice to the archive. (#61371)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61371

The ZIP format allows for writing multiple files with the same name. But
this is handled poorly by most tooling (including our own), so doing so
produces weird behavior depending on the implementation of the ZIP
reader.

Since we have no valid use case for writing multiple files with the same
name to a `torch.package`, just ban it.

Differential Revision:
D29595518
D29595518

Test Plan: Imported from OSS

Reviewed By: Lilyjjo

Pulled By: suo

fbshipit-source-id: b9f5263ab47572abde233745c102af3d6143946e
2021-07-07 18:28:42 -07:00
1d1d5acbb0 [RPC] Ensure _wait_all_workers doesn't swallow exception. (#61094)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61094

`_wait_all_workers` was swallowing exceptions and as a result if there
were any errors it would still continue with rpc_agent.join() which would hang
since something already failed before.

To fix this, I've ensured that wait_all_workers throws and in that case we just
proceed with an ungraceful shutdown without joining.
ghstack-source-id: 133160706

Test Plan:
1) Added unit test.
2) waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D29509286

fbshipit-source-id: 7c3f1c68d712ae2f63e10e0216580db8e9bcc29d
2021-07-07 18:28:41 -07:00
7b6ddb6793 [nnapi] add log_softmax (#61378)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61378

Test Plan: Imported from OSS

Reviewed By: axitkhurana

Differential Revision: D29597355

Pulled By: IvanKobzarev

fbshipit-source-id: 55124749f8eeffa2b2713f7cffd5ccf965561de1
2021-07-07 18:28:39 -07:00
eb82a88d85 Add a type for test fixture world_size (#61363)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61363

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D29561360

fbshipit-source-id: 821217e33adc483b1810585a2b91a2ee416513bd
2021-07-07 18:27:37 -07:00
d51b437b74 Cuda quantized tensors, support for quantize per channel (#58245)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58245

This adds the support for the per_channel quantization,

(Note: this ignores all push blocking failures!)

Test Plan:
python test/test_quantization.py TestQuantizedTensors
python test/test_quantization.py TestQuantizedTensors.test_compare_quant_dequant_device_numerics
python test/test_quantization.py TestQuantizedTensors.test_qtensor_to_device

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D29018271

fbshipit-source-id: 4f59aed98f2f8ff607154250e4e3f85592e17854
2021-07-07 17:36:53 -07:00
b1dc9c3946 Skip _cudnn_rnn_backward in codegen check (#61386)
Summary:
Fixes internal test failure encountered internally

For context see: https://github.com/pytorch/pytorch/issues/60426

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61386

Reviewed By: malfet

Differential Revision: D29601031

Pulled By: soulitzer

fbshipit-source-id: 3592ca45a01e7bbaa804ab5404338191154f0fbc
2021-07-07 17:36:51 -07:00
b25c65b4f3 Revert D29589020: [pytorch][PR] adding a build_start_time_epoch to build meta info
Test Plan: revert-hammer

Differential Revision:
D29589020 (d33066ab3f)

Original commit changeset: 309fc3b01cbc

fbshipit-source-id: 9b50c1e8dd63e59ab4e593d250dfd5eeb623f0af
2021-07-07 17:35:29 -07:00
9dd1824741 Fix dispatch keys for eigh, lu_solve (#60945)
Summary:
I added a test to `test_ops.py` that verifies that the op can run correctly from different cuda devices. This test revealed that `linalg_eigh`, `linalg_eigvalsh`, `linalg_matrix_rank`, `linalg_pinv` were failing. `matrix_rank` and `pinv` are calling `eigh` internally.

`linalg_eigh` and `lu_solve` internally use dispatch stubs, so they should be registered with `CPU, CUDA` dispatch keys. The generated code includes device guards in this case and the problem is not present.

Implemented a better out variant for `eigvalsh` and registered it with `CPU, CUDA` dispatch keys.

~I added a device guard to `linalg_eigh_kernel` as a fix for `eigvalsh` function. This function needs to be registered as CompositeImplicitAutograd, because it calls `at::linalg_eigh` if `at::GradMode::is_enabled()`.~

Fixes https://github.com/pytorch/pytorch/issues/60892.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60945

Reviewed By: mruberry

Differential Revision: D29589580

Pulled By: ngimel

fbshipit-source-id: 5851605958bdfc3a1a1768263934619449957168
2021-07-07 16:28:22 -07:00
fb00194030 Fix typo in common_utils.py (#61365)
Summary:
Missed this in review of https://github.com/pytorch/pytorch/pull/57953. I don't think this has affected much, though.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61365

Reviewed By: walterddr

Differential Revision: D29593764

Pulled By: janeyx99

fbshipit-source-id: 2c6f6aa961eabca0d8b8a7607aaae979667cca3b
2021-07-07 16:28:20 -07:00
6107cf3750 Add --jobs 0 for git submodule update (#61311)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61311

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61152

Some related docs about `submodule.fetchJobs`
https://git-scm.com/docs/git-config#Documentation/git-config.txt-submodulefetchJobs

```
time git submodule update --init --recursive
________________________________________________________
Executed in  243.20 secs    fish           external
   usr time   49.64 secs  213.00 micros   49.64 secs
   sys time   29.27 secs  795.00 micros   29.27 secs
```

```
time git submodule update --init --recursive --jobs 4
________________________________________________________
Executed in  143.04 secs    fish           external
   usr time   51.06 secs  246.00 micros   51.06 secs
   sys time   30.96 secs  742.00 micros   30.96 secs
```

```
time git submodule update --init --recursive --jobs 8
________________________________________________________
Executed in  124.64 secs    fish           external
   usr time   51.76 secs  264.00 micros   51.76 secs
   sys time   30.49 secs  739.00 micros   30.49 secs

```

```
time git submodule update --init --recursive --jobs 0 # use all online cpus
 ________________________________________________________
Executed in  129.75 secs    fish           external
   usr time   51.64 secs  181.00 micros   51.64 secs
   sys time   31.49 secs  781.00 micros   31.49 secs

```

Test Plan: Imported from OSS

Reviewed By: 1ntEgr8

Differential Revision: D29560875

Pulled By: zhouzhuojie

fbshipit-source-id: 556027dffe744c66428075a8a1bf64683930aaaf
2021-07-07 16:28:18 -07:00
d33066ab3f adding a build_start_time_epoch to build meta info (#61322)
Summary:
Adding a `build_start_time_epoch` as a normal field in scribe reporting.
This should fix https://github.com/pytorch/pytorch/issues/60591.

The decision was made because:
- we would like only one build (test CI job) start time as partition key string
  - the alternative is to report the duration on each test case individually which would result in duplicate numeric value upload.
- we would be easily calculate the wall-time of a test job from `MAX('time') - build_start_time_epoch` for all reporting messages with the same normal keys.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61322

Test Plan:
CI should report the extra normal field.

See: https://fburl.com/scuba/pytorch_test_times/pm6chz9w

Reviewed By: driazati

Differential Revision: D29589020

Pulled By: walterddr

fbshipit-source-id: 309fc3b01cbce76cd62f8ccd2eb0ecad27782b88
2021-07-07 16:27:13 -07:00
429436edbd Avoid complex-to-real cast warning in CopyBackward (#60021)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60021

Dropping the imaginary component is expected and gives the correct gradient
formula, so silencing the warning is appropriate.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D29589371

Pulled By: mruberry

fbshipit-source-id: 73e1511cae69207dc9abe576e2769ee1d03f1bbd
2021-07-07 15:28:38 -07:00
10b2a24508 Migrate log_sigmoid (forward and backward) to ATen (CUDA) (#60881)
Summary:
Fixes gh-24591, fixes gh-24590, closes gh-39642

Benchmarks were run with nvprof using contiguous inputs; they show improvement across the board.

#### Forward benchmarks

| Num Elements | Master (us) | This PR (us) |
|:------------:|:-----------:|:------------:|
|     10^4     |    2.5840   |    2.5230    |
|     10^5     |    4.6410   |    3.9280    |
|     10^6     |    33.772   |    23.025    |
|     10^7     |    299.67   |    206.35    |
|     10^8     |    3001.9   |    2052.8    |

#### Backward benchmarks

| Num Elements | Master (us) | This PR (us) |
|:------------:|:-----------:|:------------:|
|     10^4     |    2.7750   |    2.7080    |
|     10^5     |    5.2430   |    3.9010    |
|     10^6     |    46.198   |    32.878    |
|     10^7     |    447.18   |    296.18    |
|     10^8     |    4393.2   |    2938.0    |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60881

Reviewed By: mruberry

Differential Revision: D29589455

Pulled By: ngimel

fbshipit-source-id: 70cd5db244bf6292e9ca367462640530a1d85f7d
2021-07-07 15:28:36 -07:00
f86460a352 Add coverage files to .gitignore (#61144)
Summary:
Fixes failures when coverage is turned on: https://github.com/pytorch/pytorch/runs/2966295169 https://github.com/pytorch/pytorch/runs/2964409741

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61144

Test Plan:
```bash
$ echo hi > test/.coverage.jit.1625168654.4504092
$ git status
$
```

Reviewed By: zhouzhuojie

Differential Revision: D29530709

Pulled By: driazati

fbshipit-source-id: 0e6a1cb217c4d48f14c0c58a546f98393d2b0392
2021-07-07 15:28:35 -07:00
5e83fefdf8 [sparsity] sparsifier step tests (#60107)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60107

Unit tests for sparsifier `step`

Test Plan:
`buck test mode/dev-nosan //caffe2/test:ao -- TestWeightNormSparsifier`

https://pxl.cl/1LhQP

Reviewed By: z-a-f

Differential Revision: D29167029

fbshipit-source-id: 053027ca92701097406372ef0f81d79ef28380aa
2021-07-07 15:28:33 -07:00
8881b9d852 [sparsity] sparsifier convert tests (#60105)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60105

Unit tests for sparsifier `convert`

Test Plan:
`buck test mode/dev-nosan //caffe2/test:ao -- TestWeightNormSparsifier`

https://pxl.cl/1LhQ8

Reviewed By: z-a-f

Differential Revision: D29145450

fbshipit-source-id: b87b8f0d44751a7dae19d454a11b2d207a7286e2
2021-07-07 15:28:31 -07:00
ec200a60bd [sparsity] sparsifier prepare tests (#60042)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60042

Unit tests for sparsifier `prepare`

Test Plan:
`buck test mode/dev-nosan //caffe2/test:ao -- TestWeightNormSparsifier`

https://pxl.cl/1LhR1

Reviewed By: z-a-f

Differential Revision: D29140945

fbshipit-source-id: 73cbf27f278ce849e3930ba6caa82bb2f64f1321
2021-07-07 15:28:30 -07:00
21ad978d4f [sparsity] rename sparsity_pattern to sparse_block_shape (#59898)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59898

In `weight_norm_sparsifier`, the name of the argument `sparsity_pattern` is not intuitive for an argument describing the shape of the sparse block. It has been changed to `sparse_block_shape`.

Test Plan:
`buck test mode/dev-nosan //caffe2/test:ao -- TestWeightNormSparsifier`
https://pxl.cl/1LhRM

Reviewed By: z-a-f

Differential Revision: D29077045

fbshipit-source-id: 0cf9c5387d41ca8e839ee050d71f4fe477374143
2021-07-07 15:27:16 -07:00
aa6a8a6d21 [nnc] Add LoopNest::unsafe_fuseLoops to let users apply fusion on stmts that may violate our correctness checks (#60601)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60601

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D29346128

Pulled By: huiguoo

fbshipit-source-id: 0eb143e97dc57224adeedf99981036ad836e5a03
2021-07-07 14:27:18 -07:00
8fd90f7cfd Implementing transpose for PackedTensorAccessor (#61114)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61114

Matching the functionality of THCDeviceTensor::transpose. This
is the same as PR 60968 (https://github.com/pytorch/pytorch/pull/60968)
which was already approved; the state of the PR got messed up so
creating a fresh one.
ghstack-source-id: 133050553

Test Plan:
Unit tests at aten/src/ATen/test/packedtensoraccessor_test.cpp

Imported from OSS

Reviewed By: ezyang

Differential Revision: D29516530

fbshipit-source-id: 91d5bcc38381c00420825646b1c352c0d6bc8b3f
2021-07-07 14:27:16 -07:00
39a76fe73c BatchNorm2D (#61012)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61012

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D29562337

Pulled By: migeed-z

fbshipit-source-id: 2b848d0af607bd4f36cea2436ab2278ac4bc28d7
2021-07-07 14:26:07 -07:00
357c4d9cc4 Add a test case for findDanglingImpls (#61104)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61104

This patch added a new test case for findDanglingImpls. The test case introduces a C++ extension which has a dangling impl such that findDanglingImpls can find it and output its information.

Test Plan:
python test/test_dispatch.py TestDispatch.test_find_dangling_impls_ext

Imported from OSS

Reviewed By: ezyang

Differential Revision: D29512520

fbshipit-source-id: 6883fb8f065f2c0ae0e7a1adf6fd298591497e2b
2021-07-07 13:34:16 -07:00
4d9fd8958b Support __rand__, __ror__ and __rxor__ (#59240)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/58120.

This PR implements `torch.Tensor.{__rand__/__ror__/__rxor__}` for the compatibility with NumPy’s interface.
(cc: mruberry, rgommers, emcastillo, kmaehashi)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59240

Reviewed By: ngimel

Differential Revision: D29482304

Pulled By: mruberry

fbshipit-source-id: 13789202c1d8dddf8658a45381aeedcc31e2f603
2021-07-07 13:34:14 -07:00
9547e57643 Create SECURITY.md (#61356)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61356

Reviewed By: samestep

Differential Revision: D29589904

Pulled By: malfet

fbshipit-source-id: 5d79d25e35af9cb258fd6843559955360dc0cc4e
2021-07-07 13:34:12 -07:00
f84a441718 [torch][segment_reduce] Update default values when initial value is not set (#61266)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61266

Same as title.
Mainly this concludes the initially planned features from the op. Only missing functionality is to do reduction on any axis (currently axis 0 only is supported).

Test Plan: Updated unit test.

Reviewed By: ngimel

Differential Revision: D29552037

fbshipit-source-id: 023c7cbf750a0671f76082708f14c05739dda07a
2021-07-07 13:34:10 -07:00
a78ad5dc4c [torch][segment_reduce] Add support for int lengths as well (#61141)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61141

Currently only long is supported. This diff adds support for other index type.

Next Steps:
- Update default, refactor unit test and test non_initial value as well
- Cleanup (more tests, benchmark, update documentation)

Test Plan: updated unit test. rely on CI.

Reviewed By: ngimel

Differential Revision: D29526308

fbshipit-source-id: b4043603483851ef7e0e93b0bb02ac7849c6449d
2021-07-07 13:34:09 -07:00
423523d8bb Alias for logsumexp to special namespace (#58838)
Summary:
See https://github.com/pytorch/pytorch/issues/50345

cc: kshitij12345 Lezcano mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58838

Reviewed By: malfet

Differential Revision: D29565033

Pulled By: mruberry

fbshipit-source-id: 9b715ea00c78f47b6f183357ee3c7d4c3abe4d01
2021-07-07 13:32:15 -07:00
c03f99f3ef Remove pyproject.toml (#61367)
Summary:
This reverts https://github.com/pytorch/pytorch/issues/60408, since it doesn't really give much benefit, and it ended up breaking things:

- https://github.com/pytorch/pytorch/issues/60665
- https://github.com/pytorch/pytorch/pull/60408#issuecomment-873979383

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61367

Reviewed By: malfet, janeyx99

Differential Revision: D29593886

Pulled By: samestep

fbshipit-source-id: b1ba0ac7695e3eacf66a35e293080e8a1240efca
2021-07-07 12:47:45 -07:00
994ce7dbd9 Cuda quantized tensors, support for quantize per tensor (#59700)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59700

implements quantized tensors in cuda for for per_tensor
quantization, along with several necessary functions

(Note: this ignores all push blocking failures!)

Test Plan:
python test/test_quantization.py TestQuantizedTensors
python test/test_quantization.py
TestQuantizedTensors.test_compare_quant_dequant_device_numerics
python test/test_quantization.py
TestQuantizedTensors.test_qtensor_to_device

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D29018272

fbshipit-source-id: e07d19d6d67729c46324c2bb5946d959e6e6db8e
2021-07-07 12:40:51 -07:00
baa518e2f6 Add Int32 support for NNAPI (#59365)
Summary:
Support Int32 tensors in NNAPI converter

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59365

Test Plan: Local testing with FB prod models

Reviewed By: anshuljain1

Differential Revision: D28881040

fbshipit-source-id: 2dacceffd322a21d91bfefcf2fb2ea400d952d0d
2021-07-07 12:40:49 -07:00
cf285d8eea Add aten::slice NNAPI converter (#59364)
Summary:
Add support for aten::slice op in the NNAPI model converter

* If start = 0; end = max -> identity
* Flexible shapes can be passed through
* Flexible shapes can't be sliced over

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59364

Test Plan: pytest test/test_nnapi.py::TestNNAPI::test_slice

Reviewed By: anshuljain1

Differential Revision: D28881039

fbshipit-source-id: 3c1c630ff27b5bba6eda403d87570c61d43ae90e
2021-07-07 12:40:47 -07:00
d26372794a Add aten::detach NNAPI converter (#58543)
Summary:
* Add support for aten::detach op in the NNAPI model converter as a no-op
* Also add flexible op support for add_pointwise_simple_unary_op

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58543

Test Plan: pytest test/test_nnapi.py::TestNNAPI::test_detatch

Reviewed By: anshuljain1

Differential Revision: D28531942

fbshipit-source-id: 4387dbbbadd8ce6b690841f3a903e68a380b849d
2021-07-07 12:40:46 -07:00
0be228dd5f Add aten::flatten NNAPI converter (#60885)
Summary:
Add support for aten::div op in the NNAPI model converter. Startup time
variable size support isn't supported as shapes go as inputs to NNAPI op

Runtime variable size support to supported soon

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60885

Test Plan: pytest test/test_nnapi.py::TestNNAPI::test_flatten

Reviewed By: anshuljain1

Differential Revision: D29451725

fbshipit-source-id: 8902745f7758c8cc88ad4b4ce02b8301ff894bd4
2021-07-07 12:40:44 -07:00
b297f65b66 Add aten::div NNAPI converter (#58541)
Summary:
Add support for aten::div op in the NNAPI model converter. Add variable
size input test as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58541

Test Plan: pytest test/test_nnapi.py::TestNNAPI::test_div

Reviewed By: anshuljain1

Differential Revision: D28531943

fbshipit-source-id: e96342146f6de216f7b88443618edfc54963747c
2021-07-07 12:40:42 -07:00
eab18a9a40 Add aten::to NNAPI converter (#58540)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58540

Add support for aten::to op in the NNAPI model converter for simple
cases like to("cpu"), to("gpu")

Test Plan: pytest test/test_nnapi.py::TestNNAPI::test_to

Reviewed By: anshuljain1

Differential Revision: D28531941

fbshipit-source-id: 0c934f7aceaff2669307c3426efe32046d8c44f3
2021-07-07 12:40:41 -07:00
14d604a13e Add aten::softmax NNAPI converter (#58539)
Summary:
Add support for aten::softmax op in the NNAPI model converter with
flexible size

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58539

Test Plan: pytest test/test_nnapi.py::TestNNAPI::test_softmax

Reviewed By: anshuljain1

Differential Revision: D28531946

fbshipit-source-id: 8633f3e3f7f52795f9866ff16ad0867ea36a19e8
2021-07-07 12:39:31 -07:00
45ce26c397 Port isposinf & isneginf kernel to structured kernels (#60633)
Summary:
Porting `torch.isposinf` & `torch.isneginf` to structured kernel
Related https://github.com/pytorch/pytorch/issues/55070

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60633

Reviewed By: saketh-are

Differential Revision: D29517528

Pulled By: bdhirsh

fbshipit-source-id: f8f62e4c203e0c54790437b5e512024bfabdddfc
2021-07-07 12:33:41 -07:00
c2b0af2560 [static runtime] Implement aten::sign (#61154)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61154

Test Plan:
Added `StaticRuntime.IndividualOps_Sign`

```
[djang@devvm861.prn0 ~/local/fbsource/fbcode/caffe2] buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- -v 1
...
[ RUN      ] StaticRuntime.IndividualOps_Sign
V0701 12:05:31.836099 3679080 impl.cpp:455] StaticModuleOptions: cleanup_activations 1, enable_out_variant 1, optimize_memory1, optimize_graph_output_memory0
V0701 12:05:31.898192 3679080 impl.cpp:1279] Switch to out variant for node: %3 : Tensor = aten::sign(%input.1)
V0701 12:05:31.898849 3679080 impl.cpp:1279] Switch to out variant for node: %4 : Tensor = aten::clone(%3, %2)
```

Reviewed By: hlu1

Differential Revision: D29518603

fbshipit-source-id: e47b96d037fea639c41052f3849c82bbfa5f482a
2021-07-07 12:29:25 -07:00
1262b2c4c6 fix torch.futures docstring examples (#61029)
Summary:
Trying to run the doctests for the complete documentation hangs if it reaches the examples of `torch.futures`. It turns out to be only syntax errors, which are normally just reported. My guess is that `doctest` probably doesn't work well for failures within async stuff.

Anyway, while debugging this, I fixed the syntax.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61029

Reviewed By: mruberry

Differential Revision: D29571923

Pulled By: mrshenli

fbshipit-source-id: bb8112be5302c6ec43151590b438b195a8f30a06
2021-07-07 11:47:55 -07:00
376dc500a9 Minor bug fix in the warning message (#61127)
Summary:
The current example code does not work. The correct one is like this: cb7d813275/torch/distributed/run.py (L266)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61127

Reviewed By: cbalioglu

Differential Revision: D29572003

Pulled By: mrshenli

fbshipit-source-id: 05b470230f3d70f8a6164edb5f92894a1112069f
2021-07-07 11:42:51 -07:00
90241d254f Automated submodule update: FBGEMM (#59968)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: a2257d9471

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59968

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: r-barnes

Differential Revision: D29109045

fbshipit-source-id: 386b28b28275e728ee229d4baf1ff192635d49c3
2021-07-07 11:33:57 -07:00
29ecb9f90b Don't check stride by default (#60637)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60637

We now have ~three out of three~  four out of four datapoints that `check_stride` will be `partial`'ed to `False`:

- `torch` test suite: https://github.com/pytorch/pytorch/pull/58981#discussion_r639514081
- `torchvision` test suite: https://github.com/pytorch/pytorch/issues/56544#issuecomment-845352605
- `kornia`: 9041c42b41/test/utils.py (L25)
- `torch.fft`: https://github.com/pytorch/pytorch/pull/60304#pullrequestreview-687882323

Given that the strides in most cases are in implementation detail, IMO we should change the default to `False`. In cases were matching strides is a requirement for closeness / equality it can always set to `True` manually.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D29556355

Pulled By: mruberry

fbshipit-source-id: 0029a44280d8f4369fbdb537dce3202eeee4b1d9
2021-07-07 09:55:36 -07:00
e2a3f4b560 Use maximum of tolerances in case of mismatching dtypes (#60636)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60636

See https://github.com/pytorch/pytorch/pull/58981#issuecomment-866654600.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D29556352

Pulled By: mruberry

fbshipit-source-id: 36e97e0f338df5d17a94af078f172c668ef51ecb
2021-07-07 09:55:34 -07:00
5f18ba7075 upcast to most precise dtype within their category before the comparison (#60536)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60536

`torch.isclose` does not do this bool tensors, which results in a test failure since subtraction (`abs(actual - expected)`) is not supported for them (see #58981). Since the `dtype` is already checked at this point, we can safely move the upcasting before `torch.isclose` is invoked.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D29556356

Pulled By: mruberry

fbshipit-source-id: 4c65fad4f06cf402d6aab9dde5b127235766d5e0
2021-07-07 09:55:32 -07:00
5ac87cde30 tests for diagnostics in callable msg in torch.testing.assert_close (#60254)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60254

Before we only tested that the correct error message is returned if `msg` is passed as callable. This adds tests that make sure that

- the inputs passed to the callable are the same inputs passed to `torch.assert_close` and
- the `diagnostics` namespace has the same attributes and types as documented.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D29556354

Pulled By: mruberry

fbshipit-source-id: 9793c6d86fda842b6329381fc03b945eee878464
2021-07-07 09:55:30 -07:00
76d9e680d7 update docstring examples of torch.testing.assert_close (#60163)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60163

Changes to the default error message in case of mismatching values need to be reflected in the examples given in the docstring. Normally this should be enforced by a [`doctest`](https://docs.python.org/3/library/doctest.html). mruberry do you know why we don't have such a check?

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D29556353

Pulled By: mruberry

fbshipit-source-id: 8dbc3f566f429618811b542a059d9abde9a6530b
2021-07-07 09:55:29 -07:00
9979289037 Improve error messages of torch.testing.assert_close in case of mismatching values (#60091)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60091

Closes #58383. (1) and (2) are implemented. (3) was rejected. No consensus was reached on (4) and (5).

Improvements:

- Instead of calling everything "Tensors" we now use "Scalars" and "Tensor-likes" depending on the shape. Plus, we now internally have the option to adapt this identifier for example to report "Imaginary components of complex tensor-likes", which is even more expressive.
- The reported conditions "not close" and "not equal" are now determined based on `rtol` and `atol`.
- The number of mismatched elements and the offending indices are only reported in case the inputs are not scalar
- The allowed `rtol` and `atol` is only reported if `> 0`

**Example 1**

```python
torch.testing.assert_close(1, 3, rtol=0, atol=1)
```

Before:

```
AssertionError: Tensors are not close!

Mismatched elements: 1 / 1 (100.0%)
Greatest absolute difference: 2 at 0 (up to 1 allowed)
Greatest relative difference: 0.6666666865348816 at 0 (up to 0 allowed)
```

After:

```
AssertionError: Scalars are not close!

Absolute difference: 2 (up to 1 allowed)
Relative difference: 0.6666666865348816
```

**Example 2**

```python
torch.manual_seed(0)
t = torch.rand((2, 2), dtype=torch.complex64)
torch.testing.assert_close(t, t + complex(0, 1))
```

Before:

```
AssertionError: Tensors are not close!

Mismatched elements: 4 / 4 (100.0%)
Greatest absolute difference: 1.0000000596046448 at (0, 0) (up to 1e-05 allowed)
Greatest relative difference: 0.8833684352411922 at (0, 1) (up to 1.3e-06 allowed)

The failure occurred for the imaginary part.
```

After:

```
AssertionError: Imaginary components of tensor-likes are not close!

Mismatched elements: 4 / 4 (100.0%)
Greatest absolute difference: 1.0000000596046448 at index (0, 0) (up to 1e-05 allowed)
Greatest relative difference: 0.8833684352411922 at index (0, 1) (up to 1.3e-06 allowed)
```

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D29556357

Pulled By: mruberry

fbshipit-source-id: 559d4a19ad4fc069b2b4f8cb5fc2f6058621e33d
2021-07-07 09:54:09 -07:00
e1338016dd cuSOLVER path for LU factorization in CUDA. (#56887)
Summary:
This PR adds cuSOLVER path for `torch.lu`.

Performance comparison results: https://github.com/pytorch/pytorch/issues/53879#issuecomment-830635381

Code for reproducing performance results: https://github.com/pytorch/pytorch/pull/56887#issuecomment-843212868

The following heuristics are used for choosing cuSOLVER over MAGMA:
* If batch size == 1 OR (batch size <= 8 AND shape <= 16), choose cuSOLVER over MAGMA.
* For all other cases use MAGMA.

See also https://github.com/pytorch/pytorch/issues/47953.

Following are the performance results between the MASTER branch and the current changes:

<details>

```
[-------------------------- LU factorization (ATen) torch.float64 ---------------------------]
                                     |  lu_factorize CURRENT |  lu_factorize MASTER
1 threads: -----------------------------------------------------------------------------------
      torch.Size([1, 1, 1])          |              363.9          |             284.1
      torch.Size([2, 1, 1])          |              354.8          |             271.8
      torch.Size([4, 1, 1])          |              393.7          |             278.0
      torch.Size([8, 1, 1])          |              459.3          |             279.1
      torch.Size([16, 1, 1])         |              524.2          |             288.9
      torch.Size([32, 1, 1])         |              525.1          |             281.2
      torch.Size([64, 1, 1])         |              524.5          |             281.7
      torch.Size([128, 1, 1])        |              522.8          |             285.2
      torch.Size([1, 2, 2])          |              360.4          |             277.7
      torch.Size([2, 2, 2])          |              372.9          |             279.2
      torch.Size([4, 2, 2])          |              419.4          |             278.3
      torch.Size([8, 2, 2])          |              475.7          |             279.2
      torch.Size([16, 2, 2])         |              530.0          |             299.5
      torch.Size([32, 2, 2])         |              530.0          |             294.5
      torch.Size([64, 2, 2])         |              531.0          |             291.5
      torch.Size([128, 2, 2])        |              544.4          |             292.3
      torch.Size([1, 8, 8])          |              372.6          |             292.8
      torch.Size([2, 8, 8])          |              380.9          |             296.2
      torch.Size([4, 8, 8])          |              420.0          |             293.4
      torch.Size([8, 8, 8])          |              490.6          |             294.6
      torch.Size([16, 8, 8])         |              535.6          |             296.5
      torch.Size([32, 8, 8])         |              534.7          |             302.1
      torch.Size([64, 8, 8])         |              539.1          |             305.5
      torch.Size([128, 8, 8])        |              540.7          |             296.5
      torch.Size([1, 16, 16])        |              345.0          |             303.2
      torch.Size([2, 16, 16])        |              405.0          |             306.3
      torch.Size([4, 16, 16])        |              482.8          |             305.6
      torch.Size([8, 16, 16])        |              596.3          |             305.9
      torch.Size([16, 16, 16])       |              539.6          |             304.4
      torch.Size([32, 16, 16])       |              542.2          |             305.8
      torch.Size([64, 16, 16])       |              556.1          |             311.0
      torch.Size([128, 16, 16])      |              545.1          |             308.1
      torch.Size([1, 32, 32])        |              432.7          |             342.4
      torch.Size([2, 32, 32])        |              582.6          |             341.8
      torch.Size([4, 32, 32])        |              580.4          |             344.4
      torch.Size([8, 32, 32])        |              586.5          |             343.8
      torch.Size([16, 32, 32])       |              582.9          |             346.0
      torch.Size([32, 32, 32])       |              574.4          |             343.7
      torch.Size([64, 32, 32])       |              562.8          |             350.8
      torch.Size([128, 32, 32])      |              568.3          |             349.8
      torch.Size([1, 64, 64])        |              537.1          |             518.4
      torch.Size([2, 64, 64])        |              766.5          |             539.1
      torch.Size([4, 64, 64])        |              771.6          |             551.9
      torch.Size([8, 64, 64])        |              783.4          |             556.0
      torch.Size([16, 64, 64])       |              798.8          |             555.3
      torch.Size([32, 64, 64])       |              795.6          |             548.6
      torch.Size([64, 64, 64])       |              804.2          |             580.4
      torch.Size([128, 64, 64])      |              837.6          |             616.9
      torch.Size([1, 128, 128])      |              844.7          |             848.9
      torch.Size([2, 128, 128])      |             1096.7          |             873.3
      torch.Size([4, 128, 128])      |             1117.9          |             884.8
      torch.Size([8, 128, 128])      |             1138.1          |             903.6
      torch.Size([16, 128, 128])     |             1169.1          |             943.9
      torch.Size([32, 128, 128])     |             1204.8          |             981.4
      torch.Size([64, 128, 128])     |             1336.6          |            1105.8
      torch.Size([128, 128, 128])    |             1639.4          |            1473.3
      torch.Size([1, 512, 512])      |             3714.3          |            3928.6
      torch.Size([2, 512, 512])      |             4388.3          |            4179.7
      torch.Size([4, 512, 512])      |             4765.4          |            4536.9
      torch.Size([8, 512, 512])      |             5615.2          |            5441.1
      torch.Size([16, 512, 512])     |             7203.6          |            7130.2
      torch.Size([32, 512, 512])     |            10580.5          |           10503.9
      torch.Size([64, 512, 512])     |            17374.8          |           17349.6
      torch.Size([128, 512, 512])    |            32542.3          |           32548.8
      torch.Size([1, 1024, 1024])    |            10041.5          |           14292.3
      torch.Size([2, 1024, 1024])    |            17126.6          |           16971.0
      torch.Size([4, 1024, 1024])    |            20591.0          |           20490.8
      torch.Size([8, 1024, 1024])    |            27682.8          |           27560.7
      torch.Size([16, 1024, 1024])   |            41035.2          |           41035.8
      torch.Size([32, 1024, 1024])   |            67091.8          |           67345.9
      torch.Size([64, 1024, 1024])   |           119612.3          |          119782.3
      torch.Size([128, 1024, 1024])  |           230095.5          |          230766.2

Times are in microseconds (us).

```
</details>

The main reason why a performance regression can be seen is related to this issue (https://github.com/pytorch/pytorch/issues/55122) and there seems to be no easy way to fix this (atleast in this PR).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56887

Reviewed By: ngimel

Differential Revision: D29482342

Pulled By: mruberry

fbshipit-source-id: 4fdedf21b0d5597b289e168dff61d5f5d7727fb1
2021-07-07 09:45:23 -07:00
4a544df00d Implement and benchmark a torch.optim.multi_tensor.adagrad implementation (#59155)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59155

Test Plan: Imported from OSS

Reviewed By: iramazanli

Differential Revision: D29525213

Pulled By: ramvenkat98

fbshipit-source-id: 6d7e8da91c965d1f4e955a084ed875bab641dc9a
2021-07-07 08:08:32 -07:00
8bec478a9e MaxPool2d: use channels_last format for both output and indice when input is channels_last (#61245)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61245

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D29557884

Pulled By: ezyang

fbshipit-source-id: 0d2b8cbaaf13411eefd7d867021bd6028d40e5cc
2021-07-07 07:50:28 -07:00
66158a6e90 Enable AutogradXPU DispatchKey for Intel heterogeneous computation platform. (#61105)
Summary:
Add string wrapper for AutogradXPU to enable this DispatchKey.
We are going to use AutogradXPU as custom autograd backend, which needs this DispatchKey.
This sting wrapper is used to map AutogradXPU to the corresponding DispatchKey.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61105

Reviewed By: malfet

Differential Revision: D29557697

Pulled By: ezyang

fbshipit-source-id: f0c8155decc8e2fd90741650a05de5a8b5a70121
2021-07-07 07:47:01 -07:00
a69e947ffd avg_pool3d_backward: Port to structured (#59084)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59084

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D28802619

Pulled By: ezyang

fbshipit-source-id: 89a0fcdcf8976ca7c21da7a40fd26a1cba180faa
2021-07-07 07:44:17 -07:00
e4c450a4e8 The dispatch order for custom function (#60251)
Summary:
Hi, I am working on dev some custom ops.

And I found this issue:

Cause of the logical here: https://github.com/pytorch/pytorch/compare/master...zhuhaozhe:customer-op-trace?expand=1#diff-d7ade8589773904745c0cf965a19f24c940f1d36038f4c0ce85af2f3d89991dcL173-L177.
For all custom ops, "Tracer" dispatch key got the highest priority.

This make custom-ops and non-custom-ops get different behavior during dispatch. I do not understand whether there exist some special reason to let custom-ops "trace" first then begin to "dispatch".

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60251

Reviewed By: malfet

Differential Revision: D29577131

Pulled By: ezyang

fbshipit-source-id: a8e824029cf934f09f29638b127961a6a5c332de
2021-07-07 06:31:43 -07:00
a6fea03a8a Skip codegen checks for dequantize_self, lu_unpack, _cudnn_rnn, and .*conv.*_backward.* (#61139)
Summary:
Temporary fix for fb-internal tests. This and similar failures are being discussed here:
https://github.com/pytorch/pytorch/issues/60426

Applies the below changes:
 - This may seem counter intuitive because storage check comes before tensor check, but if TensorImpl use count is not enforced, we should also not enforce storage use count. If an op returns one of its inputs as-is, it is possible for this input to already be aliased with another tensor, and hence would have StorageImpl use count greater than one.
 - Also clarify in description that use_count is not necessarily > 1, use_count may but not necessarily return one of its inputs as-is.
 - Allow usage of regex in skip list

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61139

Reviewed By: malfet, Varal7

Differential Revision: D29564917

Pulled By: soulitzer

fbshipit-source-id: 806b7177117a573dd12f161cc80dcadac892f9d0
2021-07-07 05:21:19 -07:00
6f1455440b task 3: typecheck (#60805)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60805

Test Plan: Imported from OSS

Reviewed By: jamesr66a, VitalyFedyunin

Differential Revision: D29522885

Pulled By: migeed-z

fbshipit-source-id: 559a8a495a16e517af77fd5a0785a82e1ebb3bd7
2021-07-06 23:51:49 -07:00
9813b9bc0d Fix mypy.ini (#61333)
Summary:
Fixes CI regression caused by https://github.com/pytorch/pytorch/issues/61119
Unlike Python, `.ini` string lists could not  end with trailing comma.

Fixes CI on master

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61333

Reviewed By: bhosmer

Differential Revision: D29578696

Pulled By: malfet

fbshipit-source-id: b81e5f4c0a553299c4d4bee0a9bb73748910795f
2021-07-06 22:46:09 -07:00
f0316ec0b6 Revert D24068202: [pytorch][PR] Add typing return value to init in nn.Module
Test Plan: revert-hammer

Differential Revision:
D24068202 (506397a809)

Original commit changeset: 4cd9b6ca12b5

fbshipit-source-id: f45fcf7ee6ee9198ed6f3f34956ce68a64378c32
2021-07-06 22:15:31 -07:00
98119bfce9 task 2: ast rewrite (#60622)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60622

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D29493747

Pulled By: migeed-z

fbshipit-source-id: 684fcdfd3dd441e72c77bb7a4d64c18b9849a198
2021-07-06 20:15:30 -07:00
0dc40474fe Migrate glu from the THC to ATen (CUDA) (#61153)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61153

Fixes gh-24571, fixes gh-24572
Closes gh-39586, closes gh-39586

Benchmarks
----------

The benchmarks were run with nvprof calling the operator in a loop. It shows
reliable improvements for large tensors, but the TH implementation seems to fair
better for smaller tensors. For sufficiently large tensors, the ATen
implementation does win though.

|        Shape | Dim | Master Forward (us) | This PR Forward (us) | Master Backward (us) | This PR Backward (us) |
|-------------:|-----|:-------------------:|:--------------------:|:--------------------:|:---------------------:|
|    128, 1000 | 0   |        2.4770       |        2.0820        |        3.0440        |         3.4680        |
|              | 1   |        2.7060       |        4.4850        |        3.3380        |         3.6250        |
|   128, 10000 | 0   |        26.531       |        21.366        |        38.083        |         34.623        |
|              | 1   |        27.680       |        30.465        |        38.943        |         35.204        |
|  128, 100000 | 0   |        292.09       |        219.56        |        355.57        |         324.49        |
|              | 1   |        260.43       |        243.08        |        332.25        |         323.37        |
| 128, 1000000 | 0   |        2475.7       |        1874.6        |        3810.1        |         3215.7        |
|              | 1   |        2586.3       |        2380.9        |        3349.9        |         3207.8        |

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D29538093

Pulled By: ngimel

fbshipit-source-id: 1f66b45ec7c46fb8e680b50110a5fde6fe7faab7
2021-07-06 19:06:51 -07:00
7a4ffbd1da [FX] s/IS_SANDCASTLE/IS_FBCODE/ in tests (#61304)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61304

Previously tests were unrunnable on devserver. This fixes that
ghstack-source-id: 133051811

Test Plan: waitforsadcastle

Reviewed By: Chillee

Differential Revision: D29561806

fbshipit-source-id: 6020e5b4ba72d6de1ea2563e70fdb0e604bee1a5
2021-07-06 17:20:53 -07:00
506397a809 Add typing return value to init in nn.Module (#45654)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45497

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45654

Reviewed By: driazati

Differential Revision: D24068202

Pulled By: malfet

fbshipit-source-id: 4cd9b6ca12b531311302e3cdeeab39bc45d86c94
2021-07-06 17:09:30 -07:00
9f3167ebdf task 1: annotate (#60621)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60621

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D29493619

Pulled By: migeed-z

fbshipit-source-id: 1bd3fb02c90ae5b394869a474b2e6b06af0d4791
2021-07-06 16:48:11 -07:00
a1ad28da10 Refactor clang_tidy.py (#61119)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61119

This change spilts the clang-tidy CI job into smaller steps and uses a
refactored version of the clang_tidy.py script.

The new folder structure is as follows:
```
tools/linter/clang_tidy
|_ __main__py
|_ requirements.txt
|_ run.py
|_ setup.sh
```

`__main__.py`

This script will run `tools/linter/clang_tidy/setup.sh` if a `build`
directory doesn't exist, mimicing what used to be done as a separate
step in the CI job.

After that, it will invoke `clang-tidy` with default arguments being
declared in the script itself (as opposed to declaring them in
lint.yml).

The reasoning behind this approach is two-fold:

- Make it easier to run `clang-tidy` locally using this script
- De-duplicate the option passing

`requirements.txt`

Contains a list of additional python dependencies needed by the
`clang-tidy` script.

`setup.sh`

If a build directory doesn't exist, this command will run the necessary
codegen and build commands for running `clang-tidy`

Example usage:
```
python3 tools/linter/clang_tidy --parallel
```
Notice that we don't have to put the `.py` at the end of `clang_tidy`.

Test Plan:
Run the following command:
```
python3 tools/linter/clang_tidy --paths torch/csrc/fx --parallel
```

Reviewed By: walterddr, janeyx99

Differential Revision: D29568582

Pulled By: 1ntEgr8

fbshipit-source-id: cd6d11c5cb8ba9f1344a87c35647a1cd8dd45b04
2021-07-06 16:02:11 -07:00
81e36d02a6 Improve error message on invalid values to Distribution methods (#61056)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/18133

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61056

Reviewed By: jbschlosser

Differential Revision: D29510173

Pulled By: neerajprad

fbshipit-source-id: 205ec7de6c8576a73e77ee4bf01c30e99b38a52e
2021-07-06 15:44:55 -07:00
45cc207a88 Fix breakpad build + add test canary (#60990)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60990

This makes the breakpad build more explicit in its messaging and hints to cmake where to look for the library (it wasn't able to find it without `PATHS` on CI even though that works locally). This also adds a smoke test that will fail if breakpad isn't present on a CI job where it is expected (e.g. binary builds).

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D29514316

Pulled By: driazati

fbshipit-source-id: 79514363334788f311ba5d4f25deed3452f0c3eb
2021-07-06 14:15:07 -07:00
b6024b9d12 More loop transforms 2
Summary: Exact duplicate of D29410111 to fix land issues.

Test Plan: Sandcastle

Reviewed By: walterddr

Differential Revision: D29538335

fbshipit-source-id: 6a4f9ac4a505339ed242af60fe7fd4ba1fda3b32
2021-07-06 13:38:10 -07:00
c74c0c5718 add thrust/host_vector.h header for cuda 11.4 build (#61004)
Summary:
needed for cuda 11.4 build

Close https://github.com/pytorch/pytorch/issues/61011

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61004

Reviewed By: ngimel

Differential Revision: D29523896

Pulled By: malfet

fbshipit-source-id: acb11bdd19c0cc240696be21e5c492f8976fea65
2021-07-06 12:44:56 -07:00
5da507b57b Add bazel actions workflow (#61039)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61039

- Added a new template for bazel GH Actions workflow
- Simplified the workflow based on malfet's suggestion by combining build and test jobs into one as we only run a small subset of tests for bazel
- Tested the run to make sure it succeeds
- Build step takes 4 minutes, test step takes 7 minutes

The downside of this approach is that I duplicated some of the jobs in a new template file.  Alternative solution would be to use something like this https://jinja.palletsprojects.com/en/3.0.x/templates/#template-inheritance, however, that is better to be done in a separate PR as linux and windows workflows would need to be changed. Another solution is to use a bunch of if else statements in a linux workflow template to accommodate bazel build as part of it, but this seems not as clean as template inheritance with jinja.

Here is a link to the latest bazel run with this change https://github.com/pytorch/pytorch/actions/runs/1004656584

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D29562260

Pulled By: rsemenov

fbshipit-source-id: a7d7d3a0b8092f52929fb109820bfad4574f5602
2021-07-06 12:18:43 -07:00
fac744e116 Foreach Binary Test Refactor (#59907)
Summary:
Related: https://github.com/pytorch/pytorch/issues/58833

## Changes I'm a bit concerned
- binary ops with one tensorlist and one scalarlist support complex dtypes. To realize this, I added a specialization of [`TensorListScalarListMetadata<c10::complex<double>, 1>` ](https://github.com/pytorch/pytorch/pull/59907/files#diff-131eb9b310905b15b3528da6a23e542a3a3aa952bc88f7423c98a23a8a28cca1R49). This might be out of the scope of this pull request.

cc ptrblck ngimel mcarilli

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59907

Reviewed By: mruberry

Differential Revision: D29551001

Pulled By: ngimel

fbshipit-source-id: 46b25fdba85dd4d6332a77b27376fe96cd422384
2021-07-06 11:49:38 -07:00
5503a4ac6e DOC Improves shape documentation for *Flatten (#60980)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/60841

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60980

Reviewed By: VitalyFedyunin

Differential Revision: D29526650

Pulled By: jbschlosser

fbshipit-source-id: 2b4b0b84e0652c4cf3e9a48debb3b1bfe4e04b05
2021-07-06 10:47:11 -07:00
95cada8810 Make breakpad depdendencies private (#61183)
Summary:
Otherwise, it will results in the following errors for people developing extensions
```
CMake Error in frontends/pytorch/csrc/CMakeLists.txt:
  Imported target "torch" includes non-existent path

    "/usr/local/include/breakpad"
```

Fixes different issue reported in https://github.com/pytorch/pytorch/issues/60485

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61183

Reviewed By: driazati

Differential Revision: D29538332

Pulled By: malfet

fbshipit-source-id: e83cfd0b335e9b0b1ba5715789b09765db671346
2021-07-06 10:02:34 -07:00
635d864b26 Fix modernize-use-equals-default nolint failures in torch/csrcs (#61142)
Summary:
Test-plan: Compile + clang-tidy

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61142

Reviewed By: VitalyFedyunin

Differential Revision: D29529372

Pulled By: malfet

fbshipit-source-id: 2ccde7712a51c28243b16bbb4d1d68086e0414a6
2021-07-06 09:46:46 -07:00
718db968b8 move CI related functions out of run_test.py (#61124)
Summary:
run_test.py currently does lots of downloading and test file/suite/case parsing. It doesn't work well outside of the CI environment

Restructured the run_test.py and created tools/test/test_selections.py and move all test selection logic (reordering, categorizing slow test, creating shards)

Follow up PRs should:
- refactor those file read/write logic entangled inside test_selections.py into stats/ folder
- restructure and add network independent test logics to test_test_selections.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61124

Test Plan:
- tools/test
- CI

Related PR:
This follows the refactoring example in: https://github.com/pytorch/pytorch/issues/60373

Reviewed By: malfet

Differential Revision: D29558981

Pulled By: walterddr

fbshipit-source-id: 7f0fd9b4720a918d82918766c002295e8df04169
2021-07-06 09:06:42 -07:00
864dcbb2cc Set sccache bucket on test runs to save some run minutes (#61140)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61140

While working on bazel port to GitHub Actions I noticed that we do not set sccache bucket for test runs that causing cache misses while running test jobs. For example https://github.com/pytorch/pytorch/runs/2965919198?check_suite_focus=true  test run 1 uses local cache and has 44 cache misses with avg 9 sec read per miss it is saving 44*9/60 = 7 minutes per run.

Here is another example
https://github.com/pytorch/pytorch/runs/2966210127?check_suite_focus=true

Open to feedback if there is a downside of using AWS cache.

Test Plan: Imported from OSS

Reviewed By: 1ntEgr8

Differential Revision: D29557292

Pulled By: rsemenov

fbshipit-source-id: e8fb000850ec4627d7cccf690e8f5743999fdf36
2021-07-06 07:29:57 -07:00
05c1e5b655 [sparsity] Lambda Scheduler (#59771)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59771

Implements a specific sparsity scheduler, that uses a user-provided lambda's to change the levels.

Test Plan:
```
python test/test_ao_sparsity.py
```
Imported from OSS

Differential Revision:
D29070604
D29070604

Reviewed By: raghuramank100

Pulled By: z-a-f

fbshipit-source-id: c7ccbe63fe4cd6a0c3563541b7fcf93a99d0e62f
2021-07-02 21:39:38 -07:00
37ebf2e3cd [sparsity] Base sparsity level scheduler class (#59770)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59770

Implements the base scheduler class for changing the sparsity levels in the sparsifier.

Test Plan:
```
python test/test_ao_sparsity.py
```
Imported from OSS

Differential Revision:
D29070603
D29070603

Reviewed By: raghuramank100

Pulled By: z-a-f

fbshipit-source-id: 0b160e4eb0a2a303d2d19e6a3beb4784002b2cb7
2021-07-02 21:38:24 -07:00
ed63fb5225 Fix some more loops (#60895)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60895

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D29431572

fbshipit-source-id: fbcf48696bf2c90cc0973a767d83bb526f6ccd7f
2021-07-02 19:17:08 -07:00
43fb39c3eb [DDP] Make uneven inputs work with comm. hook (#61020)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61020

Makes uneven input support with `join` context manager work with
custom communication hooks. This will ensure that the two features can work
well together. Added relevant unittests to test allreduce and powerSGD hooks.

Instead of calling `allreduce`, the join manager now calls into `_run_reduction_hook` which will automatically run whatever hook is installed.
ghstack-source-id: 132950108

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D29480028

fbshipit-source-id: c91dc467a62c5f1e0ec702a2944ae3deb10f93f4
2021-07-02 18:48:21 -07:00
94b730681f [DDP] Refactor uneven inputs to take GradBucket (#61019)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61019

Changes uneven input logic of running allreduce to using `GradBucket` structure. This is to enable support for comm. hook with join in the next diff.
ghstack-source-id: 132950107

Test Plan: ci

Reviewed By: SciPioneer

Differential Revision: D29480027

fbshipit-source-id: 7c42c53653052f71b86a75e14a5fc7ae656433f7
2021-07-02 18:47:23 -07:00
512448a425 CTCLoss: Remove dispatching in parallel region (#60599)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60599

Ref #56794

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D29446190

Pulled By: ngimel

fbshipit-source-id: eb01783c8c32a1405b58e1364fc3d71c0f054e0a
2021-07-02 17:55:56 -07:00
d42f1751d4 [sparsity] WeightNormSparsifier (#58955)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58955

Implements the weight norm sparsifier.
This type of sparsifier computes the norm of the weights, sorts them, and zeroes-out the target fraction of them.

The main imeplemented method is `update_mask`, which holds the main logic of changing the masks.

Test Plan:
```
python test/test_ao_sparsity.py
```
Imported from OSS

Differential Revision:
D28970960
D28970960

Reviewed By: raghuramank100

Pulled By: z-a-f

fbshipit-source-id: 8f2a4360ad877f430cdc1065c6777106938b58d5
2021-07-02 17:35:27 -07:00
7ab2729481 [sparsity][refactor] Import factoring out (#58707)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58707

Minor refactor that changes the format of the import.
This is done to avoid accidental circular dependencies.

Test Plan:
```
python test/test_ao_sparsity.py
```
Imported from OSS

Differential Revision:
D28970961
D28970961

Reviewed By: raghuramank100

Pulled By: z-a-f

fbshipit-source-id: c312742f5e218c435a1a643532f5842116bfcfff
2021-07-02 16:32:39 -07:00
973e9266ff [sparsity] Sparsifier class (#58704)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58704

Implements the base sparsifier class based on the #59835 RFC documents.

This PR implements the base class for the sparsification. Specifically, the prepare method is implemented.

Test Plan:
```
python test/test_ao_sparsity.py
```
Imported from OSS

Differential Revision:
D28970958
D28970958

Reviewed By: raghuramank100

Pulled By: z-a-f

fbshipit-source-id: 0ef98a445c0a0aca22ce5708e34a9f94606d0e2b
2021-07-02 16:31:21 -07:00
80cab10534 [sparsity] Sparsity parametrization (#58705)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58705

The basic demo for this particular implementation can be found here:
https://gist.github.com/z-a-f/1d06ae8d5a509d3c9c1596dcb924afe0

Test Plan:
```
python test/test_ao_sparsity.py
```
Imported from OSS

Differential Revision:
D28970959
D28970959

Reviewed By: raghuramank100

Pulled By: z-a-f

fbshipit-source-id: 2a0bea1e0a81816690e05f83051d607c90925d32
2021-07-02 11:12:31 -07:00
5d34b7955b [sparsity][refactor] Changing linear row/col control (#60850)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60850

Test Plan:
```
python test/test_ao_sparsity.py
```

```
python test/test_ao_sparsity.py
```

Differential Revision:
D29465900
D29465900

Reviewed By: raghuramank100

Pulled By: z-a-f

fbshipit-source-id: 412f50da857f377898fea79d378ae54a049b81fe
2021-07-02 11:12:30 -07:00
509b1ef9d5 [sparsity] Add sparsity tests to run_test.py (#60887)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60887

Test Plan:
```
./test/run_test.py -i test_ao_sparsity
```

```
./test/run_test.py -i test_ao_sparsity
```

Differential Revision:
D29465834
D29465834

Reviewed By: mruberry

Pulled By: z-a-f

fbshipit-source-id: 144f940363a20dd65c2bbfe70924c266d8791dc7
2021-07-02 11:11:20 -07:00
54673fc944 Sparse: Remove dispatch in parallel region (#60598)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60598

Ref #56794

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D29446192

Pulled By: ngimel

fbshipit-source-id: 1a11f3aa847e4ce83fc6f50cee362b7d0cb61eae
2021-07-01 21:56:17 -07:00
11b722c063 [DDP] Refactor hook running logic (#61018)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61018

Extract logic of hook running to a function `run_reduction_hook` that takes in a `GradBucket` and runs the hook/allreduce. This is mainly to prepare for join to support comm. hook in follow up diffs.
ghstack-source-id: 132924220

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D29477143

fbshipit-source-id: 87e8e563e71821fd462d6b259c98a6a0afbcd7b4
2021-07-01 20:41:55 -07:00
b21df03f3b [DDP] Remove SPMD from get_bucket_tensors (#61017)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61017

Removes SPMD nested vector logic from this codepath. This is mostly in preparation for the next diffs in this stack which enable support for join with comm. hook.
ghstack-source-id: 132924223

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D29477360

fbshipit-source-id: f8132a94b1abfe28586aa78ac47e13a7ce6bb137
2021-07-01 20:40:53 -07:00
4a2e8b53bb [JIT] Add torch._C.ScriptList` (#52832)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52832

**Summary**
This commit adds `torch._C.ScriptList`, a list type that has reference
semantics across the Python/TorchScript boundary. That is, modifications
made in TorchScript to instances of `torch._C.ScriptList`
are visible in Python even when it is not returned from the function.

`torch._C.ScriptList` is implemented using a modified version of pybind's
`stl_bind.h`-style bindings attached to `ScriptList` and `ScriptListIterator`,
wrapper classes around `c10::impl::GenericList` and
`c10::impl::GenericList::iterator`. These bindings allow instances of
`torch._C.ScriptList` to be used as if it were a
regular `list` in Python. Reference semantics are achieved by simply
retrieving the `IValue` contained in `ScriptList` in `toIValue` (invoked
when converting Python arguments to `IValues` before calling TorchScript
code).

**Test Plan**
This commit adds `TestScriptList` to `test_list_dict.py`, a set of tests
that check that all of the common list operations are supported
and that instances have reference semantics across the
Python/TorchScript boundary.

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D29478121

Pulled By: SplitInfinity

fbshipit-source-id: 652cc25cfa37debe28db9527504846f22abd8b54
2021-07-01 20:28:13 -07:00
6e9e30cc1d Ignore notebooks when checking for newlines (#61156)
Summary:
Fix lint on master (these files should be considered "generated" so don't lint them)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61156

Reviewed By: malfet

Differential Revision: D29532211

Pulled By: driazati

fbshipit-source-id: a1e47f45bedf441613bdc2bd60fbf8299e5c962f
2021-07-01 18:11:43 -07:00
a4d86e0d53 [quant][fx][perf] improve runtime of prepare step for large models (#61132)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61132

For large models, the insert_observers_for_model function was taking a long time, especially for the case where not all the nodes are being quantized

For example for a model with 21000 nodes of which only ~50 are being quantized the breakdown of prepare_fx vs convert fx was

prepare_fx 979 seconds
convert_fx 9 seconds

The main reason was because we were doing some unnecessary computation for all nodes in this function, this PR just moves them to where they are actually used

After this PR
prepare_fx 26 seconds
convert_fx 9 seconds

Test Plan:
Existing tests

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D29522303

fbshipit-source-id: 7ce12582a859d02ff763abebf4a592d28e0764ca
2021-07-01 17:17:10 -07:00
277b310edb [DataLoader] Add notebook with DataPipes API example (#60680)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60680

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D29461079

Pulled By: VitalyFedyunin

fbshipit-source-id: 6532bf77113ab89a50f8bb022daf80f8477e9297
2021-07-01 16:39:28 -07:00
ca2702a776 [pruner] Make bias hook stateless (#61077)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61077

Removing `BiasHook` class, using function instead.
ghstack-source-id: 132899223

Test Plan:
` buck test mode/dev-nosan //caffe2/test:ao -- TestBasePruner`

https://pxl.cl/1L7Tg

Reviewed By: z-a-f

Differential Revision: D29504119

fbshipit-source-id: 6dd9689d18b17ac64e8a461f466e2c9018bc530b
2021-07-01 14:59:00 -07:00
0a7875231b [pruner] Add bias support (#60970)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60970

Support adding bias in eager mode
ghstack-source-id: 132695883

Test Plan:
`buck test mode/dev-nosan //caffe2/test:ao -- TestBasePruner`

https://pxl.cl/1L3K3

Reviewed By: z-a-f

Differential Revision: D29441499

fbshipit-source-id: 47e0fff5b3014612bd021e145160ea54e2645e24
2021-07-01 14:57:09 -07:00
87dbdef65d MAINT Adds test and docs for Linear with no batch dims (#60992)
Summary:
Towards https://github.com/pytorch/pytorch/issues/60585

This PR updates docs for `Linear` and adds a non-batch test case to `common_nn.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60992

Reviewed By: VitalyFedyunin

Differential Revision: D29518451

Pulled By: jbschlosser

fbshipit-source-id: 6dd79c0f21ac5b6f693e3e1ba954379d2606d4e0
2021-07-01 14:49:24 -07:00
369802a504 Add aten::avgpool2d NNAPI converter (#58538)
Summary:
Add support for aten::avgpool2d op in the NNAPI model converter with var
size support

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58538

Test Plan: pytest test/test_nnapi.py::TestNNAPI::test_avgpool2d

Reviewed By: anshuljain1

Differential Revision: D28531944

fbshipit-source-id: 43ff8c9389365698c282f204042b49c7ec84d824
2021-07-01 14:07:14 -07:00
19b6ee4d4e model_dump working with delegate models (#61043)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61043

Trying to make model_dump work with delegate models
ghstack-source-id: 132809755

Test Plan:
N509022.

The data.pkl in the lowered model:
```
bash-3.2$ python -m torch.utils.show_pickle /Users/myuan/models/backend/lowered_model.pt@*/data.pkl
torch.jit.backend_with_compiler_demo.LoweredModule.__torch__.___torch_mangle_5.ModuleAdd()(state=
 (torch.jit._pickle.restore_type_tag({'forward': torch.jit._pickle.restore_type_tag({'input_shapes': '((1, 1, 320, 240), (1, 3))',
                   'some_other_option': 'True'},
                  'Dict[str, str]')},
    'Dict[str, Any]'),
  torch.jit._pickle.restore_type_tag({'forward': 'prim::Constant#1<debug_handle>271,aten::add<debug_handle>272'},
    'Dict[str, str]'),
  True))
```
Comparing to data.pkl in scripted_model.pt:
```
__torch__.___torch_mangle_7.ModuleAdd()(state=
 {'_is_full_backward_hook': None, 'training': True})
```

Reviewed By: Amyh11325

Differential Revision: D29464860

fbshipit-source-id: d738e98ea518339465f8e3375207cf83e3dac532
2021-07-01 13:39:56 -07:00
374278f431 Improved sparse CSR tensor sampling method (#60283)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/59379

The improved sparse CSR tensor sampling method is described in https://pearu.github.io/csr_sampling.html that features:
- for specified `nnz`, one gets a CSR sample with the same `nnz`
- variability of the number of specified columns per row is maximized
- `crow_indices` content is randomized
- a given row specific `col_indices` content is sorted and filled with unique values (see also https://github.com/pytorch/pytorch/issues/60277)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60283

Reviewed By: bhosmer

Differential Revision: D29492605

Pulled By: cpuhrsch

fbshipit-source-id: 8d875b7c2b0573a9ab37047c6d8fe8b540295ce1
2021-07-01 13:26:19 -07:00
6ecc1a4c4f Make pytorch clang-tidy clean (#60649)
Summary:
This PR suppresses clang-tidy warnings in the codebase (for now) so that we can re-enable clang-tidy checks on master.

I ran this script to add the `NOLINTNEXTLINE` comments (on a devserver):
```bash
python3 setup.py develop

# Uses same script that's run on CI and adds the -j (parallel), -s (add comments), -k (continue if diagnostic errors are found) options
python3 tools/clang_tidy.py \
  -j \
  -s \
  -k \
  -v \
  --paths torch/csrc/ \
  -g"-torch/csrc/jit/passes/onnx/helper.cpp" \
  -g"-torch/csrc/jit/passes/onnx/shape_type_inference.cpp" \
  -g"-torch/csrc/jit/serialization/onnx.cpp" \
  -g"-torch/csrc/jit/serialization/export.cpp" \
  -g"-torch/csrc/jit/serialization/import.cpp" \
  -g"-torch/csrc/jit/serialization/import_legacy.cpp" \
  -g"-torch/csrc/onnx/init.cpp" \
  -g"-torch/csrc/cuda/nccl.*" \
  -g"-torch/csrc/cuda/python_nccl.cpp" \
  -g"-torch/csrc/autograd/FunctionsManual.cpp" \
  -g"-torch/csrc/generic/*.cpp" \
  -g"-torch/csrc/jit/codegen/cuda/runtime/*" \
  -g"-torch/csrc/deploy/interpreter/interpreter.cpp" \
  -g"-torch/csrc/deploy/interpreter/interpreter.h" \
  -g"-torch/csrc/deploy/interpreter/interpreter_impl.h" \
  -g"-torch/csrc/deploy/interpreter/test_main.cpp"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60649

Test Plan: Verified changes by re-running the script (without the `-s` option) and seeing no warnings/errors.

Reviewed By: walterddr, janeyx99

Differential Revision: D29504258

Pulled By: 1ntEgr8

fbshipit-source-id: 78310b30ee8213b73ddb4771ad874665323e7a4e
2021-07-01 12:21:07 -07:00
a0a9ea6598 Fix documentation preview instructions (#61080)
Summary:
People don't need to self host these anymore since we do it automatically in PRs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61080

Reviewed By: VitalyFedyunin, janeyx99

Differential Revision: D29506465

Pulled By: driazati

fbshipit-source-id: 45875cb229f8cc565a9a1405f52cef198ee0e687
2021-07-01 12:17:34 -07:00
60509f8921 Update DDP documentation to mention outputs not used in loss is supported (#60275)
Summary:
We recently landed a change to ensure that when running under ``find_unused_parameters=True``, not all module outputs have to be used in loss computation and DDP will work as expected. Mention this update in the documentation and add some additional clarification.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60275

Reviewed By: SciPioneer

Differential Revision: D29502609

Pulled By: rohan-varma

fbshipit-source-id: ddb3129cff9492018e61813413b30711af212309
2021-07-01 11:56:53 -07:00
0128eb9a85 Fix TSAN issue in distributed tests (#59238)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59238

Creating a `mutliprocessing.Manager()` launches a new process using the `fork` method (because it's the default one), and then in that subprocess it launches a new thread. TSAN really doesn't like this (and rightly so!) because we already had threads in the superprocess, and intermixing threads and forks is dangerous. The proper way to deal with this is to `exec` inside the child process or, in other words, use the `spawn` method.

Note that the method used to launch the Manager is entirely unrelated from the method used to launch our "own" subprocesses, hence we were using `fork` for the Manager even though we were using `spawn` for our own subprocesses.
ghstack-source-id: 130240724

Test Plan: Reverted the silencing introduced in D28490129, ran the `test_init_rpc_then_pg` test from the TensorPipe suite and saw the original TSAN failure. Then applied my fix, re-ran the test, and the failure was gone.

Reviewed By: zhaojuanmao

Differential Revision: D28794321

fbshipit-source-id: 12242e69be399a7f02a40a0ebb3d92f92e00ce73
2021-07-01 11:53:01 -07:00
5b44d817fb Expose raw saved tensors for codegen functions (#60565)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60565

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D29466225

fbshipit-source-id: 77eb4214a1baecc501282413d99d55f8935dc01f
2021-07-01 11:25:21 -07:00
3f0f860a1c Condense JIT/Quantization triage into one workflow (#61130)
Summary:
The `.github/workflows/{jit,quantization}_triage.yml` workflows are nearly identical, so this PR consolidates them into a single GitHub Actions workflow to reduce code duplication. It also renames the workflow so it starts with a capital letter, so that it will show up alongside all our other GitHub Actions workflows on [the HUD](https://hud.pytorch.org/build2/pytorch-master).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61130

Reviewed By: walterddr

Differential Revision: D29520022

Pulled By: samestep

fbshipit-source-id: 673789762e08c2c77d72e7c20eb16d6beec573ba
2021-07-01 10:50:26 -07:00
6f92f10c94 Use a leaky singleton for CublasHandlePool. (#60987)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60987

We were seeing deadlocks as follows during shutdown:

```
Thread 1 (LWP 2432101):
#0  0x00007efca470190b in __pause_nocancel () from /lib64/libc.so.6
#1  0x00007efca49de485 in __pthread_mutex_lock_full () from /lib64/libpthread.so.0
#2  0x00007ef91d4c42c6 in __cuda_CallJitEntryPoint () from /lib64/libnvidia-ptxjitcompiler.so.1
#3  0x00007efc651ac8f1 in ?? () from /lib64/libcuda.so
#4  0x00007efc651aee03 in ?? () from /lib64/libcuda.so
#5  0x00007efc64f76b84 in ?? () from /lib64/libcuda.so
#6  0x00007efc64f77f5d in ?? () from /lib64/libcuda.so
#7  0x00007efc64eac858 in ?? () from /lib64/libcuda.so
#8  0x00007efc64eacfbc in ?? () from /lib64/libcuda.so
#9  0x00007efc7810a924 in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#10 0x00007efc780fa2be in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#11 0x00007efc78111044 in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#12 0x00007efc7811580a in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#13 0x00007efc78115aa4 in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#14 0x00007efc781079ec in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#15 0x00007efc780e6a7a in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#16 0x00007efc7811cfa5 in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#17 0x00007efc777ea98c in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#18 0x00007efc777ebd80 in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#19 0x00007efc777ea2c9 in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#20 0x00007efc778c2e2d in cublasDestroy_v2 () from /usr/local/cuda/lib64/libcublas.so.11
#21 0x00007efc51a3fb56 in std::_Sp_counted_ptr_inplace<at::cuda::(anonymous namespace)::DeviceThreadHandlePool<cublasContext*, &at::cuda::(anonymous namespace)::createCublasHandle, &at::cuda::(anonymous namespace)::destroyCublasHandle>, std::allocator<at::cuda::(anonymous namespace)::DeviceThreadHandlePool<cublasContext*, &at::cuda::(anonymous namespace)::createCublasHandle, &at::cuda::(anonymous namespace)::destroyCublasHandle> >, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /data/users/pritam/pytorch/torch/lib/libtorch_cuda.so
#22 0x00007efc51a3fc5f in std::shared_ptr<at::cuda::(anonymous namespace)::DeviceThreadHandlePool<cublasContext*, &at::cuda::(anonymous namespace)::createCublasHandle, &at::cuda::(anonymous namespace)::destroyCublasHandle> >::~shared_ptr() () from /data/users/pritam/pytorch/torch/lib/libtorch_cuda.so
#23 0x00007efca4648b0c in __run_exit_handlers () from /lib64/libc.so.6
#24 0x00007efca4648c40 in exit () from /lib64/libc.so.6
#25 0x0000558c8852e5f9 in Py_Exit (sts=0) at /tmp/build/80754af9/python_1614362349910/work/Python/pylifecycle.c:2292
#26 0x0000558c8852e6a7 in handle_system_exit () at /tmp/build/80754af9/python_1614362349910/work/Python/pythonrun.c:636
#27 0x0000558c8852e742 in PyErr_PrintEx (set_sys_last_vars=<optimized out>, set_sys_last_vars=<optimized out>) at /tmp/build/80754af9/python_1614362349910/work/Python/pythonrun.c:646
#28 0x0000558c88540dd6 in PyRun_SimpleStringFlags (command=0x7efca4dc9050 "from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=9, pipe_handle=13)\n", flags=0x7ffe3a986110) at /tmp/build/80754af9/python_1614362349910/work/Python/pythonrun.c:457
#29 0x0000558c88540ead in pymain_run_command (cf=0x7ffe3a986110, command=<optimized out>) at /tmp/build/80754af9/python_1614362349910/work/Modules/main.c:420
#30 pymain_run_python (pymain=0x7ffe3a986220) at /tmp/build/80754af9/python_1614362349910/work/Modules/main.c:2907
#31 pymain_main (pymain=0x7ffe3a986220) at /tmp/build/80754af9/python_1614362349910/work/Modules/main.c:3460
#32 0x0000558c8854122c in _Py_UnixMain (argc=<optimized out>, argv=<optimized out>) at /tmp/build/80754af9/python_1614362349910/work/Modules/main.c:3495
#33 0x00007efca4632493 in __libc_start_main () from /lib64/libc.so.6
#34 0x0000558c884e5e90 in _start () at ../sysdeps/x86_64/elf/start.S:103
```

This was likely caused due to a static singleton that wasn't leaky. Following
the guidance in https://isocpp.org/wiki/faq/ctors#construct-on-first-use-v2 to
use a leaky singleton instead.
ghstack-source-id: 132847448

Test Plan: Verified locally.

Reviewed By: malfet

Differential Revision: D29468866

fbshipit-source-id: 89250594c5cd2643417b1da584c658b742dc5a5c
2021-07-01 10:23:07 -07:00
d2fef350f2 add embedding bag skeleton take 2 (#61126)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61126

adding skeleton implementations of quantized embedding tables with zeroes

Test Plan:
compilation, farm test, and ran test_find_dangling_impls and passed

did a manual negative test and verified the message is printed properly
```
======================================================================
FAIL: test_find_dangling_impls (test_dispatch.TestPythonDispatcher)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/users/hyz/fbsource/fbcode/buck-out/opt/gen/caffe2/test/others#binary,link-tree/test_dispatch.py", line 892, in test_find_dangling_impls
    self.assertEqual(
  File "/data/users/hyz/fbsource/fbcode/buck-out/opt/gen/caffe2/test/others#binary,link-tree/torch/testing/_internal/common_utils.py", line 1498, in assertEqual
    super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
AssertionError: False is not true : Scalars failed to compare as equal! 0 != 1
Expect zero dangling impls, but found: ['name: quantized::qembedding_bag_4bit_unpack\nschema: (none)\nCUDA: registered at caffe2/aten/src/ATen/native/quantized/cuda/embedding_bag.cu:394 :: (Tensor _0) -> (Tensor _0) [ boxed unboxed ]\n']

Reviewed By: walterddr

Differential Revision: D29518274

fbshipit-source-id: d0cb81c8bf51cdc4b83038758131ccf61e4360f5
2021-07-01 10:11:45 -07:00
e5ae0e652d [jit] Allow instance overrides of ignored methods (#61076)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61076

Previously we would always retrieve ignored methods from the
type, which doesn't work when the user has overriden the ignored method
for a specific instance.

This PR changes things up so we retrieve the ignored method as a bound
method from the object being scripted, unwrap it, then re-bind it to the
scriptmodule.

Test Plan: Imported from OSS

Differential Revision: D29504421

Pulled By: suo

fbshipit-source-id: 14649863ea69a8d2180dd2c4341ec9a826039de1
2021-07-01 09:26:30 -07:00
ccfdb30644 Revert D29413019: [torch] Various improvements to torch.distributed.launch and torch.distributed.run
Test Plan: revert-hammer

Differential Revision:
D29413019 (4e181dfc35)

Original commit changeset: 323bfbad9d0e

fbshipit-source-id: 1f8ae4b3d0a23f3eaff28c37e9148efff25fafe2
2021-07-01 08:44:51 -07:00
48bfc0e51c [DataLoader] Add Example Only fork DataPipe (#60679)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60679

This is example only DataPipe, not intended to be used in production. Used for tutorials, tests and documentation.
Have to be replaced by real `fork` upon DataLoader update.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D29461084

Pulled By: VitalyFedyunin

fbshipit-source-id: a7e435f055f040e358f5465092b8daa07f8e29b7
2021-07-01 08:41:26 -07:00
62b2dc2059 [DataLoader] Decorate ZipDataPipe as zip (#60678)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60678

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D29461085

Pulled By: VitalyFedyunin

fbshipit-source-id: f2037fbc67369aae10b07ef80a19e2a0ea7bf530
2021-07-01 08:41:25 -07:00
8e21ff91e2 [DataLoader] Add simple groupby DataPipe (#60675)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60675

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D29461082

Pulled By: VitalyFedyunin

fbshipit-source-id: ded5a3a1555bfd8457d64b7e61ab6729fff9cb75
2021-07-01 08:40:20 -07:00
cb7d813275 Revert D28836794: SumKernel (BFloat16): use float as accumulation type
Test Plan: revert-hammer

Differential Revision:
D28836794 (4f5c68857f)

Original commit changeset: 46ed3a862c2b

fbshipit-source-id: 3b586eeb752b7cdee909fa97a4c78876a6014770
2021-07-01 08:12:31 -07:00
11dca2e5f3 Fix some integer comparisons (#60894)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60894

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D29431512

fbshipit-source-id: b0ef7656806f378ad823e503e7c27cc563d3dc7d
2021-07-01 08:08:39 -07:00
7017dc101f Revert D29313058: add an embedding bag skeleton operators
Test Plan: revert-hammer

Differential Revision:
D29313058 (ae21357ada)

Original commit changeset: b05df6ff9a7c

fbshipit-source-id: ef422aedad71dee6cb2824c58aceb66104376a65
2021-07-01 07:37:02 -07:00
d6521c2249 [pyper][emb][quantization] Support emb trained in FP16 (#60736)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60736

Add support of embedding with input data type as float16, utilize new kernel functions added in fbgemm https://github.com/pytorch/FBGEMM/pull/616

Test Plan: `buck test caffe2/test/:quantization -- test_embedding_bag`

Reviewed By: supriyar

Differential Revision: D29392320

fbshipit-source-id: 0a120b3a58b6cf1d84961831097e9581ffd2b591
2021-07-01 07:35:59 -07:00
d42aa176e4 Bump docker image tag for clang-tidy (#61115)
Summary:
The new tag should fix the "Missing <omp.h>" error message on clang-tidy runs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61115

Test Plan:
Ran the clang-tidy job using the diff from https://github.com/pytorch/pytorch/issues/60976.

Expected Output:
There should be no clang diagnostic errors.

Reviewed By: walterddr

Differential Revision: D29516845

Pulled By: 1ntEgr8

fbshipit-source-id: 554229904db67eb7a7b93b3def434b30de6a43b0
2021-07-01 07:30:28 -07:00
46595a9623 [Static Runtime] Add gflag to disable nnc and caffe2 math library (#61090)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61090

Reviewed By: ajyu

Differential Revision: D29479860

fbshipit-source-id: 2b53405f41d319f074c75d8923d97fd6a45fee4b
2021-07-01 00:01:37 -07:00
c1499a9933 Enable jit tracing to parametrization and add jit tests (#60969)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60969

This PR fixes the tracing in the parametrizations.
The current resolution is that when tracing is performed while caching is enabled, we throw an error.
Without caching, the tracing should work properly (tests added).

Currently, the parametrizations don't support scripting.
This PR introduces the same logic as with the tracing (throw error if caching).
However, the scripting itself cannot enabled due to the use of the generator expressions in the parametrizations.
Added TODO to fix it.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D29462887

Pulled By: z-a-f

fbshipit-source-id: 49721d3059be58f36055d1c374080df41a748d66
2021-06-30 23:54:02 -07:00
4e181dfc35 [torch] Various improvements to torch.distributed.launch and torch.distributed.run (#60925)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60925

* Make `torch.distributed.launch` restarts to 0
* Remove unnecessary `-use_env` warning, move `-use_env` warnings
* Move `-use_env` warnings to `torch.distributed.launch`
* Make default log level WARNING
* Add new doc section around transitioning to `torch.distributed.run`
* Make `torch.distributed.launch` not use error-propagation
* Set default events handler to `null` that does not print events to console
* Add reference from `torch.distributed.launch` to `torch.distributed.run`
* Set correct preexec function that sends SIGTERM to child processes when parent dies

Issues resolved:

https://github.com/pytorch/pytorch/issues/60716
https://github.com/pytorch/pytorch/issues/60754

Test Plan:
sandcastle

    python -m torch.distributed.launch --nproc_per_node 2 main.py -> uses 0 restarts
    python -m torch.distributed.run --nproc_per_node 2 main.py -> uses default for torchelastic, 0 restarts

    python -m torch.distributed.launch --nproc_per_node=4  --use_env --no_python  main.py -> produces error
    python -m torch.distributed.launch --nproc_per_node=4  --use_env main.py -> no warning
    python -m torch.distributed.launch --nproc_per_node=4  --no_python  main.py ->warning

Output of running torch.distributed.launch without --use_env:

    $path/torch/distributed/launch.py:173: FutureWarning: The module torch.distributed.launch is deprecated
    and will be removed in future. Use torch.distributed.run.
    Note that --use_env is set by default in torch.distributed.run.
    If your script expects `--local_rank` argument to be set, please
    change it to read from `os.environ('LOCAL_RANK')` instead.

New section:

{F628923078}

{F628974089}

Reviewed By: kiukchung, cbalioglu

Differential Revision: D29413019

fbshipit-source-id: 323bfbad9d0e4aba3b10ddd7a243ca6e48169630
2021-06-30 23:31:02 -07:00
ae21357ada add an embedding bag skeleton operators (#60491)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60491

basic reference embedding bag operators, these are not going to be performant but can be used for functionality enablement

these operators will output the right shape, but the implementation is empty

Test Plan: tbd

Reviewed By: vkuzo

Differential Revision: D29313058

fbshipit-source-id: b05df6ff9a7c0c6ac46ef64a42464988453bd460
2021-06-30 23:09:11 -07:00
db1dd9e7e0 add support for quantized tensors in torch.testing.assert_close (#58926)
Summary:
This adds support for quantized tensors the same way torch.testing._internal.common_utils.TestCase.assertEqual does:

bf269fdc98/torch/testing/_internal/common_utils.py (L1314-L1341)

- `.qscheme()` is checked for equality
- `.q_scale` and `q_zero_point` are checked for equality (see comment below) for `.qscheme() == torch.per_tensor_affine`
- `.q_per_channel_scales`, `q_per_channel_zero_points`, and `q_per_channel_axis` are checked for equality (see comment below) for `.qscheme() == torch.per_tensor_affine`
- values are checked with the default checks after a `.int_repr().to(torch.int32)` call

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58926

Reviewed By: jerryzh168

Differential Revision: D29483532

Pulled By: mruberry

fbshipit-source-id: 003fde7e21cf844778a879c3de0a7c84d13877bd
2021-06-30 21:43:02 -07:00
06fc637b41 Check native_function's outputs' TensorImpl and StorageImpl (#60286)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/25927

Does some checks described in https://github.com/pytorch/pytorch/issues/25927#issuecomment-589354373:
If function does not modify its inputs (non-inplace and has no out arg):
- Check TensorImpl has use_count of 1. (This should make us aware of functions that return self.
- If function is a view function check that StorageImpl is same as that of the aliased input, otherwise, StorageImpl's use_count is 1.

Detected a couple functions that failed the check that returned TensorImpl should have use_count of 1: 'native_batch_norm', 'native_batch_norm_backward', '_embedding_bag'. (Filing issues).

Examples of generated code:
We did not update checks for in-place ops (this includes in-place views).

Example of a view:
- Check that outputs StorageImpl of `result` is the same as that of `self`.
- Check TensorImpl has use_count of 1
```cpp
at::Tensor as_strided(c10::DispatchKeySet ks, const at::Tensor & self, at::IntArrayRef size, at::IntArrayRef stride, c10::optional<int64_t> storage_offset) {
  auto& self_ = unpack(self, "self", 0);
  auto _any_requires_grad = compute_requires_grad( self );
  (void)_any_requires_grad;
  std::shared_ptr<AsStridedBackward> grad_fn;
  if (_any_requires_grad) {
    grad_fn = std::shared_ptr<AsStridedBackward>(new AsStridedBackward(), deleteNode);
    grad_fn->set_next_edges(collect_next_edges( self ));
    grad_fn->self_geometry = TensorGeometry(self);
    grad_fn->size = size.vec();
    grad_fn->stride = stride.vec();
    grad_fn->storage_offset = storage_offset;
  }
  #ifndef NDEBUG
  c10::optional<Storage> self__storage_saved =
    self_.has_storage() ? c10::optional<Storage>(self_.storage()) : c10::nullopt;
  c10::intrusive_ptr<TensorImpl> self__impl_saved;
  if (self_.defined()) self__impl_saved = self_.getIntrusivePtr();
  #endif
  auto _tmp = ([&]() {
    at::AutoDispatchBelowAutograd guard;
    return at::redispatch::as_strided(ks & c10::after_autograd_keyset, self_, size, stride, storage_offset);
  })();
  auto result = std::move(_tmp);
  #ifndef NDEBUG
  if (self__storage_saved.has_value())
    AT_ASSERT(self__storage_saved.value().is_alias_of(self_.storage()));
  if (self__impl_saved) AT_ASSERT(self__impl_saved == self_.getIntrusivePtr());
  if (self__storage_saved.has_value())
    AT_ASSERT(self__storage_saved.value().is_alias_of(result.storage())); <<<<<<<<<<<<<<<<<<<<<<<<
  AT_ASSERT(result.use_count() == 1); <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
  #endif
  if (grad_fn) {
      set_history(flatten_tensor_args( result ), grad_fn);
  }
  TORCH_CHECK_NOT_IMPLEMENTED(!(isFwGradDefined(self)), "Trying to use forward AD with as_strided that does not support it.");
  return result;
}
```
Example of non-view:
- Check that output's StorageImpl has use_count of 1.
- Check that output's TensorImpl has use_count of 1.
```cpp
at::Tensor asin(c10::DispatchKeySet ks, const at::Tensor & self) {
  auto& self_ = unpack(self, "self", 0);
  auto _any_requires_grad = compute_requires_grad( self );
  (void)_any_requires_grad;
  std::shared_ptr<AsinBackward> grad_fn;
  if (_any_requires_grad) {
    grad_fn = std::shared_ptr<AsinBackward>(new AsinBackward(), deleteNode);
    grad_fn->set_next_edges(collect_next_edges( self ));
    grad_fn->self_ = SavedVariable(self, false);
  }
  #ifndef NDEBUG
  c10::optional<Storage> self__storage_saved =
    self_.has_storage() ? c10::optional<Storage>(self_.storage()) : c10::nullopt;
  c10::intrusive_ptr<TensorImpl> self__impl_saved;
  if (self_.defined()) self__impl_saved = self_.getIntrusivePtr();
  #endif
  auto _tmp = ([&]() {
    at::AutoDispatchBelowADInplaceOrView guard;
    return at::redispatch::asin(ks & c10::after_autograd_keyset, self_);
  })();
  auto result = std::move(_tmp);
  #ifndef NDEBUG
  if (self__storage_saved.has_value())
    AT_ASSERT(self__storage_saved.value().is_alias_of(self_.storage()));
  if (self__impl_saved) AT_ASSERT(self__impl_saved == self_.getIntrusivePtr());
  if (result.has_storage()) AT_ASSERT(result.storage().use_count() == 1); <<<<<<<<<<<<<<<<<<<<<<<<<<
  AT_ASSERT(result.use_count() == 1); <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
  #endif
  if (grad_fn) {
      set_history(flatten_tensor_args( result ), grad_fn);
  }
  if (isFwGradDefined(self)) {
      auto self_t_raw = toNonOptFwGrad(self);
      auto self_t = self_t_raw.defined() ? self_t_raw : at::zeros_like(toNonOptTensor(self));
      auto self_p = toNonOptPrimal(self);
      auto result_new_fw_grad = (self_t.conj() * (-self_p * self_p + 1).rsqrt().conj()).conj();
      if (result_new_fw_grad.defined()) {
        // The hardcoded 0 here will need to be updated once we support multiple levels.
        result._set_fw_grad(result_new_fw_grad, /* level */ 0, /* is_inplace_op */ false);
      }
  }
  return result;
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60286

Reviewed By: jbschlosser

Differential Revision: D29402253

Pulled By: soulitzer

fbshipit-source-id: b90f34c455b8767f95a52c329db351dbbb495397
2021-06-30 19:19:01 -07:00
03b5a225a7 Test parametrization for instantiated device-specific tests (#60233)
Summary:
The `ops` decorator provides a way to parameterize a test across a given list of ops. This would be useful for modules as well (e.g. a `modules` decorator), but the mechanism by which this is accomplished is specific to ops. In the details, the `ops` decorator tags a test function with the metadata needed (list of ops, `dtypes`) and the actual tests are generated according to this metadata during the call to `instantiate_device_type_tests()`.

This PR makes this mechanism more generic, allowing for test parameterization across arbitrary dimensions. This makes a `modules` decorator (or any similar type of decorator) straightforward to implement without changes to the device-specific test instantiation logic.

One caveat is that, since this is implemented where the old `ops` decorator was (within `instantiate_device_type_tests()`), this only works for tests instantiated using the device-specific instantiation logic. Longer term, even device-specific test instantiation could be treated as an optional parameterization across device types, but this PR takes a low-risk approach for now. In practice, this just means that a `device` kwarg is required for all test signatures used with the mechanism.

The `ops` decorator has been refactored to use the generic mechanism and works the same as before, with one difference: when `OpDTypes.none` is specified, the test signature no longer needs an unused `dtype` kwarg. This is a nice bonus that demonstrates the added flexibility of a generic parameterization mechanism. The refactored form also has the bonus that all op-specific test generation logic is contained within the `ops` decorator class, improving readability.

Behind the scenes, the generic mechanism is a base decorator class (`_TestParameterizer`) from which `ops` derives. The core functionality is in the `_parameterize_test()` method, which takes in a test function and returns a generator that produces parameterized tests, including names and parameter kwargs to pass to them. Using the `ops` decorator results in a set of op-specific tests from a given generic test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60233

Reviewed By: iramazanli

Differential Revision: D29494995

Pulled By: jbschlosser

fbshipit-source-id: a14446488c106094fafcaa75ccf8e9e3faf33bfc
2021-06-30 18:50:22 -07:00
6643df2680 [jit] Use computed loop to dispatch to next instruction in interpreter. (#60211)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60211

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D29211283

fbshipit-source-id: 2f87b5a78d4fc00ce11ed509fc15db35332690b6
2021-06-30 17:44:26 -07:00
357a21bc92 Fix numerical issue of rowwise normalization in Caffe2 and internal tests. (#60880)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60880

Fix numerical issue of rowwise normalization in Caffe2 and internal tests.

Test Plan: buck test mode/opt //dper3/dper3/modules/tests:xdeepint_test -- --exact 'dper3/dper3/modules/tests:xdeepint_test - test_xdeepint_with_full_features_with_interactions_3 (dper3.dper3.modules.tests.xdeepint_test.XdeepInt_Test)'

Reviewed By: esqu1

Differential Revision: D29431597

fbshipit-source-id: 72df52fdcbb29ad3de7b9472f25fde26cf804a76
2021-06-30 17:31:04 -07:00
0824b919ec [BE] move general script out of .circleci/ into tools/ (#60973)
Summary:
Second step in https://github.com/pytorch/pytorch/issues/60373.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60973

Reviewed By: samestep

Differential Revision: D29499385

Pulled By: walterddr

fbshipit-source-id: 22df22f78f6b9af6221917a10188218773245009
2021-06-30 17:20:05 -07:00
4036820506 Add PocketFFT support (#60976)
Summary:
Needed on platforms, that do not have MKL, such as aarch64 and M1
- Add `AT_POCKETFFT_ENABLED()` to Config.h.in
- Introduce torch._C.has_spectral that is true if PyTorch was compiled with either MKL or PocketFFT
- Modify spectral test to use skipCPUIfNoFFT instead of skipCPUIfNoMKL

Share implementation of `_out` functions as well as fft_fill_with_conjugate_symmetry_stub between MKL and PocketFFT implementations

Fixes https://github.com/pytorch/pytorch/issues/41592

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60976

Reviewed By: walterddr, driazati, janeyx99, samestep

Differential Revision: D29466530

Pulled By: malfet

fbshipit-source-id: ac5edb3d40e7c413267825f92a5e8bc4bb249caf
2021-06-30 16:28:20 -07:00
2d0c6e60a7 going back to use packaging.version.parse instead (#61053)
Summary:
I think this may be related to https://app.circleci.com/pipelines/github/pytorch/vision/9352/workflows/9c8afb1c-6157-4c82-a5c8-105c5adac57d/jobs/687003

Apparently `pkg_resource.parse_version` returns a type of `pkg_resources.extern.packaging.version.Version` instead of `packaging.version.Version` and seems on some older version of the setuptools it doesn't support `.major/minor` operation. changing it back to using `packaging.version.parse`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61053

Test Plan: CI

Reviewed By: samestep

Differential Revision: D29494322

Pulled By: walterddr

fbshipit-source-id: 294572a10b167677440d7404e5ebe007ab59d299
2021-06-30 16:23:59 -07:00
a2ad84afbb Send test reports to S3 (#61071)
Summary:
This sends the test reports zip to S3 in addition to the GitHub artifact store. This makes it easier to query in the PR HUD since we don't have to deal with the GitHub API's rate limits / download speeds. The impact on S3 storage should be minimal since it's only 500 KB or so per run.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61071

Reviewed By: nikithamalgifb

Differential Revision: D29498941

Pulled By: driazati

fbshipit-source-id: 74bfbe7fa7d1d97fd8a6938c98dfe0caff0ab6eb
2021-06-30 16:00:01 -07:00
812ed47caa [Static runtime] Add unit tests to ops bmm and addmm (#61000)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61000

Add unit tests to bmm and addmm operators in static runtime.

Test Plan:
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest

{F628935117}

Reviewed By: hlu1

Differential Revision: D29459679

fbshipit-source-id: 5c7fa5c9b0675c1c84f3ae3110204d663255009c
2021-06-30 15:55:58 -07:00
4ff81ab112 escape backward slash in stack trace in Windows to slash (#60842)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/60187

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60842

Reviewed By: gdankel

Differential Revision: D29498498

Pulled By: malfet

fbshipit-source-id: 78e1b25a2e6bdfd3ba0c988d023c7a7f79a22cf4
2021-06-30 15:32:03 -07:00
6c1c1111de [JIT] Add reference semantics to TorchScript classes (#44324)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44324

**Summary**
This commit adds reference semantics to TorchScript class types;
modifications made to them within TorchScript will be visible in Python.

**Test Plan**
This commit adds a unit test to `TestClassType` that checks that
modifications made to a class type instance passed into TorchScript are
visible in Python after executing the scripted function or module.

**Fixes**
This commit closes #41421.

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D24912807

Pulled By: SplitInfinity

fbshipit-source-id: d64ac6211012425b040b987e3358253016e84ca0
2021-06-30 14:27:17 -07:00
aa728dc335 Fix fx patch module name (#61062)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61062

Instead of being 'patch' this should be the import-able name of the module (it's defined as `_fx` on the `torch._C` module, so the full name should be `torch._C._fx`). This now works correctly:

```python
>>> import torch._C._fx
>>> dir(torch._C._fx)
['__doc__', '__loader__', '__name__', '__package__', '__spec__', 'patch_function']
```

Test Plan: Imported from OSS

Reviewed By: Chillee

Differential Revision: D29497018

Pulled By: driazati

fbshipit-source-id: 093aa0552b48feb0aabe47bdf72776dddd5a3b8f
2021-06-30 14:23:35 -07:00
dabadd7e20 [quant] Added reset_min_max_vals() function to observers (#60883)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60883

As per this [comment](https://github.com/pytorch/pytorch/pull/59964#discussion_r659064270), I created a `reset_min_max_vals()` function inside the observers which will be called during input-weight equalization. This is so that we will not expose the implementation of the observers in the equalization code.

Test Plan:
`python test/test_quantization.py TestEqualizeFx`

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D29491848

fbshipit-source-id: 00e91959ceb3b4f3688175a1a7ba11823e929b2f
2021-06-30 14:22:08 -07:00
1a0195db49 [quant] Input-Weight Equalization - support for LinearReLU layers (#60653)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60653

Special casing was needed to get the weight attribute in the linear layers of fused LinearReLU layers.

Initial Model: `x -> linear1 -> relu`

After fusion: `x -> linearRelu`

After prepare: `x -> input_quant_obs -> input_eq_obs1 -> linearRelu -> output_quant_obs1`

After equalization functions: `x -> mul -> input_quant_obs (scaled) -> linearRelu -> output_quant_obs`

After convert: `x -> mul -> quantize_per_tensor -> quantized::linearRelu -> dequantize`

More step-throughs here: https://fb.quip.com/A9J3AsBxkykR

Test Plan:
`python test/test_quantization.py TestEqualizeFx`

Original model:
```
LinearReluModel(
  (fc): Linear(in_features=5, out_features=5, bias=True)
  (relu): ReLU()
)
```

Graph after `prepare_fx`:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_0](args = (%x,), kwargs = {})
    %x_activation_post_process_0_equalization_process_0 : [#users=1] = call_module[target=x_activation_post_process_0_equalization_process_0](args = (%x_activation_post_process_0,), kwargs = {})
    %fc : [#users=1] = call_module[target=fc](args = (%x_activation_post_process_0_equalization_process_0,), kwargs = {})
    %fc_activation_post_process_0 : [#users=1] = call_module[target=fc_activation_post_process_0](args = (%fc,), kwargs = {})
    return fc_activation_post_process_0
```

Graph after equalization functions:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_scale0 : [#users=1] = get_attr[target=x_equalization_scale0]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_scale0), kwargs = {})
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_0](args = (%mul,), kwargs = {})
    %fc : [#users=1] = call_module[target=fc](args = (%x_activation_post_process_0,), kwargs = {})
    %fc_activation_post_process_0 : [#users=1] = call_module[target=fc_activation_post_process_0](args = (%fc,), kwargs = {})
    return fc_activation_post_process_0
```

Graph after `convert_fx`:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_scale0 : [#users=1] = get_attr[target=x_equalization_scale0]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_scale0), kwargs = {})
    %fc_input_scale_0 : [#users=1] = get_attr[target=fc_input_scale_0]
    %fc_input_zero_point_0 : [#users=1] = get_attr[target=fc_input_zero_point_0]
    %quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %fc_input_scale_0, %fc_input_zero_point_0, torch.quint8), kwargs = {})
    %fc : [#users=1] = call_module[target=fc](args = (%quantize_per_tensor,), kwargs = {})
    %dequantize : [#users=1] = call_method[target=dequantize](args = (%fc,), kwargs = {})
    return dequantize
```

Imported from OSS

Reviewed By: supriyar

Differential Revision: D29406999

fbshipit-source-id: add38e8e7fb84a241c3b10bfb8451b50103effd4
2021-06-30 14:22:06 -07:00
546102e161 Fix overflow in quantize_val_arm (#60079)
Summary:
By using `__builtin_add_overflow` to detect integer overflows when `zero_point` is added to rounded integral value.
Also fix small typo.

After this PR `python3 -c "import torch;print(torch.torch.quantize_per_tensor(torch.ones(10) * 2**32, 0.5, 1, torch.quint8))"` returns same vector of `127` on both x86_64 and aarch64 platforms

This change merely mitigates overflow bug, more proper (and perhaps performance impacting) fix would be to add `zero_point` to floating values both in serial and in vectorized code. Filed https://github.com/pytorch/pytorch/issues/61047 to track this one

Also filed https://github.com/pytorch/pytorch/issues/61046 to clarify intended use of `__ARM_NEON__` define

Fixes https://github.com/pytorch/pytorch/issues/60077

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60079

Reviewed By: kimishpatel

Differential Revision: D29157883

Pulled By: malfet

fbshipit-source-id: 6f75d93e6d3d4d0d5a5eab545cb27773086b9768
2021-06-30 14:20:56 -07:00
cef0851223 Make torch.utils.bencmark numpy free (#60564)
Summary:
PyTorch core do not depend on numpy, so benchmarks should not depend on it as well

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60564

Reviewed By: robieta

Differential Revision: D29497375

Pulled By: malfet

fbshipit-source-id: d9566e5b2e48868cef5568cd62f691af19ccf1f1
2021-06-30 14:17:32 -07:00
d1a4c9e682 [ROCm] allow user to override PYTORCH_ROCM_ARCH (#60602)
Summary:
Restores the ability of a user to call .jenkins/pytorch/build.sh while
also setting PYTORCH_ROCM_ARCH. Otherwise, with IN_CI=1 as the new
default, it will forcibly ignore user settings when build.sh is used
outside of CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60602

Reviewed By: samestep

Differential Revision: D29490791

Pulled By: janeyx99

fbshipit-source-id: b5e8a529b8e0b5020b260b4bf027a37e0c1df8d5
2021-06-30 13:35:11 -07:00
14cc234a8a Fix some comparison warnings (#60875)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60875

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D29406593

fbshipit-source-id: 0eb070ef05c1cd343c9e835786b42014d0553aa5
2021-06-30 13:09:41 -07:00
74692f3ada Loop transformation (#60874)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60874

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D29406474

fbshipit-source-id: c994361e9fdafb7c4519ce2f1c40288a9ef025be
2021-06-30 13:09:39 -07:00
a8b56ea58b Remove another for-loop in SoftMax (#60873)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60873

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D29406429

fbshipit-source-id: 3b5710ed9e5d1d14379f64670638ab119d0d78e3
2021-06-30 13:09:38 -07:00
850ff82edc Remove for-loop for getting number of elements in favour of abstraction (#60872)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60872

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D29406199

fbshipit-source-id: ae49672cf1bb370d574d0c21231477bb17dea0ca
2021-06-30 13:08:25 -07:00
95e77e0af2 [Delegate] A more specific prefix for lowered module name. (#61007)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61007

Test Plan: Imported from OSS

Reviewed By: kimishpatel, raziel

Differential Revision: D29477733

Pulled By: iseeyuan

fbshipit-source-id: 94a7a784d98a41ff7ba255955acf74bd26297c9f
2021-06-30 12:37:09 -07:00
f32f85e6da Implemented torch.corrcoef (#60420)
Summary:
Implements `torch.corrcoef` similar to [`np.corrcoef`](https://numpy.org/doc/stable/reference/generated/numpy.corrcoef.html) using `torch.cov` implemented in https://github.com/pytorch/pytorch/pull/58311.

closes https://github.com/pytorch/pytorch/issues/1254

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60420

Reviewed By: mruberry

Differential Revision: D29474687

Pulled By: heitorschueroff

fbshipit-source-id: f3c7c5610363aebd88274a51fc77e3cf879cb611
2021-06-30 12:36:02 -07:00
d5be67a338 Expose findDanglingImpls to Python (#60827)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60827

This diff exposed Dispatcher.findDanglingImpls to Python as _C._dispatch_find_dangling_impls.
ghstack-source-id: 132799970

Test Plan: buck test mode/dev //caffe2/test:others -- test_find_dangling_impls

Reviewed By: ezyang

Differential Revision: D29416330

fbshipit-source-id: d2f26054b6e247be1bb9e818eaa7cb9e68a4a913
2021-06-30 12:31:19 -07:00
3cf267bfa6 Embedding: Remove dispatch in parallel region (#60597)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60597

Ref #56794

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D29446191

Pulled By: ngimel

fbshipit-source-id: d6ff010104ae621d5e3d9c269ed2b48407e71d67
2021-06-30 12:30:15 -07:00
4f5c68857f SumKernel (BFloat16): use float as accumulation type (#55217)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55217

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D28836794

Pulled By: VitalyFedyunin

fbshipit-source-id: 46ed3a862c2bb4c6325c78ecfc5d01761f7a113a
2021-06-30 12:27:42 -07:00
4d5edef8d4 Python composite module execution unit tests on delegation of backend_with_compiler_demo (#60801)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60801

backend_with_compiler_demo

Added unit tests for the execution of a simple composite module with a
compiler

Test Plan:
Running python test/test_jit.py TestBackendsWithCompiler -v gives a
success

Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D29409958

fbshipit-source-id: b02e58bdcc25a2997b70ecae41a019b8596323c1
2021-06-30 12:23:32 -07:00
3957ed41a9 [DDP] Disable reducer hooks from running outside of DDP backwards. (#60921)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60921

Sometimes local modules can fire hooks (such as when user calls
backward after using `ddp_module.module` explicitly). This isn't supported
behavior and can cause issues with various state and gradient reduction we run
in DDP, so it's best to disable this entirely.
ghstack-source-id: 132739311

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D29435737

fbshipit-source-id: fef76a0dd2955c432131632fb81dde4a4982ad91
2021-06-30 12:19:18 -07:00
5a4282d06b fix typo in binary_build_script (#61016)
Summary:
resolve comments in https://github.com/pytorch/pytorch/issues/60849

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61016

Reviewed By: samestep

Differential Revision: D29487908

Pulled By: janeyx99

fbshipit-source-id: 32feb6c6e1009324201e3d2c6fcd9a7388791401
2021-06-30 11:52:38 -07:00
d44515c418 Fix lint (#61058)
Summary:
https://github.com/pytorch/pytorch/issues/61003 broke Lint / shellcheck because of a race condition with https://github.com/pytorch/pytorch/issues/60221. This PR fixes it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61058

Test Plan: CI.

Reviewed By: walterddr

Differential Revision: D29494727

Pulled By: samestep

fbshipit-source-id: e6c5ea6daa47db13eb6a42cc2b5bf9c938c1839d
2021-06-30 11:45:23 -07:00
a25e6370e5 Add IMethod interface
Summary:
Expose IMethod interface, which provides a unified interface to either script or python methods backed by torchscript or torchdeploy.

IMethod provides a way to depend on a torch method without depending on a particular runtime implementation such as torchscript or python/deploy.

Test Plan: add unit tests.

Reviewed By: suo

Differential Revision: D29463455

fbshipit-source-id: 903391d9af9fbdd8fcdb096c1a136ec6ac153b7c
2021-06-30 11:28:24 -07:00
dace860008 Migrate pytorch-linux-bionic-py3.8-gcc9-coverage to GHA (#61050)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/59166

`pytorch-linux-bionic-py3.8-gcc9-coverage` build & tests can be run on `linux.2xlarge` instances on GHA,
which have AVX512 support.

Thanks

cc malfet seemethere samestep zhouzhuojie

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61050

Reviewed By: walterddr, 1ntEgr8

Differential Revision: D29493335

Pulled By: samestep

fbshipit-source-id: de79e61f13c537ef7ff30a1e04d1bbc625a06dd1
2021-06-30 11:02:57 -07:00
b4496df7d3 mkl_scsrmm needs to be disabled when MKL is not used (#60051)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60051

Introduction:
We want to minimize the number of dependencies for the SGX port. Therefore we need the ability to disable MKL when it is not used.

Problem :
There is a call to mkl_scsrmm that is enabled when CAFFE2_USE_MKL is not defined. This causes a compile error.

Solution :
Surround the call with preprocessor checks to CAFFE2_USE_MKL

Test Plan: Run the pytorch tests.

Reviewed By: LiJihang

Differential Revision: D29022635

fbshipit-source-id: 94ae9fdfe53399b64d8c2d4089eebe93d1d260e8
2021-06-30 10:40:18 -07:00
5644c31ec0 Move windows periodic jobs to GHA (#61003)
Summary:
Moves periodic 11.3 windows jobs to GHA.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61003

Test Plan:
https://github.com/pytorch/pytorch/pull/61003/checks?check_run_id=2947910829

Does NOT yet move the debuggable CI part yet

Reviewed By: malfet

Differential Revision: D29488761

Pulled By: janeyx99

fbshipit-source-id: b16b23b40fe1f6ae189292c6f2c561e5e70f122b
2021-06-30 10:25:10 -07:00
9b5e1e0734 [DataLoader] Make batch DataPipe sensitive to unbatch_level argument (#60672)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60672

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D29461086

Pulled By: VitalyFedyunin

fbshipit-source-id: efc6b3b567323defe64d3f1b30a5708107e62dd4
2021-06-30 10:04:32 -07:00
66de50cc11 [DataLoader] Make shuffle DataPipe sensitive to unbatch_level argument (#60671)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60671

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D29461083

Pulled By: VitalyFedyunin

fbshipit-source-id: 3d371017d5ce948a1e5b8182ae91033190f64da7
2021-06-30 10:03:29 -07:00
a652398465 [DataLoader] Rename transform DataPipe to legacy_transform (#60670)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60670

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D29461081

Pulled By: VitalyFedyunin

fbshipit-source-id: 57f53a91db9032a6126e86243ddea9149c473060
2021-06-30 09:49:14 -07:00
abb4ed7412 Move clang-format to lint.yml (#60918)
Summary:
Refactor and consolidate the location of lint related workflows

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60918

Reviewed By: mruberry

Differential Revision: D29459605

Pulled By: zhouzhuojie

fbshipit-source-id: c2993cfd037a03b733a414897bd53cf407c7c268
2021-06-30 09:45:35 -07:00
0b8a7daa2a Enable multigpu_test in GHA (#60221)
Summary:
- [x] add to test matrix
- [x] enable on PRs for testing
- [x] modify the scripts so it actually runs the multigpu tests
- [x] put `num_shards` after `shard` number
- [x] use a separate test-reports artifact
- [x] run on `linux.16xlarge.nvidia.gpu`
- [x] validate that it works
- [x] disable on PRs before merging

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60221

Test Plan: CI. Example run: https://github.com/pytorch/pytorch/actions/runs/984347177

Reviewed By: malfet

Differential Revision: D29430567

Pulled By: samestep

fbshipit-source-id: 09f8e208e524579b603611479ca00515c8a1b5aa
2021-06-30 08:52:38 -07:00
5576c7bdd1 ns for fx: initial support for int8 shadows fp32 (#60419)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60419

Adds support for NS for FX shadowed activations pass to handle int8
modules shadowing fp32 modules. The difficulty here is that in order
to insert the dtype cast, we need the qparams of the input.

For the current PR, we only handle the easy cases where the previous
node is either a `quantize_per_tensor` or an OSS quantized module.
A future PR can handle more complicated cases such as various functions.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_int8_shadows_fp32_simple
```

Imported from OSS

Reviewed By: hx89

Differential Revision: D29280050

fbshipit-source-id: 465257c9f82a34fa91b48ae8887355c68e00edc6
2021-06-30 08:08:46 -07:00
a5e2ea4345 Add noop register hook (#60685)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60685

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D29466224

fbshipit-source-id: 68c8aa022ccffeefd45062f1443d15c9a6824f3d
2021-06-30 07:46:34 -07:00
1fd65967e5 Revert D29312809: add quantized_resize and dequantize for some cuda backends
Test Plan: revert-hammer

Differential Revision:
D29312809 (c4cc26f26a)

Original commit changeset: c5c5eabb98bc

fbshipit-source-id: 565e215513b68eae0dacdd1660b1a01759215511
2021-06-30 07:37:09 -07:00
bfe03120ee [PyPer] Fix schema of fb::equally_split (#60852)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60852

Reviewed By: ajyu

Differential Revision: D29423425

fbshipit-source-id: 4525db1f268ca65d6851a5ec846a6ae2f710ec6b
2021-06-30 03:18:15 -07:00
af5a0df1d0 Prefer linalg::qr over qr in the C++ API (#60529)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/60060

Also adds `torch::linalg::qr` to the C++ API, as it was missing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60529

Reviewed By: ngimel

Differential Revision: D29353133

Pulled By: mruberry

fbshipit-source-id: e18feaffca91c13940ad3d6bd1f40bb57dc101ae
2021-06-30 02:48:04 -07:00
b39770c461 Fix degenerate shape behavior for ord=+/-2 (#60273)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/59198

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60273

Reviewed By: jbschlosser

Differential Revision: D29422907

Pulled By: mruberry

fbshipit-source-id: 609cd640b0477f90bebca20865e34cbe182d3909
2021-06-30 02:17:26 -07:00
10fc58620e [PyTorch][NASProfiler] Add moduleHierarchy Python API to print out hierarchical information about a Node (#60384)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60384

Currently inlining module graph will drop module hierarchy info on Python side. Here we retrieve the module hierarchy from cpp side and expose it to a new Python API on Node called `moduleHierarchy()`.

Test Plan:
Usage:
```
torch._C._jit_pass_inline(module.graph)
torch._C._jit_pass_propagate_shapes_on_graph(module.graph)
node = module.graph.findNode("quantized::conv2d_relu")
'top(' + module.original_name + ').' + node.moduleHierarchy() + '.' + node.kind()
```
Output:
```
'top(QuantWrapper).module(FBNetHR).0(Sequential).xif0_0(ConvBNRelu).conv(ConvReLU2d).quantized::conv2d_relu'
```

Reviewed By: kimishpatel

Differential Revision: D29252169

fbshipit-source-id: 74163a87f919e061e5e75dfebc4c5cdbe8489d93
2021-06-30 01:32:31 -07:00
44b3dc4eac resolve conjugate bit in torch.testing.assert_close (#60522)
Summary:
We need to resolve the conjugate bit for complex tensors, because otherwise we may not be able to access the imaginary component:

```python
>>> torch.tensor(complex(1, 1)).conj().imag
RuntimeError: view_as_real doesn't work on unresolved conjugated tensors.  To resolve the conjugate tensor so you can view it as real, use self.resolve_conj(); however, be warned that the resulting tensor will NOT alias the original.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60522

Reviewed By: ngimel

Differential Revision: D29353095

Pulled By: mruberry

fbshipit-source-id: c36eaf883dd55041166f692f7b1d35cd2a34acfb
2021-06-30 01:31:30 -07:00
c4cc26f26a add quantized_resize and dequantize for some cuda backends (#60489)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60489

adding entries into native_functions.yaml to enable these functions
since the code is common between cuda and cpu

Test Plan: tested with a full model, unit tests on the way

Reviewed By: ezyang

Differential Revision: D29312809

fbshipit-source-id: c5c5eabb98bc192343ec78980dc4e3fc3f41d3db
2021-06-30 00:33:12 -07:00
4adc5eb6c5 [Caffe2][Testing] Check for equality first in assertTensorEqualsWithType<float> (#61006)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61006

Test Plan: Modified existing unit test to test for eps = 0. It would fail without the equality test first.

Reviewed By: ajyu

Differential Revision: D29423770

fbshipit-source-id: 168e7de00d8522c4b646a8335d0120700915f260
2021-06-29 23:31:37 -07:00
287c0ab170 [FX] Add requires_grad to TensorMetadata (#60972)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60972

For PyTorch model memory requirement calculation, requires_grad is needed. Output tensors with requires_grad are saved in module context and increases memory during forward pass.

Test Plan: Existing test cases

Reviewed By: jamesr66a

Differential Revision: D29024932

fbshipit-source-id: def990f8c6ff6fa4537bfc377c646b9d44464ebd
2021-06-29 23:07:27 -07:00
ce232e7847 [ROCM] enable fft tests (#60313)
Summary:
This PR enables fft tests on ROCM. It contains a function that generates a valid input for fft tests that call hipfftExecC2R or hipfftExecZ2D. With this helper function we are able to fix a number of fft tests. This brings a close to the series of fft PRs enabling fft tests on ROCM.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60313

Reviewed By: mruberry

Differential Revision: D29463487

Pulled By: malfet

fbshipit-source-id: d0903fbf12d24ba95a42c8b7589714fdb63353ed
2021-06-29 22:43:29 -07:00
e2b42c6f52 [ROCm] Update the magma build to new commit (#60900)
Summary:
Magma master branch is updated with all the fixes required for ROCm, so updating the magma build to the new commit for ROCm pyTorch builds.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60900

Reviewed By: jbschlosser

Differential Revision: D29440587

Pulled By: malfet

fbshipit-source-id: 2ccdf48441dfff3d19c4a478e03ac11a843f8419
2021-06-29 22:38:58 -07:00
93772792e3 [nnc] Get rid of fuser trigger counters (#57334)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57334

Here's a possibly controversial PR.  These counters got in the way of
generalizing the fuser tests to handle arbitrary devices, and I guess I'm just
generally skeptical that they provide much value.  While true that they let us
observe whether fusion groups were created, we already have assertions based on
the shape of the graph, and I'm not sure that I trust those any less than these
counters.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D29471484

Pulled By: bertmaher

fbshipit-source-id: f6d76f6e72dbfb581acff1d834b0c74500941b57
2021-06-29 22:22:15 -07:00
c4f718cb72 [nnc] Serialize initialization of LLVM targets (#60996)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60996

We've had a bug report of weird LLVM initialization errors, e.g.,
```
Unexpected failure in LLVM JIT: Cannot choose between targets "x86-64" and "x86-64"
```

While I haven't repro'ed that exact message, I did run a stress-test that
compiles on many threads simultaneously, and it deadlocks in
TargetRegistry::lookupTarget.  And in fact I remember debugging this before in
a different system, and finding "Clients are responsible for avoid race
conditions in registration" in
https://llvm.org/doxygen/TargetRegistry_8cpp_source.html.

So yeah, let's lock this thing.
ghstack-source-id: 132719018

Test Plan: Heavy multithreaded compilation.  Not sure if it's suitable for landing.

Reviewed By: ZolotukhinM

Differential Revision: D29471343

fbshipit-source-id: b495e468b57e77796a08b627884d3efeca2d1f7c
2021-06-29 22:21:00 -07:00
5bc28c897e fixed launch bounds for gamma_cuda_kernel (#60393)
Summary:
Changed launch bounds for gamma_cuda_kernel from 512 to 256.

Timing data (using Nvidia Titan-V):
![GammaTimingData](https://user-images.githubusercontent.com/22803332/122821464-bc873300-d291-11eb-9be6-2fb690f0d5c7.PNG)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60393

Reviewed By: jbschlosser

Differential Revision: D29447926

Pulled By: ngimel

fbshipit-source-id: c2112f9be8ede3bb07cb72f301393f24d17e0c01
2021-06-29 19:22:07 -07:00
b3ec92cf66 BatchNorm: Remove dispatch in parallel region (#60596)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60596

Ref #56794

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D29446193

Pulled By: ngimel

fbshipit-source-id: 3ebf44a5f1e001e7dc42cd5963752b7e5b9bcbd9
2021-06-29 18:28:46 -07:00
28dc02fe9f Accumulate 16-bit float sums in 32-bit accumulators (#60387)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60387

Fixes gh-59489

Using 32-bit accumulators is a win-win: improved precision and improved
performance since the half precision types needed to be converted back and forth
to 32-bit float to do the arithmetic anyway.

Note that on multi-threaded or dis-contiguous sums, there can be partial sums
stored in the output so they are necessarily trucated to 16-bit. Fixing this
would require a rework of TensorIterator reductions.

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D29447187

Pulled By: ngimel

fbshipit-source-id: d0619e0ca2fe116d101460142b79ca56fd6d0840
2021-06-29 17:52:30 -07:00
f54290fd72 Expose raw saved tensors for custom functions (#60551)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60551

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D29466228

fbshipit-source-id: 7565f6cc3f2488c7e444cf81c7eb37a60c75b0e8
2021-06-29 17:21:52 -07:00
a469298707 Free space in windows libtorch build (#60849)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/60856
Remove more unless pre-installed softwares in CI image

verification links
https://app.circleci.com/pipelines/github/pytorch/pytorch/342992/workflows/3f52cacc-ba1c-4093-804f-d4c1b1c0b806/jobs/14436533
https://app.circleci.com/pipelines/github/pytorch/pytorch/342992/workflows/3f52cacc-ba1c-4093-804f-d4c1b1c0b806/jobs/14437351

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60849

Reviewed By: mruberry

Differential Revision: D29473637

Pulled By: seemethere

fbshipit-source-id: f33dd98de32a79ba1195481f1bd9f2d5362fe16e
2021-06-29 16:53:10 -07:00
af66356d47 [skip-ci] Bump docker image tag (#60988)
Summary:
This PR bumps the docker image tag for clang-tidy. The new image runs ubuntu-20.04 (and therefore has python3.8 by default).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60988

Reviewed By: malfet

Differential Revision: D29469941

Pulled By: 1ntEgr8

fbshipit-source-id: 7268bdb23edff0bc26f275689bf4b1f1ca129df7
2021-06-29 15:23:06 -07:00
8780f8fc3c Remove extraneous process group agent test code (#60903)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60903

RPC tests using process group backend were disabled for CI internally / externally. This is removing the code for process group (only) tests. Faulty agent tests which also use process group will be in a later PR.

Test Plan: Imported from OSS

Reviewed By: jbschlosser, mrshenli

Differential Revision: D29440674

Pulled By: H-Huang

fbshipit-source-id: 4724c189a110ac821c3f4f6f1f8a5c98e057a2a4
2021-06-29 14:21:56 -07:00
d3de37609f Support fused_dropout with XPU backend (#60231)
Summary:
## Motivation
Enable the fused dropout optimization on XPU devices.

## Solution
Add XPU device in the fused dropout acceptable checking.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60231

Reviewed By: jbschlosser

Differential Revision: D29437659

Pulled By: ezyang

fbshipit-source-id: b77245bb53d3ac93ab30a2a85994376ae5928c34
2021-06-29 14:20:17 -07:00
b4a4a8434d [1/n]support double for Caffe2 ScatterWeightedSum (#60402)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60402

Add float64 data type support for ScatterWeightedSum for cases that 10^7 precision is not sufficient.

Test Plan: buck test caffe2/caffe2/python/operator_test:sparse_ops_test -- testScatterWeightedSum

Reviewed By: jianyuh

Differential Revision: D29190324

fbshipit-source-id: 871a60744694e901a2c7685a67350860745d6729
2021-06-29 14:17:04 -07:00
5f51406a51 Modify error message when atol=0 and rtol=0 (#60897)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60897

Fixes #56377
Example output: #60898

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D29461107

Pulled By: 1ntEgr8

fbshipit-source-id: c6e15b299290aab6f8d5a19011c1d39279673f74
2021-06-29 14:17:02 -07:00
6d952dbaf0 [nnc] Fixed checking for loop carried dependence while fusing 2D reduction loops (#60609)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60609

Fixes #60310

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D29386144

Pulled By: navahgar

fbshipit-source-id: 230df4f59d6196a250ea57ff649b117d096fcdbc
2021-06-29 14:17:01 -07:00
b099f5429c Port argmin kernel to structured kernels. (#60364)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60364

Tracking issue: #55070

This PR was openned so as to solve the CI failures in main when merging: #59371 #59372 #59373 #59937 #59938.

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D29265855

Pulled By: ezyang

fbshipit-source-id: ccee3810940542f8b370596105826c96b32231ec
2021-06-29 14:16:59 -07:00
3e2233841f Port argmax to structured kernels. (#60363)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60363

Tracking issue: #55070

This PR was openned so as to solve the CI failures in main when merging: #59371 #59372 #59373 #59937 #59938.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D29265857

Pulled By: ezyang

fbshipit-source-id: 586914d2aa79028c56988896093945755a2b9781
2021-06-29 14:16:57 -07:00
df47fa5bdc Using meta checks for unary torch.all and torch.any. (#60362)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60362

This PR makes use of the newly implemented unified `at::meta::check_reduction` for
validating the inputs and configuring its `TensorIterator`.

This PR was openned so as to solve the CI failures in main when merging: #59371 #59372 #59373 #59937 #59938.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D29265858

Pulled By: ezyang

fbshipit-source-id: e8961b7da65a31acfed5ac3f5c1f5985ae81ec37
2021-06-29 14:16:56 -07:00
0dd90cceaf [package] track storages across lifetime of PackageExporter (#59735)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59735

1. Fixes ABA storage identity problem during serialization for `torch.package` by keeping reference of serialized storages through lifetime of `PackageExporter` to prevent reuse of memory address. Achieved by extending logic used in solution to mobile's same issue.
2. Adds determinism to naming scheme of serialized storages in export code paths which utilize `tensor_cdata_naming_scheme`(introduced 2nd mapping in `StorageContext`, now maps `storage cdata ptr` -> `unique id`, `unique id` -> `c10::Storage`)
3. Additionally uses presence of a storage in the `StorageContext` instance as marker for if a storage has been serialized or not, removing the need to scan the `PythonStreamWriter` for presence of the storage's serialization file

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D29075276

Pulled By: Lilyjjo

fbshipit-source-id: 15a5c30b1de99c5bd7079388f2db9b6ece2eca12
2021-06-29 14:16:54 -07:00
eb2f535689 c10::Storage python to cpp converter and typecast (#59734)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59734

Adds typecast logic to allow for c10::Storages to cross the Python/C++ barrier with pyBind

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D29075279

Pulled By: Lilyjjo

fbshipit-source-id: 3e67b8525d308c5bccc64438ebac82b4d17ba462
2021-06-29 14:16:52 -07:00
93eba7471b Remove fetch in clang-tidy setup (#60974)
Summary:
This was necessary previously since we'd have to diff against upstream in order to figure out what to run in clang-tidy, but now we pull this from GitHub https://github.com/pytorch/pytorch/issues/60045 so we can delete this part of the workflow

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60974

Reviewed By: mruberry

Differential Revision: D29466036

Pulled By: driazati

fbshipit-source-id: a9d619ab731e77bc69ab32b37cfb2c249e22a477
2021-06-29 14:15:34 -07:00
91c076eadc Add TorchVitals for DataLoader (#60959)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60959

Add TorchVitals for Dataloader, this indicates that the data loader was enabled.

This is a no-op if TORCH_VITALS environment variable is not set.

Test Plan: buck test mode/dbg caffe2/test:torch -- --regex vitals

Reviewed By: VitalyFedyunin

Differential Revision: D29445146

fbshipit-source-id: d5778fff3dafb3c0463fec7a498bff4905597518
2021-06-29 14:08:32 -07:00
652d911f81 add BFloat16 support for LayerNorm CPU (#55210)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55210

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D28836793

Pulled By: VitalyFedyunin

fbshipit-source-id: 998298deedd7a18e45fb761a0a4e0d88b65f2e0c
2021-06-29 14:08:30 -07:00
89d0e31fe5 [torch][repeat_interleave] Remove stream sync when output_size is given for scalar repeats (#60965)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60965

Same as title. Simple change on tensor creation.

Test Plan: Rely on existing signals and verify manually that sync is not happening.

Reviewed By: ngimel

Differential Revision: D29461773

fbshipit-source-id: 21d6ebfba08449da39fc7f109958f6c6978a4f32
2021-06-29 14:08:28 -07:00
086f6e557e Fix divide by zero error in the ASAN test (#60723)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/60722

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60723

Reviewed By: jbschlosser

Differential Revision: D29432147

Pulled By: albanD

fbshipit-source-id: c82cd0df8e4a04ee561ca26ae821a8b61c13a698
2021-06-29 14:07:26 -07:00
ec9c03c234 Implemented torch.cov (#58311)
Summary:
Based from https://github.com/pytorch/pytorch/pull/50466

Adds the initial implementation of `torch.cov` similar to `numpy.cov`. For simplicity, we removed support for many parameters in `numpy.cov` that are either redundant such as `bias`, or have simple workarounds such as `y` and `rowvar`.

cc PandaBoi

closes https://github.com/pytorch/pytorch/issues/19037

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58311

Reviewed By: jbschlosser

Differential Revision: D29431651

Pulled By: heitorschueroff

fbshipit-source-id: 167dea880f534934b145ba94291a9d634c25b01b
2021-06-29 14:02:39 -07:00
8f658d537d Improved JIT support for torch.einsum (#59265)
Summary:
Added JIT support for the vararg version of `torch.einsum`. Note that JIT does not support the Python's Ellipsis object (`...`)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59265

Reviewed By: VitalyFedyunin

Differential Revision: D29328469

Pulled By: heitorschueroff

fbshipit-source-id: 5e4b177fda93255251f45d735b00c08220f0f124
2021-06-29 14:01:21 -07:00
d46eb77b04 Improve CUDA extension building error/warning messages (#59665)
Summary:
See https://github.com/pytorch/pytorch/issues/55267

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59665

Reviewed By: mruberry

Differential Revision: D29462248

Pulled By: ezyang

fbshipit-source-id: 9de13a284a14a7cd24200b9684151ce652e1eb1e
2021-06-29 13:03:30 -07:00
12b63f4046 [DDP] Fix case where new tensors with no grad_fn are returned in DDP forward. (#60882)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60882

Fixes https://github.com/pytorch/pytorch/issues/60733, which
identified an issue with a previous PR that resulted in DDP no longer
supporting cases where newly created tensors are returned that don't have a
grad_fn. The result of this is the grad_fn is set to that of the `DDPSink`
custom backward which results in errors during the backwards pass.

This PR fixes the issue by ensuring we don't touch the `grad_fn` of the tensors
if it is `None`. Added relevant tests as well.
ghstack-source-id: 132632515

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D29423822

fbshipit-source-id: a9e01046c7be50aa43ffb955f6e0f48fef4bc881
2021-06-29 12:50:48 -07:00
1db2d9b0a8 [ProcessGroupNCCL] change WARNING to INFO (#60901)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60901

Short-term fix to address
https://github.com/pytorch/pytorch/issues/60752 . Longer-term fix is tracked here:
https://github.com/pytorch/pytorch/issues/53658 and will involve detecting
whether the user has called `torch.cuda.set_device` in their script and
respecting that device if so, otherwise falling back to our current approach.
ghstack-source-id: 132637336

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D29439322

fbshipit-source-id: 92a18fadbb514b1c029332b60fd48075874906ff
2021-06-29 12:46:47 -07:00
150c828803 Add lint rule to keep collect_env.py python2 compliant (#60946)
Summary:
Fixes T94400857

- [x] Add lint rule
- [x] Verify lint rule works
- [x] Fix torch/utils/collect_env.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60946

Reviewed By: malfet, mruberry

Differential Revision: D29457294

Pulled By: rsemenov

fbshipit-source-id: 3c0670408d7aee1479e1de335291deb13a04ace9
2021-06-29 11:57:53 -07:00
808d0e3353 [caffe2] update make_mnist_db and make_image_db to move strings into DB::Put() (#60919)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60919

Update make_mnist_db.cc and make_image_db.cc to work with the DB API changes
in D29204425 (00896cb9ed).  This is similar to the changes to make_cifar_db.cc landed in
D29374754 (394f60b0fc).
ghstack-source-id: 132621346

Test Plan: buck build caffe2/binaries/...

Reviewed By: valmikir

Differential Revision: D29447314

fbshipit-source-id: 33aff85c24d8b785211287de23d46704c7eb0726
2021-06-29 11:52:43 -07:00
fab1b6cc70 .github: Increase test shards for linux GPU (#60914)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60914

Linux GPU tests are taking almost 4 hours to execute through, let's up
the test shards for these jobs so they finish in a more timely fashion

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: samestep

Differential Revision: D29461968

Pulled By: seemethere

fbshipit-source-id: a1eab08f9cd3abd8ceca48871fe702d0bccd8a3f
2021-06-29 10:44:01 -07:00
5fbca0d281 Use cpu docker image for cpu builds (#60920)
Summary:
This was set to use the [CUDA 10.0 image](https://hub.docker.com/r/pytorch/manylinux-cuda100) which hasn't been updated in quite a while, so fix it to use the up-to-date [cpu image](https://hub.docker.com/r/pytorch/manylinux-cpu) instead

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60920

Reviewed By: janeyx99

Differential Revision: D29447897

Pulled By: driazati

fbshipit-source-id: 6e89091110361d0ddda859bb266e229c6cf83c2d
2021-06-29 10:11:55 -07:00
10b929bbfb Make Jeff and Jithun .circleci/docker code owners (#60958)
Summary:
Following up on https://github.com/pytorch/pytorch/pull/60658#issuecomment-870681027.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60958

Reviewed By: 1ntEgr8

Differential Revision: D29460721

Pulled By: samestep

fbshipit-source-id: 74badff6c4a17b3ff48dc2fc27d1faa9edeae097
2021-06-29 09:47:58 -07:00
53489bc385 fix for #60319 , forcing to use fork as start method in test/test_dat… (#60868)
Summary:
fix for https://github.com/pytorch/pytorch/issues/60319 , forcing to use fork as start method in test/test_dataloader.py

Fixes #{60319}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60868

Reviewed By: mruberry

Differential Revision: D29432876

Pulled By: ejguan

fbshipit-source-id: 5da25f7cfaf8ea0803c0b1aacf2badd656799e16
2021-06-29 09:30:37 -07:00
4310044fec update unsafe flag documentation (#60899)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60899

modify documentation for `unsafe` flag in `parametrize.py`
ghstack-source-id: 132591862

Test Plan:
shouldn't modify code behavior but as a double check,
`buck test mode/dev-nosan //caffe2/test:nn -- --exact 'caffe2/test:nn - test_register_and_remove_parametrization (test_nn.TestNN)'`

https://pxl.cl/1L1fw

Reviewed By: albanD

Differential Revision: D29436688

fbshipit-source-id: 85499ad22b49ad992507b9ed5e7def8231cbfeba
2021-06-29 09:25:37 -07:00
5b6818f08a [Model Averaging] Enforce a synchronization before allreduce parameters (#60891)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60891

This fix is particularly useful for local SGD when the averaging period is very small, which may cause the conflict between gradient allreduce within per-machine subgroup and the global parameter allreduce by the communication world.
ghstack-source-id: 132564252

Test Plan:
f281873295 (#Try1) failed due to the conflict between global process group and subgroup.
```
<Thread(configerator-monitor-singleton, started 139839806633728)>
  File "/usr/local/fbcode/platform009/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/usr/local/fbcode/platform009/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/local/fbcode/platform009/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/tmp/jetter.gson7tr3/configerator/client.py", line 348, in _monitor_loop
    self._parent_thread.join(self._interval_ms / 1000)
  File "/usr/local/fbcode/platform009/lib/python3.8/threading.py", line 1015, in join
    self._wait_for_tstate_lock(timeout=max(timeout, 0))
  File "/usr/local/fbcode/platform009/lib/python3.8/threading.py", line 1027, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):
```

Fixed after adding an explicit sync: f282044866, f282241800

Reviewed By: rohan-varma

Differential Revision: D29434597

fbshipit-source-id: a4f777fc26f379639f85fda32de425cd3b337b33
2021-06-29 01:39:40 -07:00
fbd4cb1cd7 Fix error logging in common_distributed. (#60917)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60917

The second line of error log didn't handle f-string properly.

Before fix:
```
exiting process with exit code: {MultiProcessTestCase.TEST_ERROR_EXIT_CODE}
```

After fix:
```
exiting process 3 with exit code: 10
```
ghstack-source-id: 132618199

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D29446574

fbshipit-source-id: f806ef0470cb6aa86fe3c404e1c895514abb6488
2021-06-28 19:32:17 -07:00
d71e7ae740 [PyTorch][vulkan] Unify vtensor_from_vulkan to always return non-const ref (#59996)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59996

Just like D28811477 (dce8697aea), there's no reason we can't give it this signature.
ghstack-source-id: 132566618

Test Plan: CI

Reviewed By: AshkanAliabadi

Differential Revision: D29119070

fbshipit-source-id: d049d49c38099eef6c96e8f69909827e64376097
2021-06-28 19:25:13 -07:00
7eef78597e fixed launch bounds for grid sampler 3d (#60385)
Summary:
Changed launch bounds for grid_sampler_3d from 1024 to 512 and grid_sampler_3d_backward from 1024 to 256.

Timing data (using Nvidia Titan-V):
![GridSampler3dTimingData](https://user-images.githubusercontent.com/22803332/122813457-d3c12300-d287-11eb-99c1-6572f539660f.PNG)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60385

Reviewed By: jbschlosser

Differential Revision: D29433741

Pulled By: ngimel

fbshipit-source-id: 7f475d0c2e854ae65dd0f1fb0167dfae7e506ec9
2021-06-28 19:01:38 -07:00
d36ce61a5e use explicitly non-returning GPU atomics (#60607)
Summary:
Enables an important performance optimization for ROCm, in light of the discussion in https://github.com/pytorch/pytorch/issues/41028.

CC jithunnair-amd sunway513

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60607

Reviewed By: jbschlosser

Differential Revision: D29409894

Pulled By: ngimel

fbshipit-source-id: effca258a0f37eaefa35674a7fd19459ca7dc95b
2021-06-28 18:17:29 -07:00
d62c3ea354 [skip ci] Add GitHub Actions label for g3.16xlarge (#60888)
Summary:
Prerequisite for https://github.com/pytorch/pytorch/issues/60221.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60888

Reviewed By: seemethere

Differential Revision: D29436592

Pulled By: samestep

fbshipit-source-id: b3254139ec9c46c533f8f951a9ede3b372a65536
2021-06-28 15:49:52 -07:00
d5a44f9f12 Use expecttest from PyPI (#60658)
Summary:
This PR removes `torch/testing/_internal/expecttest.py` in favor of https://github.com/ezyang/expecttest. See also https://github.com/ezyang/ghstack/pull/71.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60658

Test Plan: CI.

Reviewed By: ezyang

Differential Revision: D29430763

Pulled By: samestep

fbshipit-source-id: b7cdc7ba37330176149fd465312118e2254ae92e
2021-06-28 15:43:34 -07:00
ddb1f293b6 Fix the NNC-disabled path in static runtime for perf comparisons
Summary:
The path which has NNC/LLVM disabled still constructs a tensor
expression, even though `supports()` will always return false, so a
`KernelScope` is necessary to manage those memory allocations.

I guess we could avoid building the TEs at all in this case, but it's pretty
clean this way.

Test Plan:
```
scripts/bertrand/static_runtime/run.sh
```

Reviewed By: hlu1

Differential Revision: D29415909

fbshipit-source-id: dde43de8516b9a2cf9f5f7f3699962bf9ccd8c30
2021-06-28 15:39:07 -07:00
9b94aa5356 [quant][fx][fix] Fused modules with object_type in qconfig (#60779)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60779

When we do fusion, we replace certain modules (such as Linear + ReLU) with fused versions (such as LinearReLU) by calling `_fuse_fx` in prepare_fx. However when we try to look up using the fused module type in qconfig_dict, we cannot find a match anymore since the qconfig dict contains the original module types. An example is here [N882873](https://fburl.com/anp/azenjx3v).

So we will now update the qconfig_dict to include the fused modules mapping to the qconfigs used for the modules that make up the fused modules. If the modules are not mapped to the same qconfig, then we will raise an error.

Test Plan:
`python test/test_quantization.py TestFuseFx.test_qconfig_fused_module`

Imported from OSS

Reviewed By: supriyar

Differential Revision: D29406941

fbshipit-source-id: 74b5db89f4998aeb02b2bf7c37bf97326580c654
2021-06-28 15:22:22 -07:00
cyy
cadce14e02 don't return in __init__ functions (#60830)
Summary:
Fix some warnings from a code analyzer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60830

Reviewed By: jbschlosser

Differential Revision: D29433638

Pulled By: albanD

fbshipit-source-id: 148df1d8a0a79778f18e8b6abffbddef36c5031c
2021-06-28 14:56:13 -07:00
9af8aecd00 [caffe2/libtorch] Remove already-owned source
Summary:
This source is already owned by a more fine-grained rule, so avoid a
package boundary violation by having it also be owned by an outer
rule.

Test Plan: CI

Reviewed By: aniketmathur

Differential Revision: D29422794

fbshipit-source-id: 432accc969abcb4d56bd97341a07029926939ea0
2021-06-28 14:45:34 -07:00
eeea696c02 [caffe2] Fix include of corresponding header
Summary:
AFAICT, this include was a typo, and meant to be the corresponding
header for this .cpp, but instead pulled in an unrelated header.

Test Plan: CI

Reviewed By: igorsugak

Differential Revision: D29422993

fbshipit-source-id: cc9bb29ee1f1007b68c6666ea8e389f6f39928af
2021-06-28 14:45:32 -07:00
c3977bf3da [caffe2/utils] Add some fine-grained rules to avoid package boundary violations
Test Plan: CI

Reviewed By: igorsugak

Differential Revision: D29401295

fbshipit-source-id: e921e5578c1fcc8df6bd670ae9f95722b8e32d85
2021-06-28 14:45:30 -07:00
03de807d81 [caffe2/utils] Add explicit rule to avoid package boundary violation (#60677)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60677

Add a rule to wrap conversions.h and depend on that, rather than
relying on a glob which violates package boundaries.

Test Plan: `buck2 build fbcode//caffe2/caffe2:caffe2_core`

Reviewed By: mzlee

Differential Revision: D29370841

fbshipit-source-id: d4dd383eb8457d4f5118574e34e6f17c32fde647
2021-06-28 14:43:30 -07:00
41c380e649 Enable bionic-cuda10.2-cudnn7-py3.9-gcc7 in GHA (#60204)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60204

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D29430679

Pulled By: samestep

fbshipit-source-id: 9380f5535cd370ec7aabf609a6170c8cb4df505d
2021-06-28 13:08:36 -07:00
971cdafd15 Upgrade benchmark to v1.5.5 (#60750)
Summary:
This fixes the build for gcc 11.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60750

Test Plan: CI

Reviewed By: larryliu0820

Differential Revision: D29394541

Pulled By: dreiss

fbshipit-source-id: 61557431b52a3e898ffcc32f97133b3ea94a838f
2021-06-28 13:03:03 -07:00
007ba37c9a [pruning] Speedup activation reconstruction (#60683)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60683

Vectorized reconstruction without for loops

Test Plan:
`buck test mode/dev-nosan //caffe2/test:ao -- TestBasePruner`

https://pxl.cl/1KSJQ

Reviewed By: z-a-f

Differential Revision: D29370805

fbshipit-source-id: 75402437654a0b6f6391c8590bbe3f6fe3f43d8f
2021-06-28 12:58:21 -07:00
f302e0c781 [pruning] Additional pruning tests (#60681)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60681

Adding additional pruning tests for more complex models and more pruned rows

Test Plan:
`buck test mode/dev-nosan //caffe2/test:ao -- TestBasePruner`

https://pxl.cl/1KQ2Z

Reviewed By: z-a-f

Differential Revision: D29347546

fbshipit-source-id: cb65e564dd46d24f4aca1b00dd915ee8d64f8318
2021-06-28 12:58:20 -07:00
8d4a6ef962 [pruning] Activation reconstruction (#60292)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60292

Added activation reconstruction in the `reconstruct` method

Test Plan:
`buck test mode/dev-nosan //caffe2/test:ao -- TestBasePruner`

https://pxl.cl/1KLl1

Reviewed By: z-a-f

Differential Revision: D29236569

fbshipit-source-id: 1ad085f4143eb9fa3efca51e00d810e0fdb7e9b1
2021-06-28 12:58:18 -07:00
965dad25a5 Allow resizing of parametrized tensors (#60418)
Summary:
Modify `parametrize.py` to allow resizing of parametrized tensors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60418

Test Plan:
`buck test mode/dev-nosan //caffe2/test:nn -- --exact 'caffe2/test:nn - test_register_and_remove_parametrization (test_nn.TestNN)'`

https://pxl.cl/1L0wh

Reviewed By: z-a-f

Differential Revision: D29279442

Pulled By: kazhou

fbshipit-source-id: 4d94915748f896e7761a40ad18f4c6444f505c3a
2021-06-28 12:57:11 -07:00
956faea585 [fix] cauchy sampling inf on cuda (#60186)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/59144

As pointed by ngimel, the issue is indeed with calling `tan`.

However the C++ `std::tan` [documenation](https://en.cppreference.com/w/cpp/numeric/math/tan) states that

```
The function has mathematical poles at π(1/2 + n); however no common floating-point representation
is able to represent π/2 exactly, thus there is no value of the argument for which a pole error occurs.
```

All `torch.tan`,`numpy.tan` and `math.tan` are compliant with the above statement.

<details>

```python
import torch
import math
import numpy as np

# Single Precision
print(torch.tan(torch.tensor(math.pi, device='cuda', dtype=torch.float32) * 0.5))
print(np.tan(np.array(np.pi, dtype=np.float32) * 0.5))

# Double Precision
print(math.tan(math.pi * 0.5))
print(torch.tan(torch.tensor(math.pi, device='cuda', dtype=torch.double) * 0.5))
print(np.tan(np.array(np.pi, dtype=np.float64) * 0.5))
```

Output
```
tensor(-22877334., device='cuda:0')
-22877332.42885646
1.633123935319537e+16
tensor(1.6331e+16, device='cuda:0', dtype=torch.float64)
1.633123935319537e+16
```

</details>

So this issue stems from the use of `__tanf` faster approximation of tan from CUDA library (for float16, bfloat16 and float).

8a839c5478/aten/src/ATen/NumericUtils.h (L91-L100)

The fix in the PR is to use the **slower** but more correct version.

Benchmark::
```
[ cauchy : input dtype torch.float16 device cuda ]
                             |  Before  |  After
1 threads: -------------------------------------
      (128,)                 |    3.8   |    4.3
      (256, 128)             |    3.8   |    4.2
      (2, 512, 256)          |    3.8   |    4.2
      (2, 64, 256, 128)      |   22.8   |   29.6
      (4, 2, 512, 256, 128)  |  649.6   |  869.3

Times are in microseconds (us).

[ cauchy : input dtype torch.bfloat16 device cuda ]
                             |  Before  |  After
1 threads: -------------------------------------
      (128,)                 |    3.8   |    4.3
      (256, 128)             |    3.8   |    4.3
      (2, 512, 256)          |    3.8   |    4.3
      (2, 64, 256, 128)      |   23.8   |   30.8
      (4, 2, 512, 256, 128)  |  682.5   |  904.2

Times are in microseconds (us).

[ cauchy : input dtype torch.float32 device cuda ]
                             |  Before  |  After
1 threads: --------------------------------------
      (128,)                 |     3.8  |     4.2
      (256, 128)             |     3.7  |     4.2
      (2, 512, 256)          |     3.7  |     4.2
      (2, 64, 256, 128)      |    35.3  |    37.1
      (4, 2, 512, 256, 128)  |  1020.0  |  1058.3

Times are in microseconds (us).

[- cauchy : input dtype torch.float64 device cuda ]
                             |   Before  |   After
1 threads: ----------------------------------------
      (128,)                 |      3.8  |      4.2
      (256, 128)             |      8.0  |      8.0
      (2, 512, 256)          |     46.0  |     46.0
      (2, 64, 256, 128)      |    669.2  |    669.4
      (4, 2, 512, 256, 128)  |  21255.0  |  21262.1

Times are in microseconds (us).
```

<details>

Benchmark Script:
```python
import torch
import itertools
import time
from torch.utils.benchmark import Timer
from torch.utils.benchmark import Compare
import sys
import pickle

print('Using pytorch %s' % (torch.__version__))

cuda_shapes = [(128,), (256, 128), (2, 512, 256), (2, 64, 256, 128), (4, 2, 512, 256, 128)]
cuda_dtypes = [torch.half, torch.bfloat16, torch.float, torch.double]
results = []
repeats = 10

for device in ['cuda']:
    dtypes = cuda_dtypes
    shapes = cuda_shapes

    for dtype in dtypes:
        for shape in shapes:
            t = torch.randn(shape, device=device, dtype=dtype) * 10

            tasks = [("t.cauchy_()", "After", "")]
            timers = [Timer(stmt=stmt, label=f"cauchy : input dtype {dtype} device {device}", sub_label=f"{(shape)}", description=desc, globals=globals()) for stmt, desc, label in tasks]

            for i, timer in enumerate(timers * repeats):
                results.append(
                    timer.blocked_autorange()
                )
                print(f"\r{i + 1} / {len(timers) * repeats}", end="")
                sys.stdout.flush()

with open('after-pr.pkl', 'wb') as f:
    pickle.dump(results, f)

comparison = Compare(results)
comparison.print()
```

Compare Script:
```
import torch
import itertools
import time
from torch.utils.benchmark import Timer
from torch.utils.benchmark import Compare
import sys
import pickle

with open('before-pr.pkl', 'rb') as f:
    after_results = pickle.load(f)

with open('after-pr.pkl', 'rb') as f:
    before_results = pickle.load(f)

comparison = Compare(after_results + before_results)
comparison.print()
```

</details>

TODO:
* [x] Add comment

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60186

Reviewed By: jbschlosser

Differential Revision: D29433897

Pulled By: ngimel

fbshipit-source-id: 9c5f14b83e3372bed72369f70eed9256c04385c6
2021-06-28 12:49:30 -07:00
70e205a2ab Use the new URL for docs preview link (#60893)
Summary:
This is all set up on CloudFront now with a custom domain, so we don't need the long default cloudfront domain anymore

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60893

Reviewed By: malfet

Differential Revision: D29437300

Pulled By: driazati

fbshipit-source-id: 6f5ffd1b10c5167b0022b7e64b2164508624ca91
2021-06-28 12:45:04 -07:00
f5e5ced202 Enable parallel clang-tidy on ec2 runner (#60870)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60870

This PR makes `clang-tidy` run on our self-hosted runner in a parallel fashion.

Fixes #60867

Test Plan: #60871

Reviewed By: jbschlosser

Differential Revision: D29434240

Pulled By: 1ntEgr8

fbshipit-source-id: cead30ed718ddf5e14b13afe70cb209aa16b44a0
2021-06-28 11:45:44 -07:00
c8fb785857 Print stdout and stderr to console on parallel runs (#60869)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60869

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D29434155

Pulled By: 1ntEgr8

fbshipit-source-id: 925c9d832775dbb710af9367c07962f3367fda38
2021-06-28 11:29:12 -07:00
a8057e7ef1 docs: add permute in torch docs (#60821)
Summary:
fix https://github.com/pytorch/pytorch/issues/60181

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60821

Reviewed By: VitalyFedyunin

Differential Revision: D29431949

Pulled By: jbschlosser

fbshipit-source-id: 2353afceaa188315cde1f0c955897c4750809c8e
2021-06-28 11:20:35 -07:00
d7c58e5a04 [vulkan] Implement tanh activation function (#60695)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60695

As title. Implement tanh in Vulkan.

Test Plan:
Build Pytorch repository with the build command in P425131222.

Run test command `pytorch/build/bin/vulkan_api_test`

Output:

{F627752306}

Reviewed By: SS-JIA

Differential Revision: D29375071

fbshipit-source-id: 2d613a9542774719dd78524757a677e3b2450c74
2021-06-28 10:58:44 -07:00
da70dd199d [quant] Input-Weight Equalization - tests (#60378)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60378

Created the following unit-tests to check that our equalization algorithm is as expected:
- Check the equalization scales calculated and stored in the graph are as expected
- Check the scaled weights and biases are as expected
- Check that the min/max values in the quantization observers are as expected
- Check that the graphs with equalization are structured in the same way as graphs without equalization (except that equalized graphs have additional equalization scale and mul nodes) before and after quantization

Test Plan:
`python test/test_quantization TestEqualizeFx.test_input_weight_equalization_equalization_scales`
`python test/test_quantization TestEqualizeFx.test_input_weight_equalization_weights_bias`
`python test/test_quantization TestEqualizeFx.test_input_activation_values`
`python test/test_quantization TestEqualizeFx.test_input_weight_equalization_graphs`

Imported from OSS

Reviewed By: supriyar

Differential Revision: D29406942

fbshipit-source-id: 518208546ae5835c1ebb2af217507e90af66fbe4
2021-06-28 10:44:29 -07:00
dfb9c0bae8 [quant] Input-Weight Equalization - support for connected F.linear layer (#60272)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60272

Test Plan:
`python test/test_quantization.py TestEqualizeFx`

Original model:
```
FunctionalLinear2Module(
  (linear1): Linear()
  (linear2): Linear()
)
```

Graph after `prepare_fx`:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_0](args = (%x,), kwargs = {})
    %x_activation_post_process_0_equalization_process_0 : [#users=1] = call_module[target=x_activation_post_process_0_equalization_process_0](args = (%x_activation_post_process_0,), kwargs = {})
    %linear1_w : [#users=1] = get_attr[target=linear1.w]
    %linear1_w_activation_post_process_0 : [#users=1] = call_module[target=linear1_w_activation_post_process_0](args = (%linear1_w,), kwargs = {})
    %linear1_w_activation_post_process_0_equalization_process_0 : [#users=1] = call_module[target=linear1_w_activation_post_process_0_equalization_process_0](args = (%linear1_w_activation_post_process_0,), kwargs = {})
    %linear1_b : [#users=1] = get_attr[target=linear1.b]
    %linear : [#users=1] = call_function[target=torch.nn.functional.linear](args = (%x_activation_post_process_0_equalization_process_0, %linear1_w_activation_post_process_0_equalization_process_0), kwargs = {bias: %linear1_b})
    %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
    %linear_activation_post_process_0_equalization_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0_equalization_process_0](args = (%linear_activation_post_process_0,), kwargs = {})
    %linear2_w : [#users=1] = get_attr[target=linear2.w]
    %linear2_w_activation_post_process_0 : [#users=1] = call_module[target=linear2_w_activation_post_process_0](args = (%linear2_w,), kwargs = {})
    %linear2_w_activation_post_process_0_equalization_process_0 : [#users=1] = call_module[target=linear2_w_activation_post_process_0_equalization_process_0](args = (%linear2_w_activation_post_process_0,), kwargs = {})
    %linear2_b : [#users=1] = get_attr[target=linear2.b]
    %linear_1 : [#users=1] = call_function[target=torch.nn.functional.linear](args = (%linear_activation_post_process_0_equalization_process_0, %linear2_w_activation_post_process_0_equalization_process_0), kwargs = {bias: %linear2_b})
    %linear_1_activation_post_process_0 : [#users=1] = call_module[target=linear_1_activation_post_process_0](args = (%linear_1,), kwargs = {})
    return linear_1_activation_post_process_0
```

Graph after equalization steps:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_scale0 : [#users=1] = get_attr[target=x_equalization_scale0]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_scale0), kwargs = {})
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_0](args = (%mul,), kwargs = {})
    %linear1_w : [#users=1] = get_attr[target=linear1.w]
    %linear1_w_activation_post_process_0 : [#users=1] = call_module[target=linear1_w_activation_post_process_0](args = (%linear1_w,), kwargs = {})
    %linear1_b : [#users=1] = get_attr[target=linear1.b]
    %linear : [#users=1] = call_function[target=torch.nn.functional.linear](args = (%x_activation_post_process_0, %linear1_w_activation_post_process_0), kwargs = {bias: %linear1_b})
    %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
    %linear2_w : [#users=1] = get_attr[target=linear2.w]
    %linear2_w_activation_post_process_0 : [#users=1] = call_module[target=linear2_w_activation_post_process_0](args = (%linear2_w,), kwargs = {})
    %linear2_b : [#users=1] = get_attr[target=linear2.b]
    %linear_1 : [#users=1] = call_function[target=torch.nn.functional.linear](args = (%linear_activation_post_process_0, %linear2_w_activation_post_process_0), kwargs = {bias: %linear2_b})
    %linear_1_activation_post_process_0 : [#users=1] = call_module[target=linear_1_activation_post_process_0](args = (%linear_1,), kwargs = {})
    return linear_1_activation_post_process_0
```

Graph after `convert_fx`:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_scale0 : [#users=1] = get_attr[target=x_equalization_scale0]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_scale0), kwargs = {})
    %linear1_input_scale_0 : [#users=1] = get_attr[target=linear1_input_scale_0]
    %linear1_input_zero_point_0 : [#users=1] = get_attr[target=linear1_input_zero_point_0]
    %quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear1_input_scale_0, %linear1_input_zero_point_0, torch.quint8), kwargs = {})
    %linear1_packed_weight_0 : [#users=1] = get_attr[target=linear1_packed_weight_0]
    %linear1_scale_0 : [#users=1] = get_attr[target=linear1_scale_0]
    %linear1_zero_point_0 : [#users=1] = get_attr[target=linear1_zero_point_0]
    %linear : [#users=1] = call_function[target=torch.ops.quantized.linear](args = (%quantize_per_tensor, %linear1_packed_weight_0, %linear1_scale_0, %linear1_zero_point_0), kwargs = {})
    %linear2_packed_weight_0 : [#users=1] = get_attr[target=linear2_packed_weight_0]
    %linear2_scale_0 : [#users=1] = get_attr[target=linear2_scale_0]
    %linear2_zero_point_0 : [#users=1] = get_attr[target=linear2_zero_point_0]
    %linear_1 : [#users=1] = call_function[target=torch.ops.quantized.linear](args = (%linear, %linear2_packed_weight_0, %linear2_scale_0, %linear2_zero_point_0), kwargs = {})
    %dequantize : [#users=1] = call_method[target=dequantize](args = (%linear_1,), kwargs = {})
    return dequantize
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D29267218

fbshipit-source-id: 6b97bed1a307f1d0b1f5efcbecf41f35418242f7
2021-06-28 10:44:27 -07:00
ddf2ce03bb [quant] Input-Weight Equalization - support for connected linear layers (#60034)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60034

Added support for equalizing models with connected linear
layers. To account for connected linear layers, we will additionally
multiply the previous weight values (row-wise) by the next equalization
scale, and remove the input equalization observer between the two linear
layers. We also want to scale the bias by the next equalization scale.
The math is shown here: https://fb.quip.com/fK8rA9aRM4ca .

Original Model: `x -> linear1 -> linear2`
After `prepare_fx`: `x -> InpEqObs -> InpQuantObs -> linear1 ->
OutQuantObs -> InpEqObs -> linear2`
After equalization: `x -> mul -> InpQuantObs -> linear1 -> OutQuantObs
-> linear2`

Test Plan:
`python test/test_quantization.py
TestEqualizeFx.test_input_weight_equalization_convert`

Original Model:
```
Linear2Module(
  (linear1): Linear(in_features=2, out_features=2, bias=True)
  (linear2): Linear(in_features=2, out_features=2, bias=True)
)
```

Graph after `prepare_fx`:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_0](args = (%x,), kwargs = {})
    %x_activation_post_process_0_equalization_process_0 : [#users=1] = call_module[target=x_activation_post_process_0_equalization_process_0](args = (%x_activation_post_process_0,), kwargs = {})
    %linear1 : [#users=1] = call_module[target=linear1](args = (%x_activation_post_process_0_equalization_process_0,), kwargs = {})
    %linear1_activation_post_process_0 : [#users=1] = call_module[target=linear1_activation_post_process_0](args = (%linear1,), kwargs = {})
    %linear1_activation_post_process_0_equalization_process_0 : [#users=1] = call_module[target=linear1_activation_post_process_0_equalization_process_0](args = (%linear1_activation_post_process_0,), kwargs = {})
    %linear2 : [#users=1] = call_module[target=linear2](args = (%linear1_activation_post_process_0_equalization_process_0,), kwargs = {})
    %linear2_activation_post_process_0 : [#users=1] = call_module[target=linear2_activation_post_process_0](args = (%linear2,), kwargs = {})
    return linear2_activation_post_process_0
```

Graph after equaliation functions:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_activation_post_process_0_equalization_process_0_scale : [#users=1] = get_attr[target=x_activation_post_process_0_equalization_process_0_scale]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_activation_post_process_0_equalization_process_0_scale), kwargs = {})
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_0](args = (%mul,), kwargs = {})
    %linear1 : [#users=1] = call_module[target=linear1](args = (%x_activation_post_process_0,), kwargs = {})
    %linear1_activation_post_process_0 : [#users=1] = call_module[target=linear1_activation_post_process_0](args = (%linear1,), kwargs = {})
    %linear2 : [#users=1] = call_module[target=linear2](args = (%linear1_activation_post_process_0,), kwargs = {})
    %linear2_activation_post_process_0 : [#users=1] = call_module[target=linear2_activation_post_process_0](args = (%linear2,), kwargs = {})
    return linear2_activation_post_process_0
```

Graph after `convert_fx`:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_activation_post_process_0_equalization_process_0_scale : [#users=1] = get_attr[target=x_activation_post_process_0_equalization_process_0_scale]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_activation_post_process_0_equalization_process_0_scale), kwargs = {})
    %linear1_input_scale_0 : [#users=1] = get_attr[target=linear1_input_scale_0]
    %linear1_input_zero_point_0 : [#users=1] = get_attr[target=linear1_input_zero_point_0]
    %quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear1_input_scale_0, %linear1_input_zero_point_0, torch.quint8), kwargs = {})
    %linear1 : [#users=1] = call_module[target=linear1](args = (%quantize_per_tensor,), kwargs = {})
    %linear2 : [#users=1] = call_module[target=linear2](args = (%linear1,), kwargs = {})
    %dequantize : [#users=1] = call_method[target=dequantize](args = (%linear2,), kwargs = {})
    return dequantize
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D29204347

fbshipit-source-id: 6bb9e25e2468f50df523885ded2edc731f002ac1
2021-06-28 10:44:25 -07:00
7917318917 [quant] Input-Weight Equalization - support for F.linear layers (#59964)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59964

Input-Weight Equalization support for functional layers

Test Plan:
`python test/test_quantization.py
TestEqualizeFx.test_input_weight_equalization_convert`

Original model:
```
FunctionalLinearModule(
  (linear1): Linear()
)
```

Graph after `prepare_fx`:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0 : [#users=1] = call_module[target=x_equalization_process_0](args = (%x,), kwargs = {})
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0 : [#users=1] = call_module[target=x_equalization_process_0](args = (%x,), kwargs = {})
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%x_equalization_process_0,), kwargs = {})
    %linear1_w : [#users=1] = get_attr[target=linear1.w]
    %linear1_w_equalization_process_0 : [#users=1] = call_module[target=linear1_w_equalization_process_0](args = (%linear1_w,), kwargs = {})
    %linear1_w_activation_post_process_0 : [#users=1] = call_module[target=linear1_w_activation_post_process_00](args = (%linear1_w_equalization_process_0,), kwargs = {})
    %linear1_b : [#users=1] = get_attr[target=linear1.b]
    %linear : [#users=1] = call_function[target=torch.nn.functional.linear](args = (%x_activation_post_process_0, %linear1_w_activation_post_process_0), kwargs = {bias: %linear1_b})
    %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
    return linear_activation_post_process_0
```

Graph after equalization functions:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%mul,), kwargs = {})
    %linear1_w : [#users=1] = get_attr[target=linear1.w]
    %linear1_w_equalization_process_0 : [#users=1] = call_module[target=linear1_w_equalization_process_0](args = (%linear1_w,), kwargs = {})
    %linear1_w_activation_post_process_0 : [#users=1] = call_module[target=linear1_w_activation_post_process_00](args = (%linear1_w_equalization_process_0,), kwargs = {})
    %linear1_b : [#users=1] = get_attr[target=linear1.b]
    %linear : [#users=1] = call_function[target=torch.nn.functional.linear](args = (%x_activation_post_process_0, %linear1_w_activation_post_process_0), kwargs = {bias: %linear1_b})
    %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
    return linear_activation_post_process_0
```

Graph after `convert_fx`:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
    %linear1_input_scale_0 : [#users=1] = get_attr[target=linear1_input_scale_0]
    %linear1_input_zero_point_0 : [#users=1] = get_attr[target=linear1_input_zero_point_0]
    %quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear1_input_scale_0, %linear1_input_zero_point_0, torch.quint8), kwargs = {})
    %linear1_packed_weight_0 : [#users=1] = get_attr[target=linear1_packed_weight_0]
    %linear1_scale_0 : [#users=1] = get_attr[target=linear1_scale_0]
    %linear1_zero_point_0 : [#users=1] = get_attr[target=linear1_zero_point_0]
    %linear : [#users=1] = call_function[target=torch.ops.quantized.linear](args = (%quantize_per_tensor, %linear1_packed_weight_0, %linear1_scale_0, %linear1_zero_point_0), kwargs = {})
    %dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {})
    return dequantize
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D29135459

fbshipit-source-id: 1e69bfbb82a0c89538e55b64968effd0b11b2fde
2021-06-28 10:44:24 -07:00
387289d4a5 support non-contiguous tensor in bilinear (#38409)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38409

Reviewed By: anjali411

Differential Revision: D29361043

Pulled By: albanD

fbshipit-source-id: 05147a9b0f7a47204bcd5ff70e281a464e8de1e6
2021-06-28 10:43:21 -07:00
f118d20bea Make requires grad check run only when grad mode is enabled (#60740)
Summary:
As per title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60740

Reviewed By: ngimel

Differential Revision: D29405934

Pulled By: albanD

fbshipit-source-id: 35c537939a3871f5a0d2146543506e4d07465724
2021-06-28 10:40:30 -07:00
3ad3f20bff Add an optional Device parameter to pin_memory/is_pinned that does nothing (#60201)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60201

This is to flush out BC/FC problems with adding this parameter.  Later
PR will actually add the desired functionality.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D29331880

Pulled By: ezyang

fbshipit-source-id: 6036716d6ae55e6ea7ef2348b6c34a39613c8dd5
2021-06-28 10:38:52 -07:00
85af24f52b Remove some unnecessary functions from CUDAHooks (#59655)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59655

CUDAHooks is to be used solely when you need to call into CUDA
functionality from a context where you cannot directly link to
CUDA libraries.  Neither of hasPrimaryContext nor
getDevceIndexWithPrimaryContext (sic) needs to be used in such
contexts.  By moving them out of CUDAHooks and calling them
directly a dynamic dispatch can be skipped.

I also fixed the typo in getDev(i)ceIndexWithPrimaryContext

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D28972946

Pulled By: ezyang

fbshipit-source-id: edcd7a7b62aec97928f07fbf3bf413b9fb027517
2021-06-28 10:38:51 -07:00
b52849b589 Port silu_backward to structured (#58661)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58661

I removed dispatch: CompositeImplicitAutograd: math_silu_backward
Definitely not right, but I don't know how it works with structured core.
Keeping it will trigger an assertion failure

```
assert dispatch.keys() != {DispatchKey.CompositeImplicitAutograd}, \
    f"unexpected name for singleton CompositeImplicitAutograd dispatch entry: expected {cpp.name(func)} " \
    f"but got {dispatch[DispatchKey.CompositeImplicitAutograd]}.  Rename your implementation to the expected " \
    "name, then delete the dispatch table"
```

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D28572530

Pulled By: ezyang

fbshipit-source-id: 410f03bddf79cda7c9f0fd66f697383ee2925d32
2021-06-28 10:37:45 -07:00
66f01db36c Make some comparisons explicit (#60505)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60505

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D29313240

fbshipit-source-id: 3f558e68cbb0328326d7540e2b3bd0c2e12ba3e2
2021-06-28 10:33:59 -07:00
f5341bd5e6 Enhance ProcessGroupWrapper with additional checks + refactor (#60237)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60237

Closes https://github.com/pytorch/pytorch/issues/58711

This diff refactors the collective consistency checking in `ProcessGroupWrapper` as described in the above issue. In particular, we no longer run separate verification checks (`all_gather`s) for shapes, op type, etc. Instead, we implement a function `serialize_fingerprint` to serialize all this data into a single tensor and only verify that.

This has the benefit of being a lot more extensible, the developer does not need to add separate `all_gather` calls in order to verify additional data in the future. We can also provide some sort of mechanism where we allow data that needs to be verified to be "registered" in the `CollectiveFingerPrint` struct and make it even easier to add additional data, we can consider doing this if there are significant additions to `process group wrapper`.

We now also begin to check tensor `dtypes` and device types for consistency as well. Tests are refactored/added accordingly.
ghstack-source-id: 132520261

Test Plan: CI

Reviewed By: cbalioglu

Differential Revision: D28597287

fbshipit-source-id: b09f14f628df9e2457623ba81fc13fd4e214f3c9
2021-06-28 10:24:11 -07:00
aaea81e3fb [torch/distributed] remove outdated FutureWarning in distributed/elastic/util/store.py (#60807)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60807

Addresses: https://github.com/pytorch/pytorch/issues/60717

This warning should have been removed since this code is no longer in "experimental" mode.

Test Plan: N/A - just removing experimental warning that should've been removed.

Reviewed By: H-Huang, aivanou

Differential Revision: D29412972

fbshipit-source-id: 16a8a98abde70a4ae0c1ac1b14bda339cb44863a
2021-06-28 10:22:16 -07:00
94cdbbf48d Paren-matching kernel launch check without external deps (#60778)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60778

Matches parens and the opening `<<<` to make a more accurate kernel launch check.

Test Plan:
```
buck test //caffe2/test:kernel_launch_checks
```

Reviewed By: ngimel

Differential Revision: D29401624

fbshipit-source-id: 8649af7c33e67dbb24044af0134b1cea6f2e5dc3
2021-06-28 10:18:04 -07:00
88b0518a83 Python error unit tests on delegation of backend_with_compiler_demo (#60689)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60689

Added a test for errors that occur with a compiler, specifically when an
operator is not supported by the backend.
ghstack-source-id: 132485207

Test Plan:
Running python test/test_jit.py TestBackendsWithCompiler -v returns a
success.

Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D29374513

fbshipit-source-id: ac52b315a01719eaa4985680939239ae058d277b
2021-06-28 09:33:03 -07:00
e63db3ae46 ENH Adds byte support for nll_loss (CUDA) (#60650)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/59765

This PR adds byte support for nll_loss on CUDA.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60650

Reviewed By: albanD

Differential Revision: D29429456

Pulled By: jbschlosser

fbshipit-source-id: 894c969ed6bfc6117dee8e844a7cb5b99977247c
2021-06-28 08:20:13 -07:00
7f6b2bc2d0 Add -I<directory> option to tools/linter/clang_tidy.py (#60745)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60745

Fixes #60739

Test Plan:
Run this command:
```
python3 tools/linter/clang_tidy.py --paths torch/csrc/fx -I/usr/include/path -I/usr/include/another/path --print-include-paths
```

Output:

If the paths don't exist, you should see this:
```
ignoring nonexistent directory "/usr/include/path"
ignoring nonexistent directory "/usr/include/another/path"
```

If the paths exist, you should see them listed.

Reviewed By: ngimel

Differential Revision: D29395227

Pulled By: 1ntEgr8

fbshipit-source-id: c89650546d45887cd39e574da07f08bcfec686e0
2021-06-28 06:56:02 -07:00
5b118a7f23 Don't reference reflection_pad3d in functional.py (#60837)
Summary:
To work around FC issue

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60837

Reviewed By: jbschlosser

Differential Revision: D29421142

Pulled By: ngimel

fbshipit-source-id: f5c1d9c324173b628e286f9005edf7109162066f
2021-06-27 20:54:32 -07:00
f0e972a481 To add Nesterov Adam algorithm for multi-tensor optimizers API (#59165)
Summary:
Previously in the PR: https://github.com/pytorch/pytorch/issues/59009 we added NAdam to Optimizers.  Here in this PR we are proposing multi-tensor version of NAdam for PyTorch.

Nadam has been proposed in the paper   https://openreview.net/forum?id=OM0jvwB8jIp57ZJjtNEZ and report  and report : http://cs229.stanford.edu/proj2015/054_report.pdf by Timothy Dozat.

It has been one of the most used algorithm in Deep Learning community.

It worth to noting that the implementation of NAdam is inspired by the implementation for Keras :
f9d3868495/tensorflow/python/keras/optimizer_v2/nadam.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59165

Reviewed By: vincentqb

Differential Revision: D29360577

Pulled By: iramazanli

fbshipit-source-id: 0fe14016303b2df2cb8cc31912a2674acf63d1e5
2021-06-27 17:00:41 -07:00
3bfe15085d [TensorExpr] Add a mechanism to register custom TS->NNC lowerings in TensorExprKernel. (#60804)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60804

The lowerings are stored as a map c10::Symbol -> std::function and the
signature of thoese functions match the signature of
`computeOperandValue`. Custom lowerings have higher priority over the
standard ones, i.e. we can redefine how a given op is lowered.

In general this feature is aimed at unblocking users whose models
contain ops that are not yet supported by NNC - it allows to quickly add
a custom lowering for a given op.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D29409580

Pulled By: ZolotukhinM

fbshipit-source-id: e8e8dc9d3cb9155cfbf5c08a4216ba1b5b791a60
2021-06-27 15:27:22 -07:00
5563f4bda0 To add Rectified Adam algorithm for multi-tensor optimizers API (#59161)
Summary:
Previously in the PR: https://github.com/pytorch/pytorch/issues/58968 we added RAdam to Optimizers. Here in this PR we are proposing multi-tensor version of RAdam for PyTorch.

Radam has been proposed in the paper https://arxiv.org/pdf/1908.03265.pdf Liyuan Liu et al.

It has been one of the most used algorithm in Deep Learning community.

Differing from the paper, we selected variance tractability cut-off as 5 instead of 4 as it is the common practice.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59161

Reviewed By: vincentqb

Differential Revision: D29360576

Pulled By: iramazanli

fbshipit-source-id: 7ccdbf12b1ee7f12e66f7d7992123a70cc818b6b
2021-06-27 13:01:20 -07:00
0fbc471d10 Support default values on NamedTuple fields (#54682)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54682

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D27327241

Pulled By: ansley

fbshipit-source-id: 76546f1770d50ebc3435bba3b74540e3c6be8a1c
2021-06-26 15:18:21 -07:00
6b53792f18 fix cuda mem leak check not properly run on master_builds (#60742)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60742

improved CI_MASTER flag check logic, since it can be unset, true or false

Test Plan:
search for `PYTORCH_TEST_SKIP_CUDA_MEM_LEAK_CHECK` in logs below:

- Before adding ci/master:
  - build workflow (`PYTORCH_TEST_SKIP_CUDA_MEM_LEAK_CHECK=1`): https://circleci.com/api/v1.1/project/github/pytorch/pytorch/14394913/output/107/0?file=true&allocation-id=60d5fd2fa55ae50282aec997-0-build%2F10295B30
- After adding ci/master label:
  - build workflow (`PYTORCH_TEST_SKIP_CUDA_MEM_LEAK_CHECK=0`): https://circleci.com/api/v1.1/project/github/pytorch/pytorch/14398213/output/107/0?file=true&allocation-id=60d61cf8bb9d097afc7a11aa-0-build%2F400138F1
  - master build workflow (`PYTORCH_TEST_SKIP_CUDA_MEM_LEAK_CHECK=0`): https://circleci.com/api/v1.1/project/github/pytorch/pytorch/14398198/output/107/0?file=true&allocation-id=60d61ca3467438480c963290-0-build%2F2999C909

Reviewed By: ngimel

Differential Revision: D29405732

Pulled By: walterddr

fbshipit-source-id: 09dd653cbb47ca61b1f8872851bda6db8db671b9
2021-06-26 07:05:32 -07:00
e3abccec8a [Static Runtime] Remove output type constraints (#60669)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60669

Test Plan: Added unit test to check for nested outputs.

Reviewed By: ajyu

Differential Revision: D29322025

fbshipit-source-id: a3c8d3c5f0bb7cf7fda4bc5f579adb8fa7bc3724
2021-06-26 02:36:27 -07:00
dae25c2002 Fix missing spaces in error of constant_pad_nd (#60729)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60729

Reviewed By: ZolotukhinM

Differential Revision: D29404422

Pulled By: ngimel

fbshipit-source-id: c40458c7a6ae33f61c680bff8de778a80658c250
2021-06-25 19:20:03 -07:00
9a08e87d8b Modernize for-loops in aten (#59598)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59598

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D28946826

fbshipit-source-id: 9f3f7e38833c2bc33d27243cef16ab0118c65f3a
2021-06-25 19:02:00 -07:00
7e3a694b23 supports non-leaf inputs for autograd.backward() function (#60521)
Summary:
Close https://github.com/pytorch/pytorch/issues/60268

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60521

Reviewed By: ngimel

Differential Revision: D29393586

Pulled By: albanD

fbshipit-source-id: 2dd2de427ecfecca8d544237bacf690e0b7c918c
2021-06-25 18:57:26 -07:00
056a8e0d5c Remove un-used parameter in _trilinear backward (#60673)
Summary:
This argument is only important for speed and memory usage. So it is ok to ignore it during the backward.
As discussed, we might want to change this to speed up backward in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60673

Reviewed By: soulitzer

Differential Revision: D29370125

Pulled By: albanD

fbshipit-source-id: ad50b3ea530aeb194f5a51845523b517a50f2c71
2021-06-25 17:47:10 -07:00
f262217101 [Model Averaging] Move step out of model averaging API (#60632)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60632

Address the comment https://github.com/pytorch/pytorch/pull/60320#discussion_r654845062
ghstack-source-id: 132340278

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_periodic_model_averager

Reviewed By: rohan-varma

Differential Revision: D29355609

fbshipit-source-id: 50a6f13ed70b5a5b5b92ead2f3d7082c11277af5
2021-06-25 17:20:52 -07:00
c5f0692b6e Sparse CSR: increase dtype test coverage (#60656)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60656

This PR uses `torch.testing.get_all_dtypes()` for dtype parametrisation
of tests in `test_sparse_csr.py`. It adds previously excluded from tests
bool, half, bfloat16, complex dtypes. `torch.complex32` is omitted due
to lack of coverage and lack of specialized `AT_DISPATCH...`.
The process of adding more dtypes to tests releaved that `.to_dense()`
doesn't work for all dtypes.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D29408058

Pulled By: cpuhrsch

fbshipit-source-id: 319b6f51b9786d6957d508f51657657a6d00267a
2021-06-25 17:11:21 -07:00
dd045ab540 add channels last for AdapativeMaxPool2d (#48920)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48920

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D25399467

Pulled By: VitalyFedyunin

fbshipit-source-id: d9d2cc728cc7a18a26983e96d3c3e81a23659e89
2021-06-25 16:36:20 -07:00
367aff91d8 Fix missing #pragma once in jit/method.h
Summary: it seems to be accidentally missing

Test Plan: run CI

Reviewed By: suo

Differential Revision: D29335990

fbshipit-source-id: 2790bc10d141f9484a0807ff7800024a02fd9cfa
2021-06-25 16:32:54 -07:00
8b6487c650 Add CUDA Vital (#58059)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58059

Add CUDA.used vital sign which is true only if CUDA was "used" which technically means the context was created.

Also adds the following features:
- Force vitals to be written even if vitals are disabled, to enable testing when the env variable is not set from the start of execution
- Add a read_vitals call for python to read existing vital signs.

Test Plan: buck test mode/dbg caffe2/test:torch -- --regex basic_vitals

Reviewed By: xuzhao9

Differential Revision: D28357615

fbshipit-source-id: 681bf9ef63cb1458df9f1c241d301a3ddf1e5252
2021-06-25 16:31:11 -07:00
9134b0e42f add a boxed CPU fallback kernel (#58065)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58065

This PR replaces the existing code-generated CPU fallback kernels that XLA uses with a single boxed CPU fallback.

Current state: there are a couple different design ideas that I want to point out, but the logic for the actually kernel is mostly done and passing tests.

### Design

To preface, I'm not 100% tied to the current design and I'm putting the PR up now for opinions and totally open to alternatives, some of which I listed below. Actually after writing this description, I'm leaning toward the following changes:
* Confirm whether or not we can remove all C++ logging info directly in the yaml.

**Current Design**

All of the CPU fallback codegen is deleted. In its place, XLA (and other external backends, later) can choose to opt into a CPU fallback by adding the following code in a C++ file. I have an corresponding [xla-side PR with the xla changes](https://github.com/pytorch/xla/pull/2945/files#diff-1a005c10039f0cb11130a3b740f5de716d2f10acaea121017016025861886798R1).

There's no actual requirement to split up the code into a .h and .cpp file, but that's necessary in the XLA case because they sometimes need to call the fallback directly from their handcrafted kernels.

```
// xla_cpu_fallback.h
#include <ATen/native/CPUFallback.h>
...
void xla_cpu_fallback(const c10::OperatorHandle& op, torch::jit::Stack* stack);
...
```
```
// xla_cpu_fallback.cpp
#include "torch_xla/csrc/aten_cpu_fallback.h"
...
void xla_cpu_fallback(const c10::OperatorHandle& op, torch::jit::Stack* stack) {
  // Do custom logging here
  ...
  // Call the actual boxed CPU fallback.
  at::native::cpu_fallback(op, stack);
}

TORCH_LIBRARY_IMPL(_, XLA, m) {
  m.fallback(torch::CppFunction::makeFromBoxedFunction<&xla_cpu_fallback>());
}
```

Now that the fallback is exposed in the backend, they can call it directly. Doing so requires converting from an unboxed to a boxed context, which we provide a utility function before. E.g.:
```
#include <ATen/native/CPUFallback.h>

at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) {
  ....
  if (...call_fallback...) {
    return at::native::call_fallback_fn<&xla_cpu_fallback, decltype(at::addmm)>::call("aten::addmm", self, mat1, mat2, beta, alpha);
  }
  ...
}
```

That `decltype(at::addmm)` logic isn't actually used everywhere in the xla-side PR yet, since you hit issues with overloads. I could use it everywhere once #58092 lands.

**Alternatives: The API for calling the CPU fallback directly is ugly, can we make it nicer?**
We could change the api to use `at::redispatch`, which would make it look something like this:
```
at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) {
  ....
  if (...call_fallback...) {
    return at::redispatch::addmm(c10::DispatchKeySet(c10::DispatchKey::CPUFallback), self, mat1, mat2, beta, alpha);
  }
  ...
}
```
Which definitely feels cleaner, but also requires adding a new DispatchKey just for this use case. Conditionally calling the CPU fallback doesn't sound like a hugely important use case, so I don't know if giving up one of our 64 dispatch key slots is worth the API improvement. Totally open to other opinions though!

Another more mild improvement that would avoid having to pass operator string names (including overloads) around would be to codegen (yet another) namespaced API. Something like this:
```
at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) {
  ....
  if (...call_fallback...) {
    return at::fallback::addmm<&xla_cpu_fallback>(self, mat1, mat2, beta, alpha);
  }
  ...
}
```

Writing that out actually I actually like it more (I think it'll let us get rid of `decltype(...)`). Maybe that is nice enough to warrant a new codegen API - I haven't tried adding that yet, but if people like it I'm happy to try it out.

**More alternatives**
The current design also involves the backend manually writing and registering the boxed fallback themselves, but an alternative would be for us to do it in codegen too: they would just need to pass in all of the C++ logging that they want done in the fallback, directly through the yaml. The main downsides:
* Backend code that wants to call the fallback needs to abide by whatever convention our codegen uses to name the generated boxed fallback.
* Passing custom C++ logging through yaml is just more fragile: right now xla uses an `iostream` to log each tensor arg in the operator, so we'd have to either force other backends into the same convention or figure something else out later.

To be fair, we actually already do that: XLA has custom per-tensor-arg logging for all of the generated `out` wrappers in the codegen, which we do by passing their C++ logging info through the yaml. This seems unnecessary though, since `out` wrappers just call into a functional kernel, which is hand written with its own custom logging. So my take is: try to remove custom C++ logging from the yaml, and if it turns out to be really necessary, then we may as well take advantage of that to codegen the fallback.

### Performance impact

While ops that fall back to CPU aren't exactly hot path, we probably don't want to use a boxed fallback if it turns out to be an absolute perf killer.

I ran my benchmarks using callgrind, benchmarking both `at::add` and `at::add_out` run on XLA. My callgrind benchmark for `at::add` can be found here (the add_out benchmark looks basically the same): https://www.internalfb.com/phabricator/paste/view/P415418587. I created the benchmark by hacking the existing xla C++ test build scripts and throwing in a reference to callgrind.

I also attached the full callgrind output for each benchmark; the full output is actually pretty noise and hard to parse, but I focused on everything underneath the `at::add()` call in the output, which was much more stable. My guess is that it's due to some heavyweight async startup processing that xla does.

`at::add`:
before: 88,505,130 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415421001
after: 102,185,654 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415421273
delta: ~15.5% increase

`at::add_out`:
before: 63,897,395 instructions. Full output: https://www.internalfb.com/intern/everpaste/?handle=GBrrKwtAPlix9wUEAOZtrFXpdO5UbsIXAAAz
after: 73,170,346 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415423227
delta: ~14.5% increase

High level takeaway: A framework overhead increase of 10-20% doesn't seem too horrible for the CPU fallback use case.

For structured, functional ops that requires a CPU fallback, we're actually in an unfortunate situation: we're doing even more work than necessary. Our codegen automatically creates a `CompositeExplicitAutograd` kernel which calls into the `out` operator. So the extra work that we end up doing is:
* An extra dispatcher hop: (at::add -> CompositeExplicitAutograd -> CPUFallback -> at::native::add) instead of (at::add -> CPUFallback -> at::native::add)
* An unnecessary tensor allocation (the CompositeExplicitAutograd kernel uses at::empty() to create an output tensor, which is immediately overwritten by the CPU fallback)
* An unnecessary meta() call (the CompositeExplicitAutograd kernel calls it to create the output tensor, but we call it again in the CPU kernel).
* unboxing->boxing->unboxing logic (this is the only strictly required piece)

There are definitely ways to avoid the unnecessary work explained above: one would be to give the boxed fallback higher priority than composite keys (there's [an issue for it here](https://github.com/pytorch/pytorch/issues/55104)), and codegen fallthroughs for all composite ops. It'll require more infra to set up, so I see it as more of a perf knob that we can apply if we need it later.

Unfortunately I couldn't dig much deeper into the differences aside from the aggregate change in instructions, since it looks like callgrind fudged some of the instruction attribution (`at::to_cpu` takes up a ton of instructions, but I don't see any attribution for the `at::native::add` kernel anywhere).

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D28833085

Pulled By: bdhirsh

fbshipit-source-id: 537ebd5d7fb5858f1158764ff47132d503c3b92b
2021-06-25 16:26:50 -07:00
ad69e2fd11 [torch] Module fix on the support of LazyModule on bug #60132 (#60517)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60517

This is to fix the module support on lazymodulefixin on the bug issue #60132
Check the link: https://github.com/pytorch/pytorch/issues/60132

We will have to update lazy_extension given the dependency on module.py and update the unit test as well.

Test Plan:
Unit test passes

torchrec test passes

Reviewed By: albanD

Differential Revision: D29274068

fbshipit-source-id: 1c20f7f0556e08dc1941457ed20c290868346980
2021-06-25 16:20:19 -07:00
cab926b2c0 faster generate_square_subsequent_mask in nn.Transformer (#60631)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60631

Per #48360, speed up `Transformer.generate_square_subsequent_mask`. New impl is informally ~5x faster, though absolute difference is probably small.

PR includes Python and C++ versions as well as a couple of places where the previous impl had been copied around.

Test Plan: Imported from OSS

Reviewed By: jbschlosser, albanD

Differential Revision: D29356673

Pulled By: bhosmer

fbshipit-source-id: 4c062ba0ead61a445aeef451c78777bf0b3a631e
2021-06-25 16:07:01 -07:00
7585783b8d Remove Optional[None] annotations (#60704)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60704

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D29380281

Pulled By: ansley

fbshipit-source-id: 055c17329a35375de4ebd058ee6d127475aad373
2021-06-25 15:53:58 -07:00
5ed7400b75 Fix doc preview source directory (#60792)
Summary:
`merge` is the directory with the actual changes, not `master`. Verified by downloading arficats from https://github.com/pytorch/pytorch/pull/60777/checks and searching through the result.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60792

Reviewed By: walterddr

Differential Revision: D29405288

Pulled By: driazati

fbshipit-source-id: 419c943727c00429945c1f116645bfa22fb12456
2021-06-25 15:46:30 -07:00
7b933cd9ea configurable pre/post LayerNorm in nn.Transformer (#60593)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60593

Per #55270, this PR makes it configurable whether to run LayerNorm before or after other operations in Transformer layers.

However, it leaves for a separate PR the removal of the LayerNorm performed after the final encoder/decoder layer has run, which is redundant when LayerNorms has been run after other in-layer operations (problem described in #24930 #50086 #51447).

Note: this means that transformers built with `nn.Transformer()` are now configurable, but will still contain a redundant LayerNorm when configured as before. However, callers of the `TransformerEncoder` and `TransformerDecoder` classes have always been able to avoid this redundancy.

Reviewer notes:
1. Ran across this during other work, don't know if anybody's working on it already (most recent conversation in issues seems to be from early April). Happy to abandon if so.
2. Was looking for a quick way to add tests but it looks like the existing ones in test_nn just compare against snapshots. I could add something similar, but curious if there's any prepackaged way to add a test that LayerNorm-first (the new option) yields model that trains properly, etc.
3. New code in the `forward`s was written to minimize diff churn rather than maximize beauty :P happy to pretty it up if desired.

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D29356590

Pulled By: bhosmer

fbshipit-source-id: 308669326990b8923aab5fcd96e03b582fb21f24
2021-06-25 15:43:35 -07:00
e13a9587b4 Revert "Revert D29135358: [quant] Input-Weight Equaliaztion - convert modifications" (#60646)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60646

This reverts commit e60f9cfc58fb2fe3e2e7f65fcdbbf350e5b55a75.

Test Plan: Imported from OSS

Reviewed By: supriyar

Differential Revision: D29361191

Pulled By: angelayi

fbshipit-source-id: 275d8691d8e47da4ab80bb21b51d77ec25a0f714
2021-06-25 15:37:05 -07:00
7188d84ccf [Tools] Update path in clang_format_utils after #60473 (#60782)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60782

PR #60473 introduced a new folders nesting level, this change updates
clang_format_utils.py to accordingly adjust the way it sets up root
path.

Test Plan: Imported from OSS

Reviewed By: zhxchen17

Differential Revision: D29403622

Pulled By: ZolotukhinM

fbshipit-source-id: 6404271615c2d263834cf538ab0153c4d41cc5c3
2021-06-25 14:30:45 -07:00
394f60b0fc [caffe2] update make_cifar_db to move the string into DB::Put() (#60692)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60692

Update make_cifar_db.cc to work with the DB API changes in D29204425 (00896cb9ed).

Test Plan: buck build caffe2/binaries:make_cifar_db

Differential Revision: D29374754

fbshipit-source-id: 23d2acd24031d11071791e398433b537215ffd38
2021-06-25 14:02:24 -07:00
e1bd4963e2 To intorduce Functional API for multi-tensor (#60735)
Summary:
In this PR we change Multi-Tensor Optimizers to Functional API.

We can see that in the file : https://github.com/pytorch/pytorch/blob/master/torch/optim/_functional.py , there has been functional API defined for most of Optimizers. However we do not have similar file / functionality for multi tensors :
https://github.com/pytorch/pytorch/tree/master/torch/optim/_multi_tensor

Therefore we are adding it in this PR here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60735

Reviewed By: vincentqb

Differential Revision: D29392253

Pulled By: iramazanli

fbshipit-source-id: cebc8e7b07ab11156370f5297cfb419cd9f20b46
2021-06-25 13:09:26 -07:00
8f16a38067 Add missing kernel checks (#60635)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60635

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D29355747

fbshipit-source-id: 20bae292703a54b2895a33c11e6f1b8b9a9d8195
2021-06-25 12:54:40 -07:00
dfc8247d33 Faster cumsum and cumprod backwards (#60642)
Summary:
Piggybacking on https://github.com/pytorch/pytorch/pull/58747, now we can implement the backwards of `cumsum` and `cumprod` without tricks. This minimises the number of kernels that are launched in GPU, so we see a reasonable speed-up on GPU. We should also get a better stability for ill-conditioned inputs, as we do not perform any numerical tricks to get the result.

Note that the benchmarks test forward + backward, so the true speed-up on the backward should be even faster. Even more so in `cumsum`, as it requires less operations than the backward of `cumprod`.

<details>
<summary>
Test Script
</summary>

```python
from itertools import product

import torch
from torch.utils.benchmark import Compare, Timer

def get_timer(ndims, prod_dim, dim, num_threads, device):
    size = [500]*ndims
    size[dim] = prod_dim

    x = torch.rand(*size, device=device, requires_grad=True)
    # Make sure there are no zeros as the formula for the backward
    # that we are testing is for when the backward has no zeros
    with torch.no_grad():
        x.add_(1e-3)
    grad = torch.ones_like(x)

    timer = Timer(
        "torch.autograd.grad([x.cumprod(dim)], [x], grad_outputs=[grad])",
        globals={"x": x, "dim": dim, "grad": grad},
        label=f"Cumprod + Backwards {device}",
        description=f"dim: {dim}",
        sub_label=f"prod_dim: {prod_dim}",
        num_threads=num_threads,
    )

    return timer.blocked_autorange(min_run_time=5)

def get_params():
    ndims = 3
    dims = range(ndims)
    prod_dims = [10, 100, 500]
    for dim, prod_dim, device in product(dims, prod_dims, ("cpu", "cuda")):
        threads = (1, 2, 4) if device == "cpu" else (1,)
        for num_threads in threads:
            yield ndims, prod_dim, dim, num_threads, device

compare = Compare([get_timer(*params) for params in get_params()])
compare.trim_significant_figures()
compare.print()
```

</details>

<details>
<summary>
Benchmark PR
</summary>

```
[------------ Cumprod + Backwards cpu -------------]
                     |  dim: 0  |  dim: 1  |  dim: 2
1 threads: -----------------------------------------
      prod_dim: 10   |     11   |     14   |     12
      prod_dim: 100  |    260   |    270   |    260
      prod_dim: 500  |   1400   |   1550   |   1360
2 threads: -----------------------------------------
      prod_dim: 10   |      6   |      6   |      6
      prod_dim: 100  |    170   |    166   |    167
      prod_dim: 500  |    902   |    950   |    858
4 threads: -----------------------------------------
      prod_dim: 10   |      4   |      3   |      3
      prod_dim: 100  |    110   |    108   |    106
      prod_dim: 500  |    576   |    590   |    547

Times are in milliseconds (ms).

[------------ Cumprod + Backwards cuda ------------]
                     |  dim: 0  |  dim: 1  |  dim: 2
1 threads: -----------------------------------------
      prod_dim: 10   |    562   |    566   |   1075
      prod_dim: 100  |   5388   |   5394   |   6697
      prod_dim: 500  |  28170   |  27580   |  30740

Times are in microseconds (us).
```

</details>

<details>
<summary>
Benchmark master
</summary>

```
[------------ Cumprod + Backwards cpu -------------]
                     |  dim: 0  |  dim: 1  |  dim: 2
1 threads: -----------------------------------------
      prod_dim: 10   |     11   |     13   |     12
      prod_dim: 100  |    270   |    270   |    256
      prod_dim: 500  |   1500   |   1590   |   1300
2 threads: -----------------------------------------
      prod_dim: 10   |      6   |      6   |      6
      prod_dim: 100  |    170   |    170   |    164
      prod_dim: 500  |    911   |    940   |    840
4 threads: -----------------------------------------
      prod_dim: 10   |      4   |      4   |      4
      prod_dim: 100  |    111   |    109   |    105
      prod_dim: 500  |    570   |    590   |    536

Times are in milliseconds (ms).

[------------ Cumprod + Backwards cuda ------------]
                     |  dim: 0  |  dim: 1  |  dim: 2
1 threads: -----------------------------------------
      prod_dim: 10   |    616   |    597   |   1109
      prod_dim: 100  |   5976   |   5723   |   7017
      prod_dim: 500  |  31110   |  29160   |  32320

Times are in microseconds (us).
```

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60642

Reviewed By: ngimel

Differential Revision: D29366368

Pulled By: albanD

fbshipit-source-id: b0d692ce030352965c2f152e0f92fbb61fc5ebde
2021-06-25 12:44:12 -07:00
d3bec9f4d2 Use S3 for documentation previews (#60711)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60711

We already build the docs on each PR, this adds a step to push the relevant folder of the docs (we build the entire website for pytorch.github.io which clocks in at around 500 MB, but we really only need the "master" docs, not every version. The master docs by themselves are around 50 MB which is more reasonable). It uses the same S3 bucket as the artifacts but places the items at the `pytorch/pytorch/pr-previews/<pr number>` prefix. The bucket has a rule to expire resources in that prefix after 1 month.

On the AWS side the bucket has static hosting enabled with CloudFront directing to the docs preview prefix, so you can see the output at `https://d28slxzaq48q8t.cloudfront.net/<pr number>/`, e.g. https://d28slxzaq48q8t.cloudfront.net/60711/. For advertising we could link this on the HUD PR page as well as in the Dr. CI comment. We could add a CNAME on CloudFront to make this be `pr-preview.pytorch.org/<pr number>` or something but having random PRs be able to host content on the pytorch.org domain seems sketchy.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D29398818

Pulled By: driazati

fbshipit-source-id: 24032854d83815853b3650d8e54f60b684707f76
2021-06-25 12:12:26 -07:00
aacc722aec Dispatch to Python via __torch_dispatch__ (#59760)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59760

See https://github.com/pytorch/pytorch/issues/59049

There are some moving parts to this PR, I'll structure this explanation so the straightforward parts go first, and then the less straightforward parts.

**The actual dispatch to Python.** The core logic of dispatch to Python lives in `concrete_dispatch_fn` in `torch/csrc/autograd/python_variable.cpp`. It takes the input IValue stack, scans all the arguments for Tensor arguments, and defers most of the heavy lifting to `handle_torch_function_no_python_arg_parser` which actually does all of the logic for calling out to torch dispatch (in particular, this function handles multiple dispatch situations for you). Because we have a different function name than regular `__torch_function__` handling, `handle_torch_function_no_python_arg_parser` is generalized to accept a magic method name to look for when testing if Tensors have custom handling or not. Unlike `__torch_function__`, by default there is no `__torch_dispatch__` on Tensor classes.

**Maintaining the Python dispatch key.** In order to get to the dispatch to Python logic, we must tag Tensors with the `__torch_dispatch__` magic method with the newly added Python dispatch key (separated from PythonFuncTorch to allow for a transitional period while they migrate to this mechanism). We expose a new private property `_is_python_dispatch` that assists in debugging if a Tensor is participating in Python dispatch or not. We apply the Python dispatch key the first time a PyObject for a Tensor is constructed (THPVariable_NewWithVar), testing if `__torch_dispatch__` exists with  then newly added `check_has_torch_dispatch`.

**Shallow copy and detach.** For the simple examples tested in this PR, most creations of Tensor route through the dispatcher. The exception to this is `shallow_copy_and_detach`, which bypasses the dispatcher and is used when saving tensors for backwards. When a Tensor is Python dispatch, we override the behavior of `shallow_copy_and_detach` to instead directly call into `__torch_dispatch__` to perform a `detach` operation (in the same way it would be invoked if you called `detach` directly). Because this Python call is triggered directly from c10::TensorImpl, it must be indirected through `PyInterpreter::detach`, which is the general mechanism for dynamic dispatching to the Python interpreter associated with a TensorImpl.

**torchdeploy compatibility.** The dispatch to Python logic cannot be directly registered to the dispatcher as it is compiled in the Python library, which will get loaded multiple times per torchdeploy interpreter. Thus, we must employ a two phase process. First, we register a fallback inside a non-Python library (aten/src/ATen/core/PythonFallbackKernel.cpp). Its job is to determine the appropriate PyInterpreter to handle the Python dispatch by going through all of the arguments and finding the first argument that has a PyObject/PyInterpreter. With this PyInterpreter, it makes another dynamic dispatch via "dispatch" which will go to the correct torchdeploy interpreter to handle dispatching to actual Python.

**Testing.** We provide a simple example of a LoggingTensor for testing, which can be used to generate TorchScript-like traces to observe what operations are being called when a Tensor is invoked. Although a LoggingTensor would be better implemented via an is-a relationship rather than a has-a relationship (as is done in the test), we've done it this way to show that arbitrarily complex compositions of tensors inside a tensor work properly.

**Known limitations.**

* We haven't adjusted any operator code, so some patterns may not work (as they lose the Python subclass in an unrecoverable way)
* `__torch_function__` must be explicitly disabled with `_disabled_torch_function_impl` otherwise things don't work quite correctly (in particular, what is being disabled is default subclass preservation behavior.)
* We don't ever populate kwargs, even when an argument is kwarg-only

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision:
D29017912
D29017912

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Pulled By: ezyang

fbshipit-source-id: a67714d9e541d09203a8cfc85345b8967db86238
2021-06-25 11:50:32 -07:00
a53d7f8f7c Remove test linalg test skips from MAGMA integration (#58232)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55552; majority of cases in https://github.com/pytorch/pytorch/issues/51303

Tests in torch/testing/_internal/common_methods_invocations.py  (tested through test_ops) cannot be fully removed, since the machines seem to be running out of gpu memory during the test, and needs further analysis

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58232

Reviewed By: ngimel

Differential Revision: D29394021

Pulled By: malfet

fbshipit-source-id: f108a70af33beec908ac1c0b58467f8744e6fe87
2021-06-25 11:44:49 -07:00
8216da1f23 Use python3.6 compatible APIs in clang_tidy.py (#60659)
Summary:
This PR make `tools/clang_tidy.py` use python 3.6 APIs for `asyncio` and `shlex`.

I ran into some issues when running this script with the `-j` flag inside of the clang-tidy docker image (which uses python 3.6). Specifically, the functions `asycnio.run` and `shlex.join` are available in python >= 3.8.

This change does not affect CI because we do not run the clang-tidy job in parallel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60659

Reviewed By: albanD

Differential Revision: D29377851

Pulled By: 1ntEgr8

fbshipit-source-id: 92ab7ee6782b78d40ffccd03f1718ede4204d948
2021-06-25 10:35:03 -07:00
6322f66878 Add python version and cuda-specific folder to store extensions (#60592)
Summary:
See https://github.com/pytorch/pytorch/issues/55267

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60592

Reviewed By: albanD

Differential Revision: D29353368

Pulled By: ezyang

fbshipit-source-id: 1fbcd021f1030132c0f950f33ce4a3a2fef351e0
2021-06-25 10:27:04 -07:00
a404cc9a7b CUDA addcmul and addcdiv do math in float for 16 bits I/O (#60715)
Summary:
Currently foreach `addcmul` and `addcdiv` cast scalar to float so that actual math is done in FP32 when tensor dtype is Float16/BFloat16 while regular `addcmul` and `addcdiv`, not.

### Reproducible steps to see the behavioral difference
```ipython
In [1]: import torch; torch.__version__
Out[1]: '1.9.0'

In [2]: a, b, c = torch.tensor([60000.0], device='cuda', dtype=torch.half), torch.tensor([60000.0], device='cuda', dtype=torch.half), torch.tensor([-1.0], device='cuda', dtype=torch.half)

In [4]: torch.addcmul(a, b, c, value=2)
Out[4]: tensor([-inf], device='cuda:0', dtype=torch.float16)

In [5]: torch._foreach_addcmul([a], [b], [c], value=2)[0]
Out[5]: tensor([-60000.], device='cuda:0', dtype=torch.float16)
```

### How foreach casts?
Foreach addcmul and addcdiv cast scalar to `opmath_t` (almost equivalent to acc_type) here: 42c8439b6e/aten/src/ATen/native/cuda/ForeachPointwiseOp.cu (L30) and cast inputs and results here:
42c8439b6e/aten/src/ATen/native/cuda/ForeachFunctors.cuh (L133-L135)

Related to https://github.com/pytorch/pytorch/issues/58833 #60227 https://github.com/pytorch/pytorch/issues/60454
cc ptrblck mcarilli ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60715

Reviewed By: albanD

Differential Revision: D29385715

Pulled By: ngimel

fbshipit-source-id: 8bb2db19ab66fc99d686de056a6ee60f9f71d603
2021-06-25 10:21:35 -07:00
0be65cd52a [c10d] Fix test_collective_hang flakiness (#60662)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60662

Fixes this flaky test. Basically, sometimes a rank can exit the test
early before rank 0 calls into allreduce. In this case Gloo will throw
connection reset error on all other ranks.
ghstack-source-id: 132363151

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D29364806

fbshipit-source-id: ce0c292a2166edad57ea0dbb76df12cfd560a10d
2021-06-25 10:15:18 -07:00
474bdaf54d Add --print-include-paths option to tools/linter/clang_tidy.py (#60744)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60744

Fixes #60739

Test Plan:
Run this comand:
```
python3 tools/linter/clang_tidy.py --paths torch/csrc/fx --print-include-paths
```

Output (varies from machine to machine):
```
(clang-tidy output)
.
.
.

clang -cc1 version 11.0.0 based upon LLVM 11.0.0 default target x86_64-unknown-linux-gnu
ignoring nonexistent directory "nccl/include"
ignoring nonexistent directory "/include"
ignoring duplicate directory ".."
ignoring duplicate directory "../aten/src"
ignoring duplicate directory "../third_party/onnx"
ignoring duplicate directory ".."
ignoring duplicate directory ".."
ignoring duplicate directory "../torch/lib"
ignoring duplicate directory "../torch/../third_party/gloo"
  as it is a non-system directory that duplicates a system directory
ignoring duplicate directory "../third_party/ideep/mkl-dnn/src/../include"
  as it is a non-system directory that duplicates a system directory
#include "..." search starts here:
#include <...> search starts here:
 aten/src
 ../aten/src
 .
 ..
 ../cmake/../third_party/benchmark/include
 caffe2/contrib/aten
 ../third_party/onnx
 third_party/onnx
 ../third_party/foxi
 third_party/foxi
 ../torch/../aten/src/TH
 caffe2/aten/src
 third_party
 ../torch/../third_party/valgrind-headers
 ../torch/csrc
 ../torch/csrc/api/include
 ../torch/lib
 ../torch/lib/libshm
 ../torch/csrc/api
 third_party/ideep/mkl-dnn/include
 ../third_party/fmt/include
 third_party/gloo
 ../torch/../third_party/gloo
 ../cmake/../third_party/googletest/googlemock/include
 ../cmake/../third_party/googletest/googletest/include
 ../third_party/protobuf/src
 /data/users/eltonpinto/miniconda3/envs/pytorch/include
 ../third_party/gemmlowp
 ../third_party/neon2sse
 ../third_party/XNNPACK/include
 ../third_party
 ../cmake/../third_party/eigen
 /home/eltonpinto/local/miniconda3/envs/pytorch/include/python3.8
 /home/eltonpinto/local/miniconda3/envs/pytorch/lib/python3.8/site-packages/numpy/core/include
 ../cmake/../third_party/pybind11/include
 /usr/local/cuda-11.3/include
 ../third_party/ideep/mkl-dnn/src/../include
 ../third_party/ideep/include
 /usr/lib/gcc/x86_64-redhat-linux/8/../../../../include/c++/8
 /usr/lib/gcc/x86_64-redhat-linux/8/../../../../include/c++/8/x86_64-redhat-linux
 /usr/lib/gcc/x86_64-redhat-linux/8/../../../../include/c++/8/backward
 /usr/local/include
 /usr/lib64/clang/11.0.0/include
 /usr/include

.
.
.
(more clang-tidy output)
```

Imported from OSS

Reviewed By: ngimel

Differential Revision: D29395398

fbshipit-source-id: e92077a9c4e9dee7f9d7e05df180d552e3763540
2021-06-25 10:12:15 -07:00
608f12b818 Fix --dry-run option in tools/linter/clang_tidy.py (#60744)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60744

Fixes #60741

Test Plan:
Run this command:
```
python3 tools/linter/clang_tidy.py --paths torch/csrc/fx --dry-run
```
Output:
```
clang-tidy -p build -config '{"InheritParentConfig": true, "Checks": " bugprone-*, -bugprone-forward-declaration-namespace, -bugprone-macro-parentheses, -bugprone-lambda-function-name, -bugprone-reserved-identifier, cppcoreguidelines-*, -cppcoreguidelines-avoid-magic-numbers, -cppcoreguidelines-interfaces-global-init, -cppcoreguidelines-macro-usage, -cppcoreguidelines-owning-memory, -cppcoreguidelines-pro-bounds-array-to-pointer-decay, -cppcoreguidelines-pro-bounds-constant-array-index, -cppcoreguidelines-pro-bounds-pointer-arithmetic, -cppcoreguidelines-pro-type-cstyle-cast, -cppcoreguidelines-pro-type-reinterpret-cast, -cppcoreguidelines-pro-type-static-cast-downcast, -cppcoreguidelines-pro-type-union-access, -cppcoreguidelines-pro-type-vararg, -cppcoreguidelines-special-member-functions, -facebook-hte-RelativeInclude, hicpp-exception-baseclass, hicpp-avoid-goto, modernize-*, -modernize-concat-nested-namespaces, -modernize-return-braced-init-list, -modernize-use-auto, -modernize-use-default-member-init, -modernize-use-using, -modernize-use-trailing-return-type, performance-*, -performance-noexcept-move-constructor, -performance-unnecessary-value-param, ", "HeaderFilterRegex": "torch/csrc/.*", "AnalyzeTemporaryDtors": false, "CheckOptions": null}' torch/csrc/fx/fx_init.cpp
```

Reviewed By: ngimel

Differential Revision: D29394538

Pulled By: 1ntEgr8

fbshipit-source-id: b824bc2aa63631f074e9ad17092e4e063d347395
2021-06-25 09:53:29 -07:00
3a838e4ce3 Parametrizations depending on several inputs (#60530)
Summary:
Resubmit of https://github.com/pytorch/pytorch/pull/58488

There was a line that had been changed in `test_nn.py` as caught in https://github.com/pytorch/pytorch/pull/58488#discussion_r651267668

I reverted that line, which should never have been changed. I reckon that should solve the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60530

Reviewed By: ngimel

Differential Revision: D29329865

Pulled By: albanD

fbshipit-source-id: 8dfd0cd968fe26a3924dae7ca366af2c8a8639b3
2021-06-25 09:16:57 -07:00
8cba365378 Fix incorrect doc about the dtype for torch.randint described in issue #56347 (#60507)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60507

Fix incorrect documentation about the dtype for `torch.randint` described in issue #56347

Test Plan: Review documentation to make sure formatting is right

Reviewed By: bdhirsh

Differential Revision: D29321181

fbshipit-source-id: caae69a9bbb30052da518a3f5d22a7ed3504cdd2
2021-06-25 07:51:36 -07:00
d8c3d555e4 [Delegate] Support composite of lowered sub modules of the same backend (#59921)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59921

Test Plan: Imported from OSS

Reviewed By: raziel

Differential Revision: D29091143

Pulled By: iseeyuan

fbshipit-source-id: 9ffcd18681917ece8ec73a34866c53701bdee1bc
2021-06-25 07:18:32 -07:00
7c2938bf67 To refactor Sparse Adam algorithm for functional form (#59171)
Summary:
Adds Functional Interface for Sparse Adam Optimizer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59171

Reviewed By: vincentqb

Differential Revision: D29360582

Pulled By: iramazanli

fbshipit-source-id: 5ceffd7f4b7abd1e0b758a5b8445abdf5555eba0
2021-06-25 06:35:39 -07:00
963c983366 Improve numerical stability of LayerNorm (#59987)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59987

Similar as GroupNorm, improve numerical stability of LayerNorm by Welford algorithm and pairwise sum.

Test Plan: buck test mode/dev-nosan //caffe2/test:nn -- "LayerNorm"

Reviewed By: ngimel

Differential Revision: D29115235

fbshipit-source-id: 5183346c3c535f809ec7d98b8bdf6d8914bfe790
2021-06-25 02:22:42 -07:00
5b1f5c8f17 When creating a single parition skip the output nodes, but process possible nodes after it. (#60370)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60370

When creating a single parition skip the output nodes, but process possible nodes after it.

Test Plan: Run all CI tests.

Reviewed By: jfix71

Differential Revision: D29265278

fbshipit-source-id: 2242009973a54498d8027cce5a294558a1206fdf
2021-06-24 23:50:30 -07:00
2b51a8a935 [BackwardCompatibility] Remove aten::to from allow_list (#60147)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60147

Remove aten::to from allow_list now that the aten::to schema change has landed (D29121620 (eda2ddb5b0)).

Test Plan: CI

Reviewed By: iseeyuan

Differential Revision: D29187314

fbshipit-source-id: abdb5a560287a861f3858732f7b3da342ee4aa55
2021-06-24 22:57:57 -07:00
3ca28656fa [special] erfcx cuda support (#60519)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/50345

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60519

Reviewed By: ngimel

Differential Revision: D29353105

Pulled By: mruberry

fbshipit-source-id: 2f525a347a22f96411739a16e354c7291e863f95
2021-06-24 21:50:37 -07:00
46d27a53fe cuda rpc backward sparse tensor fix (#59609)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59609

quick fix for https://github.com/pytorch/pytorch/issues/58755

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D29335722

Pulled By: gcramer23

fbshipit-source-id: 0de7e0399b30f0934320f1e9abb1b92a45bcf929
2021-06-24 21:40:43 -07:00
561132f902 Revert D29330585: [pytorch][PR] add BFloat16 support for arange on CPU
Test Plan: revert-hammer

Differential Revision:
D29330585 (375d201086)

Original commit changeset: b8a04cee0c3f

fbshipit-source-id: dc138f9613becd083848e82d15c138d3883493c8
2021-06-24 20:57:43 -07:00
d63c236fb3 Introduce quantized convolution serialization format 3 (#60241)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60241

We're going to make a forward-incompatible change to this serialization
format soon, so I'm taking the opportunity to do a little cleanup.

- Use int for version.  This was apparently not possible when V2
  was introduced, but it works fine now as long as we use int64_t.
  (Note that the 64-bits are only used in memory.  The serializer will
  use 1 byte for small non-negative ints.)
- Remove the "packed params" tensor and replace it with a list of ints.
- Replace the "transpose" field with "flags" to allow more binary flags
  to be packed in.
- Unify required and optional tensors.  I just made them all optional
  and added an explicit assertion for the one we require.

A bit of a hack: I added an always-absent tensor to the front of the
tensor list.  Without this, when passing unpacked params from Python to
the ONNX JIT pass, they type would be inferred to `List[Tensor]` if all
tensors were present, making it impossible to cast to
`std::vector<c10::optional<at:Tensor>>` without jumping through hoops.

The plan is to ship this, along with another diff that adds a flag to
indicate numerical requirements, wait a few weeks for an FC grace
period, then flip the serialization version.

Test Plan: CI.  BC tests.

Reviewed By: vkuzo, dhruvbird

Differential Revision: D29349782

Pulled By: dreiss

fbshipit-source-id: cfef5d006e940ac1b8e09dc5b4c5ecf906de8716
2021-06-24 20:52:43 -07:00
42c8439b6e TH: Clean up dead code (#60655)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60655

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D29371717

Pulled By: ngimel

fbshipit-source-id: faa71b1d4a15450c78e12aa917daec853057bce9
2021-06-24 19:42:16 -07:00
4a7d281119 Migrate THAllocator to ATen (#60325)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60325

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D29371715

Pulled By: ngimel

fbshipit-source-id: 78ec8368a48e1a4690d0664a0b02d2a235af98ff
2021-06-24 19:42:14 -07:00
d586248544 Migrate THStorage_resizeBytes to ATen (CPU) (#60324)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60324

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D29371716

Pulled By: ngimel

fbshipit-source-id: 056aee0ec87722090c133777b6948c28b03b37e4
2021-06-24 19:41:02 -07:00
ddec2e0ef4 tentative fix for adaptiveavgpool gradient computation (#60630)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/60524

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60630

Reviewed By: jbschlosser

Differential Revision: D29374257

Pulled By: ngimel

fbshipit-source-id: be05f0ceb53e6f0f0a59a83b710dafde469d4e8a
2021-06-24 19:02:32 -07:00
40a7c317bc Run BLAS F2C checks on host architecture (#60703)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/60351

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60703

Reviewed By: driazati

Differential Revision: D29379727

Pulled By: malfet

fbshipit-source-id: dadbb1d39373887f07d59d0a05e093a5d070b016
2021-06-24 18:44:41 -07:00
7bc86458e1 Revert "Revert D28833086: beef up at::_ops API" (#60214)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60214

Relanding this PR, but with a fix for windows cuda builds (example failure in master here: https://github.com/pytorch/pytorch/runs/2852662871)

This is identical to the original PR except for one change in `tools/codegen/gen.py`: `static constexpr` -> `static CONSTEXPR_EXCEPT_WIN_CUDA`

This actually took a while to figure out, until I tracked down a previous pytorch PR that encountered a similar issue: https://github.com/pytorch/pytorch/pull/40675

This reverts commit 6d0fb85a623f5ef3f3f1a2afc3660cb71fa70511.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D29213932

Pulled By: bdhirsh

fbshipit-source-id: b90c7c10e5a51f8d6173ddca673b418e5774c248
2021-06-24 18:08:54 -07:00
9c4eec2a2d Adjust path to distributed cpp tests (#60705)
Summary:
After https://github.com/pytorch/pytorch/issues/60543 they are installed in the same folder as the rest of the tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60705

Reviewed By: driazati

Differential Revision: D29380670

Pulled By: malfet

fbshipit-source-id: a432d26c731e9220e00d8c800b1429b37d51655b
2021-06-24 17:42:36 -07:00
8395fdde46 Increase tolerance for some distributed tests to 5e-5 (#60462)
Summary:
On A100 GPUs 10 tests fail due to slightly higher deviations.
This fixes those.

Note that rtol is still the default and atol was increased by a factor of 5 (from 1e-5)

The failing tests were:

- test_accumulate_gradients_module
- test_accumulate_gradients_module_with_grad_is_view
- test_ddp_checkpointing_once
- test_ddp_checkpointing_twice
- test_ddp_checkpointing_unused_params
- test_ddp_checkpointing_weight_sharing
- test_nccl_backend_1gpu_module_device_ids_integer_list
- test_nccl_backend_1gpu_module_device_ids_torch_device_list
- test_nccl_backend_single_device_module_device_ids_None
- test_nccl_backend_single_device_module_empty_device_id

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60462

Reviewed By: albanD

Differential Revision: D29366145

Pulled By: zhaojuanmao

fbshipit-source-id: c3e34c007363dfebf75ccb82004a67e4d2e6f3cd
2021-06-24 17:38:54 -07:00
2fa6c7627e [CUDA graphs][BC-breaking] Removes post-backward syncs on default stream (#60421)
Summary:
Before https://github.com/pytorch/pytorch/pull/57833, calls to backward() or grad() synced only the calling thread's default stream with autograd leaf streams at the end of backward. This made the following weird pattern safe:
```python
with torch.cuda.stream(s):
    # imagine forward used many streams, so backward leaf nodes may run on many streams
    loss.backward()
# no sync
use grads
```

but a more benign-looking pattern was unsafe:
```python
with torch.cuda.stream(s):
    # imagine forward used a lot of streams, so backward leaf nodes may run on many streams
    loss.backward()
    # backward() syncs the default stream with all the leaf streams, but does not sync s with anything,
    # so counterintuitively (even though we're in the same stream context as backward()!)
    # it is NOT SAFE to use grads here, and there's no easy way to make it safe,
    # unless you manually sync on all the streams you used in forward,
    # or move "use grads" back to default stream outside the context.
    use grads
```
mruberry ngimel and I decided backward() should have the [same user-facing stream semantics as any cuda op](https://pytorch.org/docs/master/notes/cuda.html#stream-semantics-of-backward-passes).** In other words, the weird pattern should be unsafe, and the benign-looking pattern should be safe. Implementationwise, this meant backward() should sync its calling thread's current stream, not default stream, with the leaf streams.

After https://github.com/pytorch/pytorch/pull/57833, backward syncs the calling thread's current stream AND default stream with all leaf streams at the end of backward. The default stream syncs were retained for temporary backward compatibility.

This PR finishes https://github.com/pytorch/pytorch/pull/57833's work by deleting syncs on the default stream.

With this PR, graph-capturing an entire backward() call should be possible (see the [test_graph_grad_scaling diffs](https://github.com/pytorch/pytorch/compare/master...mcarilli:streaming_backwards_remove_default_syncs?expand=1#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450R3641-R3642)).

** first paragraph has a formatting error which this PR should also fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60421

Reviewed By: albanD

Differential Revision: D29370344

Pulled By: ngimel

fbshipit-source-id: 3248bc5fb92fc517db0c15c897e5d7250f67d7fe
2021-06-24 17:34:02 -07:00
d90aefe380 Improve error message for non-differentiable inputs (#60610)
Summary:
Improve the error message when inputs should not requires_grad=True.

For example, we now get
```
RuntimeError: The function 'binary_cross_entropy' is not differentiable with respect to argument 'weight'. This input cannot have requires_grad True.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60610

Reviewed By: anjali411

Differential Revision: D29361424

Pulled By: albanD

fbshipit-source-id: 38163ce11ae1b8df326424e95ca20e55fea2a99a
2021-06-24 17:29:16 -07:00
4ed2d5d9bb ps sparse rpc (#58003)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58003

adds trainer class DdpTrainer
adds trainer class DdpSparseRpcTrainer
adds server class ParameterServerBase
adds server class AverageParameterServer
adds experiment ddp_cpu_sparse_rpc_nccl_allreduce
adds experiment ddp_cuda_sparse_rpc_nccl_allreduce

quip document https://fb.quip.com/iQUtAeKIxWpF

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D29379696

Pulled By: gcramer23

fbshipit-source-id: 9cf5fb7398ba2fa3eb694afbddc4ed00d97f205f
2021-06-24 17:21:49 -07:00
fadaa52f64 [caffe2] add an EstimateAllBlobSizes operator (#59775)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59775

This operator is similar to `GetAllBlobNames` but also returns the estimated
size required to serialize each node.

One goal of this operator is to allow checkpoint saving logic to estimate the
amount of space/bandwidth required to save a checkpoint when first starting
training, without actually serializing any blobs yet.  Currently the
checkpointing logic uses `GetAllBlobNames` to determine the blobs to
checkpoint.  It can instead be updated to use `EstimateAllBlobSizes` to also
get an estimate for how much space will be required for the checkpoint.
ghstack-source-id: 132275153

Test Plan: Included a new unit test.

Reviewed By: mraway

Differential Revision: D29020227

fbshipit-source-id: 811e5d86c4b59183e84e6424c48c97739be09043
2021-06-24 16:55:22 -07:00
fe4ded01f7 [package] typing.io/re edge case hack (#60666)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60666

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D29367847

Pulled By: Lilyjjo

fbshipit-source-id: 2c38140fbb3eab61ae3de60ab475243f0338c547
2021-06-24 14:53:46 -07:00
375d201086 add BFloat16 support for arange on CPU (#60444)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60444

Reviewed By: VitalyFedyunin

Differential Revision: D29330585

Pulled By: ezyang

fbshipit-source-id: b8a04cee0c3f2ff5544e2b821324ce8fc4e9d0f2
2021-06-24 14:38:47 -07:00
7fc4e67771 ns for fx: fix shadow logger error for resnet18 (#60559)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60559

Adds `resnet18` to integration test, and fixes the error to
make creating the shadow model work.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIsModels.test_resnet18
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D29336236

fbshipit-source-id: 9425aa096162d80ef3a7c98144b2301cfbccc1ea
2021-06-24 13:42:18 -07:00
4ddb2b43b7 ns for fx: expose function to add comparisons between logged values (#60311)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60311

Adds a user facing utility function to FX Numeric Suite Core APIs
for comparing the values extracted by the loggers to each other.
This is needed for any kind of analysis, so would be great to
provide an example implementation.

Example:

```
// code

m = nn.Sequential(nn.Conv2d(1, 1, 1), nn.Conv2d(1, 1, 1)).eval()
qconfig_dict = {'': torch.quantization.default_qconfig}
mp = torch.quantization.quantize_fx.prepare_fx(m, qconfig_dict)
mq = torch.quantization.quantize_fx.convert_fx(copy.deepcopy(mp))
results = extract_weights('fp32', mp, 'int8', mq)
extend_logger_results_with_comparison(
    results, 'fp32', 'int8', compute_sqnr, 'sqnr_int8_vs_fp32')

print(results)

// results

{
  '_1': {'weight': {
    'fp32': [
      {'type': 'weight', 'values': [tensor([[[[-0.3284]]]])], 'prev_node_name': '_1', 'prev_node_target_type': "<class 'torch.nn.modules.conv.Conv2d'>", 'ref_node_name': '_1', 'index_within_arg': 0, 'index_of_arg': 0}
    ],
    'int8': [
      {'type': 'weight', 'values': [tensor([[[[-0.3297]]]], size=(1, 1, 1, 1), dtype=torch.qint8,
       quantization_scheme=torch.per_tensor_affine, scale=0.002575645223259926,
       zero_point=0)], 'prev_node_name': '_1', 'prev_node_target_type': "<class 'torch.nn.quantized.modules.conv.Conv2d'>", 'ref_node_name': '_1', 'index_within_arg': 0, 'index_of_arg': 0, 'sqnr_int8_vs_fp32': [tensor(48.1308)]}
    ]
  }},
  '_0': {'weight': {
    'fp32': [{'type': 'weight', 'values': [tensor([[[[0.5205]]]])], 'prev_node_name': '_0', 'prev_node_target_type': "<class 'torch.nn.modules.conv.Conv2d'>", 'ref_node_name': '_0', 'index_within_arg': 0, 'index_of_arg': 0}],
    'int8': [{'type': 'weight', 'values': [tensor([[[[0.5184]]]], size=(1, 1, 1, 1), dtype=torch.qint8,
       quantization_scheme=torch.per_tensor_affine, scale=0.004082232713699341,
       zero_point=0)], 'prev_node_name': '_0', 'prev_node_target_type': "<class 'torch.nn.quantized.modules.conv.Conv2d'>", 'ref_node_name': '_0', 'index_within_arg': 0, 'index_of_arg': 0, 'sqnr_int8_vs_fp32': [tensor(48.1309)]}]
  }}
}

```

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_extend_logger_results_with_comparison
```

Imported from OSS

Reviewed By: hx89

Differential Revision: D29244715

fbshipit-source-id: a5547b449ea54e046c752119559be49bd738beea
2021-06-24 13:42:16 -07:00
31fe1c1323 ns for fx: rekey results by model node names (#60305)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60305

Adjusts the NS for FX weight and activation extraction APIs
to require a model name, and rekeys the results of these APIs
to use the node names of the specified model as layer keys.

For example, before

```
// API call
results = ns.extract_logger_info(
  model_a, model_b, ns.OutputLogger)

// results
{'base_op_1_0': {'node_output':
  {'model_a': [{'ref_node_name': 'linear1', ...}]}}}
```

and after

```
// API call
results = ns.extract_logger_info(
  model_a, model_b, ns.OutputLogger, 'model_b_name')

// results
// note: instead of `base_op_1_0`, the layer is named `linear1`
{'linear1': {'node_output':
  {'model_a': [{'ref_node_name': 'linear1', ...}]}}}
```

Note: we cannot use these names while collecting data because
node names are not guaranteed to be consistent across graphs.
This is why we only rekey as the very last step.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_layer_names
```

Imported from OSS

Reviewed By: hx89

Differential Revision: D29243045

fbshipit-source-id: d39ecdfdd18b07291e3ecefed2ede287b100b7d0
2021-06-24 13:41:01 -07:00
0ba4044b9d Increase some tolerances for tf32 for Conv3d tests (#60451)
Summary:
Allow those tests to pass on A100 GPUs which support tf32

Basically follow-up to https://github.com/pytorch/pytorch/pull/52871 which also increased some precisions to 0.05

For reference these are the failures I see (only ones in testnn with 1.9.0):
```
FAIL: test_Conv3d_pad_same_cuda_tf32 (__main__.TestNN)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e88)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1033, in wrapper
    method(*args, **kwargs)
  File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e88)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1033, in wrapper
    method(*args, **kwargs)
  File "test_nn.py", line 11296, in with_tf32_on
    test.test_cuda(self, **kwargs)
  File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e88)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_nn.py", line 5103, in test_cuda
    test_case.assertEqualIgnoreType(cpu_d_i, gpu_d_i, atol=self.precision, rtol=0)
  File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e88)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1254, in assertEqualIgnoreType
    return self.assertEqual(*args, exact_dtype=False, **kwargs)
  File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e88)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1355, in assertEqual
    super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=0 and atol=0.005, found 161 element(s) (out of 288) whose difference(s) exceeded the margin of error (including 0 nan compariso
ns). The greatest difference was 0.032408137116391345 (-33.45570601919647 vs. -33.42329788208008), which occurred at index (2, 0, 0, 1, 0).

======================================================================
FAIL: test_Conv3d_pad_same_dilated_cuda_tf32 (__main__.TestNN)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e88)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1033, in wrapper
    method(*args, **kwargs)
  File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e88)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1033, in wrapper
    method(*args, **kwargs)
  File "test_nn.py", line 11296, in with_tf32_on
    test.test_cuda(self, **kwargs)
  File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e88)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_nn.py", line 5103, in test_cuda
    test_case.assertEqualIgnoreType(cpu_d_i, gpu_d_i, atol=self.precision, rtol=0)
  File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e88)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1254, in assertEqualIgnoreType
    return self.assertEqual(*args, exact_dtype=False, **kwargs)
  File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e88)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1355, in assertEqual
    super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=0 and atol=0.005, found 111 element(s) (out of 288) whose difference(s) exceeded the margin of error (including 0 nan compariso
ns). The greatest difference was 0.024654212557543076 (35.104286017977465 vs. 35.07963180541992), which occurred at index (3, 0, 0, 0, 2).

======================================================================
FAIL: test_Conv3d_pad_valid_cuda_tf32 (__main__.TestNN)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e88)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1033, in wrapper
    method(*args, **kwargs)
  File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e88)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1033, in wrapper
    method(*args, **kwargs)
  File "test_nn.py", line 11296, in with_tf32_on
    test.test_cuda(self, **kwargs)
  File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e88)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_nn.py", line 5103, in test_cuda
    test_case.assertEqualIgnoreType(cpu_d_i, gpu_d_i, atol=self.precision, rtol=0)
  File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e88)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1254, in assertEqualIgnoreType
    return self.assertEqual(*args, exact_dtype=False, **kwargs)
  File "/tmp/easybuild-tmp/eb-ED4 (1f47a80e88)M3d/tmpqOhUjN/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1355, in assertEqual
    super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=0 and atol=0.005, found 41 element(s) (out of 288) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.010903167642320355 (8.074376869119371 vs. 8.06347370147705), which occurred at index (0, 0, 1, 0, 0).

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60451

Reviewed By: albanD

Differential Revision: D29353255

Pulled By: ngimel

fbshipit-source-id: 155a02242be5a11dcbd9dd40ab63f15c6757ae1b
2021-06-24 13:36:27 -07:00
a3ebc40bab Update intro doc for derivatives.yaml (#60614)
Summary:
Clarify some phrasing and document the findings on the different non differentiable states.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60614

Reviewed By: anjali411

Differential Revision: D29362740

Pulled By: albanD

fbshipit-source-id: 5bc2e8b8dde57ba5a9247d7c28b83c793703e35f
2021-06-24 13:20:40 -07:00
48509b1a9b Add exclusion list to _check_kernel_launches.py (#60562)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60562

Test Plan:
```
buck test //caffe2/test:kernel_launch_checks
```

Reviewed By: ngimel

Differential Revision: D29336561

fbshipit-source-id: 0cc101143d24e887e852bd6a9ab34ac43155eb63
2021-06-24 13:18:07 -07:00
a016150163 Move torch/lib/c10d to torch/csrc/distributed/c10d (#60543)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60543

Since now c10d is part of libtorch, it would also be nice if the sources lived all in one place.
ghstack-source-id: 132306292

Test Plan: It builds

Reviewed By: cbalioglu

Differential Revision: D29062002

fbshipit-source-id: d9e1301e9d73e1643fa0f0119cd2d618f1ad52e6
2021-06-24 12:38:51 -07:00
b8d7db3b31 Turn default kernels into Meyer singletons (#60568)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60568

https://github.com/pytorch/pytorch/pull/58661 induced a static
initialization order fiasco as flagged by ASAN strict_init_order=true.
On further inspection, it became clear that it was not necessary for
these to actually be globals initialized at module load time; so
I converted them into Meyer singletons which ensures they get loaded
immediately when another compilation unit requests them.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D29338019

Pulled By: ezyang

fbshipit-source-id: 282846118df6867277404a1830d0ce39fccaa769
2021-06-24 12:30:26 -07:00
4c00df12ec Include full Python version in collect_env.py output (#59632)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59632

Before:

```
Python version: 3.7 (64-bit runtime)
```

After:

```
Python version: 3.7.7 (default, Mar 23 2020, 17:31:31)  [Clang 4.0.1 (tags/RELEASE_401/final)] (64-bit runtime)
```

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D28961500

Pulled By: ezyang

fbshipit-source-id: 0f95a49abf6977941f09a64243916576a820679f
2021-06-24 12:11:01 -07:00
d52ef2497a Python basic module execution unit test on delegation of backend_with_compiler_demo (#60468)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60468

Added a unit test for the execution of a basic module with a compiler
ghstack-source-id: 132307488

Test Plan:
Running python test/test_jit.py TestBackendsWithCompiler -v returns a successful test

Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D29306225

fbshipit-source-id: bf1ff075ebc63acbbe46d6ea030086405e29d7d3
2021-06-24 11:43:45 -07:00
b7298f499d Annotate NoneType as Optional[type] (#60383)
Summary:
------------
Infer NoneType as Optional[torch.Tensor] for monkeytype type inference

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60383

Test Plan:
------
python test/test_jit.py -k TestPDT.test_nonetype_as_optional_of_type

Reviewed By: gmagogsfm

Differential Revision: D29341513

Pulled By: nikithamalgifb

fbshipit-source-id: 9a96670cb5cf2560cd4e19962faef5fecea8b24a
2021-06-24 11:00:26 -07:00
5a077bb10b Optimize some redunction operators on CPU BFloat16 (#55202)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55202

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D28836790

Pulled By: VitalyFedyunin

fbshipit-source-id: f3a29633d85eb5a614652e568140e9b19509f959
2021-06-24 10:50:24 -07:00
4aff267072 Fix Windows error in distributed (#60167)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60167

We were getting errors such as this on Windows in our c10d ProcessGroup test suite:
```
  test_send_recv_all_to_all (__main__.ProcessGroupGlooTest) ... Exception in thread Thread-1:
Traceback (most recent call last):
  File "C:\Jenkins\Miniconda3\lib\threading.py", line 932, in _bootstrap_inner
    self.run()
  File "C:\Jenkins\Miniconda3\lib\threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\circleci\project\build\win_tmp\build\torch\testing\_internal\common_distributed.py", line 471, in _event_listener
    if pipe.poll(None):
  File "C:\Jenkins\Miniconda3\lib\multiprocessing\connection.py", line 257, in poll
    return self._poll(timeout)
  File "C:\Jenkins\Miniconda3\lib\multiprocessing\connection.py", line 330, in _poll
    return bool(wait([self], timeout))
  File "C:\Jenkins\Miniconda3\lib\multiprocessing\connection.py", line 883, in wait
    ov.cancel()
OSError: [WinError 6] The handle is invalid
Fatal Python error: could not acquire lock for <_io.BufferedWriter name='<stderr>'> at interpreter shutdown, possibly due to daemon threads
Python runtime state: finalizing (tstate=000001EFDF228CE0)

Thread 0x00001f68 (most recent call first):
  File "C:\Jenkins\Miniconda3\lib\threading.py", line 1202 in invoke_excepthook
  File "C:\Jenkins\Miniconda3\lib\threading.py", line 934 in _bootstrap_inner
  File "C:\Jenkins\Miniconda3\lib\threading.py", line 890 in _bootstrap

Current thread 0x00000f94 (most recent call first):
<no Python frame>
FAIL (5.009s)
```
And the process would then exit with error code 3221226505.
See: https://app.circleci.com/pipelines/github/pytorch/pytorch/337351/workflows/ad919a3e-fe9a-4566-8ad6-8b0a252f730c/jobs/14170191/steps

By looking at [the code of `_event_listener` in `common_distributed.py`](eb36f67dcc/torch/testing/_internal/common_distributed.py (L467-L489)) I think that the first exception (the one about the handle being invalid) is "expected" as it results from another thread purposely closing the pipe while that thread is polling it.

The relevant part of the problem seems to be the "could not acquire lock" one. I think this stems from the event listener thread being launched as a daemon thread, which means the interpreter will not wait for that thread to complete before shutting down. When the interpreter shuts down it instantly aborts all other threads. If the event listener thread was aborter _while_ it was logging to stdout then that thread was holding the lock but never got to release it. This is probably what the error is complaining about. This seems to be intended/expected behavior for CPython: https://bugs.python.org/issue42717.

The solution thus is simple: don't make that thread a daemon thread and explicitly wait for it to terminate before shutting down.
ghstack-source-id: 132293710

Test Plan: Will see...

Reviewed By: pritamdamania87

Differential Revision: D29193014

fbshipit-source-id: 4aabe1fc74bf9c54ca605e7a702ac99655489780
2021-06-24 10:35:38 -07:00
f2f2f5bf20 .github: Zip test reports before uploading (#60475)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60475

Uploading many artifacts can cause issues with GHA backend leading to
errors on our side. To be safe let's zip our artifacts into one archive
so that we avoid uploading too many files at once.

See: https://github.com/actions/upload-artifact#too-many-uploads-resulting-in-429-responses

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: samestep

Differential Revision: D29307205

Pulled By: seemethere

fbshipit-source-id: da8c9957f88bdcc758969157ee696205db5d4dff
2021-06-24 10:30:51 -07:00
7e619b9588 First step to rearrange files in tools folder (#60473)
Summary:
Changes including:
- introduced `linter/`, `testing/`, `stats/` folders in `tools/`
- move appropriate scripts into these folders
- change grepped references in the pytorch/pytorch repo

Next step
- introduce `build/` folder for build scripts

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60473

Test Plan:
- CI (this is important b/c pytorch/test-infra also rely on some script reference.
- tools/tests/

Reviewed By: albanD

Differential Revision: D29352716

Pulled By: walterddr

fbshipit-source-id: bad40b5ce130b35dfd9e59b8af34f9025f3285fd
2021-06-24 10:13:58 -07:00
40d2fe1053 correct filename issue for test_cpp_extensions_aot (#60604)
Summary:
Using file copy to make actual ninja vs. no_ninja suffixed python test files.
This is to trick xmlrunner to report test cases in the correct folder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60604

Test Plan:
- CI reports correctly into the corresponding folders
- If download the test statistics, calculate shards now doesn't need custom logic to handle `test_cpp_extensions_aot`

CI result shown it is working properly:
https://github.com/pytorch/pytorch/pull/60604/checks?check_run_id=2900038654 vs
https://github.com/pytorch/pytorch/pull/60604/checks?check_run_id=2900038673

Reviewed By: albanD

Differential Revision: D29349562

Pulled By: walterddr

fbshipit-source-id: e86e6bc0db288a2a57bea3c5f8edf03be1773944
2021-06-24 09:20:19 -07:00
9cab894367 Fix build_only for libtorch (#60615)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/60605

We have the `build_only` defined, but the config.yml doesn't have the parameter, this PR fixed that. As a result, the docker image push will be skipped

```
// in config.yml

if [ -z "${BUILD_ONLY}" ]; then
```

```
            ("11.1", [
                ("3.8", [
                    ("shard_test", [XImportant(True)]),
                    ("libtorch", [
                        (True, [
                            ('build_only', [X(True)]),
                        ]),
                    ]),
                ]),
            ]),
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60615

Reviewed By: albanD

Differential Revision: D29351567

Pulled By: zhouzhuojie

fbshipit-source-id: dab78bb91f62e8bed47739377987167fea1602cb
2021-06-24 09:11:54 -07:00
eddc5f40f9 Added GLU and FeatureAlphaDropout to nn docs (#60590)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/60563 and https://github.com/pytorch/pytorch/issues/60570

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60590

Reviewed By: albanD

Differential Revision: D29352372

Pulled By: jbschlosser

fbshipit-source-id: f81dd65deab1848a68dc202df252c416ce5214d0
2021-06-24 08:00:18 -07:00
204da12592 Reduce number of CEX when passing Tensors to Python (#60546)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60546

Before, we assume conservatively that any Tensor passed to
THPVariable_Wrap could be aliased in another thread and therefore race.
However, THPVariable_Wrap takes in Variable by value; and so if
use_count() <= 1, it is impossible for another thread to have a
reference to it.  So we can conclude that it is definitely uninitialized
if the quick test fails!

Thanks bdhirsh for pointing out the optimization opportunity here.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D29331718

Pulled By: ezyang

fbshipit-source-id: e100796fbc55a0af2c6565c6fbc9ddc8ae7ceb42
2021-06-24 07:40:39 -07:00
bdb964f89f Support RRefs that contain threading.Locks (#57943)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57943

This is a common scenario (our own tutorials propose it), hence we should ensure it works.

A more generic solution is desirable, but this should fix the immediate concern.
ghstack-source-id: 132289683

Test Plan: Added a test

Reviewed By: mrshenli

Differential Revision: D28316076

fbshipit-source-id: 64e9766189f40474298876227ea247ce5b699d97
2021-06-24 06:36:09 -07:00
4e347f1242 [docs] Fix backticks in docs (#60474)
Summary:
There is a very common error when writing docs: One forgets to write a matching `` ` ``, and something like ``:attr:`x`` is rendered in the docs. This PR fixes most (all?) of these errors (and a few others).

I found these running ``grep -r ">[^#<][^<]*\`"`` on the `docs/build/html/generated` folder. The regex finds an HTML tag that does not start with `#` (as python comments in example code may contain backticks) and that contains a backtick in the rendered HTML.

This regex has not given any false positive in the current codebase, so I am inclined to suggest that we should add this check to the CI. Would this be possible / reasonable / easy to do malfet ?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60474

Reviewed By: mrshenli

Differential Revision: D29309633

Pulled By: albanD

fbshipit-source-id: 9621e0e9f87590cea060dd084fa367442b6bd046
2021-06-24 06:27:41 -07:00
bb9e1150ea Revert D29342234: [pytorch][PR] [CUDA graphs][BC-breaking] Removes post-backward syncs on default stream
Test Plan: revert-hammer

Differential Revision:
D29342234 (675cea1adb)

Original commit changeset: 98e6be7fdd85

fbshipit-source-id: 84022973248b2254210eee57402df2c4f4bc43c6
2021-06-24 04:49:28 -07:00
2b72068a68 Make Future store Storages instead of references to DataPtrs (#60470)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60470

A Future needs to know what DataPtrs are used by its value, but it isn't always able to extract them (and even when it is, that's expensive) so they're cached. DataPtrs are kinda like unique_ptrs (movable only, cannot be copied) hence the Future can only hold _references_ to them. The Future's value, however, is unfortunately mutable (we'd wish that weren't the case, but we don't think we can prevent that), which means the tensor/storage that owned that DataPtr might be deleted and thus the DataPtr could be freed. This means our cached reference becomes stale! Which leads to all kinds of disaster, like reading garbage data or segfaulting.

Luckily all the DataPtrs we were dealing with were held inside Storages, which have a shared_ptr semantics, thus allowing us to hold a strong pointer to them which ensures they're kept alive.

ghstack-source-id: 132177396

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D29303570

fbshipit-source-id: d814754806fa58b24e45269e97d768485ef972ba
2021-06-24 03:56:04 -07:00
06e6d63187 Use a no-warning registry for TensorPipe backends (#60457)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60457

The "without warning" variants of the registry were introduced in https://github.com/pytorch/pytorch/pull/31126 to be used in Gloo for the exact same reason: we use a registry precisely so that backends can be overridden, no need to scare users with a warning.
ghstack-source-id: 132051268

Test Plan: Rebuilt and re-run

Reviewed By: mrshenli

Differential Revision: D29293840

fbshipit-source-id: 3450e547056b2c534166972e8266dab5479d5e43
2021-06-24 03:27:04 -07:00
d3a8505ee1 [jit] Added a pass to transform aten::cat ops to prim::Concat op with variable number of inputs (#59881)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59881

This pass is not included in the JIT flow or anywhere else at this point. The idea is, once this lands, everyone can use this to test their workflow with this transformation and once we are convinced this is useful and/or improves performance, we can include it in the appropriate workflow.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D29277876

Pulled By: navahgar

fbshipit-source-id: b5be7bdcc98dced59295bd7b8f6627619cb58d41
2021-06-24 01:27:41 -07:00
c35a3dd6f2 [jit] Added a new operator for concat that takes in variadic parameters (#59880)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59880

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D29277877

Pulled By: navahgar

fbshipit-source-id: 6db24e7432f683a1d1466f9778201e0aa5d3b1ad
2021-06-24 01:26:22 -07:00
dfd2edc025 [special] add zeta (#59623)
Summary:
Reference https://github.com/pytorch/pytorch/issues/50345

`zeta` was already present in the codebase to support computation of `polygamma`.

However, `zeta` only had `double(double, double)` signature **for CPU** before the PR (which meant that computation `polygamma` were always upcasted to `double` for zeta part).

With this PR, float computations will take place in float and double in double.

Have also refactored the code and moved the duplicate code from `Math.cuh` to `Math.h`

**Note**: For scipy, q is optional, and if it is `None`, it defaults `1` which corresponds to Reimann-Zeta. However, for `torch.specia.zeta`, I made it mandatory cause for me it feels odd without `q` this is Reimann-Zeta and with `q` it is the general Hurwitz Zeta. I think sticking to just general made more sense as passing `1` for q sounds trivial.

Verify:
* [x] Docs https://14234587-65600975-gh.circle-artifacts.com/0/docs/special.html#torch.special.zeta

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59623

Reviewed By: ngimel

Differential Revision: D29348269

Pulled By: mruberry

fbshipit-source-id: a3f9ebe1f7724dbe66de2b391afb9da1cfc3e4bb
2021-06-24 00:00:12 -07:00
26cdec6ce4 Support torch.bitwise_{left/right}_shift and __rlshift__, __rrshift__ (#59544)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/58121

This PR implements `torch.bitwise_left_shift` and `torch.bitwise_right_shift` and `torch.Tensor.{__rlshift__/__rrshift__}`for compatibility with Python array API standard.
(cc: mruberry, rgommers, emcastillo, kmaehashi)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59544

Reviewed By: ngimel

Differential Revision: D29348869

Pulled By: mruberry

fbshipit-source-id: 329aee296cf890735e8a9f858bccfe87c03d06ca
2021-06-23 23:57:16 -07:00
b82453cbd4 Run dist_autograd backward RPCs on appropriate CUDA streams. (#60606)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60606

TensorPipe receives tensors over the wire on custom streams and these
streams are passed to some RPC callbacks but not to `BACKWARD_AUTOGRAD_REQ`. As a
result, `BACKWARD_AUTOGRAD_REQ` ran on the default stream while still using
tensors from the custom stream. This resulted in downstream autograd operations
running on the incorrect stream.

To fix this, I've passed the streams to `BACKWARD_AUTOGRAD_REQ` as well and
added an appropriate guard.

#Closes: https://github.com/pytorch/pytorch/issues/59793
ghstack-source-id: 132252069

Test Plan: Test https://github.com/pytorch/pytorch/issues/59793

Reviewed By: mrshenli

Differential Revision: D29347244

fbshipit-source-id: 8ff8b150763c970ab15c2cac8dccf56e66e9ef5d
2021-06-23 23:52:22 -07:00
675cea1adb [CUDA graphs][BC-breaking] Removes post-backward syncs on default stream (#60421)
Summary:
Before https://github.com/pytorch/pytorch/pull/57833, calls to backward() or grad() synced only the calling thread's default stream with autograd leaf streams at the end of backward. This made the following weird pattern safe:
```python
with torch.cuda.stream(s):
    # imagine forward used many streams, so backward leaf nodes may run on many streams
    loss.backward()
# no sync
use grads
```

but a more benign-looking pattern was unsafe:
```python
with torch.cuda.stream(s):
    # imagine forward used a lot of streams, so backward leaf nodes may run on many streams
    loss.backward()
    # backward() syncs the default stream with all the leaf streams, but does not sync s with anything,
    # so counterintuitively (even though we're in the same stream context as backward()!)
    # it is NOT SAFE to use grads here, and there's no easy way to make it safe,
    # unless you manually sync on all the streams you used in forward,
    # or move "use grads" back to default stream outside the context.
    use grads
```
mruberry ngimel and I decided backward() should have the [same user-facing stream semantics as any cuda op](https://pytorch.org/docs/master/notes/cuda.html#stream-semantics-of-backward-passes).** In other words, the weird pattern should be unsafe, and the benign-looking pattern should be safe. Implementationwise, this meant backward() should sync its calling thread's current stream, not default stream, with the leaf streams.

After https://github.com/pytorch/pytorch/pull/57833, backward syncs the calling thread's current stream AND default stream with all leaf streams at the end of backward. The default stream syncs were retained for temporary backward compatibility.

This PR finishes https://github.com/pytorch/pytorch/pull/57833's work by deleting syncs on the default stream.

With this PR, graph-capturing an entire backward() call should be possible (see the [test_graph_grad_scaling diffs](https://github.com/pytorch/pytorch/compare/master...mcarilli:streaming_backwards_remove_default_syncs?expand=1#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450R3641-R3642)).

** first paragraph has a formatting error which this PR should also fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60421

Reviewed By: VitalyFedyunin, albanD

Differential Revision: D29342234

Pulled By: ngimel

fbshipit-source-id: 98e6be7fdd8550872f0a78f9a66cb8dfe75abf63
2021-06-23 23:35:24 -07:00
00896cb9ed [caffe2] update db::Transaction::Put() to accept the value by rvalue reference (#60208)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60208

Update the DB APIs so that `db::Transaction::Put()` accepts the value by
rvalue reference.  This allows DB implementations to write data asynchronously
without being forced to make an additional copy of the data in memory.
`Put()` implementations can now use the string move constructor or assignment
operator to get the string data and continue performing the write
asynchronously after returning from `Put()`.

Note that I chose to entirely replace the existing `Put()`, removing the
ability for callers to call `Put()` with a `const std::string&` argument for
the value, rather than simply adding another overloaded version of `Put()`.

This was done because in practice there were no call sites using `Put()` that
cannot move in their data.  Eliminating the `const std::string&` API entirely
simplifies the DB implementations: DBs that wish do support move semantics do
not have to implement both the move and the copy versions of `Put()`.

Test Plan:
Searched through fbcode to try and make sure I found all `db::Transaction`
subclasses, and will check sandcastle results to help confirm.

Ran the modelstore checkpointing unit tests.

Differential Revision: D29204425

fbshipit-source-id: 28be6646e92e5df71954d4bb3dc0c8add30ed041
2021-06-23 22:12:53 -07:00
b09c0b6550 [caffe2] update the BlobSerializer acceptor to allow moving in the data (#60207)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60207

Update the `BlobSerializerBase` API so that the serizialized blob data is
passed as a `std::string&&` rather than `const std::string&`.  This allows the
acceptor to take ownership of the string data.  This allows the acceptor to do
things like queue it for storing asynchronously, rather than having to make a
copy of the data if they need it to remain valid after returning.

All existing `BlobSerializerBase` implementations already pass in a valid
rvalue reference to the data, so this change did not require updating any of
the existing serializer implementations.
ghstack-source-id: 132216750

Test Plan:
Examined all ~46 `BlobSerializerBase` subclasses in fbsource to confirm they
already pass in an rvalue reference for this argument.  Also searched for
`BlobSerializerBase` on google and did not find any external references to
this class in other open source projects that might be affected.

Differential Revision: D29204426

fbshipit-source-id: b1d567e52a5c17a01d651c70bbfa2fddbaea6cd9
2021-06-23 22:11:42 -07:00
6ea22672c4 add support for sparse tensors in torch.testing.assert_close (#58844)
Summary:
This adds support for sparse tensors the same way `torch.testing._internal.common_utils.TestCase.assertEqual` does:

5c7dace309/torch/testing/_internal/common_utils.py (L1287-L1313)

- Tensors are coalesced before comparison.
- Indices and values are compared individually.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58844

Reviewed By: zou3519

Differential Revision: D29160250

Pulled By: mruberry

fbshipit-source-id: b0955656c2c7ff3db37a1367427ca54ca14f2e87
2021-06-23 21:59:01 -07:00
80f40b172f [Model Averaging] Periodic model averager (#60320)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60320

This averager can be used for post-local SGD.
ghstack-source-id: 131908011

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_periodic_model_averager

Reviewed By: rohan-varma

Differential Revision: D29249850

fbshipit-source-id: 09675d6bb1edfb8ffbeb94510d91962532d8ca3e
2021-06-23 20:23:04 -07:00
4e51503b1f DOC Improves input and target docstring for loss functions (#60553)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/56581

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60553

Reviewed By: VitalyFedyunin

Differential Revision: D29343797

Pulled By: jbschlosser

fbshipit-source-id: cafc29d60a204a21deff56dd4900157d2adbd91e
2021-06-23 20:20:29 -07:00
6d1b4642f0 DOC Describes parameters/buffers registered as None in load_state_dict (#60549)
Summary:
Related to https://github.com/pytorch/pytorch/issues/8104

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60549

Reviewed By: VitalyFedyunin

Differential Revision: D29343732

Pulled By: jbschlosser

fbshipit-source-id: ef5ba3094c8eaf2f9c8efeba6a9d9ab52ebf8b2c
2021-06-23 20:15:22 -07:00
1e31d26b1d [Static Runtime] Fix bugs in static_runtime::to_copy (#60503)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60503

Fixed a few issues in the static_runtime::to_copy impl:
- fixed a bug with memory_format
- copy strides when appropriate. This is necessary to make sure that the fbgemm path in the copy kernel gets hit.
- fix the schema in the `ReplaceWithCopy` pass
- add registration of `static_runtime::to_copy.other`

Add more unit tests:
- test dynamic shapes
- test strided input tensor to `aten::to`
- test alias case (same input/output)
- test `to.other`

Reviewed By: ajyu

Differential Revision: D26838933

fbshipit-source-id: ec0d1a2deebe998fcfe8858e772e1ef429cb4522
2021-06-23 19:57:17 -07:00
d200e9de26 [Static Runtime] Test for dynamic shapes in SR unit tests (#60579)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60579

- Modify testStaticRuntime to take two sets of inputs so if the second set of inputs have bigger shapes, it would trigger memory allocations in resize_ calls.
- Modify test scripts so that the output of the test op is managed by the memory planner, as explained in comments.

Reviewed By: ajyu

Differential Revision: D29221452

fbshipit-source-id: 09f0f7eb384dc8ca67594f1fa76e1e31392ee6ca
2021-06-23 19:56:05 -07:00
99b641169b Migrates nll_loss_forward from TH to Aten (CUDA) (#60097)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/24610
Aten Umbrella issue https://github.com/pytorch/pytorch/issues/24507
Related to https://github.com/pytorch/pytorch/issues/59765

The performance does not change between this PR and master with the following benchmark script:

<details>
 <summary>Benchmark script</summary>

```python
import torch
import torch.nn as nn
import time

torch.manual_seed(0)

def _time():
    torch.cuda.synchronize()
    MS_PER_SECOND = 1000
    return time.perf_counter() * MS_PER_SECOND

device = "cuda"
C = 30
softmax = nn.LogSoftmax(dim=1)
n_runs = 250

for reduction in ["none", "mean", "sum"]:
    for N in [100_000, 500_000, 1_000_000]:
        fwd_t = 0
        bwd_t = 0
        data = torch.randn(N, C, device=device)
        target = torch.empty(N, dtype=torch.long, device=device).random_(0, C)
        loss = nn.NLLLoss(reduction=reduction)
        input = softmax(data)

        for i in range(n_runs):
            t1 = _time()
            result = loss(input, target)
            t2 = _time()
            fwd_t = fwd_t + (t2 - t1)
        fwd_avg = fwd_t / n_runs
        print(
            f"input size({N}, {C}), reduction: {reduction} "
            f"forward time is {fwd_avg:.2f} (ms)"
        )
    print()
```

</details>

## master

```
input size(100000, 30), reduction: none forward time is 0.02 (ms)
input size(500000, 30), reduction: none forward time is 0.08 (ms)
input size(1000000, 30), reduction: none forward time is 0.15 (ms)

input size(100000, 30), reduction: mean forward time is 1.81 (ms)
input size(500000, 30), reduction: mean forward time is 8.24 (ms)
input size(1000000, 30), reduction: mean forward time is 16.46 (ms)

input size(100000, 30), reduction: sum forward time is 1.66 (ms)
input size(500000, 30), reduction: sum forward time is 8.24 (ms)
input size(1000000, 30), reduction: sum forward time is 16.46 (ms)
```

## this PR

```
input size(100000, 30), reduction: none forward time is 0.02 (ms)
input size(500000, 30), reduction: none forward time is 0.08 (ms)
input size(1000000, 30), reduction: none forward time is 0.15 (ms)

input size(100000, 30), reduction: mean forward time is 1.80 (ms)
input size(500000, 30), reduction: mean forward time is 8.24 (ms)
input size(1000000, 30), reduction: mean forward time is 16.46 (ms)

input size(100000, 30), reduction: sum forward time is 1.66 (ms)
input size(500000, 30), reduction: sum forward time is 8.24 (ms)
input size(1000000, 30), reduction: sum forward time is 16.46 (ms)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60097

Reviewed By: mrshenli

Differential Revision: D29303099

Pulled By: ngimel

fbshipit-source-id: fc0d636543a79ea81158d286dcfb84043bec079a
2021-06-23 19:47:01 -07:00
ef84bcfee6 Convert floating-point constants to T in Bessel functions (#59416)
Summary:
If T is float, many of the computations are more expensive than
expected. Compilers may be reluctant to optimize because they often lead
to different outcome. Converting many constants to T before using them
to clear any doubt.

Benchmark: (Debian 11, no turbo, Release build, Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz, gcc 10.2.1)

```python
import timeit
for dtype in ('torch.float',):
    for func in ('i0', 'i0e', 'i1', 'i1e'):
        for n, t in [(10_000, 10000),
                    (100_000, 1000)]:
            print(f'torch.special.{func}(torch.arange(n, dtype=torch.float32)), n = {n} for {t} times, dtype={dtype}')
            print(timeit.timeit(f'torch.special.{func}(a)', setup=f'import torch; a = torch.arange({n}, dtype=torch.float32)', number=t))
```

Before:

```
torch.special.i0(torch.arange(n, dtype=torch.float32)), n = 10000 for 10000 times, dtype=torch.float
1.539132010017056
torch.special.i0(torch.arange(n, dtype=torch.float32)), n = 100000 for 1000 times, dtype=torch.float
0.9613071230123751
torch.special.i0e(torch.arange(n, dtype=torch.float32)), n = 10000 for 10000 times, dtype=torch.float
4.32450835997588
torch.special.i0e(torch.arange(n, dtype=torch.float32)), n = 100000 for 1000 times, dtype=torch.float
1.5751779029960744
torch.special.i1(torch.arange(n, dtype=torch.float32)), n = 10000 for 10000 times, dtype=torch.float
1.0810036820184905
torch.special.i1(torch.arange(n, dtype=torch.float32)), n = 100000 for 1000 times, dtype=torch.float
0.5314770240220241
torch.special.i1e(torch.arange(n, dtype=torch.float32)), n = 10000 for 10000 times, dtype=torch.float
0.41711462699458934
torch.special.i1e(torch.arange(n, dtype=torch.float32)), n = 100000 for 1000 times, dtype=torch.float
0.1759720179834403
```

After:

```
torch.special.i0(torch.arange(n, dtype=torch.float32)), n = 10000 for 10000 times, dtype=torch.float
1.337154256994836
torch.special.i0(torch.arange(n, dtype=torch.float32)), n = 100000 for 1000 times, dtype=torch.float
0.8640981369826477
torch.special.i0e(torch.arange(n, dtype=torch.float32)), n = 10000 for 10000 times, dtype=torch.float
4.308618158014724
torch.special.i0e(torch.arange(n, dtype=torch.float32)), n = 100000 for 1000 times, dtype=torch.float
1.5217605629877653
torch.special.i1(torch.arange(n, dtype=torch.float32)), n = 10000 for 10000 times, dtype=torch.float
0.9398589830088895
torch.special.i1(torch.arange(n, dtype=torch.float32)), n = 100000 for 1000 times, dtype=torch.float
0.4667845010117162
torch.special.i1e(torch.arange(n, dtype=torch.float32)), n = 10000 for 10000 times, dtype=torch.float
0.3658539849857334
torch.special.i1e(torch.arange(n, dtype=torch.float32)), n = 100000 for 1000 times, dtype=torch.float
0.15680673700990155
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59416

Reviewed By: anjali411

Differential Revision: D29249897

Pulled By: mruberry

fbshipit-source-id: c170e78f2ab47176ea95b8442c6279d7ec1d75c2
2021-06-23 19:43:27 -07:00
08020220f3 [Testing] Adding reference tests to OpInfo class (#59369)
Summary:
This PR will ideally add `ref` argument to `OpInfo` base class. The idea is to add reference checks for all the ops _eligible_. For more discussion, please check https://github.com/pytorch/pytorch/issues/58294

* [x] Migrate (but not removing yet) and modify helper functions from `UnaryUfuncOpInfo` class to `OpInfo` base class.
* [x] Test the reference checks for multiple ops. (also decide a list of different and eligible ops for this)
* [x] Handle possible edge cases (for example: `uint64` isn't implemented in PyTorch but is there in NumPy, and this needs to be handled -- more on this later) -- _Update_: We decided that these reference tests should only test for values and not types.
* [x] Create a sample PR for a single (of all different categories?) on adding reference functions to the eligible ops. -- _Update_: This is being done in this PR only.
* [x] ~Remove reference tests from `test_unary_ufuncs.py` and test to make sure that nothing breaks.~ (*Update*: We won't be touching Unary Ufunc reference tests in this PR)
* [x] Add comments, remove unnecessary prints/comments (added for debugging).

Note: To keep the PR description short, examples of edge cases encountered have been mentioned in the comments below.

cc: mruberry pmeier kshitij12345

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59369

Reviewed By: ngimel

Differential Revision: D29347252

Pulled By: mruberry

fbshipit-source-id: 69719deddb1d23c53db45287a7e66c1bfe7e65bb
2021-06-23 19:26:08 -07:00
236d3afd82 manual revert of 57575 (#60572)
Summary:
manually reverting 57575 while keeping 57574 since it's fixing a bug: https://github.com/pytorch/pytorch/issues/55609
Sandcastle couldn't do it automatically

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60572

Reviewed By: driazati

Differential Revision: D29342473

Pulled By: Krovatkin

fbshipit-source-id: 66ad7d316984a13d203158ceba9706a5f451f9b2
2021-06-23 19:21:48 -07:00
9e773ea7d5 Use accscalar_t for CUDA add/sub with Tensor and Scalar (#60454)
Summary:
Follow up of https://github.com/pytorch/pytorch/issues/60227, related to https://github.com/pytorch/pytorch/issues/59907 & https://github.com/pytorch/pytorch/issues/58833

With this pull request, `torch.add` & `torch.sub` use `acc_type` for `Scalar` if either of two arguments is `Scalar`.
This mimics the behavior of [`torch.mul`](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cuda/BinaryMulDivKernel.cu#L18), `torch._foreach_(add|sub).Scalar` and `torch._foreach_(add|sub).ScalarList`.

 ---

**reference**
- torch.mul CUDA kernel: b0c9762e2d/aten/src/ATen/native/cuda/BinaryMulDivKernel.cu (L17-L25)
- `torch._foreach_(add|sub).Scalar`: cast scalar b0c9762e2d/aten/src/ATen/native/cuda/ForeachBinaryOpScalar.cu (L27)
- `torch._foreach_(add|sub).ScalarList`: `BinaryOpScalarListFunctor` b0c9762e2d/aten/src/ATen/native/cuda/ForeachFunctors.cuh (L180-L182) and multi_tensor_apply handles `scalar_t` and computes `opmath_t` (almost equivalent `accscalar_t`)  b0c9762e2d/aten/src/ATen/native/cuda/MultiTensorApply.cuh (L60-L68). BinaryOpScalarListFunctor
is used b0c9762e2d/aten/src/ATen/native/cuda/ForeachBinaryOpScalarList.cu (L24)

cc ngimel ptrblck mcarilli

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60454

Reviewed By: VitalyFedyunin

Differential Revision: D29345035

Pulled By: ngimel

fbshipit-source-id: 5dbafbdfe029a9544ec2e58f17d547928e017a04
2021-06-23 18:59:22 -07:00
af66824c1f [torch][segment_reduce] Add support for sum and min reductions (#60379)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60379

This concludes the support for all reductions types initially planned (min, max, mean, sum).

Next Steps:
- Cleanups
       -  update default values when length is 0 and initial not given
       - templatize the code to avoid branching with every item.( and other known improvements)
- more unit tests, verification
- benchmarking

Test Plan: updated unit tests.

Reviewed By: ngimel

Differential Revision: D29268218

fbshipit-source-id: c77d91671e01dcf96c18c758fa3ea522b2e13db9
2021-06-23 18:51:44 -07:00
63219f1f9f To add Rectified Adam Algorithm to Optimizers (#58968)
Summary:
Fixes : https://github.com/pytorch/pytorch/issues/24892

In the paper : https://arxiv.org/pdf/1908.03265.pdf  Liyuan Liu et al. suggested a new optimization algorithm with an essence of similar to Adam Algorithm.

It has been discussed in the paper that, without warmup heuristic, in the early stage of adaptive optimization / learning algorithms sometimes we can get undesirable large variance which can slow overall convergence process.

Authors proposed the idea of rectification of variance of adaptive learning rate when it is expected to be high.

Differing from the paper, we selected variance tractability cut-off as 5 instead of 4. This adjustment is common practice, and could be found in the code-repository and also tensorflow swift optim library as well :

2f03dd1970/radam/radam.py (L156)

f51ee4618d/Sources/TensorFlow/Optimizers/MomentumBased.swift (L638)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58968

Reviewed By: vincentqb

Differential Revision: D29310601

Pulled By: iramazanli

fbshipit-source-id: b7bd487f72f1074f266687fd9c0c6be264a748a9
2021-06-23 18:27:57 -07:00
5a2f41a2db [torch/distributed.elastic] Fix utils.distributed_test.test_create_store_timeout_on_server to be dual-stack ip compatible (#60558)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60558

Fixes 1/2 flaky tests as described in: https://github.com/pytorch/pytorch/issues/60260

`test_create_store_timeout_on_server` tests whether trying to create a `c10d::TCPStore` server on an already taken port actually fails with an `IOError`. Prior to this change the `utils.get_socket_with_port()` util method was used to synthetically reserve a port, then try creating the `TCPStore` on that port to validate the `IOError`. The issue with this is that on a dual stack ip setup, `get_socket_with_port()` (since it uses `socket.AF_UNSPEC`) reserves an ipv6 port, while `TCPStore` will try binding to an ipv4 port, so an `IOError` is not observed.

Changing the logic of the test to create two `TCPStore` servers. The first chooses a free port (by passing `server_port=0`) while the second tries to create a `TCPStore` server on the port that the first store is already running on. This would induce an `IOError` on the second store's constructor.

NOTE: this change does not solve another broader issue with `TCPStore` where the server and workers can listen and connect on ipv4 vs ipv6 when they are running on dual-stak ip hosts without ipv4 DNS entry and/or a `/etc/gai.conf` specifying the preferred bind ordering. See: https://github.com/pytorch/pytorch/pull/49124

Test Plan:
```
buck test //caffe2/test/distributed/elastic/utils:distributed_test
```

Reviewed By: cbalioglu

Differential Revision: D29334947

fbshipit-source-id: 76b998c59082cb04c0e86b7a1f3b509367fa0136
2021-06-23 17:12:18 -07:00
1a0058f593 [nnc] Merge inconsistent profiling information (#60510)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60510

We encountered a situation where loop unrolling caused us to duplicate
profiled tensor types in a manner that wasn't logically consistent (see the
attached test case).  When applying this profiling information, we need to
merge the profiled types so that we use a conservative (unspecialized) type.
ghstack-source-id: 132160002

Test Plan: new unit test, plus local predictor using P424983338

Reviewed By: Krovatkin

Differential Revision: D29322487

fbshipit-source-id: 4c18ee69c71bb0622c2e6f6aa361ab5613cbaca4
2021-06-23 17:05:32 -07:00
b5b42d4ce2 [iOS GPU] Add tests for RoIAlign (#60595)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60595

ghstack-source-id: 132245331

Test Plan: CI

Reviewed By: husthyc

Differential Revision: D29345400

fbshipit-source-id: 7406edee232a0ab7b40a4820e3ff9ac07871cdd4
2021-06-23 16:26:53 -07:00
1120a1b92e [quant][fx][fix] QAT with object_type in qconfig (#60555)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60555

When we do QAT, we swap the FP32 modules with the corresponding quantized modules counterpart by calling `qat_swap_modules` in prepare.
However when we try to look up using the swapped module type in qconfig_dict, we cannot find a match anymore since the qconfig dict contains the original
module type.

In this PR we update the qconfig_dict to include the modules swapped for QATT

Test Plan:
python test/test_quantization.py TestQuantizeFx.test_qconfig_qat_module_type

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D29337036

fbshipit-source-id: 60212eec3ee252a2445c1b58874cb36048c9f7dd
2021-06-23 15:55:25 -07:00
d867340c7b [nnc] Add LoopNest::getLoopAt to retrieve a specified inner For-stmt (#60569)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60569

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D29337767

Pulled By: huiguoo

fbshipit-source-id: e3ae23c1b290739c03d1fa5d7da25de878eb1d4c
2021-06-23 15:53:29 -07:00
c0d08dc10f [NNC] Add tile transformation in loopnest (fixed #52785) (#57758)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57758

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D28260744

Pulled By: huiguoo

fbshipit-source-id: 6b5591850aaf46455bf3c2d776fa930654839a63
2021-06-23 15:52:19 -07:00
aeea5bf4a1 [Model Averaging] Provide a util function for model averaging (#60303)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60303

The util function can be used for averaging parameters.

More optimizations can be done in the future.
ghstack-source-id: 132214212

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_average_parameters
buck test mode/dev-nosan caffe2/test/distributed:distributed_gloo_fork -- test_average_parameters

Reviewed By: rohan-varma

Differential Revision: D29242806

fbshipit-source-id: 76fb5a92adb4bdc6151a9f411e366a0ed2a31f47
2021-06-23 15:41:15 -07:00
b770c4b61a Fix ZeRO sort to be by numel (#60556)
Summary:
**Overview:**
This is a follow-up to [this PR](https://github.com/pytorch/pytorch/pull/59586) and corrects the ZeRO partitioning algorithm to sort by the number of elements in the tensor rather than the size of the first dimension. As context, that PR was meant to migrate from using a _naive greedy_ algorithm to a _sorted-greedy_ algorithm when partitioning parameters in ZeRO.

**Updated Results:**
The updated table for the partitions can be found [here](https://github.com/pytorch/pytorch/pull/59410#issuecomment-865203219). There, I also considered a third algorithm (sometimes known as multifit), which is more computationally expensive than the greedy and sorted-greedy algorithms but cannot perform worse. However, because of its increased complexity and lack of improved results, I chose to settle with the simpler sorted-greedy algorithm.

The `step()` latencies show slight improvements, but the improvements may be in the noise. The values below are in seconds and were generated using NCCL backend (unlike in the previous PR which used Gloo):

Two processes:
| Model | Max `optimizer.step()` Time - Greedy (Std.) | Max `optimizer.step()` Time - Sorted-Greedy (Std.) |
| --- | --- | --- |
| ResNet-50 | 0.047 (0.00142) | **0.044 (0.00025)** |
| ResNet-152 | 0.057 (0.00034) | **0.054 (0.00022)** |
| BERT | 0.021 (0.00008) | **0.020 (0.00008)** |

Four processes:
| Model | Max `optimizer.step()` Time - Greedy | Max `optimizer.step()` Time - Sorted-Greedy (Std.) |
| --- | --- | --- |
| ResNet-50 | 0.019 (0.00065) | **0.013 (0.00040)** |
| ResNet-152 | 0.045 (0.00024) | 0.045 (0.00025) |
| BERT | 0.019 (0.00022) | **0.018 (0.00016)** |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60556

Test Plan:
I verified that the ZeRO tests pass (via the AI AWS cluster):
```
srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=4 python test/distributed/optim/test_zero_redundancy_optimizer.py
```

Reviewed By: VitalyFedyunin

Differential Revision: D29335260

Pulled By: andwgu

fbshipit-source-id: 469d1c6e029b77c1b300a94cd1fd94b633cd28dd
2021-06-23 15:22:36 -07:00
1054ad5af3 Add back smoke tests for windows shard 1 for CircleCI (#60571)
Summary:
The reason I removed the smoke tests here were because we didn't have gflags on our GHA runners and we wanted to get sharding done sooner rather than later.

However, we shouldn't remove these tests for windows as they are important for debugging linker issues with torch. Thus, this is step 1 in adding the tests back.

Next step:
- add gflags to base ami
- remove the exist check

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60571

Test Plan: CI shouldn't break

Reviewed By: walterddr

Differential Revision: D29341850

Pulled By: janeyx99

fbshipit-source-id: 7e0c98887534d096f867e28a5482b32aa493b132
2021-06-23 14:52:14 -07:00
555c154df5 Use asyncio in tools/clang_tidy.py (#60495)
Summary:
This replaces Ninja for parallel builds with asyncio which is more idiomatic Python + easier to debug when things go wrong since the data never leaves Python.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60495

Reviewed By: bhosmer

Differential Revision: D29315526

Pulled By: driazati

fbshipit-source-id: 196b1807fe4ee6db432d5fef146e52f96939b44d
2021-06-23 14:18:03 -07:00
2dedd96dd2 cmake: Prefer CMAKE_CURRENT_SOURCE_DIR to TORCH_SRC_DIR (#60493)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60493

TORCH_SRC_DIR appears to be a bit bugged when it comes to identifying
include directories so let's try and use CMAKE_CURRENT_SOURCE_DIR
instead

<details>
<summary>Logs for builds with torchaudio</summary>

```
-- Building version 0.10.0a0+9e36281
running bdist_wheel
running build
running build_py
copying torchaudio/version.py -> build/lib.linux-x86_64-3.6/torchaudio
running build_ext
-- Configuring done
-- Generating done
-- Build files have been written to: /home/eliuriegas/work/audio/build/temp.linux-x86_64-3.6
[1/11] /usr/lib64/ccache/c++ -DUSE_C10D_GLOO -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -I../../third_party/kaldi/src -I../../third_party/kaldi/submodule/src -isystem /tmp/tmp.GKeM3KKcFi/lib/python3.6/site-packages/torch/include -isystem /tmp/tmp.GKeM3KKcFi/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/include/breakpad -Wall -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility=hidden -O3 -DNDEBUG -fPIC -D_GLIBCXX_USE_CXX11_ABI=1 -std=gnu++14 -MD -MT third_party/kaldi/CMakeFiles/kaldi.dir/submodule/src/base/kaldi-error.cc.o -MF third_party/kaldi/CMakeFiles/kaldi.dir/submodule/src/base/kaldi-error.cc.o.d -o third_party/kaldi/CMakeFiles/kaldi.dir/submodule/src/base/kaldi-error.cc.o -c ../../third_party/kaldi/submodule/src/base/kaldi-error.cc
[2/11] /usr/lib64/ccache/c++ -DUSE_C10D_GLOO -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -I../../third_party/kaldi/src -I../../third_party/kaldi/submodule/src -isystem /tmp/tmp.GKeM3KKcFi/lib/python3.6/site-packages/torch/include -isystem /tmp/tmp.GKeM3KKcFi/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/include/breakpad -Wall -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility=hidden -O3 -DNDEBUG -fPIC -D_GLIBCXX_USE_CXX11_ABI=1 -std=gnu++14 -MD -MT third_party/kaldi/CMakeFiles/kaldi.dir/submodule/src/base/kaldi-math.cc.o -MF third_party/kaldi/CMakeFiles/kaldi.dir/submodule/src/base/kaldi-math.cc.o.d -o third_party/kaldi/CMakeFiles/kaldi.dir/submodule/src/base/kaldi-math.cc.o -c ../../third_party/kaldi/submodule/src/base/kaldi-math.cc
[3/11] /usr/lib64/ccache/c++ -DUSE_C10D_GLOO -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -I../../third_party/kaldi/src -I../../third_party/kaldi/submodule/src -isystem /tmp/tmp.GKeM3KKcFi/lib/python3.6/site-packages/torch/include -isystem /tmp/tmp.GKeM3KKcFi/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/include/breakpad -Wall -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility=hidden -O3 -DNDEBUG -fPIC -D_GLIBCXX_USE_CXX11_ABI=1 -std=gnu++14 -MD -MT third_party/kaldi/CMakeFiles/kaldi.dir/submodule/src/feat/feature-functions.cc.o -MF third_party/kaldi/CMakeFiles/kaldi.dir/submodule/src/feat/feature-functions.cc.o.d -o third_party/kaldi/CMakeFiles/kaldi.dir/submodule/src/feat/feature-functions.cc.o -c ../../third_party/kaldi/submodule/src/feat/feature-functions.cc
[4/11] /usr/lib64/ccache/c++ -DUSE_C10D_GLOO -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -I../../third_party/kaldi/src -I../../third_party/kaldi/submodule/src -isystem /tmp/tmp.GKeM3KKcFi/lib/python3.6/site-packages/torch/include -isystem /tmp/tmp.GKeM3KKcFi/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/include/breakpad -Wall -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility=hidden -O3 -DNDEBUG -fPIC -D_GLIBCXX_USE_CXX11_ABI=1 -std=gnu++14 -MD -MT third_party/kaldi/CMakeFiles/kaldi.dir/src/matrix/kaldi-matrix.cc.o -MF third_party/kaldi/CMakeFiles/kaldi.dir/src/matrix/kaldi-matrix.cc.o.d -o third_party/kaldi/CMakeFiles/kaldi.dir/src/matrix/kaldi-matrix.cc.o -c ../../third_party/kaldi/src/matrix/kaldi-matrix.cc
[5/11] /usr/lib64/ccache/c++ -DUSE_C10D_GLOO -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -I../../third_party/kaldi/src -I../../third_party/kaldi/submodule/src -isystem /tmp/tmp.GKeM3KKcFi/lib/python3.6/site-packages/torch/include -isystem /tmp/tmp.GKeM3KKcFi/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/include/breakpad -Wall -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility=hidden -O3 -DNDEBUG -fPIC -D_GLIBCXX_USE_CXX11_ABI=1 -std=gnu++14 -MD -MT third_party/kaldi/CMakeFiles/kaldi.dir/submodule/src/feat/resample.cc.o -MF third_party/kaldi/CMakeFiles/kaldi.dir/submodule/src/feat/resample.cc.o.d -o third_party/kaldi/CMakeFiles/kaldi.dir/submodule/src/feat/resample.cc.o -c ../../third_party/kaldi/submodule/src/feat/resample.cc
[6/11] /usr/lib64/ccache/c++ -DUSE_C10D_GLOO -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -I../../third_party/kaldi/src -I../../third_party/kaldi/submodule/src -isystem /tmp/tmp.GKeM3KKcFi/lib/python3.6/site-packages/torch/include -isystem /tmp/tmp.GKeM3KKcFi/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/include/breakpad -Wall -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility=hidden -O3 -DNDEBUG -fPIC -D_GLIBCXX_USE_CXX11_ABI=1 -std=gnu++14 -MD -MT third_party/kaldi/CMakeFiles/kaldi.dir/src/matrix/kaldi-vector.cc.o -MF third_party/kaldi/CMakeFiles/kaldi.dir/src/matrix/kaldi-vector.cc.o.d -o third_party/kaldi/CMakeFiles/kaldi.dir/src/matrix/kaldi-vector.cc.o -c ../../third_party/kaldi/src/matrix/kaldi-vector.cc
[7/11] /usr/lib64/ccache/c++ -DINCLUDE_KALDI -DTORCH_API_INCLUDE_EXTENSION_H -DUSE_C10D_GLOO -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -D_torchaudio_EXPORTS -I../../ -I/tmp/tmp.GKeM3KKcFi/include/python3.6m -I../../third_party/kaldi/src -I../../third_party/kaldi/submodule/src -isystem /tmp/tmp.GKeM3KKcFi/lib/python3.6/site-packages/torch/include -isystem /tmp/tmp.GKeM3KKcFi/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/include/breakpad -Wall -D_GLIBCXX_USE_CXX11_ABI=1 -O3 -DNDEBUG -fPIC -D_GLIBCXX_USE_CXX11_ABI=1 -std=gnu++14 -MD -MT torchaudio/csrc/CMakeFiles/_torchaudio.dir/kaldi.cpp.o -MF torchaudio/csrc/CMakeFiles/_torchaudio.dir/kaldi.cpp.o.d -o torchaudio/csrc/CMakeFiles/_torchaudio.dir/kaldi.cpp.o -c ../../torchaudio/csrc/kaldi.cpp
[8/11] /usr/lib64/ccache/c++ -DUSE_C10D_GLOO -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -I../../third_party/kaldi/src -I../../third_party/kaldi/submodule/src -isystem /tmp/tmp.GKeM3KKcFi/lib/python3.6/site-packages/torch/include -isystem /tmp/tmp.GKeM3KKcFi/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/include/breakpad -Wall -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility=hidden -O3 -DNDEBUG -fPIC -D_GLIBCXX_USE_CXX11_ABI=1 -std=gnu++14 -MD -MT third_party/kaldi/CMakeFiles/kaldi.dir/submodule/src/feat/pitch-functions.cc.o -MF third_party/kaldi/CMakeFiles/kaldi.dir/submodule/src/feat/pitch-functions.cc.o.d -o third_party/kaldi/CMakeFiles/kaldi.dir/submodule/src/feat/pitch-functions.cc.o -c ../../third_party/kaldi/submodule/src/feat/pitch-functions.cc
../../third_party/kaldi/submodule/src/feat/pitch-functions.cc: In member function ‘void kaldi::OnlinePitchFeatureImpl::UpdateRemainder(const kaldi::VectorBase<float>&)’:
../../third_party/kaldi/submodule/src/feat/pitch-functions.cc:814:11: warning: unused variable ‘full_frame_length’ [-Wunused-variable]
  814 |     int32 full_frame_length = opts_.NccfWindowSize() + nccf_last_lag_;
      |           ^~~~~~~~~~~~~~~~~
../../third_party/kaldi/submodule/src/feat/pitch-functions.cc: In member function ‘void kaldi::OnlineProcessPitch::UpdateNormalizationStats(kaldi::int32)’:
../../third_party/kaldi/submodule/src/feat/pitch-functions.cc:1504:35: warning: comparison of integer expressions of different signedness: ‘std::vector<kaldi::OnlineProcessPitch::NormalizationStats>::size_type’ {aka ‘long unsigned int’} and ‘kaldi::int32’ {aka ‘int’} [-Wsign-compare]
 1504 |   if (normalization_stats_.size() <= frame)
      |       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~
[9/11] : && /usr/bin/cmake -E rm -f third_party/kaldi/libkaldi.a && /usr/bin/ar qc third_party/kaldi/libkaldi.a  third_party/kaldi/CMakeFiles/kaldi.dir/src/matrix/kaldi-vector.cc.o third_party/kaldi/CMakeFiles/kaldi.dir/src/matrix/kaldi-matrix.cc.o third_party/kaldi/CMakeFiles/kaldi.dir/submodule/src/base/kaldi-error.cc.o third_party/kaldi/CMakeFiles/kaldi.dir/submodule/src/base/kaldi-math.cc.o third_party/kaldi/CMakeFiles/kaldi.dir/submodule/src/feat/feature-functions.cc.o third_party/kaldi/CMakeFiles/kaldi.dir/submodule/src/feat/pitch-functions.cc.o third_party/kaldi/CMakeFiles/kaldi.dir/submodule/src/feat/resample.cc.o && /usr/bin/ranlib third_party/kaldi/libkaldi.a && :
[10/11] : && /usr/lib64/ccache/c++ -fPIC -Wall -D_GLIBCXX_USE_CXX11_ABI=1 -O3 -DNDEBUG   -shared -Wl,-soname,_torchaudio.so -o torchaudio/csrc/_torchaudio.so torchaudio/csrc/CMakeFiles/_torchaudio.dir/pybind.cpp.o torchaudio/csrc/CMakeFiles/_torchaudio.dir/lfilter.cpp.o torchaudio/csrc/CMakeFiles/_torchaudio.dir/overdrive.cpp.o torchaudio/csrc/CMakeFiles/_torchaudio.dir/utils.cpp.o torchaudio/csrc/CMakeFiles/_torchaudio.dir/kaldi.cpp.o  -Wl,-rpath,/tmp/tmp.GKeM3KKcFi/lib/python3.6/site-packages/torch/lib:  /tmp/tmp.GKeM3KKcFi/lib/python3.6/site-packages/torch/lib/libc10.so  /tmp/tmp.GKeM3KKcFi/lib/python3.6/site-packages/torch/lib/libtorch_python.so  third_party/kaldi/libkaldi.a  /tmp/tmp.GKeM3KKcFi/lib/python3.6/site-packages/torch/lib/libtorch.so  -Wl,--no-as-needed,"/tmp/tmp.GKeM3KKcFi/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so" -Wl,--as-needed  /usr/local/lib/libbreakpad_client.a  /tmp/tmp.GKeM3KKcFi/lib/python3.6/site-packages/torch/lib/libc10.so  -lpthread  -Wl,--no-as-needed,"/tmp/tmp.GKeM3KKcFi/lib/python3.6/site-packages/torch/lib/libtorch.so" -Wl,--as-needed  /tmp/tmp.GKeM3KKcFi/lib/python3.6/site-packages/torch/lib/libc10.so && :
[10/11] cd /home/eliuriegas/work/audio/build/temp.linux-x86_64-3.6 && /usr/bin/cmake -P cmake_install.cmake
-- Install configuration: "Release"
-- Installing: /home/eliuriegas/work/audio/build/lib.linux-x86_64-3.6/torchaudio/./_torchaudio.so
-- Set runtime path of "/home/eliuriegas/work/audio/build/lib.linux-x86_64-3.6/torchaudio/./_torchaudio.so" to ""
installing to build/bdist.linux-x86_64/wheel
running install
running install_lib
creating build/bdist.linux-x86_64/wheel
creating build/bdist.linux-x86_64/wheel/torchaudio
copying build/lib.linux-x86_64-3.6/torchaudio/kaldi_io.py -> build/bdist.linux-x86_64/wheel/torchaudio
copying build/lib.linux-x86_64-3.6/torchaudio/transforms.py -> build/bdist.linux-x86_64/wheel/torchaudio
copying build/lib.linux-x86_64-3.6/torchaudio/__init__.py -> build/bdist.linux-x86_64/wheel/torchaudio
creating build/bdist.linux-x86_64/wheel/torchaudio/compliance
copying build/lib.linux-x86_64-3.6/torchaudio/compliance/__init__.py -> build/bdist.linux-x86_64/wheel/torchaudio/compliance
copying build/lib.linux-x86_64-3.6/torchaudio/compliance/kaldi.py -> build/bdist.linux-x86_64/wheel/torchaudio/compliance
creating build/bdist.linux-x86_64/wheel/torchaudio/datasets
copying build/lib.linux-x86_64-3.6/torchaudio/datasets/cmuarctic.py -> build/bdist.linux-x86_64/wheel/torchaudio/datasets
copying build/lib.linux-x86_64-3.6/torchaudio/datasets/librispeech.py -> build/bdist.linux-x86_64/wheel/torchaudio/datasets
copying build/lib.linux-x86_64-3.6/torchaudio/datasets/libritts.py -> build/bdist.linux-x86_64/wheel/torchaudio/datasets
copying build/lib.linux-x86_64-3.6/torchaudio/datasets/vctk.py -> build/bdist.linux-x86_64/wheel/torchaudio/datasets
copying build/lib.linux-x86_64-3.6/torchaudio/datasets/__init__.py -> build/bdist.linux-x86_64/wheel/torchaudio/datasets
copying build/lib.linux-x86_64-3.6/torchaudio/datasets/commonvoice.py -> build/bdist.linux-x86_64/wheel/torchaudio/datasets
copying build/lib.linux-x86_64-3.6/torchaudio/datasets/gtzan.py -> build/bdist.linux-x86_64/wheel/torchaudio/datasets
copying build/lib.linux-x86_64-3.6/torchaudio/datasets/ljspeech.py -> build/bdist.linux-x86_64/wheel/torchaudio/datasets
copying build/lib.linux-x86_64-3.6/torchaudio/datasets/speechcommands.py -> build/bdist.linux-x86_64/wheel/torchaudio/datasets
copying build/lib.linux-x86_64-3.6/torchaudio/datasets/tedlium.py -> build/bdist.linux-x86_64/wheel/torchaudio/datasets
copying build/lib.linux-x86_64-3.6/torchaudio/datasets/utils.py -> build/bdist.linux-x86_64/wheel/torchaudio/datasets
copying build/lib.linux-x86_64-3.6/torchaudio/datasets/yesno.py -> build/bdist.linux-x86_64/wheel/torchaudio/datasets
creating build/bdist.linux-x86_64/wheel/torchaudio/_internal
copying build/lib.linux-x86_64-3.6/torchaudio/_internal/__init__.py -> build/bdist.linux-x86_64/wheel/torchaudio/_internal
copying build/lib.linux-x86_64-3.6/torchaudio/_internal/fft.py -> build/bdist.linux-x86_64/wheel/torchaudio/_internal
copying build/lib.linux-x86_64-3.6/torchaudio/_internal/module_utils.py -> build/bdist.linux-x86_64/wheel/torchaudio/_internal
creating build/bdist.linux-x86_64/wheel/torchaudio/backend
copying build/lib.linux-x86_64-3.6/torchaudio/backend/__init__.py -> build/bdist.linux-x86_64/wheel/torchaudio/backend
copying build/lib.linux-x86_64-3.6/torchaudio/backend/common.py -> build/bdist.linux-x86_64/wheel/torchaudio/backend
copying build/lib.linux-x86_64-3.6/torchaudio/backend/no_backend.py -> build/bdist.linux-x86_64/wheel/torchaudio/backend
copying build/lib.linux-x86_64-3.6/torchaudio/backend/soundfile_backend.py -> build/bdist.linux-x86_64/wheel/torchaudio/backend
copying build/lib.linux-x86_64-3.6/torchaudio/backend/sox_io_backend.py -> build/bdist.linux-x86_64/wheel/torchaudio/backend
copying build/lib.linux-x86_64-3.6/torchaudio/backend/utils.py -> build/bdist.linux-x86_64/wheel/torchaudio/backend
creating build/bdist.linux-x86_64/wheel/torchaudio/extension
copying build/lib.linux-x86_64-3.6/torchaudio/extension/__init__.py -> build/bdist.linux-x86_64/wheel/torchaudio/extension
copying build/lib.linux-x86_64-3.6/torchaudio/extension/extension.py -> build/bdist.linux-x86_64/wheel/torchaudio/extension
creating build/bdist.linux-x86_64/wheel/torchaudio/models
copying build/lib.linux-x86_64-3.6/torchaudio/models/__init__.py -> build/bdist.linux-x86_64/wheel/torchaudio/models
copying build/lib.linux-x86_64-3.6/torchaudio/models/conv_tasnet.py -> build/bdist.linux-x86_64/wheel/torchaudio/models
copying build/lib.linux-x86_64-3.6/torchaudio/models/deepspeech.py -> build/bdist.linux-x86_64/wheel/torchaudio/models
copying build/lib.linux-x86_64-3.6/torchaudio/models/wav2letter.py -> build/bdist.linux-x86_64/wheel/torchaudio/models
copying build/lib.linux-x86_64-3.6/torchaudio/models/wavernn.py -> build/bdist.linux-x86_64/wheel/torchaudio/models
creating build/bdist.linux-x86_64/wheel/torchaudio/models/wav2vec2
copying build/lib.linux-x86_64-3.6/torchaudio/models/wav2vec2/__init__.py -> build/bdist.linux-x86_64/wheel/torchaudio/models/wav2vec2
copying build/lib.linux-x86_64-3.6/torchaudio/models/wav2vec2/components.py -> build/bdist.linux-x86_64/wheel/torchaudio/models/wav2vec2
copying build/lib.linux-x86_64-3.6/torchaudio/models/wav2vec2/model.py -> build/bdist.linux-x86_64/wheel/torchaudio/models/wav2vec2
creating build/bdist.linux-x86_64/wheel/torchaudio/models/wav2vec2/utils
copying build/lib.linux-x86_64-3.6/torchaudio/models/wav2vec2/utils/__init__.py -> build/bdist.linux-x86_64/wheel/torchaudio/models/wav2vec2/utils
copying build/lib.linux-x86_64-3.6/torchaudio/models/wav2vec2/utils/import_fairseq.py -> build/bdist.linux-x86_64/wheel/torchaudio/models/wav2vec2/utils
copying build/lib.linux-x86_64-3.6/torchaudio/models/wav2vec2/utils/import_huggingface.py -> build/bdist.linux-x86_64/wheel/torchaudio/models/wav2vec2/utils
creating build/bdist.linux-x86_64/wheel/torchaudio/sox_effects
copying build/lib.linux-x86_64-3.6/torchaudio/sox_effects/__init__.py -> build/bdist.linux-x86_64/wheel/torchaudio/sox_effects
copying build/lib.linux-x86_64-3.6/torchaudio/sox_effects/sox_effects.py -> build/bdist.linux-x86_64/wheel/torchaudio/sox_effects
creating build/bdist.linux-x86_64/wheel/torchaudio/utils
copying build/lib.linux-x86_64-3.6/torchaudio/utils/__init__.py -> build/bdist.linux-x86_64/wheel/torchaudio/utils
copying build/lib.linux-x86_64-3.6/torchaudio/utils/sox_utils.py -> build/bdist.linux-x86_64/wheel/torchaudio/utils
creating build/bdist.linux-x86_64/wheel/torchaudio/functional
copying build/lib.linux-x86_64-3.6/torchaudio/functional/__init__.py -> build/bdist.linux-x86_64/wheel/torchaudio/functional
copying build/lib.linux-x86_64-3.6/torchaudio/functional/filtering.py -> build/bdist.linux-x86_64/wheel/torchaudio/functional
copying build/lib.linux-x86_64-3.6/torchaudio/functional/functional.py -> build/bdist.linux-x86_64/wheel/torchaudio/functional
creating build/bdist.linux-x86_64/wheel/torchaudio/prototype
copying build/lib.linux-x86_64-3.6/torchaudio/prototype/__init__.py -> build/bdist.linux-x86_64/wheel/torchaudio/prototype
copying build/lib.linux-x86_64-3.6/torchaudio/prototype/rnnt_loss.py -> build/bdist.linux-x86_64/wheel/torchaudio/prototype
copying build/lib.linux-x86_64-3.6/torchaudio/version.py -> build/bdist.linux-x86_64/wheel/torchaudio
copying build/lib.linux-x86_64-3.6/torchaudio/_torchaudio.so -> build/bdist.linux-x86_64/wheel/torchaudio
running install_egg_info
running egg_info
writing torchaudio.egg-info/PKG-INFO
writing dependency_links to torchaudio.egg-info/dependency_links.txt
writing requirements to torchaudio.egg-info/requires.txt
writing top-level names to torchaudio.egg-info/top_level.txt
reading manifest file 'torchaudio.egg-info/SOURCES.txt'
writing manifest file 'torchaudio.egg-info/SOURCES.txt'
Copying torchaudio.egg-info to build/bdist.linux-x86_64/wheel/torchaudio-0.10.0a0+9e36281-py3.6.egg-info
running install_scripts
adding license file "LICENSE" (matched pattern "LICEN[CS]E*")
creating build/bdist.linux-x86_64/wheel/torchaudio-0.10.0a0+9e36281.dist-info/WHEEL
creating 'dist/torchaudio-0.10.0a0+9e36281-cp36-cp36m-linux_x86_64.whl' and adding 'build/bdist.linux-x86_64/wheel' to it
adding 'torchaudio/__init__.py'
adding 'torchaudio/_torchaudio.so'
adding 'torchaudio/kaldi_io.py'
adding 'torchaudio/transforms.py'
adding 'torchaudio/version.py'
adding 'torchaudio/_internal/__init__.py'
adding 'torchaudio/_internal/fft.py'
adding 'torchaudio/_internal/module_utils.py'
adding 'torchaudio/backend/__init__.py'
adding 'torchaudio/backend/common.py'
adding 'torchaudio/backend/no_backend.py'
adding 'torchaudio/backend/soundfile_backend.py'
adding 'torchaudio/backend/sox_io_backend.py'
adding 'torchaudio/backend/utils.py'
adding 'torchaudio/compliance/__init__.py'
adding 'torchaudio/compliance/kaldi.py'
adding 'torchaudio/datasets/__init__.py'
adding 'torchaudio/datasets/cmuarctic.py'
adding 'torchaudio/datasets/commonvoice.py'
adding 'torchaudio/datasets/gtzan.py'
adding 'torchaudio/datasets/librispeech.py'
adding 'torchaudio/datasets/libritts.py'
adding 'torchaudio/datasets/ljspeech.py'
adding 'torchaudio/datasets/speechcommands.py'
adding 'torchaudio/datasets/tedlium.py'
adding 'torchaudio/datasets/utils.py'
adding 'torchaudio/datasets/vctk.py'
adding 'torchaudio/datasets/yesno.py'
adding 'torchaudio/extension/__init__.py'
adding 'torchaudio/extension/extension.py'
adding 'torchaudio/functional/__init__.py'
adding 'torchaudio/functional/filtering.py'
adding 'torchaudio/functional/functional.py'
adding 'torchaudio/models/__init__.py'
adding 'torchaudio/models/conv_tasnet.py'
adding 'torchaudio/models/deepspeech.py'
adding 'torchaudio/models/wav2letter.py'
adding 'torchaudio/models/wavernn.py'
adding 'torchaudio/models/wav2vec2/__init__.py'
adding 'torchaudio/models/wav2vec2/components.py'
adding 'torchaudio/models/wav2vec2/model.py'
adding 'torchaudio/models/wav2vec2/utils/__init__.py'
adding 'torchaudio/models/wav2vec2/utils/import_fairseq.py'
adding 'torchaudio/models/wav2vec2/utils/import_huggingface.py'
adding 'torchaudio/prototype/__init__.py'
adding 'torchaudio/prototype/rnnt_loss.py'
adding 'torchaudio/sox_effects/__init__.py'
adding 'torchaudio/sox_effects/sox_effects.py'
adding 'torchaudio/utils/__init__.py'
adding 'torchaudio/utils/sox_utils.py'
adding 'torchaudio-0.10.0a0+9e36281.dist-info/LICENSE'
adding 'torchaudio-0.10.0a0+9e36281.dist-info/METADATA'
adding 'torchaudio-0.10.0a0+9e36281.dist-info/WHEEL'
adding 'torchaudio-0.10.0a0+9e36281.dist-info/top_level.txt'
adding 'torchaudio-0.10.0a0+9e36281.dist-info/RECORD'
removing build/bdist.linux-x86_64/wheel

```

</details>

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D29316372

Pulled By: seemethere

fbshipit-source-id: 02be64df6197c0d4bad5a5bfb3cef336c11f53ed
2021-06-23 14:08:19 -07:00
ad1041576a Fix loop types (#60504)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60504

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D29313197

fbshipit-source-id: bc86622b587e4fdb73431c2ff27300404c9693ae
2021-06-23 13:26:22 -07:00
da030c59e7 ENH Adds Byte support for nll_loss (CPU) (#60308)
Summary:
Addresses a part of https://github.com/pytorch/pytorch/issues/59765

This PR adds byte support for nll_loss on the CPU for `input.dim() == 2`.

CUDA support will be implemented when `nll_loss` migration to CUDA is completed in https://github.com/pytorch/pytorch/pull/60299 and https://github.com/pytorch/pytorch/pull/60097

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60308

Reviewed By: VitalyFedyunin

Differential Revision: D29329458

Pulled By: jbschlosser

fbshipit-source-id: d3585c4966030bc61e451f8aa817406a8a3acf47
2021-06-23 12:16:45 -07:00
7bf195f360 fix kernel launch check in cross kernel
Summary: per title

Test Plan: buck test mode/opt //caffe2/test:kernel_launch_checks -- --exact 'caffe2/test:kernel_launch_checks - test_check_cuda_launches (test_kernel_launch_checks.AlwaysCheckCudaLaunchTest)' --run-disabled

Reviewed By: r-barnes

Differential Revision: D29335739

fbshipit-source-id: 385c66b1806886deba35f7fd83e29e0885999119
2021-06-23 11:47:50 -07:00
308d238377 add SequenceMask op (#60235)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60235

This diff
- added SequenceMask op in Dper3 (caffe2 & pytorch)
- added shape inference functions for SequenceMask op

Test Plan:
```
buck test dper3/dper3/modules/low_level_modules/tests:single_operators_test -- test_sequence_mask
```

Differential Revision: D29210097

fbshipit-source-id: cab3460e0fd6c49bec6d0c5c624bd4652de7604b
2021-06-23 11:33:00 -07:00
e60f9cfc58 Revert D29135358: [quant] Input-Weight Equaliaztion - convert modifications
Test Plan: revert-hammer

Differential Revision:
D29135358 (3de79b7757)

Original commit changeset: 2d0005672904

fbshipit-source-id: cac30c1202ebbce4f22e50ed920340c7b4c6849f
2021-06-23 11:23:24 -07:00
03ab5b72c9 Fix parallel tbb build (#60532)
Summary:
Small typo in https://github.com/pytorch/pytorch/issues/60183

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60532

Reviewed By: walterddr

Differential Revision: D29336173

Pulled By: ngimel

fbshipit-source-id: 57d753f21d484bbae26a23cb3eb35e497e25118a
2021-06-23 11:16:36 -07:00
bea83e2e46 Add NoChunk wrapper for pipeline args. (#57325)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57325

As per the design outlined in
https://github.com/pytorch/pytorch/issues/53952, adding a `NoChunk` wrapper for
pipeline parallelism inputs.

If a Tensor is wrapped with this wrapper, the pipeline implementation does not
split this Tensor across micro-batches and instead just replicates this tensor
as-is similar to non-tensors.
ghstack-source-id: 132009305

Test Plan:
1) unit tests.
2) waitforbuildbot.

Reviewed By: SciPioneer

Differential Revision: D28109277

fbshipit-source-id: ee78c814c715d207d2796aba40b756a8e1834898
2021-06-23 11:13:14 -07:00
6385621003 Use JOB_BASE_NAME throughout code--consolidate CIRCLE_JOB (#60425)
Summary:
This PR is a first step in unifying our environment variables across CI (so that we don't have `CIRCLE_BLAH` in our GHA workflows, for example), though I'd like for this PR to be more for discussion about how best to consolidate these variables.

This small change only changes most CIRCLE_JOB references in our code to be JOB_BASE_NAME, as that seems the closest GHA (and ROCm) equivalent. Currently, JOB_BASE_NAME is defined as:
- in Circle: CIRCLE_JOB (name of the job, like `pytorch_linux_bionic_py3_8_gcc9_coverage_test1`)
- in GHA: the build_environment with a `-build` or `-test` tacked to the end , e.g., `pytorch-linux-xenial-cuda10.2-cudnn7-py3.6-gcc7-test`
- in ROCm: I don't actually know, but it's important for ROCm test sharding as shown in https://github.com/pytorch/pytorch/pull/60409

I am not sure if this is the intention for JOB_BASE_NAME so it is open to discussion what variable we should use if not JOB_BASE_NAME. I also don't know if it's worth the effort consolidating all these variables, so discussion is also highly encouraged there!

Next steps:
- Consolidate more CIRCLE_* references, maybe into CI_* equivalents?
- We use BUILD_ENVIRONMENT everywhere in Circle though the variable is inconsistent across binary vs CI jobs and across platforms. For example, for linux tests and builds, BUILD_ENVIRONMENT contains the `_test` and `_build` suffixes, but the windows jobs don't. In GHA, BUILD_ENVIRONMENT is similar to how it's defined in windows jobs on Circle. This inconsistency is confusing, and we can probably do something about it. I'm thinking of switching out BUILD_ENVIRONMENT for JOB_BASE_NAME in our test scripts where we actually mean JOB_BASE_NAME.
- We should probably document the meaning of the variables we consolidate somewhere, preferably in a README in some unified `ci/` folder. For example, it seems BUILD_ENVIRONMENT is supposed to capture the build environment, whereas JOB_BASE_NAME is supposed to capture the environment _and_ whether we're building or testing.

Notes:
- I did not replace CIRCLE_JOB references in third_party directories
- Previously, print_test_stats reported CIRCLE_JOB as only the build environment for GHA workflows, and I think tacking on the `build` or `test` will not harm anything, though I may be wrong.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60425

Reviewed By: seemethere, samestep

Differential Revision: D29333882

Pulled By: janeyx99

fbshipit-source-id: a82080e6205a03a1183035011ce59698eca06748
2021-06-23 11:11:21 -07:00
ff3678eec2 Disable group group backend rpc tests from running on CI (#60407)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60407

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D29278179

Pulled By: H-Huang

fbshipit-source-id: ee78085eeb04d81842c95236b8c3a33de7142a3a
2021-06-23 10:58:31 -07:00
109f831409 Support non-Tensor args in the Pipe API (#57226)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57226

As per the design outlined in
https://github.com/pytorch/pytorch/issues/53952, this PR adds support for
non-Tensor args in the pipeline.

The `NoChunk` wrapper hasn't been implemented yet and will be implemented in a
follow up PR.
ghstack-source-id: 132008356

Test Plan:
1) unit tests
2) waitforbuildbot

Reviewed By: SciPioneer

Differential Revision: D28083564

fbshipit-source-id: 5f09da238eec0167feff76fe98916dedb0a9ae4e
2021-06-23 10:53:37 -07:00
10e11dbdcd Reland D29190420: [nnc][tests] Tests and benchmarks for computeSum (#60550)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60550

Original commit changeset: ed655497a981

Whatever gcc version OSS Bazel uses wasn't happy move-constructing the
SimpleIREvaluator, so use a unique_ptr instead.

Test Plan:
CI.  Hope that the gcc version used by OSS Bazel build is
happier with this (it should be), since actually testing it locally is
an intractable pain.

Reviewed By: navahgar

Differential Revision: D29333116

fbshipit-source-id: c3e4b5d8c91eb96a43ae5315a01ca0c0f4d4a99d
2021-06-23 10:50:03 -07:00
5fd45b8089 Port any kernel to structured kernels. (#60361)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60361

Tracking issue: #55070

This PR was openned so as to solve the CI failures in main when merging: #59371 #59372 #59373 #59937 #59938.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D29265859

Pulled By: ezyang

fbshipit-source-id: 0cca0431569f38a168473b5cc572ced473799961
2021-06-23 10:44:24 -07:00
a5aa940f5e Port all kernel to structured kernels. (#60360)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60360

Tracking issue: #55070

This PR was openned so as to solve the CI failures in main when merging: #59371 #59372 #59373 #59937 #59938.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D29265856

Pulled By: ezyang

fbshipit-source-id: 6e9b45ad3fc3852bb142ae2e3d58fc5d0a911aed
2021-06-23 10:43:25 -07:00
7b2d375148 Fix convolution_depthwise3x3_winograd for multichannel output (#60460)
Summary:
Before this change it was implemented with the assumption, that number of groups, input  and output channels are the same, which is not always the case
Extend the implementation to support any number of output channels as long as number of groups equals to the number of input channels (i.e. kernel.size(1) == 1)

Fixes https://github.com/pytorch/pytorch/issues/60176

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60460

Reviewed By: albanD

Differential Revision: D29299693

Pulled By: malfet

fbshipit-source-id: 31130c71ce86535ccfba2f4929eee3e2e287b2f0
2021-06-23 10:38:14 -07:00
c63a0d0cfe Adding windows CUDA smoke tests on PRs (#59686)
Summary:
Adding windows CUDA smoke tests on PRs (master should run the full suite).

Next step:
- Automate data update so we get a new smoke test list without manual effort

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59686

Test Plan: https://github.com/pytorch/pytorch/actions/runs/958296267 The sharded smoke tests take long still because of dependencies installation

Reviewed By: walterddr

Differential Revision: D29243533

Pulled By: janeyx99

fbshipit-source-id: dde7ba127fa15c95bda0e833cc5311598fb85e2b
2021-06-23 10:13:50 -07:00
8162439cbd [DDP] Remove python GradBucket construction (#60301)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60301

`GradBucket` is not meant to be constructed by Python user, only
consumed as part of comm. hook
ghstack-source-id: 131860243

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D29239320

fbshipit-source-id: f1631a16e7d66b7e4a9b4a44698e2319005d10b2
2021-06-23 10:05:34 -07:00
e8690dacb2 To add Nesterov Adam Algorithm to Optimizers (#59009)
Summary:
Fixes : https://github.com/pytorch/pytorch/issues/5804

In the paper : https://openreview.net/forum?id=OM0jvwB8jIp57ZJjtNEZ  Timothy Dozat suggested a new optimization algorithm with an essence of combination of NAG and Adam algorithms.

It is known that the idea of momentum can be improved with the Nesterov acceleration in optimization algorithms, and Dozat is investigating to apply this idea to momentum component of Adam algorithm. Author provided experiment evidence in their work to show excellence of the idea.

In this PR we are implementing the proposed algorithm NAdam in the mentioned paper. Author has a preliminary work http://cs229.stanford.edu/proj2015/054_report.pdf  where he shows the decay base constant should be taken as 0.96 which we also followed the same phenomenon here in this implementation similar to Keras. Moreover, implementation / coding practice have been followed similar to Keras in some other places as well:

f9d3868495/tensorflow/python/keras/optimizer_v2/nadam.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59009

Reviewed By: gchanan, vincentqb

Differential Revision: D29220375

Pulled By: iramazanli

fbshipit-source-id: 4b4bb4b15f7e16f7527f368bbf4207ed345751aa
2021-06-23 08:21:43 -07:00
a2525b035c Remove unused sample input argument from functions to resolve issue #55737 (#60486)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60486

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D29311875

Pulled By: NivekT

fbshipit-source-id: 4bf451c4f8e78290398e0514860a14a335a51fa7
2021-06-23 08:02:04 -07:00
265f0e5321 Add device runtime API for the plug-in to register platform python module into torch (#59857)
Summary:
## Motivation
Allow the out-of-tree Pytorch plug-in, for the device type other than CUDA, to add the runtime interface to the `torch` module. The runtime interface of the device can be referred with the device type name in the `torch` module. I.E., `torch.cuda` or `torch.xpu`.

## Solution
- Add a register interface for the plug-in to add the platform python module into `torch` module with the device type name. I.E., The `torch.xpu` can be used to refer the XPU runtime interface after the XPU runtime module is registered with `torch._register_device_module('xpu', xpu_module)` in Intel's XPU plug-in.

## Additional Context
More details about runtime has been discussed in https://github.com/pytorch/pytorch/issues/53707.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59857

Reviewed By: mrshenli

Differential Revision: D29309320

Pulled By: ezyang

fbshipit-source-id: b9802a5f937ddef9e0bdaf2f7692dfe463912fbe
2021-06-23 07:54:45 -07:00
c97d4d5a34 Fix test failures with some glibc libraries (#60450)
Summary:
Large complex values lead to nan/inf results when using some glibc
implementations of atanh/acos
- Skip test_reference_numerics_hard instead of "normal"
- Test the edge values only for cdouble where the stdlib/glibc implementations support those large values

Fixes https://github.com/pytorch/pytorch/issues/60259

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60450

Reviewed By: mrshenli

Differential Revision: D29304834

Pulled By: ezyang

fbshipit-source-id: d6b97456847c5573b9d2cb447bfc62abba73cb2a
2021-06-23 07:49:27 -07:00
f0e4e4be72 Clean Up ZeRO (#60285)
Summary:
**Overview:**
Being relatively new to PyTorch and ZeRO, I found parts of the code slightly hard to follow. This change strives to clean up the `ZeroRedundancyOptimizer` code in `zero_redundancy_optimizer.py` by reorganizing some computations, making variable names more explicit and consistent, and unifying terminology in the documentation. The goal is for the code to be easier to extend afterwards.

**Changes:**
1) `state_dict()`: The [logic](85517a2b70/torch/distributed/optim/zero_redundancy_optimizer.py (L510)) for updating the global `state_dict` with each rank's local `state_dict` is simplified and made more explicit. Notably, the `dict` [`local_index_to_param_id`](85517a2b70/torch/distributed/optim/zero_redundancy_optimizer.py (L513)) is unneeded. It maps `local_pg["params"][i]` to `id(global_pg["params"][i])`, so it is equivalent to make a single pass over both lists in tandem, effectively iterating over `i`, without a need for the explicit `dict`.
2) `_update_trainable()`: The function [initializes](85517a2b70/torch/distributed/optim/zero_redundancy_optimizer.py (L597)) the local optimizer if it does not exist. I am unaware of any reason for the local optimizer to be destroyed after initialization, so I moved that logic to its own function `_init_local_optimizer()`, which is called once in the constructor.
After [discussion](https://github.com/pytorch/pytorch/pull/60285#discussion_r654706728), I removed the function `_update_trainable()` itself in favor of adding a check for `parameters_as_bucket_view` in `build_param_buckets()` directly.
3) `rank_local_state_dict()`: This [function](85517a2b70/torch/distributed/optim/zero_redundancy_optimizer.py (L528)) is currently broken. It appears to be legacy and relies on the input `state_dict` to have the key `"partitions"`. For now, I have removed it and added an [issue](https://github.com/pytorch/pytorch/issues/60284). Is it a notable use case to want to access another rank's `state_dict` in particular (as opposed to consolidating the entire state and then accessing)?
4) `local_state_dict():` After [discussion](https://github.com/pytorch/pytorch/pull/60285#discussion_r655571043), I removed the function.
5) `partition_parameters()`: After [discussion](https://github.com/pytorch/pytorch/pull/60285#discussion_r654708183), I renamed the function to `_partition_parameters()` to mark it as private.
6) `_param_to_index`: After [discussion](https://github.com/pytorch/pytorch/pull/60285#discussion_r654828100), I changed the key to be the parameter itself rather than its integer ID.
7) `buckets`: I renamed the data structure to `_buckets` to mark it as private.
8) Terminology: I tried to reduce the set of terms being used instead of juggling a number of synonyms. In particular, I made an effort to distinguish between "local" and "global" and to make names more indicative of typing.
9) Style: Per the [PyTorch contributing guide](https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md#writing-documentation), I made all docstrings abide by the 80 character limit, except for the one [line](554891f6fa/torch/distributed/optim/zero_redundancy_optimizer.py (L142)) showing the example ZeRO usage. Some code lines violate the limit for readability. Also, I unified some of the minor stylistic usages out of habit.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60285

Test Plan:
The test suite passes as expected (on the AI AWS cluster):
```
gpurun python test/distributed/optim/test_zero_redundancy_optimizer.py
```
I visually inspected the generated HTML doc (as generated following [this](https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md#writing-documentation)).

Reviewed By: mrshenli

Differential Revision: D29320726

Pulled By: andwgu

fbshipit-source-id: 23f69a19ecc5e877a38fe1df0da11329428311dd
2021-06-23 07:21:40 -07:00
56481f9762 Ensure proper syncs for out-of-place grad creation (torch.autograd.grad) when backward ops run on side streams (#60127)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/59844.

Streaming backwards collects "leaf streams" for AccumulateGrad functions that stash or accumulate .grad attributes for autograd leaf tensors, and syncs those streams with some ambient stream(s) so later ops can safely consume the grads on the ambient stream(s).

But, currently, streaming backwards does not collect leaf streams for grads produced out-of-place (ie, not stashed onto a .grad attribute) by `torch.autograd.grad`, because these out-of-place grads are "captured" and returned before they reach an AccumulateGrad function. Some out-of-place grads might not even have an AccumulateGrad function to go to, because `torch.autograd.grad` can be told to make grads for non-leaf temporaries.[1]

The upshot is, when streaming backwards makes ops that produce out-of-place gradients run on side streams, no ambient stream is told to sync on these side streams, so `torch.autograd.grad` doesn't offer the same post-call safe-use guarantees for grads as the leaf accumulation of `torch.autograd.backward`.

This PR ensures `torch.autograd.grad` gives the same safe-use guarantees as `torch.autograd.backward` by also stashing leaf streams for grads created out-of-place.

I augmented a streaming backwards test to include a torch.autograd.grad attempt. The test fails on current master[2] and passes with the engine.cpp diffs.

I have no idea if this bug or its fix matter to distributed autograd. pritamdamania mrshenli should take a look before it's merged.

[1] example:
```python
leaf = torch.tensor(..., requires_grad=True)
tmp = leaf * 2
loss = tmp.sum()
torch.autograd.grad(loss, inputs=(tmp, leaf))
```
Technically, because `torch.autograd.grad` can be told to produce grads for non-leaf temporaries, these streams might NOT be "leaf streams". Maybe I should rename `leaf_streams`?

[2] the way the test currently fails is fun: it reports
```
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 0 element(s) (out of 25) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.0 (5.0 vs. 5.0), which occurred at index (0, 0).
```
I suspect this [kafka trap](https://en.wiktionary.org/wiki/Kafkatrap) happens because assertEqual does a comparison test on the device, syncs on some bool result, sees failure and prints the tensors post-sync at which point is IS safe to access the values.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60127

Reviewed By: mrshenli

Differential Revision: D29276581

Pulled By: albanD

fbshipit-source-id: a9f797e2fd76e2f884cce5a32ecf5d9b704c88ee
2021-06-23 07:14:01 -07:00
b14f19b6fe Revert D29190420: [nnc][tests] Tests and benchmarks for computeSum
Test Plan: revert-hammer

Differential Revision:
D29190420 (21479ad20c)

Original commit changeset: 86246df82098

fbshipit-source-id: ed655497a981783da4c8f13e2d7fec104e3cb184
2021-06-23 06:59:37 -07:00
90cd57ee16 To add edge_order=2 and documentation for gradient operator (#58165)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/56036
Fixes https://github.com/pytorch/pytorch/issues/56130

* All the interior points are computed using second order accurate central differences method for gradient operator. However, currently we only have first order method computation for edge points. In this PR we are adding second order methods for edge points as well.

* Currently, there is no detailed description of how gradient operator computed using second order method, and how to use parameters correctly. We add detailed explanation of meaning of each parameter, and return of the gradient operator, meanwhile giving description of the second-order computation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58165

Reviewed By: mruberry

Differential Revision: D29305321

Pulled By: iramazanli

fbshipit-source-id: 0e0e418eed801c8510b8babe2ad3d064479fb4d6
2021-06-23 03:35:15 -07:00
7ed07e2a7d [NormalizeArgs] Retain node.meta (#60449)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60449

After normalizing args, still retain each node's `meta`

Test Plan: Added unit test.

Reviewed By: gcatron

Differential Revision: D29293179

fbshipit-source-id: 432b409790041fa4d6e759f7b46a8bee363497b0
2021-06-23 03:31:53 -07:00
66452e0a8c Ensure num_threads is initialized before calling omp_get_max_threads (#60185)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60185

`get_num_threads` is usually called before `parallel_for` so there's no
guaruntee we've initialized `num_threads` properly.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D29287814

Pulled By: ngimel

fbshipit-source-id: 7e9c86fc32d63889a57a9b1d2b7d8f3863481dce
2021-06-23 01:18:24 -07:00
19553438ed OpenMP: Refactor parallel_reduce to share code with parallel_for (#60184)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60184

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D29287817

Pulled By: ngimel

fbshipit-source-id: 734a33a8d965208662989e2497b345b68c132498
2021-06-23 01:18:22 -07:00
c75714e594 Ensure thread id is valid in nested parallel regions (#60183)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60183

Fixes https://github.com/pytorch/pytorch/pull/59149#issuecomment-863287331

`parallel_for` will call the function directly if it would have run on only a
single thread anyway. This is great for performance, but causes an issue in
nested parallel regions because `get_thread_num` will reflect the parent
parallel region instead of the current `parallel_for` call.

I fix this by using a `thread_local` variable for the current thread id and
manually setting it before each call to the user-provided function.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D29287816

Pulled By: ngimel

fbshipit-source-id: 777f771a0900750c7f22eb1dd185d84d19282108
2021-06-23 01:17:09 -07:00
3f3fd57044 Migrate crossKernel from THC to ATen (CUDA) (#60039)
Summary:
Ref  https://github.com/pytorch/pytorch/issues/24507 (There doesn't seem to be an actual issue for cross)

This also moves the remaining operator functors in `THCTensorMathPointwise.cuh`  to `SparseCUDATensorMath.cu` which is the only file using them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60039

Reviewed By: mrshenli

Differential Revision: D29314638

Pulled By: ngimel

fbshipit-source-id: aa7b57f6e11a933fb44f044e26945bb4a9e3de5f
2021-06-23 00:37:55 -07:00
f590cceacb [BE] Fix Convolution.cpp build warnings (#60463)
Summary:
Use `c10::irange` and `auto` to get rid of narrowing cast and signed-unsigned compilation warnings

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60463

Reviewed By: samestep

Differential Revision: D29300415

Pulled By: malfet

fbshipit-source-id: 4d7f519e2e3ebaa754364f60af762658c1b4a62e
2021-06-23 00:02:33 -07:00
3846cef2d7 Increase tolerance for test_grad_scaling_clipping (#60458)
Summary:
This makes it pass on A100 and with e.g. torch.manual_seed(6) called before running this test.

Fixes https://github.com/pytorch/pytorch/issues/60455

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60458

Reviewed By: mrshenli

Differential Revision: D29309618

Pulled By: ngimel

fbshipit-source-id: 72584087bcc949f7bc96b0644b701e69ae1fa025
2021-06-22 23:43:25 -07:00
40de03fc55 topk on CUDA supports bfloat16 (#59977)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/56176 via https://github.com/pytorch/pytorch/issues/58196

CC zasdfgbnm ngimel ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59977

Reviewed By: mrshenli

Differential Revision: D29315018

Pulled By: ngimel

fbshipit-source-id: 0a87e7f155a97225fc6b2ec5dc0dc38a23156b41
2021-06-22 23:39:24 -07:00
21479ad20c [nnc][tests] Tests and benchmarks for computeSum (#60160)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60160

Adds a few simple tests and benchmarks for the `computeSum` op
(equivalent to `at::sum`).

The benchmarks test 1D reduction and 2D row and column reduction.  Performance
is in the ballpark of aten (14-15 GB/s) on my skylake devserver for all cases,
and occasionally better (e.g. 256k * 64 row reduction goes from 9 GB/s to 13).

Results (on my skylake-avx512, with turbo disabled):
```
------------------------------------------------------------------------------------------
Benchmark                                   Time           CPU Iterations UserCounters...
------------------------------------------------------------------------------------------
Reduce1D/Torch/16777216               4746995 ns    4746722 ns        150 BYTES=14.1379G/s
Reduce1D/Naive/16777216              34063215 ns   34061388 ns         21 BYTES=1.97023G/s
Reduce1D/NativeRfactor/16777216       5057175 ns    5057167 ns        139 BYTES=13.2701G/s
Reduce1D/TeNaive/16777216            33868945 ns   33868851 ns         21 BYTES=1.98143G/s
Reduce1D/TeSplitTail/16777216        33902786 ns   33900436 ns         21 BYTES=1.97959G/s
Reduce1D/TeSplitMask/16777216        33922509 ns   33920604 ns         21 BYTES=1.97841G/s
Reduce1D/TeRfactorV1/16777216         5141150 ns    5141002 ns        135 BYTES=13.0537G/s
Reduce1D/Op/16777216                  5140390 ns    5140091 ns        135 BYTES=13.056G/s
Reduce2DCol/Torch/8/2097152          12824403 ns   12823563 ns         55 BYTES=5.8874G/s
Reduce2DCol/Torch/64/262144           8306873 ns    8306743 ns         83 BYTES=8.20507G/s
Reduce2DCol/Torch/4096/4096           7992364 ns    7992239 ns         87 BYTES=8.3988G/s
Reduce2DCol/OpSchedule/8/2097152/0    4866144 ns    4865766 ns        138 BYTES=15.5161G/s
Reduce2DCol/OpSchedule/64/262144/0   36668978 ns   36666415 ns         19 BYTES=1.85885G/s
Reduce2DCol/OpSchedule/4096/4096/0  155862459 ns  155801266 ns          4 BYTES=430.839M/s
Reduce2DCol/OpSchedule/8/2097152/1    8067683 ns    8061117 ns         85 BYTES=9.36563G/s
Reduce2DCol/OpSchedule/64/262144/1    7496686 ns    7496562 ns         93 BYTES=9.09183G/s
Reduce2DCol/OpSchedule/4096/4096/1    5262821 ns    5262186 ns        131 BYTES=12.7562G/s
Reduce2DCol/OpSchedule/8/2097152/2    6237899 ns    6237210 ns        109 BYTES=12.1044G/s
Reduce2DCol/OpSchedule/64/262144/2    5258012 ns    5257655 ns        127 BYTES=12.9635G/s
Reduce2DCol/OpSchedule/4096/4096/2    5231686 ns    5228241 ns        132 BYTES=12.839G/s
Reduce2DCol/OpSchedule/8/2097152/3   11088573 ns   11087557 ns         62 BYTES=6.80921G/s
Reduce2DCol/OpSchedule/64/262144/3    5338843 ns    5338326 ns        127 BYTES=12.7676G/s
Reduce2DCol/OpSchedule/4096/4096/3    4311617 ns    4308102 ns        162 BYTES=15.5812G/s
Reduce2DRow/Torch/8/2097152           4642244 ns    4641794 ns        151 BYTES=14.4575G/s
Reduce2DRow/Torch/64/262144           4628311 ns    4628245 ns        151 BYTES=14.4999G/s
Reduce2DRow/Torch/4096/4096           4894012 ns    4893316 ns        143 BYTES=13.7177G/s
Reduce2DRow/Torch/262144/64          10469098 ns   10468027 ns         68 BYTES=6.51101G/s
Reduce2DRow/Hand/262144/64            5554380 ns    5554059 ns        126 BYTES=12.2716G/s
Reduce2DRow/OpSchedule/8/2097152/0   33890363 ns   33888931 ns         21 BYTES=1.98026G/s
Reduce2DRow/OpSchedule/64/262144/0   33901317 ns   33899436 ns         21 BYTES=1.97965G/s
Reduce2DRow/OpSchedule/4096/4096/0   33500358 ns   33498815 ns         21 BYTES=2.00381G/s
Reduce2DRow/OpSchedule/262144/64/0   13132231 ns   13131049 ns         53 BYTES=5.19056G/s
Reduce2DRow/OpSchedule/8/2097152/1    5200423 ns    5200025 ns        134 BYTES=12.9055G/s
Reduce2DRow/OpSchedule/64/262144/1    5204428 ns    5204327 ns        133 BYTES=12.8949G/s
Reduce2DRow/OpSchedule/4096/4096/1    8724355 ns    8723370 ns         80 BYTES=7.69488G/s
Reduce2DRow/OpSchedule/262144/64/1 1811861280 ns 1811352083 ns          1 BYTES=37.6279M/s
Reduce2DRow/OpSchedule/8/2097152/2    9169829 ns    9168946 ns         76 BYTES=7.31915G/s
Reduce2DRow/OpSchedule/64/262144/2    9159901 ns    9158560 ns         76 BYTES=7.32747G/s
Reduce2DRow/OpSchedule/4096/4096/2    9217398 ns    9215557 ns         76 BYTES=7.28391G/s
Reduce2DRow/OpSchedule/262144/64/2   10820450 ns   10818998 ns         66 BYTES=6.29979G/s
Reduce2DRow/OpSchedule/8/2097152/3    5227921 ns    5226544 ns        133 BYTES=12.84G/s
Reduce2DRow/OpSchedule/64/262144/3    5194362 ns    5194082 ns        133 BYTES=12.9203G/s
Reduce2DRow/OpSchedule/4096/4096/3    5196080 ns    5195349 ns        134 BYTES=12.9203G/s
Reduce2DRow/OpSchedule/262144/64/3    5235189 ns    5234728 ns        133 BYTES=13.0202G/s
```

ghstack-source-id: 131753875

Test Plan: these tests

Reviewed By: navahgar

Differential Revision: D29190420

fbshipit-source-id: 86246df82098da4f5493d6c4f34a40016d95a9f0
2021-06-22 23:04:09 -07:00
fbeb8b4992 [nnc] Speed up batchnorm benchmark
Summary:
Use better scheduling: fuse and parallelize NC, fuse and
vectorize HW.

```
-----------------------------------------------
 N/C/H/W               ATen               NNC
-----------------------------------------------
1/64/112/112          45449 ns         36672 ns
1/256/14/14           15555 ns	        7116 ns
1/128/28/28           15737 ns	        8560 ns
1/64/56/56            20766 ns	       12153 ns
1/512/7/7             16985 ns	        8182 ns

5/64/112/112        2532475 ns	     2069668 ns
5/256/14/14           24507 ns	       12228 ns
5/128/28/28           29352 ns	       20146 ns
5/64/56/56            44786 ns	       38784 ns
5/512/7/7             22307 ns	       20505 ns
```

Test Plan: benchmark results above

Reviewed By: navahgar

Differential Revision: D29288658

fbshipit-source-id: dd05efa4b7d26b6ad94f54a9ef6c8c47adb160b5
2021-06-22 22:57:43 -07:00
b0c9762e2d [pytorch][nnc] external function call to xnnpack ops (#59525)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59525

This PR added NNC external function call binding for two XNNPack ops:
- prepacked::linear_clamp_run
- prepacked::conv2d_clamp_run

Both ops take two arguments: a regular input tensor and a prepacked context
object that contains other parameters like weights/bias/etc. The prepacked
context object's type is a custom class.

NNC doesn't generate assembly code that reads the content of the prepacked
object directly. It simply passes it into the XNNPack ops wrapper, so both
NNC and the generated assembly code don't need to know the custom class type.

At compilation time, we use a size-1 dummy tensor as the placeholder for the
prepacked XNNPack context object.

At runtime, we pass in the raw pointer of the XNNPack context object as if it
were a regular tensor storage data pointer.

Inside the external function call wrapper, we reinterpret_cast the raw pointer
back to the custom class type before dispatching to the XNNPack ops.
ghstack-source-id: 132135512

Test Plan: unit test

Reviewed By: bertmaher

Differential Revision: D28924934

fbshipit-source-id: 15326b35dc6c022f4c3f247a2037c361e06e80b4
2021-06-22 21:29:31 -07:00
79dc500a99 Add error message for sequence length to be equal to 0 case for RNNs (#60269)
Summary:
Fixes #https://github.com/pytorch/pytorch/issues/50192

It has been discussed in the issue that, currently RNN apis do not support inputs with `seq_len=0` and the error message does not reflect this issue clearly. This PR is suggesting a solution to this issue, by adding a more clear error message that, none of RNN api (nn.RNN, nn.GRU and nn.LSTM) do not support `seq_len=0` for neither one-directional nor bi-directional layers.

```
import torch

input_size = 5
hidden_size = 6
rnn = torch.nn.GRU(input_size, hidden_size)

for seq_len in reversed(range(4)):
    output, h_n = rnn(torch.zeros(seq_len, 10, input_size))
    print('{}, {}'.format(output.shape, h_n.shape))
```

Previously was giving output as :

```
torch.Size([3, 10, 6]), torch.Size([1, 10, 6])
torch.Size([2, 10, 6]), torch.Size([1, 10, 6])
torch.Size([1, 10, 6]), torch.Size([1, 10, 6])
Traceback (most recent call last):
  File "test.py", line 8, in <module>
    output, h_n = rnn(torch.zeros(seq_len, 10, input_size))
  File "/opt/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/miniconda3/lib/python3.8/site-packages/torch/nn/modules/rnn.py", line 739, in forward
    result = _VF.gru(input, hx, self._flat_weights, self.bias, self.num_layers,
RuntimeError: stack expects a non-empty TensorList
```

However, after adding this PR, this error message change for any combination of
[RNN, GRU and LSTM] x [one-directional, bi-directional].

Let's illustrate the change with the following code snippet:

```
import torch

input_size = 5
hidden_size = 6
rnn = torch.nn.LSTM(input_size, hidden_size, bidirectional=True)
output, h_n = rnn(torch.zeros(0, 10, input_size))
```

would give output as following:

```
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/fsx/users/iramazanli/pytorch/torch/nn/modules/module.py", line 1054, in _call_impl
    return forward_call(*input, **kwargs)
  File "/fsx/users/iramazanli/pytorch/torch/nn/modules/rnn.py", line 837, in forward
    result = _VF.gru(input, hx, self._flat_weights, self.bias, self.num_layers,
RuntimeError: Expected sequence length to be larger than 0 in RNN
```

***********************************

The change for Packed Sequence didn't seem to be necessary because from the following code snippet error message looks clear about the issue:

```
import torch
import torch.nn.utils.rnn as rnn_utils
import torch.nn as nn
packed = rnn_utils.pack_sequence([])
```

returns:

```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/fsx/users/iramazanli/pytorch/torch/nn/utils/rnn.py", line 398, in pack_sequence
    return pack_padded_sequence(pad_sequence(sequences), lengths, enforce_sorted=enforce_sorted)
  File "/fsx/users/iramazanli/pytorch/torch/nn/utils/rnn.py", line 363, in pad_sequence
    return torch._C._nn.pad_sequence(sequences, batch_first, padding_value)
RuntimeError: received an empty list of sequences
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60269

Reviewed By: mrshenli

Differential Revision: D29299914

Pulled By: iramazanli

fbshipit-source-id: 5ca98faa28d4e6a5a2f7600a30049de384a3b132
2021-06-22 21:25:05 -07:00
dc9aa7b960 Add custom code filter for TS (#60309)
Summary:
-----------

Adds custom code filter for Torchscript to include tracing of forward calls.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60309

Reviewed By: zhxchen17

Differential Revision: D29317150

Pulled By: nikithamalgifb

fbshipit-source-id: d49e4dc74a2b8cc98b0d4967980d819908b7ea7b
2021-06-22 20:55:57 -07:00
3de79b7757 [quant] Input-Weight Equaliaztion - convert modifications (#59963)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59963

When converting, before quantizing the nodes, we call
`update_obs_for_equalization()` and `convert_eq_obs()`.

`update_obs_for_equalization`:
1. For each InputEqualizationObserver, we find the corresponding
WeightEqualizationObserver.
2. For nn.Linear layers, we will create an instance of the
WeightEqualizationObserver, run forward on the observer with the given
weights.
3. Calculate the equalization scale between the
InputEqualizationObserver and WeightEqualizationObserver.

`convert_eq_obs`:
For every InputEqualizationObserver, we will do the following:
1. Create a node (ex. `x0_activation_post_process_scale`) containing the
equalization scale constant.
2. Create another node containing a `mul` operator multiplying the
equalization scale and the input.
3. Remove the current InputEqualizationObserver node, and replace it
with the `mul` node.

For every WeightEqualizationObserver, we will do the following:
1. Get the next equalization scale (we may need this for equalizing
connected linear layers).
2. Scale the weights by multiplying it with the reciprocal of the
current equalization scale and the next equalization scale

Currently, this supports models with `nn.Linear` layers, but does not
support connecting linear layers.

Test Plan:
`python test/test_quantization.py
TestEqualizeFx.test_input_weight_equalization_convert`

Original Model:
```
.LinearModule(
  (linear): Linear(in_features=2, out_features=2, bias=True)
)
```

Graph after `prepare_fx`:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0 : [#users=1] = call_module[target=x_equalization_process_0](args = (%x,), kwargs = {})
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%x_equalization_process_0,), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
    %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
    return linear_activation_post_process_0
```

Graph after equalization functions:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%mul,), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
    %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
    return linear_activation_post_process_0
```

Graph after `convert_fx`:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0_scale : [#users=1] = get_attr[target=x_equalization_process_0_scale]
    %mul : [#users=1] = call_function[target=torch.mul](args = (%x, %x_equalization_process_0_scale), kwargs = {})
    %linear_input_scale_0 : [#users=1] = get_attr[target=linear_input_scale_0]
    %linear_input_zero_point_0 : [#users=1] = get_attr[target=linear_input_zero_point_0]
    %quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%mul, %linear_input_scale_0, %linear_input_zero_point_0, torch.quint8), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%quantize_per_tensor,), kwargs = {})
    %dequantize : [#users=1] = call_method[target=dequantize](args = (%linear,), kwargs = {})
    return dequantize
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D29135358

fbshipit-source-id: 2d00056729041318463de61841483490b6bfeee5
2021-06-22 20:43:30 -07:00
7589d9c58b Enable rcb lookup for typing (#60413)
Summary:
-----------

For FX traced models, types from typing modules are not available during the lookup for the function to be traced. Because of which the resolving the type results to a None type object. By enabling lookup for `typing` module in `_jit_internal.py`, we can mitigate this issue with FX_Tracing and scripting.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60413

Test Plan:
--------
with-proxy python test/test_jit.py -k TestPDT.test_fx_tracing_with_typing

Reviewed By: bhosmer

Differential Revision: D29314531

Pulled By: nikithamalgifb

fbshipit-source-id: 1aa651430b1074c7e6fa74ba02bbcc4e1b00b01b
2021-06-22 18:53:19 -07:00
135e203e5e avoid unnecessary copies in MultiDispatchKeySet (#60093)
Summary:
The code would previously pass Generator & optional<Tensor> by value.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60093

Reviewed By: swolchok

Differential Revision: D29310624

Pulled By: bhosmer

fbshipit-source-id: fb4a9740a57ef319aaf7c778d51430907a7c0cc5
2021-06-22 18:44:06 -07:00
4887c6e401 [quant] avoid resize calls in observer/fake_quant (#60386)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60386

During QAT we sometimes encounter errors with scripted models
`RuntimeError: cannot resize variables that require grad`

For per-tensor cases we don't need to resize some buffers so this PR removes the extra resize ops where applicable

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D29271905

fbshipit-source-id: 01a484a9559a3a4180490f9476d0cd3044ba0d1b
2021-06-22 17:41:43 -07:00
d3ae3e07aa parse_reports() should include hidden files (#60404)
Summary:
Not sure why there are report files starting with `.`, but in that case
`glob('**/*.xml')` should not be used, as it will skip those

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60404

Reviewed By: samestep

Differential Revision: D29276459

Pulled By: malfet

fbshipit-source-id: 8e131c38013425ad786e0a9ca0c0a43e57b1679a
2021-06-22 15:53:00 -07:00
986a88056c Remove some unused variables (#60411)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60411

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D29221207

fbshipit-source-id: da6ad44036291a98f0b36b260062d077a7c2691b
2021-06-22 15:44:33 -07:00
36d4062a62 Fix some variable types (#60414)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60414

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D29221183

fbshipit-source-id: f855efca2fd08844de65d0f9ef73bcceffee657e
2021-06-22 15:44:31 -07:00
7d779f84a3 Fix some loop types (#60415)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60415

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D29221174

fbshipit-source-id: 9bc56655f198f6eb95e6b2e7a4f0573a2cd2f9a1
2021-06-22 15:43:10 -07:00
6e926f1303 Fix lint (#60472)
Summary:
This PR fixes the `mypy` failure introduced by [`numpy` 1.21.0](https://github.com/numpy/numpy/releases/tag/v1.21.0) (by pinning `numpy` to 1.20, at least for now) and the `quick-checks` failure introduced by https://github.com/pytorch/pytorch/issues/60405.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60472

Test Plan: The Lint workflow in GitHub Actions.

Reviewed By: walterddr

Differential Revision: D29313009

Pulled By: driazati

fbshipit-source-id: 53fd0e0549c26be5fc5d3c502c5891c56c83a32c
2021-06-22 14:48:07 -07:00
0c916c8a4e up the priority of numpy array comparisons in self.assertEqual (#59067)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/58988.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59067

Reviewed By: jbschlosser

Differential Revision: D28986642

Pulled By: heitorschueroff

fbshipit-source-id: 3ef2d26b4010fc3519d0a1a020ea446ffeb46ba0
2021-06-22 13:07:07 -07:00
82c52fd417 Do not wrap Tensor.{grad,_base} by default (#60464)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60464

Fixes https://github.com/szagoruyko/pytorchviz/issues/65

An alternate implementation of this PR would be to remove the
__torch_function__ interposition points for these accessors entirely.
In the end, I decided to opt for extra expressivity.  See
torch.overrides for the criterion on how I decided which accessors
should get the nowrap treatment.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D29302835

Pulled By: ezyang

fbshipit-source-id: fbe0ac4530a6cc9d6759a3fdf5514d4d7b1f7690
2021-06-22 12:49:23 -07:00
f42140cb8a Disable warn_unused_ignores again (#60480)
Summary:
Fixes https://github.com/pytorch/pytorch/pull/60006#issuecomment-866130657.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60480

Test Plan: Run `mypy --config mypy-strict.ini` with [`ruamel.yaml`](https://pypi.org/project/ruamel.yaml/) installed.

Reviewed By: zhouzhuojie

Differential Revision: D29307823

Pulled By: samestep

fbshipit-source-id: 97fa4b7dad0465c269411c48142b22ce751bf830
2021-06-22 12:42:37 -07:00
6a87e8d087 Implement erfcx() (#58194)
Summary:
Implement erfcx() https://github.com/pytorch/pytorch/issues/31945

Reference: https://github.com/pytorch/pytorch/issues/50345

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58194

Reviewed By: ngimel

Differential Revision: D29285979

Pulled By: mruberry

fbshipit-source-id: 5bcfe77fddfabbeb8c8068658ba6d9fec6430399
2021-06-22 12:38:38 -07:00
b34965435d Improve testing of inplace views (#59891)
Summary:
Partially addresses https://github.com/pytorch/pytorch/issues/49825 by improving the testing
 - Rename some of the old tests that had "inplace_view" in their names, but actually mean "inplace_[update_]on_view" so there is no confusion with the naming
 - Adds some tests in test_view_ops that verify basic behavior
 - Add tests that creation meta is properly handled for no-grad, multi-output, and custom function cases
 - Add test that verifies that in the cross dtype view case, the inplace views won't be accounted in the backward graph on rebase as mentioned in the issue.
 - Update inference mode tests to also check in-place

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59891

Reviewed By: albanD

Differential Revision: D29272546

Pulled By: soulitzer

fbshipit-source-id: b12acf5f0e3f788167ebe268423cdb58481b56f6
2021-06-22 12:28:09 -07:00
20bda0057e [caffe2/utils] Add explicit rule to avoid package boundary violation
Summary:
Add a rule to wrap proto_utils.h and depend on that, rather than
relying on a glob which violates package boundaries.

Reviewed By: igorsugak

Differential Revision: D29273453

fbshipit-source-id: 08f198a03d06ee2fdf61f5dbe1d0087db22aec8b
2021-06-22 12:22:24 -07:00
7c1bca9e94 [caffe2/utils] Add explicit rule to avoid package boundary violation
Summary:
Add a rule to wrap simple_queue.h and depend on that, rather than
relying on a glob which violates package boundaries.

Test Plan: `buck2 build fbcode//caffe2/caffe2:caffe2_core`

Reviewed By: igorsugak

Differential Revision: D29273415

fbshipit-source-id: f2b62a82cd6478bd71a8194d661d1c8b023c0953
2021-06-22 12:21:08 -07:00
7f2592195d Adds stream recording for cross-stream uses of gradients in streaming backward (#60230)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/33909.

I _think_ the two recordDataPtrOnStreams i added are necessary and sufficient. They're the ones that worked for dmitrivainbrand's intricate multistream pipelining in https://github.com/pytorch/pytorch/issues/33909 and I can more or less convince myself they're enough, but it's hard to be sure (and hard to test).

PRing without a test now for visibility. I'll try to come up with something.

input_buffer.cpp needs to compile in cuda or cpu-only builds, so I can't call `c10::cuda::CUDACachingAllocator::recordStream` directly. I planned to work around by adding a binding in VirtualGuardImpl but https://github.com/pytorch/pytorch/pull/57047 spared me the trouble, thanks lw .

Recording a usage stream on a generic tensor was uglier than I expected, see https://github.com/pytorch/pytorch/issues/60306. Up to you guys if adding a unified way to record streams on a tensor backed by any TensorImpl should block this PR (and if so, whether it should happen in a separate PR or as part of this PR).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60230

Reviewed By: mrshenli

Differential Revision: D29289392

Pulled By: albanD

fbshipit-source-id: 1339d382b7d238a461b082597b3962847b5201fe
2021-06-22 12:16:07 -07:00
c7d0e9da0a Add pyproject.toml (#60408)
Summary:
This makes PyTorch conform to [PEP 517](https://www.python.org/dev/peps/pep-0517/) and [PEP 518](https://www.python.org/dev/peps/pep-0518/) by explicitly stating that we use [`setuptools`](https://setuptools.readthedocs.io/). It also follows up on https://github.com/pytorch/pytorch/pull/60119#pullrequestreview-685791812 by moving our [`isort`](https://pycqa.github.io/isort/) config into the new `pyproject.toml` file. I didn't move any of our other tool configs into `pyproject.toml` in this PR because:

- `.flake8` is assumed to exist in its current format for `tools/actions_local_runner.py` to work
- `mypy.ini` is not our only `mypy` config
- `pytest.ini` has detailed comments on `addopts` which [would have to be removed](https://github.com/toml-lang/toml/issues/340#issuecomment-122164501) in TOML because that setting is [a string, not an array](https://docs.pytest.org/en/6.2.x/customize.html#pyproject-toml)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60408

Reviewed By: 1ntEgr8

Differential Revision: D29277327

Pulled By: samestep

fbshipit-source-id: 3f2e63f6cf9024f8c534cb13a0d854a75609c5ba
2021-06-22 12:12:36 -07:00
1abf45e37f Revert D29241736: [pytorch][PR] To add Rectified Adam Algorithm to Optimizers
Test Plan: revert-hammer

Differential Revision:
D29241736 (0d2a936176)

Original commit changeset: 288b9b1f3125

fbshipit-source-id: 56c4ec98647c6f1822b130726741a1c9ca193670
2021-06-22 12:08:31 -07:00
99ca2c5b4b Migrates nll_loss_backward from TH to Aten (CUDA) (#60299)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/24609
Aten Umbrella issue https://github.com/pytorch/pytorch/issues/24507
Related to https://github.com/pytorch/pytorch/issues/59765

There are no performance differences when running the following benchmark:

<details>
 <summary>Benchmark script</summary>

```python
import torch
import torch.nn as nn
import time

torch.manual_seed(0)

def _time():
    torch.cuda.synchronize()
    MS_PER_SECOND = 1000
    return time.perf_counter() * MS_PER_SECOND

device = "cuda"
C = 30
softmax = nn.LogSoftmax(dim=1)
n_runs = 250

for reduction in ["none", "mean", "sum"]:
    for N in [100_000, 500_000, 1_000_000]:
        elapsed = 0
        for i in range(n_runs):
            data = torch.randn(N, C, device=device, requires_grad=True)
            target = torch.empty(N, dtype=torch.long, device=device).random_(0, C)
            loss = nn.NLLLoss(reduction=reduction)
            input = softmax(data)
            result = loss(input, target)

            if reduction == "none":
                gradient = torch.randn(N, device=device)
            else:
                gradient = torch.randn(1, device=device).squeeze()

            t1 = _time()
            result.backward(gradient)
            t2 = _time()
            elapsed = elapsed + (t2 - t1)
        elapsed_avg = elapsed / n_runs
        print(
            f"input size({N}, {C}), reduction: {reduction} "
            f"elapsed time is {elapsed_avg:.2f} (ms)"
        )
    print()

```

</details>

## master

```
input size(100000, 30), reduction: none elapsed time is 0.19 (ms)
input size(500000, 30), reduction: none elapsed time is 0.83 (ms)
input size(1000000, 30), reduction: none elapsed time is 1.66 (ms)

input size(100000, 30), reduction: mean elapsed time is 1.50 (ms)
input size(500000, 30), reduction: mean elapsed time is 7.19 (ms)
input size(1000000, 30), reduction: mean elapsed time is 14.35 (ms)

input size(100000, 30), reduction: sum elapsed time is 1.49 (ms)
input size(500000, 30), reduction: sum elapsed time is 7.17 (ms)
input size(1000000, 30), reduction: sum elapsed time is 14.21 (ms)
```

## this PR

```
input size(100000, 30), reduction: none elapsed time is 0.19 (ms)
input size(500000, 30), reduction: none elapsed time is 0.83 (ms)
input size(1000000, 30), reduction: none elapsed time is 1.66 (ms)

input size(100000, 30), reduction: mean elapsed time is 1.48 (ms)
input size(500000, 30), reduction: mean elapsed time is 7.16 (ms)
input size(1000000, 30), reduction: mean elapsed time is 14.29 (ms)

input size(100000, 30), reduction: sum elapsed time is 1.49 (ms)
input size(500000, 30), reduction: sum elapsed time is 7.15 (ms)
input size(1000000, 30), reduction: sum elapsed time is 14.18 (ms)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60299

Reviewed By: albanD

Differential Revision: D29287613

Pulled By: ngimel

fbshipit-source-id: 21e15f2c518087e9fb797a379e1e0a3508c98509
2021-06-22 12:04:07 -07:00
fca931d181 List striding with arbitrary step size (#58537)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58537

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D28531721

Pulled By: tugsbayasgalan

fbshipit-source-id: 8c8ed32ca00366603bfb5086e87dfa62736ff4b2
2021-06-22 11:25:23 -07:00
df8a8fbc1b Improve code and documentation clarity for DataPipes APIs (#60423)
Summary:
Fixes issues that are discussed with ezyang in the comments of PR https://github.com/pytorch/pytorch/issues/59498

Improved code and documentation clarity, and refactored .filter to nesting_level directly

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60423

Reviewed By: ezyang

Differential Revision: D29281599

Pulled By: NivekT

fbshipit-source-id: a9bbaf52f492db0741c00f3ceb4022b08ddb1506
2021-06-22 11:19:08 -07:00
71b83c27e2 [pruning] Move pruning directory into experimental folder (#60395)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60395

Experimental folder so other developers know this is work in progress

Test Plan:
`buck test mode/dev-nosan //caffe2/test:ao -- TestBasePruner`

https://pxl.cl/1KGJD

Reviewed By: z-a-f

Differential Revision: D29272319

fbshipit-source-id: 93eeeceba0376753efc9a5bb69a155278ceb2fca
2021-06-22 11:08:48 -07:00
f75ea51e67 [pruning] Move pruning files to their own directory (#60293)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60293

Move pruning files to their own directory

Test Plan:
`buck test mode/dev-nosan //caffe2/test:ao -- TestBasePruner`
https://pxl.cl/1KCfz

Reviewed By: z-a-f

Differential Revision: D29238159

fbshipit-source-id: 0173a278b39ff5ee4cbd54f333f558b6fe412be5
2021-06-22 11:08:47 -07:00
b25db5251a [pruning] Base pruner class (#60278)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60278

Implemented `PruningParametrization`, which removes pruned rows, and `BasePruner`, which is the base class for structured pruning.

Test Plan:
`buck test mode/dev-nosan //caffe2/test:ao -- TestBasePruner`

https://pxl.cl/1KC2n

Reviewed By: z-a-f

Differential Revision: D29208349

fbshipit-source-id: f34e8e258bf13fa80292c2bd64d56f5ad1e72b6a
2021-06-22 11:07:31 -07:00
31a884987d Remove some TH includes from ATen (#60323)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60323

Test Plan: Imported from OSS

Reviewed By: malfet, anjali411

Differential Revision: D29252862

Pulled By: ngimel

fbshipit-source-id: 9ea13495d382c04dfd52b8dd63314f53b7e83936
2021-06-22 10:55:17 -07:00
0d2a936176 To add Rectified Adam Algorithm to Optimizers (#58968)
Summary:
Fixes : https://github.com/pytorch/pytorch/issues/24892

In the paper : https://arxiv.org/pdf/1908.03265.pdf  Liyuan Liu et al. suggested a new optimization algorithm with an essence of similar to Adam Algorithm.

It has been discussed in the paper that, without warmup heuristic, in the early stage of adaptive optimization / learning algorithms sometimes we can get undesirable large variance which can slow overall convergence process.

Authors proposed the idea of rectification of variance of adaptive learning rate when it is expected to be high.

Differing from the paper, we selected variance tractability cut-off as 5 instead of 4. This adjustment is common practice, and could be found in the code-repository and also tensorflow swift optim library as well :

2f03dd1970/radam/radam.py (L156)

f51ee4618d/Sources/TensorFlow/Optimizers/MomentumBased.swift (L638)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58968

Reviewed By: gchanan

Differential Revision: D29241736

Pulled By: iramazanli

fbshipit-source-id: 288b9b1f3125fdc6c7a7bb23fde1ea5c201c0448
2021-06-22 10:38:41 -07:00
0126f42841 [complex] torch.sigmoid: CUDA support and complex autograd support (#48647)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48552

**Changes**

* Complex support for `torch.sigmoid` CUDA (CPU support already exists)
* Complex autograd support for `torch.sigmoid` (CUDA and CPU)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48647

Reviewed By: H-Huang

Differential Revision: D29163012

Pulled By: anjali411

fbshipit-source-id: 0cac0412355312675bee1cc46e090be7351d5dac
2021-06-22 10:35:00 -07:00
567e6d3a87 Remove Caffe2 thread-pool leak warning (#60318)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/57273.

Some users reported that they dislike the Caffe2 thread-pool leak warning, as it floods their logs, and have requested disabling it, or have asked for a way to filter it.

It seems caffe2 pthreadpool already exists because of some dependency in the binary distribution, so `torch.set_num_threads()` invocation isn't required to reproduce the issue (as is otherwise the case when building from the master branch).

https://github.com/pytorch/pytorch/issues/60171's test script does have a `set_num_threads` invocation & hence that's why I was able to reproduce the issue after building from the master branch's source code.

cc malfet & ejguan, who have the authority to make a decision.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60318

Reviewed By: albanD

Differential Revision: D29265771

Pulled By: ezyang

fbshipit-source-id: 26f678af2fec45ef8f7e1d39a57559790eb9e94b
2021-06-22 10:26:55 -07:00
91451369ed require non-empty inputs to grad() calls in the API (#52016)
Summary:
The grad() function needs to return the updated values, and hence
needs a non-empty inputs to populate.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52016

Test Plan:
Passes Python and C++ unit tests, and added new tests to catch this behavior.

Fixes https://github.com/pytorch/pytorch/issues/47061

Reviewed By: albanD

Differential Revision: D26406444

Pulled By: dagitses

fbshipit-source-id: 023aeca9a40cd765c5bad6a1a2f8767a33b75a1a
2021-06-22 10:10:58 -07:00
729f7cd52f Implement histogram operator on CPU (#58780)
Summary:
The existing [torch.histc](https://pytorch.org/docs/stable/generated/torch.histc.html) operator is limited in comparison to [numpy.histogram](https://numpy.org/doc/stable/reference/generated/numpy.histogram.html). This PR adds torch.histogram on CPU. The new operator replicates numpy.histogram's behavior, including support for caller-specified bin edges and weights. It was motivated by previous community requests for histogram.

The implementation was [benchmarked](https://docs.google.com/spreadsheets/d/1xCR0jODchVvwdVSAjiLsNCkmyictA6j1LNfDpWOafjw/edit?usp=sharing) against numpy.histogram as well as torch.histc. This implementation is weakly faster than numpy.histogram across all types of inputs tested, and performs in line with torch.histc for the limited inputs histc supports.

mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58780

Test Plan:
Added unit tests, OpInfo for the new torch.histogram operator.

Tested execution time on a variety of input sizes and compared to numpy.histogram performance: https://docs.google.com/spreadsheets/d/1xCR0jODchVvwdVSAjiLsNCkmyictA6j1LNfDpWOafjw/edit?usp=sharing

Reviewed By: ezyang

Differential Revision: D29134626

Pulled By: saketh-are

fbshipit-source-id: f2773085de1697f6bc6ffdeffe9a81267f51bdfc
2021-06-22 10:06:04 -07:00
3a56758e1f changed launch bound to fix col2im kernel (#60315)
Summary:
Changed launch bound for col2im kernel from 1024 to 512 to fix register spilling into local memory.

Perf comparison (using Nvidia Titan-V):

![Col2ImTimingData](https://user-images.githubusercontent.com/22803332/122627527-e0b1fc80-d064-11eb-83df-f2a1165cefcc.PNG)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60315

Reviewed By: albanD

Differential Revision: D29288113

Pulled By: ngimel

fbshipit-source-id: f78eb90941835700a1aef8e08fac6aff86dedfe9
2021-06-22 09:29:34 -07:00
926bb5d6be changed launch bounds, unrolled for loop for grid sampler 2d fwd and bwd (#60405)
Summary:
Changed launch bounds for grid sampler 2d fwd and bwd from 1024 to 256, added loop unrolling to fix register spilling into local memory.

Timing Data: (using Nvidia Titan-V)
Interpolation mode 2, padding 0, align corners False

![GridSampler2dTimingData](https://user-images.githubusercontent.com/22803332/122830305-01fd2d80-d29d-11eb-9cd3-7da533a03f33.PNG)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60405

Reviewed By: albanD

Differential Revision: D29288075

Pulled By: ngimel

fbshipit-source-id: 5e060f0c2d1cc0a3086718e6be263413dfa29689
2021-06-22 09:22:41 -07:00
23bb2ed00a Improve documentation for torch.set_rng_state (#60422)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/59974 by improving documentation for the function torch.set_rng_state

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60422

Test Plan: Only a comment is being changed.

Reviewed By: bdhirsh

Differential Revision: D29281578

Pulled By: NivekT

fbshipit-source-id: 2c160f782438b7f91f16c44f06c342e8b8b8437b
2021-06-22 07:10:50 -07:00
700df82881 [PyTorch Edge] Update iOS readme to use lite interpreter (#59841)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59841

As lite interpreter moves to beta, it's recommended to let users start using it.
ghstack-source-id: 131766778

Test Plan: CI

Reviewed By: husthyc

Differential Revision: D29048350

fbshipit-source-id: 54d2ad09b4e9475304522c80b358647bcea79b14
2021-06-22 02:17:04 -07:00
15dc320cae Fix lint build (#60438)
Summary:
per title

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60438

Reviewed By: ngimel

Differential Revision: D29288175

Pulled By: mruberry

fbshipit-source-id: f59b579b1793fdb1d298109c2bef0a70badb37b4
2021-06-22 00:11:55 -07:00
0585daae83 fixed launch bounds for gathertopk kernel (#60314)
Summary:
Changed launch bounds for gatherTopK kernel to fix register spilling into local memory.

Comparison (Nvidia Titan-V GPU):

Args: Input size as below, k=32, dim=None

![TopKTimingData](https://user-images.githubusercontent.com/22803332/122624922-46978780-d057-11eb-9b52-d5786da432c0.PNG)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60314

Reviewed By: mruberry

Differential Revision: D29267789

Pulled By: ngimel

fbshipit-source-id: 4056efb2e44e5527786167af66a127504980a3af
2021-06-21 22:24:44 -07:00
45ae2e7863 Set TORCH_WARN_ONCE to always warn inside of assertNotWarn (#60020)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60020

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D29249909

Pulled By: mruberry

fbshipit-source-id: 10a8d5c05bd8d4aec345f70b132efd3623601f6a
2021-06-21 21:35:54 -07:00
5d476f5b95 Fix FFT documentation examples and run doctests in the test suite (#60304)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/59514

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60304

Reviewed By: anjali411

Differential Revision: D29253980

Pulled By: mruberry

fbshipit-source-id: 0654f00197e5fae338aa8edf0b61ef5692cdaa7e
2021-06-21 20:47:25 -07:00
5921b5480a ensure xml report path are relative to */pytorch/test (#60380)
Summary:
Changes the approach.

Root cause of this is for some reason: `inspect.getfile` returns absolute path instead of relative path to `os.getcwd` in newer python version. we sanitize this by removing the CI_PREFIX if applies

See:
https://app.circleci.com/pipelines/github/pytorch/pytorch/339568/workflows/43cac71c-759e-471f-83c2-d59c152dcd8a/jobs/14278585 vs. https://app.circleci.com/pipelines/github/pytorch/pytorch/339568/workflows/43cac71c-759e-471f-83c2-d59c152dcd8a/jobs/14278285

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60380

Test Plan:
CI

Plot twist:

windows tests are actually launched via
```
pushd test
python run_test.py
```
while linux/macos tests are
```
python test/run_test.py
```
This might cause problem when using `os.getcwd()` we will see from PR CI results.

Reviewed By: malfet

Differential Revision: D29276969

Pulled By: walterddr

fbshipit-source-id: 336c2805d0c92733e0ff4c309ff2044dc2ed4e21
2021-06-21 20:47:23 -07:00
9b30fb8528 add support for constant (#60166)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/58739 Add support for constants according to python array API stipulation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60166

Reviewed By: anjali411

Differential Revision: D29253958

Pulled By: mruberry

fbshipit-source-id: 0bc86b74d3a4eb3ec4a65c941ec2710747402db1
2021-06-21 20:47:21 -07:00
1764aa79b9 restore JOB_BASE_NAME for test1 and test2 in test.sh (#60409)
Summary:
JOB_BASE_NAME for test1 and test2 were removed by https://github.com/pytorch/pytorch/issues/60124.  This caused the ROCm CI to run all tests for both test1 and test2.  Restore the use of JOB_BASE_NAME.

Fixes https://github.com/pytorch/pytorch/issues/60377.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60409

Reviewed By: anjali411

Differential Revision: D29277560

Pulled By: walterddr

fbshipit-source-id: ddf01466492a9a626ce1b6adf87cd102d8f1fe35
2021-06-21 20:46:17 -07:00
7d39608a29 split TestAsserts by functionality (#58919)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58919

Instead of having one large TestAsserts test case, we split of tests for
self-contained functionality like container or complex checking into
separate test cases. That makes it a lot easier to keep an overview over
what is tested.

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D29259407

Pulled By: mruberry

fbshipit-source-id: 9769cb6d56c1a3790280542db398cb247986b09a
2021-06-21 20:44:23 -07:00
14b0191d1f make assert_equal an example how to partial torch.testing.assert_close (#58918)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58918

~Instead of a distinct `torch.testing.assert_close` and `torch.testing.assert_equal`, this makes `torch.testing.assert_equal` a special case of `torch.testing.assert_close` for `rtol=atol=0`. In this case the closeness definition `abs(actual - expected) <= atol + rtol * abs(expected)` boils down to `abs(actual - expected) <= 0`. Since `abs(x)` can never be `<0`, this is equivalent to `abs(a - b) == 0` and this again boils down to `a == b`.~

Following https://github.com/pytorch/pytorch/pull/58918#issuecomment-860642057 and some offline discussions, we opted to use `assert_equal` as an example how to `partial` it.

This makes maintaing the module a lot easier, because we don't need to keep two functions in sync.

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D29259404

Pulled By: mruberry

fbshipit-source-id: fa1a1fa93672a7ed1c5f0e4beb0dcd45b5c14fce
2021-06-21 20:44:21 -07:00
583f072778 introduce TestingErrorMeta for internal use (#58917)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58917

In #54780 we opted to return `Optional[Exception]` from all internal
helper functions. Since then multiple PRs added functionality that needs
to amend the error message. For this we recreate the error

09a1b1cf87/torch/testing/_asserts.py (L417-L430)

To untangle this a little, this PR introduces the `_TestingErrorMeta`,
which carries the exception type and the message. The idiom

```python
exc = check_foo():
if exc:
    return exc
```

is still valid although `exc` should be renamed to `error_meta` to
reflect the new nature. In the top-level functions
`assert_(equal|close)`

```python
exc = check_foo():
if exc:
    raise exc
```

changes to

```python
error_meta = check_foo():
if error_meta:
    raise error_meta.to_error()
```

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D29259405

Pulled By: mruberry

fbshipit-source-id: 9078fe326283d5aa3d0cf256bf007887df9bfbfb
2021-06-21 20:44:20 -07:00
cf789b9941 remove pytest.UsageError (#58916)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58916

Using `pytest.UsageError` in case `pytest` is available adds almost
nothing as observed in
https://github.com/pytorch/pytorch/pull/53820#discussion_r593868752, but
makes it harder to maintain: due to the conditional import, `mypy` is
not able to handle `UsageError` in a type annotation.

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D29259409

Pulled By: mruberry

fbshipit-source-id: 82b00d13fa47db77383996d0caa69177804a48b6
2021-06-21 20:44:18 -07:00
9fffd05e54 hide top-level test functions from pytest's traceback (#58915)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58915

History:

- It was included for internal helper functions in the initial proposal
  in #53820
- It was removed in #54780, since it is not honored when used with
  `pytest`'s `--tb=native`, which is the default for PyTorch

Since PyTorch shouldn't be the only user of `assert_(equal|close)` we
add it here to the top-level functions `assert_(equal|close)`. If
`pytest` is used without `--tb=native`, the traceback for

```python
assert torch.eq(actual, expected), "Tensors are not equal!"
torch.testing.assert_equal(actual, expected)
```

looks the same, making it more concise.

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D29259406

Pulled By: mruberry

fbshipit-source-id: acee47b30b7f14def27433f7d56a4b19d77393c0
2021-06-21 20:44:16 -07:00
18d45b960b remove rouge raise in helper function (#58914)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58914

Only the top-level functions `assert_(equal|close)` should raise the
exception to keep the traceback managable.

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D29259408

Pulled By: mruberry

fbshipit-source-id: 40dd52eec6f9e8166b3b239d5172ee44b749e8dc
2021-06-21 20:43:06 -07:00
dca97b4394 Weighted decay with frequency (count-based) (#60382)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60382

Instead of setting weight_decay w uniformly for all ids, for each row i in the sparse embedding table, the actual weight_decay `w_i` becomes `w*freq_i` where `freq_i = halflife/counter_i \in [\log(2), halflife]`. Counter is from `rowwise_counter` with definition `counter_i = 1 + \exp(-iter_{\delta}*\rho)*counter_i`.

Test Plan:
buck test //caffe2/caffe2/python/operator_test:adagrad_test -- test_row_wise_sparse_adagrad

buck test caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test_weight_decay

Reviewed By: 0x10cxR1

Differential Revision: D25581030

fbshipit-source-id: 54b3831b20516c76c559b13d8deb809e2ee3b446
2021-06-21 18:46:35 -07:00
8f03018980 [pytorch] Move signal handler test to internal codebase (#60394)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60394

Move signal handler test to internal codebase

Github issue: https://github.com/pytorch/pytorch/issues/60260

Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed/elastic/multiprocessing:api_test

    buck test mode/dev-nosan //caffe2/torch/distributed/elastic/multiprocessing/fb/test:api_test

Reviewed By: cbalioglu

Differential Revision: D29273160

fbshipit-source-id: e4ae72f7f6d54cbba324119fce7446a30a6c37c9
2021-06-21 18:26:41 -07:00
af3f7a210a add BFloat16 support for kthvalue and median on CPU (#60074)
Summary:
Add BFloat16 support for kthvalue and median on CPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60074

Reviewed By: gchanan

Differential Revision: D29230348

Pulled By: heitorschueroff

fbshipit-source-id: fa9c086758d51069acf270faa526e4b141b0ef68
2021-06-21 17:52:18 -07:00
2606022d01 [package] fix for edge case os and os.path importing (#60276)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60276

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D29234143

Pulled By: Lilyjjo

fbshipit-source-id: 4d96dde4ef1d84f9966f9f58c883ab9bb92fe728
2021-06-21 16:54:02 -07:00
25e077bce1 [Issue 59296] added VE device (#59620)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/59296

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59620

Reviewed By: zou3519

Differential Revision: D29196830

Pulled By: ezyang

fbshipit-source-id: 7bb49f776dc755804a0ba0bc3a7dbdab9c93914e
2021-06-21 16:44:52 -07:00
9d1d799034 Added API to change logging levels for JIT (#58821)
Summary:
Description:
- Before this, logging level could only be changed by changing the env
variable "PYTORCH_JIT_LOG_LEVEL"
    - Can change the level from python now
- Have not added stream configuration for now
- Configuration is stored in a singleton class managing the options

Issue Link: https://github.com/pytorch/pytorch/issues/54188

Gotchas:
- Created separate functions
`::torch::jit::get_jit_logging_levels/set_jit_logging_levels` instead of
using the singleton class's method directly
    - This is because when running test cases, two different instances
    of the singleton are created for the test suite and the actual code
    (`jit_log.cpp`)
    - On using these methods directly, `is_enabled` calls the singleton
    in `jit_log.cpp` while we are setting the config using another
    singleton
    - See: https://stackoverflow.com/questions/55467246/my-singleton-can-be-called-multiple-times

API:
- To set the level: `torch._C._jit_set_logging_option("level")`
- To get the level: `torch._C._jit_get_logging_option()`

Testing:
- UTs were added for C++
- A very simple UT was added for python to just check if the API is
being called correctly
- The API was checked by running trace in a sample python file
    - Set env variable to "" and used `_jit_set_logging_option` in python to set the variable to `>dead_code_elimination`
    - The error output had logs of form [DUMP..] [UPDATE...] etc

Fixes https://github.com/pytorch/pytorch/issues/54188

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58821

Reviewed By: soulitzer

Differential Revision: D29116712

Pulled By: ZolotukhinM

fbshipit-source-id: 8f2861ee2bd567fb63b405953d035ca657a3200f
2021-06-21 16:10:49 -07:00
82a6574d89 cmake: Use BUILD_INTERFACE with TORCH_SRC_DIR (#60403)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60403

TORCH_SRC_DIR has the potential to be hardcoded thus breaking downstream
cmake extensions. Prefer CMAKE_CURRENT_SOURCE_DIR with BUILD_INTERFACE
to make it magically work together

See https://cmake.org/cmake/help/latest/command/target_include_directories.html

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: samestep

Differential Revision: D29276503

Pulled By: seemethere

fbshipit-source-id: 6ec0754de6a02cdc35a4a453d6271ac4fdfc5ee3
2021-06-21 15:37:27 -07:00
8dd1dc89cb [PyTorch][Edge] Adding tests for lite quantized models (#60226)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60226

# Context
Read this posts for details about why we need a test bench for quantized lite modules
https://fb.workplace.com/groups/2322282031156145/permalink/4289792691071726/

# This Diff
Adds test cases for Quantized Lite modules
ghstack-source-id: 131859101

Test Plan:
```
[ ~/fbsource/fbcode] buck test caffe2/test:mobile -- mobile.test_lite_script_module.TestLiteScriptQuantizedModule
Unable to connect to Buck daemon, restarting it...

Running with tpx session id: 44cf0b2f-0905-444a-95df-4a2eec774163
Trace available for this run at /tmp/tpx-20210618-093849.343917/trace.log
Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/7036874461151326
    ✓ ListingSuccess: caffe2/test:mobile - main (16.736)
    ✓ Pass: caffe2/test:mobile - test_two_layer (mobile.test_lite_script_module.TestLiteScriptQuantizedModule) (14.836)
    ✓ Pass: caffe2/test:mobile - test_annotated_nested (mobile.test_lite_script_module.TestLiteScriptQuantizedModule) (15.073)
    ✓ Pass: caffe2/test:mobile - test_quantization_example (mobile.test_lite_script_module.TestLiteScriptQuantizedModule) (16.286)
    ✓ Pass: caffe2/test:mobile - test_single_layer (mobile.test_lite_script_module.TestLiteScriptQuantizedModule) (18.360)
Summary
  Pass: 4
  ListingSuccess: 1
```

https://www.internalfb.com/intern/testinfra/testconsole/testrun/7036874461151326/

Reviewed By: iseeyuan

Differential Revision: D29212232

fbshipit-source-id: 8d0b61b3f414e31720f1e3ce681ec8fa716555c1
2021-06-21 15:09:42 -07:00
5bd49c3396 fix workflow id usage in GHA (#60376)
Summary:
This fixes: https://github.com/pytorch/pytorch/issues/60139

GHA workflow ID is set to `run_id` previously and it doesn't change across re-runs
see: https://docs.github.com/en/actions/reference/environment-variables#default-environment-variables

Using GITHUB_RUN_NUMBER to report workflow ID instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60376

Test Plan:
CI
see: [with rerun](https://github.com/pytorch/pytorch/actions/runs/952508536) and [without rerun](https://github.com/pytorch/pytorch/actions/runs/955665324 ) example --> they reported everything under the same run ID but in fact the first one ran twice as many test cases reported in scuba. This shouldn't occur after this PR.

Reviewed By: samestep

Differential Revision: D29267455

Pulled By: walterddr

fbshipit-source-id: 00fc6b75b84861e2f7d3e21698a5f840c3c21dcd
2021-06-21 14:54:49 -07:00
1f50dc6e46 Fix ignoring Tensor properties in torch.overrides (#60050)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60050

It doesn't work to put torch.Tensor.prop.__get__ in the ignored
list.  Now it does.  (Not exercised here, see next diff in stack).

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D29171464

Pulled By: ezyang

fbshipit-source-id: e7354668b481f9275f2eb5bb3a6228d1815fecea
2021-06-21 14:49:51 -07:00
65f33ec85c Follow-up fix for compilation error on CUDA92 (#60287)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60287

Follow up of #60017

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D29236208

Pulled By: ejguan

fbshipit-source-id: f1acf9630b45fea8cbdf7d64e47661643d0a52b8
2021-06-21 13:29:11 -07:00
01e0296eb7 [special] migrate log1p, sinc, round to special namespace (#55878)
Summary:
Reference : https://github.com/pytorch/pytorch/issues/50345

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55878

Reviewed By: zou3519, janeyx99

Differential Revision: D29160593

Pulled By: mruberry

fbshipit-source-id: f3ca9c541382bab33fb85d7817ce8ddc117c6826
2021-06-21 12:34:29 -07:00
769c299dcf [caffe2] add tests for inplace elementwise ops (#60106)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60106

In Caffe2, some elementwise in-place compatible ops lack coverage for the in-place case. We add tests for a subset of them here and thereby increase coverage.

Test Plan:
```
buck test //caffe2/caffe2/python/operator_test:elementwise_ops_test
```
Let CI run.

Reviewed By: clrfb

Differential Revision: D29143189

fbshipit-source-id: 83138ad8eff8fe95c40aece53714da3577396a23
2021-06-21 12:04:18 -07:00
f66b53e8b2 Ignore unsupported attribute checker pass for torch.jit.trace (#60200)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60200

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D29207583

Pulled By: tugsbayasgalan

fbshipit-source-id: 241620209dbafc94ebdb83d99257e341b11e999b
2021-06-21 11:55:12 -07:00
b505adbb09 Fix typo in ChainDataset docs (#60336)
Summary:
* chainning -> chaining

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60336

Reviewed By: bdhirsh

Differential Revision: D29265236

Pulled By: anjali411

fbshipit-source-id: 17a9b73af9e094550bd1ee25bc9439fb8d455e2b
2021-06-21 11:47:21 -07:00
2f3be2735f Don't split oversize cached blocks (#44742)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/35901

This change is designed to prevent fragmentation in the Caching Allocator.  Permissive block splitting in the allocator allows very large blocks to be split into many pieces.  Once split too finely it is unlikely all pieces will be 'free' at that same time so the original allocation can never be returned.   Anecdotally, we've seen a model run out of memory failing to alloc a 50 MB block on a 32 GB card while the caching allocator is holding 13 GB of 'split free blocks'

Approach:

- Large blocks above a certain size are designated "oversize".  This limit is currently set 1 decade above large, 200 MB
- Oversize blocks can not be split
- Oversize blocks must closely match the requested size (e.g. a 200 MB request will match an existing 205 MB block, but not a 300 MB block)
- In lieu of splitting oversize blocks there is a mechanism to quickly free a single oversize block (to the system allocator) to allow an appropriate size block to be allocated.  This will be activated under memory pressure and will prevent _release_cached_blocks()_ from triggering

Initial performance tests show this is similar or quicker than the original strategy.  Additional tests are ongoing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44742

Reviewed By: zou3519

Differential Revision: D29186394

Pulled By: ezyang

fbshipit-source-id: c88918836db3f51df59de6d1b3e03602ebe306a9
2021-06-21 11:46:08 -07:00
eaa36ee679 Enable sharding for Windows GHA CI (#59970)
Summary:
Enables sharding for Windows on CI. To make that possible, we currently remove the smoke tests tested in shard 1 which don't seem all that important as they are
1. tested on nightlies
2. seems to be tested anyway by running the test suite

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59970

Reviewed By: seemethere

Differential Revision: D29268484

Pulled By: janeyx99

fbshipit-source-id: 7f90d73037cfeb2c267b28714550316eb471b4dd
2021-06-21 11:42:22 -07:00
023907a6fe Allow Docker build on macOS (#60375)
Summary:
This PR allows developers using macOS to build Docker images locally. The `basename $(mktemp -u)` part was suggested by seemethere; I modified it slightly to appease ShellCheck and because [Docker doesn't allow uppercase characters in tags](https://stackoverflow.com/a/54291205).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60375

Test Plan:
On a Mac:
```
cd .circleci/docker
./build.sh pytorch-linux-xenial-py3.6-gcc5.4
```

Reviewed By: driazati

Differential Revision: D29267025

Pulled By: samestep

fbshipit-source-id: ba27d2fb108f573a50db069cf9ddea0414ed6074
2021-06-21 11:27:49 -07:00
27e34f731a Re-enable clang-tidy on PRs (#60297)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60297

This switches clang-tidy to the fresh tag from https://github.com/pytorch/test-infra/runs/2860763986 which has a fix for the missing OMP headers we were seeing. Along with #60225 this should restore clang-tidy to normal functionality and we shouldn't see any spurious warnings.

Test Plan: Imported from OSS

Reviewed By: seemethere, 1ntEgr8

Differential Revision: D29239783

Pulled By: driazati

fbshipit-source-id: b1893256fdb27436af03d6c5279e81f64b47fe6b
2021-06-21 11:04:09 -07:00
c16f87949f ENH Adds nn.ReflectionPad3d (#59791)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/27655

This PR adds a C++ and Python version of ReflectionPad3d with structured kernels. The implementation uses lambdas extensively to better share code from the backward and forward pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59791

Reviewed By: gchanan

Differential Revision: D29242015

Pulled By: jbschlosser

fbshipit-source-id: 18e692d3b49b74082be09f373fc95fb7891e1b56
2021-06-21 10:53:14 -07:00
f89ae9cb8d Moves grid_sampler to autocast promote list (#58618)
Summary:
Should close https://github.com/pytorch/pytorch/issues/42218

Numerically, `grid_sampler` is fine in fp16 or fp32, but takes several inputs and expects their dtypes to match, so it belongs on the autocast promote list.

`grid_sampler` currently uses `gpuAtomicAdd`, notoriously slow in fp16 because it calls cuda's atomicAdd __half overload which uses a software compare-and-swap loop internally. To allow good performance if both inputs happen to be FP16, the PR also modifies `grid_sampler_[2,3]d_backward_kernel`s to use `fastAtomicAdd` instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58618

Reviewed By: mruberry

Differential Revision: D29257199

Pulled By: ngimel

fbshipit-source-id: 3cc7505945b480427f2fc1beb36bee80bf3853b3
2021-06-21 10:22:36 -07:00
61e0bc1955 [nnc] Remove check on initializer in compressBuffer (#60194)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60194

Test Plan: Imported from OSS

Reviewed By: bertmaher, huiguoo

Differential Revision: D29206255

Pulled By: navahgar

fbshipit-source-id: 0a68ec4067c37f06ca1ea9ddeeb5ad5e0dcb0639
2021-06-21 09:57:37 -07:00
f2bb0932da [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D29259226

fbshipit-source-id: 15fd79f6fed38d6ed2d84018852806683d5a09fa
2021-06-21 03:57:10 -07:00
5ff407df67 Skips failing MacOS tests (#60348)
Summary:
Mitigates, but does not fix https://github.com/pytorch/pytorch/issues/60347.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60348

Reviewed By: ngimel

Differential Revision: D29257917

Pulled By: mruberry

fbshipit-source-id: de9be93ddeda1ca27ea2ff4650162f886d10f1e2
2021-06-21 01:35:36 -07:00
1dee99c973 LU Solve using cublas and cusolver (#59148)
Summary:
This PR introduces cuSOLVER and cuBLAS for the `lu_solve` routine. Solves a part of https://github.com/pytorch/pytorch/issues/47953.

Since usage of cuSOLVER with MAGMA introduces performance regressions in MAGMA (https://github.com/pytorch/pytorch/issues/56590), we use heuristics for determining when to call cuSOLVER, cuBLAS or MAGMA depending on the batch and matrix sizes. The 64-bit cuSOLVER API is not introduced in this PR since there are several problems with the LU factorization using cusolver (https://github.com/pytorch/pytorch/pull/59148).

The following are performance benchmarks using various configurations:

<details>

```
[--------------------------------------------------------- LU solve CUDA torch.float64 ----------------------------------------------------------]
                                     |  lu_solve CUSOLVER  |  lu_solve MAGMA  |  lu_solve CUBLAS  |  lu_solve cuSOLVER/MAGMA  |  lu_solve TEST ALL
1 threads: ---------------------------------------------------------------------------------------------------------------------------------------
      torch.Size([1, 1, 1])          |          703.4      |        489.8     |         511.8     |             710.1         |          487.1
      torch.Size([2, 1, 1])          |          738.9      |        504.1     |         513.0     |             958.2         |          494.4
      torch.Size([4, 1, 1])          |          790.7      |        514.7     |         506.8     |             983.9         |          540.2
      torch.Size([8, 1, 1])          |          865.3      |        496.4     |         514.7     |             975.2         |          520.0
      torch.Size([16, 1, 1])         |          955.5      |        483.9     |         508.3     |             937.6         |          526.5
      torch.Size([32, 1, 1])         |         1167.7      |        495.2     |         511.2     |             934.0         |          528.7
      torch.Size([64, 1, 1])         |         1730.0      |        492.1     |         537.8     |             936.4         |          533.2
      torch.Size([128, 1, 1])        |         2748.4      |        499.7     |         526.5     |             982.9         |          540.8
      torch.Size([1, 2, 2])          |          724.6      |        498.2     |         541.7     |             715.0         |          504.7
      torch.Size([2, 2, 2])          |          737.0      |        514.3     |         527.6     |             934.5         |          524.5
      torch.Size([4, 2, 2])          |          750.5      |        524.1     |         537.4     |             935.5         |          543.0
      torch.Size([8, 2, 2])          |          844.8      |        513.7     |         538.9     |             953.3         |          534.4
      torch.Size([16, 2, 2])         |         1013.1      |        521.9     |         530.0     |             932.2         |          537.9
      torch.Size([32, 2, 2])         |         1335.8      |        515.1     |         544.4     |             939.9         |          559.5
      torch.Size([64, 2, 2])         |         1819.6      |        511.8     |         534.1     |             973.9         |          540.0
      torch.Size([128, 2, 2])        |         3018.7      |        526.3     |         546.1     |             979.3         |          543.5
      torch.Size([1, 8, 8])          |          732.5      |        524.9     |         532.9     |             762.4         |          516.8
      torch.Size([2, 8, 8])          |          771.2      |        514.9     |         538.7     |            1007.5         |          531.1
      torch.Size([4, 8, 8])          |          811.3      |        507.7     |         534.6     |            1002.2         |          548.5
      torch.Size([8, 8, 8])          |          866.6      |        530.0     |         532.0     |            1016.1         |          562.9
      torch.Size([16, 8, 8])         |          991.8      |        533.6     |         548.0     |            1022.6         |          548.5
      torch.Size([32, 8, 8])         |         1271.7      |        541.2     |         534.7     |            1013.8         |          545.6
      torch.Size([64, 8, 8])         |         1817.2      |        530.2     |         520.6     |            1008.7         |          566.3
      torch.Size([128, 8, 8])        |         2678.7      |        531.6     |         552.2     |            1006.2         |          555.0
      torch.Size([1, 16, 16])        |          738.2      |        546.1     |         536.6     |             775.6         |          540.1
      torch.Size([2, 16, 16])        |          782.6      |        543.5     |         539.6     |            1010.9         |          541.1
      torch.Size([4, 16, 16])        |          815.2      |        546.1     |         560.9     |            1012.5         |          553.1
      torch.Size([8, 16, 16])        |          877.7      |        543.0     |         547.9     |            1012.8         |          551.5
      torch.Size([16, 16, 16])       |         1008.7      |        549.2     |         562.7     |            1016.6         |          546.8
      torch.Size([32, 16, 16])       |         1291.9      |        540.8     |         560.3     |            1055.8         |          539.3
      torch.Size([64, 16, 16])       |         1846.3      |        553.5     |         556.0     |            1010.8         |          551.9
      torch.Size([128, 16, 16])      |         2953.8      |        562.7     |         547.5     |            1026.2         |          555.8
      torch.Size([1, 32, 32])        |          789.1      |        590.6     |         590.9     |             790.5         |          579.0
      torch.Size([2, 32, 32])        |          806.9      |        596.6     |         600.2     |            1085.6         |          573.8
      torch.Size([4, 32, 32])        |          852.0      |        597.9     |         588.2     |            1098.9         |          574.7
      torch.Size([8, 32, 32])        |          914.2      |        597.8     |         591.4     |            1090.3         |          585.7
      torch.Size([16, 32, 32])       |         1063.0      |        604.6     |         597.3     |            1094.0         |          580.5
      torch.Size([32, 32, 32])       |         1302.0      |        602.0     |         598.9     |            1090.3         |          583.6
      torch.Size([64, 32, 32])       |         1861.7      |        601.1     |         599.8     |            1113.4         |          588.6
      torch.Size([128, 32, 32])      |         3251.0      |        619.6     |         595.3     |            1106.8         |          608.9
      torch.Size([1, 64, 64])        |          978.6      |        842.7     |         778.6     |            1071.4         |          825.8
      torch.Size([2, 64, 64])        |         1072.3      |        845.7     |         785.4     |            1400.6         |          829.0
      torch.Size([4, 64, 64])        |         1051.9      |        842.9     |         796.1     |            1352.2         |          788.2
      torch.Size([8, 64, 64])        |         1090.3      |        834.1     |         805.2     |            1382.6         |          804.7
      torch.Size([16, 64, 64])       |         1206.9      |        835.7     |         802.2     |            1365.6         |          801.2
      torch.Size([32, 64, 64])       |         1671.2      |        846.5     |         794.5     |            1345.1         |          814.2
      torch.Size([64, 64, 64])       |         2759.3      |        848.5     |         795.4     |            1409.7         |          832.9
      torch.Size([128, 64, 64])      |         4928.6      |        877.4     |         848.3     |            1439.0         |          883.9
      torch.Size([1, 128, 128])      |         1315.6      |       1158.4     |        1130.0     |            1301.3         |         1177.1
      torch.Size([2, 128, 128])      |         1334.7      |       1198.2     |        1186.6     |            1703.9         |         1209.5
      torch.Size([4, 128, 128])      |         1374.6      |       1200.7     |        1266.2     |            1640.6         |         1272.3
      torch.Size([8, 128, 128])      |         1453.6      |       1215.9     |        1287.3     |            1669.1         |         1288.7
      torch.Size([16, 128, 128])     |         1882.1      |       1244.9     |        1337.6     |            1698.8         |         1347.1
      torch.Size([32, 128, 128])     |         2789.0      |       1284.5     |        1398.6     |            1747.6         |         1396.3
      torch.Size([64, 128, 128])     |         4763.0      |       1425.2     |        1581.7     |            1921.0         |         1584.1
      torch.Size([128, 128, 128])    |         8835.9      |       1808.9     |        1968.7     |            2197.6         |         1961.8
      torch.Size([1, 512, 512])      |         4369.9      |       4577.6     |        4804.0     |            4331.4         |         4599.0
      torch.Size([2, 512, 512])      |         4635.9      |       4850.1     |        5159.1     |            5315.4         |         4845.5
      torch.Size([4, 512, 512])      |         5367.5      |       5261.6     |        6134.7     |            5807.8         |         5345.2
      torch.Size([8, 512, 512])      |         7025.2      |       6184.5     |        7065.6     |            6711.6         |         6303.9
      torch.Size([16, 512, 512])     |        10221.3      |       7849.7     |        8820.1     |            8323.6         |         7992.1
      torch.Size([32, 512, 512])     |        16574.8      |      11208.4     |       12284.3     |           11704.7         |        11394.4
      torch.Size([64, 512, 512])     |        29500.1      |      18043.1     |       19249.3     |           18744.0         |        18242.1
      torch.Size([128, 512, 512])    |        56783.3      |      33903.9     |       34713.5     |           33893.8         |        34041.8
      torch.Size([1, 1024, 1024])    |        14864.5      |      15714.6     |       16128.1     |           14726.7         |        14992.6
      torch.Size([2, 1024, 1024])    |        17891.0      |      18553.3     |       19111.6     |           19271.5         |        19283.0
      torch.Size([4, 1024, 1024])    |        22143.4      |      21909.2     |       23667.1     |           22698.9         |        22713.8
      torch.Size([8, 1024, 1024])    |        30621.1      |      28669.9     |       30822.9     |           29725.0         |        29760.8
      torch.Size([16, 1024, 1024])   |        47045.9      |      41900.0     |       44353.8     |           43215.6         |        43237.5
      torch.Size([32, 1024, 1024])   |        79245.5      |      68316.9     |       70959.0     |           69506.4         |        69876.7
      torch.Size([64, 1024, 1024])   |       147973.9      |     121120.6     |      124601.1     |          122084.4         |       122578.7
      torch.Size([128, 1024, 1024])  |       295586.2      |     232871.8     |      237421.8     |          233765.3         |       234704.6

Times are in microseconds (us).
```

</details>

Here's the details of how the tests were performed:
* CUSOLVER - Only call `cusolver` for all problem sizes.
* MAGMA - Only call `magma` for all problem sizes (this is the current master branch).
* CUBLAS - Only call `cublas` for all problem sizes.
* cuSOLVER / MAGMA - Use cusolver for `batch_size == 1` and magma for all others.
* TEST ALL - Employ heuristics to switch between cublas/cusolver/magma. This yields the best overall results (this PR).

Script for reproducing the results:

<details>

``` python

import torch
import pickle
import itertools
from torch.utils.benchmark import Timer
import sys

shapes = [1, 2, 8, 16, 32, 64, 128, 512, 1024]
batches = [(1,), (2,), (4,), (8,), (16,), (32,), (64,), (128,)]
results = []
num_threads = 1
dtype = torch.float64
repeats = 2

from torch.testing._internal.common_utils import random_hermitian_pd_matrix

def lu_factorize_solve(mat, b):
    lu_data = torch.lu(mat)
    x = torch.lu_solve(b, *lu_data)

for shape, batch in itertools.product(shapes, batches):
    mat = torch.randn(*batch, shape, shape, dtype=dtype, device='cuda')
    b = torch.randn(*batch, shape, 1, dtype=dtype, device='cuda')

    tasks = [("lu_factorize_solve(mat, b)", "lu_solve CUSOLVER")]

    print("shape: ", shape, " batch: ", batch)

    timers = [Timer(stmt=stmt, num_threads=num_threads, label=f"LU solve CUDA {dtype}",
                    sub_label=f"{mat.shape}", description=label, globals=globals()) for stmt, label in tasks]
    for i, timer in enumerate(timers * repeats):
        results.append(
            pickle.dumps(timer.blocked_autorange())
        )
        print(f"\r{i + 1} / {len(timers) * repeats}", end="")
        sys.stdout.flush()

f = open("cusolver_lu_solve.pickle", "wb")
pickle.dump(results, f)
f.close()
```

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59148

Reviewed By: H-Huang

Differential Revision: D29160609

Pulled By: mruberry

fbshipit-source-id: 7280f25db1e66aa650ea15608a6dc5d688fb4db2
2021-06-20 21:27:35 -07:00
4a3eea9a6a [quant][graphmode][fx] Produce reference linear module in convert (#60152)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60152

Test Plan:
python test/test_quantization.py TestQuantizeFx

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D29188263

fbshipit-source-id: f7bbbef5d4d747eadf7a627a4e77a5ec9bb0bc94
2021-06-20 20:08:12 -07:00
510334f34b [BE] clean up IS_PYTORCH_CI and IN_CI (#60279)
Summary:
`IS_PYTORCH_CI` and `IN_CI` are used randomly, however in some cases IN_CI is not currently set because it only exist in .circleci/scripts/setup_ci_environment.sh. This cleans up the 2 flags and only use IN_CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60279

Test Plan: CI

Reviewed By: seemethere

Differential Revision: D29239545

Pulled By: walterddr

fbshipit-source-id: a069424a2bb8790a3adfdaf0dc460301026bf8c7
2021-06-20 19:45:07 -07:00
2293ab4e53 [quant][graphmode][fx] Refactor convert for linear to use get_static_module_mapping and get_dynamic_module_mapping (#60151)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60151

Test Plan:
```
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps
```

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D29188264

fbshipit-source-id: d2b77ffcf4b7446fc6c43248e43218092d2a6aea
2021-06-20 19:41:16 -07:00
a516424a70 Update internal code for torch.linalg.solve (#56613)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56613

Replace linalg_solve_helper with `lu_stub` + `lu_solve_stub`.
Once `lu_stub` and `lu_solve_stub` have cuSOLVER-based codepath,
`torch.linalg.solve` will have it as well.

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D28627408

Pulled By: mruberry

fbshipit-source-id: b95bbdf35f845a56a1489c04b53742a01b36e789
2021-06-20 19:37:12 -07:00
47d727fe1b [quant][graphmode][fx] Produce conv reference static quant modules (#60138)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60138

Test Plan:
python test/test_quantization.py TestQuantizeFx

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D29184791

fbshipit-source-id: 971a40012dbba0cf687c62a3a4af9358513c253b
2021-06-20 19:25:45 -07:00
b298013cd5 [add/sub] Cast alpha to acc_type (#60227)
Summary:
This PR lets `torch.add` & `torch.sub` CUDA kernels cast `alpha` to `acc_type`, not `scalar_t`.
I do not remove `cast`s from `test/test_foreach.py` because I'll do this in https://github.com/pytorch/pytorch/issues/59907 or follow-up for it.

Current upstream `torch._foreach_add` & `torch._foreach_sub` upcast `alpha` parameter to `acc_type<scalar_t>` while `torch.add` & `torch.sub` not. This is kind of problematic because outputs of `torch.add` and `torch.sub` are different from `torch._foreach_add` and `torch._foreach_sub`, respectively if the dtype of input tensors is either `torch.half` or `torch.bfloat16`. The discrepancy is proportional-ish to `abs(alpha)` except when `alpha` is representable with 16 bits.

ref:
- `torch._foreach_add` & `torch._foreach_sub` cast `alpha`: 6d0fb85a62/aten/src/ATen/native/cuda/ForeachBinaryOpList.cu (L21-L28), `BinaryOpListAlphaFunctor` is defined here: 6d0fb85a62/aten/src/ATen/native/cuda/ForeachFunctors.cuh (L202)

related: https://github.com/pytorch/pytorch/issues/58833, https://github.com/pytorch/pytorch/pull/59907

cc ngimel ptrblck mcarilli

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60227

Reviewed By: mruberry

Differential Revision: D29252759

Pulled By: ngimel

fbshipit-source-id: 847f3b9493ae30a900f7445af00aef1abcc1ab21
2021-06-20 19:05:22 -07:00
0131a5972d [DDP] Test inference works with eval() and no_grad() (#59666)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59666

Tests that inference with DDP model won't hang when user sets eval()
or no_grad(). Note that if the model has a syncBN layer, they need both eval()
and no_grad() as eval() makes SyncBN work like a regular BN layer.
ghstack-source-id: 131906625

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D28974146

fbshipit-source-id: 137f8245b1c303beb2416518476e70fe67c73376
2021-06-20 12:02:43 -07:00
69b2bf70f9 [pytorch] fix tools/code_analyzer for llvm 11 (#60322)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60322

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D29250420

Pulled By: ljk53

fbshipit-source-id: ff7f9cbacd1d9518ed81c06fc843a90d6948f760
2021-06-20 00:39:11 -07:00
c19acf816f Replace TensorRT's deprecated API in caffe2/python/trt/test_pt_onnx_trt.py (#60236)
Summary:
TensorRT v8 is going to remove some functions/methods that used in test.

ref:
- getMaxWorkspaceSize deprecation: b2d60b6e10/include/NvInfer.h (L6984-L6993)
- buildCudaEngine deprecation: b2d60b6e10/include/NvInfer.h (L7079-L7087)

cc ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60236

Reviewed By: gchanan

Differential Revision: D29232376

Pulled By: ngimel

fbshipit-source-id: 2b8a48787bf61c68a81568b6026d6afd5a83e751
2021-06-19 19:56:30 -07:00
5ec4ad7f54 [special] Add special.ndtri (#58650)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/50345

TODO
* [x] Add docs https://13865352-65600975-gh.circle-artifacts.com/0/docs/special.html#torch.special.ndtri
* [x] Add comments on implementation
* [x] Clean-up

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58650

Reviewed By: H-Huang

Differential Revision: D29160170

Pulled By: mruberry

fbshipit-source-id: 50e4ea663920e97b8437d03d5b52bcd9dedc1a8d
2021-06-19 18:36:54 -07:00
5824a866b7 [pytorch][nnc] support custom class parameters (#59466)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59466

Change saved parameter type from at::Tensor to at::IValue to support custom
class parameters, e.g. `__torch__.torch.classes.xnnpack.Conv2dOpContext`.

The NNC produced kernels won't deal with custom class parameters directly.
They simply pass through to the external operators that take these custom
class parameters, e.g. `prepacked::conv2d_clamp_run`.

It will reuse the `__getstate__` and `__setstate__` methods on the custom class
to persist and restore the state of the parameters.

When calling into the kernel, it will pass in the untyped raw pointer of the custom
class objects to the kernel as `void*`. It's similar to the regular tensor parameters,
for which it will pass in the raw data pointer of the tensor storage. The generated
kernel needs to hardcode the expected type for each parameter and cast before
calling the external ops.
ghstack-source-id: 131897904

Test Plan: - unit tests

Reviewed By: kimishpatel

Differential Revision: D28902496

fbshipit-source-id: 4b2c0895dd28f0b7d344aa08183d42ad6a355dae
2021-06-19 06:11:01 -07:00
cac9ae1506 [iOS GPU][BE][3/n] Give MPSImage objects a label for better debugging experience (#60282)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60282

1. Adds a label to the MPSImage objects. The label describes the size of the image.
2. Remove `[image markRead]`.
3. Rename two APIs for better naming convention.
ghstack-source-id: 131839557

Test Plan:
1. CircleCI
2. buck test pp-mac

Reviewed By: SS-JIA

Differential Revision: D29232975

fbshipit-source-id: 075175c4b5a1c5b79e795f4860e1694d7c06d4f2
2021-06-18 18:47:05 -07:00
b9cd97c94b [iOS GPU][BE][2/n] Remove unused APIs (#60281)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60281

1. REmove unused APIs from MPSImageUtils.
2. Move tensor related APIs from MetalUtils to MetalTensorUtils. Delete MetalUtils.h/mm
3. Move metal buffer related APIs to MetalContext
ghstack-source-id: 131839559

Test Plan:
1. CircleCI
2. buck test pp-mac

Reviewed By: SS-JIA

Differential Revision: D29232973

fbshipit-source-id: a4c0c848883b8ef615eeb2936c1f3d18cddcb318
2021-06-18 18:47:04 -07:00
80e6e3f1da [iOS GPU][BE][1/n] Rename MPSCNNContext to MetalContext (#60280)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60280

No significant changes besides renaming the class. In the future, we'll convert this objc class to c++.
ghstack-source-id: 131827490

Test Plan:
- CircleCI
- buck test pp-mac

Reviewed By: SS-JIA

Differential Revision: D29231824

fbshipit-source-id: a0d1327a55a0414011c78a7144d3b05f1579cf42
2021-06-18 18:45:24 -07:00
319890b1b2 Support *args in Pipe.forward API. (#55441)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55441

This is the first step towards supporting the proposal outlined in
https://github.com/pytorch/pytorch/issues/53952.

In this PR I've ensured Pipe.forward() accepts a *inputs argument instead of
just a single input as previously. This lays the groundwork for supporting
non-Tensors and generic arguments to the Pipe API. In this PR we still only
support Tensors and non-Tensor support will come in future PRs.

For backward compatibility I've ensured a single Tuple[Tensor] input still
works as expected previously.
ghstack-source-id: 130767499

Test Plan: waitforbuildbot

Reviewed By: SciPioneer

Differential Revision: D27613887

fbshipit-source-id: 05e19e537e6d7fe4999745fc4ba9941ac54906de
2021-06-18 17:53:32 -07:00
a8430f1076 Remove PlacementSpec from ShardingSpecs. (#59990)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59990

ShardingSpecs accepted a Device/PlacementSpec and was initially
written this way for flexibility. Although, it is slightly confusing given
there is no general use case for this. As a result, to keep things simple I've
ensured that both specs only accept devices for now.

We can always extend this to include a general PlacementSpec later on.
ghstack-source-id: 131842525

Test Plan: waitforbuildbot

Reviewed By: SciPioneer, rohan-varma

Differential Revision: D29116463

fbshipit-source-id: a6f2b3f1346ac6afab91c9595d4cae4f4da04fda
2021-06-18 17:37:43 -07:00
1c97c3e3a4 DOC Adds LSTM docs for defined variables when bidirectional=True (#60120)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/59332

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60120

Reviewed By: gchanan

Differential Revision: D29240245

Pulled By: jbschlosser

fbshipit-source-id: acad9c24f41f7253a7d42cd940e54bb66e083ecf
2021-06-18 17:28:44 -07:00
aae2a3c95e Clarify ConvTransposeNd + reference links (#60291)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/56873

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60291

Reviewed By: gchanan

Differential Revision: D29239199

Pulled By: jbschlosser

fbshipit-source-id: 9b2de1a8b1a7444797f82c73195c5efc929562eb
2021-06-18 17:18:11 -07:00
e8e3394ea8 Recognize transposed dense tensors as a form of partial overlap (#59014)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59014

Fixes #48401

`assert_no_overlap` currently has a false-negative where it recognizes
the transpose of a contiguous tensor as fully overlapping. This happens because
the memory regions do fully overlap, but of course the strides are different so
the actual elements don't all overlap.

This goes slightly in the other direction, by requiring strides to exactly
match we get false-positives for some unusual situations, e.g.
```
torch.add(a, a, out=a.view([1, *a.shape]))
```
Or replacing strides of length-1 dimensions, etc. However, I think these are
sufficiently obscure that it's okay to error and the common cases like
inplace operations still work as before.

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D29040928

Pulled By: ngimel

fbshipit-source-id: 5a636c67536a3809c83f0d3117d2fdf49c0a45e6
2021-06-18 16:29:25 -07:00
47bbc01e0b [nnc] Added micro-benchmark to show perf improvement with cat subgraph optimization (#59581)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59581

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D28955317

Pulled By: navahgar

fbshipit-source-id: 53bb3dbfafbd3b146063f305523c2e6ec96cf6b8
2021-06-18 14:32:09 -07:00
d0c4ace00f [jit] Added a tranformation to move consumers of aten::cat to its inputs, in the fused subgraphs (#59580)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59580

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D28955318

Pulled By: navahgar

fbshipit-source-id: 7504d5aea441920f4eb9234cdfa17077161ab13c
2021-06-18 14:32:07 -07:00
d4c626a346 [jit] Exported a method to get the supported list of elementwise ops (#60162)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60162

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D29190841

Pulled By: navahgar

fbshipit-source-id: bb786a653441c5b586509e25cc80d357d2223af3
2021-06-18 14:32:05 -07:00
55755edc60 [jit] Made a list for element-wise ops. (#59579)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59579

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D28955319

Pulled By: navahgar

fbshipit-source-id: 605531aedf9250a226b0401d55fda3427bdc6f33
2021-06-18 14:30:47 -07:00
a029422cae [quant][graphmode][fx][refactor] Change the env map to add dtype as a key (#60054)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60054

Previously env in convert is Dict[str, Tuple[Node, torch.dtype]], that is, at a given time each node can only have one dtype,
this causes a problem for the following case:
```
class M(torch.nn.Module):
        def __init__(self):
            super().__init__()
            self.conv = nn.Conv2d(1, 1, 1)

        def forward(self, x):
            x = self.conv(x)
            x1 = x.expand_as(x)
            x2 = torch.add(x, x1)
            return x2

def forward(self, x):
    x = self.activation_post_process_0(x)
    x = self.conv(x)
    x = self.activation_post_process_1(x)
    x1 = x.expand_as(x)
    x1 = self.activation_post_process_2(x1)
    x2 = torch.add(x, x1)
    x2 = self.activation_post_process_3(x2)
    return x2

def forward(self, x):
    x = torch.quantize_per_tensor(x, ...)
    x = self.conv(x). # quantized conv
    x = torch.dequantize(x)
    x1 = x.expand_as(x)
    x1 = torch.quantize_per_tensor(x1, ...)
    # Error: x is dequantized
    x2 = torch.ops.quantized.add(x, x1)
    return x2

Currently we have a env that is a map from node name of the observed graph to the Node in the quantized graph, here the problem is that following a quantized operator conv, we have two operators, one is expecting float input (expand_as), the other is expecting quantized input (quantized add), and in the quantized graph, ideally, expand_as should consume the dequantized output, and quantized add should consume the quantized output:

quantized_conv - dequantize - expand_as
  \ ------- quantized_add

But currently in env, each node needs to either be quantized or not quantized. Therefore we will need to change env to include dtype as well:
env: Dict[str, Dict[dtype, Node]], e.g. {‘x’: {torch.float: dequantized_node, torch.quint8: quantized_node}}
And when we load from the env, we will need to provide the dtype of the Node that we want to load as well. We can have a separate pass to figure out this information for each node.
```

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D29149408

fbshipit-source-id: c9e4b7d65444ab6a6f573929bae1db5037629892
2021-06-18 13:31:43 -07:00
c0f8cad0f0 Be fix shard inbalance (#60206)
Summary:
First step to address https://github.com/pytorch/pytorch/issues/60136

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60206

Reviewed By: janeyx99

Differential Revision: D29215237

Pulled By: walterddr

fbshipit-source-id: ec25beb57366ef2eaf37878cdea391b245de9bef
2021-06-18 12:49:30 -07:00
d9e7df707b [TensorExpr] Add NNC lowerings for aten::mean, aten::addmm, and aten::adaptive_avg_pool2d. (#59347)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59347

We had external call wrappers for them, but they were not used in NNC.
This PR adds lowerings using these ext calls and fixes some bugs in
them.

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D28853832

Pulled By: ZolotukhinM

fbshipit-source-id: 1718400368e1a9cf3f19180ee2290a4ed9c99d41
2021-06-18 11:56:32 -07:00
c6bb9409b8 [TensorExpr] Handle not-specified dtypes and strides. (#59346)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59346

Currently JIT has a pass to propagate shapes, but doesn't have a
capability to fill in strides and dtypes. This PR works around that by
assuming default dtype to be Float and strides corresponding to
contiguous layout, unless otherwise specified. Ideally, we won't need
this, and this is done simply as a workaround unless the corresponding
features are implemented on JIT side.

This is required for AOT compilation of mobilenet v3 with NNC.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D28853831

Pulled By: ZolotukhinM

fbshipit-source-id: 81adb59409684f39b444909ab8ec58ee4a39d496
2021-06-18 11:56:30 -07:00
f042455a8d [JIT] ShapeProp: add missing ops from mobilenet v3. (#59163)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59163

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D28853833

Pulled By: ZolotukhinM

fbshipit-source-id: 451fb9ee848968049d26fb5623a904d8fa7bd6fc
2021-06-18 11:55:00 -07:00
3870e68644 TF32 threshold twiddling for tests (#60209)
Summary:
Following https://github.com/pytorch/pytorch/issues/59624 I observed some straggling failing tests on Ampere due to TF32 thresholds. This PR just twiddles some more thresholds to fix the (6) failing tests I saw on A100.

CC Flamefire ptrblck ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60209

Reviewed By: gchanan

Differential Revision: D29220508

Pulled By: ngimel

fbshipit-source-id: 7c83187a246e1b3a24b181334117c0ccf2baf311
2021-06-18 11:41:33 -07:00
5f010c066f [package] Bring back save_source_file (#59962)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59962

This reverts commit 44b021d21b5681c105529881bdbaefb6d3e335f6.

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D29113224

Pulled By: zhxchen17

fbshipit-source-id: 55d42acc421c5f4abbbad9d9ed4d32b615939463
2021-06-18 11:13:35 -07:00
5a45103139 ns for fx: add API usage logging (#60103)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60103

Adds internal logging for NS for FX API usage.

Test Plan: CI

Reviewed By: jerryzh168

Differential Revision: D29166710

fbshipit-source-id: 2a1bf2f6038b0c6c5945b57b2db2de25c585a04a
2021-06-18 10:25:59 -07:00
0baad214b0 [static runtime][fix] resize to the input tensor size for full_like (#60229)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60229

Fix bug where we did not resize to the input tensor size, causing
the output to be incorrect

Test Plan:
Test on replayer, rebased on D29217781, with model 278203319_26.

Verify with jit outputs (D28583950)

`./buck-out/gen/admarket/lib/ranking/prediction_replayer/replayer --model_inference_type_target=DISAGG_ACCELERATOR --prediction_replayer_force_model_type=inline_cvr_post_imp_model --prediction_replayer_force_model=278203319_26 --prediction_replayer_target_tier=sigrid.predictor.perf.dianshi_staticruntime_debug_0604.test --prediction_replayer_input_stream_filename=/data/users/ansha/tmp/adfinder/filtered_requests_inline_cvr_100 --ignore_model_id_mismatch --check_performance --fully_remote_sr_connection_options="overall_timeout:10000000,processing_timeout:10000000" --use_new_encoding_for_ads_services --use_new_encoding_from_model_id_to_shard_id --sigrid_force_model_dir=/data/users/ansha/tmp/adfinder/278203319_26/ --sigrid_predictor_model_suffix=.predictor.disagg.local —use_new_encoding_from_model_id_to_shard_id=true --prediction_replayer_force_model_kind=19 --pytorch_predictor_static_runtime_enable=true --prediction_replayer_target_qps=1`

Reviewed By: hlu1, movefast1990

Differential Revision: D29218918

fbshipit-source-id: dab4bbbabeaa8367174ed90edca43d6204c65409
2021-06-18 09:56:25 -07:00
d5df274ea5 [DDP] Support for multiple backwards (#59359)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59359

Move `prepare_for_backward` into `_DDPSink` backward instead of calling it in DDP forward pass so that we can run multiple backwards in DDP with `retain_graph=True`.

ghstack-source-id: 131774159

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D28855226

fbshipit-source-id: 6b7b25d75b7696f5b5629078233433f97663d61c
2021-06-18 09:23:57 -07:00
3815a013ed Enable xenial-cuda11.1-cudnn8-py3.6-gcc7 in GHA (#60196)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60196

Test Plan:
https://github.com/pytorch/pytorch/issues/60198: https://github.com/pytorch/pytorch/actions/runs/947796763

I should have used `ghstack` but I forgot; will do that in the future.

Reviewed By: walterddr

Differential Revision: D29231161

Pulled By: samestep

fbshipit-source-id: 8299a248ca9c1d36c3845d1c8a10ca9bf7101124
2021-06-18 09:18:53 -07:00
d5988c5eca remove unused type: ignore directives (#60006)
Summary:
During development it is common practice to put `type: ignore` comments on lines that are correct, but `mypy` doesn't recognize this. This often stems from the fact, that the used `mypy` version wasn't able to handle the used pattern.

With every new release `mypy` gets better at handling complex code. In addition to fix all the previously accepted but now failing patterns, we should also revisit all `type: ignore` comments to see if they are still needed or not. Fortunately, we don't need to do it manually: by adding `warn_unused_ignores = True` to the configuration, `mypy` will error out in case it encounters an `type: ignore` that is no longer needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60006

Reviewed By: jbschlosser, malfet

Differential Revision: D29133237

Pulled By: albanD

fbshipit-source-id: 41e82edc5cd5affa7ccedad044b59b94dad4425a
2021-06-18 07:23:31 -07:00
7c29ca7f2b Fix Subset of a Subset not sliceable issue (#59513)
Summary:
Dataset can be indexed by a list, but a list can not be indexed by a list. This gives error when slicing a Subset initialised with a Subset, instead of a dataset.

Fixed the issue by changing the indices to a Tensor which can be indexed by a list.

Fixes https://github.com/pytorch/pytorch/issues/59512

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59513

Reviewed By: zou3519

Differential Revision: D29196891

Pulled By: ejguan

fbshipit-source-id: ccde6e474fbcbddd2e9c7c107bc8b5de1307cdb9
2021-06-18 07:07:34 -07:00
08ce5eedf5 [reland] Move RPC agents to libtorch (#60170)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60170

Reland of #59939.

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D29193234

fbshipit-source-id: ee2a90d6be961c10f91361512bdd4cadca43dd60
2021-06-18 05:15:09 -07:00
958b881d70 [reland] Add some TORCH_API annotations to RPC (#60169)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60169

Reland of #59939.
ghstack-source-id: 131706861

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D29193233

fbshipit-source-id: 91d3ef9003b9da7b99e1b9310b7f5a6c505d3b99
2021-06-18 05:15:07 -07:00
83fde5d981 [reland] Pass RequestCallback to FaultyPG RPC agent (#60168)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60168

Reland of #59939.
ghstack-source-id: 131706860

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D29193235

fbshipit-source-id: 170108956a041f6a91b2b21c76ab1a0e0cdd34a2
2021-06-18 05:13:57 -07:00
8a839c5478 Fix saved variable unpacking version counter (#60195)
Summary:
We only set the value and not the actual VC.
This means that in the context of double backward, if that saved tensor is saved again and the original Tensor is modified inplace, we would not detect it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60195

Reviewed By: Varal7

Differential Revision: D29208766

Pulled By: albanD

fbshipit-source-id: 81175f8e3f111f89524f8e46f47577b2ea4fc945
2021-06-18 04:36:46 -07:00
5609c2e59c Adds an OpInfo note (#57428)
Summary:
Like the title says. The OpInfo pattern can be confusing when first encountered, so this note links the Developer Wiki and tracking issue, plus elaborates on the goals and structure of the OpInfo pattern.

cc imaginary-person, who I can't add as a reviewer, unfortunately

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57428

Reviewed By: SplitInfinity

Differential Revision: D29221874

Pulled By: mruberry

fbshipit-source-id: aa73228748c9c96eadf2b2397a8b2ec31383971e
2021-06-18 03:40:42 -07:00
ecc37184a5 Fix clang-tidy path filtering (#60225)
Summary:
PR https://github.com/pytorch/pytorch/issues/60048 neglected to include the `--paths` option for file filtering, so it ended up passing every changed file in the diff to clang-tidy (cpp files outside `torch/csrc/`, yaml/sh files, etc.). This adds that back in to make the filtering work properly again.

Tested it manually by printing out the files to lint and running

```bash
curl -L https://github.com/pytorch/pytorch/pull/60018.diff > diff
python tools/clang_tidy.py --diff-file diff --paths torch/csrc/

curl -L https://github.com/pytorch/pytorch/pull/60222.diff > diff
python tools/clang_tidy.py --diff-file diff --paths torch/csrc/
```

Should fix https://github.com/pytorch/pytorch/issues/60192 and fix https://github.com/pytorch/pytorch/issues/60193, the files tripping errors there shouldn't have been passed to clang-tidy in the first place (supporting aten/ for clang-tidy is a separate task)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60225

Reviewed By: zhouzhuojie

Differential Revision: D29216251

Pulled By: driazati

fbshipit-source-id: b5d7fb7161d33eb7958a6f1ccc25809942045209
2021-06-17 23:03:59 -07:00
38c3116813 [hierarchical sharding 5/n] enable table-wise -> col-wise sharding in embedding table lookup
Summary:
This diff add table-wise -> col-wise sharding support in GroupedShardedEmbeddingBag. Changes includes:
1. Add necessary member variables set up.
2. Create new fast kernel and add fast kernel lookup support
3. Add intra-host all2all and cross-host all2all logic.

Test Plan:
UT
```
buck test mode/dev-nosan //caffe2/torch/fb/training_toolkit/backend/tests:test_model_materializer_full_sync_spawn
```
```
buck test caffe2/torch/fb/hpc/tests:model_sharder_test
```
QPS check:
```
buck run mode/dev-nosan -c python.package_style=inplace caffe2/torch/fb/training_toolkit/examples:sync_sgd_local_driver -- prod-preset --num-trainers 32 --use-shrunk-model false --model-version=inline_cvr_dec_2020 --fast-kernel table_batched --max-batches 10000 --num-dpp-worker-threads 16 --num-readers 100 --hpc-identity ads_model_platform --table-partition hierarchical_based --hierarchical-options "["table_based", "column_based"]" --flow-entitlement ads_global_qps
```
with diff:
dec inline_cvr:
table-wise -> table-wise (82K):
https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_d0a0cba5?version=0&tab=status&env=PRODUCTION

table-wise -> column-wise (80k):
https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_b1ac5873

column-wise:
dec inline_cvr:
gpu trace: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2F0%2F1623827677%2F127.0.0.1%2Flibkineto_activities_4550.json.gz&bucket=gpu_traces

https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_a79e1522 (81k)

https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_2dacc13e (88k)

row-wise(62k):
https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_4e349cab

table-wise(90k):
https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_5d51b608

10x ctr_mbl_feed:
```
buck run mode/dev-nosan -c python.package_style=inplace caffe2/torch/fb/training_toolkit/examples:sync_sgd_local_driver -- prod-preset --num-trainers 128 --use-shrunk-model false --model-version=ctr_mbl_oct_2020_10x_3tb --num-dpp-worker-threads 16 --num-readers 200 --fast-kernel table_batched --max-batches 5000000 --hpc-identity ads_model_platform --table-partition column_based --flow-entitlement ads_global_tc_mimo
```
column-wise:
https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_f05fb306?version=0&tab=status&env=PRODUCTION (290k)

w/o diff:
dec inline_cvr:
column-wise (87K):
gpu trace: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2F0%2F1623864444%2F127.0.0.1%2Flibkineto_activities_4451.json.gz&bucket=gpu_traces
https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_e1315f14

row-wise (60k):
https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_8fcc0adf

table-wise (91k):
https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_cb94ff41

10x ctr_mbl_feed:
https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_203ef35b?version=0&tab=status&env=PRODUCTION (281k)

NE check(use deterministic reading D28711400)
```
buck run mode/dev-nosan -c python.package_style=inplace caffe2/torch/fb/training_toolkit/examples:sync_sgd_local_driver -- prod-preset --num-trainers 32 --use-shrunk-model false --model-version=inline_cvr_dec_2020 --fast-kernel table_batched --max-batches 100000 --num-dpp-worker-threads 16 --num-readers 64 --hpc-identity ads_model_platform --table-partition hierarchical_based --hierarchical-options "[table_based, column_based]" --flow-entitlement ads_global_qps --use-deterministic-model --use-deterministic-reading --model-entity-id 995557193
```
w/o this diff:
```
I0611 12:19:18.766000 647 print_publisher.py:33  master      ] Publishing batch metrics: ne-ne|lifetime_ne 0.8660048340401448
I0611 12:19:18.766000 647 print_publisher.py:33  master      ] Publishing batch metrics: ne-ne|window_ne 0.8660048340401447
I0611 12:19:18.766000 647 print_publisher.py:33  master      ] Publishing batch metrics: qps-qps|total_examples 1867776.0
I0611 12:19:18.766000 647 print_publisher.py:33  master      ] Publishing batch metrics: qps-qps|window_qps 491.5199890136719
```
https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_15bc6243?version=0&tab=status&env=PRODUCTION

w this diff:
```
I0611 12:19:18.766000 647 print_publisher.py:33  master      ] Publishing batch metrics: ne-ne|lifetime_ne 0.8660048340401448
I0611 12:19:18.766000 647 print_publisher.py:33  master      ] Publishing batch metrics: ne-ne|window_ne 0.8660048340401447
I0611 12:19:18.766000 647 print_publisher.py:33  master      ] Publishing batch metrics: qps-qps|total_examples 1867776.0
```
https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_15bc6243?version=0&tab=status&env=PRODUCTION

Reviewed By: JadeNie

Differential Revision: D28689126

fbshipit-source-id: 1c7879d4e3ee2b90aaf2a89e87f7b827d54173b3
2021-06-17 22:25:25 -07:00
8b55e9feaf removed cat, equal, and stack from autocast promote list (#59497)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59497

Reviewed By: zou3519

Differential Revision: D29185909

Pulled By: ngimel

fbshipit-source-id: db96239106d9e46a2704b8f457fd0463dacc1f5c
2021-06-17 21:13:22 -07:00
faf459f13e [Profiler] Fix memory profiler merge issue (#60037)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60037

The memory profiler was broken due to a mis-merge during rebase. Add lost line back.

Reviewed By: ezyang

Differential Revision: D29143469

fbshipit-source-id: c3bf0088ca12e7535eeddbede24e28201eccd5f4
2021-06-17 21:05:23 -07:00
bcf8752fb2 updated launch bounds for trilinear 3d (#59999)
Summary:
Updates launch bounds for upsample_trilinear_3d forward and backward kernel to remove register spilling into local memory. Improves runtime for forward pass by 3-4x factor, backward pass has same runtime (probably different bottleneck).

Timing data: (Using Nvidia Titan-V GPU)
![TrilinearTimingData](https://user-images.githubusercontent.com/22803332/121979658-72f19200-cd3f-11eb-9363-c00e2c4eea6d.PNG)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59999

Reviewed By: zou3519

Differential Revision: D29185976

Pulled By: ngimel

fbshipit-source-id: 0b2313e70e45c53938cd7262464d3aa4fab8da4a
2021-06-17 21:02:12 -07:00
7e032f18cf DOC Describes behavior for None in module.register_* (#60125)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45834

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60125

Reviewed By: zou3519

Differential Revision: D29196138

Pulled By: jbschlosser

fbshipit-source-id: af736c0d66005ec33412860f00b233a5d2922137
2021-06-17 19:18:23 -07:00
047925dac1 .github: Run Windows CUDA build on pull requests (#60215)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60215

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: samestep

Differential Revision: D29214519

Pulled By: seemethere

fbshipit-source-id: 58df5ee49cc5cd46f48938f023f87a6da958f3b6
2021-06-17 16:30:31 -07:00
6af5d00e4b [torch][segment_reduce] Add support for multi-dimensional input (cuda) (#60018)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60018

Same as title. This diff finishes cuda support for currently implemented reductions and input parameters.

Next Steps:
- Add support for sum/min
- More testing and benchmarking
- Cleanup
    - Update default values when length is 0
    - Use TensorIterator
    - Update documentation

Test Plan: Unit test to cover cuda forward path.

Reviewed By: ngimel

Differential Revision: D29135373

fbshipit-source-id: d070727eeb660f56782e7ac8a5b0798be688480a
2021-06-17 16:30:30 -07:00
a727f655c8 [torch][segment_reduce] Support for multi dimension (cpu only) (#59951)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59951

Add support for multi-d input for cpu forward/backward implementation.

Next step: Adding cuda support for multi-d input.

Test Plan: Added unit tests.

Reviewed By: ngimel

Differential Revision: D29105457

fbshipit-source-id: a389ba4cc10f02434a336b8e7d36259f32552e11
2021-06-17 16:29:14 -07:00
8e67981995 .github: Disable clang-tidy for now (#60219)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60219

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: driazati

Differential Revision: D29214928

Pulled By: seemethere

fbshipit-source-id: 20cf38ebfe77ed646e25293c577937c56bd930d3
2021-06-17 16:26:31 -07:00
acf04cdedf Fix default DEFAULT_FILE_PATTERN in clang-tidy (#60212)
Summary:
Without the change, clang-tidy also checks folders like `.circleci/...`

Example of the clang-tidy that looked into `.circleci` changes
https://github.com/pytorch/pytorch/runs/2844682644?check_suite_focus=true

[skip ci]

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60212

Reviewed By: seemethere

Differential Revision: D29214728

Pulled By: zhouzhuojie

fbshipit-source-id: fd53f7b2f7d88936264db1effdc06cc4fc271ca4
2021-06-17 16:25:18 -07:00
9c03de1dde Use mirrors for ubuntu apt source (#60216)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/60135

Experimented on circleci
https://app.circleci.com/pipelines/github/zhouzhuojie/gha-ci-playground/7/workflows/965c95b8-2186-434a-92ca-9cd9c8aaafdc/jobs/7

Sample logs
```
Need to get 1,389 kB of archives.
After this operation, 5,495 kB of additional disk space will be used.
Get:1 http://mirrors.ubuntu.com/mirrors.txt Mirrorlist [3,270 B]
Get:2 http://mirror.lstn.net/ubuntu focal/main amd64 libtcl8.6 amd64 8.6.10+dfsg-1 [902 kB]
Get:7 http://ubuntu.securedservers.com focal/main amd64 libipc-run-perl all 20180523.0-2 [89.7 kB]
Get:5 http://mirrors.edge.kernel.org/ubuntu focal/universe amd64 expect amd64 5.45.4-2build1 [137 kB]
Get:4 http://mirror.pnl.gov/ubuntu focal/universe amd64 tcl-expect amd64 5.45.4-2build1 [105 kB]
Get:6 http://mirror.lstn.net/ubuntu focal/main amd64 libio-pty-perl amd64 1:1.12-1 [32.4 kB]
Get:9 https://mirrors.bloomu.edu/ubuntu focal/main amd64 libtimedate-perl all 2.3200-1 [34.0 kB]
Get:8 http://la-mirrors.evowise.com/ubuntu focal/universe amd64 libtime-duration-perl all 1.21-1 [13.1 kB]
Get:3 http://mirrors.ocf.berkeley.edu/ubuntu focal/main amd64 tcl8.6 amd64 8.6.10+dfsg-1 [14.8 kB]
Get:10 http://mirrors.ocf.berkeley.edu/ubuntu focal/universe amd64 moreutils amd64 0.63-1 [60.5 kB]
Fetched 1,392 kB in 3s (464 kB/s)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60216

Reviewed By: seemethere

Differential Revision: D29214661

Pulled By: zhouzhuojie

fbshipit-source-id: ed2d85f8c0c23af4bcf33558c57472fcf9d913e8
2021-06-17 16:19:27 -07:00
3995fb1840 Add new_ones symbolic (#59255) (#59539)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59539

Add new_ones symbolic in PT-ONNX exporter

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D29046603

Pulled By: SplitInfinity

fbshipit-source-id: e7420c7b543c33e3640e62461d08ff4d5843eda7

Co-authored-by: Shubham Bhokare <shubhambhokare@gmail.com>
2021-06-17 15:49:24 -07:00
ef1c107be5 [vulkan] Do not use memcmp to compare structs (#60199)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60199

It isn't safe to use `memcmp` to determine the equality of structs due to potential random padding between fields of the struct. This can cause overloaded equality operators to return false when comparing structs with equivalent fields.

This bug appears to be responsible for the Vulkan backend crashing on WorkVC release builds.

Test Plan:
Run Vulkan unit tests:

```
cd ~/fbsource
buck build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 //xplat/caffe2:pt_vulkan_api_test_binAndroid\#android-arm64 --show-output
adb push buck-out/gen/xplat/caffe2/pt_vulkan_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_api_test
adb shell "/data/local/tmp/vulkan_api_test"
cd -
```

Test on workvc rdk build, first ensure you are receiving the Vulkan models.
```
buck install fbsource//fbandroid/mode/opt fbsource//fbandroid/mode/aloha_build_rdk fbsource//fbandroid/mode/no_obfuscation fbandroid/buck-configs/buckconfig.caffe2_pkg_snpe_libs_android aloha_workvc_rdk --deep --show-full-output
```

Reviewed By: IvanKobzarev

Differential Revision: D29203177

fbshipit-source-id: e0ee79d4e635174e165b250f2cee842a09092df9
2021-06-17 15:20:30 -07:00
6d0fb85a62 Revert D28833086: beef up at::_ops API
Test Plan: revert-hammer

Differential Revision:
D28833086 (e2129d1c06)

Original commit changeset: 55f322a8378c

fbshipit-source-id: e55bf812ec411bb6bee87654f1d65ff10c046106
2021-06-17 14:28:32 -07:00
0cbb5e15d7 Correct backend in pipe_with_ddp_test (#60123)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60123

All of the tests would run with gloo, but some tests specify a
different backend param which we should respect.
ghstack-source-id: 131688188

Test Plan: CI

Reviewed By: cbalioglu

Differential Revision: D29171549

fbshipit-source-id: 3e306060df189c0e38d5ca6dd34f4b4fbca052b9
2021-06-17 13:43:01 -07:00
acd914f039 Fix Pipe + DDP for unused parameters, static graph (#60118)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60118

Pipe + DDP has a few issues:

1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since https://github.com/pytorch/pytorch/pull/55248
2) when find_unused_parameters=True, also does not results in gradient synchronization. does not work since https://github.com/pytorch/pytorch/pull/57081

The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording.

To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in https://github.com/pytorch/pytorch/pull/49908.

to test:
All tests in pipe_with_ddp_test pass.
The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks.
ghstack-source-id: 131688187

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D29167283

fbshipit-source-id: fe62310db2dc6de8519eb361b1df8ae4dfce3ab8
2021-06-17 13:41:51 -07:00
2062cafaa5 [iOS GPU][MaskRCNN] Implement RoIAlign in Metal shaders using Sampler (#56075)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56075

Inspired by the CUDA implementation - https://fburl.com/diffusion/e90tabkj. The main difference is the way we implement bilinear interpolation. CUDA does this manually by iterating every point in each bin box. Whereas, Metal does this by calling sampler's sample function, which is a bit easier and faster. The result is almost identical to the result from CPU - P365102522.

We'll do another round of refactor once we have figured out how to support custom ops on GPU.
ghstack-source-id: 131720620

Test Plan:
1. Circle CI
2. Sandcastle

Reviewed By: ajtulloch

Differential Revision: D27485068

fbshipit-source-id: 31e831aead9d3799a3fde96e99dd677d96bd3da1
2021-06-17 13:29:42 -07:00
e2129d1c06 beef up at::_ops API (#59115)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59115

This PR beefs up the `at::_ops::` API as a source of truth for compile-time information about each operator.

### Changes
For every op defined in native_functions.yaml, e.g. `at::_ops::add_Tensor` previously defined an unambiguous function; effectively an unambiguously named version of the C++ API that you could decltype() successfully because it had no overloads with a user-facing macro: `decltype(ATEN_FN2(add, Tensor)) // expands to decltype(at::_ops::add_Tensor)`.

Now, `at::_ops::add_Tensor` is a struct containing a few static fields and methods (declared in `Operators.h`, defined in `Operators.cpp`):
```
struct TORCH_API add_Tensor {
  using schema = at::Tensor (const at::Tensor &, const at::Tensor &, const at::Scalar &);
  using ptr_schema = at::Tensor (*)(const at::Tensor &, const at::Tensor &, const at::Scalar &);
  static constexpr const char* name = "aten::add";
  static constexpr const char* overload_name = "Tensor";
  static constexpr const char* schema_str = "add.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor";
  static at::Tensor call(const at::Tensor & self, const at::Tensor & other, const at::Scalar & alpha);
  static at::Tensor redispatch(c10::DispatchKeySet dispatchKeySet, const at::Tensor & self, const at::Tensor & ot
};
```

What used to be the function `at::_ops::add_Tensor` can now be accessed as `at::_ops::add_Tensor::call`, and I've added a new macro to access the entire struct (naming suggestions welcome) - `ATEN_OP2(add, Tensor)`.

### Motivation

There were two motivations for this change:

**Codegen refactor**
The `at::_ops::` API as it exists now is (yet another) C++ entry point into the dispatcher, in addition to the Function, Method, and Redispatch APIs. Instead, after this PR, the existing three API's are all inline-able wrapper API's that call into the `at::_ops` API to do the real work. The function and method API's call into `at::_ops::{op}::call`, while the redispatch API calls into `at::_ops::{op}::redispatch`.

This will hopefully make it easier to pile in any future C++ API's that we want to code-generate. It also means that stuff like the string name, overload name, and schema of each operator is consolidated in a single place, rather than having the codegen hardcode various strings in multiple codegen output files.

**Extra compile-time metadata**
In the [boxed CPU fallback PR](https://github.com/pytorch/pytorch/pull/58065/files#diff-c9b55f0d692a9bea8019c6f19bc46877f1efa0f9d4fc2086cf299b52768343b4R31) above this in the stack, I added a new API that external backends can use to call directly into their boxed fallback from an unboxed context. Adding extra metadata to `at::_ops` means that XLA's usage of that API doesn't require passing in the string name and overload of each name as arguments; we can just infer them.

The updated API looks like this (see [the XLA-side PR ](https://github.com/pytorch/xla/pull/2945/files#diff-5e65c3c1d847191cb691d1874732e971f09fa1aad7a980a555c3b0504a5b6470R250) for more examples)
```
return at::native::call_fallback_fn<&xla_cpu_fallback, ATEN_OP2(add, Tensor)>::call(a, b, 1.0);
```

**Characteristics of the `at::_ops` API**
(I also commented this in the codegen)

(1) It follows the Dispatcher API.

This means, e.g., that it takes in the expanded arguments rather than `TensorOptions`. This is kind of necessary for perf, if we want to `at::_ops` to serve as the main implementation of the existing C++ API's. For example: if it followed the C++ API, then all of the faithful C++ factory functions would need to wrap their arguments into TensorOptions only to unwrap them again.

(2) Overload names are disambiguated.

This is the same as before; it's helpful for pytorch extenders who would like to decltype() an aten operator, that has overloads, e.g. decltype(at::_ops::mul_Tensor::call)

(3) No argument defaulting is allowed.

This is more of an implementation detail to avoid #include cycles, since TensorBody.h (which defines the Tensor class) needs to include this file. The #include situation is precarious though!

(4) manual_cpp_bindings and faithful names are not included in the API.

I think that this is one we have a choice with. This applies to stuff like __dispatch__is_complex(), and add_outf(). These aren't "real native_functions.yaml ops", they're just additional functions provided by the C++ API. They're implemented as wrappers in Functions.h that call into the actual operators defined here, i.e. at::_ops::is_complex::call() and at::_ops::add_out::call(). This means that ATEN_OP(is_complex) will not fastpath, and will go through the dispatcher. It also means that `ATEN_OP2(add, out)` is automatically faithful and takes its out argument at the end (this is just because it follows the dispatcher API).

**Details**

Instead of codegen'ing the existing 3 API's in `Functions.cpp`, `TensorMethods.cpp` and `RedispatchFunctions.cpp`, I codegen them directly into the headers: `Functions.h`, `TensorBody.h`, and `RedispatchFunctions.h`. I mostly did this for perf, since we want to avoid introducing an extra function call in the hot path of every operator. These functions are also now all one-liners that call into `at::_ops`, so the compiler should just inline them all anyway.

The main downside in doing that though was that I had to bend over backwards in a few cases to avoid cyclical #include statements. The issue is that `TensorBody.h` now includes `Operators.h` (because the codegen'd method API is implemented by calling into `at::_ops`), but `TensorBody.h` also includes the definition of the Tensor class. That means that `Operators.h` can't be aware of the Tensor class; it needs to forward declare everything and avoid using the Tensor class directly. To fix cyclic includes, I had to:
- Not allow defaulting in the `at::_ops` API
- Move some code that was called when translating from C++ to Dispatcher API's directly into the codegen template (`check_tensor_options_and_extract_memory_format`)

It's not great, but I don't think this specific include cycle will break down in the near future; the only code that we need to call before getting to `Operators.cpp` is the translations from various API's to the dispatcher API; there aren't many of them, and there's no major reason for them to live an external utils file somewhere.

Moving the code into the headers also meant that the codegen no longer needs to deal with `Functions.cpp`/`TensorMethods.cpp`/`RedispatchFunctions.cpp`. All of the functions that used to be defined in `TensorMethods.cpp` seemed small enough for me to lump into `TensorBody.h`, but some of the functions in `Functions.cpp` looked pretty big to put in a header, so I moved the file to `aten/src/ATen/native/Functions.cpp`.

It might be worth keeping `TensorMethods.cpp` there and leaving it too, in-case we have any beefy hand-written tensor methods that we don't want to put in a header.

**Perf**
I ran a few benchmarks in callgrind, and didn't see a noticeable instruction count change when calling `at::add()`. I also saw in the output that `at::add()` was successfully getting inlined.

There's also probably a light risk of binary size increase; I think that there's a binary size regression test that I can run in phabricator (going to try it). I can also try inspecting `libtorch.so` directly and seeing if it's any bigger, but my hope is that the inline-ing means that we aren't generated separate symbols for `at::add` and `at::_ops::add_Tensor::call`.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D28833086

Pulled By: bdhirsh

fbshipit-source-id: 55f322a8378cb9a3cb6642f72aa291be381dd95b
2021-06-17 13:09:46 -07:00
462448f07a Enable GHA sharding on linux (#60124)
Summary:
This is branch off of https://github.com/pytorch/pytorch/issues/59970 to only shard on linux so far (we're running in issues with windows gflags).

This would enable sharding of tests on a few Linux jobs on GHA, allowing tts to be essentially halved.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60124

Reviewed By: zou3519

Differential Revision: D29204211

Pulled By: janeyx99

fbshipit-source-id: 1cc31d1eccd564d96e2aef14c0acae96a3f0fcd0
2021-06-17 13:00:23 -07:00
bbedfd913d Run an dummy rpc._all_gather in init_rpc to avoid shutdown timeout (#59801)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59801

Fixes https://github.com/pytorch/pytorch/issues/59795.

The RPC calls in shutdown no longer able to finish within 5s if
there is no other RPCs before `rpc.shutdown()` in that process,
because agent initialization can take longer than 5s. We don't
have this problem previously, because TensorPipe's backend
registry used to use RPC to communicate CUDA devices in `init_rpc`.
However, after #58753, `init_rpc` uses ProcessGroup to communicate
devices, and hence the channels/transport could be uninitialized
after `init_rpc`.

Differential Revision:
D29039238
D29039238

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Pulled By: mrshenli

fbshipit-source-id: 46f89b01a058a51d271ddef9084a67b220a067b7
2021-06-17 11:47:54 -07:00
ebafd2aadf Stop warning on .names() access in max_pool2d and max_pool2d_backward (#60059)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60059

Fixes #60053.

The problem is that `.names()` always triggers the named tensor warning.
To not trigger it, one has to guard it with has_names:
`x.has_names() ? x.names() : DimnameList{}`

This is not the first time this has happened; we should probably
make it so that .names() doesn't raise a warning unless it is actually
populated with names. That's a little tricky to implement so I'm leaving
it for the future.

Test Plan:
- New test, also run `python test/test_nn.py -v -k "max_pool"` and
confirm there are no warnings.

Reviewed By: gchanan

Differential Revision: D29152737

Pulled By: zou3519

fbshipit-source-id: 89a2fdbe6a6064a7044b5b75f7d0c58e51e57509
2021-06-17 10:34:41 -07:00
ef09428804 Revert D29104399: Port all kernel to structured kernels.
Test Plan: revert-hammer

Differential Revision:
D29104399 (7809494c68)

Original commit changeset: 18bb747b7a19

fbshipit-source-id: f57043df5646f1e675e8a555cb4fa0e436953751
2021-06-17 10:32:23 -07:00
3ff5507fb0 Revert D29104395: Port any kernel to structured kernels.
Test Plan: revert-hammer

Differential Revision:
D29104395 (519698362d)

Original commit changeset: 0cfde57c22ba

fbshipit-source-id: ac5ebdc4b9d3aeb4c5eeab55c92ac931599d39d1
2021-06-17 10:32:21 -07:00
81baa7fb0d Revert D29104398: Using meta checks for unary torch.all and torch.any.
Test Plan: revert-hammer

Differential Revision:
D29104398 (c078cefa7d)

Original commit changeset: 6771b80130c9

fbshipit-source-id: 10e5a34370113fcd2f87aea2c2e76108fa9328d8
2021-06-17 10:32:20 -07:00
873dac4b5a Revert D29104397: Port argmax to structured kernels.
Test Plan: revert-hammer

Differential Revision:
D29104397 (6f3da4f4bf)

Original commit changeset: 580355cf3b4e

fbshipit-source-id: e51fb79329066bc1a6364cfa44a8732908a684ed
2021-06-17 10:32:18 -07:00
6b5e77904f Revert D29104396: Port argmin kernel to structured kernels.
Test Plan: revert-hammer

Differential Revision:
D29104396 (226d745a0b)

Original commit changeset: 39c59bcc0446

fbshipit-source-id: 82de26f925a885f65572a785fa45a9980d3a974b
2021-06-17 10:31:06 -07:00
3dc8112187 [NNC] Handle int64 indices and loop bounds (#59769)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59769

Allow loop bound and tensor indice to be either int32 or int64, and avoid unnecessary cast op.

Test Plan:
```
build/bin/test_tensorexpr
```

Reviewed By: H-Huang

Differential Revision: D29173970

Pulled By: desertfire

fbshipit-source-id: 859a876ddb1b41535b2266089aa1222884295c78
2021-06-17 09:35:59 -07:00
96b3537e71 [NNC] Add a dtypeToCppString virtual method in IRPrinter (#59449)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59449

Make dtypeToCppString as a virtual method so that a child
class can easily override the dtype string generation rule. This is
needed as a preparation to make loop and tensor index as int64_t.

Test Plan:
```
build/bin/test_tensorexpr
```

Reviewed By: H-Huang

Differential Revision: D29173969

Pulled By: desertfire

fbshipit-source-id: a447badba76788354da1c79f80c834c99f105776
2021-06-17 09:34:58 -07:00
ed1da5be21 PG NCCL cleanup: remove usage of completed_ in WorkNCCL copies (#59899)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59899

Test Plan: Imported from OSS

Reviewed By: cbalioglu, osalpekar

Differential Revision: D29080299

Pulled By: agolynski

fbshipit-source-id: 9ae368f91e81f19471e0a20fc913d8e9df1b9dec
2021-06-17 09:05:35 -07:00
010f4b6f2d Add .isort.cfg (#60119)
Summary:
This adds the `.isort.cfg` file from https://github.com/pytorch/pytorch/issues/55928, but doesn't try to enforce it in CI because as that PR showed, that is currently difficult to do. We could use this to gradually sort the codebase according to this configuration (enforcing bits and pieces in CI) but I don't do that here.

The advantage of including this file (even if we don't enforce it) is that it affects how certain tools work, thus encouraging a specific import style for people who happen to use those tools.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60119

Test Plan: Open `test/run_test.py` in VS Code and run the **Python Refactor: Sort Imports** command. Compare with and without this PR.

Reviewed By: 1ntEgr8

Differential Revision: D29199504

Pulled By: samestep

fbshipit-source-id: 83e937b0f517c60e3e7dedb6c0306173908fbbb0
2021-06-17 09:04:25 -07:00
226d745a0b Port argmin kernel to structured kernels. (#59938)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59938

Tracking issue: #55070

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D29104396

Pulled By: ezyang

fbshipit-source-id: 39c59bcc044649c1ec9c9685366c4dda87f76aa7
2021-06-17 08:18:13 -07:00
6f3da4f4bf Port argmax to structured kernels. (#59937)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59937

Tracking issue: #55070

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D29104397

Pulled By: ezyang

fbshipit-source-id: 580355cf3b4e9e5c934b4e51a16196087bcb3459
2021-06-17 08:18:12 -07:00
c078cefa7d Using meta checks for unary torch.all and torch.any. (#59373)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59373

This PR makes use of the newly implemented unified `at::meta::check_reduction` for
validating the inputs and configuring its `TensorIterator`.

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D29104398

Pulled By: ezyang

fbshipit-source-id: 6771b80130c91c2f1360853127de0acebcfff183
2021-06-17 08:18:10 -07:00
519698362d Port any kernel to structured kernels. (#59372)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59372

Tracking issue: #55070

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D29104395

Pulled By: ezyang

fbshipit-source-id: 0cfde57c22ba88607945c98f28b18df7709becd0
2021-06-17 08:18:08 -07:00
7809494c68 Port all kernel to structured kernels. (#59371)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59371

Tracking issue: #55070

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D29104399

Pulled By: ezyang

fbshipit-source-id: 18bb747b7a19d873427d52c1145ef7cede333a0e
2021-06-17 08:16:41 -07:00
b8ab98626b only runs mem leak check on master (#60023)
Summary:
setting environment variable to only do cuda mem leak check on master CI jobs.

See discussion in https://github.com/pytorch/pytorch/pull/59402#issuecomment-860773034

See stats before/after disabling mem leak check: https://github.com/pytorch/pytorch/pull/59942#issuecomment-860947095

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60023

Test Plan:
https://github.com/pytorch/pytorch/issues/60108
https://github.com/pytorch/pytorch/issues/60116

Reviewed By: janeyx99

Differential Revision: D29164182

Pulled By: walterddr

fbshipit-source-id: dfe88c2c1275b6eb35f18b58aacdc220f34ccb59
2021-06-17 07:56:26 -07:00
59b10036d5 Unifies OpInfo dtype tests (#60157)
Summary:
Simplifies the OpInfo dtype tests and produces nicer error messages, like:

```
AssertionError: Items in the first set but not the second:
torch.bfloat16
Items in the second set but not the first:
torch.int64 : Attempted to compare [set] types: Expected: {torch.float64, torch.float32, torch.float16, torch.bfloat16}; Actual: {torch.float64, torch.float32, torch.float16, torch.int64}.
The supported dtypes for logcumsumexp on cuda according to its OpInfo are
        {torch.float64, torch.float32, torch.float16, torch.int64}, but the detected supported dtypes are {torch.float64, torch.float32, torch.float16, torch.bfloat16}.
        The following dtypes should be added to the OpInfo: {torch.bfloat16}. The following dtypes should be removed from the OpInfo: {torch.int64}.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60157

Reviewed By: ngimel

Differential Revision: D29188665

Pulled By: mruberry

fbshipit-source-id: e84c9892c6040ea47adb027cfef3a6c0fd2f9f3c
2021-06-17 06:34:54 -07:00
4caca7a15b Improved torch.einsum testing and fixed bug (#59731)
Summary:
Improved torch.einsum testing and fixed a bug where lower case letters appeared before upper case letters in the sorted order which is inconsistent with NumPy.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59731

Reviewed By: SplitInfinity, ansley

Differential Revision: D29183078

Pulled By: heitorschueroff

fbshipit-source-id: a33980d273707da2d60a387a2af2fa41527ddb68
2021-06-17 04:48:47 -07:00
eb36f67dcc [TensorExpr] Minor cleanup in TensorExprKernel::computeValue (#60041)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60041

Differential Revision:
D29146709
D29146709

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: 49ac919c18f669d7fda1a26c5a74e62ea752df4f
2021-06-17 01:23:24 -07:00
6b1712019a Revert D29132955: Pass RequestCallback to FaultyPG RPC agent
Test Plan: revert-hammer

Differential Revision:
D29132955 (cbbb7e145e)

Original commit changeset: bb7554b84bcb

fbshipit-source-id: 4dfa2fbe7b8f58c951991c79aa9e2aa819793013
2021-06-17 00:50:32 -07:00
3c3bb91103 Revert D29132956: Add some TORCH_API annotations to RPC
Test Plan: revert-hammer

Differential Revision:
D29132956 (04ec122868)

Original commit changeset: 8637640d56a1

fbshipit-source-id: f497adcbfd5a6b5a46b8689b1943ae2687ea737b
2021-06-17 00:50:30 -07:00
f233274f30 Revert D28875276: Move RPC agents to libtorch
Test Plan: revert-hammer

Differential Revision:
D28875276 (fc50f91929)

Original commit changeset: f2f6970fd74d

fbshipit-source-id: 3c52af652579733ebea8ddfb06576a0ce262bf78
2021-06-17 00:48:58 -07:00
e5c99d9908 Revert D29147009: [pytorch][PR] refine disabled test
Test Plan: revert-hammer

Differential Revision:
D29147009 (5fd6ead097)

Original commit changeset: 37e01ac6e8d6

fbshipit-source-id: e9cd819fd819e3d653deda3b7a981c39ec0452f4
2021-06-17 00:45:21 -07:00
a0ad4c24d1 MAINT Migrates rrelu_with_noise from THC to ATen on Cuda (#57864)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/24618
Related to https://github.com/pytorch/pytorch/issues/24507

<details><summary>Benchmark script:</summary>

```py
import torch
import torch.nn as nn
import time

torch.manual_seed(0)
def _time():
    torch.cuda.synchronize()
    return time.time()

device = "cuda"
m = nn.RReLU().cuda()

for n in [100, 10_000, 100_000]:
    fwd_t = 0
    bwd_t = 0
    input = torch.randn(128, n, device=device)
    grad_output = torch.ones(128, n, device=device)
    for i in range(10000):
        t1 = _time()
        output = m(input)
        t2 = _time()
        fwd_t = fwd_t + (t2 -t1)
    fwd_avg = fwd_t / 10000 * 1000
    print(f"input size(128, {n}) forward time is {fwd_avg:.2f} (ms)")
```

</details>

### Results from benchmark:

#### This PR

```
input size(128, 100) forward time is 0.01 (ms)
input size(128, 10000) forward time is 0.06 (ms)
input size(128, 100000) forward time is 0.54 (ms)
```

#### On master

```
input size(128, 100) forward time is 0.01 (ms)
input size(128, 10000) forward time is 0.08 (ms)
input size(128, 100000) forward time is 0.66 (ms)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57864

Reviewed By: H-Huang

Differential Revision: D29177169

Pulled By: ngimel

fbshipit-source-id: 4572133db06f143d27e70a91ade977ea962c8f77
2021-06-17 00:35:16 -07:00
9e79a8a54f [iOS GPU][MaskRCNN] Force the temporaryImage to become static when doing synchronization (#60155)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60155

For intermediate tensors, we need to convert them to static images when doing GPU -> CPU synchronization.
ghstack-source-id: 131540760

Test Plan:
- CI
- buck test pp-macos

Reviewed By: SS-JIA

Differential Revision: D29126278

fbshipit-source-id: cd50b5f104e0161ec7fcfcc2c51785f241e48704
2021-06-17 00:25:14 -07:00
0e7b5ea6c0 nonzero: Default to transposed output strides (#59370)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46224

cc ailzhang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59370

Reviewed By: ezyang

Differential Revision: D29143842

Pulled By: ngimel

fbshipit-source-id: 5aa7a247b4a70cd816d0eed368ab4c445568c986
2021-06-16 22:50:38 -07:00
c0b7c59e55 [quant] Equalization Observer modifications (#59953)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59953

The following modifications were made to the equalization
observers due to design changes:
- [InputEqualizationObserver] Replaced `calculate_qparams()` with
`calculate_scaled_minmax()` since we will need to return the scaled
min/max values to update the following input quantization observer
- [WeightEqualizationObserver] We no longer need a row observer since
this will be taken care of by the following weight quantization observer
- [WeightEqualizationObserver] Following the previous comment, we no
longer need to calculate the scaled qparam values. Instead, we will use
the equalization scale to later scale the weights and the qparams will
be taken care of by the weight quantization observer.

Test Plan:
`python test/test_quantization.py
TestEqualizeFx.test_input_weight_eq_observer`

Imported from OSS

Reviewed By: supriyar

Differential Revision: D29135332

fbshipit-source-id: be7e468273c8b62fc183b1e1ec50f6bd6d8cf831
2021-06-16 22:32:30 -07:00
45c31cabb5 [quant] Input Weight Equalization - prepare modifications (#59747)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59747

Modifies prepare_fx for input-weight equalization. If a current
node is being equalized (there exists a EqualizationQConfig), then the
EqualizationObserver will be inserted before its quantization observer.

For a singular linear layer, the general flow looks like:
Original graph: `x0 -> linear -> x1`, `w -> linear`
After prepare: `x0 -> InpEqObs -> MinMaxObs -> linear1 -> MinMaxObs -> x1`
  `w -> WeightEqObs -> MinMaxObs -> linear1`

For two connected linear layers, the general flow looks like:
Original graph: `x0 -> linear1 -> linear2 -> x1`,
  `w1 -> linear1`, `w2 -> linear2`
After prepare: `x0 -> InpEqObs -> MinMaxObs -> linear1 -> MinMaxObs -> InpEqObs -> linear2 -> MinMaxObs -> x1`
  `w1 -> WeightEqObs -> MinMaxObs -> linear1`, `w2 -> WeightEqObs -> MinMaxObs -> linear2

Test Plan:
`python test/test_quantization.py
TestEqualizeFx.test_input_equalization_prepare`

Original model with one `nn.Linear` layer
```
LinearModule(
  (linear): Linear(in_features=1, out_features=1, bias=True)
)
```

Graph after `prepare_fx`:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0 : [#users=1] = call_module[target=x_equalization_process_0](args = (%x,), kwargs = {})
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%x_equalization_process_0,), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
    %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
    return linear_activation_post_process_0
```
--------------------------------------

Original model with two connected functional linear layers
```
FunctionalLinearModule(
  (linear1): Linear()
  (linear2): Linear()
)
```

Graph after `prepare_fx`:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0 : [#users=1] = call_module[target=x_equalization_process_0](args = (%x,), kwargs = {})
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%x_equalization_process_0,), kwargs = {})
    %linear1_w : [#users=1] = get_attr[target=linear1.w]
    %linear1_w_equalization_process_0 : [#users=1] = call_module[target=linear1_w_equalization_process_0](args = (%linear1_w,), kwargs = {})
    %linear1_w_activation_post_process_0 : [#users=1] = call_module[target=linear1_w_activation_post_process_00](args = (%linear1_w_equalization_process_0,), kwargs = {})
    %linear1_b : [#users=1] = get_attr[target=linear1.b]
    %linear : [#users=1] = call_function[target=torch.nn.functional.linear](args = (%x_activation_post_process_0, %linear1_w_activation_post_process_0), kwargs = {bias: %linear1_b})
    %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
    %linear_activation_post_process_0_equalization_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0_equalization_process_0](args = (%linear_activation_post_process_0,), kwargs = {})
    %linear2_w : [#users=1] = get_attr[target=linear2.w]
    %linear2_w_equalization_process_0 : [#users=1] = call_module[target=linear2_w_equalization_process_0](args = (%linear2_w,), kwargs = {})
    %linear2_w_activation_post_process_0 : [#users=1] = call_module[target=linear2_w_activation_post_process_00](args = (%linear2_w_equalization_process_0,), kwargs = {})
    %linear2_b : [#users=1] = get_attr[target=linear2.b]
    %linear_1 : [#users=1] = call_function[target=torch.nn.functional.linear](args = (%linear_activation_post_process_0_equalization_process_0, %linear2_w_activation_post_process_0), kwargs = {bias: %linear2_b})
    %linear_1_activation_post_process_0 : [#users=1] = call_module[target=linear_1_activation_post_process_0](args = (%linear_1,), kwargs = {})
    return linear_1_activation_post_process_0
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D29135316

fbshipit-source-id: 91697e805ede254dbb2a42ee4c23eb1c1c64590e
2021-06-16 22:32:28 -07:00
7ce74f3339 [quant] EqualizationQConfig to distinguish input/output activations (#59739)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59739

Created an EqualizationQConfig specifically for equalization.
This inherits from QConfig and is used to distinguish between inserting
an input observer with an output observer. Since the output observer
field is included in the EqualizationQConfig, we no longer need an
output observer field in the _InputEqualizationObserver

Test Plan:
compiles

Imported from OSS

Reviewed By: ezyang

Differential Revision: D29135298

fbshipit-source-id: 3dde9c029c291467ff0a0845f0fc9c44573fc6f6
2021-06-16 22:31:18 -07:00
c6cdb4f113 Refactor ZeroRedundancyOptimizer Assuming SPSD (#59834)
Summary:
**Overview:**
This refactors the `ZeroRedundancyOptimizer` implementation to assume single-process single-device (SPSD) instead of accommodating single-process multiple-device (SPMD). `DistributedDataParallel` [retired SPMD recently](https://github.com/pytorch/pytorch/issues/47012), so this change follows the same spirit.

**Changes:**
The parent-class `Optimizer` constructor permits the input argument `params` to be both an `iterable` of `torch.Tensor` and an `iterable` of `dict`. The latter usage is for initializing the optimizer with multiple `param_group`s to start. However, currently, `ZeroRedundancyOptimizer` only supports the former usage, requiring explicit calls to `add_param_group()` for multiple `param_group`s. Given the existing implementation, the type error would be silent and not manifest until much later (e.g. since `super().__init__()` would have no issue). Hence, I added a series of checks to begin the `__init__()` function (encapsulated in `_verify_and_init_params()`). A postcondition of this validation is that `self._all_params` is a non-empty list of all model parameters.

Additionally, I added a check for SPSD usage assuming that all model parameters exist on the same device. This logic is included in `_verify_same_param_device()` and is called immediately after the `params` type-checking.  Support for SPSD with model parameters sharded across devices may be added in the future.

Related to that aforementioned post-condition on `self._all_params`, previously there was undefined behavior resulting from different typing of the passed in `params` input argument. If `params` was a `List`, then the usage of `self._reference_is_trainable_mask` was as expected. However, if `params` was a generator (e.g. as in the canonical usage of passing `model.parameters()`), then the ensuing behavior was divergent. This is because after a generator is iterated over, it is empty. As a result, when we set `self._all_params = params` [in the old code](68d690ffbd/torch/distributed/optim/zero_redundancy_optimizer.py (L165)), `self._all_params` is empty, reducing `training_mask` to always be the empty list. This causes missed calls to `_update_trainable()` in `step()`. (A consequence of this is that `test_pytorch_parity()`, which is renamed to `test_local_optimizer_parity()`, now outputs warnings about the trainable parameters changing.)

The existing implementation assumes that all parameters share the same dense type when allocating the bucket buffers. This change preserves this assumption, which may be removed in the future. I added a check for this in `_verify_same_dense_param_type()` to avoid erroring silently later on. Note that it is insufficient to simply check for the same `dtype` since dense and sparse tensors may share the same `dtype` but require differing storage sizes. One solution is to use `torch.typename()` as the means for comparison.

 ---

The primary change in this refactor is with respect to `self._per_device_params` and `self.buckets`. `self._per_device_params` mapped `torch.device` to `List[List[Parameter]]`. The keys were the devices that the model parameters exist on, and the values designated which ranks are assigned to updating those parameters. `self.buckets` mapped `torch.device` to `List[torch.Tensor]`. The keys were the same as `self._per_device_params`, and the values were the buckets for that device. The usage of these two data structures were confined to each other only. Hence, because the notions of device and rank are now in 1:1 correspondence, we can eliminate the former completely and only use rank. As such, I removed `self._per_device_params` and made `self.buckets` directly a list of buckets (i.e. `torch.Tensor`s).

Iteration over the parameters of a rank for a given device could be simplified to just iteration over the parameters of a rank. Hence, I relied on `self.partition_parameters()` now for that iteration. Refer to `_setup_flat_buffers()` and `step()` for these changes.

One convenient side effect of removing `self._per_device_params` is that there is no longer the re-computation of the parameter partitions mentioned at the end of this [PR](https://github.com/pytorch/pytorch/pull/59410).

 ---

I changed the data structure `self._index_to_param_cache` from a `dict` to a `List` because the domain is `0`, `1`, ..., `k-1` where `k` is the number of parameters. This should yield marginal improvements in memory usage and access speed.

`_sync_param_groups()` is a static method, meaning it can be called either via `self._sync_param_groups()` or `ZeroRedundancyOptimizer._sync_param_groups()` when inside the class. I made the usage consistently `self._sync_param_groups()` rather than have instances of both.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59834

Test Plan:
I ran through the existing test suite on an AI AWS cluster:
```
srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=4 python test/distributed/optim/test_zero_redundancy_optimizer.py
```
Note: The only test where `parameters_as_bucket_view` is `True` is `test_step_with_closure()`, meaning that that is the test that exercises the core changes of removing `self._per_device_params` and changing `self.buckets`.

Also, I added tests for the `ZeroRedundancyOptimizer` constructor changes and the assumption checks.

Reviewed By: mrshenli

Differential Revision: D29177065

Pulled By: andwgu

fbshipit-source-id: 0ff004ae3959d6d3b521024028c7156bfddc93d8
2021-06-16 20:52:13 -07:00
85517a2b70 [TensorExpr] More python binding cleanups (#60058)
Summary:
A few more quality of life improvements for NNC's python bindings:
- Use standard `torch.dtype`s (rather than `te.Dtype`)
- Make names optional (they don't seem to matter)
- Make shapes optional
- A few implicit conversions to make code cleaner

Followup to https://github.com/pytorch/pytorch/issues/59920

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60058

Reviewed By: bertmaher

Differential Revision: D29151953

Pulled By: jansel

fbshipit-source-id: c8286e329eb4ee3921ca0786e17248cf6a898bd8
2021-06-16 20:06:08 -07:00
c01939a9b1 [JIT] Handle modules that already have __constants__ (#60003)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60003

**Summary**
`infer_concrete_type_builder` in `_recursive.py` assumes `__constants__`
is a `set` if it exists as an attribute on the module being scripted.
Instead, it should create a set out of whatever `__constants__` is.

**Test Plan**
Ran code from the issue.

**Fixes**
This commit fixes #59947.

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D29174243

Pulled By: SplitInfinity

fbshipit-source-id: aeb8bded80038da35478714b6a697a766ac447f5
2021-06-16 20:01:18 -07:00
d99a8a31b1 Fix version comparison for defining CUDA11OrLater (#60010)
Summary:
Before this PR `CUDA11OrLater` was incorrectly set to `False` when `torch.version.cuda == "11.0"`.
`torch.version.cuda` returns major and minor CUDA versions, it doesn't return patch info.
LooseVersion comparison was calling `[11, 0] >= [11, 0, 0]` which evaluates to `False`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60010

Reviewed By: mruberry

Differential Revision: D29147107

Pulled By: ezyang

fbshipit-source-id: bd9ed076337b4d32bf1c3376b8f7ae15dbc4d08d
2021-06-16 18:04:29 -07:00
c458bb985e make it easier to grep for unary/binary op kernels (#60128)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60128

Test Plan: Imported from OSS

Reviewed By: wenleix

Differential Revision: D29175499

Pulled By: bdhirsh

fbshipit-source-id: 1838900276e0b956edf25cdddcff438ff685a50e
2021-06-16 17:49:21 -07:00
3288c9d304 [numpy] mvlgamma: int -> float promotion (#59934)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/42515

Last int->float promotion as per the tracker!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59934

Reviewed By: H-Huang

Differential Revision: D29160008

Pulled By: mruberry

fbshipit-source-id: 389a5a7683e0c00d474da913012768bf2a212ef0
2021-06-16 17:44:20 -07:00
f65793507d [fx][Transformer] Add override for call_function (#60057)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60057

This ensures that if a function was `wrap`'d before symbolic tracing + being passed into the transformer then it will still be wrapped.

Test Plan: Added test to `test_fx.py`

Reviewed By: jamesr66a

Differential Revision: D29151191

fbshipit-source-id: 93560be59505bdcfe8d4f013e21d4719788afd59
2021-06-16 17:25:55 -07:00
cyy
5f017e91b8 don't use moved field in the second lambda (#59914)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59914

Reviewed By: H-Huang

Differential Revision: D29147018

Pulled By: ezyang

fbshipit-source-id: 04fe52fb8cf3cc8f3a538a2dddb13c52cf558549
2021-06-16 17:22:15 -07:00
64aec8d2ca [testing] OpInfoHelper tool (#58698)
Summary:
Fixes: https://github.com/pytorch/pytorch/issues/57577

Usage:
Add OpInfo entry to `common_methods_invocations` with `dtypes=_DYNAMIC_DYTPES`
Eg.
```
OpInfo('atan2',
        dtypes=_DYNAMIC_DTYPES,
        sample_inputs_func=sample_inputs_atan2,)
```

Run the helper with `python -m torch.testing._internal.opinfo_helper`

Output
```
OpInfo(atan2,
       # hint: all_types + (torch.bool,),
       dtypes=[torch.float32, torch.float64, torch.uint8, torch.int8, torch.int16, torch.int32, torch.int64, torch.bool],
       # hint: all_types + (torch.bool, torch.bfloat16, torch.float16),
       dtypesIfCUDA=[torch.float32, torch.float64, torch.uint8, torch.int8, torch.int16, torch.int32, torch.int64, torch.bool, torch.bfloat16, torch.float16],
       sample_inputs_func=sample_inputs_atan2)
```

Output without CUDA (run with `$ CUDA_VISIBLE_DEVICES=-1 python -m torch.testing._internal.opinfo_helper`)
```
UserWarning: WARNING: CUDA is not available, information pertaining to CUDA could be wrong
  warnings.warn("WARNING: CUDA is not available, information pertaining to CUDA could be wrong")
OpInfo(atan2,
       # hint: all_types + (torch.bool,),
       dtypes=[torch.float32, torch.float64, torch.uint8, torch.int8, torch.int16, torch.int32, torch.int64, torch.bool],
       sample_inputs_func=sample_inputs_atan2)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58698

Reviewed By: H-Huang

Differential Revision: D29160668

Pulled By: mruberry

fbshipit-source-id: 707370a83b451b02ad2fe539775c8c50ecf90be8
2021-06-16 17:17:03 -07:00
0bf1260795 Fix Python 3.8 expecttest machinery again, this time for good. (#60044)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60044

In #59709 I attempted to fix the expecttest machinery to work in Python
3.8.  However, I noticed that it would fail to do substitutions in this
case:

```
    self.assertExpectedInline(
        foo(),
        """bar"""
    )
```

This is because the triple quoted string is not on the same line as the
backtrace line number (at the very beginning), and for safety reasons
the preexisting regex refused to search beyond the first line.  This
wasn't a big deal prior to Python 3.8 because the flipped version of
the regex simply required the triple quoted string to be flush with
the end of the statement (which it typically was!)  But it is a big deal
now that we only have the start of the statement.

I couldn't think of a way to fix this in the current model, so I decided
to call in the big guns.  Instead of trying to do the regex with only
the start xor end line number, I now require you provide BOTH line numbers,
and we will only regex within this range.  The way we compute these line
numbers is by parsing the Python test file with ast, and then searching
through statements until we find one that is consistent with the line
number reported by the backtrace.  If we don't find anything, we
conservatively assume that the string lies exactly in the backtrace
(and you'll probably fail the substitution in that case.)

The resulting code is quite a lot simpler (no more reversed regex) and
hopefully more robust, although I suppose we are going to have to do
some field testing.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D29146943

Pulled By: ezyang

fbshipit-source-id: 2c24abc3acd4275c5b3a8f222d2a60cbad5e8c78
2021-06-16 17:10:16 -07:00
dab1e59652 Remove dead code in SavedVariable (#59838)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59838

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D29069214

fbshipit-source-id: 5debf93a6c3d1c3d585efbe54438e8df92646d62
2021-06-16 16:44:16 -07:00
1efa863837 Avoid un-necessary unwrapping of Tensor in SavedVariable (#59837)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59837

Fixes #58500

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D29069215

fbshipit-source-id: 603db3c8a64b729e86385ed774825f01c6ce0f20
2021-06-16 16:43:04 -07:00
5948e6f653 removed gelu from autocast fp32 list (#59639)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59639

Reviewed By: H-Huang

Differential Revision: D29155914

Pulled By: ezyang

fbshipit-source-id: feb117181894c2355768d5b1189b3d5f1649fc0b
2021-06-16 16:29:57 -07:00
a95207dad4 [quant] Add a quantize_per_tensor overload that takes Tensor quantization parameters (#59773)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59773

Current quantize_per_tensor takes float scale and int zero_point, which does not work with Proxy,
this PR adds a quantize_per_tensor overload that takes Tensor scale and zero_point instead.

Test Plan:
Tested locally that following runs without errors:

```python
import torch
from torch.quantization.quantize_fx import prepare_fx, convert_fx
from torch.fx.experimental import normalize

class TestModule(torch.nn.Module):
    def forward(self, x):
        return x + x

mod = TestModule()
mod.eval()
config = {"": torch.quantization.get_default_qconfig("fbgemm")}
mod = prepare_fx(mod, config)
mod = convert_fx(mod)
mod = torch.fx.Transformer(mod).transform()
```

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D29019862

fbshipit-source-id: c0176040f3b73f0a30516ed17d261b44cc658407
2021-06-16 16:07:20 -07:00
5686fe5817 Revert D29154971: Training resnext with msuru_suru_union and ig_msuru_suru_union datasets
Test Plan: revert-hammer

Differential Revision:
D29154971 (9f68f93aca)

Original commit changeset: d534d830020f

fbshipit-source-id: a3d16acc8e6b66a6010b501c28dbe295f573bc86
2021-06-16 15:33:14 -07:00
4c8c61f200 Some fixes to vec256_bfloat16.h (#59957)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59957

Test Plan: Sandcastle

Reviewed By: VitalyFedyunin

Differential Revision: D29073913

fbshipit-source-id: dc01a2015e4ff42daa1d69443460182744c06e90
2021-06-16 15:17:15 -07:00
8ce6d0c42f [torch deploy] add register_module_source (#58290)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58290

this is a helper function to get some python source code loaded
on each interpreter without having to use the standard import system
or packages. Useful for debugging or for writing wrapper classes for
handling loaded modules.

Test Plan: Imported from OSS

Reviewed By: wconstab

Differential Revision: D28435306

Pulled By: zdevito

fbshipit-source-id: b85c16346b9001cd7350d65879cb990098060813
2021-06-16 14:41:13 -07:00
fd1e9253ff [Profiler] Fix timestamp discrepancy in profiler_kineto.cpp (#60070)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60070

PyTorch pull request https://github.com/pytorch/pytorch/pull/57333 changed high_resolution_clock to system_clock but missed one location in profiler_kineto.cpp.

On some platforms (e.g. Windows), high_resolution_clock and system_clock do not map to the same underlying clock and therefore we get mixed timestamps on some platforms.

Reviewed By: wesolwsk

Differential Revision: D29155809

fbshipit-source-id: a6de6b4d550613f26f5577487c3c53716896e219
2021-06-16 14:25:24 -07:00
9d7764642b Use GitHub's diff directly in clang-tidy (#60048)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60048

This changes clang-tidy in lint.yml to pull the raw diff from GitHub and parse that rather than use the PRs base revision. The base revision can cause the spurious inclusion of files not changed in the PR as in https://github.com/pytorch/pytorch/pull/59967/checks?check_run_id=2832565901. We could be smarter about how we query git, but this approach ends up being simpler since we just need to search for the diff headers in the .diff file.

See https://github.com/pytorch/pytorch/pull/60049/checks?check_run_id=2834140350 for an example CI run with this on

Test Plan: Imported from OSS

Reviewed By: samestep

Differential Revision: D29148886

Pulled By: driazati

fbshipit-source-id: ca23446d5cc8938d1345f272afe77b9ee8898b74
2021-06-16 13:40:09 -07:00
b2fc6de2c4 support parsing of PR stats in run_test.py (#60026)
Summary:
Currently S3 test stats doesn't support PR stats parisng.

Changes to s3_stats_parser:
1. they are uploaded to `test_times/{sha1}/{job}` and `pr_test_times/{pr}/{sha1}/{job}` separately. Thus we need parsing logics for both
2. need to attach time for PR stats parsing for ordering since PR commits can be force-pushed

Changes to run_test.py
1. Reordering based on previous PR stats if available
2. Falling back to file change option if not enabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60026

Test Plan:
- CI.
- local repro: plz run:
```
CIRCLE_JOB="pytorch_linux_bionic_py3_6_clang9_noarch_test" CIRCLE_PR_NUMBER=60057 IN_CI=1 ENABLE_PR_HISTORY_REORDERING=1 python test/run_test.py
```

Reviewed By: samestep

Differential Revision: D29164754

Pulled By: walterddr

fbshipit-source-id: 206688e0fb0b78d1c9042c07243da1fbf88a924b
2021-06-16 13:32:31 -07:00
691183bb74 Fix compile failure on CUDA92 (#60017)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/60016

For CUDA 92
- OptionalBase was not check if `is_arrayref`
- constexpr seems not expect to raise Exception for cuda 92

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60017

Reviewed By: malfet

Differential Revision: D29139515

Pulled By: ejguan

fbshipit-source-id: 4f4f6d9fe6a5f2eadf913de0a9781cc9f2e6ac6f
2021-06-16 12:23:08 -07:00
15dbc566c5 [torch][segment_reduce] Add missing cuda kernel launch check (#60114)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60114

Same as title.

Test Plan: Unit test (test_kernel_launch_checks.py) is passing.

Reviewed By: ngimel

Differential Revision: D29169538

fbshipit-source-id: ba4518dcb1a4713144d92faec2bb5bdf656ff7c5
2021-06-16 12:19:12 -07:00
2c5db9a40a Add c10d filestore functionality to the current c10d_rendezvous_backend (#59719)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59719

Added filestore functionality to the c10d backend. FileStore will create a temporary file in the /tmp directory to use if it is selected as the store type. Appropriate tests were added as well.
FileStore was modified to expose the path field for testing. It was also modified so that the numWorkers field in the constructor is optional (defaulting to -1). A negative value indicates there is not a fixed number of workers. In this case, the file is not attempted to be cleaned at the end.

Test Plan: Unit tests for creating a c10d backend with filestore and simple error handling.

Reviewed By: cbalioglu, H-Huang

Differential Revision: D28997436

fbshipit-source-id: 24c9b2c9b13ea6c947e8b1207beda892bdca2217
2021-06-16 12:13:36 -07:00
84688b0c40 ci: Add note about file_diff_from_base for GHA (#60110)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60110

file_diff_from_base is currently bugged for ghstack PRs since it fails
to find a merge base

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: driazati

Differential Revision: D29168767

Pulled By: seemethere

fbshipit-source-id: 580a909aa392541769cbbfdc6acce1e6c5d1c341
2021-06-16 11:31:02 -07:00
15f236f3e3 [package] fix tutorial link (#60113)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60113

The tutorial link in the docs was to an fb-only colab.

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D29169818

Pulled By: suo

fbshipit-source-id: 374807c234a185bd515b8ffe1300e6cf8d821636
2021-06-16 11:27:25 -07:00
9f68f93aca Training resnext with msuru_suru_union and ig_msuru_suru_union datasets
Summary: We updated the training scripts and re-trained the Resnext model with msuru_suru_union and ig_msuru_suru_union datasets

Test Plan:
Main command line to run:
*./deeplearning/projects/classy_vision/fb/projects/msuru_suru/scripts/train_cluster.sh*

Config we used is *msuru_suru_config.json*, which is "Normal ResNeXt101 with finetunable head".

Experiments:
- msuru_suru_union f279939874
    - Train/test split
        - msuru_suru_union_dataset_train_w_shard: 143,632,674 rows
        - msuru_suru_union_dataset_test_w_shard: 1,831,236  rows
    - Results
       {F625232741}
       {F625232819}
- ig_msuru_suru_union f279964200
    - Train/test split
        - ig_msuru_suru_union_dataset_train_w_shard: 241,884,760 rows
        - ig_msuru_suru_union_dataset_test_w_shard: 3,477,181 rows
    - Results
{F625234126}
{F625234457}

Differential Revision: D29154971

fbshipit-source-id: d534d830020f4f8e596bb6b941966eb84a1e8adb
2021-06-16 11:22:50 -07:00
8c4e78129e .circleci: Disable Windows GPU jobs (#60024)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60024

Disables windows GPU jobs on CircleCI since they have been migrated to
GHA

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D29137287

Pulled By: seemethere

fbshipit-source-id: 204e0c9232201a36a557cd0843e31d34269cc722
2021-06-16 10:45:14 -07:00
74ea1f23b4 Revert D29148233: [pytorch][PR] Add GITHUB_HEAD_REF in check for IN_PULL_REQUEST
Test Plan: revert-hammer

Differential Revision:
D29148233 (241aac3ef8)

Original commit changeset: 7c8c1866f39c

fbshipit-source-id: f32c6c6decd737ef290d3e83c9d021475aabaab0
2021-06-16 10:41:30 -07:00
bac6bcd6d8 Update call site for FBGemm quantization util functions. (#624)
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/624

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59637

Replace FloatToFusedNBitRowwiseQuantizedSBHalf, FusedNBitRowwiseQuantizedSBHalfToFloat, FloatToFused8BitRowwiseQuantizedSBFloat, and Fused8BitRowwiseQuantizedSBFloatToFloat with newer version.

Test Plan: CI tests.

Reviewed By: dskhudia

Differential Revision: D28918581

fbshipit-source-id: a21274add71439c5e51287a0e2ec918a8d8e5392
2021-06-16 10:15:34 -07:00
d88fbf0fbc fix minor typo in run_test.py (#60055)
Summary:
Fixes typo in run_test.py for option use_specified_test_cases_by

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60055

Reviewed By: walterddr

Differential Revision: D29150156

Pulled By: janeyx99

fbshipit-source-id: 375e594d09c83188bfa80762c8b833a0b7c5cca4
2021-06-16 09:30:45 -07:00
241aac3ef8 Add GITHUB_HEAD_REF in check for IN_PULL_REQUEST (#60047)
Summary:
I believe IN_PULL_REQUEST is unset for some GHA test runs because we don't also check GITHUB_HEAD_REF. This PR is a small fix for that.

Example: https://github.com/pytorch/pytorch/pull/60023/checks?check_run_id=2831813860 doesn't set it properly

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60047

Reviewed By: walterddr

Differential Revision: D29148233

Pulled By: janeyx99

fbshipit-source-id: 7c8c1866f39ce8af8d13c34ddc0c5786a829321e
2021-06-16 08:57:49 -07:00
a6ecfb3296 Update lint.yml to use custom clang-tidy build (#59967)
Summary:
Related: https://github.com/pytorch/pytorch/issues/59815

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59967

Reviewed By: samestep

Differential Revision: D29164686

Pulled By: 1ntEgr8

fbshipit-source-id: b6f9fb6fa4280f757a54a37b30b027b7504bef63
2021-06-16 08:45:24 -07:00
842a831f53 [nnc] Move batchnorm to operators library (#59992)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59992

Wrapped batch norm in function `computeBatchNorm`.
ghstack-source-id: 131407851

Test Plan: CI

Reviewed By: ZolotukhinM

Differential Revision: D29116661

fbshipit-source-id: 2873a9a3e70f31db1988787160fc96c388ea3d4a
2021-06-16 05:09:59 -07:00
bda40639c5 [nnc] Move operator implementations into a subdirectory (#59988)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59988

As we broaden operator support, putting all the implementations into
kernel.cpp is getting unwieldy.  Let's factor them out into the "operators"
subdirectory.

This diff is big but it's entirely code movement; I didn't change anything,
other than to expose a few utilities in kernel.h.
ghstack-source-id: 131405139

Test Plan: CI

Reviewed By: ZolotukhinM

Differential Revision: D29115916

fbshipit-source-id: ba0df1d8dd4a108b584da3baf168407e966b2c78
2021-06-16 05:08:50 -07:00
f43ff754ca [docs] Correct errata in linalg.eigh and add a bit more information (#59784)
Summary:
Add extra information about the returned elements of the spectral
decompositions

Resolves https://github.com/pytorch/pytorch/issues/59718

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59784

Reviewed By: soulitzer

Differential Revision: D29088998

Pulled By: mruberry

fbshipit-source-id: 58a191c41ff5e4c9d9675e5b3d7cbbcf16be4da1
2021-06-16 01:21:09 -07:00
36a5647e30 Handle exceptions from THPModule_setQEngine (#60073)
Summary:
Prevents Python runtime crashes when `torch._C._set_qengine(2**65)` or `torch.backends.quantized.engine="fbgemm"` if PyTorch was compiled without fbgemm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60073

Reviewed By: supriyar

Differential Revision: D29156430

Pulled By: malfet

fbshipit-source-id: 95b97352a52a262f1634b72da64a0c950eaf2373
2021-06-16 00:40:59 -07:00
9fbbab88da [fx-acc] Saturate host by replicating partitions onto idle devices (#60064)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60064

This implements a host saturation optimization to maximize the utilization of the available devices.
It uses a greedy heuristic to replicate all partitions on the used devices to another set of idle devices with enough memory.

The added unittest shows an example as follows:

```
partition_0: 192 bytes; partition_1: 48 bytes
dev_0: 200 bytes, [partition_0]
dev_1: 200 bytes, [partition_1]
dev_2: 100 bytes,
dev_3: 100 bytes,
dev_4: 200 bytes,
dev_5: 100 bytes
```

Before host saturation, `partition_0` is assigned to dev_0 and `partition_1` is assigned to dev_1.
After host saturation, `partition_0` is replicated to dev_4 simply because it's the only device that can hold all partitions on dev_0. `partition_1` is replicated to dev_2 because it has minimal but large enough memory to hold all partitions on dev_1.

Test Plan:
```
buck test mode/opt //caffe2/test:test_fx_experimental -- --exact 'caffe2/test:test_fx_experimental - test_saturate_host (test_fx_experimental.TestFXExperimental)'

Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/8444249343103429
    ✓ ListingSuccess: caffe2/test:test_fx_experimental - main (1.322)
    ✓ Pass: caffe2/test:test_fx_experimental - test_saturate_host (test_fx_experimental.TestFXExperimental) (1.322)
Summary
  Pass: 1
  ListingSuccess: 1
```

An e2e test will be added to `test_fx_glow.py` in a followup diff.

Reviewed By: gcatron

Differential Revision: D29039998

fbshipit-source-id: 57518aadf668f7f05abd6ff73224c16b5d2a12ac
2021-06-15 23:04:46 -07:00
a344b09db2 [quant][fx][graphmode] Remove Quantizer class (#59606)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59606

Test Plan:
python test/test_quantization.py TestQuantizeFx

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D28951432

fbshipit-source-id: 3301f7200a4c7166673c27f9ac7ff559f1e6935d
2021-06-15 21:54:57 -07:00
78011bc0ce typofix (torch.zero to torch.zeros) in docstring (#59703)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59703

Reviewed By: ezyang

Differential Revision: D29145998

Pulled By: H-Huang

fbshipit-source-id: f2670502170aa100fb02408046b7f6850f9379cf
2021-06-15 21:12:42 -07:00
e50f264b51 [caffe2] make MulGradient implementation in-place compatible (#60035)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60035

In Caffe2, the operator schema for the MulGradient op indicates that MulGradient may be performed in-place, overwriting one of its inputs as the output. The implementation is not safe to perform in-place however, due to an accidentally-introduced write-read dependency on the overwriten input in the in-place case. We fix it here.

Test Plan:
```
buck test //caffe2/caffe2/python/operator_test:elementwise_ops_test
```

Note that the newly added test fails without this change, but passes with this change:

```
    ✓ ListingSuccess: caffe2/caffe2/python/operator_test:elementwise_ops_test - main (24.992)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_exp (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_log1p (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_abs (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_bitwise_and (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_reciprocal (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_sqr (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_rsqrt (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_mul (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_sqrt (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_add (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_swish_gradient_inplace (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_sigmoid (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_bitwise_or (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_cbrt_grad (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_not (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_sub (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_div (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_eq (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_softsign (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_eq_bcast (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_powt (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
*************************************************************************************************************************************************************************************
***********************************<NEW_TEST_YAY>************************************************************************************************************************************
*************************************************************************************************************************************************************************************

   ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_mul_gradient_inplace (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)

*************************************************************************************************************************************************************************************
***********************************</NEW_TEST_YAY>***********************************************************************************************************************************
*************************************************************************************************************************************************************************************
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_hard_sigmoid (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_bitwise_xor (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_log (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_cube (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_swish (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_cbrt (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_div_legacy_grad (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - main (125.898)
Summary
  Pass: 30
  ListingSuccess: 1
```

Reviewed By: clrfb

Differential Revision: D29034265

fbshipit-source-id: 98550e1d5976398e45d37ff2120591af1439c42a
2021-06-15 20:26:04 -07:00
eda2ddb5b0 [ATen] Fix aten::to schema (#60001)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60001

Fix the aten::to schema to reflect that the output may alias input.

Test Plan: Added new unit tests.

Reviewed By: ezyang

Differential Revision: D29121620

fbshipit-source-id: c29b6aa22d367ffedf06e47116bc46b3e188c39c
2021-06-15 20:04:20 -07:00
95257e8a62 [fx-acc] Fix wrong device assignment in find_single_partition (#60056)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60056

Previously we put the whole graph as a single partition onto a device with maximum memory if possible, but the code assumed that the first logical device always has the maximum memory.

This diff fixes this issue and updates the unittest to reflect such a corner case.

Test Plan:
```
buck test mode/opt //caffe2/test:test_fx_experimental -- --exact 'caffe2/test:test_fx_experimental - test_find_single_partition (test_fx_experimental.TestFXExperimental)'

Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/6473924507772744
    ✓ ListingSuccess: caffe2/test:test_fx_experimental - main (1.357)
    ✓ Pass: caffe2/test:test_fx_experimental - test_find_single_partition (test_fx_experimental.TestFXExperimental) (1.206)
Summary
  Pass: 1
  ListingSuccess: 1

```

Reviewed By: gcatron

Differential Revision: D29118715

fbshipit-source-id: cac6a1f0d2f47717446dcc80093bbcf362663859
2021-06-15 19:36:38 -07:00
469f0e42d6 [nnc] Handle more cases of excessive # of cat args (#60043)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60043

And add a unit test

Test Plan: new unit test

Reviewed By: navahgar

Differential Revision: D29146547

fbshipit-source-id: 31532926032dbef70d163930f3d8be160f5eacc3
2021-06-15 18:19:52 -07:00
1207745e98 fixing illegal memory access on NHWC BN kernel (#59981)
Summary:
adding an early exit in the kernel to avoid reading out of bound.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59981

Reviewed By: ezyang

Differential Revision: D29147349

Pulled By: ngimel

fbshipit-source-id: b36a6a9e2526c609ff98fb5a44468f3257e0af67
2021-06-15 16:57:41 -07:00
27a3204982 generate C++ API for meta functions using at::meta:: (#58570)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58570

**What the PR does**
Generate a fast-path `at::meta::{op}` API for calling meta functions without having to go through the dispatcher. This will be important for perf for external backends that want to use meta functions for shape checking (which seems likely to be what we end up doing for LazyTensorCore).

**Details**
In order to avoid naming collisions I had to make two small changes:
- rename `MetaFunctions.h` template -> `NativeMetaFunctions.h` (this is the file that declares the impl() function for every structured operator).
- rename the meta class: `at::meta::{op}::meta()` -> `at::meta::structured_{op}::meta()`

I also deleted a few unnecessary includes, since any file that includes NativeFunctions.h will automatically include NativeMetaFunctions.h.

**Why I made the change**
This change isn't actually immediately used anywhere; I already started writing it because I thought it would be useful for structured composite ops, but that isn't actually true (see [comment](https://github.com/pytorch/pytorch/pull/58266#issuecomment-843213147)). The change feels useful and unambiguous though so I think it's safe to add. I added explicit tests for C++ meta function calls just to ensure that I wrote it correctly - which is actually how I hit the internal linkage issue in the PR below this in the stack.

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D28711299

Pulled By: bdhirsh

fbshipit-source-id: d410d17358c2b406f0191398093f17308b3c6b9e
2021-06-15 16:54:46 -07:00
e341bab8ae bugfix: ensure that at::{dispatch_key}:: API gets external linkage (#58569)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58569

This should allow external C++ files that aren't compiled into `libtorch.so`/`libtorch_cpu.so` (including all of fbcode) to use fast path functions like `at::cpu::add()`, which skip the dispatcher.

So, after spending way too much time trying to figure out why I was getting linker errors when calling `at::meta::{op}` and `at::cpu::{op}` from C++ test files, I realized that we're not including the header files for C++ for the namespaced operator definitions. I.e. `RegisterCPU.cpp`, which provides definitions for the `at::cpu::{op}` fast path functions, wasn't including the `CPUFunctions.h` header.

Why that breaks stuff: the `CPUFunctions.h` header file is what marks each function with the `TORCH_API` macro, so without including it, when we build `libtorch.so` and `libtorch_cpu.so`, the compiler will look at the definition in `RegisterCPU.cpp`, not see a `TORCH_API`, and decide that the function should get internal linkage.

An alternative would be to directly mark the function definitions in `RegisterCPU.cpp` with `TORCH_API`, but this seemed cleaner.

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D28711300

Pulled By: bdhirsh

fbshipit-source-id: 535f245c20e977ff566d6da0757b3cefa137040b
2021-06-15 16:53:22 -07:00
5fd6ead097 refine disabled test (#60040)
Summary:
This is to refine:
https://github.com/pytorch/pytorch/pull/60029

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60040

Reviewed By: ezyang

Differential Revision: D29147009

Pulled By: Krovatkin

fbshipit-source-id: 37e01ac6e8d6f7e6b5c517f7804704f9136a56f5
2021-06-15 16:22:29 -07:00
fc50f91929 Move RPC agents to libtorch (#59939)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59939

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28875276

fbshipit-source-id: f2f6970fd74de5f112636e78edaa4410c61d8c45
2021-06-15 16:20:53 -07:00
04ec122868 Add some TORCH_API annotations to RPC
Summary: They will be needed when RPC gets merged into libtorch

Test Plan: CI later in the stack

Reviewed By: mrshenli

Differential Revision: D29132956

fbshipit-source-id: 8637640d56a1744a5dca5eb7d4b8ad0860c6b67c
2021-06-15 16:20:51 -07:00
cbbb7e145e Pass RequestCallback to FaultyPG RPC agent
Summary: This is needed to avoid FaultyPG from including and depending on RequestCallbackImpl, which is Python-only. The other RPC agents accept an explicit (upcast) pointer as an argument, and we can do the same for FaultyPG.

Test Plan: Later in the stack.

Reviewed By: mrshenli

Differential Revision: D29132955

fbshipit-source-id: bb7554b84bcbf39750af637e6480515ac8b92b86
2021-06-15 16:19:50 -07:00
f232b052a6 [fx-acc][easy] Format FX experimental partitioner code (#60030)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60030

As titled. Non-functional re-format.

Test Plan: NA

Reviewed By: gcatron

Differential Revision: D29038449

fbshipit-source-id: a7c94eaab86850ef57b51ec66bfe8ea0e68d2dc8
2021-06-15 16:14:33 -07:00
50229b5250 Fix some typing issues (#59952)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59952

Test Plan: Sandcastle

Reviewed By: swolchok

Differential Revision: D29083423

fbshipit-source-id: 7a13d6ba60808bcf88d809db194d0f873605172c
2021-06-15 14:11:06 -07:00
1d5a577f04 Fix some items identified as problematic by Wextra and other clean-up (#59909)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59909

Test Plan: Sandcastle

Reviewed By: vkuzo

Differential Revision: D29073150

fbshipit-source-id: 500a92ccb57b0e40277863a3b235099fd66ab8ad
2021-06-15 13:42:32 -07:00
dc1f60a9a2 [sparsity][refactor] Restructure the tests folders (#60032)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60032

There will be more sparse tests coming. This PR creates a separate folder for the sparse tests

Test Plan: `python test/test_ao.py`

Reviewed By: raghuramank100

Differential Revision: D29139265

fbshipit-source-id: d0db915f00e6bc8d89a5651f08f72e362a912a6b
2021-06-15 13:37:19 -07:00
8dd0570b34 Reuse build_torch_xla from pytorch/xla repo. (#59989)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59989

Test Plan: Imported from OSS

Reviewed By: samestep

Differential Revision: D29138211

Pulled By: ailzhang

fbshipit-source-id: 349d307c510e7fad266822e320f0d6904fa00239
2021-06-15 13:19:54 -07:00
b162d95e46 Fix a number of lint perf and safety issues in torch (#59897)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59897

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D29037012

fbshipit-source-id: 7c16286d5fc2b67964fb65f8374dfff4d1a7aefb
2021-06-15 13:14:51 -07:00
a0e62c4da4 Reuse run_torch_xla_tests from pytorch/xla (#59888)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59888

Test Plan: Imported from OSS

Reviewed By: samestep

Differential Revision: D29114274

Pulled By: ailzhang

fbshipit-source-id: d2845c7fc95d038cd68c10e22b68be8ad3cae736
2021-06-15 13:00:09 -07:00
c23624351a disable test_sparse_allreduce_basics (#60029)
Summary:
This test will be disabled due to intermittent failures in https://circleci.com/gh/pytorch/pytorch/14155828?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link
as per https://hud.pytorch.org/build2/pytorch-master

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60029

Reviewed By: seemethere

Differential Revision: D29139042

Pulled By: Krovatkin

fbshipit-source-id: 105000e8636f17846be31f517abdf56ea0a994e9
2021-06-15 12:35:11 -07:00
044b519a80 Symbolic for ReLu6 (#58560) (#59538)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59538

Four mealv2 models can export in torch 1.8.1, but fails when torch master introduces relu6 a few months back.

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb, ansley

Differential Revision: D29046607

Pulled By: SplitInfinity

fbshipit-source-id: d9cf7050e4ac0dad892441305ffebc19ba84e2be

Co-authored-by: David <jiafa@microsoft.com>
2021-06-15 12:24:17 -07:00
5d00c374dd [ONNX] Sum empty tensor could not be exported to ONNX successfully. (#58141) (#59537)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59537

PyTorch sum over empty tensor gives 0, while ONNX produces an error.

torch.sum will be translated into onnx::ReduceSum op. Per the definition of ReduceSum, update the keepdims attribute for this scenario.

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb, ansley

Differential Revision: D29046604

Pulled By: SplitInfinity

fbshipit-source-id: 6f5f3a66cb8eda8b5114b8474dda6fcdbae73469

Co-authored-by: fatcat-z <jiz@microsoft.com>
2021-06-15 12:24:16 -07:00
83450aa11d [ONNX] Add support for torch.bernoulli() export (#57003) (#59536)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59536

Support export HuggingFace - Training DeBERTa model.

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb, ansley

Differential Revision: D29046609

Pulled By: SplitInfinity

fbshipit-source-id: df87e0c6ed0f13463297bdeba73967fcf2aa37ca

Co-authored-by: hwangdeyu <deyhuang@qq.com>
2021-06-15 12:24:14 -07:00
cd5f142af4 fix error message for type_as (#57948) (#59535)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59535

Improve error message for type_as and add unit test.

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb, ansley

Differential Revision: D29046605

Pulled By: SplitInfinity

fbshipit-source-id: 978bceeb62e4d3c68815cd5fdf160909a99d00f2

Co-authored-by: hwangdeyu <deyhuang@qq.com>
2021-06-15 12:24:12 -07:00
55530e2276 Update Autograd Export Docs (#56594) (#59534)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59534

Update autograd export docs

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb, ansley

Differential Revision: D29046606

Pulled By: SplitInfinity

fbshipit-source-id: 36057f6bdfd3e5c071dbca05d327de7952904120

Co-authored-by: neginraoof <neginmr@utexas.edu>
2021-06-15 12:23:00 -07:00
a120a12ab4 [Bootcamp][pytorch]Add WebIterDataPipe and ToBytesIterDataPipe to the datapipes. (#59816)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59816

Add two new DataPipes, one for getting web file urls to yield streams and one for getting streams to yield bytes.

Test Plan:
Add test_web_iterable_datapipe in test/test_datapipes.py. The test initiates a local http server for serving test files. Test below locally ok.
1. create and load 16M localhost file urls (each of size 10 Bytes)
2. create and load a 64GB localhost file
in the unit test, for sake of testing time, disabling both stress test and large file test

Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D29051186

fbshipit-source-id: f8e44491e670560bf445af96f94d98230436f396
2021-06-15 11:43:26 -07:00
79d7c15dc5 [PyTorch] Add ExclusivelyOwned (#59419)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59419

This introduces ExclusivelyOwned, which allows isolated
pieces of code that can make ownership guarantees to opt out of
reference counting operations on `intrusive_ptr` and `Tensor`
entirely. To elaborate, if you know you are the exclusive owner of an
`intrusive_ptr` or `Tensor`, moving it into an `ExclusivelyOwned` will
avoid performing atomic reference counting operations at destruction
time. The documentation comment should provide sufficient explanation; please request changes if not.
ghstack-source-id: 131376658

Test Plan:
Added `ExclusivelyOwned_test.cpp`. It passes. When I ran it
under valgrind, valgrind reported no leaks.

Inspected assembly from `inspect` functions in
`ExclusivelyOwned_test.cpp` in an optimized (opt-clang) build. As
expected, `ExclusivelyOwned` calls `release_resources()` and the
`TensorImpl` virtual destructor without including any atomic reference
counting operations.

Reviewed By: ezyang

Differential Revision: D28885314

fbshipit-source-id: 20bf6c82b0966aaa635ab0233974781ed15f93c1
2021-06-15 11:26:25 -07:00
d7eb5836bb Add RRef support to ShardedTensor. (#59776)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59776

Overall design: https://github.com/pytorch/pytorch/issues/55207.

In this PR, I've added support to ShardedTensor such that it also creates RRefs
pointing to the remote shards if the RPC framework is initialized.

As a result, this provides more flexiblity for ShardedTensor such that users
can use collectives with local shards or use the RPC framework to interact with
remote shards.
ghstack-source-id: 131381914

Test Plan:
1) unit tests
2) waitforbuildbot

Reviewed By: SciPioneer

Differential Revision: D29020844

fbshipit-source-id: acb308d0029a5e486c464d93189b5de1ba680c85
2021-06-15 10:49:31 -07:00
20460b0c05 [nnc] Removed setBufferMap method from LoopNest (#59496)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59496

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D28915958

Pulled By: navahgar

fbshipit-source-id: 71e649c93fc67b36c37373f043c729aa835968a0
2021-06-15 10:37:48 -07:00
b822928e33 [nnc] Removed setGPUBlockIndex and setGPUThreadIndex methods from LoopNest (#59495)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59495

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D28915960

Pulled By: navahgar

fbshipit-source-id: 20a4032b031aba6e43d85433ade5f0680c65fbc0
2021-06-15 10:37:46 -07:00
aa163aeff5 [nnc] Made several LoopNest APIs static (#59494)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59494

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D28915959

Pulled By: navahgar

fbshipit-source-id: bf52e30d893f4d86812219b538a14307f347f10b
2021-06-15 10:36:31 -07:00
4afd0b7952 .github: Add Windows CUDA 11.1 workflow (#59960)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59960

Adds the CUDA 11.1 workflow to GHA

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: janeyx99

Differential Revision: D29116814

Pulled By: seemethere

fbshipit-source-id: 90601610e481e1f70a60eaa1b640373ecb89bdb9
2021-06-15 10:22:30 -07:00
1c502d1f8e Don't run_build when run_binary_tests (#59982)
Summary:
https://github.com/pytorch/pytorch/issues/59889 wasn't a proper revert of https://github.com/pytorch/pytorch/issues/58778. This PR fixes that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59982

Reviewed By: seemethere

Differential Revision: D29114129

Pulled By: samestep

fbshipit-source-id: b40563db6ff1153a5f759639978279f5fcbccaa9
2021-06-15 07:39:38 -07:00
90cf76dde5 Support torch.nn.parameter type for PDT (#59249)
Summary:
=========

Support torch.nn.parameter type for PDT

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59249

Test Plan:
====
with-proxy python test/test_jit.py -k TestPDT

Reviewed By: ZolotukhinM

Differential Revision: D29124413

Pulled By: nikithamalgifb

fbshipit-source-id: b486b82c897dbc2b55fbacd5d610bdb700ddc9fa
2021-06-15 07:22:33 -07:00
f9445c8a6b [torch][segment_reduce] Add cuda support for mean reduction (#59543)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59543

Building on top of previous PR: https://github.com/pytorch/pytorch/pull/59521

This diff is adding support for mean reduction for Cuda (fwd only currently).
Will add cuda backward implementation in subsequent PR.
Next Steps:
cuda backward support for mean
2d data input support
more testing
benchmarking

Test Plan: update unit test to cover this part as well.

Reviewed By: ngimel

Differential Revision: D28922838

fbshipit-source-id: 72b7e5e79db967116b96ad010f290c9f057232d4
2021-06-15 07:00:45 -07:00
f4f7950812 Prepare for TensorPipe separating its CUDA-specific headers (#59788)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59788

This one line is all we need to "migrate" PyTorch to the "new API" of TensorPipe that splits the CUDA-specific stuff in a separate top-level header. (The motivation behind that is that it will allow us to "stack" the CUDA code on top of the CPU one).
ghstack-source-id: 131326166

Test Plan: None yet

Reviewed By: beauby

Differential Revision: D28875277

fbshipit-source-id: ecfd0b7fc0218ab7899bfe64ffe73c1417b897db
2021-06-15 03:28:39 -07:00
5e5ca0682b Move CUDA-related stuff of TP agent to separate file (#59377)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59377

This PR demonstrates that now the CUDA parts of the TensorPipe agent just "plug on top" of the CPU-only parts. Thus ideally the CPU-only parts could go in libtorch while the CUDA-only parts could go in libtorch_cuda. Unfortunately we can't do that just yet, because the TensorPipe agent depends on c10d (for its Store and its ProcessGroup), which lives in libtorch_python.
ghstack-source-id: 131326168

Test Plan: CI

Reviewed By: cbalioglu

Differential Revision: D28796429

fbshipit-source-id: 41b2eb8400c0da282f3750a4eea21ad83ee4a175
2021-06-15 03:28:38 -07:00
83ba71aa0e Make CUDA serde support for TP agent pluggable (#59376)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59376

This is an experiment. The end goal is to separate the CUDA-specific aspects of the TensorPipe agent so that they can be plugged "on top" of the CPU-only parts. This will then allow to move the TP agent to libtorch (because libtorch is split into a CPU and a CUDA part; now it's in libtorch_python), although unfortunately other conditions need to also be met for this to happen.

The only instance where we had CPU and CUDA logic within the same code, guarded by `#ifdef USE_CUDA`, is the serialization/deserialization code. I'm thus introducing a sort-of registry in order to "decentralize it". It's not a c10::Registry, because that's overkill (it uses an unordered_map, with strings as keys): here we can just use an array with integers as "keys".
ghstack-source-id: 131326167

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28796428

fbshipit-source-id: b52df832e0c0abf489a9e418353103496382ea41
2021-06-15 03:27:40 -07:00
cf63893211 Enable implicit operator versioning via number of arguments (#58852)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58852

Enable implicit operator versioning via number of arguments from Mobile.
1. By default, TS doesn't emit instructions for tailing default args and the provided number of specified args is serialized to bytecode. From interpreter the default values are fetched from operator schema. The implementation has been landed in #56845. Please refer to #56845 for details.
2. Since there is bytecode schema change, the bytecode version is bumped from 5 to 6.
3. The corresponding backport function is provided, for forward compatibility use. Note that because there is instruction change, a global flag is used as the switch to control the two versions.

Test Plan: Imported from OSS

Reviewed By: raziel

Differential Revision: D28789746

Pulled By: iseeyuan

fbshipit-source-id: 6e5f16460c79b2bd3312de02d0f57b79f50bf66b
2021-06-15 02:07:40 -07:00
a1780432fa Move c10d to libtorch(_cuda) (#59563)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59563

ghstack-source-id: 131331264

Test Plan: CI

Reviewed By: malfet

Differential Revision: D28932239

fbshipit-source-id: 5df6cdfa5253b15cbbc97039fe672d6d97321e34
2021-06-15 02:01:31 -07:00
8d50a4e326 Add support for embeddingBagBytewise in FXGlow
Summary: This adds support for embeddingBagBytewise with fp32 scale/bias to FXGlow.

Test Plan: buck run  //glow/fb/fx/fx_glow:test_fx_glow

Reviewed By: jfix71

Differential Revision: D29075288

fbshipit-source-id: 4145486505a903129678216b133bbb8ad71f4fef
2021-06-14 23:31:29 -07:00
cbd1e8c335 [Static Runtime] Fix bug in aten::to (#59995)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59995

Reviewed By: ajyu

Differential Revision: D29083106

fbshipit-source-id: 687ffb121af2716d606c145474942650a2d9ac7e
2021-06-14 22:54:43 -07:00
087ac75b26 Fix quantized mean operator in QNNPACK backend (#59761)
Summary:
cc: kimishpatel

Fixes https://github.com/pytorch/pytorch/issues/58668

Test it with `pytest -k test_quantized_mean test/test_quantization.py` or `buck test //caffe2/test:quantization -- test_quantized_mean`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59761

Reviewed By: bdhirsh

Differential Revision: D29013271

Pulled By: kimishpatel

fbshipit-source-id: 020956fb63bd5078856ca17b137be016d3fc29b8
2021-06-14 17:30:21 -07:00
5b9fced70a add output_process_fn_grad before sum().backward() (#59971)
Summary:
This should fix `to_sparse` test issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59971

Test Plan:
CI

Also: directly examine the RuntimeError thrown from test_unsupported_backward
- Before:
```
NotImplementedError: Could not run 'aten::sum' with arguments from the 'SparseCPU' backend.
```
- After:
```
to_dense() not supported for float16 on CPU
```

Reviewed By: soulitzer

Differential Revision: D29112558

Pulled By: walterddr

fbshipit-source-id: c2acd22cd18d5b34d25209b8415feb3ba28fa104
2021-06-14 16:20:03 -07:00
117b7ae38a Remove update-disabled-tests workflow as it is migrated to test-infra (#59986)
Summary:
Will be replaced by https://github.com/pytorch/test-infra/pull/37

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59986

Reviewed By: seemethere, soulitzer

Differential Revision: D29115397

Pulled By: janeyx99

fbshipit-source-id: 2c1a88d6a3fec8cef57818a360884644ec2c7b79
2021-06-14 15:25:34 -07:00
c2098487e8 [c10d] Move pg wrapper tests to their own file. (#59840)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59840

moving these tests to their own standalone file. No meaningful code changes.
ghstack-source-id: 131359162

Test Plan: CI

Reviewed By: cbalioglu

Differential Revision: D29012664

fbshipit-source-id: 348870016509a6ed7e69240fa82bccef4a12d674
2021-06-14 15:05:55 -07:00
5c1d17e697 Revert D29100708: [pytorch][PR] Parametrizations depending on several inputs
Test Plan: revert-hammer

Differential Revision:
D29100708 (061e71b199)

Original commit changeset: b9e91f439cf6

fbshipit-source-id: bff6d8a3d7b24f4beb976383912033c250d91a53
2021-06-14 14:08:50 -07:00
5e993e6c81 [fx2trt] Make TRTInterpreter don't need concrete tensor as arg (#59948)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59948

1. We have two Interpreters. One for vanilla op and one for acc op. Some of the logic between them are similar and in this diff we extract out the similar logic to a Base Interpreter. This makes any future general feature change could benefit both Interpreters.

2. Make TRT Interpreter not depending on concrete tensor arg. We will use `InputTensorSpec` to create necessary inputs for acc tracer.

3. Add unittests for acc op converter.

Test Plan:
```
buck test mode/opt caffe2/torch/fb/fx2trt:test_linear
buck test mode/opt caffe2/torch/fb/fx2trt:test_batchnorm
buck test mode/opt caffe2/torch/fb/fx2trt:test_convolution
buck test mode/opt caffe2/torch/fb/fx2trt:test_reshape
buck test mode/opt caffe2/torch/fb/fx2trt:test_relu
buck test mode/opt caffe2/torch/fb/fx2trt:test_add
buck test mode/opt caffe2/torch/fb/fx2trt:test_maxpool
```

Reviewed By: jackm321

Differential Revision: D28749682

fbshipit-source-id: 830d845aede7203f6e56eb1c4e6776af197a0fc3
2021-06-14 14:03:26 -07:00
c645d39a77 Implementation of torch.isin() (#53125)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/3025

## Background

This PR implements a function similar to numpy's [`isin()`](https://numpy.org/doc/stable/reference/generated/numpy.isin.html#numpy.isin).

The op supports integral and floating point types on CPU and CUDA (+ half & bfloat16 for CUDA). Inputs can be one of:
* (Tensor, Tensor)
* (Tensor, Scalar)
* (Scalar, Tensor)

Internally, one of two algorithms is selected based on the number of elements vs. test elements. The heuristic for deciding which algorithm to use is taken from [numpy's implementation](fb215c7696/numpy/lib/arraysetops.py (L575)): if `len(test_elements) < 10 * len(elements) ** 0.145`, then a naive brute-force checking algorithm is used. Otherwise, a stablesort-based algorithm is used.

I've done some preliminary benchmarking to verify this heuristic on a devgpu, and determined for a limited set of tests that a power value of `0.407` instead of `0.145` is a better inflection point. For now, the heuristic has been left to match numpy's, but input is welcome for the best way to select it or whether it should be left the same as numpy's.

Tests are adapted from numpy's [isin and in1d tests](7dcd29aaaf/numpy/lib/tests/test_arraysetops.py).

Note: my locally generated docs look terrible for some reason, so I'm not including the screenshot for them until I figure out why.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53125

Test Plan:
```
python test/test_ops.py   # Ex: python test/test_ops.py TestOpInfoCPU.test_supported_dtypes_isin_cpu_int32
python test/test_sort_and_select.py   # Ex: python test/test_sort_and_select.py TestSortAndSelectCPU.test_isin_cpu_int32
```

Reviewed By: soulitzer

Differential Revision: D29101165

Pulled By: jbschlosser

fbshipit-source-id: 2dcc38d497b1e843f73f332d837081e819454b4e
2021-06-14 13:50:53 -07:00
f9ec86a6c6 External stream (#59527)
Summary:
Previous is https://github.com/pytorch/pytorch/issues/57781

We add now two CUDA bindings to avoid using ctypes to fix a windows issue.
However, we use ctypes to allocate the stream and create its pointer
(we can do this with a 0-dim tensor too if it feels better).

CC. ezyang rgommers ngimel mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59527

Reviewed By: albanD

Differential Revision: D29053062

Pulled By: ezyang

fbshipit-source-id: 661e7e58de98b1bdb7a0871808cd41d91fe8f13f
2021-06-14 13:46:11 -07:00
8e92a3a8b0 [docs] Add pickle security warning to package docs (#59959)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59959

**Summary**
This commit replaces the warning on the `torch.package` documentation
page about the module not being publicly released (which will no longer
be true as of 1.9) with one that warns about security issues caused by
the use of the `pickle` module.

**Test Plan**
1) Built the docs locally.
2) Continuous integration.

<img width="877" alt="Captura de Pantalla 2021-06-14 a la(s) 11 22 05 a  m" src="https://user-images.githubusercontent.com/4392003/121940300-c98cab00-cd02-11eb-99dc-08e29632079a.png">

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D29108429

Pulled By: SplitInfinity

fbshipit-source-id: 3a0aeac0dc804a31203bc5071efb1c5bd6ef9725
2021-06-14 13:03:05 -07:00
ef13341a8d upgrade onednn to v2.2.3 (#57928)
Summary:
This PR is to upgrade onednn to v2.2.3 (including v2.2 and v2.2.3 changes) which has the following main changes about CPU:

v2.2 changes:
Improved performance of compute functionality for future Intel Core processor with Intel AVX2 and Intel DL Boost instructions support (code name Alder Lake).
Improved fp32 inner product forward propagation performance for processors with Intel AVX-512 support.
Improved dnnl_gemm performance for cases with n=1 on all supported processors.

v2.2.3 changes:
Fixed a bug in int8 depthwise convolution ptimitive with groups and 1d spatial size for processors with Intel AVX-512 and Intel AVX2 support
Fixed correctness issue for PReLU primitive on Intel Processor Graphics
Fixed corretness issue in reorder for blocked layouts with zero padding
Improved performance of weights reorders used by BRGEMM-based convolution primitive for processors with Intel AVX-512 support

More changes can be found in https://github.com/oneapi-src/oneDNN/releases.

Ideep used version is pytorch-rls-v2.2.3.
OneDNN used version is v2.2.3.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57928

Reviewed By: bdhirsh

Differential Revision: D29037857

Pulled By: VitalyFedyunin

fbshipit-source-id: db74534858bdcf5d6c7dcf58e224fc756188bc31
2021-06-14 11:57:45 -07:00
061e71b199 Parametrizations depending on several inputs (#58488)
Summary:
Makes possible that the first register parametrization depends on a number of parameters rather than just one. Examples of these types of parametrizations are `torch.nn.utils.weight_norm` and low rank parametrizations via the multiplication of a `n x k`  tensor by a `k x m` tensor with `k <= m, n`.

Follows the plan outlined in https://github.com/pytorch/pytorch/pull/33344#issuecomment-768574924. A short summary of the idea is: we call `right_inverse` when registering a parametrization to generate the tensors that we are going to save. If `right_inverse` returns a sequence of tensors, then we save them as `original0`, `original1`...  If it returns a `Tensor` or a sequence of length 1, we save it as `original`.

We only allow to have many-to-one parametrizations in the first parametrization registered. The next parametrizations would need to be one-to-one.

There were a number of choices in the implementation:

If the `right_inverse` returns a sequence of parameters, then we unpack it in the forward. This is to allow to write code as:
```python
class Sum(nn.Module):
  def forward(self, X, Y):
    return X + Y
  def right_inverse(Z):
    return Z, torch.zeros_like(Z)
```
rather than having to unpack manually a list or a tuple within the `forward` function.

At the moment the errors are a bit all over the place. This is to avoid having to check some properties of `forward` and `right_inverse` when they are registered. I left this like this for now, but I believe it'd be better to call these functions when they are registered to make sure the invariants hold and throw errors as soon as possible.

The invariants are the following:
1. The following code should be well-formed
```python
X = module.weight
Y = param.right_inverse(X)
assert isinstance(Y, Tensor) or isinstance(Y, collections.Sequence)
Z = param(Y) if isisntance(Y, Tensor) else param(*Y)
```
in other words, if `Y` is a `Sequence` of `Tensor`s (we check also that the elements of the sequence are Tensors), then it is of the same length as the number parameters `param.forward` accepts.

2. Always: `X.dtype == Z.dtype and X.shape == Z.shape`. This is to protect the user from shooting themselves in the foot, as it's too odd for a parametrization to change the metadata of a tensor.
3. If it's one-to-one: `X.dtype == Y.dtype`. This is to be able to do `X.set_(Y)` so that if a user first instantiates the optimiser and then puts the parametrisation, then we reuse `X` and the user does not need to add a new parameter to the optimiser. Alas, this is not possible when the parametrisation is many-to-one. The current implementation of `spectral_norm` and `weight_norm` does not seem to care about this, so this would not be a regression. I left a warning in the documentation though, as this case is a bit tricky.

I'm still missing to go over the formatting of the documentation, I'll do that tomorrow.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58488

Reviewed By: soulitzer

Differential Revision: D29100708

Pulled By: albanD

fbshipit-source-id: b9e91f439cf6b5b54d5fa210ec97c889efb9da38
2021-06-14 11:11:47 -07:00
ab70e1e984 [TensorExpr] Add error checking in mem_arena (#59922)
Summary:
Gives an error message (rather than a segfault) if you forget `KernelScope()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59922

Reviewed By: bertmaher

Differential Revision: D29091303

Pulled By: jansel

fbshipit-source-id: a24ee2385cae1f210b0cbc3f8860948fc052b655
2021-06-14 10:37:32 -07:00
9ad0de3c6f Rework requires_grad on DifferentiableGraphOp (#57575)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57575

This PR does two things:

1. reverts "Manual revert of D27369251 (f88a3fff65) (#56080)" in commit
   92a09fb87a567100122b872613344d3a422abc9f.

2. fixing DifferentiableGraph output with wrong requires_grad flag

Fixing requires_grad on outputs from DifferentiableGraph, the proper flag is
retrieved from profiling information. We previously only retrieves the profiling
information on the first profile node in all its uses. However, in case where
control flows are present, we need to iteratively search for profile node with
profiling information available, in case the first use is in an inactive code
path.

e.g.
```
  graph(%0 : Tensor,
        %1 : Bool):
  ..., %2 : Tensor = prim::DifferentiableGraph_0(%0)
  %3 : Tensor = prim::If(%1)
    block0():
      %4 : Tensor = prim::DifferentiableGraph_1(%2)
      -> (%4)
    block1():
      %5 : Tensor = prim::DifferentiableGraph_2(%2)
      -> (%5)
  -> (%3)
with prim::DifferentiableGraph_0 = graph(%0 : Tensor):
  ...
  %out : Tensor = aten::operation(...)
  ...
  return (..., %out)
with prim::DifferentiableGraph_1 = graph(%0 : Tensor):
  %temp : Tensor = prim::profile[profiled_type=Tensor](%0)
  ...
with prim::DifferentiableGraph_2 = graph(%0 : Tensor):
  %temp : Tensor = prim::profile[profiled_type=Float(...)](%0)
  ...
```

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D29038773

Pulled By: Krovatkin

fbshipit-source-id: 6c0a851119f6b8f2f1afae5c74532407aae238fe
2021-06-14 10:37:31 -07:00
1f7251df90 fixing DifferentiableGraphOp updating requires_grad on input tensor list; python test added to verify the test (#57574)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57574

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D29038774

Pulled By: Krovatkin

fbshipit-source-id: cb342c1b04fa3713a8166b39213437bc9f2d8606
2021-06-14 10:36:26 -07:00
cyy
c50c77b444 remove unused variables (#59912)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59912

Reviewed By: soulitzer

Differential Revision: D29100518

Pulled By: albanD

fbshipit-source-id: b86a4aa9050e4fa70a0872c1d8799e5953cd2bc8
2021-06-14 10:33:48 -07:00
580a20f33b [reland] torch/lib/c10d: Use torch_check instead of throwing runtime_error (#59918)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59918

Reland of https://github.com/pytorch/pytorch/pull/59684
ghstack-source-id: 131303057

Test Plan: ci

Reviewed By: cbalioglu

Differential Revision: D29081452

fbshipit-source-id: 419df79341f702e796f7adf5f1071a6cd1dcd8d1
2021-06-14 09:52:54 -07:00
3d90c82a5c [TensorExpr] Python binding improvements (#59920)
Summary:
Some minor quality of life improvements for the NNC python bindings:
- expose `call_raw()`
- support passing integers to `call()` (for dynamic shapes)
- implicit conversions to cleanup `[BufferArg(x) for x in [A, B, C]]` into just `[A, B, C]`
- don't silently default to "ir_eval" for unknown mode (e.g. "LLVM")

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59920

Reviewed By: ZolotukhinM

Differential Revision: D29090904

Pulled By: jansel

fbshipit-source-id: 154ace82725ae2046cfe2e6eb324fd37f5d209a7
2021-06-14 09:31:40 -07:00
68d690ffbd Vectorize the softmax calculation when not along the last dim (#59195)
Summary:
Currently, if we do softmax which are not along the last dim, the calculation will fall to a [scalar version](d417a094f3/aten/src/ATen/native/SoftMax.cpp (L14-L64)).  And we find actually we have the chance to vectorize the calculation along the inner_size dim.

Changes we made:

- Use vectorized softmax_kernel instead of host_softmax when not along the last dim.

Performance data on 28 cores' Intel 8280 CPU when the Input size is [32, 81, 15130] and do softmax along the second dim(81).

- FP32 Baseline: 24.67 ms
- FP32 optimized: 9.2 ms

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59195

Reviewed By: ailzhang

Differential Revision: D28854796

Pulled By: cpuhrsch

fbshipit-source-id: 18477acc3963754c59009b1794f080496ae16c3d
2021-06-14 07:54:11 -07:00
d60d81b5a7 Make PyObject_FastGetAttrString accept const char* (#59758)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59758

The underlying call to tp_getattr is const safe but CPython
has not fixed it due to BC problems.  No reason not to advertise
the better type here though!

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D29017911

Pulled By: ezyang

fbshipit-source-id: 8d55983fe6416c03eb69c6367bcc431c30000133
2021-06-14 07:24:16 -07:00
700add0737 Fix expecttest accept on Python 3.8 and later (#59709)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59709

Fixes #59705.

Python 3.8 fixed tracebacks to report the beginning of the line
that raised an error, rather than the end.  This makes for a simpler
implementation (no more string reversing) but need to actually
implement.  This wasn't caught by tests because we hard coded line
numbers to do substitutions, so I also added a little smoketest to
detect future changes to traceback line number behavior.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D28994919

Pulled By: ezyang

fbshipit-source-id: 1fb0a782e17c55c13d668fabd04766d2b3811962
2021-06-14 07:23:12 -07:00
cf38b20c61 Alias for digamma as psi to special namespace (#59143)
Summary:
See https://github.com/pytorch/pytorch/issues/50345

cc: mruberry kshitij12345

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59143

Reviewed By: jbschlosser

Differential Revision: D28986909

Pulled By: mruberry

fbshipit-source-id: bc8ff0375de968f3662b224689fa0a6b117f9c4e
2021-06-14 03:05:14 -07:00
ff15d93b88 Improve numerical stability of GroupNorm (#54921)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54921

Improve numerical stability of GroupNorm

Test Plan: buck test mode/dev-nosan //caffe2/test:nn -- "GroupNorm"

Reviewed By: ngimel

Differential Revision: D27414438

fbshipit-source-id: 815517240ca5ea3e2beb77ced3bd862e9c83d445
2021-06-13 16:13:32 -07:00
095cd6a0da MemoryOverlap: Avoid has_storage calls (#59013)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59013

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D29040929

Pulled By: ngimel

fbshipit-source-id: 69745e7abbaf523795a90f68cf01d3d94508210e
2021-06-13 12:31:22 -07:00
be038d8989 [CUDA graphs] Make stream semantics of backward calls consistent with other cuda ops (ci-all edition) (#57833)
Summary:
ci-all resubmit of https://github.com/pytorch/pytorch/pull/54227.

Tests look good except for a few distributed autograd failures (pytorch_linux_xenial_cuda10_2_cudnn7_py3_multigpu_test) and rocm failures (pr/pytorch-linux-bionic-rocm4.1-py3.6).

The common denominator in rocm failures appears to be multi-gpu activity: some [multiprocess DDP failures](https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-bionic-rocm4.1-py3.6-test1/8115/console), some [single-process failures](https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-bionic-rocm4.1-py3.6-test2/8115/console) where the single process has autograd ops that span devices. jeffdaily jithunnair-amd sunway513, could one of you take a look? The streaming backward change is also beneficial to rocm, I expect.

For debugging rocm failures, I think we should ignore the multiprocess/DDP tests and focus on the single process cases. The root cause is probably the same and the single process cases are simpler.

----------------------------------

Update: Rocm failures are due to https://github.com/pytorch/pytorch/issues/59750.
2718a54032 is a workaround, to be updated once https://github.com/pytorch/pytorch/issues/59750 is fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57833

Reviewed By: mruberry

Differential Revision: D28942391

Pulled By: ngimel

fbshipit-source-id: d6047e971c5f1c6386334bf3641402a92f12e2f8
2021-06-13 12:09:56 -07:00
92513038e8 Revert D28994140: [pytorch][PR] Implemented torch.cov
Test Plan: revert-hammer

Differential Revision:
D28994140 (23c232554b)

Original commit changeset: 1890166c0a9c

fbshipit-source-id: 73dfe1b00464e38f004f99960cdeeb604ed4b20a
2021-06-13 02:33:37 -07:00
0ceea7faf4 Refactor SavedVariable (#59836)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59836

Preparing for #58500

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D29069159

fbshipit-source-id: dd4d870c8ae10a4bd7f12be127e093f60fa072fa
2021-06-12 23:21:36 -07:00
d03ff1a17d pre compute regex and match simple signature autograd codegen 15s -> 12s (#59852)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59852

This whole stack does not change anything to the codegened code

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D29063814

Pulled By: albanD

fbshipit-source-id: a751047526f8d58f4760ee6f9ae906675bed5d75
2021-06-12 06:58:36 -07:00
30a18fe318 refactor yaml loader import, no runtime change (#59850)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59850

This whole stack does not change anything to the codegened code

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D29063816

Pulled By: albanD

fbshipit-source-id: ca3067443d8e6282c1077d3dafa3b4f330d43b28
2021-06-12 06:58:34 -07:00
c60d1ac9cf Use C dumper if possible aten codegen 23s -> 13s (#59849)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59849

This whole stack does not change anything to the codegened code

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D29063815

Pulled By: albanD

fbshipit-source-id: c4baa72594bd2fe50ac67f513916f2b2ccb7488c
2021-06-12 06:58:32 -07:00
504ec30109 avoid error string formatting aten codegen 28s -> 23s (#59848)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59848

This whole stack does not change anything to the codegened code

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D29063818

Pulled By: albanD

fbshipit-source-id: c68734672eeacd212d7bd9bebe3d53aaa20c3c24
2021-06-12 06:58:31 -07:00
7143a6a189 Avoid unnecessary re-computation autograd codegen 21s -> 15s (#59847)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59847

This whole stack does not change anything to the codegened code

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D29063817

Pulled By: albanD

fbshipit-source-id: 284c3e057029b7a67f43a1b034bb30863bd68c71
2021-06-12 06:57:19 -07:00
1f6e39336f Simplify parametrizations.SpectralNorm and improve its initialization (#59564)
Summary:
Implements a number of changes discussed with soulitzer offline.
In particular:
- Initialise `u`, `v` in `__init__` rather than in `_update_vectors`
- Initialise `u`, `v` to some reasonable vectors by doing 15 power iterations at the start
- Simplify the code of `_reshape_weight_to_matrix` (and make it faster) by using `flatten`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59564

Reviewed By: ailzhang

Differential Revision: D29066238

Pulled By: soulitzer

fbshipit-source-id: 6a58e39ddc7f2bf989ff44fb387ab408d4a1ce3d
2021-06-11 19:52:44 -07:00
10a3a3d363 Fix bad change in a CUDACachingAllocator loop (#59903)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59903

D29034650 (cf0c4ac258) probably breaks something because it changes a `for` loop on ~Line 1200 from `[size,max)` to `[0,max)`. This fixes that

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D29081688

fbshipit-source-id: 21f08e3f244fc02cf97d137b3cc80d4378d17185
2021-06-11 18:20:07 -07:00
e49f0f4ffd Automated submodule update: FBGEMM (#59874)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: ae8ad8fd04

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59874

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: jspark1105

Differential Revision: D29064980

fbshipit-source-id: 593f08361817fb771afcf2732f0f647d7c2c72c3
2021-06-11 17:50:40 -07:00
3529a48ebb Revert D28981326: torch/lib/c10d: Use torch_check instead of throwing runtime_error
Test Plan: revert-hammer

Differential Revision:
D28981326 (6ea6075002)

Original commit changeset: 264a7f787ea8

fbshipit-source-id: 75625b76dfbd0cbaf59705d621ef9e2d1677c482
2021-06-11 17:17:10 -07:00
f3218568ad optimize channels last for BatchNorm2d on CPU (#59286)
Summary:
replacement of https://github.com/pytorch/pytorch/issues/48919
optimize channels last performance for BatchNorm2 on CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59286

Reviewed By: bdhirsh

Differential Revision: D29008198

Pulled By: VitalyFedyunin

fbshipit-source-id: 8a7d020bd6a42ab5c21ffe788b79a22f4ec82ac0
2021-06-11 16:30:16 -07:00
864d129bae [quant][fx] Remove extra q-dq for weight bias in normalization ops (#59882)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59882

Currently for normalization ops, the weight and bias arguments are treated as activationn inputs which require observers.
This results in adding extra quant-dequant ops for the weight and bias inputs.

This PR adds support to skip observing weight/bias inputs of norm operators, thus removing the redundant q-dq ops

Quantized graph with F.layer_norm
Before this PR
```
def forward(self, x):
    _input_scale_0 = self._input_scale_0
    _input_zero_point_0 = self._input_zero_point_0
    quantize_per_tensor = torch.quantize_per_tensor(x, _input_scale_0, _input_zero_point_0, torch.quint8);  x = _input_scale_0 = _input_zero_point_0 = None
    scale = self.scale
    _input_scale_1 = self._input_scale_1
    _input_zero_point_1 = self._input_zero_point_1
    quantize_per_tensor_1 = torch.quantize_per_tensor(scale, _input_scale_1, _input_zero_point_1, torch.quint8);  scale = _input_scale_1 = _input_zero_point_1 = None
    bias = self.bias
    _input_scale_2 = self._input_scale_2
    _input_zero_point_2 = self._input_zero_point_2
    quantize_per_tensor_2 = torch.quantize_per_tensor(bias, _input_scale_2, _input_zero_point_2, torch.quint8);  bias = _input_scale_2 = _input_zero_point_2 = None
    _scale_0 = self._scale_0
    _zero_point_0 = self._zero_point_0
    dequantize = quantize_per_tensor_1.dequantize();  quantize_per_tensor_1 = None
    dequantize_1 = quantize_per_tensor_2.dequantize();  quantize_per_tensor_2 = None
    layer_norm = torch.ops.quantized.layer_norm(quantize_per_tensor, [2, 5, 5], weight = dequantize, bias = dequantize_1, eps = 1e-05, output_scale = _scale_0, output_zero_point = _zero_point_0);  quantize_per_tensor = dequantize = dequantize_1 = _scale_0 = _zero_point_0 = None
    dequantize_2 = layer_norm.dequantize();  layer_norm = None
    return dequantize_2
```
After
```
def forward(self, x):
    _input_scale_0 = self._input_scale_0
    _input_zero_point_0 = self._input_zero_point_0
    quantize_per_tensor = torch.quantize_per_tensor(x, _input_scale_0, _input_zero_point_0, torch.quint8);  x = _input_scale_0 = _input_zero_point_0 = None
    scale = self.scale
    bias = self.bias
    _scale_0 = self._scale_0
    _zero_point_0 = self._zero_point_0
    layer_norm = torch.ops.quantized.layer_norm(quantize_per_tensor, [2, 5, 5], weight = scale, bias = bias, eps = 1e-05, output_scale = _scale_0, output_zero_point = _zero_point_0);  quantize_per_tensor = scale = bias = _scale_0 = _zero_point_0 = None
    dequantize = layer_norm.dequantize();  layer_norm = None
    return dequantize
```

Test Plan:
python test/test_quantization.py TestQuantizeFxOps.test_norm_weight_bias

Imported from OSS

Reviewed By: HDCharles, ailzhang

Differential Revision: D29068203

fbshipit-source-id: 24b5c38bbea5fd355d34522bfa654c9db18607da
2021-06-11 16:22:36 -07:00
60eb22e45e Build an -Wextra around c10 (#59853)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59853

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D29016682

fbshipit-source-id: f6c5f32464d57dbd60b59b5f9e2234ef2c39f1c1
2021-06-11 16:12:21 -07:00
e41bc31eb2 make --run-specified-test-case use --include (#59704)
Summary:
instead of having specific logic to handle run-specific-test-case, we provide the flag to override include or bring-to-front with the SPECIFIED_TEST_CASES_FILE.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59704

Reviewed By: janeyx99

Differential Revision: D29038425

Pulled By: walterddr

fbshipit-source-id: 803d3555813437c7f287a22f7704106b0c609919
2021-06-11 13:57:13 -07:00
cf0c4ac258 Fix some issues in CUDACachingAllocator (#59819)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59819

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D29034650

fbshipit-source-id: 7e9689fc1ae121432e9421fa4a9ae00f7f78caca
2021-06-11 13:15:27 -07:00
b83ac0cc4e [nnc] Added a check to vectorize only those loops that are normalized. (#59423)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59423

Test Plan: Imported from OSS

Reviewed By: huiguoo

Differential Revision: D28886979

Pulled By: navahgar

fbshipit-source-id: edfc61feaf5efe22d4f367ac718b83b3d0f47cb3
2021-06-11 12:03:34 -07:00
30e24b2d2b [nnc] Modified vectorize API to return bool (#59422)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59422

Test Plan: Imported from OSS

Reviewed By: huiguoo

Differential Revision: D28886980

Pulled By: navahgar

fbshipit-source-id: 58cc3ecd86564a312a132f8260d836b096505095
2021-06-11 12:02:19 -07:00
a9e136a61e Remove ci/no-build (#59889)
Summary:
This reverts https://github.com/pytorch/pytorch/issues/58778, since triggering our primary CircleCI workflow only via pytorch-probot has been causing more problems than it's worth.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59889

Reviewed By: walterddr, seemethere

Differential Revision: D29070418

Pulled By: samestep

fbshipit-source-id: 0b47121b190c2e9efa27f38000ca362e634876dc
2021-06-11 11:55:56 -07:00
f4fdc49957 [NNC] Add python bindings for loopnest.compress_buffer (#59681)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59681

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D28981573

Pulled By: huiguoo

fbshipit-source-id: 003d66df576903c71bf46c95851fe6ccbba76f29
2021-06-11 11:28:39 -07:00
ee3025f734 Give clearer lint error messages (#59876)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59876

Test Plan: Imported from OSS

Reviewed By: janeyx99

Differential Revision: D29067747

Pulled By: samestep

fbshipit-source-id: cce7195467b5f9286d55a9d0c1655b4f92d4fbaf
2021-06-11 11:25:42 -07:00
6ea6075002 torch/lib/c10d: Use torch_check instead of throwing runtime_error (#59684)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59684

Same reasoning as in the below diff.
ghstack-source-id: 131167212

Test Plan: CI

Reviewed By: cbalioglu

Differential Revision: D28981326

fbshipit-source-id: 264a7f787ea8be76f743a2eaca67ae1d3bd8073a
2021-06-11 11:16:58 -07:00
d433a55c94 Replace throw std::runtime_error with torch_check in torch/csrc/distributed (#59683)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59683

Replaces usages of throw std::runtime_error("foo") with the better
torch_check(false, "foo") which allows C++ stacktraces to show up when
TORCH_SHOW_CPP_STACKTRACES=1. This will hopefully provide much better debugging
information when debugging crashes/flaky tests.
ghstack-source-id: 131167210

Test Plan: CI

Reviewed By: cbalioglu

Differential Revision: D28981327

fbshipit-source-id: 677f569e28600263cab18759eb1b282e0391aa7b
2021-06-11 11:15:49 -07:00
9cdbddb3f7 Fix Vectorize<float>::trunc on ARM platform (#59858)
Summary:
Use `vrndq_f32`, which corresponds to `VRINTZ` instruction, which rounds floating point value towards zero, which matches `std::trunc` behaviour.
This makes trunc implementation correct even for values that fit into float32, but can not be converted to int32, for example `-1.0e+20`, see the following [gist](https://gist.github.com/malfet/c612c9f4b3b5681ca1b2a69930825871):
```
inp= 3.1 2.7 -2.9 -1e+20
old_trunc= 3 2 -2 -2.14748e+09
new_trunc= 3 2 -2 -1e+20
```

Fixes `test_reference_numerics_hard_trunc_cpu_float32` on M1

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59858

Reviewed By: kimishpatel

Differential Revision: D29052008

Pulled By: malfet

fbshipit-source-id: 6b567f39151538be1aa3890e3b4e1e978e598657
2021-06-11 10:55:45 -07:00
2ce21b2e61 [Pytorch backend delegation] Preprocess to accept (#58873)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58873

BackenDebugInforRecorder

Prior to this PR:
In order to generate debug handles corresponding to the graph being
lowered, backend's preprocess will call generate_debug_handles and will
get map of Node*-to-debug_handles.
In order to facilitate this, to_backend will own
BackendDebugInfoRecorder and initialize thread local pointer to it.
generate_debug_handle function will query thread local pointer to see if
there is a valid BackendDebugInforRecorder for the context. If there is
it will generate debug handles.

After this PR:
Signature of preprocess is changed such that backends have to register
preprocess that accepts instance of BackendDebugInfoRecorder by
reference. generate_debug_handles is no more a free function but becomes
part of the API of BackendDebugInfoRecorder. Now backend's preprocess
function will call generate_debug_handles on BackendDebugInfoRecorder
instead of free function.

Reason for this change:
With RAII that initializes thread local pointer, results in a lose
contract with backends, which may result in backends not storing
debug information. Making it part of API results in
backends having to be aware of BackendDebugInfoRecorder and explicitly
chosing not to generate/store debug information if they chose to do so.

Test Plan:
backend tests

Imported from OSS

Reviewed By: jbschlosser, raziel

Differential Revision: D28648613

fbshipit-source-id: c9b7e7bf0f78e87023ea7bc08612cf893b08cb98
2021-06-11 10:16:00 -07:00
23c232554b Implemented torch.cov (#58311)
Summary:
Based from https://github.com/pytorch/pytorch/pull/50466

Adds the initial implementation of `torch.cov` similar to `numpy.cov`. For simplicity, we removed support for many parameters in `numpy.cov` that are either redundant such as `bias`, or have simple workarounds such as `y` and `rowvar`.

cc PandaBoi

TODO

- [x] Improve documentation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58311

Reviewed By: mruberry

Differential Revision: D28994140

Pulled By: heitorschueroff

fbshipit-source-id: 1890166c0a9c01e0a536acd91571cd704d632f44
2021-06-11 09:40:50 -07:00
ba09355b12 Upgrade Windows CI Python to 3.8 (#59729)
Summary:
Python 3.6 EOL is end of this year--we should use newer Python in CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59729

Reviewed By: bdhirsh

Differential Revision: D29006807

Pulled By: janeyx99

fbshipit-source-id: c79214b02a72656058ba5d199141f8838212b3b6
2021-06-11 09:09:24 -07:00
d75e99b709 fx quant: enable qconfig_dict to target function invocations by order (#59605)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59605

Enables targeting of individual function invocations by execution order.
For example, given a module such as

```
class M1(torch.nn.Module):
  def forward(self, x):
    x = torch.add(x, x)
    x = torch.add(x, x)
    return x

class M2(torch.nn.Module):
  def __init__(self):
    self.m1 = M1()

  def forward(self, x):
    x = self.m1(x)
    return x
```

We can now target the first add of `m1` with

```
qconfig_dict = {
  "module_name_function_order": ("m1", torch.add, 0, custom_qconfig),
}
```

Test Plan:
```
python test/test_quantization.py TestQuantizeFx.test_qconfig_module_name_function_order
```

Imported from OSS

Reviewed By: hx89

Differential Revision: D28951077

fbshipit-source-id: 311d423724a31193d4fa4bbf3a712b46464b5a29
2021-06-11 08:53:40 -07:00
e6110d4d5d Fix input_buffer check if inplace update is valid (#59817)
Summary:
Fixes an issue introduced in  https://github.com/pytorch/pytorch/issues/17182

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59817

Reviewed By: bdhirsh

Differential Revision: D29040738

Pulled By: albanD

fbshipit-source-id: 67fd4e9fa0dadf507ddd954d20e119d8781c4de0
2021-06-11 07:29:03 -07:00
c9e4d1372f Add guards for USE_C10D_FOO in relevant c10d files (#59697)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59697

The c10d build process selectively adds files based on the `USE_C10D_FOO` flags (where `FOO` is one of `GLOO`, `NCCL` or `MPI`). Replicating this logic inside libtorch will be harder, since libtorch uses a simpler approach (i.e., it lists the files in `build_variables.bzl`). So instead we could always include all files, and "disable" each file as needed using `#ifdef`s. Note that this is not a new approach: we already do the same for all the files of the TensorPipe agent based on the flag `USE_TENSORPIPE`.
ghstack-source-id: 131169540

Test Plan: CI

Reviewed By: agolynski

Differential Revision: D28987577

fbshipit-source-id: 4c6195de4e9a58101dad9379537e8d055dfd38af
2021-06-11 05:06:42 -07:00
773b56e719 Fix Windows guards in c10d (#59696)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59696

Some files in c10d refer to dist autograd. However, on Windows, dist autograd isn't built. Hence we need to "mask out" those references under Windows. This was already partly done, but when moving c10d to libtorch some issues came up, possibly due to the different way in which linking happens. Hence I masked out the remaining references.
ghstack-source-id: 131169541

Test Plan: CI

Reviewed By: agolynski

Differential Revision: D28987579

fbshipit-source-id: c29c5330f8429d699554972d30f99a89b2e3971d
2021-06-11 05:06:40 -07:00
cbcae46fa5 Remove USE_CUDA from c10d reducer/logger (#59562)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59562

Needed to merge c10d into libtorch(_cuda).

ghstack-source-id: 131169542

Test Plan: CI

Reviewed By: agolynski

Differential Revision: D28931378

fbshipit-source-id: 71376b862ff6ef7dbfa7331ec8d269bd3fcc7e0d
2021-06-11 05:06:39 -07:00
b4c35d7ae7 Remove USE_CUDA from ProcessGroupGloo (#59561)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59561

Needed to merge c10d into libtorch(_cuda).

ghstack-source-id: 131169544

Test Plan: CI

Reviewed By: agolynski

Differential Revision: D28931379

fbshipit-source-id: 9bd68477ae7bb870b6737a555edd5696149ff5d6
2021-06-11 05:05:31 -07:00
b5e832111e [nnc] Limit the number of inputs to a fusion group.
Summary:
nvrtc has a hard limit to the size of kernel parameters, and llvm has
a tendency to OOM with huge parameter lists, so let's limit the number of
inputs to something sensible.

Test Plan:
tested on pyper OOM test case:
```
flow-cli test-locally --mode=opt-split-dwarf f278102738 --name "PyPer OOM repro f277966799 f63b1f9c5c0c" --run-as-secure-group oncall_pytorch_jit --entitlement default
```

Reviewed By: ZolotukhinM

Differential Revision: D29019751

fbshipit-source-id: b27f2bb5000e31a7b49ea86a6928faa0ae2ead24
2021-06-11 02:25:16 -07:00
df759a3d9e [nnc] Do not fuse matmul/conv2d if inputs are discontiguous. (#59754)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59754

Also, if inputs are contiguous, use their Placeholders
directly rather than generating contiguous Tensors from them.

The rationale for this change is that aten::matmul and aten::conv2d
support transposed inputs; if NNC generates a physical transpose to
perform an external call, performance will be strictly worse than not
fusing (sometimes dramatically so, as in the attached benchmark).

Test Plan: benchmark

Reviewed By: ZolotukhinM

Differential Revision: D29010209

fbshipit-source-id: da6d71b155c83e8d6e306089042b6b0af8f80900
2021-06-11 02:23:47 -07:00
4b91355232 [ONNX] remove raw export type (#59160)
Summary:
[ONNX] remove raw export type

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59160

Reviewed By: tugsbayasgalan

Differential Revision: D28937039

Pulled By: SplitInfinity

fbshipit-source-id: 79bf91605526aa32a7304e75f50fe55d872bd4e8
2021-06-11 00:08:06 -07:00
2112074f25 [Static Runtime] Add schema check to several aten ops (#59603)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59603

D28698997 (10345010f7) was reverted because I forgot to replace the
```
  VLOG(1) << "Found schema mismatch";
  n->schema().dump();
```
block in `aten::clamp_min` with `LogAndDumpSchema(n)` and that led to the bazel build to fail. I don't know why it makes the bazel build though.

Test Plan: OSS CI.

Reviewed By: ajyu

Differential Revision: D28950177

fbshipit-source-id: 9bb1c6619e6b68415a3349f04933c2fcd24cc9a2
2021-06-10 23:39:00 -07:00
6eabbea47c Disable cuDNN persistent RNN on A30 (#59830)
Summary:
https://github.com/pytorch/pytorch/issues/59829

cherry-picked from ptrblck 's change CC ngimel xwang233

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59830

Reviewed By: bdhirsh

Differential Revision: D29046145

Pulled By: ngimel

fbshipit-source-id: 270ab3bb6c1c7c759497a15eb38b20a177c94adb
2021-06-10 22:07:28 -07:00
455afdf974 Automated submodule update: FBGEMM (#59715)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59715

This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 0520ad5f95

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59687

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: jianyuh

Differential Revision: D28986238

Pulled By: jspark1105

fbshipit-source-id: 12f68830b5b7a858fbc301af50593281852af51f
2021-06-10 21:53:30 -07:00
c7890b4a8e [package] doc string cleanup extravaganza (#59843)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59843

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D29049342

Pulled By: Lilyjjo

fbshipit-source-id: 3330fb439f28dda0cafef5797ff61311f4afbf76
2021-06-10 21:21:48 -07:00
54bfd41a2e Fix torch.angle on aarch64 (#59832)
Summary:
angle should return 0 for positive values, pi for negative and keep nans in place, which can be accomplished using two blendv functions.

Fixes number of unary test failures on M1/aarch64

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59832

Reviewed By: kimishpatel

Differential Revision: D29046402

Pulled By: malfet

fbshipit-source-id: cb93ad2de140f7a54796387fc11053c507a1d4e9
2021-06-10 20:48:41 -07:00
4025f95a20 [docs] Add table of contents to torch.package docs (#59842)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59842

Test Plan:
Continuous integration.

<img width="544" alt="Captura de Pantalla 2021-06-10 a la(s) 5 13 07 p  m" src="https://user-images.githubusercontent.com/4392003/121612390-2ccec280-ca0f-11eb-87ad-fef632ba05ca.png">

Reviewed By: Lilyjjo

Differential Revision: D29050627

Pulled By: SplitInfinity

fbshipit-source-id: 76c25ed4002cbaf072036e2e14e7857c15077df7
2021-06-10 19:52:50 -07:00
0e222db087 [docs] Add explanation section to torch.package docs (#59833)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59833

**Summary**
This commit adds an explanation section to the `torch.package`
documentation. This section clarifies and illuminates various aspects of
the internals of `torch.package` that might be of interest to users.

**Test Plan**
Continuous integration.

Test Plan: Imported from OSS

Reviewed By: Lilyjjo

Differential Revision: D29050626

Pulled By: SplitInfinity

fbshipit-source-id: 78e0cda00f69506ef2dfc52d6df63694b502269e
2021-06-10 19:52:48 -07:00
062dde7285 [docs] Add "how do I" section to torch.package docs (#59503)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59503

**Summary**
This commit adds a "how do I..." section to the `torch.package`
documentation. This section contains short guides about how to solve
real-world problems that frequently recur while using `torch.package`.

**Test Plan**
Continuous integration.

<img width="877" alt="Captura de Pantalla 2021-06-04 a la(s) 9 19 54 p  m" src="https://user-images.githubusercontent.com/4392003/120879911-98321380-c57b-11eb-8664-c582c92b7837.png">

Test Plan: Imported from OSS

Reviewed By: Lilyjjo

Differential Revision: D29050629

Pulled By: SplitInfinity

fbshipit-source-id: 2b7800732e0a3c1c947f110c05562aed5174a87f
2021-06-10 19:52:47 -07:00
6a18ca7a07 [docs] Add tutorials section to torch.package docs (#59499)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59499

**Summary**
This commit adds a tutorials section to the torch.package docs.

**Test Plan**
Continuous integration.

<img width="870" alt="Captura de Pantalla 2021-06-04 a la(s) 5 10 31 p  m" src="https://user-images.githubusercontent.com/4392003/120874257-b9ced300-c55a-11eb-84dd-721cb7ac73ab.png">

Test Plan: Imported from OSS

Reviewed By: Lilyjjo

Differential Revision: D29050628

Pulled By: SplitInfinity

fbshipit-source-id: c17ab0100a9d63e7af8da7a618143cedbd0a5872
2021-06-10 19:52:45 -07:00
a3db8e0a26 [docs] Add torch.package documentation preamble (#59491)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59491

**Summary**
This commit adds a preamble to the `torch.package` documentation page
that explains briefly what `torch.package` is.

**Test Plan**
Continous integration.

<img width="881" alt="Captura de Pantalla 2021-06-04 a la(s) 3 57 01 p  m" src="https://user-images.githubusercontent.com/4392003/120872203-d535e000-c552-11eb-841d-b38df19bc992.png">

Test Plan: Imported from OSS

Reviewed By: Lilyjjo

Differential Revision: D29050630

Pulled By: SplitInfinity

fbshipit-source-id: 70a3fd43f076751c6ea83be3ead291686c641158
2021-06-10 19:51:37 -07:00
a524ee00ca Forward AD formulas batch 3 (#59711)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59711

This is the exact same PR as before.
This was reverted before the PR below was faulty.

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D28995762

Pulled By: albanD

fbshipit-source-id: 65940ad93bced9b5f97106709d603d1cd7260812
2021-06-10 19:30:02 -07:00
8a7c0d082f ger is an alias to outer, not the other way around (#59710)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59710

This is the exact same PR as before.
The version that landed was actually outdated compared to the github PR and that's why it failed on master... Sorry for the noise.

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D28995764

Pulled By: albanD

fbshipit-source-id: 8f7ae3356a886d45787c5e6ca53a4e7b033e306e
2021-06-10 19:28:53 -07:00
c2c35c0170 [Binary] Link whole CuDNN for CUDA-11.1 (#59802)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/50153

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59802

Reviewed By: driazati, seemethere

Differential Revision: D29033537

Pulled By: malfet

fbshipit-source-id: e816fc71f273ae0b4ba8a0621d5368a2078561a1
2021-06-10 16:54:53 -07:00
60ba451731 [torch] Remove using directive from header (#59728)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59728

I noticed Sandcastle jobs failing with:

```
fbcode/caffe2/torch/csrc/api/include/torch/nn/modules/rnn.h:19:35: error: using namespace directive in global context in header [-Werror,-Wheader-hygiene]
using namespace torch::nn::utils::rnn;
```

(cf. V3 of D28939167 or https://www.internalfb.com/intern/sandcastle/job/36028797455955174/).

Removing `using namespace ...` fixes the problem.

~~... also applied code formatting ...~~

Test Plan: Sandcastle

Reviewed By: jbschlosser

Differential Revision: D29000888

fbshipit-source-id: 10917426828fc0c82b982da435ce891dc2bb6eec
2021-06-10 15:13:07 -07:00
e9e9291dc1 [After fix] Reuse constant and bump bytecode to v5 (#59722)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59722

Reintroduce sharing constant between bytecode and torchscript (same as #58629) after the fix #59642

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D29002345

Pulled By: cccclai

fbshipit-source-id: d9c8e474ff57d0509580183206df038a24ad27e3
2021-06-10 15:03:16 -07:00
ac6b5beade [torch][segment_reduce] Add support for mean reduction (cpu) (#59521)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59521

This diff is adding support for mean reduction for CPU (fwd + bckwd).

Will add cuda implementation in subsequent PR. We are using "cub::DeviceSegmentedReduce" for other aggregation, trying to see how to support mean or will write custom kernel for it.

Next Steps:
- cuda support for mean
- 2d data input support
- more testing
- benchmarking

Test Plan: updated unit test. Still relying on manual data for ease of debugging. Will add more tests that covers edge cases once major features are complete.

Reviewed By: ngimel

Differential Revision: D28922547

fbshipit-source-id: 2fad53bbad2cce714808ff95759cbdbd45bb4ce6
2021-06-10 14:21:31 -07:00
e71db0bb82 .jenkins: Ignore exit code of nvidia-smi (#59826)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59826

It's only informational and will run on Windows CPU executors as well

Fixes issues found in https://github.com/pytorch/pytorch/runs/2797531966

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: janeyx99

Differential Revision: D29042951

Pulled By: seemethere

fbshipit-source-id: 862094e53417c0a59d7728bf680be37b806b5a6f
2021-06-10 14:16:32 -07:00
e7ad82eb2f [DataLoader] Add option to refine type during runtime validation for DP instance (#56066)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56066

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D27776646

Pulled By: ejguan

fbshipit-source-id: 695ff7775177653d809c5917d938c706281e1298
2021-06-10 14:04:24 -07:00
e2c784d940 [reland] .github: Add Windows GPU workflow (#58782) (#59752)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59752

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: samestep

Differential Revision: D29009775

Pulled By: seemethere

fbshipit-source-id: 5be1b818b5653a4fdbfe4a79731317068dc1a5d1
2021-06-10 13:38:32 -07:00
54cc477ea3 .github: Ensure cleaner windows workspace (#59742)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59742

It looks like Windows workers were failing out due to some leftovers
from previous builds, this should hopefully remedy some of those errors

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: janeyx99

Differential Revision: D29009076

Pulled By: seemethere

fbshipit-source-id: 426d54df14ec580cb24b818c48e2f4bd36159181
2021-06-10 13:37:22 -07:00
0099c25b85 fx quant: remove some dead code in observer insertion (redo) (#59799)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59799

This is a redo of #58574, easier to create a new PR than to fix rebase
conflicts, as there have been a large number of refactors to the
underlying code.

Removes some code which was incorrectly added by #57519 but never
actually used for anything.

Test Plan:
```
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D29031955

fbshipit-source-id: f407d181070cb283382965952821e3647c705544
2021-06-10 12:57:09 -07:00
fb620a27d0 [WIP] Add slow gradcheck build for the ci/slow-gradcheck label (#59020)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59020

Reviewed By: bdhirsh

Differential Revision: D29036891

Pulled By: albanD

fbshipit-source-id: b1f87b2cb38642097ad4079d1e818fa5997bedb4
2021-06-10 12:29:57 -07:00
cc32dcadd9 Fix Error when run python setup.py install again on Windows (#59689)
Summary:
Fix https://github.com/pytorch/pytorch/issues/59688

So far, .build.ninja should be removed before building the source code on Windows at any time

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59689

Reviewed By: bdhirsh

Differential Revision: D29032960

Pulled By: walterddr

fbshipit-source-id: 2b8162cd119820d3b6d8715745ec29b9c381e01f
2021-06-10 12:22:21 -07:00
1fc3576d97 Fixing and enabling tests that check fake_quant matches quant+dequant (#59095)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59095

These tests were disabled, I'm unsure as to why. I've
re-enabled them and remade them to expand testing to different devices
and dtypes

Test Plan:
python test/test_quantization.py TestFakeQuantizeOps.test_numerical_consistency

Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D29018745

fbshipit-source-id: 28188f32bafd1f1704c00ba49d09ed719dd1aeb2
2021-06-10 12:16:54 -07:00
c90260905f [fix] torch.{lin, log}space(): properly examine passed dtype (#53685)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53171

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53685

Reviewed By: jbschlosser

Differential Revision: D28331863

Pulled By: anjali411

fbshipit-source-id: e89359b607d058158cfa1c9a82389d9a4a71185b
2021-06-10 11:59:54 -07:00
9bcef86d18 Split slow gradcheck periodic CI job so that it does not time out (#59736)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59736

Reviewed By: albanD

Differential Revision: D29008100

Pulled By: soulitzer

fbshipit-source-id: 76da971356fd985dfbfa56d3573f31ef04701773
2021-06-10 11:32:36 -07:00
f240624080 displays graph node's info (#59679)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59679

Displays info about graph's nodes

Test Plan:
Expected view:

%wide_offset.1 : Tensor = aten::add(%wide.1, %self._mu, %4)
	i0: Tensor CPUFloatType {32, 50}
	i1: Tensor CPUFloatType {1, 50}
	i2: int {1}
	o0: Tensor CPUFloatType {32, 50}
%wide_normalized.1 : Tensor = aten::mul(%wide_offset.1, %self._sigma)
	i0: Tensor CPUFloatType {32, 50}
	i1: Tensor CPUFloatType {1, 50}
	o0: Tensor CPUFloatType {32, 50}
%wide_preproc.1 : Tensor = aten::clamp(%wide_normalized.1, %5, %6)
	i0: Tensor CPUFloatType {32, 50}
	i1: double {0}
	i2: double {10}
	o0: Tensor CPUFloatType {32, 50}
%user_emb_t.1 : Tensor = aten::transpose(%user_emb.1, %4, %7)
	i0: Tensor CPUFloatType {32, 1, 32}
	i1: int {1}
	i2: int {2}
	o0: Tensor CPUFloatType {32, 32, 1}
%dp_unflatten.1 : Tensor = aten::bmm(%ad_emb_packed.1, %user_emb_t.1)
	i0: Tensor CPUFloatType {32, 1, 32}
	i1: Tensor CPUFloatType {32, 32, 1}
	o0: Tensor CPUFloatType {32, 1, 1}
%31 : Tensor = static_runtime::flatten_copy(%dp_unflatten.1, %4, %8)
	i0: Tensor CPUFloatType {32, 1, 1}
	i1: int {1}
	i2: int {-1}
	o0: Tensor CPUFloatType {32, 1}
%19 : Tensor[] = prim::ListConstruct(%31, %wide_preproc.1)
	i0: Tensor CPUFloatType {32, 1}
	i1: Tensor CPUFloatType {32, 50}
	o0: TensorList {2}
%input.1 : Tensor = aten::cat(%19, %4)
	i0: TensorList {2}
	i1: int {1}
	o0: Tensor CPUFloatType {32, 51}
%fc1.1 : Tensor = aten::addmm(%self._fc_b, %input.1, %29, %4, %4)
	i0: Tensor CPUFloatType {1}
	i1: Tensor CPUFloatType {32, 51}
	i2: Tensor CPUFloatType {51, 1}
	i3: int {1}
	i4: int {1}
	o0: Tensor CPUFloatType {32, 1}
%23 : Tensor = aten::sigmoid(%fc1.1)
	i0: Tensor CPUFloatType {32, 1}
	o0: Tensor CPUFloatType {32, 1}
%24 : (Tensor) = prim::TupleConstruct(%23)
	i0: Tensor CPUFloatType {32, 1}
	o0: Tuple {1}

Reviewed By: hlu1

Differential Revision: D28592852

fbshipit-source-id: 09174014f7d0ce25c511025d2b376f14e16c8a4a
2021-06-10 10:33:30 -07:00
7af9252ed7 [skip ci] export_slow_tests.py - Add option to ignore small differences (#59759)
Summary:
This would lower the number of unnecessary commits to pytorch/test-infra by only exporting a different stats file when the stats are varying enough. This way, if the slow test cases we gather from S3 are the same and their times are trivially different, then we do not bother exporting a different stats file when the --ignore-small-diffs option is enabled.

We instead export the stats already in test-infra, so that when it tries to commit, it realizes it would be an empty commit and not add to the git history.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59759

Test Plan: Run `python tools/export_slow_tests.py --ignore-small-diffs <threshold>`.

Reviewed By: walterddr

Differential Revision: D29032712

Pulled By: janeyx99

fbshipit-source-id: 41d522a4c5f710e776acd1512d41be9791d0cf63
2021-06-10 09:44:33 -07:00
51d954e8e4 Link ATEN tests with OpenMP runtime (#59733)
Summary:
Even if OpenMP extensions are supported by compiler, OpenMP runtime library is not always implicitly added as dependency by linker
Above fixes linker problems on Apple M1, when libomp.dylib is installed via conda, when tests that directly use OpenMP pragams fail to link with following errors:
```
/Library/Developer/CommandLineTools/usr/bin/c++ -Wno-deprecated -fvisibility-inlines-hidden -Wno-deprecated-declarations -DUSE_PTHREADPOOL -Xpreprocessor -fopenmp -I/Users/nshulga/miniforge3/include -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-typedef-redefinition -Wno-unknown-warning-option -Wno-unused-private-field -Wno-inconsistent-missing-override -Wno-aligned-allocation-unavailable -Wno-c++14-extensions -Wno-constexpr-not-const -Wno-missing-braces -Qunused-arguments -fcolor-diagnostics -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-unused-private-field -Wno-missing-braces -Wno-c++14-extensions -Wno-constexpr-not-const -O3 -DNDEBUG -DNDEBUG -arch arm64 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX11.3.sdk -Wl,-search_paths_first -Wl,-headerpad_max_install_names -rdynamic caffe2/CMakeFiles/test_parallel.dir/__/aten/src/ATen/test/test_parallel.cpp.o -o bin/test_parallel  -Wl,-rpath,/Users/nshulga/git/pytorch/build/lib  lib/libgtest_main.a  lib/libtorch.dylib  lib/libtorch_cpu.dylib  lib/libprotobuf.a  lib/libc10.dylib  lib/libgtest.a && :
Undefined symbols for architecture arm64:
  "___kmpc_fork_call", referenced from:
      TestParallel_NestedParallel_Test::TestBody() in test_parallel.cpp.o
      TestParallel_Exceptions_Test::TestBody() in test_parallel.cpp.o
  "_omp_get_max_threads", referenced from:
      TestParallel_NestedParallel_Test::TestBody() in test_parallel.cpp.o
      TestParallel_Exceptions_Test::TestBody() in test_parallel.cpp.o
  "_omp_get_num_threads", referenced from:
      _.omp_outlined. in test_parallel.cpp.o
      _.omp_outlined..31 in test_parallel.cpp.o
  "_omp_get_thread_num", referenced from:
      _.omp_outlined. in test_parallel.cpp.o
      _.omp_outlined..31 in test_parallel.cpp.o
  "_omp_in_parallel", referenced from:
      TestParallel_NestedParallel_Test::TestBody() in test_parallel.cpp.o
      TestParallel_Exceptions_Test::TestBody() in test_parallel.cpp.o
ld: symbol(s) not found for architecture arm64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59733

Reviewed By: walterddr, seemethere

Differential Revision: D29005511

Pulled By: malfet

fbshipit-source-id: daab5e1b0a58d9b60a8992ef40c743e4b619dac7
2021-06-10 09:41:24 -07:00
4f79270b89 [PyTorch ] Thread parallel bmm across batch dim (#59596)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59596

Parallelize batch matmul across batch dim. This was found to improve perf for
some usecases on mobile.
ghstack-source-id: 130989569

Test Plan: CI unit tests

Reviewed By: albanD

Differential Revision: D26833417

fbshipit-source-id: 9b84d89d29883a6c9d992d993844dd31a25f76b1
2021-06-10 08:25:40 -07:00
3176f16691 [Pytorch benchmark] Add BMM benchmark (#59595)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59595

ghstack-source-id: 130946743

Test Plan: bmm_test

Reviewed By: mingzhe09088

Differential Revision: D28873228

fbshipit-source-id: 6e4cb04bb6c63f5f68d8f23c13738e2d58ab499c
2021-06-10 08:24:29 -07:00
58412740ae Added doc for torch.einsum sublist format (#57038)
Summary:
Adds documentation for the new sublist format for `torch.einsum`

closes https://github.com/pytorch/pytorch/issues/21412

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57038

Reviewed By: mruberry

Differential Revision: D28994431

Pulled By: heitorschueroff

fbshipit-source-id: 3dfb154fe6e4c440ac67c2dd92727bb5ecfe289e
2021-06-10 08:01:56 -07:00
5e3e504728 Update TensorPipe submodule (#59789)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59789

The bot messed up in D28867855 (96651458eb) so I've got to do it manually.

Test Plan: CI

Reviewed By: beauby

Differential Revision: D29027901

fbshipit-source-id: 9438e0cfbe932fbbd1e252ab57e2b1b23f9e44cf
2021-06-10 06:36:46 -07:00
96651458eb Automated submodule update: tensorpipe (#59374)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/tensorpipe](https://github.com/pytorch/tensorpipe).

New submodule commit: e942ea1513

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59374

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: lw

Differential Revision: D28867855

fbshipit-source-id: e1325046003f5c546f02024ff4c427c91721cd7e
2021-06-10 04:41:02 -07:00
0d7d316dc1 [fx ir] Support lists and dicts in FX IR GraphDrawer (#58775)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58775

Reviewed By: RoshanPAN

Differential Revision: D28613939

fbshipit-source-id: 4164e2dd772b59240ea3907001fe4ebddb003060
2021-06-10 01:55:53 -07:00
e7cccc23b9 Add query and synchronize to c10::Stream (#59560)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59560

`at::cuda::CUDAStream` has the `query` and `synchronize` methods, but `c10::Stream` does not, and I couldn't find any generic way to accomplish this. Hence I added helpers to do this to the DeviceGuardImpl interface, and then defined these methods on `c10::Stream`. (I had to do it out-of-line to circumvent a circular dependency).
ghstack-source-id: 130932249

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D28931377

fbshipit-source-id: cd0c19cf021e305d0c0cf9af364afb445d010248
2021-06-10 01:42:40 -07:00
f11120967e Support EnumerableShardingSpec in ShardedTensor. (#59061)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59061

Overall Design: https://github.com/pytorch/pytorch/issues/55207

This PR builds upon https://github.com/pytorch/pytorch/pull/58517 and
https://github.com/pytorch/pytorch/pull/57409 to support creating a
ShardedTensor using EnumerableShardingSpec.
ghstack-source-id: 130780376

Test Plan:
1) unit tests
2) waitforbuildbot

Reviewed By: SciPioneer

Differential Revision: D28734551

fbshipit-source-id: 656f5f2b22041dae071bc475f19fe94c969716e8
2021-06-09 23:21:14 -07:00
48ea7c808d [C10d] Support subgroups (#59111)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59111

Create a util function for initializing subgroups. By default, each subgroup contains all the ranks within a machine. This util function can be used by both local SGD and SyncBatchNorm optimization.

Additionally, clang format `distributed/__init__.py` after importing `_rank_not_in_group` which is used by the unit test, and also clang format `distributed_c10d.py`.

Note that this API does not accept another overall main group. Like APEX API `create_syncbn_process_group` [here](https://nvidia.github.io/apex/_modules/apex/parallel.html), always uses the global world size and should only be applied when CUDA is available.

#Closes: https://github.com/pytorch/pytorch/issues/53962
ghstack-source-id: 130975027

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups_group_size_exceeds_world_size
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups_world_size_not_divisible_by_group_size

buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups_by_enumeration
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups_by_enumeration_input_rank_exceeds_world_size
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups_overlap_not_allowed

Reviewed By: rohan-varma

Differential Revision: D28495672

fbshipit-source-id: fdcc405411dd409634eb51806ee0a320d1ecd4e0
2021-06-09 22:35:11 -07:00
fc0582ee95 [c10d] Use TORCH_CHECK for monitored barrier error (#59667)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59667

Use torch_check over throw std::runtime_error in monitored barrier so
that it works with torch_cpp_show_stacktraces to reveal the entire callstack
where the monitored barrier failed, which can help determine where the
particular rank encountered an issue.
ghstack-source-id: 130993689

Test Plan: CI

Reviewed By: cbalioglu

Differential Revision: D28974510

fbshipit-source-id: 6a6958995c1066cddcd647ca88c74473079b69fc
2021-06-09 22:31:33 -07:00
12b9e99e0d Bump the bytecode reading version kMaxSupportedBytecodeVersion to 6 (#59714)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59714

Bytecode v6 is on implicit operator versioning through number of specified arguments. Both the read and write codes are available. This PR is to enable reading v6 models. The default writing format is not changed yet and will be bumped in a later PR.

Test: CI.
Local: change the writing version to 6 temporally and run the unit tests in LiteInterpreterTest. There are a number of end-to-end tests to write v6 bytecode, read and run it.

Test Plan: Imported from OSS

Reviewed By: raziel, cccclai

Differential Revision: D29007538

Pulled By: iseeyuan

fbshipit-source-id: cb089d5d4c5b26c5b5cd3a5e0954e8c7c4c69aac
2021-06-09 21:58:31 -07:00
3c6ae6f181 [OSS CI][iOS] Use LibTorch-Lite.h for nightly builds (#59762)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59762

Test Plan: Imported from OSS

Reviewed By: cccclai

Differential Revision: D29018267

Pulled By: xta0

fbshipit-source-id: 10213a6811b4e6b33bd13355a7a7af85d21d48d4
2021-06-09 21:38:32 -07:00
a62f6b6d04 ci: Add skipIfOnGHA util (#59748)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59748

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: janeyx99

Differential Revision: D29008217

Pulled By: seemethere

fbshipit-source-id: ffc2f7935df722f26c1252e3833085430ada7433
2021-06-09 21:19:26 -07:00
1ea5c19c19 Add USE_WHOLE_CUDNN option (#59744)
Summary:
It is only enabled if USE_STATIC_CUDNN is enabled

Next step after https://github.com/pytorch/pytorch/pull/59721 towards resolving fast kernels stripping reported in https://github.com/pytorch/pytorch/issues/50153

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59744

Reviewed By: seemethere, ngimel

Differential Revision: D29007314

Pulled By: malfet

fbshipit-source-id: 7091e299c0c6cc2a8aa82fbf49312cecf3bb861a
2021-06-09 21:12:42 -07:00
bb19dc14cc add channels last support for AvgPool2d on CPU (#58725)
Summary:
replacement of: https://github.com/pytorch/pytorch/pull/48918

enable test case on AvgPool2d channels last for CPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58725

Reviewed By: ngimel

Differential Revision: D28593169

Pulled By: VitalyFedyunin

fbshipit-source-id: 5de870fe1d9dd961fb0dab5f9d531ab14614a160
2021-06-09 21:06:45 -07:00
52b2ed65c0 Revert D29007258: Revert D28926135: [pytorch][PR] Refactor Foreach Tests: Unary Functions
Test Plan: revert-hammer

Differential Revision:
D29007258

Original commit changeset: c15f51661641

fbshipit-source-id: 98236153136a5c6b6c2911079b7bd214da6cb424
2021-06-09 21:02:56 -07:00
827e00c914 Update Kineto to fix fd leak (#59755)
Summary:
Update to commit containing pytorch/kineto#281
Fixes https://github.com/pytorch/pytorch/issues/59746

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59755

Reviewed By: seemethere, ngimel

Differential Revision: D29011069

Pulled By: malfet

fbshipit-source-id: 4c7b09ce5d497634f9927c330713c7404d892912
2021-06-09 20:47:04 -07:00
a4e0368c99 Comment on tests reliance on ZeRO's partitioning algo (#59713)
Summary:
Addresses https://github.com/pytorch/pytorch/issues/59548

**Overview:**
Recently, we changed ZeRO's partitioning algorithm to first sort the parameters by decreasing size and then greedily allocate to shards. See [here](ea1de87f4b).

The current tests `test_sharding()` and `test_add_param_group()` check for a uniform partitioning, which is not achieved with the old naive greedy partitioning algorithm for general world sizes but is achieved with the new sorted-greedy algorithm. This reliance is not ideal, but for now, we opt to simply add comments to document the dependency.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59713

Test Plan:
I tested for world sizes of 1, 2, 3, and 4 via the AI AWS cluster:
```
srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=1 python test/distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_sharding
srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=2 python test/distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_sharding
srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=3 python test/distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_sharding
srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=4 python test/distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_sharding

srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=1 python test/distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_add_param_group
srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=2 python test/distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_add_param_group
srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=3 python test/distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_add_param_group
srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=4 python test/distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_add_param_group
```
However, because the train queue (which offers instances with 8 GPUs) is not working at the moment, I was unable to test for world sizes of 5+. Nonetheless, I believe that they should still work.

First, consider `test_sharding()`. Given the sorted-greedy algorithm, each shard will be assigned one of the parameters with size `9`, then one of the parameters with size `7`, then `5`, and finally `3`. Hence, each will have a uniform partition. Now, consider `test_add_param_group()`. Similarly, the same allocation behavior occurs, only the last shard is not assigned the final parameter with size `3` to begin. However, after adding the new `param_group` with the parameter with size `3`, a re-partitioning occurs. The first `param_group` is partitioned as before, and the parameter with size `3` in the new `param_group` is assigned to the last shard since it has the minimal total size. Thus, in the end, all shards have a uniform partition.

Reviewed By: mrshenli

Differential Revision: D28996460

Pulled By: andwgu

fbshipit-source-id: 22bdc638d8569ed9a20836812eac046d628d6df2
2021-06-09 19:56:28 -07:00
25179ecb63 [caffe2] Fix verbose templated signed/unsigned comparison warning (#59578)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59578

This is verbose warning formed from one `CAFFE_ENFORCE_GT()` check:
```
third-party\toolchains\vs2017_15.9\buildtools\vc\tools\msvc\14.16.27023\include\xstddef(271): warning C4018: '>': signed/unsigned mismatch
xplat\caffe2\c10\util\logging.h(208): note: see reference to function template instantiation 'bool std::greater<void>::operator ()<const T1&,const T2&>(_Ty1,_Ty2) const' being compiled
        with
        [
            T1=int,
            T2=unsigned int,
            _Ty1=const int &,
            _Ty2=const unsigned int &
        ]
xplat\caffe2\caffe2\operators\conv_pool_op_base.h(539): note: see reference to function template instantiation 'void c10::enforce_detail::enforceThatImpl<std::greater<void>,int,unsigned int,>(Pred,const T1 &,const T2 &,const char *,int,const char *,const void *)' being compiled
        with
        [
            Pred=std::greater<void>,
            T1=int,
            T2=unsigned int
        ]
xplat\caffe2\caffe2\operators\conv_pool_op_base.h(536): note: while compiling class template member function 'std::vector<caffe2::TensorShape,std::allocator<_Ty>> caffe2::ConvPoolOpBase<caffe2::CPUContext>::TensorInferenceForSchema(const caffe2::OperatorDef &,const std::vector<_Ty,std::allocator<_Ty>> &,int)'
        with
        [
            _Ty=caffe2::TensorShape
        ]
xplat\caffe2\caffe2\operators\conv_pool_op_base.h(631): note: see reference to function template instantiation 'std::vector<caffe2::TensorShape,std::allocator<_Ty>> caffe2::ConvPoolOpBase<caffe2::CPUContext>::TensorInferenceForSchema(const caffe2::OperatorDef &,const std::vector<_Ty,std::allocator<_Ty>> &,int)' being compiled
        with
        [
            _Ty=caffe2::TensorShape
        ]
xplat\caffe2\caffe2\operators\pool_op.cc(1053): note: see reference to class template instantiation 'caffe2::ConvPoolOpBase<caffe2::CPUContext>' being compiled
xplat\caffe2\c10\core\memoryformat.h(63): note: see reference to class template instantiation 'c10::ArrayRef<int64_t>' being compiled
```
Use a signed `0` because `.dims_size()` returns a signed integer.

Test Plan: Confirm warning no longer present in Windows build logs

Reviewed By: simpkins

Differential Revision: D28941905

fbshipit-source-id: acdc1281df2fe7f30b14cfad917cbbe8f3336d29
2021-06-09 19:48:29 -07:00
b0fd3ca542 [sparse] Add the AO namespace to torch (#58703)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58703

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D28970962

Pulled By: z-a-f

fbshipit-source-id: 0d0f62111a0883af4143a933292dfaaf8fae220d
2021-06-09 19:47:21 -07:00
3dfb94c17c Construct a -Wall around Torch (#59668)
Summary:
Removes unused variables and functions and performs other minor mods sufficient to introduce `-Wall` as a default build flag. This should enhance code safety in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59668

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D28974453

fbshipit-source-id: 011c720dd6e65fdbbd87aa90bf57d67bfef32216
2021-06-09 19:42:43 -07:00
fa030d1213 [DataPipes] Add simple unbatch to DataPipe (#59610)
Summary:
Implements the simple unbatch feature for DataPipe https://github.com/pytorch/pytorch/issues/58148

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59610

Reviewed By: VitalyFedyunin

Differential Revision: D28994180

Pulled By: NivekT

fbshipit-source-id: 4bafe6e26c4f95a808c489b147369413a196fa1c
2021-06-09 16:53:31 -07:00
2f395f3b54 [reland] Document debugability features in torch.distributed (#59726)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59726

Reland of https://github.com/pytorch/pytorch/pull/59604 with indentation fix
ghstack-source-id: 130979356

Test Plan: ci

Reviewed By: SciPioneer

Differential Revision: D29001923

fbshipit-source-id: 225d9dc5054c223b453f3b39749e2b62f61b9a2c
2021-06-09 16:40:11 -07:00
c5bee1ec4f [PyTorch] Parallelize gelu via tensoriterator (#58950)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58950

Use tensor iterator's API to set grain size in order to parallelize gelu op.
ghstack-source-id: 130947174

Test Plan: test_gelu

Reviewed By: ezyang

Differential Revision: D28689819

fbshipit-source-id: 0a02066d47a4d9648323c5ec27d7e0e91f4c303a
2021-06-09 16:09:38 -07:00
8b63573c31 [PyTorch Operator Benchmark] gelu benchmark (#59334)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59334

Add gelu op benchmark
ghstack-source-id: 130947172

Test Plan: gelu_test

Reviewed By: hl475

Differential Revision: D28842959

fbshipit-source-id: 93e23e027a488412488ecf22335d7d915f6cc3b4
2021-06-09 16:09:37 -07:00
874e7b889d [PyTorch] Expose interface to set grain size on tensor iterator (#58949)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58949

To parallelize ops grain size setting is exposed at for_each level.
This is too far deep in the stack cpu_kernel_vec which does not know what the
op is. You would want to parallelize op depending on the op type. Non trivial
ops can benefit from threads even when the # of elements in tensor is not high.
This API exposes setting grain size at tensor iterator level so that operator
creating it can have control over it.
ghstack-source-id: 130947175

Test Plan: CI + will add more test

Reviewed By: ezyang

Differential Revision: D26857523

fbshipit-source-id: 09fc2953061069967caa9c78b010cb1b68fcc6c9
2021-06-09 16:08:25 -07:00
1735775662 [Torch] Cast timestamp type to int (#59712)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59712

When worker process fails in fb due to signal failure, the TerminationHandler writes error reply file. Recently the error reply file was changed for mast jobs. The Json value of ``timestamp`` is string, even though in the thrift struct it is int: https://fburl.com/diffusion/upa228u5

This diff adds support for casting str timestamp to int.

Test Plan: buck test mode/dev-nosan //caffe2/test/distributed/elastic/multiprocessing/errors:api_test

Reviewed By: suphoff

Differential Revision: D28995827

fbshipit-source-id: 333448cfb4d062dc7fe751ef5839e66bfcb3ba00
2021-06-09 15:56:37 -07:00
44c442293f [torch/elastic] Fix the edge case when no node is alive (#59663)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59663

This PR fixes an edge case bug in `DynamicRendezvousHandler` where the state of the rendezvous is not always entirely updated when one or more nodes are not alive anymore.

Test Plan: Run the existing and newly-introduced unit tests.

Reviewed By: tierex

Differential Revision: D28971809

fbshipit-source-id: ebbb6a5f2b04f045c3732d6cf0f8fdc7c2381a7c
2021-06-09 15:31:50 -07:00
0fa3db5594 Fix subgradient for element-wise max and min (#59669)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59669

Fixes #56734

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D28975531

fbshipit-source-id: 4e774dc8c6e095bc66962ce2411466de3880c2d3
2021-06-09 15:21:45 -07:00
e3d75b8475 irange for PyTorch sans jit (#59481)
Summary:
Switches most of the simple for loops outside of `jit` directories to use `c10::irange`.

Generated with D28874212.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59481

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D28909681

fbshipit-source-id: ec9ab1bd602933238d9d0f73d4d8d027b75d9d85
2021-06-09 14:46:11 -07:00
804f924504 Fix accuraccy failures when running test_nn on A100s (#59624)
Summary:
Make sure tests run explicitely without TF32 don't use TF32 operations

Fixes https://github.com/pytorch/pytorch/issues/52278

After the tf32 accuracy tolerance was increased to 0.05 this is the only remaining change required to fix the above issue (for TestNN.test_Conv3d_1x1x1_no_bias_cuda)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59624

Reviewed By: heitorschueroff

Differential Revision: D28996279

Pulled By: ngimel

fbshipit-source-id: 7f1b165fd52cfa0898a89190055b7a4b0985573a
2021-06-09 14:38:34 -07:00
47e286d024 Merge c10d elastic agent tests into local_elastic_agent_test.py file (#59657)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59657

Introduce tests that test elastic agent with c10d and etc2-v2 rendezvous backends.
Added a port allocation method that uses sockets to find an available port for the c10d backend. This way, agents that are created will all share the specified address/port and can communicate.
Added a method that abstracts the backend to use when running a test. This way, any tests can quickly be switched to run on the backend of choice (c10d, etcd, or etcd-v2)

Test Plan: Tests various components of the elastic agent with 3 different backends: etcd, etcd-v2, and c10d.

Reviewed By: tierex

Differential Revision: D28972604

fbshipit-source-id: fd4cff6417fefdf0de9d7a114820914b968006a8
2021-06-09 14:28:59 -07:00
13a2025469 Delete empty caffe2/quantization/CMakeLists.txt (#59717)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59717

Reviewed By: walterddr

Differential Revision: D28997598

Pulled By: malfet

fbshipit-source-id: ef2654577c0784254f3d74bc340cdabc76fa345c
2021-06-09 14:20:33 -07:00
171142f9cc Revert D28926135: [pytorch][PR] Refactor Foreach Tests: Unary Functions
Test Plan: revert-hammer

Differential Revision:
D28926135 (0897df18a3)

Original commit changeset: 4eb21dcebbff

fbshipit-source-id: c15f51661641f455ae265cdf048051a3c01198f9
2021-06-09 14:05:56 -07:00
9bb5663979 Use commit stats from viable/strict instead of nightlies for sharding (#59727)
Summary:
Currently, not all of CI runs on nightlies, so it's better to use viable/strict.

For example, current 11.1 test jobs do not get to use automatic sharding because of the lack of stats: https://app.circleci.com/jobs/github/pytorch/pytorch/14010983?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59727

Reviewed By: heitorschueroff

Differential Revision: D29004910

Pulled By: janeyx99

fbshipit-source-id: eb0c54a7e7947decba8134a1d67e4b0434151a06
2021-06-09 13:52:15 -07:00
8845cbabf0 [CMake] Split caffe2::cudnn into public and private (#59721)
Summary:
This is only important for builds where cuDNN is linked statically into libtorch_cpu.
Before this PR PyTorch wheels often accidentally contained several partial copies of cudnn_static library.
Splitting the interface into header only (cudnn-public) and library+headers(cudnn-private) prevents those from happening.
Preliminary step towards enabling optional linking whole cudnn_library to workaround issue reported in https://github.com/pytorch/pytorch/issues/50153

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59721

Reviewed By: ngimel

Differential Revision: D29000967

Pulled By: malfet

fbshipit-source-id: f054df92b265e9494076ab16c247427b39da9336
2021-06-09 13:18:48 -07:00
c738c13304 Fix typo in checkpoint docs (#59646)
Summary:
This small typo causing this valuable piece of information to be excluded from the docs.

<img width="876" alt="image" src="https://user-images.githubusercontent.com/8812459/121240517-47f2d400-c84f-11eb-9288-23c551c1591a.png">

The last "warning" is missing a second ":", so it doesn't render in the docs {emoji:1f447}

<img width="875" alt="image" src="https://user-images.githubusercontent.com/8812459/121240467-39a4b800-c84f-11eb-9dd6-ec26754c43d3.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59646

Reviewed By: mruberry

Differential Revision: D28972541

Pulled By: jbschlosser

fbshipit-source-id: d10c6688d8db4d4ec4b02858a4c7b352365219c0
2021-06-09 12:48:18 -07:00
51af772937 [jit] Set debug name for value coming out of GetAttr nodes. (#59123)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59123

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D28766023

fbshipit-source-id: 0919f4318fb5a7b1d5adc8f976dfc9309e233d13
2021-06-09 12:24:55 -07:00
bbd58d5c32 fix :attr: rendering in F.kl_div (#59636)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59636

Fixes #57538

Test Plan:
Rebuilt docs to verify the fix:

{F623235643}

Reviewed By: zou3519

Differential Revision: D28964825

fbshipit-source-id: 275c7f70e69eda15a807e1774fd852d94bf02864
2021-06-09 12:20:14 -07:00
e385be7611 .circleci: Disable pytorch_windows_test_multigpu (#59725)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59725

These are failing on CircleCI with no apparent debug messages, see https://github.com/pytorch/pytorch/issues/59724

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: samestep

Differential Revision: D29001353

Pulled By: seemethere

fbshipit-source-id: ce0f4fbcfc7918824f6bad47b922d914eeb2f5a6
2021-06-09 12:12:13 -07:00
f8bb7e2f7c Magma isn't needed in cpu build (#59619)
Summary:
Fix incorrect logic in windows CPU build script
VERSION_SUFFIX shouldn't be cpu

https://github.com/pytorch/pytorch/pull/59618/checks?check_run_id=2771591019
![image](https://user-images.githubusercontent.com/16190118/121158840-3f18f700-c87d-11eb-9c03-277856afb1b2.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59619

Reviewed By: samestep

Differential Revision: D29000213

Pulled By: seemethere

fbshipit-source-id: fcc474967e281fbf9be69f14c0aedfd01820573f
2021-06-09 12:06:33 -07:00
ed3884c3e9 Fix timeout with ZeRO test_step() and test_step_with_closure() (#59648)
Summary:
Partially fixes https://github.com/pytorch/pytorch/issues/59548

**Overview:**
This fixes the timeout issues with `test_step()` and `test_step_with_closure()` for the `ZeroRedundancyOptimizer`.

The existing tests partially assumed a `world_size` of `2` (hence why [this](https://github.com/pytorch/pytorch/pull/59622) seems to be a temporary fix). This change instead should avoid baking in that assumption and allow `world_size` to be flexible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59648

Test Plan:
I tested with 2, 3, and 4 GPUs (and hence `world_size`s of 2, 3, and 4, respectively) via the AI AWS cluster.
```
srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=2 python test/distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_step
srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=3 python test/distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_step
srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=4 python test/distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_step

srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=2 python test/distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_step_with_closure
srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=3 python test/distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_step_with_closure
srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=4 python test/distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_step_with_closure
```

Reviewed By: jbschlosser

Differential Revision: D28975035

Pulled By: andwgu

fbshipit-source-id: 2cbaf6a35e22a95e19fc97e1b64e585e452e774e
2021-06-09 12:03:05 -07:00
61965abad7 Move _PartialWrapper to module scope (#59660)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59660

Context https://github.com/pytorch/pytorch/issues/57352

Test Plan: Pytorch CI tests

Reviewed By: vkuzo

Differential Revision: D28972991

fbshipit-source-id: efc9dd3e90e18e1cdf27d5ef0f168abd8169bc42
2021-06-09 11:55:04 -07:00
0f6bd550a4 Revert D28981443: reland D28645531: .github: Add Windows GPU workflow
Test Plan: revert-hammer

Differential Revision:
D28981443 (21121675b3)

Original commit changeset: 5d24cccfb8c8

fbshipit-source-id: 14e5b610978882bace2f834f61e5457f62b11290
2021-06-09 11:43:10 -07:00
167477329d [Reland] adding base commit to scribe report (#59677)
Summary:
Reland of https://github.com/pytorch/pytorch/issues/59570.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59677

Reviewed By: janeyx99

Differential Revision: D28980356

Pulled By: walterddr

fbshipit-source-id: 9c4671d18ce00fda98d774d1b2aa556662ecddfe
2021-06-09 11:06:01 -07:00
d42e6c7f70 Clang format distributed_test.py (#59693)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59693

ghstack-source-id: 130931133

Test Plan: N/A

Reviewed By: rohan-varma

Differential Revision: D28987619

fbshipit-source-id: 3681cc262b889653615ec64da8c23c96cc0d997b
2021-06-09 10:58:48 -07:00
68f74966fc [ttk] Store float64 in tensorboard instead of float32 (#59435)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59435

Sometimes we need to compare 10+ digits. Currenlty tensorboard only saves float32. Provide an option to save float64

Reviewed By: yuguo68

Differential Revision: D28856352

fbshipit-source-id: 05d12e6f79b6237b3497b376d6665c9c38e03cf7
2021-06-09 10:42:37 -07:00
3271853912 hold references to storages during TorchScript serializaiton (#59642)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59642

Test Plan: Imported from OSS

Reviewed By: jbschlosser, cccclai

Differential Revision: D28968947

Pulled By: Lilyjjo

fbshipit-source-id: 0046da8adb3a29fb108965a1d2201749fe2d0b41
2021-06-09 10:12:07 -07:00
21121675b3 reland D28645531: .github: Add Windows GPU workflow (#59678)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59678

This reverts commit 2956bbaf2388d424ef986c22fac8287f7c345978.

Reland of https://github.com/pytorch/pytorch/pull/58782

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: samestep

Differential Revision: D28981443

Pulled By: seemethere

fbshipit-source-id: 5d24cccfb8c87832fa0233d0b524575dc04f8f05
2021-06-09 09:51:29 -07:00
0897df18a3 Refactor Foreach Tests: Unary Functions (#58960)
Summary:
Related issue: https://github.com/pytorch/pytorch/issues/58833

__changes__
- slowpath tests: pass every dtype&device tensors and compare the behavior with regular functions including inplace
- check of #cudaLaunchKernel
- rename `ForeachUnaryFuncInfo` -> `ForeachFuncInfo`: This change is mainly for the future binary/pointwise test refactors

cc: ngimel ptrblck mcarilli

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58960

Reviewed By: ejguan

Differential Revision: D28926135

Pulled By: ngimel

fbshipit-source-id: 4eb21dcebbffffaf79259e31961626e0707fb8d1
2021-06-09 09:45:16 -07:00
62583e51a5 [reland] Add a ci/no-build label (#58778)
Summary:
Depends on https://github.com/pytorch/pytorch-probot/pull/22. Adds a new label called `ci/no-build` that disables the CircleCI `build` workflow on PRs. The current behavior should be the same in the absence of `ci/no-build`.

Specifically, after this PR lands, for anyone who isn't rebased onto the latest `master`, I believe this will happen:
- when they push to their PR, the CircleCI app triggers CI
- the `pytorch-probot` app sees that their PR doesn't have the `ci/no-build` tag, so it also triggers CI
- the latter should auto-cancel the former

After checking with https://github.com/pytorch/pytorch/issues/59087, it looks like this would cause the "errored" number to go up and then go down as Circle jobs are canceled (saying "Your CircleCI tests were canceled") and then restarted:

<img width="868" alt="Screen Shot 2021-05-27 at 12 39 20 PM" src="https://user-images.githubusercontent.com/8246041/119887123-9667b080-bee8-11eb-8acb-e1967899c9d5.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58778

Reviewed By: malfet

Differential Revision: D28995335

Pulled By: samestep

fbshipit-source-id: 8d7543b911e4bbbeef14639baf9d9108110b97c8
2021-06-09 09:05:44 -07:00
b844fd11ee Allow tools/test_history.py to be piped to head (#59676)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59676

Test Plan:
```
tools/test_history.py --mode=columns --ref=3cf783 --test=test_set_dir --job pytorch_linux_xenial_py3_6_gcc5_4_test --job pytorch_linux_xenial_py3_6_gcc7_test | head -n10
```
Before this PR, the above command seems to just hang. After this PR, it nicely prints the following, line by line, and then exits:
```
2021-02-10 12:18:50Z 3cf78395cbc32fa9c83b585c9ec63f960b32d17f    0.644s    0.312s
2021-02-10 11:13:34Z 594a66d778a660faed0b0fbbe1dd8c2c318707ff    0.360s  errored
2021-02-10 10:13:25Z 9c0caf0384690cb67dcccb7066ece5184f72ca78    0.819s    0.449s
2021-02-10 10:09:14Z 602434bcbebb82c6f3741b2a3d5ebac7ee482268    0.361s    0.454s
2021-02-10 10:09:10Z 2e35fe953553247d8a22fc38b039374e426f13b8
2021-02-10 10:09:07Z ff73be7e45616fe106b9e5040bc091ca5cdbfc7f
2021-02-10 10:05:39Z 74082f0d6f8dfd60f28c0de0fe43bcb97b95ee5a
2021-02-10 07:42:29Z 0620c96fd6a140e68c49d68ed14721b1ee108ecc    0.414s    0.377s (2 job re-runs omitted)
2021-02-10 07:27:53Z 33afb5f19f4e427f099653139ae45b661b8bc596    0.381s    0.294s
2021-02-10 07:05:15Z 5f9fb93c1423814a20007faa506ceb8b4828c8d1    0.461s    0.361s
```

Reviewed By: seemethere

Differential Revision: D28978017

Pulled By: samestep

fbshipit-source-id: 021e634bbf40eb1d3b131fac574343dd5cef5deb
2021-06-09 08:42:05 -07:00
26beda8ed5 [BE] unsupported backward failing on single sample (#59455)
Summary:
Echo on https://github.com/pytorch/pytorch/pull/58260#discussion_r637467625

similar to `test_unsupported_dtype` which only check exception raised on the first sample. we should do similar things for unsupported_backward as well. The goal for both test is to remind developer to
1. add a new dtype to the support list if they are fulling runnable without failure (over all samples)
2. replace the skip mechanism which will indefinitely ignore tests without warning

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59455

Test Plan: CI.

Reviewed By: mruberry

Differential Revision: D28927169

Pulled By: walterddr

fbshipit-source-id: 2993649fc17a925fa331e27c8ccdd9b24dd22c20
2021-06-09 08:17:03 -07:00
12b4e8996f [DataLoader] Add nesting_level argument to map and filter (#59498)
Summary:
This PR implements the .map and .filter APIs for IterDataPipe.

[DataPipes] Make .map of DataPipe sensitive to nested_level argument https://github.com/pytorch/pytorch/issues/58145
[DataPipes] Make .filter of DataPipe sensitive to nested_level argument https://github.com/pytorch/pytorch/issues/58147

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59498

Reviewed By: ejguan

Differential Revision: D28964280

Pulled By: NivekT

fbshipit-source-id: b1ee6cafa3953093ebd7bf30eacc80c3ef7cd190
2021-06-09 07:40:53 -07:00
2693b0bef3 Fix compile error when debugging (#59616)
Summary:
Signed-off-by: caozhong <zhong.z.cao@intel.com>

Triggered this probably because my full debug version python. ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59616

Reviewed By: jbschlosser

Differential Revision: D28958685

Pulled By: albanD

fbshipit-source-id: fdab622c4d1be93eb27e9006dcf3db7c5b44a04b
2021-06-09 06:34:06 -07:00
f1786b293d Revert D28972444: [pytorch][PR] Document debugability features in torch.distributed
Test Plan: revert-hammer

Differential Revision:
D28972444 (a9d2810817)

Original commit changeset: da5e8ee84f0d

fbshipit-source-id: 94d3b3b75ddec74ea5b2b76f6a7519dc921ee2a7
2021-06-09 03:04:36 -07:00
a56c89a160 Revert D28918331: [pytorch][PR] Automated submodule update: FBGEMM
Test Plan: revert-hammer

Differential Revision:
D28918331 (cc840cf544)

Original commit changeset: def60efe5584

fbshipit-source-id: 88101feb87ebfbd38cf10b45d09af309e9759852
2021-06-09 01:36:06 -07:00
a9d2810817 Document debugability features in torch.distributed (#59604)
Summary:
Adds comprehensive documentation around debugability features added to `torch.distributed` recently, including the `monitored_barrier` and TORCH_DISTRIBUTED_DEBUG env variable.

![dist_one](https://user-images.githubusercontent.com/8039770/121102672-0f052180-c7b3-11eb-974c-81dbbe102cb6.png)
![dist_two](https://user-images.githubusercontent.com/8039770/121102734-39ef7580-c7b3-11eb-94f7-c75469351440.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59604

Reviewed By: jbschlosser, SciPioneer

Differential Revision: D28972444

Pulled By: rohan-varma

fbshipit-source-id: da5e8ee84f0d6f252c703c4d70ff2a0d5817cc4e
2021-06-08 23:52:19 -07:00
daa35141e8 Reland: "[TensorExpr] Fix handling of 0-dim tensors." (#59508)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59508

An assert that was triggering in a previous version is now relaxed to
take 0-dim tensors into account.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D28918342

Pulled By: ZolotukhinM

fbshipit-source-id: c09b62c9725d1603b0ec11fcc051e7c932af06ae
2021-06-08 22:48:17 -07:00
9f9904969f Reland: "[TensorExpr] Fix printing of Bool dtype." (#59507)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59507

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D28918344

Pulled By: ZolotukhinM

fbshipit-source-id: b75aa9f316e4f3f648130a3171a35bfbbf1f397d
2021-06-08 22:48:16 -07:00
0b6ec32004 Reland: "[TensorExpr] Improve debug messages." (#59506)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59506

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D28918343

Pulled By: ZolotukhinM

fbshipit-source-id: 168703f6368f5182cf9762600d7f0f6ea5b20280
2021-06-08 22:47:06 -07:00
04986b909f [package] Add docstring for PackageExporter.intern (#59602)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59602

**Summary**
This commit adds a docstring for `PackageExporter.intern`.

**Test Plan**
Continuous integration.

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D28972939

Pulled By: SplitInfinity

fbshipit-source-id: 1765541aa2ed88e01beb48c08b90f56df3a591b7
2021-06-08 19:53:36 -07:00
f52e202840 Add warning when accessing Tensor::grad() in the C++ API (#59362)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/35379

 - Adds  `retains_grad` attribute backed by cpp as a native function. The python bindings for the function are skipped to be consistent with `is_leaf`.
   - Tried writing it without native function, but the jit test `test_tensor_properties` seems to require that it be a native function (or alternatively maybe it could also work if we manually add a prim implementation?).
 - Python API now uses `retain_grad` implementation from cpp

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59362

Reviewed By: jbschlosser

Differential Revision: D28969298

Pulled By: soulitzer

fbshipit-source-id: 335f2be50b9fb870cd35dc72f7dadd6c8666cc02
2021-06-08 19:43:21 -07:00
90303157ab Enable complex dtypes for coo_sparse-coo_sparse matmul [CPU] (#59554)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59554

This PR enables complex numbers supports for matrix-matrix
multiplication of COO sparse matrices.

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D28968309

Pulled By: anjali411

fbshipit-source-id: 4fd471e76a5584366aabc86c08b4564667ee54ca
2021-06-08 19:34:41 -07:00
b386ed6f9b Fix some compiler warnings (#59643)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59643

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D28916206

fbshipit-source-id: 4f6c8e0faeb76848f5951ff85db7c9da7fe9bf54
2021-06-08 18:22:57 -07:00
02d380450d [FX][docs][EZ] Fix link to fuser example (#59670)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59670

Test Plan: Imported from OSS

Reviewed By: jansel

Differential Revision: D28975704

Pulled By: jamesr66a

fbshipit-source-id: 2fb759224b5b1ecc62c0ab26563d2a35ed422794
2021-06-08 17:32:55 -07:00
1733d10399 Warn when backward() is called with create_graph=True (#59412)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/4661
- Add warnings in engine's `execute` function so it can be triggered through both cpp and python codepaths
- Adds an RAII guard version of `c10::Warning::set_warnAlways` and replaces all prior usages of the set_warnAlways with the new one

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59412

Reviewed By: jbschlosser

Differential Revision: D28969294

Pulled By: soulitzer

fbshipit-source-id: b03369c926a3be18ce1cf363b39edd82a14245f0
2021-06-08 17:19:04 -07:00
82466e0605 Revert D28900487: ger is an alias to outer, not the other way around
Test Plan: revert-hammer

Differential Revision:
D28900487 (4512d75063)

Original commit changeset: e9065c5b2907

fbshipit-source-id: 712c05d2fba28c83958ef760290e1e08c147a907
2021-06-08 17:09:15 -07:00
cc840cf544 Automated submodule update: FBGEMM (#59505)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 77a4792062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59505

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: dskhudia

Differential Revision: D28918331

fbshipit-source-id: def60efe55843023e70b94726cde1faf6857be0b
2021-06-08 17:03:26 -07:00
2956bbaf23 Revert D28645531: .github: Add Windows GPU workflow
Test Plan: revert-hammer

Differential Revision:
D28645531 (51884c6479)

Original commit changeset: 6ed1a2dead9c

fbshipit-source-id: e082d7d50de77d0572596111e95a3da3a350a319
2021-06-08 16:59:56 -07:00
97dfc7e300 [Reland] Adding run specified tests option to run_test.py (#59649)
Summary:
Reland of https://github.com/pytorch/pytorch/issues/59487

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59649

Reviewed By: samestep

Differential Revision: D28970751

Pulled By: janeyx99

fbshipit-source-id: 6e28d4dcfdab8a49da4b6a02c57516b08bacd7b5
2021-06-08 16:04:46 -07:00
51884c6479 .github: Add Windows GPU workflow (#58782)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58782

[skip ci]

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: samestep

Differential Revision: D28645531

Pulled By: seemethere

fbshipit-source-id: 6ed1a2dead9cca29e26e613afdbcf46ba7cee88c
2021-06-08 16:00:21 -07:00
6104ac5aaf [libkineto] Refactor trace activities (#59360)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59360

Pull Request resolved: https://github.com/pytorch/kineto/pull/206

Replace ClientTraceActivity with GenericActivity.
In addition:
* Add a couple of new activity types for user annotations
* Simplify code for GPU-side user annotations
* Add accessor to containing trace span object in activities. Later we can replace this with a trace context / trace session object.
* Simplified MemoryTraceLogger
* Added early exit for cupti push/pop correlation ID

Reviewed By: ilia-cher

Differential Revision: D28231675

fbshipit-source-id: 7129f2493016efb4d3697094f24475e2c39e6e65
2021-06-08 15:49:35 -07:00
acc47357b5 Fix torch.conj for zero-dimensional sparse coo matrix (#59553)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59553

Added a test for 0x0 sparse coo input for sparse_unary_ufuncs.
This test fails for `conj` on master.

Modified `unsupportedTypes` for test_sparse_consistency, complex dtypes
pass, but float16 doesn't pass for `conj` because `to_dense()` doesn't
work with float16.

Fixes https://github.com/pytorch/pytorch/issues/59549

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D28968215

Pulled By: anjali411

fbshipit-source-id: 44e99f0ce4aa45b760d79995a021e6139f064fea
2021-06-08 15:46:49 -07:00
894aaa3997 Revert D28943928: [pytorch][PR] adding base commit to scribe report
Test Plan: revert-hammer

Differential Revision:
D28943928 (92ed70a048)

Original commit changeset: ae3d279005f5

fbshipit-source-id: fda98b6c54425bba2f937a1cb921027531d61842
2021-06-08 15:43:57 -07:00
6ca141fe6c Make detach return an alias even under inference mode (#59633)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59633

Fixes #59614

This fix isn't 100% correct but it appears to stem the bleeding.
A better fix would be understand how to detect when function
implementations don't uphold required invariants, leading to
refcount disaster.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D28962183

Pulled By: ezyang

fbshipit-source-id: 6ec71994666289dadef47bac363e6902df90b094
2021-06-08 15:31:29 -07:00
14f4c8d333 Revert D28387762: Forward AD formulas batch 3
Test Plan: revert-hammer

Differential Revision:
D28387762 (58348bea06)

Original commit changeset: fc395c92af7e

fbshipit-source-id: 608d704ff5bc560714790a576eaf9ed7f1f44e13
2021-06-08 15:19:26 -07:00
528d82d6a6 [torch] Add debug name to assert message for useOf
Summary:
Make an assert message in Pytorch's JIT provide better information by
printing the debug name of a value in `PythonPrintImpl::useOf` if it's not
found in any tables.

Test Plan:
Tested printing a `module.code` where the module had an invalid value used
as an operand. Before it asserted without any more details, afterwards it
printed the debug name which made it easy to track down the offending value.

Reviewed By: SplitInfinity

Differential Revision: D28856026

fbshipit-source-id: 479f66c458a0a2d9a161ade09f20382e7b19d60e
2021-06-08 15:03:58 -07:00
9d533ef3ac Renorm fix (#59615)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/59584
albanD, soulitzer, `renorm` grad was completely busted. Fast gradcheck is definitely not doing its job.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59615

Reviewed By: jbschlosser

Differential Revision: D28964271

Pulled By: ngimel

fbshipit-source-id: b6878cd24db9189b64b67eb58bd2cd8956cda78a
2021-06-08 14:59:24 -07:00
67b8e6410d [OSS] Add podspec for libtorch-lite (#59638)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59638

ghstack-source-id: 130847775

Test Plan: .

Reviewed By: husthyc, cccclai

Differential Revision: D28966693

fbshipit-source-id: 1b82623279709d0118c0967e2ba730d5dec040cc
2021-06-08 14:46:23 -07:00
1bb1a9e22b [ROCm] enable test_cufft_plan_cache test (#57520)
Summary:
This pr enables the test_cufft_plan_cache in test_spectral suite.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57520

Reviewed By: ejguan

Differential Revision: D28936128

Pulled By: ngimel

fbshipit-source-id: c843ab31c50855b624a986155c17c8d24e89a2ac
2021-06-08 14:42:01 -07:00
43274ca145 test_store multiworker remove multiprocessing (#59599)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59599

This will fix the flakiness for these tests internally when running under TSAN. We don't need multiprocessing since we should restrict the testing to the `wait_for_workers` and `world_size` parameters of the tcp store master store.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D28947838

Pulled By: H-Huang

fbshipit-source-id: d3e3904aa7ac81ae4c744a193a3b7167c2227bc8
2021-06-08 14:38:42 -07:00
40cbf342d3 Fix vectorized calculations on POWER (#59382)
Summary:
This fixes multiple bugs introduced by the VSX optimized code in https://github.com/pytorch/pytorch/pull/41541

- min/max/clamp now consistently return nan when any value is NaN as on other architectures
- The non-complex angle functions return PI for negative values now
- The complex angle functions have been corrected and optimized
- The float32-log function implementation returned a wrong result when inf was passed (and maybe other inputs), replaced by the sleef function just as for float64

Fixes https://github.com/pytorch/pytorch/issues/59248
Fixes https://github.com/pytorch/pytorch/issues/57537

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59382

Reviewed By: jbschlosser

Differential Revision: D28944626

Pulled By: ezyang

fbshipit-source-id: 1ae2782b9e34e458a19cec90617037654279e0e0
2021-06-08 14:18:47 -07:00
ea3b2fd0fa Throw RunTimeError using TORCH_CHECK (#59485)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59485

... when varaible is not allowed to required grad

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D28933808

fbshipit-source-id: ef3536049d3a4a2f6e2f4b1787f0c17763f5828c
2021-06-08 14:03:21 -07:00
5fc105b323 Raise NotImplementedError on forward passes (#59483)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59483

... for functions that are not implemented

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D28933806

fbshipit-source-id: dadae1af6609f15419cf0f47a98361dc87dff849
2021-06-08 14:03:19 -07:00
c268eefe96 Use TORCH_CHECK_NOT_IMPLEMENTED for AD not implemented (#59482)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59482

Fixes #53398

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D28933809

fbshipit-source-id: 53387ec9690fc235b0622b50800feced706ea1ee
2021-06-08 14:02:04 -07:00
84061dadad Add reduce variants for scatter operation. (#57015)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/56463 #56464

- Add reduce variants for `scatter` in both _native_functions.yaml_ and _TensorAdvancedIndexing.cpp_
- Add `OpInfo` tests and reduce tests in _test_torch.py_
- Fix default reduce argument for `scatter_` in __tensor_docs.py_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57015

Reviewed By: mrshenli

Differential Revision: D28162657

Pulled By: ezyang

fbshipit-source-id: 4d37ed1569ce8560aca1085c9cf5349f11427c4f
2021-06-08 13:37:26 -07:00
9de0c214bd [quant] Fix dimension for output of batchnorm 1d (#59264)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59264

Previously batchnorm 1d did unsqueeze twice but only squeeze once before return when the dimension
for input Tensor is 2, this PR adds an extra squeeze

Test Plan: Imported from OSS

Reviewed By: supriyar

Differential Revision: D28810597

fbshipit-source-id: 879873bbf39ed3607762684694f6e81b423740c2
2021-06-08 13:07:00 -07:00
58348bea06 Forward AD formulas batch 3 (#58094)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58094

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D28387762

Pulled By: albanD

fbshipit-source-id: fc395c92af7ebb5ebae95c40f6c76273047f4097
2021-06-08 13:00:21 -07:00
4512d75063 ger is an alias to outer, not the other way around (#59448)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59448

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D28900487

Pulled By: albanD

fbshipit-source-id: e9065c5b29078d92ea9b746e188ebc1e62a407a0
2021-06-08 12:59:06 -07:00
d0e84c2f23 Revert D28961233: [pytorch][PR] Adding run-specified-test-cases option in run_test.py
Test Plan: revert-hammer

Differential Revision:
D28961233 (a6c9483c2f)

Original commit changeset: 6b7ddc6e6185

fbshipit-source-id: 4f8471df987a03d5c928a04f989d5d43f9cc47e9
2021-06-08 12:04:15 -07:00
0208e604e3 seems os.environ.get() not working well on windows (#59634)
Summary:
replace with os.getenv() instead

For some reason this was intermittently failing azure pipelines. I can't login to the pipeline itself for debugging but here are 2 examples: [successful](https://app.circleci.com/pipelines/github/pytorch/pytorch/332405/workflows/944609ad-5dcf-49da-984f-26c381d1f16c/jobs/13969059) vs [failed](https://app.circleci.com/pipelines/github/pytorch/pytorch/332518/workflows/21f8a5a6-3b95-432e-be42-ac98008c671b/jobs/13975637)

However given the fact that the other common_utils.py exposed constants using `os.getenv()` was working. I am making them consistent.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59634

Test Plan: CI/master

Reviewed By: jbschlosser

Differential Revision: D28966412

Pulled By: walterddr

fbshipit-source-id: 7bcb9adf06df0acabd9574459eb6637c3e6a2947
2021-06-08 11:59:39 -07:00
1242dd1357 Remove cancel_redundant_workflows job (#59608)
Summary:
After https://github.com/pytorch/pytorch/issues/59019 this workflow itself is redundant, so we don't need it anymore

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59608

Reviewed By: jbschlosser, seemethere

Differential Revision: D28952314

Pulled By: driazati

fbshipit-source-id: 41aa33164be8271210ec23b9641e74596114416d
2021-06-08 11:38:29 -07:00
7949fdd2b6 ninja 1.9.0 couldn't be installed, CI might be broken (#59625)
Summary:
I suddenly find that `pip install ninja==1.9.0 ` failed in CI.
And I tested locally and on another colleague's machine.
It looks it conflicts with cmake installed in conda.

https://app.circleci.com/pipelines/github/pytorch/pytorch/332470/workflows/d8b6ed30-1c7e-4863-898a-7f067c6202e1/jobs/13972409
![image](https://user-images.githubusercontent.com/16190118/121175743-02a1c700-c88e-11eb-9596-97b903b727f9.png)

1.10.0 couldn't be installed either.
![image](https://user-images.githubusercontent.com/16190118/121176606-fbc78400-c88e-11eb-931c-aa65bad080f8.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59625

Reviewed By: jbschlosser

Differential Revision: D28966699

Pulled By: seemethere

fbshipit-source-id: a1150e411ba3b4ab65448a087aa65f4ebe6c3596
2021-06-08 11:07:14 -07:00
13917bab7f [Torch] Correct launcher tests (#59635)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59635

The diff corrects launcher tests. The follow up would be to determine why the tests succeeded during the ``use_env`` diff removal

Test Plan: buck test mode/dev-tsan //caffe2/test/distributed/launcher:run_test -- --exact 'caffe2/test/distributed/launcher:run_test - test_launch_user_script_python_caffe2_bc (run_test.ElasticLaunchTest)' --run-disabled

Reviewed By: cbalioglu

Differential Revision: D28963813

fbshipit-source-id: a9f9b80787fb5c2f40a69ce31c8c2f3138654cad
2021-06-08 11:05:57 -07:00
3b0c6a7b50 fix AddPadding tensor shape inference (#59572)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59572

fix AddPadding tensor shape inference

Test Plan: sandcastle

Reviewed By: dehuacheng

Differential Revision: D28686983

fbshipit-source-id: 03f70335fcfd94a1241562f8fbf12043a0deac2b
2021-06-08 11:02:33 -07:00
7dac2987ce [quant][eager][fix] Fix a typo in convert function in eager mode quantization (#59571)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59571

Test Plan:
python test/test_quantization.py TestPostTrainingStatic.test_custom_module_class

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D28938355

fbshipit-source-id: 566daeb07d616ae40e52754d3d4581f75f248f04
2021-06-08 10:24:22 -07:00
31d136c81f [DDP] Rename the member divFactor_ as div_factor for naming consistency in reducer (#59523)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59523

Should use snake case instead of camel case for the consistency.
ghstack-source-id: 130759655

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_ddp_grad_div_uneven_inputs

Reviewed By: cbalioglu

Differential Revision: D28922896

fbshipit-source-id: e04298284a78b2e71b562f790a878731962f873a
2021-06-08 10:04:20 -07:00
b7ee164456 [DDP] Remove the duplicate parseHookResult in reducer (#59510)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59510

Address the comment in https://github.com/pytorch/pytorch/pull/58937#discussion_r645822768

#Closes: https://github.com/pytorch/pytorch/issues/41266
ghstack-source-id: 130758758

Test Plan: waitforbuildbot

Reviewed By: cbalioglu

Differential Revision: D28918694

fbshipit-source-id: 7ac4e4e6268e220adefed230bdb377ab3b25e302
2021-06-08 10:04:18 -07:00
2b398d0537 [Reland][Gradient Compression] Apply division first to avoid overflow (#59576)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59576

If the gradients before allreduce are large, then the sum after allreduce may overflow, especially for FP16. Therefore, apply the division before allreduce.

This fix is applied to both C++ and Python comm hooks.
ghstack-source-id: 130754510

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_comm_hook_allreduce_hook_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_compress_wrapper_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_comm_hook_allreduce_hook_nccl_grad_is_view
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl_is_view
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_compress_wrapper_is_view
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl_grad_is_view
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl_grad_is_view

Reviewed By: rohan-varma

Differential Revision: D28941327

fbshipit-source-id: 932e8ddbdb2bfd609a78943f6dc390d3d6ca333f
2021-06-08 10:03:21 -07:00
92ed70a048 adding base commit to scribe report (#59570)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/59408

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59570

Test Plan: CI

Reviewed By: samestep

Differential Revision: D28943928

Pulled By: walterddr

fbshipit-source-id: ae3d279005f54d83d7a3acae508d3ccdf1cd46b8
2021-06-08 09:58:38 -07:00
a6c9483c2f Adding run-specified-test-cases option in run_test.py (#59487)
Summary:
The run-specified-test-cases option would allow us to specify a list of test cases to run by having a CSV with minimally two columns: test_filename and test_case_name.

This PR also adds .json to some files we use for better clarity.

Usage:
`python test/run_test.py --run-specified-test-cases <csv_file>` where the csv file can look like:
```
test_filename,test_case_name,test_total_time,windows_only_failure_sha_count,total_sha_count,windows_failure_count,linux_failure_count,windows_total_count,linux_total_count
test_cuda,test_cudnn_multiple_threads_same_device,8068.8409659525,46,3768,53,0,2181,6750
test_utils,test_load_standalone,8308.8062920459,14,4630,65,0,2718,8729
test_ops,test_forward_mode_AD_acosh_cuda_complex128,91.652619369806,11,1971,26,1,1197,3825
test_ops,test_forward_mode_AD_acos_cuda_complex128,91.825633094915,11,1971,26,1,1197,3825
test_profiler,test_source,60.93786725749,9,4656,21,3,2742,8805
test_profiler,test_profiler_tracing,203.09352795241,9,4662,21,3,2737,8807
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59487

Test Plan:
Without specifying the option, everything should be as they were before.

Running `python test/run_test.py --run-specified-test-cases windows_smoke_tests.csv` resulted in this paste P420276949 (you can see internally). A snippet looks like:
```
(pytorch) janeyx@janeyx-mbp pytorch % python test/run_test.py --run-specified-test-cases windows_smoke_tests.csv
Loading specified test cases to run from windows_smoke_tests.csv.
Processed 28 test cases.
Running test_cpp_extensions_jit ... [2021-06-04 17:24:41.213644]
Executing ['/Users/janeyx/miniconda3/envs/pytorch/bin/python', 'test_cpp_extensions_jit.py', '-k', 'test_jit_cuda_archflags'] ... [2021-06-04 17:24:41.213781]
s
----------------------------------------------------------------------
Ran 1 test in 0.000s

OK (skipped=1)
...
```
With pytest, an example executable would be:
`Running test_dataloader ... [2021-06-04 17:37:57.643039]
Executing ['/Users/janeyx/miniconda3/envs/pytorch/bin/python', '-m', 'pytest', 'test_dataloader.py', '-v', '-k', 'test_segfault or test_timeout'] ... [2021-06-04 17:37:57.643327]`

Reviewed By: jbschlosser

Differential Revision: D28961233

Pulled By: janeyx99

fbshipit-source-id: 6b7ddc6e61856aa0002e1a0afc845770e4f8400b
2021-06-08 09:49:10 -07:00
ea1de87f4b Sort params by size (decreasing)
Summary:
Pull Request: https://github.com/pytorch/pytorch/pull/59586
Task: https://www.internalfb.com/tasks/?t=90847711

**Overview:**
Suppose we have `n` items with positive integer sizes and `k` buckets. We want to assign items to buckets with the goal of uniformity. The precise criteria for uniformity can vary: e.g. minimize the maximum size, maximize the minimum size, etc. This is known as [multiway number partitioning](https://en.wikipedia.org/wiki/Multiway_number_partitioning). ZeRO's partitioning task reduces to solving this problem. In particular, this is the subproblem to be solved for each `param_group` in `self.param_groups`, where the parameters are the items and the ranks give the buckets.

The existing implementation uses the linear-time [greedy number partitioning algorithm](https://en.wikipedia.org/wiki/Greedy_number_partitioning#Linear-time_algorithm), which assigns the next tensor-parameter to the process with the smallest total parameter size so far. In this task, I explore the [extension](https://en.wikipedia.org/wiki/Greedy_number_partitioning#Improved_algorithm) where each parameter group is sorted by decreasing size before applying the greedy algorithm, requiring linearithmic time (as dominated by the sort).

**Experiments**
The mean number of parameters represents a perfectly uniform allocation and hence the ideal allocation (which may be even better than the optimal partition). In the following tables, I present the maximum number of parameters for any one process and the difference from the mean in parentheses for ResNet-50, ResNet-152, and BERT (the bare BERT model). The best-performing partitioning strategy for each model is bolded.

Two processes:
| Model | Max Num Params - Greedy (Diff) | Max Num Params - Greedy-Sorted (Diff) | Mean Num Params |
| --- | --- | --- | --- |
| ResNet-50 | 13,249,600 (471,084) | **12,794,816 (16,300)** | 12,778,516 |
| ResNet-152 | 30,567,488 (471,084) | **30,111,424 (15,020)** | 30,096,404 |
| BERT | **54,749,184 (8,064)** | 55,327,488 (586,368) | 54,741,120 |

Four processes:
| Model | Max Num Params - Greedy (Diff) | Max Num Params - Greedy-Sorted (Diff) | Mean Num Params |
| --- | --- | --- | --- |
| ResNet-50 | 7,524,864 (1,135,606) |  **6,436,864 (47,606)** | 6,389,258 |
| ResNet-152 | 16,232,192 (1,183,990) | **15,090,152 (41,950)** | 15,048,202 |
| BERT | **28,151,040 (780,480)** | 28,352,256 (981,696)  | 27,370,560 |

 ---

I also investigated the latency of `optimizer.step()` for the different partitioning algorithms. I measured the latency for 30 iterations and took the mean latency per process (excluding the first iteration due to cache coldness). In the following tables, I present the maximum of those mean latencies over all processes and the standard deviation of the latencies contributing to that maximum. Again, the best-performing partitioning strategy for each model is bolded. All entries are presented in seconds and used `gloo` backend.

Two processes:
| Model | Max `optimizer.step()` Time - Greedy (Std.) | Max `optimizer.step()` Time - Greedy-Sorted (Std.) |
| --- | --- | --- |
| ResNet-50 | **0.060 (0.002)** | 0.061 (0.002) |
| ResNet-152 | 0.166 (0.003) | **0.160 (0.004)** |
| BERT | 0.220 (0.009) | **0.199 (0.006)** |

Four processes:
| Model | Max `optimizer.step()` Time - Greedy | Max `optimizer.step()` Time - Greedy-Sorted |
| --- | --- | --- |
| ResNet-50 | 0.094 (0.004) | **0.093 (0.004)** |
| ResNet-152 | **0.228 (0.011)** | 0.231 (0.009) |
| BERT | **0.328 (0.015)** | 0.329 (0.021) |

Based on the standard deviations, the differences in the latency measurements across the different algorithms appear to be within the uncertainty in the measurement itself. Hence, it is difficult to argue that one algorithm is clearly the fastest.

 ---

`zero.py` is my experiment script, and I use the AI AWS cluster. The run command looks like:
```
srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=4 python zero.py -b nccl greedy 2 4
```
This runs the experiment script on an instance with 4 GPUs using `nccl` backend, outputting to a directory named `greedy/`, and using world sizes of 2 and 4. An analogous command can be used after modifying `partition_parameters()`, e.g. replacing `greedy` with `greedy_sorted` as the output directory name. Then, to run the analysis script:
```
python analyze.py greedy greedy_sorted
```
For more details on the experiment code, refer to: https://www.internalfb.com/diff/D28946756

**Notes:**
There exists an optimal solution to this partitioning problem. An algorithm that finds such a solution is the [complete greedy algorithm (CGA)](https://en.wikipedia.org/wiki/Greedy_number_partitioning#An_exact_algorithm), which reduces to the brute-force combinatorial search in the worst case. There exist heuristics to improve the `k = 2` case (i.e. when there are two processes); however, given that `n` in typical use cases is very large, any algorithm that is quadratic or slower is unrealistic. Other exact algorithms are similarly exponential in the worst case, rendering them intractable. Given this, I do not currently see a need for future proofing the partitioning algorithm against the introduction of algorithms beyond the naive greedy and the sorted greedy algorithms.

 ---

In the current ZeRO implementation, the core `partition_parameters()` computation happens twice upon initialization (i.e. call to `__init__()`): first from a call to `_param_to_rank()` (i.e. an access to `_param_to_rank`) and then from a call to `_update_trainable()`. `_update_trainable()` sees that no optimizer has been constructed yet, so it clears the cache, eliminating the first `partition_parameters()` computation and performing a redundant re-computation.

Here is a typical trace:
- [The ZeRO optimizer object is initialized, calling `__init__()`.](d125694d0b/torch/distributed/optim/zero_redundancy_optimizer.py (L142))
- [In `__init__()`, `self._device` is set, so it accesses `self._per_device_params`.](d125694d0b/torch/distributed/optim/zero_redundancy_optimizer.py (L182))
- [`self._per_device_params` is not cached, so it accesses `self._param_to_rank`.](d125694d0b/torch/distributed/optim/zero_redundancy_optimizer.py (L340))
- [`self._param_to_rank` is not cached, so it calls `partition_parameters()`.](d125694d0b/torch/distributed/optim/zero_redundancy_optimizer.py (L353)) (first call to `partition_parameters()`)
- [`__init__()` later calls `_update_trainable()`.](d125694d0b/torch/distributed/optim/zero_redundancy_optimizer.py (L185))
- [In `_update_trainable()`, `self` does not have `attr` `"optim"`, so it clears the cached objects (notably, `self._partition_parameters_cache`).](d125694d0b/torch/distributed/optim/zero_redundancy_optimizer.py (L591))
- [`_update_trainable()` calls `self.partition_parameters()`.](d125694d0b/torch/distributed/optim/zero_redundancy_optimizer.py (L593)) (second call to `partition_parameters()`)

Based on the discussion [here](https://github.com/pytorch/pytorch/pull/59410), this recomputation is unintentional and should be addressed in a future diff.

Test Plan: I verified that the total number of parameters across the processes was consistent after the partitioning algorithm change. Otherwise, no additional modifications were made to existing tests.

Reviewed By: mrshenli

Differential Revision: D28946755

fbshipit-source-id: 7ad66a21a963555b3b2e693ba8069d2dddc94c60
2021-06-08 09:47:35 -07:00
935057fc74 [package] turn MockZipReader into DirectoryReader and add test coverage (#59107)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59107

Adding documentation, test coverage, and a missing method to the `DirectoryReader` class. `DirectoryReader` was previously named `MockZipReader`, and is used for operating on opened package archives via a `PackageImporter`.

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D28760410

Pulled By: Lilyjjo

fbshipit-source-id: aa9d0a68e19738a6d5555bb04ce33af6a53f1268
2021-06-08 08:02:34 -07:00
693b2696f8 add dispatch for bitwise_and (#59388)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59388

Reviewed By: agolynski

Differential Revision: D28891985

Pulled By: ezyang

fbshipit-source-id: 4f8b301ba615f1e21a920f02166d64c978204adb
2021-06-08 07:51:47 -07:00
4920d5a05a Temporarily add skip to fix slow gradcheck failure on master (#59585)
Summary:
Related https://github.com/pytorch/pytorch/issues/59584

Failure https://app.circleci.com/pipelines/github/pytorch/pytorch/331771/workflows/fed7923c-3490-490f-8769-81a71beae558/jobs/13940286

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59585

Reviewed By: albanD

Differential Revision: D28945267

Pulled By: soulitzer

fbshipit-source-id: 72ae4b6c9a04fe9fdfb89888e12bae25c78be23c
2021-06-08 07:21:30 -07:00
5c7e14d2bc [DataLoader] Switch NotImplementedError to TypeError for len (#59464)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59464

Fixes #59378

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D28944447

Pulled By: ejguan

fbshipit-source-id: 8b3d53a1863b41e578d56f219e452d18d7eae0d8
2021-06-08 07:16:18 -07:00
1b578c4bf5 [DataLoader] Close byte stream explicitly (#58938)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58938

When run `test_datapipe.py`, python `gc` would report lots of `ResourceWarning`s due to unclosed stream. It's not only annoying, there are two potential problems:
- Performance regression because `gc` requires additional memory and computation to track reference
- Python `gc` runs periodically so we many encountered an error of too many open files due to OS limitation
To reduce the warning:
- Explicitly close byte stream
- Modify `test_datapipe.py` to use context manager

Small fix:
- Reorder import in `test_datapipe.py`

Further investigation:
Can we directly use context manager in `LoadFileFromDisk` and `ReadFileFromTar` to eliminate this Error?
- Probably no. It's feasible only if the pipeline is synchronized and without prefetching. When we enable these two features, the scope guard of the context manager doesn't work.
- We may need to implement some reference counter attached to these file byte stream to close by itself.

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D28689862

Pulled By: ejguan

fbshipit-source-id: bb2a85defb8a4ab5384db902ef6ad062185c2653
2021-06-08 07:15:08 -07:00
90c5b74e47 Back out "[PyTorch Edge] bytecode version bump to v5 and enable share constant table" (#59432)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59432

Original commit changeset: 6f5cf4296eaa
ghstack-source-id: 130805860

Test Plan: CI

Reviewed By: raziel, iseeyuan

Differential Revision: D28892955

fbshipit-source-id: ce414a4c7a18001bdd27333cea03c6403b39d146
2021-06-08 07:11:26 -07:00
5d6a10a765 Revert D28913223: [pytorch][PR] Adding run-specified-test-cases option in run_test.py
Test Plan: revert-hammer

Differential Revision:
D28913223 (24432eaa29)

Original commit changeset: 0d1f99109734

fbshipit-source-id: 47c073720cff23a5d4cb64556381c46025e90937
2021-06-08 02:18:16 -07:00
010bcb4c2d Fix xnnpack hardswish memory issue (#59577)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59577

Collapse all dimensions of tensor into batch and use channels as 1. Fixes the 1D over calculation case

Test Plan:
buck test fbandroid/mode/server fbandroid/mode/asan_ubsan fbsource//xplat/caffe2:pt_xnnpack_test

buck test fbsource//xplat/caffe2:pt_xnnpack_test

Reviewed By: kimishpatel

Differential Revision: D28942141

fbshipit-source-id: b36f820a900b6a2ed649d6b9bac79d3392d3537c
2021-06-07 21:56:05 -07:00
1faba1e4cc [Pytorch Edge] Make RegisterBackendSelect Selective (#59096)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59096

RegisterBackendSelect is bringing in ~100 extra ops to the runtime. This messes with the compatibility api, and also adds a nontrivial amount of size.

Test Plan: Model Unittests/CI

Reviewed By: iseeyuan

Differential Revision: D28588100

fbshipit-source-id: ffd0b5b9cbe20f27dbf3be418a6c1f80c7396fdb
2021-06-07 19:48:46 -07:00
501320ed81 [pytorch] deprecate default_op_deps.yaml (#59573)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59573

To do mobile selective build, we have several options:
1. static dispatch;
2. dynamic dispatch + static analysis (to create the dependency graph);
3. dynamic dispatch + tracing;

We are developing 3. For open source, we used to only support 1, and
currently we support both 1 and 2.

This file is only used for 2. It was introduced when we deprecated
the static dispatch (1). The motivation was to make sure we have a
low-friction selective build workflow for dynamic dispatch (2).
As the name indicates, it is the *default* dependency graph that users
can try if they don't bother to run the static analyzer themselves.
We have a CI to run the full workflow of 2 on every PR, which creates
the dependency graph on-the-fly instead of using the committed file.

Since the workflow to automatically update the file has been broken
for a while, it started to confuse other pytorch developers as people
are already manually editing it, and it might be broken for some models
already.

We reintroduced the static dispatch recently, so we decide to deprecate
this file now and automatically turn on static dispatch if users run
selective build without providing the static analysis graph.

The tracing-based selective build will be the ultimate solution we'd
like to provide for OSS, but it will take some more effort to polish
and release.

Differential Revision:
D28941020
D28941020

Test Plan: Imported from OSS

Reviewed By: dhruvbird

Pulled By: ljk53

fbshipit-source-id: 9977ab8568e2cc1bdcdecd3d22e29547ef63889e
2021-06-07 19:37:37 -07:00
c436426be8 [fbgemm] fix gconv + acc16 (#59541)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59541

Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/621

Fixing 2 issues. These are actually 2 independent issues one in Caffe2 and another in FBGEMM, so no need to wait until FBGEMM is synchronized with PyTorch

1) conv 16-bit accumulation doesn't support fast gconv path, so TakeGConvFastPath_ should honor it
2) packed_index_ generates indices up to (G/GTogether_) F R S OC_per_G GTogether_ paddedICPerG which can exceed G kernel_prod OC_per_G paddedICPerG allocated in PackWeightMatrixForGConv (kernel_prod = F R S): e.g., when G=3, GTogether_=2, we allocate 3 F R S OC_per_G paddedICPerG but we access up to 2 F R S OC_per_G 2 paddedICPerG

BTW, not sure how we haven't known about this issue for so long. Any idea will be really appreciated.

Test Plan:
In a BDW machine,
buck test //caffe2/caffe2/quantization/server:conv_groupwise_dnnlowp_acc16_op_test -- --run-disabled

Reviewed By: dskhudia

Differential Revision: D28927214

fbshipit-source-id: 3ec98ea2fc177545392a0148daca592d80f40ad3
2021-06-07 19:20:59 -07:00
57d8bccd00 only reorder tests based on git diff if IN_CI (#59565)
Summary:
Do not reorder tests unless they are in IN_CI, this causes local development test ordering indeterministic. most of use branch out from viable strict not head of master.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59565

Reviewed By: ejguan

Differential Revision: D28943906

Pulled By: walterddr

fbshipit-source-id: e742e7ce4b3fc017d7563b01e93c4cd774d0a537
2021-06-07 17:54:19 -07:00
dafa4b3517 quantization: improve documentation on natively supported backends (#58925)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58925

Cleans up documentation on natively supported backends.  In particular:
* adds a section title
* deduplicates information about fbgemm/qnnpack
* clarifies what `torch.backends.quantized.engine` does
* adds code samples with default settings for `fbgemm` and `qnnpack`

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D28681840

Pulled By: vkuzo

fbshipit-source-id: 51a6ab66934f657553351f6c84a638fd5f7b4e12
2021-06-07 17:29:03 -07:00
6575975da9 [Reland2][DDP] Merge work and future_work in reducer (#59574)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59574

Remove `work` attribute from Reducer class in favor of `future_work`.

Additionally, remove `copy_grad_to_bucket` method since now it's only one-line implementation, and created a new C++ comm hook called `_AllReduceCommHookWithDivFactor` to replace allreduce and also support handling uneven input.

1) Compared with the reverted https://github.com/pytorch/pytorch/pull/58937, updated `_AllReduceCommHookWithDivFactor` in `default_comm_hooks.cpp` to apply division first and hence avoid FP16 overflow.

2) Compared with the reverted https://github.com/pytorch/pytorch/pull/59520, disabled `test_DistributedDataParallel_non_default_stream` on AMD, because now applying division first hurts the gradient averaging accuracy on AMD.
See [07:48:26]:
https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-bionic-rocm4.2-py3.6-test1/1129/console

#Original PR Issue: https://github.com/pytorch/pytorch/issues/41266
ghstack-source-id: 130752393

Test Plan:
buck test caffe2/test/distributed:distributed_gloo_fork --  test_accumulate_gradients_no_sync
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_accumulate_gradients_no_sync
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_ddp_grad_div_uneven_inputs
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_grad_is_view

buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork --  test_DistributedDataParallel_non_default_stream

Reviewed By: rohan-varma

Differential Revision: D28940800

fbshipit-source-id: 1ba727ac951ebc1e7875dc1a1be8108a2c8d9462
2021-06-07 16:52:20 -07:00
fbe65b16ae Use irange in torch/csrc/jit (#55716)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55716

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D27690245

fbshipit-source-id: 6052b0acd792a9527d131822453a17cdb7ae3ba5
2021-06-07 16:48:08 -07:00
ff553e5b09 enable upload test stats on PR (#59567)
Summary:
Enable test stats upload on PR.

Uses PR number as part of the key so that it can be properly indexed and later parsed if PR has been merged/closed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59567

Reviewed By: ejguan

Differential Revision: D28943654

Pulled By: walterddr

fbshipit-source-id: f3a7a25ae14c6877067e1b347e3a8658d80d1544
2021-06-07 16:45:10 -07:00
24432eaa29 Adding run-specified-test-cases option in run_test.py (#59487)
Summary:
The run-specified-test-cases option would allow us to specify a list of test cases to run by having a CSV with minimally two columns: test_filename and test_case_name.

This PR also adds .json to some files we use for better clarity.

Usage:
`python test/run_test.py --run-specified-test-cases <csv_file>` where the csv file can look like:
```
test_filename,test_case_name,test_total_time,windows_only_failure_sha_count,total_sha_count,windows_failure_count,linux_failure_count,windows_total_count,linux_total_count
test_cuda,test_cudnn_multiple_threads_same_device,8068.8409659525,46,3768,53,0,2181,6750
test_utils,test_load_standalone,8308.8062920459,14,4630,65,0,2718,8729
test_ops,test_forward_mode_AD_acosh_cuda_complex128,91.652619369806,11,1971,26,1,1197,3825
test_ops,test_forward_mode_AD_acos_cuda_complex128,91.825633094915,11,1971,26,1,1197,3825
test_profiler,test_source,60.93786725749,9,4656,21,3,2742,8805
test_profiler,test_profiler_tracing,203.09352795241,9,4662,21,3,2737,8807
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59487

Test Plan:
Without specifying the option, everything should be as they were before.

Running `python test/run_test.py --run-specified-test-cases windows_smoke_tests.csv` resulted in this paste P420276949 (you can see internally). A snippet looks like:
```
(pytorch) janeyx@janeyx-mbp pytorch % python test/run_test.py --run-specified-test-cases windows_smoke_tests.csv
Loading specified test cases to run from windows_smoke_tests.csv.
Processed 28 test cases.
Running test_cpp_extensions_jit ... [2021-06-04 17:24:41.213644]
Executing ['/Users/janeyx/miniconda3/envs/pytorch/bin/python', 'test_cpp_extensions_jit.py', '-k', 'test_jit_cuda_archflags'] ... [2021-06-04 17:24:41.213781]
s
----------------------------------------------------------------------
Ran 1 test in 0.000s

OK (skipped=1)
...
```
With pytest, an example executable would be:
`Running test_dataloader ... [2021-06-04 17:37:57.643039]
Executing ['/Users/janeyx/miniconda3/envs/pytorch/bin/python', '-m', 'pytest', 'test_dataloader.py', '-v', '-k', 'test_segfault or test_timeout'] ... [2021-06-04 17:37:57.643327]`

Reviewed By: samestep

Differential Revision: D28913223

Pulled By: janeyx99

fbshipit-source-id: 0d1f9910973426b8756815c697b483160517b127
2021-06-07 16:27:43 -07:00
caf76c2445 Move sharding to after all tests have been excluded (#59583)
Summary:
It would be most accurate if sharding occurred after all other changes to selected_tests were complete.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59583

Reviewed By: ejguan

Differential Revision: D28944737

Pulled By: janeyx99

fbshipit-source-id: a851473948a5ec942ffeeedeefdc645536a3d9f7
2021-06-07 15:04:36 -07:00
93140a31e2 Use irange in a few places (#55325)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55325

Test Plan: Sandcastle

Reviewed By: SciPioneer

Differential Revision: D27573006

fbshipit-source-id: 647b5da3901e92c23e95b2fe5e833e9081d72837
2021-06-07 14:53:41 -07:00
737d920b21 Strictly type everything in .github and tools (#59117)
Summary:
This PR greatly simplifies `mypy-strict.ini` by strictly typing everything in `.github` and `tools`, rather than picking and choosing only specific files in those two dirs. It also removes `warn_unused_ignores` from `mypy-strict.ini`, for reasons described in https://github.com/pytorch/pytorch/pull/56402#issuecomment-822743795: basically, that setting makes life more difficult depending on what libraries you have installed locally vs in CI (e.g. `ruamel`).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59117

Test Plan:
```
flake8
mypy --config mypy-strict.ini
```

Reviewed By: malfet

Differential Revision: D28765386

Pulled By: samestep

fbshipit-source-id: 3e744e301c7a464f8a2a2428fcdbad534e231f2e
2021-06-07 14:49:36 -07:00
6ff001c125 DOC Improve documentation for LayerNorm (#59178)
Summary:
Closes https://github.com/pytorch/pytorch/issues/51455

I think the current implementation is aggregating over the correct dimensions. The shape of `normalized_shape` is only used to determine the dimensions to aggregate over. The actual values of `normalized_shape` are used when `elementwise_affine=True` to initialize the weights and biases.

This PR updates the docstring to clarify how `normalized_shape` is used. Here is a short script comparing the implementations for tensorflow and pytorch:

```python
import torch
import torch.nn as nn

import tensorflow as tf
from tensorflow.keras.layers import LayerNormalization

rng = np.random.RandomState()
x = rng.randn(10, 20, 64, 64).astype(np.float32)
# slightly non-trival
x[:, :10, ...] = x[:, :10, ...] * 10 + 20
x[:, 10:, ...] = x[:, 10:, ...] * 30 - 100

# Tensorflow Layer norm
x_tf = tf.convert_to_tensor(x)
layer_norm_tf = LayerNormalization(axis=[-3, -2, -1], epsilon=1e-5)
output_tf = layer_norm_tf(x_tf)
output_tf_np = output_tf.numpy()

# PyTorch Layer norm
x_torch = torch.as_tensor(x)
layer_norm_torch = nn.LayerNorm([20, 64, 64], elementwise_affine=False)
output_torch = layer_norm_torch(x_torch)
output_torch_np = output_torch.detach().numpy()

# check tensorflow and pytorch
torch.testing.assert_allclose(output_tf_np, output_torch_np)

# manual comutation
manual_output = ((x_torch - x_torch.mean(dim=(-3, -2, -1), keepdims=True)) /
                 (x_torch.var(dim=(-3, -2, -1), keepdims=True, unbiased=False) + 1e-5).sqrt())

torch.testing.assert_allclose(output_torch, manual_output)
```

To get to the layer normalization as shown here:

<img width="157" alt="Screen Shot 2021-05-29 at 2 13 52 PM" src="https://user-images.githubusercontent.com/5402633/120080691-1e37f100-c088-11eb-9060-4f263e4cd093.png">

One needs to pass in `normalized_shape` with shape `x.dim() - 1` with the size of the channels and all spatial dimensions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59178

Reviewed By: ejguan

Differential Revision: D28931877

Pulled By: jbschlosser

fbshipit-source-id: 193e05205b9085bb190c221428c96d2ca29f2a70
2021-06-07 14:34:10 -07:00
a30b359590 fix double backward for binary_cross_entropy loss function when reduction=sum. (#59479)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/59477.

```python
In [1]: import torch

In [2]: x = torch.rand(3, 3, dtype=torch.double, requires_grad=True)

In [3]: y = torch.rand(3, 3, dtype=torch.double)

In [4]: torch.autograd.gradgradcheck(lambda x, y: torch.nn.functional.binary_cross_entropy(x, y, reduction='sum'), [x, y])
Out[4]: True

In [5]: torch.autograd.gradgradcheck(lambda x, y: torch.nn.functional.binary_cross_entropy(x, y, reduction='mean'), [x, y])
Out[5]: True

In [6]: torch.autograd.gradcheck(lambda x, y: torch.nn.functional.binary_cross_entropy(x, y, reduction='sum'), [x, y])
Out[6]: True

```

More comprehensive testing could be added in https://github.com/pytorch/pytorch/pull/59447 where explicit `gradcheck` and `gradgradcheck` tests are added.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59479

Reviewed By: ejguan

Differential Revision: D28934354

Pulled By: albanD

fbshipit-source-id: 12ce68e3c5c499b2531f7cdba3c22548d67e07e9
2021-06-07 14:14:08 -07:00
77dde35f1a Fix error message formatting in _make_grads (#59532)
Summary:
- TORCH_CHECK doesn't handle printf style format and it will output like: `got %ld tensors and %ld gradients21`
- `got 2 tensors and 1 gradients` should be the expected message for this

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59532

Reviewed By: ejguan

Differential Revision: D28934680

Pulled By: albanD

fbshipit-source-id: 2d27a754ae81310b9571ae2a2ea09d0f8d8a3d81
2021-06-07 14:05:24 -07:00
24e27af683 [ROCm] enable kernel asserts (#49624)
Summary:
Addresses missing ROCm feature indicated in https://github.com/pytorch/pytorch/issues/38943.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49624

Reviewed By: agolynski

Differential Revision: D28902459

Pulled By: malfet

fbshipit-source-id: 29c9b552770241a0ec52cd057ea45efc4389d838
2021-06-07 13:43:07 -07:00
05b571ee8e fix name of 'dims' kwarg in torch.tile docs (#59471)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59471

Fixes #59150

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D28908569

Pulled By: saketh-are

fbshipit-source-id: 57d0e75d899a1d9979e8bdb20dfd2b136dd63d1b
2021-06-07 13:18:19 -07:00
b0ac9bfb2b Add warning about should_drop for JIT coverage plug-in (#57961)
Summary:
This adds a comment above `should_drop` to prevent someone from inadvertently breaking JIT coverage by renaming the function without updating the correct references.

The current JIT plug-in uses `should_drop` to figure out which code is going to be JIT'd. If the function is named differently, the plug-in would also need to be updated.

Question: I understand this may not be the cleanest solution. Would a cleaner solution be to create a dummy function that would simply exist for the JIT plug-in? I did not immediately do that as that may be adding unnecessary code complexity in torch.jit.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57961

Reviewed By: samestep

Differential Revision: D28933587

Pulled By: janeyx99

fbshipit-source-id: 260aaf7b11f07de84a81d6c3554c4a5ce479d623
2021-06-07 12:48:01 -07:00
8693e288d7 DOC Small rewrite of interpolate recompute_scale_factor docstring (#58989)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55909

This PR looks to improve the documentation to describe the following behavior:

8130f2f67a/torch/nn/functional.py (L3673-L3685)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58989

Reviewed By: ejguan

Differential Revision: D28931879

Pulled By: jbschlosser

fbshipit-source-id: d1140ebe1631c5ec75f135c2907daea19499f21a
2021-06-07 12:40:05 -07:00
1798ff02e4 [PyTorch] Optimize c10::optional<ArrayRef<T>> for size (#59333)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59333

Code comment should explain this in sufficient detail. In brief, making it 16 bytes should get it to be passed in registers.
ghstack-source-id: 130631329

Test Plan: Updated optional_test and added static_assert in Optional.cpp.

Reviewed By: ezyang

Differential Revision: D28843027

fbshipit-source-id: 3029f05e03a9f04ca7337962e7770cdeb9a608d9
2021-06-07 11:35:17 -07:00
cc03ea2c47 [quant] Implemented InputWeightObserver for Linear inputs
Summary: Implemented two observers (InputEqualObserver and WeightEqualObserver) which will be inserted into the graph during prepare_fx().

Test Plan: python test/test_quantization.py TestEqualizeFx

Reviewed By: supriyar

Differential Revision: D28836954

fbshipit-source-id: 25517dc82ae67698ed8b2dc334e3323286976104
2021-06-07 11:19:43 -07:00
c51abf8fca Make binary_cross_entropy differentiable wrt target (#59447)
Summary:
As per title. Resolves https://github.com/pytorch/pytorch/issues/56683.
`gradgradcheck` will fail once `target.requires_grad() == True` because of the limitations of the current double backward implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59447

Reviewed By: agolynski

Differential Revision: D28910140

Pulled By: albanD

fbshipit-source-id: 20934880eb4d22bec34446a6d1be0a38ef95edc7
2021-06-07 09:20:17 -07:00
94cc681fc2 Revert D28922305: [Reland][DDP] Merge work and future_work in reducer
Test Plan: revert-hammer

Differential Revision:
D28922305 (3137bbeb1a)

Original commit changeset: 6388a96eda7a

fbshipit-source-id: bc150672e857286eeb129ea683b1cfd2034f0564
2021-06-07 03:58:20 -07:00
f998e63dca Revert D28922548: [Gradient Compression] Apply division first to avoid overflow
Test Plan: revert-hammer

Differential Revision:
D28922548 (459270ac01)

Original commit changeset: 442bd3cc7a35

fbshipit-source-id: 7e4361b4eb283cdb21f15a36d6eebf558dd7386f
2021-06-07 03:57:10 -07:00
459270ac01 [Gradient Compression] Apply division first to avoid overflow (#59522)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59522

If the gradients before allreduce are large, then the sum after allreduce may overflow, especially for FP16. Therefore, apply the division before allreduce.

This fix is applied to both C++ and Python comm hooks.
ghstack-source-id: 130686229

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_comm_hook_allreduce_hook_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_compress_wrapper_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_comm_hook_allreduce_hook_nccl_grad_is_view
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl_is_view
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_compress_wrapper_is_view
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl_grad_is_view
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl_grad_is_view

Reviewed By: rohan-varma

Differential Revision: D28922548

fbshipit-source-id: 442bd3cc7a35a8b948f626062fa7ad2e3704c5be
2021-06-07 01:43:10 -07:00
a2e56fa0dc Adding users of a node to the serialized JSON. (#59357)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59357

Adding users of a node to the serialized JSON. Illustrated in the example:

JSON:
P419734894

Examples:
    {
      "shape": "[7]",
      "dtype": "torch.float16",
      "stride": "[1]",
      "is_quantized": false,
      "target": "conv.bias",
      "op_code": "get_attr",
      "name": "conv_bias",
      "args": [],
      "kwargs": {},
      "users": [
        {
          "is_node": true,
          "name": "to_dtype"
        }
      ]
    }

    {
      "target": "output",
      "op_code": "output",
      "name": "output",
      "args": [
        {
          "is_node": true,
          "name": "fba_layout_transform_1",
          "shape": "[3, 7, 12, 12]",
          "dtype": "torch.float16",
          "stride": "[1008, 144, 12, 1]",
          "is_quantized": false
        }
      ],
      "kwargs": {},
      "users": []
    }

Test Plan: buck test //caffe2/test:test_fx_experimental

Reviewed By: gcatron, jfix71

Differential Revision: D28857487

fbshipit-source-id: a3bac6bdb21ce10ba4a0d170c809aef13e6174a6
2021-06-06 23:15:32 -07:00
de40c8e495 Adds remaining OpInfos and removes redundant test generators (#55558)
Summary:
Per title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55558

Reviewed By: ngimel

Differential Revision: D28922522

Pulled By: mruberry

fbshipit-source-id: 89cefd93788bc8aa0683f4583cf5caa81aa2dc93
2021-06-06 14:52:26 -07:00
8c852de54d [PyTorch Edge] Remove legacy and kineto profilers from mobile build (#58730)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58730

The sources for the profilers are not needed in the mobile build, and unnecessarily add weight to the build. Remove them from the lite-interpreter build.

ghstack-source-id: 130684568

Test Plan: Build + BSB

Reviewed By: kimishpatel, raziel

Differential Revision: D28563725

fbshipit-source-id: 9d6f76176c2d2bbc25703281af1a076b1f2b4f19
2021-06-06 13:16:07 -07:00
3137bbeb1a [Reland][DDP] Merge work and future_work in reducer (#59520)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59520

Remove `work` attribute from Reducer class in favor of `future_work`.

Additionally, remove `copy_grad_to_bucket` method since now it's only one-line implementation, and created a new C++ comm hook called `_AllReduceCommHookWithDivFactor` to replace allreduce and also support handling uneven input.

Compared with the reverted https://github.com/pytorch/pytorch/pull/58937, updated `_AllReduceCommHookWithDivFactor` in `default_comm_hooks.cpp` to apply division first and hence avoid FP16 overflow.

#Original PR Issue: https://github.com/pytorch/pytorch/issues/41266
ghstack-source-id: 130685351

Test Plan:
buck test caffe2/test/distributed:distributed_gloo_fork --  test_accumulate_gradients_no_sync
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_accumulate_gradients_no_sync
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_ddp_grad_div_uneven_inputs
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_grad_is_view

Reviewed By: walterddr

Differential Revision: D28922305

fbshipit-source-id: 6388a96eda7a06f292873afed6d1362096c13e1c
2021-06-06 09:49:08 -07:00
390fe74944 Migrate torch.lstsq to ATen (#59400)
Summary:
Closes  https://github.com/pytorch/pytorch/issues/24726, closes https://github.com/pytorch/pytorch/issues/44011

This builds on the port from https://github.com/pytorch/pytorch/issues/44011. I've rebased on master and addressed mruberry's comments. There were also some unnecessary copies of `B` taking place that I've cleaned up. This function is already deprecated, but since it's the last lapack routine in TH, it's still worth porting.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59400

Reviewed By: mruberry

Differential Revision: D28922060

Pulled By: ngimel

fbshipit-source-id: cfd7ec8b50d2ab886f0e04a2a557e4e410ee8184
2021-06-06 02:18:17 -07:00
da972afdcd OpInfo: to_sparse (#59445)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59445

Reviewed By: ngimel

Differential Revision: D28920866

Pulled By: mruberry

fbshipit-source-id: ba8d3071d9937096288b69511000eeb007f53434
2021-06-05 19:13:58 -07:00
96ac0e0340 OpInfo: t (#59442)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59442

Reviewed By: agolynski

Differential Revision: D28898946

Pulled By: mruberry

fbshipit-source-id: be32429fa7306554e4912fdcc382593d00c9f4ad
2021-06-05 18:59:38 -07:00
0a5bfa9919 Support __rmod__ (#58476)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/58035.

This PR implements `torch.Tensor.__rmod__` and `torch.remainder(scalar, tensor)` for the compatibility with NumPy’s interface.
(cc: mruberry, rgommers, emcastillo, kmaehashi)

TODO:
  - [x] Update `tensor_binary_op` in test/test_binary_ufuncs.py after https://github.com/pytorch/pytorch/issues/58216 is merged.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58476

Reviewed By: ngimel

Differential Revision: D28776810

Pulled By: mruberry

fbshipit-source-id: 74f8aea80f439ef2cc370333524e39971eeb7bf4
2021-06-05 16:19:24 -07:00
344ecb2e71 flip via TI (#59509)
Summary:
Resubmit of https://github.com/pytorch/pytorch/issues/58747

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59509

Reviewed By: mruberry

Differential Revision: D28918665

Pulled By: ngimel

fbshipit-source-id: b045c7b35eaf22e53b1bc359ffbe5a4fda05dcda
2021-06-05 15:43:29 -07:00
1be7ca71ee OpInfo: log_softmax (#59336)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59336

Reviewed By: agolynski

Differential Revision: D28899052

Pulled By: mruberry

fbshipit-source-id: 60a9a4ffbca5a0f2c899d4d83500dcab4555ffb0
2021-06-05 13:51:50 -07:00
1dcc034fba [caffe2] Avoid attempt to use undefined preprocessor directive
Summary:
This is somewhat more verbose, but it's more correct and addresses this warning on Visual Studio 2017:
```
xplat\caffe2\caffe2\core\common.h(76): warning C4067: unexpected tokens following preprocessor directive - expected a newline
```

Test Plan: Built locally with fix

Reviewed By: simpkins

Differential Revision: D28868632

fbshipit-source-id: f6a583e8275162adedb2a4bc5ed0f64847020871
2021-06-05 09:22:52 -07:00
1d9c1cc00a [4/n] [c10d] Introduce the multi-tenancy feature in TCPStore (#58331)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58331

This PR is the final part of a stack that addresses the GitHub issue #41614; it introduces the multi-tenancy feature to the `TCPStore` class allowing two server stores to be instantiated with the same host:port pair.
ghstack-source-id: 130676394

Test Plan:
- Run the existing and newly-introduced tests.
- Run several smoke tests including the short code snippet referred in GitHub issue #41614.

Reviewed By: H-Huang

Differential Revision: D28453850

fbshipit-source-id: f9066b164305de0f8c257e9d5736e93fd7e21ec6
2021-06-05 07:50:07 -07:00
844a98758a [3/n] [c10d] Revise the implementation of TCPStore (#58330)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58330

This PR is part of a stack that addresses the GitHub issue #41614; it introduces a major refactoring of the `TCPStore` class in preparation of the multi-tenancy feature.

- All TCP sockets are wrapped with a new `TCPSocket` RAII type.
- `BackgroundThread` and daemon types are moved from header to cpp file.
- Server, client, and callback sockets are refactored into their own internal types `TCPServer`, `TCPClient` and `TCPCallbackClient`.
- Calls to `tcputil::send*` and `tcputil::recv*` are wrapped in `TCPClient` for easier readability and maintenance purposes.
- Two `TODO` statements are put to reference future improvements. Based on feedback, I will either create separate GitHub issues for them or address them as part of this stack.
ghstack-source-id: 130676392

Test Plan: Run the existing tests since there are no user-facing behavioral changes.

Reviewed By: H-Huang

Differential Revision: D28448981

fbshipit-source-id: 415b21e74b3cd51d673c1d5c349c6a2cb21dd667
2021-06-05 07:50:06 -07:00
4ee761c2c5 [2/n] [c10d] Introduce the 'multiTenant' constructor parameter in TCPStore (#58329)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58329

This PR is part of a stack that addresses the GitHub issue #41614; it introduces:

- A new `multiTenant` constructor option for the `TCPStore` class indicating whether multiple store instances can be initialized with the same host:port pair.

- Updates to the C10d distributed (elastic) rendezvous and the `init_process_group` method to leverage the new `multiTenant` feature.

Note that the multi-tenancy feature itself is implemented in the fourth PR of this stack. In this PR passing `true` to `multiTenant` results only with a warning output.
ghstack-source-id: 130676389

Test Plan: Run the existing tests since there are no behavioral changes.

Reviewed By: rohan-varma

Differential Revision: D28424978

fbshipit-source-id: fb1d1d81b8b5884cc5b54486700a8182a69c1f29
2021-06-05 07:50:04 -07:00
cf408c3743 [1/n] [c10d] Introduce a new TCPStore constructor (#58328)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58328

This PR is part of a stack that addresses the GitHub issue #41614; it introduces a new `TCPStore` constructor that takes its optional parameters via a newly introduced `TCPStoreOptions` structure. This gives the API callers the flexibility to specify only the desired options while skipping the rest.

The main motivation behind this change is the introduction of the `multiTenant` constructor option in the second PR of this stack.
ghstack-source-id: 130676384

Test Plan: Run the existing tests since there are no behavioral changes.

Reviewed By: H-Huang

Differential Revision: D28417742

fbshipit-source-id: e6ac2a057f7ad1908581176ee6d2c2554c3c74a9
2021-06-05 07:50:02 -07:00
91eb831422 Revert D28698997: [Static Runtime] Add schema check to aten ops
Test Plan: revert-hammer

Differential Revision:
D28698997 (10345010f7)

Original commit changeset: 232fc60c0321

fbshipit-source-id: e351df62779fea85b7afe5160d3c40c4e7cee4ed
2021-06-05 07:48:49 -07:00
c88a0b55b3 Revert D28677383: [DDP] Merge work and future_work in reducer
Test Plan: revert-hammer

Differential Revision:
D28677383 (f8bebade47)

Original commit changeset: 85e0620378b7

fbshipit-source-id: ef3c65b88c375aa9a6befe2ab004ec37ae7eb587
2021-06-05 07:25:44 -07:00
f8bebade47 [DDP] Merge work and future_work in reducer (#58937)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58937

Remove `work` attribute from Reducer class in favor of `future_work`.

Additionally, remove `copy_grad_to_bucket` method since now it's only one-line implementation, and created a new C++ comm hook called `_AllReduceCommHookWithDivFactor` to replace allreduce and also support handling uneven input.

#Original PR Issue: https://github.com/pytorch/pytorch/issues/41266
ghstack-source-id: 130673249

Test Plan:
buck test caffe2/test/distributed:distributed_gloo_fork --  test_accumulate_gradients_no_sync
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_accumulate_gradients_no_sync
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_ddp_grad_div_uneven_inputs

Reviewed By: agolynski

Differential Revision: D28677383

fbshipit-source-id: 85e0620378b7e9d837e436e94b9d807631d7d752
2021-06-05 01:18:30 -07:00
5117ac3bb4 Revert D28877076: [pytorch][PR] torch.flip via TI
Test Plan: revert-hammer

Differential Revision:
D28877076 (d82bc3feb8)

Original commit changeset: 4fa6eb519085

fbshipit-source-id: c81e7d3283ff6822db913bf9f49a1533268755d0
2021-06-04 23:03:53 -07:00
10345010f7 [Static Runtime] Add schema check to aten ops (#59426)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59426

Reviewed By: ajyu

Differential Revision: D28698997

fbshipit-source-id: 232fc60c0321b8e68e4f1b6705233485260c281d
2021-06-04 21:38:45 -07:00
d82bc3feb8 torch.flip via TI (#58747)
Summary:
Implements an idea by ngimel to improve the performance of `torch.flip` via a clever hack into TI to bypass the fact that TI is not designed to work with negative indices.

Something that might be added is vectorisation support on CPU, given how simple the implementation is now.

Some low-hanging fruits that I did not implement:
- Write it as a structured kernel
- Migrate the tests to opinfos
- Have a look at `cumsum_backward` and `cumprod_backward`,  as I think that they could be implemented faster with `flip`, now that `flip` is fast.

**Edit**
This operation already has OpInfos and it cannot be migrated to a structured kernel because it implements quantisation

Summary of the PR:
- x1.5-3 performance boost on CPU
- x1.5-2 performance boost on CUDA
- Comparable performance across dimensions, regardless of the strides (thanks TI)
- Simpler code

<details>
<summary>
Test Script
</summary>

```python
from itertools import product

import torch
from torch.utils.benchmark import Compare, Timer

def get_timer(size, dims, num_threads, device):
    x = torch.rand(*size, device=device)

    timer = Timer(
        "torch.flip(x, dims=dims)",
        globals={"x": x, "dims": dims},
        label=f"Flip {device}",
        description=f"dims: {dims}",
        sub_label=f"size: {size}",
        num_threads=num_threads,
    )

    return timer.blocked_autorange(min_run_time=5)

def get_params():
    sizes = ((1000,)*2, (1000,)*3, (10000,)*2)
    for size, device in product(sizes, ("cpu", "cuda")):
        threads = (1, 2, 4) if device == "cpu" else (1,)
        list_dims = [(0,), (1,), (0, 1)]
        if len(size) == 3:
            list_dims.append((0, 2))
        for num_threads, dims in product(threads, list_dims):
            yield size, dims, num_threads, device

def compare():
    compare = Compare([get_timer(*params) for params in get_params()])
    compare.trim_significant_figures()
    compare.colorize()
    compare.print()

compare()
```
</details>

<details>
<summary>
Benchmark PR
</summary>

![image](https://user-images.githubusercontent.com/3291265/119139954-81e46d80-ba3b-11eb-9aad-e825e515d41b.png)

</details>

<details>
<summary>
Benchmark master
</summary>

![image](https://user-images.githubusercontent.com/3291265/119139915-76914200-ba3b-11eb-9aa8-84b3ca220c93.png)

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58747

Reviewed By: agolynski

Differential Revision: D28877076

Pulled By: ngimel

fbshipit-source-id: 4fa6eb519085950176cb3a9161eeb3b6289ec575
2021-06-04 20:13:38 -07:00
bca25d97ad [itemwise-dropout][1/x][low-level module] Implement Itemwise Sparse Feature Dropout in Dper3 (#59322)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59322

Implement sparse feature dropout (with replacement) that can drop out individual items in each sparse feature. For example, the existing sparse feature dropout with replacement drops out whole feature (e.g., a list of page ids) when the feature is selected for drop out. This itemwise dropout assigns probability and drops out to individual items in sparse features.

Test Plan:
```
buck test mode/dev caffe2/torch/fb/sparsenn:test
```

https://www.internalfb.com/intern/testinfra/testrun/281475166777899/

```
buck test mode/dev //dper3/dper3/modules/tests:sparse_itemwise_dropout_with_replacement_test
```
https://www.internalfb.com/intern/testinfra/testrun/6473924504443423

```
buck test mode/opt caffe2/caffe2/python:layers_test
```
https://www.internalfb.com/intern/testinfra/testrun/2533274848456607

```
buck test mode/opt caffe2/caffe2/python/operator_test:sparse_itemwise_dropout_with_replacement_op_test
```
https://www.internalfb.com/intern/testinfra/testrun/8725724318782701

Reviewed By: Wakeupbuddy

Differential Revision: D27867213

fbshipit-source-id: 8e173c7b3294abbc8bf8a3b04f723cb170446b96
2021-06-04 19:59:17 -07:00
68df4d40d2 show_pickle/model_dump: Handle invalid UTF-8 in pickles (#57661)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57661

Thie Pickle "specification" (pickletools.py) states that the argument to
a BINUNICODE opcode must be UTF-8 encoded.  However, if a PyTorch custom
class returns a non-UTF-8 std::string from its pickle method the
libtorch Pickler will write it to the output pickle without complaining.
Python's _Unpickler (the Python implementation of Unpickler) always
throws an exception when trying to deserialize these invalid pickles.

We still want to be able to dump these pickle files.  Update
DumpUnpickler to create its own opcode dispatch table (initialized as a
clone of the _Unpickler dispatch table) and patch in a custom function
for the BINUNICODE op.  We try to emulate the default behavior, but any
UnicodeDecodeError is caught and replaced with a dummy object.  This
could violate the assumptions of a user that expects a str in that
position, so we disable this behavior by default.

Update model_dump to recognize this special object and allow it to be
rendered.

Test Plan: Dumped and viewed a model with an invalid string in an object state.

Reviewed By: malfet

Differential Revision: D28531392

Pulled By: dreiss

fbshipit-source-id: ab5aea20975a0ef53ef52a880deaa2c5a626e4a2
2021-06-04 19:42:25 -07:00
ba3a90b55e Revert D28819780: [TensorExpr] Fix handling of 0-dim tensors.
Test Plan: revert-hammer

Differential Revision:
D28819780

Original commit changeset: f3feff35a1ce

fbshipit-source-id: 1dca4ac9cea0b67e9f02800f6d5b3c7e4ae1d81a
2021-06-04 19:25:30 -07:00
88fb5ee84c Revert D28819779: [TensorExpr] Improve debug messages.
Test Plan: revert-hammer

Differential Revision:
D28819779

Original commit changeset: 2eaa0b78fb30

fbshipit-source-id: babc22f75d87b1ba25f78ffe59266560413778ce
2021-06-04 19:20:31 -07:00
aa66990ef1 Automated submodule update: kineto (#54604)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/kineto](https://github.com/pytorch/kineto).

New submodule commit: 88e3332ab9

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54604

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: malfet

Differential Revision: D27297755

fbshipit-source-id: 5f5dd2429fb561530e6a59285c6ae708e5818ce9
2021-06-04 18:54:32 -07:00
18848d55b7 Do not use gold linker for CUDA builds (#59490)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59490

Reviewed By: agolynski, seemethere

Differential Revision: D28913160

Pulled By: malfet

fbshipit-source-id: d27092c252fc86424028abe146cf5f33a2f74544
2021-06-04 18:12:26 -07:00
a682ff7ef1 Add kMaxSupportedBytecodeVersion for Lite Interpreter (#59472)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59472

Previously, the lite interpreter would refuse to load any model
with a version greater than kProducedBytecodeVersion.  Now, we're
able to independently advance the loading and saving code, so we
can roll out changes without breaking forward compatibility.

Test Plan:
CI.
Loaded a bytecode v5 model even with setting kProducedBytecodeVersion
to v4.

Reviewed By: raziel

Differential Revision: D28904350

fbshipit-source-id: 598c22f0adf47d4ed3e976bcbebdf3959dacb1df
2021-06-04 17:55:02 -07:00
d125694d0b Move CUDA async warning to suffix (#59467)
Summary:
After the change async error warnings look as follows:
```
$ python -c "import torch;torch.eye(3,3,device='cuda:777')"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59467

Reviewed By: ngimel

Differential Revision: D28904360

Pulled By: malfet

fbshipit-source-id: 2a8fa5affed5b4ffcaa602c8ab2669061cde7db0
2021-06-04 17:26:28 -07:00
f23c45bd04 Revert D28841011: [TensorExpr] Fix printing of Bool dtype.
Test Plan: revert-hammer

Differential Revision:
D28841011 (19985d6f84)

Original commit changeset: 9f68dd47e14a

fbshipit-source-id: ff517cfff49e46ed513e79eabbe9e9fd246ccce8
2021-06-04 16:27:14 -07:00
6309b342c3 [nnc] Enable CPU fuser inside FB, take 5 (#59461)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59461

long tail test failues
ghstack-source-id: 130607578

Test Plan: fixed T92123560

Reviewed By: navahgar

Differential Revision: D28892885

fbshipit-source-id: 762a275b5aa14af0847e46cbf4036d3342b82189
2021-06-04 16:26:46 -07:00
f5e3eae82a [nnc] Infer device type from nodes if inputs are all scalars (#59430)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59430

With constant support added, we can now have fusion groups with only
scalar inputs.  So, we need to get the device type from the nodes in the graph
rather than just the inputs.
ghstack-source-id: 130613871

Test Plan: new unit test; also see test_tracer test_trace_of_script

Reviewed By: navahgar

Differential Revision: D28891989

fbshipit-source-id: f9e824acbd4856216b85a135c8cb60a2eac3c628
2021-06-04 16:25:33 -07:00
a776072de6 .github: Switch windows instance types (#59473)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59473

Switches windows instance types to prep for usage of the AWS built
Windows AMI with pre-installed Tesla Driver.

Unforutnately neither c5.4xlarge nor g3.8xlarge is not supported by this
AMI but luckily we can swap those out with pretty comparable
alternatives like c5d.4xlarge and p3.2xlarge.

For CPU workflows this shouldn't provide any real difference since the
CPU / Memory is the same with c5d.4xlarge. For GPU workflows the GPU
with the p3.2xlarge is a Tesla V100 which should suit our needs.

<details>
<summary> nvidia-smi.exe (p3.2xlarge) </summary>

```
PS C:\Users\Administrator> & 'C:\Program Files\NVIDIA Corporation\NVSMI\nvidia-smi.exe'
Fri Jun  4 18:53:10 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 462.31       Driver Version: 462.31       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  TCC  | 00000000:00:1E.0 Off |                    0 |
| N/A   42C    P0    23W / 300W |      0MiB / 16258MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
```

</details>

It might eventually make sense to also switch linux to these instance types but do bear in mind that p3.2xlarge for linux is ~$0.75 more expensive than g3.8xlarge

* [Price comparison for g3.8xlarge vs. p3.2xlarge](https://instances.vantage.sh/?compare_on=true&selected=p3.2xlarge,g3.8xlarge)
* [Price comparison for c5.4xlarge vs. c5d.4xlarge](https://instances.vantage.sh/?compare_on=true&selected=c5.4xlarge,c5d.4xlarge)

AMI that I'm planning on using as the new base AMI with included Tesla driver: https://aws.amazon.com/marketplace/pp/prodview-jrxucanuabmfm?qid=1622830809415&sr=0-2&ref_=srh_res_product_title#pdp-pricing

Info about c5 instances can be found here: https://aws.amazon.com/ec2/instance-types/c5/

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: agolynski

Differential Revision: D28913659

Pulled By: seemethere

fbshipit-source-id: 11b4d332e82b078a6801b312dc4ace2928838fc8
2021-06-04 16:22:05 -07:00
bbf7eceaf0 Refactor c10d and dist aliases for torch.distributed (#59456)
Summary:
**Overview:**
This consolidates `c10d` and `dist` to only `dist` as the alias for `torch.distributed` in `test_store.py`. Both aliases were used most likely due to incremental additions to the test file and not intentional.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59456

Test Plan:
```
python test/distributed/test_store.py
```

Reviewed By: agolynski

Differential Revision: D28910169

Pulled By: andwgu

fbshipit-source-id: f830dead29e9de48aaf2845dfa5861c9cccec15d
2021-06-04 16:07:44 -07:00
1183fa3817 Switch PG::Work to Future in default_comm_hooks.cpp (#59398)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59398

Test Plan: Imported from OSS

Reviewed By: SciPioneer

Differential Revision: D28876182

Pulled By: agolynski

fbshipit-source-id: 9d8f09ffa2f40bb0fb25c626b52678a1597a797e
2021-06-04 15:27:13 -07:00
aa27136e3c Fix test_randperm_device_compatibility for 1 GPU (#59484)
Summary:
Do not try to create tensors on 2nd device if device_count() == 1

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59484

Reviewed By: ngimel

Differential Revision: D28910673

Pulled By: malfet

fbshipit-source-id: e3517f31a463dd049ce8a5155409b7b716c8df18
2021-06-04 14:41:06 -07:00
a7c8c56b7f torchdeploy allow embedded cuda interp use without cuda (#59459)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59459

For any binary that can be used both with and without cuda, it's better to allow just including the cuda flavor of the interpreter.  The previous logic would fail in this case, as it only allows using the cuda flavor if torch::cuda::is_available() reports true.  Now, we unconditionally allow the cuda flavor to be used if it's present.

Test Plan: Added new unit test to exercise this scenario, ran locally on devvm without cuda.

Reviewed By: dzhulgakov

Differential Revision: D28902176

fbshipit-source-id: 5c7c90d84987848471bb6dd5318db15314e0b442
2021-06-04 14:37:39 -07:00
aeb55225e0 [caffe2] add a basic implementation of run-time feature rollout checks (#59355)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59355

Add a `CheckKnob()` function for doing run-time checks of feature roll-out
knobs.  This provides an API for safely controlling the roll-out of new
functionality in the code.

Test Plan: Included some basic unit tests.

Reviewed By: voznesenskym

Differential Revision: D26536430

fbshipit-source-id: 2e53234c6d9ce624848fc8b2c76f6833f344f48b
2021-06-04 14:34:41 -07:00
90ad0f316f try fixing checkout dirty issue (#59450)
Summary:
Testing and see if checkout with submodules during build phase will help.

tentatively address https://github.com/pytorch/pytorch/issues/58867. but since the repro is not reliable. we cant be sure.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59450

Reviewed By: malfet

Differential Revision: D28908537

Pulled By: walterddr

fbshipit-source-id: 21ad1392a5066554b5c633f31616ab3e6541c54d
2021-06-04 14:31:43 -07:00
c4349bfa84 [GHA] add upload binary size step (#58341)
Summary:
GHA upload working.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58341

Test Plan: Internal table pytorch_binary_size row for this PR: https://github.com/pytorch/pytorch/issues/58341

Reviewed By: agolynski

Differential Revision: D28908549

Pulled By: walterddr

fbshipit-source-id: 313e5b2c5ce2a47af3c37652612af922a68fd246
2021-06-04 14:17:13 -07:00
3607478ecd Conjugate View (#54987)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54987

Based off of ezyang (https://github.com/pytorch/pytorch/pull/44799) and bdhirsh (https://github.com/pytorch/pytorch/pull/43702) 's prototype:

Here's a summary of the changes in this PR:
This PR adds a new dispatch key called Conjugate. This enables us to make conjugate operation a view and leverage the specialized library functions that fast path with the hermitian operation (conj + transpose).

1. Conjugate operation will now return a view with conj bit (1) for complex tensors and returns self for non-complex tensors as before. This also means `torch.view_as_real` will no longer be a view on conjugated complex tensors and is hence disabled. To fill the gap, we have added `torch.view_as_real_physical` which would return the real tensor agnostic of the conjugate bit on the input complex tensor. The information about conjugation on the old tensor can be obtained by calling `.is_conj()` on the new tensor.
2. NEW API:
    a) `.conj()` -- now returning a view.
    b) `.conj_physical()` -- does the physical conjugate operation. If the conj bit for input was set, you'd get `self.clone()`, else you'll get a new tensor with conjugated value in its memory.
    c) `.conj_physical_()`, and `out=` variant
    d) `.resolve_conj()`  -- materializes the conjugation. returns self if the conj bit is unset, else returns a new tensor with conjugated values and conj bit set to 0.
    e) `.resolve_conj_()` in-place version of (d)
    f) `view_as_real_physical` -- as described in (1), it's functionally same as `view_as_real`, just that it doesn't error out on conjugated tensors.
    g) `view_as_real` -- existing function, but now errors out on conjugated tensors.
3. Conjugate Fallback
    a) Vast majority of PyTorch functions would currently use this fallback when they are called on a conjugated tensor.
    b) This fallback is well equipped to handle the following cases:
        - functional operation e.g., `torch.sin(input)`
        - Mutable inputs and in-place operations e.g., `tensor.add_(2)`
        - out-of-place operation e.g., `torch.sin(input, out=out)`
        - Tensorlist input args
        - NOTE: Meta tensors don't work with conjugate fallback.
4. Autograd
    a) `resolve_conj()` is an identity function w.r.t. autograd
    b) Everything else works as expected.
5. Testing:
    a) All method_tests run with conjugate view tensors.
    b) OpInfo tests that run with conjugate views
        - test_variant_consistency_eager/jit
        - gradcheck, gradgradcheck
        - test_conj_views (that only run for `torch.cfloat` dtype)

NOTE: functions like `empty_like`, `zero_like`, `randn_like`, `clone` don't propagate the conjugate bit.

Follow up work:
1. conjugate view RFC
2. Add neg bit to re-enable view operation on conjugated tensors
3. Update linalg functions to call into specialized functions that fast path with the hermitian operation.

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D28227315

Pulled By: anjali411

fbshipit-source-id: acab9402b9d6a970c6d512809b627a290c8def5f
2021-06-04 14:12:41 -07:00
19985d6f84 [TensorExpr] Fix printing of Bool dtype. (#59328)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59328

Before the change we printed:
```
aten_eq[0] = decltype(::c10::impl::ScalarTypeToCPPType< ::c10::ScalarType::Bool>::t)((targ_0[0])==(targ_1[0]) ? 1 : 0);
```
After the change we print:
```
aten_eq[0] = bool((targ_0[0])==(targ_1[0]) ? 1 : 0);
```

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D28841011

Pulled By: ZolotukhinM

fbshipit-source-id: 9f68dd47e14a7bc28156b56414c2d5c0aad6b2d4
2021-06-04 13:59:38 -07:00
285b8a5252 [TensorExpr] Improve debug messages. (#59280)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59280

Differential Revision:
D28819779
D28819779

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: 2eaa0b78fb309cccb0efe9025a5c3b039e717027
2021-06-04 13:59:36 -07:00
d60efd8207 [TensorExpr] Fix handling of 0-dim tensors. (#59279)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59279

There were some issues with how we handle 0-dim cases in lowerings and
also in how we generate reductions in that special case. This PR fixes
those issues and reenables a bunch of tests.

Differential Revision:
D28819780
D28819780

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: f3feff35a1ce11821ada2f8d04ae9d4be10dc736
2021-06-04 13:58:15 -07:00
dce8697aea [PyTorch][vulkan] Unify convert as vTensor& convert(const Tensor&) (#59268)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59268

There's no reason we can't give `convert` this signature: `Tensor::unsafeGetTensorImpl() cocnst ` returns a non-const TensorImpl pointer. (See https://github.com/zdevito/ATen/issues/27#issuecomment-330717839)
ghstack-source-id: 130548716

Test Plan: CI

Reviewed By: SS-JIA

Differential Revision: D28811477

fbshipit-source-id: 269f58980c1f68b29d4be3cba4cd340299ce39af
2021-06-04 13:16:14 -07:00
c99d6254fb remove THCReduce.cuh (#59431)
Summary:
Per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59431

Reviewed By: malfet

Differential Revision: D28904504

Pulled By: ngimel

fbshipit-source-id: 25d98b736d74d64fd20a40e0d9c773332f56cc30
2021-06-04 12:57:07 -07:00
780faf52ca [profile] Clarify record_shapes=True docstring (#59469)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59469

Clarify that using record_shapes=True may cause extra tensor copies.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D28905089

Pulled By: ilia-cher

fbshipit-source-id: 7642cb16f6697b6d255a2b82348d4c17486680d0
2021-06-04 12:01:35 -07:00
b3ee645cbf Migrate _th_std_var to ATen (#59258)
Summary:
Ref https://github.com/pytorch/pytorch/issues/49421

This migrates `std`/`var`'s special case all-reduction from TH to ATen. Using the benchmark from gh-43858 that was used to justify keeping the TH version; I find this PR has similar (slightly better) performance in single threaded. And unlike the TH version, this is multi-threaded and so much faster for large tensors.

TH Results:
```
[----------------------------- Index ------------------------------]
               |  torch_var  |  torch_var0  |  stdfn   |  torch_sum0
1 threads: ---------------------------------------------------------
      8        |       3.6   |       3.8    |     8.2  |      1.2
      80       |       3.7   |       3.8    |     8.4  |      1.2
      800      |       4.2   |       4.3    |     8.7  |      1.2
      8000     |       9.0   |       9.1    |    11.2  |      1.5
      80000    |      58.3   |      59.0    |    30.6  |      4.2
      800000   |     546.9   |     546.9    |   183.4  |     31.3
      8000000  |    5729.7   |    5701.0    |  6165.4  |    484.1
```

ATen results:
```
[----------------------------- Index ------------------------------]
               |  torch_var  |  torch_var0  |  stdfn   |  torch_sum0
1 threads: ---------------------------------------------------------
      8        |       4.0   |       4.0    |     8.7  |      1.2
      80       |       3.6   |       3.8    |     9.0  |      1.2
      800      |       4.1   |       4.3    |     8.9  |      1.2
      8000     |       8.9   |       9.2    |    10.6  |      1.5
      80000    |      57.0   |      57.4    |    28.8  |      4.3
      800000   |     526.9   |     526.9    |   178.3  |     30.2
      8000000  |    5568.1   |    5560.6    |  6042.1  |    453.2

[----------------------------- Index ------------------------------]
               |  torch_var  |  torch_var0  |  stdfn   |  torch_sum0
8 threads: ---------------------------------------------------------
      8        |      3.9    |      3.8     |     9.1  |      1.2
      80       |      3.8    |      3.9     |     8.8  |      1.2
      800      |      4.2    |      4.3     |     8.9  |      1.3
      8000     |      9.0    |      9.2     |    10.4  |      1.5
      80000    |     26.0    |     26.8     |    26.4  |      4.4
      800000   |     92.9    |     87.3     |    72.1  |     22.4
      8000000  |    793.5    |    791.8     |  5334.8  |    115.1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59258

Reviewed By: jbschlosser

Differential Revision: D28860353

Pulled By: ngimel

fbshipit-source-id: 80c9fe1db84dbc864eeb1a319076c7aaff0a04e5
2021-06-04 11:58:12 -07:00
689a5edd0a Revert D28326365: [pytorch][PR] Add torch.cuda.streams.ExternalStream
Test Plan: revert-hammer

Differential Revision:
D28326365 (d7ef9b73fb)

Original commit changeset: b67858c80339

fbshipit-source-id: 337588d40b96cf04e46e554fa481ae7fd4254478
2021-06-04 11:19:36 -07:00
3472f0c94d Enable torch::deploy GPU tests in sandcastle (#59460)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59460

Original commit changeset: 6e01a96d3746

Test Plan: Verify new tests run in sandcastle and existing CI is OK

Reviewed By: H-Huang

Differential Revision: D28900869

fbshipit-source-id: a8962ec48c66bba3b4b8f001ece7231953b29e82
2021-06-04 11:13:43 -07:00
ed993f3243 [CODEOWNERS] spandantiwari -> shubhambhokare1 (#59427)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59427

Reviewed By: agolynski

Differential Revision: D28902131

Pulled By: SplitInfinity

fbshipit-source-id: 6a583c5087caf147f9033b73765b1dd3f59a405c
2021-06-04 11:06:55 -07:00
e90caac676 Port gelu_backward to structured (#58665)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58665

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D28572527

Pulled By: ezyang

fbshipit-source-id: 0cb286f20c5f91453594a7dfe39ae4e4d24a13a1
2021-06-04 11:06:54 -07:00
153a96054b Port gelu to structured (#58664)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58664

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D28572533

Pulled By: ezyang

fbshipit-source-id: 8be00ecdcc224b516de28bf5f43ec308174053db
2021-06-04 11:06:52 -07:00
5f824ef437 Port hardshrink to structured (#58663)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58663

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D28572531

Pulled By: ezyang

fbshipit-source-id: 3fc8c33445adeae1789774fb6d8099278b93f8f8
2021-06-04 11:06:50 -07:00
b4fa4c86f7 Port hardshrink_backward and softshrink_backward to structured (#58662)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58662

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D28572532

Pulled By: ezyang

fbshipit-source-id: 8ecebc1090d884ee579f5d04a46f1e60a2dd978e
2021-06-04 11:05:44 -07:00
2119efd234 reflection_pad1d_backward: Port to structured (#59103)
Summary:
Tracking Issue: https://github.com/pytorch/pytorch/issues/55070
Port `reflection_pad1d_backward` to structured kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59103

Test Plan: Pre-existing tests

Reviewed By: jbschlosser

Differential Revision: D28836043

Pulled By: ezyang

fbshipit-source-id: 4c3b0880edf305896f540113dcab70c8af24253b
2021-06-04 10:23:53 -07:00
a6bd6b9ca5 [NNC] Fix the uninitialized pointer in loopnest.fuse_loops (#59411)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59411

Bug: the uninitialized For* caused a casting error in pybind11.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D28882635

Pulled By: huiguoo

fbshipit-source-id: e3f2b659bae94e9617936b1b2368157bed73c2fe
2021-06-04 10:04:34 -07:00
aa06bc0731 OpInfo: minor fix in sample_inputs_diff (#59181)
Summary:
sample_inputs_diff constructs all five positional arguments for [diff ](https://pytorch.org/docs/stable/generated/torch.diff.html) but uses only the first three. This doesn't seem to be intentional.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59181

Test Plan: This change expands coverage of diff's OpInfo sample inputs. Related tests still pass.

Reviewed By: mruberry

Differential Revision: D28878359

Pulled By: saketh-are

fbshipit-source-id: 1466f6c6c341490885c85bc6271ad8b3bcdf3a3e
2021-06-04 09:53:31 -07:00
b99523832b Remove use_env from torch.distributed.run, clarify bc around that parameter in comment. (#59409)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59409

Remove use_env from torch.distributed.run, and clarify bc around that parameter in comment.

Test Plan: n/a

Reviewed By: cbalioglu

Differential Revision: D28876485

fbshipit-source-id: 5f10365968d204985ce517b83c392c688995d76e
2021-06-04 09:02:47 -07:00
4ae5764d47 Add is_inference to native functions (#58729)
Summary:
Adds `is_inference` as a native function w/ manual cpp bindings.
Also changes instances of `is_inference_tensor` to `is_inference` to be consistent with other properties such as `is_complex`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58729

Reviewed By: mruberry

Differential Revision: D28874507

Pulled By: soulitzer

fbshipit-source-id: 0fa6bcdc72a4ae444705e2e0f3c416c1b28dadc7
2021-06-04 08:59:11 -07:00
fa597ee17f Fix torch.randperm for CUDA (#59352)
Summary:
Context https://github.com/pytorch/pytorch/issues/58545

The logic is that we are going to keep it consistent for both
torch.randperm and torch.randint

1. Generators can have either a fully-specified or non-fully specified device
2. As long as the device type match with the result, we don't error out

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59352

Test Plan:
```
python test/test_tensor_creation_ops.py -k TestRandomTensorCreation
```

Reviewed By: ngimel

Differential Revision: D28855920

Pulled By: zhouzhuojie

fbshipit-source-id: f8141a2c4b2f177e1aa7baec6999b65916cba02c
2021-06-04 08:56:18 -07:00
202b2c9fc2 Remove many unnecessary constructor calls of Vectorized<T> (#58875)
Summary:
Refresh https://github.com/pytorch/pytorch/issues/56241

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58875

Reviewed By: mruberry

Differential Revision: D28892034

Pulled By: ezyang

fbshipit-source-id: 21074e45f29a780168852be5305420a3cc1148fc
2021-06-04 08:50:53 -07:00
d7ef9b73fb Add torch.cuda.streams.ExternalStream (#57781)
Summary:
This is required in https://github.com/pytorch/pytorch/pull/57110#issuecomment-828357947

We need to provide means to synchronize on externally allocated streams for dlpack support in python array data api.

cc mruberry rgommers leofang asi1024 kmaehashi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57781

Reviewed By: mrshenli

Differential Revision: D28326365

Pulled By: ezyang

fbshipit-source-id: b67858c8033949951b49a3d319f649884dfd0a91
2021-06-04 08:47:09 -07:00
c769300301 Fix MaxPool default pad documentation (#59404)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/33384

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59404

Reviewed By: albanD

Differential Revision: D28879049

Pulled By: Varal7

fbshipit-source-id: 03a86cd347d53ac2d06028b3f213c5b5d5ab7e91
2021-06-04 08:32:03 -07:00
6d51a89778 Fix broken hyperlinks (#59425)
Summary:
**Overview:**
A number of the hyperlinks in the [`CONTRIBUTING.md` file](https://github.com/pytorch/pytorch/blob/master/torch/distributed/CONTRIBUTING.md) are broken since they include an extraneous `/torch/`. This PR fixes those links.

The files whose links are broken are
- `ProcessGroupNCCL.hpp`
- `Store.hpp`
- `FileStore.hpp`
- `TCPStore.hpp`
- `PrefixStore.hpp`
- `rref_impl.h`
- `rref_context.h`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59425

Test Plan:
The `CONTRIBUTING.md` file is at https://github.com/pytorch/pytorch/blob/master/torch/distributed/CONTRIBUTING.md.

`ProcessGroupNCCL.hpp` should have link https://github.com/pytorch/pytorch/blob/master/torch/lib/c10d/ProcessGroupGloo.hpp, which is equivalent to `../lib/c10d/ProcessGroupGloo.hpp`.

`Store.hpp` should have link https://github.com/pytorch/pytorch/blob/master/torch/lib/c10d/Store.hpp, which is equivalent to `../lib/c10d/Store.hpp`.

`FileStore.hpp` should have link https://github.com/pytorch/pytorch/blob/master/torch/lib/c10d/FileStore.hpp, which is equivalent to `../lib/c10d/FileStore.hpp`.

`PrefixStore.hpp` should have link https://github.com/pytorch/pytorch/blob/master/torch/lib/c10d/PrefixStore.hpp, which is equivalent to `../lib/c10d/PrefixStore.hpp`.

`rref_interface.h` should have link https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/core/rref_interface.h, which is equivalent to `../../aten/src/ATen/core/rref_interface.h`.

`rref_context.h` should have link https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/rpc/rref_context.h, which is equivalent to `../csrc/distributed/rpc/rref_context.h`.

Reviewed By: mruberry

Differential Revision: D28888188

Pulled By: andwgu

fbshipit-source-id: 023219184d42284ea1cbfcf519c1b4277dd5a02b
2021-06-04 08:27:26 -07:00
63956610a7 Search for static OpenBLAS compiled with OpenMP (#59428)
Summary:
Before that, only dynamically linked OpenBLAS compield with OpenMP could
be found.

Also get rid of hardcoded codepath for libgfortran.a in FindLAPACK.cmake

Only affects aarch64 linux builds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59428

Reviewed By: agolynski

Differential Revision: D28891314

Pulled By: malfet

fbshipit-source-id: 5af55a14c85ac66551ad2805c5716bbefe8d55b2
2021-06-04 08:09:21 -07:00
c7a3a13bab .circleci: Disable USE_GOLD_LINKER for CUDA 10.2 (#59413)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59413

For CUDA 10.2 builds linked with the gold linker we were observing
crashes when exceptions were being raised

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D28888054

Pulled By: seemethere

fbshipit-source-id: f9b38147591721803ed3cac607510fe5bbc49d6d
2021-06-04 07:02:54 -07:00
06ed658358 Merge TensorPipe's CPU and CUDA channel registry (#59375)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59375

The CPU and CUDA channels used to be separate classes in TensorPipe, but they recently got merged in order to support cross-device-type channels. We used to need two separate registries in PyTorch, but now we can merge them. This simplifies some registration logic, and will help in future PRs.
ghstack-source-id: 130583770

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28796427

fbshipit-source-id: b7db983293cbbddd1aedec6428de08d8944b0344
2021-06-04 06:53:49 -07:00
c09beaaf4a Remove LazyStreamContext (2 out of 2) (#59299)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59299

After recent changes, LazyStreamContext had in fact always become eager, and was in fact equivalent to a vector of streams. So it makes more sense now to remove this abstraction and use a more self-descriptive type.

This PR migrates the TensorPipe agent. The previous PR migrated the RequestCallback internals.
ghstack-source-id: 130583773

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28789174

fbshipit-source-id: a27d2b1f40ab3cf2ac0dd946232fd0eecda6d450
2021-06-04 06:53:47 -07:00
03a5c6ea99 Remove LazyStreamContext (1 out of 2) (#59298)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59298

After recent changes, LazyStreamContext had in fact always become eager, and was in fact equivalent to a vector of streams. So it makes more sense now to remove this abstraction and use a more self-descriptive type.

This PR migrates the RequestCallback internals. The next PR migrates the TensorPipe agent.
ghstack-source-id: 130583774

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28789175

fbshipit-source-id: fa581a50f9a6a1e42c2ad8c808a9b099bea7433e
2021-06-04 06:53:46 -07:00
3e7396f99d Fix CUDA sync when switching streams in RPC tests (#59297)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59297

PyTorch requires users to manually record tensors with the CUDA caching allocator when switching streams. We weren't doing it.

Also, the usage of an Event can be simplified by using `s1.wait(s2)`.
ghstack-source-id: 130583777

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28832902

fbshipit-source-id: cd4f40ff811fa1b0042deedda2456e22f33b92bd
2021-06-04 06:53:44 -07:00
8f4cfaa9db Fix race condition in TP agent (#58753)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58753

TSAN was (rightfully!) detecting and complaining about a race due to the fact that upon init the TP agent exchanges the device maps between nodes using RPC requests (and by doing so it accesses the device maps) and then sets the reverse device maps (thus possibly modifying the set of devices). This resulted in a data race, i.e., simultaneously reading and writing the set of devices without synchronizing.

One solution is to add a mutex around the devices, which works, but is "annoying". An alternative solution is to make the set of devices immutable (i.e., `const`). For that to work, we need to exchange the device maps without using RPC calls. We can do so using the process group that we need to create anyways.

Since now there's a lot more logic in Python, I've moved (and restructured) all safety checks over there, and removed them from C++.
ghstack-source-id: 130583775

Test Plan: Unit tests

Reviewed By: mrshenli

Differential Revision: D28603754

fbshipit-source-id: 88533e65d72d1eb806dc41bec8d55def5082e290
2021-06-04 06:53:42 -07:00
c0acffa6ef Ensure async_execution works with CUDAFuture (#56863)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56863

ghstack-source-id: 130583772

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D27985908

fbshipit-source-id: 09469ee0eb70b8e3b61f6278f2c881ce7f5244d6
2021-06-04 06:53:40 -07:00
7bcd8f94a5 Avoid re-doing CUDA stream sync in OwnerRRef (#57355)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57355

We had started fixing OwnerRRef to make it CUDA-compatible, by properly synchronizing CUDA streams/events where appropriate. However, since we started using CUDAFuture (or, well, ivalue::Future nowadays, after they got merged) this is all done automatically for us, hence we can undo these "fixes" as they're now duplicated.
ghstack-source-id: 130583771

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28118182

fbshipit-source-id: 4b1dd9fe88c23802b1df573941d1b73af48bb67b
2021-06-04 06:52:33 -07:00
d009c9c129 [RPC Framework] Separate initialize_from_module_rref method out of RemoteModule constructor (#59292)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59292

#Closes: https://github.com/pytorch/pytorch/issues/58274

Create an alternate initialization method, and also create a few util functions to avoid duplicate code.
ghstack-source-id: 130575373

Test Plan: buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_create_remote_module_from_module_rref

Reviewed By: vipannalla

Differential Revision: D28825895

fbshipit-source-id: 87803e94d9b50f94e1b7b2c99b9bf1634e20d065
2021-06-04 03:43:36 -07:00
c3bf42e0d8 Fix symbolic derivative of hardswish (#59405)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59405

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D28879698

Pulled By: bertmaher

fbshipit-source-id: 2f2d9836bf592b18ed9a19aab4f5967e653b5898
2021-06-03 23:12:18 -07:00
9ac954789d [nnc] Add hardsigmoid (#59069)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59069

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D28738166

Pulled By: bertmaher

fbshipit-source-id: d9f5b87ef1f2323a3631add79c2670ce794f911e
2021-06-03 23:10:36 -07:00
c717ce6771 [NNC] Add python bindings for Compute2 (#59350)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59350

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D28854806

Pulled By: huiguoo

fbshipit-source-id: b9091f9183249257aedc1eafb1992e0faf5dea82
2021-06-03 22:37:08 -07:00
db90533b9e Make JIT not assume that the device is CUDA. (#54238)
Summary:
Decouple the JIT argument spec and shape analysis with CUDA.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54238

Reviewed By: ngimel

Differential Revision: D28802085

Pulled By: Krovatkin

fbshipit-source-id: 4068c9460cdec2d80733f001ca90ea3f5e6d3a7e
2021-06-03 22:21:27 -07:00
7c4ac9e3ee [NNC] Fix loopnest.cache_accesses for reduce ops (fixed #59002) (#59136)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59136

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D28768598

Pulled By: huiguoo

fbshipit-source-id: 99ab8430bc0ba395e2a041b03a7761de335ddda5
2021-06-03 21:04:14 -07:00
d9d7d5e24a [torch] Remove migration warning for ScriptDict
Summary:
This commit removes the warning that suggests that users script their
dictionaries before passing them into TorchScript code. The ScriptDict feature
is not fully ready, so it does not make sense to recommend this yet.

Test Plan:
Sandcastle.

In addition, the PyPER test broken by the original diff passes:

```
buck test mode/opt //caffe2/torch/fb/training_toolkit/backend/tests:test_model_materializer_full_sync_lwt -- --exact 'caffe2/torch/fb/training_toolkit/backend/tests:test_model_materializer_full_sync_lwt - caffe2.torch.fb.training_toolkit.backend.tests.test_model_materializer_full_sync_lwt.ModelMaterializerFullSyncLwtTest: test_materialization_determinism_cpu' --run-disabled
```

Differential Revision: D28891351

fbshipit-source-id: 2a3a00cde935d670fb1dc7fd8c709ae9c2ad8cdc
2021-06-03 20:55:40 -07:00
6627c00e63 [Static Runtime] Fix bug in quantized::linear wrapper (#59407)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59407

Reviewed By: ajyu

Differential Revision: D28881307

fbshipit-source-id: 46c169f783cf05c585871c2e074d52255116b9c3
2021-06-03 19:18:04 -07:00
7d38901e7c [NNC] Fix BufHandle arguments in loopnest python API (#59348)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59348

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D28854233

Pulled By: huiguoo

fbshipit-source-id: 2484249992903ed7af0de504ac27f96f30e993d1
2021-06-03 17:34:17 -07:00
77de640f4b [torch distributed] Implementing reduce_scatter_base (#57567)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57567

Support flattened reduce_scatter.

Test Plan:
buck test mode/opt -c fbcode.enable_gpu_sections=true //caffe2/torch/lib/c10d:ProcessGroupNCCLTest
buck test mode/opt -c fbcode.enable_gpu_sections=true //caffe2/test/distributed:c10d

Reviewed By: zhaojuanmao

Differential Revision: D27876281

fbshipit-source-id: 58e2edfb1baff5cdc083dbaaba9f19502ef0b298
2021-06-03 17:17:53 -07:00
46d724c919 Revert D28859795: [nnc] Enable CPU fusion inside Facebook, take 4
Test Plan: revert-hammer

Differential Revision:
D28859795 (6baa66ece9)

Original commit changeset: 826801db24e8

fbshipit-source-id: c85a0fc7e88c95af939d5c0f50c0c8878e1174d3
2021-06-03 16:29:51 -07:00
526445dfa8 Update reviewer list for the distributed package (#59417)
Summary:
Added cbalioglu to the default reviewer list of the distributed package.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59417

Reviewed By: mruberry

Differential Revision: D28883997

Pulled By: cbalioglu

fbshipit-source-id: 0ed9638f25bd914b71d96203579507af3b830df4
2021-06-03 15:38:07 -07:00
aa4f27c12a Prefer accurate reciprocal on ARMv8 (#59361)
Summary:
Default NEON accelerated implementation of reciprocal uses vrecpeq_f32 which yield  Newton-Raphson approximation rather than actual value
Use regular NEON accelerated division for reciprocal and reciprocal square root operations.

This fixes `test_reference_numerics_hard_frac_cpu_float32`, `test_reference_numerics_normal_rsqrt_cpu_float32` etc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59361

Reviewed By: mruberry

Differential Revision: D28870456

Pulled By: malfet

fbshipit-source-id: e634b0887cce7efb046ea1fd9b74424e0eceb164
2021-06-03 15:28:36 -07:00
3416b8dd70 Automated submodule update: FBGEMM (#59337)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 9cb33bcfe5

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59337

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: caogao

Differential Revision: D28846199

fbshipit-source-id: b78f087129edef97247d4ceea77cfede0c6800fe
2021-06-03 14:45:32 -07:00
1aa14fcb14 Fix the "tensors to be on the same device" error in HistogramObserver (#59234)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/59075

This PR fixes the "tensors to be on the same device" error in `HistogramObserver`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59234

Reviewed By: jbschlosser

Differential Revision: D28837572

Pulled By: vkuzo

fbshipit-source-id: ff7c3229ced7de2cdd8f76d526f0fd33ac643216
2021-06-03 13:30:56 -07:00
2aa463d931 Support switching RemoteModule between train/eval (#59026)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59026

#Closes: https://github.com/pytorch/pytorch/issues/51480

Enabled methods train and eval in RemoteModule to call the underlying train/eval methods on the actual
 nn.Module
ghstack-source-id: 130421137

Test Plan:
Call these two updated methods in method test_send_remote_module_over_the_wire in remote_module_test.py. To test the correctness, after running method train, the training mode should be set to True; after running method eval, the training mode of the remote module should be set to False.

	Related test output:

    ✓ Pass: caffe2/test/distributed/rpc:process_group_agent - test_send_remote_module_over_the_wire (fb.test_process_group_agent.ProcessGroupThreeWorkersRemoteModuleTestWithFork) (23.059)
    ✓ Pass: caffe2/test/distributed/rpc:thrift_agent - test_send_remote_module_over_the_wire (fb.test_thrift_agent.ThriftThreeWorkersRemoteModuleTestWithFork) (27.965)
    ✓ Pass: caffe2/test/distributed/rpc:process_group_agent - test_send_remote_module_over_the_wire (test_process_group_agent.ProcessGroupThreeWorkersRemoteModuleTestWithSpawn) (74.481)
    ✓ Pass: caffe2/test/distributed/rpc:thrift_agent - test_send_remote_module_over_the_wire (fb.test_thrift_agent.ThriftThreeWorkersRemoteModuleTestWithSpawn) (77.243)
    ✓ Pass: caffe2/test/distributed/rpc:tensorpipe_agent - test_send_remote_module_over_the_wire (fb.test_tensorpipe_agent.TensorPipeThreeWorkersRemoteModuleTestWithFork) (58.644)
    ✓ Pass: caffe2/test/distributed/rpc:tensorpipe_agent - test_send_remote_module_over_the_wire (test_tensorpipe_agent.TensorPipeThreeWorkersRemoteModuleTestWithSpawn) (90.229)

Reviewed By: pritamdamania87, SciPioneer

Differential Revision: D28721078

fbshipit-source-id: aa45c1e5755f583200144ecfec3704f28221972c
2021-06-03 13:13:58 -07:00
c1c9774acb Revert D28538996: Enable torch::deploy GPU tests in sandcastle
Test Plan: revert-hammer

Differential Revision:
D28538996 (4b74c848aa)

Original commit changeset: 1a6ccea07cfe

fbshipit-source-id: 6e01a96d3746d3ca3e4e792a7b623ef960c9d2d6
2021-06-03 13:00:25 -07:00
e66015dadf Add build support for kineto + rocm (#58401)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/58399

CMake changes to allow kineto to build with rocm support.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58401

Reviewed By: mruberry

Differential Revision: D28479807

Pulled By: walterddr

fbshipit-source-id: fc01f05b2a5592ee1d1dbd71d2d4f7aec1bd74f7
2021-06-03 12:15:20 -07:00
332b01e93f [DDP] log usage of torch_distributed_debug (#59351)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59351

Logging PT distributed debug level to track usage internally.
ghstack-source-id: 130443122

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D28854914

fbshipit-source-id: a8e85ca4a3c9ac2f18d13190e87c0ebc4a8e7ea2
2021-06-03 11:49:23 -07:00
6408cbd918 Migrate renorm to ATen (CPU and CUDA) (#59250)
Summary:
Resubmit of https://github.com/pytorch/pytorch/issues/59108, closes https://github.com/pytorch/pytorch/issues/24754, closes https://github.com/pytorch/pytorch/issues/24616

This reuses `linalg_vector_norm` to calculate the norms. I just add a new kernel that turns  the norm into a normalization factor, then multiply the original tensor using a normal broadcasted `mul` operator. The result is less code, and better performance to boot.

#### Benchmarks (CPU):
|     Shape    | Dim |  Before | After (1 thread) | After (8 threads) |
|:------------:|:---:|--------:|-----------------:|------------------:|
| (10, 10, 10) | 0   | 11.6 us |           4.2 us |            4.2 us |
|              | 1   | 14.3 us |           5.2 us |            5.2 us |
|              | 2   | 12.7 us |           4.6 us |            4.6 us |
| (50, 50, 50) | 0   |  330 us |           120 us |           24.4 us |
|              | 1   |  350 us |           135 us |           28.2 us |
|              | 2   |  417 us |           130 us |           24.4 us |

#### Benchmarks (CUDA)
|     Shape    | Dim |  Before |   After |
|:------------:|:---:|--------:|--------:|
| (10, 10, 10) | 0   | 12.5 us | 12.1 us |
|              | 1   | 13.1 us | 12.2 us |
|              | 2   | 13.1 us | 11.8 us |
| (50, 50, 50) | 0   | 33.7 us | 11.6 us |
|              | 1   | 36.5 us | 15.8 us |
|              | 2   | 41.1 us |   15 us |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59250

Reviewed By: mruberry

Differential Revision: D28820359

Pulled By: ngimel

fbshipit-source-id: 572486adabac8135d52a9b8700f9d145c2a4ed45
2021-06-03 11:43:27 -07:00
2ad4b8e58c Extract c10d Store tests to dedicated test file (#59271)
Summary:
Partially addresses https://github.com/pytorch/pytorch/issues/55340

**Overview**
This factors out `FileStoreTest`, `HashStoreTest`, `PrefixFileStoreTest`, `TCPStoreTest`, `PrefixTCPStoreTest`, `PythonStoreTest`, `RendezvousTest`, `RendezvousEnvTest`, `RendezvousFileTest`, and `RendezvousTCPTest` from `test_c10d_common.py` to a new file `test_store.py`.

Additionally, unused import/initialization statements are removed from `test_c10d_common.py`, and the minimal set of import/initialization statements are used for `test_store.py`.

Also, this changes `.jenkins/pytorch/multigpu-test.sh`, `.jenkins/pytorch/win-test-helpers/test_distributed.bat`, and `test/run_test.py` to include the new `test_store.py`.

**Testing**
All commands shown are run on an AI AWS cluster.

I check the Store tests:
```
python test/distributed/test_store.py
```

I also check `test_c10d_common.py` since it is the source of the refactored code. In addition, I check `test_c10d_nccl.py` and `test_c10d_gloo.py` since they import from `test_c10d_common.py`; those two should be the only test files depending on `test_c10d_common.py`.
```
python test/distributed/test_c10d_common.py
python test/distributed/test_c10d_nccl.py
python test/distributed/test_c10d_gloo.py
```
`test_c10d_gloo.py` produces warnings about how using sparse tensors in TorchScript is experimental, but the warnings do not result from this PR's changes.

**Testing Issues** (To Be Revisited)
```
WORLD_SIZE=4 BACKEND=gloo gpurun pytest test/distributed/test_c10d_gloo.py
```
Running the above command fails three tests (written as `[Test]`: `[Error]`):
- `ProcessGroupGlooWrapperTest.test_collective_hang`: `RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [10.200.24.101]:15580`
- `CommTest.test_broadcast_coalesced_gloo_cuda`: `RuntimeError: cuda runtime error (3) : initialization error at ../aten/src/THC/THCGeneral.cpp:54`
- `CommTest.test_sequence_num_incremented_gloo_default`: `RuntimeError: cuda runtime error (3) : initialization error at ../aten/src/THC/THCGeneral.cpp:54`
However, running each of the following yields no errors:
```
WORLD_SIZE=4 BACKEND=gloo gpurun pytest test/distributed/test_c10d_gloo.py -k test_collective_hang
WORLD_SIZE=4 BACKEND=gloo gpurun pytest test/distributed/test_c10d_gloo.py -k test_broadcast_coalesced_gloo_cuda
WORLD_SIZE=4 BACKEND=gloo gpurun pytest test/distributed/test_c10d_gloo.py -k test_sequence_num_incremented_gloo_default
```
This suggests the existence of some inadvertent state dependency between tests (e.g. improper cleanup). I have not explored this further yet. In particular, I do not have a solid understanding of the tests to be able to explain why using `pytest` and `gpurun` induces the failure (since notably, running the `.py` directly shows no issue).

Similarly, running the following yields 47 errors:
```
WORLD_SIZE=4 BACKEND=nccl gpurun pytest test/distributed/test_c10d_nccl.py
```
The errors seem to all be simply complaining about the usage of `fork()` instead of `spawn()` for CUDA multiprocessing. Though, most of the tests in `test_c10d_nccl.py` ask for at least 2 CUDA devices, so I think that the `gpurun` is warranted (assuming that the test file does not need to be run partially on different machines).

Both `test_c10d_common.py` and `test_store.py` work fine with `pytest`.

**Other Notes**
I noticed that `torch.distributed` is imported both as `dist` and as `c10d` and that `c10d` is used throughout the Store tests. I was curious if this is intentional (as opposed to using `dist` to refer to `torch.distributed`). Also, the original [issue](https://github.com/pytorch/pytorch/issues/55340) suggests that the Store tests do not use multiprocessing, but I saw that `torch.multiprocessing` is still used in `TCPStoreTest`.

The links for the Store files in the `CONTRIBUTING.md` [file](https://github.com/pytorch/pytorch/blob/master/torch/distributed/CONTRIBUTING.md) are broken. This can fixed in a separate PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59271

Reviewed By: jbschlosser, mrshenli

Differential Revision: D28856920

Pulled By: andwgu

fbshipit-source-id: 630950cba18d34e6b5de661f5a748f2cddc1b446
2021-06-03 10:53:33 -07:00
f05d5bec48 Preserve PyObject even when it goes dead (#56017)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56017

Fixes #55686

This patch is seemingly straightforward but some of the changes are very
subtle.  For the general algorithmic approach, please first read the
quoted issue.  Based on the algorithm, there are some fairly
straightforward changes:

- New boolean on TensorImpl tracking if we own the pyobj or not
- PythonHooks virtual interface for requesting deallocation of pyobj
  when TensorImpl is being released and we own its pyobj, and
  implementation of the hooks in python_tensor.cpp
- Modification of THPVariable to MaybeOwned its C++ tensor, directly
  using swolchok's nice new class

And then, there is python_variable.cpp.  Some of the changes follow the
general algorithmic approach:

- THPVariable_NewWithVar is simply adjusted to handle MaybeOwned and
  initializes as owend (like before)
- THPVariable_Wrap adds the logic for reverting ownership back to
  PyObject when we take out an owning reference to the Python object
- THPVariable_dealloc attempts to resurrect the Python object if
  the C++ tensor is live, and otherwise does the same old implementation
  as before
- THPVariable_tryResurrect implements the resurrection logic.  It is
  modeled after CPython code so read the cited logic and see if
  it is faithfully replicated
- THPVariable_clear is slightly updated for MaybeOwned and also to
  preserve the invariant that if owns_pyobj, then pyobj_ is not null.
  This change is slightly dodgy: the previous implementation has a
  comment mentioning that the pyobj nulling is required to ensure we
  don't try to reuse the dead pyobj.  I don't think, in this new world,
  this is possible, because the invariant says that the pyobj only
  dies if the C++ object is dead too.  But I still unset the field
  for safety.

And then... there is THPVariableMetaType.  colesbury explained in the
issue why this is necessary: when destructing an object in Python, you
start off by running the tp_dealloc of the subclass before moving up
to the parent class (much in the same way C++ destructors work).  The
deallocation process for a vanilla Python-defined class does irreparable
harm to the PyObject instance (e.g., the finalizers get run) making it
no longer valid attempt to resurrect later in the tp_dealloc chain.
(BTW, the fact that objects can resurrect but in an invalid state is
one of the reasons why it's so frickin' hard to write correct __del__
implementations).  So we need to make sure that we actually override
the tp_dealloc of the bottom most *subclass* of Tensor to make sure
we attempt a resurrection before we start finalizing.  To do this,
we need to define a metaclass for Tensor that can override tp_dealloc
whenever we create a new subclass of Tensor.  By the way, it was totally
not documented how to create metaclasses in the C++ API, and it took
a good bit of trial error to figure it out (and the answer is now
immortalized in https://stackoverflow.com/q/67077317/23845 -- the things
that I got wrong in earlier versions of the PR included setting
tp_basicsize incorrectly, incorrectly setting Py_TPFLAGS_HAVE_GC on
the metaclass--you want to leave it unset so that it inherits, and
determining that tp_init is what actually gets called when you construct
a class, not tp_call as another not-to-be-named StackOverflow question
suggests).

Aside: Ordinarily, adding a metaclass to a class is a user visible
change, as it means that it is no longer valid to mixin another class
with a different metaclass.  However, because _C._TensorBase is a C
extension object, it will typically conflict with most other
metaclasses, so this is not BC breaking.

The desired new behavior of a subclass tp_dealloc is to first test if
we should resurrect, and otherwise do the same old behavior.  In an
initial implementation of this patch, I implemented this by saving the
original tp_dealloc (which references subtype_dealloc, the "standard"
dealloc for all Python defined classes) and invoking it.  However, this
results in an infinite loop, as it attempts to call the dealloc function
of the base type, but incorrectly chooses subclass type (because it is
not a subtype_dealloc, as we have overridden it; see
b38601d496/Objects/typeobject.c (L1261) )
So, with great reluctance, I must duplicate the behavior of
subtype_dealloc in our implementation.  Note that this is not entirely
unheard of in Python binding code; for example, Cython
c25c3ccc4b/Cython/Compiler/ModuleNode.py (L1560)
also does similar things.  This logic makes up the bulk of
THPVariable_subclass_dealloc

To review this, you should pull up the CPython copy of subtype_dealloc
b38601d496/Objects/typeobject.c (L1230)
and verify that I have specialized the implementation for our case
appropriately.  Among the simplifications I made:

- I assume PyType_IS_GC, because I assume that Tensor subclasses are
  only ever done in Python and those classes are always subject to GC.
  (BTW, yes!  This means I have broken anyone who has extend PyTorch
  tensor from C API directly.  I'm going to guess no one has actually
  done this.)

- I don't bother walking up the type bases to find the parent dealloc;
  I know it is always THPVariable_dealloc.  Similarly, I can get rid
  of some parent type tests based on knowledge of how
  THPVariable_dealloc is defined

- The CPython version calls some private APIs which I can't call, so
  I use the public PyObject_GC_UnTrack APIs.

- I don't allow the finalizer of a Tensor to change its type (but
  more on this shortly)

One alternative I discussed with colesbury was instead of copy pasting
the subtype_dealloc, we could transmute the type of the object that was
dying to turn it into a different object whose tp_dealloc is
subtype_dealloc, so the stock subtype_dealloc would then be applicable.
We decided this would be kind of weird and didn't do it that way.

TODO:

- More code comments

- Figure out how not to increase the size of TensorImpl with the new
  bool field

- Add some torture tests for the THPVariable_subclass_dealloc, e.g.,
  involving subclasses of Tensors that do strange things with finalizers

- Benchmark the impact of taking the GIL to release C++ side tensors
  (e.g., from autograd)

- Benchmark the impact of adding a new metaclass to Tensor (probably
  will be done by separating out the metaclass change into its own
  change)

- Benchmark the impact of changing THPVariable to conditionally own
  Tensor (as opposed to unconditionally owning it, as before)

- Add tests that this actually indeed preserves the Python object

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D27765125

Pulled By: ezyang

fbshipit-source-id: 857f14bdcca2900727412aff4c2e2d7f0af1415a
2021-06-03 10:50:36 -07:00
fa72d9a379 [quant] Fix use after free (#59267)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59267

fixes: https://github.com/pytorch/pytorch/issues/58868

Test Plan: Imported from OSS

Reviewed By: jbschlosser, supriyar

Differential Revision: D28811529

fbshipit-source-id: f27018ae0a02d1dd229d1ff7638f130c38a00986
2021-06-03 10:35:48 -07:00
6baa66ece9 [nnc] Enable CPU fusion inside Facebook, take 4
Summary:
fixed the awkward configerator initialization issue that broke some
tests.  Trying again

Test Plan: predictor comparisons

Reviewed By: ZolotukhinM

Differential Revision: D28859795

fbshipit-source-id: 826801db24e86b1c3594a86e3ac32f0a84c496f7
2021-06-03 09:33:13 -07:00
57e452ff5d Revert D28856713: [PyTorch Edge] Add proper error message when loading incompatible model with lite interpreter
Test Plan: revert-hammer

Differential Revision:
D28856713

Original commit changeset: c3f9a3b64459

fbshipit-source-id: cc6ba8ec1047f29e62061107a2e5f245981b8039
2021-06-03 08:40:28 -07:00
6620d7d688 OpInfo: norm (#59259)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

EDIT:
~~Test takes whooping 4 mins to run 😓~~ (Filtered tests also included linalg norm)

Newly added tests take around 2 mins.
```
==================================================== 193 passed, 224 skipped, 27224 deselected, 5 warnings in 138.87s (0:02:18) ====================================================
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59259

Reviewed By: jbschlosser

Differential Revision: D28833962

Pulled By: mruberry

fbshipit-source-id: 40b24d6a8cb8b7d231b2f6b34b87cee4f136c5f9
2021-06-03 08:25:58 -07:00
4b74c848aa Enable torch::deploy GPU tests in sandcastle
Summary:
Added GPU tests in previous diffs but had to disable them as they only
pass locally on devgpu, but not in sandcastle.

note: local testing requires mode/dev-nosan or else ASAN interferes with CUDA.

Test Plan: Verify tests passing in sandcastle.

Reviewed By: malfet

Differential Revision: D28538996

fbshipit-source-id: 1a6ccea07cfe2f150eee068594e636add620cd91
2021-06-03 08:10:19 -07:00
f1ce7f4b7f Update PyTorch version to 0.10.0a (#59345)
Summary:
Also fix `TestProducerVersion` by removing assumption that major and minor are single digit

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59345

Reviewed By: robieta

Differential Revision: D28853720

Pulled By: malfet

fbshipit-source-id: 4b6d03c6b0c9d652a5aef792aaa84eaa522d10e8
2021-06-03 07:55:44 -07:00
c829095590 Revert D28802058: [pytorch][PR] add dispatch for bitwise_and
Test Plan: revert-hammer

Differential Revision:
D28802058 (874f287c52)

Original commit changeset: cccbbff46df5

fbshipit-source-id: 1675fe42966278aa446496445342d6d8a92ecea0
2021-06-03 07:38:13 -07:00
d095ec75a1 Forward AD formulas batch 2 (#57863)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57863

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D28387763

Pulled By: albanD

fbshipit-source-id: e1b60ab728bb05b9e3323ee0dc7e401aaf5b8817
2021-06-03 07:33:04 -07:00
add291cf66 [JIT] Add a phase to perform inplace<->functional conversion for activation operators (#57477)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57477

Currently the conversion only deals with activation operators. The legality check is somewhat strict for now.

Test Plan:
```
python test/test_jit.py -k test_functional_to_inplace_activation
python test/test_jit.py -k test_inplace_to_functional_activation
```

Reviewed By: mrshenli

Differential Revision: D28155153

Pulled By: desertfire

fbshipit-source-id: df092830c4dff3ce9578ff76285eb7a566b7d81b
2021-06-03 06:43:23 -07:00
91b7bcf4c0 [PyTorch Edge] Add proper error message when loading incompatible model with lite interpreter (#59354)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59354

Check if the model has `bytecode.pkl` and provide proper error message before loading model. Test it by loading a model.pt and model.ptl.
```
>>> from torch.jit.mobile import _load_for_lite_interpreter
>>> _load_for_lite_interpreter("/Users/chenlai/Documents/pytorch/data/mobilenet_v2.pt")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/chenlai/pytorch/torch/jit/mobile/__init__.py", line 48, in _load_for_lite_interpreter
    cpp_module = torch._C._load_for_lite_interpreter(f, map_location)  # type: ignore[attr-defined]
RuntimeError: The model is not generated from the api _save_for_lite_interpreter. Please regenerate the module by scripted_module._save_for_lite_interpreter('model.ptl'). Refer to https://pytorch.org/tutorials/prototype/lite_interpreter.html for more details.
```

iOS:
![image](https://user-images.githubusercontent.com/16430979/120593077-cbe23180-c3f3-11eb-9745-ee2b04b78c6c.png)

Android:
![image](https://user-images.githubusercontent.com/16430979/120594357-af46f900-c3f5-11eb-9fb0-500a038148e3.png)

Differential Revision:
D28856713
D28856713

Test Plan: Imported from OSS

Reviewed By: dhruvbird

Pulled By: cccclai

fbshipit-source-id: c3f9a3b64459dda6811d296371c8a2eaf22f8b20
2021-06-03 03:18:14 -07:00
3979cb0656 irange for size_t (#55320)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55320

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D27572577

fbshipit-source-id: 97710fd2bb1303006b05828a0d1343b0b59ccb03
2021-06-03 01:04:13 -07:00
f914ab193e Use irange in a few places in torch/csrc (#55100)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55100

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D27447708

fbshipit-source-id: 4f21133bd76f29d73a51befcae649ab55637b36e
2021-06-03 00:58:51 -07:00
18642e664a [quant][graphmode][fx][refactor] Split quantize.py to prepare.py and convert.py (#59353)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59353

Next: remove Quantizer class

Test Plan: Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D28856277

fbshipit-source-id: 25f5502be387dbe9706780f667501b46b82789a5
2021-06-02 23:52:39 -07:00
8b4784a9c6 Revert D28821216: [pytorch][PR] Migrate _th_std_var to ATen
Test Plan: revert-hammer

Differential Revision:
D28821216 (1fb5cf5a71)

Original commit changeset: f35992c21f08

fbshipit-source-id: d068a62b7fa941188591a74dcb5d1a24719af7b3
2021-06-02 21:18:26 -07:00
eb55b086b7 [DDP] Log some python-side errors (#59284)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59284

Logs a few python-side errors to DDP logging.

TODO: Most python errors actually have to do with user input correctness, so they throw before reducer is constructed and thus there is no logger. For this case, should we allow `logger` to be created optionally without a reducer, just for the purpose of logging errors, so that we can gain insight into these errors in scuba?
ghstack-source-id: 130412973

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D28820290

fbshipit-source-id: 610e5dba885b173c52351f7ab25c923edce639e0
2021-06-02 19:49:26 -07:00
79aeca0b00 [DDP] Log when errors happen (#59281)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59281

Adds ability to log when reducer/ddp encounters an error. We add fields "has_error" and "error" to indicate that an error has
occured in this iteration, and the other fields (performance stats) are not
guaranteed to be updated.

Errors encountered in python-side DDP will be added in the next diff.
ghstack-source-id: 130412974

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28652717

fbshipit-source-id: 9772abc2647a92dac6a325da6976ef5eb877c589
2021-06-02 19:48:26 -07:00
d2e03051e0 Fix fecher continue next after StopIterator (#59313)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/59312

cc VitalyFedyunin dzhulgakov

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59313

Reviewed By: jbschlosser

Differential Revision: D28837762

Pulled By: dzhulgakov

fbshipit-source-id: 95cc29359aaba0f24ca169c5495ab5c6c53a0dce
2021-06-02 19:14:25 -07:00
1fb5cf5a71 Migrate _th_std_var to ATen (#59258)
Summary:
Ref https://github.com/pytorch/pytorch/issues/49421

This migrates `std`/`var`'s special case all-reduction from TH to ATen. Using the benchmark from gh-43858 that was used to justify keeping the TH version; I find this PR has similar (slightly better) performance in single threaded. And unlike the TH version, this is multi-threaded and so much faster for large tensors.

TH Results:
```
[----------------------------- Index ------------------------------]
               |  torch_var  |  torch_var0  |  stdfn   |  torch_sum0
1 threads: ---------------------------------------------------------
      8        |       3.6   |       3.8    |     8.2  |      1.2
      80       |       3.7   |       3.8    |     8.4  |      1.2
      800      |       4.2   |       4.3    |     8.7  |      1.2
      8000     |       9.0   |       9.1    |    11.2  |      1.5
      80000    |      58.3   |      59.0    |    30.6  |      4.2
      800000   |     546.9   |     546.9    |   183.4  |     31.3
      8000000  |    5729.7   |    5701.0    |  6165.4  |    484.1
```

ATen results:
```
[----------------------------- Index ------------------------------]
               |  torch_var  |  torch_var0  |  stdfn   |  torch_sum0
1 threads: ---------------------------------------------------------
      8        |       4.0   |       4.0    |     8.7  |      1.2
      80       |       3.6   |       3.8    |     9.0  |      1.2
      800      |       4.1   |       4.3    |     8.9  |      1.2
      8000     |       8.9   |       9.2    |    10.6  |      1.5
      80000    |      57.0   |      57.4    |    28.8  |      4.3
      800000   |     526.9   |     526.9    |   178.3  |     30.2
      8000000  |    5568.1   |    5560.6    |  6042.1  |    453.2

[----------------------------- Index ------------------------------]
               |  torch_var  |  torch_var0  |  stdfn   |  torch_sum0
8 threads: ---------------------------------------------------------
      8        |      3.9    |      3.8     |     9.1  |      1.2
      80       |      3.8    |      3.9     |     8.8  |      1.2
      800      |      4.2    |      4.3     |     8.9  |      1.3
      8000     |      9.0    |      9.2     |    10.4  |      1.5
      80000    |     26.0    |     26.8     |    26.4  |      4.4
      800000   |     92.9    |     87.3     |    72.1  |     22.4
      8000000  |    793.5    |    791.8     |  5334.8  |    115.1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59258

Reviewed By: mruberry

Differential Revision: D28821216

Pulled By: ngimel

fbshipit-source-id: f35992c21f08a0a8878053680dc0ca7a8facd155
2021-06-02 19:01:39 -07:00
c03cae49fc [DDP] Remove unused initialize_buckets (#59066)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59066

Per title
ghstack-source-id: 130338812

Test Plan: ci

Reviewed By: SciPioneer

Differential Revision: D28734666

fbshipit-source-id: 89ca7f8e625c4068ba0ed9800be2619e469ae515
2021-06-02 17:20:22 -07:00
2a78e896a0 [DDP] use work.result() in _check_global_requires_backward_grad_sync (#59065)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59065

Cleaner to use work.result() instead of sending back the tensor from
this function.
ghstack-source-id: 130338813

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D28551203

fbshipit-source-id: d871fed78be91f0647687ea9d6fc86e576dc53a6
2021-06-02 17:19:07 -07:00
517ea26eee [deploy] Make load_library a no-op inside a package (#58933)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58933

**Summary**
This commit makes load_library calls no-ops inside packages run with
deploy. Libraries containing custom C++ operators and classes are statically linked in C++
and don't need to be loaded. This commit takes advantage of the fact that sys.executable is
set to torch_deploy in deploy and uses that to exit early in load_library if
the program is running inside deploy.

**Test Plan**
This commit adds a test to `generate_examples`/`test_deploy` that
packages and runs a function that calls `load_library`. The library
doesn't exist, but that's okay because the function should be a no-op
anyway.

Test Plan: Imported from OSS

Reviewed By: Lilyjjo

Differential Revision: D28687159

Pulled By: SplitInfinity

fbshipit-source-id: 4a61fc636698e44f204334e338c5ce35257e7ae2
2021-06-02 17:01:31 -07:00
dfe85d6fd7 Revert D28840199: [pytorch][PR] Update version to 1.10
Test Plan: revert-hammer

Differential Revision:
D28840199 (3453aa44c1)

Original commit changeset: acc5a93e12a3

fbshipit-source-id: a41eb7c882fe0bf8f9a35ef180e99a7e72f6857d
2021-06-02 16:25:51 -07:00
2ce23136d0 Use irange in torch/csrc utils (#55556)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55556

Test Plan: Sandcastle

Reviewed By: ezyang

Differential Revision: D27625936

fbshipit-source-id: 79065438f582a6f5fe6f1f796b6984767605197e
2021-06-02 15:47:00 -07:00
e6c8e9497c Small fix type hints in mobile optimizer (#59282)
Summary:
Adjusts type hints for optimize_for_mobile to be consistent with the default. Right now using optimize_for_mobile and only passing a script_module gives me a type error complaining about preserved_methods can't be None.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59282

Test Plan:
Imported from GitHub, without a `Test Plan:` line.

Open source tests ran the lints. Internal CI should be enough here.

Reviewed By: jbschlosser

Differential Revision: D28838159

Pulled By: JacobSzwejbka

fbshipit-source-id: dd1e9aff00a759f71d32025d8c5b01e612c869a5
2021-06-02 15:32:16 -07:00
318c858eb5 [fx2trt] Organize converters and add unittests (#59261)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59261

Split converters to different files instead of putting them in a single file.

Reviewed By: jackm321

Differential Revision: D28613989

fbshipit-source-id: f25ca3732c457af51a07ef466915a4a08bd45e6e
2021-06-02 15:22:15 -07:00
0eafef5031 Fix internal assert location in custom Function binding (#59301)
Summary:
For facebook employees, this fix some internal failures from https://www.internalfb.com/tasks/?t=92100671

This was not a problem before https://github.com/pytorch/pytorch/pull/58271 because these cycles used to just be leaked (so nothing was cleared/dealloced).
Now that we properly clean up these cycles, we have to fix the assert in the clear.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59301

Reviewed By: jbschlosser

Differential Revision: D28841564

Pulled By: albanD

fbshipit-source-id: e2ec51f6abf44c4e3a83c293e90352295a43ba37
2021-06-02 15:09:51 -07:00
c3745dc580 Small change for torch.distributed launcher (#59152)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59152

Small change for https://fb.workplace.com/groups/319878845696681

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D28773682

Pulled By: H-Huang

fbshipit-source-id: acf82273e8622b7ffd3088d8d766bdf49273754c
2021-06-02 15:05:41 -07:00
3453aa44c1 Update version to 1.10 (#59325)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59325

Reviewed By: jbschlosser, seemethere

Differential Revision: D28840199

Pulled By: malfet

fbshipit-source-id: acc5a93e12a3db47d6103ea064bec9e40320f708
2021-06-02 15:00:33 -07:00
7ee68363a8 Add new rpc.barrier API (#53423)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53423

closes #40166

This change exposes a new API, rpc.barrier() which blocks the main processes of all workers running RPC until the whole group completes this function. Optionally rpc.barrier can take in a set of worker_names and only synchronize across those worker names.

Example:
```python
import os
import torch.multiprocessing as mp
import torch.distributed.rpc as rpc
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "5678"

world_size = 4
odd_num_workers = [f"worker{i}" for i in range(world_size) if i % 2]
even_num_workers = [f"worker{i}" for i in range(world_size) if not i % 2]

def worker(i):
    print(i)
    rpc.init_rpc(f"worker{i}", rank=i, world_size=world_size)
    if i % 2:
        print(f"start barrier {i}")
        rpc.barrier(set(odd_num_workers))
    else:
        print(f"start barrier {i}")
        rpc.barrier(set(even_num_workers))
    rpc.shutdown()
    print(f"shutdown{i}")

if __name__ == '__main__':
    with mp.Pool(processes=world_size) as pool:
        pool.map(worker, range(world_size))
```

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D27737145

Pulled By: H-Huang

fbshipit-source-id: 369196bc62446f506d1fb6a3fa5bebcb0b09da9f
2021-06-02 14:20:16 -07:00
1765f51618 [iOS GPU] [BE] use channel-last to transform the weights (#59113)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59113

Manually permuting the weights is slower than `calling at::contiguous()`
ghstack-source-id: 130374487

Test Plan: CI

Reviewed By: SS-JIA

Differential Revision: D28762278

fbshipit-source-id: 1dde3ef82018bc2507d0ca5132b1ee97dc99787f
2021-06-02 14:02:11 -07:00
1968efa2dd [c10d] Remove verbose log (#59070)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59070

This log is too verbose, especially in the case we call monitored
barrier before every collective as we do in ProcessGroupWrapper.
ghstack-source-id: 130052822

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D28738189

fbshipit-source-id: f2899537caa4c13508da31134d5dd0f4fd6a1f3a
2021-06-02 13:50:11 -07:00
7f2e620105 FIX Validates that weights are 2d in embedding (#59314)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55185

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59314

Reviewed By: H-Huang

Differential Revision: D28837753

Pulled By: jbschlosser

fbshipit-source-id: 683378244c61b0937c95563f91ef87ab09fd1653
2021-06-02 12:52:21 -07:00
fb709a8ca5 Build with USE_GLOO_WITH_OPENSSL=1 (#59274) (#59323)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59323

Reviewed By: jbschlosser

Differential Revision: D28839920

Pulled By: malfet

fbshipit-source-id: 63cffa6fe25cf354966354e5dd5490ba6e5b3d11
2021-06-02 12:51:00 -07:00
f7097b0c0b Make unary tests runnable if SCIPY is not installed (#59304)
Summary:
By adding `if TEST_SCIPY else _NOTHING` to special.i1 and special.i1e

Discovered while running tests on M1

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59304

Reviewed By: jbschlosser

Differential Revision: D28835693

Pulled By: malfet

fbshipit-source-id: e4fde6584da29fa43bc6da75eebe560512754ed0
2021-06-02 12:47:30 -07:00
eae84f0d5d Fix ONNX forward compatibility (#59327)
Summary:
Fixes `onnx.utils.polish_model` not found exception when executed using onnx-1.9

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59327

Reviewed By: H-Huang

Differential Revision: D28840563

Pulled By: malfet

fbshipit-source-id: 403a29a88e7dee8b3414602b9fe2b31baf737dce
2021-06-02 12:39:56 -07:00
c22ac14969 [Error-reporting] Set upper boundary on border element (#59311)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59311

The diff sets the upper boundary on border element when presenting the error message. This is required in order to avoid unnecessary log contamination

Test Plan: Example of log contamination: https://www.internalfb.com/fblearner/details/276849996/operator/2942475685?tab=try_27021599785797968

Reviewed By: d4l3k

Differential Revision: D28812745

fbshipit-source-id: 4f491b9acc8cc9831d763f185022879bbbfb4c8a
2021-06-02 12:28:54 -07:00
99f2000a99 Migrate nonzero from TH to ATen (CPU) (#59149)
Summary:
Resubmit of https://github.com/pytorch/pytorch/issues/58811, Closes gh-24745

The existing PR (gh-50655) has been stalled because `TensorIterator` doesn't guarantee iteration order in the same way that `TH_TENSOR_APPLY` does. For contiguous test cases this isn't an issue; but it breaks down for example with channels last format. I resolve this by adding a new `TensorIteratorConfig` parameter, `enforce_linear_iteration`, which disables dimension reordering. I've also added a test case for non-contiguous tensors to verify this works.

This PR also significantly improves performance by adding multithreading support to the algorithm.  As part of this, I wrote a custom `count_nonzero` that gives per-thread counts which is necessary to write the outputs in the right location.

|    Shape   |  Before | After (1 thread) | After (8 threads) |
|:----------:|--------:|-----------------:|------------------:|
| 256,128,32 | 2610 us |          2150 us |            551 us |
| 128,128,32 | 1250 us |          1020 us |            197 us |
|  64,128,32 |  581 us |           495 us |             99 us |
|  32,128,32 |  292 us |           255 us |             83 us |
|  16,128,32 |  147 us |           126 us |             75 us |
|  8,128,32  |   75 us |            65 us |             65 us |
|  4,128,32  |   39 us |            33 us |             33 us |
|  2,128,32  |   20 us |            18 us |             18 us |
|  1,128,32  |   11 us |             9 us |              9 us |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59149

Reviewed By: mruberry

Differential Revision: D28817466

Pulled By: ngimel

fbshipit-source-id: f08f6c003c339368fd53dabd28e9ada9e59de732
2021-06-02 12:26:29 -07:00
b4d30bb583 [PyTorch] Use expect_contiguous in CPU matmul (#58895)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58895

There doesn't seem to be any reason we can't use expect_contiguous here.
ghstack-source-id: 130283300

Test Plan: CI

Reviewed By: ngimel

Differential Revision: D28666399

fbshipit-source-id: b4a9bcb01ff1c30d991765140c8df34c3ac3a89b
2021-06-02 12:04:18 -07:00
0528325b5f [iOS GPU] Raise the minimum OS support version to 11.0 (#59310)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59310

We recently updated the GK to deliver GPU models to only 11.0+ devices. Will do a clean up in following diffs to clean up shader functions written for iOS 10.0.
ghstack-source-id: 130374598

Test Plan: CI

Reviewed By: linbinyu

Differential Revision: D28805864

fbshipit-source-id: 4cde34ff9fbbe811a69686a0f29b56d69aeefbee
2021-06-02 11:53:45 -07:00
f8f06e7099 [iOS GPU] Fix the OSS macos build (#59102)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59102

ghstack-source-id: 130374334

Test Plan:
- On the OSS side
    - CI
    - `USE_PYTORCH_METAL=ON python setup.py install --cmake`

Reviewed By: IvanKobzarev

Differential Revision: D28757412

fbshipit-source-id: 2efea9dfe7361a73c02d1ca5fbf587835d39d325
2021-06-02 11:47:11 -07:00
874f287c52 add dispatch for bitwise_and (#59125)
Summary:
ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59125

Reviewed By: ngimel

Differential Revision: D28802058

Pulled By: ezyang

fbshipit-source-id: cccbbff46df552235072fa38fea1c19b068991ea
2021-06-02 11:42:49 -07:00
484d53f4a0 [torch][JIT] Warn only once when using unscripted dictionary (#59287)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59287

D27211605 added a warning in `toIValue` that warns users to script their
dictionaries before passing them to TorchScript functions in order to get some
performance benefits and reference semantics. However, this warning is emitted
every time `toIValue` is called (e.g. when a dictionary is passed to
TorchScript function), which can lead to noisy log output. This diff changes
this changes to use `TORCH_WARN_ONCE` instead.

Test Plan: Sandcastle, OSS CI.

Reviewed By: hyuen

Differential Revision: D28824468

fbshipit-source-id: e651eade4380abaf77c6c8a81ec4e565b0c2c714
2021-06-02 11:41:37 -07:00
82052b0a76 [vulkan] Remove constant duplication for Vulkan optimize_for_mobile (#59276)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59276

Test Plan: Imported from OSS

Reviewed By: cccclai, ngimel

Differential Revision: D28814072

Pulled By: IvanKobzarev

fbshipit-source-id: d5cfd1352a2e07cdd4708d19fe4320444521db78
2021-06-02 11:38:18 -07:00
3ec0904718 docs: Add note about nightly versions bump (#59324)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59324

Also updates section on pinning pytorch/builder with an example

[skip ci]

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D28840049

Pulled By: seemethere

fbshipit-source-id: e5d6722713680e969893d9df97ec269fc9c00411
2021-06-02 11:29:41 -07:00
5386f6935a avg_pool3d: port to structured (#59083)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59083

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D28802620

Pulled By: ezyang

fbshipit-source-id: 1e890af3c37912447198aa2f20914b99decda8b2
2021-06-02 11:29:39 -07:00
5dc426a6f6 avg_pool2d_backward: Port to structured (#59082)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59082

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D28802621

Pulled By: ezyang

fbshipit-source-id: 15b8ba562eee132ef8390a7de520bdd8e15d0f86
2021-06-02 11:28:25 -07:00
eb1adc4c5e cmake: Add USE_GLOO_WITH_OPENSSL to Summary.cmake (#59321)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59321

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D28839370

Pulled By: seemethere

fbshipit-source-id: 0d4b35c05c2b1a78b752088cd16cd6263958e7f6
2021-06-02 11:10:55 -07:00
afd5237a4f Revert D28800692: [nnc] Enable CPU fusion inside Facebook, take 3
Test Plan: revert-hammer

Differential Revision:
D28800692 (6e7dae9cec)

Original commit changeset: d791c3b2ccd7

fbshipit-source-id: 5042fecfbab59181572013bf39760bc716e86430
2021-06-02 10:07:46 -07:00
a7aeaaf99e Added missing namespaces for C++ API (#45736)
Summary:
Hello,

depending on the build environment you may encounter
```c++
error: reference to 'optional' is ambiguous
```
when using the Torch-C++-API.

This PR adds `c10::` to avoid possible ambiguities with **std::optional** and does not introduce any functional change.

Fixes https://discuss.pytorch.org/t/linker-failed-with-ambiguous-references/36255 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45736

Reviewed By: dzhulgakov

Differential Revision: D24125123

Pulled By: VitalyFedyunin

fbshipit-source-id: df21420f0a2d0270227c28976a7a4218315cc107
2021-06-02 09:46:20 -07:00
87a25e09f4 [quant][graphmode][fx][refactor] Remove _convert from Quantizer class (#59042)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59042

To remove Quantizer class and split prepare and convert functions to different files

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D28724867

fbshipit-source-id: 9f87d51020caa20d5408cb2820947e23d92d5fc3
2021-06-02 08:50:56 -07:00
580831bfbb Add support for MatMul to BatchMatMulFP16Acc{16,32}Fake Op Mapping
Test Plan: f276981395

Reviewed By: hx89

Differential Revision: D28815646

fbshipit-source-id: c16b081bf3da2b157b9d42ea67b03dae88e82c6d
2021-06-02 08:32:21 -07:00
599f5058cf [ONNX] Update ONNX to rel-1.9 (#55889) (#57080)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57080

ONNX optimizer is removed in ONNX 1.9
This PR removes ONNX optimizer from a C++ code path and uses `try-except` block in Python to keep it compatible with both ONNX-1.8 and 1.9.

Test Plan: Imported from OSS

Reviewed By: heitorschueroff

Differential Revision: D28467330

Pulled By: malfet

fbshipit-source-id: 5e4669dd0537648898e593f9e253da18d6dc7568

Co-authored-by: neginraoof <neginmr@utexas.edu>
Co-authored-by: Nikita Shulga <nshulga@fb.com>
2021-06-02 08:27:17 -07:00
f87aa23125 .github: Remove windows dependency installs (#59283)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59283

We were observing 403s when attempting to install dependencies from
chocolatey leading us to believe that we were getting rate limited from
chocolatey.

We've instead opted to install our dependencies in our base AMIs instead
considering we would install them on every workflow anyway. This also
comes with the moving of the windows 10 sdk installation to the base sdk
as well since we were observing failures there as well due to failed
dependency installations.

Also moves windows 10 sdk installations to our visual studio installation script, which is activated by an passing an environment variable

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: janeyx99

Differential Revision: D28822962

Pulled By: seemethere

fbshipit-source-id: b5e35ffe4537db55deb027376bd2d418683707a5
2021-06-02 08:16:21 -07:00
3a2149a4ce [reland] Make TP agent use streams from Future when sending response (#59212)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59212

Reland of https://github.com/pytorch/pytorch/pull/58428

Until now, the TP agent expected the output of a remote function to be on the same streams as the inputs. In other words, it used the lazy stream context of the inputs to synchronize the output tensors. This was true in the most common case of a synchronous remote function. However it wasn't true for async functions, for fetching RRefs, ... The more generic way is to use the CUDA events held by the Future to perform this synchronization. (These events may be on the input streams, or they may not be!).
ghstack-source-id: 130202842

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28623885

fbshipit-source-id: 29333bcb75d077ab801eac92017d0e381e8f5569
2021-06-02 05:46:05 -07:00
258a991027 [reland] Set and propagate devices in RRef completion future (#59211)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59211

Reland of https://github.com/pytorch/pytorch/pull/58674

I found this missing parameter while debugging failures in the next PR.
I'm very unhappy about this change. I think this future, which we know for sure won't contain tensors, shouldn't have to worry about CUDA devices. And yet, it does. This means that basically any future anywhere might have to worry about it, and this just doesn't scale, and thus it's bad.
ghstack-source-id: 130202843

Test Plan: Should fix the next diff.

Reviewed By: mrshenli

Differential Revision: D28623886

fbshipit-source-id: 6c82ed7c785ac3bf32fff7eec67cdd73b96aff28
2021-06-02 05:46:04 -07:00
a3392cafe0 [reland] Set streams when invoking UDFs (#59210)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59210

Reland of https://github.com/pytorch/pytorch/pull/58427

Running the UDF (be it Python or JIT) is the first step of (most?) RPC calls, which is where the inputs are consumed. The lazy stream context contains the streams used by the inputs, thus it must be made current before any UDF call. I opt to do this as "close" as possible to the place the UDF is invoked, to make the relationship as explicit as possible.
ghstack-source-id: 130202847

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28623889

fbshipit-source-id: ed38242f813dac075d162685d52ae89f408932f9
2021-06-02 05:46:02 -07:00
f8a3fd4e34 [reland] Create CUDA-aware futures in RequestCallback (#59209)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59209

Reland of https://github.com/pytorch/pytorch/pull/58426

The operations in RequestCallback can return CUDA tensors, thus the futures used to hold them must be CUDA-aware.
ghstack-source-id: 130202844

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28623887

fbshipit-source-id: 53561b8ae011458d8f848f0a03830925aff2f0c2
2021-06-02 05:46:00 -07:00
3af6ff98ff [reland] Provide pre-extracted DataPtrs when completing a Future with a Message (#59208)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59208

Reland of https://github.com/pytorch/pytorch/pull/58425

Now that callbacks can provide pre-extracted DataPtrs, let's do so. This will become of crucial importance in the next PR, where some of these futures will become CUDA-aware, and thus they will try to extract DataPtrs on their own, but they would fail to do so here because Message isn't "inspectable".
ghstack-source-id: 130202845

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28623888

fbshipit-source-id: 1aa4bde8014870c071685ba8f72d5f3f01f0a512
2021-06-02 05:45:59 -07:00
1adc289e10 [reland] Allow Future::then to return pre-extracted DataPtrs (#59207)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59207

Reland of https://github.com/pytorch/pytorch/pull/58424

In CUDA mode, Future must inspect its value and extract DataPtrs. However some types are not supported, for example the C++/JIT custom classes, which include Message, which is widely used in RPC. Hence for these scenarios we allow the user to perform the custom DataPtr extraction on their own, and pass the pre-extracted DataPtrs.

Note that `markCompleted` already allowed users to pass in pre-extracted DataPtrs, hence this PR simply extends this possibility to the `then` method too.
ghstack-source-id: 130202846

Test Plan: Used in next PR.

Reviewed By: mrshenli

Differential Revision: D28623890

fbshipit-source-id: 468c5308b40774ba0a778b195add0e0845c1929e
2021-06-02 05:45:57 -07:00
b07d68e24c [reland] Always use intrusive_ptr for Message (2 out of 2) (#59206)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59206

Reland of https://github.com/pytorch/pytorch/pull/58423

This is part 2 of the previous PR. Here we address the remaining occurrences of "raw" Message, namely the ones within toMessageImpl. And since they're the last ones, we make the constructor of Message private, to prevent new usages from emerging.
ghstack-source-id: 130202848

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28623892

fbshipit-source-id: f815cf6b93e488c118e5d2298473e6e9d9f4c132
2021-06-02 05:45:55 -07:00
5ec169b4c3 [reland] Always use intrusive_ptr for Message (1 out of 2) (#59205)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59205

Reland of https://github.com/pytorch/pytorch/pull/58422

Similar to Future (which I tackled recently), Message is an ivalue type (a "custom class" one), and the natural way to represent it is inside an intrusive_ptr. However in the RPC code we had a mix of usages, often passing Message by value. This has undesirable consequences, as it could easily trigger a copy by accident, which I believe is why in many places we accepted _rvalue references_ to Message, in order to force the caller to move. In my experience this is non-idiomatic in C++ (normally a function signature specifies how the function consumes its arguments, and it's up to the caller to then decide whether to copy or move).

By moving to intrusive_ptr everywhere I think we eliminate and simplify many of the problems above.

In this PR I do half of the migration, by updating everything except the `toMessageImpl` methods, which will come in the next PR.
ghstack-source-id: 130202849

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28623891

fbshipit-source-id: c9aeea3440679a11741ca78c06b03c57cb815a5e
2021-06-02 05:44:49 -07:00
44c20ce676 Alias for i0 to special namespace (#59141)
Summary:
See https://github.com/pytorch/pytorch/issues/50345

cc: mruberry kshitij12345

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59141

Reviewed By: ngimel

Differential Revision: D28784097

Pulled By: mruberry

fbshipit-source-id: 9b61a21906ef337292686fd40e328502a79e6f09
2021-06-01 23:04:09 -07:00
059a717c9e Fix breakpad build and add to more images (#59236)
Summary:
This PR
* adds the breakpad build to most of the remaining docker images (except the mobile + slim ones)
* pins to a [fork of breakpad](https://github.com/google/breakpad/compare/master...driazati:master?expand=1) to enable dasiy chaining on signal handlers
* renames the API to be nicer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59236

Reviewed By: malfet

Differential Revision: D28792511

Pulled By: driazati

fbshipit-source-id: 83723e74b7f0a00e1695210ac2620a0c91ab4bf2
2021-06-01 22:47:14 -07:00
dbe629c51d [RPC Framework] Support creating a RemoteModule by RRef (#59242)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59242

#Oringal PR Issue: https://github.com/pytorch/pytorch/issues/58274

This can be a workaround: Instead of passing a script `RemoteModule` over RPC, pass its `module_rref` field over RPC, and then construct a new `RemoteModule` on the receiver end.
ghstack-source-id: 130268018

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_send_remote_module_over_the_wire_script_not_supported

buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_remote_module_py_pickle_not_supported_script

buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_create_remote_module_by_module_rref

Reviewed By: vipannalla

Differential Revision: D28794905

fbshipit-source-id: 1a677ff0d4b47c078ad47b50d7102a198a1fc39b
2021-06-01 22:35:03 -07:00
3218d890dd [quant][graphmode][fx][fix] Fix support for custom module (#59041)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59041

Static quantization for Custom module support was removed in a previous refactor
https://github.com/pytorch/pytorch/pull/57519 since it's not covered by the test case
This PR re-enabled the test case and fixed the support

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D28724866

fbshipit-source-id: 1974675b88b56a2173daf86965d6f3fb7ebd783b
2021-06-01 22:31:15 -07:00
06af7618e7 [quant][graphmode][fx][refactor] Remove Quantizer class from convert (QuantizeHandler) (#59040)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59040

To remove Quantizer class and split prepare and convert functions to different files

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D28724870

fbshipit-source-id: c0f748711b825cd46bdfcc05c054c77a41e8207a
2021-06-01 22:00:49 -07:00
0a26781966 fix numpy compatibility in test for torch.kthvalue (#59214)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/59201. Should be merged after https://github.com/pytorch/pytorch/issues/59067 to ensure this actually working correctly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59214

Reviewed By: albanD

Differential Revision: D28792363

Pulled By: mruberry

fbshipit-source-id: 0cf613463139352906fb567f1efcc582c2c25de8
2021-06-01 21:57:09 -07:00
e9e1bb1a4e Fix device of info tensor for torch.linalg.inv_ex with MAGMA backend (#59223)
Summary:
This PR fixes `torch.linalg.inv_ex` with MAGMA backend.
`info` tensor was returned on CPU device even for CUDA inputs.
Now it's on the same device as input.

Fixes https://github.com/pytorch/pytorch/issues/58769

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59223

Reviewed By: ngimel

Differential Revision: D28814876

Pulled By: mruberry

fbshipit-source-id: f66c6f06fb8bc305cb2e22b08750a25c8888fb65
2021-06-01 21:49:57 -07:00
50e6ee3ca2 [quant][graphmode][fx][refactor] Remove Quantizer class from quantize_node (#59039)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59039

To remove Quantizer class and split prepare and convert functions to different files

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D28724874

fbshipit-source-id: bd984716b2da1d6879c3e92fa827574783a41567
2021-06-01 21:40:08 -07:00
2d8f0d966f CUDA support in the CSR layout: CUDA addmm/matvec (#59012)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59012

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D28719631

Pulled By: bhosmer

fbshipit-source-id: 43e2004a61e114aeb0a7c6ad8a25fedda238c6da
2021-06-01 21:16:42 -07:00
3efefc4016 [CUDA graphs] Makes sure all graphs tests call empty_cache() at some point before capture (#59233)
Summary:
Graphs tests are sometimes flaky in CI ([example](https://app.circleci.com/pipelines/github/pytorch/pytorch/328930/workflows/0311199b-a0be-4802-a286-cf1e73f96c70/jobs/13793451)) because when the GPU runs near its max memory capacity (which is not unusual during a long test), sometimes, to satisfy new allocations that don't match any existing unused blocks, the caching allocator may call `synchronize_and_free_events` to wait on block end-of-life events and cudaFree unused blocks, then re-cudaMalloc a new block. For ungraphed ops this isn't a problem, but synchronizing or calling cudaFree while capturing is illegal, so `synchronize_and_free_events` raises an error if called during capture.

The graphs tests themselves don't use much memory, so calling torch.cuda.empty_cache() at some point before their captures should ensure memory is available and the captures never need `synchronize_and_free_events`.

I was already calling empty_cache() near the beginning of several graphs tests. This PR extends it to the ones I forgot.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59233

Reviewed By: mruberry

Differential Revision: D28816691

Pulled By: ngimel

fbshipit-source-id: 5cd83e48e43b1107daed5cfa2efff0fdb4f99dff
2021-06-01 21:05:46 -07:00
1d37f41567 [quant][graphmode][fx][refactor] Remove _prepare from Quantizer class (#59038)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59038

To remove Quantizer class and split prepare and convert functions to different files

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D28724869

fbshipit-source-id: e8501c9720b5ddb654e78bc8fa08de0466c1d52b
2021-06-01 18:01:22 -07:00
970096b624 [Reland] Adds an aten::_ops namespace with unambiguous function names (#59018)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59018

Fixes #58044.

This PR:
- adds `ATEN_FN(op)` and `ATEN_FN2(op, overload)` macros that resolve to
an non-overloaded function in aten::_ops that calls the desired operator
(without default arguments).

The motivation for this is two-fold:
1) Using aten operators with templates is hard if the operator is
overloaded (e.g. add.Tensor and add.Scalar).
2) Method-only operators require special handling; pointers-to-method
are different from function pointers. `ATEN_FN2(add_, Tensor)` returns
a function instead of a method.

There is some interesting behavior for out= operations.
`ATEN_FN2(sin, "out")` gives a function that is *faithful* to the schema;
that is, the order of arguments is exactly what it looks like in the
schema. This makes it so that you can directly register
`ATEN_FN2(sin,"out")` (or a function wrapping it using the same signature)
as an override for a DispatchKey.

Test Plan:
- New tests that ATEN_FN2 works on function and method-only operators
- New test that ATEN_FN works
- New test that ATEN_FN macro returns a "faithful" function.

Codegen output:
Operators.h and Operators.cpp are both here:
https://gist.github.com/zou3519/c2c6a900410b571f0d7d127019ca5175

Reviewed By: bdhirsh

Differential Revision: D28721206

Pulled By: zou3519

fbshipit-source-id: a070017f98e8f4038cb0c64be315eef45d264217
2021-06-01 17:19:06 -07:00
8805093ec5 use long index type for index_add_cuda deterministic path (#59254)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59254

index_add can take int or long index tensor whereas index_put only takes long indices tensor.

In the deterministic path of index_add_cuda, we use index_put. Hence we better convert index tensor to long.

Test Plan:
buck test mode/opt //caffe2/test:torch_cuda -- test_index_add_deterministic

    ✓ ListingSuccess: caffe2/test:torch_cuda - main (14.748)
    ✓ Pass: caffe2/test:torch_cuda - test_index_add_deterministic_cuda (test_torch.TestTorchDeviceTypeCUDA) (27.717)
    ✓ Pass: caffe2/test:torch_cuda - main (27.717)

Reviewed By: ngimel

Differential Revision: D28804038

fbshipit-source-id: de12932a7738f2805f3bceb3ec024497625bce6a
2021-06-01 16:28:18 -07:00
20348fb32e [quant][graphmode][fx][refactor] Remove find_matches from Quantizer class (#59037)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59037

To remove Quantizer class and split prepare and convert functions to different files

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D28724865

fbshipit-source-id: 6c6824d0af7dd47d4c111d6a08e373bc65f33e08
2021-06-01 16:07:07 -07:00
7d64fc675b [quant][graphmode][fx][refactor] Remove fold_weights from Quantizer class (#59036)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59036

To remove Quantizer class and split prepare and convert functions to different files

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D28724862

fbshipit-source-id: 5900420127fcc14846bc34c9ac29ff7e6a703f1e
2021-06-01 15:52:57 -07:00
8af6281201 DOC Adds register_module_full_backward_hook into docs (#58954)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/54443

Adds `register_module_full_backward_hook` into the index so it is rendered in the html docs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58954

Reviewed By: ngimel

Differential Revision: D28801816

Pulled By: jbschlosser

fbshipit-source-id: a2e737fe983e5d7e4e26d7639183bca34b571cb8
2021-06-01 15:47:10 -07:00
6e7dae9cec [nnc] Enable CPU fusion inside Facebook, take 3 (#59253)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59253

Fixed a miscompilation exposed by multithreaded profiling collection; let's try again.
ghstack-source-id: 130286580

Test Plan: servicelab

Reviewed By: navahgar, huiguoo

Differential Revision: D28800692

fbshipit-source-id: d791c3b2ccd75fe5e6eca0859083d4cd67460147
2021-06-01 15:42:22 -07:00
cc4891804c [quant][graphmode][fx][refactor] Remove save_state and restore_state from Quantizer class (#59035)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59035

To remove Quantizer class and split prepare and convert functions to different files

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D28724872

fbshipit-source-id: d32752c635917c9820e5e7cc414ba9d48a258a19
2021-06-01 15:38:36 -07:00
336ac9496f Fix mismatch in README.md Docker Image section (#59199)
Summary:
docker.Makefile has CUDNN_VERSION=8 as the defaults, but README.md states cuDNN v7

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59199

Reviewed By: mruberry

Differential Revision: D28808611

Pulled By: ngimel

fbshipit-source-id: 96cea32bfe33184b2bff69b7bb7f3e50a2b9c6aa
2021-06-01 15:22:30 -07:00
95c26b2806 [ROCm] disable test test_Conv2d_groups_nobias for ROCm (#59158)
Summary:
Disabling the test since its failing in ROCm4.2

Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59158

Reviewed By: mruberry

Differential Revision: D28808953

Pulled By: ngimel

fbshipit-source-id: 134f147ead6dc559d2cde49cf8343cd976e6c224
2021-06-01 15:10:06 -07:00
3d521e8b40 [quant][graphmode][fx][refactor] Remove prepare_custom_config from Quantizer class (#59034)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59034

To remove Quantizer class and split prepare and convert functions to different files

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D28724873

fbshipit-source-id: 870e0822843ad1d035f41eaa015bdde9ccf6ec23
2021-06-01 14:52:22 -07:00
a5dcd3c4b7 Revert D28240105: [pytorch][PR] Fix DistributedSampler mem usage on large datasets
Test Plan: revert-hammer

Differential Revision:
D28240105 (a0ce8da26e)

Original commit changeset: 4c6aa493d0f7

fbshipit-source-id: 8a0e17764c2f26c8316f88ad6c8772b08883ceee
2021-06-01 14:44:23 -07:00
a0ce8da26e Fix DistributedSampler mem usage on large datasets (#51841)
Summary:
The current implementation of DistributedSampler generates a python list to hold all of the indices, and then returns a slice of this list for the given rank (creating a partial copy of the list). When the underlying dataset is large, both of these choices waste a large amount of memory. It is much more efficient to create a tensor to hold the indices, and then index into that tensor instead of creating slices.

In the case of a sampler with `shuffle=False`, it would be possible to avoid creating the `indices` tensor entirely (since the index will always match the value), but I have opted instead here to keep the implementation as similar to the existing version as possible. One possible benefit of this approach is that memory usage will not significantly change based on changing this parameter. Still, it might be better to simply return the indices directly without the underlying array.

Additionally, the logic around calculating the number of samples is unnecessarily complex. When dropping the last batch, this can be a simple floor division.

In a simple test script which creates a sampler for a dataset with a 100,000,000 items, memory usage is reduced 98% compared to the existing implementation.

Fixes https://github.com/pytorch/pytorch/issues/45427

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51841

Reviewed By: albanD

Differential Revision: D28240105

Pulled By: rohan-varma

fbshipit-source-id: 4c6aa493d0f75c07ec14c98791b3a531300fb1db
2021-06-01 14:15:14 -07:00
5a42a97c49 Add NCCL_ASYNC_ERROR_HANDLING as an environment variable (#59109)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/57878.

This adds `NCCL_ASYNC_ERROR_HANDLING` as a DDP relevant environment variable and includes a check for that variable in the test `test_dump_DDP_relevant_env_vars()`. Notably, the modified test now checks for the new variable but does not check for any of the other previously-existing relevant environment variables that were not already tested for (e.g. `NCCL_BLOCKING_WAIT`).

The change was tested via the following on an AI AWS cluster:
`WORLD_SIZE=2 BACKEND=nccl gpurun pytest test/distributed/test_distributed_spawn.py -k test_dump_DDP_relevant_env_vars -vs`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59109

Reviewed By: H-Huang, SciPioneer

Differential Revision: D28761148

Pulled By: andwgu

fbshipit-source-id: 7be4820e61a670b001408d0dd273f65029b1d2fe
2021-06-01 14:02:41 -07:00
5f1117226f DOC Update register_buffer/parameter docstring explaining None (#59015)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/40977

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59015

Reviewed By: ngimel

Differential Revision: D28797948

Pulled By: jbschlosser

fbshipit-source-id: 3bf60af5c1cfc5f1786b4975b48f093391374503
2021-06-01 13:55:07 -07:00
e4b2684331 [quant][graphmode][fx][refactor] Remove patterns from Quantizer class (#59033)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59033

To remove Quantizer class and split prepare and convert functions to different files

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D28724861

fbshipit-source-id: 97b38e851b6bf581510a24636b1d8d6f1d977f5a
2021-06-01 13:44:08 -07:00
83892c1861 [quant][graphmode][fx][refactor] Remove node_name_to_scope from Quantizer (#59032)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59032

To remove Quantizer class and split prepare and convert functions to different files

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D28724868

fbshipit-source-id: 6df639f20076b480812b6dcf0fc7d2c87ca29d8b
2021-06-01 13:26:09 -07:00
3826f7e8e0 [quant][graphmode][fx][refactor] Remove quantized_graph from Quantizer (#59031)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59031

Trying to remove Quantizer class and split prepare and convert code

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D28724871

fbshipit-source-id: dad0332ba271c4cfb6ec1e8f2036443149b5bea4
2021-06-01 13:01:54 -07:00
1b4586ee20 [quant][gx][graphmode][refactor] Remove modules from Quantizer (#59030)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59030

Trying to remove Quantizer class and split prepare and convert code

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D28724875

fbshipit-source-id: d6610c1d5eb7755331252be9e348a230abf4175c
2021-06-01 12:42:28 -07:00
aa857850bb Add check_env, getenv api (#59052)
Summary:
Related Issue: https://github.com/pytorch/pytorch/issues/57691
This PR introduces an API for checking environment variables:

```c++
optional<bool> check_env(const char *name)
```
Reads the environment variable name and returns
- `optional<true>`,                       if set equal to "1"
- `optional<false>`,                      if set equal to "0"
- `nullopt`,   otherwise

Issues a warning if the environment variable was set to any value other than 0 or 1

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59052

Test Plan:
Manually run the following test case:

- Apply this diff to the repo
```
 diff --git a/torch/csrc/Exceptions.cpp b/torch/csrc/Exceptions.cpp
index d008643f70..990d254f0d 100644
 --- a/torch/csrc/Exceptions.cpp
+++ b/torch/csrc/Exceptions.cpp
@@ -9,6 +9,9 @@

 #include <torch/csrc/THP.h>

+#include <c10/util/Optional.h>
+#include <c10/util/env.h>
+
 // NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)
 PyObject *THPException_FatalError;

@@ -23,18 +26,7 @@ bool THPException_init(PyObject *module)
 namespace torch {

 static bool compute_cpp_stack_traces_enabled() {
-  auto envar = std::getenv("TORCH_SHOW_CPP_STACKTRACES");
-  if (envar) {
-    if (strcmp(envar, "0") == 0) {
-      return false;
-    }
-    if (strcmp(envar, "1") == 0) {
-      return true;
-    }
-    TORCH_WARN("ignoring invalid value for TORCH_SHOW_CPP_STACKTRACES: ", envar,
-               " valid values are 0 or 1.");
-  }
-  return false;
+ return c10::utils::check_env("TORCH_SHOW_CPP_STACKTRACES").value_or(false);
 }

 bool get_cpp_stacktraces_enabled() {
```
This patch replaces the prior `std::getenv` usage in `torch/csrc/Exceptions.cpp` to use the new api.
- Run the following python3 script
```python
import torch

print(torch.__version__) # should print local version (not release)

a1 = torch.tensor([1,2,3])
a2 = torch.tensor([2])

a1 @ a2
```
using the following commands
```bash
python3 test.py # should not output CPP trace
TORCH_SHOW_CPP_STACKTRACES=1 python3 test.py # should output CPP trace
```

Reviewed By: ngimel

Differential Revision: D28799873

Pulled By: 1ntEgr8

fbshipit-source-id: 3e23353f48679ba8ce0364c049420ba4ff86ff09
2021-06-01 12:24:14 -07:00
fd2a36369a Fixed torch.nn.MultiMarginLoss equation format error (#59188)
Summary:
Removed the extra parenthesis from the right side
Fixes https://github.com/pytorch/pytorch/issues/58634

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59188

Reviewed By: ngimel

Differential Revision: D28797720

Pulled By: jbschlosser

fbshipit-source-id: 47e3084526389e7d1cc17c1a01b253e666c58784
2021-06-01 12:04:34 -07:00
06399d441d Create EngineHolder for serializing and running TRT Engines with PyTorch
Test Plan:
**python tests**
`buck test mode/opt -c python.package_style=inplace -c fbcode.platform=platform009 -c fbcode.enable_gpu_sections=true -j 20 deeplearning/trt/EngineHolder:engine_holder_test`

**python tests to generate test models** (this outputs the jit model files for use with cpp tests)
`buck run mode/opt -c python.package_style=inplace -c fbcode.platform=platform009 -c fbcode.enable_gpu_sections=true -j 20 deeplearning/trt/EngineHolder:engine_holder_generate_test_models`

**cpp tests**
`buck test mode/opt -c python.package_style=inplace -c fbcode.platform=platform009 -c fbcode.enable_gpu_sections=true -j 20 deeplearning/trt/EngineHolder:engine_holder_test_cpp`

**run service locally**

*build service*
`buck build mode/opt-split-dwarf -c fbcode.platform=platform009 -c fbcode.enable_gpu_sections=true -j 20 smart/inference_platform_sp/predictor_gpu:service`

*run service*
`buck-out/gen/smart/inference_platform_sp/predictor_gpu/service --model_dir="/home/jackmontgomery" --model_id=123_0 --pytorch_predictor_use_cuda`

*build requester*
`buck build mode/opt -c python.package_style=inplace -c fbcode.platform=platform009 -c fbcode.enable_gpu_sections=true -j 20 glow/fb/test:invoke_cv_pt_predictor`

*run requester*
`buck-out/gen/glow/fb/test/invoke_cv_pt_predictor.par --model_id=123_0 --port=33131 --host="2401:db00:eef0:1100:3560:0:1c02:2115" --num_parallel_requesters=1`

Reviewed By: 842974287

Differential Revision: D28581591

fbshipit-source-id: 7738b05543c2c840ee6b8f0d4818f21dc7f61b19
2021-06-01 11:41:33 -07:00
e9e5588588 Improve Tensor traverse to traverse its grad_fn when possible (#58271)
Summary:
There are two main changes here:
- THPVariable will actually visit their grad_fn if there are no other reference to the c++ Tensor and no other reference to the grad_fn. The critical observation compared to the existing comment (thanks Ed!) is that if we also check that the c++ Tensor object is not referenced somewhere else, we're sure that no one can change the grad_fn refcount between the traverse and the clear.
- THPVariable don't need a special clear for this new cases as we're the only owner of the c++ Tensor and so the cdata.reset() will necessarily free the Tensor and all its resources.

The two tests are to ensure:
- That the cycles are indeed collectible by the gc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58271

Reviewed By: ngimel

Differential Revision: D28796461

Pulled By: albanD

fbshipit-source-id: 62c05930ddd0c48422c79b03118db41a73c1355d
2021-06-01 10:27:52 -07:00
65748f81c9 Un-verbose the build (#59235)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59235

Reviewed By: zou3519

Differential Revision: D28792468

Pulled By: driazati

fbshipit-source-id: 98f730ea0ee28b4b5c13198879bee8f586c0c14c
2021-06-01 10:14:26 -07:00
7523728368 [quant][graphmode][fx] Factor out run_weight_observer (#59029)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59029

Trying to remove Quantizer class and split prepare and convert code

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D28724864

fbshipit-source-id: 67ac5e7eb351970fdf46532c3c2ac6ac831bc697
2021-06-01 10:01:42 -07:00
10fc42eacc [quant][graphmode][fx] Merge quant_env and env (#59028)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59028

Previously we have an env and a quant_env in convert, which is a bit confusing,
in this PR we merged them and have a Dict[str, Tuple[Node, torch.dtype]]

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D28724863

fbshipit-source-id: 722a682c70d300a6ccd2b988786a1ac2d45e880e
2021-06-01 09:21:38 -07:00
afdfd2288a Revert D28767060: [pytorch][PR] Migrate renorm to ATen (CPU and CUDA)
Test Plan: revert-hammer

Differential Revision:
D28767060 (74ec50893d)

Original commit changeset: 93dcbe5483f7

fbshipit-source-id: ae85d90212df4e6bb3a5da310e97ad1c06aa9a77
2021-06-01 05:15:21 -07:00
0b040e17e5 More user-friendly error messages (#59106)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59106

Should make debugging a bit easier

Test Plan:
Example error in https://www.internalfb.com/intern/aibench/details/884106485190261 (open log for Portal or Portal+):
```
The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__/torch/backends/_nnapi/prepare.py", line 29, in forward
    _0 = uninitialized(__torch__.torch.classes._nnapi.Compilation)
    if torch.__is__(self.comp, None):
      _1 = (self).init(args, )
            ~~~~~~~~~~ <--- HERE
    else:
      pass
  File "code/__torch__/torch/backends/_nnapi/prepare.py", line 97, in init
    comp = __torch__.torch.classes._nnapi.Compilation.__new__(__torch__.torch.classes._nnapi.Compilation)
    _22 = (comp).__init__()
    _23 = (comp).init(self.ser_model, self.weights, )
           ~~~~~~~~~~ <--- HERE
    self.comp = comp
    return None

Traceback of TorchScript, original code (most recent call last):
  File "/data/users/dhaziza/fbsource/fbcode/buck-out/dev/gen/mobile-vision/d2go/projects/facegen/tools/export_to_app#link-tree/torch/backends/_nnapi/prepare.py", line 47, in forward
    def forward(self, args: List[torch.Tensor]) -> List[torch.Tensor]:
        if self.comp is None:
            self.init(args)
            ~~~~~~~~~ <--- HERE
        comp = self.comp
        assert comp is not None
  File "/data/users/dhaziza/fbsource/fbcode/buck-out/dev/gen/mobile-vision/d2go/projects/facegen/tools/export_to_app#link-tree/torch/backends/_nnapi/prepare.py", line 42, in init
        self.weights = [w.contiguous() for w in self.weights]
        comp = torch.classes._nnapi.Compilation()
        comp.init(self.ser_model, self.weights)
        ~~~~~~~~~ <--- HERE
        self.comp = comp
RuntimeError: [enforce fail at nnapi_model_loader.cpp:171] result == ANEURALNETWORKS_NO_ERROR. NNAPI returned error: 4
```

Reviewed By: axitkhurana

Differential Revision: D28287450

fbshipit-source-id: ccd10301e1492f8879f9d6dd57b60c4e683ebb9e
2021-06-01 02:05:24 -07:00
cab4849463 [caffe2][glow] Share info about current batch_size (#58902)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58902

Pull Request resolved: https://github.com/pytorch/glow/pull/5681

Reviewed By: ChunliF

Differential Revision: D28665162

fbshipit-source-id: 39e173a24ee247bc6fee44009798c74dddb27648
2021-06-01 01:21:42 -07:00
7fb3385f4b Automated submodule update: FBGEMM (#59170)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59170

This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: ffc2e1a91e

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58874

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: hx89

Differential Revision: D28648577

Pulled By: jspark1105

fbshipit-source-id: 0ad1a6fdf27cd3f05f9e342030461cb7caa9986b
2021-05-31 23:18:58 -07:00
74ec50893d Migrate renorm to ATen (CPU and CUDA) (#59108)
Summary:
Closes https://github.com/pytorch/pytorch/issues/24754, closes https://github.com/pytorch/pytorch/issues/24616, closes https://github.com/pytorch/pytorch/issues/50874

This reuses `linalg_vector_norm` to calculate the norms. I just add a new kernel that turns  the norm into a normalization factor, then multiply the original tensor using a normal broadcasted `mul` operator. The result is less code, and better performance to boot.

#### Benchmarks (CPU):
|     Shape    | Dim |  Before | After (1 thread) | After (8 threads) |
|:------------:|:---:|--------:|-----------------:|------------------:|
| (10, 10, 10) | 0   | 11.6 us |           4.2 us |            4.2 us |
|              | 1   | 14.3 us |           5.2 us |            5.2 us |
|              | 2   | 12.7 us |           4.6 us |            4.6 us |
| (50, 50, 50) | 0   |  330 us |           120 us |           24.4 us |
|              | 1   |  350 us |           135 us |           28.2 us |
|              | 2   |  417 us |           130 us |           24.4 us |

#### Benchmarks (CUDA)
|     Shape    | Dim |  Before |   After |
|:------------:|:---:|--------:|--------:|
| (10, 10, 10) | 0   | 12.5 us | 12.1 us |
|              | 1   | 13.1 us | 12.2 us |
|              | 2   | 13.1 us | 11.8 us |
| (50, 50, 50) | 0   | 33.7 us | 11.6 us |
|              | 1   | 36.5 us | 15.8 us |
|              | 2   | 41.1 us |   15 us |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59108

Reviewed By: mrshenli

Differential Revision: D28767060

Pulled By: ngimel

fbshipit-source-id: 93dcbe5483f71cc6a6444fbd5b1aa1f29975d857
2021-05-31 22:38:16 -07:00
223725cfb0 OpInfo: div - port pending method_tests entry (#59173)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Depends on: https://github.com/pytorch/pytorch/issues/59154

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59173

Reviewed By: ngimel

Differential Revision: D28785178

Pulled By: mruberry

fbshipit-source-id: 902310f2d77e499a2355a23b2d5a8c0b21b8c5bb
2021-05-31 17:32:27 -07:00
6d45d7a6c3 Enables previously "slow" gradgrad checks on CUDA (#57802)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/57508

Earlier, a few CUDA `gradgrad` checks (see the list of ops below) were disabled because of them being too slow. There have been improvements (see https://github.com/pytorch/pytorch/issues/57508 for reference) and this PR aimed on:

1. Time taken by `gradgrad` checks on CUDA for the ops listed below.
2. Enabling the tests again if the times sound reasonable

Ops considered: `addbmm, baddbmm, bmm, cholesky, symeig, inverse, linalg.cholesky, linalg.cholesky_ex, linalg.eigh, linalg.qr, lu, qr, solve, triangular_solve, linalg.pinv, svd, linalg.svd, pinverse, linalg.householder_product, linalg.solve`.

For numbers (on time taken) on a separate CI run: https://github.com/pytorch/pytorch/pull/57802#issuecomment-836169691.

cc: mruberry albanD pmeier

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57802

Reviewed By: ngimel

Differential Revision: D28784106

Pulled By: mruberry

fbshipit-source-id: 9b15238319f143c59f83d500e831d66d98542ff8
2021-05-30 22:16:46 -07:00
ef40757de3 OpInfo: zero_ (#58731)
Summary:
See https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58731

Reviewed By: ngimel

Differential Revision: D28784083

Pulled By: mruberry

fbshipit-source-id: f06de8045afd3728b1fedc014c091d8fd1955a9f
2021-05-30 21:49:29 -07:00
2aeb16c13a [fix] i1-i1e ROCm failure: mark array as const so that it is available for host and device (#59187)
Summary:
Fix failing ROCm build introduced by https://github.com/pytorch/pytorch/issues/56352

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59187

Reviewed By: ngimel

Differential Revision: D28784072

Pulled By: mruberry

fbshipit-source-id: 36a5bd11ad2fe80a81aae6eb8b21f0901c842ddc
2021-05-30 21:44:54 -07:00
fea7a79e0b [special] Add ndtr (#58126)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/50345

Plot:
![image](https://user-images.githubusercontent.com/19503980/117942099-54efd680-b328-11eb-8948-c3080779ce19.png)
https://colab.research.google.com/drive/1Of67A042rOImj8wrLF_fUTgoy_wVEOZS?usp=sharing

TODO:
* [x] Add docs (https://13385714-65600975-gh.circle-artifacts.com/0/docs/special.html#torch.special.ndtr)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58126

Reviewed By: anjali411

Differential Revision: D28700957

Pulled By: mruberry

fbshipit-source-id: 5b9991e97ec1e8fd01518cc9d9849108d35fe406
2021-05-30 21:12:04 -07:00
2a78f6376c TensorIterator: Reduce serial_for_each static overhead (#58909)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58909

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D28776507

Pulled By: ngimel

fbshipit-source-id: 4f0283d03b26aa5785b687b78d77e6b0efcbaf65
2021-05-30 21:08:54 -07:00
445e838210 OpInfo: resize_, resize_as_ (#59176)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59176

Reviewed By: ngimel

Differential Revision: D28780083

Pulled By: mruberry

fbshipit-source-id: 472584e8faa4cb1031908df097849d2d4167fdf5
2021-05-30 18:53:17 -07:00
ea465f7378 OpInfo: true_divide and minor fix (#59154)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59154

Reviewed By: ngimel

Differential Revision: D28780115

Pulled By: mruberry

fbshipit-source-id: 91e254698597fa0c7d4df6053ec017a85e180304
2021-05-30 18:35:30 -07:00
aaccdc3996 SparseCsr: Fix some uses of deprecated Tensor methods (#58990)
Summary:
This fixes some deprecation warnings in the build that were introduced by https://github.com/pytorch/pytorch/issues/58768.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58990

Reviewed By: ngimel

Differential Revision: D28776804

Pulled By: mruberry

fbshipit-source-id: 8abf75ea8f7adca537f9c808e68356829407665e
2021-05-30 03:58:19 -07:00
6ee9466d3a OpInfo: tensor_split: port remaining method_test entries (#59133)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59133

Reviewed By: ngimel

Differential Revision: D28776470

Pulled By: mruberry

fbshipit-source-id: 975a7062788de514f214f8c4ef0146eaf6b407f7
2021-05-30 00:40:29 -07:00
96c549d1c6 Replace dim_apply with TensorIterator (#58656)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58656

Ref gh-56794

`dim_apply` is problematic because it calls `Tensor.select` inside of a parallel
region. Instead, replace it with `TensorIterator` by squashing the
apply-dimension. This is similar to the `_dim_apply` function already used by
the sort kernels:

8c91acc161/aten/src/ATen/native/cpu/SortingKernel.cpp (L27)

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D28776441

Pulled By: ngimel

fbshipit-source-id: 14449d4b12ed4576f879bb65a35e881ce1a953b1
2021-05-30 00:09:14 -07:00
cab65ea3b9 OpInfo: renorm (#59079)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59079

Reviewed By: ngimel

Differential Revision: D28776789

Pulled By: mruberry

fbshipit-source-id: ca46f2debe918c3de1f3b5bbc9924b7ddfe9442a
2021-05-29 22:38:15 -07:00
5c18994674 [special] Add i1 and i1e (#56352)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/50345

* [x] Check Docs https://12721710-65600975-gh.circle-artifacts.com/0/docs/special.html
* [x] Investigate fp32 failure on CI?! (Fails on clang. Reproduced locally with clang-11)
* [ ] Kernel vs Composite?
* [x] Autograd for `i0e` for zero?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56352

Reviewed By: anjali411

Differential Revision: D28700888

Pulled By: mruberry

fbshipit-source-id: 91a3cbb94f5b8a3b063589ec38179848c11def83
2021-05-29 20:55:23 -07:00
27009d6129 [TensorExpr] Add NNC lowerings for aten::view, aten::reshape and aten::expand_as. (#59157)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59157

Currently view is represented as a copy since we don't support inplace
operations in NNC (similar to `aten::reshape`).  Lowering for
`aten::expand_as` is exactly the same as for the `aten::expand`, since
we're building the TE expression basing on the output shape anyway.

Differential Revision:
D28774224
D28774224

Test Plan: Imported from OSS

Reviewed By: Chillee

Pulled By: ZolotukhinM

fbshipit-source-id: 0a1593c4c6500dcc5a374213adb734180ae1f72e
2021-05-29 20:36:32 -07:00
355b24438c make vector_norm backward call norm_backward (#59135)
Summary:
Per title. Remove duplicated code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59135

Reviewed By: mruberry

Differential Revision: D28775716

Pulled By: ngimel

fbshipit-source-id: 50dc77590db15976453fc41c3657a77198749849
2021-05-29 12:14:46 -07:00
9fc0c5a54a OpInfo: tril, triu (#59145)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59145

Reviewed By: ngimel

Differential Revision: D28776433

Pulled By: mruberry

fbshipit-source-id: 2ff11a5202af1e73ffc2b242035c990646bd2259
2021-05-29 02:55:50 -07:00
1871d4e604 avoid explicitly casting low precision inputs to fp32 in norm (#59134)
Summary:
Per title. Now `norm` with fp16/bfloat16 inputs and fp32 outputs on cuda won't do explicit cast

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59134

Reviewed By: mruberry

Differential Revision: D28775729

Pulled By: ngimel

fbshipit-source-id: 896daa4f02e8a817cb7cb99ae8a93c02fa8dd5e9
2021-05-29 00:48:18 -07:00
d68df54269 OpInfo: fill_ (#59138)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59138

Reviewed By: ngimel

Differential Revision: D28776451

Pulled By: mruberry

fbshipit-source-id: 2e8e9f1805ec7d900223ea749a4a0b86a1bedb54
2021-05-29 00:35:02 -07:00
a427820350 [NNC] Added triangular_solve external call + fixed permute (#59131)
Summary:
The triangular_solve only returns the first input, since the second input is just a copy of the first one. Why does that exist?

Also, I fixed the permute lowering - I was previously doing the inverse application of the permute.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59131

Reviewed By: ansley

Differential Revision: D28768169

Pulled By: Chillee

fbshipit-source-id: 8e78611c6145fb2257cb409ba98c14ac55cdbccf
2021-05-28 22:29:30 -07:00
c9af4c2636 OpInfo: where (#58349)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58349

Reviewed By: mrshenli

Differential Revision: D28744220

Pulled By: mruberry

fbshipit-source-id: 893a2fb88a48a60df75c7d6e2f58a42ca949daa7
2021-05-28 18:22:03 -07:00
b977a3b66d [c10d] Split custom class bindings out of python binding code (#58992)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58992

Currently, we define Torchbind custom classes in the same place that we define Python bindings.

This is nice from a code location perspective, but has two downsides:
1. These custom classes are not available in a C++-only build.
2. These break when included in torch::deploy.

Some explanation on the second issue: torch::deploy creates many Python
interpreters, and creates a full copy of all the bindings for each one. This
will run the static initialization code once for each copy of the bindings,
leading to multiple registration of the custom classes (and therefore an
error).

This PR splits out the relevant custom class binding code into its own source
file to be included in libc10d, which can be compiled and statically
initialized a single time and linked against from the c10d python bindings.
ghstack-source-id: 130168942

Test Plan: CI

Reviewed By: wconstab

Differential Revision: D28690832

fbshipit-source-id: 3c5e3fff28abb8bcdb4a952794c07de1ee2ae5a8
2021-05-28 15:35:23 -07:00
ab372ba510 [iOS GPU] Add debug information to track memory allocation exception (#59112)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59112

ghstack-source-id: 130027273

Test Plan: CI

Reviewed By: linbinyu

Differential Revision: D28730604

fbshipit-source-id: 2ec7ca1b722a9fe496635cb6eea7e0d88b0c18b1
2021-05-28 12:16:29 -07:00
41054f2ab5 CUDA support in the CSR layout: sparse_to_dense/add_sparse_csr (#59011)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59011

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D28719550

Pulled By: bhosmer

fbshipit-source-id: 530c7cd1b20ae6d8865fd414afaf6fab27a643e6
2021-05-27 20:59:22 -07:00
9c83e4160d Use some c10::ThreadLocal to avoid crashes on old Android toolchains (#59017)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59017

See the comment in ThreadLocal.h for context.
I used a slightly dirty preprocessor hack to minimize the number of changes.
The hope is that we'll be able to revert all of these soon.

Test Plan:
CI.
Built FB4A with gnustl and saw no references to cxa_thread_atexit
in the PyTorch libraries.

Reviewed By: ilia-cher

Differential Revision: D28720762

fbshipit-source-id: 0f13c7ac5a108b95f8fde6dbc63c6b8bdb8599de
2021-05-27 20:49:03 -07:00
4b3d17c0a2 Include Macros.h in ThreadLocal
Summary: This wasn't picking up C10_ANDROID.  Not sure how to prevent stuff like this.

Test Plan: Build for Android+gnustl, saw proper ThreadLocal being defined.

Reviewed By: swolchok

Differential Revision: D28720763

fbshipit-source-id: 58eb4ea80ad32a856fcea6d65e5c1c37ebf3bd55
2021-05-27 20:47:56 -07:00
0c1420aa3c OpInfo: fmod and remainder (#57941)
Summary:
See https://github.com/pytorch/pytorch/issues/54261

cc: mruberry Lezcano kshitij12345

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57941

Reviewed By: mrshenli

Differential Revision: D28744464

Pulled By: mruberry

fbshipit-source-id: 19847277d4f8d3a39a706c2b3c9eddf0dedcb20c
2021-05-27 20:32:56 -07:00
657b75d155 Revert D28700259: [pytorch][PR] Migrate nonzero from TH to ATen (CPU)
Test Plan: revert-hammer

Differential Revision:
D28700259 (95b1bc1009)

Original commit changeset: 9b279ca7c36d

fbshipit-source-id: 267afe63376be598d24c862e02e3b4b3ea75f77c
2021-05-27 20:07:30 -07:00
4e543d017f Move remaining \*Sort\* in THC to ATen (#58953)
Summary:
https://github.com/pytorch/pytorch/issues/24637

CC zasdfgbnm ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58953

Reviewed By: mrshenli

Differential Revision: D28749713

Pulled By: ngimel

fbshipit-source-id: 33ce87cf77e23d5d67d193d6368131cb8dab39ae
2021-05-27 18:35:42 -07:00
f3aa61b9ed Add peephole for len(x.size()) (#59051)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59051

Test Plan: Imported from OSS

Reviewed By: Krovatkin

Differential Revision: D28727247

Pulled By: eellison

fbshipit-source-id: 6474d39773b640992bdaf261575a3dbd48c6d56c
2021-05-27 17:57:53 -07:00
b9dc51863c Add more shape functions for mobilenet (#58932)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58932

This adds all the operators necessary for mobilenet. I kind of wanted to get these landed to unblock ZolotukhinM, but I'm happy to split these up into multiple PRs if it makes reviewing easier. In terms of testing, i'm going to add an automated shape analysis OpInfo test.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D28727246

Pulled By: eellison

fbshipit-source-id: c17f9b7bdf7a43ddf99212b281ae2dd311259374
2021-05-27 17:57:52 -07:00
0ebc665305 Switch symbolic shape registry to operator map' (#58890)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58890

'

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D28663681

Pulled By: eellison

fbshipit-source-id: 5b44fdf14a8ffe05606cc12897e366a64259650d
2021-05-27 17:57:50 -07:00
d8cbba3ee2 [JIT] Disable Complete Shape Inlining For Testing Purposes (#56966)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56966

This PR adds a toggle to shape analysis which won't inline complete tensor shapes as constants into the shape compute graph, which is a good stress test on the partial evaluation pipeline.

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D28444664

Pulled By: eellison

fbshipit-source-id: a62e424515a8837a4b596546efa93af5e8e61f10
2021-05-27 17:57:48 -07:00
f66fbb1e2e Add unary/binary ops necessary for mobilenet (#56828)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56828

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D28444660

Pulled By: eellison

fbshipit-source-id: 656673e6139550f2752c0d3ac2fb8731f4bf9bbb
2021-05-27 17:56:30 -07:00
40f851c53e Use dataclasses to simplify ShardingSpec (#58893)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58893

Leverage dataclasses to simplify some of the ShardingSpec classes.
ghstack-source-id: 130041687

Test Plan: waitforbuildbot

Reviewed By: SciPioneer

Differential Revision: D28665137

fbshipit-source-id: da37517cf2bd8c65d4a5b7cae171fa460e6b0946
2021-05-27 17:33:28 -07:00
89d78851e6 [quant][refactor tests] Move qtensor serialization tests from test_deprecated_jit (#59089)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59089

Move these tests into test_quantized_tensor

Test Plan:
python test/test_quantization.py

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D28750065

fbshipit-source-id: 5c4350d49dd07710b86ba330de80369403c6013c
2021-05-27 17:04:15 -07:00
886a2ddc83 [quant][refactor tests] Clean up test_quantization.py (#59088)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59088

Clean up comments and organize the tests better

Test Plan:
python test/test_quantization.py

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D28750064

fbshipit-source-id: 4c36922e25e3adea3aaa8b4d9185dc28b17aa57c
2021-05-27 17:03:01 -07:00
f993ceffb5 TensorIteratorReduce: Avoid tensor operations in parallel_for (#58655)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58655

Ref gh-56794

The two pass reduction calls `copy_` and `select` inside a parallel region. The
`copy_` can just be moved outside of the parallel region, but avoiding the
`select` call is more complicated because it's needed to construct the
`TensorIterator`. Instead, I factor out a `serial_for_each` free-function that
just takes pointers and strides. Then manually advance the pointer to the
thread-specific slice of data.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D28735330

Pulled By: ngimel

fbshipit-source-id: 8e096eb5801af9381ebd305e3ae7796a79b86298
2021-05-27 15:58:03 -07:00
ef32a29c97 Back out "[pytorch][PR] ENH Adds dtype to nn.functional.one_hot" (#59080)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59080

Original commit changeset: 3686579517cc

Test Plan: None; reverting diff

Reviewed By: albanD

Differential Revision: D28746799

fbshipit-source-id: 75a7885ab0bf3abadde9a42b56d479f71f57c89c
2021-05-27 15:40:52 -07:00
3d2b55553b Retiring _module_copies field in DDP reducer. (#59094)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59094

Commented out _module_copies fields and changed for loops accordingly

Test Plan: Test cases mentioned in T91292908 passed succesfully

Reviewed By: SciPioneer

Differential Revision: D28736135

fbshipit-source-id: 1857102f0c57a734026f3025e9653d8fad57d0b6
2021-05-27 15:09:14 -07:00
c6c563fc26 Added minor fixes to Az DevOps Build Logic (#59016)
Summary:
This PR also adds a a few minor logic changes to the custom PyTorch PR tests logic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59016

Reviewed By: mrshenli

Differential Revision: D28732437

Pulled By: malfet

fbshipit-source-id: 14b7ed837209d77e0e175d92959aeb0f086e6737
2021-05-27 14:35:11 -07:00
61f946bba6 don't copy indices to the self device in dispatch_index (#59059)
Summary:
Let index/index_put implementation in aten take care of moving the indices to the correct device, don't make python wrapper do that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59059

Reviewed By: mruberry

Differential Revision: D28750562

Pulled By: ngimel

fbshipit-source-id: 2f2b5f875733898f1c0b30b544c89808f91e4a6f
2021-05-27 14:19:59 -07:00
16ae6cad3d Revert D28615349: [pytorch][PR] Add a ci/no-build label
Test Plan: revert-hammer

Differential Revision:
D28615349 (bae06a0293)

Original commit changeset: 1ed521761ca4

fbshipit-source-id: 987439c2570bbffc0f0f8517d82970a3a4add789
2021-05-27 14:17:06 -07:00
bae06a0293 Add a ci/no-build label (#58778)
Summary:
Depends on https://github.com/pytorch/pytorch-probot/pull/22. Adds a new label called `ci/no-build` that disables the CircleCI `build` workflow on PRs. The current behavior should be the same in the absence of `ci/no-build`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58778

Reviewed By: malfet

Differential Revision: D28615349

Pulled By: samestep

fbshipit-source-id: 1ed521761ca4ffa32db954a51918f693beddb3f3
2021-05-27 14:03:03 -07:00
3e2db56dcf [docs] document dim argument to tensor.size() (#58777)
Summary:
[docs] document dim argument to tensor.size()

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58777

Reviewed By: gchanan

Differential Revision: D28641109

Pulled By: zou3519

fbshipit-source-id: 5cb46bb8abe45ed299843af38515e5db89ad02a1
2021-05-27 13:51:56 -07:00
18302bcdf3 Add script to cancel workflows (#59019)
Summary:
This removes our cancel_redundant_workflows job in favor of GitHub's built in [`concurrency`](https://docs.github.com/en/actions/reference/workflow-syntax-for-github-actions#concurrency) keyword which limits runs of a particularly named group. Since the group names have to be unique per job per PR, it should end up looking something like `filename-job_name-{pr number | sha (for non-PR workflows)}`. There's also a script to check workflows and ensure that it is being properly gated so people don't forget to add the key in the future.

`ruamel.YAML` also didn't like some of the spacing so that is changed but it also makes it more consistent so �

This also has a minor change of renaming the workflow templates from `.in` to `.j2` which is the standard Jinja2 extension that the VSCode extension automatically picks up for syntax highlighting / errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59019

Test Plan: pushed a commit `reset` and then immediately another commit `test`: the jobs from `reset` are cancelled: https://github.com/pytorch/pytorch/actions/runs/880099510

Reviewed By: samestep

Differential Revision: D28722419

Pulled By: driazati

fbshipit-source-id: c547a161877a0583be9d7edb29244b086b6bcad1
2021-05-27 12:32:15 -07:00
920619dc2b [PyTorch] Save a refcount bump in meta functions for addmm and mm (#59063)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59063

`TensorMeta::maybe_get_output()` returns `const Tensor&`, no need to copy the Tensor..
ghstack-source-id: 130044287

Test Plan: CI

Reviewed By: ngimel

Differential Revision: D28735225

fbshipit-source-id: f2bdf39b28de245ec4664718490e7e0b36bc8819
2021-05-27 12:15:52 -07:00
2c17b6a0fe [ONNX] Enable support for roll() op. (#58389) (#58697)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58697

1. Add a symbolic function for aten::roll() op in symbolic_opset9.py.
2. Add a test with multiple scenarios as well.

Test Plan: Imported from OSS

Reviewed By: driazati, bhosmer

Differential Revision: D28714807

Pulled By: SplitInfinity

fbshipit-source-id: eae85f2dcf02737c9256a180f6905a935ca3f57e

Co-authored-by: fatcat-z <jiz@microsoft.com>
2021-05-27 12:06:45 -07:00
1aabb8f98c [ONNX] handle aten::_set_item on Dict in convertInplaceOpsAndTrackAlias (#58317) (#58696)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58696

It seems the JIT produces an output for aten::_set_item on lists but
not on dicts. Previously the code would crash because it assumed it
was operating on a list.

The different behavior can be seen with the following test:

```python
class DictModule(torch.nn.Module):
    def forward(self, x_in: torch.Tensor) -> typing.Dict[str, torch.Tensor]:
        x_out = {}
        x_out["test_key_out"] = x_in
        return x_out

x_in = torch.tensor(1)
dms = torch.jit.script(DictModule())
torch.onnx.export(dms, (x_in,), "/dev/null", example_outputs=(dms(x_in),))
```

Before this change:
`RuntimeError: outputs_.size() == 1INTERNAL ASSERT FAILED at "../torch/csrc/jit/ir/ir.h":452, please report a bug to PyTorch.`

After this change:
`RuntimeError: Exporting the operator prim_DictConstruct to ONNX opset version 9 is not supported. Please feel free to request support or submit a pull request on PyTorch GitHub.`

This is a more useful error message.

Test Plan: Imported from OSS

Reviewed By: driazati

Differential Revision: D28714804

Pulled By: SplitInfinity

fbshipit-source-id: 1e5dc5fb44d1e3f971a22a79b5cf009d7590bf84

Co-authored-by: Gary Miguel <garymiguel@microsoft.com>
2021-05-27 12:06:44 -07:00
0a6828a306 [ONNX] use consistent quoting for string literals (#57757) (#58695)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58695

As PEP8 says: "Pick a rule and stick to it." [1]

[1] https://www.python.org/dev/peps/pep-0008/#string-quotes

Test Plan: Imported from OSS

Reviewed By: driazati

Differential Revision: D28714811

Pulled By: SplitInfinity

fbshipit-source-id: c95103aceb1725c17c034dc6fc8216627f189548

Co-authored-by: Gary Miguel <garymiguel@microsoft.com>
2021-05-27 12:06:42 -07:00
b27fc0ff85 [ONNX] Improve lower tuples and handle control flow (#57650) (#58694)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58694

Improving the logic for finding tuple patterns within control flow.
Also fixes: https://github.com/pytorch/pytorch/issues/56914

Test Plan: Imported from OSS

Reviewed By: driazati

Differential Revision: D28714806

Pulled By: SplitInfinity

fbshipit-source-id: 1552100cf9cc88e6f58df2e90758e8898ba0a9b3

Co-authored-by: neginraoof <neginmr@utexas.edu>
2021-05-27 12:06:40 -07:00
57c9355a0d [ONNX] Update special post process for SequenceInsert after SequenceEmpty (#56965) (#58693)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58693

`ONNX::SequenceEmpty` requires dtype to be provided, and is default to float. We updates previous dtype of created `ONNX::SequenceEmpty` node when dtype is later discovered to be other than float, through downstream `ONNX::SequenceInsert` node. This PR improves the algorithm to cover nested loop case.

Test Plan: Imported from OSS

Reviewed By: driazati

Differential Revision: D28714808

Pulled By: SplitInfinity

fbshipit-source-id: e45ab3a12d0fec637733acbd3cd0438ff80d2cd4

Co-authored-by: BowenBao <bowbao@microsoft.com>
2021-05-27 12:06:39 -07:00
b8c96e6b08 Support symbolic for conv_tbc (#58359) (#58692)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58692

This is a fix for exporting fairseq models, see:
```python
model = torch.hub.load(github, 'conv.wmt14.en-fr', tokenizer='moses', bpe='subword_nmt')
model = torch.hub.load(github, 'conv.wmt17.en-de', tokenizer='moses', bpe='subword_nmt')
```
With this fix, and comment out model script one line `GradMultiply`, these two models can be exported successfully with perf met.

The original PR https://github.com/pytorch/pytorch/pull/57708 has merging issue, use this one instead.

Test Plan: Imported from OSS

Reviewed By: driazati

Differential Revision: D28714809

Pulled By: SplitInfinity

fbshipit-source-id: 71c2de6cec7ee05af68560996acf47d97af46fb2

Co-authored-by: David <jiafa@microsoft.com>
2021-05-27 12:06:37 -07:00
d101816fdd [ONNX] RNN scripting (#57564) (#58691)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58691

Note the first commit in this PR has its own pull request here since it seemed self-contained:
https://github.com/pytorch/pytorch/pull/57082

* [ONNX] simplify batch_first logic in RNN tests

* [ONNX] support GRU with packed input in scripting mode

This required two changes:
* Add as_tensor to symbolic_opset9.py
* Change torch::jit::pushPackingPastRnn to recognize and properly
  replace another use of the batch_sizes output of prim::PackPadded.
  Previously the code assumed that the first use was as input to the
  RNN operator. However in some cases, it is also used to compute
  max_batch_size. For example in this code:
  https://github.com/pytorch/pytorch/blob/febff45/torch/nn/modules/rnn.py#L815-L815

With these changes the GRU tests now pass in scripting mode for opset
version >= 11.

Test Plan: Imported from OSS

Reviewed By: driazati

Differential Revision: D28714805

Pulled By: SplitInfinity

fbshipit-source-id: f19647a04533d9ec76399a8793b3f712ea0337d2

Co-authored-by: Gary Miguel <garymiguel@microsoft.com>
2021-05-27 12:06:35 -07:00
51d14b6859 [ONNX] Update instance_norm2 symbolic to handle track_running_stats=True (#55051) (#58690)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58690

Fixes [#53887](https://github.com/pytorch/pytorch/issues/53887)
Not input calling running_mean and running_var when track_running_stats=True

Test Plan: Imported from OSS

Reviewed By: driazati

Differential Revision: D28714812

Pulled By: SplitInfinity

fbshipit-source-id: 3f2f2ec9a7eaf8a1432a552d751cbd5974b20195

Co-authored-by: hwangdeyu <deyhuang@qq.com>
2021-05-27 12:05:26 -07:00
ba694520e5 [ROCm] fix JIT codegen (#57400)
Summary:
Fixes upcoming changes that are part of ROCm 4.2 and affect PyTorch JIT.

- ROCM_VERSION macro must be available to both device and host compilation passes.
- Unifies some of CUDA and HIP differences in the code generated.
  - NAN / POS_INFINITY / NEG_INFINITY
  - Do not hipify `extern __shared__` -> `HIP_DYNAMIC_SHARED()` macro [deprecated]
- Differentiates bf16 codegen for HIP.
- Optionally provides missing macros when using hiprtc precompiled header feature.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57400

Reviewed By: ejguan

Differential Revision: D28421065

Pulled By: malfet

fbshipit-source-id: 215f476773c61d8b0d9d148a4e5f5d016f863074
2021-05-27 11:45:07 -07:00
7e4e648c2a Enable NNC fusion for relu6 (#58773)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58773

Test Plan:
```
python test/test_ops.py -k relu6
python test/test_jit_fuser_te.py
```

Reviewed By: bertmaher

Differential Revision: D28721791

Pulled By: desertfire

fbshipit-source-id: a94f711977afd080faae052f66eb8dded3cdc79e
2021-05-27 10:54:02 -07:00
0106fe3934 avg_pool2d: port to structured (#58987)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58987

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D28717067

Pulled By: ezyang

fbshipit-source-id: 984a8ae8bc05811b787fdac565566f359b55a3d6
2021-05-27 10:51:11 -07:00
d935259171 Remove redundant code from LayerNorm Fake Op. (#59054)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59054

to handle elementwise_affine

Test Plan: GLOW_NNPI=1 buck run -c fbcode.platform=platform009 //caffe2/caffe2/contrib/fakelowp/test:test_layernorm_nnpi_fp16nnpi

Reviewed By: hx89

Differential Revision: D28726804

fbshipit-source-id: b03485e98d490d4e9e1b178a8c50677b77e27596
2021-05-27 10:35:32 -07:00
b14c3205fd [JIT] Add torch._C.ScriptDict (#52659)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52659

**Summary**
This commit adds `torch._C.ScriptDict`, a dictionary type that has reference
semantics across the Python/TorchScript boundary. That is, modifications
made to instances of `torch._C.ScriptDict` in TorchScript are visible in
Python even when it is not returned from the function. Instances can be
constructed by passing an instance of a Python dictionary to
`torch.jit.script`. In the case of an empty dictionary, its type is
assumed to be `Dict[str, Tensor]` to be consistent with the handling of
empty dictionaries in TorchScript source code.

`torch._C.ScriptDict` is implemented using a modified version of pybind's `stl_bind.h`-style bindings attached to `ScriptDict`, `ScriptDictIterator` and `ScriptDictKeyIterator`, wrapper classes around `c10::impl::GenericDict` and `c10::impl::GenericDict::iterator`. These bindings allow instances of `torch._C.ScriptDict` to be used as if it were a regular `dict` Python. Reference semantics are achieved by simply retrieving the `IValue` contained in `ScriptDict` in `toIValue` (invoked when converting Python arguments to `IValues` before calling TorchScript code).

**Test Plan**
This commit adds `TestScriptDict` to `test_list_dict.py`, a set of tests
that check that all of the common dictionary operations are supported
and that instances have reference semantics across the
Python/TorchScript boundary.

Differential Revision:
D27211605
D27211605

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Pulled By: SplitInfinity

fbshipit-source-id: 446d4e5328375791aa73eb9e8b04dfe3465af960
2021-05-27 10:25:30 -07:00
95b1bc1009 Migrate nonzero from TH to ATen (CPU) (#58811)
Summary:
Closes gh-24745

The existing PR (gh-50655) has been stalled because `TensorIterator` doesn't guarantee iteration order in the same way that `TH_TENSOR_APPLY` does. For contiguous test cases this isn't an issue; but it breaks down for example with channels last format. I resolve this by adding a new `TensorIteratorConfig` parameter, `enforce_linear_iteration`, which disables dimension reordering. I've also added a test case for non-contiguous tensors to verify this works.

This PR also significantly improves performance by adding multithreading support to the algorithm.  As part of this, I wrote a custom `count_nonzero` that gives per-thread counts which is necessary to write the outputs in the right location.

|    Shape   |  Before | After (1 thread) | After (8 threads) |
|:----------:|--------:|-----------------:|------------------:|
| 256,128,32 | 2610 us |          2220 us |            496 us |
| 128,128,32 | 1250 us |           976 us |            175 us |
|  64,128,32 |  581 us |           486 us |             88 us |
|  32,128,32 |  292 us |           245 us |             80 us |
|  16,128,32 |  147 us |           120 us |             71 us |
|  8,128,32  |   75 us |            61 us |             61 us |
|  4,128,32  |   39 us |            32 us |             32 us |
|  2,128,32  |   20 us |            17 us |             17 us |
|  1,128,32  |   11 us |             9 us |              9 us |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58811

Reviewed By: anjali411

Differential Revision: D28700259

Pulled By: ngimel

fbshipit-source-id: 9b279ca7c36d8e348b7e5e4be0dd159e05aee159
2021-05-27 10:06:54 -07:00
934f6dca65 Fix pthreadpool guard test (#58977)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58977

* Test was flaky as part of it ran async
* Remove async part to test only the functionality added

Test Plan:
regular test:

`buck test mode/dev //caffe2/aten:test_thread_pool_guard -- --exact 'caffe2/aten:test_thread_pool_guard - TestThreadPoolGuard.TestRunWithGuard' --run-disabled`

stress test:

`buck test mode/dev //caffe2/aten:test_thread_pool_guard -- --exact 'caffe2/aten:test_thread_pool_guard - TestThreadPoolGuard.TestRunWithGuard' --run-disabled --jobs 18 --stress-runs 10 --record-results`

Reviewed By: kimishpatel

Differential Revision: D28703064

fbshipit-source-id: be19da3f42f44288afc726bdb2f40342eee26e01
2021-05-27 09:49:52 -07:00
e89b150a39 [typing] Pyre fixes for remote_module (#59046)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59046

Correcting type hint for _RemoteModule to pass Pyre checks.

Test Plan: N/A

Reviewed By: walterddr, SciPioneer

Differential Revision: D28725237

fbshipit-source-id: 1ca714bbf1a597a29850f70bac826a0c95a4019f
2021-05-27 09:44:50 -07:00
11aa5e4f66 Add underscores to some internal names (#59027)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59027

Add underscores to some of the internal names

Test Plan:
python test/test_profiler.py -v

Imported from OSS

Reviewed By: mrshenli

Differential Revision: D28724294

fbshipit-source-id: 1f6252e4befdf1928ac103d0042cbbf40616f74a
2021-05-27 09:39:28 -07:00
617b74aa35 [nnc] LLVMCodeGen for any target (#58713)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58713

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D28585722

Pulled By: bertmaher

fbshipit-source-id: 82885b9780dc1a8610660a90969d8d2baad97920
2021-05-27 09:25:15 -07:00
a1806134a7 [QAT] Fix the runtime run cannot resize variables that require grad (#57068)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57068

When training with histogram observer on, we got this runtime error:
```
torch/quantization/observer.py", line 942, in forward
                    self.bins)

            self.histogram.resize_(combined_histogram.shape)
            ~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
            self.histogram.copy_(combined_histogram)
            self.min_val.resize_(combined_min.shape)
RuntimeError: cannot resize variables that require grad
```

Since this is the histogram observer that is used to collect histogram information, should not need gradient. So turn off the grad before resizing using `detach_()` method.

Test Plan:
- arc lint
- Train with histogram observer turned on, training finished successfully

f264139727

Reviewed By: supriyar

Differential Revision: D27147212

fbshipit-source-id: abed5b9c4570ffc6bb60e58e64791cfce66856cd
2021-05-27 09:12:06 -07:00
25ac647f64 [QAT] Auto format the torch/quantization/observer.py` (#57067)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57067

auto format the code

Test Plan: lint

Reviewed By: jerryzh168

Differential Revision: D27147213

fbshipit-source-id: 008871d276c8891b2411549e17617e5c27d16ee3
2021-05-27 09:10:34 -07:00
9baf75c86e add test_filename field in scribe upload (#59024)
Summary:
Add test filename dimension to scribe upload

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59024

Test Plan: CI - validate result on scuba table

Reviewed By: janeyx99

Differential Revision: D28726711

Pulled By: walterddr

fbshipit-source-id: 78a1708787f0507d1171800f633e1f7137f629cd
2021-05-27 08:21:05 -07:00
7461792c4a adding upload step on all build jobs (#58998)
Summary:
Relates to https://github.com/pytorch/pytorch/issues/58826.

Currently we don't have the exact build time for non-binary jobs collected. collecting this reports the exact test time from pytorch checkout finish till build stage successful.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58998

Test Plan: CI - validate result on scuba table

Reviewed By: janeyx99

Differential Revision: D28747962

Pulled By: walterddr

fbshipit-source-id: 715d91d597bc004977fdceaf245263c9c8aacc84
2021-05-27 08:17:01 -07:00
3d70ab08ae bump out repeat_interleave BC allow date (#59057)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59057

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D28732990

Pulled By: bhosmer

fbshipit-source-id: 27a9fe9169b2da9405d2c3f235e7c015896fe7fc
2021-05-26 23:32:05 -07:00
74089a0d34 [quant][refactor tests] Move quantization tests into subfolders (#59007)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59007

Create folders for each test category and move the tests.
Will follow-up with a cleanup of test_quantization.py

Test Plan:
python test/test_quantization.py

Imported from OSS

Reviewed By: HDCharles

Differential Revision: D28718742

fbshipit-source-id: 4c2dbbf36db35d289df9708565b7e88e2381ff04
2021-05-26 23:02:12 -07:00
e146ed21fb [quant][refactor tests] Move TestModelNumerics to a separate file (#59000)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59000

These tests span both QAT and PTQ APIs so factor them out

Test Plan:
python test/test_quantization.py TestModelNumericsEager

Imported from OSS

Reviewed By: HDCharles

Differential Revision: D28713910

fbshipit-source-id: b2ad27cf59abb7cc0c4e4da705f8c9220410f8ad
2021-05-26 23:02:11 -07:00
b6c5c5d90e [quant][refactor tests] Rename test_numeric_suite and equalization tests (#58999)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58999

Rename the test files to be more explicit that they are for eager mode

Test Plan:
python test/test_quantization.py

Imported from OSS

Reviewed By: HDCharles

Differential Revision: D28713909

fbshipit-source-id: b4ccd06c841fe96edf8c065a0bceae15fed260f9
2021-05-26 23:02:09 -07:00
82d587f434 [quant][refactor tests] split test_workflow_module into test_workflow_ops and test_workflow_module (#58963)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58963

some tests are used to check the op level numerics of the fake quantize operations

Test Plan:
python test/test_quantization.py

Imported from OSS

Reviewed By: HDCharles

Differential Revision: D28696599

fbshipit-source-id: 98f9b0c993dd43050176125461ddd5288142989b
2021-05-26 23:01:08 -07:00
0e9a295b41 Refactor GlooDeviceFactory::makeDeviceFor... (#58996)
Summary:
`makeDeviceForHostname` and `makeDeviceForInterface` are almost
duplicate except for different default argument values

Create generic `makeGlooDevice` anonymous function that takes both host
name and interface name and call it from both
makeDeviceFor[Hostname|Interface]

Also solve two other minor issues:
 - do not call `getenv("GLOO_DEVICE_TRANSPORT")` during library load
   time
 - Raise exception rather than crash if GLOO_DEVICE_TRANSPORT is set to unknown value

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58996

Reviewed By: pbelevich

Differential Revision: D28713324

Pulled By: malfet

fbshipit-source-id: cb33b438078d163e3ec6f047f2e5247b07d94f8d
2021-05-26 20:33:11 -07:00
9e60c7dee3 Add docstring for is_inference_mode_enabled (#59047)
Summary:
Fixes` #{issue number}

Testing:
```
>>> import torch
>>> torch.is_inference_mode_enabled.__doc__
'\nis_inference_mode_enabled(input) -> (bool)\n\nReturns True if inference mode is currently enabled.\n\nArgs:\n    input (Tensor): the input tensor.\n'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59047

Reviewed By: ailzhang

Differential Revision: D28726991

Pulled By: soulitzer

fbshipit-source-id: c117c7d73e551a1b5f0e215f2aed528bf558ef7c
2021-05-26 19:27:33 -07:00
1bd22e28b3 BFloat16 support for torch.sort (#58196)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58196

Reviewed By: anjali411

Differential Revision: D28721364

Pulled By: ngimel

fbshipit-source-id: 0785f7100fb76d69da7a73022c7d2eb43c91fa6e
2021-05-26 16:49:03 -07:00
f4a890d7c6 fix unique for discontiguous inputs (#59003)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/58959

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59003

Reviewed By: mruberry

Differential Revision: D28714534

Pulled By: ngimel

fbshipit-source-id: d9bf82f54be5b5919e27281e49fad74e00d8b766
2021-05-26 16:43:19 -07:00
b435a27fb7 CUDA support in the CSR layout: constructors (#59010)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59010

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D28719287

Pulled By: bhosmer

fbshipit-source-id: fbb5784ccb5ce19dcca1f2f95c4ee16f9b7680c4
2021-05-26 16:39:43 -07:00
7c17e1dd90 [fx2trt] Quantized uru on gpu (#58965)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58965

Test Plan:
```
// This script is just for playing around
buck run mode/opt -c python.package_style=inplace deeplearning/trt/fx2trt:fx2trt_quantized_test

// To check accuracy
buck run mode/opt -c python.package_style=inplace deeplearning/trt/fx2trt:uru_10x10_to_trt_eval.py
```

Reviewed By: mortzur

Differential Revision: D28445702

fbshipit-source-id: 5357a02a78cb7f9cf772e7a91a08166ef90cc4f8
2021-05-26 16:00:34 -07:00
58d1b3639b fix nn.MHA scriptability (#58727)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58727

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D28593830

Pulled By: bhosmer

fbshipit-source-id: 37dee9efededaea9985a2bf040df1ba4b46f6580
2021-05-26 15:29:49 -07:00
ac67cda272 [PyTorch] Rename TI::add_borrowed_{in,out}put to TI::add_{in,out}put (#58608)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58608

D28523254 (705dd9ffac) ensures us that this was save: we renamed away all the internal uses of add_input/add_output. (Also, practically everything I found internally could borrow, and the stuff that couldn't wouldn't compile because it is passed unnamed temporaries.)
ghstack-source-id: 129882758

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D28524585

fbshipit-source-id: 437235d5cc55c3737c928991a996b8f5e1c5beaa
2021-05-26 15:06:28 -07:00
7db36c0792 [PyTorch] Add temporary guardrail to borrowing_ op variants on TensorIterator (#58607)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58607

Don't let code that tries to pass temporaries to these variants compile.
ghstack-source-id: 129882759

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D28524227

fbshipit-source-id: e5ce80f048480c67645198eaa0e43532567d4adb
2021-05-26 15:06:27 -07:00
bed0eb5428 [PyTorch] Add TI::add_owned_{in,out}put for clarity & use them (#58606)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58606

Removes the pit of non-success around using the owning variants; gives us the option to make add_{in,out}put borrow in the future as a pit of success if we decide that's not bc-breaking.
ghstack-source-id: 129882760

Test Plan: CI

Reviewed By: ngimel

Differential Revision: D28523976

fbshipit-source-id: ab5eb7bf5d672a0f8c4a50eb8a21c156d4189709
2021-05-26 15:05:08 -07:00
4f390eb6b6 Document factory_kwargs in nn.Quantize + remove Attributes section (#59025)
Summary:
The `factory_kwargs` kwarg was previously undocumented in `nn.Quantize`. Further, the `Attributes` section of the docs was improperly filled in, resulting in bad formatting. This section doesn't apply since `nn.Quantize` doesn't have parameters, so it has been removed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59025

Reviewed By: anjali411

Differential Revision: D28723889

Pulled By: jbschlosser

fbshipit-source-id: ba86429f66d511ac35042ebd9c6cc3da7b6b5805
2021-05-26 14:40:48 -07:00
a749e8edf5 Add UninitializedBuffer to nn docs (#59021)
Summary:
The `UninitializedBuffer` class was previously left out of `nn.rst`, so it was not included in the generated documentation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59021

Reviewed By: anjali411

Differential Revision: D28723044

Pulled By: jbschlosser

fbshipit-source-id: 71e15b0c7fabaf57e8fbdf7fbd09ef2adbdb36ad
2021-05-26 14:36:05 -07:00
de22657e1c [PyTorch] Replace RecordFunction shouldRun callback with atomic bools (#56504)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56504

Having callbacks registered but disabled via their
`shouldRun` callback defeats the `shouldRunRecordFunction`
optimization (no relation between the two things, despite the
shared prefix on the names) that aims to skip `RecordFunction`
construction.

This diff attempts to safely rectify this issue: we drop support for
`shouldRun` callbacks (this is bc-breaking; does anything use these
externally? do I need to add the support back and just stop using it
internally?), add support for enabling and disabling callbacks, and
(for global callbacks) make doing so thread-safe.

There is an interesting subtlety with `std::atomic` that came up: it
is neither copyable nor movable, which precludes putting it into
`std::vector`. I manually overrode this because the thread safety
reasons it is neither copyable nor movable don't apply here; we
already state that adding or removing callbacks (the operations that
might copy/move an atomic) are not thread-safe and should be done at
initialization time.
ghstack-source-id: 129614296

Test Plan:
Existing CI should cover correctness, right?  Inspected
perf report of a simple benchmark that runs nn.Linear in a loop on
CUDA, where internally have Kineto initialized and thus had a
shouldRun observer previously; we are no longer going through the
dispatcher's slow RecordFunction path or spending measurable time
constructing RecordFunction instances.

Reviewed By: ilia-cher

Differential Revision: D27834944

fbshipit-source-id: 93db1bc0a28b5372f7307490c908457e7853fa92
2021-05-26 14:31:33 -07:00
ac07c6451e [nnc] Use BufHandle in loopnest.cache_accesses python API (#59006)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59006

Related github issue: https://github.com/pytorch/pytorch/issues/59002

Test Plan:
Imported from OSS

Tested in the github issue: https://github.com/pytorch/pytorch/issues/59002

Reviewed By: bertmaher

Differential Revision: D28714829

Pulled By: huiguoo

fbshipit-source-id: 5fd7d5426c5cdb5af30731f662b083d2bd611bc4
2021-05-26 13:58:55 -07:00
b93e7a7602 concurrency fixes (#58961)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/58446

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58961

Reviewed By: anjali411

Differential Revision: D28700307

Pulled By: Krovatkin

fbshipit-source-id: 1fed90c64e88aaf2824c48e006f66a9266d1e163
2021-05-26 13:53:44 -07:00
97c1179c9d Revert D28549240: [typing] Pyre fixes for batch_distributed_inference
Test Plan: revert-hammer

Differential Revision:
D28549240 (671c224b0a)

Original commit changeset: dadfedf93aae

fbshipit-source-id: 820fefccf2b4c6368defd762ce55245dd35505ca
2021-05-26 13:39:30 -07:00
0d5527de7a Back out "Back out "[ONNX] Process const folding progressively when converts to ONNX (#54569)"" (#58923)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58923

Original commit changeset: c54597b2048e
ghstack-source-id: 129842041

Test Plan: Sandcastle and OSS CI.

Reviewed By: snisarg

Differential Revision: D28432555

fbshipit-source-id: 2a9ec22cc004c7c6979f1cc8f3124b833cdc6634
2021-05-26 13:29:07 -07:00
b420ded66f ShardedTensor framework for ChunkedShardingSpec (#58517)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58517

Building upon the sharding specifications, this PR introduces the
intial skeleton of ShardedTensor and allows building a ShardedTensor by
specifying ChunkedShardingSpec.

In follow up PRs, I'll add further support for GenericShardingSpec.
ghstack-source-id: 129917841

Test Plan:
1) unit tests.
2) waitforbuildbot

Reviewed By: SciPioneer

Differential Revision: D28526012

fbshipit-source-id: 8e62847b58957d284e40f57a644302c171289138
2021-05-26 13:24:23 -07:00
671c224b0a [typing] Pyre fixes for batch_distributed_inference
Summary:
Pyre does not support dynamic imports, so we can leave the pyre-ignores for those. (https://fb.workplace.com/groups/pyreqa/permalink/3119812734775204/)

Parameterized pyre-ignore are also necessary as explained by [this Q&A](https://www.internalfb.com/intern/qa/109058/pyre-says-undefined-attribute-16-module-parameteri)

Test Plan:
- `pyre -l .`
- `pyre check`
- `buck test //caffe2/torch/fb/training_toolkit/applications/sparse_nn/batch_distributed_inference/tests:batch_distributed_inference_test`

Reviewed By: vipannalla

Differential Revision: D28549240

fbshipit-source-id: dadfedf93aae860fe6d0a112002bdfe743139b1e
2021-05-26 13:08:19 -07:00
be47060af9 [remove xla from codegen] rename aten_xla_type.h -> DispatchKeyNativeFunctions.h (#58568)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58568

I split out the file rename into a separate commit to make the diff easier. The template file name is `aten_xla_type.h` -> `{DispatchKey}NativeFunctions.h`

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D28711298

Pulled By: bdhirsh

fbshipit-source-id: 2fa7d2abede560a2c577300f0b5a1f7de263d897
2021-05-26 12:53:19 -07:00
86ce2950f6 remove xla-specific stuff from codegen (minus CPU fallback) (#58064)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58064

**Summary**
This PR tries to remove all xla-specific logic from the codegen except for two places:
- renaming the `aten_xla_type.h/cpp` template files; Going to do that in a separate PR just to make the diff easier to understand
- CPU fallback logic (everything in `aten_xla_type_default.h/cpp` and `gen_external_aten_fallbacks.py`). I'm trying to kill all of that logic in a subsequent PR by making the CPU fallback a boxed kernel, so it felt unnecessary to go through it all and remove the xla references here.

**Notable changes**
The xla codegen includes some custom logging in each kernel wrapper, so I added a few new knobs to the external yaml, that we now test. I have a corresponding [xla-side PR](https://github.com/pytorch/xla/pull/2944) with the new yaml changes, which look like this:
```
per_op_log: XLA_FN_TRACK(3)
per_argument_log: TF_VLOG(3)
cpu_fallback_counter: XLA_COUNTER("aten::{name}", 1)
extra_headers: >
     #include <tensorflow/compiler/xla/xla_client/debug_macros.h>
     #include <tensorflow/compiler/xla/xla_client/metrics.h>
     #include <tensorflow/compiler/xla/xla_client/tf_logging.h>
     #include <torch_xla/csrc/function_call_tracker.h>
     #include <torch_xla/csrc/aten_xla_type.h>
     #include <torch_xla/csrc/aten_xla_type_default.h>
```

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D28711095

Pulled By: bdhirsh

fbshipit-source-id: 90a48440f2e865a948184e2fb167ea240ada47bb
2021-05-26 12:52:13 -07:00
511979df85 Define the SYCL device version __assert_fail when the NDEBUG defined. (#58906)
Summary:
## Motivation
The utils in namespace  `c10` require the `__assert_fail` when the NDEBUG is defined in kernel code.

The `__assert_fail` declaration in pytorch is not compatible to the SYCL‘s specification.

This causes compile error when use these utils in SYCL kernels.

## Solution
Add the `__assert_fail` declaration for SYCL kernels to pytorch when compiling the SYCL kernels with `c10` utils.

## Additional context
`__assert_fail` in SYCL kernel

`extern SYCL_EXTERNAL void __assert_fail(const char *expr, const char *file, unsigned int line, const char *func);`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58906

Reviewed By: anjali411

Differential Revision: D28700863

Pulled By: ezyang

fbshipit-source-id: 81896d022b35ace8cd16474128649eabedfaf138
2021-05-26 12:48:37 -07:00
2e2a75720b [structured] remainder (#58732)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/55070

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58732

Reviewed By: gchanan

Differential Revision: D28666480

Pulled By: ezyang

fbshipit-source-id: f247365f2e6b3cdf29f7cc242f179041b968e75a
2021-05-26 12:44:32 -07:00
29487ac7ff Add 11.3 binaries without conda (#58877)
Summary:
Tested specifically for cuda 11.3 in https://github.com/pytorch/pytorch/pull/57522.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58877

Reviewed By: walterddr

Differential Revision: D28697703

Pulled By: janeyx99

fbshipit-source-id: 08ae7f7d023cb93e47a2e0a4f115cee9e8a6156a
2021-05-26 12:40:13 -07:00
24508337f4 Revert D28643215: Adds an aten::_ops namespace with unambiguous function names
Test Plan: revert-hammer

Differential Revision:
D28643215 (28740869a1)

Original commit changeset: 7b2b8459f1b2

fbshipit-source-id: ea869bf4cfde7038087e990b2cff5a86f9e2a531
2021-05-26 12:35:34 -07:00
12418a4f86 Back out "Revert D28664514: [pytorch][PR] various TensorIterator speed improvements"
Summary: Original commit changeset: fcad039b7dc8

Test Plan: Existing tests

Reviewed By: mruberry

Differential Revision: D28720186

fbshipit-source-id: 14ac99ee2d7cafb86b20c979f8917beeefd616e1
2021-05-26 12:22:48 -07:00
17fb651a3b Make torch.Tensor(torch.tensor(1.0)) work (#58885)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58885

Fixes #58884

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D28687510

Pulled By: ezyang

fbshipit-source-id: 81325f501cc3e83cbac02f7c44ded9d396356bb8
2021-05-26 11:33:05 -07:00
e24362746a [nnc] Concat input shapes must be known to fuse (#58974)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58974

I don't know how we overlooked this for so long...
ghstack-source-id: 129932134

Test Plan:
Predictor test of model 184778294_0 using multiple request replay
threads.  It's not clear to me why multithreading matters, except that perhaps
it makes it easier to get an unknown shape in the profile.

Reviewed By: navahgar

Differential Revision: D28702660

fbshipit-source-id: 565550b1d2e571d62d0c8b21150193f2a7ace334
2021-05-26 11:29:26 -07:00
8398ebaa86 Revert D28664514: [pytorch][PR] various TensorIterator speed improvements
Test Plan: revert-hammer

Differential Revision:
D28664514 (8a28bbeeb9)

Original commit changeset: 2e03cf90b37a

fbshipit-source-id: fcad039b7dc823fec8afa694ab74a7ac5011f8ab
2021-05-26 10:49:58 -07:00
c06d2afa99 [caffe2] Add support for int32 lengths in BatchSparseToDense (#58062)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58062

Make templated function to make sure BatchSparseToDense supports int32 lengths/indices

Test Plan:
```buck test //caffe2/caffe2/python/operator_test:batch_sparse_to_dense_op_test
```

Reviewed By: khabinov

Differential Revision: D28271423

fbshipit-source-id: 41b88b7a3663616b533aaf4731ff35cdf6ec4c85
2021-05-26 10:33:32 -07:00
444e195b6d Use docker base for clang-lint in CI (#58964)
Summary:
This PR introduces a docker base to speed up the `clang-tidy`'s dependencies stage. Originally I was looking into using the native github action cache, but the dependencies are spread across many apt and pip installation places, thus consolidating with a docker image might work better. It shortens the deps installation time from 4min down to 1min by pulling from docker base image.

Base image used: https://github.com/pytorch/test-infra/pull/15

```
FROM nvidia/cuda:10.2-devel-ubuntu18.04

RUN apt-get update && apt-get upgrade -y
RUN apt install -y software-properties-common wget
RUN wget -O - https://apt.llvm.org/llvm-snapshot.gpg.key | apt-key add -
RUN apt-add-repository "deb http://apt.llvm.org/bionic/ llvm-toolchain-bionic-11 main"
RUN apt-add-repository ppa:git-core/ppa

RUN apt-get update && apt-get upgrade -y && apt-get install -y git python3-dev python3-pip build-essential cmake clang-tidy-11
RUN update-alternatives --install /usr/bin/clang-tidy clang-tidy /usr/bin/clang-tidy-11 1000
RUN pip3 install pyyaml typing_extensions dataclasses

```

Previous successful run of clang-tidy: https://github.com/pytorch/pytorch/runs/2671193875?check_suite_focus=true

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58964

Reviewed By: samestep

Differential Revision: D28712536

Pulled By: zhouzhuojie

fbshipit-source-id: 0c48a605efe8574c104da6a0cad1a8b7853ba35e
2021-05-26 10:15:24 -07:00
b8d56572a1 Open json config file in context manager (#58077)
Summary:
* Open json config file safely using a context manager (using a with block).
* This will make sure that the file closed even if an exception is raised.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58077

Reviewed By: anjali411

Differential Revision: D28711177

Pulled By: H-Huang

fbshipit-source-id: 597ba578311b1f1d6706e487872db4e784c78c3c
2021-05-26 08:58:40 -07:00
8130f2f67a DOC Adds code comment for _ConvNd.reset_parameters (#58931)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55741 by adding a comment regarding the behavior of `kaiming_uniform_`

The docstring is correct in this case. For example:

```python
import math
import matplotlib.pyplot as plt

import torch
import torch.nn as nn

in_channels = 120
groups = 2
kernel = (3, 8)
m = nn.Conv2d(in_channels=in_channels, groups=groups,
              out_channels=100, kernel_size=kernel)

k = math.sqrt(groups / (in_channels * math.prod(kernel)))
print(f"k: {k:0.6f}")

print(f"min weight: {m.weight.min().item():0.6f}")
print(f"max weight: {m.weight.max().item():0.6f}")
```

outputs:
```
k: 0.026352
min weight: -0.026352
max weight: 0.026352
```

And when we plot the distribution, it is uniform with the correct bounds:

```python
_ = plt.hist(m.weight.detach().numpy().ravel())
```

![Unknown](https://user-images.githubusercontent.com/5402633/119552979-21ba3800-bd69-11eb-8e10-e067c943abe3.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58931

Reviewed By: anjali411

Differential Revision: D28689863

Pulled By: jbschlosser

fbshipit-source-id: 98eebf265dfdaceed91f1991fc4b1592c0b3cf37
2021-05-26 08:39:40 -07:00
950e67fa43 [quant][refactor tests] Move test_qat_module into test_quantize_eager_qat (#58928)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58928

Test Plan:
python test/test_quantization.py TestConvBNQATModule

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D28683925

fbshipit-source-id: 59d240d521c8067a344c9bdf4bec94e82f52e76f
2021-05-26 07:49:59 -07:00
cc07825a21 [quant][refactor tests] Split test_quantize into test_quantize_eager_ptq, test_quantize_eager_qat and test_fusion (#58927)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58927

Part of larger re-factor of quantization tests to make it clearer as to which test belongs where.

proposed folder structure
```
test/quantization
         - bc/
            - test_backward_compatibility.py
         - core/
            - test_quantized_kernels.py
            - test_quantized_workflow_ops.py
            - test_quantized_tensor.py
            - test_workflow_module.py
         - eager/
            - test_quantize_eager_ptq.py
            - test_quantize_eager_qat.py
            - test_fusion.py
         - equalization/
            - test_equalize_eager.py
            - test_bias_correction_eager.py
         - fx/
           - test_quantize_fx.py
         - jit/
            - test_quantize_jit.py
            - test_fusion_passes.py
         - numeric_suite/
            - test_numeric_suite_fx.py
            - test_numeric_suite_eager.py
```

Test Plan:
python test/test_quantization.py

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D28683926

fbshipit-source-id: f84a4271c77c418ce9751196241933ea8cc14913
2021-05-26 07:48:28 -07:00
28740869a1 Adds an aten::_ops namespace with unambiguous function names (#58092)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58092

Fixes #58044.

This PR:
- adds `ATEN_FN(op)` and `ATEN_FN2(op, overload)` macros that resolve to
an non-overloaded function in aten::_ops that calls the desired operator
(without default arguments).

The motivation for this is two-fold:
1) Using aten operators with templates is hard if the operator is
overloaded (e.g. add.Tensor and add.Scalar).
2) Method-only operators require special handling; pointers-to-method
are different from function pointers. `ATEN_FN2(add_, Tensor)` returns
a function instead of a method.

There is some interesting behavior for out= operations.
`ATEN_FN2(sin, "out")` gives a function that is *faithful* to the schema;
that is, the order of arguments is exactly what it looks like in the
schema. This makes it so that you can directly register
`ATEN_FN2(sin,"out")` (or a function wrapping it using the same signature)
as an override for a DispatchKey.

Test Plan:
- New tests that ATEN_FN2 works on function and method-only operators
- New test that ATEN_FN works
- New test that ATEN_FN macro returns a "faithful" function.

Codegen output:
Operators.h and Operators.cpp are both here:
https://gist.github.com/zou3519/c2c6a900410b571f0d7d127019ca5175

Reviewed By: mruberry

Differential Revision: D28643215

Pulled By: zou3519

fbshipit-source-id: 7b2b8459f1b2eb5ad01ee7b0d2bb77639f77940e
2021-05-26 07:29:15 -07:00
032d6b0643 Revert D28112689: CUDA support in the CSR layout: constructors
Test Plan: revert-hammer

Differential Revision:
D28112689 (1416e57465)

Original commit changeset: f825cd4bce40

fbshipit-source-id: 421fc590797ac5fab6a55ac6f213361fbba7cd5b
2021-05-26 06:15:05 -07:00
bbdc428db2 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D28704311

fbshipit-source-id: f089266771c1ceba127116638a4dd87aa21e2e27
2021-05-26 03:19:49 -07:00
9ba9a16700 [PyTorch Edge] Use stream as backport_vi_to_vi-1 interface (#58790)
Summary:
Two main changes:
1. Change the argument of the collection of backport_v{i}_to_v{i-1} from (reader, writer) to (input_model_stream, output_model_stream), so it's easier to backport a model in option 2.

>  2) [Both format and content change] ]Use torch.jit.load() to load the stream,
 and save it to output_model_stream.

2. Fix an issue in the test `backportAllVersionCheck`. Previous it declares `std::ostringstream oss` and uses `oss.clear()` to reset the stringstream. However, the `clear()` function doesn't reset the stream content, and causes problematic stream. As a mitigation, checks are added to prevent corrupted stream for each iteration in while loop.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58790

ghstack-source-id: 129929960

Test Plan:
CI
```
buck test mode/dev //caffe2/test/cpp/jit:jit
```

Reviewed By: raziel, iseeyuan

Differential Revision: D28620961

fbshipit-source-id: b0cbe0e88645ae278eb3999e2a84800702b5f985
2021-05-26 02:07:46 -07:00
1416e57465 CUDA support in the CSR layout: constructors (#57274)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57274

Test Plan: Imported from OSS

Reviewed By: astaff

Differential Revision: D28112689

Pulled By: bhosmer

fbshipit-source-id: f825cd4bce402dd4c3f71db88854f77830b687b8
2021-05-26 01:36:20 -07:00
be4ba29d49 Detect overflow in numel of sparse COO tensor (#57492)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/57416

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57492

Reviewed By: albanD

Differential Revision: D28273649

Pulled By: mruberry

fbshipit-source-id: 08ba50509556df1981d7ede025d84a836d2e8e5e
2021-05-25 22:16:21 -07:00
948df6c7a9 [numpy] torch.i0: promote integer inputs to float (#52735)
Summary:
Reference : https://github.com/pytorch/pytorch/issues/42515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52735

Reviewed By: zou3519

Differential Revision: D28630505

Pulled By: mruberry

fbshipit-source-id: e81a35dfc1a322daf0c44718901470fac677bc94
2021-05-25 22:02:00 -07:00
49c2da0ee0 [testing] improve broadcasts_input error message (#58295)
Summary:
Context:
The Error message when `broadcasts_input` is marked incorrectly is uninformative [See Error Currently]
https://github.com/pytorch/pytorch/pull/57941#discussion_r631749435

Error Currently
```
Traceback (most recent call last):
  File "/home/kshiteej/Pytorch/pytorch_i0_promotion/test/test_ops.py", line 326, in test_variant_consistency_eager
    _test_consistency_helper(samples, variants)
  File "/home/kshiteej/Pytorch/pytorch_i0_promotion/test/test_ops.py", line 310, in _test_consistency_helper
    variant_forward = variant(cloned,
  File "/home/kshiteej/.conda/envs/pytorch-cuda-dev/lib/python3.8/unittest/case.py", line 227, in __exit__
    self._raiseFailure("{} not raised".format(exc_name))
  File "/home/kshiteej/.conda/envs/pytorch-cuda-dev/lib/python3.8/unittest/case.py", line 164, in _raiseFailure
    raise self.test_case.failureException(msg)
AssertionError: RuntimeError not raised
```

Error After PR
```
Traceback (most recent call last):
  File "/home/kshiteej/Pytorch/pytorch_i0_promotion/test/test_ops.py", line 329, in test_variant_consistency_eager
    _test_consistency_helper(samples, variants)
  File "/home/kshiteej/Pytorch/pytorch_i0_promotion/test/test_ops.py", line 313, in _test_consistency_helper
    variant_forward = variant(cloned,
  File "/home/kshiteej/.conda/envs/pytorch-cuda-dev/lib/python3.8/unittest/case.py", line 227, in __exit__
    self._raiseFailure("{} not raised".format(exc_name))
  File "/home/kshiteej/.conda/envs/pytorch-cuda-dev/lib/python3.8/unittest/case.py", line 164, in _raiseFailure
    raise self.test_case.failureException(msg)
AssertionError: RuntimeError not raised : inplace variant either allowed resizing or you have marked the sample SampleInput(input=Tensor, args=(tensor([[[ 2.1750, -8.5027, -3.1403, -6.9942,  3.2609],
         [-2.5057, -5.9123, -5.4633,  6.1203, -8.2124],
         [-3.5802, -8.4869, -6.0700,  2.3431, -8.1955],
         [-7.3316,  1.3248, -6.8661,  7.1483, -8.0719],
         [ 4.5977, -4.0448, -6.2044, -2.1314, -8.4956]],

        [[ 3.2769, -8.4360,  1.2826,  7.1749,  4.7653],
         [-0.2816, -2.5997, -4.7659, -3.7814,  3.9704],
         [-2.1778, -3.8117, -6.0276, -0.8423, -5.9646],
         [ 8.6544, -3.0922,  0.2558, -4.9318, -4.7596],
         [ 4.5583,  4.3830,  5.8793,  0.9713, -2.1481]],

        [[-1.0447,  0.9334,  7.6405, -4.8933, -7.4010],
         [ 7.7168, -8.4266, -5.5980, -6.9368,  7.1309],
         [-8.7720, -5.0890, -0.4975,  1.9518,  1.7074],
         [-8.5783,  8.5510, -8.5459, -3.5451,  8.4319],
         [ 8.5052, -8.9149, -6.6298, -1.2750, -5.7367]],

        [[-6.5625,  8.2795, -4.9311,  1.9501, -7.1777],
         [-8.4035,  1.1136, -7.6418, -7.0726, -2.8281],
         [ 4.2668, -0.2883, -6.2246,  2.3396,  1.2911],
         [ 4.6550, -1.9525,  4.4873, -3.8061, -0.8653],
         [-3.4256,  4.4423,  8.2937, -5.3456, -4.2624]],

        [[ 7.6128, -6.3932,  4.7131, -5.4938,  6.4792],
         [-6.5385,  2.4385,  4.5570,  3.7803, -8.3281],
         [-2.9785, -4.4745, -1.1778, -8.9324,  1.3663],
         [ 3.7437,  3.5171, -6.3135, -8.4519, -2.7033],
         [-5.0568, -8.4630, -4.2870, -3.7284, -1.5238]]], device='cuda:0',
       dtype=torch.float32, requires_grad=True),), broadcasts_input=True) incorrectly with `broadcasts_self=True
```

**NOTE**:
Printing the sample looks very verbose and it may be hard to figure out which sample is incorrectly configured if there are multiple samples with similar input shapes.

Two Options to make this error less verbose
* Don't print the sample and just print `inplace variant either allowed resizing or you have marked one of the sample incorrectly with broadcasts_self=True`
* Have some mechanism to name samples which will be printed in the `repr` (which will need extra machinery)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58295

Reviewed By: ngimel

Differential Revision: D28627308

Pulled By: mruberry

fbshipit-source-id: b3bdeacac3cf9c0d984f0b85410ecce474291d20
2021-05-25 21:14:17 -07:00
083d3bb93b [torch][repeat_interlaeve] Add to exception list in backward compat check (#58966)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58966

Same as title.

Test Plan: CI since updated the check

Reviewed By: ngimel

Differential Revision: D28699577

fbshipit-source-id: 436fdc648a4c653081ff0e1b6b809c4af742055a
2021-05-25 20:04:50 -07:00
26c1f0f72e [skip ci] Skip debug info on PRs (#58897)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58897

We don't need to be building debug info on PRs since it's just filling up S3/CircleCI storage with useless 800 MB zips, this flips it so it's only run on master + release branches. See #58898 for CI signal

Also see pytorch/builder counterpart (unlike the last debuginfo PR there is no hard dependency between these two so there won't be any churn on un-rebased PRs): https://github.com/pytorch/builder/pull/778

Test Plan: Imported from OSS

Reviewed By: seemethere, samestep

Differential Revision: D28689413

Pulled By: driazati

fbshipit-source-id: 77a37e84afe492215008d5e023ceab0c24adb33c
2021-05-25 17:01:51 -07:00
32273e806a Ensure NativeFunctions.h codegen output is deterministic (#58889)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58889

fixes https://github.com/pytorch/pytorch/issues/58796

Planning on re-testing locally tomorrow morning to confirm, but this change should fix the non-determinism in the codegen output that was causing `ccache` not to re-use its cached output.

I built from the commit referenced in https://github.com/pytorch/pytorch/issues/58796 a few times and ran `diff -Naur` on the codegen output in `build/aten/src/ATen`. After a few tries, `NativeFunctions.h` had a few diffs. The diffs were all related to the ordering of functional/inplace/out variants of a NativeFunctionGroup, which looked non-deterministic.

That looks like it's coming from my calling `set()` to filter out duplicate NativeFunction declarations. The earlier version of the codegen also called `set()` to filter out duplicates, but it did so individually for each `NativeFunction` object, before merging the groups (I'm not too sure why this didn't introduce non-determinism before. though). With the refactor from https://github.com/pytorch/pytorch/pull/57361, we're calling `set()` on the declarations from every operator for a given DispatchKey, which is probably what introduced the nondeterminism.

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D28675941

Pulled By: bdhirsh

fbshipit-source-id: bb66de00aafeeb9720d85e8156ac9f7539aed0d6
2021-05-25 16:33:03 -07:00
db5e5781ad replace all remaining occurrences of deadline=1000, to prevent test flakiness
Summary: Per title

Test Plan: Fixes existing tests

Reviewed By: robieta

Differential Revision: D28690296

fbshipit-source-id: d7b5b5065517373b75d501872814c89b24ec8cfc
2021-05-25 15:55:30 -07:00
60af6e928a [PyTorch Edge][Version] Fix torchscript model after backport (#58892)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58892

The torchscript model after backport misses the `constants` archive. Add it back, and extend the unit test to run torchscript part.
ghstack-source-id: 129853819

Test Plan:
```
buck test mode/dev //caffe2/test/cpp/jit:jit -- --exact 'caffe2/test/cpp/jit:jit
- LiteInterpreterTest.BackPortByteCodeModelAllVersions'
```

Reviewed By: raziel, iseeyuan

Differential Revision: D28664507

fbshipit-source-id: 5f98723231cc64ed203c062ee6f00d8adbdccf77
2021-05-25 15:36:56 -07:00
fb120493b1 Make Scalar.to<> for invalid types a compile-time error (#58726)
Summary:
Currently calling `scalar.to<std::complex<double>>()` for example compiles but throws an error at runtime. Instead, marking the non-specialized cases as `= delete` means the code fails to compile and you catch the error sooner.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58726

Reviewed By: zou3519, seemethere

Differential Revision: D28646057

Pulled By: ezyang

fbshipit-source-id: 9e4e3d1b4586eeecbb73db61bba56560b2657351
2021-05-25 15:34:01 -07:00
36a77580f5 [docs] Clarify batch_first behavior for nn.LSTM, nn.RNN, and nn.GRU (#58809)
Summary:
Fixes the high-pri doc component of https://github.com/pytorch/pytorch/issues/4145.

To make the input / output shapes more readable for both `batch_first` states, this PR also introduces short dim names. Opinions welcome on the readability of the restructured docs!

Screenshot for `nn.LSTM`:
<img width="791" alt="Screen Shot 2021-05-24 at 5 11 39 PM" src="https://user-images.githubusercontent.com/75754324/119408130-389e5300-bcb3-11eb-9a4f-1df96a0a4d70.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58809

Reviewed By: gchanan

Differential Revision: D28685415

Pulled By: jbschlosser

fbshipit-source-id: e8c92e3d7e052071a505b55dca976fd2ef5a8307
2021-05-25 15:27:17 -07:00
7179e7ea7b [CMake] Prefer third_party/pybind11 by default (#58951)
Summary:
To make build behaviour aligned with other third_party/ libraries,
introduce `USE_SYSTEM_PYBIND11 (d55b25a633)` build option, which set to OFF by
default, which means PyTorch will be build with bundled pybind11 even if
other version is already installed locally.

Fixes https://github.com/pytorch/pytorch/issues/58750

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58951

Reviewed By: driazati

Differential Revision: D28690411

Pulled By: malfet

fbshipit-source-id: e56b5a8f2a23ee1834b2a6d3807f287149decf8c
2021-05-25 15:10:17 -07:00
45aa54d83c relax test deadlines
Summary: Relax test deadlines for c2 tests. We run on loaded machines, and timings are unreliable.

Test Plan: Fixes existing tests

Reviewed By: mruberry

Differential Revision: D28690006

fbshipit-source-id: 457707e81a1ec92548c1f23ea7a0022fa0a3bfda
2021-05-25 15:02:52 -07:00
b4b95fc87a Expose cudaMemGetInfo (#58635)
Summary:
This PR resolves the second issue outlined in https://github.com/pytorch/pytorch/issues/58376, which has previously been discussed in https://github.com/pytorch/pytorch/issues/50722.

`cudaMemGetInfo` is bound/exposed to the Python API. An example function call is provided below:

```
device_free, device_total = torch.cuda.mem_get_info(torch.device('cuda:0'))
print(device_free, device_total)
```

In  `CUDACachingAllocator.cpp`, in constant to my initial PR, the newly defined function `std::pair<size_t, size_t> raw_cuda_mem_get_info(int device)` has been moved from the `CUDACaching` namespace to the `cuda` namespace. In addition, as suugested by ezyang, `det` has been removed from all function names.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58635

Reviewed By: zou3519

Differential Revision: D28649093

Pulled By: ezyang

fbshipit-source-id: d8b7c53e52cf73f35495d8651863c5bb408d7a6a
2021-05-25 14:58:35 -07:00
133133afa8 [PyTorch] Extract non-template parts of torch::class_ (#54548)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54548

We don't need to inline most of this class; doing so bloats code size and build time.
ghstack-source-id: 129765666

Test Plan:
Existing CI

buildsizebot some mobile apps

Reviewed By: jamesr66a

Differential Revision: D27277317

fbshipit-source-id: 7643aa35e4d794fee0a48a3bbe0890c2e428ae78
2021-05-25 14:47:00 -07:00
ec89bf6535 .github: Ensure 7zip install for windows (#58924)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58924

Was observing behavior where 7zip was nowhere to be found after a build
was completed. Let's just have 7zip be installed within the workflow as
well just to be completely sure 7zip is there.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: samestep

Differential Revision: D28681241

Pulled By: seemethere

fbshipit-source-id: f649c1713edcdeb82c84fd67866700caa2726d71
2021-05-25 13:58:35 -07:00
ede3f5421f [Pytorch Delegated Backend] Save function name in debug info (#57481)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57481

This diff introduces function name to InlinedCallStack.
Since we are using InlinedCallStack for debug information in lite
interpreter as well as delegate backends, where InlinedCallStack cannot
be constructed from model source code, we need to save function name.
In the absence of function name Function* is used to get name of the
function. This is when JIT compiles code at runtime.
When that is not possible, this diff introduces a way to obtain function
name.

Test Plan:
test_backend
test_cs_debug_info_serialization

test_backend
test_cs_debug_info_serialization

Imported from OSS

Differential Revision:
D28159097
D28159097

Reviewed By: raziel, ZolotukhinM

Pulled By: kimishpatel

fbshipit-source-id: deacaea3325e27273f92ae96cf0cd0789bbd6e72
2021-05-25 13:19:02 -07:00
813adf1076 [Pytorch Delegated Backend] Save operator name and function name in (#57441)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57441

debug info

Previous diffs did not save operator name in debug info. For delegated
backends that only idenfity op for profiling with debug handle, operator
name should be stores as well.
Furthermore to complete debug informaton also serialize function name.

Test Plan:
Existing lite interpreter and backend tests

Existing lite interpreter and backend tests

Imported from OSS

Differential Revision:
D28144581
D28144581

Reviewed By: raziel

Pulled By: kimishpatel

fbshipit-source-id: 415210f147530a53b444b07f1d6ee699a3570d99
2021-05-25 13:17:54 -07:00
a7a5992d7d Add no-grad inference mode note (#58513)
Summary:
Adds a note explaining the difference between several often conflated mechanisms in the autograd note
Also adds a link to this note from the docs in `grad_mode` and `nn.module`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58513

Reviewed By: gchanan

Differential Revision: D28651129

Pulled By: soulitzer

fbshipit-source-id: af9eb1749b641fc1b632815634eea36bf7979156
2021-05-25 13:06:54 -07:00
5268b5a29a Add parsing logic for Tuple[()] annotation (#58340)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58340

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D28459502

Pulled By: ansley

fbshipit-source-id: 4bb188448d66269b42b068858b895debac86e9ee
2021-05-25 12:12:43 -07:00
b9d1ad9c78 OpInfo: diag_embed, diagonal (#58642)
Summary:
See: https://github.com/pytorch/pytorch/issues/54261.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58642

Reviewed By: ngimel

Differential Revision: D28627226

Pulled By: mruberry

fbshipit-source-id: b96fa8410bd53937ddb72a46c02b949691ee9458
2021-05-25 11:52:53 -07:00
f976275858 Run pthreadpool with _NoPThreadPoolGuard on the same thread (#58759)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58759

* Makes `pthreadpool()->run` respect `_NoPThreadPoolGuard`
   Runs tasks on the same thread instead of parallelizing when guard is present

Test Plan:
buck build //xplat/caffe2:aten_test_test_thread_pool_guard
./buck-out/last/aten_test_test_thread_pool_guard

Reviewed By: kimishpatel

Differential Revision: D28597425

fbshipit-source-id: 0365ad9947c239f5b37ce682802d4d401b8b0a48
2021-05-25 11:39:05 -07:00
b703f1e02d [NNC] Add documentation for splitWith APIs (#58270)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58270

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D28427226

Pulled By: navahgar

fbshipit-source-id: 39635e985095c7b581452464d7a515c6f86b24e8
2021-05-25 11:32:53 -07:00
dd7bbe1a63 [NNC] Make splitWithMask transform in-place (#58269)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58269

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D28427227

Pulled By: navahgar

fbshipit-source-id: 4e38a436abcf4752fd7ef6ab3666876eec6ea5ba
2021-05-25 11:32:51 -07:00
e2467cc43e [NNC] Make splitWithTail transform in-place (#58268)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58268

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D28427228

Pulled By: navahgar

fbshipit-source-id: 270b62c4e83739ad21dd68f375120e56881b394f
2021-05-25 11:31:14 -07:00
6b6a27e430 [jit] Add Python API for ScriptProfile (#57398)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57398

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D28133577

fbshipit-source-id: dcb8338159a24b00b5c495ecec66a3303d9b4aba
2021-05-25 11:09:18 -07:00
c88333484f [resubmit] masked_scatter thrust->cub (#58865)
Summary:
See ae7760cf50bb2cddff4663a07b9d68decf4b6c75 for the fix

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58865

Reviewed By: mruberry

Differential Revision: D28657940

Pulled By: ngimel

fbshipit-source-id: 9155c710b0e18ebb3bfa2dabfdd117355ac30840
2021-05-25 11:00:50 -07:00
fedf6f2db2 Check memory overlap in sort for large input sizes (#58327)
Summary:
The downstream cub sort doesn't support inplace sorting; this PR adds a check to bail out to allocating a new tensor instead of silently corrupting the returned indices.

CC ngimel zasdfgbnm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58327

Reviewed By: mruberry

Differential Revision: D28661244

Pulled By: ngimel

fbshipit-source-id: 40617a7d3bfcebbe187bb706b6b753371bb99097
2021-05-25 10:57:31 -07:00
7eade660c6 [PyTorch] Reduce errors of foreach functions (#56993)
Summary:
This is based on  https://github.com/pytorch/pytorch/issues/48224.

To make `foreach` more flexible, this PR pushes unsupported cases to slow path.
Also, this adds some tests to verify that
- `foreach` functions work with tensors of different dtypes and/or memory layouts in 7bd4b2c89f
- `foreach` functions work with tensors on different devices in a list, but are on the same device if the indices are the same: def4b9b5a1

Future plans:
1. Improve the coverage of unittests using `ops` decorator & updating `foreach_unary_op_db` and creating `foreach_(binary|pointwise|minmax)_db`.
2. Support broadcasting in slow path. Ref:  https://github.com/pytorch/pytorch/pull/52448
3. Support type promotion in fast path. Ref https://github.com/pytorch/pytorch/pull/52449

CC: ngimel mcarilli  ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56993

Reviewed By: zou3519

Differential Revision: D28630580

Pulled By: ngimel

fbshipit-source-id: e26ee74a39a591025e18c1ead48948cb7ec53c19
2021-05-25 10:50:20 -07:00
8a28bbeeb9 various TensorIterator speed improvements (#58810)
Summary:
1) remove pushing back to strides vector for 1D tensors, those strides are never used in the loop anyway
2) avoid calling get_data_ptrs unless necessary
3) don't call into assert_no_partial_overlap if tensorImpls are the same (assert_no_partial_overlap has this comparison too, but after a couple of nested function calls)
4) is_non_overlapping_and_dense instead of is_contiguous in memory overlap (which, for some reason, is faster than is_contiguous, though I hoped after is_contiguous is non-virtualized, it should be the same).

Altogether, brings instruction count down from ~110K to 102735 for the following binary inplace benchmark:
```
In [2]:  timer = Timer("m1.add_(b);", setup="at::Tensor m1=torch::empty({1}); at::Tensor b = torch::empty({1});", language="c++", timer=timeit.default_timer)
   ...:  stats=timer.collect_callgrind(number=30, repeats=3)
   ...:  print(stats[1].as_standardized().stats(inclusive=False))
```
similar improvements for unary inplace.

Upd: returned stride packing for now, counts is now 104295, so packing is worth ~ 52 instructions, we should think about how to remove it  safely.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58810

Reviewed By: bhosmer

Differential Revision: D28664514

Pulled By: ngimel

fbshipit-source-id: 2e03cf90b37a411d9994a7607402645f1d8f3c93
2021-05-25 10:44:51 -07:00
09a8f22bf9 Add mish activation function (#58648)
Summary:
See issus: https://github.com/pytorch/pytorch/issues/58375

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58648

Reviewed By: gchanan

Differential Revision: D28625390

Pulled By: jbschlosser

fbshipit-source-id: 23ea2eb7d5b3dc89c6809ff6581b90ee742149f4
2021-05-25 10:36:21 -07:00
bf269fdc98 Re-enable torchdeploy oss tests and move to per-PR cuda11 job (#58872)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58872

Test Plan: verify tests running on CI as expected

Reviewed By: suo

Differential Revision: D28646660

fbshipit-source-id: eb7d784844fb7bc447b4232e2f1e479d4d5aa72f
2021-05-25 10:05:32 -07:00
19bcbfc5cf [c10d] Use pg wrapper in detailed debug mode (#58281)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58281

When TORCH_DISTRIBUTED_DEBUG=DETAIL is enabled, this PR causes process groups created by `new_group` and `init_process_group` that are nccl or gloo to be wrapped in `ProcessGroupWrapper`.

As a result, the user will get back a `ProcessGroupWrapper` that they can use in the exact same way as a regular nccl/gloo pg, but will be more helpful in terms of debugging desync/hangs.

Besides doing collective desync checks, which should be transparent if there are indeed no issues in the user application, there are no semantic differences in using the wrapper pg. Note that there is a performance implication here but that is a tradeoff we are making when DETAIL debug mode is enabled.

Open to suggestions on how to test better. Currently I verified locally that enabling TORCH_DISTRIBUTED_DEBUG=detail creates the wrapper and all tests still pass, but that doesn't run in CI. On the other hand testing everything with debug=detail and the regular tests might be too much, so we have only added it to a few tests for now. We also do have tests in the below diff.
ghstack-source-id: 129817857

Test Plan: ci

Reviewed By: SciPioneer

Differential Revision: D28402301

fbshipit-source-id: c4d3438320f6f0986e128c738c9d4a87bbb6eede
2021-05-25 09:55:52 -07:00
aad2ad883a Disable test_nccl_errors_blocking_abort (#58921)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58921

Reviewed By: ezyang

Differential Revision: D28680061

Pulled By: malfet

fbshipit-source-id: bab4a28f054ed26bcd6431576b60268ba4db8e6b
2021-05-25 09:50:24 -07:00
470160ad78 [Pytorch] Update FuseLinear to map source range information (#58492)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58492

Update graph rewrite to specify how values in replacement pattern should
map to values in original pattern for fuse_linear pass

(Note: this ignores all push blocking failures!)

Test Plan:
python test/test_quantization.py TestQuantizeJitPasses.test_fuse_linear

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D28512464

fbshipit-source-id: 250a69cebc11eb4328a34c8f685b36e337439aae
2021-05-25 09:19:57 -07:00
e067675167 [Pytorch] Provide API to preserve source range and callstack information during graph rewrite (#58300)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58300

Current state: During graph rewriting that can fuse nodes or add nodes
result in new nodes without debug information that was available in
original node. Thus we lose this information during graph rewrite.

This PR changes graph rewriting API to let user specify how the values
in the replacement pattern map to values in the pattern to be matched.
Then the graph rewriting will copy source range and inlined callstack
from the matched nodes onto the nodes being inserted.

(Note: this ignores all push blocking failures!)

Test Plan:
python test/test_jit.py
TestJit.test_pattern_based_rewrite_with_source_range_preserved

Imported from OSS

Reviewed By: malfet

Differential Revision: D28512465

fbshipit-source-id: 863173c29de726be85b3acbd3ddf3257eea36d13
2021-05-25 09:18:59 -07:00
2ef9a1df22 Increase mimimum number of warmup runs to 2 (#58801)
Summary:
The JIT will typically need two warmup runs to do profiling and optimization.
This is not the perfect solution but it will substantially reduce the number of surprised people when the docs say torch.utils.benchmark.Timer takes care of warmup.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58801

Reviewed By: desertfire

Differential Revision: D28644244

Pulled By: robieta

fbshipit-source-id: cc54ed019e882a379d6e4a0c6a01fd5873dd41c3
2021-05-25 08:38:52 -07:00
09a1b1cf87 Forward AD formulas batch 1 (#57768)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57768

Note that this PR implements formulas only for ops that are supported by OpInfo.

Test Plan: Imported from OSS

Reviewed By: zou3519, malfet

Differential Revision: D28387766

Pulled By: albanD

fbshipit-source-id: b4ba1cf1ac1dfd46cdd889385c9c2d5df3cf7a71
2021-05-25 07:29:25 -07:00
b4f3a989da [torch][repeat_interleave] Fix ambigious function call (#58881)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58881

recently added new parameter to the function with PR: https://github.com/pytorch/pytorch/pull/58417

However, this introduced ambiguity when making call below:
  some_tensor.repeat_interleave(some_integer_value)

Making it optional to avoid the issue.

Reviewed By: ezyang, ngimel

Differential Revision: D28653820

fbshipit-source-id: 5bc0b1f326f069ff505554b51e3b24d60e69c843
2021-05-25 00:31:32 -07:00
3dbfaddfa1 Port elu_backward to structured (#58660)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58660

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D28572528

Pulled By: ezyang

fbshipit-source-id: 12265c287f178f9435d5d96f3bba49082d9e7f2c
2021-05-25 00:14:13 -07:00
5850553bc0 Port hardsigmoid_backward to strucutred (#58484)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58484

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D28572529

Pulled By: ezyang

fbshipit-source-id: aee125aa59a1f2b1ddb0c29a287097d866121379
2021-05-25 00:14:12 -07:00
3f0b7e0feb Port leaky_relu_backward to structured (#58483)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58483

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D28572526

Pulled By: ezyang

fbshipit-source-id: a73bdf06967687dbb1d4fbb0f2ca80115db57a07
2021-05-25 00:14:10 -07:00
ad27513430 Port softplus to structured (#58482)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58482

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D28571059

Pulled By: ezyang

fbshipit-source-id: a1065294f3c459e7c99aaed9edb09f88705f58e9
2021-05-25 00:12:57 -07:00
0b8931fe4b [torch][JIT] Predicate uses of RPC APIs on torch.distributed.rpc.is_available() (#58887)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58887

There are some callsites of `torch.distributed.rpc.XXX` APIs that are compiled
or not based on `USE_RPC`. However, `torch::deploy`, at least for now,
is compiled with `USE_RPC=1`, but the `torch.distributed.rpc.XXX` APIs used by
the aforementioned pieces of code are not available (i.e.
`torch.distributed.rpc.is_available()` returns `False`). This can cause
Torchscript compilation to fail, even if the code being compiled doesn't use
RPC.

This commit fixes this problem (at least temporarily) by predicating the use
all thse `torch.distributed.rpc` APIs on the value of
`torch.distributed.rpc.is_available()`.

Test Plan: Ran packaged XLM-R model with C++ benchmark.

Reviewed By: suo

Differential Revision: D28660925

fbshipit-source-id: fbff7c7ef9596549105e79f702987a53b04ba6f9
2021-05-24 21:53:53 -07:00
c502f49535 Fix failing torch deploy tests and reenable. (#58871)
Summary:
Fix is simple; alias inputs before feeding them to distinct
torchdeploy interpreters.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Fixes https://github.com/pytorch/pytorch/issues/58832

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58871

Reviewed By: wconstab, zou3519

Differential Revision: D28646784

Pulled By: ezyang

fbshipit-source-id: 6d2850f3226b5b99468d1465723b421ce4d7ab89
2021-05-24 20:27:41 -07:00
cf395c0718 [c10d] Introduce ProcessGroupWrapper (#58224)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58224

Adds C++ implementation of ProcessGroupWrapper. It wraps
an underlying ProcessGroup and does debug checks before dispatching the
collective to the underlying pg. The design mostly follows https://github.com/pytorch/pytorch/issues/22071.

Concretely, on each collective, we:
1. Verify op type consistency. This can help catch mismatched ops in the user application (i.e. allreduce on one rank and allgather on another)
2. Verify tensor shapes. This can help catch bugs where the tensor inputs are malformed, whereas normally in NCCL this would just lead to a hang. The shapes verification for allgather/allreduce_coalesced is omitted because they actually accept different shape tensors and don't error out.

This is done through an abstraction called `CollectiveFingerPrint` which uses a helper process group to do the above verification. Concretely, we gather the data we need for each of the above checks into tensors, and allgather them, and verify their equivalence.

Once all of this passes we simply dispatch the collective to the underlying pg.

Added `ProcessGroupWrapperTest` in python to comprehensively test these changes.
ghstack-source-id: 129735687

Test Plan: ci

Reviewed By: zhaojuanmao

Differential Revision: D28023981

fbshipit-source-id: 1defc203c5efa72ca0476ade0d1d8d05aacd4e64
2021-05-24 20:09:51 -07:00
c00eefb6c7 [Static Runtime] Clean up and fix bugs in Static Runtime (#58829)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58829

- Delete copying and moving of MemoryPlanner.
- Remove `inline` in some of the member functions because member functions implemented in classes are inline by default.
- Clean up ad update comments.
- Reorganize some code

Reviewed By: edvgha

Differential Revision: D28555476

fbshipit-source-id: 7ea8efc0e2ed93a6788a742470b9e753a85df677
2021-05-24 19:46:58 -07:00
de845020a0 fix docstring for fusing functions (#58638)
Summary:
This PR fixes docstrings of fusing functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58638

Reviewed By: H-Huang

Differential Revision: D28584501

Pulled By: jerryzh168

fbshipit-source-id: 77a53a709d968df8ba8f5b613ad7cf225ba2826a
2021-05-24 18:27:22 -07:00
2b0ec9c3cf Reapply "[jit] Implement ScriptProfile to collect instruction profiles." (#58783)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58783

This reverts commit fc804b5def5e7d7ecad24c4d1ca4ac575e588ae8.

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D28617037

Pulled By: zhxchen17

fbshipit-source-id: 645de2ede20500a5c218d6ec3c7faae94de37a14
2021-05-24 18:23:21 -07:00
705dd9ffac [PyTorch] Migrate remaining stray uses of TI:add_output to borrowing (#58605)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58605

Found a few more by grepping.
ghstack-source-id: 129730281

Test Plan: CI

Reviewed By: ngimel

Differential Revision: D28523254

fbshipit-source-id: 317baea88885586c5106c8335ebde0d8802a3532
2021-05-24 17:34:54 -07:00
12bb1e86ed Make c10::ThreadPool::available_ atomic. (#58457)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58457

This variable had concurrent read/write access without any
synchronization. The issue was caught and reported by TSAN.
ghstack-source-id: 129311384

Test Plan:
1) Verify test locally.
2) waitforbuildbot.

Reviewed By: ezyang

Differential Revision: D28498116

fbshipit-source-id: 89af068467fed64c131d743504c0cecf3017d638
2021-05-24 17:29:44 -07:00
a5250425e0 [quant] Eager mode equalization support for ConvReLU and LinearReLU (#58792)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58792

Enabling support for fused modules like ConvReLU or LinearReLU on eager mode cross-layer equalization.

Test Plan:
`python test/test_quantization.py TestEqualizeEager`

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D28647242

fbshipit-source-id: 286e057ce70aa7de45d575afd6c13e55120ff18a
2021-05-24 17:25:13 -07:00
b593dd2027 [Gradient Compression] Re-enable test_ddp_hook_parity_powerSGD on Gloo backend (#58882)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58882

Re-enable this test since https://github.com/facebookincubator/gloo/pull/309 is already picked up by Gloo submodule.
ghstack-source-id: 129760436

Test Plan: waitforbuildbot

Reviewed By: agolynski

Differential Revision: D28654433

fbshipit-source-id: dfc002936e88c074be529d6024f889214130b1b9
2021-05-24 16:52:54 -07:00
a566005679 [skip ci] Update readme to use hud.pytorch.org (#58835)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58835

Pulled By:
davidriazati
driazati

Reviewed By: seemethere

Differential Revision: D28632504

fbshipit-source-id: 867f061be039bc63c1478b1b1eed8c0380e94faa
2021-05-24 15:02:18 -07:00
f29e75c4dc [reland][quant][fx][graphmode][refactor] Remove qconfig_map from Quantizer (#58455) (#58756)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58756

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Imported from OSS

Reviewed By: supriyar

Differential Revision: D28607564

fbshipit-source-id: 979cf165941bb3a9044d03077a170b5ea64dc36a
2021-05-24 14:57:45 -07:00
76f03bc42f Fix torch.finfo.bits typo in stub (#58819)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/58818.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58819

Reviewed By: walterddr, malfet

Differential Revision: D28641587

Pulled By: ezyang

fbshipit-source-id: b19b519db43f2075c64f4f9ba922310f2561ca70
2021-05-24 14:52:49 -07:00
bc2ee078d1 Update Gloo submodule (#58853)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58853

Reviewed By: pbelevich, SciPioneer

Differential Revision: D28642642

Pulled By: agolynski

fbshipit-source-id: 8c31f9ab86c5f3063733199474022e7e2c6e9a2f
2021-05-24 14:23:41 -07:00
51b7224f8f [vulkan] Add max_pool2d op (#58806)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58806

Adds the max_pool2d op to Vulkan.

Test Plan:
```
cd ~/fbsource
buck build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 //xplat/caffe2:pt_vulkan_api_test_binAndroid\#android-arm64 --show-output
adb push buck-out/gen/xplat/caffe2/pt_vulkan_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_api_test
adb shell "/data/local/tmp/vulkan_api_test"
cd -
```

Reviewed By: IvanKobzarev

Differential Revision: D28625049

fbshipit-source-id: 75c82a84f0eca51627e33a6182ef51cb7e82e068
2021-05-24 14:16:19 -07:00
a679bb5ecf Refactor local lint (#58798)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58798

In #58623 there was a bug in `make quicklint` where ShellCheck would run on the entire repo when there were no files. This PR fixes that by refactoring out common stuff (like skipping quicklint when there are no files, let checks do their own file filtering) and pushes the logic into a runner class.

Test Plan: Imported from OSS

Reviewed By: samestep

Differential Revision: D28649889

Pulled By: driazati

fbshipit-source-id: b19f32cdb63396c806cb689b2f6daf97e1724d44
2021-05-24 13:52:53 -07:00
a7f4f80903 ENH Adds dtype to nn.functional.one_hot (#58090)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/33046
Related to https://github.com/pytorch/pytorch/issues/53785

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58090

Reviewed By: zou3519

Differential Revision: D28640893

Pulled By: jbschlosser

fbshipit-source-id: 3686579517ccc75beaa74f0f6d167f5e40a83fd2
2021-05-24 13:48:25 -07:00
e4be80c1b8 simplify cpu_kernel to not have contiguous special case (#58830)
Summary:
Per title
 `unroll_contiguous_scalar_checks` tries to verify that all arguments (including outputs) are contiguous except maybe 1 scalar (with stride 0). Then it calls the passed lambda with index of the scalar arg if this verification succeeded, or 0 if args were not contiguous/there was no scalar. Depending on the value of this index (with 0=not found) a different function can be called (in vectorized kernels it’s vectorized loop if args are contiguous + scalar, and basic loop if not). It makes sense for vectorized kernel (vectorized loop can still be used in some broadcasted cases), but all other (cpu_kernel, serial_cpu_kernel, cpu_kernel_multiple_outputs) don’t even use idx argument in lambda, so regardless of what `unroll_contiguous_scalar_checks` does, they'll do the same thing. No point in calling `unroll_contiguous_scalar_checks` then.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58830

Reviewed By: zou3519, mruberry

Differential Revision: D28632668

Pulled By: ngimel

fbshipit-source-id: c6db3675933184e17cc249351c4f170b45d28865
2021-05-24 12:07:29 -07:00
1c5f63d86d [Pytorch Edge] Model Ops compatibility api (#57501)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57501

Add an api _get_model_ops_and_info to get root operators and versioning info of a model in both cxx and python, and the input can be from a file path or buffer.
ghstack-source-id: 129620112

Test Plan: unit test.

Reviewed By: xcheng16, raziel

Differential Revision: D28162765

fbshipit-source-id: 4413c1e906b8a872e4a717d849da37347adbbea4
2021-05-24 12:00:06 -07:00
2a456e4f49 [skip ci] Move debug wheels out of package dir before test (#58685)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58685

This moves debug packages out of the artifacts dir before running tests (as a counterpart to https://github.com/pytorch/builder/pull/770). Doing it this way allows us to keep the CI configs simple since there's one directory to use for artifacts / upload to S3.

See #58684 for actual CI signals (the ones on this PR are all cancelled since it depends on the builder branch set in the next PR up the stack)

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D28646995

Pulled By: driazati

fbshipit-source-id: 965265861968906770a6e6eeecfe7c9458631b5a
2021-05-24 11:46:37 -07:00
2733555ed1 replace all_gather with more efficient collective api _all_gather_base (#57769)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57769

_all_gather_base saved copies in all_gather, so it is more efficient

Test Plan: unit test

Reviewed By: SciPioneer

Differential Revision: D28227193

fbshipit-source-id: ddd8590095a5b45676497a71ed792a457f9825c6
2021-05-24 11:34:45 -07:00
c58709b7bb Helper function for skipping module parameter / buffer initialization (#57555)
Summary:
This PR introduces a helper function named `torch.nn.utils.skip_init()` that accepts a module class object + `args` / `kwargs` and instantiates the module while skipping initialization of parameter / buffer values. See discussion at https://github.com/pytorch/pytorch/issues/29523 for more context. Example usage:

```python
import torch

m = torch.nn.utils.skip_init(torch.nn.Linear, 5, 1)
print(m.weight)

m2 = torch.nn.utils.skip_init(torch.nn.Linear, 5, 1, device='cuda')
print(m2.weight)

m3 = torch.nn.utils.skip_init(torch.nn.Linear, in_features=5, out_features=1)
print(m3.weight)
```
```
Parameter containing:
tensor([[-3.3011e+28,  4.5915e-41, -3.3009e+28,  4.5915e-41,  0.0000e+00]],
       requires_grad=True)
Parameter containing:
tensor([[-2.5339e+27,  4.5915e-41, -2.5367e+27,  4.5915e-41,  0.0000e+00]],
       device='cuda:0', requires_grad=True)
Parameter containing:
tensor([[1.4013e-45, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00]],
       requires_grad=True)
```

Bikeshedding on the name / namespace is welcome, as well as comments on the design itself - just wanted to get something out there for discussion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57555

Reviewed By: zou3519

Differential Revision: D28640613

Pulled By: jbschlosser

fbshipit-source-id: 5654f2e5af5530425ab7a9e357b6ba0d807e967f
2021-05-24 11:28:32 -07:00
277f587496 rename benchmark_cpp_extension (#58708)
Summary:
Currently the cpp_extension build in benchmarks is misleading as it has the same name with torch.utils.cpp_extension

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58708

Test Plan:
Run from `./benchmarks/operator_benchmark/pt_extension` folder:
```
python setup.py install
python cpp_extension_test.py
```

Note: CI doesn't matter as currently benchmarks/ folder is not compiled/test against CI

Reviewed By: robieta

Differential Revision: D28585582

Pulled By: walterddr

fbshipit-source-id: fc071040cf3cb52ee6c9252b2c5a0c3043393f57
2021-05-24 11:04:02 -07:00
a083933d2a .github: Add windows.8xlarge.nvidia.gpu (#58781)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58781

Adds windows GPU workers to our GHA self hosted infra

[skip ci]

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D28645532

Pulled By: seemethere

fbshipit-source-id: b00d0caef727c597ee15d19c76bda384231f42c9
2021-05-24 10:40:46 -07:00
8ae4d07dac .circleci: Disable windows CPU builds for CircleCI (#58855)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58855

We have successfully migrated windows CPU builds to Github Actions so
let's go ahead and disable them in CircleCI

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: zhouzhuojie

Differential Revision: D28642875

Pulled By: seemethere

fbshipit-source-id: 8ffe9338e58952531a70002891a19ea33363d958
2021-05-24 10:28:41 -07:00
1fca1545d4 fixing csr addmm bug (#58768)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58768

Fixes gh-58757

This PR has a fix for CPU version of addmm op. Just for context, before this PR, only CSR @ vector was supported. I found out a minor bug in the addmm_out_sparse_csr_dense_cpu for the non MKL code which is solved in this PR.

Moreover, I discovered a limitation in the current MKL implementation. It only works well (acceptable tolerance for output error) with square matrices. I was looking in deep to this issue and I found out that it could be a limitation of the MKL API.

I used this [gist code](https://gist.github.com/aocsa/0606e833cd16a8bfb7d37a5fbb3a5b14) based on [this](https://github.com/baidu-research/DeepBench/blob/master/code/intel/spmm/spmm_bench.cpp) to test this behavior.

As you can see there is not an acceptable output error (last column) when the matrices are squares and there is a not acceptable error when the matrices are not square. I reported the issue here: https://github.com/pytorch/pytorch/issues/58770

Looking forward to your comments.

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D28629563

Pulled By: malfet

fbshipit-source-id: 5ee00ae667336e0d9301e5117057213f472cbc86
2021-05-24 09:54:07 -07:00
2dda8d7571 Move cublas dependency after CuDNN (#58287)
Summary:
Library linking order matters during static linking
Not sure whether its a bug or a feature, but if cublas is reference
before CuDNN, it will be partially statically linked into the library,
even if it is not used

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58287

Reviewed By: janeyx99

Differential Revision: D28433165

Pulled By: malfet

fbshipit-source-id: 8dffa0533075126dc383428f838f7d048074205c
2021-05-24 09:39:09 -07:00
bb4770462f .github: Enable Windows workflow for pull_request (#58418)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58418

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D28483418

Pulled By: seemethere

fbshipit-source-id: c9f5a4df5a308e0ac6fc6fdc1a26d04723ffded7
2021-05-24 09:34:47 -07:00
007fe949aa Adding a new include directory in BLIS search path (#58166)
Summary:
While trying to build PyTorch with BLIS as the backend library,
we found a build issue due to some missing include files.
This was caused by a missing directory in the search path.
This patch adds that path in FindBLIS.cmake.

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58166

Reviewed By: zou3519

Differential Revision: D28640460

Pulled By: malfet

fbshipit-source-id: d0cd3a680718a0a45788c46a502871b88fbadd52
2021-05-24 08:57:02 -07:00
0e16087064 [DataLoader] Fix bugs for typing (#58450)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58450

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D28507877

Pulled By: ejguan

fbshipit-source-id: f4051ff51ce77ef45214f11cba10c8a7e1da4dad
2021-05-24 07:14:40 -07:00
5c7dace309 Automated submodule update: FBGEMM (#58161)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 4b8aaad426

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58161

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: jspark1105

Differential Revision: D28385619

fbshipit-source-id: ace938b1e43760b4bedd596ebbd355168a8706b7
2021-05-23 23:33:19 -07:00
74c12da451 add deterministic path for scatter_add_cuda for 1D tensors (#58761)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58761

previously we implemented deterministic path for gather_backward in https://github.com/pytorch/pytorch/pull/55573, which replaced non-deterministic scatter_add_cuda.

It's better to move it inside scatter_add so scatter_add can benefit from the deterministic path.

Test Plan:
buck test mode/opt //caffe2/test:torch_cuda -- test_scatter_add_one_dim_deterministic

    ✓ ListingSuccess: caffe2/test:torch_cuda - main (5.063)
    ✓ Pass: caffe2/test:torch_cuda - test_scatter_add_one_dim_deterministic_cuda (test_torch.TestTorchDeviceTypeCUDA) (30.909)
    ✓ Pass: caffe2/test:torch_cuda - main (30.909)
Summary
  Pass: 2
  ListingSuccess: 1

buck test mode/opt //caffe2/test:torch_cuda -- test_gather_backward

    ✓ ListingSuccess: caffe2/test:torch_cuda - main (4.613)
    ✓ Pass: caffe2/test:torch_cuda - test_gather_backward_deterministic_path_cuda (test_torch.TestTorchDeviceTypeCUDA) (25.369)

buck test mode/opt //caffe2/test:torch_cuda -- test_nondeterministic_alert

    ✓ ListingSuccess: caffe2/test:torch_cuda - main (5.356)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_CTCLoss_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_put_accumulate_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_ReplicationPad1d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_scatter_add_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_FractionalMaxPool2d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_AdaptiveAvgPool2d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_AvgPool3d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_grid_sample_2d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_NLLLoss_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_put_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_median_cuda_float64 (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_gather_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_bincount_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_histc_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_ReflectionPad1d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_interpolate_bilinear_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_ReplicationPad2d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_interpolate_bicubic_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_grid_sample_3d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_MaxPool3d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_AdaptiveAvgPool3d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_EmbeddingBag_max_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_interpolate_trilinear_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_AdaptiveMaxPool2d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_ReflectionPad2d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_FractionalMaxPool3d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_kthvalue_cuda_float64 (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_interpolate_linear_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_ReplicationPad3d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - main (28.146)
Summary
  Pass: 30
  ListingSuccess: 1

Reviewed By: ngimel

Differential Revision: D28585659

fbshipit-source-id: 1ad003d4130501ceff5f6a7a870ca3dbc9a3f1f2
2021-05-23 21:36:02 -07:00
50ded095e4 [deploy] temporarily disable deploy tests (#58832)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58832

While we investigate breakage.

Differential Revision:
D28631469
D28631469

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Pulled By: suo

fbshipit-source-id: 43d51c1c9d81e951074824ccf624e42f6bec4242
2021-05-23 19:26:06 -07:00
a7fdd487e5 Port kthvalue tests to OpInfo (#58654)
Summary:
Tracking issue https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58654

Reviewed By: ngimel

Differential Revision: D28627207

Pulled By: mruberry

fbshipit-source-id: f662f178ab87a9d461f1e0c91b02942c64125e73
2021-05-23 16:44:16 -07:00
4709fdb117 Add GenericShardingSpec for generic tensor sharding. (#57409)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57409

Full design: https://github.com/pytorch/pytorch/issues/55207

In https://github.com/pytorch/pytorch/issues/55207, we proposed
`MeshShardingSpec` as a generic sharding mechanism. However, that proposal does
not provide the flexibility to specify shards which have uneven
sizes/partitions and assumes even partitioning. Uneven partitioning is one of
the requirements of an internal use case.

As a result, instead of that we introduce a `GenericShardingSpec` which allows
specifying any arbitrary partitioning of a multi dimensional tensor. Basically
it specifies the start offsets of each shard and the length of each dim of the
shard allowing for greater flexibility
ghstack-source-id: 129604155

Test Plan:
1) unit tests
2) waitforbuildbot

Reviewed By: SciPioneer

Differential Revision: D28137616

fbshipit-source-id: 61255762485fb8fa3ec3a43c27bbb222ca25abff
2021-05-23 16:06:05 -07:00
0d6fa1adc5 Introduce ChunkShardingSpec as a model sharding specification. (#55728)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55728

Full design: https://github.com/pytorch/pytorch/issues/55207

This PR introduces ChunkShardingSpec (SingleShardingSpec in the design). Used
the name ChunkShardingSpec since it is very similar to `torch.chunk` in terms
of how a Tensor is split up and feels more clear compared to SingleShardingSpec.
ghstack-source-id: 129603318

Test Plan: waitforbuildbot

Reviewed By: SciPioneer

Differential Revision: D27694108

fbshipit-source-id: c8764abe6a4d5fc56d023fda29b74b5af2a73b49
2021-05-23 16:04:57 -07:00
c5a1f04367 Enabled BFloat16 support for cumsum, logcumsumexp, cumprod, cummin & cummax on CUDA (#57904)
Summary:
Enabled BFloat16 support for `cumsum`, `logcumsumexp`, `cumprod`, `cummin` & `cummax` on CUDA

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57904

Reviewed By: ailzhang

Differential Revision: D28558722

Pulled By: ngimel

fbshipit-source-id: 2a8e49c271e968f841d24534b6cc7be162d3a5aa
2021-05-23 15:51:23 -07:00
ee3ea31f12 OpInfo: split, split_with_sizes (#58184)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58184

Reviewed By: ngimel

Differential Revision: D28627271

Pulled By: mruberry

fbshipit-source-id: e6c0d2b005904ddebc9dab76685403530a6f6519
2021-05-23 15:47:35 -07:00
52a8031e8c [ROCm] disable test test_Conv2d_groups_nobias_v2 for ROCm (#58701)
Summary:
Disable test_Conv2d_groups_nobias_v2 test because it is failing on ROCm 4.2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58701

Reviewed By: ngimel

Differential Revision: D28626651

Pulled By: mruberry

fbshipit-source-id: a74bdf45335ae2afee0aa5e3bece6e208e75a63f
2021-05-23 15:43:36 -07:00
fa0b89bbf7 Change list striding kernel implementation to handle optional integers (#58536)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58536

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D28531720

Pulled By: tugsbayasgalan

fbshipit-source-id: c06a8933aa9b4aa562ea65ac2558353b05d0f624
2021-05-23 12:34:22 -07:00
28840b9a44 [Gradient Compression] Disable test_ddp_hook_parity_powerSGD on Gloo backend (#58802)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58802

This test can only be re-enabled once https://github.com/facebookincubator/gloo/pull/309 is picked up by Gloo submodule.
ghstack-source-id: 129661729

Test Plan: unit test.

Reviewed By: rohan-varma

Differential Revision: D28623214

fbshipit-source-id: 0249ae816469c3e8cabd08db415821091a064d58
2021-05-22 23:41:27 -07:00
4ca4640bae [torch][repeat_interleave] remove stream syncronization if output size is given (#58417)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58417

Same as title.

Test Plan:
Rely on CI signal.

Update unit test to exercise new code path as well.

Reviewed By: ngimel

Differential Revision: D28482927

fbshipit-source-id: 3ec8682810ed5c8547b1e8d3869924480ce63dcd
2021-05-22 20:53:28 -07:00
c1c9be16c4 port mm to structure kernel (#57755)
Summary:
relate to https://github.com/pytorch/pytorch/issues/57417.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57755

Reviewed By: ezyang

Differential Revision: D28426111

Pulled By: walterddr

fbshipit-source-id: 943d3e36433ca846990b940177fb040553961156
2021-05-22 19:24:14 -07:00
f9e8dc005a OpInfo: clone, contiguous (#58390)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58390

Reviewed By: soulitzer

Differential Revision: D28567821

Pulled By: mruberry

fbshipit-source-id: bcf42cb4a9a57d8a15a76819b8a9e2df97cf00be
2021-05-22 18:25:31 -07:00
a70020465b adding test_sparse_csr to run_test (#58666)
Summary:
fixes https://github.com/pytorch/pytorch/issues/58632.

Added several skips that relates to test assert and MKL. Will address them in separate PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58666

Reviewed By: seemethere, janeyx99

Differential Revision: D28607966

Pulled By: walterddr

fbshipit-source-id: 066d4afce2672e4026334528233e69f68da04965
2021-05-22 13:17:46 -07:00
22776f0857 [PyTorch] Remove device check from a few indexing methods (#58800)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58800

These methods leverages TensorIterator which will handle
(or skip) device check.
ghstack-source-id: 129654358

Test Plan: CI && sandcastle

Reviewed By: ngimel

Differential Revision: D28622626

fbshipit-source-id: 6153299780d4f7bf286423520ba4cb60b554335e
2021-05-22 13:13:39 -07:00
796c97a88f [Pytorch Delegated Backend] Add python binding for (#57156)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57156

generate_debug_handles

To be able to generate debug handles for preprocess written inpython.

Test Plan:
CI

CI

Imported from OSS

Differential Revision:
D28062328
D28062328

Reviewed By: raziel

Pulled By: kimishpatel

fbshipit-source-id: 8795d089edc00a292a2221cfe80bbc671468055c
2021-05-22 08:34:19 -07:00
d6d726f781 [Pytorch Backend delegation] Add api for backend lowering to query debug (#55462)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55462

handles and symbolicate exception callstack thrown from backend.

Objective of this diff is to achieve improve error reporting when
exceptions are raised from lowered backend. We would effectively like to
get the same model level stack trace that you would get without having
lowered some module to backend.

For example:
```
class AA(nn.Module):
  def forward(self, x, y):
    return x + y

class A(nn.Module):
  def __init__(...):
    self.AA0 = AA()
  def forward(self, x, y):
    return self.AA0.forward(x, y) + 3

class B(nn.Module):
  def forward(self, x):
    return x + 2

class C(nn.Module):
  def __init__(...):
    self.A0 = A()
    self.B0 = B()
  def forward(self, x, y):
    return self.A0.forward(x, y) + self.B0.forward(x)
```
If the we then do C().forward(torch.rand((2,3)), torch.rand(14,2))) we
will likely see error stack like:
```
C++ exception with description "The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "<string>", line 3, in forward

    def forward(self, x, y):
      return self.A0.forward(x, y) + self.B0.forward(x)
             ~~~~~~~~~~~~~~~ <--- HERE

  File "<string>", line 3, in forward

    def forward(self, x, y):
      return self.AA0.forward(x, y) + 3
             ~~~~~~~~~~~~~~~~ <--- HERE

  File "<string>", line 3, in forward

    def forward(self, x, y):
      return x + y
             ~~~~~ <--- HERE
```

We would like to see the same error stack if we lowered C.A0 to some
backend.

With this diff we get something like:
```
  Module hierarchy:top(C).A0(backend_with_compiler_demoLoweredModule).AA0(AA)
Traceback of TorchScript (most recent call last):
  File "<string>", line 3, in FunctionName_UNKNOWN

    def forward(self, x, y):
      return self.A0.forward(x, y) + self.B0.forward(x)
             ~~~~~~~~~~~~~~~ <--- HERE

  File "<string>", line 5, in FunctionName_UNKNOWN
                typed_inputs: List[Any] = [x, y, ]
                if self.__backend.is_available() :
                  _0, = self.__backend.execute(self.__handles["forward"], typed_inputs)
                        ~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
                  assert isinstance(_0, Tensor)
                  return _0
  File "<string>", line 3, in FunctionName_UNKNOWN

    def forward(self, x, y):
      return self.AA0.forward(x, y) + 3
             ~~~~~~~~~~~~~~~~ <--- HERE

  File "<string>", line 3, in FunctionName_UNKNOWN

    def forward(self, x, y):
      return x + y
             ~~~~~ <--- HERE
```
This is achieved in 3 parts:
Part 1:
A. BackendDebugInfoRecorder:
   During backend lowering, in `to_backend`, before calling the preprocess
   function corresponding to the backend. This will facilitate recording of
   debug info (such as source range + inlined callstack) for the lowered module.
B. Instantiate WithBackendDebugInfoRecorder with BackendDebugInfoRecorder.
   This initializes thread local pointer to BackendDebugInfoRecorder.
C. generate_debug_handles:
   In preprocess function, the backend will call generate_debug_handles
   for each method being lowered separately. generate_debug_handles
   takes `Graph` of the method being lowered and returns a map
   of Node*-to-debug_handles. Backend is responsible for storing debug
   handles appropriately so as to raise exception (and later profiling)
   using debug handles when the exception being raised corresponds to
   particular Node that was lowered.
   Inside generate_debug_handles, we will query the current
   BackendDebugHandleInfoRecorder, that is issuing debug handles. This debug
   handle manager will issue debug handles as well as record
   debug_handles-to-<source range, inlined callstack> map.
D. Back in `to_backend`, once the preprocess function is has finished
   lowering the module, we will call `stopRecord` on
   BackendDebugInfoRecorder. This will return the debug info map. This
   debug info is then stored inside the lowered module.

Part 2:
Serialization:
During serialization for bytecode (lite interpreter), we will do two
things:
1. Extract all the source ranges that are contained inside
debug_handles-to-<source range, inlined callstack> map for lowered
module. This will be source range corresponding to debug handles,
including what is there is inlined callstack. Since we replaced original
module with lowered module, we wont be serializing code for the original
module and thus no source range. That is why the source range will have
to be stored separately. We will lump all the source ranges for all the
lowered modules in one single debug_pkl file.
2. Then we will serialize debug_handles-to-<source range, inlined
callstack> map.

Now during deserialization we will be able to reconstruct
debug_handles-to-<source range, inlined callstack> map. Given all
debug_handles are unique we would not need any module information.

Test Plan:
Tests are added in test_backend.cpp

Tests are added in test_backend.cpp

Imported from OSS

Differential Revision:
D27621330
D27621330

Reviewed By: raziel

Pulled By: kimishpatel

fbshipit-source-id: 0650ec68cda0df0a945864658cab226a97ba1890
2021-05-22 08:33:07 -07:00
e7c35a3363 Revert D28617214: [Gradient Compression] Do not skip the comm hook tests on Gloo backend
Test Plan: revert-hammer

Differential Revision:
D28617214 (3e88acbf05)

Original commit changeset: 3bafb0c837a1

fbshipit-source-id: 0b6254e9766436633faea63cd64c454b739f74b4
2021-05-22 07:47:18 -07:00
6093161158 Separated out working tests from not working tests for NNC OpInfo (#58788)
Summary:
This gets rid of a lot of the try/else rigamarole.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58788

Reviewed By: ZolotukhinM

Differential Revision: D28621054

Pulled By: Chillee

fbshipit-source-id: d0d8a1b6466eb318d939a1ed172b78f492ee0d5b
2021-05-22 02:24:23 -07:00
dc8bc6ba4b [PyTorch Edge] Check if open paren ( occurs in an operator name string (#58687)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58687

We want to validate if the usages are all okay.
ghstack-source-id: 129639560

Test Plan: Tested on master. Build fails. The tested with D28549578 (db67699ae6) applied, and the build succeeds.

Reviewed By: JacobSzwejbka

Differential Revision: D28579734

fbshipit-source-id: 1ac65474762855562109adc0bac2897b59f637ce
2021-05-21 20:23:42 -07:00
4c961beacb Revert D28474878: Always use intrusive_ptr for Message (1 out of 2)
Test Plan: revert-hammer

Differential Revision:
D28474878 (4d704e607d)

Original commit changeset: 5b76d45e05f6

fbshipit-source-id: 677c5bc7f02dca23213f778eb0e626a2f6600f3b
2021-05-21 19:24:22 -07:00
a6b9268f31 Revert D28474879: Always use intrusive_ptr for Message (2 out of 2)
Test Plan: revert-hammer

Differential Revision:
D28474879 (ebf55a7d13)

Original commit changeset: 498652a8b80a

fbshipit-source-id: 4d81e9769699356bf2a2ffc14b26f480bfeef9a1
2021-05-21 19:24:20 -07:00
c1a9befba2 Revert D28474880: Allow Future::then to return pre-extracted DataPtrs
Test Plan: revert-hammer

Differential Revision:
D28474880 (a0ee299d92)

Original commit changeset: 91a0dde5e29d

fbshipit-source-id: fabf7b0bcbd41342553660a4d1e4bfc3d1dd2d41
2021-05-21 19:24:19 -07:00
a1719be07f Revert D28474877: Provide pre-extracted DataPtrs when completing a Future with a Message
Test Plan: revert-hammer

Differential Revision:
D28474877 (bdf6a4bffd)

Original commit changeset: e68d7d45f1c1

fbshipit-source-id: b89858b4e82f4f766031cfaad9fc736cf8097816
2021-05-21 19:24:17 -07:00
341f83d6a2 Revert D28474981: Create CUDA-aware futures in RequestCallback
Test Plan: revert-hammer

Differential Revision:
D28474981 (027c68ef00)

Original commit changeset: 492b8e71a43d

fbshipit-source-id: 0697c0922cd6bcbea2505efeecbbcbb3ffcfff2b
2021-05-21 19:24:15 -07:00
7a8336a5a7 Revert D28474983: Set streams when invoking UDFs
Test Plan: revert-hammer

Differential Revision:
D28474983 (ab1e958d20)

Original commit changeset: 358292764d0a

fbshipit-source-id: b4d4c25fe551d83848a9d023c139a9f1acc4c23d
2021-05-21 19:24:14 -07:00
89c81c5bba Revert D28574083: Set and propagate devices in RRef completion future
Test Plan: revert-hammer

Differential Revision:
D28574083 (23df70359a)

Original commit changeset: 5c89902cdc5c

fbshipit-source-id: e48043b6c4fb8a6f383f78e1aa88f7614f9fa13a
2021-05-21 19:24:12 -07:00
b8a04e25ec Revert D28474982: Make TP agent use streams from Future when sending response
Test Plan: revert-hammer

Differential Revision:
D28474982 (19a7472702)

Original commit changeset: c0034eb3f2a2

fbshipit-source-id: fb260c71e6c9dd5a2c44121fe4729a4f4418532b
2021-05-21 19:23:01 -07:00
dceaf98e79 [torch][package] Fix importlib.resources.path for python <3.8.8 (#58718)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58718

`PackageImporter` does not populate `module.__spec__.origin`, which causes an
unhandled `Exception` to be raised when using `importlib.resources.path` to get
a path to a binary file resource in the package in python <3.8.6.

This commit fixes this issue by setting `module.__spec__.origin` to
"<package_importer>". The actual value is not important as far as I can tell;
the simple fact that it is not `None` allows `importlib` to avoid raising an
`Exception` in `importlib.resources.path`.

Test Plan:
This commit adds a unit test to `test_resources.py` that tests that
`importlib.resources.path` can be used within a package.

Reviewed By: suo

Differential Revision: D28589117

fbshipit-source-id: 870d606a30fce6884ae48b03ff71c0864e4b325f
2021-05-21 19:16:54 -07:00
071d49a970 Document monitored barrier (#58322)
Summary:
Will not land before the release, but would be good to have this function documented in master for its use in distributed debugability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58322

Reviewed By: SciPioneer

Differential Revision: D28595405

Pulled By: rohan-varma

fbshipit-source-id: fb00fa22fbe97a38c396eae98a904d1c4fb636fa
2021-05-21 19:04:57 -07:00
84b6c629d3 [lint] Move shellcheck to its own step (#58623)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58623

This splits out everything shellcheck related into its own job that generates and checks GHA workflows, then shellchecks those + jenkins scripts. This PR also integrates shellcheck into the changed-only stuff in `actions_local_runner.py` so that shellcheck won't do anything unless someone edits a shell script in their local checkout. This is the final piece to clean up the output of `make quicklint` and speeds it up by a good bit (before it was shellchecking everything which took a few seconds):

```
$ make quicklint -j $(nproc)
✓ quick-checks: Ensure no unqualified noqa
✓ quick-checks: Ensure canonical include
✓ quick-checks: Ensure no unqualified type ignore
✓ quick-checks: Ensure no direct cub include
✓ quick-checks: Ensure no tabs
✓ quick-checks: Ensure no non-breaking spaces
✓ shellcheck: Regenerate workflows
✓ quick-checks: Ensure no versionless Python shebangs
✓ quick-checks: Ensure correct trailing newlines
✓ shellcheck: Assert that regenerating the workflows didn't change them
✓ mypy (skipped typestub generation)
✓ cmakelint: Run cmakelint
✓ quick-checks: Ensure no trailing spaces
✓ flake8
✓ shellcheck: Extract scripts from GitHub Actions workflows
✓ shellcheck: Run Shellcheck
real 0.92
user 6.12
sys 2.45
```

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D28617293

Pulled By: driazati

fbshipit-source-id: af960ed441db797d07697bfb8292aff5010ca45b
2021-05-21 18:23:40 -07:00
b842351b4f Skip SVE acceleration on M1 (#58785)
Summary:
As it's not supported by the chip and also crashes compiler, see https://bugs.llvm.org/show_bug.cgi?id=50407

Fixes https://github.com/pytorch/pytorch/issues/58653

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58785

Reviewed By: zhouzhuojie, driazati

Differential Revision: D28619231

Pulled By: malfet

fbshipit-source-id: 34367c074f9624b21d239eec757891cbb51f5bed
2021-05-21 18:08:30 -07:00
3e88acbf05 [Gradient Compression] Do not skip the comm hook tests on Gloo backend (#58784)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58784

DDP communication hooks are already supported on Gloo backend. No longer need to skip these tests on Gloo.

Original PR issue: https://github.com/pytorch/pytorch/issues/58467
ghstack-source-id: 129635828

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:distributed_gloo_fork -- test_ddp_comm_hook_logging
buck test mode/dev-nosan caffe2/test/distributed:distributed_gloo_fork -- test_ddp_hook_parity_allreduce
buck test mode/dev-nosan caffe2/test/distributed:distributed_gloo_fork -- test_ddp_hook_parity_allreduce_process_group
buck test mode/dev-nosan caffe2/test/distributed:distributed_gloo_fork -- test_ddp_hook_parity_powerSGD

Reviewed By: rohan-varma

Differential Revision: D28617214

fbshipit-source-id: 3bafb0c837a15ad203a8570f90750bc5177d5207
2021-05-21 17:47:52 -07:00
041bff77b6 Make tools/actions_local_runner.py PY-3.X compatible (#58787)
Summary:
Do not use `shlex.join`, which is a simple join over quoted args, i.e.
a9e43615c2/Lib/shlex.py (L318-L320)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58787

Reviewed By: driazati

Differential Revision: D28619996

Pulled By: malfet

fbshipit-source-id: dd4e939a88e2923b41084da2b5fbdbee859c0104
2021-05-21 17:40:48 -07:00
829a096cd7 Fix arange functions for VSX specializations of Vec256 (#58553)
Summary:
Need a templated 2nd parameter to support e.g. double steps even for int vectors.

This extends https://github.com/pytorch/pytorch/pull/34555 x86 specific fix to VSX instruction set.

Fixes https://github.com/pytorch/pytorch/issues/58551

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58553

Reviewed By: ailzhang

Differential Revision: D28551266

Pulled By: malfet

fbshipit-source-id: de7d23685da06b1b3089933d74398667cfb43c9f
2021-05-21 17:30:12 -07:00
e094980060 Makefile should use python3 instead of python alias (#58786)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58786

Reviewed By: driazati

Differential Revision: D28619802

Pulled By: malfet

fbshipit-source-id: 8f81298d39ba89c4e007f537ec2dd64bb23338af
2021-05-21 17:23:27 -07:00
1d885fbd0e Update GraphTask::owner_ in a single thread for DistEngine. (#58625)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58625

Several TSAN tests were failing for distributed since `owner_` was not
atomic and was being accessed by several threads. As an example:
https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/autograd/engine/dist_engine.cpp#L333.

To fix this, I've set the owner_ only once when the graphTask is created.

Test Plan:
1) Validated change fixes failing TSAN test.
2) waitforbuildbot

Reviewed By: albanD

Differential Revision: D28496878

fbshipit-source-id: 473f4f6d859595749a02563a204ba7aa35ea19e3
2021-05-21 17:12:27 -07:00
d9aa0b53eb [PyTorch] Migrate TI usage in ATen/native/quantized to borrowing (#58307)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58307

Borrowing is more efficient, and we can see in all these cases that the TensorIterator doesn't outlive the input & output Tensors.
ghstack-source-id: 129598791

Test Plan: Existing CI

Reviewed By: ezyang

Differential Revision: D28445922

fbshipit-source-id: ce12743980296bab72a0cb83a8baff0bb6d80091
2021-05-21 16:31:01 -07:00
3ddb4b3e68 [PyTorch] Migrate TI usage in ATen/native to borrowing (#58305)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58305

Borrowing is more efficient, and we can see in all these cases that the TensorIterator doesn't outlive the input & output Tensors.
ghstack-source-id: 129598793

Test Plan: Existing CI

Reviewed By: ezyang

Differential Revision: D28445712

fbshipit-source-id: 0822f1408a0a71c8f8934e6d90659ae3baa085ac
2021-05-21 16:29:50 -07:00
e574c2c025 [quant][fx] Validate qconfig_dict keys (#58566)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58566

Validates the keys of the qconfig_dict, prepare_custom_config_dict, convert_custom_config_dict, and
fuse_custom_config_dict. If the user passes in an invalid key or makes a type, we will throw and error and let the user know what keys are supported.

Test Plan:
Imported from OSS

python test/test_quantization.py

Reviewed By: jerryzh168

Differential Revision: D28540923

fbshipit-source-id: 5958c32017b7d16abd219aefc8e92c42543897c2
2021-05-21 15:20:05 -07:00
ed4cda0183 [pkg] opt into autoformat
Summary: woooo

Test Plan: arc lint --apply-patches --take BLACK --paths-cmd 'hg files -I "caffe2/**/*.py"'

Reviewed By: SplitInfinity

Differential Revision: D28608934

fbshipit-source-id: 7768fed50a87883a95319376c0a6d73a9492bdcc
2021-05-21 15:03:52 -07:00
e5ba9307b7 catch exception when running print regression (#58751)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58751

Test Plan: https://github.com/pytorch/pytorch/issues/58752

Reviewed By: samestep

Differential Revision: D28605667

Pulled By: walterddr

fbshipit-source-id: 3796c924df8e50849dd08ecbeab612ba4f0c569b
2021-05-21 14:59:42 -07:00
378b2af93d T90561249: Enforce kernel launch checks for OSS CI (#58465)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58465

Test Plan: how to test?

Reviewed By: r-barnes

Differential Revision: D28500258

fbshipit-source-id: 19e56d5e18d77b951acb510e1e7ac834ce1ffc9b
2021-05-21 14:03:48 -07:00
19a7472702 Make TP agent use streams from Future when sending response (#58428)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58428

Until now, the TP agent expected the output of a remote function to be on the same streams as the inputs. In other words, it used the lazy stream context of the inputs to synchronize the output tensors. This was true in the most common case of a synchronous remote function. However it wasn't true for async functions, for fetching RRefs, ... The more generic way is to use the CUDA events held by the Future to perform this synchronization. (These events may be on the input streams, or they may not be!).
ghstack-source-id: 129567045

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28474982

fbshipit-source-id: c0034eb3f2a2ea525efb63a31b839bc086060e7e
2021-05-21 13:15:35 -07:00
23df70359a Set and propagate devices in RRef completion future (#58674)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58674

I found this missing parameter while debugging failures in the next PR.

I'm very unhappy about this change. I think this future, which we know for sure won't contain tensors, shouldn't have to worry about CUDA devices. And yet, it does. This means that basically any future anywhere might have to worry about it, and this just doesn't scale, and thus it's bad.
ghstack-source-id: 129567042

Test Plan: Should fix the next diff.

Reviewed By: mrshenli

Differential Revision: D28574083

fbshipit-source-id: 5c89902cdc5cc12f1ebeea860b90cd9c3d7c7da1
2021-05-21 13:15:34 -07:00
ab1e958d20 Set streams when invoking UDFs (#58427)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58427

Running the UDF (be it Python or JIT) is the first step of (most?) RPC calls, which is where the inputs are consumed. The lazy stream context contains the streams used by the inputs, thus it must be made current before any UDF call. I opt to do this as "close" as possible to the place the UDF is invoked, to make the relationship as explicit as possible.
ghstack-source-id: 129567052

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28474983

fbshipit-source-id: 358292764d0a6832081c34bf6736f0961475ff3d
2021-05-21 13:15:32 -07:00
027c68ef00 Create CUDA-aware futures in RequestCallback (#58426)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58426

The operations in RequestCallback can return CUDA tensors, thus the futures used to hold them must be CUDA-aware.
ghstack-source-id: 129567051

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28474981

fbshipit-source-id: 492b8e71a43da5f63b4b7a31f820427cde9736e4
2021-05-21 13:15:30 -07:00
bdf6a4bffd Provide pre-extracted DataPtrs when completing a Future with a Message (#58425)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58425

Now that callbacks can provide pre-extracted DataPtrs, let's do so. This will become of crucial importance in the next PR, where some of these futures will become CUDA-aware, and thus they will try to extract DataPtrs on their own, but they would fail to do so here because Message isn't "inspectable".
ghstack-source-id: 129567057

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28474877

fbshipit-source-id: e68d7d45f1c1dc6daa5e05cf984cfc93d2dce0d0
2021-05-21 13:15:29 -07:00
a0ee299d92 Allow Future::then to return pre-extracted DataPtrs (#58424)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58424

In CUDA mode, Future must inspect its value and extract DataPtrs. However some types are not supported, for example the C++/JIT custom classes, which include Message, which is widely used in RPC. Hence for these scenarios we allow the user to perform the custom DataPtr extraction on their own, and pass the pre-extracted DataPtrs.

Note that `markCompleted` already allowed users to pass in pre-extracted DataPtrs, hence this PR simply extends this possibility to the `then` method too.
ghstack-source-id: 129567044

Test Plan: Used in next PR.

Reviewed By: mrshenli

Differential Revision: D28474880

fbshipit-source-id: 91a0dde5e29d1afac55650c5dfb306873188d785
2021-05-21 13:15:27 -07:00
ebf55a7d13 Always use intrusive_ptr for Message (2 out of 2) (#58423)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58423

This is part 2 of the previous PR. Here we address the remaining occurrences of "raw" Message, namely the ones within toMessageImpl. And since they're the last ones, we make the constructor of Message private, to prevent new usages from emerging.
ghstack-source-id: 129567049

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28474879

fbshipit-source-id: 498652a8b80a953396cd5d4b275c0b2e869c9ecf
2021-05-21 13:15:25 -07:00
4d704e607d Always use intrusive_ptr for Message (1 out of 2) (#58422)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58422

Similar to Future (which I tackled recently), Message is an ivalue type (a "custom class" one), and the natural way to represent it is inside an intrusive_ptr. However in the RPC code we had a mix of usages, often passing Message by value. This has undesirable consequences, as it could easily trigger a copy by accident, which I believe is why in many places we accepted _rvalue references_ to Message, in order to force the caller to move. In my experience this is non-idiomatic in C++ (normally a function signature specifies how the function consumes its arguments, and it's up to the caller to then decide whether to copy or move).

By moving to intrusive_ptr everywhere I think we eliminate and simplify many of the problems above.

In this PR I do half of the migration, by updating everything except the `toMessageImpl` methods, which will come in the next PR.
ghstack-source-id: 129567053

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28474878

fbshipit-source-id: 5b76d45e05f6fa58c831e369c5c964d126187a6c
2021-05-21 13:15:24 -07:00
35ea8779da Prevent using anything other than intrusive_ptr for Future (#58421)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58421

Here I make it impossible to create Futures that do not use intrusive_ptr, by making the constructor private. This makes it safer (by "forcing" people to do the right thing) and prevents a proliferation of new shared_ptrs or of accidental copies/moves.
ghstack-source-id: 129567047

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28474484

fbshipit-source-id: 82c487e1bb7c27a2e78cb5d594e00e54c752bf09
2021-05-21 13:15:22 -07:00
44daf1930b Migrate remaining shared_ptr<Future> to intrusive_ptr (#58420)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58420

In https://github.com/pytorch/pytorch/pull/57636 I migrated most uses of Future to an intrusive_ptr. I thought I had all of them but I missed a couple. These are the remaining ones. (The next PR will make it impossible to add new usages of shared_ptr).
ghstack-source-id: 129567071

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28477285

fbshipit-source-id: 75008276baa59e26b450e942c009ec7e78f89b13
2021-05-21 13:15:20 -07:00
59454ce36e Make remaining autograd methods return futures (#57861)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57861

The very last methods left that still didn't return Futures were the autograd ones, but they're very easy to port.

We've now finished the conversion of RequestCallback to be fully Future-based!
ghstack-source-id: 129567055

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28286173

fbshipit-source-id: 1de58cee1b4513fb25b7e089eb9c45e2dda69fcb
2021-05-21 13:15:19 -07:00
d6d2fb3323 Make remaining RRef methods return futures (#57860)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57860

The other methods for RRefs just did bookkeeping and are trivially easy to migrate to Futures (which is done mainly for consistency at this point).
ghstack-source-id: 129567068

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28286175

fbshipit-source-id: 1d97142803f73fe522435ca75200403c78babc68
2021-05-21 13:15:17 -07:00
797dff55b5 Unify fetching RRefs (#57859)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57859

Just like with assigning OwnerRRefs, we can also deduplicate the code paths for fetching their values. In fact this was duplicated three times, with different ways of post-processing the value (once for JIT, once for Python, once for autograd). Thanks to future, we can have that logic once, and then connect it to different follow-up steps.
ghstack-source-id: 129567050

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28286172

fbshipit-source-id: e0742a99cf555755e848057ab6fee5285ff0df2a
2021-05-21 13:15:15 -07:00
b9b41f6d1b Deduplicate Python object serialization (#57858)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57858

Just a small deduplication, which move complexity our of the way, and ensures consistent error checking.
ghstack-source-id: 129567056

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28286174

fbshipit-source-id: 6eab8d3f30405d49c51f8b9220453df8773ff410
2021-05-21 13:15:14 -07:00
cd9dbbd93a Simplify process(Script|Python)(Remote)?Call (#57857)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57857

There used to be a whole lot of methods: `processPythonCall`, `processScriptCall`, `processScriptRemoteCall`, `processPythonRemoteCall`, `processScriptCallOp`, `processBaseScriptRemoteCall` and `processScriptRemoteCallOp`. Thanks to the previous simplification, we can now drop all but the first four, which map nicely 1:1 to the four message types we need to handle. Also their signatures become much simpler: they take an RPC command and return a future.
ghstack-source-id: 129567070

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28253848

fbshipit-source-id: e0e45345c414a96900f9d70ee555359d28908833
2021-05-21 13:15:12 -07:00
c96a05d148 Unify assignment of OwnerRRef result (#57856)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57856

Thanks to Futures providing a "common language" between various steps, we can now deduplicate the creation of OwnerRRef, by having two different ways of creating the result (JIT and Python) but then connecting them to a single method that wraps and stores that result in an OwnerRRef.
ghstack-source-id: 129567072

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28253845

fbshipit-source-id: a156e56cac60eb22f557c072b61ebac421cfad43
2021-05-21 13:15:10 -07:00
e220a1bbcd Make processPythonExecution return a future (#57855)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57855

We already had a helper to run Python functions, which was nice (it de-duplicated some code). This helper was however taking a callback which, as I said, isn't as nice as it returning a Future. Hence here I change this.
ghstack-source-id: 129567054

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28253846

fbshipit-source-id: d854d4aa163798fb015cd6d46932f9ff1d18262e
2021-05-21 13:15:09 -07:00
20d02cb7dd Remove getScriptRemoteCallType (#57854)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57854

Because OwnerRRefs used to be created before their value was computed, we had to figure out their type ahead of time. After the previous diff, we inverted the order of operations, and we can now first compute the result and then create the OwnerRRef. Which means we can just inspect the value to get its type. Much simpler, and much less likely to get it wrong.
ghstack-source-id: 129567060

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28253843

fbshipit-source-id: f13c9b294f477ae66fcbdbc85c642fdc69b2740f
2021-05-21 13:15:07 -07:00
60fc37393e Simplify OwnerRRef completion (#57853)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57853

A bunch of methods received an OwnerRRef to "fill in". I think it will be more flexible to do it the other way around, and have these methods return a value (wrapped in a Future), which can then be "connected" to an OwnerRRef, but which can also potentially be consumed in different ways.
ghstack-source-id: 129567059

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28253844

fbshipit-source-id: 7e3772312dbacfc75a6ac0f62189fc9828001fc7
2021-05-21 13:15:05 -07:00
ea2f5bbb4c Unify async execution for JIT functions (#57852)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57852

Another great example of the benefits of Futures. Thanks to the "right abstraction" (i.e., the `thenAsync` method), adding support for async execution becomes trivial, and the code much simpler than what it used to be.
ghstack-source-id: 129567063

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28253842

fbshipit-source-id: b660151ca300f3d6078db0f3e380c80a4d8f5190
2021-05-21 13:15:04 -07:00
bfdc279134 Unify invoking JIT functions (#57851)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57851

The same as the previous PR, but for JIT functions.
ghstack-source-id: 129567069

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28253841

fbshipit-source-id: 2b8affde16c106f5c76efa8be49af070213708bf
2021-05-21 13:15:02 -07:00
77428159f5 Unify invoking JIT operands (#57850)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57850

What I envision is a modular decomposed code, with separate steps which each consume/produce Futures, and which can be chained together to obtain the desired results. One common "starting point" for these chains is the execution of a remote function (Python or JIT or otherwise). I'm thus creating a helper function for one of these, the JIT operators (by deduplicating the places where we used to run them). More will follow.

This deduplication will also help to add CUDA support to JIT RPC, since the execution of the JIT function/operators is where we need to set our custom streams.
ghstack-source-id: 129567058

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28253847

fbshipit-source-id: 24ab67ad89c8796861e9bbcb78878b26704c0c48
2021-05-21 13:15:00 -07:00
f94f1db938 Make some methods of RequestCallback return void instead of bool (#57849)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57849

Some methods are currently returning bool, but I'll soon want them to return a Future. I could have them return a tuple of bool and Future, but that's a bit heavy. Instead it turns out we can very easily make them return void, which will simplify things.
ghstack-source-id: 129567061

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28224476

fbshipit-source-id: 26dc796b7e38f03aa269cf0731b0059d58e57e94
2021-05-21 13:14:59 -07:00
4ac18f6710 Centralize setting messageId in RequestCallback (#57848)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57848

This PR looks large, but all it does is add a dozen lines and remove a lot of other ones.

One first advantage of using Futures is that we can easily chain some "post-processing" to them. Until now we needed to pass the message ID around everywhere because it was set separately by each method. Instead, we could simply add a follow-up step to the final future which sets this ID, and remove all the former logic.
ghstack-source-id: 129567065

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28224477

fbshipit-source-id: 7b6e21646262abe5bbbf268897e2d792e5accc27
2021-05-21 13:14:57 -07:00
f6844eafce Make RequestCallback collect Futures from methods, rather than providing them (#57847)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57847

This is the first PR of a stack that aims to simplify RequestCallback, and I want to start by explaining my intentions.

With the introduction of CUDA support in the TensorPipe agent, we found out that other layers higher up in the stack (RRefs, dist autograd, ...) were not "ready" to support CUDA. One cause of this was that in PyTorch most CUDA state is thread-local, and the RequestCallback class (and others) might execute different steps of an operation on multiple threads. The solution to this problem is to preserve or recreate the CUDA state when switching between threads (propagating streams, or recording events and then wait on them). If we were to manually do this everywhere it would be tedious, error-prone, and hard to maintain.

In fact, we already have a primitive that can do this for us: CUDAFuture (now known as just Future). If whenever we switch threads we were to pack the values in a CUDAFuture and then unpack them on the other threads, all CUDA stuff would be taken care of for us.

If our code leveraged CUDAFuture at its core, thing would become the "idiomatic" thing to do, the natural behavior. Future changes would thus also be inclined to follow this pattern, hence automatically doing the right thing.

I also think that, even without these concerns about CUDA, there are benefits to use Futures more extensively. Currently RequestCallback uses a mix of Futures and callbacks. These are two tools for the same job, and thus mixing them creates confusion. Futures are more powerful than simple callbacks (they can be passed around, inspected, chained, waited on, ...) and thus should be preferred. They also lead to more readable code, as each step can be defined and chained in logical order, whereas callbacks must either be nested, or defined inline, or defined before and used later (thus making the code out-of-order).

In short: I intend to rework RequestCallback to use Futures much more. I believe it will greatly simplify the code, help readability, and prove invaluable to support CUDA.

 ---

Until now, we had the final result future being created at the very beginning, and then passed around everywhere, so that the various method could "fill in" its value. I think it's much lighter to instead allow each method to create or obtain its futures however it wants, and have it return them. I.e., have these futures "bubble up" from the lower layers, rather than them being "pushed down" from the upper ones.

In this initial PR, I move the place where we create this "final result future", but I still keep it around. I will then, in later PRs, slowly migrate each method so that it returns a future, and in the end I will avoid creating the final result future.
ghstack-source-id: 129567062

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28224478

fbshipit-source-id: dbdc66b6458645a4a164c02f00d8618fa64da028
2021-05-21 13:14:55 -07:00
7e1f2b33ce Add helpers to manipulate futures (#57846)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57846

In later PRs I'll need to create already-completed futures (it'll make sense then, I hope). Here are a few helpers for that, which I'm adding separately to reduce the noise later.
ghstack-source-id: 129567064

Test Plan: See later.

Reviewed By: mrshenli

Differential Revision: D28253664

fbshipit-source-id: f091e1d3ea353bb5bfbd2f582f1b8f84e4b0114f
2021-05-21 13:14:54 -07:00
1d7cf4b248 Reduce overhead when Future invokes callbacks inline (#57638)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57638

In RPC there are a few instances of "fastpaths" which do `if (fut.isCompleted()) { do_sth(); } else { fut.addCallback(do_sth); }`. I intend to get rid of them, for reasons I'll clarify later but which in a nutshell have to do with CUDA correctness and readability. Note that dropping the fastpath introduces no change in behavior (because `addCallback` invokes the callback inline anyways), thus the only perf concern comes from the fact that the fastpath avoids constructing and passing around a `std::function`. I don't think this is a significant performance hit. Regardless, this PR preemptively addresses this concern, by tweaking `addCallback` (and similar methods) so they can handle raw lambdas, and so that they do _not_ wrap them into `std::function`s if they are invoked inline.
ghstack-source-id: 129567067

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28222808

fbshipit-source-id: eb1c7114cf7aca3403cb708f14287cab0907ecfa
2021-05-21 13:14:52 -07:00
ce2f1c29f9 Introduce thenAsync method on Future (#57637)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57637

I had proposes a similar method in https://github.com/pytorch/pytorch/pull/48790, although that PR was exposing it to Python and thus requires a bit more work. This PR only introduces this method as a C++ API. Python can be added later.

This new method is useful when one wants to use `then` but the callback does perform some async operation itself, and one wants to "reconcile" the future produced inside the callback with the one produced outside.
ghstack-source-id: 129567066

Test Plan: Used (and thus tested) later in the stack.

Reviewed By: mrshenli

Differential Revision: D28222809

fbshipit-source-id: 869f11ab390b15e80c0855750e616f41248686c5
2021-05-21 13:13:02 -07:00
d7d0fa2069 Fix typo. (#58728)
Summary:
Fix typo in docs and comments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58728

Reviewed By: mruberry

Differential Revision: D28603612

Pulled By: H-Huang

fbshipit-source-id: b3cd8f6f19354201d597254d0b3cb4e2062471ab
2021-05-21 11:45:10 -07:00
13c975684a c10/util/thread_name.cpp: pthread_setname_np requires Glibc 2.12 (#55063)
Summary:
`pthread_setname_np` requires Glibc 2.12. The patch reproduces what numactl does: 93867c59b0/syscall.c (L132-L136)

Related to issue https://github.com/pytorch/pytorch/issues/23482 and the `pthread_setname_np.patch` patch that adamjstewart shared.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55063

Reviewed By: soulitzer

Differential Revision: D28577146

Pulled By: malfet

fbshipit-source-id: 85867b6f04795b1ae7bd46dbbc501cfd0ec9f163
2021-05-21 10:26:51 -07:00
76ce925257 [c10d] Fix monitored_barrier with wait_all_ranks (#58702)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58702

Off by one error when determining if some ranks failed or not with
`wait_all_ranks=True`. This wasn't caught by tests because the tests only
tested failure scenarios, not success scenarios with `wait_all_ranks=True`.
ghstack-source-id: 129559840

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D28583235

fbshipit-source-id: a8f376efb13a3f36c788667acab86543c80aff59
2021-05-21 09:40:50 -07:00
9e261de630 Revert D28547564: [pytorch][PR] masked_scatter thrust->cub
Test Plan: revert-hammer

Differential Revision:
D28547564 (5152cf8647)

Original commit changeset: 83aeddfaf702

fbshipit-source-id: d5259afb584e0f6c0a11de4d4cb3d56a2a562eb7
2021-05-21 09:18:34 -07:00
5313bafd31 [JIT] integer value refinement (#56438)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56438

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D27924239

Pulled By: eellison

fbshipit-source-id: ace54fcb594853f30c242369ea203b0eb5527ac1
2021-05-21 08:51:01 -07:00
483ea176b3 Factor out isDominatedBy (#56437)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56437

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D27924240

Pulled By: eellison

fbshipit-source-id: d600f895bfb06304957fe65155fceab0e5f873ea
2021-05-21 08:50:59 -07:00
0d9f1c1ec6 Add Value * == Value * peephole (#55978)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55978

This is needed for broadcasting two of the same symbolic shape

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D27755328

Pulled By: eellison

fbshipit-source-id: d38d9458a9e28d31558f0bc55206516b78131032
2021-05-21 08:50:57 -07:00
391603d883 Factor out non tensor peephole (#55977)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55977

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D27755329

Pulled By: eellison

fbshipit-source-id: 0e8948c0607fa59133310e4db8e05ac6847c9f8b
2021-05-21 08:50:55 -07:00
5cebf29b4e Add list len refinement (#55926)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55926

This is necessary for code like conv2d where we wish to share a generic convolution shape function logic with that of conv2d but for conv2d always infer the output is dimension 4. I'm also hoping the refinement algorithm here could be refactored out and used to support refining tensor types from user annotations. i have a length comment explaining how this works, and the logic outside of data structures is pretty small and contained. Additionally, you might check out https://fb.quip.com/X7EVAdQ99Zzm for a very similar description of how to refine values based on comparison operators.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D27750997

Pulled By: eellison

fbshipit-source-id: d962415af519ac37ebc9de88f2e1ea60a1374f7c
2021-05-21 08:50:54 -07:00
9fd2306036 Add handling of symbolic shapes (#55925)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55925

This sets up the initial handling of symbolic shapes. As in the test, it doesn't work perfectly yet because it needs a couple other optimization passes. The basic description is pretty simple: we resolve tensor dimension indices to the same Value *, and before extracting out the output Tensor shape we substitute in symbolic shapes. We don't substitute during optimization because they are represented as negative numbers so we don't want them inadvertently used in Constant prop or something else.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D27750996

Pulled By: eellison

fbshipit-source-id: 6984e7276b578f96b00fc2025cef0e13f594b6e6
2021-05-21 08:50:52 -07:00
f39471a171 Initial Symbolic Shape Analysis (#54809)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54809

I'm going to post on dev-discuss soon with a more thorough explanation of the design and advantages of this shape analysis, so I'm leaving out that for now.

There is still a ton left to do, I'm posting this initial version so we can get something on master multiple can work on. List of many remaining steps to do:

- [ ] Add symbolic shapes support
- [ ] Bind shape functions for operators in C++
- [ ] Make classes of operators share the same shape function (e.g. pointwise, broadcast two inputs)
- [ ] Refactor APIs
- [ ] Only iteratively optimize shape function while a change has been made
- [ ] Expand coverage of coverage to common ops
- [ ] Add shape analysis pass on Graph that handles Ifs and Loops
- [ ] Allow concurrent reads to the operator map
- [ ] Successive applications of same inputs to same shape function (e.g. series of pointwise ops)

For this review, I am mostly looking for comments related to the implementation of symolic_shape_analysis.cpp, with the caveats listed above. I am not really looking for comments related to api/registration/graph level analysis as those are all planned to be changed. I am fine landing this as is or waiting until necessary components of the TODOs above are finished.

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D27750998

Pulled By: eellison

fbshipit-source-id: 4338b99e8651df076291c6b781c0e36a1bcbec03
2021-05-21 08:49:46 -07:00
72ae924fad Added sublist support for torch.einsum (#56625)
Summary:
This PR adds an alternative way of calling `torch.einsum`. Instead of specifying the subscripts as letters in the `equation` parameter, one can now specify the subscripts as a list of integers as in `torch.einsum(operand1, subscripts1, operand2, subscripts2, ..., [subscripts_out])`. This would be equivalent to `torch.einsum('<subscripts1>,<subscripts2>,...,->[<subscript_out>]', operand1, operand2, ...)`

TODO
- [x] Update documentation
- [x] Add more error checking
- [x] Update tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56625

Reviewed By: zou3519

Differential Revision: D28062616

Pulled By: heitorschueroff

fbshipit-source-id: ec50ad34f127210696e7c545e4c0675166f127dc
2021-05-21 08:36:45 -07:00
fc804b5def Revert D28133579: [jit] Implement ScriptProfile to collect instruction profiles.
Test Plan: revert-hammer

Differential Revision:
D28133579 (034a238bab)

Original commit changeset: e7e30e961513

fbshipit-source-id: 5a7756468b4f2eeed24d2abb7b52ab46d081a95e
2021-05-21 08:18:40 -07:00
e56d3b0238 Added OpInfo tests for NNC (#58719)
Summary:
Finds a couple of bugs:

1. permute needs to wrap dimensions
2. slice needs to wrap dimensions
3. frac doesn't work correctly for negative values
4. Permute has some other failures.

This PR also fixes 1 + 2.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58719

Reviewed By: SplitInfinity

Differential Revision: D28590457

Pulled By: Chillee

fbshipit-source-id: a67fce67799602f9396bfeef615e652364918fbd
2021-05-21 01:41:28 -07:00
d88d321ee3 More robust slicing logic for nn.ModuleList (#58361)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58361

Fixes: https://github.com/pytorch/pytorch/issues/16123

Test Plan: Imported from OSS

Reviewed By: ppwwyyxx

Differential Revision: D28464855

Pulled By: tugsbayasgalan

fbshipit-source-id: db8c41b15dbe6550035e8230dea68ce60e5a6f9a
2021-05-20 23:00:17 -07:00
b301558410 [Reducer] Remove replica size == 1 checks (#58603)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58603

No longer need these checks
ghstack-source-id: 129498227

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D28549893

fbshipit-source-id: a89bf8c3fc3aba311a70fd37e5a6aa5dc14b41b9
2021-05-20 22:34:23 -07:00
1d67c6d639 [DDP] Remove train call to module copies (#58595)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58595

No longer needed since this list is always of size 1.
ghstack-source-id: 129498229

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D28548426

fbshipit-source-id: 7d6dba92fff685ec7f52ba7a3d350e36405e2578
2021-05-20 22:34:20 -07:00
88c76b43fb [Reducer] move comment to the right place (#58594)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58594

This comment was misplaced after some changes, move it to the right
place.
ghstack-source-id: 129498228

Test Plan: ci

Reviewed By: zhaojuanmao

Differential Revision: D28548100

fbshipit-source-id: a9163fc3b25a9d9b8b6d4bfa2a77af290108fc09
2021-05-20 22:34:17 -07:00
d83c5a5c7f Format reducer.cpp, hpp (#58593)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58593

Per title
ghstack-source-id: 129498230

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D28528465

fbshipit-source-id: 89e4bfcb4a0275dc17090a934d4c0a41a3c54046
2021-05-20 22:32:30 -07:00
6d97a80dd2 [fx][graph_drawer] Improve graph drawer coloring and tensor_meta handling (#58699)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58699

Make `call_function`/`call_method` random colors based on their target name. This coloring is stable according to the name of the target. Also handle tensor_meta more elegantly for quantized types, including print q_scale/q_zero_point if they're used.

Test Plan: Tested locally

Reviewed By: chenccfb, 842974287

Differential Revision: D28580333

fbshipit-source-id: ad9961e1106a1bfa5a018d009b0ddb8802d2163c
2021-05-20 21:26:04 -07:00
5455df2b99 [codemod][dirsync] Apply clang-format
Test Plan: Sandcastle and visual inspection.

Reviewed By: igorsugak

Differential Revision: D28477071

fbshipit-source-id: e844e0fad2f5599fd27e0fd113a328031cb63aa7
2021-05-20 21:23:24 -07:00
21a9334034 Revert D28497967: [quant][fx][graphmode][refactor] Remove qconfig_map from Quantizer
Test Plan: revert-hammer

Differential Revision:
D28497967 (1cf8f7a439)

Original commit changeset: 421ce3d86fad

fbshipit-source-id: b1b290be47d847ab0e0128e3ae89f528578550ee
2021-05-20 20:56:12 -07:00
1cf8f7a439 [quant][fx][graphmode][refactor] Remove qconfig_map from Quantizer (#58455)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58455

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D28497967

fbshipit-source-id: 421ce3d86fadd3d92f4120b850b0167270509189
2021-05-20 20:34:47 -07:00
62adf9e1c9 [Reducer] Completely remove VariableIndex (#58592)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58592

Completely removes VariableIndex from reducer code, as it is not
needed. replica_index is always 0 so simplify the code to only use the
parameter index. Next, we should also remove all of the nested data structures
that were needed when num_replicas > 1 was possible.
ghstack-source-id: 129498226

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D28528440

fbshipit-source-id: e0568399264ab4f86de3b7a379a4f0831f8f42e9
2021-05-20 19:47:50 -07:00
8e4fc0063a [Try] [PyTorch Edge] Trim unused code related to CUDA and HIP Interfaces (#58689)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58689

This doesn't seem to be mobile related, but ends up getting called from multiple places, so is hard to get rid of entirely.
ghstack-source-id: 129413850

Test Plan: Build

Reviewed By: iseeyuan

Differential Revision: D28543374

fbshipit-source-id: 867b3e2fafdcbf6030d7029a82a2b711bcecefc5
2021-05-20 18:23:36 -07:00
773cfae93b Tag PyObject on TensorImpl per torchdeploy interpreter (#57985)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57985

Fixes https://github.com/pytorch/pytorch/issues/57756

This PR introduces a new `pyobj_interpreter_` field on TensorImpl which tracks what Python interpreter (if any) owns the TensorImpl. This makes it illegal to bind a TensorImpl from multiple Python interpreters, and means that we can now directly store PyObject pointer on TensorImpl even in the presence of multiple Python interpreters, as is the case in torchdeploy. This is a necessary step for PyObject preservation, which cannot be easily implemented when there are multiple Python interpreters.

Although the PR is not that long, there is a very subtle portion of the implementation devoted to ensuring that the tagging process is thread safe, since multiple threads can concurrently try to tag a PyObject. Check Note [Python interpreter tag] and Note [Memory ordering on Python interpreter tag] for detailed discussion of how this is handled. You will have to check this code carefully in code review; I did not torture test the multithreaded paths in any meaningful way.

In a follow up PR, I will pack the interpreter and PyObject fields into single atomic word on 64-bit.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: wconstab

Differential Revision: D28390242

Pulled By: ezyang

fbshipit-source-id: a6d9b244ee6b9c7209e1ed185e336297848e3017
2021-05-20 18:18:39 -07:00
fe8e5eb260 Change native functions to take c10::string_view args instead of std::string (#57680)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53546

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57680

Reviewed By: malfet

Differential Revision: D28511799

Pulled By: ezyang

fbshipit-source-id: 43142f994d048b28b3279ccdb7a28cbaa3190973
2021-05-20 18:15:45 -07:00
d1d24304ee [Caffe2] [Easy] Fix comment on caffe2_serialize_using_bytes_as_holder to reflect correct types
Summary:
the logic is:

```
template <typename T>
typename std::enable_if<
    std::is_same<T, bool>::value || std::is_same<T, uint8_t>::value ||
        std::is_same<T, int8_t>::value || std::is_same<T, uint16_t>::value ||
        std::is_same<T, int16_t>::value,
    void>::type
```

Test Plan: N/A

Reviewed By: simpkins

Differential Revision: D28587311

fbshipit-source-id: 970c673a9c1256600ec8bdd5f9ca53333a60d588
2021-05-20 18:03:34 -07:00
db67699ae6 [Pytorch Edge] NAME -> SCHEMA (#58604)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58604

Minor bug fix. Schemas should be defined with the schema macro not the name one.

Test Plan: ci and buck test fbsource//xplat/pytorch_models/build/cair_messaging_2021_05_17/v2:cair_messaging_2021_05_17_test

Reviewed By: dhruvbird, iseeyuan

Differential Revision: D28549578

fbshipit-source-id: 0c64eb8c60f1aee8213a1fc1fb7231226b905795
2021-05-20 17:51:38 -07:00
0ede83db7a enable torch.cpu.amp.autocast (#57386)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57386

Here is the PR for what's discussed in the RFC https://github.com/pytorch/pytorch/issues/55374 to enable the autocast for CPU device. Currently, this PR only enable BF16 as the lower precision datatype.

Changes:
1.  Enable new API `torch.cpu.amp.autocast` for autocast on CPU device: include the python API, C++ API, new Dispatchkey etc.
2.  Consolidate the implementation for each cast policy sharing between CPU and GPU devices.
3.  Add the operation lists to corresponding cast policy for cpu autocast.

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D28572219

Pulled By: ezyang

fbshipit-source-id: db3db509973b16a5728ee510b5e1ee716b03a152
2021-05-20 17:48:36 -07:00
b6dcdeacc9 [quant][graphmode][fx] Move qat_swap_modules outside of Quantizer (#58454)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58454

Trying to remove Quantizer in the end

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D28497966

fbshipit-source-id: 800f8e4afd99918d7330345f8ae7bcf018a5bde7
2021-05-20 17:27:49 -07:00
fdc5dfdd50 [PyTorch] Migrate TI usage in ATen/native/cpu to borrowing (#58303)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58303

Borrowing is more efficient, and we can see in all these cases that the TensorIterator doesn't outlive the input & output Tensors.
ghstack-source-id: 129471191

Test Plan: Existing CI

Reviewed By: ezyang

Differential Revision: D28444032

fbshipit-source-id: f6a9e9effb43c273f464ef6ff410274962f3ab23
2021-05-20 17:24:13 -07:00
7c15d3206d [PyTorch] Add TI::borrowing_nullary_op and use it (#58280)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58280

All core PyTorch uses of TensorIterator::nullary_op look like they can safely borrow.
ghstack-source-id: 129471193

Test Plan: Existing CI

Reviewed By: bhosmer

Differential Revision: D28429695

fbshipit-source-id: 404cf6db31e45e5cf7ae6d2f113c5a8eff6f7c3d
2021-05-20 17:22:58 -07:00
618be18a41 Enable the quantization on XPU devices (#54857)
Summary:
Enable the quantization on XPU devices. Keep the model as is if the model is on XPU devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54857

Reviewed By: ailzhang

Differential Revision: D28501381

Pulled By: jerryzh168

fbshipit-source-id: 6d3e9b04075393248b30776c69881f957a1a837c
2021-05-20 17:02:13 -07:00
ce3788d6a5 Add #pragma once to CUDA foreach headers (#58209)
Summary:
Per the title, adding `#pragma once` to cuda headers related to foreach functions.

cc: ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58209

Reviewed By: ailzhang

Differential Revision: D28558620

Pulled By: ngimel

fbshipit-source-id: 195f68435999eb7409ba904daf6fc5f0962d375d
2021-05-20 16:35:43 -07:00
f879e70fc1 [quant][fx][graphmode][refactor] Factor out generate_qconfig_map to qconfig_utils.py (#58453)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58453

Move the class method generate_qconfig_map to qconfig_utils, will add more PRs
to remove functions out of Quantizer and eventually remove the Quantizer object

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D28497965

fbshipit-source-id: 3c78cfe676965d20a8834a859ffed4d8e9ecade4
2021-05-20 16:26:24 -07:00
bf1c936e06 [static runtime] out variant for full_like (#58079)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58079

Support full_like

Test Plan:
`buck test mode/dev caffe2/benchmarks/static_runtime:static_runtime_cpptest -- StaticRuntime.IndividualOps_FullLike`

Test on regenerated local inline_cvr model
```
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 3 ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --scripted_model=/data/users/ansha/tmp/adfinder/dec_6x/266377643_shrunk.predictor.disagg.local.regenerated.pt --pt_inputs=/data/users/ansha/tmp/adfinder/dec_6x/local_inputs --pt_enable_static_runtime=1 --pt_cleanup_activations=1 --pt_enable_out_variant=1 --compare_results=1 --iters=5000 --warmup_iters=5000 --num_threads=1 --do_profile=0 --do_benchmark=1 --adsfinder_compatibility=1 --v=1
```

`V0511 10:59:57.187054 1911683 impl.cpp:1229] Switch to out variant for node: %5571 : Tensor = aten::full_like(%blob_for_shape.1, %235, %654, %75, %75, %75, %75)`

Reviewed By: hlu1

Differential Revision: D28361997

fbshipit-source-id: 89c41e37ce23d6008cfe4d80536832ee76d3405e
2021-05-20 16:17:40 -07:00
5211eeb22b Support aten::leaky_relu for TE (#58464)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58464

Test Plan:
./bin/test_tensorexpr

python test/test_jit_fuser_te.py TestTEFuser.test_unary_ops

Reviewed By: Krovatkin

Differential Revision: D28499776

fbshipit-source-id: 20094a1bc78aa485f76aec4e065ff69e43d692d7
2021-05-20 16:12:03 -07:00
4668d09ca6 [quant][graphmode][fx] Quantize the output of statically quantized fp16 op in QuantizeHandler (#58445)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58445

Previously the output of statically quantized fp16 operator is not quantized in QuantizeHandler, which is not consistent with
the behavior of static int8 operators. Also it does not work well with reference functions, this PR
changes the fp16 static QuantizeHandler to quantize (call to(torch.float16)) in the QuantizeHandler, this also
makes the future support for reference functions easier.

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D28495830

fbshipit-source-id: 2140eab8ab2dd08f6570d9e305485e3029e1f47d
2021-05-20 16:03:42 -07:00
6edd49a8e8 [Android]Removed dependency with AppCompat. (#58527)
Summary:
I build using [Bazel](https://bazel.build/).

When I use `pytorch_android` in latest Android app, I get the following error due to dependencies:

```
$ bazel build //app/src/main:app
WARNING: API level 30 specified by android_ndk_repository 'androidndk' is not available. Using latest known API level 29
INFO: Analyzed target //app/src/main:app (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
ERROR: /home/H1Gdev/android-bazel-app/app/src/main/BUILD.bazel:3:15: Merging manifest for //app/src/main:app failed: (Exit 1): ResourceProcessorBusyBox failed: error executing command bazel-out/k8-opt-exec-2B5CBBC6/bin/external/bazel_tools/src/tools/android/java/com/google/devtools/build/android/ResourceProcessorBusyBox --tool MERGE_MANIFEST -- --manifest ... (remaining 11 argument(s) skipped)

Use --sandbox_debug to see verbose messages from the sandbox ResourceProcessorBusyBox failed: error executing command bazel-out/k8-opt-exec-2B5CBBC6/bin/external/bazel_tools/src/tools/android/java/com/google/devtools/build/android/ResourceProcessorBusyBox --tool MERGE_MANIFEST -- --manifest ... (remaining 11 argument(s) skipped)

Use --sandbox_debug to see verbose messages from the sandbox
Error: /home/H1Gdev/.cache/bazel/_bazel_H1Gdev/29e18157a4334967491de4cc9a879dc0/sandbox/linux-sandbox/914/execroot/__main__/app/src/main/AndroidManifest.xml:19:18-86 Error:
	Attribute application@appComponentFactory value=(androidx.core.app.CoreComponentFactory) from [maven//:androidx_core_core] AndroidManifest.xml:19:18-86
	is also present at [maven//:com_android_support_support_compat] AndroidManifest.xml:19:18-91 value=(android.support.v4.app.CoreComponentFactory).
	Suggestion: add 'tools:replace="android:appComponentFactory"' to <application> element at AndroidManifest.xml:5:5-19:19 to override.
May 19, 2021 10:45:03 AM com.google.devtools.build.android.ManifestMergerAction main
SEVERE: Error during merging manifests
com.google.devtools.build.android.AndroidManifestProcessor$ManifestProcessingException: Manifest merger failed : Attribute application@appComponentFactory value=(androidx.core.app.CoreComponentFactory) from [maven//:androidx_core_core] AndroidManifest.xml:19:18-86
	is also present at [maven//:com_android_support_support_compat] AndroidManifest.xml:19:18-91 value=(android.support.v4.app.CoreComponentFactory).
	Suggestion: add 'tools:replace="android:appComponentFactory"' to <application> element at AndroidManifest.xml:5:5-19:19 to override.
	at com.google.devtools.build.android.AndroidManifestProcessor.mergeManifest(AndroidManifestProcessor.java:186)
	at com.google.devtools.build.android.ManifestMergerAction.main(ManifestMergerAction.java:217)
	at com.google.devtools.build.android.ResourceProcessorBusyBox$Tool$5.call(ResourceProcessorBusyBox.java:93)
	at com.google.devtools.build.android.ResourceProcessorBusyBox.processRequest(ResourceProcessorBusyBox.java:233)
	at com.google.devtools.build.android.ResourceProcessorBusyBox.main(ResourceProcessorBusyBox.java:177)

Warning:
See http://g.co/androidstudio/manifest-merger for more information about the manifest merger.
Target //app/src/main:app failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 2.221s, Critical Path: 1.79s
INFO: 2 processes: 2 internal.
FAILED: Build did NOT complete successfully
```

This is due to conflict between `AndroidX` and `Support Library` on which `pytorch_android_torch` depends.
(In the case of `Gradle`, it is avoided by `android.useAndroidX`.)

I created [Android application](https://github.com/H1Gdev/android-bazel-app) for comparison.

At first, I updated `AppCompat` from `Support Library` to `AndroidX`, but `pytorch_android` and `pytorch_android_torchvision` didn't seem to need any dependencies, so I removed dependencies.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58527

Reviewed By: xta0

Differential Revision: D28585234

Pulled By: IvanKobzarev

fbshipit-source-id: 78aa6b1525543594ae951a6234dd88a3fdbfc062
2021-05-20 15:49:19 -07:00
d84121421e [third-party] Update nccl to 2.9.8 (#58667)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/58470

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58667

Reviewed By: ngimel

Differential Revision: D28577042

Pulled By: malfet

fbshipit-source-id: 62f1c67f35bf5a004852806c1a74bb068cefb79b
2021-05-20 15:42:17 -07:00
bbf92e6176 Add missing .to_sparse(ndim) gradient (#58413)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46720, extends PR https://github.com/pytorch/pytorch/issues/46825 by adding test requested in [this comment](https://github.com/pytorch/pytorch/pull/46825#issuecomment-842304079).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58413

Reviewed By: ailzhang

Differential Revision: D28540550

Pulled By: albanD

fbshipit-source-id: d7e292e09b5402336c43844ee233b83b0a095035
2021-05-20 15:08:34 -07:00
8a3d9962e0 Enable ceil, floor, frac, round & trunc for BFloat16 on CUDA (#57910)
Summary:
Enable `ceil`, `floor`, `frac`, `round` & `trunc` for BFloat16 on CUDA

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57910

Reviewed By: soulitzer

Differential Revision: D28579486

Pulled By: ngimel

fbshipit-source-id: 2f90354339dbccb69cea7ec9caf9b066ea13a666
2021-05-20 14:52:45 -07:00
034a238bab [jit] Implement ScriptProfile to collect instruction profiles. (#57397)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57397

Introduces two main classes in C++ runtime:

ScriptProfile is the implementation for enalbing and disabling interpreter
profiling in C++. This should be only used from Python, and we will add
corresponding Python API in the next diff.

InstructionSpan is a utility class to instrument execution of each single
instruction. A start timestamp is recorded in the consturctor, and an end
timestamp is recorded in the destructor. During destruction, this will send
runtime data to all enabled ScriptProfile instances.

Test Plan:
build/bin/test_jit --gtest_filter='ScriptProfileTest.Basic'

Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D28133579

fbshipit-source-id: e7e30e96151367022793ab3ad323f01c51ad4a3b
2021-05-20 14:11:03 -07:00
e8c6a65074 Adds grid_sampler to autocast fp32 list for 1.9 (#58679)
Summary:
Temporary fix for https://github.com/pytorch/pytorch/issues/42218.

Numerically, grid_sampler should be fine in fp32 or fp16. So grid_sampler really belongs on the promote list. But performancewise, native grid_sampler backward kernels use gpuAtomicAdd, which is notoriously slow in fp16. So the simplest functionality fix is to put grid_sampler on the fp32 list.

In https://github.com/pytorch/pytorch/pull/58618 I implement the right long-term fix (refactoring kernels to use fp16-friendly fastAtomicAdd and moving grid_sampler to the promote list). But that's more invasive, and for 1.9 ngimel says this simple temporary fix is preferred.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58679

Reviewed By: soulitzer

Differential Revision: D28576559

Pulled By: ngimel

fbshipit-source-id: d653003f37eaedcbb3eaac8d7fec26c343acbc07
2021-05-20 14:05:09 -07:00
691c139144 Do not use TF32 matmul in linalg and DDP tests (#56114)
Summary:
This PR does several things to relax test tolerance

- Do not use TF32 in cuda matmul in test_c10d. See https://github.com/pytorch/pytorch/issues/52941.
- Do not use TF32 in cuda matmul in test_linalg. Increase atol for float and cfloat. See https://github.com/pytorch/pytorch/issues/50453
    The tolerance is increased because most linear algebra operators are not that stable in single precision.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56114

Reviewed By: ailzhang

Differential Revision: D28554467

Pulled By: ngimel

fbshipit-source-id: 90416be8e4c048bedb16903b01315584d344ecdf
2021-05-20 14:01:19 -07:00
a7f06e1e55 Added statistic related to out variant nodes
Summary: added more statistic info for static runtime

Test Plan:
caffe2/benchmarks/static_runtime:static_runtime_cpptest

Expected output example:

Static runtime ms per iter: 0.939483. Iters per second: 1064.41
Node #0: 0.195671 ms/iter, %wide_offset.1 : Tensor = aten::add(%wide.1, %self._mu, %4)
Node #1: 0.169457 ms/iter, %wide_normalized.1 : Tensor = aten::mul(%wide_offset.1, %self._sigma)
Node #2: 0.118218 ms/iter, %wide_preproc.1 : Tensor = aten::clamp(%wide_normalized.1, %5, %6)
Node #3: 0.038814 ms/iter, %user_emb_t.1 : Tensor = aten::transpose(%user_emb.1, %4, %7)
Node #4: 0.0860747 ms/iter, %dp_unflatten.1 : Tensor = aten::bmm(%ad_emb_packed.1, %user_emb_t.1)
Node #5: 0.0102666 ms/iter, %31 : Tensor = static_runtime::flatten_copy(%dp_unflatten.1, %4, %8)
Node #6: 0.000476333 ms/iter, %19 : Tensor[] = prim::ListConstruct(%31, %wide_preproc.1)
Node #7: 0.0707332 ms/iter, %input.1 : Tensor = aten::cat(%19, %4)
Node #8: 0.123695 ms/iter, %fc1.1 : Tensor = aten::addmm(%self._fc_b, %input.1, %29, %4, %4)
Node #9: 0.0309244 ms/iter, %23 : Tensor = aten::sigmoid(%fc1.1)
Node #10: 0.0046297 ms/iter, %24 : (Tensor) = prim::TupleConstruct(%23)
Time per node type:
       0.195671 ms.    23.0483%. aten::add (1 nodes)
       0.169457 ms.    19.9605%. aten::mul (1 nodes, out variant)
       0.123695 ms.    14.5702%. aten::addmm (1 nodes, out variant)
       0.118218 ms.     13.925%. aten::clamp (1 nodes, out variant)
      0.0860747 ms.    10.1388%. aten::bmm (1 nodes, out variant)
      0.0707332 ms.    8.33175%. aten::cat (1 nodes, out variant)
       0.038814 ms.    4.57195%. aten::transpose (1 nodes)
      0.0309244 ms.    3.64263%. aten::sigmoid (1 nodes, out variant)
      0.0102666 ms.    1.20932%. static_runtime::flatten_copy (1 nodes, out variant)
      0.0046297 ms.   0.545338%. prim::TupleConstruct (1 nodes, out variant)
    0.000476333 ms.  0.0561079%. prim::ListConstruct (1 nodes, out variant)
       0.848959 ms. in Total
StaticRuntime setup time: 0.018925 ms
Memory allocation time: 0.019808 ms
Memory deallocation time: 0.0120445 ms
Outputs deallocation time: 0.0864947 ms
Total memory managed: 19328 bytes
Total number of reused tensors: 3
Total number of 'out' variant nodes/total number of nodes: 9/11 (81.8182%)

Reviewed By: hlu1

Differential Revision: D28553029

fbshipit-source-id: 55e7eab50b4b475ae219896100bdf4f6678875a4
2021-05-20 13:57:07 -07:00
056287aec4 turn off deadline for adagrad test
Summary: Tests are frequently failing with "exceeded the deadline of 1000.00ms", we expect this to happen, so remove the deadline

Test Plan: N/A: Fix breakages

Reviewed By: robieta

Differential Revision: D28581051

fbshipit-source-id: 4825ada9af151fa5d57c45c549138c15ba613705
2021-05-20 13:47:02 -07:00
9db64e6e56 Revert "Striding for lists Part 2 (#49352)" (#58523)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58523

This reverts commit fee7e8b91d4434b976a339330bfa89bd827ab9ec.

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D28528023

Pulled By: tugsbayasgalan

fbshipit-source-id: 9fa1d86f0c81fcc6fd3798e0d51a712a3c9b3952
2021-05-20 13:20:33 -07:00
9123229684 Cleanup functional.py after lu_unpack was removed (#58669)
Summary:
Remove code in functional.py that became unused after PR c790fd2bf8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58669

Reviewed By: driazati

Differential Revision: D28572377

Pulled By: heitorschueroff

fbshipit-source-id: c90d80ead5f3d69100667488bc6b14ef54b95b54
2021-05-20 13:06:30 -07:00
0e1bed364d [nnc] Use int64 to compute matmul flops heuristic (#58676)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58676

We only generate asm for small matmuls, but we were computing the # of
flops using an int32, which is too small.

Test Plan:
```
buck test mode/dev //caffe2/test:static_runtime -- --exact 'caffe2/test:static_runtime - test_mlp (test_static_runtime.TestStaticModule)'
```

Reviewed By: navahgar

Differential Revision: D28562157

fbshipit-source-id: a07ceba5209ef6022ead09140380c116994755cf
2021-05-20 13:05:21 -07:00
a60ce98a2e Remove opinfo warning from floor_divide (#58682)
Summary:
This warning makes downstream users of OpInfo error when they use this opinfo, unless they actually run the operation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58682

Reviewed By: mruberry

Differential Revision: D28577334

Pulled By: Chillee

fbshipit-source-id: f10e64f8ad3fb50907531d8cb89ce5b0d06ac076
2021-05-20 12:57:58 -07:00
1981904c8d [Static Runtime] Check input container type in aten::__getitem__ (#58639)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58639

Fix two tests in `//caffe2/test:static_runtime` that were previously broken.

Reviewed By: ajyu, edvgha

Differential Revision: D28561185

fbshipit-source-id: 3cfb0960666c808523d65da267f70bd51e828313
2021-05-20 12:47:01 -07:00
84500d03d2 .github: Upload /download large artifacts to s3 (#58506)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58506

We were experiencing 500 errors when it came to downloading large
artifacts so let's just use s3 for those larger artifacts just in case

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: zhouzhuojie

Differential Revision: D28520792

Pulled By: seemethere

fbshipit-source-id: 3aa15c4872fe46c9491ac31dc969bf71175378aa
2021-05-20 11:52:05 -07:00
151ec56311 ENH Adds check for input sizes in cosine_similarity (#58559)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55273

Adds check for input sizes to be consistent with the docstring.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58559

Reviewed By: soulitzer

Differential Revision: D28562376

Pulled By: ailzhang

fbshipit-source-id: f292e8a26f11a40d146fbed94a28025794808216
2021-05-20 11:40:06 -07:00
3c55db8065 Add Deploy to PredictorContainer (#58503)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58503

add gflags to force using deploy for torchscript models

Test Plan: Add parametrization to PredictorContainer test to exercise gflag override and test deploy codepath.  Add test case to exercise new torch.package codepath.

Reviewed By: suo

Differential Revision: D28246793

fbshipit-source-id: 88a2c8322c89284e3c8e14fee5f20e9d8a4ef300
2021-05-20 11:29:31 -07:00
1fc3e1e1fb Abladawood patch 1 (#58496)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58496

Reviewed By: soulitzer

Differential Revision: D28562333

Pulled By: ailzhang

fbshipit-source-id: aa9fcc03ba7ffe03db6cc5da353d37d679a0a160
2021-05-20 10:32:18 -07:00
5152cf8647 masked_scatter thrust->cub (#56750)
Summary:
Benchmark:

```python
import torch
import itertools

def run50_sync(f):
    for _ in range(50):
        f()
    torch.cuda.synchronize()

run50_sync(lambda: torch.randperm(1000000, device='cuda'))

def benchmark(M):
    a = torch.randn(M, device='cuda')
    m = torch.randint(1, (M,), dtype=torch.long, device='cuda').bool()
    v = torch.randn(M, device='cuda')

    torch.cuda.synchronize()

    %timeit run50_sync(lambda:a.masked_scatter_(m, v))

for M in (100, 1000, 100000, 10000000):
    print(M)
    benchmark(M)
```

Before:
```
100
8.65 ms ± 80.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1000
8.75 ms ± 72.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
100000
9.27 ms ± 87.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
10000000
33.6 ms ± 358 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

After
```
100
8.04 ms ± 37.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1000
8.09 ms ± 38.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
100000
8.63 ms ± 76.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
10000000
31.9 ms ± 298 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56750

Reviewed By: ailzhang

Differential Revision: D28547564

Pulled By: ngimel

fbshipit-source-id: 83aeddfaf7023f9f9501c6b1e2faf91e8b6277b1
2021-05-20 10:27:58 -07:00
4942fe0290 [DataLoader] Introduce MapMapDataPipe functional datapipe (#58258)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58258

As part of https://github.com/pytorch/pytorch/issues/57031, this PR adds the `MapMapDataPipe` functional datapipe for the `MapDataPipe` class.

Usage:
```
def fn(x):
    return x * 10

dp = CountingDataset(n=10)
dp.map(fn)
```

Reviewed By: ejguan

Differential Revision: D28394510

fbshipit-source-id: 8d71b1f5723dff52385c3ce753944304896af678
2021-05-20 09:00:21 -07:00
faa7d3793d [DDP] Support not all outputs used in loss calculation (#57081)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57081

Changes in this diff:

Enable passthrough autograd function when find_unused_parameters=True.
With above, move prepare_for_backward which does unused parameter checking logic to beginning of backwards pass, only when find_unused_parameters=True.
Enhance process of unused parameter checking to account for outputs not being used in loss.
The way (3) is implemented is by triggering the autograd hook corresponding to parameters that did not participate in loss computation. Since they did not participate, the autograd hook is triggered with a gradient of None, and the reducer handles this appropriately to ensure that the gradient is not touched.

Tested by ensuring that when a model output is not used in loss, the corresponding grad is not modified. Also verified that the grads are the same in local vs DDP training case. Also verified that gradients are not touched in this case, i.e. if grad is originally None, it stays as None, not zero, after.

Note that in this diff we are not enabling the pass through autograd function for regular case find_unused_parameters=False because that has a much bigger blast radius and needs additional careful analysis especially with regard to the performance.
ghstack-source-id: 129425139

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D28048628

fbshipit-source-id: 71d7b6af8626804710017a4edd753787aa9bba61
2021-05-20 08:34:33 -07:00
abb215e229 Fix dtype inference in sparse_csr_tensor_ctor (#58631)
Summary:
`NULL` return from `PyObject_GetAttrString` should never get ignored without handling the exception, as behavior of subsequent Python C API calls are undefined until `PyErr_Fetch` or `PyErr_Clear` is called.

This accidentally leads to `list` type being incorrectly identified as `Tensor`

Fixes https://github.com/pytorch/pytorch/issues/58520

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58631

Reviewed By: albanD

Differential Revision: D28559454

Pulled By: malfet

fbshipit-source-id: 46f044b5f0f94264779a6108474d04a8ba851c53
2021-05-20 08:02:05 -07:00
9ac0bd23a2 Fix bug in test_fx_experimental codegen (#58587)
Summary:
This PR fixes a bug in test_fx_experimental where code generated for ops with kwarg-only Tensor parameters would fail to execute because they would be called as positional parameters.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58587

Reviewed By: ailzhang

Differential Revision: D28548365

Pulled By: heitorschueroff

fbshipit-source-id: 8f1746053cbad1b11e817b0099db545d8dd22232
2021-05-20 07:49:08 -07:00
bf00d26deb Enables builds with Compute Library backend for oneDNN (#55913)
Summary:
Since v1.7, oneDNN (MKL-DNN) has supported the use of Compute Library
for the Arm architeture to provide optimised convolution primitives
on AArch64.

This change enables the use of Compute Library in the PyTorch build.
Following the approach used to enable the use of CBLAS in MKLDNN,
It is enabled by setting the env vars USE_MKLDNN and USE_MKLDNN_ACL.
The location of the Compute Library build must be set useing `ACL_ROOT_DIR`.

This is an extension of the work in https://github.com/pytorch/pytorch/pull/50400
which added support for the oneDNN/MKL-DNN backend on AArch64.

_Note: this assumes that Compute Library has been built and installed at
ACL_ROOT_DIR. Compute library can be downloaded here:
`https://github.com/ARM-software/ComputeLibrary`_

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55913

Reviewed By: ailzhang

Differential Revision: D28559516

Pulled By: malfet

fbshipit-source-id: 29d24996097d0a54efc9ab754fb3f0bded290005
2021-05-20 07:43:56 -07:00
145a6f7985 DOC Adds code comment to clarify nn.Linear.reset_parameters (#58487)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/57109

Adds comment to clarify `a=sqrt(5)` in `nn.Linear.reset_parameters`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58487

Reviewed By: ailzhang

Differential Revision: D28548391

Pulled By: jbschlosser

fbshipit-source-id: 2d5910b2576a04f19edbd8b8515cdb55fc249ce5
2021-05-20 06:15:47 -07:00
5caccbe39e [pkg] Catch exceptions where dependency resolution gets invalid imports (#58573)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58573

Users can create invalid imports, like:
```
HG: in a top-level package
if False:
  from .. import foo
```

Since this code is never executed, it will not cause the module to fail to
load. But our dependency analysis walks every `import` statement in the AST,
and will attempt to resolve the (incorrectly formed) import, throwing an exception.

For posterity, the code that triggered this: https://git.io/JsCgM

Differential Revision: D28543980

Test Plan: Added a unit test

Reviewed By: Chillee

Pulled By: suo

fbshipit-source-id: 03b7e274633945b186500fab6f974973ef8c7c7d
2021-05-19 23:04:21 -07:00
703f24397b [pkg] simplifications to broken dependency handling (#58572)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58572

Right now, we have three categories of error (broken, denied, unhandled). This
PR unifies them into a single "error" field in the node, with optional context.
It also generalizes how formatting of the error in PackagingError occurs.

Differential Revision: D28543982

Test Plan: sandcastle

Reviewed By: Chillee

Pulled By: suo

fbshipit-source-id: d99d37699ec2e172e3798763e60aafe9a66ed6f4
2021-05-19 23:03:12 -07:00
c4f0c5ee50 Quote in setup-ci-env (#58637)
Summary:
Do not put quotes for arguments that do not have space in them in add_to_env_file

ENV file is used both by bash as well as by docker, which does not omit
quotes when they are present there

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58637

Reviewed By: wconstab

Differential Revision: D28561159

Pulled By: malfet

fbshipit-source-id: 0843aad22703b6c3adebeb76175de1cfc1a974b5
2021-05-19 22:20:13 -07:00
8615fd65e3 Fix GIL issue when acquiring multiple sessions. (#58584)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58584

Test Plan: buck test //caffe2/torch/csrc/deploy:test_deploy

Reviewed By: wconstab

Differential Revision: D28545314

fbshipit-source-id: 45cb0e4d80d4766ec1aed6a51679af3424cb0878
2021-05-19 22:05:52 -07:00
24786bd6ef Make torch::deploy work with or without cuda (#58493)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58493

In fbcode, we want torch::deploy to be a target that works with or without cuda, depending only on whether cuda is linked in the final binary.  To enable this, we build both flavors of libinterpreter,  and choose which to load at runtime depending on whether cuda is available in the application.  This comes at a cost to binary size, as it includes two copies of libinterpreter instead of one.  However, it does not require _loading_ two copies of libinterpreter into memory at runtime, so the memory footprint of the interpreter (which we make N copies of) is not impacted.

In oss/cmake, this change is a no-op.  cuda is already handled there by building just one libinterpreter, but building cuda or not for the whole pytorch build based on a global cmake flag.

Test Plan: test in fbcode with new gpu mode unit tests, verify existing oss CI passes

Reviewed By: suo

Differential Revision: D28512178

fbshipit-source-id: 61354bf78b1932605a841388fcbc4bafc0c4bbb4
2021-05-19 21:44:23 -07:00
fbc235c226 port sgn to structured (#58197)
Summary:
https://github.com/pytorch/pytorch/issues/55070

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58197

Reviewed By: ejguan

Differential Revision: D28416538

Pulled By: ezyang

fbshipit-source-id: bd78172ff4b11bfc69304c426d5817a47bcbb567
2021-05-19 20:10:01 -07:00
b5e39bceec Port fmax & fmin to structured kernel (#58458)
Summary:
Port fmax & fmin to structured kernel
Related https://github.com/pytorch/pytorch/issues/55070

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58458

Reviewed By: ailzhang

Differential Revision: D28509263

Pulled By: ezyang

fbshipit-source-id: 3fccb46746e5c0695fe8fa498ce32f8ab4609f04
2021-05-19 20:06:06 -07:00
e179a56839 [FX Splitter] dump final graph and print operator stats via to_glow API
Summary:
- dump final graph in glow
- print operator stats via to_glow API
   - 1) node stats for final glow graph
   - 2) operator stats in TorchGlowBackend for torch::jit::graph to lower

Reviewed By: khabinov

Differential Revision: D28444501

fbshipit-source-id: 743755c320071edc4c045ad004adeb16b4a9c323
2021-05-19 19:16:19 -07:00
9a622f4cd9 refactor ASGD to use functional API (#58410)
Summary:
Functional API is used in large scale distributed training to enable multithreaded training instead of multiprocess, as it gives more optimal resource utilization and efficiency.

In this PR, we provide code migration and refactoring for functional API for ASGD algorithm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58410

Reviewed By: ailzhang

Differential Revision: D28546702

Pulled By: iramazanli

fbshipit-source-id: 4f62b6037d53f35b19f98340e88af2ebb6243a4f
2021-05-19 18:55:52 -07:00
208b36f109 remove redundant getDispatchKeySetUnboxed(eligibleKeys) (#58535)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58535

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D28531377

Pulled By: bhosmer

fbshipit-source-id: ade1427c8c9ada10ecdc69ef80c5d90be23f5787
2021-05-19 17:08:03 -07:00
47c566ebb1 Rename namespace vec256 to vec, struct Vec256 to Vectorized (and other related classes/structs) (#58438)
Summary:
In order to make it more convenient for maintainers to review the ATen AVX512 implementation, the namespace `vec256` is being renamed to `vec` in this PR, as modifying 77 files & creating 2 new files only took a few minutes, as these changes aren't significant, so fewer files would've to be reviewed while reviewing https://github.com/pytorch/pytorch/issues/56992.
The struct `Vec256` is not being renamed to `Vec`, but `Vectorized` instead, because there are some `using Vec=` statements in the codebase, so renaming it to `Vectorized` was more convenient. However, I can still rename it to `Vec`, if required.

### Changes made in this PR -
Created `aten/src/ATen/cpu/vec` with subdirectory `vec256` (vec512 would be added via https://github.com/pytorch/pytorch/issues/56992).
The changes were made in this manner -

1. First, a script was run to rename `vec256` to `vec` & `Vec` to `Vectorized` -
```
# Ref: https://stackoverflow.com/a/20721292
cd aten/src
grep -rli 'vec256\/vec256\.h' * | xargs -i@ sed -i 's/vec256\/vec256\.h/vec\/vec\.h/g' @
grep -rli 'vec256\/functional\.h' * | xargs -i@ sed -i 's/vec256\/functional\.h/vec\/functional\.h/g' @
grep -rli 'vec256\/intrinsics\.h' * | xargs -i@ sed -i 's/vec256\/intrinsics\.h/vec\/vec256\/intrinsics\.h/g' @
grep -rli 'namespace vec256' * | xargs -i@ sed -i 's/namespace vec256/namespace vec/g' @
grep -rli 'Vec256' * | xargs -i@ sed -i 's/Vec256/Vectorized/g' @
grep -rli 'vec256\:\:' * | xargs -i@ sed -i 's/vec256\:\:/vec\:\:/g' @
grep -rli 'at\:\:vec256' * | xargs -i@ sed -i 's/at\:\:vec256/at\:\:vec/g' @
cd ATen/cpu
mkdir vec
mv vec256 vec
cd vec/vec256
grep -rli 'cpu\/vec256\/' * | xargs -i@ sed -i 's/cpu\/vec256\//cpu\/vec\/vec256\//g' @
grep -rli 'vec\/vec\.h' * | xargs -i@ sed -i 's/vec\/vec\.h/vec\/vec256\.h/g' @
```

2. `vec256` & `VEC256` were replaced with `vec` & `VEC` respectively in 4 CMake files.

3. In `pytorch_vec/aten/src/ATen/test/`, `vec256_test_all_types.h` & `vec256_test_all_types.cpp` were renamed.

4. `pytorch_vec/aten/src/ATen/cpu/vec/vec.h` & `pytorch_vec/aten/src/ATen/cpu/vec/functional.h` were created.
Both currently have one line each & would have 5 when AVX512 support would be added for ATen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58438

Reviewed By: malfet

Differential Revision: D28509615

Pulled By: ezyang

fbshipit-source-id: 63840df5f23b3b59e203d25816e2977c6a901780
2021-05-19 16:04:36 -07:00
a6b358d53b Revert D28461013: [nnc] Enable CPU fusion inside Facebook, take 2
Test Plan: revert-hammer

Differential Revision:
D28461013 (c76405d3b1)

Original commit changeset: 79a80b6ffb65

fbshipit-source-id: d9cc5c512542153f39664635fb080d797a9de7d0
2021-05-19 15:27:38 -07:00
36adc3f04d [FX] Add APIs to mutate specific args/kwargs (#58571)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58571

Test Plan: Imported from OSS

Reviewed By: jansel

Differential Revision: D28543359

Pulled By: jamesr66a

fbshipit-source-id: 44812d04886e653b5439c880dd831ecbc893fe23
2021-05-19 14:54:16 -07:00
296d2a4399 [THC] Rename THCTensorMathMagma from cu to cpp (#58521)
Summary:
This supposed to be a no-op (as .cu file do not contain any cuda code),
that reduces compilation time 2.5x:
```
$ time /usr/local/cuda/bin/nvcc /home/nshulga/git/pytorch/aten/src/THC/THCTensorMathMagma.cu -c ...
real	0m7.701s
$ time /usr/local/cuda/bin/nvcc /home/nshulga/git/pytorch/aten/src/THC/THCTensorMathMagma.cpp -c ...
real	0m2.657s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58521

Reviewed By: ngimel

Differential Revision: D28526946

Pulled By: malfet

fbshipit-source-id: ed42a9db3349654b75dcf63605bb4256154f01ff
2021-05-19 14:26:21 -07:00
ae99640a78 Added publishing of test results and minor fixes to Az DevOps Build Logic (#58436)
Summary:
This PR adds the ability to publish the xml test data of custom PyTorch PR tests. This PR also adds a few fixes to the custom PyTorch PR tests logic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58436

Reviewed By: seemethere, mruberry

Differential Revision: D28512958

Pulled By: malfet

fbshipit-source-id: d3a1a251d3d126c923d5f733dccfb31a4b701b7e
2021-05-19 14:17:48 -07:00
b9b8522e00 [profile] fix recorded data type (#58531)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58531

fix data type of alltoall(v) when recording communication metadata via DebugInfo in NCCL PG

Reviewed By: chaekit

Differential Revision: D28529372

fbshipit-source-id: 2917653f73f5fe4f6dc901803235994ca042bba2
2021-05-19 14:14:54 -07:00
8de8b492f7 Revert "Move Azure MultiGPU tests back to nightly (#58242)" (#58451)
Summary:
This reverts commit 2afcb7e8fde0476db2e32feae9a80e36f23c1b19.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58451

Reviewed By: ailzhang

Differential Revision: D28497920

Pulled By: malfet

fbshipit-source-id: 7e9e4f1e3e6e46d8d2a4cba2e6147e0b50d27f6d
2021-05-19 13:55:26 -07:00
3113a1de4a Fix some tensor operators to return NotImplemented for invalid inputs (#58216)
Summary:
Same as https://github.com/pytorch/pytorch/issues/57934. (cc/ albanD)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58216

Reviewed By: ailzhang

Differential Revision: D28494886

Pulled By: albanD

fbshipit-source-id: 380205867ee1cde90e1c6fcfe2a31749e1243530
2021-05-19 13:09:57 -07:00
6c70cbedb6 step 0 of cuDNN v8 convolution API integration (#51390)
Summary:
This PR is step 0 of adding PyTorch convolution bindings using the cuDNN frontend. The cuDNN frontend is the recommended way of using cuDNN v8 API. It is supposed to have faster release cycles, so that, for example, if people find a specific kernel has a bug, they can report it, and that kernel will be blocked in the cuDNN frontend and frameworks could just update that submodule without the need for waiting for a whole cuDNN release.

The work is not complete, and this PR is only step 0.

**What this PR does:**
- Add cudnn-frontend as a submodule.
- Modify cmake to build that submodule.
- Add bindings for convolution forward in `Conv_v8.cpp`, which is disabled by a macro by default.
- Tested manually by enabling the macro and run `test_nn.py`. All tests pass except those mentioned below.

**What this PR doesn't:**
- Only convolution forward, no backward. The backward will use v7 API.
- No 64bit-indexing support for some configuration. This is a known issue of cuDNN, and will be fixed in a later cuDNN version. PyTorch will not implement any workaround for issue, but instead, v8 API should be disabled on problematic cuDNN versions.
- No test beyond PyTorch's unit tests.
  - Not tested for correctness on real models.
  - Not benchmarked for performance.
- Benchmark cache is not thread-safe. (This is marked as `FIXME` in the code, and will be fixed in a follow-up PR)
- cuDNN benchmark is not supported.
- There are failing tests, which will be resolved later:
  ```
  FAILED test/test_nn.py::TestNNDeviceTypeCUDA::test_conv_cudnn_nhwc_cuda_float16 - AssertionError: False is not true : Tensors failed to compare as equal!With rtol=0.001 and atol=1e-05, found 32 element(s) (out of 32) whose difference(s) exceeded the margin of error (in...
  FAILED test/test_nn.py::TestNNDeviceTypeCUDA::test_conv_cudnn_nhwc_cuda_float32 - AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 32 element(s) (out of 32) whose difference(s) exceeded the margin of error (...
  FAILED test/test_nn.py::TestNNDeviceTypeCUDA::test_conv_large_cuda - RuntimeError: CUDNN_BACKEND_OPERATION: cudnnFinalize Failed cudnn_status: 9
  FAILED test/test_nn.py::TestNN::test_Conv2d_depthwise_naive_groups_cuda - AssertionError: False is not true : Tensors failed to compare as equal!With rtol=0 and atol=1e-05, found 64 element(s) (out of 64) whose difference(s) exceeded the margin of error (including 0 an...
  FAILED test/test_nn.py::TestNN::test_Conv2d_deterministic_cudnn - RuntimeError: not supported yet
  FAILED test/test_nn.py::TestNN::test_ConvTranspose2d_groups_cuda_fp32 - RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM
  FAILED test/test_nn.py::TestNN::test_ConvTranspose2d_groups_cuda_tf32 - RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM
  ```

Although this is not a complete implementation of cuDNN v8 API binding, I still want to merge this first. This would allow me to do small and incremental work, for the ease of development and review.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51390

Reviewed By: malfet

Differential Revision: D28513167

Pulled By: ngimel

fbshipit-source-id: 9cc20c9dec5bbbcb1f94ac9e0f59b10c34f62740
2021-05-19 12:54:09 -07:00
954d39ba38 [ATen][Quant] Pass at::Tensor by reference (#58284)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58284

- Passing at::Tensor by value can incur a lot of refcount bumps overhead. Passing-by-reference is much more efficient.
- Use Tensor::expect_contiguous() where possible to remove refcount bump overhead when input tensor is already contiguous.

Reviewed By: supriyar, swolchok

Differential Revision: D28432300

fbshipit-source-id: 089ceed08f0d54f109e441f8a1314d726e8481ce
2021-05-19 12:36:50 -07:00
a91375432a model_dump: Accept variable-length debug info (#57660)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57660

Ignore trailing elements so we're compatible with both old and new
models.

Test Plan: Dumped and old model.  Unit test.

Reviewed By: malfet

Differential Revision: D28531391

Pulled By: dreiss

fbshipit-source-id: 197a55ab0e6a7d8e25cbee83852e194afacc988e
2021-05-19 12:25:27 -07:00
ab1fdbefe1 model_dump: Use DumpUnpickler.load instead of .dump (#57659)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57659

Faster since we don't do an automatic pprint, and shorter, simpler code.

Test Plan: Dumped some models.

Reviewed By: malfet

Differential Revision: D28531398

Pulled By: dreiss

fbshipit-source-id: 47f1f646d4576af9f7e680933e0512f616dab5c0
2021-05-19 12:25:25 -07:00
53078924ad model_dump: Add a section that summarizes tensor memory usage (#57658)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57658

Since there is no Python change here and we only do the analysis when
rendering the open section, this should have no impact on page size or
load time!  (Well, a constant impact on page size due to the added
code.)  Before I made it lazy, I observed that it increased load time by
over 100ms for a large model.

Test Plan: Dumped a CUDA model and saw the size summary.

Reviewed By: malfet

Differential Revision: D28531394

Pulled By: dreiss

fbshipit-source-id: f77012b7bab069de861a4ba23486c665e1306aa0
2021-05-19 12:25:23 -07:00
ef4e6036bc model_dump: Handle dict rendering (#57657)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57657

Test Plan: Clicked around a model with some dicts in it.

Reviewed By: malfet

Differential Revision: D28531397

Pulled By: dreiss

fbshipit-source-id: 069690f147e91eadd76fec5f5ca4eec057abcb98
2021-05-19 12:25:21 -07:00
72ff3163bd model_dump: Handle torch.device objects (#57656)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57656

This came up when dumping a CUDA model.

Test Plan: Dumped a CUDA model.

Reviewed By: malfet

Differential Revision: D28531396

Pulled By: dreiss

fbshipit-source-id: fe0e94248c8085a8b760d253ba0b517f153b3442
2021-05-19 12:25:19 -07:00
a380575f5b model_dump: Refactor renderTensor into a helper method (#57655)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57655

Now lots of code is shared between tensor and qtensor rendering.  Net
lines of code is actually +1, but it should result in a savings if/when
we implement some of those todos.

Test Plan: Clicked around in Chrome.

Reviewed By: malfet

Differential Revision: D28531395

Pulled By: dreiss

fbshipit-source-id: 190a04ed587b54d27f3410246763cd636c0634be
2021-05-19 12:25:17 -07:00
3ff76af23c model_dump: Implement "Hider" properly (#57654)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57654

I learned how to use children in React/Preact. :)  Now it's not
necessary to give every hidable section its own id and synchonize the
"shown=false" with "style='display:none;'".

This also means that the hidden elements aren't rendered to the DOM
unless the hider is open.

Test Plan: Clicked around in Chrome.

Reviewed By: malfet

Differential Revision: D28531393

Pulled By: dreiss

fbshipit-source-id: bc86c823ae4b7e80c000f50c5429d89dff6ae64d
2021-05-19 12:23:59 -07:00
3f0b081636 move code to Blas.cpp, clean up THC magma (#58526)
Summary:
To improve compilation times

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58526

Reviewed By: malfet

Differential Revision: D28540035

Pulled By: ngimel

fbshipit-source-id: 01a6b1e2b12aa246c5ecfa810ad4e87bde040553
2021-05-19 12:04:18 -07:00
703cfdc9ed [JIT] improve documentation (#57991)
Summary:
* Fix lots of links.
* Minor improvements for consistency, clarity or grammar.
* Update jit_python_reference to note the limitations on __exit__.
  (Related to https://github.com/pytorch/pytorch/issues/41420).
* Fix a comment in exit_transforms.cpp: removed the word "not" which
  made the comment say the opposite of the truth.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57991

Reviewed By: malfet

Differential Revision: D28522247

Pulled By: SplitInfinity

fbshipit-source-id: fc63a59d19ea6c89f957c9f7d451be17d1c5fc91
2021-05-19 11:47:32 -07:00
79a258f448 s/foward/forward/g (#58497)
Summary:
Annoying typo.

Prompted by these profiling results: https://github.com/pytorch/pytorch/issues/56419#issuecomment-825787828

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58497

Reviewed By: malfet

Differential Revision: D28521081

Pulled By: Chillee

fbshipit-source-id: ab91a2e167dd7d3387fd56106a6cff81f7a32f10
2021-05-19 11:42:42 -07:00
ccad77aa22 Added OperatorMap for mapping Operator to any template <T> (#58060)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58060

Generic way to check if Operator belongs to predefined map, and if so via public method(s) access to map value. In general value can be anything for example Operator's schema.

Test Plan: buck test caffe2/test/cpp/jit:jit -- OperatorMap

Reviewed By: Krovatkin

Differential Revision: D28357933

fbshipit-source-id: ba3248cf06c07f16aebafccb7ae71c1245afb083
2021-05-19 11:38:49 -07:00
1ba05efd26 [Reducer] Remove some unused variables (#58524)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58524

Per title
ghstack-source-id: 129311600

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D28528223

fbshipit-source-id: 239a15de4b602e35ed9b15b8a4bea3c28b61de12
2021-05-19 09:55:04 -07:00
4cf9b11022 Fix issues regarding binary_checkout (#58558)
Summary:
Cherry-pick of https://github.com/pytorch/pytorch/issues/58495 back to master

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Fixes https://github.com/pytorch/pytorch/issues/58557

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58558

Reviewed By: albanD

Differential Revision: D28538867

Pulled By: malfet

fbshipit-source-id: 3517d8729df7c0c0a221d26f6966c8dcef2f3076
2021-05-19 08:24:34 -07:00
baf05c3f5e Split CUDA SpectralOp (#58459)
Summary:
Move all cuFFT related parts to SpectralOps.cpp
Leave only _fft_fill_with_conjugate_symmetry_cuda_ in SpecralOps.cu

Keep `CUDAHooks.cpp` in torch_cuda_cpp by introducing `at::cuda::detail::THCMagma_init` functor and registering it from global constructor in `THCTensorMathMagma.cu`

Move entire detail folder to torch_cuda_cpp library.

This is a no-op that helps greatly reduce binary size for CUDA-11.x builds by avoiding cufft/cudnn symbol duplication between torch_cuda_cpp(that makes most of cuFFT calls) and torch_cuda_cu (that only needed it to compile SpectralOps.cu)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58459

Reviewed By: ngimel

Differential Revision: D28499001

Pulled By: malfet

fbshipit-source-id: 425a981beb383c18a79d4fbd9b49ddb4e5133291
2021-05-19 07:59:03 -07:00
029bec4505 [lint] Fix uninitialized variable lint error in Module.cpp (#58499)
Summary:
This PR fixes two uninitialized variable lint warnings in `Module.cpp` by initializing them to `nullptr`s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58499

Reviewed By: driazati, samestep

Differential Revision: D28519192

Pulled By: 1ntEgr8

fbshipit-source-id: 293cd4b296eea70b72adf02cd73f354063b124c6
2021-05-19 07:55:24 -07:00
b45a105acb Automated submodule update: tensorpipe (#58477)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/tensorpipe](https://github.com/pytorch/tensorpipe).

New submodule commit: a0c6aa1422

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58477

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: lw

Differential Revision: D28506522

fbshipit-source-id: 2da92feae212a568cfe441d33e4966ffe6c182e5
2021-05-19 05:49:29 -07:00
4d7abdbdad [Quant] Add out variant for int8 quantized::linear (#58282)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58282

Reviewed By: ajyu

Differential Revision: D28428734

fbshipit-source-id: f25243cdbc220e59659605a3a29e2b161dd7c1f2
2021-05-19 00:24:23 -07:00
c76405d3b1 [nnc] Enable CPU fusion inside Facebook, take 2 (#58347)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58347

Back out "Revert D27652484 (ac04cc775b): [nnc] Enable CPU fusion inside Facebook"
Original commit changeset: ecfef3ee1e71
ghstack-source-id: 129279584

Test Plan: Tests for bugfix included in this stack

Reviewed By: navahgar

Differential Revision: D28461013

fbshipit-source-id: 79a80b6ffb653ab952ff5efaa143d3362bb7d966
2021-05-18 21:45:48 -07:00
dcfc2050bd VaryingShape<Strides>::isComplete() needs to consider whether each Stride is complete (#58510)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58510

In some case that I don't fully understand we're getting a stride that is:
```
{2:1, 1:1, 0:*}
```
(in this debug output, M:N means stride index M, stride value N).  This shape
should be considered incomplete, since we don't actually know the values of the
stride, but VaryingShape::isComplete considers it complete because it only
checks the presence of elements in the vector, not whether those elements are
themselves complete.
ghstack-source-id: 129279583

Test Plan:
new unit test in test/cpp/jit

To see the failure in the context of a real model:
```
./fblearner/predictor/loadgen/download-requests.sh 272478342_0 10 ~/local/requests/272478342_0.recordio

buck-out/gen/fblearner/predictor/loadgen/replay_model_requests --model_id=272478342_0 --replay_record_source=recordio:/data/users/bertrand/requests/272478342_0.recordio --remote_port=9119 --output_file=/data/users/bertrand/responses/272478342_0_actual.recordio --output_type=recordio

buck-out/gen/fblearner/predictor/loadgen/replay_model_requests --model_id=272478342_0 --replay_record_source=recordio:/data/users/bertrand/requests/272478342_0.recordio --remote_port=9119 --output_file=/data/users/bertrand/responses/272478342_0_actual.recordio --output_type=recordio
```

Reviewed By: Krovatkin

Differential Revision: D28520062

fbshipit-source-id: 3ca900337d86480a40fbd90349a698cbb2fa5f11
2021-05-18 21:45:46 -07:00
3d20ddfe92 [nnc] Do not fuse unsqueeze with variable dim (#58346)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58346

If `dim` is a variable, NNC doesn't know how to translate the result,
since the shape is unknown.  This issue manifested as a `bad_variant_access`
when we try to pull an int constant out of that arg.

Note that, while the PE will pick up the resultant shape, it won't set guards accordingly.
ghstack-source-id: 129078971

Test Plan: new fuser test

Reviewed By: navahgar

Differential Revision: D28460956

fbshipit-source-id: 57ef918ef309ee57bfdf86717b910b6549750454
2021-05-18 21:44:37 -07:00
2ddd841635 [nnc] Make the pretty printer prettier (#57874)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57874

Before:
```
{
  for (int v = 0; v < 100; v++) {
    aten_sin[v] = sin(x_1[v]);
  }
{
    sum = float(0);
    for (int v_1 = 0; v_1 < 100; v_1++) {
      sum = ReduceOp((sum) + float(aten_sin[v_1]), reduce_args={});
    }
  }  for (int v_2 = 0; v_2 < 100; v_2++) {
    aten_cos[v_2] = cos(x_1[v_2]);
  }
  for (int v_3 = 0; v_3 < 100; v_3++) {
    aten_mul[v_3] = (_tensor_constant0[v_3]) * (aten_cos[v_3]);
  }
}
```

After:
```
{
  for (int v = 0; v < 100; v++) {
    aten_sin[v] = sin(x_1[v]);
  }
  {
    sum = float(0);
    for (int v_1 = 0; v_1 < 100; v_1++) {
      sum = ReduceOp((sum) + float(aten_sin[v_1]), reduce_args={});
    }
  }
  for (int v_2 = 0; v_2 < 100; v_2++) {
    aten_cos[v_2] = cos(x_1[v_2]);
  }
  for (int v_3 = 0; v_3 < 100; v_3++) {
    aten_mul[v_3] = (_tensor_constant0[v_3]) * (aten_cos[v_3]);
  }
}
```

Test Plan: Imported from OSS

Reviewed By: navahgar, malfet

Differential Revision: D28455842

Pulled By: bertmaher

fbshipit-source-id: 6d5ca9be12afd66a9ba32c129a3f4d618247cd35
2021-05-18 18:26:58 -07:00
3a3959d253 [jit] Add a utility class SourceRef to represent Source as keys (#57396)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57396

A new type SourceRef is introduced to represent a unique identifier to source
text. The type holds refcount to underlying source, and supports comparators
and hash functions, such that it can be used in C++ and Python maps. In later
diffs we will use this to aggregate and print profiling information.

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D28133578

fbshipit-source-id: c3d5199a8269c5006c85a145b281bcaaf3e2dc1c
2021-05-18 18:20:53 -07:00
0362b753db [BE] Use __func__ as checkAllSameGPU() 1st arg (#58502)
Summary:
Hardcoded names often get out of date, for example in AdaptiveAverafePooling those names contained cudnn_ prefix

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58502

Reviewed By: samestep

Differential Revision: D28518917

Pulled By: malfet

fbshipit-source-id: 9b16adae85a179e335da4facb4e769b9f67824bc
2021-05-18 16:45:54 -07:00
ea0f7c4720 move unused parameters to end of bucket orders when rebuild buckets for static graph (#58097)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58097

move unused parameters to end of bucket orders when rebuild buckets for static graph

Test Plan: unit tests

Reviewed By: rohan-varma

Differential Revision: D28366689

fbshipit-source-id: fbd224aeb761d5aa3bab35a00d64974eb4455b2e
2021-05-18 16:36:40 -07:00
a7b62abeb0 [PyTorch Edge] bytecode version bump to v5 and enable share constant table (#57888)
Summary:
As title, main change:
1. Enable share constant table and reduce model size up to 50%
2. Bump bytecode version from v4 to v5.
3. Add the unittest back. (It was partially removed because `script_module_v5.ptl` bytecode version is v5. When current runtime is v4 and try to load a v5 model, it will raise an error because version is not within the range.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57888

As title
ghstack-source-id: 129255867

Test Plan:
CI
```
buck test papaya/toolkit/frontend/torch/...
buck test mode/opt papaya/integration/service/test/smartkeyboard:smartkeyboard_system_test
```

Reviewed By: raziel, iseeyuan

Differential Revision: D28309381

fbshipit-source-id: 6f5cf4296eaadde913d55f27d5bfb9d1dea2fbaf
2021-05-18 16:17:13 -07:00
9eee782cb6 [nnc][scripts] Add a script for bisecting the TE fuser pass (#58357)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58357

Finding a miscompilation in a large program can be tedious; this
script automates the process of bisecting based on the number of fused
instructions.  Since fusing aten::cat without the corresponding
prim::ListConstruct will cause an assertion failure, we treat that case as a
"skip" and ignore it for the purpose of bisection.
ghstack-source-id: 129079484

Test Plan:
Tried it on some failing testcases, plus I wrote a simple bash
script to simulate "failure" and "skip" and verified a few different cases.

Reviewed By: huiguoo

Differential Revision: D28463808

fbshipit-source-id: 64836f1d37a573549179410316ea7168e3dc1f23
2021-05-18 16:10:20 -07:00
7d78d72d7b removing old comment (#56430)
Summary:
Removing a comment which is no longer relevant after
https://github.com/pytorch/pytorch/pull/56089

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56430

Reviewed By: desertfire

Differential Revision: D28515547

Pulled By: Krovatkin

fbshipit-source-id: c4e62741a872fef015248cd7ab1b3213d35109ee
2021-05-18 14:56:22 -07:00
a07cd22efb Comment why render_test_results is its own step (#58505)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58505

Reviewed By: seemethere

Differential Revision: D28520332

Pulled By: samestep

fbshipit-source-id: 6637b58b399caf6019d6fd8bfab21646cbd219b6
2021-05-18 14:40:32 -07:00
8efaab1b83 Add long tensor type to AddFakeFp16 Op (#58504)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58504

.. to support QRT inline_CVR models to avoid failure
```
[DataPreproc] User preprocessing error: c10::Error: [enforce fail at operator.h:1307] . Unsupported type of tensor: long (Error from operator:
input: "sparse_nn_2/HistogramBinningCalibrationByFeature_2/cast_22/cast_22_5" input: "sparse_nn_2/HistogramBinningCalibrationByFeature_2/mul_5/Mul" output: "sparse_nn_2/HistogramBinningCalibrationByFeature_2/add_7/Add_2" name: "" type: "AddFakeFp16" arg { name: "broadcast" i: 1 } device_option { extra_info: "inference_split:force_merge" extra_info: "inference_split:force_merge" })
```
f273407515

Test Plan: f273692411

Reviewed By: hx89

Differential Revision: D28513550

fbshipit-source-id: 86892e1a98b5219cd187731018ce2692b231fb58
2021-05-18 14:25:56 -07:00
4b859cbca1 [NNC] Do not optimize conditionals when the corresponding loop is not normalized (#57675)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57675

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D28231375

Pulled By: navahgar

fbshipit-source-id: bcbcebca25577744c7190a0aa9fa376f76dea77d
2021-05-18 14:25:53 -07:00
a71b99b50d [NNC] Add a method to check if a loop is normalized (#57674)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57674

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D28231377

Pulled By: navahgar

fbshipit-source-id: 3d92d532f1e1f78c9d94619980340622b73f99ec
2021-05-18 14:25:50 -07:00
3fe72d30dc [NNC] Optimize conditionals that correspond to the form generated for aten::cat op. (#57673)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57673

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D28231374

Pulled By: navahgar

fbshipit-source-id: 1777a63df4e5ebed6d515683bd772a88be465b3a
2021-05-18 14:23:48 -07:00
db42ec4297 [Pytorch Sparsity] Add sparse sources to build target
Summary:
This adds to internal build target and makes it ready for selective build
workflow.

Test Plan: CI builds

Reviewed By: z-a-f

Differential Revision: D28103697

fbshipit-source-id: 19c8b27aae4de1cece8d88d13ea51ca4ac7d79b6
2021-05-18 14:19:14 -07:00
ad97fd8031 Support symbolic diff for leaky_relu (#58337)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58337

supports  symbolic differentiation for leaky_relu

Test Plan:
test/test_jit.py
test/test_ops.py

Reviewed By: Krovatkin

Differential Revision: D28458898

fbshipit-source-id: bdde74d689d2c2ea1f59507456c2efa4e38de1cc
2021-05-18 14:13:40 -07:00
e1551f1678 Clarify .github/scripts/generate_ci_workflows.py (#58498)
Summary:
Followup to https://github.com/pytorch/pytorch/issues/58491:

- use f-string to remove the literal `generated` string from the generator script, so Phabricator no longer thinks it is a generated file
- remove the special logic for `test_runner_type` and instead explicitly specify for every workflow

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58498

Test Plan:
```
make generate-gha-workflows
```
Also, check that Phabricator doesn't classify `.github/scripts/generate_ci_workflows.py` as "Generated changes" in this diff.

Reviewed By: seemethere

Differential Revision: D28516291

Pulled By: samestep

fbshipit-source-id: 8736eaad5d28082490be0a9b2e271c9493c2ba9d
2021-05-18 12:50:00 -07:00
5fcf49f596 [PyTorch] Add a guard rail to TensorIterator::add_borrowed_{in,out}put (#58279)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58279

See comment in source code.
ghstack-source-id: 129002040

Test Plan: CI

Reviewed By: wenleix

Differential Revision: D28428962

fbshipit-source-id: e011819e5579396f3ca2d87978c84965260adb1b
2021-05-18 12:46:33 -07:00
03f2f0f88f [PyTorch] Migrate remaining CUDA TI usage to borrowing where possible (#58278)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58278

Borrowing is more efficient, and we can see in all these cases that the TensorIterator doesn't outlive the input & output Tensors.
ghstack-source-id: 129002042

Test Plan: Existing CI

Reviewed By: ezyang

Differential Revision: D28428809

fbshipit-source-id: 23ccf508c4413371a88085271f11c7d0cc861a9e
2021-05-18 12:46:32 -07:00
1fd256dc3b [PyTorch] Migrate CUDA indexing TI usage to borrowing (#58277)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58277

Borrowing is more efficient, and we can see in all these cases that the TensorIterator doesn't outlive the input & output Tensors.
ghstack-source-id: 129002044

Test Plan: Existing CI

Reviewed By: ngimel

Differential Revision: D28428441

fbshipit-source-id: 243b746aeb5fdf8b95c8e591c066c5eab140deb6
2021-05-18 12:46:30 -07:00
029289bd6c [PyTorch] Migrate TensorAdvancedIndexing TI usage to borrowing where possible (#58276)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58276

Borrowing is more efficient, and we can see in all these cases that the TensorIterator doesn't outlive the input & output Tensors.
ghstack-source-id: 129002045

Test Plan: Existing CI

Reviewed By: ngimel

Differential Revision: D28428234

fbshipit-source-id: 9eada7725a070799b55e6683509e359505a2b80a
2021-05-18 12:46:28 -07:00
439ba27dea [PyTorch] Migrate all extant uses of build_binary_float_op to build_borrowing_binary_float_op (#58273)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58273

Borrowing is more efficient, and structured kernels can always borrow.
ghstack-source-id: 129002041

Test Plan: Existing CI

Reviewed By: ezyang

Differential Revision: D28427914

fbshipit-source-id: eed27a10603b412af5357d3554477ba407abba73
2021-05-18 12:46:26 -07:00
8a4a511ff5 [PyTorch] Migrate all extant uses of build_binary_op to build_borrowing_binary_op (#58272)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58272

Borrowing is more efficient, and structured kernels can always borrow.
ghstack-source-id: 129002046

Test Plan: Existing CI

Reviewed By: ezyang

Differential Revision: D28427768

fbshipit-source-id: 6314a682556c6914c843aaacf2d75b2adb164e9a
2021-05-18 12:44:50 -07:00
07da584dbd Fix KeyError returned by _maybe_get_last_node_only_observer (#58443)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58443

Test Plan: arc lint

Reviewed By: vkuzo

Differential Revision: D28494119

fbshipit-source-id: 05abf4e12051afc237096812fb0ee08a8b9447f9
2021-05-18 12:41:19 -07:00
46484e8dfe Simplify .github/scripts/generate_ci_workflows.py (#58491)
Summary:
This PR simplifies `.github/scripts/generate_ci_workflows.py` by using the same strategy as https://github.com/pytorch/pytorch/issues/54344, representing workflows as plain data to avoid duplicating the definition of the `generate_workflow_file` function. This will make the script easier to maintain if/when that function is modified and/or more workflow types are added.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58491

Test Plan:
The Lint job in CI; specifically:
```
make generate-gha-workflows
mypy --config mypy-strict.ini
```

Reviewed By: malfet, seemethere

Differential Revision: D28511918

Pulled By: samestep

fbshipit-source-id: aaf415a954d938a29aee7c9367c9bc2b9f44bb01
2021-05-18 11:49:51 -07:00
f7c15610aa Collect kernel version (#58485)
Summary:
Collect env should collect kernel and glibc version

Fixes https://github.com/pytorch/pytorch/issues/58387

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58485

Reviewed By: walterddr

Differential Revision: D28510564

Pulled By: malfet

fbshipit-source-id: ad3d4b93f51db052720bfaa4322138c55816921b
2021-05-18 10:57:59 -07:00
92e36240f5 fix nonzero perf regression (#58468)
Summary:
https://github.com/pytorch/pytorch/issues/55292 introduced perf regression for nonzero cuda, this fixes it. nvcc is still pretty bad about unrolling loops with boundaries that are not known at compile time, this makes `write_indices` kernels ~5x slower than it should be.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58468

Reviewed By: mruberry

Differential Revision: D28511147

Pulled By: ngimel

fbshipit-source-id: fe7303ec77da1abbe5e874093eca247b3919616f
2021-05-18 10:33:10 -07:00
4ce8378ec5 [local lint] Remove success checks in tests (#58490)
Summary:
Testing for both that a lint job ran and that it was successful depends
on having lint pass for the PR, which can create confusion if it doesn't
(i.e. a flake8 failure also causes this job to fail, and it's not
immediately clear why). With this PR we just check for the presence of
job names to see that something ran.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58490

Reviewed By: samestep

Differential Revision: D28511229

Pulled By: driazati

fbshipit-source-id: 3036deff9f9d0ef2e78b44a9a43b342acdcfa296
2021-05-18 09:31:13 -07:00
afe23b8f8b Fix alpine image (#58462)
Summary:
Fixes dockerhub rate limiting issue, use the ECR image instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58462

Reviewed By: malfet

Differential Revision: D28510603

Pulled By: zhouzhuojie

fbshipit-source-id: 2cac59da1d1efdf31df71e9f76d802f8e9a0bfd5
2021-05-18 09:22:28 -07:00
821a97595b fx quant: improve performance of all_node_args_have_no_tensors (#58461)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58461

Improves the logic which calculates whether a node has any tensors
in its arguments by terminating the recursion early when possible.

In a future PR, we should probably ditch this entire approach and switch to
using dtype propagation.

Test Plan:
```
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D28499455

fbshipit-source-id: bedd844022b90e1fcb7d7a3cb4cc65440dc9cc59
2021-05-18 07:19:59 -07:00
e059fd40a8 Remove master documentation from being indexable by search engines (#58056)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58056

This PR addresses an action item in #3428: disabling search engine
indexing of master documentation. This is desireable because we want to
direct users to our stable documentation (instead of master
documentation) because they are more likely to have a stable version of
PyTorch installed.

Test Plan:
1. run `make html`, check that the noindex tags are there
2. run `make html-stable, check that the noindex tags aren't there

Reviewed By: bdhirsh

Differential Revision: D28490504

Pulled By: zou3519

fbshipit-source-id: 695c944c4962b2bd484dd7a5e298914a37abe787
2021-05-18 06:20:09 -07:00
52b45b7655 Revert D28494073: [Gradient Compression] Do not skip the comm hook tests for Gloo/MPI backends
Test Plan: revert-hammer

Differential Revision:
D28494073 (df44f015fe)

Original commit changeset: 6ba14082f986

fbshipit-source-id: 0e094f09b59c93f5ee13a667aacfb3ccf608547e
2021-05-18 05:39:09 -07:00
34d6618386 [NNC] Fixing a bug in simplifier (#58291)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58291

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D28435393

Pulled By: navahgar

fbshipit-source-id: 517e47385a93a43d2ddf054382adc81c18484066
2021-05-18 01:28:33 -07:00
df44f015fe [Gradient Compression] Do not skip the comm hook tests for Gloo/MPI backends (#58444)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58444

DDP communication hooks are already supported on Gloo and MPI backends. No longer need to skip these tests on Gloo/MPI backends.

TODO: `test_ddp_hook_parity_powerSGD` failes on Gloo backend. Filed a bug #58467.
ghstack-source-id: 129209528

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:distributed_gloo_fork -- test_ddp_comm_hook_logging
buck test mode/dev-nosan caffe2/test/distributed:distributed_gloo_fork -- test_ddp_hook_parity_allreduce
buck test mode/dev-nosan caffe2/test/distributed:distributed_gloo_fork -- test_ddp_hook_parity_allreduce_process_group
buck test mode/dev-nosan caffe2/test/distributed:distributed_gloo_fork -- test_ddp_hook_parity_powerSGD

Reviewed By: rohan-varma

Differential Revision: D28494073

fbshipit-source-id: 6ba14082f98696bc4bd8c02395cb58b9c1795015
2021-05-17 23:05:01 -07:00
c38616491f Conservatively move all suitable prim ops from full-jit to mobile, and make them selective. (#58353)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58353

There are long tail operators in register_prim_ops_fulljit.cpp, that may be used in mobile build. In this PR
1. All of the ops that are likely to be used in mobile are moved to register_prim_ops.cpp.
2. Note that this move is conservative. If an op is likely to have fulljit dependency, or cannot be selective, it will be kept. Later if there's need to be used in mobile (rare), it will be adapted and moved case by case.
3. All the moved ops are marked selective. The registration function is changed from `Operator()` to `OperatorGenerator()`. Size regression is not expected.

Test Plan:
* Internal size tests
* CI

Reviewed By: dhruvbird

Differential Revision: D28463158

Pulled By: iseeyuan

fbshipit-source-id: 34536b8a569f1274329ccf1dac809fe9b891b4ff
2021-05-17 23:01:22 -07:00
b5a834a739 [Pytorch] Build lite interpreter as default for iOS
Summary:
Two changes:
1. Build lite interpreter as default for iOS
2. Switch the previous lite interpreter test to full jit build test

Test Plan: Imported from OSS

Differential Revision: D27698039

Reviewed By: xta0

Pulled By: cccclai

fbshipit-source-id: 022b554f4997ae577681f2b79a9ebe9236ca4f7d
2021-05-17 22:36:05 -07:00
8a3fb2689f Wrap torch::deploy API functions in safe rethrow macros (#58412)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58412

Second try- avoid ctor/dtor handling this time as it is kind of
pointless if the rethrow will still terminate(), and upsets -Werror=terminate
Original commit changeset: 1775bed18269

Test Plan: existing unit tests and CI

Reviewed By: suo

Differential Revision: D28478588

fbshipit-source-id: 84191cecc3ef52e23f11bfea07bbb9773ebc5df4
2021-05-17 22:09:19 -07:00
7b73fdf597 [FX] Fix retracing wrapped functions (#58061)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58061

Test Plan: Imported from OSS

Reviewed By: yuhc

Differential Revision: D28358801

Pulled By: jamesr66a

fbshipit-source-id: c7c9a8a80e5bfe1eb1f6d2cf858ac7e57153a860
2021-05-17 19:50:16 -07:00
5fa4541c65 Make new_ones an operator (#58405)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/58394

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58405

Reviewed By: HDCharles

Differential Revision: D28480075

Pulled By: Chillee

fbshipit-source-id: bd29399867e2a002a2f395554621761d3c701f68
2021-05-17 19:24:34 -07:00
0547a3be63 Change link order for BUILD_SPLIT_CUDA option (#58437)
Summary:
torch_cuda_cu depends on torch_cuda_cpp, so it should be linked first
Otherwise linker keeps lots of cudnn symbols for no good reason

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58437

Reviewed By: janeyx99

Differential Revision: D28496472

Pulled By: malfet

fbshipit-source-id: 338605ff755591476070c172a6ea0a0dcd0beb23
2021-05-17 18:38:04 -07:00
af463d2235 Add shape documentation for CosineEmbeddingLoss (#58403)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/52732

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58403

Reviewed By: HDCharles

Differential Revision: D28480076

Pulled By: jbschlosser

fbshipit-source-id: c2c51e9da86e274e80126bbcabebb27270f2d2d0
2021-05-17 18:14:16 -07:00
e24dee00d4 add kernel launch checks after each kernel launch to silence the check (#58432)
Summary:
T90898552

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58432

Reviewed By: r-barnes

Differential Revision: D28487446

Pulled By: ngimel

fbshipit-source-id: 3a756ffa3cd68720e132af27cd5ae36f7fd4a2d8
2021-05-17 18:03:19 -07:00
7dd08504f6 [package] fix persistent_load error (#58439)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58439

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D28494250

Pulled By: Lilyjjo

fbshipit-source-id: c068760db9c25dcbf5a88ea9343eab11f0e7736a
2021-05-17 17:38:53 -07:00
314a578154 Clang format distributed_c10d.py (#58435)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58435

Prepare for #53962

ghstack-source-id: 129171617

Test Plan: N/A

Reviewed By: zhaojuanmao

Differential Revision: D28490326

fbshipit-source-id: 2ed3c5850788b9702a8020f6ee6d0b579625bf89
2021-05-17 16:47:35 -07:00
b6d3929b51 [ATen] Use MaybeOwned<T> in at::argmin/argmax (#58338)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58338

Test Plan: CI

Reviewed By: swolchok

Differential Revision: D28458968

fbshipit-source-id: 2c759bdb9fbdbef32d804f6d8efb09fb1d2bb30a
2021-05-17 16:42:52 -07:00
6989eb60e5 Remove timeouts for C2 tests
Summary: When run on very heavily loaded machines, some of these tests are timing out. It's not an issue with the test, it's an issue with the environment. I've removed the timeout so we at least keep unit test coverage.

Test Plan: N/A: Fix breakages

Reviewed By: ngimel

Differential Revision: D28492334

fbshipit-source-id: aed3ee371763161aab2d356f5623c7df053fda6f
2021-05-17 16:39:30 -07:00
4310decfbf .github: Add intial Windows CPU GHA workflow (#58199)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58199

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D28465272

Pulled By: seemethere

fbshipit-source-id: d221ad71d160088883896e018c58800dae85ff2c
2021-05-17 15:04:16 -07:00
c156a4ffaa fx quant: fix crash on output dicts and lists (#58416)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58416

https://github.com/pytorch/pytorch/pull/57519 had a regression not
caught by CI, it added an assertion which failed on various model
output types.

This PR removes the assertion and adds the logic to observe graph
outputs in a way that supports arbitrary output formats.

Test Plan:
```
python test/test_quantization.py TestQuantizeFx.test_output_lists_and_dicts
```

Imported from OSS

Reviewed By: z-a-f

Differential Revision: D28479946

fbshipit-source-id: bcce301f98a057b134c0cd34ab0ca96ba457863f
2021-05-17 15:02:09 -07:00
a1cacf3b5d fx quant: remove test debug logs (#58415)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58415

Removes test debugging logs which were committed, probably
someone forgot to remove before landing.

Test Plan:
```
python test/test_quantization.py TestQuantizeFx
```

Imported from OSS

Reviewed By: z-a-f

Differential Revision: D28479947

fbshipit-source-id: 3adba87c51652e3353f455b293abc90debe3dd7d
2021-05-17 15:01:03 -07:00
3d12ab452e [ONNX] Fix split export in opset13 (#56277) (#57605)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57605

Fix split export in opset13

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D28393522

Pulled By: SplitInfinity

fbshipit-source-id: 4de83345ec7bc9bafe778fe534d9a8760ce16ab3

Co-authored-by: Ksenija Stanojevic <ksenija.stanojevic@gmail.com>
Co-authored-by: BowenBao <bowbao@microsoft.com>
2021-05-17 14:50:33 -07:00
0c3db1cb33 [Pytorch] Build lite interpreter as default for Android
Summary:
Build lite interpreter as default for android, should wait until https://github.com/pytorch/pytorch/pull/56002 lands
Mainly two changes:
1. Use lite interpreter as default for Android
2. Switch the lite interpreter build test to full jit build test

Test Plan: Imported from OSS

Differential Revision: D27695530

Reviewed By: IvanKobzarev

Pulled By: cccclai

fbshipit-source-id: e1b2c70fee6590accc22c7404b9dd52c7d7c36e2
2021-05-17 14:12:48 -07:00
d645088f2f [torch] Format repeat_interleave op files (#58313)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58313

Same as title.

I am planning to send a follow-up diff to this op, so sending formatting diff ahead to keep PR simple.

Test Plan: Rely on existing signals since this is simple formatting diff.

Reviewed By: ngimel

Differential Revision: D28447685

fbshipit-source-id: c7cd473b61e40e6f50178aca88b9af197a759099
2021-05-17 13:51:53 -07:00
06c1094ea0 Merge CreationMeta MULTI_OUTPUT_SAFE with MULTI_OUTPUT_NODE (#58285)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/57679

##### Release Notes
This is part of the end of the deprecation of inplace/view:
- `detach_` will now raise an error when invoked on any view created by `split`, `split_with_sizes`, or `chunk`. You should use the non-inplace `detach` instead.
- The error message for when an in-place operation (that is not detach) is performed on a view created by `split`, `split_with_size`, and `chunk` has been changed from  "This view is **an** output of a function..." to "This view is **the** output of a function...".

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58285

Reviewed By: bdhirsh

Differential Revision: D28441980

Pulled By: soulitzer

fbshipit-source-id: e2301d7b8cbc3dcdd328c46f24bcb9eb7f3c0d87
2021-05-17 13:48:39 -07:00
3507ca320b Remove unused python2 shebang (#58409)
Summary:
This is the only line (not in `third_party`) matching the regex `^#!.*python2`, and [it is not the first line of its file](https://github.com/koalaman/shellcheck/wiki/SC1128), so it has no effect. As a followup to https://github.com/pytorch/pytorch/issues/58275, this PR removes that shebang to reduce confusion, so now all Python shebangs in this repo are `python3`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58409

Reviewed By: walterddr

Differential Revision: D28478469

Pulled By: samestep

fbshipit-source-id: c17684c8651e45d3fc383cbbc04a31192d10f52f
2021-05-17 13:19:32 -07:00
98cc0aa6b0 Use torch.allclose to check tensor equality (#58429)
Summary:
This fixes test_lkj_cholesky_log_prob if default codepath is used
I.e. test is executed as follows:
```
 ATEN_CPU_CAPABILITY=default python3 distributions/test_distributions.py -v -k test_lkj_cholesky_log_prob
```

Fixes https://github.com/pytorch/pytorch/issues/58381

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58429

Reviewed By: neerajprad

Differential Revision: D28484340

Pulled By: malfet

fbshipit-source-id: 32afcc75e5250f5a11d66b4fa194ea1c784454a6
2021-05-17 13:16:35 -07:00
50f9a1812e Enable NNAPI in internal build (#58324)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58324

Test Plan: Build Size Bot.  Segmentation in Spark Player.

Reviewed By: axitkhurana

Differential Revision: D28435176

fbshipit-source-id: f2fb25e3cd331433e7a3156a528811abd3bcbf3a
2021-05-17 12:52:56 -07:00
532632ca26 Don't bind Android NNAPI on Apple platforms (#58323)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58323

Currently there is no way to run NNAPI on Apple platforms.
Disabling the binding with the preprocessor makes this easier
to enable NNAPI in the internal build without affecting iOS size.

This should be reverted soon and migrated to selective build.

Test Plan: Build Size Bot on later diff.

Reviewed By: axitkhurana

Differential Revision: D28435179

fbshipit-source-id: 040eeb74532752630d329b15d5f95c538c2e3f9e
2021-05-17 12:51:46 -07:00
1891e4bf1e [Pytorch] Remove run_on_bundled_input (#58344)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58344

remove a helper function thats more trouble then its worth.

ghstack-source-id: 129131889

Test Plan: ci and {P414950111}

Reviewed By: dhruvbird

Differential Revision: D28460607

fbshipit-source-id: 31bd6c1cc169785bb360e3113d258b612cad47fc
2021-05-17 12:44:00 -07:00
443ce1e8a1 Improve error message when Proxy object is iterated (#58302)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58302

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D28444030

Pulled By: ansley

fbshipit-source-id: ee29b0f7b2199f8590de4c5945b0d4ce59230ce2
2021-05-17 12:42:23 -07:00
a4ce85ad68 Chown workspace in calculate-docker-image (#58398)
Summary:
Since https://github.com/pytorch/pytorch/issues/58299 changed the calculate-docker-image job from `ubuntu-18.04` to `linux.2xlarge`, it has been sometimes failing with this message:

```
Warning: Unable to clean or reset the repository. The repository will be recreated instead.
Deleting the contents of '/home/ec2-user/actions-runner/_work/pytorch/pytorch'
Error: Command failed: rm -rf "/home/ec2-user/actions-runner/_work/pytorch/pytorch/.azure_pipelines"
```

- https://github.com/pytorch/pytorch/runs/2587348894
- https://github.com/pytorch/pytorch/runs/2592943274
- https://github.com/pytorch/pytorch/runs/2600707737

This PR hopes to fix that issue by adding the "Chown workspace" step that we already use for the other jobs in the Linux CI workflow.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58398

Reviewed By: seemethere

Differential Revision: D28476902

Pulled By: samestep

fbshipit-source-id: a7dbf0ad9c18ac44cc1a3cef7647f56489958fe6
2021-05-17 12:40:55 -07:00
e8981e7c5d Improve CONTRIBUTING.md (#58396)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58396

Test Plan: Imported from OSS

Reviewed By: samestep

Differential Revision: D28476510

Pulled By: ansley

fbshipit-source-id: 3f45bee93dfeda06a44570305f9699bcafc45d2e
2021-05-17 12:36:38 -07:00
9afe9fba29 Reland OpInfo support for forward AD (#58304)
Summary:
Try 3 to land this.
Trying ci-all label to ensure we test everything.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58304

Reviewed By: heitorschueroff

Differential Revision: D28474343

Pulled By: albanD

fbshipit-source-id: 8230fa3c0a8d3633f09999e7c2f47dbdc5fe57e9
2021-05-17 12:33:27 -07:00
1a9efbbc92 generate inplace/out kernels for xla (#57510)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57510

This is a re-write of https://github.com/pytorch/pytorch/pull/56835, which is significantly shorter thanks to the data model change in the PR below this one in the stack. See the original description in the linked PR for details.

The functional changes in this PR are the same as in the above linked one, so the description is the same with a few small changes:
- I don't bother generating `at::xla::{op}` entries for CPU fallbacks. After looking around, I see precedent for that. For example, we don't have `at::cpu::{op}` entries for composite ops- if you really want to bypass the dispatcher you need to call `at::compositeimplicitautograd::{op}`. Maybe we should revisit that later if we find an important use case for having full namespace coverage, but that doesn't seem worth half-fixing for external backends in this PR.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D28474364

Pulled By: bdhirsh

fbshipit-source-id: 4d58b60e5debad6f1ff06420597d8df8505b2876
2021-05-17 12:25:38 -07:00
9354a68e7d [codegen] split out backend-specific information from NativeFunction in the model (#57361)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57361

Data model change in the codegen, which splits backend-specific information out of `NativeFunction`

### Overview
Currently in the codegen, native_functions.yaml has backend-specific information about each operator that is encoded directly into the data model, in the `NativeFunction` object. That's reasonable, since the native_functions.yaml is the source of truth for information about an operator, and the data model encodes that information into types.

Now that external backends can use the codegen though, that information is technically incomplete/inaccurate. In another PR, I tried patching the information on the `NativeFunction` object with the additional external information, by updating the `dispatch` entry to contain the external backend kernel name and dispatch key.

Instead, this PR tries to split out that information. The `NativeFunction` class contains all information about an operator from native_functions.yaml that's backend-independent and is known never to change regardless of what extra information backends provide. We also build up a backend "index", which is basically a mapping from [backend] -> [backend-specific-metadata]. Reading in an external backend yaml just involves updating that index with the new backend.

There were a few places where `NativeFunction` used the dispatch table directly, that I encoded as properties directly on the NativeFunction object (e.g. `is_abstract`). They were mostly around whether or not the operator has a composite kernel, which isn't something that's going to change for any external backends.

This has a few advantages:
- We can more easily re-use the existing logic in `native_function.py` and `register_dispatch_key.py` for both native and external backends, since they both involve a NativeFunction + a particular backend index
- The data in the data model will be the same regardless of how the codegen is run. Running the codegen with a new external backend doesn't change the data inside of NativeFunction or an existing backend index. It just adds a new index for that backend.
- There are several of codegen areas that don't care about backend-specific information: mostly the tracing and autograd codegen. We can reason about the codegen there more easily, knowing that backend-specific info is entirely uninvolved.

An alternative to this split would be to augment the NativeFunction objects with external backend information at the time that we create them. So the external codegen could read both native_functions.yaml and the external backend's yaml at the same time, and construct a NativeObject with a full dispatch table (including the XLA entry), and the correct setting of structured (taking into account both yamls). One disadvantage to this approach is that NativeFunction objects now contain different stuff depending on how you ran the codegen, and you have to make sure that any changes to the codegen can properly handle all the different variants.

### Data Model Changes
Removed 3 classes, which are used by the external codegen:
- ExternalBackendFunction
- ExternalBackendFunctionsGroup
- ExternalBackendMetadata

And added two new ones:
- BackendIndex
- BackendMetadata

`BackendIndex` contains any info that's specific to that backend, plus a mapping from operator names to backend specific metadata about the operator. One example of backend-specific info that's not operator-dependent is the fact that XLA prefers to implement functional kernels instead of out kernels (and so when they eventually mark an op as structured, they're going to mark the functional op and not the out op).

`BackendMetadata` contains info specific to an (operator, backend) pair. Right now, that's just (a) the name of the kernel, and (b) whether or not that operator is structured.

### Questions
I wanted to get this PR up earlier so I could get feedback, but there are a few things I want to call out:

**Dealing with `structured`.**
This PR separates out the notion of `structured` into two bits of information:
- Does [operator] have a meta() function. This is backend-agnostic, and is represented by the `structured` property on `NativeFunction`, same as before. This is used, e.g., to decide what signatures to add to `MetaFunctions.h`.
- Does [operator, backend] have an impl() function. This is backend dependent; even though technically all in-tree backends are forced to write impl() functions for an operator when we port the op to structured in native_functions.yaml, out-of-tree backends can decide to opt in independently. This is represented as a property on `BackendMetadata`. This is used in most other cases, e.g. in `RegisterDispatchKey` when we're deciding whether or not to gen a structured or unstructured wrapper.

I also baked `is_structured_dispatch_key` directly into each BackendIndex. So for operators marked "structured" in native_functions.yaml, their corresponding CPU/CUDA BackendIndex entries will be marked structured, and all others (except for potentially external backends) will not.

I ended up trying to deal with `structured` in this change since it's technically backend dependent (XLA can opt kernels into structured separately from in-tree ops), but that may have been too ambitious: it's technically not relevant until we actually add support for structured external kernels. If it's not clear that this is the right path for dealing with structured and we want to push that off, I'm fine with backing out the bits of this PR that make `structured` backend-dependent. I don't see anything *too* controversial related to structured in the change, but I tried to call out any areas in the comments

**Localizing the fact that external backends follow Dispatcher convention.**
Another thing that's sort of backend specific that I didn't totally address in this PR is the fact the fact that in-tree backends follow the Native API while external backends follow the Dispatcher API. I painted over that in `native_functions.py` by adding a helper, `kernel_signature`, that takes in a native function and gives you the "correct" signature for the specified backend- NativeSignature for in-tree backends, and DispatcherSignature for out-of-tree backends. In order to make that fully useable though, we'll need `NativeSignature` and `DispatcherSignature` to have matching interfaces. I didn't bother with that in this PR, which is why `gen_external_aten_fallbacks.py` still has a bunch of direct references to the dispatcher API. Thinking of adding it in a later PR but wanted to see if anyone has other opinions.

Maybe `is_external()` shouldn't even be a property on the BackendMetadata, and anything the codegen does that requires asking for that information should just be better abstracted away.

**Thoughts on the `BackendIndex` / `BackendMetadata` breakdown.**
One thing that's annoying right now is that to query for various pieces of metadata, you call helper functions like `backend_index.structured(f)`, which queries that particular backend and tells you if that specific NativeFunctionGroup is structured for that backend. It has to return an `Optional[bool]` though, since you have to handle the case where that operator doesn't have a kernel for that backend at all. So users of those helpers end up with a bunch of optionals that they need to unpack, even if they know at some point that the result isn't None. I think it would be easier instead to just store the NativeFunction object as a field directly on the BackendMetadata. Curious if there are any other opinions on a better way to model it though.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D28474362

Pulled By: bdhirsh

fbshipit-source-id: 41a00821acf172467d764cb41e771e096542f661
2021-05-17 12:25:35 -07:00
0db33eda2a remove bridge API from codegen (#55796)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55796

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D28474361

Pulled By: bdhirsh

fbshipit-source-id: c7f5ce35097f8eaa514f3df8f8559548188b265b
2021-05-17 12:25:32 -07:00
3d9f10f530 [external codegen] better yaml error messaging, added explicit error message tests (#56597)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56597

3 small changes, all centered around error messaging.

1) Improved error messages when `gen_backend_stubs.py` receives invalid yaml

2) Added error message tests. I wasn't sure if there was a canonical way to do this, so I just wrote a test that takes in a list of (yaml input, expected error message) pairs and runs the codegen pipeline on each of them.

3) I also removed the LineLoader from the yaml parsing bit that reads in the external backend yaml file. Two reasons that I took it out:
 - The main reason we use it with native_functions.yaml is to easily pinpoint problems with new ops as they're added, that the codegen can pick up. 99% of these problems have to do with schema, which is irrelevant to the external yaml since it pulls the schema from native_functions
 - Not all operators have to appear in the external yaml. We could do something like "line: -1", but that's kind of weird.

If you think the line numbers would actually be of more use than I'm thinking of in the external yaml though, let me know!

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D28474363

Pulled By: bdhirsh

fbshipit-source-id: 8b5ec804b388dbbc0350a20c053da657fad0474f
2021-05-17 12:25:29 -07:00
4dc1b8e06b add _to_cpu() operator (#55795)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55795

description coming soon

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D28474365

Pulled By: bdhirsh

fbshipit-source-id: 0704d7ce354308601a0af9ab48851459f34ce7a0
2021-05-17 12:23:35 -07:00
6168 changed files with 579678 additions and 227783 deletions

View File

@ -44,7 +44,7 @@ jobs:
is_official_build: ${{ parameters.is_official_build}}
# Sync and update PyTorch submodules
- bash: git submodule update --init --recursive
- bash: git submodule update --init --recursive --jobs 0
displayName: Update PyTorch submodules
# Build PyTorch and run unit tests - no packaging

View File

@ -47,7 +47,7 @@ jobs:
is_official_build: ${{ parameters.is_official_build}}
# Sync and update PyTorch submodules
- script: git submodule update --init --recursive
- script: git submodule update --init --recursive --jobs 0
displayName: Update PyTorch submodules
# Build PyTorch and run unit tests - no packaging

View File

@ -0,0 +1,26 @@
parameters:
name: ''
pool: ''
customMatrixes: ''
jobs:
- job: ${{parameters.name}}
timeoutInMinutes: 600
strategy:
matrix:
${{ insert }}: ${{parameters.customMatrixes}}
pool:
name: ${{ parameters.pool}}
steps:
# Clone PyTorch Tests repository
- bash: |
B64_PAT=$(echo -n ":$_ADOTOKEN" | base64)
git -c http.extraHeader="Authorization: Basic ${B64_PAT}" clone $(AZURE_DEVOPS_PYTORCH_TESTS_REPO_URL)
cd pytorch_tests
git checkout $(PYTORCH_TESTS_CHECKOUT_BRANCH)
env:
_ADOTOKEN: $(AZURE_DEVOPS_CLI_PAT)
displayName: Clone PyTorch Tests repo
- bash: |
bash $(Build.SourcesDirectory)/pytorch_tests/webapp/notify_webapp.sh
displayName: Notify Webapp

View File

@ -46,7 +46,7 @@ steps:
curl -k https://s3.amazonaws.com/ossci-windows/sccache.exe --output .\tmp_bin\sccache.exe
curl -k https://s3.amazonaws.com/ossci-windows/sccache-cl.exe --output .\tmp_bin\sccache-cl.exe
copy .\tmp_bin\sccache.exe .\tmp_bin\nvcc.exe
curl -kL https://github.com/peterjc123/randomtemp-rust/releases/download/v0.3/randomtemp.exe --output .\tmp_bin\randomtemp.exe
curl -kL https://github.com/peterjc123/randomtemp-rust/releases/download/v0.4/randomtemp.exe --output .\tmp_bin\randomtemp.exe
displayName: Install sccache and randomtemp
condition: not(eq(variables.CUDA_VERSION, ''))

View File

@ -33,7 +33,7 @@ jobs:
# Clone PyTorch Tests repository
- bash: |
B64_PAT=$(printf "%s"":$_ADOTOKEN" | base64)
B64_PAT=$(echo -n ":$_ADOTOKEN" | base64)
git -c http.extraHeader="Authorization: Basic ${B64_PAT}" clone $(AZURE_DEVOPS_PYTORCH_TESTS_REPO_URL)
cd pytorch_tests
git checkout $(PYTORCH_TESTS_CHECKOUT_BRANCH)
@ -48,4 +48,14 @@ jobs:
_TS_CLONE_P: $(TS_CLONE_PASSWORD)
_TS_P: $(TS_PAT)
_TS_SM_P: $(TS_SM_PAT)
_AZUREML_CLONE_PASSWORD: $(AZUREML_CLONE_PASSWORD)
_SPPASSWORD: $(SPPASSWORD)
displayName: Run PyTorch Unit Tests
# Tests results are available outside the docker container since
# the current directory is mounted as a volume of the container.
- task: PublishTestResults@2
condition: always()
inputs:
testResultsFiles: '**/test-*.xml'
testRunTitle: 'Publish test results for Python'

View File

@ -47,3 +47,11 @@ jobs:
_TS_P: $(TS_PAT)
_TS_SM_P: $(TS_SM_PAT)
displayName: Run PyTorch Unit Tests
# Tests results are available outside the docker container since
# the current directory is mounted as a volume of the container.
- task: PublishTestResults@2
condition: always()
inputs:
testResultsFiles: '**\test-*.xml'
testRunTitle: 'Publish test results for Python'

View File

@ -120,9 +120,7 @@ steps:
Write-Host "##vso[task.setvariable variable=CMAKE_LIBRARY_PATH;]$(Build.SourcesDirectory)\mkl\lib;$env:CMAKE_LIBRARY_PATH"
Write-Host "##vso[task.setvariable variable=ADDITIONAL_PATH;]$(Build.SourcesDirectory)\tmp_bin"
Write-Host "##vso[task.setvariable variable=SCCACHE_IDLE_TIMEOUT;]1500"
Write-Host "##vso[task.setvariable variable=RANDOMTEMP_EXECUTABLE;]$(Build.SourcesDirectory)\tmp_bin\nvcc.exe"
Write-Host "##vso[task.setvariable variable=CUDA_NVCC_EXECUTABLE;]$(Build.SourcesDirectory)\tmp_bin\randomtemp.exe"
Write-Host "##vso[task.setvariable variable=RANDOMTEMP_BASEDIR;]$(Build.SourcesDirectory)\tmp_bin"
Write-Host "##vso[task.setvariable variable=CMAKE_CUDA_COMPILER_LAUNCHER;]$(Build.SourcesDirectory)/tmp_bin/randomtemp.exe;$(Build.SourcesDirectory)/tmp_bin/sccache.exe"
displayName: Set MKL, sccache and randomtemp environment variables
# View current environment variables

View File

@ -8,7 +8,7 @@ steps:
connectionType: 'connectedServiceName'
serviceConnection: circleciconn
method: 'POST'
headers: '{"Content-Type":"application/json", "BranchName":"$(TARGET_BRANCH_TO_CHECK_PR)", "JobName":"$(TARGET_CIRCLECI_PR)", "PlanUrl":"$(System.CollectionUri)", "ProjectId":"$(System.TeamProjectId)", "HubName":"$(System.HostType)", "PlanId":"$(System.PlanId)", "JobId":"$(System.JobId)", "TimelineId":"$(System.TimelineId)", "TaskInstanceId":"$(System.TaskInstanceId)", "AuthToken":"$(System.AccessToken)"}'
headers: '{"Content-Type":"application/json", "BranchName":"$(_TARGET_BRANCH_TO_CHECK)", "JobName":"$(TARGET_CIRCLECI_BUILD_PR)", "PRNumber":"$(_TARGET_PR_NUMBER)", "TargetCommit":"$(_TARGET_COMMIT)", "PlanUrl":"$(System.CollectionUri)", "ProjectId":"$(System.TeamProjectId)", "HubName":"$(System.HostType)", "PlanId":"$(System.PlanId)", "JobId":"$(System.JobId)", "TimelineId":"$(System.TimelineId)", "TaskInstanceId":"$(System.TaskInstanceId)", "AuthToken":"$(System.AccessToken)"}'
body: ''
urlSuffix: 'api/JobStatus'
waitForCompletion: true

View File

@ -1,6 +1,6 @@
# Initiate 5 agentless-server waiting jobs to check on the
# status of PR artifact builds, for a maximum wait time of
# 5 * 60 min =300 minutes. These jobs will pass immediately
# 11*60 min=660 mins. These jobs will pass immediately
# once targeted CircleCI build is ready.
jobs:
@ -8,7 +8,6 @@ jobs:
pool: server
timeoutInMinutes: 60
continueOnError: true
steps:
- template: wheel-wait-job-template.yml
@ -17,7 +16,6 @@ jobs:
timeoutInMinutes: 60
dependsOn: checkjob1
continueOnError: true
steps:
- template: wheel-wait-job-template.yml
@ -26,7 +24,6 @@ jobs:
timeoutInMinutes: 60
dependsOn: checkjob2
continueOnError: true
steps:
- template: wheel-wait-job-template.yml
@ -35,7 +32,6 @@ jobs:
timeoutInMinutes: 60
dependsOn: checkjob3
continueOnError: true
steps:
- template: wheel-wait-job-template.yml
@ -44,6 +40,53 @@ jobs:
timeoutInMinutes: 60
dependsOn: checkjob4
continueOnError: true
steps:
- template: wheel-wait-job-template.yml
- job: checkjob6
pool: server
timeoutInMinutes: 60
dependsOn: checkjob5
continueOnError: true
steps:
- template: wheel-wait-job-template.yml
- job: checkjob7
pool: server
timeoutInMinutes: 60
dependsOn: checkjob6
continueOnError: true
steps:
- template: wheel-wait-job-template.yml
- job: checkjob8
pool: server
timeoutInMinutes: 60
dependsOn: checkjob7
continueOnError: true
steps:
- template: wheel-wait-job-template.yml
- job: checkjob9
pool: server
timeoutInMinutes: 60
dependsOn: checkjob8
continueOnError: true
steps:
- template: wheel-wait-job-template.yml
- job: checkjob10
pool: server
timeoutInMinutes: 60
dependsOn: checkjob9
continueOnError: true
steps:
- template: wheel-wait-job-template.yml
- job: checkjob11
pool: server
timeoutInMinutes: 60
dependsOn: checkjob10
continueOnError: true
steps:
- template: wheel-wait-job-template.yml

View File

@ -48,3 +48,13 @@ stages:
_PYTHON_VERSION: $(PYTHON_VERSION_WIN_2)
_CUDA_BUILD_VERSION: $(CUDA_BUILD_VERSION_WIN_2)
_RUN_TESTS: $(RUN_TESTS_WIN)
- stage: 'NotifyWebapp'
displayName: 'Notify Webapp that pipeline is finished'
dependsOn: NightlyCustomTests
condition: succeededOrFailed()
jobs:
- template: job_templates/notify-webapp-template.yml
parameters:
name: ubuntu_1804_CPU
pool: $(BUILD_POOL_LIN_1)

View File

@ -7,14 +7,28 @@
# 2) runs custom PyTorch unit-tests on PyTorch
# wheels generated during PR builds.
resources:
webhooks:
- webhook: GitHubPyTorchPRTrigger
connection: GitHubPyTorchPRTriggerConnection
filters:
- path: repositoryName
value: pytorch_tests
stages:
- stage: 'EnsureArtifactsReady'
displayName: 'Ensure PyTorch PR Artifacts are ready'
jobs:
- template: job_templates/wheel-wait-template.yml
variables:
_TARGET_BRANCH_TO_CHECK: ${{parameters.GitHubPyTorchPRTrigger.TARGET_BRANCH_TO_CHECK_AZ_DEVOPS_PR}}
_TARGET_PR_NUMBER: ${{parameters.GitHubPyTorchPRTrigger.PR_NUMBER}}
_TARGET_COMMIT: ${{parameters.GitHubPyTorchPRTrigger.TARGET_COMMIT}}
- stage: 'PRCustomTests'
displayName: 'Run custom unit tests on PyTorch wheels'
dependsOn: EnsureArtifactsReady
condition: succeeded()
jobs:
- template: job_templates/pytorch-template-unix.yml
parameters:
@ -24,7 +38,25 @@ stages:
PR_Custom_Tests:
_PYTHON_VERSION: $(PYTHON_VERSION_PR)
_CUDA_BUILD_VERSION: $(CUDA_BUILD_VERSION_PR)
_TARGET_CIRCLECI_BUILD: $(TARGET_CIRCLECI_PR)
_TARGET_BRANCH_TO_CHECK: $(TARGET_BRANCH_TO_CHECK_PR)
_TARGET_CIRCLECI_BUILD: $(TARGET_CIRCLECI_BUILD_PR)
_TARGET_BRANCH_TO_CHECK: ${{parameters.GitHubPyTorchPRTrigger.TARGET_BRANCH_TO_CHECK_AZ_DEVOPS_PR}}
_TARGET_PR_NUMBER: ${{parameters.GitHubPyTorchPRTrigger.PR_NUMBER}}
_TARGET_COMMIT: ${{parameters.GitHubPyTorchPRTrigger.TARGET_COMMIT}}
_DOCKER_IMAGE: $(DOCKER_IMAGE_PR)
_RUN_TESTS: $(RUN_TESTS_PR)
- stage: 'NotifyWebapp'
displayName: 'Notify Webapp that pipeline is finished'
dependsOn: PRCustomTests
condition: succeededOrFailed()
jobs:
- template: job_templates/notify-webapp-template.yml
parameters:
name: ubuntu_1804_CPU
pool: $(BUILD_POOL_LIN_1)
customMatrixes:
PR_Notify_WebApp:
_TARGET_CIRCLECI_BUILD: $(TARGET_CIRCLECI_BUILD_PR)
_TARGET_BRANCH_TO_CHECK: ${{parameters.GitHubPyTorchPRTrigger.TARGET_BRANCH_TO_CHECK_AZ_DEVOPS_PR}}
_TARGET_PR_NUMBER: ${{parameters.GitHubPyTorchPRTrigger.PR_NUMBER}}
_TARGET_COMMIT: ${{parameters.GitHubPyTorchPRTrigger.TARGET_COMMIT}}

View File

@ -1,3 +1,19 @@
build --copt=--std=c++14
build --copt=-I.
build --copt=-isystem --copt bazel-out/k8-fastbuild/bin
build --experimental_ui_max_stdouterr_bytes=2048576
# Configuration to disable tty features for environments like CI
build:no-tty --curses no
build:no-tty --progress_report_interval 10
build:no-tty --show_progress_rate_limit 10
# Configuration to build with GPU support
build:gpu --define=cuda=true
# define a separate build folder for faster switching between configs
build:gpu --platform_suffix=-gpu
# rules_cuda configuration
build:gpu --@rules_cuda//cuda:enable_cuda
build:gpu --@rules_cuda//cuda:cuda_targets=sm_52
build:gpu --@rules_cuda//cuda:compiler=nvcc
build:gpu --repo_env=CUDA_PATH=/usr/local/cuda

View File

@ -1 +1 @@
3.1.0
4.2.1

View File

@ -343,7 +343,6 @@ All linux builds occur in docker images. The docker images are
* Has ALL CUDA versions installed. The script pytorch/builder/conda/switch_cuda_version.sh sets /usr/local/cuda to a symlink to e.g. /usr/local/cuda-10.0 to enable different CUDA builds
* Also used for cpu builds
* pytorch/manylinux-cuda90
* pytorch/manylinux-cuda92
* pytorch/manylinux-cuda100
* Also used for cpu builds

View File

@ -30,21 +30,7 @@ def get_processor_arch_name(gpu_version):
"cu" + gpu_version.strip("cuda") if gpu_version.startswith("cuda") else gpu_version
)
LINUX_PACKAGE_VARIANTS = OrderedDict(
manywheel=[
"3.6m",
"3.7m",
"3.8m",
"3.9m"
],
conda=dimensions.STANDARD_PYTHON_VERSIONS,
libtorch=[
"3.7m",
],
)
CONFIG_TREE_DATA = OrderedDict(
linux=(dimensions.GPU_VERSIONS, LINUX_PACKAGE_VARIANTS),
macos=([None], OrderedDict(
wheel=dimensions.STANDARD_PYTHON_VERSIONS,
conda=dimensions.STANDARD_PYTHON_VERSIONS,
@ -63,7 +49,8 @@ CONFIG_TREE_DATA = OrderedDict(
],
)),
windows=(
[v for v in dimensions.GPU_VERSIONS if v not in dimensions.ROCM_VERSION_LABELS],
# Stop building Win+CU102, see https://github.com/pytorch/pytorch/issues/65648
[v for v in dimensions.GPU_VERSIONS if v not in dimensions.ROCM_VERSION_LABELS and v != "cuda102"],
OrderedDict(
wheel=dimensions.STANDARD_PYTHON_VERSIONS,
conda=dimensions.STANDARD_PYTHON_VERSIONS,
@ -126,6 +113,7 @@ class PackageFormatConfigNode(ConfigNode):
self.props["python_versions"] = python_versions
self.props["package_format"] = package_format
def get_children(self):
if self.find_prop("os_name") == "linux":
return [LinuxGccConfigNode(self, v) for v in LINUX_GCC_CONFIG_VARIANTS[self.find_prop("package_format")]]

View File

@ -124,9 +124,9 @@ class Conf(object):
Output looks similar to:
- binary_upload:
name: binary_linux_manywheel_3_7m_cu92_devtoolset7_nightly_upload
name: binary_linux_manywheel_3_7m_cu113_devtoolset7_nightly_upload
context: org-member
requires: binary_linux_manywheel_3_7m_cu92_devtoolset7_nightly_test
requires: binary_linux_manywheel_3_7m_cu113_devtoolset7_nightly_test
filters:
branches:
only:
@ -134,7 +134,7 @@ class Conf(object):
tags:
only: /v[0-9]+(\\.[0-9]+)*-rc[0-9]+/
package_type: manywheel
upload_subfolder: cu92
upload_subfolder: cu113
"""
return {
"binary_upload": OrderedDict({

View File

@ -3,12 +3,13 @@ PHASES = ["build", "test"]
CUDA_VERSIONS = [
"102",
"111",
"113",
"115",
]
ROCM_VERSIONS = [
"4.0.1",
"4.1",
"4.2",
"4.3.1",
"4.5.2",
]
ROCM_VERSION_LABELS = ["rocm" + v for v in ROCM_VERSIONS]
@ -16,8 +17,8 @@ ROCM_VERSION_LABELS = ["rocm" + v for v in ROCM_VERSIONS]
GPU_VERSIONS = [None] + ["cuda" + v for v in CUDA_VERSIONS] + ROCM_VERSION_LABELS
STANDARD_PYTHON_VERSIONS = [
"3.6",
"3.7",
"3.8",
"3.9"
"3.9",
"3.10"
]

View File

@ -1,99 +1,7 @@
from cimodel.lib.conf_tree import ConfigNode, X, XImportant
from cimodel.lib.conf_tree import ConfigNode
CONFIG_TREE_DATA = [
("xenial", [
("gcc", [
("5.4", [ # All this subtree rebases to master and then build
("3.6", [
("important", [X(True)]),
("parallel_tbb", [X(True)]),
("parallel_native", [X(True)]),
("pure_torch", [X(True)]),
]),
]),
# TODO: bring back libtorch test
("7", [X("3.6")]),
]),
("clang", [
("5", [
("3.6", [
("asan", [
(True, [
("shard_test", [XImportant(True)]),
]),
]),
]),
]),
("7", [
("3.6", [
("onnx", [XImportant(True)]),
]),
]),
]),
("cuda", [
("10.2", [
("3.6", [
("shard_test", [X(True)]),
("libtorch", [
(True, [
('build_only', [X(True)]),
]),
]),
]),
]),
("11.1", [
("3.8", [
("shard_test", [XImportant(True)]),
("libtorch", [
(True, [
('build_only', [X(True)]),
]),
]),
]),
]),
]),
]),
("bionic", [
("clang", [
("9", [
("3.6", [
("noarch", [XImportant(True)]),
]),
]),
("9", [
("3.6", [
("xla", [XImportant(True)]),
("vulkan", [XImportant(True)]),
]),
]),
]),
("cuda", [
("10.2", [
("3.9", [
("shard_test", [XImportant(True)]),
]),
]),
]),
("gcc", [
("9", [
("3.8", [
("coverage", [
(True, [
("shard_test", [XImportant(True)]),
]),
]),
]),
]),
]),
("rocm", [
("3.9", [
("3.6", [
('build_only', [XImportant(True)]),
]),
]),
]),
]),
]
@ -174,12 +82,19 @@ class ExperimentalFeatureConfigNode(TreeConfigNode):
"build_only": BuildOnlyConfigNode,
"shard_test": ShardTestConfigNode,
"cuda_gcc_override": CudaGccOverrideConfigNode,
"coverage": CoverageConfigNode,
"pure_torch": PureTorchConfigNode,
"slow_gradcheck": SlowGradcheckConfigNode,
}
return next_nodes[experimental_feature]
class SlowGradcheckConfigNode(TreeConfigNode):
def init2(self, node_name):
self.props["is_slow_gradcheck"] = True
def child_constructor(self):
return ExperimentalFeatureConfigNode
class PureTorchConfigNode(TreeConfigNode):
def modify_label(self, label):
return "PURE_TORCH=" + str(label)
@ -310,14 +225,6 @@ class ShardTestConfigNode(TreeConfigNode):
return ImportantConfigNode
class CoverageConfigNode(TreeConfigNode):
def init2(self, node_name):
self.props["is_coverage"] = node_name
def child_constructor(self):
return ExperimentalFeatureConfigNode
class ImportantConfigNode(TreeConfigNode):
def modify_label(self, label):
return "IMPORTANT=" + str(label)

View File

@ -31,6 +31,7 @@ class Conf:
is_libtorch: bool = False
is_important: bool = False
parallel_backend: Optional[str] = None
build_only: bool = False
@staticmethod
def is_test_phase(phase):
@ -112,6 +113,8 @@ class Conf:
parameters["resource_class"] = "xlarge"
if hasattr(self, 'filters'):
parameters['filters'] = self.filters
if self.build_only:
parameters['build_only'] = miniutils.quote(str(int(True)))
return parameters
def gen_workflow_job(self, phase):
@ -175,35 +178,6 @@ class DocPushConf(object):
}
}
# TODO Convert these to graph nodes
def gen_dependent_configs(xenial_parent_config):
extra_parms = [
(["multigpu"], "large"),
(["nogpu", "NO_AVX2"], None),
(["nogpu", "NO_AVX"], None),
(["slow"], "medium"),
]
configs = []
for parms, gpu in extra_parms:
c = Conf(
xenial_parent_config.distro,
["py3"] + parms,
pyver=xenial_parent_config.pyver,
cuda_version=xenial_parent_config.cuda_version,
restrict_phases=["test"],
gpu_resource=gpu,
parent_build=xenial_parent_config,
is_important=False,
)
configs.append(c)
return configs
def gen_docs_configs(xenial_parent_config):
configs = []
@ -211,7 +185,7 @@ def gen_docs_configs(xenial_parent_config):
HiddenConf(
"pytorch_python_doc_build",
parent_build=xenial_parent_config,
filters=gen_filter_dict(branches_list=r"/.*/",
filters=gen_filter_dict(branches_list=["master", "nightly"],
tags_list=RC_PATTERN),
)
)
@ -227,7 +201,7 @@ def gen_docs_configs(xenial_parent_config):
HiddenConf(
"pytorch_cpp_doc_build",
parent_build=xenial_parent_config,
filters=gen_filter_dict(branches_list=r"/.*/",
filters=gen_filter_dict(branches_list=["master", "nightly"],
tags_list=RC_PATTERN),
)
)
@ -238,13 +212,6 @@ def gen_docs_configs(xenial_parent_config):
branch="master",
)
)
configs.append(
HiddenConf(
"pytorch_doc_test",
parent_build=xenial_parent_config
)
)
return configs
@ -258,7 +225,7 @@ def gen_tree():
return configs_list
def instantiate_configs():
def instantiate_configs(only_slow_gradcheck):
config_list = []
@ -272,13 +239,16 @@ def instantiate_configs():
compiler_version = fc.find_prop("compiler_version")
is_xla = fc.find_prop("is_xla") or False
is_asan = fc.find_prop("is_asan") or False
is_coverage = fc.find_prop("is_coverage") or False
is_noarch = fc.find_prop("is_noarch") or False
is_onnx = fc.find_prop("is_onnx") or False
is_pure_torch = fc.find_prop("is_pure_torch") or False
is_vulkan = fc.find_prop("is_vulkan") or False
is_slow_gradcheck = fc.find_prop("is_slow_gradcheck") or False
parms_list_ignored_for_docker_image = []
if only_slow_gradcheck ^ is_slow_gradcheck:
continue
python_version = None
if compiler_name == "cuda" or compiler_name == "android":
python_version = fc.find_prop("pyver")
@ -313,10 +283,6 @@ def instantiate_configs():
python_version = fc.find_prop("pyver")
parms_list[0] = fc.find_prop("abbreviated_pyver")
if is_coverage:
parms_list_ignored_for_docker_image.append("coverage")
python_version = fc.find_prop("pyver")
if is_noarch:
parms_list_ignored_for_docker_image.append("noarch")
@ -342,6 +308,10 @@ def instantiate_configs():
if build_only or is_pure_torch:
restrict_phases = ["build"]
if is_slow_gradcheck:
parms_list_ignored_for_docker_image.append("old")
parms_list_ignored_for_docker_image.append("gradcheck")
gpu_resource = None
if cuda_version and cuda_version != "10":
gpu_resource = "medium"
@ -361,6 +331,7 @@ def instantiate_configs():
is_libtorch=is_libtorch,
is_important=is_important,
parallel_backend=parallel_backend,
build_only=build_only,
)
# run docs builds on "pytorch-linux-xenial-py3.6-gcc5.4". Docs builds
@ -381,36 +352,14 @@ def instantiate_configs():
tags_list=RC_PATTERN)
c.dependent_tests = gen_docs_configs(c)
if cuda_version == "10.2" and python_version == "3.6" and not is_libtorch:
c.dependent_tests = gen_dependent_configs(c)
if (
compiler_name == "gcc"
and compiler_version == "5.4"
and not is_libtorch
and not is_vulkan
and not is_pure_torch
and parallel_backend is None
):
bc_breaking_check = Conf(
"backward-compatibility-check",
[],
is_xla=False,
restrict_phases=["test"],
is_libtorch=False,
is_important=True,
parent_build=c,
)
c.dependent_tests.append(bc_breaking_check)
config_list.append(c)
return config_list
def get_workflow_jobs():
def get_workflow_jobs(only_slow_gradcheck=False):
config_list = instantiate_configs()
config_list = instantiate_configs(only_slow_gradcheck)
x = []
for conf_options in config_list:

View File

@ -1,119 +0,0 @@
import cimodel.data.simple.util.branch_filters as branch_filters
from cimodel.data.simple.util.docker_constants import (
DOCKER_IMAGE_NDK, DOCKER_REQUIREMENT_NDK
)
import cimodel.lib.miniutils as miniutils
class AndroidJob:
def __init__(self,
variant,
template_name,
is_master_only=True):
self.variant = variant
self.template_name = template_name
self.is_master_only = is_master_only
def gen_tree(self):
base_name_parts = [
"pytorch",
"linux",
"xenial",
"py3",
"clang5",
"android",
"ndk",
"r19c",
] + self.variant + [
"build",
]
full_job_name = "_".join(base_name_parts)
build_env_name = "-".join(base_name_parts)
props_dict = {
"name": full_job_name,
"build_environment": "\"{}\"".format(build_env_name),
"docker_image": "\"{}\"".format(DOCKER_IMAGE_NDK),
"requires": [DOCKER_REQUIREMENT_NDK]
}
if self.is_master_only:
props_dict["filters"] = branch_filters.gen_filter_dict(branch_filters.NON_PR_BRANCH_LIST)
return [{self.template_name: props_dict}]
class AndroidGradleJob:
def __init__(self,
job_name,
template_name,
dependencies,
is_master_only=True,
is_pr_only=False,
extra_props=tuple()):
self.job_name = job_name
self.template_name = template_name
self.dependencies = dependencies
self.is_master_only = is_master_only
self.is_pr_only = is_pr_only
self.extra_props = dict(extra_props)
def gen_tree(self):
props_dict = {
"name": self.job_name,
"requires": self.dependencies,
}
if self.is_master_only:
props_dict["filters"] = branch_filters.gen_filter_dict(branch_filters.NON_PR_BRANCH_LIST)
elif self.is_pr_only:
props_dict["filters"] = branch_filters.gen_filter_dict(branch_filters.PR_BRANCH_LIST)
if self.extra_props:
props_dict.update(self.extra_props)
return [{self.template_name: props_dict}]
WORKFLOW_DATA = [
AndroidJob(["x86_32"], "pytorch_linux_build", is_master_only=False),
AndroidJob(["x86_64"], "pytorch_linux_build"),
AndroidJob(["arm", "v7a"], "pytorch_linux_build"),
AndroidJob(["arm", "v8a"], "pytorch_linux_build"),
AndroidGradleJob(
"pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build-x86_32",
"pytorch_android_gradle_build-x86_32",
["pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_32_build"],
is_master_only=False,
is_pr_only=True),
AndroidGradleJob(
"pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single",
"pytorch_android_gradle_custom_build_single",
[DOCKER_REQUIREMENT_NDK],
is_master_only=False,
is_pr_only=True),
AndroidGradleJob(
"pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit",
"pytorch_android_gradle_custom_build_single",
[DOCKER_REQUIREMENT_NDK],
is_master_only=False,
is_pr_only=True,
extra_props=tuple({
"lite_interpreter": miniutils.quote(str(int(False)))
}.items())),
AndroidGradleJob(
"pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build",
"pytorch_android_gradle_build",
["pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_32_build",
"pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_64_build",
"pytorch_linux_xenial_py3_clang5_android_ndk_r19c_arm_v7a_build",
"pytorch_linux_xenial_py3_clang5_android_ndk_r19c_arm_v8a_build"]),
]
def get_workflow_jobs():
return [item.gen_tree() for item in WORKFLOW_DATA]

View File

@ -120,9 +120,9 @@ WORKFLOW_DATA = [
),
SmoketestJob(
"binary_windows_build",
["wheel", "3.7", "cu102"],
["wheel", "3.7", "cu113"],
None,
"binary_windows_wheel_3_7_cu102_build",
"binary_windows_wheel_3_7_cu113_build",
is_master_only=True,
),
@ -144,11 +144,11 @@ WORKFLOW_DATA = [
),
SmoketestJob(
"binary_windows_test",
["wheel", "3.7", "cu102"],
["wheel", "3.7", "cu113"],
None,
"binary_windows_wheel_3_7_cu102_test",
"binary_windows_wheel_3_7_cu113_test",
is_master_only=True,
requires=["binary_windows_wheel_3_7_cu102_build"],
requires=["binary_windows_wheel_3_7_cu113_build"],
extra_props={
"executor": "windows-with-nvidia-gpu",
},

View File

@ -4,37 +4,24 @@ from cimodel.lib.miniutils import quote
from cimodel.data.simple.util.branch_filters import gen_filter_dict, RC_PATTERN
# TODO: make this generated from a matrix rather than just a static list
# NOTE: All hardcoded docker image builds have been migrated to GHA
IMAGE_NAMES = [
"pytorch-linux-bionic-cuda10.2-cudnn7-py3.8-gcc9",
"pytorch-linux-bionic-cuda10.2-cudnn7-py3.9-gcc7",
"pytorch-linux-bionic-py3.6-clang9",
"pytorch-linux-bionic-cuda10.2-cudnn7-py3.6-clang9",
"pytorch-linux-bionic-py3.8-gcc9",
"pytorch-linux-xenial-cuda10-cudnn7-py3-gcc7",
"pytorch-linux-xenial-cuda10.1-cudnn7-py3-gcc7",
"pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7",
"pytorch-linux-xenial-cuda11.1-cudnn8-py3-gcc7",
"pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7",
"pytorch-linux-xenial-py3-clang5-android-ndk-r19c",
"pytorch-linux-xenial-py3-clang5-asan",
"pytorch-linux-xenial-py3-clang7-onnx",
"pytorch-linux-xenial-py3.8",
"pytorch-linux-xenial-py3.6-clang7",
"pytorch-linux-xenial-py3.6-gcc5.4", # this one is used in doc builds
"pytorch-linux-xenial-py3.6-gcc7.2",
"pytorch-linux-xenial-py3.6-gcc7",
"pytorch-linux-bionic-rocm3.9-py3.6",
"pytorch-linux-bionic-rocm4.0.1-py3.6",
"pytorch-linux-bionic-rocm4.1-py3.6",
"pytorch-linux-bionic-rocm4.2-py3.6",
]
# This entry should be an element from the list above
# This should contain the image matching the "slow_gradcheck" entry in
# pytorch_build_data.py
SLOW_GRADCHECK_IMAGE_NAME = "pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7"
def get_workflow_jobs():
def get_workflow_jobs(images=IMAGE_NAMES, only_slow_gradcheck=False):
"""Generates a list of docker image build definitions"""
ret = []
for image_name in IMAGE_NAMES:
for image_name in images:
if image_name.startswith('docker-'):
image_name = image_name.lstrip('docker-')
if only_slow_gradcheck and image_name is not SLOW_GRADCHECK_IMAGE_NAME:
continue
parameters = OrderedDict({
"name": quote(f"docker-{image_name}"),
"image_name": quote(image_name),

View File

@ -1,78 +0,0 @@
import cimodel.lib.miniutils as miniutils
from cimodel.data.simple.util.versions import MultiPartVersion, CudaVersion
from cimodel.data.simple.util.docker_constants import DOCKER_IMAGE_BASIC, DOCKER_IMAGE_CUDA_10_2
class GeConfigTestJob:
def __init__(self,
py_version,
gcc_version,
cuda_version,
variant_parts,
extra_requires,
use_cuda_docker=False,
build_env_override=None):
self.py_version = py_version
self.gcc_version = gcc_version
self.cuda_version = cuda_version
self.variant_parts = variant_parts
self.extra_requires = extra_requires
self.use_cuda_docker = use_cuda_docker
self.build_env_override = build_env_override
def get_all_parts(self, with_dots):
maybe_py_version = self.py_version.render_dots_or_parts(with_dots) if self.py_version else []
maybe_gcc_version = self.gcc_version.render_dots_or_parts(with_dots) if self.gcc_version else []
maybe_cuda_version = self.cuda_version.render_dots_or_parts(with_dots) if self.cuda_version else []
common_parts = [
"pytorch",
"linux",
"xenial",
] + maybe_cuda_version + maybe_py_version + maybe_gcc_version
return common_parts + self.variant_parts
def gen_tree(self):
resource_class = "gpu.medium" if self.use_cuda_docker else "large"
docker_image = DOCKER_IMAGE_CUDA_10_2 if self.use_cuda_docker else DOCKER_IMAGE_BASIC
full_name = "_".join(self.get_all_parts(False))
build_env = self.build_env_override or "-".join(self.get_all_parts(True))
props_dict = {
"name": full_name,
"build_environment": build_env,
"requires": self.extra_requires,
"resource_class": resource_class,
"docker_image": docker_image,
}
if self.use_cuda_docker:
props_dict["use_cuda_docker_runtime"] = miniutils.quote(str(1))
return [{"pytorch_linux_test": props_dict}]
WORKFLOW_DATA = [
GeConfigTestJob(
MultiPartVersion([3, 6], "py"),
MultiPartVersion([5, 4], "gcc"),
None,
["jit_legacy", "test"],
["pytorch_linux_xenial_py3_6_gcc5_4_build"]),
GeConfigTestJob(
None,
None,
CudaVersion(10, 2),
["cudnn7", "py3", "jit_legacy", "test"],
["pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_build"],
use_cuda_docker=True,
),
]
def get_workflow_jobs():
return [item.gen_tree() for item in WORKFLOW_DATA]

View File

@ -1,7 +1,7 @@
from cimodel.data.simple.util.versions import MultiPartVersion
import cimodel.lib.miniutils as miniutils
XCODE_VERSION = MultiPartVersion([12, 0, 0])
XCODE_VERSION = MultiPartVersion([12, 5, 1])
class ArchVariant:
@ -75,6 +75,12 @@ WORKFLOW_DATA = [
IOSJob(XCODE_VERSION, ArchVariant("arm64", "custom"), extra_props={
"op_list": "mobilenetv2.yaml",
"lite_interpreter": miniutils.quote(str(int(True)))}),
IOSJob(XCODE_VERSION, ArchVariant("x86_64", "coreml"), is_org_member_context=False, extra_props={
"use_coreml": miniutils.quote(str(int(True))),
"lite_interpreter": miniutils.quote(str(int(True)))}),
IOSJob(XCODE_VERSION, ArchVariant("arm64", "coreml"), extra_props={
"use_coreml": miniutils.quote(str(int(True))),
"lite_interpreter": miniutils.quote(str(int(True)))}),
]

View File

@ -4,12 +4,6 @@ PyTorch Mobile PR builds (use linux host toolchain + mobile build options)
import cimodel.lib.miniutils as miniutils
import cimodel.data.simple.util.branch_filters
from cimodel.data.simple.util.docker_constants import (
DOCKER_IMAGE_ASAN,
DOCKER_REQUIREMENT_ASAN,
DOCKER_IMAGE_NDK,
DOCKER_REQUIREMENT_NDK
)
class MobileJob:
@ -52,33 +46,6 @@ class MobileJob:
WORKFLOW_DATA = [
MobileJob(
DOCKER_IMAGE_ASAN,
[DOCKER_REQUIREMENT_ASAN],
["build"]
),
# Use LLVM-DEV toolchain in android-ndk-r19c docker image
MobileJob(
DOCKER_IMAGE_NDK,
[DOCKER_REQUIREMENT_NDK],
["custom", "build", "dynamic"]
),
MobileJob(
DOCKER_IMAGE_NDK,
[DOCKER_REQUIREMENT_NDK],
["custom", "build", "static"]
),
# Use LLVM-DEV toolchain in android-ndk-r19c docker image
# Most of this CI is already covered by "mobile-custom-build-dynamic" job
MobileJob(
DOCKER_IMAGE_NDK,
[DOCKER_REQUIREMENT_NDK],
["code", "analysis"],
True
),
]

View File

@ -1,77 +0,0 @@
from cimodel.data.simple.util.docker_constants import (
DOCKER_IMAGE_NDK,
DOCKER_REQUIREMENT_NDK
)
class AndroidNightlyJob:
def __init__(self,
variant,
template_name,
extra_props=None,
with_docker=True,
requires=None,
no_build_suffix=False):
self.variant = variant
self.template_name = template_name
self.extra_props = extra_props or {}
self.with_docker = with_docker
self.requires = requires
self.no_build_suffix = no_build_suffix
def gen_tree(self):
base_name_parts = [
"pytorch",
"linux",
"xenial",
"py3",
"clang5",
"android",
"ndk",
"r19c",
] + self.variant
build_suffix = [] if self.no_build_suffix else ["build"]
full_job_name = "_".join(["nightly"] + base_name_parts + build_suffix)
build_env_name = "-".join(base_name_parts)
props_dict = {
"name": full_job_name,
"requires": self.requires,
"filters": {"branches": {"only": "nightly"}},
}
props_dict.update(self.extra_props)
if self.with_docker:
props_dict["docker_image"] = DOCKER_IMAGE_NDK
props_dict["build_environment"] = build_env_name
return [{self.template_name: props_dict}]
BASE_REQUIRES = [DOCKER_REQUIREMENT_NDK]
WORKFLOW_DATA = [
AndroidNightlyJob(["x86_32"], "pytorch_linux_build", requires=BASE_REQUIRES),
AndroidNightlyJob(["x86_64"], "pytorch_linux_build", requires=BASE_REQUIRES),
AndroidNightlyJob(["arm", "v7a"], "pytorch_linux_build", requires=BASE_REQUIRES),
AndroidNightlyJob(["arm", "v8a"], "pytorch_linux_build", requires=BASE_REQUIRES),
AndroidNightlyJob(["android_gradle"], "pytorch_android_gradle_build",
with_docker=False,
requires=[
"nightly_pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_32_build",
"nightly_pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_64_build",
"nightly_pytorch_linux_xenial_py3_clang5_android_ndk_r19c_arm_v7a_build",
"nightly_pytorch_linux_xenial_py3_clang5_android_ndk_r19c_arm_v8a_build"]),
AndroidNightlyJob(["x86_32_android_publish_snapshot"], "pytorch_android_publish_snapshot",
extra_props={"context": "org-member"},
with_docker=False,
requires=["nightly_pytorch_linux_xenial_py3_clang5_android_ndk_r19c_android_gradle_build"],
no_build_suffix=True),
]
def get_workflow_jobs():
return [item.gen_tree() for item in WORKFLOW_DATA]

View File

@ -1,12 +1,15 @@
import cimodel.data.simple.ios_definitions as ios_definitions
import cimodel.lib.miniutils as miniutils
class IOSNightlyJob:
def __init__(self,
variant,
is_full_jit=False,
is_upload=False):
self.variant = variant
self.is_full_jit = is_full_jit
self.is_upload = is_upload
def get_phase_name(self):
@ -16,8 +19,11 @@ class IOSNightlyJob:
extra_name_suffix = [self.get_phase_name()] if self.is_upload else []
extra_name = ["full_jit"] if self.is_full_jit else []
common_name_pieces = [
"ios",
] + extra_name + [
] + ios_definitions.XCODE_VERSION.render_dots_or_parts(with_version_dots) + [
"nightly",
self.variant,
@ -30,7 +36,8 @@ class IOSNightlyJob:
return "_".join(["pytorch"] + self.get_common_name_pieces(False))
def gen_tree(self):
extra_requires = [x.gen_job_name() for x in BUILD_CONFIGS] if self.is_upload else []
build_configs = BUILD_CONFIGS_FULL_JIT if self.is_full_jit else BUILD_CONFIGS
extra_requires = [x.gen_job_name() for x in build_configs] if self.is_upload else []
props_dict = {
"build_environment": "-".join(["libtorch"] + self.get_common_name_pieces(True)),
@ -43,6 +50,11 @@ class IOSNightlyJob:
props_dict["ios_arch"] = self.variant
props_dict["ios_platform"] = ios_definitions.get_platform(self.variant)
props_dict["name"] = self.gen_job_name()
props_dict["use_metal"] = miniutils.quote(str(int(True)))
props_dict["use_coreml"] = miniutils.quote(str(int(True)))
if self.is_full_jit:
props_dict["lite_interpreter"] = miniutils.quote(str(int(False)))
template_name = "_".join([
"binary",
@ -58,9 +70,14 @@ BUILD_CONFIGS = [
IOSNightlyJob("arm64"),
]
BUILD_CONFIGS_FULL_JIT = [
IOSNightlyJob("x86_64", is_full_jit=True),
IOSNightlyJob("arm64", is_full_jit=True),
]
WORKFLOW_DATA = BUILD_CONFIGS + [
IOSNightlyJob("binary", is_upload=True),
WORKFLOW_DATA = BUILD_CONFIGS + BUILD_CONFIGS_FULL_JIT + [
IOSNightlyJob("binary", is_full_jit=False, is_upload=True),
IOSNightlyJob("binary", is_full_jit=True, is_upload=True),
]

View File

@ -11,7 +11,7 @@ def gen_docker_image_requires(image_name):
DOCKER_IMAGE_BASIC, DOCKER_REQUIREMENT_BASE = gen_docker_image(
"pytorch-linux-xenial-py3.6-gcc5.4"
"pytorch-linux-xenial-py3.7-gcc5.4"
)
DOCKER_IMAGE_CUDA_10_2, DOCKER_REQUIREMENT_CUDA_10_2 = gen_docker_image(
@ -19,7 +19,7 @@ DOCKER_IMAGE_CUDA_10_2, DOCKER_REQUIREMENT_CUDA_10_2 = gen_docker_image(
)
DOCKER_IMAGE_GCC7, DOCKER_REQUIREMENT_GCC7 = gen_docker_image(
"pytorch-linux-xenial-py3.6-gcc7"
"pytorch-linux-xenial-py3.7-gcc7"
)

View File

@ -1,164 +0,0 @@
import cimodel.lib.miniutils as miniutils
from cimodel.data.simple.util.branch_filters import gen_filter_dict, RC_PATTERN, NON_PR_BRANCH_LIST
from cimodel.data.simple.util.versions import CudaVersion
class WindowsJob:
def __init__(
self,
test_index,
vscode_spec,
cuda_version,
force_on_cpu=False,
multi_gpu=False,
master_only=False,
nightly_only=False,
master_and_nightly=False
):
self.test_index = test_index
self.vscode_spec = vscode_spec
self.cuda_version = cuda_version
self.force_on_cpu = force_on_cpu
self.multi_gpu = multi_gpu
self.master_only = master_only
self.nightly_only = nightly_only
self.master_and_nightly = master_and_nightly
def gen_tree(self):
base_phase = "build" if self.test_index is None else "test"
numbered_phase = (
base_phase if self.test_index is None else base_phase + str(self.test_index)
)
key_parts = ["pytorch", "windows", base_phase]
if self.multi_gpu:
key_parts.append('multigpu')
key_name = "_".join(key_parts)
cpu_forcing_name_parts = ["on", "cpu"] if self.force_on_cpu else []
target_arch = self.cuda_version.render_dots() if self.cuda_version else "cpu"
base_name_parts = [
"pytorch",
"windows",
self.vscode_spec.render(),
"py36",
target_arch,
]
prerequisite_jobs = []
if base_phase == "test":
prerequisite_jobs.append("_".join(base_name_parts + ["build"]))
if self.cuda_version:
self.cudnn_version = 8 if self.cuda_version.major == 11 else 7
arch_env_elements = (
["cuda" + str(self.cuda_version.major), "cudnn" + str(self.cudnn_version)]
if self.cuda_version
else ["cpu"]
)
build_environment_string = "-".join(
["pytorch", "win"]
+ self.vscode_spec.get_elements()
+ arch_env_elements
+ ["py3"]
)
is_running_on_cuda = bool(self.cuda_version) and not self.force_on_cpu
if self.multi_gpu:
props_dict = {"requires": prerequisite_jobs}
else:
props_dict = {
"build_environment": build_environment_string,
"python_version": miniutils.quote("3.6"),
"vc_version": miniutils.quote(self.vscode_spec.dotted_version()),
"vc_year": miniutils.quote(str(self.vscode_spec.year)),
"vc_product": self.vscode_spec.get_product(),
"use_cuda": miniutils.quote(str(int(is_running_on_cuda))),
"requires": prerequisite_jobs,
}
if self.master_only:
props_dict[
"filters"
] = gen_filter_dict()
elif self.nightly_only:
props_dict[
"filters"
] = gen_filter_dict(branches_list=["nightly"], tags_list=RC_PATTERN)
elif self.master_and_nightly:
props_dict[
"filters"
] = gen_filter_dict(branches_list=NON_PR_BRANCH_LIST + ["nightly"], tags_list=RC_PATTERN)
name_parts = base_name_parts + cpu_forcing_name_parts + [numbered_phase]
if not self.multi_gpu:
if base_phase == "test":
test_name = "-".join(["pytorch", "windows", numbered_phase])
props_dict["test_name"] = test_name
if is_running_on_cuda:
props_dict["executor"] = "windows-with-nvidia-gpu"
props_dict["cuda_version"] = (
miniutils.quote(str(self.cuda_version))
if self.cuda_version
else "cpu"
)
props_dict["name"] = "_".join(name_parts)
return [{key_name: props_dict}]
class VcSpec:
def __init__(self, year, version_elements=None, hide_version=False):
self.year = year
self.version_elements = version_elements or []
self.hide_version = hide_version
def get_elements(self):
if self.hide_version:
return [self.prefixed_year()]
return [self.prefixed_year()] + self.version_elements
def get_product(self):
return "BuildTools"
def dotted_version(self):
return ".".join(self.version_elements)
def prefixed_year(self):
return "vs" + str(self.year)
def render(self):
return "_".join(self.get_elements())
_VC2019 = VcSpec(2019)
WORKFLOW_DATA = [
# VS2019 CUDA-10.1
WindowsJob(None, _VC2019, CudaVersion(10, 1), master_only=True),
WindowsJob(1, _VC2019, CudaVersion(10, 1), master_only=True),
WindowsJob(2, _VC2019, CudaVersion(10, 1), master_only=True),
# VS2019 CUDA-11.1
WindowsJob(None, _VC2019, CudaVersion(11, 1)),
WindowsJob(1, _VC2019, CudaVersion(11, 1), master_only=True),
WindowsJob(2, _VC2019, CudaVersion(11, 1), master_only=True),
WindowsJob('_azure_multi_gpu', _VC2019, CudaVersion(11, 1), multi_gpu=True, nightly_only=True),
# VS2019 CPU-only
WindowsJob(None, _VC2019, None),
WindowsJob(1, _VC2019, None),
WindowsJob(2, _VC2019, None),
WindowsJob(1, _VC2019, CudaVersion(10, 1), force_on_cpu=True, master_only=True),
]
def get_windows_workflows():
return [item.gen_tree() for item in WORKFLOW_DATA]

6735
.circleci/config.yml generated

File diff suppressed because it is too large Load Diff

View File

@ -27,5 +27,5 @@ Docker builds are now defined with `.circleci/cimodel/data/simple/docker_definit
./build.sh pytorch-linux-bionic-py3.8-gcc9 -t myimage:latest
# Set flags (see build.sh) and build image
sudo bash -c 'BREAKPAD=1 ./build.sh pytorch-linux-bionic-py3.8-gcc9 -t myimage:latest
sudo bash -c 'PROTOBUF=1 ./build.sh pytorch-linux-bionic-py3.8-gcc9 -t myimage:latest
```

View File

@ -51,9 +51,9 @@ android {
dependencies {
implementation 'com.android.support:appcompat-v7:28.0.0'
implementation 'androidx.appcompat:appcompat:1.0.0'
implementation 'com.facebook.fbjni:fbjni-java-only:0.0.3'
implementation 'com.facebook.fbjni:fbjni-java-only:0.2.2'
implementation 'com.google.code.findbugs:jsr305:3.0.1'
implementation 'com.facebook.soloader:nativeloader:0.8.0'
implementation 'com.facebook.soloader:nativeloader:0.10.1'
implementation 'junit:junit:' + rootProject.junitVersion
implementation 'androidx.test:core:' + rootProject.coreVersion

View File

@ -78,127 +78,127 @@ TRAVIS_DL_URL_PREFIX="https://s3.amazonaws.com/travis-python-archives/binaries/u
case "$image" in
pytorch-linux-xenial-py3.8)
ANACONDA_PYTHON_VERSION=3.8
CMAKE_VERSION=3.10.3
GCC_VERSION=7
# Do not install PROTOBUF, DB, and VISION as a test
;;
pytorch-linux-xenial-py3.6-gcc5.4)
ANACONDA_PYTHON_VERSION=3.6
pytorch-linux-xenial-py3.7-gcc5.4)
ANACONDA_PYTHON_VERSION=3.7
CMAKE_VERSION=3.10.3
GCC_VERSION=5
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
BREAKPAD=yes
;;
pytorch-linux-xenial-py3.6-gcc7.2)
ANACONDA_PYTHON_VERSION=3.6
pytorch-linux-xenial-py3.7-gcc7.2)
ANACONDA_PYTHON_VERSION=3.7
CMAKE_VERSION=3.10.3
GCC_VERSION=7
# Do not install PROTOBUF, DB, and VISION as a test
;;
pytorch-linux-xenial-py3.6-gcc7)
ANACONDA_PYTHON_VERSION=3.6
pytorch-linux-xenial-py3.7-gcc7)
ANACONDA_PYTHON_VERSION=3.7
CMAKE_VERSION=3.10.3
GCC_VERSION=7
PROTOBUF=yes
DB=yes
VISION=yes
BREAKPAD=yes
;;
pytorch-linux-xenial-cuda10-cudnn7-py3-gcc7)
CUDA_VERSION=10.0
CUDNN_VERSION=7
ANACONDA_PYTHON_VERSION=3.6
GCC_VERSION=7
PROTOBUF=yes
DB=yes
VISION=yes
BREAKPAD=yes
;;
pytorch-linux-xenial-cuda10.1-cudnn7-py3-gcc7)
CUDA_VERSION=10.1
CUDNN_VERSION=7
ANACONDA_PYTHON_VERSION=3.6
GCC_VERSION=7
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
BREAKPAD=yes
;;
pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7)
CUDA_VERSION=10.2
CUDNN_VERSION=7
ANACONDA_PYTHON_VERSION=3.6
ANACONDA_PYTHON_VERSION=3.7
CMAKE_VERSION=3.10.3
GCC_VERSION=7
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
BREAKPAD=yes
;;
pytorch-linux-xenial-cuda11.1-cudnn8-py3-gcc7)
CUDA_VERSION=11.1
CUDNN_VERSION=8
ANACONDA_PYTHON_VERSION=3.6
ANACONDA_PYTHON_VERSION=3.7
CMAKE_VERSION=3.10.3
GCC_VERSION=7
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
BREAKPAD=yes
;;
pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7)
CUDA_VERSION=11.3.0 # Deviating from major.minor to conform to nvidia's Docker image names
CUDNN_VERSION=8
ANACONDA_PYTHON_VERSION=3.6
TENSORRT_VERSION=8.0.1.6
ANACONDA_PYTHON_VERSION=3.7
CMAKE_VERSION=3.10.3
GCC_VERSION=7
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
;;
pytorch-linux-bionic-cuda11.5-cudnn8-py3-gcc7)
CUDA_VERSION=11.5.0
CUDNN_VERSION=8
ANACONDA_PYTHON_VERSION=3.7
CMAKE_VERSION=3.10.3
GCC_VERSION=7
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
BREAKPAD=yes
;;
pytorch-linux-xenial-py3-clang5-asan)
ANACONDA_PYTHON_VERSION=3.6
ANACONDA_PYTHON_VERSION=3.7
CLANG_VERSION=5.0
CMAKE_VERSION=3.13.5
PROTOBUF=yes
DB=yes
VISION=yes
;;
pytorch-linux-xenial-py3-clang7-asan)
ANACONDA_PYTHON_VERSION=3.7
CLANG_VERSION=7
CMAKE_VERSION=3.10.3
PROTOBUF=yes
DB=yes
VISION=yes
BREAKPAD=yes
;;
pytorch-linux-xenial-py3-clang7-onnx)
ANACONDA_PYTHON_VERSION=3.6
ANACONDA_PYTHON_VERSION=3.7
CLANG_VERSION=7
CMAKE_VERSION=3.10.3
PROTOBUF=yes
DB=yes
VISION=yes
BREAKPAD=yes
;;
pytorch-linux-xenial-py3-clang5-android-ndk-r19c)
ANACONDA_PYTHON_VERSION=3.6
ANACONDA_PYTHON_VERSION=3.7
CLANG_VERSION=5.0
CMAKE_VERSION=3.13.5
LLVMDEV=yes
PROTOBUF=yes
ANDROID=yes
ANDROID_NDK_VERSION=r19c
GRADLE_VERSION=6.8.3
CMAKE_VERSION=3.7.0
NINJA_VERSION=1.9.0
;;
pytorch-linux-xenial-py3.6-clang7)
ANACONDA_PYTHON_VERSION=3.6
pytorch-linux-xenial-py3.7-clang7)
ANACONDA_PYTHON_VERSION=3.7
CMAKE_VERSION=3.10.3
CLANG_VERSION=7
PROTOBUF=yes
DB=yes
VISION=yes
BREAKPAD=yes
;;
pytorch-linux-bionic-py3.6-clang9)
ANACONDA_PYTHON_VERSION=3.6
pytorch-linux-bionic-py3.7-clang9)
ANACONDA_PYTHON_VERSION=3.7
CLANG_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
BREAKPAD=yes
VULKAN_SDK_VERSION=1.2.162.1
SWIFTSHADER=yes
;;
@ -208,28 +208,15 @@ case "$image" in
PROTOBUF=yes
DB=yes
VISION=yes
BREAKPAD=yes
BREAKPAD=yes
;;
pytorch-linux-bionic-cuda10.2-cudnn7-py3.6-clang9)
pytorch-linux-bionic-cuda10.2-cudnn7-py3.7-clang9)
CUDA_VERSION=10.2
CUDNN_VERSION=7
ANACONDA_PYTHON_VERSION=3.6
ANACONDA_PYTHON_VERSION=3.7
CLANG_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
BREAKPAD=yes
;;
pytorch-linux-bionic-cuda10.2-cudnn7-py3.8-gcc9)
CUDA_VERSION=10.2
CUDNN_VERSION=7
ANACONDA_PYTHON_VERSION=3.8
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
BREAKPAD=yes
;;
pytorch-linux-bionic-cuda10.2-cudnn7-py3.9-gcc7)
CUDA_VERSION=10.2
@ -239,53 +226,42 @@ case "$image" in
PROTOBUF=yes
DB=yes
VISION=yes
BREAKPAD=yes
;;
pytorch-linux-bionic-cuda11.0-cudnn8-py3.6-gcc9)
pytorch-linux-bionic-cuda11.0-cudnn8-py3.7-gcc9)
CUDA_VERSION=11.0
CUDNN_VERSION=8
ANACONDA_PYTHON_VERSION=3.6
ANACONDA_PYTHON_VERSION=3.7
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
BREAKPAD=yes
ROCM_VERSION=3.9
;;
pytorch-linux-bionic-rocm4.0.1-py3.6)
ANACONDA_PYTHON_VERSION=3.6
pytorch-linux-bionic-rocm4.3.1-py3.7)
ANACONDA_PYTHON_VERSION=3.7
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
BREAKPAD=yes
ROCM_VERSION=4.0.1
ROCM_VERSION=4.3.1
;;
pytorch-linux-bionic-rocm4.1-py3.6)
ANACONDA_PYTHON_VERSION=3.6
pytorch-linux-bionic-rocm4.5-py3.7)
ANACONDA_PYTHON_VERSION=3.7
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
BREAKPAD=yes
ROCM_VERSION=4.1
;;
pytorch-linux-bionic-rocm4.2-py3.6)
ANACONDA_PYTHON_VERSION=3.6
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
BREAKPAD=yes
ROCM_VERSION=4.2
ROCM_VERSION=4.5.2
;;
*)
# Catch-all for builds that are not hardcoded.
PROTOBUF=yes
DB=yes
VISION=yes
BREAKPAD=yes
echo "image '$image' did not match an existing build configuration"
if [[ "$image" == *xenial* ]]; then
CMAKE_VERSION=3.10.3
fi
if [[ "$image" == *py* ]]; then
extract_version_from_image_name py ANACONDA_PYTHON_VERSION
fi
@ -320,7 +296,17 @@ if [ -n "${JENKINS:-}" ]; then
JENKINS_GID=$(id -g jenkins)
fi
tmp_tag="tmp-$(cat /dev/urandom | tr -dc 'a-z' | head -c 32)"
tmp_tag=$(basename "$(mktemp -u)" | tr '[:upper:]' '[:lower:]')
# If we are trying to use nvidia cuda image make sure it exists, otherwise use IMAGE from ghcr.io
# this logic currently only exists for ubuntu
if [[ "$image" == *cuda* && ${OS} == "ubuntu" ]]; then
IMAGE_NAME="nvidia/cuda:${CUDA_VERSION}-cudnn${CUDNN_VERSION}-devel-ubuntu${UBUNTU_VERSION}"
if ! DOCKER_CLI_EXPERIMENTAL=enabled docker manifest inspect "${IMAGE_NAME}" >/dev/null 2>/dev/null; then
IMAGE_NAME="ghcr.io/pytorch/nvidia/cuda:${CUDA_VERSION}-devel-ubuntu${UBUNTU_VERSION}"
INSTALL_CUDNN="True"
fi
fi
# Build image
# TODO: build-arg THRIFT is not turned on for any image, remove it once we confirm
@ -348,7 +334,7 @@ docker build \
--build-arg "GCC_VERSION=${GCC_VERSION}" \
--build-arg "CUDA_VERSION=${CUDA_VERSION}" \
--build-arg "CUDNN_VERSION=${CUDNN_VERSION}" \
--build-arg "BREAKPAD=${BREAKPAD}" \
--build-arg "TENSORRT_VERSION=${TENSORRT_VERSION}" \
--build-arg "ANDROID=${ANDROID}" \
--build-arg "ANDROID_NDK=${ANDROID_NDK_VERSION}" \
--build-arg "GRADLE_VERSION=${GRADLE_VERSION}" \
@ -358,6 +344,9 @@ docker build \
--build-arg "NINJA_VERSION=${NINJA_VERSION:-}" \
--build-arg "KATEX=${KATEX:-}" \
--build-arg "ROCM_VERSION=${ROCM_VERSION:-}" \
--build-arg "PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH:-gfx900;gfx906}" \
--build-arg "IMAGE_NAME=${IMAGE_NAME}" \
--build-arg "INSTALL_CUDNN=${INSTALL_CUDNN}" \
-f $(dirname ${DOCKERFILE})/Dockerfile \
-t "$tmp_tag" \
"$@" \
@ -376,6 +365,7 @@ function drun() {
}
if [[ "$OS" == "ubuntu" ]]; then
if !(drun lsb_release -a 2>&1 | grep -qF Ubuntu); then
echo "OS=ubuntu, but:"
drun lsb_release -a

View File

@ -26,11 +26,14 @@ login() {
docker login -u AWS --password-stdin "$1"
}
# Retry on timeouts (can happen on job stampede).
retry login "${registry}"
# Logout on exit
trap "docker logout ${registry}" EXIT
# Only run these steps if not on github actions
if [[ -z "${GITHUB_ACTIONS}" ]]; then
# Retry on timeouts (can happen on job stampede).
retry login "${registry}"
# Logout on exit
trap "docker logout ${registry}" EXIT
fi
# export EC2=1
# export JENKINS=1
@ -45,8 +48,8 @@ trap "docker logout ${registry}" EXIT
docker push "${image}:${tag}"
docker save -o "${IMAGE_NAME}:${tag}.tar" "${image}:${tag}"
if [ -z "${DOCKER_SKIP_S3_UPLOAD:-}" ]; then
trap "rm -rf ${IMAGE_NAME}:${tag}.tar" EXIT
docker save -o "${IMAGE_NAME}:${tag}.tar" "${image}:${tag}"
aws s3 cp "${IMAGE_NAME}:${tag}.tar" "s3://ossci-linux-build/pytorch/base/${IMAGE_NAME}:${tag}.tar" --acl public-read
fi

View File

@ -4,6 +4,10 @@ FROM centos:${CENTOS_VERSION}
ARG CENTOS_VERSION
# Set AMD gpu targets to build for
ARG PYTORCH_ROCM_ARCH
ENV PYTORCH_ROCM_ARCH ${PYTORCH_ROCM_ARCH}
# Install required packages to build Caffe2
# Install common dependencies (so that this step can be cached separately)
@ -11,6 +15,12 @@ ARG EC2
ADD ./common/install_base.sh install_base.sh
RUN bash ./install_base.sh && rm install_base.sh
# Update CentOS git version
RUN yum -y remove git
RUN yum -y remove git-*
RUN yum -y install https://packages.endpoint.com/rhel/7/os/x86_64/endpoint-repo-1.9-1.x86_64.rpm
RUN yum install -y git
# Install devtoolset
ARG DEVTOOLSET_VERSION
ADD ./common/install_devtoolset.sh install_devtoolset.sh
@ -27,7 +37,7 @@ RUN rm install_glibc.sh
ADD ./common/install_user.sh install_user.sh
RUN bash ./install_user.sh && rm install_user.sh
# Install conda and other packages (e.g., numpy, coverage, pytest)
# Install conda and other packages (e.g., numpy, pytest)
ENV PATH /opt/conda/bin:$PATH
ARG ANACONDA_PYTHON_VERSION
ADD ./common/install_conda.sh install_conda.sh

View File

@ -11,8 +11,13 @@ install_ubuntu() {
# "$UBUNTU_VERSION" == "18.04"
if [[ "$UBUNTU_VERSION" == "18.04"* ]]; then
cmake3="cmake=3.10*"
maybe_libiomp_dev="libiomp-dev"
elif [[ "$UBUNTU_VERSION" == "20.04"* ]]; then
cmake3="cmake=3.16*"
maybe_libiomp_dev=""
else
cmake3="cmake=3.5*"
maybe_libiomp_dev="libiomp-dev"
fi
# Install common dependencies
@ -33,7 +38,7 @@ install_ubuntu() {
git \
libatlas-base-dev \
libc6-dbg \
libiomp-dev \
${maybe_libiomp_dev} \
libyaml-dev \
libz-dev \
libjpeg-dev \
@ -44,6 +49,10 @@ install_ubuntu() {
wget \
vim
# Should resolve issues related to various apt package repository cert issues
# see: https://github.com/pytorch/pytorch/issues/65931
apt-get install -y libgnutls30
# Cleanup package manager
apt-get autoclean && apt-get clean
rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
@ -109,14 +118,11 @@ esac
# Install Valgrind separately since the apt-get version is too old.
mkdir valgrind_build && cd valgrind_build
VALGRIND_VERSION=3.16.1
if ! wget http://valgrind.org/downloads/valgrind-${VALGRIND_VERSION}.tar.bz2
then
wget https://sourceware.org/ftp/valgrind/valgrind-${VALGRIND_VERSION}.tar.bz2
fi
wget https://ossci-linux.s3.amazonaws.com/valgrind-${VALGRIND_VERSION}.tar.bz2
tar -xjf valgrind-${VALGRIND_VERSION}.tar.bz2
cd valgrind-${VALGRIND_VERSION}
./configure --prefix=/usr/local
make -j 4
make -j6
sudo make install
cd ../../
rm -rf valgrind_build

View File

@ -1,19 +0,0 @@
#!/bin/bash
set -ex
git clone https://github.com/malfet/breakpad.git -b pytorch/release-1.9
pushd breakpad
git clone https://chromium.googlesource.com/linux-syscall-support src/third_party/lss
pushd src/third_party/lss
# same as with breakpad, there are no real releases for this repo so use a
# commit as the pin
git checkout e1e7b0ad8ee99a875b272c8e33e308472e897660
popd
./configure
make
make install
popd
rm -rf breakpad

View File

@ -4,6 +4,9 @@ set -ex
[ -n "$CMAKE_VERSION" ]
# Remove system cmake install so it won't get used instead
apt-get remove cmake -y
# Turn 3.6.3 into v3.6
path=$(echo "${CMAKE_VERSION}" | sed -e 's/\([0-9].[0-9]\+\).*/v\1/')
file="cmake-${CMAKE_VERSION}-Linux-x86_64.tar.gz"

View File

@ -13,7 +13,12 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
CONDA_FILE="Miniconda2-latest-Linux-x86_64.sh"
;;
3)
CONDA_FILE="Miniconda3-latest-Linux-x86_64.sh"
if [ "$ANACONDA_PYTHON_VERSION" = "3.6" ]; then
# Latest release of Conda that still supports python-3.6
CONDA_FILE="Miniconda3-py37_4.10.3-Linux-x86_64.sh"
else
CONDA_FILE="Miniconda3-latest-Linux-x86_64.sh"
fi
;;
*)
echo "Unsupported ANACONDA_PYTHON_VERSION: $ANACONDA_PYTHON_VERSION"
@ -56,7 +61,9 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
pushd /opt/conda
# Track latest conda update
as_jenkins conda update -y -n base conda
if [ "$ANACONDA_PYTHON_VERSION" != "3.6" ]; then
as_jenkins conda update -y -n base conda
fi
# Install correct Python version
as_jenkins conda install -y python="$ANACONDA_PYTHON_VERSION"
@ -69,8 +76,8 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
}
# Install PyTorch conda deps, as per https://github.com/pytorch/pytorch README
# DO NOT install cmake here as it would install a version newer than 3.5, but
# we want to pin to version 3.5.
# DO NOT install cmake here as it would install a version newer than 3.10, but
# we want to pin to version 3.10.
SCIPY_VERSION=1.1.0
if [ "$ANACONDA_PYTHON_VERSION" = "3.9" ]; then
# Install llvm-8 as it is required to compile llvmlite-0.30.0 from source
@ -86,18 +93,10 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
conda_install numpy=1.18.5 astunparse pyyaml mkl mkl-include setuptools cffi future six dataclasses typing_extensions
fi
if [[ "$CUDA_VERSION" == 10.0* ]]; then
conda_install magma-cuda100 -c pytorch
elif [[ "$CUDA_VERSION" == 10.1* ]]; then
conda_install magma-cuda101 -c pytorch
elif [[ "$CUDA_VERSION" == 10.2* ]]; then
conda_install magma-cuda102 -c pytorch
elif [[ "$CUDA_VERSION" == 11.0* ]]; then
conda_install magma-cuda110 -c pytorch
elif [[ "$CUDA_VERSION" == 11.1* ]]; then
conda_install magma-cuda111 -c pytorch
elif [[ "$CUDA_VERSION" == 11.3* ]]; then
conda_install magma-cuda113 -c pytorch
# Magma package names are concatenation of CUDA major and minor ignoring revision
# I.e. magma-cuda102 package corresponds to CUDA_VERSION=10.2 and CUDA_VERSION=10.2.89
if [ -n "$CUDA_VERSION" ]; then
conda_install magma-cuda$(TMP=${CUDA_VERSION/./};echo ${TMP%.*[0-9]}) -c pytorch
fi
# TODO: This isn't working atm
@ -107,22 +106,21 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
# TODO: Why is scipy pinned
# Pin MyPy version because new errors are likely to appear with each release
# Pin hypothesis to avoid flakiness: https://github.com/pytorch/pytorch/issues/31136
# Pin coverage so we can use COVERAGE_RCFILE
as_jenkins pip install --progress-bar off pytest \
scipy==$SCIPY_VERSION \
scikit-image \
psutil \
unittest-xml-reporting \
boto3==1.16.34 \
coverage==5.5 \
hypothesis==4.53.2 \
expecttest==0.1.3 \
mypy==0.812 \
tb-nightly
# Install numba only on python-3.8 or below
# For numba issue see https://github.com/pytorch/pytorch/issues/51511
if [[ $(python -c "import sys; print(int(sys.version_info < (3, 9)))") == "1" ]]; then
as_jenkins pip install --progress-bar off numba librosa>=0.6.2
as_jenkins pip install --progress-bar off numba==0.54.1 librosa>=0.6.2
else
as_jenkins pip install --progress-bar off numba==0.49.0 librosa>=0.6.2
fi

View File

@ -0,0 +1,10 @@
#!/bin/bash
sudo apt-get update
# also install ssh to avoid error of:
# --------------------------------------------------------------------------
# The value of the MCA parameter "plm_rsh_agent" was set to a path
# that could not be found:
# plm_rsh_agent: ssh : rsh
sudo apt-get install -y ssh
sudo apt-get update && apt-get install -y --no-install-recommends libcudnn8=8.2.0.53-1+cuda11.3 libcudnn8-dev=8.2.0.53-1+cuda11.3 && apt-mark hold libcudnn8

View File

@ -2,23 +2,6 @@
set -ex
# This function installs protobuf 2.6
install_protobuf_26() {
pb_dir="/usr/temp_pb_install_dir"
mkdir -p $pb_dir
# On the nvidia/cuda:9-cudnn7-devel-centos7 image we need this symlink or
# else it will fail with
# g++: error: ./../lib64/crti.o: No such file or directory
ln -s /usr/lib64 "$pb_dir/lib64"
curl -LO "https://github.com/google/protobuf/releases/download/v2.6.1/protobuf-2.6.1.tar.gz"
tar -xvz -C "$pb_dir" --strip-components 1 -f protobuf-2.6.1.tar.gz
pushd "$pb_dir" && ./configure && make && make check && sudo make install && sudo ldconfig
popd
rm -rf $pb_dir
}
install_ubuntu() {
apt-get update
apt-get install -y --no-install-recommends \

View File

@ -7,15 +7,18 @@ if [ -n "$GCC_VERSION" ]; then
# Need the official toolchain repo to get alternate packages
add-apt-repository ppa:ubuntu-toolchain-r/test
apt-get update
if [ "$UBUNTU_VERSION" = "16.04" -a "$GCC_VERSION" = "5" ]; then
if [[ "$UBUNTU_VERSION" == "16.04" && "${GCC_VERSION:0:1}" == "5" ]]; then
apt-get install -y g++-5=5.4.0-6ubuntu1~16.04.12
update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-5 50
update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-5 50
update-alternatives --install /usr/bin/gcov gcov /usr/bin/gcov-5 50
else
apt-get install -y g++-$GCC_VERSION
update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-"$GCC_VERSION" 50
update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-"$GCC_VERSION" 50
update-alternatives --install /usr/bin/gcov gcov /usr/bin/gcov-"$GCC_VERSION" 50
fi
update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-"$GCC_VERSION" 50
update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-"$GCC_VERSION" 50
update-alternatives --install /usr/bin/gcov gcov /usr/bin/gcov-"$GCC_VERSION" 50
# Cleanup package manager
apt-get autoclean && apt-get clean

View File

@ -1,4 +0,0 @@
#!/bin/bash
sudo apt-get -qq update
sudo apt-get -qq install --allow-downgrades --allow-change-held-packages libnccl-dev=2.5.6-1+cuda10.1 libnccl2=2.5.6-1+cuda10.1

View File

@ -1,4 +1,10 @@
#!/bin/bash
sudo apt-get update
# also install ssh to avoid error of:
# --------------------------------------------------------------------------
# The value of the MCA parameter "plm_rsh_agent" was set to a path
# that could not be found:
# plm_rsh_agent: ssh : rsh
sudo apt-get install -y ssh
sudo apt-get install -y --allow-downgrades --allow-change-held-packages openmpi-bin libopenmpi-dev

View File

@ -4,11 +4,11 @@ set -ex
OPENSSL=openssl-1.1.1k
wget -q -O "${OPENSSL}.tar.gz" "https://www.openssl.org/source/${OPENSSL}.tar.gz"
wget -q -O "${OPENSSL}.tar.gz" "https://ossci-linux.s3.amazonaws.com/${OPENSSL}.tar.gz"
tar xf "${OPENSSL}.tar.gz"
cd "${OPENSSL}"
./config --prefix=/opt/openssl -d '-Wl,--enable-new-dtags,-rpath,$(LIBRPATH)'
# NOTE: opensl errors out when built with the -j option
make install_sw
# NOTE: openssl install errors out when built with the -j option
make -j6; make install_sw
cd ..
rm -rf "${OPENSSL}"

View File

@ -2,8 +2,8 @@
set -ex
# This function installs protobuf 2.6
install_protobuf_26() {
# This function installs protobuf 3.17
install_protobuf_317() {
pb_dir="/usr/temp_pb_install_dir"
mkdir -p $pb_dir
@ -12,37 +12,32 @@ install_protobuf_26() {
# g++: error: ./../lib64/crti.o: No such file or directory
ln -s /usr/lib64 "$pb_dir/lib64"
curl -LO "https://github.com/google/protobuf/releases/download/v2.6.1/protobuf-2.6.1.tar.gz"
tar -xvz -C "$pb_dir" --strip-components 1 -f protobuf-2.6.1.tar.gz
pushd "$pb_dir" && ./configure && make && make check && sudo make install && sudo ldconfig
curl -LO "https://github.com/protocolbuffers/protobuf/releases/download/v3.17.3/protobuf-all-3.17.3.tar.gz"
tar -xvz -C "$pb_dir" --strip-components 1 -f protobuf-all-3.17.3.tar.gz
# -j6 to balance memory usage and speed.
# naked `-j` seems to use too much memory.
pushd "$pb_dir" && ./configure && make -j6 && make -j6 check && sudo make -j6 install && sudo ldconfig
popd
rm -rf $pb_dir
}
install_ubuntu() {
# Ubuntu 14.04 ships with protobuf 2.5, but ONNX needs protobuf >= 2.6
# so we install that here if on 14.04
# Ubuntu 14.04 also has cmake 2.8.12 as the default option, so we will
# Ubuntu 14.04 has cmake 2.8.12 as the default option, so we will
# install cmake3 here and use cmake3.
apt-get update
if [[ "$UBUNTU_VERSION" == 14.04 ]]; then
apt-get install -y --no-install-recommends cmake3
install_protobuf_26
else
apt-get install -y --no-install-recommends \
libprotobuf-dev \
protobuf-compiler
fi
# Cleanup
apt-get autoclean && apt-get clean
rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
install_protobuf_317
}
install_centos() {
# Centos7 ships with protobuf 2.5, but ONNX needs protobuf >= 2.6
# so we always install install that here
install_protobuf_26
install_protobuf_317
}
# Install base packages depending on the base OS

View File

@ -6,14 +6,23 @@ install_magma() {
# "install" hipMAGMA into /opt/rocm/magma by copying after build
git clone https://bitbucket.org/icl/magma.git
pushd magma
git checkout 878b1ce02e9cfe4a829be22c8f911e9c0b6bd88f
# fix for magma_queue memory leak issue
git checkout c62d700d880c7283b33fb1d615d62fc9c7f7ca21
cp make.inc-examples/make.inc.hip-gcc-mkl make.inc
echo 'LIBDIR += -L$(MKLROOT)/lib' >> make.inc
echo 'LIB += -Wl,--enable-new-dtags -Wl,--rpath,/opt/rocm/lib -Wl,--rpath,$(MKLROOT)/lib -Wl,--rpath,/opt/rocm/magma/lib' >> make.inc
echo 'DEVCCFLAGS += --amdgpu-target=gfx803 --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --gpu-max-threads-per-block=256' >> make.inc
echo 'DEVCCFLAGS += --gpu-max-threads-per-block=256' >> make.inc
export PATH="${PATH}:/opt/rocm/bin"
if [[ -n "$PYTORCH_ROCM_ARCH" ]]; then
amdgpu_targets=`echo $PYTORCH_ROCM_ARCH | sed 's/;/ /g'`
else
amdgpu_targets=`rocm_agent_enumerator | grep -v gfx000 | sort -u | xargs`
fi
for arch in $amdgpu_targets; do
echo "DEVCCFLAGS += --amdgpu-target=$arch" >> make.inc
done
# hipcc with openmp flag may cause isnan() on __device__ not to be found; depending on context, compiler may attempt to match with host definition
sed -i 's/^FOPENMP/#FOPENMP/g' make.inc
export PATH="${PATH}:/opt/rocm/bin"
make -f make.gen.hipMAGMA -j $(nproc)
LANG=C.UTF-8 make lib/libmagma.so -j $(nproc) MKLROOT=/opt/conda
make testing/testing_dgemm -j $(nproc) MKLROOT=/opt/conda
@ -25,12 +34,19 @@ ver() {
printf "%3d%03d%03d%03d" $(echo "$1" | tr '.' ' ');
}
# Map ROCm version to AMDGPU version
declare -A AMDGPU_VERSIONS=( ["4.5.2"]="21.40.2" )
install_ubuntu() {
apt-get update
if [[ $UBUNTU_VERSION == 18.04 ]]; then
# gpg-agent is not available by default on 18.04
apt-get install -y --no-install-recommends gpg-agent
fi
if [[ $UBUNTU_VERSION == 20.04 ]]; then
# gpg-agent is not available by default on 20.04
apt-get install -y --no-install-recommends gpg-agent
fi
apt-get install -y kmod
apt-get install -y wget
@ -38,6 +54,13 @@ install_ubuntu() {
apt-get install -y libc++1
apt-get install -y libc++abi1
if [[ $(ver $ROCM_VERSION) -ge $(ver 4.5) ]]; then
# Add amdgpu repository
UBUNTU_VERSION_NAME=`cat /etc/os-release | grep UBUNTU_CODENAME | awk -F= '{print $2}'`
local amdgpu_baseurl="https://repo.radeon.com/amdgpu/${AMDGPU_VERSIONS[$ROCM_VERSION]}/ubuntu"
echo "deb [arch=amd64] ${amdgpu_baseurl} ${UBUNTU_VERSION_NAME} main" > /etc/apt/sources.list.d/amdgpu.list
fi
ROCM_REPO="ubuntu"
if [[ $(ver $ROCM_VERSION) -lt $(ver 4.2) ]]; then
ROCM_REPO="xenial"
@ -45,7 +68,8 @@ install_ubuntu() {
# Add rocm repository
wget -qO - http://repo.radeon.com/rocm/rocm.gpg.key | apt-key add -
echo "deb [arch=amd64] http://repo.radeon.com/rocm/apt/${ROCM_VERSION} ${ROCM_REPO} main" > /etc/apt/sources.list.d/rocm.list
local rocm_baseurl="http://repo.radeon.com/rocm/apt/${ROCM_VERSION}"
echo "deb [arch=amd64] ${rocm_baseurl} ${ROCM_REPO} main" > /etc/apt/sources.list.d/rocm.list
apt-get update --allow-insecure-repositories
DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-unauthenticated \
@ -82,11 +106,24 @@ install_centos() {
yum install -y epel-release
yum install -y dkms kernel-headers-`uname -r` kernel-devel-`uname -r`
if [[ $(ver $ROCM_VERSION) -ge $(ver 4.5) ]]; then
# Add amdgpu repository
local amdgpu_baseurl="https://repo.radeon.com/amdgpu/${AMDGPU_VERSIONS[$ROCM_VERSION]}/rhel/7.9/main/x86_64"
echo "[AMDGPU]" > /etc/yum.repos.d/amdgpu.repo
echo "name=AMDGPU" >> /etc/yum.repos.d/amdgpu.repo
echo "baseurl=${amdgpu_baseurl}" >> /etc/yum.repos.d/amdgpu.repo
echo "enabled=1" >> /etc/yum.repos.d/amdgpu.repo
echo "gpgcheck=1" >> /etc/yum.repos.d/amdgpu.repo
echo "gpgkey=http://repo.radeon.com/rocm/rocm.gpg.key" >> /etc/yum.repos.d/amdgpu.repo
fi
local rocm_baseurl="http://repo.radeon.com/rocm/yum/${ROCM_VERSION}"
echo "[ROCm]" > /etc/yum.repos.d/rocm.repo
echo "name=ROCm" >> /etc/yum.repos.d/rocm.repo
echo "baseurl=http://repo.radeon.com/rocm/yum/${ROCM_VERSION}" >> /etc/yum.repos.d/rocm.repo
echo "baseurl=${rocm_baseurl}" >> /etc/yum.repos.d/rocm.repo
echo "enabled=1" >> /etc/yum.repos.d/rocm.repo
echo "gpgcheck=0" >> /etc/yum.repos.d/rocm.repo
echo "gpgcheck=1" >> /etc/yum.repos.d/rocm.repo
echo "gpgkey=http://repo.radeon.com/rocm/rocm.gpg.key" >> /etc/yum.repos.d/rocm.repo
yum update -y

View File

@ -0,0 +1,7 @@
#!/bin/bash
if [ -n "$TENSORRT_VERSION" ]; then
python3 -m pip install --upgrade setuptools pip
python3 -m pip install nvidia-pyindex
python3 -m pip install nvidia-tensorrt==${TENSORRT_VERSION} --extra-index-url https://pypi.ngc.nvidia.com
fi

View File

@ -2,23 +2,6 @@
set -ex
# This function installs protobuf 2.6
install_protobuf_26() {
pb_dir="/usr/temp_pb_install_dir"
mkdir -p $pb_dir
# On the nvidia/cuda:9-cudnn7-devel-centos7 image we need this symlink or
# else it will fail with
# g++: error: ./../lib64/crti.o: No such file or directory
ln -s /usr/lib64 "$pb_dir/lib64"
curl -LO "https://github.com/google/protobuf/releases/download/v2.6.1/protobuf-2.6.1.tar.gz"
tar -xvz -C "$pb_dir" --strip-components 1 -f protobuf-2.6.1.tar.gz
pushd "$pb_dir" && ./configure && make && make check && sudo make install && sudo ldconfig
popd
rm -rf $pb_dir
}
install_ubuntu() {
apt-get update
apt-get install -y --no-install-recommends \

View File

@ -1,13 +1,15 @@
ARG UBUNTU_VERSION
ARG CUDA_VERSION
ARG CUDNN_VERSION
ARG IMAGE_NAME
FROM nvidia/cuda:${CUDA_VERSION}-cudnn${CUDNN_VERSION}-devel-ubuntu${UBUNTU_VERSION}
FROM ${IMAGE_NAME}
ARG UBUNTU_VERSION
ARG CUDA_VERSION
ARG CUDNN_VERSION
ENV DEBIAN_FRONTEND noninteractive
# Install common dependencies (so that this step can be cached separately)
@ -24,7 +26,7 @@ ARG KATEX
ADD ./common/install_katex.sh install_katex.sh
RUN bash ./install_katex.sh && rm install_katex.sh
# Install conda and other packages (e.g., numpy, coverage, pytest)
# Install conda and other packages (e.g., numpy, pytest)
ENV PATH /opt/conda/bin:$PATH
ARG ANACONDA_PYTHON_VERSION
ADD ./common/install_conda.sh install_conda.sh
@ -65,22 +67,29 @@ ADD ./common/install_openssl.sh install_openssl.sh
ENV OPENSSL_ROOT_DIR /opt/openssl
RUN bash ./install_openssl.sh
# (optional) Install TensorRT
ARG TENSORRT_VERSION
ADD ./common/install_tensorrt.sh install_tensorrt.sh
RUN if [ -n "${TENSORRT_VERSION}" ]; then bash ./install_tensorrt.sh; fi
RUN rm install_tensorrt.sh
# (optional) Install non-default CMake version
ARG CMAKE_VERSION
ADD ./common/install_cmake.sh install_cmake.sh
RUN if [ -n "${CMAKE_VERSION}" ]; then bash ./install_cmake.sh; fi
RUN rm install_cmake.sh
# Install ccache/sccache (do this last, so we get priority in PATH)
ADD ./common/install_cache.sh install_cache.sh
ENV PATH /opt/cache/bin:$PATH
RUN bash ./install_cache.sh && rm install_cache.sh
ENV CUDA_NVCC_EXECUTABLE=/opt/cache/lib/nvcc
ENV CMAKE_CUDA_COMPILER_LAUNCHER=/opt/cache/bin/sccache
# Add jni.h for java host build
ADD ./common/install_jni.sh install_jni.sh
ADD ./java/jni.h jni.h
RUN bash ./install_jni.sh && rm install_jni.sh
# Install NCCL for when CUDA is version 10.1
ADD ./common/install_nccl.sh install_nccl.sh
RUN if [ "${CUDA_VERSION}" = 10.1 ]; then bash ./install_nccl.sh; fi
RUN rm install_nccl.sh
# Install Open MPI for CUDA
ADD ./common/install_openmpi.sh install_openmpi.sh
RUN if [ -n "${CUDA_VERSION}" ]; then bash install_openmpi.sh; fi
@ -93,9 +102,17 @@ ENV BUILD_ENVIRONMENT ${BUILD_ENVIRONMENT}
# AWS specific CUDA build guidance
ENV TORCH_CUDA_ARCH_LIST Maxwell
ENV TORCH_NVCC_FLAGS "-Xfatbin -compress-all"
ENV CUDA_PATH /usr/local/cuda
# Install LLVM dev version (Defined in the pytorch/builder github repository)
COPY --from=pytorch/llvm:9.0.1 /opt/llvm /opt/llvm
# Hack for CUDA 11.5.0 image to install cudnn8 since cudnn8 is not included with CUDA 11.5 image
# Also note cudnn 8.2.0.53 is labeled for cuda 11.3
ARG INSTALL_CUDNN
ADD ./common/install_cudnn8.sh install_cudnn8.sh
RUN if [ -n "${INSTALL_CUDNN}" ]; then bash install_cudnn8.sh; fi
RUN rm install_cudnn8.sh
USER jenkins
CMD ["bash"]

View File

@ -6,6 +6,10 @@ ARG UBUNTU_VERSION
ENV DEBIAN_FRONTEND noninteractive
# Set AMD gpu targets to build for
ARG PYTORCH_ROCM_ARCH
ENV PYTORCH_ROCM_ARCH ${PYTORCH_ROCM_ARCH}
# Install common dependencies (so that this step can be cached separately)
ARG EC2
ADD ./common/install_base.sh install_base.sh
@ -21,7 +25,7 @@ RUN bash ./install_clang.sh && rm install_clang.sh
ADD ./common/install_user.sh install_user.sh
RUN bash ./install_user.sh && rm install_user.sh
# Install conda and other packages (e.g., numpy, coverage, pytest)
# Install conda and other packages (e.g., numpy, pytest)
ENV PATH /opt/conda/bin:$PATH
ARG ANACONDA_PYTHON_VERSION
ADD ./common/install_conda.sh install_conda.sh

View File

@ -33,7 +33,7 @@ ARG KATEX
ADD ./common/install_katex.sh install_katex.sh
RUN bash ./install_katex.sh && rm install_katex.sh
# Install conda and other packages (e.g., numpy, coverage, pytest)
# Install conda and other packages (e.g., numpy, pytest)
ENV PATH /opt/conda/bin:$PATH
ARG ANACONDA_PYTHON_VERSION
ADD ./common/install_conda.sh install_conda.sh
@ -82,13 +82,6 @@ RUN rm AndroidManifest.xml
RUN rm build.gradle
ENV INSTALLED_ANDROID ${ANDROID}
# (optional) Install breakpad
ARG BREAKPAD
ADD ./common/install_breakpad.sh install_breakpad.sh
RUN if [ -n "${BREAKPAD}" ]; then bash ./install_breakpad.sh; fi
RUN rm install_breakpad.sh
ENV INSTALLED_BREAKPAD ${BREAKPAD}
# (optional) Install Vulkan SDK
ARG VULKAN_SDK_VERSION
ADD ./common/install_vulkan_sdk.sh install_vulkan_sdk.sh

View File

@ -1,13 +0,0 @@
FROM ubuntu:18.04
RUN apt-get update && apt-get install -y python3-pip git && rm -rf /var/lib/apt/lists/* /var/log/dpkg.log
ADD requirements.txt /requirements.txt
RUN pip3 install -r /requirements.txt
ADD gc.py /usr/bin/gc.py
ADD docker_hub.py /usr/bin/docker_hub.py
ENTRYPOINT ["/usr/bin/gc.py"]

View File

@ -1,125 +0,0 @@
#!/usr/bin/env python3
from collections import namedtuple
import boto3
import requests
import os
IMAGE_INFO = namedtuple(
"IMAGE_INFO", ("repo", "tag", "size", "last_updated_at", "last_updated_by")
)
def build_access_token(username, passwordtr):
r = requests.post(
"https://hub.docker.com/v2/users/login/",
data={"username": username, "password": password},
)
r.raise_for_status()
token = r.json().get("token")
return {"Authorization": "JWT " + token}
def list_repos(user, token):
r = requests.get("https://hub.docker.com/v2/repositories/" + user, headers=token)
r.raise_for_status()
ret = sorted(
repo["user"] + "/" + repo["name"] for repo in r.json().get("results", [])
)
if ret:
print("repos found:")
print("".join("\n\t" + r for r in ret))
return ret
def list_tags(repo, token):
r = requests.get(
"https://hub.docker.com/v2/repositories/" + repo + "/tags", headers=token
)
r.raise_for_status()
return [
IMAGE_INFO(
repo=repo,
tag=t["name"],
size=t["full_size"],
last_updated_at=t["last_updated"],
last_updated_by=t["last_updater_username"],
)
for t in r.json().get("results", [])
]
def save_to_s3(tags):
table_content = ""
client = boto3.client("s3")
for t in tags:
table_content += (
"<tr><td>{repo}</td><td>{tag}</td><td>{size}</td>"
"<td>{last_updated_at}</td><td>{last_updated_by}</td></tr>"
).format(
repo=t.repo,
tag=t.tag,
size=t.size,
last_updated_at=t.last_updated_at,
last_updated_by=t.last_updated_by,
)
html_body = """
<html>
<head>
<link rel="stylesheet"
href="https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css"
integrity="sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh"
crossorigin="anonymous">
<link rel="stylesheet" type="text/css"
href="https://cdn.datatables.net/1.10.20/css/jquery.dataTables.css">
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.4.1/jquery.min.js">
</script>
<script type="text/javascript" charset="utf8"
src="https://cdn.datatables.net/1.10.20/js/jquery.dataTables.js"></script>
<title> docker image info</title>
</head>
<body>
<table class="table table-striped table-hover" id="docker">
<caption>Docker images on docker hub</caption>
<thead class="thead-dark">
<tr>
<th scope="col">repo</th>
<th scope="col">tag</th>
<th scope="col">size</th>
<th scope="col">last_updated_at</th>
<th scope="col">last_updated_by</th>
</tr>
</thead>
<tbody>
{table_content}
</tbody>
</table>
</body>
<script>
$(document).ready( function () {{
$('#docker').DataTable({{paging: false}});
}} );py
</script>
</html>
""".format(
table_content=table_content
)
client.put_object(
Bucket="docker.pytorch.org",
ACL="public-read",
Key="docker_hub.html",
Body=html_body,
ContentType="text/html",
)
if __name__ == "__main__":
username = os.environ.get("DOCKER_HUB_USERNAME")
password = os.environ.get("DOCKER_HUB_PASSWORD")
token = build_access_token(username, password)
tags = []
for repo in list_repos("pytorch", token):
tags.extend(list_tags(repo, token))
save_to_s3(tags)

View File

@ -1,218 +0,0 @@
#!/usr/bin/env python3
import argparse
import boto3
import datetime
import pytz
import re
import sys
def save_to_s3(project, data):
table_content = ""
client = boto3.client("s3")
for repo, tag, window, age, pushed in data:
table_content += "<tr><td>{repo}</td><td>{tag}</td><td>{window}</td><td>{age}</td><td>{pushed}</td></tr>".format(
repo=repo, tag=tag, window=window, age=age, pushed=pushed
)
html_body = """
<html>
<head>
<link rel="stylesheet"
href="https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css"
integrity="sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh"
crossorigin="anonymous">
<link rel="stylesheet" type="text/css" href="https://cdn.datatables.net/1.10.20/css/jquery.dataTables.css">
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.4.1/jquery.min.js"></script>
<script type="text/javascript" charset="utf8" src="https://cdn.datatables.net/1.10.20/js/jquery.dataTables.js"></script>
<title>{project} nightly and permanent docker image info</title>
</head>
<body>
<table class="table table-striped table-hover" id="docker">
<thead class="thead-dark">
<tr>
<th scope="col">repo</th>
<th scope="col">tag</th>
<th scope="col">keep window</th>
<th scope="col">age</th>
<th scope="col">pushed at</th>
</tr>
</thead>
<tbody>
{table_content}
</tbody>
</table>
</body>
<script>
$(document).ready( function () {{
$('#docker').DataTable({{paging: false}});
}} );
</script>
</html>
""".format(
project=project, table_content=table_content
)
# for pytorch, file can be found at
# http://ossci-docker.s3-website.us-east-1.amazonaws.com/pytorch.html
# and later one we can config docker.pytorch.org to point to the location
client.put_object(
Bucket="docker.pytorch.org",
ACL="public-read",
Key="{project}.html".format(project=project),
Body=html_body,
ContentType="text/html",
)
def repos(client):
paginator = client.get_paginator("describe_repositories")
pages = paginator.paginate(registryId="308535385114")
for page in pages:
for repo in page["repositories"]:
yield repo
def images(client, repository):
paginator = client.get_paginator("describe_images")
pages = paginator.paginate(
registryId="308535385114", repositoryName=repository["repositoryName"]
)
for page in pages:
for image in page["imageDetails"]:
yield image
parser = argparse.ArgumentParser(description="Delete old Docker tags from registry")
parser.add_argument(
"--dry-run", action="store_true", help="Dry run; print tags that would be deleted"
)
parser.add_argument(
"--debug", action="store_true", help="Debug, print ignored / saved tags"
)
parser.add_argument(
"--keep-stable-days",
type=int,
default=14,
help="Days of stable Docker tags to keep (non per-build images)",
)
parser.add_argument(
"--keep-unstable-days",
type=int,
default=1,
help="Days of unstable Docker tags to keep (per-build images)",
)
parser.add_argument(
"--filter-prefix",
type=str,
default="",
help="Only run cleanup for repositories with this prefix",
)
parser.add_argument(
"--ignore-tags",
type=str,
default="",
help="Never cleanup these tags (comma separated)",
)
args = parser.parse_args()
if not args.ignore_tags or not args.filter_prefix:
print(
"""
Missing required arguments --ignore-tags and --filter-prefix
You must specify --ignore-tags and --filter-prefix to avoid accidentally
pruning a stable Docker tag which is being actively used. This will
make you VERY SAD. So pay attention.
First, which filter-prefix do you want? The list of valid prefixes
is in jobs/private.groovy under the 'docker-registry-cleanup' job.
You probably want either pytorch or caffe2.
Second, which ignore-tags do you want? It should be whatever the most
up-to-date DockerVersion for the repository in question is. Follow
the imports of jobs/pytorch.groovy to find them.
"""
)
sys.exit(1)
client = boto3.client("ecr", region_name="us-east-1")
stable_window = datetime.timedelta(days=args.keep_stable_days)
unstable_window = datetime.timedelta(days=args.keep_unstable_days)
now = datetime.datetime.now(pytz.UTC)
ignore_tags = args.ignore_tags.split(",")
def chunks(chunkable, n):
""" Yield successive n-sized chunks from l.
"""
for i in range(0, len(chunkable), n):
yield chunkable[i: i + n]
SHA_PATTERN = re.compile(r'^[0-9a-f]{40}$')
def looks_like_git_sha(tag):
"""Returns a boolean to check if a tag looks like a git sha
For reference a sha1 is 40 characters with only 0-9a-f and contains no
"-" characters
"""
return re.match(SHA_PATTERN, tag) is not None
stable_window_tags = []
for repo in repos(client):
repositoryName = repo["repositoryName"]
if not repositoryName.startswith(args.filter_prefix):
continue
# Keep list of image digests to delete for this repository
digest_to_delete = []
for image in images(client, repo):
tags = image.get("imageTags")
if not isinstance(tags, (list,)) or len(tags) == 0:
continue
created = image["imagePushedAt"]
age = now - created
for tag in tags:
if any([
looks_like_git_sha(tag),
tag.isdigit(),
tag.count("-") == 4, # TODO: Remove, this no longer applies as tags are now built using a SHA1
tag in ignore_tags]):
window = stable_window
if tag in ignore_tags:
stable_window_tags.append((repositoryName, tag, "", age, created))
elif age < window:
stable_window_tags.append((repositoryName, tag, window, age, created))
else:
window = unstable_window
if tag in ignore_tags or age < window:
if args.debug:
print("Ignoring {}:{} (age: {})".format(repositoryName, tag, age))
break
else:
for tag in tags:
print("{}Deleting {}:{} (age: {})".format("(dry run) " if args.dry_run else "", repositoryName, tag, age))
digest_to_delete.append(image["imageDigest"])
if args.dry_run:
if args.debug:
print("Skipping actual deletion, moving on...")
else:
# Issue batch delete for all images to delete for this repository
# Note that as of 2018-07-25, the maximum number of images you can
# delete in a single batch is 100, so chunk our list into batches of
# 100
for c in chunks(digest_to_delete, 100):
client.batch_delete_image(
registryId="308535385114",
repositoryName=repositoryName,
imageIds=[{"imageDigest": digest} for digest in c],
)
save_to_s3(args.filter_prefix, stable_window_tags)

View File

@ -1,3 +0,0 @@
boto3
pytz
requests

View File

@ -11,19 +11,11 @@ import sys
from collections import namedtuple
import cimodel.data.binary_build_definitions as binary_build_definitions
import cimodel.data.pytorch_build_definitions as pytorch_build_definitions
import cimodel.data.simple.android_definitions
import cimodel.data.simple.bazel_definitions
import cimodel.data.simple.binary_smoketest
import cimodel.data.simple.docker_definitions
import cimodel.data.simple.ge_config_tests
import cimodel.data.simple.ios_definitions
import cimodel.data.simple.macos_definitions
import cimodel.data.simple.mobile_definitions
import cimodel.data.simple.nightly_android
import cimodel.data.simple.nightly_ios
import cimodel.data.simple.anaconda_prune_defintions
import cimodel.data.windows_build_definitions as windows_build_definitions
import cimodel.lib.miniutils as miniutils
import cimodel.lib.miniyaml as miniyaml
@ -80,15 +72,15 @@ class Header(object):
for line in filter(None, lines):
output_filehandle.write(line + "\n")
def filter_master_only_jobs(items):
def _for_all_items(items, functor) -> None:
if isinstance(items, list):
for item in items:
_for_all_items(item, functor)
if isinstance(items, dict) and len(items) == 1:
item_type, item = next(iter(items.items()))
functor(item_type, item)
def _for_all_items(items, functor) -> None:
if isinstance(items, list):
for item in items:
_for_all_items(item, functor)
if isinstance(items, dict) and len(items) == 1:
item_type, item = next(iter(items.items()))
functor(item_type, item)
def filter_master_only_jobs(items):
def _is_master_item(item):
filters = item.get('filters', None)
branches = filters.get('branches', None) if filters is not None else None
@ -126,33 +118,45 @@ def filter_master_only_jobs(items):
_for_all_items(items, _save_requires_if_master)
return _do_filtering(items)
def generate_required_docker_images(items):
required_docker_images = set()
def _requires_docker_image(item_type, item):
requires = item.get('requires', None)
if not isinstance(requires, list):
return
for requirement in requires:
requirement = requirement.replace('"', '')
if requirement.startswith('docker-'):
required_docker_images.add(requirement)
_for_all_items(items, _requires_docker_image)
return required_docker_images
def gen_build_workflows_tree():
build_workflows_functions = [
cimodel.data.simple.docker_definitions.get_workflow_jobs,
pytorch_build_definitions.get_workflow_jobs,
cimodel.data.simple.macos_definitions.get_workflow_jobs,
cimodel.data.simple.android_definitions.get_workflow_jobs,
cimodel.data.simple.ios_definitions.get_workflow_jobs,
cimodel.data.simple.mobile_definitions.get_workflow_jobs,
cimodel.data.simple.ge_config_tests.get_workflow_jobs,
cimodel.data.simple.bazel_definitions.get_workflow_jobs,
cimodel.data.simple.binary_smoketest.get_workflow_jobs,
cimodel.data.simple.nightly_ios.get_workflow_jobs,
cimodel.data.simple.nightly_android.get_workflow_jobs,
cimodel.data.simple.anaconda_prune_defintions.get_workflow_jobs,
windows_build_definitions.get_windows_workflows,
binary_build_definitions.get_post_upload_jobs,
binary_build_definitions.get_binary_smoke_test_jobs,
]
build_jobs = [f() for f in build_workflows_functions]
build_jobs.extend(
cimodel.data.simple.docker_definitions.get_workflow_jobs(
# sort for consistency
sorted(generate_required_docker_images(build_jobs))
)
)
master_build_jobs = filter_master_only_jobs(build_jobs)
binary_build_functions = [
binary_build_definitions.get_binary_build_jobs,
binary_build_definitions.get_nightly_tests,
binary_build_definitions.get_nightly_uploads,
]
build_jobs = [f() for f in build_workflows_functions]
master_build_jobs = filter_master_only_jobs(build_jobs)
return {
"workflows": {
"binary_builds": {
@ -181,7 +185,6 @@ YAML_SOURCES = [
File("build-parameters/binary-build-params.yml"),
File("build-parameters/promote-build-params.yml"),
Header("Job specs"),
File("job-specs/pytorch-job-specs.yml"),
File("job-specs/binary-job-specs.yml"),
File("job-specs/job-specs-custom.yml"),
File("job-specs/job-specs-promote.yml"),
@ -190,8 +193,6 @@ YAML_SOURCES = [
File("job-specs/docker_jobs.yml"),
Header("Workflows"),
Treegen(gen_build_workflows_tree, 0),
File("workflows/workflows-scheduled-ci.yml"),
File("workflows/workflows-ecr-gc.yml"),
File("workflows/workflows-promote.yml"),
]

View File

@ -55,7 +55,7 @@ else
echo "Can't tell what to checkout"
exit 1
fi
retry git submodule update --init --recursive
retry git submodule update --init --recursive --jobs 0
echo "Using Pytorch from "
git --no-pager log --max-count 1
popd
@ -63,7 +63,6 @@ popd
# Clone the Builder master repo
retry git clone -q https://github.com/pytorch/builder.git "$BUILDER_ROOT"
pushd "$BUILDER_ROOT"
git checkout release/1.9
echo "Using builder from "
git --no-pager log --max-count 1
popd

View File

@ -22,7 +22,7 @@ export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
# sync submodules
cd ${PROJ_ROOT}
git submodule sync
git submodule update --init --recursive
git submodule update --init --recursive --jobs 0
# run build script
chmod a+x ${PROJ_ROOT}/scripts/build_ios.sh
@ -31,8 +31,12 @@ cat ${PROJ_ROOT}/scripts/build_ios.sh
echo "########################################################"
echo "IOS_ARCH: ${IOS_ARCH}"
echo "IOS_PLATFORM: ${IOS_PLATFORM}"
echo "USE_PYTORCH_METAL: ${USE_PYTORCH_METAL}"
echo "USE_COREML_DELEGATE: ${USE_COREML_DELEGATE}"
export IOS_ARCH=${IOS_ARCH}
export IOS_PLATFORM=${IOS_PLATFORM}
export USE_PYTORCH_METAL=${USE_PYTORCH_METAL}
export USE_COREML_DELEGATE=${USE_COREML_DELEGATE}
unbuffer ${PROJ_ROOT}/scripts/build_ios.sh 2>&1 | ts
#store the binary

View File

@ -8,16 +8,17 @@ cd ${PROJ_ROOT}/ios/TestApp
# install fastlane
sudo gem install bundler && bundle install
# install certificates
echo "${IOS_CERT_KEY}" >> cert.txt
echo "${IOS_CERT_KEY_2022}" >> cert.txt
base64 --decode cert.txt -o Certificates.p12
rm cert.txt
bundle exec fastlane install_cert
bundle exec fastlane install_root_cert
bundle exec fastlane install_dev_cert
# install the provisioning profile
PROFILE=PyTorch_CI_2021.mobileprovision
PROFILE=PyTorch_CI_2022.mobileprovision
PROVISIONING_PROFILES=~/Library/MobileDevice/Provisioning\ Profiles
mkdir -pv "${PROVISIONING_PROFILES}"
cd "${PROVISIONING_PROFILES}"
echo "${IOS_SIGN_KEY}" >> cert.txt
echo "${IOS_SIGN_KEY_2022}" >> cert.txt
base64 --decode cert.txt -o ${PROFILE}
rm cert.txt
# run the ruby build script
@ -25,5 +26,5 @@ if ! [ -x "$(command -v xcodebuild)" ]; then
echo 'Error: xcodebuild is not installed.'
exit 1
fi
PROFILE=PyTorch_CI_2021
PROFILE=PyTorch_CI_2022
ruby ${PROJ_ROOT}/scripts/xcode_build.rb -i ${PROJ_ROOT}/build_ios/install -x ${PROJ_ROOT}/ios/TestApp/TestApp.xcodeproj -p ${IOS_PLATFORM} -c ${PROFILE} -t ${IOS_DEV_TEAM_ID}

View File

@ -23,15 +23,27 @@ do
fi
done
lipo -i ${ZIP_DIR}/install/lib/*.a
echo "BUILD_LITE_INTERPRETER: ${BUILD_LITE_INTERPRETER}"
# copy the umbrella header and license
cp ${PROJ_ROOT}/ios/LibTorch.h ${ZIP_DIR}/src/
if [ "${BUILD_LITE_INTERPRETER}" == "1" ]; then
cp ${PROJ_ROOT}/ios/LibTorch-Lite.h ${ZIP_DIR}/src/
else
cp ${PROJ_ROOT}/ios/LibTorch.h ${ZIP_DIR}/src/
fi
cp ${PROJ_ROOT}/LICENSE ${ZIP_DIR}/
# zip the library
ZIPFILE=libtorch_ios_nightly_build.zip
export DATE="$(date -u +%Y%m%d)"
export IOS_NIGHTLY_BUILD_VERSION="1.11.0.${DATE}"
if [ "${BUILD_LITE_INTERPRETER}" == "1" ]; then
# libtorch_lite_ios_nightly_1.11.0.20210810.zip
ZIPFILE="libtorch_lite_ios_nightly_${IOS_NIGHTLY_BUILD_VERSION}.zip"
else
ZIPFILE="libtorch_ios_nightly_build.zip"
fi
cd ${ZIP_DIR}
#for testing
touch version.txt
echo $(date +%s) > version.txt
echo "${IOS_NIGHTLY_BUILD_VERSION}" > version.txt
zip -r ${ZIPFILE} install src version.txt LICENSE
# upload to aws
# Install conda then 'conda install' awscli
@ -48,3 +60,16 @@ set +x
# echo "AWS KEY: ${AWS_ACCESS_KEY_ID}"
# echo "AWS SECRET: ${AWS_SECRET_ACCESS_KEY}"
aws s3 cp ${ZIPFILE} s3://ossci-ios-build/ --acl public-read
if [ "${BUILD_LITE_INTERPRETER}" == "1" ]; then
# create a new LibTorch-Lite-Nightly.podspec from the template
echo "cp ${PROJ_ROOT}/ios/LibTorch-Lite-Nightly.podspec.template ${PROJ_ROOT}/ios/LibTorch-Lite-Nightly.podspec"
cp ${PROJ_ROOT}/ios/LibTorch-Lite-Nightly.podspec.template ${PROJ_ROOT}/ios/LibTorch-Lite-Nightly.podspec
# update pod version
sed -i '' -e "s/IOS_NIGHTLY_BUILD_VERSION/${IOS_NIGHTLY_BUILD_VERSION}/g" ${PROJ_ROOT}/ios/LibTorch-Lite-Nightly.podspec
cat ${PROJ_ROOT}/ios/LibTorch-Lite-Nightly.podspec
# push the new LibTorch-Lite-Nightly.podspec to CocoaPods
pod trunk push --verbose --allow-warnings --use-libraries --skip-import-validation ${PROJ_ROOT}/ios/LibTorch-Lite-Nightly.podspec
fi

View File

@ -4,10 +4,14 @@ echo "RUNNING ON $(uname -a) WITH $(nproc) CPUS AND $(free -m)"
set -eux -o pipefail
source /env
# Defaults here so they can be changed in one place
export MAX_JOBS=${MAX_JOBS:-$(( $(nproc) - 2 ))}
# Because most Circle executors only have 20 CPUs, using more causes OOMs w/ Ninja and nvcc parallelization
MEMORY_LIMIT_MAX_JOBS=18
NUM_CPUS=$(( $(nproc) - 2 ))
if [[ "${DESIRED_CUDA}" == "cu111" ]]; then
# Defaults here for **binary** linux builds so they can be changed in one place
export MAX_JOBS=${MAX_JOBS:-$(( ${NUM_CPUS} > ${MEMORY_LIMIT_MAX_JOBS} ? ${MEMORY_LIMIT_MAX_JOBS} : ${NUM_CPUS} ))}
if [[ "${DESIRED_CUDA}" =~ cu11[0-9] ]]; then
export BUILD_SPLIT_CUDA="ON"
fi
@ -22,5 +26,9 @@ else
build_script='manywheel/build.sh'
fi
if [[ "$CIRCLE_BRANCH" == "master" ]] || [[ "$CIRCLE_BRANCH" == release/* ]]; then
export BUILD_DEBUG_INFO=1
fi
# Build the package
SKIP_ALL_TESTS=1 "/builder/$build_script"

View File

@ -1,10 +1,24 @@
#!/bin/bash
source /home/circleci/project/env
cat >/home/circleci/project/ci_test_script.sh <<EOL
OUTPUT_SCRIPT=${OUTPUT_SCRIPT:-/home/circleci/project/ci_test_script.sh}
# only source if file exists
if [[ -f /home/circleci/project/env ]]; then
source /home/circleci/project/env
fi
cat >"${OUTPUT_SCRIPT}" <<EOL
# =================== The following code will be executed inside Docker container ===================
set -eux -o pipefail
retry () {
"\$@" || (sleep 1 && "\$@") || (sleep 2 && "\$@")
}
# Source binary env file here if exists
if [[ -e "${BINARY_ENV_FILE:-/nofile}" ]]; then
source "${BINARY_ENV_FILE:-/nofile}"
fi
python_nodot="\$(echo $DESIRED_PYTHON | tr -d m.u)"
# Set up Python
@ -23,14 +37,23 @@ fi
EXTRA_CONDA_FLAGS=""
NUMPY_PIN=""
if [[ "\$python_nodot" = *39* ]]; then
PROTOBUF_PACKAGE="defaults::protobuf"
if [[ "\$python_nodot" = *310* ]]; then
EXTRA_CONDA_FLAGS="-c=conda-forge"
# There's an issue with conda channel priority where it'll randomly pick 1.19 over 1.20
# we set a lower boundary here just to be safe
NUMPY_PIN=">=1.21.2"
PROTOBUF_PACKAGE="protobuf>=3.19.0"
fi
if [[ "\$python_nodot" = *39* ]]; then
EXTRA_CONDA_FLAGS="-c=conda-forge"
# There's an issue with conda channel priority where it'll randomly pick 1.19 over 1.20
# we set a lower boundary here just to be safe
NUMPY_PIN=">=1.20"
fi
if [[ "$DESIRED_CUDA" == "cu112" ]]; then
if [[ "$DESIRED_CUDA" == "cu112" || "$DESIRED_CUDA" == "cu115" ]]; then
EXTRA_CONDA_FLAGS="-c=conda-forge"
fi
@ -59,7 +82,7 @@ if [[ "$PACKAGE_TYPE" == conda ]]; then
ninja \
dataclasses \
typing-extensions \
defaults::protobuf \
${PROTOBUF_PACKAGE} \
six
if [[ "$DESIRED_CUDA" == 'cpu' ]]; then
retry conda install -c pytorch -y cpuonly
@ -92,4 +115,4 @@ EOL
echo
echo
echo "The script that will run in the next step is:"
cat /home/circleci/project/ci_test_script.sh
cat "${OUTPUT_SCRIPT}"

View File

@ -14,6 +14,10 @@ chmod +x "$build_script"
# Build
cat >"$build_script" <<EOL
export PATH="$workdir/miniconda/bin:$PATH"
if [[ "$CIRCLE_BRANCH" == "nightly" ]]; then
export USE_PYTORCH_METAL_EXPORT=1
export USE_COREML_DELEGATE=1
fi
if [[ "$PACKAGE_TYPE" == conda ]]; then
"$workdir/builder/conda/build_pytorch.sh"
else

View File

@ -19,39 +19,47 @@ tagged_version() {
fi
}
# We need to write an envfile to persist these variables to following
# steps, but the location of the envfile depends on the circleci executor
if [[ "$(uname)" == Darwin ]]; then
# macos executor (builds and tests)
workdir="/Users/distiller/project"
elif [[ "$OSTYPE" == "msys" ]]; then
# windows executor (builds and tests)
workdir="/c/w"
elif [[ -d "/home/circleci/project" ]]; then
# machine executor (binary tests)
workdir="/home/circleci/project"
else
# docker executor (binary builds)
workdir="/"
fi
envfile="$workdir/env"
touch "$envfile"
chmod +x "$envfile"
# These are only relevant for CircleCI
# TODO: Remove these later once migrated fully to GHA
if [[ -z ${IS_GHA:-} ]]; then
# We need to write an envfile to persist these variables to following
# steps, but the location of the envfile depends on the circleci executor
if [[ "$(uname)" == Darwin ]]; then
# macos executor (builds and tests)
workdir="/Users/distiller/project"
elif [[ "$OSTYPE" == "msys" ]]; then
# windows executor (builds and tests)
workdir="/c/w"
elif [[ -d "/home/circleci/project" ]]; then
# machine executor (binary tests)
workdir="/home/circleci/project"
else
# docker executor (binary builds)
workdir="/"
fi
envfile="$workdir/env"
touch "$envfile"
chmod +x "$envfile"
# Parse the BUILD_ENVIRONMENT to package type, python, and cuda
configs=($BUILD_ENVIRONMENT)
export PACKAGE_TYPE="${configs[0]}"
export DESIRED_PYTHON="${configs[1]}"
export DESIRED_CUDA="${configs[2]}"
if [[ "${BUILD_FOR_SYSTEM:-}" == "windows" ]]; then
export DESIRED_DEVTOOLSET=""
export LIBTORCH_CONFIG="${configs[3]:-}"
if [[ "$LIBTORCH_CONFIG" == 'debug' ]]; then
export DEBUG=1
# Parse the BUILD_ENVIRONMENT to package type, python, and cuda
configs=($BUILD_ENVIRONMENT)
export PACKAGE_TYPE="${configs[0]}"
export DESIRED_PYTHON="${configs[1]}"
export DESIRED_CUDA="${configs[2]}"
if [[ "${BUILD_FOR_SYSTEM:-}" == "windows" ]]; then
export DESIRED_DEVTOOLSET=""
export LIBTORCH_CONFIG="${configs[3]:-}"
if [[ "$LIBTORCH_CONFIG" == 'debug' ]]; then
export DEBUG=1
fi
else
export DESIRED_DEVTOOLSET="${configs[3]:-}"
fi
else
export DESIRED_DEVTOOLSET="${configs[3]:-}"
envfile=${BINARY_ENV_FILE:-/tmp/env}
workdir="/pytorch"
fi
if [[ "$PACKAGE_TYPE" == 'libtorch' ]]; then
export BUILD_PYTHONLESS=1
fi
@ -62,7 +70,7 @@ if [[ -z "$DOCKER_IMAGE" ]]; then
if [[ "$PACKAGE_TYPE" == conda ]]; then
export DOCKER_IMAGE="pytorch/conda-cuda"
elif [[ "$DESIRED_CUDA" == cpu ]]; then
export DOCKER_IMAGE="pytorch/manylinux-cuda100"
export DOCKER_IMAGE="pytorch/manylinux-cpu"
else
export DOCKER_IMAGE="pytorch/manylinux-cuda${DESIRED_CUDA:2}"
fi
@ -85,7 +93,7 @@ PIP_UPLOAD_FOLDER='nightly/'
# We put this here so that OVERRIDE_PACKAGE_VERSION below can read from it
export DATE="$(date -u +%Y%m%d)"
#TODO: We should be pulling semver version from the base version.txt
BASE_BUILD_VERSION="1.9.0.dev$DATE"
BASE_BUILD_VERSION="1.11.0.dev$DATE"
# Change BASE_BUILD_VERSION to git tag when on a git tag
# Use 'git -C' to make doubly sure we're in the correct directory for checking
# the git tag
@ -131,24 +139,24 @@ if [[ "$PACKAGE_TYPE" == libtorch ]]; then
fi
fi
cat >>"$envfile" <<EOL
cat >"$envfile" <<EOL
# =================== The following code will be executed inside Docker container ===================
export TZ=UTC
echo "Running on $(uname -a) at $(date)"
export PACKAGE_TYPE="$PACKAGE_TYPE"
export DESIRED_PYTHON="$DESIRED_PYTHON"
export DESIRED_PYTHON="${DESIRED_PYTHON:-}"
export DESIRED_CUDA="$DESIRED_CUDA"
export LIBTORCH_VARIANT="${LIBTORCH_VARIANT:-}"
export BUILD_PYTHONLESS="${BUILD_PYTHONLESS:-}"
export DESIRED_DEVTOOLSET="$DESIRED_DEVTOOLSET"
export DESIRED_DEVTOOLSET="${DESIRED_DEVTOOLSET:-}"
if [[ "${BUILD_FOR_SYSTEM:-}" == "windows" ]]; then
export LIBTORCH_CONFIG="${LIBTORCH_CONFIG:-}"
export DEBUG="${DEBUG:-}"
fi
export DATE="$DATE"
export NIGHTLIES_DATE_PREAMBLE=1.9.0.dev
export NIGHTLIES_DATE_PREAMBLE=1.11.0.dev
export PYTORCH_BUILD_VERSION="$PYTORCH_BUILD_VERSION"
export PYTORCH_BUILD_NUMBER="$PYTORCH_BUILD_NUMBER"
export OVERRIDE_PACKAGE_VERSION="$PYTORCH_BUILD_VERSION"
@ -156,6 +164,7 @@ export OVERRIDE_PACKAGE_VERSION="$PYTORCH_BUILD_VERSION"
# TODO: We don't need this anymore IIUC
export TORCH_PACKAGE_NAME='torch'
export TORCH_CONDA_BUILD_FOLDER='pytorch-nightly'
export ANACONDA_USER='pytorch'
export USE_FBGEMM=1
export JAVA_HOME=$JAVA_HOME
@ -163,23 +172,6 @@ export BUILD_JNI=$BUILD_JNI
export PIP_UPLOAD_FOLDER="$PIP_UPLOAD_FOLDER"
export DOCKER_IMAGE="$DOCKER_IMAGE"
export workdir="$workdir"
export MAC_PACKAGE_WORK_DIR="$workdir"
if [[ "$OSTYPE" == "msys" ]]; then
export PYTORCH_ROOT="$workdir/p"
export BUILDER_ROOT="$workdir/b"
else
export PYTORCH_ROOT="$workdir/pytorch"
export BUILDER_ROOT="$workdir/builder"
fi
export MINICONDA_ROOT="$workdir/miniconda"
export PYTORCH_FINAL_PACKAGE_DIR="$workdir/final_pkgs"
export CIRCLE_TAG="${CIRCLE_TAG:-}"
export CIRCLE_SHA1="$CIRCLE_SHA1"
export CIRCLE_PR_NUMBER="${CIRCLE_PR_NUMBER:-}"
export CIRCLE_BRANCH="$CIRCLE_BRANCH"
export CIRCLE_WORKFLOW_ID="$CIRCLE_WORKFLOW_ID"
export USE_GOLD_LINKER="${USE_GOLD_LINKER}"
export USE_GLOO_WITH_OPENSSL="ON"
@ -187,6 +179,42 @@ export USE_WHOLE_CUDNN="${USE_WHOLE_CUDNN}"
# =================== The above code will be executed inside Docker container ===================
EOL
# nproc doesn't exist on darwin
if [[ "$(uname)" != Darwin ]]; then
# Because most Circle executors only have 20 CPUs, using more causes OOMs w/ Ninja and nvcc parallelization
MEMORY_LIMIT_MAX_JOBS=18
NUM_CPUS=$(( $(nproc) - 2 ))
# Defaults here for **binary** linux builds so they can be changed in one place
export MAX_JOBS=${MAX_JOBS:-$(( ${NUM_CPUS} > ${MEMORY_LIMIT_MAX_JOBS} ? ${MEMORY_LIMIT_MAX_JOBS} : ${NUM_CPUS} ))}
cat >>"$envfile" <<EOL
export MAX_JOBS="${MAX_JOBS}"
EOL
fi
if [[ -z "${IS_GHA:-}" ]]; then
cat >>"$envfile" <<EOL
export workdir="$workdir"
export MAC_PACKAGE_WORK_DIR="$workdir"
if [[ "$OSTYPE" == "msys" ]]; then
export PYTORCH_ROOT="$workdir/p"
export BUILDER_ROOT="$workdir/b"
else
export PYTORCH_ROOT="$workdir/pytorch"
export BUILDER_ROOT="$workdir/builder"
fi
export MINICONDA_ROOT="$workdir/miniconda"
export PYTORCH_FINAL_PACKAGE_DIR="$workdir/final_pkgs"
export CIRCLE_TAG="${CIRCLE_TAG:-}"
export CIRCLE_SHA1="$CIRCLE_SHA1"
export CIRCLE_PR_NUMBER="${CIRCLE_PR_NUMBER:-}"
export CIRCLE_BRANCH="$CIRCLE_BRANCH"
export CIRCLE_WORKFLOW_ID="$CIRCLE_WORKFLOW_ID"
EOL
fi
echo 'retry () {' >> "$envfile"
echo ' $* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)' >> "$envfile"
echo '}' >> "$envfile"

View File

@ -63,6 +63,10 @@ s3_upload() {
)
}
# Install dependencies (should be a no-op if previously installed)
conda install -yq anaconda-client
pip install -q awscli
case "${PACKAGE_TYPE}" in
conda)
conda_upload

View File

@ -8,15 +8,45 @@ export CUDA_VERSION="${DESIRED_CUDA/cu/}"
export USE_SCCACHE=1
export SCCACHE_BUCKET=ossci-compiler-cache-windows
export NIGHTLIES_PYTORCH_ROOT="$PYTORCH_ROOT"
export VC_YEAR=2019
if [[ "$CUDA_VERSION" == "92" || "$CUDA_VERSION" == "100" ]]; then
export VC_YEAR=2017
else
export VC_YEAR=2019
if [[ "${DESIRED_CUDA}" == *"cu11"* ]]; then
export BUILD_SPLIT_CUDA=ON
fi
if [[ "${DESIRED_CUDA}" == "cu111" ]]; then
export BUILD_SPLIT_CUDA="ON"
echo "Free Space for CUDA DEBUG BUILD"
if [[ "$CIRCLECI" == 'true' ]]; then
if [[ -d "C:\\Program Files (x86)\\Microsoft Visual Studio\\2019\\Community" ]]; then
rm -rf "C:\\Program Files (x86)\\Microsoft Visual Studio\\2019\\Community"
fi
if [[ -d "C:\\Program Files (x86)\\Microsoft Visual Studio 14.0" ]]; then
rm -rf "C:\\Program Files (x86)\\Microsoft Visual Studio 14.0"
fi
if [[ -d "C:\\Program Files (x86)\\Microsoft.NET" ]]; then
rm -rf "C:\\Program Files (x86)\\Microsoft.NET"
fi
if [[ -d "C:\\Program Files\\dotnet" ]]; then
rm -rf "C:\\Program Files\\dotnet"
fi
if [[ -d "C:\\Program Files (x86)\\dotnet" ]]; then
rm -rf "C:\\Program Files (x86)\\dotnet"
fi
if [[ -d "C:\\Program Files (x86)\\Microsoft SQL Server" ]]; then
rm -rf "C:\\Program Files (x86)\\Microsoft SQL Server"
fi
if [[ -d "C:\\Program Files (x86)\\Xamarin" ]]; then
rm -rf "C:\\Program Files (x86)\\Xamarin"
fi
if [[ -d "C:\\Program Files (x86)\\Google" ]]; then
rm -rf "C:\\Program Files (x86)\\Google"
fi
fi
set +x
@ -32,7 +62,8 @@ if [[ "$CIRCLECI" == 'true' && -d "C:\\ProgramData\\Microsoft\\VisualStudio\\Pac
fi
if [[ "$CIRCLECI" == 'true' && -d "C:\\Microsoft" ]]; then
rm -rf "C:\\Microsoft\\Android*"
# don't use quotes here
rm -rf /c/Microsoft/AndroidNDK*
fi
echo "Free space on filesystem before build:"

View File

@ -4,13 +4,7 @@ set -eux -o pipefail
source "/c/w/env"
export CUDA_VERSION="${DESIRED_CUDA/cu/}"
export VC_YEAR=2017
if [[ "$CUDA_VERSION" == "92" || "$CUDA_VERSION" == "100" ]]; then
export VC_YEAR=2017
else
export VC_YEAR=2019
fi
export VC_YEAR=2019
pushd "$BUILDER_ROOT"

View File

@ -10,18 +10,27 @@ pt_checkout="/var/lib/jenkins/workspace"
# Since we're cat-ing this file, we need to escape all $'s
echo "cpp_doc_push_script.sh: Invoked with $*"
# Argument 1: Where to copy the built documentation for Python API to
# (pytorch.github.io/$install_path)
install_path="$1"
if [ -z "$install_path" ]; then
echo "error: cpp_doc_push_script.sh: install_path (arg1) not specified"
# for statements like ${1:-${DOCS_INSTALL_PATH:-docs/}}
# the order of operations goes:
# 1. Check if there's an argument $1
# 2. If no argument check for environment var DOCS_INSTALL_PATH
# 3. If no environment var fall back to default 'docs/'
# NOTE: It might seem weird to gather the second argument before gathering the first argument
# but since DOCS_INSTALL_PATH can be derived from DOCS_VERSION it's probably better to
# try and gather it first, just so we don't potentially break people who rely on this script
# Argument 2: What version of the Python API docs we are building.
version="${2:-${DOCS_VERSION:-master}}"
if [ -z "$version" ]; then
echo "error: cpp_doc_push_script.sh: version (arg2) not specified"
exit 1
fi
# Argument 2: What version of the Python API docs we are building.
version="$2"
if [ -z "$version" ]; then
echo "error: cpp_doc_push_script.sh: version (arg2) not specified"
# Argument 1: Where to copy the built documentation for Python API to
# (pytorch.github.io/$install_path)
install_path="${1:-${DOCS_INSTALL_PATH:-docs/${DOCS_VERSION}}}"
if [ -z "$install_path" ]; then
echo "error: cpp_doc_push_script.sh: install_path (arg1) not specified"
exit 1
fi
@ -56,7 +65,6 @@ cp torch/_utils_internal.py tools/shared
# Generate PyTorch files
time python tools/setup_helpers/generate_code.py \
--declarations-path build/aten/src/ATen/Declarations.yaml \
--native-functions-path aten/src/ATen/native/native_functions.yaml \
--nn-path aten/src/
@ -88,8 +96,12 @@ git status
git config user.email "soumith+bot@pytorch.org"
git config user.name "pytorchbot"
# If there aren't changes, don't make a commit; push is no-op
git commit -m "Generate C++ docs from pytorch/pytorch@$CIRCLE_SHA1" || true
git commit -m "Generate C++ docs from pytorch/pytorch@${GITHUB_SHA}" || true
git status
if [[ "${WITH_PUSH:-}" == true ]]; then
git push -u origin
fi
popd
# =================== The above code **should** be executed inside Docker container ===================

View File

@ -13,18 +13,27 @@ echo "python_doc_push_script.sh: Invoked with $*"
set -ex
# Argument 1: Where to copy the built documentation to
# (pytorch.github.io/$install_path)
install_path="$1"
if [ -z "$install_path" ]; then
echo "error: python_doc_push_script.sh: install_path (arg1) not specified"
# for statements like ${1:-${DOCS_INSTALL_PATH:-docs/}}
# the order of operations goes:
# 1. Check if there's an argument $1
# 2. If no argument check for environment var DOCS_INSTALL_PATH
# 3. If no environment var fall back to default 'docs/'
# NOTE: It might seem weird to gather the second argument before gathering the first argument
# but since DOCS_INSTALL_PATH can be derived from DOCS_VERSION it's probably better to
# try and gather it first, just so we don't potentially break people who rely on this script
# Argument 2: What version of the docs we are building.
version="${2:-${DOCS_VERSION:-master}}"
if [ -z "$version" ]; then
echo "error: python_doc_push_script.sh: version (arg2) not specified"
exit 1
fi
# Argument 2: What version of the docs we are building.
version="$2"
if [ -z "$version" ]; then
echo "error: python_doc_push_script.sh: version (arg2) not specified"
# Argument 1: Where to copy the built documentation to
# (pytorch.github.io/$install_path)
install_path="${1:-${DOCS_INSTALL_PATH:-docs/${DOCS_VERSION}}}"
if [ -z "$install_path" ]; then
echo "error: python_doc_push_script.sh: install_path (arg1) not specified"
exit 1
fi
@ -34,7 +43,7 @@ if [ "$version" == "master" ]; then
fi
# Argument 3: The branch to push to. Usually is "site"
branch="$3"
branch="${3:-${DOCS_BRANCH:-site}}"
if [ -z "$branch" ]; then
echo "error: python_doc_push_script.sh: branch (arg3) not specified"
exit 1
@ -122,8 +131,12 @@ git status
git config user.email "soumith+bot@pytorch.org"
git config user.name "pytorchbot"
# If there aren't changes, don't make a commit; push is no-op
git commit -m "Generate Python docs from pytorch/pytorch@$CIRCLE_SHA1" || true
git commit -m "Generate Python docs from pytorch/pytorch@${GITHUB_SHA}" || true
git status
if [[ "${WITH_PUSH:-}" == true ]]; then
git push -u origin "${branch}"
fi
popd
# =================== The above code **should** be executed inside Docker container ===================

View File

@ -7,6 +7,9 @@ sudo rm -f /etc/apt/heroku.list
sudo rm -f /etc/apt/openjdk-r-ubuntu-ppa-xenial.list
sudo rm -f /etc/apt/partner.list
# To increase the network reliability, let apt decide which mirror is best to use
sudo sed -i -e 's/http:\/\/.*archive/mirror:\/\/mirrors/' -e 's/\/ubuntu\//\/mirrors.txt/' /etc/apt/sources.list
retry () {
$* || $* || $* || $* || $*
}
@ -29,7 +32,7 @@ if ! command -v aws >/dev/null; then
fi
if [ -n "${USE_CUDA_DOCKER_RUNTIME:-}" ]; then
DRIVER_FN="NVIDIA-Linux-x86_64-460.39.run"
DRIVER_FN="NVIDIA-Linux-x86_64-495.44.run"
wget "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN"
sudo /bin/bash "$DRIVER_FN" -s --no-drm || (sudo cat /var/log/nvidia-installer.log && false)
nvidia-smi
@ -40,9 +43,9 @@ if [ -n "${USE_CUDA_DOCKER_RUNTIME:-}" ]; then
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L "https://nvidia.github.io/nvidia-docker/${distribution}/nvidia-docker.list" | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update -qq
retry sudo apt-get update -qq
# Necessary to get the `--gpus` flag to function within docker
sudo apt-get install -y nvidia-container-toolkit
retry sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
else
# Explicitly remove nvidia docker apt repositories if not building for cuda
@ -64,6 +67,7 @@ add_to_env_file() {
}
add_to_env_file IN_CI 1
add_to_env_file CI_MASTER "${CI_MASTER:-}"
add_to_env_file COMMIT_SOURCE "${CIRCLE_BRANCH:-}"
add_to_env_file BUILD_ENVIRONMENT "${BUILD_ENVIRONMENT}"
add_to_env_file CIRCLE_PULL_REQUEST "${CIRCLE_PULL_REQUEST}"

View File

@ -1,149 +0,0 @@
import glob
import json
import logging
import os
import os.path
import pathlib
import re
import sys
import time
import zipfile
import requests
def get_size(file_dir):
try:
# we should only expect one file, if no, something is wrong
file_name = glob.glob(os.path.join(file_dir, "*"))[0]
return os.stat(file_name).st_size
except Exception:
logging.exception(f"error getting file from: {file_dir}")
return 0
def build_message(size):
pkg_type, py_ver, cu_ver, *_ = os.environ.get("BUILD_ENVIRONMENT", "").split() + [
None,
None,
None,
]
os_name = os.uname()[0].lower()
if os_name == "darwin":
os_name = "macos"
return {
"normal": {
"os": os_name,
"pkg_type": pkg_type,
"py_ver": py_ver,
"cu_ver": cu_ver,
"pr": os.environ.get("CIRCLE_PR_NUMBER"),
"build_num": os.environ.get("CIRCLE_BUILD_NUM"),
"sha1": os.environ.get("CIRCLE_SHA1"),
"branch": os.environ.get("CIRCLE_BRANCH"),
"workflow_id": os.environ.get("CIRCLE_WORKFLOW_ID"),
},
"int": {
"time": int(time.time()),
"size": size,
"commit_time": int(os.environ.get("COMMIT_TIME", "0")),
"run_duration": int(time.time() - os.path.getmtime(os.path.realpath(__file__))),
},
}
def send_message(messages):
access_token = os.environ.get("SCRIBE_GRAPHQL_ACCESS_TOKEN")
if not access_token:
raise ValueError("Can't find access token from environment variable")
url = "https://graph.facebook.com/scribe_logs"
r = requests.post(
url,
data={
"access_token": access_token,
"logs": json.dumps(
[
{
"category": "perfpipe_pytorch_binary_size",
"message": json.dumps(message),
"line_escape": False,
}
for message in messages
]
),
},
)
print(r.text)
r.raise_for_status()
def report_android_sizes(file_dir):
def gen_sizes():
# we should only expect one file, if no, something is wrong
aar_files = list(pathlib.Path(file_dir).rglob("pytorch_android-*.aar"))
if len(aar_files) != 1:
logging.exception(f"error getting aar files from: {file_dir} / {aar_files}")
return
aar_file = aar_files[0]
zf = zipfile.ZipFile(aar_file)
for info in zf.infolist():
# Scan ".so" libs in `jni` folder. Examples:
# jni/arm64-v8a/libfbjni.so
# jni/arm64-v8a/libpytorch_jni.so
m = re.match(r"^jni/([^/]+)/(.*\.so)$", info.filename)
if not m:
continue
arch, lib = m.groups()
# report per architecture library size
yield [arch, lib, info.compress_size, info.file_size]
# report whole package size
yield ["aar", aar_file.name, os.stat(aar_file).st_size, 0]
def gen_messages():
android_build_type = os.environ.get("ANDROID_BUILD_TYPE")
for arch, lib, comp_size, uncomp_size in gen_sizes():
print(android_build_type, arch, lib, comp_size, uncomp_size)
yield {
"normal": {
"os": "android",
# TODO: create dedicated columns
"pkg_type": "{}/{}/{}".format(android_build_type, arch, lib),
"cu_ver": "", # dummy value for derived field `build_name`
"py_ver": "", # dummy value for derived field `build_name`
"pr": os.environ.get("CIRCLE_PR_NUMBER"),
"build_num": os.environ.get("CIRCLE_BUILD_NUM"),
"sha1": os.environ.get("CIRCLE_SHA1"),
"branch": os.environ.get("CIRCLE_BRANCH"),
"workflow_id": os.environ.get("CIRCLE_WORKFLOW_ID"),
},
"int": {
"time": int(time.time()),
"commit_time": int(os.environ.get("COMMIT_TIME", "0")),
"run_duration": int(time.time() - os.path.getmtime(os.path.realpath(__file__))),
"size": comp_size,
"raw_size": uncomp_size,
},
}
send_message(list(gen_messages()))
if __name__ == "__main__":
file_dir = os.environ.get(
"PYTORCH_FINAL_PACKAGE_DIR", "/home/circleci/project/final_pkgs"
)
if len(sys.argv) == 2:
file_dir = sys.argv[1]
print("checking dir: " + file_dir)
if "-android" in os.environ.get("BUILD_ENVIRONMENT", ""):
report_android_sizes(file_dir)
else:
size = get_size(file_dir)
if size != 0:
try:
send_message([build_message(size)])
except Exception:
logging.exception("can't send message")

View File

@ -1,8 +1,8 @@
# https://developercommunity.visualstudio.com/t/install-specific-version-of-vs-component/1142479
# https://docs.microsoft.com/en-us/visualstudio/releases/2019/history#release-dates-and-build-numbers
# Where to find the links: https://docs.microsoft.com/en-us/visualstudio/releases/2019/history#release-dates-and-build-numbers
# 16.8.5 BuildTools
$VS_DOWNLOAD_LINK = "https://download.visualstudio.microsoft.com/download/pr/20130c62-1bc8-43d6-b4f0-c20bb7c79113/145a319d79a83376915d8f855605e152ef5f6fa2b2f1d2dca411fb03722eea72/vs_BuildTools.exe"
# BuildTools from S3
$VS_DOWNLOAD_LINK = "https://s3.amazonaws.com/ossci-windows/vs${env:VS_VERSION}_BuildTools.exe"
$COLLECT_DOWNLOAD_LINK = "https://aka.ms/vscollect.exe"
$VS_INSTALL_ARGS = @("--nocache","--quiet","--wait", "--add Microsoft.VisualStudio.Workload.VCTools",
"--add Microsoft.Component.MSBuild",
@ -14,32 +14,45 @@ $VS_INSTALL_ARGS = @("--nocache","--quiet","--wait", "--add Microsoft.VisualStud
"--add Microsoft.VisualStudio.Component.VC.Tools.x86.x64",
"--add Microsoft.VisualStudio.ComponentGroup.NativeDesktop.Win81")
curl.exe --retry 3 -kL $VS_DOWNLOAD_LINK --output vs_installer.exe
if ($LASTEXITCODE -ne 0) {
echo "Download of the VS 2019 Version 16.8.5 installer failed"
exit 1
if (${env:INSTALL_WINDOWS_SDK} -eq "1") {
$VS_INSTALL_ARGS += "--add Microsoft.VisualStudio.Component.Windows10SDK.19041"
}
if (Test-Path "${env:ProgramFiles(x86)}\Microsoft Visual Studio\Installer\vswhere.exe") {
$existingPath = & "${env:ProgramFiles(x86)}\Microsoft Visual Studio\Installer\vswhere.exe" -products "Microsoft.VisualStudio.Product.BuildTools" -version "[16, 17)" -property installationPath
if ($existingPath -ne $null) {
echo "Found existing BuildTools installation in $existingPath"
$VS_UNINSTALL_ARGS = @("uninstall", "--installPath", "`"$existingPath`"", "--quiet","--wait")
$process = Start-Process "${PWD}\vs_installer.exe" -ArgumentList $VS_UNINSTALL_ARGS -NoNewWindow -Wait -PassThru
$exitCode = $process.ExitCode
if (($exitCode -ne 0) -and ($exitCode -ne 3010)) {
echo "Original BuildTools uninstall failed with code $exitCode"
exit 1
}
echo "Original BuildTools uninstalled"
$VS_VERSION_major = [int] ${env:VS_VERSION}.split(".")[0]
$existingPath = & "${env:ProgramFiles(x86)}\Microsoft Visual Studio\Installer\vswhere.exe" -products "Microsoft.VisualStudio.Product.BuildTools" -version "[${env:VS_VERSION}, ${env:VS_VERSION_major + 1})" -property installationPath
if (($existingPath -ne $null) -and (!${env:CIRCLECI})) {
echo "Found correctly versioned existing BuildTools installation in $existingPath"
exit 0
}
$pathToRemove = & "${env:ProgramFiles(x86)}\Microsoft Visual Studio\Installer\vswhere.exe" -products "Microsoft.VisualStudio.Product.BuildTools" -property installationPath
}
echo "Downloading VS installer from S3."
curl.exe --retry 3 -kL $VS_DOWNLOAD_LINK --output vs_installer.exe
if ($LASTEXITCODE -ne 0) {
echo "Download of the VS 2019 Version ${env:VS_VERSION} installer failed"
exit 1
}
if ($pathToRemove -ne $null) {
echo "Uninstalling $pathToRemove."
$VS_UNINSTALL_ARGS = @("uninstall", "--installPath", "`"$pathToRemove`"", "--quiet","--wait")
$process = Start-Process "${PWD}\vs_installer.exe" -ArgumentList $VS_UNINSTALL_ARGS -NoNewWindow -Wait -PassThru
$exitCode = $process.ExitCode
if (($exitCode -ne 0) -and ($exitCode -ne 3010)) {
echo "Original BuildTools uninstall failed with code $exitCode"
exit 1
}
echo "Other versioned BuildTools uninstalled."
}
echo "Installing Visual Studio version ${env:VS_VERSION}."
$process = Start-Process "${PWD}\vs_installer.exe" -ArgumentList $VS_INSTALL_ARGS -NoNewWindow -Wait -PassThru
Remove-Item -Path vs_installer.exe -Force
$exitCode = $process.ExitCode
if (($exitCode -ne 0) -and ($exitCode -ne 3010)) {
echo "VS 2017 installer exited with code $exitCode, which should be one of [0, 3010]."
echo "VS 2019 installer exited with code $exitCode, which should be one of [0, 3010]."
curl.exe --retry 3 -kL $COLLECT_DOWNLOAD_LINK --output Collect.exe
if ($LASTEXITCODE -ne 0) {
echo "Download of the VS Collect tool failed."
@ -47,6 +60,6 @@ if (($exitCode -ne 0) -and ($exitCode -ne 3010)) {
}
Start-Process "${PWD}\Collect.exe" -NoNewWindow -Wait -PassThru
New-Item -Path "C:\w\build-results" -ItemType "directory" -Force
Copy-Item -Path "C:\Users\circleci\AppData\Local\Temp\vslogs.zip" -Destination "C:\w\build-results\"
Copy-Item -Path "${env:TEMP}\vslogs.zip" -Destination "C:\w\build-results\"
exit 1
}

View File

@ -1,70 +1,78 @@
#!/bin/bash
set -eux -o pipefail
cuda_major_version=${CUDA_VERSION%.*}
if [[ "$cuda_major_version" == "10" ]]; then
cuda_installer_name="cuda_10.1.243_426.00_win10"
msbuild_project_dir="CUDAVisualStudioIntegration/extras/visual_studio_integration/MSBuildExtensions"
cuda_install_packages="nvcc_10.1 cuobjdump_10.1 nvprune_10.1 cupti_10.1 cublas_10.1 cublas_dev_10.1 cudart_10.1 cufft_10.1 cufft_dev_10.1 curand_10.1 curand_dev_10.1 cusolver_10.1 cusolver_dev_10.1 cusparse_10.1 cusparse_dev_10.1 nvgraph_10.1 nvgraph_dev_10.1 npp_10.1 npp_dev_10.1 nvrtc_10.1 nvrtc_dev_10.1 nvml_dev_10.1"
elif [[ "$cuda_major_version" == "11" ]]; then
if [[ "${CUDA_VERSION}" == "11.1" ]]; then
cuda_installer_name="cuda_11.1.0_456.43_win10"
msbuild_project_dir="visual_studio_integration/CUDAVisualStudioIntegration/extras/visual_studio_integration/MSBuildExtensions"
case ${CUDA_VERSION} in
10.1)
cuda_installer_name="cuda_10.1.243_426.00_win10"
cuda_install_packages="nvcc_10.1 cuobjdump_10.1 nvprune_10.1 cupti_10.1 cublas_10.1 cublas_dev_10.1 cudart_10.1 cufft_10.1 cufft_dev_10.1 curand_10.1 curand_dev_10.1 cusolver_10.1 cusolver_dev_10.1 cusparse_10.1 cusparse_dev_10.1 nvgraph_10.1 nvgraph_dev_10.1 npp_10.1 npp_dev_10.1 nvrtc_10.1 nvrtc_dev_10.1 nvml_dev_10.1"
;;
10.2)
cuda_installer_name="cuda_10.2.89_441.22_win10"
cuda_install_packages="nvcc_10.2 cuobjdump_10.2 nvprune_10.2 cupti_10.2 cublas_10.2 cublas_dev_10.2 cudart_10.2 cufft_10.2 cufft_dev_10.2 curand_10.2 curand_dev_10.2 cusolver_10.2 cusolver_dev_10.2 cusparse_10.2 cusparse_dev_10.2 nvgraph_10.2 nvgraph_dev_10.2 npp_10.2 npp_dev_10.2 nvrtc_10.2 nvrtc_dev_10.2 nvml_dev_10.2"
;;
11.1)
cuda_installer_name="cuda_11.1.1_456.81_win10"
cuda_install_packages="nvcc_11.1 cuobjdump_11.1 nvprune_11.1 nvprof_11.1 cupti_11.1 cublas_11.1 cublas_dev_11.1 cudart_11.1 cufft_11.1 cufft_dev_11.1 curand_11.1 curand_dev_11.1 cusolver_11.1 cusolver_dev_11.1 cusparse_11.1 cusparse_dev_11.1 npp_11.1 npp_dev_11.1 nvrtc_11.1 nvrtc_dev_11.1 nvml_dev_11.1"
elif [[ "${CUDA_VERSION}" == "11.3" ]]; then
;;
11.3)
cuda_installer_name="cuda_11.3.0_465.89_win10"
msbuild_project_dir="visual_studio_integration/CUDAVisualStudioIntegration/extras/visual_studio_integration/MSBuildExtensions"
cuda_install_packages="thrust_11.3 nvcc_11.3 cuobjdump_11.3 nvprune_11.3 nvprof_11.3 cupti_11.3 cublas_11.3 cublas_dev_11.3 cudart_11.3 cufft_11.3 cufft_dev_11.3 curand_11.3 curand_dev_11.3 cusolver_11.3 cusolver_dev_11.3 cusparse_11.3 cusparse_dev_11.3 npp_11.3 npp_dev_11.3 nvrtc_11.3 nvrtc_dev_11.3 nvml_dev_11.3"
else
echo "This should not happen! ABORT."
;;
11.5)
cuda_installer_name="cuda_11.5.0_496.13_win10"
cuda_install_packages="thrust_11.5 nvcc_11.5 cuobjdump_11.5 nvprune_11.5 nvprof_11.5 cupti_11.5 cublas_11.5 cublas_dev_11.5 cudart_11.5 cufft_11.5 cufft_dev_11.5 curand_11.5 curand_dev_11.5 cusolver_11.5 cusolver_dev_11.5 cusparse_11.5 cusparse_dev_11.5 npp_11.5 npp_dev_11.5 nvrtc_11.5 nvrtc_dev_11.5 nvml_dev_11.5"
;;
*)
echo "CUDA_VERSION $CUDA_VERSION is not supported yet"
exit 1
fi
;;
esac
if [[ -f "/c/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v${CUDA_VERSION}/bin/nvcc.exe" ]]; then
echo "Existing CUDA v${CUDA_VERSION} installation found, skipping install"
else
echo "CUDA_VERSION $CUDA_VERSION is not supported yet"
exit 1
tmp_dir=$(mktemp -d)
(
# no need to popd after, the subshell shouldn't affect the parent shell
pushd "${tmp_dir}"
cuda_installer_link="https://ossci-windows.s3.amazonaws.com/${cuda_installer_name}.exe"
curl --retry 3 -kLO $cuda_installer_link
7z x ${cuda_installer_name}.exe -o${cuda_installer_name}
pushd ${cuda_installer_name}
mkdir cuda_install_logs
set +e
# This breaks for some reason if you quote cuda_install_packages
# shellcheck disable=SC2086
./setup.exe -s ${cuda_install_packages} -loglevel:6 -log:"$(pwd -W)/cuda_install_logs"
set -e
if [[ ! -f "/c/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v${CUDA_VERSION}/bin/nvcc.exe" ]]; then
echo "CUDA installation failed"
mkdir -p /c/w/build-results
7z a "c:\\w\\build-results\\cuda_install_logs.7z" cuda_install_logs
exit 1
fi
)
rm -rf "${tmp_dir}"
fi
if [[ "$cuda_major_version" == "11" && "${JOB_EXECUTOR}" == "windows-with-nvidia-gpu" ]]; then
cuda_install_packages="${cuda_install_packages} Display.Driver"
fi
cuda_installer_link="https://ossci-windows.s3.amazonaws.com/${cuda_installer_name}.exe"
curl --retry 3 -kLO $cuda_installer_link
7z x ${cuda_installer_name}.exe -o${cuda_installer_name}
cd ${cuda_installer_name}
mkdir cuda_install_logs
set +e
./setup.exe -s ${cuda_install_packages} -loglevel:6 -log:"$(pwd -W)/cuda_install_logs"
set -e
if [[ "${VC_YEAR}" == "2017" ]]; then
cp -r ${msbuild_project_dir}/* "C:/Program Files (x86)/Microsoft Visual Studio/2017/${VC_PRODUCT}/Common7/IDE/VC/VCTargets/BuildCustomizations/"
if [[ -f "/c/Program Files/NVIDIA Corporation/NvToolsExt/bin/x64/nvToolsExt64_1.dll" ]]; then
echo "Existing nvtools installation found, skipping install"
else
cp -r ${msbuild_project_dir}/* "C:/Program Files (x86)/Microsoft Visual Studio/2019/${VC_PRODUCT}/MSBuild/Microsoft/VC/v160/BuildCustomizations/"
# create tmp dir for download
tmp_dir=$(mktemp -d)
(
# no need to popd after, the subshell shouldn't affect the parent shell
pushd "${tmp_dir}"
curl --retry 3 -kLO https://ossci-windows.s3.amazonaws.com/NvToolsExt.7z
7z x NvToolsExt.7z -oNvToolsExt
mkdir -p "C:/Program Files/NVIDIA Corporation/NvToolsExt"
cp -r NvToolsExt/* "C:/Program Files/NVIDIA Corporation/NvToolsExt/"
)
rm -rf "${tmp_dir}"
fi
if ! ls "/c/Program Files/NVIDIA Corporation/NvToolsExt/bin/x64/nvToolsExt64_1.dll"
then
curl --retry 3 -kLO https://ossci-windows.s3.amazonaws.com/NvToolsExt.7z
7z x NvToolsExt.7z -oNvToolsExt
mkdir -p "C:/Program Files/NVIDIA Corporation/NvToolsExt"
cp -r NvToolsExt/* "C:/Program Files/NVIDIA Corporation/NvToolsExt/"
export NVTOOLSEXT_PATH="C:\\Program Files\\NVIDIA Corporation\\NvToolsExt\\"
fi
if ! ls "/c/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v${CUDA_VERSION}/bin/nvcc.exe"
then
echo "CUDA installation failed"
mkdir -p /c/w/build-results
7z a "c:\\w\\build-results\\cuda_install_logs.7z" cuda_install_logs
exit 1
fi
cd ..
rm -rf ./${cuda_installer_name}
rm -f ./${cuda_installer_name}.exe

View File

@ -1,28 +1,49 @@
#!/bin/bash
set -eux -o pipefail
cuda_major_version=${CUDA_VERSION%.*}
# This is typically blank but for CUDA 10* it'll be set to 10
windows_version_qualifier=""
if [[ "$cuda_major_version" == "10" ]]; then
cudnn_installer_name="cudnn-${CUDA_VERSION}-windows10-x64-v7.6.4.38"
elif [[ "$cuda_major_version" == "11" ]]; then
if [[ "${CUDA_VERSION}" == "11.1" ]]; then
cudnn_installer_name="cudnn-${CUDA_VERSION}-windows-x64-v8.0.5.39"
elif [[ "${CUDA_VERSION}" == "11.3" ]]; then
cudnn_installer_name="cudnn-${CUDA_VERSION}-windows-x64-v8.2.0.53"
else
echo "This should not happen! ABORT."
case ${CUDA_VERSION} in
10.1)
archive_version="v7.6.4.38"
windows_version_qualifier="10"
;;
10.2)
archive_version="v7.6.5.32"
windows_version_qualifier="10"
;;
11.1)
archive_version="v8.0.5.39"
;;
11.3)
archive_version="v8.2.0.53"
;;
11.5)
archive_version="v8.2.0.53"
;;
*)
echo "CUDA_VERSION: ${CUDA_VERSION} not supported yet"
exit 1
fi
;;
esac
cudnn_installer_name="cudnn_installer.zip"
cudnn_installer_link="https://ossci-windows.s3.amazonaws.com/cudnn-${CUDA_VERSION}-windows${windows_version_qualifier}-x64-${archive_version}.zip"
cudnn_install_folder="C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v${CUDA_VERSION}/"
if [[ -f "${cudnn_install_folder}/include/cudnn.h" ]]; then
echo "Existing cudnn installation found, skipping install..."
else
echo "CUDNN for CUDA_VERSION $CUDA_VERSION is not supported yet"
exit 1
tmp_dir=$(mktemp -d)
(
pushd "${tmp_dir}"
curl --retry 3 -o "${cudnn_installer_name}" "$cudnn_installer_link"
7z x "${cudnn_installer_name}" -ocudnn
# Use '${var:?}/*' to avoid potentially expanding to '/*'
# Remove all of the directories before attempting to copy files
rm -rf "${cudnn_install_folder:?}/*"
cp -rf cudnn/cuda/* "${cudnn_install_folder}"
)
rm -rf "${tmp_dir}"
fi
cudnn_installer_link="https://ossci-windows.s3.amazonaws.com/${cudnn_installer_name}.zip"
curl --retry 3 -O $cudnn_installer_link
7z x ${cudnn_installer_name}.zip -ocudnn
cp -r cudnn/cuda/* "C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v${CUDA_VERSION}/"
rm -rf cudnn
rm -f ${cudnn_installer_name}.zip

View File

@ -15,31 +15,17 @@ pytorch_params: &pytorch_params
build_only:
type: string
default: ""
ci_master:
type: string
default: ""
environment:
BUILD_ENVIRONMENT: << parameters.build_environment >>
DOCKER_IMAGE: << parameters.docker_image >>
USE_CUDA_DOCKER_RUNTIME: << parameters.use_cuda_docker_runtime >>
BUILD_ONLY: << parameters.build_only >>
CI_MASTER: << pipeline.parameters.run_master_build >>
resource_class: << parameters.resource_class >>
pytorch_android_params: &pytorch_android_params
parameters:
build_environment:
type: string
default: ""
op_list:
type: string
default: ""
lite_interpreter:
type: string
default: "1"
environment:
BUILD_ENVIRONMENT: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c"
PYTHON_VERSION: "3.6"
SELECTED_OP_LIST: << parameters.op_list >>
BUILD_LITE_INTERPRETER: << parameters.lite_interpreter >>
pytorch_ios_params: &pytorch_ios_params
parameters:
build_environment:
@ -60,6 +46,9 @@ pytorch_ios_params: &pytorch_ios_params
lite_interpreter:
type: string
default: "1"
use_coreml:
type: string
default: "0"
environment:
BUILD_ENVIRONMENT: << parameters.build_environment >>
IOS_ARCH: << parameters.ios_arch >>
@ -67,6 +56,7 @@ pytorch_ios_params: &pytorch_ios_params
SELECTED_OP_LIST: << parameters.op_list >>
USE_PYTORCH_METAL: << parameters.use_metal >>
BUILD_LITE_INTERPRETER: << parameters.lite_interpreter >>
USE_COREML_DELEGATE: << parameters.use_coreml >>
pytorch_windows_params: &pytorch_windows_params
parameters:
@ -84,7 +74,10 @@ pytorch_windows_params: &pytorch_windows_params
default: "10.1"
python_version:
type: string
default: "3.6"
default: "3.8"
vs_version:
type: string
default: "16.8.6"
vc_version:
type: string
default: "14.16"
@ -102,6 +95,7 @@ pytorch_windows_params: &pytorch_windows_params
SCCACHE_BUCKET: "ossci-compiler-cache"
CUDA_VERSION: <<parameters.cuda_version>>
PYTHON_VERSION: <<parameters.python_version>>
VS_VERSION: <<parameters.vs_version>>
VC_VERSION: <<parameters.vc_version>>
VC_YEAR: <<parameters.vc_year>>
VC_PRODUCT: <<parameters.vc_product>>

View File

@ -171,4 +171,4 @@ commands:
cd ~/project
export ANDROID_BUILD_TYPE="<< parameters.build_type >>"
export COMMIT_TIME=$(git log --max-count=1 --format=%ct || echo 0)
python3 .circleci/scripts/upload_binary_size_to_scuba.py android
python3 -m tools.stats.upload_binary_size_to_scuba android

View File

@ -17,6 +17,9 @@ parameters:
run_master_build:
type: boolean
default: false
run_slow_gradcheck_build:
type: boolean
default: false
executors:
windows-with-nvidia-gpu:

View File

@ -1,3 +1,4 @@
jobs:
binary_linux_build:
<<: *binary_linux_build_params
steps:
@ -22,14 +23,14 @@
command: |
ls -lah /final_pkgs
- run:
name: save binary size
name: upload build & binary data
no_output_timeout: "5m"
command: |
source /env
cd /pytorch && export COMMIT_TIME=$(git log --max-count=1 --format=%ct || echo 0)
python3 -mpip install requests && \
SCRIBE_GRAPHQL_ACCESS_TOKEN=${SCRIBE_GRAPHQL_ACCESS_TOKEN} \
python3 /pytorch/.circleci/scripts/upload_binary_size_to_scuba.py || exit 0
python3 -m tools.stats.upload_binary_size_to_scuba || exit 0
- persist_to_workspace:
root: /
paths: final_pkgs
@ -239,7 +240,7 @@
binary_ios_build:
<<: *pytorch_ios_params
macos:
xcode: "12.0"
xcode: "12.5.1"
steps:
- attach_workspace:
at: ~/workspace
@ -266,7 +267,7 @@
binary_ios_upload:
<<: *pytorch_ios_params
macos:
xcode: "12.0"
xcode: "12.5.1"
steps:
- attach_workspace:
at: ~/workspace

View File

@ -54,61 +54,3 @@
export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_DOCKER_BUILDER_V1}
set -x
cd .circleci/docker && ./build_docker.sh
docker_for_ecr_gc_build_job:
machine:
image: ubuntu-2004:202104-01
steps:
- checkout
- run:
name: build_docker_image_for_ecr_gc
no_output_timeout: "1h"
command: |
cd .circleci/ecr_gc_docker
docker build . -t 308535385114.dkr.ecr.us-east-1.amazonaws.com/gc/ecr
set +x
export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_DOCKER_BUILDER_V1}
export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_DOCKER_BUILDER_V1}
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
export AWS_REGION=us-east-1
aws ecr get-login-password --region $AWS_REGION|docker login --username AWS \
--password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com
set -x
docker push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/gc/ecr
ecr_gc_job:
parameters:
project:
type: string
default: "pytorch"
tags_to_keep: # comma separate values
type: string
environment:
PROJECT: << parameters.project >>
# TODO: Remove legacy image tags once we feel comfortable with new docker image tags
IMAGE_TAG: << parameters.tags_to_keep >>
docker:
- image: 308535385114.dkr.ecr.us-east-1.amazonaws.com/gc/ecr
aws_auth:
aws_access_key_id: ${CIRCLECI_AWS_ACCESS_KEY_FOR_DOCKER_BUILDER_V1}
aws_secret_access_key: ${CIRCLECI_AWS_SECRET_KEY_FOR_DOCKER_BUILDER_V1}
steps:
- checkout
- run:
# NOTE: see 'docker_build_job' for how these tags actually get built
name: dynamically generate tags to keep
no_output_timeout: "1h"
command: |
GENERATED_IMAGE_TAG=$(\
git log --oneline --pretty='%H' .circleci/docker \
| xargs -I '{}' git rev-parse '{}:.circleci/docker' \
| paste -sd "," -)
echo "export GENERATED_IMAGE_TAG='${GENERATED_IMAGE_TAG}'" >> ${BASH_ENV}
- run:
name: garbage collecting for ecr images
no_output_timeout: "1h"
command: |
set +x
export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_DOCKER_BUILDER_V1}
export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_DOCKER_BUILDER_V1}
set -x
/usr/bin/gc.py --filter-prefix ${PROJECT} --ignore-tags "${IMAGE_TAG},${GENERATED_IMAGE_TAG}"

View File

@ -27,7 +27,7 @@
pytorch_python_doc_build:
environment:
BUILD_ENVIRONMENT: pytorch-python-doc-push
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-gcc5.4"
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.7-gcc5.4"
resource_class: large
machine:
image: ubuntu-2004:202104-01
@ -41,9 +41,10 @@
no_output_timeout: "1h"
command: |
set -ex
export COMMIT_DOCKER_IMAGE=${DOCKER_IMAGE}:${DOCKER_TAG}-${CIRCLE_SHA1}
export COMMIT_DOCKER_IMAGE=${DOCKER_IMAGE}:build-${DOCKER_TAG}-${CIRCLE_SHA1}
echo "DOCKER_IMAGE: "${COMMIT_DOCKER_IMAGE}
tag=${CIRCLE_TAG:1:5}
# turn v1.12.0rc3 into 1.12.0
tag=$(echo $CIRCLE_TAG | sed -e 's/v*\([0-9.]*\).*/\1/')
target=${tag:-master}
echo "building for ${target}"
time docker pull ${COMMIT_DOCKER_IMAGE} >/dev/null
@ -72,7 +73,7 @@
pytorch_cpp_doc_build:
environment:
BUILD_ENVIRONMENT: pytorch-cpp-doc-push
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-gcc5.4"
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.7-gcc5.4"
resource_class: large
machine:
image: ubuntu-2004:202104-01
@ -86,8 +87,10 @@
no_output_timeout: "1h"
command: |
set -ex
export COMMIT_DOCKER_IMAGE=${DOCKER_IMAGE}:${DOCKER_TAG}-${CIRCLE_SHA1}
export COMMIT_DOCKER_IMAGE=${DOCKER_IMAGE}:build-${DOCKER_TAG}-${CIRCLE_SHA1}
echo "DOCKER_IMAGE: "${COMMIT_DOCKER_IMAGE}
# turn v1.12.0rc3 into 1.12.0
tag=$(echo $CIRCLE_TAG | sed -e 's/v*\([0-9.]*\).*/\1/')
tag=${CIRCLE_TAG:1:5}
target=${tag:-master}
echo "building for ${target}"
@ -126,6 +129,7 @@
set -e
export IN_CI=1
export CROSS_COMPILE_ARM64=1
export JOB_BASE_NAME=$CIRCLE_JOB
# Install sccache
sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
@ -162,6 +166,7 @@
command: |
set -e
export IN_CI=1
export JOB_BASE_NAME=$CIRCLE_JOB
# Install sccache
sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
@ -198,6 +203,7 @@
command: |
set -e
export IN_CI=1
export JOB_BASE_NAME=$CIRCLE_JOB
chmod a+x .jenkins/pytorch/macos-test.sh
unbuffer .jenkins/pytorch/macos-test.sh 2>&1 | ts
@ -207,13 +213,15 @@
command: |
set -ex
source /Users/distiller/workspace/miniconda3/bin/activate
pip install boto3
export PYTHONPATH="$PWD"
python3 -m pip install boto3==1.19.12
export IN_CI=1
export JOB_BASE_NAME=$CIRCLE_JOB
# Using the same IAM user to write stats to our OSS bucket
export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_S3_BUCKET_V4}
export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_S3_BUCKET_V4}
python tools/print_test_stats.py --upload-to-s3 --compare-with-s3 test
python -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
when: always
- store_test_results:
path: test/test-reports
@ -235,6 +243,7 @@
set -e
export IN_CI=1
export BUILD_LITE_INTERPRETER=1
export JOB_BASE_NAME=$CIRCLE_JOB
chmod a+x ${HOME}/project/.jenkins/pytorch/macos-lite-interpreter-build-test.sh
unbuffer ${HOME}/project/.jenkins/pytorch/macos-lite-interpreter-build-test.sh 2>&1 | ts
- store_test_results:
@ -244,7 +253,7 @@
environment:
BUILD_ENVIRONMENT: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c"
PYTHON_VERSION: "3.6"
PYTHON_VERSION: "3.7"
resource_class: large
machine:
image: ubuntu-2004:202104-01
@ -258,7 +267,7 @@
no_output_timeout: "1h"
command: |
set -eux
docker_image_commit=${DOCKER_IMAGE}:${DOCKER_TAG}-${CIRCLE_SHA1}
docker_image_commit=${DOCKER_IMAGE}:build-${DOCKER_TAG}-${CIRCLE_SHA1}
docker_image_libtorch_android_x86_32=${docker_image_commit}-android-x86_32
docker_image_libtorch_android_x86_64=${docker_image_commit}-android-x86_64
@ -333,7 +342,7 @@
environment:
BUILD_ENVIRONMENT: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-publish-snapshot
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c"
PYTHON_VERSION: "3.6"
PYTHON_VERSION: "3.7"
resource_class: large
machine:
image: ubuntu-2004:202104-01
@ -347,7 +356,7 @@
no_output_timeout: "1h"
command: |
set -eux
docker_image_commit=${DOCKER_IMAGE}:${DOCKER_TAG}-${CIRCLE_SHA1}
docker_image_commit=${DOCKER_IMAGE}:build-${DOCKER_TAG}-${CIRCLE_SHA1}
docker_image_libtorch_android_x86_32_gradle=${docker_image_commit}-android-x86_32-gradle
@ -369,7 +378,7 @@
environment:
BUILD_ENVIRONMENT: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build-only-x86_32
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c"
PYTHON_VERSION: "3.6"
PYTHON_VERSION: "3.7"
resource_class: large
machine:
image: ubuntu-2004:202104-01
@ -384,7 +393,7 @@
no_output_timeout: "1h"
command: |
set -e
docker_image_libtorch_android_x86_32=${DOCKER_IMAGE}:${DOCKER_TAG}-${CIRCLE_SHA1}-android-x86_32
docker_image_libtorch_android_x86_32=${DOCKER_IMAGE}:build-${DOCKER_TAG}-${CIRCLE_SHA1}-android-x86_32
echo "docker_image_libtorch_android_x86_32: "${docker_image_libtorch_android_x86_32}
# x86
@ -407,47 +416,10 @@
path: ~/workspace/build_android_x86_32_artifacts/artifacts.tgz
destination: artifacts.tgz
pytorch_android_gradle_custom_build_single:
<<: *pytorch_android_params
resource_class: large
machine:
image: ubuntu-2004:202104-01
steps:
- checkout
- calculate_docker_image_tag
- setup_linux_system_environment
- checkout
- calculate_docker_image_tag
- setup_ci_environment
- run:
name: pytorch android gradle custom build single architecture (for PR)
no_output_timeout: "1h"
command: |
set -e
# Unlike other gradle jobs, it's not worth building libtorch in a separate CI job and share via docker, because:
# 1) Not shareable: it's custom selective build, which is different from default libtorch mobile build;
# 2) Not parallelizable by architecture: it only builds libtorch for one architecture;
echo "DOCKER_IMAGE: ${DOCKER_IMAGE}:${DOCKER_TAG}"
time docker pull ${DOCKER_IMAGE}:${DOCKER_TAG} >/dev/null
git submodule sync && git submodule update -q --init --recursive --depth 1
VOLUME_MOUNTS="-v /home/circleci/project/:/var/lib/jenkins/workspace"
export id=$(docker run --env-file "${BASH_ENV}" ${VOLUME_MOUNTS} --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${DOCKER_IMAGE}:${DOCKER_TAG})
export COMMAND='((echo "export GRADLE_OFFLINE=1" && echo "export BUILD_LITE_INTERPRETER=${BUILD_LITE_INTERPRETER}" && echo "sudo chown -R jenkins workspace && cd workspace && ./.circleci/scripts/build_android_gradle.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
# Skip docker push as this job is purely for size analysis purpose.
# Result binaries are already in `/home/circleci/project/` as it's mounted instead of copied.
- upload_binary_size_for_android_build:
build_type: custom-build-single
pytorch_ios_build:
<<: *pytorch_ios_params
macos:
xcode: "12.0"
xcode: "12.5.1"
steps:
- checkout
- run_brew_for_ios_build
@ -461,16 +433,17 @@
# install fastlane
sudo gem install bundler && bundle install
# install certificates
echo ${IOS_CERT_KEY} >> cert.txt
echo ${IOS_CERT_KEY_2022} >> cert.txt
base64 --decode cert.txt -o Certificates.p12
rm cert.txt
bundle exec fastlane install_cert
bundle exec fastlane install_root_cert
bundle exec fastlane install_dev_cert
# install the provisioning profile
PROFILE=PyTorch_CI_2021.mobileprovision
PROFILE=PyTorch_CI_2022.mobileprovision
PROVISIONING_PROFILES=~/Library/MobileDevice/Provisioning\ Profiles
mkdir -pv "${PROVISIONING_PROFILES}"
cd "${PROVISIONING_PROFILES}"
echo ${IOS_SIGN_KEY} >> cert.txt
echo ${IOS_SIGN_KEY_2022} >> cert.txt
base64 --decode cert.txt -o ${PROFILE}
rm cert.txt
- run:
@ -500,7 +473,7 @@
# sync submodules
cd ${PROJ_ROOT}
git submodule sync
git submodule update --init --recursive --depth 1
git submodule update --init --recursive --depth 1 --jobs 0
# export
export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
@ -511,6 +484,7 @@
echo "IOS_PLATFORM: ${IOS_PLATFORM}"
echo "USE_PYTORCH_METAL": "${USE_METAL}"
echo "BUILD_LITE_INTERPRETER": "${BUILD_LITE_INTERPRETER}"
echo "USE_COREML_DELEGATE": "${USE_COREML_DELEGATE}"
#check the custom build flag
echo "SELECTED_OP_LIST: ${SELECTED_OP_LIST}"
@ -519,6 +493,7 @@
fi
export IOS_ARCH=${IOS_ARCH}
export IOS_PLATFORM=${IOS_PLATFORM}
export USE_COREML_DELEGATE=${USE_COREML_DELEGATE}
if [ ${IOS_PLATFORM} != "SIMULATOR" ]; then
export USE_PYTORCH_METAL=${USE_METAL}
fi
@ -528,12 +503,8 @@
no_output_timeout: "30m"
command: |
set -e
if [ ${BUILD_LITE_INTERPRETER} == 0 ]; then
echo "Run Build Test is not for full jit, skipping."
exit 0
fi
PROJ_ROOT=/Users/distiller/project
PROFILE=PyTorch_CI_2021
PROFILE=PyTorch_CI_2022
# run the ruby build script
if ! [ -x "$(command -v xcodebuild)" ]; then
echo 'Error: xcodebuild is not installed.'
@ -557,21 +528,40 @@
if [ ${IOS_PLATFORM} != "SIMULATOR" ]; then
echo "not SIMULATOR build, skip it."
exit 0
elif [ ${BUILD_LITE_INTERPRETER} == 0 ]; then
echo "Run Simulator Tests is not for full jit, skipping."
exit 0
fi
WORKSPACE=/Users/distiller/workspace
PROJ_ROOT=/Users/distiller/project
source ~/anaconda/bin/activate
pip install torch torchvision --progress-bar off
#run unit test
# use the pytorch nightly build to generate models
pip3 install --pre torch torchvision torchaudio -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html
# generate models for differnet backends
cd ${PROJ_ROOT}/ios/TestApp/benchmark
python trace_model.py
ruby setup.rb
mkdir -p ../models
if [ ${USE_COREML_DELEGATE} == 1 ]; then
pip install coremltools==5.0b5
pip install six
python coreml_backend.py
else
python trace_model.py
fi
if [ ${BUILD_LITE_INTERPRETER} == 1 ]; then
echo "Setting up the TestApp for LiteInterpreter"
ruby setup.rb --lite 1
else
echo "Setting up the TestApp for Full JIT"
ruby setup.rb
fi
cd ${PROJ_ROOT}/ios/TestApp
instruments -s -devices
fastlane scan
if [ ${BUILD_LITE_INTERPRETER} == 1 ]; then
if [ ${USE_COREML_DELEGATE} == 1 ]; then
fastlane scan --only_testing TestAppTests/TestAppTests/testCoreML
else
fastlane scan --only_testing TestAppTests/TestAppTests/testLiteInterpreter
fi
else
fastlane scan --only_testing TestAppTests/TestAppTests/testFullJIT
fi
pytorch_linux_bazel_build:
<<: *pytorch_params
machine:
@ -593,7 +583,7 @@
echo "Do NOT merge master branch into $CIRCLE_BRANCH in environment $BUILD_ENVIRONMENT"
git submodule sync && git submodule update -q --init --recursive --depth 1
git submodule sync && git submodule update -q --init --recursive --depth 1 --jobs 0
docker cp /home/circleci/project/. $id:/var/lib/jenkins/workspace
@ -604,7 +594,7 @@
# Push intermediate Docker image for next phase to use
if [ -z "${BUILD_ONLY}" ]; then
# Augment our output image name with bazel to avoid collisions
output_image=${DOCKER_IMAGE}:${DOCKER_TAG}-bazel-${CIRCLE_SHA1}
output_image=${DOCKER_IMAGE}:build-${DOCKER_TAG}-bazel-${CIRCLE_SHA1}
export COMMIT_DOCKER_IMAGE=$output_image
docker commit "$id" ${COMMIT_DOCKER_IMAGE}
time docker push ${COMMIT_DOCKER_IMAGE}
@ -624,7 +614,7 @@
no_output_timeout: "90m"
command: |
set -e
output_image=${DOCKER_IMAGE}:${DOCKER_TAG}-bazel-${CIRCLE_SHA1}
output_image=${DOCKER_IMAGE}:build-${DOCKER_TAG}-bazel-${CIRCLE_SHA1}
export COMMIT_DOCKER_IMAGE=$output_image
echo "DOCKER_IMAGE: "${COMMIT_DOCKER_IMAGE}
@ -670,7 +660,7 @@
pytorch_doc_test:
environment:
BUILD_ENVIRONMENT: pytorch-doc-test
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-gcc5.4"
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.7-gcc5.4"
resource_class: medium
machine:
image: ubuntu-2004:202104-01
@ -684,7 +674,7 @@
no_output_timeout: "30m"
command: |
set -ex
export COMMIT_DOCKER_IMAGE=${DOCKER_IMAGE}:${DOCKER_TAG}-${CIRCLE_SHA1}
export COMMIT_DOCKER_IMAGE=${DOCKER_IMAGE}:build-${DOCKER_TAG}-${CIRCLE_SHA1}
echo "DOCKER_IMAGE: "${COMMIT_DOCKER_IMAGE}
time docker pull ${COMMIT_DOCKER_IMAGE} >/dev/null
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})

View File

@ -1,378 +0,0 @@
jobs:
pytorch_linux_build:
<<: *pytorch_params
machine:
image: ubuntu-2004:202104-01
steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
- checkout
- calculate_docker_image_tag
- setup_linux_system_environment
- optional_merge_target_branch
- setup_ci_environment
- run:
name: Build
no_output_timeout: "1h"
command: |
set -e
if [[ ${BUILD_ENVIRONMENT} == *"pure_torch"* ]]; then
echo 'BUILD_CAFFE2=OFF' >> "${BASH_ENV}"
fi
if [[ ${BUILD_ENVIRONMENT} == *"paralleltbb"* ]]; then
echo 'ATEN_THREADING=TBB' >> "${BASH_ENV}"
echo 'USE_TBB=1' >> "${BASH_ENV}"
elif [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then
echo 'ATEN_THREADING=NATIVE' >> "${BASH_ENV}"
fi
echo "Parallel backend flags: "${PARALLEL_FLAGS}
# Pull Docker image and run build
echo "DOCKER_IMAGE: "${DOCKER_IMAGE}:${DOCKER_TAG}
time docker pull ${DOCKER_IMAGE}:${DOCKER_TAG} >/dev/null
export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${DOCKER_IMAGE}:${DOCKER_TAG})
git submodule sync && git submodule update -q --init --recursive --depth 1
docker cp /home/circleci/project/. $id:/var/lib/jenkins/workspace
export COMMAND='((echo "sudo chown -R jenkins workspace && export CIRCLE_JOB="$CIRCLE_JOB" && cd workspace && .jenkins/pytorch/build.sh && find ${BUILD_ROOT} -type f -name "*.a" -or -name "*.o" -delete") | docker exec -u jenkins -i "$id" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
# Copy dist folder back
docker cp $id:/var/lib/jenkins/workspace/dist /home/circleci/project/. || echo "Dist folder not found"
# Push intermediate Docker image for next phase to use
if [ -z "${BUILD_ONLY}" ]; then
# Note [Special build images]
# The xla build uses the same docker image as
# pytorch_linux_bionic_py3_6_clang9_build. In the push step, we have to
# distinguish between them so the test can pick up the correct image.
output_image=${DOCKER_IMAGE}:${DOCKER_TAG}-${CIRCLE_SHA1}
if [[ ${BUILD_ENVIRONMENT} == *"xla"* ]]; then
export COMMIT_DOCKER_IMAGE=$output_image-xla
elif [[ ${BUILD_ENVIRONMENT} == *"libtorch"* ]]; then
export COMMIT_DOCKER_IMAGE=$output_image-libtorch
elif [[ ${BUILD_ENVIRONMENT} == *"paralleltbb"* ]]; then
export COMMIT_DOCKER_IMAGE=$output_image-paralleltbb
elif [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then
export COMMIT_DOCKER_IMAGE=$output_image-parallelnative
elif [[ ${BUILD_ENVIRONMENT} == *"android-ndk-r19c-x86_64"* ]]; then
export COMMIT_DOCKER_IMAGE=$output_image-android-x86_64
elif [[ ${BUILD_ENVIRONMENT} == *"android-ndk-r19c-arm-v7a"* ]]; then
export COMMIT_DOCKER_IMAGE=$output_image-android-arm-v7a
elif [[ ${BUILD_ENVIRONMENT} == *"android-ndk-r19c-arm-v8a"* ]]; then
export COMMIT_DOCKER_IMAGE=$output_image-android-arm-v8a
elif [[ ${BUILD_ENVIRONMENT} == *"android-ndk-r19c-x86_32"* ]]; then
export COMMIT_DOCKER_IMAGE=$output_image-android-x86_32
elif [[ ${BUILD_ENVIRONMENT} == *"android-ndk-r19c-vulkan-x86_32"* ]]; then
export COMMIT_DOCKER_IMAGE=$output_image-android-vulkan-x86_32
elif [[ ${BUILD_ENVIRONMENT} == *"vulkan-linux"* ]]; then
export COMMIT_DOCKER_IMAGE=$output_image-vulkan
else
export COMMIT_DOCKER_IMAGE=$output_image
fi
docker commit "$id" ${COMMIT_DOCKER_IMAGE}
time docker push ${COMMIT_DOCKER_IMAGE}
fi
- store_artifacts:
path: /home/circleci/project/dist
pytorch_linux_test:
<<: *pytorch_params
machine:
image: ubuntu-2004:202104-01
steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
- checkout
- calculate_docker_image_tag
- setup_linux_system_environment
- setup_ci_environment
- run:
name: Download Docker image
no_output_timeout: "90m"
command: |
set -e
export PYTHONUNBUFFERED=1
if [[ "${DOCKER_IMAGE}" == *rocm3.9* ]]; then
export DOCKER_TAG="f3d89a32912f62815e4feaeed47e564e887dffd6"
fi
# See Note [Special build images]
output_image=${DOCKER_IMAGE}:${DOCKER_TAG}-${CIRCLE_SHA1}
if [[ ${BUILD_ENVIRONMENT} == *"xla"* ]]; then
export COMMIT_DOCKER_IMAGE=$output_image-xla
elif [[ ${BUILD_ENVIRONMENT} == *"libtorch"* ]]; then
export COMMIT_DOCKER_IMAGE=$output_image-libtorch
elif [[ ${BUILD_ENVIRONMENT} == *"paralleltbb"* ]]; then
export COMMIT_DOCKER_IMAGE=$output_image-paralleltbb
elif [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then
export COMMIT_DOCKER_IMAGE=$output_image-parallelnative
elif [[ ${BUILD_ENVIRONMENT} == *"vulkan-linux"* ]]; then
export COMMIT_DOCKER_IMAGE=$output_image-vulkan
else
export COMMIT_DOCKER_IMAGE=$output_image
fi
echo "DOCKER_IMAGE: "${COMMIT_DOCKER_IMAGE}
if [[ ${BUILD_ENVIRONMENT} == *"paralleltbb"* ]]; then
echo 'ATEN_THREADING=TBB' >> "${BASH_ENV}"
echo 'USE_TBB=1' >> "${BASH_ENV}"
elif [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then
echo 'ATEN_THREADING=NATIVE' >> "${BASH_ENV}"
fi
echo "Parallel backend flags: "${PARALLEL_FLAGS}
time docker pull ${COMMIT_DOCKER_IMAGE} >/dev/null
# TODO: Make this less painful
if [ -n "${USE_CUDA_DOCKER_RUNTIME}" ]; then
export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --gpus all --shm-size=2g -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
elif [[ ${BUILD_ENVIRONMENT} == *"rocm"* ]]; then
hostname
export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --shm-size=8g --ipc=host --device /dev/kfd --device /dev/dri --group-add video -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
else
export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --shm-size=1g --ipc=host -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
fi
echo "id=${id}" >> "${BASH_ENV}"
- run:
name: Check for no AVX instruction by default
no_output_timeout: "20m"
command: |
set -e
is_vanilla_build() {
if [ "${BUILD_ENVIRONMENT}" == "pytorch-linux-bionic-py3.6-clang9-test" ]; then
return 0
fi
if [ "${BUILD_ENVIRONMENT}" == "pytorch-linux-xenial-py3.6-gcc5.4-test" ]; then
return 0
fi
return 1
}
if is_vanilla_build; then
echo "apt-get update && apt-get install -y qemu-user gdb" | docker exec -u root -i "$id" bash
echo "cd workspace/build; qemu-x86_64 -g 2345 -cpu Broadwell -E ATEN_CPU_CAPABILITY=default ./bin/basic --gtest_filter=BasicTest.BasicTestCPU & gdb ./bin/basic -ex 'set pagination off' -ex 'target remote :2345' -ex 'continue' -ex 'bt' -ex='set confirm off' -ex 'quit \$_isvoid(\$_exitcode)'" | docker exec -u jenkins -i "$id" bash
else
echo "Skipping for ${BUILD_ENVIRONMENT}"
fi
- run:
name: Run tests
no_output_timeout: "90m"
command: |
set -e
cat >docker_commands.sh \<<EOL
# =================== The following code will be executed inside Docker container ===================
set -ex
export SCRIBE_GRAPHQL_ACCESS_TOKEN="${SCRIBE_GRAPHQL_ACCESS_TOKEN}"
export CIRCLE_JOB="$CIRCLE_JOB"
${PARALLEL_FLAGS}
cd workspace
EOL
if [[ ${BUILD_ENVIRONMENT} == *"multigpu"* ]]; then
echo ".jenkins/pytorch/multigpu-test.sh" >> docker_commands.sh
elif [[ ${BUILD_ENVIRONMENT} == *onnx* ]]; then
echo "pip install click mock tabulate networkx==2.0" >> docker_commands.sh
echo "pip -q install --user \"file:///var/lib/jenkins/workspace/third_party/onnx#egg=onnx\"" >> docker_commands.sh
echo ".jenkins/caffe2/test.sh" >> docker_commands.sh
else
echo ".jenkins/pytorch/test.sh" >> docker_commands.sh
fi
echo "(cat docker_commands.sh | docker exec -u jenkins -i "$id" bash) 2>&1" > command.sh
unbuffer bash command.sh | ts
if [[ ${BUILD_ENVIRONMENT} == *"coverage"* ]]; then
echo "Retrieving C++ coverage report"
docker cp $id:/var/lib/jenkins/workspace/build/coverage.info ./test
fi
if [[ ${BUILD_ENVIRONMENT} == *"coverage"* || ${BUILD_ENVIRONMENT} == *"onnx"* ]]; then
echo "Retrieving Python coverage report"
docker cp $id:/var/lib/jenkins/workspace/test/.coverage ./test
docker cp $id:/var/lib/jenkins/workspace/test/coverage.xml ./test
python3 -mpip install codecov
python3 -mcodecov
fi
- run:
name: Report results
no_output_timeout: "5m"
command: |
set -e
# Retrieving test results should be done as very first step as command never fails
# But is always executed if previous step fails for some reason
echo "Retrieving test reports"
docker cp $id:/var/lib/jenkins/workspace/test/test-reports ./ || echo 'No test reports found!'
docker stats --all --no-stream
cat >docker_commands.sh \<<EOL
# =================== The following code will be executed inside Docker container ===================
set -ex
export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}
export SCRIBE_GRAPHQL_ACCESS_TOKEN="${SCRIBE_GRAPHQL_ACCESS_TOKEN}"
export CIRCLE_TAG="${CIRCLE_TAG:-}"
export CIRCLE_SHA1="$CIRCLE_SHA1"
export CIRCLE_PR_NUMBER="${CIRCLE_PR_NUMBER:-}"
export CIRCLE_BRANCH="$CIRCLE_BRANCH"
export CIRCLE_JOB="$CIRCLE_JOB"
export CIRCLE_WORKFLOW_ID="$CIRCLE_WORKFLOW_ID"
cd workspace
export PYTHONPATH="\${PWD}"
python tools/print_test_stats.py --upload-to-s3 --compare-with-s3 test
EOL
echo "(cat docker_commands.sh | docker exec -u jenkins -e LANG=C.UTF-8 -i "$id" bash) 2>&1" > command.sh
unbuffer bash command.sh | ts
when: always
- store_test_results:
path: test-reports
- store_artifacts:
path: test/.coverage
- store_artifacts:
path: test/coverage.xml
pytorch_windows_build:
<<: *pytorch_windows_params
parameters:
executor:
type: string
default: "windows-xlarge-cpu-with-nvidia-cuda"
build_environment:
type: string
default: ""
test_name:
type: string
default: ""
cuda_version:
type: string
default: "10.1"
python_version:
type: string
default: "3.6"
vc_version:
type: string
default: "14.16"
vc_year:
type: string
default: "2019"
vc_product:
type: string
default: "BuildTools"
use_cuda:
type: string
default: ""
executor: <<parameters.executor>>
steps:
- checkout
- run:
name: Install VS2019 toolchain
no_output_timeout: 10m
command: |
powershell .circleci/scripts/vs_install.ps1
- run:
name: Install Cuda
no_output_timeout: 30m
command: |
if [[ "${USE_CUDA}" == "1" ]]; then
.circleci/scripts/windows_cuda_install.sh
fi
- run:
name: Install Cudnn
command : |
if [[ "${USE_CUDA}" == "1" ]]; then
.circleci/scripts/windows_cudnn_install.sh
fi
- run:
name: Build
no_output_timeout: "90m"
command: |
set -e
set +x
export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_WIN_BUILD_V1}
export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_WIN_BUILD_V1}
set -x
.jenkins/pytorch/win-build.sh
- persist_to_workspace:
root: "C:/w"
paths: build-results
- store_artifacts:
path: C:/w/build-results
pytorch_windows_test:
<<: *pytorch_windows_params
parameters:
executor:
type: string
default: "windows-medium-cpu-with-nvidia-cuda"
build_environment:
type: string
default: ""
test_name:
type: string
default: ""
cuda_version:
type: string
default: "10.1"
python_version:
type: string
default: "3.6"
vc_version:
type: string
default: "14.16"
vc_year:
type: string
default: "2019"
vc_product:
type: string
default: "BuildTools"
use_cuda:
type: string
default: ""
executor: <<parameters.executor>>
steps:
- checkout
- attach_workspace:
at: c:/users/circleci/workspace
- run:
name: Install VS2019 toolchain
no_output_timeout: 10m
command: |
powershell .circleci/scripts/vs_install.ps1
- run:
name: Install Cuda
no_output_timeout: 30m
command: |
if [[ "${CUDA_VERSION}" != "cpu" ]]; then
if [[ "${CUDA_VERSION}" != "10" || "${JOB_EXECUTOR}" != "windows-with-nvidia-gpu" ]]; then
.circleci/scripts/windows_cuda_install.sh
fi
fi
- run:
name: Install Cudnn
command : |
if [[ "${CUDA_VERSION}" != "cpu" ]]; then
.circleci/scripts/windows_cudnn_install.sh
fi
- run:
name: Test
no_output_timeout: "30m"
command: |
set -e
export IN_CI=1
set +x
export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_WIN_BUILD_V1}
export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_WIN_BUILD_V1}
set -x
.jenkins/pytorch/win-test.sh
- run:
name: Report results
no_output_timeout: "5m"
command: |
set -ex
export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_WIN_BUILD_V1}
export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_WIN_BUILD_V1}
export PYTHONPATH="$PWD"
pip install typing_extensions boto3
python tools/print_test_stats.py --upload-to-s3 --compare-with-s3 test
when: always
- store_test_results:
path: test/test-reports
- store_artifacts:
path: test/coverage.xml

View File

@ -26,6 +26,7 @@
# (smoke tests and upload jobs do not need the pytorch repo).
binary_checkout: &binary_checkout
name: Checkout pytorch/builder repo
no_output_timeout: "30m"
command: .circleci/scripts/binary_checkout.sh
# Parses circleci arguments in a consistent way, essentially routing to the

View File

@ -1,34 +0,0 @@
ecr_gc:
triggers:
- schedule:
cron: "45 * * * *"
filters:
branches:
only:
- master
jobs:
- docker_for_ecr_gc_build_job
- ecr_gc_job:
name: ecr_gc_job_for_pytorch
project: pytorch
tags_to_keep: "271,262,256,278,282,291,300,323,327,347,389,401,402,403,405,a8006f9a-272d-4478-b137-d121c6f05c83,6e7b11da-a919-49e5-b2ba-da66e3d4bb0a,f990c76a-a798-42bb-852f-5be5006f8026,e43973a9-9d5a-4138-9181-a08a0fc55e2f,8fcf46ef-4a34-480b-a8ee-b0a30a4d3e59,9a3986fa-7ce7-4a36-a001-3c9bef9892e2,1bc00f11-e0f3-4e5c-859f-15937dd938cd,209062ef-ab58-422a-b295-36c4eed6e906,be76e8fd-44e2-484d-b090-07e0cc3a56f0,fff7795428560442086f7b2bb6004b65245dc11a,ab1632df-fa59-40e6-8c23-98e004f61148"
requires:
- docker_for_ecr_gc_build_job
- ecr_gc_job:
name: ecr_gc_job_for_caffe2
project: caffe2
tags_to_keep: "376,373,369,348,345,336,325,324,315,306,301,287,283,276,273,266,253,248,238,230,213"
requires:
- docker_for_ecr_gc_build_job
- ecr_gc_job:
name: ecr_gc_job_for_translate
project: translate
tags_to_keep: "8"
requires:
- docker_for_ecr_gc_build_job
- ecr_gc_job:
name: ecr_gc_job_for_tensorcomp
project: tensorcomp
tags_to_keep: "34"
requires:
- docker_for_ecr_gc_build_job

View File

@ -1,195 +0,0 @@
scheduled-ci:
triggers:
- schedule:
# runs every 4 hours on the 45th minute
cron: "45 0,4,8,12,16,20 * * *"
filters:
branches:
only:
- master
jobs:
- docker_build_job:
name: "docker-pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7"
image_name: "pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7"
- pytorch_linux_build:
name: periodic_pytorch_xenial_cuda11_3_cudnn8_gcc7_build
requires:
- "docker-pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7"
build_environment: "pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7-build"
docker_image: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7"
- pytorch_linux_test:
name: periodic_pytorch_xenial_cuda11_3_cudnn8_gcc7_test
requires:
- periodic_pytorch_xenial_cuda11_3_cudnn8_gcc7_build
build_environment: "pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7-test"
docker_image: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7"
use_cuda_docker_runtime: "1"
resource_class: gpu.medium
- pytorch_linux_build:
name: periodic_libtorch_xenial_cuda11_3_cudnn8_gcc7_build
requires:
- "docker-pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7"
build_environment: "pytorch-libtorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7-build"
docker_image: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7"
- pytorch_windows_build:
build_environment: pytorch-win-vs2019-cuda11-cudnn8-py3
cuda_version: "11.3"
name: periodic_pytorch_windows_cuda11.3_build
python_version: "3.6"
use_cuda: "1"
vc_product: BuildTools
vc_version: "14.28.29333"
vc_year: "2019"
- pytorch_windows_test:
build_environment: pytorch-win-vs2019-cuda11-cudnn8-py3
cuda_version: "11.3"
executor: windows-with-nvidia-gpu
name: periodic_pytorch_windows_cuda11.3_test1
python_version: "3.6"
requires:
- periodic_pytorch_windows_cuda11.3_build
test_name: pytorch-windows-test1
use_cuda: "1"
vc_product: BuildTools
vc_version: "14.28.29333"
vc_year: "2019"
- pytorch_windows_test:
build_environment: pytorch-win-vs2019-cuda11-cudnn8-py3
cuda_version: "11.3"
executor: windows-with-nvidia-gpu
name: periodic_pytorch_windows_cuda11.3_test2
python_version: "3.6"
requires:
- periodic_pytorch_windows_cuda11.3_build
test_name: pytorch-windows-test2
use_cuda: "1"
vc_product: BuildTools
vc_version: "14.28.29333"
vc_year: "2019"
# The following allows these jobs to run on ci-all and release branches
debuggable-scheduled-ci:
jobs:
- docker_build_job:
name: "docker-pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7"
image_name: "pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7"
filters:
branches:
only:
- /ci-all\/.*/
- /release\/.*/
- pytorch_linux_build:
name: pytorch_linux_xenial_cuda11_3_cudnn8_py3_gcc7_build
requires:
- "docker-pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7"
build_environment: "pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7-build"
docker_image: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7"
filters:
branches:
only:
- /ci-all\/.*/
- /release\/.*/
- pytorch_linux_test:
name: pytorch_linux_xenial_cuda11_3_cudnn8_py3_gcc7_test
requires:
- pytorch_linux_xenial_cuda11_3_cudnn8_py3_gcc7_build
build_environment: "pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7-test"
docker_image: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7"
use_cuda_docker_runtime: "1"
resource_class: gpu.medium
filters:
branches:
only:
- /ci-all\/.*/
- /release\/.*/
- pytorch_linux_build:
name: pytorch_libtorch_linux_xenial_cuda11_3_cudnn8_py3_gcc7_build
requires:
- "docker-pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7"
build_environment: "pytorch-libtorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7-build"
docker_image: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7"
filters:
branches:
only:
- /ci-all\/.*/
- /release\/.*/
- pytorch_windows_build:
build_environment: pytorch-win-vs2019-cuda11-cudnn8-py3
cuda_version: "11.3"
name: pytorch_windows_vs2019_py36_cuda11.3_build
python_version: "3.6"
use_cuda: "1"
vc_product: BuildTools
vc_version: "14.28.29333"
vc_year: "2019"
filters:
branches:
only:
- /ci-all\/.*/
- /release\/.*/
- pytorch_windows_test:
build_environment: pytorch-win-vs2019-cuda11-cudnn8-py3
cuda_version: "11.3"
executor: windows-with-nvidia-gpu
name: pytorch_windows_vs2019_py36_cuda11.3_test1
python_version: "3.6"
requires:
- pytorch_windows_vs2019_py36_cuda11.3_build
test_name: pytorch-windows-test1
use_cuda: "1"
vc_product: BuildTools
vc_version: "14.28.29333"
vc_year: "2019"
filters:
branches:
only:
- /ci-all\/.*/
- /release\/.*/
- pytorch_windows_test:
build_environment: pytorch-win-vs2019-cuda11-cudnn8-py3
cuda_version: "11.3"
executor: windows-with-nvidia-gpu
name: pytorch_windows_vs2019_py36_cuda11.3_test2
python_version: "3.6"
requires:
- pytorch_windows_vs2019_py36_cuda11.3_build
test_name: pytorch-windows-test2
use_cuda: "1"
vc_product: BuildTools
vc_version: "14.28.29333"
vc_year: "2019"
filters:
branches:
only:
- /ci-all\/.*/
- /release\/.*/
# the following clones pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7's tests but enables
# slow tests and sets an environment variable so gradcheck runs with fast_mode=False
slow-gradcheck-scheduled-ci:
triggers:
- schedule:
# runs every 8 hours on the 45th minute
cron: "45 0,8,16 * * *"
filters:
branches:
only:
- master
jobs:
- docker_build_job:
name: "docker-pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7"
image_name: "pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7"
- pytorch_linux_build:
name: periodic_pytorch_xenial_cuda10_2_cudnn7_gcc7_build
requires:
- "docker-pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7"
build_environment: "pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7-build"
docker_image: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7"
- pytorch_linux_test:
name: periodic_pytorch_xenial_cuda10_2_cudnn7_gcc7_old_gradcheck_tests
requires:
- periodic_pytorch_xenial_cuda10_2_cudnn7_gcc7_build
build_environment: "pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7-old-gradcheck-tests"
docker_image: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7"
use_cuda_docker_runtime: "1"
resource_class: gpu.medium

View File

@ -9,6 +9,7 @@ bugprone-*,
-bugprone-reserved-identifier,
cppcoreguidelines-*,
-cppcoreguidelines-avoid-magic-numbers,
-cppcoreguidelines-avoid-non-const-global-variables,
-cppcoreguidelines-interfaces-global-init,
-cppcoreguidelines-macro-usage,
-cppcoreguidelines-owning-memory,
@ -21,6 +22,7 @@ cppcoreguidelines-*,
-cppcoreguidelines-pro-type-union-access,
-cppcoreguidelines-pro-type-vararg,
-cppcoreguidelines-special-member-functions,
-cppcoreguidelines-non-private-member-variables-in-classes,
-facebook-hte-RelativeInclude,
hicpp-exception-baseclass,
hicpp-avoid-goto,
@ -31,11 +33,13 @@ modernize-*,
-modernize-use-default-member-init,
-modernize-use-using,
-modernize-use-trailing-return-type,
-modernize-use-nodiscard,
performance-*,
-performance-noexcept-move-constructor,
-performance-unnecessary-value-param,
'
HeaderFilterRegex: 'torch/csrc/.*'
HeaderFilterRegex: 'torch/csrc/(?!deploy/interpreter/cpython).*'
AnalyzeTemporaryDtors: false
WarningsAsErrors: '*'
CheckOptions:
...

View File

@ -16,7 +16,6 @@ per-file-ignores = __init__.py: F401 torch/utils/cpp_extension.py: B950
optional-ascii-coding = True
exclude =
./.git,
./build_code_analyzer,
./build_test_custom_build,
./build,
./caffe2,

5
.gitattributes vendored
View File

@ -1 +1,4 @@
*.bat text eol=crlf
*.bat text eol=crlf
.circleci/config.yml linguist-generated=true
.github/workflows/generated-*.yml linguist-generated=true
.github/generated-* linguist-generated=true

View File

@ -1,49 +0,0 @@
---
name: "\U0001F41B Bug Report"
about: Submit a bug report to help us improve PyTorch
---
## 🐛 Bug
<!-- A clear and concise description of what the bug is. -->
## To Reproduce
Steps to reproduce the behavior:
1.
1.
1.
<!-- If you have a code sample, error messages, stack traces, please provide it here as well -->
## Expected behavior
<!-- A clear and concise description of what you expected to happen. -->
## Environment
Please copy and paste the output from our
[environment collection script](https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py)
(or fill out the checklist below manually).
You can get the script and run it with:
```
wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
```
- PyTorch Version (e.g., 1.0):
- OS (e.g., Linux):
- How you installed PyTorch (`conda`, `pip`, source):
- Build command you used (if compiling from source):
- Python version:
- CUDA/cuDNN version:
- GPU models and configuration:
- Any other relevant information:
## Additional context
<!-- Add any other context about the problem here. -->

56
.github/ISSUE_TEMPLATE/bug-report.yml vendored Normal file
View File

@ -0,0 +1,56 @@
name: 🐛 Bug Report
description: Create a report to help us reproduce and fix the bug
body:
- type: markdown
attributes:
value: >
#### Before submitting a bug, please make sure the issue hasn't been already addressed by searching through [the existing and past issues](https://github.com/pytorch/pytorch/issues?q=is%3Aissue+sort%3Acreated-desc+).
- type: textarea
attributes:
label: 🐛 Describe the bug
description: |
Please provide a clear and concise description of what the bug is.
If relevant, add a minimal example so that we can reproduce the error by running the code. It is very important for the snippet to be as succinct (minimal) as possible, so please take time to trim down any irrelevant code to help us debug efficiently. We are going to copy-paste your code and we expect to get the same result as you did: avoid any external data, and include the relevant imports, etc. For example:
```python
# All necessary imports at the beginning
import torch
# A succinct reproducing example trimmed down to the essential parts:
t = torch.rand(5, 10) # Note: the bug is here, we should pass requires_grad=True
t.sum().backward()
```
If the code is too long (hopefully, it isn't), feel free to put it in a public gist and link it in the issue: https://gist.github.com.
Please also paste or describe the results you observe instead of the expected results. If you observe an error, please paste the error message including the **full** traceback of the exception. It may be relevant to wrap error messages in ```` ```triple quotes blocks``` ````.
placeholder: |
A clear and concise description of what the bug is.
```python
# Sample code to reproduce the problem
```
```
The error message you got, with the full traceback.
```
validations:
required: true
- type: textarea
attributes:
label: Versions
description: |
Please run the following and paste the output below.
```sh
wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
```
validations:
required: true
- type: markdown
attributes:
value: >
Thanks for contributing 🎉!

39
.github/ISSUE_TEMPLATE/ci-sev.md vendored Normal file
View File

@ -0,0 +1,39 @@
---
name: "⚠CI SEV"
about: Tracking incidents for PyTorch's CI infra.
---
> NOTE: Remember to label this issue with "`ci: sev`"
## Current Status
*Status could be: preemptive, ongoing, mitigated, closed. Also tell people if they need to take action to fix it (i.e. rebase)*.
## Error looks like
*Provide some way users can tell that this SEV is causing their issue.*
## Incident timeline (all times pacific)
*Include when the incident began, when it was detected, mitigated, root caused, and finally closed.*
<details>
<summary> Click for example </summary>
e.g.
- 10/30 7:27a incident began
- 10/30 8:30a detected by <method>
- 10/30 9:00 pm root caused as…
- 10/30 9:10 pm mitigated by…
- 10/31 10: am closed by…
</details>
## User impact
*How does this affect users of PyTorch CI?*
## Root cause
*What was the root cause of this issue?*
## Mitigation
*How did we mitigate the issue?*
## Prevention/followups
*How do we prevent issues like this in the future?*

5
.github/ISSUE_TEMPLATE/config.yml vendored Normal file
View File

@ -0,0 +1,5 @@
blank_issues_enabled: true
contact_links:
- name: Questions
url: https://discuss.pytorch.org/
about: Ask questions and discuss with other pytorch community members

View File

@ -1,9 +0,0 @@
---
name: "\U0001F4DA Documentation"
about: Report an issue related to https://pytorch.org/docs
---
## 📚 Documentation
<!-- A clear and concise description of what content in https://pytorch.org/docs is an issue. If this has to do with the general https://pytorch.org website, please file an issue at https://github.com/pytorch/pytorch.github.io/issues/new/choose instead. If this has to do with https://pytorch.org/tutorials, please file an issue at https://github.com/pytorch/tutorials/issues/new -->

View File

@ -0,0 +1,20 @@
name: 📚 Documentation
description: Report an issue related to https://pytorch.org/docs/stable/index.html
body:
- type: textarea
attributes:
label: 📚 The doc issue
description: >
A clear and concise description of what content in https://pytorch.org/docs/stable/index.html is an issue. If this has to do with the general https://pytorch.org website, please file an issue at https://github.com/pytorch/pytorch.github.io/issues/new/choose instead. If this has to do with https://pytorch.org/tutorials, please file an issue at https://github.com/pytorch/tutorials/issues/new.
validations:
required: true
- type: textarea
attributes:
label: Suggest a potential alternative/fix
description: >
Tell us how we could improve the documentation in this regard.
- type: markdown
attributes:
value: >
Thanks for contributing 🎉!

View File

@ -1,24 +0,0 @@
---
name: "\U0001F680Feature Request"
about: Submit a proposal/request for a new PyTorch feature
---
## 🚀 Feature
<!-- A clear and concise description of the feature proposal -->
## Motivation
<!-- Please outline the motivation for the proposal. Is your feature request related to a problem? e.g., I'm always frustrated when [...]. If this is related to another GitHub issue, please link here too -->
## Pitch
<!-- A clear and concise description of what you want to happen. -->
## Alternatives
<!-- A clear and concise description of any alternative solutions or features you've considered, if any. -->
## Additional context
<!-- Add any other context or screenshots about the feature request here. -->

View File

@ -0,0 +1,25 @@
name: 🚀 Feature request
description: Submit a proposal/request for a new pytorch feature
body:
- type: textarea
attributes:
label: 🚀 The feature, motivation and pitch
description: >
A clear and concise description of the feature proposal. Please outline the motivation for the proposal. Is your feature request related to a specific problem? e.g., *"I'm working on X and would like Y to be possible"*. If this is related to another GitHub issue, please link here too.
validations:
required: true
- type: textarea
attributes:
label: Alternatives
description: >
A description of any alternative solutions or features you've considered, if any.
- type: textarea
attributes:
label: Additional context
description: >
Add any other context or screenshots about the feature request.
- type: markdown
attributes:
value: >
Thanks for contributing 🎉!

View File

@ -1,13 +0,0 @@
---
name: "❓Questions/Help/Support"
about: Do you need support? We have resources.
---
## ❓ Questions and Help
### Please note that this issue tracker is not a help form and this issue will be closed.
We have a set of [listed resources available on the website](https://pytorch.org/resources). Our primary means of support is our discussion forum:
- [Discussion Forum](https://discuss.pytorch.org/)

View File

@ -1 +1 @@
Fixes #{issue number}
Fixes #ISSUE_NUMBER

11
.github/actionlint.yaml vendored Normal file
View File

@ -0,0 +1,11 @@
self-hosted-runner:
labels:
- linux.large
- linux.2xlarge
- linux.4xlarge
- linux.4xlarge.nvidia.gpu
- linux.8xlarge.nvidia.gpu
- linux.16xlarge.nvidia.gpu
- windows.4xlarge
- windows.8xlarge.nvidia.gpu
- bm-runner

266
.github/generated-ciflow-ruleset.json generated vendored Normal file
View File

@ -0,0 +1,266 @@
{
"__comment": "@generated DO NOT EDIT MANUALLY, Generation script: .github/scripts/generate_ci_workflows.py",
"label_rules": {
"ciflow/all": [
"caffe2-linux-xenial-py3.7-gcc5.4",
"docker-builds",
"ios-12-5-1-arm64",
"ios-12-5-1-arm64-coreml",
"ios-12-5-1-arm64-custom-ops",
"ios-12-5-1-arm64-full-jit",
"ios-12-5-1-arm64-metal",
"ios-12-5-1-x86-64",
"ios-12-5-1-x86-64-coreml",
"ios-12-5-1-x86-64-full-jit",
"libtorch-linux-xenial-cuda10.2-py3.7-gcc7",
"libtorch-linux-xenial-cuda11.3-py3.7-gcc7",
"linux-bionic-cuda10.2-py3.9-gcc7",
"linux-bionic-py3.7-clang9",
"linux-docs",
"linux-docs-push",
"linux-vulkan-bionic-py3.7-clang9",
"linux-xenial-cuda11.3-py3.7-gcc7",
"linux-xenial-cuda11.3-py3.7-gcc7-bazel-test",
"linux-xenial-cuda11.3-py3.7-gcc7-no-ops",
"linux-xenial-py3-clang5-mobile-build",
"linux-xenial-py3-clang5-mobile-custom-build-static",
"linux-xenial-py3.7-clang7-asan",
"linux-xenial-py3.7-clang7-onnx",
"linux-xenial-py3.7-gcc5.4",
"linux-xenial-py3.7-gcc7",
"linux-xenial-py3.7-gcc7-no-ops",
"macos-10-15-py3-arm64",
"macos-10-15-py3-lite-interpreter-x86-64",
"macos-11-py3-x86-64",
"parallelnative-linux-xenial-py3.7-gcc5.4",
"periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7",
"periodic-libtorch-linux-xenial-cuda11.1-py3.7-gcc7",
"periodic-linux-bionic-cuda11.5-py3.7-gcc7",
"periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck",
"periodic-linux-xenial-cuda11.1-py3.7-gcc7-debug",
"periodic-win-vs2019-cuda11.1-py3",
"periodic-win-vs2019-cuda11.5-py3",
"pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build",
"pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single",
"pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit",
"win-vs2019-cpu-py3",
"win-vs2019-cuda11.3-py3"
],
"ciflow/android": [
"pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build",
"pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single",
"pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit"
],
"ciflow/bazel": [
"linux-xenial-cuda11.3-py3.7-gcc7-bazel-test"
],
"ciflow/binaries": [
"linux-binary-conda",
"linux-binary-libtorch-cxx11-abi",
"linux-binary-libtorch-pre-cxx11",
"linux-binary-manywheel"
],
"ciflow/binaries/conda": [
"linux-binary-conda"
],
"ciflow/binaries/libtorch": [
"linux-binary-libtorch-cxx11-abi",
"linux-binary-libtorch-pre-cxx11"
],
"ciflow/binaries/wheel": [
"linux-binary-manywheel"
],
"ciflow/cpu": [
"caffe2-linux-xenial-py3.7-gcc5.4",
"linux-bionic-py3.7-clang9",
"linux-docs",
"linux-docs-push",
"linux-vulkan-bionic-py3.7-clang9",
"linux-xenial-cuda11.3-py3.7-gcc7-bazel-test",
"linux-xenial-py3.7-clang7-asan",
"linux-xenial-py3.7-clang7-onnx",
"linux-xenial-py3.7-gcc5.4",
"linux-xenial-py3.7-gcc7",
"linux-xenial-py3.7-gcc7-no-ops",
"parallelnative-linux-xenial-py3.7-gcc5.4",
"pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build",
"pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single",
"pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit",
"win-vs2019-cpu-py3"
],
"ciflow/cuda": [
"libtorch-linux-xenial-cuda10.2-py3.7-gcc7",
"libtorch-linux-xenial-cuda11.3-py3.7-gcc7",
"linux-bionic-cuda10.2-py3.9-gcc7",
"linux-xenial-cuda11.3-py3.7-gcc7",
"linux-xenial-cuda11.3-py3.7-gcc7-no-ops",
"periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7",
"periodic-libtorch-linux-xenial-cuda11.1-py3.7-gcc7",
"periodic-linux-bionic-cuda11.5-py3.7-gcc7",
"periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck",
"periodic-linux-xenial-cuda11.1-py3.7-gcc7-debug",
"periodic-win-vs2019-cuda11.1-py3",
"periodic-win-vs2019-cuda11.5-py3",
"win-vs2019-cuda11.3-py3"
],
"ciflow/default": [
"linux-bionic-py3.7-clang9",
"linux-docs",
"linux-vulkan-bionic-py3.7-clang9",
"linux-xenial-cuda11.3-py3.7-gcc7",
"linux-xenial-cuda11.3-py3.7-gcc7-bazel-test",
"linux-xenial-py3-clang5-mobile-build",
"linux-xenial-py3-clang5-mobile-custom-build-static",
"linux-xenial-py3.7-clang7-asan",
"linux-xenial-py3.7-clang7-onnx",
"linux-xenial-py3.7-gcc5.4",
"linux-xenial-py3.7-gcc7",
"linux-xenial-py3.7-gcc7-no-ops",
"pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single",
"pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit",
"win-vs2019-cpu-py3",
"win-vs2019-cuda11.3-py3"
],
"ciflow/docs": [
"linux-docs"
],
"ciflow/ios": [
"ios-12-5-1-arm64",
"ios-12-5-1-arm64-coreml",
"ios-12-5-1-arm64-custom-ops",
"ios-12-5-1-arm64-full-jit",
"ios-12-5-1-arm64-metal",
"ios-12-5-1-x86-64",
"ios-12-5-1-x86-64-coreml",
"ios-12-5-1-x86-64-full-jit"
],
"ciflow/libtorch": [
"libtorch-linux-xenial-cuda10.2-py3.7-gcc7",
"libtorch-linux-xenial-cuda11.3-py3.7-gcc7",
"periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7",
"periodic-libtorch-linux-xenial-cuda11.1-py3.7-gcc7"
],
"ciflow/linux": [
"caffe2-linux-xenial-py3.7-gcc5.4",
"libtorch-linux-xenial-cuda10.2-py3.7-gcc7",
"libtorch-linux-xenial-cuda11.3-py3.7-gcc7",
"linux-bionic-cuda10.2-py3.9-gcc7",
"linux-bionic-py3.7-clang9",
"linux-docs",
"linux-docs-push",
"linux-vulkan-bionic-py3.7-clang9",
"linux-xenial-cuda11.3-py3.7-gcc7",
"linux-xenial-cuda11.3-py3.7-gcc7-bazel-test",
"linux-xenial-cuda11.3-py3.7-gcc7-no-ops",
"linux-xenial-py3-clang5-mobile-build",
"linux-xenial-py3-clang5-mobile-custom-build-static",
"linux-xenial-py3.7-clang7-asan",
"linux-xenial-py3.7-clang7-onnx",
"linux-xenial-py3.7-gcc5.4",
"linux-xenial-py3.7-gcc7",
"linux-xenial-py3.7-gcc7-no-ops",
"parallelnative-linux-xenial-py3.7-gcc5.4",
"periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7",
"periodic-libtorch-linux-xenial-cuda11.1-py3.7-gcc7",
"periodic-linux-bionic-cuda11.5-py3.7-gcc7",
"periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck",
"periodic-linux-xenial-cuda11.1-py3.7-gcc7-debug",
"pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build",
"pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single",
"pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit"
],
"ciflow/macos": [
"ios-12-5-1-arm64",
"ios-12-5-1-arm64-coreml",
"ios-12-5-1-arm64-custom-ops",
"ios-12-5-1-arm64-full-jit",
"ios-12-5-1-arm64-metal",
"ios-12-5-1-x86-64",
"ios-12-5-1-x86-64-coreml",
"ios-12-5-1-x86-64-full-jit",
"macos-10-15-py3-arm64",
"macos-10-15-py3-lite-interpreter-x86-64",
"macos-11-py3-x86-64"
],
"ciflow/mobile": [
"linux-xenial-py3-clang5-mobile-build",
"linux-xenial-py3-clang5-mobile-custom-build-static"
],
"ciflow/noarch": [
"linux-bionic-py3.7-clang9"
],
"ciflow/onnx": [
"linux-xenial-py3.7-clang7-onnx"
],
"ciflow/sanitizers": [
"linux-xenial-py3.7-clang7-asan"
],
"ciflow/scheduled": [
"linux-docs-push",
"periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7",
"periodic-libtorch-linux-xenial-cuda11.1-py3.7-gcc7",
"periodic-linux-bionic-cuda11.5-py3.7-gcc7",
"periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck",
"periodic-linux-xenial-cuda11.1-py3.7-gcc7-debug",
"periodic-win-vs2019-cuda11.1-py3",
"periodic-win-vs2019-cuda11.5-py3"
],
"ciflow/slow": [
"linux-bionic-cuda10.2-py3.9-gcc7",
"periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck"
],
"ciflow/slow-gradcheck": [
"periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck"
],
"ciflow/trunk": [
"caffe2-linux-xenial-py3.7-gcc5.4",
"docker-builds",
"ios-12-5-1-arm64",
"ios-12-5-1-arm64-coreml",
"ios-12-5-1-arm64-custom-ops",
"ios-12-5-1-arm64-full-jit",
"ios-12-5-1-arm64-metal",
"ios-12-5-1-x86-64",
"ios-12-5-1-x86-64-coreml",
"ios-12-5-1-x86-64-full-jit",
"libtorch-linux-xenial-cuda10.2-py3.7-gcc7",
"libtorch-linux-xenial-cuda11.3-py3.7-gcc7",
"linux-bionic-cuda10.2-py3.9-gcc7",
"linux-bionic-py3.7-clang9",
"linux-docs",
"linux-vulkan-bionic-py3.7-clang9",
"linux-xenial-cuda11.3-py3.7-gcc7",
"linux-xenial-cuda11.3-py3.7-gcc7-bazel-test",
"linux-xenial-cuda11.3-py3.7-gcc7-no-ops",
"linux-xenial-py3-clang5-mobile-build",
"linux-xenial-py3-clang5-mobile-custom-build-static",
"linux-xenial-py3.7-clang7-asan",
"linux-xenial-py3.7-clang7-onnx",
"linux-xenial-py3.7-gcc5.4",
"linux-xenial-py3.7-gcc7",
"linux-xenial-py3.7-gcc7-no-ops",
"macos-10-15-py3-arm64",
"macos-10-15-py3-lite-interpreter-x86-64",
"macos-11-py3-x86-64",
"parallelnative-linux-xenial-py3.7-gcc5.4",
"pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build",
"pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single",
"pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit",
"win-vs2019-cpu-py3",
"win-vs2019-cuda11.3-py3"
],
"ciflow/vulkan": [
"linux-vulkan-bionic-py3.7-clang9"
],
"ciflow/win": [
"periodic-win-vs2019-cuda11.1-py3",
"periodic-win-vs2019-cuda11.5-py3",
"win-vs2019-cpu-py3",
"win-vs2019-cuda11.3-py3"
],
"ciflow/xla": [
"linux-bionic-py3.7-clang9"
]
},
"version": "v1"
}

Some files were not shown because too many files have changed in this diff Show More