Compare commits

...

3467 Commits

Author SHA1 Message Date
d69c22dd61 [docs] Add torch.package documentation for beta release (#59886)
**Summary**
This commit adds documentation for the `torch.package` module to
accompany its beta release in 1.9.

**Test Plan**
Continous integration.
2021-06-11 13:43:27 -07:00
4ad4f6db7f hold references to storages during TorchScript serializaiton (#59672)
Fixes issue for serialization problem caused by using memory address of storages for mobile and torch.package models.

 - https://github.com/pytorch/pytorch/pull/59642 hold references to storages during TorchScript serialization

Uses StorageContext to hold a reference to all storages seen during TorchScript serialization to allow for tensors to be created/destroyed during serialization process. Tracking of the storages solves for the ABA memory problem.
2021-06-11 13:42:58 -07:00
90e67738b1 [Release/1.9] Link whole CuDNN for CUDA-11.1 (#59873)
* Move cublas dependency after CuDNN (#58287)

Summary:
Library linking order matters during static linking
Not sure whether its a bug or a feature, but if cublas is reference
before CuDNN, it will be partially statically linked into the library,
even if it is not used

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58287

Reviewed By: janeyx99

Differential Revision: D28433165

Pulled By: malfet

fbshipit-source-id: 8dffa0533075126dc383428f838f7d048074205c

* [CMake] Split caffe2::cudnn into public and private (#59721)

Summary:
This is only important for builds where cuDNN is linked statically into libtorch_cpu.
Before this PR PyTorch wheels often accidentally contained several partial copies of cudnn_static library.
Splitting the interface into header only (cudnn-public) and library+headers(cudnn-private) prevents those from happening.
Preliminary step towards enabling optional linking whole cudnn_library to workaround issue reported in https://github.com/pytorch/pytorch/issues/50153

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59721

Reviewed By: ngimel

Differential Revision: D29000967

Pulled By: malfet

fbshipit-source-id: f054df92b265e9494076ab16c247427b39da9336

* Add USE_WHOLE_CUDNN option (#59744)

Summary:
It is only enabled if USE_STATIC_CUDNN is enabled

Next step after https://github.com/pytorch/pytorch/pull/59721 towards resolving fast kernels stripping reported in https://github.com/pytorch/pytorch/issues/50153

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59744

Reviewed By: seemethere, ngimel

Differential Revision: D29007314

Pulled By: malfet

fbshipit-source-id: 7091e299c0c6cc2a8aa82fbf49312cecf3bb861a

* [Binary] Link whole CuDNN for CUDA-11.1 (#59802)

Summary:
Fixes https://github.com/pytorch/pytorch/issues/50153

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59802

Reviewed By: driazati, seemethere

Differential Revision: D29033537

Pulled By: malfet

fbshipit-source-id: e816fc71f273ae0b4ba8a0621d5368a2078561a1
2021-06-11 10:38:31 -07:00
43c581aa62 Make detach return an alias even under inference mode (#59633) (#59757)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59633

Fixes #59614

This fix isn't 100% correct but it appears to stem the bleeding.
A better fix would be understand how to detect when function
implementations don't uphold required invariants, leading to
refcount disaster.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D28962183

Pulled By: ezyang

fbshipit-source-id: 6ec71994666289dadef47bac363e6902df90b094
2021-06-11 10:04:14 -07:00
bc446f6a54 Fix test_randperm_device_compatibility for 1 GPU (#59484) (#59502)
Summary:
Do not try to create tensors on 2nd device if device_count() == 1

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59484

Reviewed By: ngimel

Differential Revision: D28910673

Pulled By: malfet

fbshipit-source-id: e3517f31a463dd049ce8a5155409b7b716c8df18
2021-06-04 20:01:02 -07:00
abe996a7fb Move CUDA async warning to suffix (#59467) (#59501)
Summary:
After the change async error warnings look as follows:
```
$ python -c "import torch;torch.eye(3,3,device='cuda:777')"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59467

Reviewed By: ngimel

Differential Revision: D28904360

Pulled By: malfet

fbshipit-source-id: 2a8fa5affed5b4ffcaa602c8ab2669061cde7db0
2021-06-04 20:00:55 -07:00
795df76568 Do not use gold linker for CUDA builds (#59490) (#59500)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59490

Reviewed By: agolynski, seemethere

Differential Revision: D28913160

Pulled By: malfet

fbshipit-source-id: d27092c252fc86424028abe146cf5f33a2f74544
2021-06-04 20:00:45 -07:00
3b9cd08901 Prefer accurate reciprocal on ARMv8 (#59361) (#59470)
Summary:
Default NEON accelerated implementation of reciprocal uses vrecpeq_f32 which yield  Newton-Raphson approximation rather than actual value
Use regular NEON accelerated division for reciprocal and reciprocal square root operations.

This fixes `test_reference_numerics_hard_frac_cpu_float32`, `test_reference_numerics_normal_rsqrt_cpu_float32` etc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59361

Reviewed By: mruberry

Differential Revision: D28870456

Pulled By: malfet

fbshipit-source-id: e634b0887cce7efb046ea1fd9b74424e0eceb164
2021-06-04 18:34:39 -07:00
226c274f70 Search for static OpenBLAS compiled with OpenMP (#59428) (#59463)
Summary:
Before that, only dynamically linked OpenBLAS compield with OpenMP could
be found.

Also get rid of hardcoded codepath for libgfortran.a in FindLAPACK.cmake

Only affects aarch64 linux builds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59428

Reviewed By: agolynski

Differential Revision: D28891314

Pulled By: malfet

fbshipit-source-id: 5af55a14c85ac66551ad2805c5716bbefe8d55b2
2021-06-04 11:15:58 -07:00
ce24cab257 Fix torch.randperm for CUDA (#59352) (#59452)
Summary:
Context https://github.com/pytorch/pytorch/issues/58545

The logic is that we are going to keep it consistent for both
torch.randperm and torch.randint

1. Generators can have either a fully-specified or non-fully specified device
2. As long as the device type match with the result, we don't error out

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59352

Test Plan:
```
python test/test_tensor_creation_ops.py -k TestRandomTensorCreation
```

Reviewed By: ngimel

Differential Revision: D28855920

Pulled By: zhouzhuojie

fbshipit-source-id: f8141a2c4b2f177e1aa7baec6999b65916cba02c
2021-06-04 10:23:29 -07:00
d98d113810 .circleci: Disable USE_GOLD_LINKER for CUDA 10.2 (#59413) (#59462)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59413

For CUDA 10.2 builds linked with the gold linker we were observing
crashes when exceptions were being raised

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D28888054

Pulled By: seemethere

fbshipit-source-id: f9b38147591721803ed3cac607510fe5bbc49d6d
(cherry picked from commit c7a3a13baba0d547c5c20579328b0b3d83b94656)
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
2021-06-04 10:22:51 -07:00
17a44c2bb5 Added missing namespaces for C++ API (#45736) (#59367)
Summary:
Hello,

depending on the build environment you may encounter
```c++
error: reference to 'optional' is ambiguous
```
when using the Torch-C++-API.

This PR adds `c10::` to avoid possible ambiguities with **std::optional** and does not introduce any functional change.

Fixes https://discuss.pytorch.org/t/linker-failed-with-ambiguous-references/36255 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45736

Reviewed By: dzhulgakov

Differential Revision: D24125123

Pulled By: VitalyFedyunin

fbshipit-source-id: df21420f0a2d0270227c28976a7a4218315cc107

Co-authored-by: Johannes Czech <QueensGambit@users.noreply.github.com>
2021-06-03 10:39:51 -07:00
26e6fa380e [vulkan] Remove constant duplication for Vulkan optimize_for_mobile (#59341)
ghstack-source-id: bb809586d27d1285660d1db2c3561b46d158f499
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59276
2021-06-03 09:45:56 -07:00
bf16699cc8 [Release-1.9] Disable failing ROCM-4.2 tests (#59339)
* [ROCm] disable test test_Conv2d_groups_nobias for ROCm (#59158)

Summary:
Disabling the test since its failing in ROCm4.2

Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59158

Reviewed By: mruberry

Differential Revision: D28808953

Pulled By: ngimel

fbshipit-source-id: 134f147ead6dc559d2cde49cf8343cd976e6c224

* [ROCm] disable test test_Conv2d_groups_nobias_v2 for ROCm (#58701)

Summary:
Disable test_Conv2d_groups_nobias_v2 test because it is failing on ROCm 4.2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58701

Reviewed By: ngimel

Differential Revision: D28626651

Pulled By: mruberry

fbshipit-source-id: a74bdf45335ae2afee0aa5e3bece6e208e75a63f

Co-authored-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>
Co-authored-by: Kyle Chen <kylechen@amd.com>
2021-06-02 15:07:06 -07:00
6d4fe05502 Build with USE_GLOO_WITH_OPENSSL=1 (#59274)
Needed for https://github.com/pytorch/builder/pull/779

Co-authored-by: Your Name <driazati@users.noreply.github.com>
2021-06-02 08:18:25 -07:00
b046542f8a Add breakpad + debug builds (#59275)
This is the combination of #59236 and #58685 which will enable <insert builder PR here> to land on the release branch. This enables breakpad for minidump collection (which is still opt-in) and debug builds for the release.

Co-authored-by: Your Name <driazati@users.noreply.github.com>
2021-06-01 23:32:08 -07:00
5d57b9392c [pkg] Catch exceptions where dependency resolution gets invalid imports (#58573) (#59272)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58573

Users can create invalid imports, like:
```
HG: in a top-level package
if False:
  from .. import foo
```

Since this code is never executed, it will not cause the module to fail to
load. But our dependency analysis walks every `import` statement in the AST,
and will attempt to resolve the (incorrectly formed) import, throwing an exception.

For posterity, the code that triggered this: https://git.io/JsCgM

Differential Revision: D28543980

Test Plan: Added a unit test

Reviewed By: Chillee

Pulled By: suo

fbshipit-source-id: 03b7e274633945b186500fab6f974973ef8c7c7d

Co-authored-by: Michael Suo <suo@fb.com>
2021-06-01 15:51:38 -07:00
f6a9351776 [pkg] simplifications to broken dependency handling (#58572) (#59273)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58572

Right now, we have three categories of error (broken, denied, unhandled). This
PR unifies them into a single "error" field in the node, with optional context.
It also generalizes how formatting of the error in PackagingError occurs.

Differential Revision: D28543982

Test Plan: sandcastle

Reviewed By: Chillee

Pulled By: suo

fbshipit-source-id: d99d37699ec2e172e3798763e60aafe9a66ed6f4

Co-authored-by: Michael Suo <suo@fb.com>
2021-06-01 15:51:30 -07:00
3071601491 [c10d] Fix monitored_barrier with wait_all_ranks (#58702) (#59266)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58702

Off by one error when determining if some ranks failed or not with
`wait_all_ranks=True`. This wasn't caught by tests because the tests only
tested failure scenarios, not success scenarios with `wait_all_ranks=True`.
ghstack-source-id: 129559840

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D28583235

fbshipit-source-id: a8f376efb13a3f36c788667acab86543c80aff59
2021-06-01 15:45:16 -07:00
d417a094f3 Document factory_kwargs in nn.Quantize + remove Attributes section (#59025) (#59045)
Summary:
The `factory_kwargs` kwarg was previously undocumented in `nn.Quantize`. Further, the `Attributes` section of the docs was improperly filled in, resulting in bad formatting. This section doesn't apply since `nn.Quantize` doesn't have parameters, so it has been removed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59025

Reviewed By: anjali411

Differential Revision: D28723889

Pulled By: jbschlosser

fbshipit-source-id: ba86429f66d511ac35042ebd9c6cc3da7b6b5805

Co-authored-by: Joel Schlosser <jbschlosser@fb.com>
2021-05-27 20:53:52 -07:00
1fdbbc96ae fix unique for discontiguous inputs (#59003) (#59055)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/58959

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59003

Reviewed By: mruberry

Differential Revision: D28714534

Pulled By: ngimel

fbshipit-source-id: d9bf82f54be5b5919e27281e49fad74e00d8b766
2021-05-27 20:52:42 -07:00
e761f16ad5 Collect kernel version (#58485) (#59121)
Summary:
Collect env should collect kernel and glibc version

Fixes https://github.com/pytorch/pytorch/issues/58387

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58485

Reviewed By: walterddr

Differential Revision: D28510564

Pulled By: malfet

fbshipit-source-id: ad3d4b93f51db052720bfaa4322138c55816921b
2021-05-27 17:12:31 -07:00
0544a765d3 Split CUDA SpectralOp (#58459) (#59120)
Summary:
Move all cuFFT related parts to SpectralOps.cpp
Leave only _fft_fill_with_conjugate_symmetry_cuda_ in SpecralOps.cu

Keep `CUDAHooks.cpp` in torch_cuda_cpp by introducing `at::cuda::detail::THCMagma_init` functor and registering it from global constructor in `THCTensorMathMagma.cu`

Move entire detail folder to torch_cuda_cpp library.

This is a no-op that helps greatly reduce binary size for CUDA-11.x builds by avoiding cufft/cudnn symbol duplication between torch_cuda_cpp(that makes most of cuFFT calls) and torch_cuda_cu (that only needed it to compile SpectralOps.cu)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58459

Reviewed By: ngimel

Differential Revision: D28499001

Pulled By: malfet

fbshipit-source-id: 425a981beb383c18a79d4fbd9b49ddb4e5133291
2021-05-27 17:08:20 -07:00
1ea8ae5d93 Refactor GlooDeviceFactory::makeDeviceFor... (#58996) (#59118)
Summary:
`makeDeviceForHostname` and `makeDeviceForInterface` are almost
duplicate except for different default argument values

Create generic `makeGlooDevice` anonymous function that takes both host
name and interface name and call it from both
makeDeviceFor[Hostname|Interface]

Also solve two other minor issues:
 - do not call `getenv("GLOO_DEVICE_TRANSPORT")` during library load
   time
 - Raise exception rather than crash if GLOO_DEVICE_TRANSPORT is set to unknown value

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58996

Reviewed By: pbelevich

Differential Revision: D28713324

Pulled By: malfet

fbshipit-source-id: cb33b438078d163e3ec6f047f2e5247b07d94f8d
2021-05-27 17:08:09 -07:00
97ca7303b0 [ROCm] fix JIT codegen (#57400) (#59116)
Summary:
Fixes upcoming changes that are part of ROCm 4.2 and affect PyTorch JIT.

- ROCM_VERSION macro must be available to both device and host compilation passes.
- Unifies some of CUDA and HIP differences in the code generated.
  - NAN / POS_INFINITY / NEG_INFINITY
  - Do not hipify `extern __shared__` -> `HIP_DYNAMIC_SHARED()` macro [deprecated]
- Differentiates bf16 codegen for HIP.
- Optionally provides missing macros when using hiprtc precompiled header feature.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57400

Reviewed By: ejguan

Differential Revision: D28421065

Pulled By: malfet

fbshipit-source-id: 215f476773c61d8b0d9d148a4e5f5d016f863074

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2021-05-27 17:07:45 -07:00
ac94547143 Change link order for BUILD_SPLIT_CUDA option (#58437) (#59119)
Summary:
torch_cuda_cu depends on torch_cuda_cpp, so it should be linked first
Otherwise linker keeps lots of cudnn symbols for no good reason

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58437

Reviewed By: janeyx99

Differential Revision: D28496472

Pulled By: malfet

fbshipit-source-id: 338605ff755591476070c172a6ea0a0dcd0beb23
2021-05-27 17:07:39 -07:00
e2027acebe Add underscores to some internal names (#59105)
* Add underscores to some internal names

Summary:
Add underscores to some of the internal names

Test Plan:
python test/test_profiler.py -v

Reviewers: anjali411

[ghstack-poisoned]

* Add underscores to some internal names

Summary:
Add underscores to some of the internal names

Test Plan:
python test/test_profiler.py -v

Reviewers: anjali411

[ghstack-poisoned]

Co-authored-by: ilia-cher <iliacher@fb.com>
2021-05-27 14:19:13 -07:00
0896c6b1f0 fix nn.MHA scriptability (#58727) (#59072)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58727

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D28593830

Pulled By: bhosmer

fbshipit-source-id: 37dee9efededaea9985a2bf040df1ba4b46f6580
2021-05-27 10:30:10 -07:00
43f6675363 [PyTorch] Remove device check from a few indexing methods (#58800) (#59048)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58800

These methods leverages TensorIterator which will handle
(or skip) device check.
ghstack-source-id: 129654358

Test Plan: CI && sandcastle

Reviewed By: ngimel

Differential Revision: D28622626

fbshipit-source-id: 6153299780d4f7bf286423520ba4cb60b554335e

Co-authored-by: Wenlei Xie <wxie@fb.com>
2021-05-27 10:28:56 -07:00
450f5c6f4d Add docstring for is_inference_mode_enabled (#59047) (#59085)
Summary:
Fixes` #{issue number}

Testing:
```
>>> import torch
>>> torch.is_inference_mode_enabled.__doc__
'\nis_inference_mode_enabled(input) -> (bool)\n\nReturns True if inference mode is currently enabled.\n\nArgs:\n    input (Tensor): the input tensor.\n'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59047

Reviewed By: ailzhang

Differential Revision: D28726991

Pulled By: soulitzer

fbshipit-source-id: c117c7d73e551a1b5f0e215f2aed528bf558ef7c
2021-05-27 10:27:32 -07:00
310e528a0d Add UninitializedBuffer to nn docs (#59021) (#59044)
Summary:
The `UninitializedBuffer` class was previously left out of `nn.rst`, so it was not included in the generated documentation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59021

Reviewed By: anjali411

Differential Revision: D28723044

Pulled By: jbschlosser

fbshipit-source-id: 71e15b0c7fabaf57e8fbdf7fbd09ef2adbdb36ad

Co-authored-by: Joel Schlosser <jbschlosser@fb.com>
2021-05-27 10:27:20 -07:00
e4161d0b2b Add sparse_csr_tensor to BC allow-list (#59093)
Fix for intentional regression in #59001

Co-authored-by: driazati <driazati@users.noreply.github.com>
2021-05-27 10:27:00 -07:00
016dc8cb68 Fix build regression caused by https://github.com/pytorch/pytorch/pull/58940 (#59008)
s/Vectorized/Vec256/

Vec256 were renamed to Vectorized on master after the branch cut
2021-05-26 11:55:50 -07:00
a3ea5cee52 [docs] Clarify batch_first behavior for nn.LSTM, nn.RNN, and nn.GRU (#58809) (#58958)
Summary:
Fixes the high-pri doc component of https://github.com/pytorch/pytorch/issues/4145.

To make the input / output shapes more readable for both `batch_first` states, this PR also introduces short dim names. Opinions welcome on the readability of the restructured docs!

Screenshot for `nn.LSTM`:
<img width="791" alt="Screen Shot 2021-05-24 at 5 11 39 PM" src="https://user-images.githubusercontent.com/75754324/119408130-389e5300-bcb3-11eb-9a4f-1df96a0a4d70.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58809

Reviewed By: gchanan

Differential Revision: D28685415

Pulled By: jbschlosser

fbshipit-source-id: e8c92e3d7e052071a505b55dca976fd2ef5a8307

Co-authored-by: Joel Schlosser <jbschlosser@fb.com>
2021-05-26 11:12:56 -07:00
dfc58f4faa Underscore prefix sparse_csr_tensor and to_sparse_csr (#59001)
* Underscore prefix sparse_csr_tensor and to_sparse_csr

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

* fix lint

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
2021-05-26 11:11:25 -07:00
b5e2635281 Add mish activation function (#58648) (#58940)
Summary:
See issus: https://github.com/pytorch/pytorch/issues/58375

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58648

Reviewed By: gchanan

Differential Revision: D28625390

Pulled By: jbschlosser

fbshipit-source-id: 23ea2eb7d5b3dc89c6809ff6581b90ee742149f4

Co-authored-by: Adnios <2780199647@qq.com>
2021-05-25 13:30:36 -07:00
9dfd2e7b56 Add no-grad inference mode note (#58513) (#58939)
Summary:
Adds a note explaining the difference between several often conflated mechanisms in the autograd note
Also adds a link to this note from the docs in `grad_mode` and `nn.module`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58513

Reviewed By: gchanan

Differential Revision: D28651129

Pulled By: soulitzer

fbshipit-source-id: af9eb1749b641fc1b632815634eea36bf7979156
2021-05-25 13:30:29 -07:00
f0bdbb4ce1 [Release/1.9][DataLoader] Add keyword arg to meta and support abc for typing (#58848)
ghstack-source-id: 36e1ae3e08cf19da25c00a0a5e8a2bd0ab9530c3
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58450
2021-05-24 10:26:54 -07:00
bc4471c8c9 catch exception when running print regression (#58751) (#58752)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58751

Test Plan: https://github.com/pytorch/pytorch/issues/58752

Reviewed By: samestep

Differential Revision: D28605667

Pulled By: walterddr

fbshipit-source-id: 3796c924df8e50849dd08ecbeab612ba4f0c569b
2021-05-23 22:30:07 -07:00
317fd72526 Quote in setup-ci-env (#58637) (#58763)
Summary:
Do not put quotes for arguments that do not have space in them in add_to_env_file

ENV file is used both by bash as well as by docker, which does not omit
quotes when they are present there

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58637

Reviewed By: wconstab

Differential Revision: D28561159

Pulled By: malfet

fbshipit-source-id: 0843aad22703b6c3adebeb76175de1cfc1a974b5
2021-05-21 13:55:42 -07:00
b8d36033f0 Enables builds with Compute Library backend for oneDNN (#55913) (#58746)
Summary:
Since v1.7, oneDNN (MKL-DNN) has supported the use of Compute Library
for the Arm architeture to provide optimised convolution primitives
on AArch64.

This change enables the use of Compute Library in the PyTorch build.
Following the approach used to enable the use of CBLAS in MKLDNN,
It is enabled by setting the env vars USE_MKLDNN and USE_MKLDNN_ACL.
The location of the Compute Library build must be set useing `ACL_ROOT_DIR`.

This is an extension of the work in https://github.com/pytorch/pytorch/pull/50400
which added support for the oneDNN/MKL-DNN backend on AArch64.

_Note: this assumes that Compute Library has been built and installed at
ACL_ROOT_DIR. Compute library can be downloaded here:
`https://github.com/ARM-software/ComputeLibrary`_

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55913

Reviewed By: ailzhang

Differential Revision: D28559516

Pulled By: malfet

fbshipit-source-id: 29d24996097d0a54efc9ab754fb3f0bded290005
2021-05-21 10:59:00 -07:00
47507259b9 [PyTorch Edge] Use lite interpreter as default and bump model version (#58630)
* [PyTorch Edge] bytecode version bump to v5 and enable share constant table

* [Pytorch] Build lite interpreter as default for iOS

* [Pytorch] Build lite interpreter as default for Android
2021-05-20 17:43:14 -07:00
e77e8d52da Add grid_sample to fp32 list (#58683) 2021-05-20 17:34:03 -07:00
b9fb6d1c7e fix nonzero perf regression (#58714) 2021-05-20 17:31:34 -07:00
1ea310bc8e [1.9] remove gate for beta feature (torchscript support in torch.package) (#58620) 2021-05-19 15:21:11 -07:00
8e6b8d8d46 Add shape documentation for CosineEmbeddingLoss (#58403) (#58590)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/52732

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58403

Reviewed By: HDCharles

Differential Revision: D28480076

Pulled By: jbschlosser

fbshipit-source-id: c2c51e9da86e274e80126bbcabebb27270f2d2d0

Co-authored-by: Joel Schlosser <jbschlosser@fb.com>
2021-05-19 14:05:11 -07:00
87c46a5e32 [1.9] Remove torch.vmap (#58589)
torch.vmap is a prototype feature and should not be in the stable
binary. This PR:
- Removes the torch.vmap API
- Removes the documentation entry for torch.vmap
- Changes the vmap tests to use an internal API instead of torch.vmap.

Test Plan:
- Tested locally (test_torch, test_autograd, test_type_hints, test_vmap),
but also wait for CI.
2021-05-19 14:04:27 -07:00
5092364d78 [release/1.9] Pin builder and xla repos (#58514)
Pin builder to https://github.com/pytorch/builder/commits/release/1.9
Pin xla to https://github.com/pytorch/xla/tree/r1.9

Co-authored-by: driazati <driazati@users.noreply.github.com>
2021-05-18 18:52:06 -07:00
085a3bcb77 [release/1.9] Fix issues regarding binary_chekcout (#58495)
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
2021-05-18 10:36:16 -07:00
5f0bbb38ec ci: Release branch specific changes
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
2021-05-17 17:30:59 -07:00
cce156ac94 .github: Make on_pull_request a conditional block (#58363)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58363

Previous implemntation relied on us directly writing the yaml instead of
just having a conditional block, this allows us better readability for
pull request triggers

Signed-off-by: Eli Uriegas <seemethere101@gmail.com>

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D28465271

Pulled By: seemethere

fbshipit-source-id: fd556bb6bac4954fcddb4a2b0383e996f292a794
2021-05-17 12:08:58 -07:00
c29e6d37e8 [Vulkan] Switch to Image2D for Convolution biases (#57201)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57201

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D28293602

Pulled By: SS-JIA

fbshipit-source-id: 7f9691ea9e8e2505616ee7cefc0a1f3fe4bf95e7
2021-05-17 12:06:52 -07:00
2879f0f780 [Vulkan] Use 2D tensor views when possible (#57198)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57198

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D28293561

Pulled By: SS-JIA

fbshipit-source-id: 77099cfe54870a19e926a067de9852787eb55d0b
2021-05-17 12:05:12 -07:00
95fd1e9045 reduce number of randperm template instantiations (#58362)
Summary:
Per title, benchmarks in https://github.com/pytorch/pytorch/issues/54113 don't regress, size of torch_cuda_cu_generated_Randperm.cu.o goes 8562152 -> 2585792 for a single architecture, compilation time decreases also.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58362

Reviewed By: heitorschueroff

Differential Revision: D28477697

Pulled By: ngimel

fbshipit-source-id: 32dbe44ca6b3807668d548512d7484f8488834c4
2021-05-17 11:40:59 -07:00
a3b33139da [Pytorch] Add non mutator bundled inputs method (#58408)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58408

Itd be nice to have a version of bundle inputs that didnt mutate the original class/object. So now there is!
ghstack-source-id: 129127316

Test Plan: The new unittests

Reviewed By: dhruvbird

Differential Revision: D28460231

fbshipit-source-id: f6f7a19e264bddfaa177304cbde40336060a237a
2021-05-17 11:36:49 -07:00
ae9b66dd94 Fix TP agent not recording outgoing tensors with caching allocator (#58384)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58384

When the caller send tensors within a request, it does so on fresh streams it obtained from the caching allocator. However it wasn't recording those tensors with the caching allocator. This carried the risk that, if those tensors were deleted before the async CUDA ops were done, the caching allocator could reuse the storage and thus overwrite the previous data while it was still being used.
ghstack-source-id: 129107582

Test Plan: eyes

Reviewed By: mrshenli

Differential Revision: D28473429

fbshipit-source-id: 3f2617048d984cec7a270858d282cecf1140ecf0
2021-05-17 10:57:44 -07:00
affed3b04d Prevent lock inversions with GIL in Future (#58391)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58391

An additional (and hopefully more robust) way of fixing the same problem https://github.com/pytorch/pytorch/pull/58382 fixed.
ghstack-source-id: 129110325

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28474154

fbshipit-source-id: 625ebe782e380c60b3ead4c4ed8a51d4bc917153
2021-05-17 10:54:26 -07:00
5a238eb96e Fix deadlock in Future due to lock inversion with GIL (#58382)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58382

Calling markCompleted on a Future now first acquires the Future's mutex (as usual) but then sometimes tries to acquire the GIL during the DataPtr extraction while still holding the Future's mutex. (This happens when the value passed to markCompleted is a Python object). This can cause a deadlock if someone else calls any of the other methods of Future while holding the GIL.

There are two solutions to this: avoid holding the Future's mutex when extracting DataPtrs, and avoid holding the GIL while invoking the Future's method. In this PR I'm going for the latter, because it's a very simple immediate fix, but I believe this is brittle and that we should probably also consider the former fix.
ghstack-source-id: 129105358

Test Plan: The repro in https://github.com/pytorch/pytorch/issues/58239 now doesn't deadlock.

Reviewed By: mrshenli

Differential Revision: D28472816

fbshipit-source-id: 1bc9bca426dd004f9eb2568db1ffd38f014450e2
2021-05-17 10:53:19 -07:00
eab59bae15 Fix cmake_minimum_require in libshm (#58306)
Summary:
Deprecation warning reported by cmake:

```
CMake Deprecation Warning at CMakeLists.txt (cmake_minimum_required):
  Compatibility with CMake < 2.8.12 will be removed from a future version of CMake.
  Update the VERSION argument <min> value or use a ...<max> suffix to tell
  CMake that the project does not need compatibility with older versions.
```

This is the only place that requires bumping min version. There're two others but only in `third_party` folder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58306

Reviewed By: bdhirsh

Differential Revision: D28446097

Pulled By: zhouzhuojie

fbshipit-source-id: af5ef50e61bd57dc36089ebe62db70ba0081864c
2021-05-17 09:55:07 -07:00
bef0e07e09 Remove unused Dockerfile_runtime (#58333)
Summary:
Related to the effort of upgrade ubuntu base images https://github.com/pytorch/pytorch/issues/58309, this PR removes the unused tools/docker/Dockerfile_runtime

It was introduced in https://github.com/pytorch/pytorch/issues/1619, https://github.com/pytorch/pytorch/pull/1732

- No code references in pytorch github org https://github.com/search?q=org%3Apytorch+Dockerfile_runtime&type=code
- Runtime images are available https://hub.docker.com/r/pytorch/pytorch/tags?page=1&ordering=last_updated&name=runtime (~2GB image size)

One less thing to maintain...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58333

Reviewed By: samestep

Differential Revision: D28457139

Pulled By: zhouzhuojie

fbshipit-source-id: 3c7034c52eb71463ac284dc48f0f9bbbf3af1312
2021-05-17 09:50:47 -07:00
4454b18e14 Revert D28371127: Wrap torch::deploy API functions in safe rethrow macros
Test Plan: revert-hammer

Differential Revision:
D28371127 (1ad06ba3f5)

Original commit changeset: c0ced2f19442

fbshipit-source-id: 1775bed182692b3246ff591e6a655264f3546315
2021-05-17 09:30:34 -07:00
432676599c Stop installing libuv on Windows (#51936)
Summary:
Fixes #{issue number}
gunandrose4u

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51936

Reviewed By: malfet

Differential Revision: D28467662

Pulled By: seemethere

fbshipit-source-id: 28d203ee3af13d6a3158f188c2e889e310ee6010
2021-05-17 08:52:29 -07:00
1ad06ba3f5 Wrap torch::deploy API functions in safe rethrow macros (#58192)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58192

Exceptions thrown by deploy internals need to be sanitized
for application safety.

See commment in deploy.h for detailed explanation.

Test Plan: Added unit test

Reviewed By: suo

Differential Revision: D28371127

fbshipit-source-id: c0ced2f194424a394c5852bd4ab5cb41b0f4e87b
2021-05-17 08:02:33 -07:00
b1b9fb0147 Specify the exact commit when triggering multi-gpu pipeline (#58219)
Summary:
Previously only the **branch** is specified when triggering the multi-gpu pipeline, which could result in incorrect commit being targeted, because when the pipeline actually runs there could be newer commit on the specified branch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58219

Reviewed By: malfet, bdhirsh

Differential Revision: D28446453

Pulled By: seemethere

fbshipit-source-id: 680c0b3a9f3f20b61787cc90fda73b87d66e6af8
2021-05-17 06:32:18 -07:00
ee93a348de ENH Raises nicer error when calling module.train with invalid modes (#58247)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46763

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58247

Reviewed By: ejguan

Differential Revision: D28418080

Pulled By: albanD

fbshipit-source-id: fef8f4f641ef75e801ed8b8d04c4016579aea8b0
2021-05-17 05:57:18 -07:00
9c7d5ed9b0 Clarifies cholesky_ex role and makes batched support a common string (#58217)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/34272

Also updates and creates a common string for when the linear algebra operations support batching

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58217

Reviewed By: ngimel

Differential Revision: D28405908

Pulled By: mruberry

fbshipit-source-id: a9d81a5a4712cfdedc22d614986d3707f10742a2
2021-05-17 05:23:06 -07:00
6060684609 Automated submodule update: tensorpipe (#57613)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/tensorpipe](https://github.com/pytorch/tensorpipe).

New submodule commit: 9ed4fb12a4

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57613

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: lw

Differential Revision: D28220987

fbshipit-source-id: 4ecd2589d01f91678194d9e3ac309ad6f6df3e70
2021-05-17 01:35:48 -07:00
71f4c5c1f4 Fix "ci/master" workflow (#58335)
Summary:
Include jobs master-only jobs depends on to the workflow

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58335

Reviewed By: walterddr

Differential Revision: D28458406

Pulled By: malfet

fbshipit-source-id: 217a8996daacd494af1bbc54e725bbcacc0c7784
2021-05-16 12:01:38 -07:00
1a91892f90 Added fix for missing ops aten::sorted.str (#58339)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58339

The operator was present as part of full_jit ops but wasn't included for mobile. The diff copies it for mobile

Test Plan: buck run xplat/langtech/mobile:giga5_bin -- --voice /data/users/prabhavag/experiments/embedded_new_stateful_conv_may6/nicole_batch.giga5 --frontend /data/users/prabhavag/experiments/tools_pkg/en_US.embedded.frontend.titan --icudata xplat/third-party/icu/stubdata/reduced/icudt55l.dat --text "haha"

Reviewed By: iseeyuan

Differential Revision: D28452179

fbshipit-source-id: ef7a929f1a6d40573438785a4959c1c1e39762f0
2021-05-15 19:26:35 -07:00
211bac53ef [JIT] Add optimize_for_inference API (#58193)
Summary:
Freezing exists as a pass which partially evaluates your model and applies generic optimizations which should speed it up. Optimize for inference is a counterpart to these optimizations which runs build & server specific optimizations.  The interaction with existing `optimize_frozen_module` is not great, I guess we could just deprecate the API entirely? it was never officially released but just existed to document the `optimize_numerics` keyword.

Eventually, I would like to add a way of adding example inputs but I didnt add that here because they are not being used at all yet. I also have not yet included a way to blacklist individual optimizations, and would like to wait until we move this to Beta and have a little more clarity on how everything will fit together. I also think blacklisting will be an uncommon use case for the current optimizations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58193

Reviewed By: bertmaher, navahgar

Differential Revision: D28443714

Pulled By: eellison

fbshipit-source-id: b032355bb2585720a6d2f00c89d0d9a7ef60e649
2021-05-15 15:50:14 -07:00
fad2ce439e [nnc] Link all available LLVM targets (#58312)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58312

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D28449168

Pulled By: bertmaher

fbshipit-source-id: 4c72f8dbb28d860377dcd19f5934927e7347409a
2021-05-15 15:32:14 -07:00
4f50fdc2a3 fx quant: refactor observer insertion
Summary:
tl;dr; rewrites the FX graph mode quantization observer insertion to be easier to understand and extend.
The key conceptual difference from before is:
* before: for each node, observers are always inserted to the output of the current node, even if they are needed for the next node. This is hard to reason about.
* after: for each node, observers are inserted to the inputs (if needed, as calculated by the dtype of the argument and dtype of current node) and to the output (if needed for the type of pattern and qconfig).  There is no knowledge of future nodes needed to insert observers for the current node.

This allows us to significantly simplify various things:
* all new observers needed for a node are inserted together.  This makes it easier to understand and debug things.  We add an invariant that node X will never change any observers inserted by any preceding or subsequent node, so to debug an issue the user can just understand what is happening for node X, without having to understand what happens before or after it.
* all the state tracking of activation_post_process_map and activation_post_process_indices are removed, instead observers are looked up by graph traversals
* since there is no longer a need for overlapping graph passes which mutate each other's interemediate state, it is easier to understand what the rules are for inserting observers, and to create new rules in the future.

Test Plan:
```
# all OSS tests pass
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps
```

Imported from OSS

Differential Revision: D28241864

Reviewed By: jerryzh168

Pulled By: vkuzo

fbshipit-source-id: 950d58972d26362808564cc0a2dfb30413a3734d
2021-05-15 09:51:33 -07:00
2436377a7d Remote the list for the attributes that will be ignored for pickling (#58345)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58345

1. Add a sanity check to make sure any new attribute added to the constructor should be added to either `_REMOTE_MODULE_ATTRIBUTES_IGNORE_FOR_PICKLING` pr `_REMOTE_MODULE_ATTRIBUTES_IGNORE_FOR_PICKLING`.
2. Update some comments and warning -- now if a new attribute is added after the construction, it will not be pickled. Previously it will trigger a runtime error, which is hard for unit test (one worker hit the runtime error, but the other worker will cause timeout).
Context: https://github.com/pytorch/pytorch/pull/58019#discussion_r632322083
ghstack-source-id: 129070358

Test Plan: unit test

Reviewed By: rohan-varma

Differential Revision: D28460744

fbshipit-source-id: 8028186fc447c88fbf2bf57f5c5d321f42ba54ed
2021-05-15 00:47:48 -07:00
9def776cd6 [fx_acc] e2e quantized resnet18 (#58204)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58204

Pull Request resolved: https://github.com/pytorch/glow/pull/5657

E2E quantized ResNet18 via accelerated graph module. Accuracy matches

Test Plan: `buck test glow/fb/fx/fx_glow:test_fx_glow -- test_fx_glow_binding_quantized_resnet`

Reviewed By: khabinov

Differential Revision: D27717265

fbshipit-source-id: 6c6a40eb07f19c7b4d663a5dfb07e5d16fd05e03
2021-05-14 18:49:37 -07:00
bcacf91a71 [fx_glow]Add Support for importing quantized linear in FXIRImporter (#57483)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57483

Pull Request resolved: https://github.com/pytorch/glow/pull/5622

Quantized linear has packed parameters. We want to unpack it so that it would be easier for graph optimization and importer to deal with the weight and bias. A customized remapping function is used to unpack quantized linear and map it to acc_op.linear.

Test Plan: `buck test glow/fb/fx/nnpi_importer:test_importer`

Reviewed By: gcatron, jfix71, khabinov

Differential Revision: D27451237

fbshipit-source-id: e46e961734788fd5333e227ca6143fd37c33204e
2021-05-14 18:48:31 -07:00
998374a702 [tsm] add support for jetter to Role (base_image) for mast launches (#58252)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58252

Pull Request resolved: https://github.com/pytorch/elastic/pull/149

1. Adds `ml_image` buck macro
2. Adds `--run_path` option to `torch.distributed.run`
3. Adds `tsm/driver/fb/test/patched/foo` (for unittesting)
4. Changes to `distributed_sum` to use `ml_image` (see Test plan for how this was tested in local and mast)

NOTE: need to enable jetter for flow and local schedulers (will do this on a separate diff since this diff is already really big)

Test Plan:
## Local Testing
```
# build the two fbpkgs (base and main)
buck run //pytorch/elastic/examples/distributed_sum/fb:torchx.examples.dist_sum.base
buck run //pytorch/elastic/examples/distributed_sum/fb:torchx.examples.dist_sum

# fetch the fbpkgs
cd ~/tmp

fbpkg fetch --symlink-tags  -o -d . jetter:prod
fbpkg fetch --symlink-tags  -o -d . torchx.examples.dist_sum.base
fbpkg fetch --symlink-tags  -o -d . torchx.examples.dist_sum

jetter/LAST/jetter apply-and-run \
  torchx.examples.dist_sum.base/LAST/torchrun \
  torchx.examples.dist_sum/LAST \
  -- \
  --as_function \
  --rdzv_id foobar \
  --nnodes 1 \
  --nproc_per_node 2 \
  --max_restarts 0 \
  --role worker \
  --no_python \
~/torchx.examples.dist_sum/LAST/pytorch/elastic/examples/distributed_sum/fb/main.py
```

## Mast Testing
```
buck-out/gen/pytorch/elastic/torchelastic/tsm/fb/cli/tsm.par run_ddp \
  --scheduler mast
  --base_fbpkg torchx.examples.dist_sum.base:78f01b5 \
  --fbpkg torchx.examples.dist_sum:f38ab46 \
  --run_cfg hpcClusterUuid=MastNaoTestCluster,hpcIdentity=pytorch_r2p,hpcJobOncall=pytorch_r2p \
  --nnodes 2 \
  --resource T1 \
  --nproc_per_node 4 \
  --name kiuk_jetter_test \
 pytorch/elastic/examples/distributed_sum/fb/main.py
```
Runs successfully: https://www.internalfb.com/mast/job/tsm_kiuk-kiuk_jetter_test_34c9f0fa?

Reviewed By: tierex

Differential Revision: D28421033

fbshipit-source-id: 96edcecf639143e31ec6c86ec713a2e2d7790f3d
2021-05-14 17:39:18 -07:00
b0819b0b73 [CircleCI] s/ubuntu-1604:202007-01/ubuntu-2004:202104-01/ (#58308)
Summary:
Switch to latest ubuntu supported by CircleCI, according to https://circleci.com/docs/2.0/configuration-reference/#machine

Also upgrade awscli from 1.x to 2.x, which requires replacing aws ecr get-login with awc ecr get-login-password, per https://docs.aws.amazon.com/cli/latest/userguide/cliv2-migration.html#cliv2-migration-ecr-get-login

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58308

Reviewed By: walterddr

Differential Revision: D28446659

Pulled By: malfet

fbshipit-source-id: 260f795dce83b3d191d8e3fa629c77c1b9fae36c
2021-05-14 17:28:35 -07:00
67583122f0 Use pip3 instead of pip when building ECR GC image (#58334)
Summary:
A followup to https://github.com/pytorch/pytorch/issues/58309, to fix the broken docker_for_ecr_gc_build_job:

- https://app.circleci.com/pipelines/github/pytorch/pytorch/322672/workflows/4877ddfe-eee1-4116-91ae-6ee9dd3a97ad/jobs/13486207
- https://app.circleci.com/pipelines/github/pytorch/pytorch/322710/workflows/8d33afb6-7b85-48c7-94fd-ac9176f4a16e/jobs/13488388
- https://app.circleci.com/pipelines/github/pytorch/pytorch/322759/workflows/b480989a-b39e-48f7-929d-66f1bdc50c89/jobs/13490919

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58334

Test Plan:
Before this PR, this fails:
```
cd .circleci/ecr_gc_docker && docker build .
```
After this PR, it succeeds.

Reviewed By: zhouzhuojie

Differential Revision: D28457290

Pulled By: samestep

fbshipit-source-id: d8d8c8759412d9d7876d5908768ee5cb7261132d
2021-05-14 17:21:48 -07:00
eqy
00a46a5eb4 Fix incorrect inplace sort in topk (#58314) (#58318)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/58314

https://github.com/pytorch/pytorch/issues/55392 introduced a bug by not allocating a separate value tensor for sorting
CC ngimel zasdfgbnm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58318

Reviewed By: mruberry

Differential Revision: D28450698

Pulled By: ngimel

fbshipit-source-id: dea1201ebfcbaab8536580b80f8321bda2468fc4
2021-05-14 17:15:24 -07:00
c4c2039fb2 Revert D27652484: [nnc] Enable CPU fusion inside Facebook
Test Plan: revert-hammer

Differential Revision:
D27652484 (ac04cc775b)

Original commit changeset: a82681455dae

fbshipit-source-id: ecfef3ee1e7197148b172234691e72faf4b95cf8
2021-05-14 16:41:23 -07:00
a4075fca9a extract dispatch keys from optional Tensors (unboxed) (#58296)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58296

Test Plan: Imported from OSS

Reviewed By: Chillee

Differential Revision: D28436822

Pulled By: bhosmer

fbshipit-source-id: 8031c9a3c121483dd0e5ed7b8b165952477108e4
2021-05-14 16:12:57 -07:00
3dc70d8f78 [iOS][Metal] Add target for testing metal ops (#57832)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57832

- Added a shared BUCK target: `//fbobjc/Apps/Internal/PyTorchMetalOpTester:PyTorchMetalOpTester`
- Invoke this target from 3 Apps: pp-ios, pp-macos, PyTorchBenckmark
ghstack-source-id: 129037985

(Note: this ignores all push blocking failures!)

Test Plan:
## PyTorch Playground - macOS
`buck test pp-macos`
```
➜  fbsource buck test pp-macos
Building: finished in 0.3 sec (100%) 204/6264 jobs, 0 updated
  Total time: 0.5 sec
Testing: finished in 8.4 sec (6 PASS/0 FAIL)
RESULTS FOR //fbobjc/Apps/Internal/PyTorchPlaygroundMac:PyTorchPlaygroundMacTests
PASS     999ms  2 Passed   0 Skipped   0 Failed   ClassificationTests
PASS      6.4s  1 Passed   0 Skipped   0 Failed   MetalOpsTests
PASS     181ms  3 Passed   0 Skipped   0 Failed   PersonSegmentationTests
Updated test logs: buck-out/log/test.log
TESTS PASSED
```
## PyTorch Playground - iOS
- `arc focus2 pp-ios`
- Build and run via Xcode
{F613716289}

### AI Bench
- `buck run mode/mac aibench:run_bench_macos -- -b aibench/specifications/models/pytorch/metal/metal_op_test.json --platform ios --framework pytorch --remote --devices D53 (871b1419de)pAP`
- Result: https://fburl.com/aibench/d2gtlndd

Reviewed By: xta0

Differential Revision: D28235867

fbshipit-source-id: dcee8aee140b5f665a97efe278ee621f436c7c68
2021-05-14 15:06:24 -07:00
84d8e3b0f6 [FX] Finished prepare_for_inference API for release (#58293)
Summary:
Added an ability to configure which passes to run.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58293

Reviewed By: bdhirsh

Differential Revision: D28435948

Pulled By: Chillee

fbshipit-source-id: dfc7f1ef6b38e6f49c2423a5efe8477a645171d0
2021-05-14 14:10:07 -07:00
00156d4845 [FX][WIP] Proxyable classes (#56737)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56737

Test Plan: Imported from OSS

Reviewed By: Chillee

Differential Revision: D27953073

Pulled By: jamesr66a

fbshipit-source-id: fafc681af7bd5200a9ead2fd0720940913885575
2021-05-14 14:07:04 -07:00
d3fbb41c61 [PyTorch Edge] share tensors in mobile with new api (#58182)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58182

As title, the v5 model format will be
```
(base) chenlai@chenlai-mp reuse_constant % zipinfo /Users/chenlai/Documents/pytorch/reuse_constant/tmp/zip/script_module_v5_unify.ptl
Archive:  /Users/chenlai/Documents/pytorch/reuse_constant/tmp/zip/script_module_v5_unify.ptl
Zip file size: 3120 bytes, number of entries: 7
-rw----     0.0 fat       77 bl stor 80-000-00 00:00 script_module_v4_unify/data.pkl
-rw----     0.0 fat      240 bl defN 80-000-00 00:00 script_module_v4_unify/code/__torch__/___torch_mangle_5.py
-rw----     0.0 fat      422 bl defN 80-000-00 00:00 script_module_v4_unify/code/__torch__/___torch_mangle_5.py.debug_pkl
-rw----     0.0 fat       64 bl stor 80-000-00 00:00 script_module_v4_unify/constants/140245072983168.storage
-rw----     0.0 fat      172 bl stor 80-000-00 00:00 script_module_v4_unify/constants.pkl
-rw----     0.0 fat      678 bl stor 80-000-00 00:00 script_module_v4_unify/bytecode.pkl
-rw----     0.0 fat        2 bl stor 80-000-00 00:00 script_module_v4_unify/version
7 files, 1655 bytes uncompressed, 1453 bytes compressed:  12.2%
```
bytecode.pkl is:
```
(5,
 ('__torch__.___torch_mangle_5.TestModule.forward',
  (('instructions',
    (('STOREN', 1, 2),
     ('DROPR', 1, 0),
     ('LOADC', 0, 0),
     ('LOADC', 1, 0),
     ('MOVE', 2, 0),
     ('OP', 0, 0),
     ('LOADC', 1, 0),
     ('OP', 1, 0),
     ('RET', 0, 0))),
   ('operators', (('aten::add', 'int'), ('aten::add', 'Scalar'))),
   ('constants',
    (torch._utils._rebuild_tensor_v2(pers.obj(('storage',
          torch.DoubleStorage,
          '140245072983168.storage',
          'cpu',
          8),),
       0,
       (2, 4),
       (4, 1),
       False,
       collections.OrderedDict()),
     1)),
   ('types', ()),
   ('register_size', 2)),
  (('arguments',
    ((('name', 'self'),
      ('type', '__torch__.___torch_mangle_5.TestModule'),
      ('default_value', None)),
     (('name', 'y'), ('type', 'int'), ('default_value', None)))),
   ('returns',
    ((('name', ''), ('type', 'Tensor'), ('default_value', None)),)))))
```

constants.pkl is:
```
(torch._utils._rebuild_tensor_v2(pers.obj(('storage', torch.DoubleStorage, '140245072983168.storage', 'cpu', 8),),
   0,
   (2, 4),
   (4, 1),
   False,
   collections.OrderedDict()),)
```

Both tensors will refer to the tensor in at the path `script_module_v4_unify/constants/140245072983168.storage`.

## Note
According to unify format, all tensors will be written to the folder `.data`, however, torch.jit.load() can't handle the unified format at this moment, so this change will write tensors at the `constants` folders, and mobile will write/read tensors from `constants` folder. such that the model can be interpreted by both jit and mobile.
ghstack-source-id: 129010347

Test Plan: buck test mode/dev //caffe2/test/cpp/jit:jit

Reviewed By: raziel, iseeyuan

Differential Revision: D28375257

fbshipit-source-id: 6544472db4c957c5ea037e0bb5112b637dd15897
2021-05-14 14:03:56 -07:00
c034bce979 Back out "[ONNX] Process const folding progressively when converts to ONNX (#54569)"
Summary: Original commit changeset: 833dac7c71f2

Test Plan:
```
buck test mode/dev //pytext/fb/assistant/lite/test:test -- --exact
'pytext/fb/assistant/lite/test:test - test_export_bytes_model_to_caffe2
(pytext.fb.assistant.lite.test.test.TestExport)'
```

Reviewed By: jeanm

Differential Revision: D28431840

fbshipit-source-id: 0f1d530034404421a5d51691173e1cc0ee16fdd6
2021-05-14 13:45:49 -07:00
0a561f83ca [PyTorch Mobile]Fix unit test (#58202)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58202

This unit test was testing the wrong target. It should test the sampler under jit::mobile. This diff fixes it.

Test Plan: run unit tests

Reviewed By: shreyanb98

Differential Revision: D28384839

fbshipit-source-id: 35cc63be2e73ca9b1a7d30d6f67fffcfe5021fa2
2021-05-14 13:43:22 -07:00
34ac28b5ff Bump Ubuntu/Python versions in ECR GC Docker image (#58309)
Summary:
Should fix the ECR GC jobs that broke as a result of https://github.com/pytorch/pytorch/issues/58275; examples:

- https://app.circleci.com/pipelines/github/pytorch/pytorch/322385/workflows/c26788cb-2147-4279-9813-224af3c01583/jobs/13480923
- https://app.circleci.com/pipelines/github/pytorch/pytorch/322385/workflows/c26788cb-2147-4279-9813-224af3c01583/jobs/13473074
- https://app.circleci.com/pipelines/github/pytorch/pytorch/322385/workflows/c26788cb-2147-4279-9813-224af3c01583/jobs/13473077
- https://app.circleci.com/pipelines/github/pytorch/pytorch/322385/workflows/c26788cb-2147-4279-9813-224af3c01583/jobs/13473073

See also https://github.com/pytorch/pytorch/issues/58308.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58309

Reviewed By: malfet, seemethere

Differential Revision: D28447014

Pulled By: samestep

fbshipit-source-id: db857154a94482f4da1db8d74c383527d1b14b49
2021-05-14 13:12:02 -07:00
623d63d9da [reland] Build and push Docker images in GitHub Actions (#58299)
Summary:
This is a reland of https://github.com/pytorch/pytorch/issues/58174.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58299

Reviewed By: malfet, seemethere, janeyx99

Differential Revision: D28445451

Pulled By: samestep

fbshipit-source-id: 2654118fe80f50bbdaaad9b7ee58dfd8ef25980d
2021-05-14 13:07:25 -07:00
73d51406fa [PyTorch Mobile]Move train related files to their own folder (#58205)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58205

It's worthing moving train related files into their own folder since we are adding more code under the mobile directory.

This diff does that.

Test Plan: run unit tests and ci

Reviewed By: iseeyuan

Differential Revision: D28402432

fbshipit-source-id: cd76a1c4f8ff06508cdc3aad8a169fbf34bb4995
2021-05-14 12:54:44 -07:00
49a8942a77 Revert D25399466: add channels last support for AvgPool2d on CPU
Test Plan: revert-hammer

Differential Revision:
D25399466 (8ac0917cc7)

Original commit changeset: 9477b0c281c0

fbshipit-source-id: e0245f0e390f5eca228445fd82d6e5142a827abc
2021-05-14 12:45:29 -07:00
0caec739a3 Revert D25399468: optimize channels last for BatchNorm2d on CPU
Test Plan: revert-hammer

Differential Revision:
D25399468 (0be334a1ba)

Original commit changeset: a4cd7a09cd4e

fbshipit-source-id: cef74881adcdf193355fa5a77e816addd2e2c56e
2021-05-14 12:44:14 -07:00
94ef2b9b48 [Pytorch] Better doc strings for bundled inputs (#56591)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56591

title
ghstack-source-id: 128926699

Test Plan: na

Reviewed By: dreiss

Differential Revision: D27912185

fbshipit-source-id: 1a8f267af21afb7b4393b9ec0792dd17c48e57cb
2021-05-14 11:13:23 -07:00
0be334a1ba optimize channels last for BatchNorm2d on CPU (#48919)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48919

move data indexing utils

parallel inference contiguous path

parallel inference channels last path

add dim apply

optimize update stats

add channels last support for backward

Revert "add channels last support for backward"

This reverts commit cc5e29dce44395250f8e2abf9772f0b99f4bcf3a.

Revert "optimize update stats"

This reverts commit 7cc6540701448b9cfd5833e36c745b5015ae7643.

Revert "add dim apply"

This reverts commit b043786d8ef72dee5cf85b5818fcb25028896ecd.

bug fix

add batchnorm nhwc test for cpu, including C=1 and HW=1

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D25399468

Pulled By: VitalyFedyunin

fbshipit-source-id: a4cd7a09cd4e1a8f5cdd79c7c32c696d0db386bd
2021-05-14 11:09:42 -07:00
0d11dbf511 [ONNX] Support index_add_ function. (#56867) (#57830)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57830

This is PR is aiming to support tensor.index_add_() method in symbolic function. We leverage scatter_add() to implement this function while ONNX doesn't have a corresponding operator.

Notes:

      1.  4 tests have been added for some scenarios.
      2.  If there are duplicated value in 'index' parameter, the export will still execute successfully but the results are wrong. Add a warning for every call to this symbolic function. And if we detect that the rank of 'index' is greater than the size of the 'dim' dimension, will raise an exception to stop exporting an incorrect ONNX file.

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D28393518

Pulled By: SplitInfinity

fbshipit-source-id: f487ca2c63fec47c6ab74f1a7783dae7f3b8d1ef

Co-authored-by: Jay Zhang <jiz@microsoft.com>
2021-05-14 09:51:55 -07:00
520f90f359 [ONNX] Handling incorrect format for example_outputs (#55802) (#57829)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57829

Handling incorrect format for example_outputs, fixing exception behavior.

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D28393521

Pulled By: SplitInfinity

fbshipit-source-id: aa518483f94e31194b951198aefa7c897932356e

Co-authored-by: Ksenija Stanojevic <ksenija.stanojevic@gmail.com>
Co-authored-by: Negin Raoof <neginmr@utexas.edu>
2021-05-14 09:50:58 -07:00
52bb8120b8 Mention distributed profiling in documentation (#58286)
Summary:
Added a simple section indicating distributed profiling is expected to work similar to other torch operators, and is supported for all communication backends out of the box.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58286

Reviewed By: bdhirsh

Differential Revision: D28436489

Pulled By: rohan-varma

fbshipit-source-id: ce1905a987c0ede8011e8086a2c30edc777b4a38
2021-05-14 09:43:00 -07:00
064923e635 Improve native_batch_norm_backward performance (CUDA) (#58240)
Summary:
Fixes  https://github.com/pytorch/pytorch/issues/38915

The original code uses a single kernel to do both the reduction and the elementwise backward calculations. Whereas the  `SyncBatchNorm` kernels are split, which makes them slightly slower in some cases. I try to use the fused kernel when it's beneficial, but otherwise choose the optimized channels last split kernels. There is also eval mode, where the reduction is sometimes unnecessary in which case split kernels are a win even without channels last.

Benchmarks on my system show significant speedups for channels last reductions and eval mode, with only a few minor slowdowns in training mode. These slowdowns are for 2 x 2048 shape in training, which is a small channels last inputs. But for larger batches or channels, the channels last kernels are much faster.

|N   |C   |L   |training|backward|old   |new   |cudnn |
|----|----|----|--------|--------|------|------|------|
|1   |256 |3136|TRUE    |all     |70.25 |64.93 |68.90 |
|    |    |    |TRUE    |self    |69.77 |64.61 |69.42 |
|    |    |    |FALSE   |all     |70.10 |51.12 |x     |
|    |    |    |FALSE   |self    |70.17 |51.17 |x     |
|3136|256 |    |TRUE    |all     |554.08|76.63 |549.88|
|    |    |    |TRUE    |self    |553.34|78.19 |552.36|
|    |    |    |FALSE   |all     |565.40|55.09 |x     |
|    |    |    |FALSE   |self    |565.71|54.84 |x     |
|2   |8192|1   |TRUE    |all     |155.47|47.26 |202.26|
|    |    |    |TRUE    |self    |155.46|48.36 |203.72|
|    |    |    |FALSE   |all     |178.28|40.90 |x     |
|    |    |    |FALSE   |self    |178.21|40.69 |x     |
|2   |2048|1   |TRUE    |all     |43.50 |48.21 |57.47 |
|    |    |    |TRUE    |self    |43.63 |47.24 |55.22 |
|    |    |    |FALSE   |all     |49.36 |39.27 |x     |
|    |    |    |FALSE   |self    |49.25 |42.02 |x     |
|128 |8192|1   |TRUE    |all     |762.70|106.45|336.52|
|    |    |    |TRUE    |self    |765.79|107.04|337.32|
|    |    |    |FALSE   |all     |792.68|74.94 |x     |
|    |    |    |FALSE   |self    |793.86|74.83 |x     |
|128 |2048|1   |TRUE    |all     |188.37|46.20 |85.02 |
|    |    |    |TRUE    |self    |188.47|47.57 |85.04 |
|    |    |    |FALSE   |all     |191.57|40.44 |x     |
|    |    |    |FALSE   |self    |190.13|41.55 |x     |
|2   |8192|    |TRUE    |all     |156.03|43.01 |155.19|
|    |    |    |TRUE    |self    |156.24|46.59 |156.93|
|    |    |    |FALSE   |all     |179.34|40.06 |x     |
|    |    |    |FALSE   |self    |179.20|41.85 |x     |
|2   |2048|    |TRUE    |all     |44.05 |50.15 |44.21 |
|    |    |    |TRUE    |self    |44.10 |48.97 |44.11 |
|    |    |    |FALSE   |all     |49.72 |40.95 |x     |
|    |    |    |FALSE   |self    |49.87 |43.43 |x     |
|128 |8192|    |TRUE    |all     |775.19|96.60 |777.64|
|    |    |    |TRUE    |self    |776.20|96.85 |774.21|
|    |    |    |FALSE   |all     |797.64|68.01 |x     |
|    |    |    |FALSE   |self    |806.25|68.05 |x     |
|128 |2048|    |TRUE    |all     |188.49|48.10 |188.97|
|    |    |    |TRUE    |self    |188.07|46.97 |187.98|
|    |    |    |FALSE   |all     |192.32|43.78 |x     |
|    |    |    |FALSE   |self    |193.72|40.82 |x     |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58240

Reviewed By: bdhirsh

Differential Revision: D28435158

Pulled By: ngimel

fbshipit-source-id: bf62a1ee1c5d95a2caf55bee6176ae5c965688ec
2021-05-14 09:29:05 -07:00
c711c30c74 Revert "Revert D28387764: Codegen inplace forward AD formula from out of place one if needed" (#58231)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58231

This reverts commit 066e7699eb8c375a441e6de168da3ba7a73c3f27.

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D28412495

Pulled By: albanD

fbshipit-source-id: 97dd4580baac903805ab66ad55fe9570dec993ee
2021-05-14 08:35:38 -07:00
6e1718277c Make GHA test-reports upload regex more permissive (#58250)
Summary:
Currently, our test stats [uploaded to S3](fee7e8b91d/&showversions=false) by GitHub Actions are missing the reports from `test/custom_backend/test_custom_backend.py` and `test/custom_operator/test_custom_ops.py`. From [this log](https://github.com/pytorch/pytorch/runs/2573747177), we know that those tests are indeed being run, but the artifact on that workflow run shows that the XML files are currently not being uploaded for use in the render-test-results job. This PR makes the regex for that artifact upload more permissive.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58250

Test Plan:
For context, before this PR, the test-reports artifact of Linux CI (pytorch-linux-xenial-py3.6-gcc5.4) before this PR looks like this:

- `test-reports`
  - `cpp-rpc`
    - ...
  - `cpp-unittest`
    - ...
  - `dist-gloo`
    - ...
  - `python-unittest`
    - ...

Wait for Linux CI (pytorch-linux-xenial-py3.6-gcc5.4) to run on this PR, then download and unzip the test-reports artifact and check that its directory structure looks like this:

- `custom_backend`
  - `test-reports`
    - `python-unittest`
      - ...
- `custom_operator`
  - `test-reports`
    - `python-unittest`
      - ...
- `test-reports`
  - `cpp-rpc`
    - ...
  - `cpp-unittest`
    - ...
  - `dist-gloo`
    - ...
  - `python-unittest`
    - ...

Also, [this run](https://github.com/pytorch/pytorch/runs/2579875947) shows the following line of output, which is exactly what we would expect to see if this PR correctly adds the 9 tests across `custom_backend` and `custom_operator`:

> ```
> Added    (across    2 suites)     9 tests, totaling +   0.10s
> ```

Reviewed By: walterddr

Differential Revision: D28442396

Pulled By: samestep

fbshipit-source-id: 893a397a8e701e4180e1812d6f83352b5920ced6
2021-05-14 08:28:51 -07:00
4bcaa5ae20 Revert D28412496: Revert "Revert D28387767: Add forward AD test for op info"
Test Plan: revert-hammer

Differential Revision:
D28412496 (4f28c0b590)

Original commit changeset: 5b8e30b5e807

fbshipit-source-id: 5a47aad4d5428e97e2d2b4acb4192909360870cd
2021-05-14 08:26:03 -07:00
2e26976ad3 Disallow versionless Python shebangs (#58275)
Summary:
Some machines don't have a versionless `python` on their PATH, which breaks these existing shebangs.

I'm assuming that all the existing versionless `python` shebangs are meant to be `python3` and not `python2`; please let me know if my assumption was incorrect for any of these.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58275

Test Plan: CI.

Reviewed By: zhouzhuojie

Differential Revision: D28428143

Pulled By: samestep

fbshipit-source-id: 6562be3d12924db72a92a0207b060ef740f61ebf
2021-05-14 08:26:02 -07:00
e6adc06221 Revert D28425179: Build and push Docker images in GitHub Actions
Test Plan: revert-hammer

Differential Revision:
D28425179 (2ead01f796)

Original commit changeset: acea02d300c2

fbshipit-source-id: 01bf91a58238a9cd47e6b503c197ac346e8bbabe
2021-05-14 08:24:49 -07:00
76d2cb3b8e [torch.package/TorchScript] flag to gate allowance of TS serializaiton in torch.package (#57678)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57678

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D28232891

Pulled By: Lilyjjo

fbshipit-source-id: f6b2f4557cb98c4e811b7e3b665e0ffe88115555
2021-05-14 08:21:46 -07:00
e27f861db7 [torch.Package/TorchScript] TS into torch.Package test cases (#54894)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54894

Test cases to test torch.Package's handling of TorchScript objects.

TODO: test mapping storages to different device

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D27832544

Pulled By: Lilyjjo

fbshipit-source-id: 6a67938a428b57520fead698da1412623ece9dbd
2021-05-14 08:21:44 -07:00
9403fe17ce [torch.package/TorchScript] logic to enable sharing of tensors on load (#57573)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57573

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D28226975

Pulled By: Lilyjjo

fbshipit-source-id: bc8cb3e8052fa18336c437e0601d8b0028fd1895
2021-05-14 08:21:43 -07:00
307375a88e [torch.Package/TorchScript] torch.Package python logic to save TorchScript (#54893)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54893

Adds logic to torch.Package's `PackageExporter` and `PackageImporter` to handle TorchScript objects. Also adds necessary `__reduce_package__` methods to `ScriptModule` and `RecursiveScriptModule` to enable this

API:
```
# create scripted objects
scripted_mod = torch.jit.script(Mod1("initial_1"))
scripted_mod2 = torch.jit.script(Mod2("initial_2"))

# save objects into package
with PackageExporter(filename, verbose=False) as e:
            e.save_pickle("res", "mod.pkl", scripted_mod)
            e.save_pickle("res", "mod2.pkl", scripted_mod2)

# load scripted objects from package
importer = PackageImporter(filename)
scripted_mod_loaded = importer.load_pickle("res", "mod.pkl")
scripted_mod2_loaded = importer.load_pickle("res", "mod2.pkl")
```

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D27832547

Pulled By: Lilyjjo

fbshipit-source-id: 73bf254c311fee2a2b21a9a7861d6cdc53709bd1
2021-05-14 08:21:41 -07:00
3ad11803f7 [torch.Package/TorchScript] ScriptModuleSerializer add unified format (#56299)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56299

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D27832545

Pulled By: Lilyjjo

fbshipit-source-id: 1b2880a8458f99bd66a8c9656c5ca700f43cffe8
2021-05-14 08:21:40 -07:00
8ab3aa464a [torch.Package/TorchScript] refactor ScriptModuleSerializer Exporter (#55958)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55958

This PR refactors the existing ScriptModuleSerializer to be exposed to the public. Most of the code is the same, git just thinks it's different due to it being shifted over a white space. I commented on the actual changes that weren't due to the white space shifting

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D27832546

Pulled By: Lilyjjo

fbshipit-source-id: c73e33211e46fca56053aa45ea2b9a2803eab82c
2021-05-14 08:21:38 -07:00
07de11c26d [torch.Package/TorchScript] TS serialization importer to handle unified format (#54891)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54891

Changed TorchScript's jit/serialization importer logic to handle both original TS serialization format and new unified TS format

Original TS file format:
```
resnet.pt
├── data  # tensor data
│   ├── 94286146172688
│   ├── 94286146172784
│   └── ...
├── code/  # TorchScript code
│   ├── __torch__
│   │   ├── torch
│   │   │   └── nn ...
│   │   └── torchvision ...
│   ├── __torch__.py
│   └── __torch__.py.debug_pkl
├── data.pkl  # the ScriptModule object, pickled by the TS pickler
├── version  # version metadata
├── constants.pkl  # any tensor constants present in the TS code
└── extra
     ├── name_of_file
     └── foo
```

Unified file format:
```
─── package_name.pt
    ├── .data
    │   ├── ts_code # code shared between models
    │   │   ├── 0
    │   │   │   ├── constants.pkl
    │   │   │   └── data.pkl
    │   │   ├── 1
    │   │   │   ├── constants.pkl
    │   │   │   └── data.pkl
    │   │   └── code
    │   │       ├── __torch__
    │   │       │   ├── torch
    │   │       │   │   └── nn ...
    │   │       │   └── torchvision ...
    │   │       ├── __torch__.py
    │   │       └── __torch__.py.debug_pkl
    │   ├── 0.storage
    │   ├── 1.storage
    │   ├── <many more storages>
    │   ├── 201.storage
    │   ├── extern_modules
    │   └── version
    └── res
        ├── mod.pkl  # maps to ts_id 0 and .data/ts_code/0
        └── mod2.pkl # maps to ts_id 1 and .data/ts_code/1
```

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D27832548

Pulled By: Lilyjjo

fbshipit-source-id: 4a6e84c3a9bac8eed6a4e4afc2ac76dd691858b0
2021-05-14 08:20:34 -07:00
2ead01f796 Build and push Docker images in GitHub Actions (#58174)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58174

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D28425179

Pulled By: samestep

fbshipit-source-id: acea02d300c2547ced55e0e5586e95a6b5e1876d
2021-05-14 08:13:21 -07:00
1f5ed1ff69 [metal] Fix binary elementwise ops to handle inputs with mismatched dim() (#58262)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58262

When broadcasting, it can be fine for input tensors to have a different number of dims. Fix the checks in arithmetic ops to accept these cases.

Test Plan:
Test on device:
```
arc focus2 pp-ios
```
Test on mac
```
buck test pp-macos
```

Reviewed By: xta0

Differential Revision: D27093367

fbshipit-source-id: 797eeffa1864291cb0e40277372842dca145c9c0
2021-05-14 07:53:39 -07:00
72a90c3ea5 [metal] Add reflection_pad2d for metal (#58263)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58263

Add the `reflection_pad2d` op in preparation for newer xirp models.

Test Plan:
Test on device:
```
arc focus2 pp-ios
```
Test on mac
```
buck test pp-macos
```

Reviewed By: xta0

Differential Revision: D27047892

fbshipit-source-id: 815856e19e4885c352f5d7174866480db7641cdf
2021-05-14 07:52:30 -07:00
4f28c0b590 Revert "Revert D28387767: Add forward AD test for op info" (#58230)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58230

This reverts commit f88297c66bd36d075e9d50eb09a81bea74a669c6.

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D28412496

Pulled By: albanD

fbshipit-source-id: 5b8e30b5e80771dedf999c3aaa9791fc9026f8c1
2021-05-14 06:55:44 -07:00
ccd7141919 Modify DispatchKeyExtractor to also work for optional Tensors (#58283)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58283

Reviewed By: bhosmer

Differential Revision: D28436443

Pulled By: Chillee

fbshipit-source-id: ba6aae74e8ec3c5a6157cc4517c29b36bdd4a30d
2021-05-14 03:09:04 -07:00
88ff651e90 torch.jit.ignore as a context manager (#55172)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55172

Description:
This is part 1 of series of PRs for supporting torch.jit.ignore as context manager. Following features are implemented in this PR:

- Unique name for the registered function under torch.jit.frontend module. The unique name is generated based on the file name and line number of context manager
- Forcing user to explicitly annotate the input and outputs.
- No side effects are considered.

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D27895283

Pulled By: tugsbayasgalan

fbshipit-source-id: 5d36d9aa5d457055a6bb1676f264647a745ec36a
2021-05-14 01:53:50 -07:00
cf1daf571d Port silu to structured (#58050)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58050

Test Plan: Imported from OSS

Reviewed By: agolynski

Differential Revision: D28382790

Pulled By: ezyang

fbshipit-source-id: 5aeedfe39b5f15d14022d1e9edec1b30e98e5076
2021-05-14 00:49:10 -07:00
f23e10f27b Port softshrink to structured (#57623)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57623

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D28224703

Pulled By: ezyang

fbshipit-source-id: 62e40d53eb130205f6c4d2775082e436e6adadce
2021-05-14 00:49:09 -07:00
d65dff463a Port hardsigmoid to structured (#57622)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57622

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D28224704

Pulled By: ezyang

fbshipit-source-id: 3feea6d87f2de4da5ae1f973a53ee136957ec807
2021-05-14 00:49:07 -07:00
401d0fe8c5 Port leaky_relu to structured (#57621)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57621

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D28224706

Pulled By: ezyang

fbshipit-source-id: 168b175d0fd9e0cc3335ea00df4c7967fea77819
2021-05-14 00:49:05 -07:00
9dba26eed1 Port softplus to structured (#57620)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57620

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D28224705

Pulled By: ezyang

fbshipit-source-id: a48419f5958e4d29427fb1fec7ff929f0297e4e4
2021-05-14 00:49:04 -07:00
03398b7edb Port elu to structured (#57619)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57619

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D28224707

Pulled By: ezyang

fbshipit-source-id: 9e1cad3f5536c65ada2d951366de134ebcb6bb3f
2021-05-14 00:47:41 -07:00
ac04cc775b [nnc] Enable CPU fusion inside Facebook (#58029)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58029

We've been testing this for months, it's time.
ghstack-source-id: 128932738

Test Plan: CI

Reviewed By: ZolotukhinM

Differential Revision: D27652484

fbshipit-source-id: a82681455dae0db19c8ac9918065b6e186c9e71a
2021-05-14 00:10:10 -07:00
6b8b591a84 [nnc] Fix output restriding of size-1 dimensions (#58256)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58256

Size-1 dims mess up our output restriding logic, because they're
technically "dense" no matter what stride the dimension has.  In this example a
size-1 dim has stride 1, which causes all the indices to be taken mod 1 (i.e.,
all indices become 0).  We work around this peculiar case by skipping size-1 in
our layout logic, since it has no impact on the rest of the tensor's indexing.
ghstack-source-id: 128932739

Test Plan:
new unit test, plus
```
buck test mode/dev //langtech/mobile/audio_stream_processor:audio_stream_processor_test -- --exact 'langtech/mobile/audio_stream_processor:audio_stream_processor_test - AudioStreamProcessorTest.DemucsReadWriteFloat'
```

Reviewed By: eellison

Differential Revision: D28424388

fbshipit-source-id: e33e39eef2a5bf2797bee78a5987558308b6d110
2021-05-14 00:09:12 -07:00
cb7c6a536b [doc] update distributed optimizer doc (#58084)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58084

update the doc for distributed optimizer with TorchScript support.

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D28363971

Pulled By: wanchaol

fbshipit-source-id: df9d2acc1bbb2292d683d2231e1349b8d3946c8f
2021-05-13 23:37:00 -07:00
a8122062c0 [PyTorch Mobile]Add light version of RandomSampler (#58201)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58201

Add light version of RandomSampler which can be used torch mobile.

Test Plan: run unit test

Reviewed By: iseeyuan

Differential Revision: D28364467

fbshipit-source-id: 3148129fa56533f5f4b76b63b60e8778eeaf815f
2021-05-13 22:53:21 -07:00
38e606d056 [RFC] Add method torch.jit._clone_module_with_class (#56152)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56152

Currently, the Bundled Inputs API mutates the module in-place. It adds class methods and not instance methods. This results in a small problem that one can't re-run an already executed cell in Bento if the class has already been subject to bundled inputs.

In addition, there is no way to add bundled inputs to a module that has bundled inputs added already. This API provides a way to solve this problem as well by adding an `ignored_methods` to the call to `clone()` by allowing the implementation of bundled inputs to pass in the methods that it will add as `ignored_methods` so that when it does try to add those methods, it will be able to do so successfully.

We'll have to be careful when ignoring those methods during the call to `torch.jit._clone_module_with_class` since any bundled input that relies on a user-provided method will need to be preserved and not ignored during the clone.

Looking for feedback on whether this is an acceptable direction.
ghstack-source-id: 128908360

Test Plan:
Added unit test and ran it as `buck test //caffe2/test:mobile`

Also see this Bento Notebook: https://www.internalfb.com/intern/anp/view/?id=550829

Reviewed By: gmagogsfm

Differential Revision: D27788394

fbshipit-source-id: 48109cd4583506d4efdb345e4ba31385db23a273
2021-05-13 22:31:05 -07:00
452569dffb cfloat and cdouble functions (#58137)
Summary:
This adds the methods `Tensor.cfloat()` and `Tensor.cdouble()`.

I was not able to find the tests for `.float()` functions. I'd be happy to add similar tests for these functions  once someone points me to them.

Fixes https://github.com/pytorch/pytorch/issues/56014

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58137

Reviewed By: ejguan

Differential Revision: D28412288

Pulled By: anjali411

fbshipit-source-id: ff3653cb3516bcb3d26a97b9ec3d314f1f42f83d
2021-05-13 21:13:37 -07:00
f6408c0dc1 [ATen][quant] Use expect_contiguous in quantized::linear fbgemm version (#58221)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58221

- Use expect_contiguous to avoid Tensor refcount bumps if input tensor is already contiguous
- Use Tensor::sizes()[i] in place of Tensor::size(i) which goes through the dispatcher
- Use at::Dimvector in place of std::vector to avoid heap allocation

Since the qnnpack version needs on device testing, I'll skip that one for now.

Test Plan: CI

Reviewed By: swolchok

Differential Revision: D28406942

fbshipit-source-id: 3c1bdfd1c859fe71869d4daec22158be5c2719d4
2021-05-13 20:51:47 -07:00
31607ad41d [nnc] Started codegenning some external calls (#58118)
Summary:
Currently only supports native ops that have all tensor arguments, an out variant, and no kwargs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58118

Reviewed By: ejguan

Differential Revision: D28421323

Pulled By: Chillee

fbshipit-source-id: 1c75c900415deca63fcc0e496e3bac126f21bf49
2021-05-13 19:56:50 -07:00
04970057d8 Code-dedup in PowKernel (#57873)
Summary:
Both CPU and CUDA versions of PowKernel reimplement functionality that
already exists in UnaryOps, such as sqrt, rsqrt and reciprocal

Find this out while looking at sluggish compilation of PowerKernel.cu:
 - Before the change it took 11m5s and resulted in 7.6Mb .o file
 - After the change compilation finished in 10m20s, and 6.4Mb .o file

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57873

Reviewed By: ezyang

Differential Revision: D28304929

Pulled By: malfet

fbshipit-source-id: ac499476280de55a92044b1b041b1246eea74c64
2021-05-13 19:52:34 -07:00
64d23cc040 Revert D28379394: Update internal code for torch.linalg.solve
Test Plan: revert-hammer

Differential Revision:
D28379394 (b0833533a7)

Original commit changeset: b47f66bc1ee1

fbshipit-source-id: c81b34f45a1d82a2b1cecc8987048fa1055203d6
2021-05-13 19:49:41 -07:00
3072c97017 Gelu Backward, Contribution from Kevin Stephano (#58249)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58249

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D28425629

Pulled By: Krovatkin

fbshipit-source-id: 494ab165d548aa76f036344ab1c19c5fd64bae82
2021-05-13 19:39:39 -07:00
f3ead05d77 hardtanh (#57750)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57750

Test Plan: Imported from OSS

Reviewed By: huiguoo

Differential Revision: D28425975

fbshipit-source-id: a5e3dfbd6c77c595528c052e0b4325ef452983eb
2021-05-13 19:39:37 -07:00
c524448dd1 init hardshrink (#57749)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57749

add to a fx test

Test Plan: Imported from OSS

Reviewed By: huiguoo

Differential Revision: D28425974

fbshipit-source-id: 195c7a1944decb7a2a99c2831cab38485f32be17
2021-05-13 19:38:05 -07:00
047ae6b713 [profiler][small] CUDA synchronize guard, minor fix (#58254)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58254

Don't use CUDA synchronize when profiling in CPU only mode.
minor fixes (a clarification for a doc string, fix spammy logging)

(Note: this ignores all push blocking failures!)

Test Plan: manual + CI

Reviewed By: gdankel, chaekit

Differential Revision: D28423667

Pulled By: ilia-cher

fbshipit-source-id: 04c71727f528ae8e2e0ff90e88271608d291bc69
2021-05-13 19:21:56 -07:00
8ac0917cc7 add channels last support for AvgPool2d on CPU (#48918)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48918

enable test case on AvgPool2d channels last for CPU

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D25399466

Pulled By: VitalyFedyunin

fbshipit-source-id: 9477b0c281c0de5ed981a97e2dcbe6072d7f0aef
2021-05-13 18:05:57 -07:00
fd3d3ef900 [RPC Framework] Add _script_module_reducer unconditionally for RecursiveScriptModule in RPC pickler (#58020)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58020

Previously there is no RPC pickler for `RecursiveScriptModule`. Although it is a subclass of `ScriptModule`, the reducer of `ScriptModule` is not triggered for `RecursiveScriptModule` when a script remote module is sent over RPC.

This PR checkpoints the investigation of #58274, which makes sure that a RPC pickler is invoked here. This still cannot fix `test_send_remote_module_over_the_wire_script`. Will revisit this bug once there is a feature request from users.

ghstack-source-id: 128949642

Test Plan:
TODO: re-enable these tests

buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_send_remote_module_over_the_wire_script
buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_remote_module_py_pickle_not_supported_script

Reviewed By: rohan-varma

Differential Revision: D28346758

fbshipit-source-id: 3cff84ca665da03da6ed6acb094a1f594fcd945e
2021-05-13 17:51:25 -07:00
993a35a8cb [Static Runtime] Support clamp.Tensor (#58191)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58191

There are two clamp overloads: clamp.Scalar and clamp.Tensor. SR needs to support both or has checks in place to avoid runtime errors. Supporting both is not too hard so here we are.

Reviewed By: edvgha

Differential Revision: D28371949

fbshipit-source-id: 0ec6b8a0b8c6277e50d8e51e4e7a45aa62211e22
2021-05-13 17:46:59 -07:00
1f3807ce5d More stable and faster implementation of the gradient of torch.linalg.eigh (#55049)
Summary:
This PR:
- Renames symeig_backward to eigh_backward
- Improves the stability and speed of the gradient computation by doing `V(A + B)Vh` instead of `VAVh + VBVh`  when both the gradients of the eigenvectors and eigenvalues are defined.
- Updates the comments of the function to make them arguably clearer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55049

Reviewed By: ngimel

Differential Revision: D28396823

Pulled By: mruberry

fbshipit-source-id: a144482bfb1054e281b58ae1fe3cf1015bab505d
2021-05-13 17:17:35 -07:00
b0833533a7 Update internal code for torch.linalg.solve (#56613)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56613

Replace linalg_solve_helper with `lu_stub` + `lu_solve_stub`.
Once `lu_stub` and `lu_solve_stub` have cuSOLVER-based codepath,
`torch.linalg.solve` will have it as well.

Test Plan: Imported from OSS

Reviewed By: agolynski

Differential Revision: D28379394

Pulled By: mruberry

fbshipit-source-id: b47f66bc1ee12715da11dcffc92e31e67fa8c8f6
2021-05-13 16:57:29 -07:00
d304bb070a Gelu Backward, Contribution from Kevin Stephano (#58249)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58249

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D28425381

Pulled By: Krovatkin

fbshipit-source-id: 21b7ac972220b6c35b285e3b66f05eb392002408
2021-05-13 16:36:44 -07:00
3a898c26c0 Print stderrs in tools/mypy_wrapper.py (#58265)
Summary:
Uncovered while investigating https://github.com/pytorch/pytorch/issues/58253.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58265

Test Plan:
Before this PR:

```
$ mypy tools/stats_utils/foo.txt
mypy: can't read file 'tools/stats_utils/foo.txt': No such file or directory
$ echo $?
2
$ tools/mypy_wrapper.py $PWD/tools/stats_utils/foo.txt
$ echo $?
2
```

After this PR:

```
$ mypy tools/stats_utils/foo.txt
mypy: can't read file 'tools/stats_utils/foo.txt': No such file or directory
$ echo $?
2
$ tools/mypy_wrapper.py $PWD/tools/stats_utils/foo.txt > /dev/null
mypy: can't read file 'tools/stats_utils/foo.txt': No such file or directory
mypy: can't read file 'tools/stats_utils/foo.txt': No such file or directory
$ echo $?
2
```

Reviewed By: zhouzhuojie

Differential Revision: D28426439

Pulled By: samestep

fbshipit-source-id: c47a85a696ed44c9873416decc9fed8d99bc556c
2021-05-13 16:25:42 -07:00
7756cb6a5b Migrate pytorch_python_doc_build to github action (#57371)
Summary:
# Changes

This PR migrates `pytorch_python_doc_build` from circleci to github actions.

Noticeable changes
- Refactor `docker cp` into a single `docker run` with volume mount, because the in circleci volume is not accessible from its remote docker engine
- `pytorch_python_doc_push` job will have a race condition with circleci, which will be migrated in separate PRs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57371

Reviewed By: samestep

Differential Revision: D28416289

Pulled By: zhouzhuojie

fbshipit-source-id: 04caccccf3d7eb7e2225846a406a53ccda356d44
2021-05-13 15:42:52 -07:00
3f9126f399 Only quicklint files that exist (#58261)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/58253.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58261

Test Plan: The repro steps for https://github.com/pytorch/pytorch/issues/58253.

Reviewed By: janeyx99

Differential Revision: D28425900

Pulled By: samestep

fbshipit-source-id: b4abe910bd9ba5a34ec5a413d4df21b85f96a89f
2021-05-13 15:16:07 -07:00
f6532468c8 Make norm and vector_norm use the same kernels. (#58214)
Summary:
Fixes a few problems with `torch.norm` (incorrect behavior for empty inputs and negative p, https://github.com/pytorch/pytorch/issues/52783, and incorrect imaginary part for complex).
Most importantly, makes linalg_norm and vector_norm use the same kernels, reducing compile time and binary size.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58214

Reviewed By: ejguan

Differential Revision: D28422439

Pulled By: ngimel

fbshipit-source-id: afe088a866963068e8c85eb9c3b2218a21ff2d48
2021-05-13 15:06:37 -07:00
26aeec35a1 Disable more of quicklint test (#58257)
Summary:
Essentially a followup to https://github.com/pytorch/pytorch/issues/57968. For now, this test is just too flaky to run on every PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58257

Test Plan: The repro steps in https://github.com/pytorch/pytorch/issues/58253.

Reviewed By: walterddr

Differential Revision: D28424862

Pulled By: samestep

fbshipit-source-id: 00aed872fe505db67e48414b1234505a71099262
2021-05-13 14:45:45 -07:00
d833caaf6b [PyTorch Mobile][Forward/backward compatibility] Number of arguments for operators (#56845)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56845

Handle forward/backward compatibility caused by added default arguments in mobile. As an example,

In older version, operator aten::foo's schema is
```
foo(Tensor a, Tensor b) -> Tensor
```
In the new version, the schema is updated to
```
foo(Tensor a, Tensor b, int groups=1) -> Tensor
```

## Model file
Serialize the number of specified arguments to each operator into the bytecode operator table. Before the operator table contains operator name and overload name:
```
('operators', (('aten::foo', ''),))
```
Now the number of specified arguments is added:
```
# bytecode version 6
('operators', (('aten::foo', '', 2),))
```
where "2" means the number of specified arguments.

Since there's bytecode schema change, the bytecode version number is bumped. This PR is to be landed after #56002 , where the version number is bumped from 4 to 5. This PR bumps the version number from 5 to 6.

## Runtime and backward compatibility
When the operator is found (either jit or c10), we have the OperatorHandle, where the operator schema can be accessed by
```
op.value().schema().arguments()
```
Adaptation is implemented to handle backward compatibility. For the example above, the new runtime holds the updated schema:
```
foo(Tensor a, Tensor b, int groups=1) -> Tensor
```
Whereas the model file carries
```
(('aten::foo', ''), 2)
```
We can implement a wrapper around the original function pointer to push the default argument to the stack.

## Deliver time and forward compatibility
At model delivery time, two checks can be done:
### Operator check
Two APIs to be provided:
* Runtime: An API to get a runtime’s ops and their schemas (i.e. the # of args). D27920185(WIP)
* Model: An API to get a model’s ops and their schema requirements (i.e. the # of args required).

The APIs can be used to check
* runtime.ops() is a superset of model.ops()
* for each op in model.ops() validate their schemas are compatible with those in runtime.ops() -- i.e. the # args required in a model op are <= # args in the runtime op.

Note that only root ops in the model needs to be checked here. For transient ops it's not necessary. For example, if a root op, "aten::root" calls "aten::foo", it's "aten::root"'s responsibility to adapt to "aten::foo"'s change, or "aten::root" itself needs to be updated too.
### Bytecode version backport (PR coming)
When delivering a model with bytecode v6, if the runtime only works with bytecode v5 and lower, backport is needed.
* The number of arguments is removed from the operator table
* The bytecode version is changed from 6 to 5

Note that this backport is a pure format change, it does not guarantee the backported model always runs in old runtime. The operator check mentioned before should be done first, before it’s back ported to v5.

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D27986544

Pulled By: iseeyuan

fbshipit-source-id: 143e19d4798cfb96b65095538dd648eead4e3fda
2021-05-13 14:20:47 -07:00
e1bb9d2d99 Reimplement spectral_norm using new parametrization functionality (#57784)
Summary:
Adds a new file under `torch/nn/utils/parametrizations.py` which should contain all the parametrization implementations

For spectral_norm we add the `SpectralNorm` module which can be registered using `torch.nn.utils.parametrize.register_parametrization` or using a wrapper: `spectral_norm`, the same API the old implementation provided.

Most of the logic is borrowed from the old implementation:
 - Just like the old implementation, there should be cases when retrieving the weight should perform another power iteration (thus updating the weight) and cases where it shouldn't. For example in eval mode `self.training=True`, we do not perform power iteration.

There are also some differences/difficulties with the new implementation:
 - Using new parametrization functionality as-is there doesn't seem to be a good way to tell whether a 'forward' call was the result of parametrizations are unregistered (and leave_parametrizations=True) or when the injected property's getter was invoked. The issue is that we want perform power iteration in the latter case but not the former, but we don't have this control as-is. So, in this PR I modified the parametrization functionality to change the module to eval mode before triggering their forward call
 - Updates the vectors based on weight on initialization to fix https://github.com/pytorch/pytorch/issues/51800 (this avoids silently update weights in eval mode). This also means that we perform twice any many power iterations by the first forward.
 - right_inverse is just the identity for now, but maybe it should assert that the passed value already satisfies the constraints
 - So far, all the old spectral_norm tests have been cloned, but maybe we don't need so much testing now that the core functionality is already well tested

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57784

Reviewed By: ejguan

Differential Revision: D28413201

Pulled By: soulitzer

fbshipit-source-id: e8f1140f7924ca43ae4244c98b152c3c554668f2
2021-05-13 14:16:13 -07:00
51cd89ecc6 [ONNX] Handle mixed mask, index input for index_put (#57604)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57604

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D28393524

Pulled By: SplitInfinity

fbshipit-source-id: 6c0cd9db981a7e4ece2fdd375a814a13449e1ab0

Co-authored-by: David <jiafa@microsoft.com>
2021-05-13 13:42:56 -07:00
01374d69e4 [ONNX] ListUnpack on dynamic tensor list (#56592) (#57603)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57603

With explicit list unpack code from user, it is possible to observe `prim::ListUnpack` with a `ONNX::Sequence` object.
 This PR supports the conversion.

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D28393527

Pulled By: SplitInfinity

fbshipit-source-id: 1e6234d349b94c97c6ff20880a801433a9a428e9

Co-authored-by: BowenBao <bowbao@microsoft.com>
2021-05-13 13:42:55 -07:00
8e29863a0d [ONNX] Handle NoneType in Assign Output Shapes (#54623) (#57602)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57602

Needed for ONNX Export of Huggingface-reformer models

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D28393517

Pulled By: SplitInfinity

fbshipit-source-id: bab6f91e624bb31e804fe2cf7ec0970164a6f29e

Co-authored-by: shubhambhokare1 <shubhambhokare1@gmail.com>
2021-05-13 13:42:53 -07:00
bfe7728f18 [ONNX] Process const folding progressively when converts to ONNX (#54569) (#57601)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57601

This PR automatically solves onnx const attribute issue in PR https://github.com/pytorch/pytorch/pull/53784.

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D28393525

Pulled By: SplitInfinity

fbshipit-source-id: 833dac7c71f24a88af62d5dd2be0a702ed34d053

Co-authored-by: David <jiafa@microsoft.com>
2021-05-13 13:42:51 -07:00
346dc88bfa [ONNX] Support registering custom export for prim::PythonOp from torch.autograd.Function (#55630) (#57600)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57600

Demo script:

```python
import torch

class MyReLU(torch.autograd.Function):
    staticmethod
    def forward(ctx, input, scalar_tuple, scalar, scalar_list):
        ctx.save_for_backward(input)
        return input.clamp(min=scalar)
    staticmethod
    def backward(ctx, grad_output):
        input, = ctx.saved_tensors
        grad_input = grad_output.clone()
        grad_input[input < 0] = 0
        return grad_input

class MyModule(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear_a = torch.nn.Linear(2, 2)
        self.linear_b = torch.nn.Linear(2, 2)
        self.relu = MyReLU.apply
    def forward(self, x):
        h = self.linear_a(x)
        h = self.relu(h, (5, 3), 2, [1, 2, 3])
        h = self.linear_b(h)
        return h

"""
User define how to export prim::PythonOp into custom op.
"""
def symbolic_pythonop(g, n, *args, **kwargs):
    # Print information:
    print('arguments of ', kwargs['name'], ':')
    print('original node: ', n)
    for i, out in enumerate(n.outputs()):
        print('original output {}: {}, requires grad: {}'.format(i, out, out.requiresGrad()))
    import torch.onnx.symbolic_helper as sym_helper
    for i, arg in enumerate(args):
        print('arg {}: {}, requires grad: {}'.format(i, arg, arg.requiresGrad() if sym_helper._is_value(arg) else False))
    for k, v in kwargs.items():
        print('key: ', k, ' v: ', v)

    # TODO: all inputs (tensors and scalars) are in args.
    #       backend can define CustomDomain::PythonOp and how info are stored however it deem fit.
    return g.op("CustomDomain::PythonOp", args[0], name_s=kwargs['name'])

torch.onnx.register_custom_op_symbolic("::prim_PythonOp", symbolic_pythonop, 9)

# Define input.
x = torch.tensor([[0.3971, 0.7544],
                  [0.5695, 0.4388]], requires_grad=True)

model = MyModule()
# Forward.
y = model(x)

torch.onnx.export(model, (x,), 'model.onnx', opset_version=12, verbose=True)
```

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D28393528

Pulled By: SplitInfinity

fbshipit-source-id: e0d55b7c737c5916fda08a3b26b3306037f970df

Co-authored-by: BowenBao <bowbao@microsoft.com>
2021-05-13 13:42:49 -07:00
2b0f481d3f Add support to to(device) op. (#56857) (#57599)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57599

Currently, if we call tensor.to() method and pass a device as the parameter. It will fail, because in symbolic function of to() we didn't handle such case.

So add a check in the beginning of this symbolic function, if this is a device cast, we return self directly. A test has also been added.

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D28393523

Pulled By: SplitInfinity

fbshipit-source-id: c41e3c0293932fc90dedb544daadd9c5d4b48792

Co-authored-by: Jay Zhang <jiz@microsoft.com>
2021-05-13 13:42:48 -07:00
9e56314d2c onnx.symbolic_helper.parse_args: document and clean up (#56956) (#57598)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57598

Add a doc string to explain what it does and how to use it.

Remove hack around a bug in Python 2's functools.wrap().
Python 2 is no longer supported.

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D28393519

Pulled By: SplitInfinity

fbshipit-source-id: aae8c5e7b49e2ad2d24a0e86f8ba47f1cd080e46

Co-authored-by: Gary Miguel <garymiguel@microsoft.com>
2021-05-13 13:42:46 -07:00
dc0071dfa5 [ONNX] Special post process for onnx::Cast and onnx::ConstantOfShape shape type inference (#55962) (#57597)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57597

* Special post process for onnx::Cast and onnx::ConstantOfShape
* Update `test_pytorch_onnx_shape_inference.py` to be unit test over shape inference patterns.

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D28393529

Pulled By: SplitInfinity

fbshipit-source-id: fc26032ddb842d4e299447da39564b28049752ed

Co-authored-by: BowenBao <bowbao@microsoft.com>
2021-05-13 13:42:44 -07:00
ac9e79e561 Add a new operator for fill_() function. (#56859) (#57596)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57596

Add the corresponding symbolic function and test for fill_() function.

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D28393520

Pulled By: SplitInfinity

fbshipit-source-id: 3e177f88d3776d0d4a9d5e7ec7df4e6629738799

Co-authored-by: Jay Zhang <jiz@microsoft.com>
2021-05-13 13:42:43 -07:00
6d7fe76317 [ONNX] Warning when using __len__ to calculate tensor shape (#55151) (#57595)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57595

Difference in traced graph and outputs, when using len(tensor) as compared to tensor.shape[0]

An example model is (with tensor.shape):
```
# Test len fix with variable inputs
import torch
import onnxruntime

class Model(torch.nn.Module):
    def forward(self, x):
        return x.size(1) + x.shape[0]

# Call export
dummy_x = torch.randn(3, 5)
model = Model()

import io
onnx_io = io.BytesIO()
torch.onnx.export(model, (dummy_x,), onnx_io,
                  input_names=['input'],
                  dynamic_axes={'input': {0:'h'}},
                  verbose=True)

# Call onnxruntime runtime and compare outputs on dynamic inputs
ort_session = onnxruntime.InferenceSession(onnx_io.getvalue())

x = torch.randn(2, 5)
print(model(x))
print(ort_session.run(None, {ort_session.get_inputs()[0].name: x.numpy()}))
```
The output graph is as follows:
```
graph(%input : Float(*, 5, strides=[5, 1], requires_grad=0, device=cpu)):
  %1 : Long(2, strides=[1], device=cpu) = onnx::Shape(%input)
  %2 : Long(device=cpu) = onnx::Constant[value={1}]()
  %3 : Long(device=cpu) = onnx::Gather[axis=0](%1, %2) # test/onnx/test_m.py:9:0
  %4 : Long(2, strides=[1], device=cpu) = onnx::Shape(%input)
  %5 : Long(device=cpu) = onnx::Constant[value={0}]()
  %6 : Long(device=cpu) = onnx::Gather[axis=0](%4, %5) # test/onnx/test_m.py:9:0
  %7 : Long(requires_grad=0, device=cpu) = onnx::Add(%3, %6) # test/onnx/test_m.py:9:0
  return (%7)
```
Torch output: 7
ORT output: 7

Now replacing tensor.shape[0] with len(tensor), the graph looks like:
```
graph(%input : Float(*, 5, strides=[5, 1], requires_grad=0, device=cpu)):
  %1 : Long(2, strides=[1], device=cpu) = onnx::Shape(%input)
  %2 : Long(device=cpu) = onnx::Constant[value={1}]()
  %3 : Long(device=cpu) = onnx::Gather[axis=0](%1, %2) # test/onnx/test_m.py:9:0
  %4 : Long(requires_grad=0, device=cpu) = onnx::Constant[value={3}]()
  %5 : Long(requires_grad=0, device=cpu) = onnx::Add(%3, %4)
  return (%5)
```
Torch output: 7
ORT output: 8

In the case with __len__, %4 is traced as a constant

**Workaround to avoid the mismatch when using len to get tensor.shape**

Add the following pattern around _export call
```
    import builtins
    len_backup = builtins.len
    builtins.len = lambda x : x.__len__()

    # Call export
    _export(model, args, .....

    builtins.len = len_backup

```

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D28393526

Pulled By: SplitInfinity

fbshipit-source-id: a7d50740442c7e913119f9f92deab48aa8c02843

Co-authored-by: shubhambhokare1 <shubhambhokare1@gmail.com>
2021-05-13 13:42:41 -07:00
3bc8a2264d [ONNX] Support .item() export & NumberType to tensor conversion (#55697) (#57594)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57594

Support .item() export & NumberType to tensor conversion

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D28393516

Pulled By: SplitInfinity

fbshipit-source-id: 94d0aec0a8fe144ee2567dc3c9c19fcb18ed21fa

Co-authored-by: BowenBao <bowbao@microsoft.com>
2021-05-13 13:41:29 -07:00
061c7a1e17 Overwrite with ln if libc10.so already exists (#58243)
Summary:
This should fix the issue noted in https://github.com/pytorch/pytorch/pull/57622#issuecomment-840612300 and demonstrated in [this run](https://github.com/pytorch/pytorch/runs/2566809365). Please review this PR carefully, because I do not know enough context to know whether this is the right thing to do.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58243

Test Plan: n/a

Reviewed By: walterddr

Differential Revision: D28414358

Pulled By: samestep

fbshipit-source-id: 0eb1c598f353ebac7f0a85c290be6fce4e00b6d5
2021-05-13 13:31:29 -07:00
9b95568dc3 update abs forward ad formula (#58235)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58235

this is to make the opinfo change python only

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D28412937

Pulled By: albanD

fbshipit-source-id: 1d6eb1e4baaa837c300ee8aa00b57986ba3e3eb2
2021-05-13 13:19:27 -07:00
3c4a90ce38 Revert "Revert D28387764: Codegen inplace forward AD formula from out of place one if needed" (#58231)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58231

This reverts commit 066e7699eb8c375a441e6de168da3ba7a73c3f27.

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D28412480

Pulled By: albanD

fbshipit-source-id: 7a231aa81b9e89537e6dca19642c4f12cd4b5ea5
2021-05-13 13:18:16 -07:00
098d9975a7 Port heaviside to structured kernel (#57933)
Summary:
Port heaviside to structured kernel
Related https://github.com/pytorch/pytorch/issues/55070

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57933

Reviewed By: mruberry

Differential Revision: D28362533

Pulled By: ezyang

fbshipit-source-id: 96b4591db3f609434784bd0ef9e54c61c918fb88
2021-05-13 10:48:11 -07:00
770f8cea2d Add support for real and imag tensor attributes (#54692)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54692

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D28412240

Pulled By: anjali411

fbshipit-source-id: e6965c55539a44260a812fdaa4a982f02067bb05
2021-05-13 10:44:27 -07:00
8888565597 T90561249: Enforce kernel launch checks (#58178)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58178

https://www.internalfb.com/T90561249: change the test to enforce

Test Plan:
buck test //caffe2/test:kernel_launch_checks

before fixing LinearAlgebra.cu and file close: https://www.internalfb.com/intern/testinfra/testconsole/testrun/1970324893386017/

after: https://www.internalfb.com/intern/testinfra/testconsole/testrun/2814749824394650/

Reviewed By: r-barnes

Differential Revision: D28390166

fbshipit-source-id: 8a217ce8c0b204922005c731aa38bc03f70fabcc
2021-05-13 10:41:20 -07:00
1de9f51782 [Pytorch Edge] Runtime ops compatibility api (#57570)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57570

Move runtime ops compatibility api to OSS and introduce schema information
ghstack-source-id: 128789159

Test Plan: unit test and manually ran it for a runtime with all (non custom) ops, and the bixray models unittest {P412728176}

Reviewed By: raziel

Differential Revision: D28203104

fbshipit-source-id: 432a7d0247bccfb2e1ce90e8d41f81596efa3d67
2021-05-13 10:20:41 -07:00
2294fd61c6 .github: Add windows.4xlarge to scale-config.yml (#58198)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58198

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: samestep

Differential Revision: D28401791

Pulled By: seemethere

fbshipit-source-id: dabaf58417114cc6138feca26d0121036476e04b
2021-05-13 10:07:22 -07:00
d8c6b74b0b Deprecate torch.solve (#57741)
Summary:
Deprecate deprecate deprecate.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57741

Reviewed By: agolynski

Differential Revision: D28379337

Pulled By: mruberry

fbshipit-source-id: a7a35ce1d3f25d8593698d89761c6c2d940db31a
2021-05-13 09:54:21 -07:00
020e2ff115 Add tests for PDT (#58211)
Summary:
This is a duplicate of the PR https://github.com/pytorch/pytorch/issues/56029

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58211

Reviewed By: gmagogsfm

Differential Revision: D28403903

Pulled By: nikithamalgifb

fbshipit-source-id: 290c46709c77c1a6fd647a2348419d12bf0a5ed6
2021-05-13 09:51:09 -07:00
5e65428503 Fix NumPy compatibility issue for torch.linalg.cond (#58041)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58041

The shape of the returned result was different for NumPy and PyTorch for
`ord={-2, 2, None}`. Now it's fixed.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D28405147

Pulled By: mruberry

fbshipit-source-id: 30293a017a0c0a7e9e3aabd470386235fef7b6a6
2021-05-13 09:42:18 -07:00
a49406b331 Fixed batched version of torch.linalg.cond for singular inputs (#58040)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58040

This PR uses `torch.linalg.inv_ex` to determine the non-invertible
inputs and return the condition number of infinity for such inputs.

Added OpInfo entry for `torch.linalg.cond`.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D28405146

Pulled By: mruberry

fbshipit-source-id: 524b9a38309851fa6461cb787ef3fba5aa7d5328
2021-05-13 09:42:17 -07:00
c1430c3425 Add torch.linalg.inv_ex without checking for errors by default (#58039)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58039

The new function has the following signature
`inv_ex(Tensor inpit, *, bool check_errors=False) -> (Tensor inverse, Tensor info)`.
When `check_errors=True`, an error is thrown if the matrix is not invertible; `check_errors=False` - responsibility for checking the result is on the user.

`linalg_inv` is implemented using calls to `linalg_inv_ex` now.

Resolves https://github.com/pytorch/pytorch/issues/25095

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D28405148

Pulled By: mruberry

fbshipit-source-id: b8563a6c59048cb81e206932eb2f6cf489fd8531
2021-05-13 09:42:15 -07:00
9e156b01e5 linalg.eig backwards and linalg.eigvals (#57276)
Summary:
This PR adds backwards support for `eig` and `eigvals`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57276

Reviewed By: ngimel

Differential Revision: D28405056

Pulled By: mruberry

fbshipit-source-id: 27ef03f139f44d75f4d319b0f3e77e99eea9bb01
2021-05-13 09:42:13 -07:00
2afcb7e8fd Move Azure MultiGPU tests back to nightly (#58242)
Summary:
As its currently broken on master

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58242

Reviewed By: samestep

Differential Revision: D28414152

Pulled By: malfet

fbshipit-source-id: 2566be294d62e39f9f7d316a039ab9372d685577
2021-05-13 09:41:02 -07:00
e507771294 [RPC Framework] Replace Python Pickler with internal RPC pickler for RemoteModule (#58019)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58019

In order to support sending `RemoteModule` over PRC, previously the pickling/unpickling of `RemoteModule` was implemented based on `__setstate__` and `__getstate__`. However, this means that the user can call regular Python pickler/unpickler to invoke the same logic,which should not be allowed.

This PR ensures that the pickling can only happen over RPC and not via regular python pickle.

Additionally, when a new attribute is added to `RemoteModule`, if it's not added to either `_REMOTE_MODULE_PICKLED_ATTRIBUTES` or `_REMOTE_MODULE_ATTRIBUTES_IGNORE_FOR_PICKLING`, this attribute will be ignored and an error message will be printed to std.err. However, it will not raise an exception like before, because such exception raised at the RPC layer will somehow cause timeout.

#Closes: https://github.com/pytorch/pytorch/issues/57516
ghstack-source-id: 128868501

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_send_remote_module_over_the_wire
buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_remote_module_py_pickle_not_supported
buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_send_remote_module_with_a_new_attribute_ignored_over_the_wire
buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- RemoteModule

buck test mode/dev-nosan //caffe2/torch/fb/csrc/concurrency/test:atomic_int_interprocess_test -- --exact 'caffe2/torch/fb/csrc/concurrency/test:atomic_int_interprocess_test - test_multiple_processes (caffe2.torch.fb.csrc.concurrency.test.atomic_int_interprocess_test.ForkMultipleProcessTest)'
buck test mode/dev //caffe2/torch/distributed/fb/test:app_test -- --exact 'caffe2/torch/distributed/fb/test:app_test - test_custom_init_rpc (caffe2.torch.distributed.fb.test.app_test.TestRpc)'

Reviewed By: mrshenli

Differential Revision: D28318270

fbshipit-source-id: 7e7df2a6690f0860c4531a244d38789db424496f
2021-05-13 09:37:42 -07:00
470cd64514 [TensorExpr] Remove disabled tests that we do not plan to re-enable. (#58207)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58207

We probably don't even know what these tests check and there are no
plans on re-enabling them - let's just nuke them to keep the code clean.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D28403251

Pulled By: ZolotukhinM

fbshipit-source-id: fe12e978636a74f309f57e3408ab78d459fe4d29
2021-05-13 09:19:20 -07:00
a0f4b7cd48 [TensorExpr] Re-enable skipped tests, they seem to be working now. (#58206)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58206

Tested on CUDA with and without `PYTORCH_TENSOREXPR_DONT_USE_LLVM=1`.

Closes #48053.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D28403250

Pulled By: ZolotukhinM

fbshipit-source-id: 1ae1cfed691e0077a37db646937e580fbd32b23f
2021-05-13 09:18:09 -07:00
dd3bd0286b T89509943 - Improve error message during Glow ONNXIFI (#58069)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58069

We want to tell user 5821 means ONNXIFI_EVENT_STATE_NONSIGNALLED in the error message.

Added that status code to the mapping and the error message output.

Reviewed By: hl475

Differential Revision: D28359864

fbshipit-source-id: 87f50ddd4ded9ced03ec6af6a1a4ef85bd2195d6
2021-05-13 09:02:36 -07:00
e71b526e7e Add inference mode python bindings and tests (#58045)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/56608

 - Adds binding to the `c10::InferenceMode` RAII class in `torch._C._autograd.InferenceMode` through pybind. Also binds the `torch.is_inference_mode` function.
 - Adds context manager `torch.inference_mode` to manage an instance of `c10::InferenceMode` (global).  Implemented in `torch.autograd.grad_mode.py` to reuse the `_DecoratorContextManager` class.
 - Adds some tests based on those linked in the issue + several more for just the context manager

Issues/todos (not necessarily for this PR):
- Improve short inference mode description
- Small example
- Improved testing since there is no direct way of checking TLS/dispatch keys
-

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58045

Reviewed By: agolynski

Differential Revision: D28390595

Pulled By: soulitzer

fbshipit-source-id: ae98fa036c6a2cf7f56e0fd4c352ff804904752c
2021-05-13 08:55:35 -07:00
002ce5c1df port addmm to structure kernel (#57417)
Summary:
Port addmm to structure kernel

Follow ups
- migrate `mm` and `addbmm` to structure
- move TORCH_CHECKS currently in `addmm_cpu_impl_` and `addmm_out_cuda_impl` to meta

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57417

Reviewed By: bdhirsh

Differential Revision: D28291001

Pulled By: walterddr

fbshipit-source-id: 4eafaa30a465e225fbb4d2a69a36f1e037df9122
2021-05-13 08:33:42 -07:00
52e9a192b1 [ROCm] add 4.2 to nightly builds (#58143)
Summary:
Depends on https://github.com/pytorch/builder/pull/764.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58143

Reviewed By: agolynski

Differential Revision: D28385532

Pulled By: malfet

fbshipit-source-id: 1a37b1d4636327f8e1d0d5cfaa03f652565f8e38
2021-05-13 08:14:23 -07:00
e8574b84bf Fix legacy tensor constructor/new matching incorrect signature with d… (#58108)
Summary:
…evice.

Previously, it was possible for torch.Tensor(tensor, device) or Tensor.new(tensor, device) to map to IntArrayRef or PyObject*.

PyObject* was not a problem because that would error out later.
But IntArrayRef would create an uninitialized tensor, which is confusing.

Fixes https://github.com/pytorch/pytorch/issues/47112

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58108

Reviewed By: agolynski, mruberry

Differential Revision: D28372426

Pulled By: gchanan

fbshipit-source-id: 795ab4f0561939d002a661c5cc14c6cdb579f31a
2021-05-13 08:11:08 -07:00
ab5c273950 Remove the matmul complex backward skip (#58138)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58138

related https://github.com/pytorch/pytorch/issues/55754

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D28403156

Pulled By: anjali411

fbshipit-source-id: dca4dd643f190b314a8a4c01c698c6a1e5229f6f
2021-05-13 07:48:08 -07:00
cf7d56d8f2 Gradgradcheck to runs successfully with unrelated inputs (#58049)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/57649

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58049

Reviewed By: agolynski

Differential Revision: D28390033

Pulled By: albanD

fbshipit-source-id: a0809b918321f3ea6fc59bfbec1f37e566d3611d
2021-05-13 06:42:29 -07:00
6997e7bd39 Update Kineto submodule (#58179)
Summary:
Update Kineto submodule, minor api changes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58179

Test Plan: CI

Reviewed By: gdankel

Differential Revision: D28391369

Pulled By: ilia-cher

fbshipit-source-id: 61fbf63d9ec2db66fac203944679e4b99cb0d568
2021-05-13 04:03:04 -07:00
2b99bce1d7 [profiler] CUDA event fallback (#58133)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58133

Adding CUDA event fallback for cases when CUPTI tracing is not
available, this corresponds to the legacy profiler GPU profiling

Test Plan: python test/test_profiler.py -v

Reviewed By: gdankel

Differential Revision: D28379596

Pulled By: ilia-cher

fbshipit-source-id: 2db3b2cd8c1c3e6e596784ab00a226c69db2ef27
2021-05-13 03:41:03 -07:00
fee7e8b91d Striding for lists Part 2 (#49352)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49352

In this PR, we replace all definitions of slice to take None parameters for the start, end, and step. This will simplify the compiler logic

Test Plan:
test_jit test cases

Imported from OSS

Reviewed By: jamesr66a, nikithamalgifb

Differential Revision: D25929903

fbshipit-source-id: 5bfc6bad514a8aafbef2dacc706f95f867fe85f1
2021-05-13 00:16:02 -07:00
82d714935e [TS] Add complex support for more ops (#54541)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54541

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D27599114

Pulled By: anjali411

fbshipit-source-id: 182d4480fd788599c408bfaf0d23baf3d9a4e967
2021-05-12 23:33:29 -07:00
7a95cccbc7 Revert D28393469: [pytorch][PR] Enable ceil, floor, frac, round & trunc for BFloat16 on CUDA
Test Plan: revert-hammer

Differential Revision:
D28393469 (e6d8f45523)

Original commit changeset: b0f02ade7c6e

fbshipit-source-id: 5e900f240e738168b9db9a617c6a75c949ad36d6
2021-05-12 23:29:34 -07:00
c8644326a7 Revert D28177553: [tsm] add support for jetter to Role (base_image) for mast launches
Test Plan: revert-hammer

Differential Revision:
D28177553 (8a1dab3d26)

Original commit changeset: 29daada4bc26

fbshipit-source-id: 28132684dfdc28915d5fa5217a4591fec8d880fe
2021-05-12 23:21:59 -07:00
e554731b32 Hide set_enabled since it's not public facing. (#58078)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58078

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D28362048

Pulled By: ailzhang

fbshipit-source-id: 4c78a7c58860ec4963bc8d05d133ea26e47dcf00
2021-05-12 22:52:17 -07:00
8a1dab3d26 [tsm] add support for jetter to Role (base_image) for mast launches
Summary:
1. Adds `ml_image` buck macro
2. Adds `--run_path` option to `torch.distributed.run`
3. Adds `tsm/driver/fb/test/patched/foo` (for unittesting)
4. Changes to `distributed_sum` to use `ml_image` (see Test plan for how this was tested in local and mast)

NOTE: need to enable jetter for flow and local schedulers (will do this on a separate diff since this diff is already really big)

Test Plan:
## Local Testing
```
# build the two fbpkgs (base and main)
buck run //pytorch/elastic/examples/distributed_sum/fb:torchx.examples.dist_sum.base
buck run //pytorch/elastic/examples/distributed_sum/fb:torchx.examples.dist_sum

# fetch the fbpkgs
cd ~/tmp

fbpkg fetch --symlink-tags  -o -d . jetter:prod
fbpkg fetch --symlink-tags  -o -d . torchx.examples.dist_sum.base
fbpkg fetch --symlink-tags  -o -d . torchx.examples.dist_sum

jetter/LAST/jetter apply-and-run \
  torchx.examples.dist_sum.base/LAST/torchrun \
  torchx.examples.dist_sum/LAST \
  -- \
  --as_function \
  --rdzv_id foobar \
  --nnodes 1 \
  --nproc_per_node 2 \
  --max_restarts 0 \
  --role worker \
  --no_python \
~/torchx.examples.dist_sum/LAST/pytorch/elastic/examples/distributed_sum/fb/main.py
```

## Mast Testing
```
buck-out/gen/pytorch/elastic/torchelastic/tsm/fb/cli/tsm.par run_ddp \
  --scheduler mast
  --base_fbpkg torchx.examples.dist_sum.base:78f01b5 \
  --fbpkg torchx.examples.dist_sum:f38ab46 \
  --run_cfg hpcClusterUuid=MastNaoTestCluster,hpcIdentity=pytorch_r2p,hpcJobOncall=pytorch_r2p \
  --nnodes 2 \
  --resource T1 \
  --nproc_per_node 4 \
  --name kiuk_jetter_test \
 pytorch/elastic/examples/distributed_sum/fb/main.py
```
Runs successfully: https://www.internalfb.com/mast/job/tsm_kiuk-kiuk_jetter_test_34c9f0fa?

Reviewed By: tierex, yifuwang

Differential Revision: D28177553

fbshipit-source-id: 29daada4bc26e5ef0949bf75215f35e557bd35b8
2021-05-12 22:10:15 -07:00
ad4b2571b6 Fix multi gpu test break on Windows (#58213)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58213

Reviewed By: robieta, mruberry

Differential Revision: D28405126

Pulled By: malfet

fbshipit-source-id: 48c0aa8a113c554e3a007c1900fae2ff453cf85b
2021-05-12 21:39:08 -07:00
6b1eeef601 OpInfo: squeeze (#58080)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58080

Reviewed By: agolynski

Differential Revision: D28379485

Pulled By: mruberry

fbshipit-source-id: 2b288036f595a5bd6b948a072494ee87f82322ce
2021-05-12 21:29:31 -07:00
a31daf381f Move libtorch builds to be master-only (#58183)
Summary:
There were almost no libtorch specific regressions recently

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58183

Reviewed By: janeyx99

Differential Revision: D28393091

Pulled By: malfet

fbshipit-source-id: 6dadd915ba574294afa6a95eaa759564af3154d4
2021-05-12 21:16:16 -07:00
2d7d6922b6 Revert D28387765: Add forward AD gradcheck
Test Plan: revert-hammer

Differential Revision:
D28387765 (647282cb0c)

Original commit changeset: ed15049b5bda

fbshipit-source-id: b47ac5de90da8fce3697a4d16aa10feea5668c99
2021-05-12 20:42:31 -07:00
f88297c66b Revert D28387767: Add forward AD test for op info
Test Plan: revert-hammer

Differential Revision:
D28387767 (26b6d044cd)

Original commit changeset: 369d76921c84

fbshipit-source-id: 91ac961339bdd5e1e2530d2655364f9fe46cdafb
2021-05-12 20:41:25 -07:00
87f7fdfd5c Allow instruction counting to use shared memory as a staging ground. (And a couple other tweaks.) (#56711)
Summary:
This is actually something I discovered a while ago with the wall of serotonin. It was really easy for large scale runs to get bottlenecked on disk access. I have a hack in the working files of that machine to use `/dev/shm`, but I figured I should formalize and actually make a respectable utility.

I also added a param to tweak the run cadence and print when a CorePool is created; these are just to make the CI logs a bit nicer. (A printout each second on a 40 minute CI job is a bit much...)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56711

Reviewed By: agolynski

Differential Revision: D28392248

Pulled By: robieta

fbshipit-source-id: b6aa7445c488d8e4ab9d4b31ab18df4e12783d8f
2021-05-12 20:37:41 -07:00
066e7699eb Revert D28387764: Codegen inplace forward AD formula from out of place one if needed
Test Plan: revert-hammer

Differential Revision:
D28387764 (2279962162)

Original commit changeset: 7bf3929dd214

fbshipit-source-id: 473851cf7527b0edf303fdb46b9c07357ff7f340
2021-05-12 20:35:02 -07:00
ce1a8620d9 Enabled roll & diag for BFloat16 dtype on CUDA (#57916)
Summary:
Enabled `roll` & `diag` for BFloat16 dtype on CUDA

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57916

Reviewed By: agolynski

Differential Revision: D28393534

Pulled By: ngimel

fbshipit-source-id: fc1d8555b23a75f8b24c2ad826f89cd4e14cf487
2021-05-12 20:29:17 -07:00
f9aa6b2432 Enable lerp for BFloat16 on CUDA (#57907)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57907

Reviewed By: agolynski

Differential Revision: D28393597

Pulled By: ngimel

fbshipit-source-id: 27ebfaf175c9eeb8d411ce782fdbc468082c6af3
2021-05-12 20:23:52 -07:00
e6d8f45523 Enable ceil, floor, frac, round & trunc for BFloat16 on CUDA (#57910)
Summary:
Enable `ceil`, `floor`, `frac`, `round` & `trunc` for BFloat16 on CUDA

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57910

Reviewed By: agolynski

Differential Revision: D28393469

Pulled By: ngimel

fbshipit-source-id: b0f02ade7c6e2ed122aa5d80f6d442823dc1f221
2021-05-12 20:22:19 -07:00
c4a486f4b1 Enable atan2 & hypot for BFloat16 on CUDA (#57905)
Summary:
Enable `atan2` & `hypot` for BFloat16 on CUDA.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57905

Reviewed By: agolynski

Differential Revision: D28393706

Pulled By: ngimel

fbshipit-source-id: e505e5f098d35e4f7417508443cb0fedf6562dd1
2021-05-12 20:19:14 -07:00
f4a5730a6b Add LowerSimpleTuples for freeze tuples (#57915)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/57698

Follow the suggestion mentioned in https://github.com/pytorch/pytorch/issues/57698
add a call to LowerSimpleTuples after the call:
https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/passes/freeze_module.cpp#L89.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57915

Reviewed By: agolynski

Differential Revision: D28387310

Pulled By: nikithamalgifb

fbshipit-source-id: 5becb608c5352240b30dfdf03a821d0297e9609c
2021-05-12 19:07:20 -07:00
f0a5500722 [torch/elastic] Add logging to the sanitize function of RendezvousStateHolder (#58169)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58169

This PR adds logging to the `_sanitize()` function of `RendezvousStateHolder` to output the nodes that had no recent heartbeat and are considered "dead".
ghstack-source-id: 128798389

Test Plan: Run the existing tests.

Reviewed By: tierex

Differential Revision: D28333394

fbshipit-source-id: ba0a398a759815e4224b58323c0e743eb383f723
2021-05-12 18:53:55 -07:00
2279962162 Codegen inplace forward AD formula from out of place one if needed (#57767)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57767

Test Plan: Imported from OSS

Reviewed By: agolynski

Differential Revision: D28387764

Pulled By: albanD

fbshipit-source-id: 7bf3929dd21425be653da112385e902aa50455a1
2021-05-12 18:49:20 -07:00
26b6d044cd Add forward AD test for op info (#57701)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57701

The new OpInfo flag has the following semantic:
- If it says that it supports forward AD, we run gradcheck with forward AD to ensure it is correct
- If it says that it does not support it, we check that the corresponding error is raised

All the added tests take 3s to run for CPU builds and 1min for GPU builds which should be pretty negligible compared to the test_ops runtime for each of these arch.

Test Plan: Imported from OSS

Reviewed By: agolynski

Differential Revision: D28387767

Pulled By: albanD

fbshipit-source-id: 369d76921c8460aa4548f9b5159b7297994672f5
2021-05-12 18:49:18 -07:00
647282cb0c Add forward AD gradcheck (#57633)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57633

Test Plan: Imported from OSS

Reviewed By: agolynski

Differential Revision: D28387765

Pulled By: albanD

fbshipit-source-id: ed15049b5bdacca54f775b50ef166d540ba0b847
2021-05-12 18:48:07 -07:00
bc30c3165c Update docs for get_future support (#58107)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58107

Test Plan: Imported from OSS

Reviewed By: SciPioneer

Differential Revision: D28387374

Pulled By: agolynski

fbshipit-source-id: 70052afbb0b07ba341ea55f7ec30f7d9759b7bd4
2021-05-12 18:29:28 -07:00
645a5f706a move flatten_dense_tensors and unflatten_dense_tensors to Native (#58006)
Summary:
https://github.com/pytorch/pytorch/issues/55240

CC ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58006

Reviewed By: agolynski

Differential Revision: D28386749

Pulled By: ngimel

fbshipit-source-id: 4860c35d5ff95bcc38a243d7001180e7bd536314
2021-05-12 18:18:34 -07:00
f1ac9b6598 fix lint (#58203)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58203

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D28401974

Pulled By: suo

fbshipit-source-id: cc244e0fc81c5f699ff4bd30754a3f6467f232c4
2021-05-12 18:01:50 -07:00
028f2f62ac [torch/elastic] Update the rendezvous docs (#58160)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58160

This PR updates the Torch Distributed Elastic documentation with references to the new `c10d` backend.
ghstack-source-id: 128783809

Test Plan: Visually verified the correct

Reviewed By: tierex

Differential Revision: D28384996

fbshipit-source-id: a40b0c37989ce67963322565368403e2be5d2592
2021-05-12 16:54:28 -07:00
ae63b1d1c6 [torch/elastic] Revise distributed run script (#58159)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58159

This PR includes the following changes:

- The `--standalone` option of `torch.distributed.run` now uses the `c10d` backend instead of `etcd` backend.

- The `import` statement for `EtcdServer` has been removed from the run script.

- The docstrings and parameter descriptions of the run script have been revised and improved.

- The default port number of `EtcdRendezvousBackend` has been changed from 29500 to 29400 to improve the user experience when used along with the run script which uses the port 29500 for the distributed job store (a.k.a. `MASTER_PORT`) by default.
ghstack-source-id: 128782267

Test Plan:
- Run existing tests.
- Visually verified the correct rendering of the docs.

Reviewed By: tierex

Differential Revision: D28383681

fbshipit-source-id: a4098f7c23c97a2376a9c4023e81f82fedd04b10
2021-05-12 16:53:31 -07:00
166a8df65f [reland] make ddp logging api to be private (#58089)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58089

make ddp logging api to be private
ghstack-source-id: 128796419

Test Plan: unit test

Reviewed By: rohan-varma

Differential Revision: D28365412

fbshipit-source-id: 374c01d443ffb47a3706f59e296d6e47eb5f4c85
2021-05-12 16:45:13 -07:00
8a45006765 enable deterministic path for index_copy_cuda with index_put (#58144)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58144

reland D28291041 (14badd9929), which was reverted due to a type error from Tuple[torch.Tensor], seems that mypy requires Tuple[torch.Tensor, torch.Tensor, torch.Tensor]

Test Plan:
buck test mode/opt //caffe2/test:torch_cuda -- test_index_copy_deterministic

    ✓ ListingSuccess: caffe2/test:torch_cuda - main (9.229)
    ✓ Pass: caffe2/test:torch_cuda - test_index_copy_deterministic_cuda (test_torch.TestTorchDeviceTypeCUDA) (25.750)
    ✓ Pass: caffe2/test:torch_cuda - main (25.750)

Reviewed By: ngimel

Differential Revision: D28383178

fbshipit-source-id: 38896fd6ddd670cfcce36e079aee7ad52adc2a28
2021-05-12 16:26:50 -07:00
01d0eb9dac [package] Add an intern keyword (#57341)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57341

Require that users be explicit about what they are going to be
interning. There are a lot of changes that are enabled by this. The new
overall scheme is:

PackageExporter maintains a dependency graph. Users can add to it,
either explicitly (by issuing a `save_*` call) or explicitly (through
dependency resolution). Users can also specify what action to take when
PackageExporter encounters a module (deny, intern, mock, extern).

Nothing (except pickles, tho that can be changed with a small amount
of work) is written to the zip archive until we are finalizing the
package. At that point, we consult the dependency graph and write out
the package exactly as it tells us to.

This accomplishes two things:
1. We can gather up *all* packaging errors instead of showing them one at a time.
2. We require that users be explicit about what's going in packages, which is a common request.

Differential Revision: D28114185

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Pulled By: suo

fbshipit-source-id: fa1abf1c26be42b14c7e7cf3403ecf336ad4fc12
2021-05-12 16:22:43 -07:00
d230045fde Combine backtrace print into one string to avoid interleaving. (#56961)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56961

As described in https://github.com/pytorch/pytorch/issues/56583, the
backtrace amongst several processes was garbled.
https://github.com/pytorch/pytorch/pull/56198 would've alleviated this to some
extent, but this PR combines all the logging into just one string to reduce
interleaving further.
ghstack-source-id: 128730047

Test Plan: waitforbuildbot

Reviewed By: cbalioglu

Differential Revision: D28013191

fbshipit-source-id: 8bd8978a92ee2fbcd18472e1293d5809455b411b
2021-05-12 15:52:05 -07:00
d09abf004c OpInfo: narrow (#58082)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58082

Reviewed By: agolynski

Differential Revision: D28379371

Pulled By: mruberry

fbshipit-source-id: 484e560b1e6ceba234e497585ed308a27cd8b7a0
2021-05-12 15:39:15 -07:00
9148f19e85 enable support for nested containers in torch.testing.assert(equal|close) (#57270)
Summary:
In contrast to the initial opinion in https://github.com/pytorch/pytorch/issues/55385, there are legitimate use cases for nested containers. One such example is the [output of `LSTM`'s](https://pytorch.org/docs/stable/generated/torch.nn.LSTM):

```python
output: Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]] = torch.nn.LSTM()(input)
assert_close(output, expected)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57270

Reviewed By: albanD

Differential Revision: D28249303

Pulled By: mruberry

fbshipit-source-id: 75caa4414cc184ff0ce4cfc0dd5aafddfad42bcf
2021-05-12 15:37:42 -07:00
9063cb0a3c Infer types for arguments of methods not invoked directly by monkeytype (#57202)
Summary:
Support adding type annotations for class methods and nn.Module methods which are not invoked under the hood of MonkeyType

** Changes **
* This PR involves a slight change in how the example inputs are passed while scripting `class` and `nn.Module` objects.
* The example inputs passed to `_script_pdt` is of the following format:
     - example_inputs= [(obj.method1, (arg_list)), (obj.method2, (arg_list)),]
* For nn.Modules, to infer types for `forward` methods, example_inputs can be passed in two ways:
    - example_inputs= [(obj.forward, (arg_list, ))]
    - example_inputs = [(obj, (arg_list, ) )]

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57202

Reviewed By: desertfire

Differential Revision: D28382827

Pulled By: nikithamalgifb

fbshipit-source-id: 5481467f3e909493bf3f439ee312056943508534
2021-05-12 15:32:38 -07:00
1de3525ca8 [ONNX] Handle PackedParams inputs for _propagate_and_assign_input_shapes (#56449) (#57079)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57079

Testing onnx 1.9 release, we see that the old bug is triggered for the caffe2 test:
`pytest test/onnx/test_pytorch_onnx_caffe2_quantized.py::TestQuantizedOps::test_small_model`
This is because the graph inputs
```python
graph(%x.1 : Tensor,
      %conv1._packed_params : __torch__.torch.classes.quantized.Conv2dPackedParamsBase,
      %conv2._packed_params : __torch__.torch.classes.quantized.Conv2dPackedParamsBase,
      %fc.bias : Float(10, strides=[1], requires_grad=0, device=cpu),
      %fc.weight : Float(10, 72, strides=[72, 1], requires_grad=0, device=cpu)):
```
contains `Conv2dPackedParamsBase` which is a PackedParams.
When we do flatten, we will flatten to several tensors, then the shape inference for input misaligned.
This PR record how may tensors got flattened in PackeParams, and skip by these number rather than 1, then the UT passed.
Note that tuple case should still follow the original logic.

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D28393949

Pulled By: malfet

fbshipit-source-id: 98d48aad27e5ca03fb10d260f8e625478d996ee2

Co-authored-by: David <jiafa@microsoft.com>
2021-05-12 15:20:26 -07:00
3d5bb71020 Back out "[PyTorch Edge] Reuse constant table from ts in bytecode" (#58099)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58099

Original commit changeset: 34e0cb814901
ghstack-source-id: 128749184

Test Plan: CI

Reviewed By: raziel, iseeyuan

Differential Revision: D28369142

fbshipit-source-id: 631034126cebbd1c94ead6316b66e83a4812a890
2021-05-12 15:12:18 -07:00
85d64648d3 Port threshold to structure (#57810)
Summary:
Related https://github.com/pytorch/pytorch/issues/55070
Port threshold and threshold_backward to structure

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57810

Reviewed By: agolynski

Differential Revision: D28382716

Pulled By: ezyang

fbshipit-source-id: 8d0702ad074b52e8512524d9807c93bfe04c51d6
2021-05-12 15:04:55 -07:00
82b2013eac Delete move constructor on TensorImpl (#58048)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58048

It's never used, and it is also a bit dangerous, because a move
typically destroys the source location, but there may be other owning
references to the original location!

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D28390241

Pulled By: ezyang

fbshipit-source-id: 68f22756ac066a7a0fc8baedd2b7834c01c2c534
2021-05-12 15:03:49 -07:00
9bfc1c4e0e [Gradient Compression] Update the docstring of fp16_compress_hook (#58168)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58168

Update the documentation to be consistent to https://github.com/pytorch/pytorch/pull/57410.
ghstack-source-id: 128797174

Test Plan: N/A

Reviewed By: agolynski, zhengwy888

Differential Revision: D28388160

fbshipit-source-id: 6ba13ad9f9d7b4d003cdc112545573e452df8b65
2021-05-12 14:28:41 -07:00
2073e866ad Switch GHA test stats S3 upload token (#58156)
Summary:
TODOs:

- [x] generate a temporary new token on this repo for testing purposes
- [x] change the name of the S3 secret used in the workflow YAML definitions
- [x] check the test plan
- [x] replace the temporary token with a more permanent one
- [x] check the test plan again
- [x] uncomment the `if` statement that guards against uploading PR test stats

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58156

Test Plan: Check the [ossci-metrics bucket](https://s3.console.aws.amazon.com/s3/buckets/ossci-metrics) after CI runs on this PR. Specifically, [this prefix](a3445bfbd7/pytorch-linux-xenial-py3.6-gcc5.4/&showversions=false) has two objects under it.

Reviewed By: janeyx99

Differential Revision: D28393138

Pulled By: samestep

fbshipit-source-id: 2c39c102652d471afa016cfc4942bb1e5bbb4163
2021-05-12 14:20:18 -07:00
581bf01074 [Gradient Compression] Remove unnecessary warning on the rst file and the check on C++ version (#58170)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58170

Now comm hook can be supported on MPI and GLOO backends besides NCCL. No longer need these warnings and check.
ghstack-source-id: 128799123

Test Plan: N/A

Reviewed By: agolynski

Differential Revision: D28388861

fbshipit-source-id: f56a7b9f42bfae1e904f58cdeccf7ceefcbb0850
2021-05-12 14:15:10 -07:00
4c24d820ff [TensorExpr] Implement 'call_raw' in CUDA codegen. (#57901)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57901

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D28312107

Pulled By: ZolotukhinM

fbshipit-source-id: 53b4fd418d0c7bf70647278ee03efa5ef60b3af8
2021-05-12 14:08:20 -07:00
c751e53800 [TensorExpr] Implement 'call_raw' in IREval. (#57882)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57882

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D28306752

Pulled By: ZolotukhinM

fbshipit-source-id: 11d0034f9bfbadf8483de90c457f952a2161f10b
2021-05-12 14:08:18 -07:00
cbba3db21b [TensorExpr] Minor cleanup in IREval. (#57881)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57881

Test Plan: Imported from OSS

Reviewed By: navahgar, ngimel

Differential Revision: D28306751

Pulled By: ZolotukhinM

fbshipit-source-id: aad9774d62d2e54b3ca51f5cc2ced841c6b9206b
2021-05-12 14:07:08 -07:00
5e83c62a9e Revert D28351931: [pytorch][PR] Fix some tensor operators to return NotImplemented for invalid inputs
Test Plan: revert-hammer

Differential Revision:
D28351931 (35521a2629)

Original commit changeset: 985457a44dba

fbshipit-source-id: 10724c219e53648f10a70719e25bcf774c6c7852
2021-05-12 13:58:03 -07:00
46e4b2dbda Convert assert -> cast. (#57458)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55868.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57458

Reviewed By: mruberry

Differential Revision: D28365745

Pulled By: walterddr

fbshipit-source-id: 35cc3fa85f87b0ef98cf970f620ab909d240c7be
2021-05-12 13:54:16 -07:00
614437751f make remote model instantiation async when possible (#58052)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/58052 for the cases where `module_interface_cls` is not provided

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58052

Reviewed By: mruberry

Differential Revision: D28369064

Pulled By: mrzzd

fbshipit-source-id: 3ded7ea943a5ff0425bedc05448a59e6eefbeaaf
2021-05-12 13:48:09 -07:00
0bfcc3e3f4 fix topk with k=0 on cuda (#58086)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58086

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D28364964

Pulled By: bdhirsh

fbshipit-source-id: 4d02bf5b27ca5e8b6f7b6cc6aa99d9e31233578b
2021-05-12 13:38:10 -07:00
cbd1227809 Add a note in the parametrize doc about the naming choice (#58142)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58142

Reviewed By: agolynski

Differential Revision: D28386655

Pulled By: albanD

fbshipit-source-id: c2793ac377ef7082c1840e1a50604da3ff9c61ac
2021-05-12 13:15:56 -07:00
3c973de543 HABANA Device registration key and Autograd key addition (#57094)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57094

Reviewed By: mruberry

Differential Revision: D28355895

Pulled By: wconstab

fbshipit-source-id: 5d8b5762a69f444f4fe7f476891150fa5483d893
2021-05-12 13:07:33 -07:00
c9eb381aac Allow zero jobs in tools/explicit_ci_jobs.py (#58176)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58176

Test Plan:
```
tools/explicit_ci_jobs.py --keep-gha
```

Reviewed By: driazati

Differential Revision: D28390351

Pulled By: samestep

fbshipit-source-id: 1dc01c523b465efd0b617d98d0cdd1a759882110
2021-05-12 13:03:34 -07:00
6955d4d0f7 [nnc] Handle only the first argument of aten::to (#58028)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58028

We were trying to translate the device argument and thus throwing an
unsupported dtype.
ghstack-source-id: 128748658

Test Plan: predictor models

Reviewed By: navahgar

Differential Revision: D28347704

fbshipit-source-id: 331a5786339e01f9df1b1878970b0c5983a92980
2021-05-12 12:52:29 -07:00
a88673e93e Enable cat wo conditionals iff cpu (#58026)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58026

Cat-without-conditionals is a valuable optimization on CPU but on GPU
it can generate invalid code since it may introduce allocations (i.e. extra
kernel launches)
ghstack-source-id: 128748630

Test Plan: predictor

Reviewed By: navahgar

Differential Revision: D28347703

fbshipit-source-id: f9e68cd7bcf5d316082ce8378ddf99f2d33fcc07
2021-05-12 12:51:10 -07:00
ab6b5fa036 Add HIP (ROCm) semantics doc (#57871)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57871

Reviewed By: agolynski

Differential Revision: D28385510

Pulled By: malfet

fbshipit-source-id: 9cf69e52d026a1cf74cc12d8727ca17ae026235e
2021-05-12 12:34:07 -07:00
af36d084fd reland [ROCm] ubuntu version check in install_rocm.sh (#58164)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58164

Reland of https://github.com/pytorch/builder/pull/764

This reverts commit 6404184700159213f7d64df62537e238822f8b15.

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D28387195

Pulled By: seemethere

fbshipit-source-id: 905a14d9ed5e85a7dff7da3d1c4a628320ff7451
2021-05-12 12:29:17 -07:00
53bc6f79f3 Added DevOps PR and Nightly Build logic (#58007)
Summary:
This PR adds Azure DevOps support for running custom PyTorch unit tests on PyTorch PR and Nightly builds.

PR Builds on Azure DevOps:
- Ensures that the wheel artifacts for a given PR build is ready
- Once the wheels are ready, PyTorch custom tests are run on torch installation from build wheels

Nightly Builds on Azure DevOps:
- Cues 4 builds {Win,Linux}*{cpu, CUDA} to run PyTorch custom unit tests on nightly PyTorch builds.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58007

Reviewed By: seemethere, mruberry

Differential Revision: D28342428

Pulled By: malfet

fbshipit-source-id: a454accf69163f9ba77845eeb54831ef91437981
2021-05-12 12:24:41 -07:00
7156168f71 Port max_pool2d_with_indices_backward to structure (#57797)
Summary:
Realted https://github.com/pytorch/pytorch/issues/55070
Port max_pool2d_with_indices_backward to structure kernel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57797

Reviewed By: bdhirsh

Differential Revision: D28289731

Pulled By: ezyang

fbshipit-source-id: 4c562d0b9fddbf9a445062f8723eeec607bd1108
2021-05-12 12:11:29 -07:00
3b977b3b4d [DataLoader] Add context manager for runtime type validation (#55936)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55936

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D27743476

Pulled By: ejguan

fbshipit-source-id: 8f0454ccf3ec37807598056433bff91013fa9bb9
2021-05-12 11:59:16 -07:00
5c696443c7 [DataLoader] Modfity construct_time_validation to argument_validation (#55836)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55836

Change construct_time_validation to argument_validation as we should provide users the flexibility to use this decorator over all different functions, which are required with type validation.

It can also work as a construct-time validation
```py
class ExampleDataPipe(IterDataPipe):
    argument_validation
    def __init__(self, dp: IterDataPipe[int]):
        self.dp = dp

    ...
```
Notebook is also updated.

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D27743478

Pulled By: ejguan

fbshipit-source-id: 49743152d121028cd7d72d89dc7df5c7c7b94c41
2021-05-12 11:58:05 -07:00
a0ac80ec76 [DDP] Don't find tensors if static graph (#58105)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58105

When find_unused_parameters=True but static_graph is also set, static graph handles unused parameter accounting, so this code path is not needed
ghstack-source-id: 128736289

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D28371954

fbshipit-source-id: 0b42a9c0fd2bba26a0de288436e9c7139e292578
2021-05-12 11:40:18 -07:00
87afcea0cc T90561249: Enforce kernel launch checks (#58116)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58116

T90561249: Enforce kernel launch checks

Test Plan: how to test?

Reviewed By: r-barnes

Differential Revision: D28367890

fbshipit-source-id: 159dd3e14a4532c1325a0a332c02ef58d720a91b
2021-05-12 11:34:22 -07:00
35521a2629 Fix some tensor operators to return NotImplemented for invalid inputs (#57934)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/57719.

This PR fixes `torch.Tensor{__rsub__, __rdiv__, __rtruediv__, __pow__, __rmatmul__}` to return `NotImplemented` instead of raising a `TypeError`.

cc/ mruberry: The first commit of this PR is the same as 1d209db1cc excepts the commit message.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57934

Reviewed By: mruberry

Differential Revision: D28351931

Pulled By: albanD

fbshipit-source-id: 985457a44dba24d2496794dfb8c1661cbcd4ff8f
2021-05-12 11:03:23 -07:00
6404184700 Revert D28385479: [pytorch][PR] [ROCm] ubuntu version check in install_rocm.sh
Test Plan: revert-hammer

Differential Revision:
D28385479 (94bb1150a7)

Original commit changeset: 10ad225b7185

fbshipit-source-id: ac9167c906404e87ec0b94cf1c4e9c4cab7aca0f
2021-05-12 10:52:01 -07:00
9d56176034 Fix splitter and add a unittest (#58075)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58075

Pull Request resolved: https://github.com/facebookresearch/pytext/pull/1687

Reviewed By: mikekgfb

Differential Revision: D28357724

fbshipit-source-id: 36c2d211576a90107bc75468a39408ffecaeed43
2021-05-12 10:40:37 -07:00
bfd0a46156 [fx] Arg normalization not save output node in the node_map (#58058)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58058

Don't save output node in the `node_map` because the result of output node could be a list of proxies and would throw an error when being used as key.

Test Plan: CI

Reviewed By: mikekgfb

Differential Revision: D28329580

fbshipit-source-id: a29f3ef1763930faa20cb20eb9ffd04ef7e52dc1
2021-05-12 10:39:37 -07:00
3603ba24d5 Trigger Windows multi gpu tests on master (#57817)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57817

Reviewed By: mruberry

Differential Revision: D28368299

Pulled By: malfet

fbshipit-source-id: 765ef740c25477ba8a5d41489ffad4e5d8456236
2021-05-12 10:36:02 -07:00
8f83bfeb98 Update CI images for rocm4.2 (#58017)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58017

Reviewed By: agolynski

Differential Revision: D28385181

Pulled By: malfet

fbshipit-source-id: b4bb02d4dfaaa741ee6a804bbd7d7e9e394f7321
2021-05-12 10:31:07 -07:00
94bb1150a7 [ROCm] ubuntu version check in install_rocm.sh (#57751)
Summary:
In preparation for ROCm 4.2 release changing the apt repo name from xenial to ubuntu.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57751

Reviewed By: agolynski

Differential Revision: D28385479

Pulled By: malfet

fbshipit-source-id: 10ad225b71857226d8e36eaa62eba4511d9362e7
2021-05-12 10:23:45 -07:00
16d617c3e5 test experiment script (#57925)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57925

1. adds test_scripts.py that will run added scripts and verify that there are no errors
2. adds local ddp_nccl_allreduce experiment script

test with command `pytest test_scripts.py`

Test Plan: Imported from OSS

Reviewed By: agolynski

Differential Revision: D28382452

Pulled By: gcramer23

fbshipit-source-id: 21028a990ebfedf1aad6b007a723c02403e8bea8
2021-05-12 10:22:47 -07:00
d212bf1863 Enable BFloat16 for nan_to_num on CUDA (#58063)
Summary:
Enabled BFloat16 for `nan_to_num` on CUDA. For comparison with numpy, a [workaround suggested](https://github.com/pytorch/pytorch/issues/57982#issuecomment-839150556) by ngimel is being used - the OpInfo's `sample.kwargs` is used to set two `numpy.kwargs`, viz. `posinf` & `neginf` for `BFloat16`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58063

Reviewed By: mruberry

Differential Revision: D28373478

Pulled By: ngimel

fbshipit-source-id: 6493b560d83632a8519c1d3bfc5c54be9b935fb9
2021-05-12 09:50:26 -07:00
c52700dbcd [wip] enhance DDPSink to work for general outputs (#57073)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57073

Enhances use of DDPSink to work for all output types DDP supports as per https://github.com/pytorch/pytorch/issues/55876.

TODO: Add additional testing for tuple, list, dict return types
ghstack-source-id: 128726768

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D27756985

fbshipit-source-id: 2e0408649fb2d6a46d6c33155a24c4c1723dd799
2021-05-12 09:45:10 -07:00
4faa427383 Remove printout from distributed tests (#58095)
Summary:
These were added to help debug a flaky test, the flaky test has since been resolved.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58095

Reviewed By: SciPioneer

Differential Revision: D28368077

Pulled By: rohan-varma

fbshipit-source-id: 9618f64de2b7015401bb8cb7816b09e1a44e0fef
2021-05-12 09:34:38 -07:00
30f26c5893 Reimplement torch::flip based on advanced indexing (#56713)
Summary:
## Rationale
This PR improves the performance of `torch::flip` by using `TensorIterator` as the same fashion as using `AdvancedIndexing`. Which means that this implementation is semantically equivalent to indexing a tensor using reverse indices `A[dim0_size - 1:0 ..., dimN_size-1:0, ...]`.

## Benchmark results
The following benchmark compares the runtime of this implementation of `flip` against the current implementation, AdvancedIndexing with reversed indices, as well as OpenCV one. The comparison scenarios consider a 4D tensor `[B, C, H, W]`, where the dimensions flipped correspond to `H` (vertical flip) and `W` (horizontal flip) under float32 and uint8 datatypes.

The benchmark implementation details can be found in https://github.com/andfoy/flip-benchmarks/blob/main/5_Stable_implementation/benchmarks.py. Additionally, there are correctness tests against the current flip implementation in https://github.com/andfoy/flip-benchmarks/blob/main/5_Stable_implementation/main.cpp, which tests against different layouts, datatypes and contiguous/non-contiguous tensors.

The following plots correspond to the means of the runtime of each operator after 100 samples. As it is possible to observe, the latest implementation of flip has a runtime similar to the indexing one. Also, the performance gains are up to 6X under some scenarios.

### Horizontal flip (float)
![bokeh_plot](https://user-images.githubusercontent.com/1878982/115766715-e72a3d80-a36d-11eb-8552-9005028900b1.png)

### Horizontal flip (uint8)
![bokeh_plot(1)](https://user-images.githubusercontent.com/1878982/115766720-e7c2d400-a36d-11eb-822d-44046882c976.png)

### Vertical flip (float)
![bokeh_plot(2)](https://user-images.githubusercontent.com/1878982/115766721-e7c2d400-a36d-11eb-8f4b-d44c8c33d104.png)

### Vertical flip (uint8)
![bokeh_plot(3)](https://user-images.githubusercontent.com/1878982/115766725-e85b6a80-a36d-11eb-907a-cfcddba555ad.png)

cc fmassa vfdev-5

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56713

Reviewed By: datumbox

Differential Revision: D28255088

Pulled By: fmassa

fbshipit-source-id: 5b8684812357c331e83a677b99cf0d78f0821678
2021-05-12 09:17:03 -07:00
5ea87f9c24 Grammatically updated the tech docs (complex_numbers.rst) (#57540)
Summary:
Small grammatical change in complex_numbers.rst .
-You can see the changes in the screenshot below -
![Capture](https://user-images.githubusercontent.com/38073192/117013956-01aed000-acf9-11eb-9d17-1e369de68585.PNG)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57540

Reviewed By: albanD

Differential Revision: D28233650

Pulled By: mrshenli

fbshipit-source-id: 0cec7bb1f4bd61e929e2a8fc5292bc20b77aee35
2021-05-12 09:05:18 -07:00
ff982ef73d OpInfo: reshape, reshape_as and minor clean-up (#57460)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57460

Reviewed By: nairbv

Differential Revision: D28151675

Pulled By: anjali411

fbshipit-source-id: 2b3bcadab3ff5d1761b2922b63afd70a354e785c
2021-05-12 06:05:21 -07:00
c911c30520 Revert D28291041: enable deterministic path for index_copy_cuda with index_put
Test Plan: revert-hammer

Differential Revision:
D28291041 (14badd9929)

Original commit changeset: 7f0cf3ec7280

fbshipit-source-id: 6117bc6e5b2044ce70d4e4a19bccd8c183ea3702
2021-05-12 03:33:57 -07:00
c7fb0a0e82 Remove beta warning for use_deterministic_algorithms (#58074)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/58073

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58074

Reviewed By: ngimel

Differential Revision: D28373676

Pulled By: mruberry

fbshipit-source-id: cae9a92ebbf6ac5f8d3008aa6a6a9cd5c1041c9f
2021-05-12 03:30:12 -07:00
e1078d42f0 std/var: Return real results for complex input (#58066)
Summary:
Fixes gh-56627

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58066

Reviewed By: ngimel

Differential Revision: D28372987

Pulled By: mruberry

fbshipit-source-id: c34d55f1a48047ceefa298ef2f4f33ad7dd1e577
2021-05-12 03:26:55 -07:00
db13119fc4 Deprecate symeig (#57732)
Summary:
This one had a tricky usage of `torch.symeig` that had to be replaced. I tested the replacement locally though.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57732

Reviewed By: bdhirsh

Differential Revision: D28328189

Pulled By: mruberry

fbshipit-source-id: 7f000fcbf2b029beabc76e5a89ff158b47977474
2021-05-12 02:21:35 -07:00
e18f5f1d13 [profiler][small] Add skip_first parameter to the default schedule (#58025)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58025

Add `skip_first` parameter to allow for arbitrary profiler step ranges

Test Plan: python test/test_profiler.py

Reviewed By: gdankel

Differential Revision: D28347768

Pulled By: ilia-cher

fbshipit-source-id: bb6fd3cedfa4a5d1307b91002def733896dd03eb
2021-05-12 02:06:11 -07:00
cdf161c382 [profiler][small] Speed up postprocessing (#58021)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58021

Improve complexity of _remove_dup_nodes function

Test Plan:
using trivial microbenchmark:
```
import torch
from torch.autograd.profiler import *
import time

evts = EventList()
id_cnt = 0
for r in range(10*1000):
    st = r * 1000
    evts.append(FunctionEvent(id_cnt, thread=0, name="parent", start_us=st, end_us=st+100))
    evts.append(FunctionEvent(id_cnt+1, thread=0, name="parent", start_us=st+1, end_us=st+99))
    evts.append(FunctionEvent(id_cnt+2, thread=0, name="child", start_us=st+10, end_us=st+90))
    id_cnt+=3

st = time.time()
evts._build_tree()
print("Elapsed: {:.3f}s".format(time.time() - st))
```

```
After:
python test_prof.py
Elapsed: 0.203s

Before:
python test_prof.py
Elapsed: 3.653s
```

Reviewed By: gdankel

Differential Revision: D28347217

Pulled By: ilia-cher

fbshipit-source-id: d62da3400009f1fa8cb41a11a828aa8307f190bf
2021-05-12 02:06:09 -07:00
bf2ebfc9f6 [profiler][small] Handle empty trace (#58013)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58013

Add a test case and a fix (legacy profiler) for empty trace handling

Test Plan: python test/test_profiler.py

Reviewed By: gdankel

Differential Revision: D28345388

Pulled By: ilia-cher

fbshipit-source-id: 4727589ab83367ac8b506cc0f186e5292d974671
2021-05-12 02:06:08 -07:00
f1defeaea4 [profiler][resend] Add cuda memory and distributed metadata (#58010)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58010

Resending https://github.com/pytorch/pytorch/pull/57252

Test Plan: CI

Reviewed By: gdankel

Differential Revision: D28345161

Pulled By: ilia-cher

fbshipit-source-id: 18be07b275403205f5b5487ae3589bd39a8eac96
2021-05-12 02:04:48 -07:00
14badd9929 enable deterministic path for index_copy_cuda with index_put (#57870)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57870

this is similar to index_add_cuda with index_put accumulate = True

Test Plan:
buck test mode/opt //caffe2/test:torch_cuda -- test_index_copy_deterministic

    ✓ ListingSuccess: caffe2/test:torch_cuda - main (9.229)
    ✓ Pass: caffe2/test:torch_cuda - test_index_copy_deterministic_cuda (test_torch.TestTorchDeviceTypeCUDA) (25.750)
    ✓ Pass: caffe2/test:torch_cuda - main (25.750)

Reviewed By: ngimel

Differential Revision: D28291041

fbshipit-source-id: 7f0cf3ec72805f3617fd1de9ff03e1d49114fed8
2021-05-12 00:32:35 -07:00
a07a0190f9 enable deterministic path for index_put with accumulate=False on CPU and CUDA (#57839)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57839

we reuse the `index_put_accum_kernel`, rename it to  `index_put_deterministic_kernel` and add a bool `accumulate` in `index_backward_kernel`

Test Plan:
buck test mode/opt //caffe2/test:torch -- test_index_put_non_accumulate_deterministic

    ✓ Pass: caffe2/test:torch - test_index_put_non_accumulate_deterministic_cpu (test_torch.TestTorchDeviceTypeCPU) (5.120)
Summary
  Pass: 1
  Skip: 1
    ↻ caffe2/test:torch - test_index_put_non_accumulate_deterministic_meta (test_torch.TestTorchDeviceTypeMETA)
  ListingSuccess: 1

buck test mode/opt //caffe2/test:torch_cuda -- test_index_put_non_accumulate_deterministic

    ✓ ListingSuccess: caffe2/test:torch_cuda - main (6.397)
    ✓ Pass: caffe2/test:torch_cuda - test_index_put_non_accumulate_deterministic_cuda (test_torch.TestTorchDeviceTypeCUDA) (26.030)
    ✓ Pass: caffe2/test:torch_cuda - main (26.030)
Summary
  Pass: 2
  ListingSuccess: 1

Reviewed By: ngimel

Differential Revision: D28290699

fbshipit-source-id: df8bbe7af2e72017566161b05b85737fda4ceb3f
2021-05-12 00:31:19 -07:00
d623fb7e04 Add a disclaimer about limited CUDA support in RPC (#58023)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58023

Clearly state that some features of RPC aren't yet compatible with CUDA.
ghstack-source-id: 128688856

Test Plan: None

Reviewed By: agolynski

Differential Revision: D28347605

fbshipit-source-id: e8df9a4696c61a1a05f7d2147be84d41aeeb3b48
2021-05-12 00:11:22 -07:00
c3d40fdf56 [ATen] Use expect_contiguous in layer_norm (#58067)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58067

- Use expect_contiguous in layer_norm to avoid unnecessary refcount bumps when the tensors are contiguous
- Clean up some leftovers from the hacky wrappers removal cleanup: use c10::MaybeOwned<Tensor> for bias tensors
- Skip dispatcher for at::empty in the layer_norm impl in Static Runtime

Test Plan: CI

Reviewed By: swolchok

Differential Revision: D28214298

fbshipit-source-id: 73150fa62d5c18f41a2264f8e56bbe5e377ad045
2021-05-11 22:56:32 -07:00
c790fd2bf8 ATen lu_unpack. Required for making torch.lu_solve differentiable. (#46913)
Summary:
Backward methods for `torch.lu` and `torch.lu_solve` require the `torch.lu_unpack` method.
However, while `torch.lu` is a Python wrapper over a native function, so its gradient is implemented via `autograd.Function`,
`torch.lu_solve` is a native function, so it cannot access `torch.lu_unpack` as it is implemented in Python.

Hence this PR presents a native (ATen) `lu_unpack` version. It is also possible to update the gradients for `torch.lu` so that backward+JIT is supported (no JIT for `autograd.Function`) with this function.

~~The interface for this method is different from the original `torch.lu_unpack`, so it is decided to keep it hidden.~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46913

Reviewed By: albanD

Differential Revision: D28355725

Pulled By: mruberry

fbshipit-source-id: 281260f3b6e93c15b08b2ba66d5a221314b00e78
2021-05-11 22:53:21 -07:00
32acc96f78 [Static Runtime] Fix bug in aten::clone (#58100)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58100

aten::clone has a second arg, memory_format, which was not previously supported.

Reviewed By: ajyu

Differential Revision: D28347171

fbshipit-source-id: e083cc24c3228048429bba3497326415bc3d1f5a
2021-05-11 22:47:25 -07:00
8c91acc161 port topk to structured (#57790)
Summary:
https://github.com/pytorch/pytorch/issues/55070

There are a few places where `const_cast` is used with utility functions shared with unstructured operators.
The RFC says that assigning to the `out` tensor doesn't work, but that seems to be what e.g., `_allocate_or_resize_output_with_indices` seems to do. Does assignment "work" when the tensors are not allocated?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57790

Reviewed By: bdhirsh

Differential Revision: D28289685

Pulled By: ezyang

fbshipit-source-id: 7027f162581af0bc0f5b750ff4439b0ecb01ec7b
2021-05-11 22:14:53 -07:00
e9e125475e [Static Runtime] Add schema check to aten::repeat and fb::fast_gather (#58106)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58106

Followup for D28047955 (1f83d8eec2).

Reviewed By: ajyu

Differential Revision: D28369472

fbshipit-source-id: 36aa10082589f4b6f0cc2d79f032fe72a19cda57
2021-05-11 22:07:21 -07:00
8824f49e68 Split test_testing.py::TestAsserts for multiple devices (#56365)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56365

Follow-up to https://github.com/pytorch/pytorch/pull/54784#discussion_r614156172. Instead of having one large testcase where most methods are decorated with `onlyCPU`, this factors out all tests that actually need another device into a separate test case.

Test Plan: Imported from OSS

Reviewed By: walterddr, albanD

Differential Revision: D28247529

Pulled By: mruberry

fbshipit-source-id: 946e7694b70e736941565f29b5dd459ed7fbca4e
2021-05-11 19:47:56 -07:00
8b816e9010 To implement gradient for Pytorch (#54617)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/56129

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54617

Reviewed By: anjali411

Differential Revision: D28057452

Pulled By: iramazanli

fbshipit-source-id: 9bd86679282d34f5e5393e6447121586517eb4f0
2021-05-11 18:52:20 -07:00
0d4dc6cb39 Let submodules be collected as args/kwargs (#57840)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57840

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D28294984

Pulled By: ansley

fbshipit-source-id: d64fe109a349516da69d2d17f58e42f98af564fd
2021-05-11 18:17:11 -07:00
b7d674eb21 Revert D28331386: [pytorch][PR] [torch/elastic] Update the rendezvous docs
Test Plan: revert-hammer

Differential Revision:
D28331386 (e4418b67c7)

Original commit changeset: 95dd32146222

fbshipit-source-id: 5522d4a09bc06ac42943eec9aa8bf5292cc778b2
2021-05-11 18:10:46 -07:00
aaca12bcc2 Deprecate in docs torch.svd and change svd -> linalg_svd (#57981)
Summary:
This PR adds a note to the documentation that torch.svd is deprecated together with an upgrade guide on how to use `torch.linalg.svd` and `torch.linalg.svdvals` (Lezcano's instructions from https://github.com/pytorch/pytorch/issues/57549).
In addition, all usage of the old svd function is replaced with a new one from torch.linalg module, except for the `at::linalg_pinv` function, that fails the XLA CI build (https://github.com/pytorch/xla/issues/2755, see failure in draft PR https://github.com/pytorch/pytorch/pull/57772).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57981

Reviewed By: ngimel

Differential Revision: D28345558

Pulled By: mruberry

fbshipit-source-id: 02dd9ae6efe975026e80ca128e9b91dfc65d7213
2021-05-11 18:04:10 -07:00
e573987bea remove syncs in one_hot (#57902)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55579
Now on cuda one-hot relies on device-side asserts thrown by scatter

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57902

Reviewed By: bdhirsh

Differential Revision: D28328698

Pulled By: ngimel

fbshipit-source-id: 1cd13e2c123c733cde7dbe4cbe6ff5335063bb70
2021-05-11 17:54:08 -07:00
7a23a5e8e9 Shut up sccache couldn't connect error (#58047)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58047

This error ALWAYS gets picked up by Dr. CI and IT DRIVES ME NUTS.
Consign it to the /dev/null bin.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D28352658

Pulled By: ezyang

fbshipit-source-id: a55f99ed76728d46f02d6a61a45c7691e8be7a47
2021-05-11 17:34:09 -07:00
29cfcf70be [package] add mock/extern hooks (#58000)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58000

Directly overriding save_extern and save_mock may mess with our
invariants in weird ways. This is less pronounced now, but once we
switch to graph-based dependency management things will get broken
subtly if people fail to call `super()`.

Better to add hook support to reflect that really you can only do a side
effect. Also has the bonus that people are likely familiar with it from
`nn.Module` hooks.

Differential Revision: D28339191

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Pulled By: suo

fbshipit-source-id: 63ffd39d2dcb1a7524f3c2c6a23bd399e754cc44
2021-05-11 16:46:54 -07:00
d9ea93181b Some types for remote_module (#58012)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58012

Test Plan: Sandcastle

Reviewed By: SciPioneer

Differential Revision: D28334611

fbshipit-source-id: 5e4645a7de65e064cb6a919cdc2372151ec48d44
2021-05-11 16:43:55 -07:00
1f83d8eec2 [Static Runtime] Return nullptr if the number of input args doesn't match (#58018)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58018

- Add checks for the number of input args and return nullptr if it doesn't match. This is intended to make Static Runtime more robust so that op schema change is less likely to break things. Imagine that a new arg is added to an op or a new overload is added that has this added arg, SR would simply ignore this added arg. If this arg has a default value, SR would run the model with the default value and give you wrong results, which can be hard to track down.

Reviewed By: ajyu

Differential Revision: D28047955

fbshipit-source-id: 01067059edd5cfea80c4ee121829f7733b11f601
2021-05-11 16:30:45 -07:00
a90c229900 Remove the BETA status for torch.linalg (#58043)
Summary:
We are ready to move to the new stage for our `torch.linalg` module, which is stable (or STABLE?).

Ref. https://github.com/pytorch/pytorch/issues/42666

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58043

Reviewed By: ngimel

Differential Revision: D28356172

Pulled By: mruberry

fbshipit-source-id: e2c1effa79b9635b2ef0a820a03a0685105042bd
2021-05-11 16:11:48 -07:00
a1f9a3c643 Fix UB in library.h (#57962)
Summary:
The function name and return type both are called `class_`, therefore they are ambiguous and this is UB and does not work on NVCC. See the tests for the failure case.

Thanks for the help of Thibaut Lutz from NVIDIA's compiler team.

cc: yueyericardo ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57962

Reviewed By: mruberry

Differential Revision: D28359400

Pulled By: ezyang

fbshipit-source-id: c64ec89203f99f656611aba34f7424eed7bc9e7c
2021-05-11 16:04:02 -07:00
c36055bb42 Make mypy_wrapper.py accept multiple filenames (#57998)
Summary:
A followup to https://github.com/pytorch/pytorch/issues/57752.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57998

Test Plan:
```
mypy --config=mypy-strict.ini
python tools/test/test_mypy_wrapper.py
python tools/test/test_actions_local_runner.py -k mypy
```

Reviewed By: driazati

Differential Revision: D28338531

Pulled By: samestep

fbshipit-source-id: ae31e3fa4a2b8060c200f9a13f768beaf2f55694
2021-05-11 15:54:12 -07:00
f9c8b7f1a8 [FX][docs] minor fixes (#58085)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58085

Reviewed By: mruberry

Differential Revision: D28364553

Pulled By: jamesr66a

fbshipit-source-id: 0d953672de9a86ecf5b1900b22e6ddef850dbe8f
2021-05-11 15:35:49 -07:00
a13718b69f [FX] Make stack trace testing less strict (#58088)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58088

Test Plan: Imported from OSS

Reviewed By: ansley

Differential Revision: D28365398

Pulled By: jamesr66a

fbshipit-source-id: 4d5d173721b4a917893a6f1202e3980aa6e85fcc
2021-05-11 15:34:06 -07:00
e4418b67c7 [torch/elastic] Update the rendezvous docs (#57973)
Summary:
This PR updates the rendezvous documentation for the Torch Distributed Elastic section of PyTorch docs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57973

Reviewed By: kiukchung

Differential Revision: D28331386

Pulled By: cbalioglu

fbshipit-source-id: 95dd32146222aaeff246bd3c3d2caf0036a9011b
2021-05-11 15:32:50 -07:00
8b12c8e8b3 Fixes: register_full_backward_hook crash if first argument don't require a gradient (#57944) (#57945)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/57944

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57945

Reviewed By: mruberry

Differential Revision: D28351929

Pulled By: albanD

fbshipit-source-id: d0db898e6bf13d1877cd81892a5a65c7854c8102
2021-05-11 15:07:35 -07:00
4ef94265e9 Add Futures to ProcessGroupGloo (#57818)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57818

Test Plan: Imported from OSS

Reviewed By: SciPioneer

Differential Revision: D28304171

Pulled By: agolynski

fbshipit-source-id: dbf7f5538890d138582831aa0279ede89619ea1e
2021-05-11 14:47:09 -07:00
111c99cdfd [vulkan] Fix glslc path for desktop build (#56507)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56507

Test Plan: Imported from OSS

Reviewed By: SS-JIA

Differential Revision: D27951058

Pulled By: IvanKobzarev

fbshipit-source-id: 29443b61264bb28ae4982ed9f4c21f1c45f6b519
2021-05-11 14:18:39 -07:00
d49f6d556b [DataLoader] Fix tempfile binding and removing for torch_shm_manager (#57566)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57566

Fix the problem that `tempfile` has never been deleted even after `torch_shm_manager` is destroyed.
- The previous implementation has wrong path length for the Linux Socket. It leads to we lose the last character of the name of `tempfile` when bind the pathname to socket. At the end, we can not delete this file due to unexpected file name.
- After we solve the racing problem by introducing a temporary directory, it becomes more dangerous since it prevents `torch_shm_manager` to delete directory as the tempfile persists in the temporary directory.

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D28202866

Pulled By: ejguan

fbshipit-source-id: 912cfd8fec0cc309d47df223b2b0faa599c60799
2021-05-11 14:14:58 -07:00
1d4d9ffca0 [torch/elastic] Refactor rendezvous store initialization logic (#58057)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58057

This PR refactors the store initialization logic and moves it to the `create_backend` function for both C10d and etcd backends.
ghstack-source-id: 128671579

Test Plan: Run the existing and revised tests.

Reviewed By: tierex

Differential Revision: D28356587

fbshipit-source-id: caf9416ab811eefe4834268d8a11a48f2236ed5b
2021-05-11 13:46:07 -07:00
b58a7c95aa [DataLoader] Raise detailed Error for ForwardRef type (#57824)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57824

Implement type check for string type. Re-raise detailed exception at compile time.
```py
>>> class InvalidData(Generic[T_co], NamedTuple):  # Invalid generic namedtuple in Python typing
...     name: str
...     data: T_co

class DP(IterDataPipe['InvalidData[int]']):
...     pass
TypeError: InvalidData[int] is not supported by Python typing
```

Add `__type_class__` attribute to class, which optimizes the static checking flow by reducing checking times.
```py
>>> class DP1(IterDataPipe[Union[int, str]]):
...     pass
>>> class DP2(DP1[int]):
...     pass
>>> list((cls, getattr(cls, '__type_class__', None)) for cls in DP2.__mro__)
[(<class '__main__.DP2'>, False), (<class 'abc.DP1[int]'>, True), (<class '__main__.DP1'>, False), (<class 'abc.IterableDataset[typing.Union[int, str]]'>, True), (<class 'torch.utils.data.dataset.IterableDataset'>, False), (<class 'torch.utils.data.dataset.Dataset'>, None), (<class 'typing.Generic'>, None), (<class 'object'>, None)]
```
Among the class of `DP2`'s MRO, only `DP2`, `DP1` will be static checked when `__type_class__` is `False`. `abc.DP1[int]` and `abc.IterableDataset[typing.Union[int, str]]` will be ignored since they are just a class with typing.

## Future
When Python 3.6 is deprecated, using TypeAlias rather than TypeMeta can eliminates the usage of `__type_class__` attribute.

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D28289104

Pulled By: ejguan

fbshipit-source-id: 1da97460c8bfc48cea7396033fde484a24caba7c
2021-05-11 13:38:30 -07:00
dd876120f9 Out version for aten::repeat (#57683)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57683

Support aten::repeat for static runtime

Test Plan: buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest

Reviewed By: hlu1

Differential Revision: D27639482

fbshipit-source-id: e6e706cb1d52750eea74f19536245f0484e945e6
2021-05-11 13:21:58 -07:00
86b7ae181a Automated submodule update: FBGEMM (#57983)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 566d74c27c

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57983

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: jspark1105

Differential Revision: D28334558

fbshipit-source-id: fcc41aae7c8309e8baccbf71442436a1ebb42378
2021-05-11 12:42:11 -07:00
eb1ffa91d8 [pyper] allow static runtime on and glow on simultaneously (#57972)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57972

Allow static runtime to be on when glow is on. This should be fine as long as glow AOT has already been run.

Test Plan: Test on replayer with remote_other net. D28291326 fixes remaining issue removing loops from the remote_other model. Need to test on regenerated model.

Reviewed By: hlu1

Differential Revision: D28275514

fbshipit-source-id: ee78972660dfdc3fcfb9af2cf7ebb19ee745a4f1
2021-05-11 12:24:07 -07:00
698be31262 Adding support for normalization of __is__ op (#57862)
Summary:
normalizing `__is__` to `eq`, and `__isnot__` to `ne` in the case of bools.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57862

Test Plan:
```
python test/test_jit.py TestPeephole
```
11 Tests, 1 skipped, no failures
Fixes https://github.com/pytorch/pytorch/issues/57387

Reviewed By: eellison

Differential Revision: D28335646

Pulled By: Gamrix

fbshipit-source-id: c9f885044b32897ba35483091bcf7037759b7517
2021-05-11 12:20:47 -07:00
ad4cd6ef89 Revert D28338485: make ddp logging api to be private
Test Plan: revert-hammer

Differential Revision:
D28338485 (ac44569b0d)

Original commit changeset: bd2ae7c78904

fbshipit-source-id: d383f42a2051457147dec42ea273ed4fa82ffa1f
2021-05-11 12:12:51 -07:00
a02305925c [local lint] Force color output on mypy (#58071)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58071

There's an environment variable that mypy will use to force color output, so turn that on if the runner detects a terminal.

Test Plan: Imported from OSS

Reviewed By: samestep

Differential Revision: D28360742

Pulled By: driazati

fbshipit-source-id: c0dc372a44ab3a16e67115ce54784f4d5a4833ee
2021-05-11 12:07:08 -07:00
0da5421837 Doc deprecate norm and add seealso to linalg.norm (#57986)
Summary:
**BC-breaking note**

This PR updates the deprecation notice for torch.norm to point users to the new torch.linalg.vector_norm and torch.linalg.matrix_norm functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57986

Reviewed By: nikithamalgifb

Differential Revision: D28353625

Pulled By: heitorschueroff

fbshipit-source-id: 5de77d89f0e84945baa5fea91f73918dc7eeafd4
2021-05-11 12:02:12 -07:00
e385aa863a Add tools/ script to limit circleci to a set of jobs (#58001)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58001

Adds a script so that devs can generate a commit (at the base of a stack) that removes all CI jobs but the set that they care about. See CONTRIBUTING.md changes for usage

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D28359910

Pulled By: driazati

fbshipit-source-id: 2741570f2bab2c28f4a9d7aef727b1b2399d0ce1
2021-05-11 11:58:35 -07:00
18edb77a28 Add pad_sequence as a native function (#57868)
Summary:
https://github.com/pytorch/pytorch/issues/56229

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57868

Reviewed By: mruberry

Differential Revision: D28334174

Pulled By: ngimel

fbshipit-source-id: f1647718ada596686117703b682c0af7e92e16f5
2021-05-11 11:18:13 -07:00
ac44569b0d make ddp logging api to be private (#57999)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57999

make ddp logging api to be private
ghstack-source-id: 128607185

Test Plan: unit test

Reviewed By: rohan-varma

Differential Revision: D28338485

fbshipit-source-id: bd2ae7c78904e93eed88be91876f5a832b5b7886
2021-05-11 10:37:03 -07:00
e0539b0ba6 [DataLoader] Remove redundant len >= 0 (#57951)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57951

As pmeier suggested in another PR, just remove all redundant check for prior DataPipe.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D28325414

Pulled By: ejguan

fbshipit-source-id: 17497745fef1647c24a25f4ca08082dd4df6f4a7
2021-05-11 10:34:07 -07:00
7faac089ca Enable cusolver potrf batched for Cholesky decomposition when cuda >= 11.3 (#57788)
Summary:
This PR enables the usage of cusolver potrf batched as the backend of Cholesky decomposition (`torch.linalg.cholesky` and `torch.linalg.cholesky_ex`) when cuda version is greater than or equal to 11.3.

Benchmark available at https://github.com/xwang233/code-snippet/tree/master/linalg/cholesky-new. It is seen that cusolver potrf batched performs better than magma potrf batched in most cases.

## cholesky dispatch heuristics:

### before:

- batch size == 1: cusolver potrf
- batch size > 1: magma xpotrf batched

### after:

cuda >= 11.3:
- batch size == 1: cusolver potrf
- batch size > 1: cusolver potrf batched

cuda < 11.3 (not changed):
- batch size == 1: cusolver potrf
- batch size > 1: magma xpotrf batched

 ---

See also https://github.com/pytorch/pytorch/issues/42666 #47953 https://github.com/pytorch/pytorch/issues/53104 #53879

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57788

Reviewed By: ngimel

Differential Revision: D28345530

Pulled By: mruberry

fbshipit-source-id: 3022cf73b2750e1953c0e00a9e8b093dfc551f61
2021-05-11 10:26:34 -07:00
ea421fb249 enable static graph training in DDP (#55248)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55248

This PR provides enable static graph training when users call _set_static_graph(). This can help support more use cases in DDP without performance regression, also can potentially improve performance when there are unused parameters in the graph.
1. first iteration records graph states like how many times a grad is calculated, whether the grad is used or not. then first iteration queues a delay_all_reduce call back to all reduce grads.
2. Since autograd call back is associated with current target graph task, the delay_all_all call back should be associated with out-most backward graph task. A DDP sink layer is added in DDP forward loop so that we can queue the delay_all_reduce call back in the sink layer.
3. after first iterations, DDP will use the saved graph states to determine whether a grad is used or not. whether a grad is ready for communication.
4. rebuilt bucket is called in second iteration, after graph states are recorded in first iteration.
5. if the graph states change, DDP will throw errors
ghstack-source-id: 128599464

Test Plan: unit tests. adding more tests

Reviewed By: rohan-varma

Differential Revision: D27539964

fbshipit-source-id: 74de1ad2719465be67bab8688d6e293cd6e3a246
2021-05-11 10:23:25 -07:00
502eb664ae OpInfo: chunk (#57935)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57935

Reviewed By: ngimel

Differential Revision: D28346217

Pulled By: mruberry

fbshipit-source-id: 331995aa18fd2983fc2122a9af31fba43ab9839c
2021-05-11 10:16:10 -07:00
90f05c005d refactor multi_head_attention_forward (#56674)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56674

`torch.nn.functional.multi_head_attention_forward` supports a long tail of options and variations of the multihead attention computation. Its complexity is mostly due to arbitrating among options, preparing values in multiple ways, and so on - the attention computation itself is a small fraction of the implementation logic, which is relatively simple but can be hard to pick out.

The goal of this PR is to
- make the internal logic of `multi_head_attention_forward` less entangled and more readable, with the attention computation steps easily discernible from their surroundings.
- factor out simple helpers to perform the actual attention steps, with the aim of making them available to other attention-computing contexts.

Note that these changes should leave the signature and output of `multi_head_attention_forward` completely unchanged, so not BC-breaking. Later PRs should present new multihead attention entry points, but deprecating this one is out of scope for now.

Changes are in two parts:
- the implementation of `multi_head_attention_forward` has been extensively resequenced, which makes the rewrite look more total than it actually is. Changes to argument-processing logic are largely confined to a) minor perf tweaks/control flow tightening, b) error message improvements, and c) argument prep changes due to helper function factoring (e.g. merging `key_padding_mask` with `attn_mask` rather than applying them separately)
- factored helper functions are defined just above `multi_head_attention_forward`, with names prefixed with `_`. (A future PR may pair them with corresponding modules, but for now they're private.)

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D28344707

Pulled By: bhosmer

fbshipit-source-id: 3bd8beec515182c3c4c339efc3bec79c0865cb9a
2021-05-11 10:09:56 -07:00
4fb8676cea Add dot implementation for BFloat16 on CUDA (#57903)
Summary:
Enabled `dot` for BFloat16 on CUDA (version 11+).
It also enabled `matmul` & `vdot` for BFloat16.
Backward for `matmul` isn't supported for `BFloat16`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57903

Reviewed By: mruberry

Differential Revision: D28346031

Pulled By: ngimel

fbshipit-source-id: 0917e9e0d6cf3694f45fe1c7e76370581502036a
2021-05-11 09:46:58 -07:00
067147ac7d Enable BFloat16 for logaddexp & logaddexp2 on CUDA (#57908)
Summary:
Enabled BFloat16 for `logaddexp` & `logaddexp2` on CUDA, with a [workaround](https://github.com/pytorch/pytorch/pull/57908#issuecomment-837320532) suggested by zasdfgbnm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57908

Reviewed By: mruberry

Differential Revision: D28344976

Pulled By: ngimel

fbshipit-source-id: edef654b5819b236fbd9996f962115beb6e147e1
2021-05-11 09:44:15 -07:00
fa318911be Enable geometric ops, exp2, expm1, rsqrt & erfc for BFloat16 on CUDA (#57913)
Summary:
Ops enabled for BFloat16 on CUDA (12 in total):

`acos`
`asin`
`atan`
`cosh`
`sin`
`sinh`
`tan`
`sinc`
`exp2`
`erfc`
`expm1`
`rsqrt`

Enabled backward for `cos` on CUDA.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57913

Reviewed By: mruberry

Differential Revision: D28342969

Pulled By: ngimel

fbshipit-source-id: 3c140fe408cbf93b21296a52d95ef0a0ccd96503
2021-05-11 09:43:05 -07:00
dbedb1fa1c [CUDA graphs] Sync after replay (#57556)
Summary:
Right now** there's a bug in libcuda.so that triggers sometimes when graphs with certain topologies are replayed back to back without a sync in between. Replays that hit this bug turn into spaghetti: kernels reordered ignoring dependencies, kernels elided, corrupted results. Currently, the only workaround I know that fixes all our repros is a manual sync between replays.

I'll remove the sync (or special case it based on cuda version) in a later PR, as soon as a fixed libcuda.so is available.

The only substantive change is the cudaDeviceSynchronize, other lines changed are de-indenting an unneeded scope.

** The bug is in current and semi-recent public versions of libcuda.so. We discovered the bug recently and we're not sure yet which public release was first affected. The version that ships with 11.3 is definitely affected, versions that shipped with 11.1 and earlier are likely not affected.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57556

Reviewed By: mruberry

Differential Revision: D28343043

Pulled By: ngimel

fbshipit-source-id: 3b907241aebdb8ad47ae96a6314a8b02de7bfa77
2021-05-11 09:38:47 -07:00
565550d89a [iOS GPU][perf][5/n] Replace std:vector with IntArrayRef and SmallVector (#57668)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57668

As title says
ghstack-source-id: 128338530

Test Plan: CI

Reviewed By: SS-JIA

Differential Revision: D28053671

fbshipit-source-id: 8052398a1f31dc34f427e8eecb31ddf7a27a0754
2021-05-11 09:33:02 -07:00
dc55ab3f77 [fbgemm] fix bug handling bias in rowwise quantization of FC (#58022)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58022

Caffe2 Int8FC + rowwise quantization was not handling bias correctly.

Test Plan: The example in D28347336 doesn't show bigger error with rowwise quantization any more

Reviewed By: hx89, janeyx99

Differential Revision: D28347336

fbshipit-source-id: 3ac95fd2f29ef6e52705c3a2361b605813c2bcc5
2021-05-11 08:38:39 -07:00
3e46d6c9e4 Update docs to mention CUDA support for Future (#50048)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50048

To reflect the many changes introduced recently.

In my mind, CUDAFuture should be considered a "private" subclass, which in practice should always be returned as a downcast pointer to an ivalue::Future. Hence, we should document the CUDA behavior in the superclass, even if it's CUDA-agnostic, since that's the interface the users will see also for CUDA-enabled futures.
ghstack-source-id: 128640983

Test Plan: Built locally and looked at them.

Reviewed By: mrshenli

Differential Revision: D25757474

fbshipit-source-id: c6f66ba88fa6c4fc33601f31136422d6cf147203
2021-05-11 08:26:33 -07:00
9e94921a55 combine consecutive layes on the same device (#55973)
Summary:
Implements proposal https://github.com/pytorch/pytorch/issues/53438

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55973

Reviewed By: pritamdamania87

Differential Revision: D28340034

Pulled By: mrzzd

fbshipit-source-id: d7fe476c0364603f36d41f348769245dac0acd88
2021-05-11 08:04:08 -07:00
cf7a0e5af4 Use RPC context streams to cover serde ops (#57926)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57926

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D28316526

Pulled By: mrshenli

fbshipit-source-id: 1907ec8f46e40fa5049d810c6ad959263361b6aa
2021-05-11 07:07:51 -07:00
0d564904b5 [iOS GPU][Perf][4/n] Reuse the same command buffer when copying results to CPU (#57667)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57667

Context - https://fb.workplace.com/groups/pytorch.edge.team/permalink/855194118368662/

Got 5% win for mobilenetv2 and unet
ghstack-source-id: 128338532

Test Plan: - CI

Reviewed By: kimishpatel

Differential Revision: D28116806

fbshipit-source-id: b9c766c58ae41f3408724ec962695f38985ace05
2021-05-11 04:47:58 -07:00
43f6deb6e4 Deprecate chain_matmul (#57735)
Summary:
This one's easy. I also included a bugfix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57735

Reviewed By: bdhirsh

Differential Revision: D28318277

Pulled By: mruberry

fbshipit-source-id: c3c4546a11ba5b555b99ee79b1ce6c0649fa7323
2021-05-11 00:09:36 -07:00
7707efed8f Deprecate matrix_rank (#57734)
Summary:
This one's straightforward

**BC-breaking Note**

This PR deprecates matrix_rank in favor of linalg.matrix_rank. An upgrade guide from matrix_rank to linalg.matrix_rank is provided in the documentation of matrix_rank.

It DOES NOT remove matrix_rank.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57734

Reviewed By: bdhirsh

Differential Revision: D28318301

Pulled By: mruberry

fbshipit-source-id: b9a27f58fdad72f408ca8b83a70c9b1fc2ef28e9
2021-05-10 23:58:46 -07:00
415ae54c31 Deprecate torch.eig (#57727)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57727

Reviewed By: bdhirsh

Differential Revision: D28317984

Pulled By: mruberry

fbshipit-source-id: fa1aa1b78fd3611ac208bca93e2b745a1bac41f1
2021-05-10 23:31:02 -07:00
ee48bd089c Support mix of int32 and int64 offsets/indices for EmbeddingBag and its variants (#55189)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55189

Currently EmbeddingBag and it variants support either int32 or int64 indices/offsets. We have use cases where there are mix of int32 and int64 indices which are not supported yet. To avoid introducing too many branches we could simply cast offsets type to indices type when they are not the same.

Test Plan: unit tests

Reviewed By: allwu

Differential Revision: D27482738

fbshipit-source-id: deeadd391d49ff65d17d016092df1839b82806cc
2021-05-10 23:23:50 -07:00
3ec16035f2 TST Migrates some of test_nn.py from assertEqualIgnoreTypes to assertEqual (#57642)
Summary:
Related to https://github.com/pytorch/pytorch/issues/38095, https://github.com/pytorch/pytorch/issues/50006

Migrates some of `test_nn.py` from `assertEqualIgnoreTypes` to `assertEqual`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57642

Reviewed By: bdhirsh

Differential Revision: D28317761

Pulled By: mruberry

fbshipit-source-id: 6bea6f669569922b2a391d1523917edde976f014
2021-05-10 23:10:29 -07:00
24087d07ca Deprecate QR (#57745)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57745

Reviewed By: bdhirsh

Differential Revision: D28318164

Pulled By: mruberry

fbshipit-source-id: b8e3cb9d7ab33f30c8653ec39f932a8af8bd2a50
2021-05-10 22:56:37 -07:00
4fef1c1d74 Deprecate torch.cholesky (#57725)
Summary:
**BC-breaking note:**

This PR deprecates torch.cholesky in favor of torch.linalg.cholesky. A upgrade guide is added to the documentation for torch.cholesky.

Note this PR DOES NOT remove torch.cholesky.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57725

Reviewed By: bdhirsh

Differential Revision: D28318260

Pulled By: mruberry

fbshipit-source-id: e7ba049321810e70f4de08e6ac37ff800e576152
2021-05-10 22:44:25 -07:00
f3e014f37b [iOS GPU][Perf][3/n] Cache the compuation pipeline state object (#57666)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57666

Got 5% improvement on mobilenetv2 and Unet

1. `std::unordered_map` is faster than `NSMutableDictionary`
2. `std::string` is cheaper than `NSString`
ghstack-source-id: 128338531

Test Plan: CI

Reviewed By: kimishpatel, SS-JIA

Differential Revision: D28048992

fbshipit-source-id: fc4f7e41928c524acde48947d2cd6b9f6ef7cbc8
2021-05-10 22:41:57 -07:00
36a22967b7 [fx ir] Handle the case when output consumes get_attr directly (#57844)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57844

Reviewed By: 842974287

Differential Revision: D28294298

fbshipit-source-id: db337fadca9f10f208324c9da6d95620178a189b
2021-05-10 22:04:43 -07:00
a93314dec3 Alias det, slogdet, matrix_power, inverse, pinverse (#57821)
Summary:
When doing this, I realised that `torch.linalg.pinv` did not have a note on the problems of its derivative (`torch.pinverse` did have it), so I added that.

As I was at it, I made a bit more explicit the recommendation for some functions in `torch.linalg`  to prefer other functions. I also changed the mentions of "stable" to "numerically stable" as discussed with IvanYashchuk and mruberry

If it seems like too much, I'm happy to move the recommendations part of `torch.linalg` to a different PR, but it was such a small thing that I figured it wouldn't be that big a deal if it was here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57821

Reviewed By: bdhirsh

Differential Revision: D28317959

Pulled By: mruberry

fbshipit-source-id: 6b116561bf3cba46fadc5ac14448e5d28ea88039
2021-05-10 22:00:59 -07:00
ba84c91197 Deprecate torch.lstsq (#57743)
Summary:
**BC-breaking note:**

This PR deprecates torch.lstsq; it adds an upgrade guide for how to use torch.linalg.lstsq instead.

It DOES NOT remove torch.lstsq, but warns once when it's called

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57743

Reviewed By: bdhirsh

Differential Revision: D28318196

Pulled By: mruberry

fbshipit-source-id: 0d6df29648a91a44c7d0ac58062c1099fcb61fb8
2021-05-10 21:39:19 -07:00
5840c8cfd8 [nccl] log rank when communicator is aborted (#57974)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57974

We see this error quite a bit in internal workflows, would be useful
to have this additional logging information here.
ghstack-source-id: 128602199

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D28331693

fbshipit-source-id: 25398c6a3420a2b594d79aa8f46936cd0addd426
2021-05-10 21:23:31 -07:00
a0d686c9cd OpInfo: select (#57731)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57731

Reviewed By: bdhirsh

Differential Revision: D28318229

Pulled By: mruberry

fbshipit-source-id: ec9058fd188b82de80d3a2f1a1ba07f36d8d0741
2021-05-10 21:18:58 -07:00
e90fcffb65 [c10d] Log when store based barrier succeeds (#57711)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57711

Seeing some hangs/issues around store based barrier internally, would
be good to have this log to indicate whether store based barrier has completed
successfully or not for a particular rank to debug further.
ghstack-source-id: 128605600

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D28249087

fbshipit-source-id: 644e5780519017ae780c3bc78bbe5def322db3f8
2021-05-10 21:09:40 -07:00
71ca3e99af Only use actually mismatched elements for reporting in torch.testing (#57923)
Summary:
Redo of https://github.com/pytorch/pytorch/issues/57135 out of stack

 ---

Currently all values are used for the reported absolute and relative differences. This usually works fine, but breaks down for the extremals:

```python
torch.testing.assert_close(torch.tensor([1.0, 0.0]), torch.tensor([2.0, 0.0]))
```

```
[...]
Greatest absolute difference: 1.0 at 0 (up to 1e-05 allowed)
Greatest relative difference: nan at 1 (up to 1.3e-06 allowed)
```

Although the second element is matching it is listed as offender for the greatest relative difference. The `NaN` stems from the `0 / 0` division.

To overcome this, we should only use the values that were considered a mismatch for the reported stats.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57923

Reviewed By: ngimel

Differential Revision: D28317316

Pulled By: mruberry

fbshipit-source-id: 4c604493bbe13b37f41225ea9af9e839a7304161
2021-05-10 20:58:47 -07:00
c714596027 [kineto] Update Kineto submodule, cupti library paths (#57789)
Summary:
Update kineto submodule, improve cupti detection

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57789

Test Plan: CI

Reviewed By: ngimel

Differential Revision: D28297175

Pulled By: ilia-cher

fbshipit-source-id: 5895270fae160097ae8872a592984d0e4a1b187b
2021-05-10 19:15:59 -07:00
f97650e70b [nnc] Fix float->bool conversion on cpu (#57798)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57798

Our instruction sequence was just plain wrong, instead of `fcmp une %x, +0.0`
(unordered equal 0.0) we were doing `fcmp uno`, which is just an unordered check
(i.e., is either side NaN).
ghstack-source-id: 128586464

Test Plan: New unit test against the full cross-product of dtypes.

Reviewed By: navahgar

Differential Revision: D28276269

fbshipit-source-id: ba5e59778e07770fb78ef02309f10edde333a800
2021-05-10 18:31:38 -07:00
b8ca1219de Add tests for custom state_dict save/load methods in TorchScript (#57886)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57886

Reviewed By: jamesr66a

Differential Revision: D28309228

Pulled By: gmagogsfm

fbshipit-source-id: 6ac60b1d4a8017aefb6f6dff49cde598de000265
2021-05-10 18:04:56 -07:00
fc9c486044 Add enabling default instructions flag for mobile (#57778)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57778

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D28268997

Pulled By: tugsbayasgalan

fbshipit-source-id: 5571b233d03d3aa80c820ee4245b4d0d3b70f924
2021-05-10 17:26:05 -07:00
38500d5d7b [RPC Framework] Move the annotation w/ bold effect out of the quotes (#57965)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57965

The bold effect does not work under quotes, so move it out.
ghstack-source-id: 128570357

Test Plan:
locally view

{F614715259}

Reviewed By: rohan-varma

Differential Revision: D28329694

fbshipit-source-id: 299b427f4c0701ba70c84148f65203a6e2d6ac61
2021-05-10 16:51:23 -07:00
747312bf61 Support for accumulate nodes traversal and to access op names in the compare function (#57685)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57685

- Accumulate traversal : `minimizer.settings.traverse_method = "accumulate" `
   - Feature
   - net_min_tests
- Return op name to the compare function so that we can map the cosine similarity to the individual ops
- Fix the settings combinations in net_min_tests

Test Plan:
buck test glow/fb/nnpi/lowering:net_min_tests

NNPI_LOG_LEVEL=5 USE_INF_API=1 buck run mode/opt -j 12 --config fbcode//cxx.link_weight=3 --config misc.strip_binaries=debug-non-line -c glow.nnpi_project_name='fb-nnpi-nextgen' ai_codesign/video/inference:xrayvideo_2019a_eval -- --job create --model_a model_prod --device_a PTCPU --trace_a none --model_b model_v3 --device_b NNPI --trace_b fusion --replace_b true --log_level INFO --use_scrambled false --save_repro false --num_ab_runs 0 --symbolic_trace_b true --save_modified_model_b false

USE_INF_API=1 buck test glow/fb/nnpi/lowering:net_min_tests

Reviewed By: 842974287

Differential Revision: D27867010

fbshipit-source-id: 6a756468b1f1fe24ef0400669d911825a7562484
2021-05-10 15:52:17 -07:00
036167111d Revert D28294662: [pytorch][PR] add cuda memory and distributed metadata
Test Plan: revert-hammer

Differential Revision:
D28294662 (98fcdb8005)

Original commit changeset: 3c71ffa333e3

fbshipit-source-id: 7c96e13b227fe0dff60ccb1c57cfd6790f8591b7
2021-05-10 15:28:53 -07:00
36172f347a [iOS GPU][Perf][2/n] Prepack linear + Fuse relu/hardtanh (#57665)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57665

Prepacking linear weights in JIT passes gives us 25% win on mobilenetv2.
ghstack-source-id: 128338535

allow-large-files

Test Plan: - CI

Reviewed By: kimishpatel

Differential Revision: D28033081

fbshipit-source-id: b006313f6b94b31b8d7ddacc0165ceab5a23dce9
2021-05-10 14:58:05 -07:00
f1d01b9488 Disable test for quicklint (#57968)
Summary:
Disabling until we fix https://github.com/pytorch/pytorch/issues/57967

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57968

Pulled By: driazati

Reviewed By: samestep

Differential Revision: D28330226

fbshipit-source-id: 7ea130e0cf7b94959a7db18838d21e4711716625
2021-05-10 14:40:59 -07:00
f0f69c5dc1 torch.where is now mentioning Bool rather than Byte when given wrong dtype mask (#57942)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49769

torch.where is now mentioning Bool instead of Byte when given wrong dtype mask

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57942

Reviewed By: bdhirsh

Differential Revision: D28330921

Pulled By: ngimel

fbshipit-source-id: 44a01e1daf1790308804ca7bb606f745c3eb71e1
2021-05-10 14:36:45 -07:00
ebb1b74f65 Fix json parse error for profiler call stack (#57099)
Summary:
The call stack in profiler result json file lacks surrounding double quotes, and result in json parse error.
This PR just add it.

ilia-cher

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57099

Reviewed By: gdankel

Differential Revision: D28324182

Pulled By: ilia-cher

fbshipit-source-id: dc479a023bb25de27c414629a27d624d64457c3e
2021-05-10 13:46:01 -07:00
98fcdb8005 add cuda memory and distributed metadata (#57252)
Summary:
Implementation for https://github.com/pytorch/kineto/issues/155

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57252

Reviewed By: gdankel

Differential Revision: D28294662

Pulled By: ilia-cher

fbshipit-source-id: 3c71ffa333e341ff8113e891681a4905f54802dc
2021-05-10 13:29:18 -07:00
ba07aaf211 Fix typo in warning for spawn method (#57927)
Summary:
Fix typo in warning for spawn method

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57927

Reviewed By: suo

Differential Revision: D28326390

Pulled By: bdhirsh

fbshipit-source-id: b0c12b1020d713865687f94f28ab2873ae260c23
2021-05-10 13:12:38 -07:00
19706d91cd [vulkan] Add sigmoid activation functions (#57867)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57867

Test Plan: Imported from OSS

Reviewed By: kimishpatel

Differential Revision: D28324220

Pulled By: SS-JIA

fbshipit-source-id: eae9d8ecca1c641cb7b356db66c368304bc92311
2021-05-10 12:41:10 -07:00
481806be97 Fix creation_meta for multi view outputs in NoGradMode/InferenceMode. (#57842)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57842

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D28295649

Pulled By: ailzhang

fbshipit-source-id: e0e11f537a97825e3fb7255aa561d3e855a6d3ce
2021-05-10 12:37:30 -07:00
478f639779 [Vulkan] Fix seg fault during descriptor set allocation on some platforms (#57825)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57825

Test Plan: Imported from OSS

Reviewed By: kimishpatel

Differential Revision: D28294939

Pulled By: SS-JIA

fbshipit-source-id: a7847fdc118ce96bfee83e755888e28dd605c1fb
2021-05-10 12:33:49 -07:00
e1cbc43f50 Use tools/print_test_stats.py in GHA (#57647)
Summary:
Judging from https://github.com/pytorch/pytorch/issues/57584, it seems like the test-reports artifact was originally intended to be downloaded to `$PWD/test-reports` instead of just directly into `$PWD`. To minimize confusion, this PR changes it to download into `test/test-reports`, which should match where the files came from in the `test` step anyway.

TODOs:

- [x] change the extract path for test-reports
- [x] install Python dependencies
- [x] call `tools/print_test_stats.py`
- [x] use deep clone to allow `git` commands
- [x] correctly set `CIRCLE_*` environment variables
- [x] set Scribe credentials
- [x] set AWS credentials

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57647

Test Plan: CI.

Reviewed By: seemethere

Differential Revision: D28325833

Pulled By: samestep

fbshipit-source-id: cc322bad76747f59b764a1a0a863153bb26095e7
2021-05-10 12:29:40 -07:00
bf053a1296 Fix hasattr support type (#57950)
Summary:
`hasattr` is partially supported. This PR fixes that in the builtin table.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57950

Reviewed By: pbelevich

Differential Revision: D28329005

Pulled By: nikithamalgifb

fbshipit-source-id: c4cfba9badcc8f7cbc8250a5c21dfb62b35a83fc
2021-05-10 12:21:56 -07:00
fea3824214 Ensure torch.save() deterministic output (#57536)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42163.

## {emoji:1f525} Pitch

Currently, the binary outputs produced by `torch.save()` are non-deterministic (as pointed out in https://github.com/pytorch/pytorch/issues/42163). This means that running a simple snippet that creates a tensor (or a model) twice will produce output files with a different `md5` sum.

**Why does this occur?**
The cause of this behavior lies in the fact that the `obj._cdata` is used to identify a tensor and is written to a file, but the `_cdata` attribute is of course non-deterministic:
a80b215a9a/torch/serialization.py (L416)

**Why does this matter?**
Reproducibility is essential for many Machine Learning projects.
For instance, when using [`dvc`](https://dvc.org/) you would expect that if none of the dependencies of a stage  of a ML pipeline has changed, then running the same stage another time will produce the same binary output. For the reasons explained above, with `torch` this was not the case, so this PR tries to fix this issue.

## {emoji:1f4cc} Content of this PR
### What changes?
- The `persistent_id()` function now returns a deterministic value, rather than `obj._cdata` (which depends on runtime).
- As a consequence, `torch.save(obj, "output.pt")` produces a deterministic output, i.e. the `md5` hash of `output.pt` is determinstic. See **Test 1** and **Test 2** below.

### What does not change?
- If an `obj` contains several tensors that share the same underlying data (e.g. they are views of the same tensor),the `obj_key` returned by `persistent_id()` is still going to be the same for all of them
- As a consequence, serialization optimizes disk storage by storing only necessary tensors, rather than writing one tensor per view. See **Test 3** below.

## � How to test

### Test 1: snipped from https://github.com/pytorch/pytorch/issues/42163
Consider the following `snippet_1.py` (from https://github.com/pytorch/pytorch/issues/42163).
```python
import hashlib
import torch

def get_sha256_hash(file: str, chunk_size: int = 4096) -> str:
    hasher = hashlib.sha256()
    with open(file, "rb") as fh:
        for chunk in iter(lambda: fh.read(chunk_size), b""):
            hasher.update(chunk)
    return hasher.hexdigest()

file = "tensor.pt"
hashes = []
for _ in range(5):
    obj = torch.ones(1)
    torch.save(obj, file)
    hashes.append(get_sha256_hash(file)[:8])
    del obj

hash = hashes[0]
assert all(other == hash for other in hashes[1:])
print(hash)
```

On `master` you obtain an error
```bash
$ python snippet_1.py
Traceback (most recent call last):
  File "save_tensor.py", line 84, in <module>
    assert all(other == hash for other in hashes[1:])
AssertionError
```
while on this PR branch you should get the following consistent behaviour:
```bash
$ for run in {1..2}; do python snippet_1.py; done
600a83cb
600a83cb
```

### Test 2: Deterministic save of `Tensor` and `nn.Module` instances
Consider the following `snippet_2.py`
```python
import torch
torch.manual_seed(0)
x = torch.tensor([8., 8., 5., 0.])
torch.save(x, "out_tensor.pt")

model = torch.nn.Sequential(
    torch.nn.Linear(3, 1),
    torch.nn.Flatten(0, 1)
)
torch.save(model, "out_model.pt")
```
On `master` branch, the `md5` hash of `out_tensor.pt` and `out_model.pt` are non-determinstic, for instance you may get
```bash
$ for run in {1..2}; do python snippet_2.py; md5 out_*pt; done
MD5 (bc9e8af218) (out_model.pt) = 92dca4a310b691e893f3cb41d64d5af1
MD5 (bc9e8af218) (out_tensor.pt) = a4ef290583f50a9c203a42d0cfc078af
MD5 (bc9e8af218) (out_model.pt) = de3cb9791a66af8aed77ed7224bd1d5c
MD5 (bc9e8af218) (out_tensor.pt) = 3b8a6009d3a0be5b9dd94152dcc0c7cb
```
while on this PR branch you should get the following consistent behaviour:
```bash
$ for run in {1..2}; do python snippet_2.py; md5 out_*pt; done
MD5 (bc9e8af218) (out_model.pt) = dba75fd50a190e4e7fa89b7a2477bab7
MD5 (bc9e8af218) (out_tensor.pt) = 029f52f0706d6c813cc796d3cdcd3eb0
MD5 (bc9e8af218) (out_model.pt) = dba75fd50a190e4e7fa89b7a2477bab7
MD5 (bc9e8af218) (out_tensor.pt) = 029f52f0706d6c813cc796d3cdcd3eb0
```

### Test 3: Views of the same tensor are not re-written to file
Consider the following `snippet_3.py`.
```python
import torch
torch.manual_seed(0)
x = torch.rand(1_000, 1_000)
y = x.T
z = x.view(1_000_000, 1)

torch.save({"x": x}, "out_tensor_x.pt")
torch.save({"x": x, "y": y, "z": z}, "out_tensor_xyz.pt")
```
Both on `master` branch and on this  PR branch you should get two output files with same size:
```bash
$ python snippet_3.py && du -sh out_tensor*pt && md5 out_*pt
3.8M    out_tensor_x.pt
3.8M    out_tensor_xyz.pt
MD5 (bc9e8af218) (out_tensor_x.pt) = eda516d9156177b27bdc2a75c9064d9b
MD5 (bc9e8af218) (out_tensor_xyz.pt) = 333b869f5b93ced7b8649ab1571eb8e3
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57536

Reviewed By: bdhirsh

Differential Revision: D28304728

Pulled By: ailzhang

fbshipit-source-id: 49788e566a3cd2c6c36dc801e6bdd8f42c9459cb
2021-05-10 11:51:55 -07:00
fe3c63d9d3 [DDP] fix param to name mapping (#57771)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57771

This mapping didn't work properly when certain parameters didn't
require grad. Fixed that and added a test.
ghstack-source-id: 128527537

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D28265636

fbshipit-source-id: 7b342ce012b2b7e33058b4c619ffb98992ed05b7
2021-05-10 11:47:46 -07:00
b587354e4c Add Python-3.9 CI testing (#50992)
Summary:
Skip number of tests adjust typing handling

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50992

Reviewed By: walterddr

Differential Revision: D26170388

Pulled By: malfet

fbshipit-source-id: 47852512aa3d5c25faf6687bcd0b1cbb332b0b20
2021-05-10 10:51:39 -07:00
29753339b7 Do not download slow test when on sandcastle (#57953)
Summary:
Downloading slow_test list on SC causes timeout, this is even a bigger issue since `common_utils.py` is reused in many internal projects/modules.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57953

Test Plan: CI

Reviewed By: janeyx99

Differential Revision: D28325527

fbshipit-source-id: ae47c9e43ad6f416008005bb26ceb2f3d6966f2e
2021-05-10 10:39:10 -07:00
710a83d09f Remove code and logic for old style custom autograd Function (#57357)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/30696

### Release Notes
Instantiating a custom autograd function is now deprecated. Users should call `.apply()` on the class itself because it is a static method.

--end release notes--
 - There are a couple error messages that we can't entirely remove because accessing these attributes of the autograd function instance may segfault (due to cdata being nullptr). Also added a TORCH_CHECK for the name attribute which previously segfaulted.
 - Error message updated to convey 1) old-style functions have been deprecated 2) this access pattern was once valid
 - Updates variable -> Tensor for some error messages

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57357

Reviewed By: mrshenli

Differential Revision: D28193095

Pulled By: soulitzer

fbshipit-source-id: f021b105e9a3fd4a20d6ee3dfb6a06a8c34b10ca
2021-05-10 10:26:06 -07:00
d115e81a32 Fix document around DDP uneven inputs (#57448)
Summary:
Typo fix and additional clarifications on the API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57448

Reviewed By: SciPioneer

Differential Revision: D28153264

Pulled By: rohan-varma

fbshipit-source-id: 9bd35d918299ad7e080785d755f97b966f826615
2021-05-10 09:33:59 -07:00
4d181ba51c Port maximum and minimum to structured (#57630)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57630

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D28224831

Pulled By: ezyang

fbshipit-source-id: de4c40560613b68473aa53bb7424476dc558a6b2
2021-05-10 08:51:24 -07:00
727c1d69d7 Remove unnecessary indirection through torch::autograd::impl::pyobj/set_pyobj (#57733)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57733

I'm going to be modifying the APIs here, so the less API surface
covering these functions the better.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D28289082

Pulled By: ezyang

fbshipit-source-id: 4b71270bb82e0d6baa4dfed2f2e4ee8831f590b5
2021-05-10 08:18:33 -07:00
807bea1c4e [JIT] initial support for PEP-585 types (#57363)
Summary:
Relates to https://github.com/pytorch/pytorch/issues/56210. Initial attempt to make support for `list`, `tuple` and `dict` type in PEP-585.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57363

Test Plan:
- newly added `test_pep585_type`
- CI

Reviewed By: ngimel

Differential Revision: D28128230

Pulled By: walterddr

fbshipit-source-id: e5ba487dfd8c42e89f851d22b3aebfa56dd419bf
2021-05-10 07:29:10 -07:00
bc798cdc1d Add run_master_build workflow (#57899)
Summary:
Automatically generate this workflow by filtering all jobs that has *filters:branches:only:master* restriction

Add probot config to schedule this workflow if `ci/master` label is set on PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57899

Reviewed By: walterddr

Differential Revision: D28311838

Pulled By: malfet

fbshipit-source-id: 63df81212279f5edd8463d1f6b22f37253c53a98
2021-05-10 07:10:05 -07:00
ece15f6902 [DataLoader] Change Decoder signature and add MatHandler (#57391)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57391

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D28151601

Pulled By: ejguan

fbshipit-source-id: 34814197d2f068cab0c7ca2330152ad588eb1ef0
2021-05-10 06:29:00 -07:00
cbfce376a8 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D28319469

fbshipit-source-id: 8295597a8ee16b2fef3f7aacdd6c892cb22db988
2021-05-10 03:39:31 -07:00
b84a28b50a tweak sync note wording for linalg docs (#57924)
Summary:
Per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57924

Reviewed By: mruberry

Differential Revision: D28317350

Pulled By: ngimel

fbshipit-source-id: 2411edb392ba52621941269c4835dfe573a6225a
2021-05-10 02:40:36 -07:00
3c87fe9b14 Revert D28117714: [pytorch][PR] ATen lu_unpack. Required for making torch.lu_solve differentiable.
Test Plan: revert-hammer

Differential Revision:
D28117714 (5c67d8dfd3)

Original commit changeset: befd33db12ec

fbshipit-source-id: 295b2134935542a903a73f90a7998239dfe6cc81
2021-05-09 23:20:06 -07:00
259d19a733 [JIT] Adding a concat optimization pass (#55474)
Summary:
This PR adds a new pass in JIT that optimizes `aten::cat` ops.

Specifically, here are optimizations performed:
* Eliminate redundant in `cat` inputs by performing cse on the list of inputs.
   - This includes eliminating fully redundant `cat` ops when all the inputs are the same as well the case when "all but one" of the inputs have already been concatenated.
* Expand `cat` into multiple copies and eliminate redundancies.
   - This also includes eliminating redundancies in the underlying buffers used for `cat`.

These optimizations are not enabled in any compilation flow at this point.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55474

Reviewed By: albanD

Differential Revision: D27624511

Pulled By: navahgar

fbshipit-source-id: d509289fafc23e73b02f64a90219148896817339
2021-05-09 22:06:44 -07:00
e7e73192f6 Added cuBLAS path for torch.linalg.lstsq (#54725)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54725

cuBLAS's gelsBatched is faster than MAGMA's for matrices with rows less
than 128.

Performance comparison cuSOLVER vs cuBLAS: https://github.com/pytorch/pytorch/pull/54725#issuecomment-832234456.
Performance comparison MAGMA vs cuBLAS: https://github.com/pytorch/pytorch/pull/54725#issuecomment-827649039.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D28248803

Pulled By: mruberry

fbshipit-source-id: d3661bccb85c6fc1cee3a246ae8233492964f400
2021-05-09 21:20:16 -07:00
d11cce4f5e Add cuSOLVER path for torch.linalg.lstsq (#57317)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57317

This PR implements QR-based least squares solver using geqrf, ormqr, and
triangular_solve operations.

Internal code of triangular_solve was fixed to handle correctly larger
sized rectangular arrays.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D28312683

Pulled By: mruberry

fbshipit-source-id: dc8ae837a5fb0685d85c8733a47d7d25dc46443a
2021-05-09 21:19:10 -07:00
300363b54f CLN Removes unused RReLU code (#57672)
Summary:
Follow up from https://github.com/pytorch/pytorch/issues/49788

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57672

Reviewed By: mruberry

Differential Revision: D28291489

Pulled By: ngimel

fbshipit-source-id: 4691051165756d38ef37b48a78f456fa44d27022
2021-05-09 20:59:03 -07:00
50e22e1e08 Remove tmp folder when run unit test (#57800)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/57799

Remove tmp folder when run cpp_api_parity test cases one by one.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57800

Reviewed By: mruberry

Differential Revision: D28303956

Pulled By: ngimel

fbshipit-source-id: ec313d0c14ae432bc3862988eb00742810ef53e2
2021-05-09 20:07:14 -07:00
5c67d8dfd3 ATen lu_unpack. Required for making torch.lu_solve differentiable. (#46913)
Summary:
Backward methods for `torch.lu` and `torch.lu_solve` require the `torch.lu_unpack` method.
However, while `torch.lu` is a Python wrapper over a native function, so its gradient is implemented via `autograd.Function`,
`torch.lu_solve` is a native function, so it cannot access `torch.lu_unpack` as it is implemented in Python.

Hence this PR presents a native (ATen) `lu_unpack` version. It is also possible to update the gradients for `torch.lu` so that backward+JIT is supported (no JIT for `autograd.Function`) with this function.

~~The interface for this method is different from the original `torch.lu_unpack`, so it is decided to keep it hidden.~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46913

Reviewed By: astaff

Differential Revision: D28117714

Pulled By: mruberry

fbshipit-source-id: befd33db12ecc147afacac792418b6f4948fa4a4
2021-05-09 19:12:56 -07:00
fc55290e5b Fix distributed autograd gradients synchronization (#57792)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57792

There are two problems when using CUDA RPC with distributed autograd
and distributed optimizer:

1) In local autograd engine, all autograd functions/nodes, including
AccumualteGrad will use the forward stream for backward computation.
But distributed autograd skips AccumulateGrad autograd function/node
and directly calls into `AccumulateGrad::accumulateGrad`. As the
result, it will use the default stream to accumulate gradients
instead of the forward stream. This commit changes that and uses the
forward stream to accumulate gradients, matching forward behavior.
2) Distributed optimizer and distributed autograd backward are
separate RPC calls, and CUDA streams are not synchronized across
different RPC calls. As a result, distributed optimizer might
consume gradients before they are ready. This commit uses CUDA
events to record the completion of gradient computation, and use
those events to block current streams when getGradients() are called.

Test Plan: Imported from OSS

Reviewed By: pritamdamania87

Differential Revision: D28274876

Pulled By: mrshenli

fbshipit-source-id: 22e607152324ae918084066cde8c5dbb418bba7c
2021-05-09 17:32:59 -07:00
14282232d9 Fix generate_not_implemented_tests not testing unknown types correctly (#56997)
Summary:
Currently, the test code is not testing unknown types correctly because `op` is overwritten in the for-loop (i.e., currently only `__ior__` is tested).
This PR fixes the test `generate_not_implemented_tests` to bind operator name to each method, and remove operators currently unsupported (`__rand__`, …).

cc/ mruberry This fix is be needed to add tests for the operator we are going to introduce (e.g., `__rand__`)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56997

Reviewed By: astaff

Differential Revision: D28118465

Pulled By: mruberry

fbshipit-source-id: c5a466a7604262ed5490862300d47043aff63d0b
2021-05-09 05:34:10 -07:00
4cf2c646c2 Added torch.linalg.matrix_norm (#57127)
Summary:
This PR is focused on  the API for `linalg.matrix_norm` and delegates computations to `linalg.norm` for the moment.

The main difference between the norms is when `dim=None`. In this case
- `linalg.norm` will compute a vector norm on the flattened input if `ord=None`, otherwise it requires the input to be either 1D or 2D in order to disambiguate between vector and matrix norm
- `linalg.vector_norm` will flatten the input
- `linalg.matrix_norm` will compute the norm over the last two dimensions, treating the input as batch of matrices

In future PRs, the computations will be moved to `torch.linalg.matrix_norm` and `torch.norm` and `torch.linalg.norm` will delegate computations to either `linalg.vector_norm` or `linalg.matrix_norm` based on the arguments provided.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57127

Reviewed By: mrshenli

Differential Revision: D28186736

Pulled By: mruberry

fbshipit-source-id: 99ce2da9d1c4df3d9dd82c0a312c9570da5caf25
2021-05-09 04:50:33 -07:00
9ad19af935 [TensorExpr] Fix a condition when we use a native depthwise conv2d lowering. (#57906)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57906

I think it was accidentally flipped in #56875.

Test Plan: Imported from OSS

Reviewed By: Chillee

Differential Revision: D28312947

Pulled By: ZolotukhinM

fbshipit-source-id: 8d0f45e540f47daefbc270f5a2ade87f2171b958
2021-05-08 23:04:14 -07:00
0c2d38264a Improve BatchNorm1d performance (CUDA) (#57786)
Summary:
Part of gh-38915, resubmit of gh-57034

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57786

Reviewed By: mruberry

Differential Revision: D28290284

Pulled By: ngimel

fbshipit-source-id: 8768578ba9ace6a948cb8145c0091e0ea49b12da
2021-05-08 19:09:29 -07:00
e8fb167b17 [PyTorch Edge] Reuse constant table from ts in bytecode (#56002)
Summary:
## Note:
**This change will include the feature, but the feature is not on. It will be enabled and bytecode version will be bumped in D27844651 (8c04593c0a).**

Jit will generate constant tensor, and it locates in the constant folder (can find them after unzip model.ptl). Bytecode generated by lite interpreter also includes constant tensor, which are almost the same with the constant tensor value from jit. This pr will let lite interpreter reuses the constant tensor from jit, instead of reproducing the similar tensor values. The reading and writing session will be as following.

More details and background can found in [Lite Interpreter Model Size Issue](https://fb.quip.com/OSidAcjhL9LS).
Data size comparison can be found in [Model size analysis](https://fb.quip.com/oEm6A4bhbo06)

### Write
1. In `export_module.cpp`, store all constant tensor value from jit in an `unordered_map constants_from_jit`, where the tensor value use tensor string as a hash. constants_from_jit is a map: (tensor) => (archive_name, index). When writing bytecode archive `writeByteCode()`, the map `constants_from_jit` will also be passed all the way to it's pickler.

2. In `pickler.cpp`, a new map tensors_archive_table_ is added. It is also a map: (tensor) => (archive_name, index). The corresponding function to update the map is `updateTensorsArchiveTable`. When pushing the storage of a tensor, if the tensor exists in `tensors_archive_table_`, the root key will be `{archive_name}/{index}`, instead of `{index}`. For example, the tensor
```
     torch._utils._rebuild_tensor_v2(pers.obj(('storage', torch.FloatStorage, '0', 'cpu', 90944),),
       0,
       (1, 116, 28, 28),
       (90944, 784, 28, 1),
       False,
       collections.OrderedDict()),
```
will be like following instead
```
     torch._utils._rebuild_tensor_v2(pers.obj(('storage', torch.FloatStorage, 'constants/0', 'cpu', 90944),),
       0,
       (1, 116, 28, 28),
       (90944, 784, 28, 1),
       False,
       collections.OrderedDict()),
```

**Note**:  Only tensors in bytecode archive will be different. The tensors in other archive remains the same, because `updateTensorsArchiveTable()` is only called when `use_tensors_archive_table_` is `true`, and `tensors_archive_table_` is only set as `true` when `bytecode_version` is a valid number.

### Read
1. In `import.cpp`, the function `read_record` passed to Unpickler is updated. The argument of `read_record` is the root key. In version 4, the root key will just be index, and `archive_name_plus_slash` + `name` will be used to get the tensor. With this change (version 5+), `read_record` will check if slash exists in the argument `name`. If it does, it means the argument is `archive_name/index`, and it can be used to get tensor directly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56002

ghstack-source-id: 128498244

Test Plan:
### Verify the new model generated from this pr can reuse constant table and the numerical result is the same.
1. Build pytorch locally. `MACOSX_DEPLOYMENT_TARGET=10.9 CC=clang CXX=clang++ USE_CUDA=0 DEBUG=1 MAX_JOBS=16 python setup.py develop`
2. Run `python save_lite.py`
```
import torch

# ~/Documents/pytorch/data/dog.jpg
model = torch.hub.load('pytorch/vision:v0.6.0', 'shufflenet_v2_x1_0', pretrained=True)
model.eval()

# sample execution (requires torchvision)
from PIL import Image
from torchvision import transforms
import pathlib
import tempfile
import torch.utils.mobile_optimizer

input_image = Image.open('~/Documents/pytorch/data/dog.jpg')
preprocess = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
input_tensor = preprocess(input_image)
input_batch = input_tensor.unsqueeze(0) # create a mini-batch as expected by the model

# move the input and model to GPU for speed if available
if torch.cuda.is_available():
    input_batch = input_batch.to('cuda')
    model.to('cuda')

with torch.no_grad():
    output = model(input_batch)
# Tensor of shape 1000, with confidence scores over Imagenet's 1000 classes
print(output[0])
# The output has unnormalized scores. To get probabilities, you can run a softmax on it.
print(torch.nn.functional.softmax(output[0], dim=0))

traced = torch.jit.trace(model, input_batch)
sum(p.numel() * p.element_size() for p in traced.parameters())
tf = pathlib.Path('~/Documents/pytorch/data/data/example_debug_map_with_tensorkey.ptl')

torch.jit.save(traced, tf.name)
print(pathlib.Path(tf.name).stat().st_size)
traced._save_for_lite_interpreter(tf.name)
print(pathlib.Path(tf.name).stat().st_size)
print(tf.name)

```

3. Run `python test_lite.py`
```
import torch
from torch.jit.mobile import _load_for_lite_interpreter
# sample execution (requires torchvision)
from PIL import Image
from torchvision import transforms

input_image = Image.open('~/Documents/pytorch/data/dog.jpg')
preprocess = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
input_tensor = preprocess(input_image)
input_batch = input_tensor.unsqueeze(0) # create a mini-batch as expected by the model
reload_lite_model = _load_for_lite_interpreter('~/Documents/pytorch/experiment/example_debug_map_with_tensorkey.ptl')

with torch.no_grad():
    output_lite = reload_lite_model(input_batch)
# Tensor of shape 1000, with confidence scores over Imagenet's 1000 classes
print(output_lite[0])
# The output has unnormalized scores. To get probabilities, you can run a softmax on it.
print(torch.nn.functional.softmax(output_lite[0], dim=0))

```
4. Compare the result with pytorch in master and pytorch built locally with this change, and see the same output.
5. The model size was 16.1 MB and becomes 12.9 with this change.

Size comparison in production models:

{F603127047}

Reviewed By: iseeyuan

Differential Revision: D27759891

fbshipit-source-id: 34e0cb8149011c46c1910165b545c137d7a0b855
2021-05-08 13:08:09 -07:00
737f48dfc5 Remove _save_data() and _load_data() from mobile (#57879)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57879

_save_data() and _load_data() were designed as a protocol of data serialization of trainer client. As confirmed with kwanmacher and dreiss , they are not used. In addition, there's no plan to use them in Federated Learning flow. Remove them for now.

Test Plan: Imported from OSS

Reviewed By: kwanmacher

Differential Revision: D28306682

Pulled By: iseeyuan

fbshipit-source-id: 1b993ce4d78e372ae9b83bcbe496a196f9269d47
2021-05-08 10:52:44 -07:00
88a1e8eb01 Add EMA to DecayAdagrad (#57866)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57866

As titled

Test Plan: f271267365

Reviewed By: lanlanfb

Differential Revision: D28292875

fbshipit-source-id: f6532048eb558afce87fdada3b7dfa8457a1f538
2021-05-07 23:09:08 -07:00
a46e927b1a [torch] handle embedding bag with empty bag (#57446)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57446

GPU EmbeddingBag is handling L == 0 . Matching CPU version.

Test Plan: buck test //deeplearning/fbgemm/fbgemm_gpu:split_table_batched_embeddings_test -- test_forward

Reviewed By: jiyuanzFB

Differential Revision: D28145090

fbshipit-source-id: d91d0050ddd5636293a8965d3eece02633918f4c
2021-05-07 22:20:26 -07:00
f51798d0dc [TensorExpr] Fix UB in LoopNest::distribute. (#57883)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57883

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D28307300

Pulled By: ZolotukhinM

fbshipit-source-id: 5c35d50759904ed10c54e71b8bcb91572341f991
2021-05-07 22:08:19 -07:00
8639fd104e [profiler][kineto] Support for memory allocs/deallocs in the traces (#57835)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57835

Pull Request resolved: https://github.com/pytorch/kineto/pull/208

Adding ability to save memory allocs/deallocs into the trace

Test Plan: python test/test_profiler.py -v

Reviewed By: gdankel

Differential Revision: D28260915

fbshipit-source-id: d7905d38d7fac9750754ac1b293d3a1951590b5f
2021-05-07 21:23:30 -07:00
3a66a1cb99 [clang-tidy] Exclude cppcoreguidelines-avoid-magic-numbers (#57841)
Summary:
Add cppcoreguidelines-avoid-magic-numbers exclusion to clang-tidy
Remove existing nolint warnings using following script:
```
for file in `git ls-files | grep -v \.py`; do gsed '/^ *\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)/d' -i  $file; done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57841

Reviewed By: samestep

Differential Revision: D28295045

Pulled By: malfet

fbshipit-source-id: 7c6e8d1213c9593f169ed3df6a916498f1a97163
2021-05-07 20:02:33 -07:00
bc2540f0be benchmark rpc ps (#57454)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57454

DDP with NCCL AllReduce for the entire model experiment from Quip https://fb.quip.com/iQUtAeKIxWpF

I have been testing this on the AI cluster. There seem to be some connection problems with RPC when using multiple trainers or parameter servers.

```
Namespace(bconfig_id='3', dconfig_id='DummyData', mconfig_id='DummyModel', pconfig_id='None', tconfig_id='DdpNcclTrainer')

benchmark warmup done

metrics for trainer=0
+-----------------------------------+----------+---------+----------+------------+-----------+
| name                              |      min |     max |     mean |   variance |     stdev |
+===================================+==========+=========+==========+============+===========+
| backward_metric,backward          | 2.45248  | 4.18304 | 3.972    | 0.097122   | 0.311644  |
+-----------------------------------+----------+---------+----------+------------+-----------+
| batch_level_metric,batch_all      | 4.11955  | 4.58138 | 4.31439  | 0.00229848 | 0.0479424 |
+-----------------------------------+----------+---------+----------+------------+-----------+
| foward_metric,forward_pass        | 0.141312 | 1.4807  | 0.222566 | 0.0555432  | 0.235676  |
+-----------------------------------+----------+---------+----------+------------+-----------+
| hook_future_metric,nccl_allreduce | 0.191488 | 3.54099 | 3.11694  | 0.557106   | 0.746395  |
+-----------------------------------+----------+---------+----------+------------+-----------+
metrics for trainer=1
+-----------------------------------+----------+---------+----------+-------------+------------+
| name                              |      min |     max |     mean |    variance |      stdev |
+===================================+==========+=========+==========+=============+============+
| backward_metric,backward          | 2.4617   | 2.59174 | 2.51196  | 0.000938276 | 0.0306313  |
+-----------------------------------+----------+---------+----------+-------------+------------+
| batch_level_metric,batch_all      | 4.22605  | 4.71757 | 4.27921  | 0.00468424  | 0.0684415  |
+-----------------------------------+----------+---------+----------+-------------+------------+
| foward_metric,forward_pass        | 0.807936 | 1.50118 | 0.846008 | 0.00601693  | 0.0775688  |
+-----------------------------------+----------+---------+----------+-------------+------------+
| hook_future_metric,nccl_allreduce | 0.108544 | 0.1536  | 0.11222  | 2.16726e-05 | 0.00465538 |
+-----------------------------------+----------+---------+----------+-------------+------------+
metrics for all trainer
+-----------------------------------+----------+---------+----------+------------+-----------+
| name                              |      min |     max |     mean |   variance |     stdev |
+===================================+==========+=========+==========+============+===========+
| backward_metric,backward          | 2.45248  | 4.18304 | 3.24198  | 0.584391   | 0.764455  |
+-----------------------------------+----------+---------+----------+------------+-----------+
| batch_level_metric,batch_all      | 4.11955  | 4.71757 | 4.2968   | 0.00378467 | 0.0615197 |
+-----------------------------------+----------+---------+----------+------------+-----------+
| foward_metric,forward_pass        | 0.141312 | 1.50118 | 0.534287 | 0.128284   | 0.358167  |
+-----------------------------------+----------+---------+----------+------------+-----------+
| hook_future_metric,nccl_allreduce | 0.108544 | 3.54099 | 1.61458  | 2.5456     | 1.59549   |
+-----------------------------------+----------+---------+----------+------------+-----------+
```

Test Plan: Imported from OSS

Reviewed By: H-Huang, ngimel

Differential Revision: D28296175

Pulled By: gcramer23

fbshipit-source-id: 5dd208fc86f8b5558d7c8860d685bb25c2e09fe7
2021-05-07 19:58:40 -07:00
94080f45ab [RPC Framework] Update rpc.rst (#57876)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57876

ghstack-source-id: 128484049

Test Plan: N/A

Reviewed By: pritamdamania87

Differential Revision: D28305719

fbshipit-source-id: cc0d79fb46077a0d1cf6026c373893e7d3b7761e
2021-05-07 19:42:29 -07:00
4db88307d9 [RPC Framework] Add a link to the tutorial in RemoteModule docstring (#57875)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57875

This tutorial combines DDP and RemoteModule.
ghstack-source-id: 128482681

Test Plan: N/A

Reviewed By: rohan-varma

Differential Revision: D28305382

fbshipit-source-id: 572e1ec4b4aa00735fff16a6ce6ae4c7cad0b27f
2021-05-07 19:42:27 -07:00
74d493cc07 [RPC Framework] Support passing RemoteModule as an arg (#57695)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57695

Add pickling/unpickling support for `RemoteModule`.

#Closes: https://github.com/pytorch/pytorch/issues/57516
ghstack-source-id: 128472946

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_send_remote_module_over_the_wire

buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_send_remote_module_with_a_new_attribute_over_the_wire

buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- RemoteModule

Reviewed By: rohan-varma

Differential Revision: D28233108

fbshipit-source-id: 94eea2251fa53fb71912457c80d0a1e44504fc85
2021-05-07 19:41:17 -07:00
8c04593c0a [PyTorch Edge] Add backport to export old bytecode models (#56802)
Summary:
Add an api to backport a model vn to model vi. It accept an input model (file or buffer) and output a model (file or buffer) with an expected bytecode version.

In this change, the input is a model and it can come from a file or buffer. The output is a model and can be either file path or buffer.

When backport fails, function return false with a warning message :
```
/Users/chenlai/pytorch/cmake-build-debug/bin/test_jit --gtest_filter=LiteInterpreterTest.BackPortByteCodeModelV4:LiteInterpreterTest/*.BackPortByteCodeModelV4:*/LiteInterpreterTest.BackPortByteCodeModelV4/*:*/LiteInterpreterTest/*.BackPortByteCodeModelV4 --gtest_color=no
Testing started at 2:32 PM ...
CUDA not available. Disabling CUDA and MultiCUDA tests

[W backport.cpp:419] Warning: Backport doesn't support backport to version3 (function _backport_for_mobile_impl)
Process finished with exit code 0
```

## Test
1. Run both `caffe2/test/cpp/jit/test_lite_interpreter.cpp` and `caffe2/test/mobile/test_bytecode.py`.
2. Run all prod models with backport api.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56802

ghstack-source-id: 128425510

Test Plan: CI

Reviewed By: raziel, iseeyuan

Differential Revision: D27844651

fbshipit-source-id: 8a803cf6c76433ee0a3049b1a5570585d569f8d6
2021-05-07 18:14:33 -07:00
e9c3ce30d4 Fix flaky test_barrier_timeout_global. (#57523)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57523

`_test_barrier_timeout` would run a barrier on rank 1 and sleep for
`timeout` on other ranks. In some cases if the other ranks would be faster,
they would enter the sleep call much earlier than rank 0 would enter barrier.
As a result, they would exit before the timeout is up and rank 0 would receive
a connection closed error instead of a timeout error. This would result in the
barrier call exiting before the timeout and the subsequent assertion failing.

#Closes: https://github.com/pytorch/pytorch/issues/57176
ghstack-source-id: 128278775

Test Plan:
1) waitforbuildbot
2) Tested synthetically by forcing a rank to exit earlier.

Reviewed By: rohan-varma

Differential Revision: D28170821

fbshipit-source-id: a67456a1784dd0657f264c4f5498638e0aa00de2
2021-05-07 17:32:05 -07:00
73f22bcbf9 [fx ir] Handle cases in GraphDrawer when shape, type or stride are not present (#57845)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57845

As title says

Test Plan: N/A

Reviewed By: 842974287

Differential Revision: D28295999

fbshipit-source-id: f2cbf80c468f13685b17bb396c1f48972744ced0
2021-05-07 17:24:48 -07:00
ee4be5322b Fix lint in test_tensorexpr_pybind (#57869)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57869

Pulled By: driazati

Reviewed By: gmagogsfm

Differential Revision: D28303542

fbshipit-source-id: a4c77c1e7ee92c39fd1c7d88422728e2f1c31680
2021-05-07 15:58:21 -07:00
4fad8d1a2c Update the default detach semantic for forward mode AD (#57820)
Summary:
This makes detach both forward and backward non-differentiable by default.
You can pass the `only_backward_mode=True` argument to make it forward differentiable but backward non-differentiable.

The important side effect of this change is that, by default, detach is not tracking any view information.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57820

Reviewed By: ezyang

Differential Revision: D28287633

Pulled By: albanD

fbshipit-source-id: bdc4726fcd05889f6ac84e5a3a3ef71b2ec41015
2021-05-07 15:51:18 -07:00
bc0965ac85 [Vulkan] Use more optimal command buffer submission rate (#57196)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57196

Test Plan: Imported from OSS

Reviewed By: xta0

Differential Revision: D28293756

Pulled By: SS-JIA

fbshipit-source-id: 3099e2112c5ba665e2045cbfc57acc131143f864
2021-05-07 15:42:54 -07:00
b0c27b44cf Enable backward/forward compatibility for TS runtime (#57498)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57498

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D28162448

Pulled By: tugsbayasgalan

fbshipit-source-id: 5c21ced42a22aca7cee089e876e9d98d32f68955
2021-05-07 15:41:45 -07:00
b38f153d91 [nnc] Added NNC lowerings for t/transpose/permute/expand + other cleaning (#57426)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57426

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D28293191

Pulled By: Chillee

fbshipit-source-id: b8fc44299acf2569c11e87e1991a2b724434b15d
2021-05-07 15:38:56 -07:00
c88167d2ed Respect .ini for flake8 and mypy (#57752)
Summary:
Previously `make quicklint` would lint all changed files for both mypy `ini`s, regardless of whether that file was actually supposed to be run under that configuration. This PR fixes that so we are using `tools/mypy_wrapper.py` to check if files should be included.

There's a similar change for `flake8` so that it now only outputs errors once and correctly excludes the paths in `.flake8`.

This also adds a bunch of tests to ensure that `make lint` and `make quicklint` both work and that `make quicklint` is excluding and including what it should.

Fixes #57644
](https://our.intern.facebook.com/intern/diff/28259692/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57752

Pulled By: driazati

Reviewed By: samestep

Differential Revision: D28259692

fbshipit-source-id: 233d355781230f11f98a6f61e2c07e9f5e737e24
2021-05-07 15:21:57 -07:00
18fed3dfbe Change name for namedtuple return of torch.linalg.svd (#57181)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57181

Documentation for torch.linalg.svd says:
> The returned decomposition is a named tuple `(U, S, Vh)`

The documentation is correct while the implementation was wrong.
Renamed `V` -> `Vh`. `h` stands for hermitian.
This is a BC-breaking change but our linalg module is beta, therefore we can do it without a deprecation notice or aliases.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D28142162

Pulled By: mruberry

fbshipit-source-id: 5e6e0ae5a63300f2db1575ca3259df381f8e1a7e
2021-05-07 15:17:43 -07:00
58f32fa5fd Remove compute_uv flag from torch.linalg.svd (#57180)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57180

We have now a separate function for computing only the singular values.
`compute_uv` argument is not needed and it was decided in the
offline discussion to remove it. This is a BC-breaking change but our
linalg module is beta, therefore we can do it without a deprecation
notice.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D28142163

Pulled By: mruberry

fbshipit-source-id: 3fac1fcae414307ad5748c9d5ff50e0aa4e1b853
2021-05-07 15:16:42 -07:00
db412a6885 Avoid 2 extra copies when reducing sparse tensors and fix result() vs inplace output discrepancy (#57822)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57822

* `AsyncSparseAllreduceWork` can avoid copying output tensors, since we keep all the results alive by means of modifying input vector directly
* `AsyncSparseAllreduceWork` now returns inputs back to user instead of former behavior where it returned copies of inputs. This is consistent with other operations and process group implementations
* `AsyncSparseAllreduceCUDAWork` is now copying tensors directly from CPU to input tensors avoiding extra copy `output` -> `outputs` -> `inputs`. inputs are being returned to back to user. This is consistent with other operations and process group implementations.

overall AsyncSparseAllreduceCUDAWork is now avoiding 2 extra copies (as AsyncSparseAllreduceCUDAWork is using AsyncSparseAllreduceWork's impl)

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D28298325

Pulled By: agolynski

fbshipit-source-id: 18e2104413cdf5e73a01aad464e2613807779297
2021-05-07 15:12:58 -07:00
2043093217 Add correction parameter to std/var (#50903)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50903

First part of #50010. Also fixes #51127.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D27911345

Pulled By: mruberry

fbshipit-source-id: 7138fddc935802918ab9ff19f4bc1b9f4d745d41
2021-05-07 14:40:28 -07:00
3d2ce60539 [PyTorch] Remove dead get/setTLSCallbacks APIs (#56492)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56492

These are documented as internal-only and aren't called.
ghstack-source-id: 128354112

Test Plan: CI

Reviewed By: ilia-cher

Differential Revision: D27834789

fbshipit-source-id: 4a1aa320f952249db51945ff77563558fa884266
2021-05-07 14:00:59 -07:00
9234d7fc27 [PyTorch] Use MaybeOwned and avoid resize in bmm_cuda (#56115)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56115

No reason to size it wrong and then resize it. Also, no reason to unconditionally go through the dispatcher.
ghstack-source-id: 128354110

Test Plan: Existing CI

Reviewed By: ngimel

Differential Revision: D27768757

fbshipit-source-id: 5dcb1fed5c5fa6707ee15359a26fde2a9a888b7f
2021-05-07 13:59:47 -07:00
96e1a83fb2 Add Gloo TCP_TLS transport (#56442)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56442

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D27896285

Pulled By: pbelevich

fbshipit-source-id: 589af59ca4c7c9bab2329f079382c09b71cfcf9e
2021-05-07 13:36:11 -07:00
96fce78ac4 [Vulkan] Add -Os flag to shader compilation (#57199)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57199

Reduces the size of compiled shaders, and also potentially adds some performance boost.

Test Plan: Imported from OSS

Reviewed By: xta0

Differential Revision: D28293816

Pulled By: SS-JIA

fbshipit-source-id: 424dc0bce24d6115ba2bf8405027e967f6cb9497
2021-05-07 13:15:01 -07:00
731dcd75f5 [torch/elastic] Revise the note section of RendezvousHandler doc (#57723)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57723

Updated the note section of `RendezvousHandler`:

- Removed the experimental API warning.
- Recommended using the C10d Store instead of etcd for most users.

Test Plan: N/A

Reviewed By: kiukchung

Differential Revision: D28253828

fbshipit-source-id: c4f34dffd1a3cc132977029fe449b6d63ddc877b
2021-05-07 13:10:22 -07:00
64dc10e268 [JIT] Also fold NaiveSyncBatchNorm when folding batch norm (#57823)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57823

Some models use the `NaiveSyncBatchNorm` instead of `BatchNorm2d`, but during inference they behave the same. This change is to ensure that `NaiveSyncBatchNorm` gets folded into convs during optimization passes, particularly `FoldConvBatchNorm`.

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D28291709

Pulled By: SS-JIA

fbshipit-source-id: c494dc7698c3fa536146038808fedbb46c17a63b
2021-05-07 12:55:35 -07:00
0503105bc2 Port logaddexp and logaddexp2 to structured (#57629)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57629

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D28224830

Pulled By: ezyang

fbshipit-source-id: 356aa5683f254e77e0a77d76f6ef939631a3910c
2021-05-07 12:41:24 -07:00
223a362f63 Port lcm to structured (#57628)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57628

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D28224833

Pulled By: ezyang

fbshipit-source-id: 17ba9ea419a9fab5dcbdaefbe6d330e2e74c1e1f
2021-05-07 12:41:22 -07:00
470c7af749 Port hypot to structured (#57627)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57627

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D28224829

Pulled By: ezyang

fbshipit-source-id: d7b683618cc0c4df923ee771d6363478df677d7c
2021-05-07 12:41:21 -07:00
3dd88d6792 Port igamma and igammac to structured (#57626)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57626

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D28224834

Pulled By: ezyang

fbshipit-source-id: 924b0e10428429715dca1b75dce40dcd67b59134
2021-05-07 12:41:19 -07:00
3a1dc60da5 Port nextafter to structured (#57625)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57625

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D28224828

Pulled By: ezyang

fbshipit-source-id: 560f1bf942ea14ea6d8b915753aec83d1168005e
2021-05-07 12:41:17 -07:00
7e51ac5ea7 Port gcd to structured (#57624)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57624

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D28224832

Pulled By: ezyang

fbshipit-source-id: 30a8eba025c67d990103e49c03a396810f9d4006
2021-05-07 12:39:51 -07:00
5044d9dc51 Fixing quantize_per_tensor on cuda (#57703)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57703

The .bzl files didn't have registerQuantizedCUDA listed for some reason but upon adding them, the previously broken commands (on CUDA) now work.

note: these build files didn't affect OSS builds which was working throughout.

the test_qtensor test was potentially misleading since it would pass even if CUDA support wasn't working as long as the build system wasn't CUDA enabled. I broke this out into independent tests for each device so at least a skip would be produced rather than a pass for systems without CUDA enabled.

Test Plan:
buck test mode/dbg //caffe2/test:quantization -- --exact 'caffe2/test:quantization - test_qtensor_cpu (quantization.test_quantized_tensor.TestQuantizedTensor)'

buck test mode/dbg //caffe2/test:quantization -- --exact 'caffe2/test:quantization - test_qtensor_cuda (quantization.test_quantized_tensor.TestQuantizedTensor)'

Reviewed By: jerryzh168

Differential Revision: D28242797

fbshipit-source-id: 938ae86dcd605aedf26ac0bace9db77deaaf9c0f
2021-05-07 12:26:19 -07:00
c07babbcf1 [Gradient Compression] Divide by world size before all_reduce to avoid overflow (#57410)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57410

FP16 gradient compression may run into 'inf' issue. switching to division before allreduce can avoid this problem.
ghstack-source-id: 127877083

Test Plan:
before chage

f268909897

after change:
f270950609

If you still sees 'grad_norm = inf' after enabling fp16 hook, you can resume the training and turning off the hook.

Reviewed By: SciPioneer

Differential Revision: D28128628

fbshipit-source-id: 0b6648637713e4f321e39c9ccb645a6b6f1750a0
2021-05-07 12:23:21 -07:00
626ae7f036 Copy edit of TorchScript Language Reference (#57694)
Summary:
Initial copy edit of the file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57694

Reviewed By: malfet, ngimel

Differential Revision: D28289209

Pulled By: holly1238

fbshipit-source-id: 7035d6790767a2f758e6019ae63df16537ef2725
2021-05-07 12:17:32 -07:00
b5b158a6c6 Be more lenient with network exceptions in trigger_azure_pipeline.py (#57714)
Summary:
Fixes possible failure caused by network instability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57714

Reviewed By: ngimel

Differential Revision: D28288555

Pulled By: malfet

fbshipit-source-id: 2deedf3fe1a95dae5a68d599d9603f3da4702e8e
2021-05-07 10:02:23 -07:00
161ea537f0 [reland] Remove unused code in windows_build_definition.py (#57107)
Summary:
I accidentally reverted https://github.com/pytorch/pytorch/issues/56230 in https://github.com/pytorch/pytorch/issues/56128 when resolving conflicts.. This PR relands https://github.com/pytorch/pytorch/issues/56230

CC mszhanyi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57107

Reviewed By: astaff

Differential Revision: D28096003

Pulled By: seemethere

fbshipit-source-id: ea616d6b5cb0b04841d2f4cc30bd130ade4a364c
2021-05-07 09:27:53 -07:00
0dd0151c64 add torch.testing to docs (#57247)
Summary:
Redo of https://github.com/pytorch/pytorch/issues/56373 out of stack.

 ---

To reviewers: **please be nitpicky**. I've read this so often that I probably missed some typos and inconsistencies.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57247

Reviewed By: albanD

Differential Revision: D28247402

Pulled By: mruberry

fbshipit-source-id: 71142678ee5c82cc8c0ecc1dad6a0b2b9236d3e6
2021-05-07 09:16:39 -07:00
27f672a0fc Fix test reporting regression (#57795)
Summary:
Here is why another move of this single line is needed:
 - Regardless of whether test-run failed or succeeded it's good to
   report number of tests executed
 - `docker cp || echo` always succeeds so could safely be executed
   before any other step in "Report test results"
 - This command should not be part of "Run tests" step, otherwise it would not get executed if any of the test failed (if it must be part of "Run tests" step, it should be prefixed with [trap](https://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_02.html) command and defined before `docker exec` step

This fixes "regression" introduced by https://github.com/pytorch/pytorch/pull/56725 although real culprit here is lack of documentation

Here is an example of PR where test results are not reported back due to
the failure: https://app.circleci.com/pipelines/github/pytorch/pytorch/317199/workflows/584a658b-c742-4cbb-8f81-6bb4718a0c04/jobs/13209736/steps

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57795

Reviewed By: samestep

Differential Revision: D28275510

Pulled By: malfet

fbshipit-source-id: 622f3bfca96a1ee9b8959590b28a26046eb37ea3
2021-05-07 09:12:50 -07:00
2901d2e694 make quantizeable MHA work with torch.jit.script (#57774)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57774

Makes `torch.nn.quantizable.MultiheadAttention`
work with `torch.jit.script`.

Test Plan:
```
python test/test_quantization.py TestQuantizedOps.test_custom_module_multi_head_attention
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D28268218

fbshipit-source-id: 422868d9d26cae015d3c691ea710d82ffac3fa7f
2021-05-07 08:40:49 -07:00
023ecc40ad Revert D28248766: Update internal code for torch.linalg.solve
Test Plan: revert-hammer

Differential Revision:
D28248766 (5f2925074b)

Original commit changeset: 300366605653

fbshipit-source-id: 316b97791e57f9017d4bf87898aea8dc869cba79
2021-05-07 07:49:16 -07:00
6f2c0cccdd New: sparse complex: add linear algebra, addmm (#57129)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57129

Test Plan: Imported from OSS

Reviewed By: janeyx99, astaff

Differential Revision: D28112701

Pulled By: ezyang

fbshipit-source-id: 1b253453dc19e908fb18d0b1a83738243e0a8d59
2021-05-07 05:37:48 -07:00
a911c4fc1c New: Initial support for sparse complex tensors constructors for CPU/CUDA (#57125)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57125

I'm opening this PR, solving the last issued reported before merging PR #54153

https://github.com/pytorch/pytorch/pull/54153#issuecomment-827997616,

Solves gh-50690

Test Plan: Imported from OSS

Reviewed By: astaff

Differential Revision: D28112702

Pulled By: ezyang

fbshipit-source-id: 915681954edb14b7c19c3ffe641af2d2e6649576
2021-05-07 05:36:41 -07:00
8d363d37da [FX] Adds PyTree support to FX through concrete_args (#55888)
Summary:
```
class Foo(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, y, x):
        for k in x:
            for v in x[k]:
                v += y
        return x

example_dict = {'x': {'a': [fx.HOLE], 'z': [fx.HOLE, fx.HOLE]}}
new_f = fx.symbolic_trace(Foo(), concrete_args=example_dict)
print(new_f.code)
new_f(torch.randn(5), {'x': {'a': [torch.randn(5)], 'z': [torch.randn(5), torch.randn(5)]}})

fx.symbolic_trace(new_f, concrete_args=example_dict)
```

prints out
```
def forward(self, y, x):
    y, tree_2, tree_3, tree_4 = pytree.tree_flatten([y, x])[0]
    add = tree_2 + y
    add_1 = tree_3 + y
    add_2 = tree_4 + y;  y = None
    return {'a': [tree_2], 'z': [tree_3, tree_4]}
```

Currently, I store `in_spec` as an extra attribute on `fx.Graph`, and then include it when we do the codegen. I'm not sure if this is the right approach - it introduces a divergence between what's in `fx.Graph` and what's in the python code.

Perhaps the best API is something explicit like `fx.Graph.flatten_args`, but that does make calling things a bit ... more verbose.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55888

Reviewed By: jamesr66a

Differential Revision: D27884694

Pulled By: Chillee

fbshipit-source-id: f9e8a70c63a8df63c9f9bd0a6459255daa5a8df8
2021-05-07 04:48:35 -07:00
45012da298 Migrate from shared_ptr to intrusive_ptr for Future (#57636)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57636

The "preferred" pointer holder for Future is `intrusive_ptr` (e.g., `then` returns an `intrusive_ptr`, `toFuture` returns `intrusive_ptr`, ...). However in RPC we often wrap it with `shared_ptr`. This probably dates back to when we had a separate Future type, before the merge.

At the boundary between RPC and JIT this difference becomes a bit annoying, as conversions between the pointer types are needed. I think it would be simpler and more consistent to always use `intrusive_ptr`, also in RPC.

This PR was produced mainly by find-and-replace, plus a couple of manual fixes.
ghstack-source-id: 128296581

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D28187972

fbshipit-source-id: d4609273a1550b4921910e85d2198e02f31c905b
2021-05-07 03:59:20 -07:00
36e47af58b Pass reference to parent future in callbacks (#57635)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57635

Note: this PR looks massive, but it's just one simple change, codemodded many times.

In many cases, a callback needs to access the value/error produced by the parent future. In Python this was easy because the callback was invoked with the parent future as argument, and could thus inspect it. In C++ the callbacks didn't take any arguments, thus in many cases we worked around this by capturing the future in its own callback. This is risky (leads to reference cycle and thus memory leak) and must be done carefully (spoiler: sometimes we weren't).
ghstack-source-id: 128296580

Test Plan: CI

Reviewed By: wanchaol

Differential Revision: D28178783

fbshipit-source-id: 6de02c4568be42123372edc008f630d5ddae0081
2021-05-07 03:59:18 -07:00
9aa1461a68 Make wrapPropagateTLSState more generic (#57634)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57634

`wrapPropagateTLSState` was restricting its argument to be an argument-less function, and I need to relax this for later work.

Also, it was requiring its argument to be converted to `std::function`, and also returned a `std::function`. Each creation of a `std::function` could cause a heap allocation. It's not particularly expensive, but here we can easily avoid it by having `wrapPropagateTLSState` directly operate on generic callables (thus, possibly, raw lambdas).
ghstack-source-id: 128295264

Test Plan: CI

Reviewed By: ilia-cher

Differential Revision: D28178782

fbshipit-source-id: d657f5751514974518606dd4fc4175e805dcb90a
2021-05-07 03:58:08 -07:00
5f2925074b Update internal code for torch.linalg.solve (#56613)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56613

Replace linalg_solve_helper with `lu_stub` + `lu_solve_stub`.
Once `lu_stub` and `lu_solve_stub` have cuSOLVER-based codepath,
`torch.linalg.solve` will have it as well.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D28248766

Pulled By: mruberry

fbshipit-source-id: 3003666056533d097d0ad659e0603f59fbfda9aa
2021-05-07 03:29:16 -07:00
adaf80bcbe Update internal code for at::_lu_with_info (#56612)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56612

The goal of this refactoring is to make the `torch.linalg.solve`
to be a composition of calls to `lu_stub` and `lu_solve_stub`.
Once `lu_stub` and `lu_solve_stub` have cuSOLVER-based codepath,
`torch.linalg.solve` will have it as well.

Replaced `lu_with_info_{cpu, cuda}` with one function that calls
to `lu_stub`.
Split MAGMA-based `apply_lu` into `apply_lu_looped_magma`
and `apply_lu_batched_magma`. This simplifies the future switch to
cuSOLVER and cuBLAS libraries.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D28248756

Pulled By: mruberry

fbshipit-source-id: 40e02b5be4ff5f78885bcc95685aba581043e096
2021-05-07 03:28:04 -07:00
9e6b7e6e6e OpInfo: expand and expand_as (#57606)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57606

Reviewed By: albanD

Differential Revision: D28249191

Pulled By: mruberry

fbshipit-source-id: d985ab4e8a99b116c45953e621092929a9a8028e
2021-05-07 02:50:00 -07:00
1f7309dfe3 [testing] clean-up test_unary_ufuncs.py (#57615)
Summary:
Some clean-ups

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57615

Reviewed By: albanD

Differential Revision: D28249173

Pulled By: mruberry

fbshipit-source-id: 19a300f6aa267932a7a92c2f5f377488f69bd822
2021-05-07 02:26:10 -07:00
4cb3c60c20 OpInfo: float_power (#57648)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/54295 (`float_power`)

cc: mruberry kshitij12345 krshrimali

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57648

Reviewed By: albanD

Differential Revision: D28249489

Pulled By: mruberry

fbshipit-source-id: 0ae5ce0d8b154724ae59f5f5b4412e34b0128d0a
2021-05-07 02:09:47 -07:00
6eec730a73 [testing] atan2: Enable cases where self broadcasts (#57608)
Summary:
Just a follow-up

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57608

Reviewed By: albanD

Differential Revision: D28249409

Pulled By: mruberry

fbshipit-source-id: a1ce2cd736ac5547cecb3e21aaa50637917284bc
2021-05-07 01:48:44 -07:00
159a2404bd fft: Increase tolerance for nd-fft tests (#57576)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/56820

The test only fails for inverse n-dim functions with `norm="forward"`. The relative error for isn't actually any bigger than other norm modes though. It's just that the magnitude of the result is bigger, so the absolute tolerance is less relative each element. So, I just increase the relative tolerance  to compensate.

This `precisionOverride` is already applied to `fftn` and `rfftn` for exactly the same reason.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57576

Reviewed By: albanD

Differential Revision: D28249222

Pulled By: mruberry

fbshipit-source-id: 734c7c1ae8236b253d6e3cd2218c05d21901c567
2021-05-07 01:30:32 -07:00
ee79413b6a [testing] change unaryufunc default dtypes (#57616)
Summary:
Reference: https://github.com/pytorch/pytorch/pull/56646#pullrequestreview-644839124

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57616

Reviewed By: albanD

Differential Revision: D28249129

Pulled By: mruberry

fbshipit-source-id: 2cfc837fd49100d2b1b2a09d9ca6db93e089e099
2021-05-07 01:20:49 -07:00
319b08be59 Add call_kwargs(args, kwargs) method to torch::deploy api (#57748)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57748

To be used by PyTorchPredictor integration for deploy.

Original commit changeset: 4d41efc733b2

Test Plan: tested via new unit tests

Reviewed By: suo

Differential Revision: D28258525

fbshipit-source-id: 8b9436e47501d7c1c16e79909e668100f825711e
2021-05-07 00:07:06 -07:00
f2fdb61e2d [iOS GPU][Perf][1/n] Use aten::contiguous instead of permuting weights manually (#57664)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57664

Manually permuting weights is slower than using aten::contiguous. This will improve the model loading time at runtime especially on low-end devices. Some numbers from the Unet model. Average 6x faster.

- iPhone 12
    - before - 26.252 ms
    - after - 4.727 ms
- iPhone 11
   - before - 29.638 ms
   - after - 5.012 ms
- iPhone X
     - before - 33.257 ms
     - after - 5.481 ms
- iPhone 8
     - before - 33.335 ms
     - after - 5.83 ms
- iPhone 7
     - before - 36.144 ms
     - after - 6.232 ms
- iPhone 6s
     - before - 47.977 ms
     - after - 6.998 ms
ghstack-source-id: 128338534

Test Plan: - CI

Reviewed By: kimishpatel

Differential Revision: D28087911

fbshipit-source-id: ad0029436e59a0ecc02ce660ed1110dc0b82848c
2021-05-06 23:15:41 -07:00
ca8090f81b [Pytorch Edge] Enable eager symbolication in benchmarking binary (#57705)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57705

This will enable module level debug info for benchmarking binary.

Test Plan: Run on AIBench

Reviewed By: larryliu0820

Differential Revision: D28230948

fbshipit-source-id: 5d06c6853d049ff678995a2ed4a86f4e6c85bdc7
2021-05-06 21:50:57 -07:00
e5e095cbe4 [torch/elastic] Rename etcd-/c10d-experimental to etcd-v2 and c10d (#57764)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57764

As discussed offline this PR renames etcd-experimental backend to etcd-v2 and c10d-experimental backend to c10d.
ghstack-source-id: 128342523

Test Plan: Run the existing unit tests.

Reviewed By: kiukchung

Differential Revision: D28263739

fbshipit-source-id: c3409037ecea5a8ff6daadeeb1f2fb4205cc3852
2021-05-06 19:51:53 -07:00
cb95c9db9f Automated submodule update: FBGEMM (#57485)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 530356e16f

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57485

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: jiecaoyu

Differential Revision: D28158310

fbshipit-source-id: 2ea77956a6e1709569a587c671c0c08018b8a966
2021-05-06 17:57:33 -07:00
1f1e2dab6b Remove optional type for ord parameter in vector_norm (#57662)
Summary:
As per discussion here https://github.com/pytorch/pytorch/pull/57127#discussion_r624948215

Note that we cannot remove the optional type from the `dim` parameter because the default is to flatten the input tensor which cannot be easily captured by a value other than `None`

### BC Breaking Note
This PR changes the `ord` parameter of `torch.linalg.vector_norm` so that it no longer accepts `None` arguments. The default behavior of `2` is equivalent to the previous default of `None`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57662

Reviewed By: albanD, mruberry

Differential Revision: D28228870

Pulled By: heitorschueroff

fbshipit-source-id: 040fd8055bbe013f64d3c8409bbb4b2c87c99d13
2021-05-06 17:53:25 -07:00
cb1272a846 update doc in build section (#56686)
Summary:
Why:
To keep VS version always updated in README
1. update VS version link in CI. It's more convenient for my PR robot to update the version in README once the VS in CI is updated. and permlink isn't stable.
2. Move `building on legacy code` to development tips. The table is big and it looks the REAMD not updated at the first sight.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56686

Reviewed By: janeyx99

Differential Revision: D28272060

Pulled By: samestep

fbshipit-source-id: 4bb879ea2914cc8bcd68343a9ed230418e1f9268
2021-05-06 17:35:56 -07:00
8b38458011 [jit] Break interpreter.cpp into smaller files. (#56546)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56546

A code move for CodeImpl and Frame to a subdirectory runtime/interpreter, so
that it's easier to reuse them and navigate the interpreter code.

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D28133580

fbshipit-source-id: 8de89a4e8e637836625e1ac1db95f0a3353da670
2021-05-06 16:43:57 -07:00
2787f01455 Catch KeyboardInterrupt in tools/test_history.py (#57780)
Summary:
Currently, interrupting `tools/test_history.py` with `^C` gives a very long traceback:
```
$ tools/test_history.py --mode=multiline --ref=594a66 --sha-length=8 --test=test_set_dir --job pytorch_linux_xenial_py3_6_gcc5_4_test --job pytorch_linux_xenial_py3_6_gcc7_test
2021-02-10 11:13:34Z 594a66d7 pytorch_linux_xenial_py3_6_gcc5_4_test 0.36s
2021-02-10 11:13:34Z 594a66d7 pytorch_linux_xenial_py3_6_gcc7_test 0.573s errored
2021-02-10 10:13:25Z 9c0caf03 pytorch_linux_xenial_py3_6_gcc5_4_test 0.819s
2021-02-10 10:13:25Z 9c0caf03 pytorch_linux_xenial_py3_6_gcc7_test 0.449s
2021-02-10 10:09:14Z 602434bc pytorch_linux_xenial_py3_6_gcc5_4_test 0.361s
2021-02-10 10:09:14Z 602434bc pytorch_linux_xenial_py3_6_gcc7_test 0.454s
2021-02-10 10:09:10Z 2e35fe95 (no reports in S3)
2021-02-10 10:09:07Z ff73be7e (no reports in S3)
2021-02-10 10:05:39Z 74082f0d (no reports in S3)
2021-02-10 07:42:29Z 0620c96f pytorch_linux_xenial_py3_6_gcc5_4_test 0.414s
2021-02-10 07:42:29Z 0620c96f pytorch_linux_xenial_py3_6_gcc5_4_test 0.476s
2021-02-10 07:42:29Z 0620c96f pytorch_linux_xenial_py3_6_gcc7_test 0.377s
2021-02-10 07:42:29Z 0620c96f pytorch_linux_xenial_py3_6_gcc7_test 0.326s
2021-02-10 07:27:53Z 33afb5f1 pytorch_linux_xenial_py3_6_gcc5_4_test 0.381s
2021-02-10 07:27:53Z 33afb5f1 pytorch_linux_xenial_py3_6_gcc7_test 0.294s
^CTraceback (most recent call last):
  File "tools/test_history.py", line 344, in <module>
    main()
  File "tools/test_history.py", line 339, in main
    for line in run(sys.argv[1:]):
  File "tools/test_history.py", line 143, in history_lines
    summaries = get_test_stats_summaries(sha=sha, jobs=jobs)
  File "/Users/sestep/github/pytorch/pytorch/tools/stats_utils/s3_stat_parser.py", line 161, in get_test_stats_summaries
    return _parse_s3_summaries(summaries, jobs=list(jobs or []))
  File "/Users/sestep/github/pytorch/pytorch/tools/stats_utils/s3_stat_parser.py", line 147, in _parse_s3_summaries
    for summary in summaries:
  File "/Users/sestep/miniconda3/envs/pytorch/lib/python3.8/site-packages/boto3/resources/collection.py", line 83, in __iter__
    for page in self.pages():
  File "/Users/sestep/miniconda3/envs/pytorch/lib/python3.8/site-packages/boto3/resources/collection.py", line 166, in pages
    for page in pages:
  File "/Users/sestep/miniconda3/envs/pytorch/lib/python3.8/site-packages/botocore/paginate.py", line 255, in __iter__
    response = self._make_request(current_kwargs)
  File "/Users/sestep/miniconda3/envs/pytorch/lib/python3.8/site-packages/botocore/paginate.py", line 332, in _make_request
    return self._method(**current_kwargs)
  File "/Users/sestep/miniconda3/envs/pytorch/lib/python3.8/site-packages/botocore/client.py", line 357, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/Users/sestep/miniconda3/envs/pytorch/lib/python3.8/site-packages/botocore/client.py", line 662, in _make_api_call
    http, parsed_response = self._make_request(
  File "/Users/sestep/miniconda3/envs/pytorch/lib/python3.8/site-packages/botocore/client.py", line 682, in _make_request
    return self._endpoint.make_request(operation_model, request_dict)
  File "/Users/sestep/miniconda3/envs/pytorch/lib/python3.8/site-packages/botocore/endpoint.py", line 102, in make_request
    return self._send_request(request_dict, operation_model)
  File "/Users/sestep/miniconda3/envs/pytorch/lib/python3.8/site-packages/botocore/endpoint.py", line 134, in _send_request
    success_response, exception = self._get_response(
  File "/Users/sestep/miniconda3/envs/pytorch/lib/python3.8/site-packages/botocore/endpoint.py", line 166, in _get_response
    success_response, exception = self._do_get_response(
  File "/Users/sestep/miniconda3/envs/pytorch/lib/python3.8/site-packages/botocore/endpoint.py", line 200, in _do_get_response
    http_response = self._send(request)
  File "/Users/sestep/miniconda3/envs/pytorch/lib/python3.8/site-packages/botocore/endpoint.py", line 269, in _send
    return self.http_session.send(request)
  File "/Users/sestep/miniconda3/envs/pytorch/lib/python3.8/site-packages/botocore/httpsession.py", line 308, in send
    urllib_response = conn.urlopen(
  File "/Users/sestep/miniconda3/envs/pytorch/lib/python3.8/site-packages/urllib3/connectionpool.py", line 699, in urlopen
    httplib_response = self._make_request(
  File "/Users/sestep/miniconda3/envs/pytorch/lib/python3.8/site-packages/urllib3/connectionpool.py", line 445, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/Users/sestep/miniconda3/envs/pytorch/lib/python3.8/site-packages/urllib3/connectionpool.py", line 440, in _make_request
    httplib_response = conn.getresponse()
  File "/Users/sestep/miniconda3/envs/pytorch/lib/python3.8/http/client.py", line 1347, in getresponse
    response.begin()
  File "/Users/sestep/miniconda3/envs/pytorch/lib/python3.8/http/client.py", line 307, in begin
    version, status, reason = self._read_status()
  File "/Users/sestep/miniconda3/envs/pytorch/lib/python3.8/http/client.py", line 268, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/Users/sestep/miniconda3/envs/pytorch/lib/python3.8/socket.py", line 669, in readinto
    return self._sock.recv_into(b)
  File "/Users/sestep/miniconda3/envs/pytorch/lib/python3.8/ssl.py", line 1241, in recv_into
    return self.read(nbytes, buffer)
  File "/Users/sestep/miniconda3/envs/pytorch/lib/python3.8/ssl.py", line 1099, in read
    return self._sslobj.read(len, buffer)
KeyboardInterrupt
```
This PR eliminates that traceback using a technique from `tools/actions_local_runner.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57780

Test Plan:
```
$ tools/test_history.py --mode=multiline --ref=594a66 --sha-length=8 --test=test_set_dir --job pytorch_linux_xenial_py3_6_gcc5_4_test --job pytorch_linux_xenial_py3_6_gcc7_test
2021-02-10 11:13:34Z 594a66d7 pytorch_linux_xenial_py3_6_gcc5_4_test 0.36s
2021-02-10 11:13:34Z 594a66d7 pytorch_linux_xenial_py3_6_gcc7_test 0.573s errored
2021-02-10 10:13:25Z 9c0caf03 pytorch_linux_xenial_py3_6_gcc5_4_test 0.819s
2021-02-10 10:13:25Z 9c0caf03 pytorch_linux_xenial_py3_6_gcc7_test 0.449s
2021-02-10 10:09:14Z 602434bc pytorch_linux_xenial_py3_6_gcc5_4_test 0.361s
2021-02-10 10:09:14Z 602434bc pytorch_linux_xenial_py3_6_gcc7_test 0.454s
2021-02-10 10:09:10Z 2e35fe95 (no reports in S3)
2021-02-10 10:09:07Z ff73be7e (no reports in S3)
2021-02-10 10:05:39Z 74082f0d (no reports in S3)
2021-02-10 07:42:29Z 0620c96f pytorch_linux_xenial_py3_6_gcc5_4_test 0.414s
2021-02-10 07:42:29Z 0620c96f pytorch_linux_xenial_py3_6_gcc5_4_test 0.476s
2021-02-10 07:42:29Z 0620c96f pytorch_linux_xenial_py3_6_gcc7_test 0.377s
2021-02-10 07:42:29Z 0620c96f pytorch_linux_xenial_py3_6_gcc7_test 0.326s
2021-02-10 07:27:53Z 33afb5f1 pytorch_linux_xenial_py3_6_gcc5_4_test 0.381s
2021-02-10 07:27:53Z 33afb5f1 pytorch_linux_xenial_py3_6_gcc7_test 0.294s
^C
```

Reviewed By: walterddr

Differential Revision: D28269719

Pulled By: samestep

fbshipit-source-id: e5b4f2677f90f745fb171f159cced03a4f1d4b0b
2021-05-06 16:19:28 -07:00
78fb9c2f5b Reorder gc.py imports (#57779)
Summary:
A tiny PR that reorder imports and run autopep8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57779

Reviewed By: malfet

Differential Revision: D28269455

Pulled By: samestep

fbshipit-source-id: 7d3176efad96e3a8ac1cdc76a5018c7ffa00c449
2021-05-06 16:10:58 -07:00
241c2f4496 Add Gelu To NNC (#57753)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57753

I'm not adding symbolic gradient because that is being added in https://github.com/pytorch/pytorch/pull/46785.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D28262765

Pulled By: eellison

fbshipit-source-id: be365a2d392d7ac4bcc099a184762249ec2e18a6
2021-05-06 16:04:50 -07:00
aedcff7275 fix codegen for lite_interpreter (#57761)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57761

Reviewed By: cccclai

Differential Revision: D28262513

Pulled By: walterddr

fbshipit-source-id: 40fe82de540791f19fdf349e71b05a12b9a57ad0
2021-05-06 16:01:01 -07:00
52d1b91d38 Give Python sub-version in GHA CUDA workflow name (#57770)
Summary:
Addresses part of https://github.com/pytorch/pytorch/issues/57686#issuecomment-833672132. Evidence that the Python version is indeed 3.6: https://github.com/pytorch/pytorch/runs/2520276328

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57770

Test Plan: CI would be nice, but this workflow does not currently run on PRs.

Reviewed By: malfet

Differential Revision: D28265048

Pulled By: samestep

fbshipit-source-id: 513caf52a8f18d6e529e0934bf024f49e1571926
2021-05-06 15:16:37 -07:00
2992ff3fb8 Revert D28142447: Improve BatchNorm1d performance (CUDA)
Test Plan: revert-hammer

Differential Revision:
D28142447 (b2936ad8fa)

Original commit changeset: c70109780e20

fbshipit-source-id: e93f6d00d644697b106f5ea8ab79872f353b51c6
2021-05-06 15:01:19 -07:00
3948ce2fd9 [Caffe2] Introduce c10::CudaError for CUDA Exceptions (#57609)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57609

Throw c10::CudaError for CUDA Exceptions for better classification of errors

Test Plan: Test locally by running some workflows

Reviewed By: dzhulgakov

Differential Revision: D28209356

fbshipit-source-id: 19a5fc8548433238dc224ea81a5f63a945fc5cc3
2021-05-06 14:28:45 -07:00
cb234e606d [package] fix corner case in PacakgeImporter.whichmodule (#57651)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57651

We've gone back and forth on whether to emulate the `sys.modules` lookup
behavior in our own `whichmodule`, the provided test is a concrete case
for doing so.

An additional minor cleanup is to make the type of `self.modules` in
importers `Dict[str, ModuleType]`. Modules could only be None in the
dictionary in older versions of the import system

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D28226536

Pulled By: suo

fbshipit-source-id: c2e6da91651ddaa4fbf7171555df9e5cbe1060fd
2021-05-06 14:15:21 -07:00
2370d8c41f [profiler] Add profiler fallback (#57612)
Summary:
Add an ability to use new profiler API even if Kineto is not compiled
in, by falling back to the legacy profiler.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57612

Test Plan:
compiled
USE_KINETO=0 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python
setup.py develop install --cmake
and with USE_KINETO=1
and ran
python test/test_profiler.py -v

Reviewed By: gdankel

Differential Revision: D28217680

Pulled By: ilia-cher

fbshipit-source-id: ec81fb527eb69bb0a3e0bd6aad13592200d7fe70
2021-05-06 13:35:27 -07:00
da06ae73a3 [c2] Fix flaky test_spatial_bn_multi_batch_grad
Summary: Removed the deadline restriction since the first run can take more than the deadline, wile subsequent runs are shorter.

Reviewed By: ngimel

Differential Revision: D28260077

fbshipit-source-id: 8ed2f5c16bc184bf4fae0a59b662fa1da2d4dd0a
2021-05-06 12:50:53 -07:00
eb6445a92a [JIT] Lazily initialize aliasDb in DCE (#56649)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56649

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D27926043

Pulled By: desertfire

fbshipit-source-id: 6f6cca6a8d32ac26d780a41edba1e6e653050a1f
2021-05-06 12:19:32 -07:00
b2936ad8fa Improve BatchNorm1d performance (CUDA) (#57034)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57034

Resolves gh-38915

For the example given in the issue, BatchNorm1d on cuDNN is around 12x slower
than BatchNorm2d. Internally, cuDNN expects at least a 4d tensor (N, C, H, W)
so these two modules actually call the same cuDNN code. My assumption is that
cuDNN just isn't optimized for H=W=1.

Instead, this disables cudnn for 2d batch_norm inputs and improves the CUDA
implementation of `native_batch_norm` to be competative with cuDNN. For the
example in the issue, `BatchNorm1d` now takes 335 us compared to 6.3 ms before,
or a 18x speedup.

Before this change, nvprof shows:
```
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   99.64%  630.95ms       100  6.3095ms  5.6427ms  8.8800ms  void cudnn::bn_fw_tr_1C11_kernel_NCHW<float, float, int=512, bool=0, int=2>(cudnnTensorStruct, float const *, cudnn::bn_fw_tr_1C11_kernel_NCHW<float, float, int=512, bool=0, int=2>, cudnnTensorStruct*, float const *, float const , cudnnTensorStruct*, cudnnTensorStruct*, cudnnTensorStruct**, float const *, float const *, float const *, cudnnTensorStruct*, cudnnTensorStruct*)
```

But after, it shows:
```
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   54.76%  14.352ms       100  143.52us  123.52us  756.28us  _ZN2at6native27unrolled_elementwise_kernelIZZZNS0_72_GLOBAL__N__48_tmpxft_001e82d0_00000000_7_Normalization_cpp1_ii_db66e07022batch_norm_elementwiseERKNS_6TensorES5_RKN3c108optionalIS3_EESA_S5_S5_ENKUlvE_clEvENKUlvE2_clEvEUlfffffE_NS_6detail5ArrayIPcLi6EEE16OffsetCalculatorILi5EjESI_ILi1EjENS0_6memory15LoadWithoutCastENSL_16StoreWithoutCastEEEviT_T0_T1_T2_T3_T4_
                   35.09%  9.1951ms       100  91.950us  84.415us  362.17us  void at::native::reduce_kernel<int=256, int=2, at::native::ReduceOp<float, at::native::WelfordOps<float, float, int, float, thrust::pair<float, float>>, unsigned int, float, int=2>>(float)
                    0.71%  186.14us       100  1.8610us  1.8240us  1.9840us  _ZN2at6native72_GLOBAL__N__48_tmpxft_001e82d0_00000000_7_Normalization_cpp1_ii_db66e07045unrolled_elementwise_kernel_for_multi_outputsILi3EZZZNS1_34batch_norm_update_stats_and_invertERKNS_6TensorES5_S5_S5_ddlENKUlvE_clEvENKUlvE2_clEvEUlffffE_NS_6detail5ArrayIPcLi7EEE23TrivialOffsetCalculatorILi4EjESD_ILi3EjEEEviT0_T1_T2_T3_
                    0.59%  153.37us       100  1.5330us  1.4720us  2.6240us
										void at::native::vectorized_elementwise_kernel<int=4,
										at::native::BUnaryFunctor<at::native::AddFunctor<long>>,
										at::detail::Array<char*, int=2>>(int, long,
										at::native::AddFunctor<long>)
```

I think there is similar scope to improve the backward implementation.

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D28142447

Pulled By: ngimel

fbshipit-source-id: c70109780e206fa85e50a31e90a1cb4c533199da
2021-05-06 12:14:02 -07:00
1101a5f6e9 [paramcomms] support for in and out split sizes (#57709)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57709

NOTE: initial commit got reverted D28247764

Adding way to accept in and out split sizes.

Test Plan:
{F613245151}
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2F0%2F1620153506%2F127.0.0.1%2Flibkineto_activities_1112677.json.gz&bucket=gpu_traces
NOTE: ignore the GPU user showing up in CPU - the issue is fixed in the diff above the stack D28196723 (fc657b547a)

UPDATED: now the sizes are encoded as arrays in .json
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2F0%2F1620259313%2F127.0.0.1%2Flibkineto_activities_3944235.json.gz&bucket=gpu_traces

Reviewed By: kingchc

Differential Revision: D28248333

fbshipit-source-id: cee523612667cb37170c94e3c40dab5fba432225
2021-05-06 12:04:34 -07:00
ebd2c0a4ed Port ceil to structured (#57589)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57589

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D28224598

Pulled By: ezyang

fbshipit-source-id: 2c83d2b005004d783a394f1f7e3db828adf5d566
2021-05-06 10:05:30 -07:00
ccbaa5fbe5 Port sign to structured (#57588)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57588

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D28224600

Pulled By: ezyang

fbshipit-source-id: 71de5211617c1eba34192e23831136ae5c403e61
2021-05-06 10:05:28 -07:00
3c0d22fab3 Port floor to structured (#57587)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57587

Use ghstack to reopen to reduce self conflict。

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D28224599

Pulled By: ezyang

fbshipit-source-id: 023f21bc976b90f8a5a409db4f3390aa4eaea446
2021-05-06 10:04:15 -07:00
d83d1d3741 TensorIterator: documentation on the order of creation (#57550)
Summary:
Adds documentation to TensorIterator and TensorIteratorConfig that outputs need to be added first before inputs.

Fixes https://github.com/pytorch/pytorch/issues/57343

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57550

Reviewed By: VitalyFedyunin

Differential Revision: D28198135

Pulled By: mrshenli

fbshipit-source-id: 363603cac968bf786a4a6a64e353307c54d541b1
2021-05-06 09:39:47 -07:00
72ebdd68e1 Revert D28242069: Add cuSOLVER path for torch.linalg.lstsq
Test Plan: revert-hammer

Differential Revision:
D28242069 (7b31d4262b)

Original commit changeset: 23979d19ccc7

fbshipit-source-id: edf26a78b3485790deb1a8f53e8c8d3989c28e1b
2021-05-06 09:28:15 -07:00
dc06f52480 Add result() to ProcessGroupGloo::AsyncWork's (#57565)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57565

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D28255120

Pulled By: agolynski

fbshipit-source-id: 1e904d4fe024d5b99cb642f8689ca32be0581e82
2021-05-06 08:48:48 -07:00
a7ba0f08f3 Update internal code for torch.lu_solve (#56611)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56611

The goal of this refactoring is to make the `torch.linalg.solve`
to be a composition of calls to `lu_stub` and `lu_solve_stub`.
Once `lu_stub` and `lu_solve_stub` have cuSOLVER-based codepath,
`torch.linalg.solve` will have it as well.

Replaced lu_solve_helper with DECLARE_DISPATCH for lu_solve_stub.
Removed unnecessary copy improving the performance (see https://github.com/pytorch/pytorch/pull/56611#issuecomment-824303673).
Split MAGMA-based `apply_lu_solve` into `apply_lu_solve_looped_magma`
and `apply_lu_solve_batched_magma`. This simplifies future dispatch to
cuSOLVER and cuBLAS.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D28142279

Pulled By: mruberry

fbshipit-source-id: 9d4baf650ca7a40b800616794408b34342d8d68f
2021-05-06 08:26:13 -07:00
cb7197ce3f Torchelastic: populate __init__.py with failover documentation
Summary: Torchelastic: populate __init__.py with failover documentation

Test Plan: {F613772684}

Reviewed By: cbalioglu

Differential Revision: D28243715

fbshipit-source-id: aeed8d3ddd2d27ef86d837e7e3ebfa7a0b80a07d
2021-05-06 07:38:48 -07:00
ad31aa652c Fixed the error in conv1d example (#57356)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/51225

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57356

Reviewed By: albanD

Differential Revision: D28173174

Pulled By: malfet

fbshipit-source-id: 5e813306f2e2f7e0412ffaa5d147441134739e00
2021-05-06 07:02:37 -07:00
52ac015d76 Add note about improved vmap prototype to vmap docstring (#57677)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57677

This PR adds a note about the existence of the improved vmap prototype
to raise awareness of its existence. Eventually the plan is to delete
the in-core vmap prototype and replace it with the improved vmap
prototype but that might take a while.

Test Plan: - view docs

Reviewed By: Chillee

Differential Revision: D28231346

Pulled By: zou3519

fbshipit-source-id: 0a3b274df87ffd50333330e413e1a89634865403
2021-05-06 06:47:18 -07:00
f1a62264f3 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D28250914

fbshipit-source-id: 8bec4e0806891a045becf59c2d2f44f12bc41926
2021-05-06 05:11:25 -07:00
40cb55f978 Revert D28154522: Add call_kwargs(args, kwargs) method to torch::deploy api
Test Plan: revert-hammer

Differential Revision:
D28154522 (ba500c5c90)

Original commit changeset: 5ba57a8d7f01

fbshipit-source-id: 4d41efc733b22bc8eb8d6b174f4531e7e87e38ee
2021-05-06 04:52:07 -07:00
7b31d4262b Add cuSOLVER path for torch.linalg.lstsq (#57317)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57317

This PR implements QR-based least squares solver using geqrf, ormqr, and
triangular_solve operations.

Internal code of triangular_solve was fixed to handle correctly larger
sized rectangular arrays.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D28242069

Pulled By: mruberry

fbshipit-source-id: 23979d19ccc7f591afa8df4435d0db847e2d0d97
2021-05-06 04:45:55 -07:00
35fab44eaf Add CUDA support for torch.ormqr (#57316)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57316

CUDA support is implemented using cuSOLVER.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D28242071

Pulled By: mruberry

fbshipit-source-id: 6f0a1c50c21c376d2ee2907bddb618c6a600db1f
2021-05-06 04:45:54 -07:00
59d794b2c3 Port CPU torch.ormqr to ATen (#57315)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57315

This PR ports `torch.ormqr` from TH to ATen.
CUDA path will be implemented in a follow-up PR.
With ATen port, support for complex and batched inputs is added.
The tests are rewritten and OpInfo entry is added.

We can implement the least squares solver with geqrf + ormqr +
triangular_solve. So it's useful to have this function renewed at least for the
internal code.

Resolves https://github.com/pytorch/pytorch/issues/24748

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D28242070

Pulled By: mruberry

fbshipit-source-id: f070bb6ac2f5a3269b163b22f7354e9089ed3061
2021-05-06 04:44:40 -07:00
b4a098f1fb [pytorch][nnc] mobile nnc backend skeleton (#56852)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56852

This is part of the changes to enable NNC AOT compilation for mobile.
It introduced a custom backend for NNC, which uses the components defined in the stacked PRs to load and execute a NNC-compiled model.
ghstack-source-id: 128285801

Test Plan:
- On X86 host:
```
buck build //xplat/caffe2/fb/lite_predictor:lite_predictor_nnc
buck-out/last/lite_predictor_nnc --model xplat/pytorch_models/build/pytorch_dev_linear/v1/nnc/compiled.pt --print_output true --input_dims '4,4' --input_type float
```
- On Android:
```
buck build fbsource//fbandroid/mode/gnustl //xplat/caffe2/fb/lite_predictor:lite_predictor_nncAndroid#android-armv7
adb push buck-out/last/lite_predictor_nncAndroid#android-armv7 /data/local/tmp
adb push xplat/pytorch_models/build/pytorch_dev_linear/v1/nnc/compiled.pt /data/local/tmp
adb shell 'cd /data/local/tmp; ./lite_predictor_nncAndroid\#android-armv7 --model compiled.pt --print_output true --input_dims "4,4" --input_type float'
```

Reviewed By: kimishpatel, raziel

Differential Revision: D27897153

fbshipit-source-id: 8e039089d1602782582747adfd75b31496b525ca
2021-05-06 03:25:18 -07:00
d82333e92a [pytorch][nnc] protocol classes to persist the context for compiled functions (#56851)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56851

This is part of the changes to enable NNC AOT compilation for mobile.
At the end of the ahead-of-time compilation the compiler produces two sets of artifacts:
1. "compiled assembly code" - kernel functions in assembly format optimized for target platforms;
2. "compiled model" - regular TorchScript model that contains serialized parameters (weights/bias/etc) and invokes kernel functions via "handles" (name/version id/input & output specs/etc of the kernel functions).

This PR introduces a set of classes to represent kernel functions (a.k.a "handles"), which can be serialized/deserialized into/from the "compiled model" as an IValue.
Also introduces APIs to register/look-up "compiled assembly code".
ghstack-source-id: 128285802

Test Plan:
- unit tests
- for FB build environment:
buck test //caffe2/test/mobile/nnc:mobile_nnc

Reviewed By: kimishpatel, raziel

Differential Revision: D27921866

fbshipit-source-id: 4c2a4d8a4d072fc259416ae674b3b494f0ca56f3
2021-05-06 03:24:15 -07:00
db7b31358f Fix internal assert in CUDA caching allocator when trying to allocate ~2^64 memory (#57571)
Summary:
When the memory requested is huge, some internal logic in CUDA caching allocator could overflow. The result of the overflow is the caching allocator gives a confusing error message.

For example:

```python
import torch
import torch.nn as nn
from torch.utils import cpp_extension
cuda_source = """
#include <c10/cuda/CUDACachingAllocator.h>
void my_fun(void)
{
    size_t temp_storage_bytes = 18446744073708433663UL;
    auto& caching_allocator = *::c10::cuda::CUDACachingAllocator::get();
    auto temp_storage = caching_allocator.allocate(temp_storage_bytes);
    return;
}
"""
cpp_source = """
    void my_fun(void);
"""
module = torch.utils.cpp_extension.load_inline(
    name="cuda_test_extension",
    cpp_sources=cpp_source,
    cuda_sources=cuda_source,
    functions="my_fun",
    extra_cuda_cflags=["--extended-lambda"],
    verbose=True,
)
module.my_fun()
print('done')
```

gives

```
Traceback (most recent call last):
  File "/home/gaoxiang/misc/caching-allocator.py", line 26, in <module>
    module.my_fun()
RuntimeError: p.block != nullptr && p.block->ptr != nullptrINTERNAL ASSERT FAILED at "../c10/cuda/CUDACachingAllocator.cpp":991, please report a bug to PyTorch.
Exception raised from alloc_block at ../c10/cuda/CUDACachingAllocator.cpp:991 (most recent call first):
frame #0: <unknown function> + 0x83e93 (0x7f424f05ee93 in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame https://github.com/pytorch/pytorch/issues/1: <unknown function> + 0x83bf9 (0x7f424f05ebf9 in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame https://github.com/pytorch/pytorch/issues/2: <unknown function> + 0x839bd (0x7f424f05e9bd in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame https://github.com/pytorch/pytorch/issues/3: std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>::operator()() const + 0x4c (0x7f428a3350a2 in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame https://github.com/pytorch/pytorch/issues/4: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x40 (0x7f424f05dc34 in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame https://github.com/pytorch/pytorch/issues/5: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x97 (0x7f424f05c42f in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame https://github.com/pytorch/pytorch/issues/6: <unknown function> + 0x6948b4 (0x7f42978fd8b4 in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame https://github.com/pytorch/pytorch/issues/7: <unknown function> + 0x22373 (0x7f424f0e2373 in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame https://github.com/pytorch/pytorch/issues/8: <unknown function> + 0x1fa6c (0x7f424f0dfa6c in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame https://github.com/pytorch/pytorch/issues/9: <unknown function> + 0x2337a (0x7f424f0e337a in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame https://github.com/pytorch/pytorch/issues/10: <unknown function> + 0x23f18 (0x7f424f0e3f18 in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame https://github.com/pytorch/pytorch/issues/11: my_fun() + 0x4b (0x7f4200338f74 in /home/gaoxiang/.cache/torch_extensions/cuda_test_extension/cuda_test_extension.so)
frame https://github.com/pytorch/pytorch/issues/12: torch::detail::wrap_pybind_function_impl_<void (&)()>(void (&)(), std::integer_sequence<unsigned long>)::{lambda()https://github.com/pytorch/pytorch/issues/1}::operator()() const + 0x3f (0x7f420031e575 in /home/gaoxiang/.cache/torch_extensions/cuda_test_extension/cuda_test_extension.so)
frame https://github.com/pytorch/pytorch/issues/13: <unknown function> + 0x570f2 (0x7f42003350f2 in /home/gaoxiang/.cache/torch_extensions/cuda_test_extension/cuda_test_extension.so)
frame https://github.com/pytorch/pytorch/issues/14: <unknown function> + 0x536e2 (0x7f42003316e2 in /home/gaoxiang/.cache/torch_extensions/cuda_test_extension/cuda_test_extension.so)
frame https://github.com/pytorch/pytorch/issues/15: <unknown function> + 0x4ef2f (0x7f420032cf2f in /home/gaoxiang/.cache/torch_extensions/cuda_test_extension/cuda_test_extension.so)
frame https://github.com/pytorch/pytorch/issues/16: <unknown function> + 0x4ef93 (0x7f420032cf93 in /home/gaoxiang/.cache/torch_extensions/cuda_test_extension/cuda_test_extension.so)
frame https://github.com/pytorch/pytorch/issues/17: <unknown function> + 0x3e7f2 (0x7f420031c7f2 in /home/gaoxiang/.cache/torch_extensions/cuda_test_extension/cuda_test_extension.so)
<omitting python frames>
frame https://github.com/pytorch/pytorch/issues/30: __libc_start_main + 0xd5 (0x7f42c60bab25 in /usr/lib/libc.so.6)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57571

Reviewed By: VitalyFedyunin

Differential Revision: D28224574

Pulled By: ezyang

fbshipit-source-id: df440961f6eaf58048af36ae2a06c59f3c18baec
2021-05-06 01:36:58 -07:00
7d4121d1d2 Make RRefContext get devices from RPC agent when creating OwnerRRef (#57443)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57443

Based on the comments in https://github.com/pytorch/pytorch/pull/57355, I started looking at the callsites of `getOrCreateOwnerRRef` and `createOwnerRRef`, and noticed that many of them didn't specify the `devices` argument, which was optional and thus defaulted to `{}`, which created a CPU-only Future inside the OwnerRRef. (Such callsites were, for example, in `processPythonRemoteCall` and `processBaseScriptRemoteCall`, or `PyRRef::unpickle`, ...).

Some (or all?) of these callsites might still have worked thanks to the RRef's own handling of CUDA streams and events, however we intend to remove that in https://github.com/pytorch/pytorch/pull/57355. I think it would be a safer and more generic solution to always create OwnerRRefs with the full set of devices supported by the RPC agent, and this is in fact easy to do since the RRefContext has access to the RPC agent. This means that all OwnerRRefs, no matter how they're created, will support CUDA if the agent does. This also allows us to stop requiring to specify devices when creating a OwnerRRef by hand in Python.
ghstack-source-id: 128184665

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28144365

fbshipit-source-id: 1f2d446873f31ee297415c46b94126b6502b12d3
2021-05-06 01:12:56 -07:00
7ffadf6e46 Replace DeviceIndexes with Devices in RRefs (#57442)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57442

We did this for the RPC agents and for ivalue::Future, the last one (I think) is RRef.
ghstack-source-id: 128184664

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28144368

fbshipit-source-id: eeacab6006f72118cbec542a02322f2e391c67a3
2021-05-06 01:12:54 -07:00
8e9bbd3113 Make DataPtr extraction in CUDAFuture faster for Python values (#56918)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56918

Re-importing a Python module each time is a bit expensive, and it's unnecessary because this is a private module which won't change and thus we can cache the value once we first extract it.

ghstack-source-id: 128184666

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D27985910

fbshipit-source-id: be40ae9b67ab8ea6c07bc2cb9a78d2c2c30b35d3
2021-05-06 01:12:53 -07:00
69de4940f3 Ensure devices are preserved when forwarding between futures (#57432)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57432

In a bunch of places we were creating a future and then "forwarding" the value of another future to it once that other future completed. (This was in order to convert the type of the value, or to "merge" multiple futures into one). However when doing so we often created a child future with an empty set of devices, which meant it didn't support CUDA, and thus would cause a silent synchronization/correctness bug if the parent future did actually contain CUDA tensors.

One way this could have been caught earlier would have been to have Future always extract the DataPtrs, even in CPU-only mode, in order to ensure they always reside on the expected set of devices. Unfortunately this might have some averse perf effects thus should be done carefully.
ghstack-source-id: 128184667

Test Plan: eyes

Reviewed By: mrshenli

Differential Revision: D28143045

fbshipit-source-id: 9af1abf270366dc1df0d4857d6a8cc73668af9d1
2021-05-06 01:12:51 -07:00
1292602375 Avoid re-extracting DataPtrs when forwarding values between Futures (#57433)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57433

In a bunch of cases we need to "forward" between one future and another, typically because we need to convert the type of the data (e.g., from Message to PyObject). In most of these cases the DataPtrs of the value don't change, and yet the new future must re-extract them from scratch. By allowing the user to obtain the vector of extracted DataPtrs from the old future, we can allow them to "shortcut" this step.

Also, this change is a requirement for the next PR to work, since the next PR would otherwise cause us to attempt extracting DataPtrs from Message instances, which doesn't work (because Message is a custom class), but thanks to this PR we actually skip that.

ghstack-source-id: 128184663

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28118298

fbshipit-source-id: 70e333ea6a4f8d4d9a86514c350028d412469ee1
2021-05-06 01:11:38 -07:00
1f178de800 [NNC] Add support for computing conv with dynamic shapes (#57514)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57514

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D28226918

Pulled By: navahgar

fbshipit-source-id: 818ac8411b809033388d419c8f33db6aeece4b33
2021-05-06 01:08:25 -07:00
eef72f3f8a [NNC] Update Buf on mutation instead of creating new ones (#57513)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57513

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D28226917

Pulled By: navahgar

fbshipit-source-id: 4e74c56a85b7aadc285b872b8ef8f8e26f31c8ce
2021-05-06 01:08:23 -07:00
95fbc158d4 [NNC] Add a method to compute conv without bias (#57512)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57512

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D28226919

Pulled By: navahgar

fbshipit-source-id: e84b944f7fdc84a77409d59218ceaa0862298f3c
2021-05-06 01:07:21 -07:00
3fb5be05ba [iOS GPU] Add Metal API availability check (#57663)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57663

Detail error messages when shader compilation fails.
ghstack-source-id: 128282408

Test Plan: CI

Reviewed By: linbinyu

Differential Revision: D28247966

fbshipit-source-id: 2c8ae575acbb197048c1edde28674ab69f008751
2021-05-06 00:17:55 -07:00
7870450706 [PyTorch] Use c10::ThreadLocal instead thread_local in record_function.cpp for specific __GLIBCXX__ on Android (#57689)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57689
* Older versions of libgnustd have issues with thread_local C++ qualifier on Android devices prior to r17+. Use c10::tls<> wrapper with smart pointer semantics in such cases.
* Convenient macro `C10_DEFINE_TLS_static` was added as well:

```
  // Define static TLS variable str_tls_ of type std::string
  C10_DEFINE_TLS_static(std::string, str_tls_);

  //////// Excercise it ////////
  {
     *str_tls_ = "abc";
     assert(str_tls_->length(), 3);
  }
```
ghstack-source-id: 128233742

Test Plan: CI +

Reviewed By: ilia-cher

Differential Revision: D27875779

fbshipit-source-id: 7764f96ac1e121051c6ea66eabcedb9ef54d290e
2021-05-06 00:13:33 -07:00
fc657b547a [kineto] set the correct device id for GenericTraceActivity
Summary: while merging ClientTraceActivity and GenericTraceActivity, we accidentally adopted CTA's behavior of returning process id over its `device`. This causes GTA to show up in CPU timeline rather than associated GPU's

Test Plan:
before

https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2F0%2F1620113910%2F127.0.0.1%2Flibkineto_activities_1270242.json.gz&bucket=gpu_traces

{F613233496}

after

https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2F0%2F1620115859%2F127.0.0.1%2Flibkineto_activities_1511899.json.gz&bucket=gpu_traces

{F613231643}

Reviewed By: gdankel

Differential Revision: D28196723

fbshipit-source-id: eb8330c14e7c43a470bb4df4811b80754d96535b
2021-05-05 23:54:08 -07:00
8bbe383877 [Static Runtime] Fix bugs in logit (#57578)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57578

The original impl in SR assumes that eps is a constant, which is true most of the times. However it could be a graph input as well. This diff fixes this issue. Unit tests are added as well.

Reviewed By: edvgha

Differential Revision: D28207975

fbshipit-source-id: 9a10dec159f3804e43ef74aaa20c3ec6c79548c9
2021-05-05 23:38:15 -07:00
126ea1ccad relax type equality constraint for scalars (#57532)
Summary:
Currently we require type equality for `torch.testing.assert_(equal|close)`:

3db45bcb91/torch/testing/_asserts.py (L509-L513)

That means `assert_equal(1, 1.0)` will correctly fail. Although the type of a scalar is similiar to a dtype of a tensor, `assert_equal(1, 1.0, check_dtype=False)` will also fail while `assert_equal(torch.as_tensor(1), torch.as_tensor(1.0), check_dtype=False)` will pass.

To make the interface more consistent, this PR relaxes the type equality constraint, by disabling it in case both inputs are scalars.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57532

Reviewed By: ngimel

Differential Revision: D28242428

Pulled By: mruberry

fbshipit-source-id: b643c77f48b64fc2c8a43925120d2b634ec336b5
2021-05-05 22:42:51 -07:00
ba78bf1363 [standaloneRunner] fix another GIL mutithreading issue exposed by torch::jit::toIValue() (#57688)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57688

P412982836 says that `torch::jit::toIValue()` will also touch GIL through `torch::jit::createGenericDict()` (P412848640)
So we have to move `torch::jit::toIValue()` out of multithreading execution

Reviewed By: hyuen

Differential Revision: D28236527

fbshipit-source-id: 43a33dbcfc828cc42c5e1230c8f5cb415bf7bde4
2021-05-05 21:41:04 -07:00
ccbbb2d6f8 Revert D28052211: [paramcomms] support for in and out split sizes
Test Plan: revert-hammer

Differential Revision:
D28052211 (866b19e95d)

Original commit changeset: 4ab7d425fc72

fbshipit-source-id: 80c001ddcb3730f0487adddf66d9166f53c45a8c
2021-05-05 21:10:31 -07:00
86b061c80e [FX] Changes in order to move python key out of tree (#57427)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57427

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D28215322

Pulled By: Chillee

fbshipit-source-id: 94439376097c74f2004e6eca214d7940df20865d
2021-05-05 20:55:51 -07:00
c27428b5e9 [nnc] ported conv2d lowering over (#56875)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56875

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D28213450

Pulled By: Chillee

fbshipit-source-id: bacdcec83ec61aba1d55f5e3a16f81d6ada3cff2
2021-05-05 20:54:43 -07:00
866b19e95d [paramcomms] support for in and out split sizes
Summary: Adding way to accept in and out split sizes.

Test Plan:
{F613245151}
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2F0%2F1620153506%2F127.0.0.1%2Flibkineto_activities_1112677.json.gz&bucket=gpu_traces
NOTE: ignore the GPU user showing up in CPU - the issue is fixed in the diff above the stack D28196723

UPDATED: now the sizes are encoded as arrays in .json
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2F0%2F1620259313%2F127.0.0.1%2Flibkineto_activities_3944235.json.gz&bucket=gpu_traces

Reviewed By: kingchc

Differential Revision: D28052211

fbshipit-source-id: 4ab7d425fc722907d9bbcfad7e364d031ff69b29
2021-05-05 20:46:11 -07:00
27af9b0462 Fix flaky test_rref_context_debug_info (#57526)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57526

This test would create an RRef, delete that rref and then create two
more RRefs and validate total rrefs were 2 in the end.

Due to the async nature of delete, sometimes the RRef would not be deleted
until the assertion was made. As a result, I've fixed this by waiting for the
RRef to be deleted at the appropriate time.

#Closes: https://github.com/pytorch/pytorch/issues/55382
ghstack-source-id: 128037566

Test Plan: waitforbuildbot

Reviewed By: H-Huang

Differential Revision: D28173151

fbshipit-source-id: e4f34ff4e49b72cfc9e67a72c482f5e05159eda5
2021-05-05 20:26:58 -07:00
ba500c5c90 Add call_kwargs(args, kwargs) method to torch::deploy api (#57484)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57484

To be used by PyTorchPredictor integration for deploy.

Test Plan: tested via new unit tests

Reviewed By: suo

Differential Revision: D28154522

fbshipit-source-id: 5ba57a8d7f01686180e6fd47663635ec3ab2120d
2021-05-05 20:21:59 -07:00
8df9b88042 [kineto] Update Kineto submodule (#57700)
Summary:
Update Kineto submodule to fix an invalid json bug, also update and
move profiler json tracing unit test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57700

Test Plan: python test/test_profiler.py -v

Reviewed By: gdankel, rohan-varma

Differential Revision: D28243256

Pulled By: ilia-cher

fbshipit-source-id: edfe9f26c66e967d610231be5fc22ba5ee1054fa
2021-05-05 20:09:38 -07:00
0d813bbca5 Revert D28177176: [iOS GPU] Add Metal API availability check
Test Plan: revert-hammer

Differential Revision:
D28177176 (30c96c9419)

Original commit changeset: b5913e5ed75d

fbshipit-source-id: e6ea545493b788b66b065a4c16daf31373c9eecc
2021-05-05 19:50:09 -07:00
44b021d21b [package] remove save_source_file API (#57340)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57340

This API was only used within our own implementation. I couldn't find
any uses anywhere else. Removing it to reduce our overall surface area,
and also because the semantics are unclear in a world where
serialization is deferred to close() time.

Differential Revision: D28114188

Test Plan: Imported from OSS

Reviewed By: anjali411

Pulled By: suo

fbshipit-source-id: 6da53f20518885c7f4359e00e174f5e911906389
2021-05-05 17:57:05 -07:00
a3cba770b5 [package] remove PackageExporter.file_structure (#57339)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57339

After the `intern` changes, we will no longer eager write to the package
archive so `file_structure` as written doesn't make much sense.

Differential Revision: D28114187

Test Plan: Imported from OSS

Reviewed By: anjali411

Pulled By: suo

fbshipit-source-id: 875595db933e9d1b2fdde907b086889cc977e92f
2021-05-05 17:57:04 -07:00
f326f7dda8 [package] use digraph to back dependency visualization (#57338)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57338

Differential Revision: D28114190

Test Plan: Imported from OSS

Reviewed By: astaff

Pulled By: suo

fbshipit-source-id: 78b15edae3b991307fd3656ac7b374d4d218b460
2021-05-05 17:57:02 -07:00
53c21172c0 [package] add simple graph data structure (#57337)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57337

Add a really simple graph data sturcutre for tracking dependencies. API
based on networkx, but I didn't want to require the dependency.

Differential Revision: D28114186

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Pulled By: suo

fbshipit-source-id: 802fd067017e493a48d6672538080e61d249accd
2021-05-05 17:57:00 -07:00
a39c685ace [package] make extern a dict (#57336)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57336

Avoid a small n^2

Differential Revision: D28114189

Test Plan: Imported from OSS

Reviewed By: astaff

Pulled By: suo

fbshipit-source-id: 2672669ad0e23169d70c92f9d5ed61f66081f248
2021-05-05 17:56:59 -07:00
dedf9fbe81 [package] factor out PackageExporter._get_dependencies (#57335)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57335

Mostly refactoring. The only behavioral change is that I have eliminated
the `orig_source_file` argument to `save_source_string`. I think it
doesn't provide enough marginal value (since if you have the module name
you can get the source file anyway).

Differential Revision: D28114184

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Pulled By: suo

fbshipit-source-id: b5e9eb4250dc84552befeef2dcf9e591b32899ae
2021-05-05 17:55:48 -07:00
7627dd568a hardswish reland (#57652)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57652

Test Plan: Imported from OSS

Reviewed By: Krovatkin

Differential Revision: D28226724

Pulled By: eellison

fbshipit-source-id: 585a91ffab7a855b5600e79130a37be25ef9b354
2021-05-05 17:21:43 -07:00
56211524a7 [NNC] ported over sum and softmax to new scheme (#56775)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56775

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D28173905

Pulled By: Chillee

fbshipit-source-id: 865ff71e5a428341d7225f534f7093ef2994fe5a
2021-05-05 17:09:34 -07:00
0b51ee311d Add missing return statement from 57057 (#57669)
Summary:
Fixes a bug introduced by https://github.com/pytorch/pytorch/issues/57057

cc ailzhang while writing the tests, I realized that for these functions, we don't properly set the CreationMeta in no grad mode and Inference mode. Added a todo there.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57669

Reviewed By: soulitzer

Differential Revision: D28231005

Pulled By: albanD

fbshipit-source-id: 08a68d23ded87027476914bc87f3a0537f01fc33
2021-05-05 16:13:35 -07:00
cd22bdf236 [PyTorch] Autoformat c10, round 2 (#57645)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57645

Second round of autoformatting changes since the first pass became too large.
ghstack-source-id: 128199695

Test Plan: CI

Reviewed By: zertosh

Differential Revision: D28131430

fbshipit-source-id: 24b03e38b087f31e8cac2404bebcd401c55b6cab
2021-05-05 15:45:53 -07:00
e5179e960e Share VS Code settings/extensions nicely (#57671)
Summary:
This is a second attempt at https://github.com/pytorch/pytorch/issues/51214. It should achieve the same goals with (as far as I can tell) no disadvantages, but the advantages are a bit less pronounced than in the more dictatorial approach that https://github.com/pytorch/pytorch/issues/51214 took:

- Unfortunately, I was unable to figure out how to include [the `mypy` configuration given in the docstring of `tools.mypy_wrapper.main`](7115a4b870/tools/mypy_wrapper.py (L81-L89)), because as walterddr pointed out, `"${env:HOME}/miniconda3/envs/pytorch/bin/python"` is not guaranteed to be correct on everyone's machine:
  ```json
  {
    "python.linting.enabled": true,
    "python.linting.mypyEnabled": true,
    "python.linting.mypyPath": "${env:HOME}/miniconda3/envs/pytorch/bin/python",
    "python.linting.mypyArgs": [
      "${workspaceFolder}/tools/mypy_wrapper.py"
    ]
  }
  ```

  Importantly, this does not work:
  ```json
  "python.linting.mypyPath": "${workspaceFolder}/tools/mypy_wrapper.py"
  ```
  This is because VS Code does not run the given `mypy` command inside of the user's specified virtual environment, so for instance, on my system, setting the `mypy` command to directly call `tools/mypy_wrapper.py` results in using `mypy 0.782` instead of the correct `mypy 0.812`.

  Sadly, [this](https://code.visualstudio.com/docs/editor/variables-reference#_configuration-variables) does not work either, although I'm not sure why:
  ```json
  {
    "python.linting.mypyPath": "${config:python.pythonPath}",
    "python.linting.mypyArgs": [
      "${workspaceFolder}/tools/mypy_wrapper.py"
    ]
  }
  ```

- As a result, `git clean -fdx; tools/vscode_settings.py` still results in some loss of useful configuration.

One other thing to note: as `.vscode/settings_recommended.json` shows, there are some configuration sections that only take effect within the context of a `"[language]"`, so currently, if a dev already has one of those settings, it would be entirely overwritten by `tools/vscode_settings.py` rather than a graceful merge. This could probably be fixed by using a deep merge instead of the current shallow merge strategy.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57671

Test Plan:
If you want, you can typecheck the small script added by this PR (no output is expected):
```sh
tools/mypy_wrapper.py $PWD/tools/vscode_settings.py
```
You can also try running it to update your own VS Code workspace settings:
```sh
tools/vscode_settings.py
```
This should have minimal impact on your existing `tools/settings.json` file other than enabling the few explicitly recommended settings (e.g. it should not reorder or remove any of your existing settings).

Reviewed By: malfet

Differential Revision: D28230390

Pulled By: samestep

fbshipit-source-id: 53a7907229e5807c77531cae4f9ab9d469fd7684
2021-05-05 15:19:59 -07:00
65fad0ebd2 Expand Kineto platform support (ci-all) (#56323)
Summary:
Expanding support to all builds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56323

Test Plan: CI

Reviewed By: malfet

Differential Revision: D28171478

Pulled By: ilia-cher

fbshipit-source-id: 16bc752d1be3cbaeda5316f5d8a687ae05a83d22
2021-05-05 15:00:01 -07:00
30c96c9419 [iOS GPU] Add Metal API availability check (#57663)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57663

Detail error messages when shader compilation fails.
ghstack-source-id: 128206967

Test Plan: CI

Reviewed By: linbinyu

Differential Revision: D28177176

fbshipit-source-id: b5913e5ed75df96fda770c3f1a893f9bfd781ec0
2021-05-05 14:21:56 -07:00
69e64b2632 [Flaky tests] Fix flaky rpc profiling tests (#57517)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57517

Fixes the flaky tests https://github.com/pytorch/pytorch/issues/45145
and https://github.com/pytorch/pytorch/issues/45067.

The root cause is that it is not the case that all remote events will be
children of the record function remote event, as other events can sometimes be
profiled under the hood such as the issue described in
https://github.com/pytorch/pytorch/issues/43868.

We fix this issue by verifying that the set of events that are children on the
remote end and children on the local end are the same, without necessarily
enforcing specific events to be logged.

Tested by running the test 1000+ times and verifying it passed. Will also test on CI box before landing
ghstack-source-id: 128200041

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D28166602

fbshipit-source-id: 8145857da4642aef31f360b20db00f4328abe2ca
2021-05-05 14:06:32 -07:00
c4bb6a5781 NNAPI: flex size support for upsample_nearest2d op (#57563)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57563

Add flexible size support for upsample_nearest2d op in nnapi model conversion

Test Plan:
pytest test/test_nnapi.py

Imported from OSS

Reviewed By: dreiss

Differential Revision: D28200847

fbshipit-source-id: 901fe3f6e68e4c16ece730f3ffa68dc88c6ed6c3
2021-05-05 13:54:43 -07:00
4c609a9782 NNAPI: Add qadd flexible size support (#57562)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57562

Add flexible size support for qadd op in nnapi model conversion

Test Plan:
pytest test/test_nnapi.py

Imported from OSS

Reviewed By: dreiss

Differential Revision: D28200849

fbshipit-source-id: d5b2ea8e9eb8ae405ff2c960f7549cef60bc0991
2021-05-05 13:54:41 -07:00
28cd04ea64 NNAPI: add flexible size support for conv2d (#57561)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57561

Add flexible size support for conv2d op in nnapi model conversion

Test Plan:
pytest test/test_nnapi.py

Imported from OSS

Reviewed By: dreiss

Differential Revision: D28200848

fbshipit-source-id: d94ccf48a3d8453aa8e96c7cac02948c4cd870cc
2021-05-05 13:53:33 -07:00
049152faa9 Make torch.linalg.eigvalsh differentiable (#57189)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57189

`torch.linalg.eigvalsh` now supports autograd. This is achieved by
computing the eigenvectors internally if input requires grad,
otherwise the eigenvectors are not computed and the operation is faster.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D28199708

Pulled By: albanD

fbshipit-source-id: 12ac56f50137398613e186abd49f82c8ab83532e
2021-05-05 13:12:18 -07:00
babae61f2f Make torch.linalg.svdvals differentiable (#57188)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57188

`torch.linalg.svdvals` now supports autograd. This is achieved by
computing the singular vectors internally if input requires grad,
otherwise the vectors are not computed and the operation is faster.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D28199709

Pulled By: albanD

fbshipit-source-id: cf39cf40965c606927db5331ce16743178fa711f
2021-05-05 13:11:15 -07:00
534c457d3d add a standalone extra file loader for pytorch model (#57591)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57591

According to dhruvbird we should be able to read a file from pytorch model (which is a zip file) using miniz. This diff added a standalone loader so user can load a JSON (or other type) file in the extra folder of the model. The whole point is to avoid loading pytorch library first, which can be complex (voltron, dynamic loading etc).

With this the hand tracking inference config (D27937516) can no longer depends on pytorch or use dynamic_pytorch. Previous it uses torch::jit::_load_extra_only_for_mobile which requires pytorch to be loaded first. We want to avoid doing that.

Test Plan: buck test caffe2/fb/dynamic_pytorch:extract_file_test

Reviewed By: dhruvbird

Differential Revision: D28140492

fbshipit-source-id: 2fd1570523841f4c35dc2ad8dfde5f1d396a74fa
2021-05-05 13:08:40 -07:00
15c092b888 Revert "Make grad mode error just a warning (#56401)" (#57640)
Summary:
This reverts commit 63dac82444cc522f177b801d9f0cd2e22417c2f4.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57640

Reviewed By: soulitzer, yuguo68

Differential Revision: D28223946

Pulled By: albanD

fbshipit-source-id: 641b87cff1e2f08162ca8cacae333105e89438f1
2021-05-05 13:07:29 -07:00
7115a4b870 Clang format ProcessGroupNCCL.cpp (#56840)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56840

Per comments in https://github.com/pytorch/pytorch/pull/56427/files
ghstack-source-id: 128142665

Test Plan: Ci

Reviewed By: SciPioneer

Differential Revision: D27980768

fbshipit-source-id: 0158ae1cfd892ff3385ffa0084dd7ef9de014f8c
2021-05-05 10:17:09 -07:00
a948e279ac [c10d] Profiler support for nccl p2p collectives (#56427)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56427

This PR enables support for nccl send/recv profiling similar to how we have it for MPI and Gloo.

The process to do so is similar to the NCCL collectives where we create the `recordingFunction` in `initWork` and then add a callback that runs the profiler end callbacks. Tests are added similar to send/recv tests with gloo/MPI.

We also test with both autograd profiler and torch.profiler.
ghstack-source-id: 128142666

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D27866600

fbshipit-source-id: f29d9103e22b22f658632fece0df9ba36911fc62
2021-05-05 10:14:56 -07:00
17035f6aab Speedup render_junit (#57641)
Summary:
JUnitXml.__iadd__() is very slow
But since testsuites are flattened anyway in
`convert_junit_to_testcases` concatenate flattened tests right away

As result, parsing test-reports folder with 393 files and 25+ test cases
takes .5 sec instead of 193 sec

Fix typing errors and add script to mypy-strict

Print warning, rather than abort if xml can not be parsed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57641

Reviewed By: samestep

Differential Revision: D28224401

Pulled By: malfet

fbshipit-source-id: 3efc079c1c0deef8fff5ddf083268885b28418f9
2021-05-05 09:45:47 -07:00
fb9a32b7b4 [PyTorch][Edge] Add api to get bytecode model version (#56801)
Summary:
Add an api `_get_bytecode_version` to get version number given a bytecode model in both cxx and python, and the input can be both from file path and buffer.
## Test
CI (new added unit test will run as part of `pytorch_core-buck`)

1. run test_lite_interpreter.cpp
2. `python test/mobile/test_bytecode.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56801

ghstack-source-id: 128169647

Test Plan:
CI (new added unit test will run as part of `pytorch_core-buck`)

1. run test_lite_interpreter.cpp
2. `python test/mobile/test_bytecode.py`

Reviewed By: iseeyuan

Differential Revision: D27961417

fbshipit-source-id: f786cc9573d855feecff0b4fe8e5363e25f5728c
2021-05-05 09:17:26 -07:00
dedaf4fad7 Reland: [TensorExpr] Add methods for inspecting generated code in TensorExprKernel. (#57560)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57560

The new methods allow to peak into bufferArgs which describe parameters
that codegen expects. This description includes info whether a given
parameter is a scalar var or a buffer and in case it's a buffer allows
to get the corresponding `Buf*` pointer from which we could get the
expected sizes.

Relanding #57074  which was reverted because I forgot to guard a new
test with `ifdef LLVM`.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D28199048

Pulled By: ZolotukhinM

fbshipit-source-id: 636e838e7e242a3c63e97ec453b8fae9b6380231
2021-05-05 09:11:40 -07:00
9e7814d539 Reland: [StaticRuntime] Use NNC's call_raw API to reduce call overheads. (#57553)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57553

Relanding #57329 (the entire stack) which was reverted because I forgot
to guard a new test with `ifdef LLVM`.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D28195048

Pulled By: ZolotukhinM

fbshipit-source-id: 50052a2f20f84940b83d1dd1241c8659ff06e014
2021-05-05 09:11:38 -07:00
e686c66fe7 Reland: [TensorExpr] Add TensorExprKernel::runFast method. (#57552)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57552

This method uses `CodeGen::call_raw` instead of `CodeGen::call`.

Relanding #57328 (the entire stack) which was reverted because I forgot
to guard a new test with `ifdef LLVM`.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D28195047

Pulled By: ZolotukhinM

fbshipit-source-id: bcfd3cb5b4f33a149b7549515ffd705e2c4f208f
2021-05-05 09:11:37 -07:00
0bf69278f7 Reland: [TensorExpr] Add CodeGen::call_raw method. (#57551)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57551

The new method allows to pass input and output arguments by `void*`
pointers instead of CallArgs. That helps to reduce the invocation
overhead. Currently this is only supported in LLVM codegen.

Relanding #55113 (the entire stack) which was reverted because I forgot
to guard a new test with `ifdef LLVM`.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D28195049

Pulled By: ZolotukhinM

fbshipit-source-id: 035b77ae996dbbcd542b4b0e4c011b41e8d7828b
2021-05-05 09:10:25 -07:00
da8cc355a3 Relax tp_new so that it is OK to call (#57544)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57544

Instead of removing tp_new from the superclass (which causes
super().__new__ to not work), I now still install tp_new on the
superclass, but verify that you are not trying to directly
construct _TensorBase.

Fixes https://github.com/pytorch/pytorch/issues/57421

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D28189475

Pulled By: ezyang

fbshipit-source-id: 9397a3842a77f5428d182dd62244b42425bca827
2021-05-05 09:04:39 -07:00
c65a1da90a Fixed C++ linalg API (#57464)
Summary:
Previous reverted PR https://github.com/pytorch/pytorch/pull/57055. This PR leaves the deprecated signatures untouched.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57464

Reviewed By: mruberry

Differential Revision: D28151048

Pulled By: heitorschueroff

fbshipit-source-id: bc89d6cf3d801819d37b3d19bf525f8abd816881
2021-05-05 08:05:10 -07:00
887d0e5657 Revert D28197820: [JIT][NNC] add hardswish symbolic gradient and NNC lowering
Test Plan: revert-hammer

Differential Revision:
D28197820 (0142fd0b57)

Original commit changeset: 05305d85c5bb

fbshipit-source-id: 2e1d9699515982ba2a9be06e83a2ce043ec857ee
2021-05-05 07:53:30 -07:00
0787d781c5 Fix compatibility problem with LSTMs and torch.save (#57558)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57558

Fixes #53359

If someone directly saves an nn.LSTM in PyTorch 1.7 and then loads it in PyTorch
1.8, it errors out with the following:
```
(In PyTorch 1.7)
import torch
model = torch.nn.LSTM(2, 3)
torch.save(model, 'lstm17.pt')

(In PyTorch 1.8)
model = torch.load('lstm17.pt')
AttributeError: 'LSTM' object has no attribute 'proj_size'
```

Although we do not officially support this (directly saving modules via
torch.save), it used to work and the fix is very simple. This PR adds an
extra line to `__setstate__`: if the state we are passed does not have
a `proj_size` attribute, we assume it was saved from PyTorch 1.7 and
older and set `proj_size` equal to 0.

Test Plan:
I wrote a test that tests `__setstate__`. But also,

Run the following:
```
(In PyTorch 1.7)
import torch
x = torch.ones(32, 5, 2)
model = torch.nn.LSTM(2, 3)
torch.save(model, 'lstm17.pt')
y17 = model(x)

(Using this PR)
model = torch.load('lstm17.pt')
x = torch.ones(32, 5, 2)
y18 = model(x)
```
and finally compare y17 and y18.

Reviewed By: mrshenli

Differential Revision: D28198477

Pulled By: zou3519

fbshipit-source-id: e107d1ebdda23a195a1c3574de32a444eeb16191
2021-05-05 07:36:13 -07:00
49adac65c4 ns for fx: clean up manual string names of related ops (#57210)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57210

Removes the manually specified string name for sets of
related ops, and replaces it with an automatically generated
index. The manual name was arbitrary and ok for an MVP, but
is not safe for wide usage.

Also, adds APIs for users to add custom functions to the
relatedness map by either pairing it to a known function
or creating a new relatedness set.

Test Plan:
```
python test/test_quantization.py TestFXGraphMatcher
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D28077977

fbshipit-source-id: e64a1ad6cd063014d74cdad189b0a612b1143435
2021-05-05 06:30:32 -07:00
76f29d53bf ns for fx: change matching to only match known types (#57186)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57186

Before this PR, we matched any pair of nodes with equal or related
types.

This PR changes the behavior to only match nodes whose type is in
the allowlist (the relatedness mappings). This will prevent matching
user defined modules, unless users add them to the mappings.

This is motivated by a couple of things:
1. if user defined types are matched, it can break scriptability of the
   model with loggers attached. This happens whenever the user module
   has a return type of anything other than a Tensor or a tuple of
   Tensors.
2. we tried the past behavior on a couple of models, and it hasn't been
   useful.

Test Plan:
```
python test/test_quantization.py TestFXGraphMatcher
python test/test_quantization.py TestFXGraphMatcherModels
python test/test_quantization.py TestFXNumericSuiteCoreAPIs
python test/test_quantization.py TestFXNumericSuiteCoreAPIsModels
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D28077981

fbshipit-source-id: 0a698e52b807cda47e6923310448a985b26eb362
2021-05-05 06:30:30 -07:00
44bb15cfd3 ns for fx: add more type to relationship mapping (#57184)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57184

Add remaining types to the relationship mapping to have full coverage
of ops quantization knows about, except binary ops and RNNs.

Test Plan:
```
python test/test_quantization.py TestFXGraphMatcher.test_op_relationship_mapping
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D28077979

fbshipit-source-id: 0f6070c8a995032978702d088803f89ff25f2a7f
2021-05-05 06:30:29 -07:00
a9dc9535f6 ns for fx: move relatedness mapping to mappings file (#57171)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57171

No logic change, just moving the mapping to a file where
the other mappings are.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D28077978

fbshipit-source-id: 4049d6a498156a5dffe3a03d2f4abc79da7bf907
2021-05-05 06:29:11 -07:00
9ec6883442 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D28216577

fbshipit-source-id: ce31fb98320a31eb947bdd31c68aaafed034df79
2021-05-05 04:41:21 -07:00
aeaa91bff6 mkldnn gelu (#53615)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53615

Reviewed By: anjali411

Differential Revision: D28154396

Pulled By: Krovatkin

fbshipit-source-id: 7a9d4d37dc06e54e3249c531a034667b5a2afc46
2021-05-05 02:03:52 -07:00
0142fd0b57 [JIT][NNC] add hardswish symbolic gradient and NNC lowering (#57383)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57383

Notes: I picked up an activation from https://github.com/pytorch/pytorch/issues/56969. You can look at the [activations.cpp](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/Activation.cpp#L429) file which has both forward and backward kernel code to help you write the NNC lowering and the symbolic gradient.

I added a test in test_jit_fuser_te for the fusion, and I added an OpInfo and asserted that we expect to see autodiffable nodes to test the symbolic gradient.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D28197820

Pulled By: eellison

fbshipit-source-id: 05305d85c5bb0847c8f911b95ba47b137dca7e90
2021-05-04 23:39:59 -07:00
133d8abbfc Compute nvrtc during libtorch build (#57579)
Summary:
The warning is completely harmless, but it still its nice not to emit it
when it could be computed.

Fixes https://github.com/pytorch/pytorch/issues/53350

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57579

Reviewed By: walterddr

Differential Revision: D28208938

Pulled By: malfet

fbshipit-source-id: 8dcc3f1bff7c5ed2c0157268c3063228d3c445b6
2021-05-04 22:51:24 -07:00
cd9995ae14 Update Gloo submodule (#57586)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57586

Reviewed By: mrshenli

Differential Revision: D28210701

Pulled By: pbelevich

fbshipit-source-id: 5edce939ee50bc0488e190a61b6fd10c635dff67
2021-05-04 22:06:18 -07:00
45a3231bb8 [codemod] Enforce proper use of emplacy functions
Summary: The goal of this diff is enforce proper use of "emplacy" functions. In each case, this saves at worst a move constructor call, and at best a full copy of the object (in the case of a constructor call where the object does not have a move constructor).

Test Plan: CI.

Reviewed By: marksantaniello

Differential Revision: D27888714

fbshipit-source-id: 235d0b31066463588c7e4ab86e132c430a352500
2021-05-04 20:58:18 -07:00
d728491fc1 [RFC] [PyTorch Edge] Simplify error logging in mobile/import.cpp (#55711)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55711

Currently, there is some complex logic that tries to handle all exceptions but re-throws them as a `c10::Error` so that it can log the error message. I'm looking for context on why this was added. The current logic (after talking with swolchok) seems equivalent, simpler, and also preserves the original stack trace from where the exception was originally thrown. This is useful when viewing the backtrace in logview. Re-throwing an exception using `TORCH_CHECK(false, message)` results in the original exception stack trace getting lost, so we want to avoid that.
ghstack-source-id: 128043281

Test Plan: Build.

Reviewed By: iseeyuan

Differential Revision: D27688352

fbshipit-source-id: b7b1a29b652b31da80d72f16d284e48b8623377b
2021-05-04 20:45:32 -07:00
eb39da6b52 Always run as many quick-checks steps as possible (#57572)
Summary:
This is essentially a continuation of https://github.com/pytorch/pytorch/issues/56700. Currently, some of the steps in **Lint / quick-checks** (such as the trailing newlines check) still don't always run if an earlier steps fail; example: https://github.com/pytorch/pytorch/runs/2504623867

This PR adds some more `if`s to remaining steps, so that they, too, can still run even when earlier steps fail.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57572

Test Plan:
- https://github.com/pytorch/pytorch/runs/2504706736 before this PR, many steps get skipped if an early step fails
- https://github.com/pytorch/pytorch/runs/2504778437 using this PR's technique, those steps still run
- https://github.com/pytorch/pytorch/runs/2504787234 if the requirements step doesn't run, steps still get skipped
- https://github.com/pytorch/pytorch/runs/2504796695 after this PR, `quick-checks` still succeeds

Reviewed By: driazati

Differential Revision: D28205900

Pulled By: samestep

fbshipit-source-id: bea856e15bdd17ee66e9ebba019ce91133b17bcd
2021-05-04 19:18:18 -07:00
7175d49122 [Dist profiling] Add is_async field (#57253)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57253

This PR:

1. Adds is_async getter/setter to RecordFunction
2. Adds is_async field to LegacyEvent and KinetoEvent, read from RecordFunction
3. Modifies python profiler code to check is_async via this flag (and keeps the old thread check as well)
4. Sets profiling of c10d collectives as async in ProcessGroup.cpp
5. Modifies tests to ensure is_async is set

This also fixes flaky tests such as #50840 and #56690 which have been flaky due to the profiling part (https://github.com/pytorch/pytorch/pull/56963 tried to do so as well but this is a better approach).
ghstack-source-id: 128021158

Test Plan: CI

Reviewed By: walterddr, ilia-cher

Differential Revision: D28086719

fbshipit-source-id: 4473db4aed939a71fbe9db5d6655f3008347cb29
2021-05-04 17:44:28 -07:00
151e81b7bc [nnc][tests] Skip long running tests when using TE interpreter (#57568)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57568

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D28202740

Pulled By: bertmaher

fbshipit-source-id: 3f88aed91cd92c270ea5e6b504ae5ebc6810aa2b
2021-05-04 16:57:48 -07:00
7c3a30fd79 fx quant: remove matching hack for binary qhandler (#57470)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57470

Removes the earlier hack of matching patterns originally matched
to BinaryOpQuantizeHandler to switch to CopyHandler. After this PR,
each pattern can only be matched to one type of QuantizeHandler or
to nothing.

Test Plan:
```
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D28152909

fbshipit-source-id: afc285e770bd7eb0518c90e3ee4874c421e78bbc
2021-05-04 16:38:56 -07:00
2b6c09c11e Add futures to ProcessGroupMPI work (but not including Send/Recv) and python DDP comm hook testing (#57214)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57214

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D28200791

Pulled By: agolynski

fbshipit-source-id: 83f814abd4f2eea70e383ed373b04aae8291be55
2021-05-04 16:04:45 -07:00
8c9e42baaf .github: Add render_test_results job (#57472)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57472

This render should put us with some feature parity to the CircleCI web UI renders for j(x)unit test reports, should make it so you don't have to look through a long list of logs to see what tests failed for which job

Render should look somewhat similar to
![image](https://user-images.githubusercontent.com/1700823/116908744-1bb4b980-abf8-11eb-904c-e93ea4d2f805.png)

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: samestep

Differential Revision: D28154513

Pulled By: seemethere

fbshipit-source-id: 02d918b5c4cb6e236b806db48c3debe44de69660
2021-05-04 15:59:31 -07:00
00d6472b4d tools: Add render_junit script (#57327)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57327

Renders junit results to the console

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D28154514

Pulled By: seemethere

fbshipit-source-id: 02e34930b4f0bd257b4e359623b06a4b8f8e996d
2021-05-04 15:58:23 -07:00
9c5478588e [iOS GPU] [easy] Rename APIs in MPSImageWrapper
Summary:
1. Clean up unused APIs on MPSImageWrapper
2. Rename textures to images to avoid confusions.

Test Plan: CI

Reviewed By: husthyc

Differential Revision: D28176917

fbshipit-source-id: 3afb261d9e5a9a6145ca3067cf0d245f1bf04683
2021-05-04 15:37:51 -07:00
76d9070d10 Replace windows CUDA 11.2 CI with 11.3 (#57223)
Summary:
Testing 11.3 with current CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57223

Test Plan:
Relevant CI (11.3) pass!

Disclaimer: Skipped test_inverse_errors_large for CUDA 11.3 as it failed. Issue documented at https://github.com/pytorch/pytorch/issues/57482.

Reviewed By: malfet

Differential Revision: D28169393

Pulled By: janeyx99

fbshipit-source-id: 9f5cf7b6737ee6196de92bd80918a5bfbe5510ea
2021-05-04 14:23:23 -07:00
1fc89d9ffc Use proper Google Analytics id (#56578)
Summary:
This PR fixes the GA id and relies on `pytorch-sphinx-theme`  to set the GA script instead of hard-coding it (this is supported since https://github.com/pytorch/pytorch_sphinx_theme/pull/110 was merged).

Similar PRs were opened and merged in torchchvision/audio/text, e.g.: https://github.com/pytorch/vision/pull/3700

CC brianjo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56578

Reviewed By: mrshenli

Differential Revision: D28199244

Pulled By: ranman

fbshipit-source-id: a20b7fd1b1da3ebff491286c3eeb1410f3c80670
2021-05-04 13:23:16 -07:00
383e451036 Implement torch.sort with cub::DeviceSegmentedRadixSort (#56821)
Summary:
Benchmark:
```python
import torch
import itertools

for i in range(1000):
    torch.arange(100000, device='cuda')

def run50_sync(f):
    for _ in range(50):
        f()
    torch.cuda.synchronize()

for i, j in itertools.product([512, 4096, 8192], repeat=2):
    print(i,j)
    t = torch.randn(i, j, device='cuda')
    torch.cuda.synchronize()
    %timeit run50_sync(lambda: torch.sort(t))
    torch.cuda.synchronize()
    %timeit run50_sync(lambda: torch.sort(t, dim=0))
    print()
```

Before
```
512 512
4.02 ms ± 28.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5 ms ± 15.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

512 4096
40.7 ms ± 74.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
33.9 ms ± 186 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

512 8192
71.7 ms ± 636 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
66.4 ms ± 163 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

4096 512
27.6 ms ± 27.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
46.6 ms ± 101 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

4096 4096
262 ms ± 1.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
321 ms ± 1.32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

4096 8192
520 ms ± 5.47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
661 ms ± 853 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

8192 512
54.6 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
83.2 ms ± 320 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

8192 4096
521 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
645 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

8192 8192
1.04 s ± 2.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.34 s ± 541 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

After
```
512 512
4.65 ms ± 62.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.75 ms ± 62.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

512 4096
30.3 ms ± 261 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
39.4 ms ± 421 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

512 8192
59.7 ms ± 344 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
77 ms ± 601 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

4096 512
32.2 ms ± 376 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
37.1 ms ± 211 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

4096 4096
204 ms ± 471 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
272 ms ± 1.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

4096 8192
422 ms ± 3.25 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
562 ms ± 4.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

8192 512
63.1 ms ± 595 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
72.7 ms ± 532 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

8192 4096
401 ms ± 3.08 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
573 ms ± 2.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

8192 8192
831 ms ± 7.86 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.2 s ± 9.17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56821

Reviewed By: mrshenli

Differential Revision: D28172609

Pulled By: ngimel

fbshipit-source-id: 87314a6985a84d326304ff5220df5661ef00d710
2021-05-04 13:16:52 -07:00
bca1949dc9 [typing] suppress errors in fbcode/caffe2 - batch 2
Test Plan: Sandcastle

Differential Revision: D28191118

fbshipit-source-id: 59421c7346903597308b0fdf8a0984f56664fb4f
2021-05-04 12:44:27 -07:00
28c24ec3e8 [numpy] polygamma: int -> float promotion (#57462)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/42515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57462

Reviewed By: mrshenli

Differential Revision: D28187104

Pulled By: ezyang

fbshipit-source-id: 4072589ad1cb9766e7721d006d43701820922d56
2021-05-04 12:22:57 -07:00
1461859fde Revert D28048289: [TensorExpr] Add methods for inspecting generated code in TensorExprKernel.
Test Plan: revert-hammer

Differential Revision:
D28048289 (6b2cb939c5)

Original commit changeset: 3867e862a0ec

fbshipit-source-id: bdd45dcc4b229673efeb06da411bbf0c58d44026
2021-05-04 11:29:14 -07:00
b3c0ef4a40 Revert back to old assert behavior in as_view (#57499)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57499

Reviewed By: gmagogsfm

Differential Revision: D28162814

Pulled By: albanD

fbshipit-source-id: e3a970107ab59bb15794f0f82ee12c771caa93d5
2021-05-04 11:16:11 -07:00
42d073a7e9 Look for unqualified ignore in .pyi, not just .py (#57468)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57468

Test Plan:
On the commit that expanded the lints but didn't remove the `# type: ignore` comment, the quick-checks job failed:

- https://github.com/pytorch/pytorch/runs/2493713340

In contrast, on the tip of this PR, both the quick-checks job and the mypy job succeed:

- https://github.com/pytorch/pytorch/runs/2493744907
- https://github.com/pytorch/pytorch/runs/2493746144

Reviewed By: driazati

Differential Revision: D28153020

Pulled By: samestep

fbshipit-source-id: 5e21bde38ab741e87b3e5f3d45e7e50456fd7ec9
2021-05-04 09:40:40 -07:00
34d853a524 [fx2trt] example for lowering model to trt with FX based tooling (#57298)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57298

Some of the code is borrowed from NVIDIA-AI-IOT/torch2trt https://github.com/NVIDIA-AI-IOT/torch2trt/tree/master/torch2trt.

Move fx2trt stuff to fx/experimental/fx2trt.

Add an example in fx/experimental/fx2trt/example/fx2trt_example.py that shows how we lower resnet18 to TensorRT using FX.

TODO: Include license from NVIDIA-AI-IOT/torch2trt

Test Plan: CI

Reviewed By: jackm321

Differential Revision: D28102144

fbshipit-source-id: 1a7b03e45b8ab3fcc355d097d73afeec2efc3328
2021-05-04 09:24:43 -07:00
5326ec60e6 [Inlined Callstack Fix] Fix inlined callstack for blocks of the node. (#56562)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56562

Earlier inlined callstack was annotated only nodes. This left out nodes
such as If which have block of nodes. These nodes should also be updated
similarly.

Test Plan:
Added test in test_misc

Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D27902516

fbshipit-source-id: 4e65c686fa6b4977e8719db45f71f7d2599d4d8e
2021-05-04 09:21:15 -07:00
bb3c6699a5 [Pytorch Mobile DebugInfo Serialization] Save debug handles for all instructions. (#55252)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55252

Earlier for bytecode serialization we were saving debug handles only for OPs and not all
instructions. This PR makes changes to add that for all instructions.

Test Plan:
python test/mobile/test_lite_script_module.py TestLiteScriptModule

Imported from OSS

Reviewed By: dreiss

Differential Revision: D27542502

fbshipit-source-id: cff75118c721ce9f0c2f60d2c9471481f05264ca
2021-05-04 09:21:13 -07:00
e0fc473e47 [Pytorch, Mobile] Serialize inlined callstack pointer with debug handle. (#55062)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55062

This diff introduces the following changes:
1. InlinedCallStack pickler/serializer is introduced. It is serialized
as a tuple of {module_instance_info, source range tag, callee:InlinedCallStack}
Module instance info is serialized as tuple of {class_type_name,
instance_name}.
Note that callee of the serialized inlined callstack points to the tuple
of already serialized callstack. This means the first callstack ptr to
serialize, will serialize entire path of the tree, where some callee
nodes might be shared with callstack pointers that will be serialized
subsequently. Pickler supports memoization of pickled objects, where if
a tuple has been serialized then object id is obtained instead of
serialized object again. Thus we stll serialize the tree and not every
path from the root separately. Furthermore, InlinedCallStackSerializer
also uses cache to lookup the pointer and return the serialized IValue.
Furthermore, note that we must also serialize the source range of
InlinedCallStack. In order to this serializer requires map of
source-range-tags-to-source-range map. This was done in the previous
diff, where as part of source range serialization we also generate
unique tags. These are the tags that are serialized in InlinedCallStack.
Thus during deserialization we would have to deserialize source range
before deserializing InlinedCallStacks.
2. Furthermore, each serialized InlinedCallStack is serialized with a
unique debug_handle and source range tag.
BackendDebugHandleManager manages generation of
unique debug handles and saves the map of
debug-handles-to-{source_range_tag, inlined-callstack-ptr}.
This map is then serialized as callstack_debug_map.pkl. Note that
inlined callstack is not sufficient to get all the source information
since it contains source information about the nodes which are inlined.
The top-of-the-stack (or bottom) node, which is the actual op node, is
not part of the inlined callstack pointer and thus the source range of
this node is serialized separately using source_range_tag. This is
similar to how JIT creates callstack in
torch/csrc/jit/runtime/interpreter.cpp

Unique debug handles facilitates exception throwing or profiling using
just the debug handle without any further qualifications, such as which
function or module the inlined-callstack belongs to.

Furthermore, this diff refactors the old mobile code for tracking
module hierarchy information per op. Mainly now bytecode serialization
will serialize debug handles corresponding to ops/nodes in graph and
have callstack_debug_map.pkl help generate:
1. Entire callstack and
2. Module hierarchy information.

Test Plan:
python test/mobile/test_lite_script_module.py TestLiteScriptModule
./build/bin/test_jit --gtest_filter=*ModuleInfo

Imported from OSS

Reviewed By: raziel

Differential Revision: D27468709

fbshipit-source-id: 53e2413e7703ead01c77718b7c333c7c6ff50a23
2021-05-04 09:21:12 -07:00
f4a921600a [PyTorch, Mobile] Serialization format change for source range (#54284)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54284

In order to bring mobile deployment, via lite interpreter, on feature
parity with JIT, with respect model level debug information we must make
model level debug information available to mobile runtime.
At the moment, model level debug information is stored in SourceRange
which associates node's of graph to where the come from in original
python source code.
This information is serialized as part of debug_pkl and deserialized
when JIT loads the model and reads the model code.
On lite interpreter, we do not have access to all the functionality of
JIT and hence we cannot load model in the same way as JIT, by reading
code, constructing module hierarchy and graph corresponding module
methods etc. Instead in, lite interpreter, only bytecode corresonding to
the compiled graph, Code, is saved.
Thus in order to annotate OPs in the bytecode with equivalent
SourceRange information we do the following:
1. During model serialization, we create a unique tag for each source
range of the model.
2. Create a map of <SourceRange, tag>
3. During debug_pkl serialization we save tag along with SourceRange, on
top of byte offset.
4. During bytecode generation, the methods of the top module are
lowered. During this process methods are inlined. In the inlined graph,
when the node of a graph is lowered to bytecode, we query node's source
range and look it up against the map.
5. Resulting source range tag is serialized in module_debug_info.
6. During model deserialization, we read all the debug_pkl records in
the archieve and create a map of <tag, SourceRange>
7. This map can be used to find source code information.

During mobile runtime:
1. We read all the debug_pkl records and create <tag=debug_handle,
SourceRange> map.
   1.1 This map, MobileDebugInfo, is a member of mobile Module.
2. Interpreter catches appropriate exceptions and sets the thread local
debug handle and rethrows the exception.
3. In Function's run method we catch exception and query current debug
handle where the exception happened.
4. Query MobileDebugInfo with debug handle to retrieve source range and
augment error with source range info.

This information is still incomplete as it does not contain entire
callstack.

In the following diffs we will serialize InlinedCallStack directly.

Note that compilation is gated by SYMBOLICATE_MOBILE_DEBUG_HANDLE macro,
so that mobile builds can avoid building MobileDebugInfo, source range
and source range pickler/unpickler. Later we will add path where, if
building without debug support stack trace will contain only debug
handles. They can be symbolicated later.

Test Plan:
Ported bunch of source range tests from test_jit.py. Added on more test
in test_lite_interpreter.py

Imported from OSS

Reviewed By: raziel

Differential Revision: D27174722

fbshipit-source-id: a7b7c6088ce16dec37e823c7fefa4f0b61047e12
2021-05-04 09:19:27 -07:00
aa5ff7cc91 irange for Indexing.cu (#57479)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57479

Test Plan: Sandcastle

Reviewed By: walterddr, ngimel

Differential Revision: D28135714

fbshipit-source-id: 4fe4559b25165c59bd69180bfd439b74cedc0942
2021-05-04 08:52:32 -07:00
01e4444211 Tiny typo fix (#57113)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57113

Reviewed By: astaff

Differential Revision: D28122605

Pulled By: zou3519

fbshipit-source-id: dcf30ce38366d62befd784d7b3878c2ad1e3b86b
2021-05-04 08:42:20 -07:00
03b5d87980 fix(docs): torch.add and torch.mul (#54672)
Summary:
fixes https://github.com/pytorch/pytorch/issues/39425
https://11813267-65600975-gh.circle-artifacts.com/0/docs/generated/torch.add.html
https://11813267-65600975-gh.circle-artifacts.com/0/docs/generated/torch.mul.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54672

Reviewed By: ailzhang

Differential Revision: D27328523

Pulled By: zou3519

fbshipit-source-id: c804e3312b63ee209fef8bdfd8a92d46a345aa21
2021-05-04 08:38:06 -07:00
dc49299078 Allow passing cpu to CUDA RPC device maps (#57019)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57019

Based on https://github.com/pytorch/pytorch/pull/56043

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D28169796

Pulled By: beauby

fbshipit-source-id: 7fcf623de07c74c4f1ab415b7e20b518876a567a
2021-05-04 04:14:27 -07:00
5439977352 [Static Runtime] Revamp op schema check (#57521)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57521

When an op is added to static runtime, we manually check the schema (not with the jit schema check, more with IValue.IsTensor()/IsInt() etc) and make sure it's the one we do support. If the schema doesn't match, SR would throw an exception with TORCH_CHECK, which makes the entire graph invalid for SR.

This diff tries to make the op with unsupported schema to use the fallback path and make it go through the dispatcher instead:

```
  if (node->kind() != prim::ListConstruct &&
      node->kind() != prim::TupleConstruct &&
      node->kind() != prim::DictConstruct && node->kind() != prim::ListUnpack) {
    const Operator& op = node->getOperator();
    TORCH_CHECK(op.hasOperation());
    op_ = op.getOperation(node);
    VLOG(1) << "Fallback interpreter for node: " << PrintNode(node);
  }
```

The 2-arg `torch.norm`, which the SR `torch.norm impl doesn't support (only 3, 4, 5 args are supported), now can run in static runtime with fallback mode.

(Note: this ignores all push blocking failures!)

Reviewed By: ajyu

Differential Revision: D27531447

fbshipit-source-id: 0a9c2662ac73ed0393a23cc3a2c7df45fdb00fdd
2021-05-04 02:48:04 -07:00
a80b215a9a [1/n][torch/elastic] Move torchelastic docs *.rst (#148)
Summary:
Pull Request resolved: https://github.com/pytorch/elastic/pull/148

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56811

Moves docs sphinx `*.rst` files from the torchelastic repository to torch. Note: only moves the rst files the next step is to link it to the main pytorch `index.rst` and write new `examples.rst`

Reviewed By: H-Huang

Differential Revision: D27974751

fbshipit-source-id: 8ff9f242aa32e0326c37da3916ea0633aa068fc5
2021-05-04 00:57:56 -07:00
3db45bcb91 Compilation error fix for torch/csrc/distributed/rpc/init.cpp (#57500)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57500

Test Plan: Imported from OSS

Reviewed By: SciPioneer

Differential Revision: D28162887

Pulled By: agolynski

fbshipit-source-id: b6fafd64778fc09a5e832b0a557ae70f06951454
2021-05-03 23:15:02 -07:00
3cc733e451 fix for nvtxstring not printing name for aten kernels (#57407)
Summary:
aten kernels have a sequence number of -1

In order to ensure the names are properly printed in every case, we must change the >= 0 to => -1

Example of bug:
![Capture](https://user-images.githubusercontent.com/20074092/116767312-45959280-a9e4-11eb-92a3-c2236a00d481.PNG)
Example of fix:
![image](https://user-images.githubusercontent.com/20074092/116919709-82d96a80-ac06-11eb-8b74-e34cf1214ea5.png)
Additionally, while fixing and investigating this issue another issue was detected and has now been filed:
https://github.com/pytorch/pytorch/issues/57476

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57407

Reviewed By: anjali411

Differential Revision: D28165818

Pulled By: ngimel

fbshipit-source-id: dd3d245f1ea23c4b2edfcedbed3b47705ec1e966
2021-05-03 21:42:56 -07:00
67f874de8f [resubmit] Remove sync for randperm on small tensors. (#54113) (#57364)
Summary:
- [x] check MaskRCNN

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57364

Reviewed By: anjali411

Differential Revision: D28166385

Pulled By: ngimel

fbshipit-source-id: 42804b52cc837a95fc1d7ea49b430b55598be7bb
2021-05-03 20:48:04 -07:00
5c7e35c689 [RPC Framework] Clang-format remote_module.py and instantiator.py (#57414)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57414

ghstack-source-id: 127927609

Test Plan: N/A

Reviewed By: rohan-varma

Differential Revision: D28138870

fbshipit-source-id: 04894abaf2e713dc559cd9795197f85539b25e17
2021-05-03 20:28:51 -07:00
6b2cb939c5 [TensorExpr] Add methods for inspecting generated code in TensorExprKernel. (#57074)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57074

The new methods allow to peak into bufferArgs which describe parameters
that codegen expects. This description includes info whether a given
parameter is a scalar var or a buffer and in case it's a buffer allows
to get the corresponding `Buf*` pointer from which we could get the
expected sizes.

Differential Revision: D28048289

Test Plan: Imported from OSS

Reviewed By: bertmaher

Pulled By: ZolotukhinM

fbshipit-source-id: 3867e862a0ec3593906820826c2344bd8a8f5c0a
2021-05-03 20:02:28 -07:00
030692cf9e [TensorExpr] Remove dtype_ and add buf_ fields to CodeGen::BufferArg. (#57382)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57382

`BufferArg` is used to describe parameters passed to the codegen: it
indicates whether the parameter is a var or a buf and holds a pointer to
the corresponding var/buf. Both var and buf contain dtype, and thus
duplicating it in BufferArg is unnecessary - we can always get it from
the var/buf. Hence we're removing dtype_ field from BufferArg in this
PR. We're also adding a `buf_` field here: this is done so that
BufferArg truly has all the info about the parameter.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D28128329

Pulled By: ZolotukhinM

fbshipit-source-id: c03bff54bc6860f7ac6edfcb42ce6a82d8309589
2021-05-03 20:02:26 -07:00
839d549f8f [JIT] Add a pass for removing a first (self) argument from a graph if it is unused. (#57169)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57169

The pass is planned to be used in AOT pipeline, where we expect input
graphs to be functional. As such, these graphs should not use 'self'
argument even if it is present, and thus it can be remove safely.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D28128328

Pulled By: ZolotukhinM

fbshipit-source-id: a7dfbf7776682826100c8eb0fef982a2e81c2554
2021-05-03 20:02:25 -07:00
3ad3d8bd3f [JIT] Add a pass for annotating graph with input types derived from sample inputs. (#57076)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57076

This pass is intended to be used in conjunction with shape propagation
pass: first we use sample inputs to specify shape info for graph inputs
and then we run shape-prop to infer shapes of intermediate values in the
graph.

Differential Revision: D28048290

Test Plan: Imported from OSS

Reviewed By: astaff

Pulled By: ZolotukhinM

fbshipit-source-id: 778d772e873d59d77af9f669f45dc44b9ee5e443
2021-05-03 20:01:15 -07:00
74a4868d9a Add docs for c10::InferenceMode. (#57480)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57480

Test Plan: Imported from OSS

Reviewed By: navahgar, pbelevich

Differential Revision: D28164070

Pulled By: ailzhang

fbshipit-source-id: a6b0658f3e65f76387095fbd8d66c762914d3bea
2021-05-03 19:36:10 -07:00
75f6dcf8b5 protect destructors of python bindings that can be kept alive by c++ objects (#57488)
Summary:
Such a deadlock was found for PyFunctionPreHook after adding https://github.com/pytorch/pytorch/pull/57057
This is fixing all occurrences in torch/csrc/autograd

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57488

Reviewed By: malfet

Differential Revision: D28163321

Pulled By: albanD

fbshipit-source-id: 4daf1db69674e73967fc7c5ca2a240c61340e7ca
2021-05-03 19:32:37 -07:00
1d3a9bff3c Swap CUDA 10.1 and CPU CI for windows (#57493)
Summary:
This change temporarily disables CUDA testing on PRs, but keeps it on master.
This is likely to increase the number of reverts, but this is necessary as a stop-gap measure to cap the CI costs growth.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57493

Reviewed By: seemethere

Differential Revision: D28162697

Pulled By: janeyx99

fbshipit-source-id: 1bc529a405f7d63c07f4bd9f8ceca8da450743fc
2021-05-03 19:21:09 -07:00
4143483d95 [RPC Framework] Create a separate remote module template when moving CPU tensors to a cuda device is not enabled (#57413)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57413

An internal test fails because somehow `Tuple[()]` is not considered compatible with `Tuple[Any]` in TorchScript, even if the code that involves this type of variables is not executed at all.

Therefore, create separate templates for instantiation to avoid typing check failure. This can address the FIXME left in https://github.com/pytorch/pytorch/pull/57288

#Closes: https://github.com/pytorch/pytorch/issues/51670

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- RemoteModule -j 1

buck test mode/dev-nosan caffe2/torch/fb/training_toolkit/applications/sparse_nn/batch_distributed_inference/tests:batch_distributed_inference_test -- test_load_di_parts

Reviewed By: wanchaol

Differential Revision: D28138864

fbshipit-source-id: 39e3e67b0c3979b607ff104d84b4fb1070ffefd6
2021-05-03 19:10:24 -07:00
15975cf6a6 To add priority of int/int? over int[] on signature matching and adding {h,v,d}split methods (#57346)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/54555

It has been discussed in the issue https://github.com/pytorch/pytorch/issues/54555 that {h,v,d}split methods unexpectedly matches argument of single int[] when it is expected to match single argument of int. The same unexpected behavior can happen in other functions/methods which can take both int[] and int? as single argument signatures.

In this PR we solve this problem by giving higher priority to int/int? arguments over int[] while sorting signatures.

We also add methods of {h,v,d}split methods here, which helped us to discover this unexpected behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57346

Reviewed By: ezyang

Differential Revision: D28121234

Pulled By: iramazanli

fbshipit-source-id: 851cf40b370707be89298177b51ceb4527f4b2d6
2021-05-03 18:52:41 -07:00
c0309af1f3 Actually report mac stats (#57511)
Summary:
Give credentials to pytorch mac tests in CI so that test reports can be uploaded to S3.

Master runs have not been uploaded to S3 as the credentials were missing. https://app.circleci.com/pipelines/github/pytorch/pytorch/311990/workflows/2b2fbb72-b613-4986-8842-eccd93e7cdae/jobs/12945609/steps

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57511

Reviewed By: samestep

Differential Revision: D28165041

Pulled By: janeyx99

fbshipit-source-id: a4a9c793029838bdab41af19dbce1c8c49f7122d
2021-05-03 18:35:30 -07:00
bf6e3425b0 [23/n] [torch/elastic] Introduce the implementation of DynamicRendezvousHandler (#57151)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57151

This PR introduces the implementation of `DynamicRendezvousHandler` that mostly facilitates the types introduced in previous PRs.
ghstack-source-id: 127685212

Test Plan: Run the existing and new unit tests.

Reviewed By: tierex

Differential Revision: D28060531

fbshipit-source-id: 844ff0e9c869f2bbb85fba05a16002d00eae130f
2021-05-03 18:32:43 -07:00
a357fc8a4b [22/n] [torch/elastic] Introduce a new from_backend static constructor for DynamicRendezvousHandler (#57150)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57150

This PR refactors the `__init__` method of `DynamicRendezvousHandler` to a `from_backend` static constructor for easier testing and future extensibility.
ghstack-source-id: 127685183

Test Plan: Run the updated unit tests.

Reviewed By: tierex

Differential Revision: D28060336

fbshipit-source-id: b07dcbb61e8ff5a536b7b021cd50438010c648dd
2021-05-03 18:32:42 -07:00
4a10bd3b58 [21/n] [torch/elastic] Introduce _RendezvousJoinOp (#57149)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57149

This PR introduces the `_RendezvousJoinOp` type that represents a rendezvous join operation to be executed via a `_RendezvousOpExecutor`.
ghstack-source-id: 127685142

Test Plan: Run the existing and new unit tests.

Reviewed By: tierex

Differential Revision: D28059785

fbshipit-source-id: 6e67a54289eef1a2349fcc52f8841e49c139459a
2021-05-03 18:32:40 -07:00
81ef683cb3 [20/n] [torch/elastic] Introduce _RendezvousExitOp (#57148)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57148

This PR introduces the `_RendezvousExitOp` type that represents a rendezvous exit operation to be executed via a `_RendezvousOpExecutor`.
ghstack-source-id: 127685094

Test Plan: Run the existing and new unit tests.

Reviewed By: tierex

Differential Revision: D28059764

fbshipit-source-id: 2da428885f1390957242fdd82d68cee2ac273c71
2021-05-03 18:32:38 -07:00
baf8f4c0a6 [19/n] [torch/elastic] Introduce _RendezvousKeepAliveOp (#57147)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57147

This PR introduces the `_RendezvousKeepAliveOp` type that represents a rendezvous keep-alive heartbeat operation to be executed via a `_RendezvousOpExecutor`.
ghstack-source-id: 127685037

Test Plan: Run the existing and new unit tests.

Reviewed By: tierex

Differential Revision: D28059733

fbshipit-source-id: 31fd8fc06f03d8f9cd21558b15a06dea7ad85bc6
2021-05-03 18:32:37 -07:00
3e024fcfc9 [18/n] [torch/elastic] Introduce _RendezvousCloseOp (#57146)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57146

This PR introduces the `_RendezvousCloseOp` type that represents a rendezvous close operation to be executed via a `_RendezvousOpExecutor`.
ghstack-source-id: 127684991

Test Plan: Run the existing and new unit tests.

Reviewed By: tierex

Differential Revision: D28059693

fbshipit-source-id: 6c944d3b4f6a6ed2057ea2921ae8a42609998dd2
2021-05-03 18:32:35 -07:00
aa5d35e1d7 [17/n] [torch/elastic] Introduce _DistributedRendezvousOpExecutor (#57145)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57145

This PR introduces the `_DistributedRendezvousOpExecutor` type that implements the `_RendezvousOpExecutor` interface for rendezvous shared via a `_RendezvousStateHolder`.
ghstack-source-id: 127684945

Test Plan: Run the existing and new unit tests.

Reviewed By: tierex

Differential Revision: D28059417

fbshipit-source-id: 7ef72ea16b54eaaa11a6ece7459d385d49692a84
2021-05-03 18:31:23 -07:00
2a178d34cd [Redo] Add pybind interface to caffe2 quantization server (#57378)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57378

Previous version got reverted due to some tests not running because I wasn't in the pytorch github org

Differential Revision: D28125562

fbshipit-source-id: 758c1c9a009e79febf6cbd062a47d2a3d94e3a78
2021-05-03 15:52:18 -07:00
6d3bb01b1a Sequence Blob NVM Reader to Selectively NVMify Ads Embeddings in A*
Summary:
This diff enabled mapping a selected set of Ads embeddings to the T17 host on hierarchical memory (nvmify). To achieve that the following is implemented:

- Allow fo OTHER net to be both onnxified and nvmified
  - For that an allowlist placement policy is added to the nvmify stack
  - onnxifi_transform is lightly updated to accept a blacklist of operators based on name
  - nvm transform is broken into two parts, op replacement, and blob update.
  - A drived class `SeqBlobNVMReader` is defined which adds the functionality to load blobs to the card or nvm.

Test Plan:
* Unit test
* Run predictor replayer: selectively load the following ads embedding to NVM as in `--caffe2_nvm_dram_placement_file=/home/hanli/nvm_allowlist`:
```
SPARSE_AD_ACCOUNT_ID
SPARSE_NEW_AD_ID_COARSE
SPARSE_NEW_AD_ID_REFINED
SPARSE_NEW_CAMPAIGN_ID
SPARSE_NEW_TARGET_ID
SPARSE_NEW_AD_CLUSTER_ID
SPARSE_NEW_PAGE_ID
SPARSE_NEW_STORY_ID
SPARSE_NEW_VIDEO_ID
SPARSE_ENTITY_EQUIVALENCE_KEY
SPARSE_ENTITY_EQUIVALENCE_KEY_NO_CREATIVE
```
major parameter change in sigrid_remote_predictor_glow_nnpi:
```
--caffe2_nets_to_nvmify=DISAGG_ACC_REMOTE_OTHER \
--caffe2_nvm_sls_ops=SparseLengthsSumFused8BitRowwise,SparseLengthsWeightedSumFused8BitRowwise,SparseLengthsSumFused4BitRowwise,SparseLengthsWeightedSumFused4BitRowwise,SparseLengthsSum4BitRowwiseSparse \
--caffe2_nvm_table_path=/home/hanli/tables/225412100_2870/ \
--caffe2_nvm_dram_placement_file=/home/hanli/nvm_allowlist \
--caffe2_nvm_dram_placement_policy=by_file_allowlist \
--caffe2_predictor_nets_to_load=DISAGG_ACC_REMOTE_OTHER
```
In predictor log, observe that the blobs to be NVMified are transformed in op types, skipped in Onnxifi transform, and deferred loaded and do NVM net transform:
```
I0416 09:59:29.550690 662344 Nvmifier.cpp:142] ^[[92mReplacing SparseLengthsSumFused4BitRowwise with NVM variant.^[[0m
I0416 09:59:29.550701 662344 Nvmifier.cpp:142] ^[[92mReplacing SparseLengthsSumFused4BitRowwise with NVM variant.^[[0m
I0416 09:59:29.550705 662344 Nvmifier.cpp:142] ^[[92mReplacing SparseLengthsSumFused4BitRowwise with NVM variant.^[[0m
I0416 09:59:29.550712 662344 Nvmifier.cpp:142] ^[[92mReplacing SparseLengthsSumFused4BitRowwise with NVM variant.^[[0m
I0416 09:59:29.550715 662344 Nvmifier.cpp:142] ^[[92mReplacing SparseLengthsSumFused4BitRowwise with NVM variant.^[[0m
I0416 09:59:29.550721 662344 Nvmifier.cpp:142] ^[[92mReplacing SparseLengthsSumFused4BitRowwise with NVM variant.^[[0m

...
I0416 09:59:31.665369 662344 onnxifi_transformer.cc:1097] Skipping blocklisted op SparseLengthsSumFused4BitRowwiseNVM at pos 770
I0416 09:59:31.667042 662344 onnxifi_transformer.cc:1097] Skipping blocklisted op SparseLengthsSumFused4BitRowwiseNVM at pos 777
I0416 09:59:31.667294 662344 onnxifi_transformer.cc:1097] Skipping blocklisted op SparseLengthsSumFused4BitRowwiseNVM at pos 779
I0416 09:59:31.668828 662344 onnxifi_transformer.cc:1097] Skipping blocklisted op SparseLengthsSumFused4BitRowwiseNVM at pos 786
I0416 09:59:31.668843 662344 onnxifi_transformer.cc:1097] Skipping blocklisted op SparseLengthsSumFused4BitRowwiseNVM at pos 787
I0416 09:59:31.669909 662344 onnxifi_transformer.cc:1097] Skipping blocklisted op SparseLengthsSumFused4BitRowwiseNVM at pos 792

...

I0416 10:01:09.087282 662344 Nvmifier.cpp:346]  found the name: table0
I0416 10:01:09.373975 662344 Nvmifier.cpp:374] ^[[96mSaved /home/hanli/tables/225412100_2870/table0^[[0m
I0416 10:01:09.376008 662344 Nvmifier.cpp:343]  filename: sparse_nn_sparse_arch_SPARSE_NEW_AD_ID_COARSE_dedicated_13_w_EmbeddingFusedUint4Quantization
..

I0416 10:11:05.310854 662344 Nvmifier.cpp:161] ^[[95mNVMifying the model.^[[0m
I0416 10:11:05.310887 662344 Nvmifier.cpp:185]  found the name: table0 for sparse_nn_sparse_arch_SPARSE_NEW_AD_ID_COARSE_dedicated_13_w_EmbeddingFusedUint4Quantization
I0416 10:11:07.580587 662344 Nvmifier.cpp:185]  found the name: table4 for sparse_nn_sparse_arch_SPARSE_AD_ACCOUNT_ID_dedicated_20_w_EmbeddingFusedUint4Quantization
I0416 10:11:07.580648 662344 Nvmifier.cpp:185]  found the name: table3 for sparse_nn_sparse_arch_SPARSE_ENTITY_EQUIVALENCE_KEY_dedicated_22_w_EmbeddingFusedUint4Quantization
I0416 10:11:07.580667 662344 Nvmifier.cpp:185]  found the name: table5 for sparse_nn_sparse_arch_SPARSE_NEW_TARGET_ID_dedicated_29_w_EmbeddingFusedUint4Quantization
I0416 10:11:07.580682 662344 Nvmifier.cpp:185]  found the name: table2 for sparse_nn_sparse_arch_SPARSE_NEW_AD_ID_REFINED_dedicated_30_w_EmbeddingFusedUint4Quantization
I0416 10:11:07.580695 662344 Nvmifier.cpp:185]  found the name: table1 for sparse_nn_sparse_arch_SPARSE_NEW_STORY_ID_dedicated_35_w_EmbeddingFusedUint4Quantization

```
Make sure model is properly loaded:
```
I0415 21:42:48.400249 873685 ModelManagerBase.cpp:806] Loaded 225412100_2870 in 730944 ms (63800 ms of IO)  memory used 8744167456 byte(s)
```
* Only load user embedding to NVM to make sure baseline use case is not broken by this diff:
```
--caffe2_nets_to_nvmify=DISAGG_ACC_REMOTE_REQUEST_ONLY \
--caffe2_nvm_sls_ops=SparseLengthsSumFused8BitRowwise,SparseLengthsWeightedSumFused8BitRowwise,SparseLengthsSumFused4BitRowwise,SparseLengthsWeightedSumFused4BitRowwise,SparseLengthsSum4BitRowwiseSparse \
--caffe2_nvm_table_path=/home/hanli/tables/225412100_2870/
```
Make sure model is loaded:
```
Loaded 225412100_2870 in 381139 ms (56313 ms of IO)  memory used 7043933560 byte(s)
```
* Run feed replayer: `buck-out/gen/sigrid/feed/prediction_replayer/fully_remote_replayer_main --use_new_encoding_for_ads_services --use_new_encoding_from_model_id_to_shard_id --request_file_path /data/users/hanli/f266405843.requests --model_id=265540157_0 --replayer_thread_count=30 --sigrid_predictor_single_host=2401:db00:272c:602e:face:0:10:0 --sigrid_predictor_single_port=7444 --num_iterations=5 --qps=100 --client_name=predictor_v1` (load predictor as in P411172400)
Output:
```
I0428 21:20:25.106635 1396182 FullyRemoteReplayer.cpp:107] Loading requests from /data/users/hanli/f266405843.requests
I0428 21:20:25.547982 1396182 FullyRemoteReplayer.cpp:109] Requests size : 6699
I0428 21:20:25.548146 1396182 Client.cpp:274] V1 tier name:  V2 tier name: sigrid.predictor.fully_remote_test V2 fully remote tier name:
I0428 21:20:25.548153 1396182 Client.cpp:282] [MF] Migration Framework (traffic routing) enabled: false
I0428 21:20:25.548172 1396182 ModelRemoteStatus.cpp:206] Selection probabilities znode path: /configerator-gz/.prn
I0428 21:20:25.674162 1396265 ModelRemoteStatus.cpp:612] Found 0 host, 0 shards in predictor tier
I0428 21:20:25.674181 1396182 ModelRemoteStatus.cpp:557] Refresh sigrid model succeeded: 1
I0428 21:21:26.252820 1396265 ModelRemoteStatus.cpp:612] Found 0 host, 0 shards in predictor tier
I0428 21:21:26.252851 1396265 ModelRemoteStatus.cpp:557] Refresh sigrid model succeeded: 1
I0428 21:22:22.225976 1396182 PredictionReplayer.cpp:67] Previous request took too long, not reaching target QPS
I0428 21:22:26.252643 1396265 ModelRemoteStatus.cpp:612] Found 0 host, 0 shards in predictor tier
I0428 21:22:26.252678 1396265 ModelRemoteStatus.cpp:557] Refresh sigrid model succeeded: 1
I0428 21:23:26.252959 1396265 ModelRemoteStatus.cpp:612] Found 0 host, 0 shards in predictor tier
I0428 21:23:26.252987 1396265 ModelRemoteStatus.cpp:557] Refresh sigrid model succeeded: 1
I0428 21:24:26.253135 1396265 ModelRemoteStatus.cpp:612] Found 0 host, 0 shards in predictor tier
I0428 21:24:26.253166 1396265 ModelRemoteStatus.cpp:557] Refresh sigrid model succeeded: 1
I0428 21:25:27.252734 1396265 ModelRemoteStatus.cpp:612] Found 0 host, 0 shards in predictor tier
I0428 21:25:27.252763 1396265 ModelRemoteStatus.cpp:557] Refresh sigrid model succeeded: 1
I0428 21:26:03.172894 1396182 FullyRemoteReplayer.cpp:59] cpu time p25, p50, p75, p95, p99 9570 13011 16218 20788 24840
I0428 21:26:03.172927 1396182 FullyRemoteReplayer.cpp:61] wait time p25, p50, p75, p95, p99 11845 15958 19946 26579 31842
I0428 21:26:03.172940 1396182 FullyRemoteReplayer.cpp:63] wall time p25, p50, p75, p95, p99 16194 20888 25303 31692 37387
```

Reviewed By: ehsanardestani

Differential Revision: D27701121

fbshipit-source-id: e898abc6957c839e402a9763172cf85d9bb84cbd
2021-05-03 15:21:13 -07:00
589072afa1 Fix return type of getDeviceMap (#57487)
Summary:
https://github.com/pytorch/pytorch/pull/57294 changed behaviour to
return `c10::Device` rather that `c10::DeviceIndex`, but missed method bind:
1a6f827ae6/torch/csrc/distributed/rpc/init.cpp (L606-L611)
that cast return type to map of c10::DeviceIndex rather than
c10::Device

Do not ignore cast error when compiling this code

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57487

Reviewed By: nikithamalgifb

Differential Revision: D28158750

Pulled By: malfet

fbshipit-source-id: d57d869cceca8b7ed06d4d638e2b911da8236ed4
2021-05-03 15:01:24 -07:00
d68ad3cb1e Add a shortcut to test all torchbench models. (#57311)
Summary:
This PR adds a shortcut of specifying all models in TorchBench CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57311

Test Plan:
CI

RUN_TORCHBENCH: ALL

Reviewed By: bitfort

Differential Revision: D28160198

Pulled By: xuzhao9

fbshipit-source-id: 67c292bc98868979d868d4cf1e599c38e0da94b5
2021-05-03 13:50:27 -07:00
33eea146ee torch.clamp with tensor min and max (#52695)
Summary:
Fixes gh-2793

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52695

Reviewed By: mruberry

Differential Revision: D27395977

Pulled By: ezyang

fbshipit-source-id: f86aa240feb034d42e4c45447e72218f6a773c24
2021-05-03 12:56:16 -07:00
c328bb6d79 Port trunc to structured (#57350)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57350

Reviewed By: anjali411

Differential Revision: D28146473

Pulled By: ezyang

fbshipit-source-id: bf23fc0dda4afddd070ff1bb7aac2759be4002e6
2021-05-03 12:51:50 -07:00
1a6f827ae6 [16/n] [torch/elastic] Introduce _RendezvousOpExecutor (#57144)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57144

This PR introduces the `_RendezvousOpExecutor` interface. Implementers of this interface are responsible for executing rendezvous operations in a state machine that outputs actions based on the current state of the rendezvous.
ghstack-source-id: 127684898

Test Plan: None beyond `flake8` and `mypy` as this is solely an interface definition.

Reviewed By: tierex

Differential Revision: D28059159

fbshipit-source-id: 8e7da33e02336206cddbe76d773681e98c28a98f
2021-05-03 12:18:27 -07:00
76bccfb2e0 [15/n] [torch/elastic] Introduce _RendezvousStateHolder (#56538)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56538

This PR introduces the `_RendezvousStateHolder` interface and its accompanying `_BackendRendezvousStateHolder` type that is responsible for synchronizing the local rendezvous state with the other nodes.
ghstack-source-id: 127684796

Test Plan: Run the existing and new unit tests.

Reviewed By: tierex

Differential Revision: D27892600

fbshipit-source-id: a55d884a1f9b0d742787be4dff4271e076c08962
2021-05-03 12:17:18 -07:00
eqy
160304a81d fix comments in ATenNVRTC.h (#57318)
Summary:
Adding a function in ATenNVRTC.h also requires changing Lazy NVRTC.cpp, but this was missing in the comments.
Also fix a typo.

CC jjsjann123

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57318

Reviewed By: anjali411

Differential Revision: D28146223

Pulled By: ezyang

fbshipit-source-id: be69241a4b41ac7361a8c9f978fa4c837f41fbd1
2021-05-03 11:59:10 -07:00
e841f335aa [RELAND] [CUDA graphs] Avoid sync errors when graph capturing cudnn rnn calls that use cudnn dropout (#57373)
Summary:
https://github.com/pytorch/pytorch/pull/56433 was reverted because the test perceived internal dropout state creation as a memory leak. This PR resubmits with the leak check skipped.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57373

Reviewed By: anjali411

Differential Revision: D28152186

Pulled By: ezyang

fbshipit-source-id: 9a593fcdbbabbb09dc4e4221191663e94b697503
2021-05-03 11:41:40 -07:00
1b745efbe8 [14/n] Introduce a name attribute to _PeriodicTimer (#57143)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57143

This PR introduces a `name` attribute in `_PeriodicTimer` for testing and debugging purposes.
ghstack-source-id: 127684751

Test Plan: Run the new and updated unit tests.

Reviewed By: tierex

Differential Revision: D28059045

fbshipit-source-id: 9eb067300aea21a99577e6cd8a354f7eb749f4a6
2021-05-03 11:37:05 -07:00
233004b4c8 [13/n] Extend the return type of RendezvousBackend's set_state method (#57142)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57142

This PR extends the return type of `RendezvousBackend`'s `set_state` method with an additional boolean flag that specifies whether the write attempt has succeeded.
ghstack-source-id: 127629538

Test Plan: Run the updated unit tests.

Reviewed By: tierex

Differential Revision: D28058980

fbshipit-source-id: 26333790c39386891beb155b20ba1291d2cbdd03
2021-05-03 11:37:03 -07:00
a6f60cf4f0 [12/n] Rename last_keep_alives to last_heartbeats in _RendezvousState (#57141)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57141

Per feedback this PR renames `last_keep_alives` to `last_heartbeats` in `_RendezvousState`.
ghstack-source-id: 127629442

Test Plan: Run the updated unit tests.

Reviewed By: tierex

Differential Revision: D28058948

fbshipit-source-id: 0db12eac56a47a426a7a48fb5c93ac6a08b0d22e
2021-05-03 11:37:01 -07:00
3209364724 [11/n] [torch/elastic] Add heartbeat timeout to RendezvousTimeout (#57140)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57140

This PR introduces a new `heartbeat` attribute in `RendezvousTimeout`.
ghstack-source-id: 127626815

Test Plan: Run the updated unit tests.

Reviewed By: tierex

Differential Revision: D28058908

fbshipit-source-id: c6f8b3a06210cc59714fa841d9387eeb028dc02f
2021-05-03 11:37:00 -07:00
6876e15dbe [10/n] [torch/elastic] Add comparison operators to _NodeDesc (#57139)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57139

This PR sets the `order` attribute of the `dataclass` annotation to `True` in order to introduce comparison operators for `_NodeDesc`.
ghstack-source-id: 127626783

Test Plan: Run the existing unit tests.

Reviewed By: tierex

Differential Revision: D28058851

fbshipit-source-id: 66313f84f507100e20acb687a3427b3dd51a6310
2021-05-03 11:36:58 -07:00
6bf8df6b3b [9/n] [torch/elastic] Introduce RendezvousSettings (#56537)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56537

This PR introduces the `RendezvousSettings` type to consolidate the arguments passed to `DynamicRendezvousHandler`.
ghstack-source-id: 127626738

Test Plan: Run the existing unit tests.

Reviewed By: tierex

Differential Revision: D27890155

fbshipit-source-id: 22060c25b6927cc832f18ae6c5f7ba0f7a9ef3cf
2021-05-03 11:36:04 -07:00
ac71432c54 [PyTorch][Edge] Add api to get bytecode version from runtime (#56948)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56948

Add api to get runtime bytecode version

## Test
Both `caffe2/test/cpp/jit/test_lite_interpreter.cpp` and `caffe2/test/mobile/test_bytecode.py` pass
ghstack-source-id: 127939889

Test Plan: Both `caffe2/test/cpp/jit/test_lite_interpreter.cpp` and `caffe2/test/mobile/test_bytecode.py` pass

Reviewed By: raziel, iseeyuan

Differential Revision: D27987811

fbshipit-source-id: 35ed9bd626aecffc226f6dacfa046e6cdabfed51
2021-05-03 11:26:38 -07:00
945c93b8bd [quant][graphmode][fx] Skip observering boolean Tensors (#57375)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57375

Skip observing the input for masked_fill. Currently we don't have a way to
query the type of Proxy in GraphModule, hopefully we should have the functionality to annotate the type,
we'll need to annotate a Proxy to be a boolean Tensor to remove this hack.

Test Plan:
python test/test_quantization.py TestQuantizeFxOps.test_boolean_tensor

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D28126003

fbshipit-source-id: 2989766370a607579b3ea07ca36cdc2ce35893cc
2021-05-03 11:20:33 -07:00
264d87985a Use ld.gold by default to link in CI (#57061)
Summary:
This adds an option to CMake to use `ld.gold` to link rather than `ld` (which symlinks to `ld.bfd` on Ubuntu by default). This shouldn't change any functionality, only a mild improvement on link times during builds (shaves off 1 minute) on CI.

Verify by searching for `ld.gold is available` in [the logs](https://circleci.com/api/v1.1/project/github/pytorch/pytorch/13046834/output/105/0?file=true&allocation-id=608c434338107e5b6cf938a1-0-build%2F7BDA2FF1)
](https://our.intern.facebook.com/intern/diff/28123522/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57061

Pulled By: driazati

Reviewed By: janeyx99

Differential Revision: D28123522

fbshipit-source-id: 5a60798ca4785427fd92bbf3b3aa5f63730e9b20
2021-05-03 10:05:36 -07:00
c0d39ba680 Replace 11.2 linux CI with 11.3 (#57222)
Summary:
Let's see how 11.3 holds up!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57222

Test Plan: CUDA 11.3 has passed build and test below.

Reviewed By: malfet

Differential Revision: D28152554

Pulled By: janeyx99

fbshipit-source-id: 84b687660b9a5b6337b65d6aaaaf003ea94b2864
2021-05-03 09:48:52 -07:00
375c8a81dc [DDP] Profile search_unused_parameters (#57376)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57376

Having this in profiler/trace outputs will be useful when
investigating performance overhead of find_unused_parameters for certain
workloads, to determine whether it is a bottleneck or not.
ghstack-source-id: 127942159

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D28126233

fbshipit-source-id: 93082ae5b84e64351d59447a29f97eaf9b0bbd64
2021-05-03 09:41:18 -07:00
52b389259c Port max_pool2d_with_indices to structured kernel (#56459)
Summary:
Port max_pool2d_with_indices to structured kernel
https://github.com/pytorch/pytorch/issues/55070
also clean code for https://github.com/pytorch/pytorch/issues/56320

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56459

Reviewed By: zou3519

Differential Revision: D27882473

Pulled By: ezyang

fbshipit-source-id: 9f502c3c89d57ee201db4a024465c4b79446c8c6
2021-05-03 09:36:09 -07:00
6bc3ad28a3 Revert D28143091: [pytorch][PR] Add cross OpInfo
Test Plan: revert-hammer

Differential Revision:
D28143091 (4a872f8539)

Original commit changeset: 0b98226a1811

fbshipit-source-id: eda38923f31ac5a79af5c78077ed0106d904f6da
2021-05-03 09:19:41 -07:00
c7d8d8f925 [BE] Improve has_bf16_support (#57408)
Summary:
Use `functools.lru_cache` to avoid calling this function multiple time
Check that we are running on Linux platform before trying to open
"/proc/cpuinfo"
Do not spawn new process, but simply open("/proc/cpuinfo").read() and
search the output for the keywords

Fixes https://github.com/pytorch/pytorch/issues/57360

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57408

Reviewed By: driazati

Differential Revision: D28136769

Pulled By: malfet

fbshipit-source-id: ab476774c3be2913cb576d98d47a2f7ec03c19aa
2021-05-03 09:11:04 -07:00
f332a8bdff Implement result() function in MPI Work classes (#57168)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57168

Implement result() for MPI which wasn't previously supported.

Some user rely on output args, however in future usecases (e.g. DDP comm hook) we need to return the result explicitly.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D28129125

Pulled By: agolynski

fbshipit-source-id: d6abcd2114163471c045043534a0a3377f2579b4
2021-05-03 07:12:46 -07:00
0a0e024648 use importlib instead of imp as it support python 3.5+ (#57160)
Summary:
Prevent some annoying deprecation warning when importing cpp_extensions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57160

Reviewed By: astaff

Differential Revision: D28096751

Pulled By: albanD

fbshipit-source-id: f169ad4c4945b0fff54c0339052a29f95b9f1831
2021-05-03 05:56:25 -07:00
7e12c3e10a Automated submodule update: tensorpipe (#56916)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/tensorpipe](https://github.com/pytorch/tensorpipe).

New submodule commit: 12699ad388

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56916

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: lw

Differential Revision: D27997920

Pulled By: beauby

fbshipit-source-id: 057dff1f28bf3a9d1d05522d3b60ee3530aecf22
2021-05-03 02:08:56 -07:00
87242d2393 Eliminate global usage of torch.set_default_dtype in test_autograd (#56446)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56446

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D28000589

Pulled By: mruberry

fbshipit-source-id: c8fb2907d656138e72ecf8fb3e572591f8972900
2021-05-02 22:13:33 -07:00
154eca0309 OpInfo: ravel, view, view_as (#56910)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56910

Reviewed By: ngimel

Differential Revision: D28141867

Pulled By: mruberry

fbshipit-source-id: bff49d40d7e3bb36bc83d1405bd77f5529eeffe9
2021-05-02 22:10:36 -07:00
e845158b1a Assert that GIL is not held in blocking destructors (#57030)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57030

PR #57029 is not perfect; there are still obscure situations in which
we might allocate a shared_ptr to an RpcAgent that doesn't have a
no GIL constructor, so this PR adds the other half of the equation:
assert that we don't hold the GIL when running a blocking destructor.
This makes it possible to detect potential deadlocks even if the
code doesn't deadlock in practice (because you got lucky and none
of the threads you blocked on tried to also take out the GIL).

I considered whether or not to make this DEBUG_ONLY.  For now it's
not, so I can get better CI coverage, and because this test only
happens in destructors of objects that die rarely.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D28030582

Pulled By: ezyang

fbshipit-source-id: a7d7f6545223c4823c7f6036dfe29bd2edaf60a5
2021-05-02 22:06:02 -07:00
da51fd31a5 fx quant: remove find_quants from convert (#57402)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57402

This is a cleanup, the value is not used by anything. It was
probably left behind after previous refactors.

Test Plan:
```
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D28133622

fbshipit-source-id: 44a3f955d4af8d6dd15b4fb3038188568e4ee549
2021-05-02 20:13:13 -07:00
d6563bc153 fx quant: remove unnecessary quants arguments (#57399)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57399

There were a couple of functions which took `quants` as arguments
without using them, probably left over from after past refactors.
Cleaning this up to improve code readability.

Test Plan:
```
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D28132413

fbshipit-source-id: 636b146c0b5ef0caea9c4b539e245de245d48c49
2021-05-02 20:13:12 -07:00
643f41be61 fx quant: remove FixedQParamsOpQuantizeHandler from quantize.py (#57393)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57393

Moves the information on whether we should pass the information
whether the output is quantized based on the inputs to live
on the qhandler object.  This allows us to remove
FixedQParamsOpQuantizeHandler from quantize.py, further reducing
the coupling between handler objects and the quantization pass.

Test Plan:
```
python test/test_quantization.py TestQuantizeFxOps
```

Imported from OSS

Reviewed By: astaff

Differential Revision: D28132414

fbshipit-source-id: 5c28524b47c00f618d3a38657376abae9e6ffe7c
2021-05-02 20:13:10 -07:00
2bd158386a fx quant: move input_output_observed to qhandler (#57388)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57388

It's a bit confusing to have this be a decorator. It's simpler to
just expose it as a function on qhandler.

Test Plan:
```
python test/test_quantization.py TestQuantizeFxOps
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D28129411

fbshipit-source-id: f7316f285e8546c67e8d8cf753462b2c2abb2636
2021-05-02 20:13:08 -07:00
1b20eeb138 fx quant: move output obs logic to QuantizeHandler (#57377)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57377

Moves the logic which determines
1. whether a pattern instance's output should be observed
2. whether a pattern instance's output should be marked as observed based on its inputs
3. whether to ovverride the activation specified in the qconfig

from `quantize.py` to `quantization_patterns.py`.  This makes
the code easier to read and reduces the coupling between `Quantizer`
and `QuantizeHandler` instances.

Note: there are some further cleanups which would be good after this one
- leaving those for future PRs.

Test Plan:
```
python test/test_quantization.py TestQuantizeFxOps
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D28126896

fbshipit-source-id: 94c80a9c7307452783348d65b402acc84983e3f6
2021-05-02 20:13:07 -07:00
fe23881e76 fx quant: readability improvements on observer functions (#57368)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57368

1. renames functions which only sometimes insert observers to start with `maybe_`,
to clarify the difference from functions which always insert observers
2. saves a level of indent in `maybe_insert_observer_for_output_of_the_node`

Test Plan:
```
python test/test_quantization.py TestQuantizeFx
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D28126897

fbshipit-source-id: 4cbc184dbf5e85954314cfbbcdd1551474175bf0
2021-05-02 20:13:05 -07:00
db6cd42434 fx quant: clean up nit in insert_observer (#57367)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57367

This code is never hit (see insert_observer_for_output_of_the_node
which gates it out), so changing to an assert in order to
have `insert_observer` actually always insert an observer.
This helps code readability.

Test Plan:
```
python test/test_quantization.py TestQuantizeFx
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D28126898

fbshipit-source-id: 411bc37769a6eacbebc463ed6c84cac85871bd5e
2021-05-02 20:12:10 -07:00
46a32e075c Improve BatchNorm1d training performance (CPU) (#57033)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57033

CPU part of gh-38915

BatchNorm1d is implemented by looping over the channels, selecting one channel
at a time and performing cpu_serial_kernel loops per-channel. For (N, C)
contiguous layout this results in a sub-optimal strided memory access pattern;
guarunteeing no elements will ever be in the same cache line.

I fix this by passing the entire input into one `TensorIterator` and letting
it decide which dimensions to iterate over and how to divide work among threads.

For statistic updates and the backward function, I use `at::mean` and `at::sum`
instead of the ad-hoc reductions there. Not only does this allow better memory
access patterns, it also enables vectorization and so performance improves for
BatchNorm2d as well. Unfortunately, `at::var` and `at::var_mean` don't perform
as well so I've left the other reductions as they were.

Overall, on my machine this takes the 1d example from 24 ms down to 4 ms and
the 2d example from 2.5 ms down to 2 ms.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D28142333

Pulled By: ngimel

fbshipit-source-id: 066fe4f37f29b6458005e513e85faa398eeb9e2d
2021-05-02 17:47:55 -07:00
4a872f8539 Add cross OpInfo (#55483)
Summary:
One of the tasks in https://github.com/pytorch/pytorch/issues/54261.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55483

Reviewed By: ngimel

Differential Revision: D28143091

Pulled By: mruberry

fbshipit-source-id: 0b98226a1811f61cb90d2248dd4425135a096551
2021-05-02 16:23:02 -07:00
5c68072ee8 add support for complex input to torch.testing.assert_(equal|close) (#57162)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57162

Reviewed By: ngimel

Differential Revision: D28141902

Pulled By: mruberry

fbshipit-source-id: fd35e73e10167e3e44da4daf6582183bc4a0de7f
2021-05-02 16:13:12 -07:00
eaf00bf7d4 Skip linalg.qr saved mode check if compiled without LAPACK (#56284)
Summary:
This PR also removes qr and eig tests from test/test_torch.py. They were not skipped if compiled without LAPACK and they are now replaced with OpInfos.

Fixes https://github.com/pytorch/pytorch/issues/55929

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56284

Reviewed By: ejguan

Differential Revision: D27827077

Pulled By: mruberry

fbshipit-source-id: 1dceb955810a9fa34bb6baaccbaf0c8229444d3a
2021-05-02 16:07:07 -07:00
ce4449918a Port reverse binary ops to OpInfo (#56471)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/54296
Tracking Issue https://github.com/pytorch/pytorch/issues/54261

**Summary:**
- `rsub` (aten function) was already ported
- Ported tests for its dunder version: `__rsub__`
- Ported tests for the other dunder functions: `__radd__`, `__rmul__`, `__rdiv__`, `__rpow__`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56471

Reviewed By: ngimel

Differential Revision: D28142843

Pulled By: mruberry

fbshipit-source-id: 3d1bd88a4f124774f48d33a7ca7bfc7f796360df
2021-05-02 16:01:12 -07:00
57f72b8433 [DDP] Uneven inputs: option to throw early (#56755)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56755

Rehash of https://github.com/pytorch/pytorch/pull/47488

Adds a flag to ddp join() context manager that enables throwing a
StopIteration across all ranks when this flag is specified.

To do this, we implement the design in #47250. When running with this flag, we schedule an additional allreduce in the case that a joined rank needs to throw a StopIteration. In non-joined ranks forward pass, we match this allreduce and if at least one rank tells us to throw, we raise a StopIteration.

Tested by modifying existing tests, as well as adding additional tests validating that this works with SyncBatchNorm models and a model with custom collectives in the forward pass.

Currently running perf benchmarks, will post when those are available, but we expect a small (~2%) perf reduction when enabling this feature due to the blocking allreduce. Hence we will only recommend it for models with collective comm.
ghstack-source-id: 127883115

Test Plan: Ci

Reviewed By: SciPioneer

Differential Revision: D27958369

fbshipit-source-id: c26f7d315d95f17bbdc28b4a0561916fcbafb7ca
2021-05-02 15:41:50 -07:00
7fe4c1d0e7 Torchelastic: add multiprocessing tests to ci/cd (#56842)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56842

Add elastic multiprocessing test to ci/cd

Test Plan: buck test mode/opt-tsan //caffe2/test/distributed/elastic/multiprocessing/... -- --run-disabled

Reviewed By: wilson100hong

Differential Revision: D27982226

fbshipit-source-id: 1b4e6f1a20867a6aa7ca409e280fdb04e8db198b
2021-05-02 14:03:47 -07:00
bb640efa40 ns for fx: add missing add_relu and mul_relu patterns (#56927)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56927

Adds the connection of `torch.add` to `toq.add_relu` and of `torch.mul`
to `toq.mul_relu`.

Test Plan:
CI

Imported from OSS

Reviewed By: supriyar

Differential Revision: D28003475

fbshipit-source-id: a12871feacf84c5afb0e1cc47e708e285695ffeb
2021-05-02 08:34:49 -07:00
0ecdbfebff s/InplaceOrView/ADInplaceOrView/g (#57372)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57372

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57324

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D28121821

Pulled By: ailzhang

fbshipit-source-id: f568dd2505f6279da9ffb93ce1d22e0f98c606bb
2021-05-01 22:56:18 -07:00
41099ef71c OpInfo: mvlgamma (#56907)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/42515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56907

Reviewed By: astaff

Differential Revision: D28118669

Pulled By: mruberry

fbshipit-source-id: f54ad6dc64ddb6bcfca5c5c7fd8f395cd9761128
2021-05-01 20:51:01 -07:00
05b255c543 Revert D27487549: [TensorExpr] Add CodeGen::call_raw method.
Test Plan: revert-hammer

Differential Revision:
D27487549 (c9ab384af7)

Original commit changeset: d8f3d92262cd

fbshipit-source-id: ea8e71dbe2d632bc0fb557362c8bd899eb6aa83a
2021-05-01 19:48:07 -07:00
75a2a92b02 Add torch.linalg.cholesky_ex without checking for errors by default (#56724)
Summary:
The new function has the following signature `cholesky_ex(Tensor input, *, bool check_errors=False) -> (Tensor L, Tensor infos)`. When `check_errors=True`, an error is thrown if the decomposition fails; `check_errors=False` - responsibility for checking the decomposition is on the user.

When `check_errors=False`, we don't have host-device memory transfers for checking the values of the `info` tensor.

Rewrote the internal code for `torch.linalg.cholesky`. Added `cholesky_stub` dispatch. `linalg_cholesky` is implemented using calls to `linalg_cholesky_ex` now.

Resolves https://github.com/pytorch/pytorch/issues/57032.

Ref. https://github.com/pytorch/pytorch/issues/34272, https://github.com/pytorch/pytorch/issues/47608, https://github.com/pytorch/pytorch/issues/47953

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56724

Reviewed By: ngimel

Differential Revision: D27960176

Pulled By: mruberry

fbshipit-source-id: f05f3d5d9b4aa444e41c4eec48ad9a9b6fd5dfa5
2021-05-01 18:48:27 -07:00
afe6b4c8ee [NNC] Add logical Operators '&&' and '||' (#56947)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56947

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D28007342

Pulled By: huiguoo

fbshipit-source-id: a2ad8d2e99d7c8d8c8bdcd8f65fa3f340bdd2bbc
2021-05-01 18:44:27 -07:00
2be115336b Fix torch.ormqr for non Fortran-contiguous inputs (#57314)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57314

Test Plan: Imported from OSS

Reviewed By: astaff

Differential Revision: D28118029

Pulled By: mruberry

fbshipit-source-id: e2ef65093cc5f77769adc7066c76f0607b5559a9
2021-05-01 17:50:06 -07:00
7c8d0069c4 grad_fn getter for optional strings (#55225)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55225

Test Plan: Imported from OSS

Reviewed By: astaff

Differential Revision: D28118113

Pulled By: mruberry

fbshipit-source-id: 711723922cff3afa220e03d926cee5884e167706
2021-05-01 17:39:17 -07:00
a5288a0244 Sparse support for division rounding_mode argument (#51989)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51989

Test Plan: Imported from OSS

Reviewed By: astaff

Differential Revision: D28118114

Pulled By: mruberry

fbshipit-source-id: 2a76ee55c3845552e57e93d54628ce3c2fab3399
2021-05-01 17:37:25 -07:00
6d681d064f ROCM: Re-enable test_norm_fro_2_equivalence_old (#57170)
Summary:
This test was disabled for ROCM 3.9. With latest updates, the test is passing in ROCM 4.1. Hence enabling this test in test/test_linalg.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57170

Reviewed By: astaff

Differential Revision: D28118217

Pulled By: mruberry

fbshipit-source-id: 1b830eed944a664c3b1b3e936b87096fef0c0ca2
2021-05-01 16:41:41 -07:00
4350d4af77 Immediately mark DLPack capsule as used after stealing the ownership (#56789)
Summary:
After stealing the ownership of the tensor passed via DLPack capsule, PyTorch should immediately mark it as used (by changing its name to `used_dltensor`). This fix is needed because the following line may raise an exception:

```cpp
py::module::import("torch.cuda").attr("init")();
```

When an exception is raised, Tensor created by `at::fromDLPack` calls the `deleter`. However as the causple is not consumed, the producer (a library that created the capsule) also calls the `deleter`, causing a double free.

Reprodcuer (I'm running this code on A100 GPU + PyTorch wheel which does not include `sm_80` support; in this configuration `torch.cuda.init` will raise a warning):
```py
$ python -Werror
>>> import torch.utils.dlpack
>>> import cupy
>>> tensor = torch.utils.dlpack.from_dlpack(cupy.arange(10).toDlpack())
free(): double free detected in tcache 2
zsh: abort (core dumped)  python -Werror
```

Once this fix is merged users can now see the exception correctly:

```
A100-PCIE-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the A100-PCIE-40GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56789

Reviewed By: astaff

Differential Revision: D28118512

Pulled By: mruberry

fbshipit-source-id: 56992f7a3fc78d94c69513e864a473ae9587a9c8
2021-05-01 16:20:54 -07:00
3018093066 Revert D28110359: [TensorExpr] Add TensorExprKernel::runFast method.
Test Plan: revert-hammer

Differential Revision:
D28110359 (f219ed6627)

Original commit changeset: 4fdffc8196d2

fbshipit-source-id: 3c93a058b5dd7a3b71e399341a408ec74949ef56
2021-05-01 16:16:37 -07:00
82d245faef Inline hooks in ivalue::Future (#57354)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57354

The ivalue::Future class used to have some hooks, defined as separate protected virtual methods, so that they could be overridden by the CUDAFuture subclass. Now that CUDAFuture has been merged into ivalue::Future those hooks can be "inlined" to where they're used, hopefully making the code more readable as it puts related things closer together.
ghstack-source-id: 127920096

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28117199

fbshipit-source-id: f749cd842c3bdc44a08f0a33bef972dfbf08afdd
2021-05-01 16:12:58 -07:00
fb7469fb7f Use Devices instead of DeviceIndexes in Future (#57353)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57353

Even though we merged CUDAFuture into ivalue::Future, the resulting methods still had basically two distinct codepaths (i.e., an "early exit" if `impl_ == nullptr` for CPU, and then some code for CUDA). This works but it risks creating divergence and inconsistencies when the same class is used in those two modes. Ideally we should have the same codepath, and have the stream operations be no-ops for CPU. Luckily, this is exactly what happens when using a CPU DeviceGuardImplInterface!

Hence here I do that, and for convenience I also use c10::Devices instead of c10::DeviceIndexes (like we did in https://github.com/pytorch/pytorch/pull/57294 for RPC).
ghstack-source-id: 127920097

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28100525

fbshipit-source-id: cfac73894220ef5fa8a0389b5533c5d69ba1cf04
2021-05-01 16:12:56 -07:00
0422e67336 Use Devices instead of DeviceIndexes in TensorPipe agent (#57294)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57294

With the advent of CPUs in the device maps, and to be more generic (e.g., to support AMD GPUs), and to avoid conversions when passing to Future and RRef and such, it's easier to use Devices instead of DeviceIndices. This started by just migrating the TensorPipe agent but the RPC layer is quite intertwined so I had to migrate a lot of stuff.
ghstack-source-id: 127916562

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28092733

fbshipit-source-id: 024dcb3648c5898ab13e770413c43958f04f1a8a
2021-05-01 16:12:55 -07:00
0c3e79b5b9 Rename DeviceGuardImplInteface's getStreamFromPool method (#57345)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57345

Already back in https://github.com/pytorch/pytorch/pull/57046 we realized that calling this method `getStreamFromPool` could cause issues because that name gets HIPified and thus in some callsites we'd end up calling a method that doesn't exist. In the end we got away with it because the places where we were calling that method weren't HIPified. However in the next PR we'll use this method inside RPC, and that will start causing problems, hence here I rename it to something that should not cause conflicts. This is a private API (since it's inside `impl`) thus there's no backwards compatibility concerns.
ghstack-source-id: 127916484

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28114923

fbshipit-source-id: e027ad08a8e02090c08c6407c2db5a7fde104812
2021-05-01 16:12:53 -07:00
6697ef51b2 Add device() method to c10::Event (#57293)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57293

It's just a nice convenience.
ghstack-source-id: 127916265

Test Plan: It builds

Reviewed By: mrshenli

Differential Revision: D28092731

fbshipit-source-id: 99c8c33fd6e245915f2ed0c0482de132d7c75bf5
2021-05-01 16:12:51 -07:00
58bc003487 Add pybind type caster for c10::Device (#57292)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57292

In Future (and soon in other places too) we need to receive a list of devices from Python-land. We don't want to just take their indices because we need full devices in order to infer the type from them. torch.device is not defined through pybind, it's defined through a plain `PyModule_AddObject` call with CPython, thus pybind isn't naturally able to understand and convert it. However we can provide a custom type caster which fixes that. We have this already for at::Tensor, at::Generator, ...
ghstack-source-id: 127916268

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28092732

fbshipit-source-id: 1c31d0b85a4d5c9e7bde8161efbb7574d505157c
2021-05-01 16:11:10 -07:00
2dffa8cdf8 Fix CUDA Stream synchronization when arguments contains RRefs (#57394)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57394

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D28131325

Pulled By: mrshenli

fbshipit-source-id: 7174942d4c8dabe13f8eb1ba7fea599922a022c0
2021-05-01 16:04:11 -07:00
d536e6c684 Fix variable names in torch.fft examples (#57290)
Summary:
Apparently normal reST doctests aren't run in CI, because of this line in the `conf.py`:
ac86e0a0e5/docs/source/conf.py (L366)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57290

Reviewed By: astaff

Differential Revision: D28118198

Pulled By: mruberry

fbshipit-source-id: 7af621c4fef4e5d37e0fc62b9fd4382cc1698d89
2021-05-01 15:56:19 -07:00
3315f14280 Revert D28110358: [StaticRuntime] Use NNC's call_raw API to reduce call overheads.
Test Plan: revert-hammer

Differential Revision:
D28110358 (400ca7677c)

Original commit changeset: 94b87130a1ff

fbshipit-source-id: 246c0e54b02443c039105f48c4c419fe281150fc
2021-05-01 15:35:34 -07:00
20085f6d23 Support auto generation of device check (#56872)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56872

ghstack-source-id: 127914018

Test Plan: auto test

Reviewed By: ezyang

Differential Revision: D27986429

fbshipit-source-id: 0da8413b0b8e6810fcea27ed1de499f11f68bd1f
2021-05-01 12:02:09 -07:00
22ecb8885f Disable device check for foreach kernels (#56871)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56871

foreach kernels fall back to slow path when tensor are on different devices

Generated by codemod:
```
fastmod '(- func: _foreach.*)' '${1}
  device_check: NoCheck   # foreach kernels fall back to slow path when tensor are on different devices'   aten/src/ATen/native/native_functions.yaml
```
ghstack-source-id: 127914017

Test Plan: autotest

Reviewed By: ezyang

Differential Revision: D27986560

fbshipit-source-id: b0cd963cdba04b4e1589bbf369eb26b48d523968
2021-05-01 12:02:07 -07:00
183320df96 Add device_check place holder for functions (#56870)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56870

Automatic generation of device check code will be supported in
following PRs.

Changed are generaetd via:

1. Codemod
```
fastmod '  device_guard: False' '  device_check: NoCheck
  device_guard: False' aten/src/ATen/native/native_functions.yaml
```

2. Python script: https://gist.github.com/wenleix/be20c34bbbfcee0b289cdea2cf15b16c
ghstack-source-id: 127914016

Test Plan: auto test

Reviewed By: ezyang

Differential Revision: D27986427

fbshipit-source-id: 4e598a30306b80b5ade27af70d3e58770e401fc2
2021-05-01 12:02:05 -07:00
f7f8540794 Fix tensor device in test_kthvalue_overlap (#56869)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56869

ghstack-source-id: 127914015

Test Plan: auto test

Reviewed By: ezyang

Differential Revision: D27986559

fbshipit-source-id: f4a638d737b06dd5f384b54e20490d76543d4e78
2021-05-01 12:01:09 -07:00
44cc873fba [PyTorch] Autoformat c10 (#56830)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56830

Opt into formatting on GitHub and format everything. This is a trial run before turning on formatting for more and eventually all of the codebase.

Test Plan: CI

Reviewed By: zertosh

Differential Revision: D27979080

fbshipit-source-id: a80f0c48691c08ae8ca0af06377b87e6a2351151
2021-04-30 21:23:28 -07:00
3c4d57c18b [pytorch][nnc] update external functions for mobile build (#56850)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56850

This is part of the changes to enable NNC AOT compilation for mobile.
The generated kernels need to call these external functions thus change the declarations to use C linkage when building the mobile runtime.

Added nnc_aten_addmm external function.

ghstack-source-id: 127877411

Test Plan:
- build & CI;
- tested mobile build with stacked PRs;

Reviewed By: ZolotukhinM

Differential Revision: D27897154

fbshipit-source-id: 61d5499d7781a83bd2657859659fd1b5043d6b04
2021-04-30 19:07:19 -07:00
b11a24209f [PyTorch] Take advantage of string literals in TORCH_WARN (#54032)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54032

Add a `const char*` override to c10::Warning::warn so that we can avoid wrapping plain C string literals in std::string.
ghstack-source-id: 125544720

Test Plan: Buildsizebot some iOS apps?

Reviewed By: ezyang

Differential Revision: D27061983

fbshipit-source-id: dc11150c911a4317a8edac75e50c5ba43511ff24
2021-04-30 19:02:42 -07:00
13dbb77b7a [RPC Framework] Enable RemoteModule to directly send GPU tensors over the wire on TensorPipe RPC backend if a device map is provided (#57288)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57288

If the device map provided by RemoteModue is not empty, then TensorPipe RPC backend can support directly sending GPU tensors over the wire.

Also add pybind of `_get_device_map`.

The changes in unit test setup is separated out as a follow-up PR, as currently it breaks some tests in `distributed/rpc/test_faulty_agent.py`.

Still need to fix test_load_di_parts in `torch/fb/training_toolkit/applications/sparse_nn/batch_distributed_inference/tests:batch_distributed_inference_test`. Currently an early return is used to bypass this test failure.

#Original PR issue: https://github.com/pytorch/pytorch/issues/51670

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_input_moved_to_cuda_device

buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_input_moved_to_cuda_device_script

buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- RemoteModule -j 1

CAUTION: This one actually fails and now it is bypassed. See FIXME in `_remote_forward`.
buck test mode/dev-nosan caffe2/torch/fb/training_toolkit/applications/sparse_nn/batch_distributed_inference/tests:batch_distributed_inference_test -- test_load_di_parts

Reviewed By: wanchaol

Differential Revision: D28021672

fbshipit-source-id: a89245dc35e1d9479811ec6f98d9f34116837d79
2021-04-30 18:04:45 -07:00
20eac093a7 [torch][segment_reduce] Add support for initial value (#56923)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56923

Next Steps in order:
- Add backward support for CUDA
- Add support for more aggregation types
- Benchmarking (for cuda mainly)/more testing/documentation
- Support for multi dimension

Test Plan: Updated unit test to include 0 length segment as well.

Reviewed By: ngimel

Differential Revision: D27992228

fbshipit-source-id: 28851811f8a784a63162721c511d69e617a93727
2021-04-30 18:01:31 -07:00
bd347012ec Added sm_75 support for CI Xenial CUDA 11.1 cuDNN 8 builds (#57320)
Summary:
This PR adds `sm_75` CUDA architecture support for the PR CI build Xenial CUDA 11.1 cuDNN 8, with build name:`pytorch_linux_xenial_cuda11_1_cudnn8_py3_gcc7_build`, so that generated artifacts from these builds can be installed and run on machines with CUDA capability sm_75.

In PR https://github.com/pytorch/pytorch/issues/57207, the Xenial CUDA 10.2 cuDNN 7 build `pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_build` was taken off the list of builds done for PRs to `master`. PR https://github.com/pytorch/pytorch/issues/56619 has added `sm_75` support for this build. This PR removes this support for the Xenial CUDA 10.2 cuDNN7 builds, and adds it for the current PR CI build Xenial CUDA 11.1 cuDNN 8 `pytorch_linux_xenial_cuda11_1_cudnn8_py3_gcc7_build`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57320

Reviewed By: astaff

Differential Revision: D28125542

Pulled By: malfet

fbshipit-source-id: f220b8f3279054c98cab9eef1e0d7e37161a946f
2021-04-30 17:51:42 -07:00
2b54cec7e8 Clean up naming and comments (#56964)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56964

This PR does many things but does not update any logic:
 - Prefixes all function names that are not `gradcheck`, `gradgradcheck`, `get_numerical_jacobian`, and `get_analytical_jacobian` with underscore to indicate that they aren't part of the public API (https://github.com/pytorch/pytorch/issues/55714).
 - Improve naming to avoid referencing Jacobian rows or Jacobian cols when we really mean vjp and jvp as suggested by zou3519
 - Try to reduce comment line length so they are more consistent and easier to read
 - Other misc improvements to documentaiton

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D28096571

Pulled By: soulitzer

fbshipit-source-id: d372b5f8ee080669e525a987402ded72810baa0c
2021-04-30 17:40:14 -07:00
bbdadab306 Refactor fast gradcheck (#55871)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55871

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D28096549

Pulled By: soulitzer

fbshipit-source-id: ee8b71fbd03ee581e71cdfcfd5e2258adefe15a6
2021-04-30 17:39:09 -07:00
47e9ec401a [nnc] ported some more ops + added vectors to argvalue (#56766)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56766

Test Plan: Imported from OSS

Reviewed By: desertfire

Differential Revision: D28118331

Pulled By: Chillee

fbshipit-source-id: eb012943ad3b83e72a8cb17b594852164c3f0567
2021-04-30 17:34:49 -07:00
233f2cd29f Maintain submodule references during subgraph rewriting (#55463)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55463

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D27621650

Pulled By: ansley

fbshipit-source-id: e3558c64cdc2c1d846355fa58307a18c0714874b
2021-04-30 16:46:44 -07:00
3a5f85465b [pytorch] fewer cuda sync in unique by using cub instead of thrust (#57323)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57323

Use cub library instead of thrust to reduce # of cuda stream synchronize.

Reviewed By: ngimel

Differential Revision: D28088029

fbshipit-source-id: b616294cd776aa5643c153e172258a0153a42b6a
2021-04-30 16:36:01 -07:00
208f81b787 [PyTorch] ifdef out ATen tests that fail with static dispatch (#57379)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57379

Reviewed By: cspanda

Differential Revision: D27576223

fbshipit-source-id: 6f77f1ac8b92f955d654231527eee2a8b7a1ff3d
2021-04-30 15:58:13 -07:00
293830bc19 Fix min() and max() for empty tensors (#52565)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/34907

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52565

Reviewed By: anjali411

Differential Revision: D27999955

Pulled By: ezyang

fbshipit-source-id: 30e88cc8d84806198500e3001ecf58fa764536dd
2021-04-30 15:55:10 -07:00
c1a442248b [JIT] Enable conv-add-relu fusion as a part of frozen graph optimization (#56580)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56580

Turn on conv-add-relu fusion as default for the frozen graph optimization.

Test Plan:
```
python test/test_jit.py -k test_freeze_conv_relu_fusion
```

Reviewed By: nikithamalgifb

Differential Revision: D27915515

Pulled By: desertfire

fbshipit-source-id: 9a68d2a6aba70e697258c02c4fd3f3fbfc9fb8f6
2021-04-30 15:29:38 -07:00
400ca7677c [StaticRuntime] Use NNC's call_raw API to reduce call overheads. (#57329)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57329

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D28110358

Pulled By: ZolotukhinM

fbshipit-source-id: 94b87130a1ffdb4acf171ddcea3895e8a75c34ac
2021-04-30 15:26:20 -07:00
f219ed6627 [TensorExpr] Add TensorExprKernel::runFast method. (#57328)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57328

This method uses `CodeGen::call_raw` instead of `CodeGen::call`.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D28110359

Pulled By: ZolotukhinM

fbshipit-source-id: 4fdffc8196d24fc3300a9b4bc69f67562042a045
2021-04-30 15:26:18 -07:00
c9ab384af7 [TensorExpr] Add CodeGen::call_raw method. (#55113)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55113

The new method allows to pass input and output arguments by `void*`
pointers instead of CallArgs. That helps to reduce the invocation
overhead. Currently this is only supported in LLVM codegen.

Differential Revision: D27487549

Test Plan: Imported from OSS

Reviewed By: bertmaher

Pulled By: ZolotukhinM

fbshipit-source-id: d8f3d92262cde1c155beefb629454370d9af2f89
2021-04-30 15:24:37 -07:00
4c3283da0d Fix binary_checkout to use master (#57389)
Summary:
These lines never should have been committed, remove themh

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57389

Pulled By: driazati

Reviewed By: seemethere, samestep

Differential Revision: D28129673

fbshipit-source-id: 2de4b4d94c569177fec0c9eac8b7e9a8e59b550b
2021-04-30 14:52:19 -07:00
5e422fa170 per_channel fake quant fp16 and fp64 support (#56894)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56894

used the dispatch type macro to add support for fp16 and 64 tensors. haven't tested on gpu yet, will do so once I can rebuilt pytorch with cuda.

Test Plan:
python test/test_quantization.py TestFakeQuantize.test_forward_per_channel_half_precision_numerics
python test/test_quantization.py TestFakeQuantize
python test/test_quantization.py TestFakeQuantize.test_backward_per_channel_cachemask_cpu
python test/test_quantization.py TestFakeQuantize.test_forward_per_channel_cachemask_cpu

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D28002955

fbshipit-source-id: c9cf17aa0f15f163bfcc8e5ef7b329ca754924fd
2021-04-30 13:52:45 -07:00
42b3fc29f4 Fix NVRTC versioning for CUDA 11.X (X>=3), CUDA 12 and later (#57204)
Summary:
NVRTC versioning has changed starting 11.3, and will change again for CUDA 12.X. See comment in code for detail. As a result, jit on CUDA 11.3 is broken.

Also, the error message is misleading: When both `libname` and `alt_libname` are non-empty, the error message is only reporting `alt_libname`, it should report both.

To reproduce the error, you can use:

```python
import torch

torch._C._jit_set_profiling_mode(False)
torch._C._jit_set_profiling_executor(False)
torch._C._jit_override_can_fuse_on_cpu(True)
torch._C._jit_override_can_fuse_on_gpu(True)

torch.jit.script
def jit_relu_dropout(x, prob) :
    # type: (Tensor, float) -> Tensor
    x = torch.nn.functional.relu(x)
    x = torch.nn.functional.dropout(x, p=prob, training=True)
    return x

x = torch.randn((64, 40, 12, 1024), device="cuda:0", dtype=torch.float16, requires_grad=True)
y = jit_relu_dropout(x, 0.5)
```
with CUDA 11.3, and you will see
```
Traceback (most recent call last):
  File "/home/gaoxiang/misc/nvrtc-failure.py", line 16, in <module>
    y = jit_relu_dropout(x, 0.5)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: Error in dlopen or dlsym: libnvrtc-8aa72235.so.11.3: cannot open shared object file: No such file or directory
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57204

Reviewed By: ngimel

Differential Revision: D28122083

Pulled By: malfet

fbshipit-source-id: fd387cf79f33a6d5a5b93d54c9f21e9c23731045
2021-04-30 13:24:01 -07:00
72b1faa2d2 [8/n] [torch/elastic] Add unit tests for _RendezvousState (#56536)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56536

This PR adds unit tests to ensure that the encoded byte length of `_RendezvousState` stays under a certain limit.
ghstack-source-id: 127626622

Test Plan: Run the newly-introduced unit tests.

Reviewed By: tierex

Differential Revision: D27890704

fbshipit-source-id: 24905c8bc9d985d5ee90d370f28739eb137ce0f0
2021-04-30 13:14:52 -07:00
bbc3cc6718 [CUDA graphs] [BC-breaking] Makes torch.cuda.amp.GradScaler scale updates in-place for better composability with graph capture (#55562)
Summary:
I'd like the following pattern (a natural composition of Amp with full fwd+bwd capture) to work:
```python
# Create "static_input" with dummy data, run warmup iterations,
# call optimizer.zero_grad(set_to_none=True), then
g = torch.cuda._Graph()
s.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(s):
    optimizer.zero_grad(set_to_none=True)
    g.capture_begin()
    with autocast():
        out = model(static_input)
        loss = loss_fn(out)
    scaler.scale(loss).backward()
    g.capture_end()
torch.cuda.current_stream().wait_stream(s)

# Training loop:
for b in data:
    # optimizer.zero_grad() deliberately omitted, replay()'s baked-in backward will refill statically held .grads
    static_input.copy_(b)
    g.replay()
    scaler.step(optimizer)
    scaler.update()
```

Right now `GradScaler` can't work with this pattern because `update()` creates the scale tensor for the next iteration out of place. This PR changes `update()` to act in place on a long-lived scale tensor that stays static across iterations.

I'm not sure how this change affects XLA (see https://github.com/pytorch/pytorch/pull/48570), so we shouldn't merge without approval from ailzhang yaochengji.

Tagged bc-breaking because it's a change to the amp update utility function in native_functions.yaml. The function was never meant to be user-facing though.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55562

Reviewed By: zou3519

Differential Revision: D28046159

Pulled By: ngimel

fbshipit-source-id: 02018c221609974546c562f691e20ab6ac611910
2021-04-30 13:03:05 -07:00
3a777b6792 [PyTorch] Optimize intrusive_ptr(TTarget*) ctor (pybind) (#57053)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57053

This ctor is intended for pybind use. It increments weakcount when creating a strong reference, which is only correct if you know that the value was previously zero. So, I consolidated make() with this ctor.
ghstack-source-id: 127537070

Test Plan: existing CI

Reviewed By: ezyang

Differential Revision: D28037206

fbshipit-source-id: eec57a99e3e032830f156c1e6258760f6465137b
2021-04-30 11:26:58 -07:00
b9b768c0e7 Revert D28011862: Add pybind interface to caffe2 quantization server
Test Plan: revert-hammer

Differential Revision:
D28011862 (81ef82e5f4)

Original commit changeset: 647383017c4f

fbshipit-source-id: 1e2dbaba7c5fdc98d75a3bcf3722b529e9109348
2021-04-30 11:20:38 -07:00
f54aa85a6c Fix MAGMA qr for empty batched inputs (#56257)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56257

CPU and cuSOLVER path were fixed with refactoring of
`_linalg_qr_helper_default`.

Resolves https://github.com/pytorch/pytorch/issues/50576

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D27960157

Pulled By: mruberry

fbshipit-source-id: f923f3067a35e65218889e64c6a886364c3d1759
2021-04-30 11:15:03 -07:00
ff59039a24 Add cuSOLVER path for torch.linalg.qr (#56256)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56256

Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr'
--durations=5` cuts the runtime for these tests by 1 minute locally. See https://github.com/pytorch/pytorch/pull/56256#issuecomment-821069086.

Performance comparison: https://github.com/pytorch/pytorch/pull/56256#issuecomment-821077712.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D27960154

Pulled By: mruberry

fbshipit-source-id: 5312330d82337dec2856ec5527156a3a547a0b50
2021-04-30 11:15:01 -07:00
6cb9abfd20 Remove size arguments for internal orgqr and geqrf calls (#56255)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56255

With refactored non-allocating `linalg_qr_out_helper` from the previous
commit we don't need to specify the size arguments because the inputs to
orgqr and geqrf are always of correct size.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D27960153

Pulled By: mruberry

fbshipit-source-id: 0f9be25781371633378752b587da62b828816646
2021-04-30 11:14:59 -07:00
d5e1cac6e1 Add non-allocating helper function for torch.linalg.qr (#56254)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56254

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D27960151

Pulled By: mruberry

fbshipit-source-id: 4067afed0dcca3f32d0fa153e50a268a850817b2
2021-04-30 11:13:22 -07:00
e68c46bb3a Propagate information on torch_shm_manager execl failure to parent process (#57310)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57310

If we fail to exec `torch_shm_manager`, write an appropriate error message to stdout so that the parent process can have some context on the failure.

Reviewed By: ejguan

Differential Revision: D28047917

fbshipit-source-id: 68bf357df7a6b318c036f4f62cbb428a62cb139e
2021-04-30 11:11:09 -07:00
2c2aa9e030 Address temp file/bind race condition in torch_shm_manager (#57309)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57309

Addressing a race condition that can occur in `torch_shm_manager` between the time its temporary file is unlinked and when it `bind()`s the manager server socket to that same name. In that time window, other threads/processes can re-create another temporary file with the same name, causing `bind()` to fail with `EADDRINUSE`.

This diff introduces `c10::TempDir` and associated helper functions that mirror those of `c10::TempFile` and generates the manager socket name using a combination of a temporary directory, which will be valid for the lifetime of `torch_shm_manager`, and a well-known file name within that directory that will never be used outside of `bind()`.

Reviewed By: ejguan

Differential Revision: D28047914

fbshipit-source-id: 148d54818add44159881d3afc2ffb31bd73bcabf
2021-04-30 11:11:07 -07:00
7eed5410cd Make c10::TempFile non-copyable but movable (#57308)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57308

This diff makes `c10::TempFile` non-copyable but movable. `torch_shm_manager` was previously dependent upon some hidden behavior that was a result of copying `TempFile`s, which is also being made more explicit now that they can be moved but not copied.

Context:

`c10::TempFile` is currently copyable, which leads to surprising behavior. A seemingly valid `TempFile` may in fact be invalid if the original it was copied from has already been destroyed, resulting in the file descriptor to be closed and the filename being unlinked without the user knowing about it.

**In fact, both `c10::try_make_tempfile` and `c10::make_tempfile` cause copies of `TempFile` to be made**, which can easily be verified by explicitly deleting the copy constructor of `TempFile` and attempting to compile. This means that in practice, users of these functions are getting temporary files that have already been closed and unlinked.

This copying of `TempFile` is particularly interesting in the case of `torch_shm_manager`, which uses `try_make_tempfile` to generate the name of a Unix domain socket to communicate with clients. In order for `bind()` on the socket name to be successful, a file with that same name must not be linked in the filesystem, or `EADDRINUSE` will result. Happily, beacuse `try_make_tempfile` previously created a copy of the `TempFile` while destroying the original, `torch_shm_manager` did not encounter this. With this change, howevrer, `torch_shm_manager` must now explicitly destroy the `TempFile` before attempting to `bind()`. Unfortunately, this exposes a race condition--**other code can re-generate the same-named temporary file after the one created by `torch_shm_manager` is explicitly unlinked but before `torch_shm_manager` binds it to the server socket.** To be clear: this race condition already existed before this diff, but this makes things more explicit. The real fix will be in a follow-up change.

Reviewed By: ejguan

Differential Revision: D28047915

fbshipit-source-id: e8a1b6bb50419fe65620cfecdb67c566a4cf9056
2021-04-30 11:11:06 -07:00
788aefd7cc Propagate information on torch_shm_manager failures to parent process (#57307)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57307

Extend the `"ERROR"` message that `torch_shm_manager` writes to the pipe when it encounters a fatal error with some extra context (specifically, the `what()` on a caught `std::exception`), allowing the parent process to gain some insight into the cause of the failure.

Also, simply return from `main()` with an error exit code when a fatal exception is caught rather than re-throwing, because re-throwing leads to premature process termination that may prevent standard output from being flushed (and therefore the parent process from being able to read the error context from the pipe).

Reviewed By: ejguan

Differential Revision: D28047916

fbshipit-source-id: d423ee8ed1b2bf7831db877e8f8515ec6d6aa169
2021-04-30 11:09:47 -07:00
3f81912885 static graph api skeleton (#54995)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54995

provide an DDP private API to explicitly set the training is static, also set this flag in logger
ghstack-source-id: 127755713

Test Plan: unit tests

Reviewed By: rohan-varma

Differential Revision: D27444965

fbshipit-source-id: 06ef1c372296815944b2adb33fbdf4e1217c1359
2021-04-30 11:07:26 -07:00
5f2b9b1df9 refactor autograd_hook (#54981)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54981

put part of codes in autograd_hook into functions, so that they can be used in the static graph training later on.
ghstack-source-id: 127755405

Test Plan: unit tests

Reviewed By: SciPioneer

Differential Revision: D27439508

fbshipit-source-id: a02a4b029841f5e7f11cfc5496bb7972ef53d878
2021-04-30 11:06:04 -07:00
81ef82e5f4 Add pybind interface to caffe2 quantization server (#57330)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57330

Differential Revision: D28011862

fbshipit-source-id: 647383017c4fbc9afc4fd5aa5c771fd6a4619e29
2021-04-30 10:53:34 -07:00
e62cdae469 Static Runtime support for aten::matmul (#57291)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57291

aten::matmul support for static runtime

Test Plan: buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- IndividualOps_Binary_MatMul

Reviewed By: hlu1

Differential Revision: D28099671

fbshipit-source-id: 784035060c8c24953df47ca4227d2bca5094da22
2021-04-30 10:49:55 -07:00
0a9c9cc674 Update DLPack to 0.4 (#55365)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55090

I included the header directly, but I am not sure if we should add this as a git submodule, what do you guys think?
Also regarding the implementation, in ATen lanes seems not to be supported, but from CuPy complex types are exported with 2 lanes, I am not sure wether this is correct or not. However, in PyTorch this seems to be working properly, so I forgive 2 lanes for complex datatypes.

TODO: add tests for complex and bfloat

Easy test script against cupy

```python
import cupy
import torch

from torch.utils.dlpack import to_dlpack
from torch.utils.dlpack import from_dlpack

# Create a PyTorch tensor.
tx1 = torch.tensor(
    [2 + 1j, 3 + 2j, 4 + 3j, 5 + 4j], dtype=torch.complex128
).cuda()

# Convert it into a DLPack tensor.
dx = to_dlpack(tx1)

# Convert it into a CuPy array.
cx = cupy.fromDlpack(dx)

# Convert it back to a PyTorch tensor.
tx2 = from_dlpack(cx.toDlpack())
torch.testing.assert_allclose(tx1, tx2)
```

Thanks to leofang who updated CuPy's dlpack version and his PR served me as the guide for this one.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55365

Reviewed By: ngimel

Differential Revision: D27724923

Pulled By: mruberry

fbshipit-source-id: 481eadb882ff3dd31e7664e08e8908c60a960f66
2021-04-30 10:30:05 -07:00
b87d3fa432 [PyTorch][jit] Don't allow create() on singleton types (#56807)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56807

If I understand correctly, there's no reason to create your own instance of these global singleton types.
ghstack-source-id: 127312270

Test Plan: CI

Reviewed By: SplitInfinity

Differential Revision: D27973447

fbshipit-source-id: f12df69d185f1baaa45f2ac6eac70570a7a65912
2021-04-30 10:28:50 -07:00
d896d1f4ce [fx splitter] Fix fusion group utility (#57280)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57280

We've found an issue that fusion group would results in circular dependency. For example
```
a -> b -> c -> d
|              ^
+ -------------+

Only a has non tensor output and currently we would create a fusion group (a, b, d). This results in circular dependency because now the fusion group depends on c while c depends on the fusion group as well.
```

This diff implement the solution discussed before. When we add a node to fusion group, we add all the nodes that are in the middle of the fusion group and this newly added node.

Use the same logic in minimizer to build fusion group.

Test Plan: split_tests and net_min_tests

Reviewed By: khabinov

Differential Revision: D27917432

fbshipit-source-id: a3d99fe5929dbc9f8eb0f45bccd83fd7b173795a
2021-04-30 10:18:01 -07:00
7c8a7efe3f [nnc] Enable all fuser tests for cpu (#57332)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57332

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D28113481

Pulled By: bertmaher

fbshipit-source-id: b55e4bbcc25a09614b37985873b72337fdefc6b0
2021-04-30 10:11:06 -07:00
d50a969f2a reduce inline autodiff threshold so we can caputre smaller fusions (#57062)
Summary:
This should let us fuse simpler expressions like

```cpp
              torch.jit.script
                def foo(x):
                    return torch.sigmoid(torch.sigmoid(x))
```

RUN_TORCHBENCH: alexnet attention_is_all_you_need_pytorch Background_Matting BERT_pytorch demucs densenet121 dlrm fastNLP gen_torchvision_benchmarks.py LearningToPaint maml mnasnet1_0 mobilenet_v2 mobilenet_v2_quantized_qat moco pyhpc_equation_of_state pyhpc_isoneutral_mixing pytorch_CycleGAN_and_pix2pix pytorch_mobilenet_v3 pytorch_stargan pytorch_struct resnet18 resnet50 resnext50_32x4d shufflenet_v2_x1_0 squeezenet1_1 Super_SloMo tacotron2 vgg16 yolov3

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57062

Reviewed By: zou3519

Differential Revision: D28053608

Pulled By: Krovatkin

fbshipit-source-id: 6871c3d2a81dd326a481e7ecfaf2ffefffce4a89
2021-04-30 09:55:09 -07:00
e795f88d6b [NNC] Make flatten transform in-place (#56629)
Summary:
Partial fix for https://github.com/pytorch/pytorch/issues/56157

This PR updates the `flatten` API in `LoopNest` to perform the flattening transformation in-place. After this transformation, the first loop in the input becomes the flattened loop.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56629

Reviewed By: H-Huang

Differential Revision: D28004787

Pulled By: navahgar

fbshipit-source-id: 7474ae237fae3fff0cd1c64a276a8831dc5b7db0
2021-04-30 09:51:45 -07:00
b49e079a2a Fix string_view::equals_ compilation by CUDA-11.3 (#57322)
Summary:
__builtin_memcmp is not a constexpr for character arrays for NVCC-11.3 compiler.
Attempts to compile this code results in the following error:
```
/opt/conda/lib/python3.6/site-packages/torch/include/c10/util/string_view.h(585): note: constexpr memory comparison is only supported for top-level integer or array-of-integer objects
/opt/conda/lib/python3.6/site-packages/torch/include/c10/util/string_view.h(340): note: called from:
/opt/conda/lib/python3.6/site-packages/torch/include/c10/util/string_view.h(369): note: called from:

```

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57322

Reviewed By: janeyx99

Differential Revision: D28119125

Pulled By: malfet

fbshipit-source-id: e5ff6ac7bb42022e86c9974919e055cf82c2ea83
2021-04-30 09:07:15 -07:00
52805a0f4f [PyTorch] Include hip_runtime.h in macros.h (#57070)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57070

See code comment.
ghstack-source-id: 127564865

Test Plan: CI, should unbreak build of following formatting diff

Reviewed By: ngimel

Differential Revision: D28044331

fbshipit-source-id: f571e60b2534313fb9e7dd13dd98d2441b9ce8b8
2021-04-30 09:02:48 -07:00
c971401696 [JIT] Disable conv-add-relu fusion for cuDNN7 when model uses fp16 (#56579)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56579

On earlier cuDNN versions, when a model uses fp16, the
performance after conv-add-relu fusion regresses. Let's just
disable the fusion for fp16 if cuDNN version is older than v8.

Test Plan: Tested for fp16 models on Nvidia Tesla T4

Reviewed By: ZolotukhinM

Differential Revision: D27915514

Pulled By: desertfire

fbshipit-source-id: 1c0081a80540c507e608216c90bc74c486c7008d
2021-04-30 08:57:50 -07:00
731cc472c5 refactor autocast to be extensible for devices (#57104)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57104

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D28094173

Pulled By: ezyang

fbshipit-source-id: a5fb62b9a4e58f30d2756bba4331d5fc88136b89
2021-04-30 08:46:40 -07:00
095c328d9f Add supported backward_dtype to OpInfo (#56156)
Summary:
Related to https://github.com/pytorch/pytorch/issues/55601.

- [x] removed complex autograd checker in `test_supported_backward`
- [x] created `backward_dtype[If<Device>]` that inherits from normal `dtype[If<Device>]` by default
- [x] removed all skip for backward test, instead added backward dtype
- [x] change complex autograd to a function call: `support_complex_autograd(device_type)` that depends on `backward_dtype*` since they essentially mean the same thing for complex types

TODO for next PR
- add `test_unsupported_backward` to verify they are actually unsupported.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56156

Reviewed By: mruberry

Differential Revision: D27926717

Pulled By: walterddr

fbshipit-source-id: 9a4af8612278ca44a97b6f1510b6b175852c893b
2021-04-30 08:24:58 -07:00
e08303c740 Revert D27582224: [pytorch][PR] Automated submodule update: FBGEMM
Test Plan: revert-hammer

Differential Revision:
D27582224 (54469e157b)

Original commit changeset: 6670e96b21d8

fbshipit-source-id: fbc6ab0d35ff6168cb341477e7e86169ab1a43bf
2021-04-30 07:47:47 -07:00
0dddfbf346 Revert D28114231: [pytorch][PR] Automated submodule update: FBGEMM
Test Plan: revert-hammer

Differential Revision:
D28114231 (264db1959e)

Original commit changeset: 0a5883ebb2fc

fbshipit-source-id: edcb0d2ae1adfdea0999a6e410bdbe530bf61dda
2021-04-30 07:41:47 -07:00
95dc2b6e9b Remove unused forward AD flag (#57058)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57058

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D28071504

Pulled By: albanD

fbshipit-source-id: df694ac6b9fbb4aed269d61cd9522f8602fdae0c
2021-04-30 07:32:56 -07:00
83f186717b Improve perf for forward AD view handling (#57057)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57057

This PR performs optimization on the ViewInfo handling to remove the need for the "no forward AD mode".
- When the forward and backward ViewInfo are the same, create and store only one of them

Code for timing:
```python
timer = Timer(
    stmt='a.view(-1)',
    setup='''\
import torch
a = torch.rand(4)''')

res = timer.collect_callgrind(repeats=2, number=10)[1]
```

Difference between master and this PR:
```
# Benchmark at master
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7fe33be83690>
a.view(-1)
setup:
  import torch
  a = torch.rand(4)

                           All          Noisy symbols removed
    Instructions:        69286                      68442
    Baseline:             1332                       1188
10 runs per measurement, 1 thread

# Benchmark at this branch
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7fe33bd7ec30>
a.view(-1)
setup:
  import torch
  a = torch.rand(4)

                           All          Noisy symbols removed
    Instructions:        69437                      68562
    Baseline:             1363                       1188
10 runs per measurement, 1 thread

# Difference between the two
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.FunctionCounts object at 0x7fe1216e9a00>
    160  ???:0x000000000a11c8d0
     60  torch::autograd::DifferentiableViewMeta::DifferentiableViewMeta
     60  ???:torch::autograd::as_view(at::Tensor const&, at::Tensor const&, bool, bool, std::function<at::Tensor (at::Tensor const&)>, torch::autograd::CreationMeta, bool)
     40  ???:0x0000000008e14f50
     40  ???:0x0000000008e05bd0
     40  ???:0x0000000008e05480
     40  ???:0x0000000008e036d0
     40  ???:0x0000000008e02720
     30  make_variable_differentiable_view
    ...
    -20  ???:0x0000000008e02060
    -20  ???:0x0000000008e01fd0
    -30  ???:torch::autograd::isForwardADEnabled()
    -40  ???:0x0000000008e14f90
    -40  ???:0x0000000008e05c00
    -40  ???:0x0000000008e054a0
    -40  ???:0x0000000008e036f0
    -40  ???:0x0000000008e02740
   -160  ???:0x000000000a11d8d0

Total: 120

```

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D28071505

Pulled By: albanD

fbshipit-source-id: 672b1bdf87d516b6de4f2e36656819cfd6f4c9b9
2021-04-30 07:32:54 -07:00
b016bc1c91 fix InplaceOrView implementation for manual functions (#57152)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57152

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D28071506

Pulled By: albanD

fbshipit-source-id: ef015593dd81be11bc08714d07e0ac4f26e188ec
2021-04-30 07:32:53 -07:00
c91bd25e90 Fix use of allow_tensor_metadata in view variable creation (#57069)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57069

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D28071507

Pulled By: albanD

fbshipit-source-id: 44f0e09846fdc569cf1a62a6f80ca88911e7e45c
2021-04-30 07:31:54 -07:00
6fa1d880b6 make external codegen aware of autogen'd composite kernels (#56960)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56960

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D28012667

Pulled By: bdhirsh

fbshipit-source-id: 56da050c5d46b8952ddecfa83ebd5fe8454acffe
2021-04-30 07:23:28 -07:00
d4ddb47719 [special] Add xlog1py (#55138)
Summary:
Reference : https://github.com/pytorch/pytorch/issues/50345

* [x] Check Rendered Document (https://12494173-65600975-gh.circle-artifacts.com/0/docs/special.html#torch.special.xlog1py)
* [x] Tests in Binary Ufunc
* [x] OpInfo
* [x] Structured Kernel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55138

Reviewed By: ngimel

Differential Revision: D27961461

Pulled By: mruberry

fbshipit-source-id: 30a8f41970a829bf50254aadf5615e8ce4148c7e
2021-04-30 05:51:13 -07:00
5b3e7638ca Expand Kineto profiler support (part 1) (#57333)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57333

Pull Request resolved: https://github.com/pytorch/kineto/pull/193

Expanding Kineto support to more platforms

Test Plan:
CI and OSS CI:
https://github.com/pytorch/pytorch/pull/56323

Reviewed By: gdankel

Differential Revision: D27873669

fbshipit-source-id: 4a72a589f958440cbfff247751b7f4e1910a10c7
2021-04-30 05:02:23 -07:00
db32b69591 quote str kwarg values in test_ops.py::TestCommon::test_jit_alias_remapping (#57120)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/57119.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57120

Reviewed By: gchanan

Differential Revision: D28086601

Pulled By: mruberry

fbshipit-source-id: 566a53c2365f2d128da49ac58463e37b36455831
2021-04-30 04:29:12 -07:00
df69b0d060 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D28115855

fbshipit-source-id: 20434a96dae636db53fae089042342000fc103c7
2021-04-30 04:18:28 -07:00
264db1959e Automated submodule update: FBGEMM (#57342)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 5ce0eed074

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57342

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: jspark1105

Differential Revision: D28114231

fbshipit-source-id: 0a5883ebb2fcd45ff547d594928372a9a9c9b76c
2021-04-30 00:01:15 -07:00
54469e157b Automated submodule update: FBGEMM (#55347)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: c565348fdc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55347

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: jspark1105

Differential Revision: D27582224

fbshipit-source-id: 6670e96b21d84dc6464559bf179f74751927fdd4
2021-04-29 22:51:42 -07:00
b3e1802439 Static runtime support for fb::expand_dims (#57282)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57282

Added support for fb::expand_dims for SR.

Test Plan:
buck test caffe2/torch/fb/sparsenn:gpu_test -- test_expand_dims

buck test caffe2/benchmarks/static_runtime/fb:test_fb_operators

Reviewed By: hlu1

Differential Revision: D28043049

fbshipit-source-id: 01f59db7b507f027b220f044d6ff23602adbdb06
2021-04-29 22:40:56 -07:00
e31b67f550 [torch/deploy] opt torch/csrc/depoy into autofromatting
Summary: One time formatting change + editing fbsource-lint-engine.toml.

Test Plan:
```
arc lint --take CLANGFORMAT --apply-patches --paths-cmd 'hg files caffe2/torch/csrc/deploy'
```

Reviewed By: wconstab, Lilyjjo

Differential Revision: D28100954

fbshipit-source-id: 831e5796d23c99a2f92e7abd9983ac07b1cf6fbb
2021-04-29 22:29:24 -07:00
ac72881f3f Fix a numerical issue of CUDA channels-last SyncBatchNorm (#57077)
Summary:
Fix a numerical issue of CUDA channels-last SyncBatchNorm

The added test is a repro for the numerical issue. Thanks for the help from jjsjann123 who identified the root cause. Since pytorch SBN channels-last code was migrated from [nvidia/apex](https://github.com/nvidia/apex), apex SBN channels-last also has this issue. We will submit a fix there soon.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57077

Reviewed By: mruberry

Differential Revision: D28107672

Pulled By: ngimel

fbshipit-source-id: 0c80e79ddb48891058414ad8a9bedd80f0f7f8df
2021-04-29 21:38:52 -07:00
c44cbc63cc Ignore more compiler warnings, unify WERROR options (#56630)
Summary:
This adds some more compiler warnings ignores for everything that happens on a standard CPU build (CUDA builds still have a bunch of warnings so we can't turn on `-Werror` everywhere yet).
](https://our.intern.facebook.com/intern/diff/28005063/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56630

Pulled By: driazati

Reviewed By: malfet

Differential Revision: D28005063

fbshipit-source-id: 541ed415eb0470ddf7e08c22c5eb6da9db26e9a0
2021-04-29 21:20:29 -07:00
65968ab817 Revert "Remove sync for randperm on small tensors. (#54113)" (#57299)
Summary:
This reverts commit e8c268746b297efa988e03abc61ff22203bf3980.
It occasionally produces wrong results.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57299

Reviewed By: wat3rBro

Differential Revision: D28102706

Pulled By: ngimel

fbshipit-source-id: d7618e104d854c3b96aa502fb4e30041b9aab5df
2021-04-29 17:52:21 -07:00
49dbe1798f [kineto] Deprecate ClientTraceActivity and merge it with GenericTraceActivity (#56743)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56743

Pull Request resolved: https://github.com/pytorch/kineto/pull/184

as part of the migration to ClientTraceActivity -> GenericTraceActivity, now that all CTA mirrors GTA's data structure, we can safely swap out the symbol name.

Test Plan:
- `buck build kineto`
- sandcastle to catch any other breakage in depdendees

Took before and after of `fastrnns` bench
`buck run mode/opt //caffe2/benchmarks/fastrnns:bench -- --cnns resnet50 --group cnns --nloops 1000`

Before
https://fburl.com/perfdoctor/9n0izgji

{F611729029}

After
https://fburl.com/perfdoctor/h9d9tlmp
{F611725475}

Sample ParamComms traces
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2F0%2F1619503816%2F127.0.0.1%2Flibkineto_activities_4003656.json.gz&bucket=gpu_traces

https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2F0%2F1619503816%2F127.0.0.1%2Flibkineto_activities_4003657.json.gz&bucket=gpu_traces

https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2F0%2F1619503816%2F127.0.0.1%2Flibkineto_activities_4003658.json.gz&bucket=gpu_traces

Reviewed By: gdankel

Differential Revision: D27353973

fbshipit-source-id: 7012c6524c3c75079029ac290c1dd722ac187ec5
2021-04-29 16:36:40 -07:00
16fc18bf82 port neg to structure kernel (#57212)
Summary:
`negative` alias is not ported.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57212

Reviewed By: driazati

Differential Revision: D28095043

Pulled By: walterddr

fbshipit-source-id: 6c7bcd727800bb1db7add43a152de7b58f4ccf43
2021-04-29 15:59:08 -07:00
995161203b Fix sort for slow gradcheck (#57192)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/57166

Generate new inputs until we get one where we know that x + eps won't change its sorted order.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57192

Reviewed By: albanD

Differential Revision: D28102361

Pulled By: soulitzer

fbshipit-source-id: a12377cc135b0bd92adf0914a100969317b97e8c
2021-04-29 15:48:02 -07:00
e27740b38e [torch] Add backward support for segment reduce (CPU only)
Summary:
This is to setup boiler plate code for backward and CPU implementation.

Next Steps in order:
- Add backward support for CUDA
- Add support for more aggregation types
- Benchmarking (for cuda mainly)/more testing/documentation
- Support for multi dimension

Test Plan:
Updated unit test to also check correctness of backward.

Wait for CI signal

Reviewed By: ngimel

Differential Revision: D27970340

fbshipit-source-id: 3e608c7fe3628b0a761dd8affc6aad8f65a6ef7f
2021-04-29 15:41:37 -07:00
d1def93166 [torch/debuggability] use log.info() in addition to print() in timeoutguard (#57296)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57296

Seems many trainers disable print(), so we cannot see the thread dumps with CompleteInTimeOrDie(). So log.info() also.

Test Plan: sandcastle

Reviewed By: aalmah

Differential Revision: D28098738

fbshipit-source-id: dfdca8801bacf5c7bccecc2387cb7ef41dadfa46
2021-04-29 15:23:35 -07:00
c2fbd96735 [RPC Framework] Expose a Python API for device map getter (#57179)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57179

Expose a Python API to get the device map and unblock RemoteModule work.

See: https://github.com/pytorch/pytorch/pull/56854#issuecomment-827762398

Additionally, add a const decorator for the C++ getter.

#Original PR issue: https://github.com/pytorch/pytorch/issues/51670
ghstack-source-id: 127684266

Test Plan: waitforbuildbot

Reviewed By: mrshenli

Differential Revision: D28070160

fbshipit-source-id: 624d14552d82b99487f72e16428fa75c7a47f61f
2021-04-29 14:29:10 -07:00
2c6f5e8a12 [package] PackageExporter __import__ logic to not parse dynamic cases (#57283)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57283

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D28095858

Pulled By: Lilyjjo

fbshipit-source-id: c3cec074e6b2c48a09785fa0c02cd576b7ec94d9
2021-04-29 14:21:33 -07:00
6ed90ed1ac Added OpInfos for sub & mul (#56227)
Summary:
`OpInfo`s for `sub` & `mul` operators. Both of them will reuse the sample inputs function added for `add` via another PR.

A https://github.com/pytorch/pytorch/issues/54261 task.

cc mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56227

Reviewed By: H-Huang

Differential Revision: D27993889

Pulled By: mruberry

fbshipit-source-id: 7b2da02b0edba3cc37b5b1b88ca32f7dd369ca60
2021-04-29 14:10:15 -07:00
149000c3f0 Update compare_set docs (#57203)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57203

Update documentation to remove warning. Refactored arguments from `old_value` -> `expected_value` and `new_value` -> `desired_value`

Test Plan: Imported from OSS

Reviewed By: gchanan, cbalioglu

Differential Revision: D28076556

Pulled By: H-Huang

fbshipit-source-id: 5fcc5bcfff89cad51d8dc0b74a234964f1af20ed
2021-04-29 13:58:57 -07:00
e31265dfb3 Fix path handling on Win32 in rendezvous.py (#57000)
Summary:
Fixes test failure after https://github.com/pytorch/pytorch/issues/56598

Introduced by https://github.com/pytorch/pytorch/issues/45335.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57000

Reviewed By: zou3519

Differential Revision: D28030360

Pulled By: seemethere

fbshipit-source-id: 4871d51e6b80dceef8bf95c6c658441287575f63
2021-04-29 13:55:11 -07:00
a6fa6a6cda [fx minimizer] Add an option to minimizer to allow return all intermediate results (#57279)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57279

Added an option "return_intermediate". If true, when building the submodule we want to run , we will replace the output with all the nodes, so that intermediate results of all the nodes will be returned as output.

This is recommended to use with `run_node()` function.

Test Plan: `buck test glow/fb/nnpi/lowering:net_min_tests`

Reviewed By: khabinov

Differential Revision: D27913887

fbshipit-source-id: 5a3eab02da05214fb9adeb25656c267b58075b1d
2021-04-29 13:46:25 -07:00
95f393f212 Add compare_set to trampoline class, add typing and formatting (#57191)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57191

Changed Store::compareSet() to a pure virtual function and added compareSet definition to PythonStore. Rest of changes are from clang-format.

Test Plan: Imported from OSS

Reviewed By: cbalioglu

Differential Revision: D28076557

Pulled By: H-Huang

fbshipit-source-id: 379636cf8b031088341a032250ba410d84ccf692
2021-04-29 13:29:11 -07:00
be0ca00c5c [torch/deploy] Minor housekeeping in interpreter_impl
Summary:
1. Delete dead code relating to maskrcnn_benchmark extension module
2. Add some more commentary on why we define a meta path finder

isthisimpact

Test Plan: sandcastle

Reviewed By: wconstab

Differential Revision: D28078211

fbshipit-source-id: cfc6f47861c14ec7482b55ee585504271ae0f365
2021-04-29 12:51:56 -07:00
4b96fc060b Remove distutils (#57040)
Summary:
[distutils](https://docs.python.org/3/library/distutils.html) is on its way out and will be deprecated-on-import for Python 3.10+ and removed in Python 3.12 (see [PEP 632](https://www.python.org/dev/peps/pep-0632/)). There's no reason for us to keep it around since all the functionality we want from it can be found in `setuptools` / `sysconfig`. `setuptools` includes a copy of most of `distutils` (which is fine to use according to the PEP), that it uses under the hood, so this PR also uses that in some places.

Fixes #56527
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57040

Pulled By: driazati

Reviewed By: nikithamalgifb

Differential Revision: D28051356

fbshipit-source-id: 1ca312219032540e755593e50da0c9e23c62d720
2021-04-29 12:10:11 -07:00
21be40b390 Add torch_cpu specific flag for debug info (#57190)
Summary:
Right now we are using `REL_WITH_DEB_INFO=1` on Linux CI binary builds. This is causing intermittent failures on CUDA builds since the debug information increases the load on the linker. This adds a workaround by a flag to enable debug info only for the target we actually want it for (`libtorch_cpu.so`, all the other binaries are stripped over their debug info after building).

Example failures (from [the hud](https://ezyang.github.io/pytorch-ci-hud/build2/pytorch-nightly?mode=nightly)):
* https://app.circleci.com/pipelines/github/pytorch/pytorch/311785/workflows/df640957-54b0-4592-aeef-6d5baee503ae/jobs/12932229
* https://app.circleci.com/pipelines/github/pytorch/pytorch/311784/workflows/e3b487d6-fb46-4a5d-a2d5-22eec328b678/jobs/12932228
* https://app.circleci.com/pipelines/github/pytorch/pytorch/311784/workflows/e3b487d6-fb46-4a5d-a2d5-22eec328b678/jobs/12932227

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57190

Pulled By: driazati

Reviewed By: janeyx99

Differential Revision: D28085550

fbshipit-source-id: 0fc5b3e769b10c0dd3811717f968d0c933667361
2021-04-29 12:06:15 -07:00
d3ffe9ab6b [PyTorch] Allocate correctly-sized output tensor in addmm_cuda (#56033)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56033

There doesn't seem to be any reason not to size the output
correctly, and it avoids a round of dispatch for resize.
ghstack-source-id: 127409715

Test Plan:
Inspected GPU trace for simple nn.Linear in a loop. No more
resize operator invocation.

Existing CI should let us know if this is incorrect

Reviewed By: ngimel

Differential Revision: D27768311

fbshipit-source-id: fb48ec50f3cffc1015ef03d528e9007274b4dd3a
2021-04-29 11:59:51 -07:00
dd9f4c8cc9 [PyTorch] Reduce move overhead in inferExpandGeometry (#56032)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56032

Profiling & assembly inspection showed that we weren't
getting NRVO with `inferExpandGeometry_dimvector` returning
`std::tuple`. I added a custom type with constructors so that, as the
comment says, we could be sure to get NRVO.
ghstack-source-id: 127409717

Test Plan:
Inspected new assembly, no more move construction (which is
a copy for on-stack DimVectors!) upon returning

Reviewed By: ezyang

Differential Revision: D27768312

fbshipit-source-id: d1d53a36508be92585802e1467d8a42d1ae05d80
2021-04-29 11:59:50 -07:00
fb2f3cd172 [PyTorch] Migrate copy_ to borrow input/output (#56031)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56031

Copy kernels just immediately do the copy; borrowing should
be fine.
ghstack-source-id: 127409719

Test Plan: CI, review

Reviewed By: ezyang, walterddr

Differential Revision: D27768310

fbshipit-source-id: 7651731fd3dea14adbdb3fef95a6d67c02175508
2021-04-29 11:59:48 -07:00
a1d2bd56a0 [PyTorch] Make as_strided_ use_const_ref_for_mutable_tensors (#55875)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55875

One less const-incorrect function.
ghstack-source-id: 127409720

Test Plan: fitsships

Reviewed By: ezyang

Differential Revision: D27686995

fbshipit-source-id: 6ba3fe86be9957770920177649f586da8134a09a
2021-04-29 11:58:38 -07:00
ac86e0a0e5 fix: index_fill_ formula to support duplicate indices (#57101)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/57006

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57101

Reviewed By: gchanan

Differential Revision: D28076988

Pulled By: albanD

fbshipit-source-id: 1c1bd396282ca030b2445e4f3e1912f3c5a42b6c
2021-04-29 11:29:17 -07:00
ec86f96e91 Fix for derivative of sinc(x) when x is positive but very very small (#56986)
Summary:
Problem arises for sinc'(x) where x != 0, but x ** 2 == 0, which happens for some very small floats.

I realized that my solution from https://github.com/pytorch/pytorch/issues/56763 was incomplete when I did a quick implementation using `torch.autograd.Function` and still got a `NaN` from my derivative.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56986

Reviewed By: gchanan

Differential Revision: D28093507

Pulled By: albanD

fbshipit-source-id: 2a30e1065b08c5c60de843a0778dedeb0fb295f4
2021-04-29 11:16:39 -07:00
fd67088a57 [Distributed test]Enable ddp_control_flow tests for ROCm (#57159)
Summary:
Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57159

Reviewed By: zou3519

Differential Revision: D28074244

Pulled By: rohan-varma

fbshipit-source-id: 03e66cf5f546987b3d6d1b9c5feafcdf8292573e
2021-04-29 11:10:47 -07:00
2e2c0099eb Support type inference of nn.Module methods using PDT (#57165)
Summary:
Adds support for type inference of nn.Module methods using monkeytype in JIT

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57165

Reviewed By: gmagogsfm

Differential Revision: D28064983

Pulled By: nikithamalgifb

fbshipit-source-id: 303eaf8d7a27e74be09874f70f519b4c1081645b
2021-04-29 11:09:37 -07:00
8a949f9e51 [23/n][torch/elastic][upstream] Rename torch.distributed.elastic_launch to torch.distributed.run (#56831)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56831

Rename torch.distributed.elastic_launch to torch.distributed.run

Test Plan:
buck test mode/dev-nosan //pytorch/elastic/torchelastic/...
  buck test mode/dev-nosan //caffe2/test/distributed/elastic/agent/server/test/...
  flow-cli canary  pytorch.elastic.examples.classy_vision.main --entitlement gpu_prod --run-as-secure-group oncall_dai_pet --buck-target //fblearner/flow/projects/pytorch/elastic/examples:workflow

Reviewed By: kiukchung

Differential Revision: D27921159

fbshipit-source-id: cc7f2f035223b2d4abd7373af298998887e14c12
2021-04-29 11:06:20 -07:00
c72f01ab6b Add CI workflow and script to test torchbench. (#56957)
Summary:
This PR adds TorchBench (pytorch/benchmark) CI workflow to pytorch. It tests PRs whose body contains a line staring with "RUN_TORCHBENCH: " followed by a list of torchbench model names. For example, this PR will create a Torchbench job of running pytorch_mobildnet_v3 and yolov3 model.

For security reasons, only the branch on pytorch/pytorch will run. It will not work on forked repositories.

The model names have to match the exact names in pytorch/benchmark/torchbenchmark/models, separated by comma symbol. Only the first line starting with "RUN_TORCHBENCH: " is respected. If nothing is specified after the magic word, no test will run.

Known issues:
1. Build PyTorch from scratch and do not reuse build artifacts from other workflows. This is because GHA migration is still in progress.
2. Currently there is only one worker, so jobs are serialized. We will review the capacity issue after this is deployed.
3. If the user would like to rerun the test, she has to push to the PR. Simply updating the PR body won't work.
4. Only supports environment CUDA 10.2 + python 3.7

RUN_TORCHBENCH: yolov3, pytorch_mobilenet_v3

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56957

Reviewed By: janeyx99

Differential Revision: D28079077

Pulled By: xuzhao9

fbshipit-source-id: e9ea73bdd9f35e650b653009060d477b22174bba
2021-04-29 11:02:38 -07:00
ee71584236 Update compare_set implementation for FileStore and HashStore (#57175)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57175

Update other Store implementations to add the value when current value is empty to match the amendment made to TCPStore (#55636). Added test to cover this case.

Test:
`pytest -vs test/distributed/test_c10d_common.py -k compare_set`

Test Plan: Imported from OSS

Reviewed By: cbalioglu

Differential Revision: D28069380

Pulled By: H-Huang

fbshipit-source-id: eac703edb41faee32a4e7cda61107e2a0e726326
2021-04-29 10:48:11 -07:00
ecacb8c78b [quant][graphmode][fx] Fix getitem for unmatched nodes (#57173)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57173

If getitem is followed by an unmatched node, we'll remove the observer after it.

Test Plan:
python test/test_quantization.pyt TestQuantizeFxOps.test_getitem

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D28068805

fbshipit-source-id: e79f8ec3e8fd61d348b8a7069ab0bb434d737c30
2021-04-29 10:16:44 -07:00
9486fc3229 [PyTorch][Edge] share readArchiveAndTensors between mobile and jit (#57098)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57098

1. Separate `readArchiveAndTensors()` from `jit/import.cpp` to a new file `jit/import_read.cpp`.
2. Use `readArchiveAndTensors()` in `mobile/import.cpp`
ghstack-source-id: 127703081
3. Add a util function in cpp that could read .pkl files directly instead of loading the entire module

Test Plan: CI

Reviewed By: raziel, iseeyuan

Differential Revision: D28052193

fbshipit-source-id: c8d57f3270bdcf2e52a32f7c111899bd5da7cac2
2021-04-29 10:09:50 -07:00
2c8ea63cbb add a test for grad view with torch amp (#56730)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56730

add a test to verify DDP with torch map will result in the same results when using grad_as_bucket_view=true and false.

torch.amp scale factor does not have dependencies on old gradients, thus it is not affected by grad_as_bucket_view=true or false, see
how torch.amp is implemeted here https://github.com/pytorch/pytorch/pull/33366/files.

This diff verified ddp can work as expected with amp.GradScaler and amp.autocast when when using grad_as_bucket_view=true and false.
ghstack-source-id: 127526358

Test Plan: unit tests

Reviewed By: rohan-varma

Differential Revision: D27950132

fbshipit-source-id: 8ed26935fdcb4514fccf01bb510e31bf6aedac69
2021-04-29 10:06:07 -07:00
e96667175e .circleci: Switch libtorch builds to use smaller image (#56937)
Summary:
These weren't using the smaller images so we should probably let them
use the smaller images

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56937

Reviewed By: walterddr

Differential Revision: D28077747

Pulled By: seemethere

fbshipit-source-id: da0245bc3b4f564fcd392630542777b2b668b98f
2021-04-29 10:01:41 -07:00
311ad5e3af Merge CUDAFuture into ivalue::Future (#57052)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57052

This PR caps a stack whose goal was to merge CUDAFuture into ivalue::Future. CUDAFuture used to be a subclass of ivalue::Future, which was already pretty good, but it meant that in several places we needed `#ifdef`s or registries in order to create the right type of class, which was annoying. We've made CUDAFuture device-agnostic, by using generic helpers, so that it doesn't depend on CUDA. Now all its code can be inserted into ivalue::Future.

This PR does this very naively, by copy-pasting CUDAFuture's code into the (previously empty) virtual methods of ivalue::Future. This helps ensure the correctness of this PR, as it's straightforward to see it behaves exactly like before. However we probably want to polish it a bit later to iron out so wrinkles.
ghstack-source-id: 127713138

(Note: this ignores all push blocking failures!)

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28036829

fbshipit-source-id: 3e5b16402f5dc245c1fcb9d7bf06db64dcb0d2a3
2021-04-29 09:31:52 -07:00
71c2f88b90 Make CUDAFuture handle any kind of device type (#57051)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57051

Make CUDAFuture autodetect the devicetype from its arguments (which thus change from DeviceIndices to full Devices). This in fact transforms CUDAFuture into a AnythingFuture, since it's not tied to CUDA in any way anymore. Having made it fully device-agnostic, we'll merge it into ivalue::Future in the next PR.
ghstack-source-id: 127713134

(Note: this ignores all push blocking failures!)

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28032711

fbshipit-source-id: 8ba23b1b0d97f61db8693cd5f3c7bae7989a9bcd
2021-04-29 09:31:50 -07:00
cf1595c48b Use only generic helpers in CUDAFuture (#57050)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57050

Avoid (nearly*) any explicit mention of CUDA in CUDAFuture, and instead use "generic" classes like c10::Event, c10::Stream and most notably c10::impl::DeviceGuardImplInterface which allow us to indirectly manipulate CUDA entities. This is a preparation step to make CUDAFuture device-agnostic and thus become able to merge it into ivalue::Future.

* The one exception is when we construct the c10::impl::DeviceGuardImplInterface, where for now we still hardcode CUDA. This will be fixed in the very next PR
ghstack-source-id: 127713133

(Note: this ignores all push blocking failures!)

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28032710

fbshipit-source-id: a240ecc32bda481e8ecf85dab94933e24f832bb0
2021-04-29 09:31:48 -07:00
682476022f Introduce generic MultiStreamGuard (#57049)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57049

There was a comment above CUDAMultiStreamGuard which said "TODO: Implement this generically in c10". This is what I'm doing here.

The new generic MultiStreamGuard class is able to take a vector of device-agnostic c10::Streams and is able to support any device type (CUDA, but also ROCm and others) by using a VirtualGuardImpl. A class called CUDAMultiStreamGuard is still kept around, for convenience, and slightly for performance as it avoids a vtable lookup.
ghstack-source-id: 127713139

(Note: this ignores all push blocking failures!)

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28029158

fbshipit-source-id: 2f3181371f8cb0d77a3b2e6aa510f1dd74e8f69b
2021-04-29 09:31:47 -07:00
381698f900 Simplify CUDAMultiStreamGuard (#57048)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57048

CUDAMultiStreamGuard had a default constructor and a `original_devices()` method which were only used in a test. I'm removing them here to simplify the API and make it easier to manipulate this class later. One extra benefit is that this class used to get and store the current stream of _all_ devices, whereas now it only does so for the relevant devices.
ghstack-source-id: 127713136

(Note: this ignores all push blocking failures!)

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28029160

fbshipit-source-id: 185ef9a7ac909cd0ae6507dad9826fe978e67308
2021-04-29 09:31:45 -07:00
ea64c90ecc Add recordDataPtrOnStream to DeviceGuardImplInterface (#57047)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57047

We intend to merge CUDAFuture into ivalue::Future by using DeviceGuardImplInterface to avoid explicitly referring to CUDA. For that we need to add two methods to DeviceGuardImplInterface. In this PR, we add a method to record a DataPtr onto a stream with the caching allocator.
ghstack-source-id: 127713135

(Note: this ignores all push blocking failures!)

Test Plan: Used later in this stack

Reviewed By: ezyang

Differential Revision: D28029161

fbshipit-source-id: ff337ab8ccc98437b5594b2f263476baa1ae93e7
2021-04-29 09:31:43 -07:00
6fdf092cad Add getStreamFromPool to DeviceGuardImplInterface (#57046)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57046

We intend to merge CUDAFuture into ivalue::Future by using DeviceGuardImplInterface to avoid explicitly referring to CUDA. For that we need to add two methods to DeviceGuardImplInterface. In this PR, we add a method to get a stream from the global ATen pool.
ghstack-source-id: 127713137

(Note: this ignores all push blocking failures!)

Test Plan: Used later in this stack

Reviewed By: ezyang

Differential Revision: D28029159

fbshipit-source-id: 5055d84c1f3c2a4d86442f3149455c5ebd976dea
2021-04-29 09:30:41 -07:00
63533478bd Fix misleading messages in test_jit_c10d (#57256)
Summary:
TCPStore is now available on Windows.

Before: `TCPStore not available on Windows`
After:  `c10d was not compiled with the NCCL backend`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57256

Reviewed By: gchanan

Differential Revision: D28092539

Pulled By: H-Huang

fbshipit-source-id: 1e48cfe29b33b102bc97f51268ac1bbda596397d
2021-04-29 09:17:41 -07:00
b232659765 Replaced _lstsq_helper with internal dispatch (#54724)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54724

Removed at::_lstsq_helper; it is replaced with DEFINE/DECLARE_DISPATCH.

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D27993747

Pulled By: mruberry

fbshipit-source-id: dc8b884fd33b3dd18d9a8e4c582b869ac5391de5
2021-04-29 09:11:14 -07:00
03962bc7f1 Updated linalg.lstsq with NumPy compatible kwarg rcond (#54723)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54723

Renamed "cond" -> "rcond" to be NumPy compatible. The default value for
rcond was changed to match non-legacy NumPy behavior.

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D27993741

Pulled By: mruberry

fbshipit-source-id: a4baf25aca6a8272f1af2f963600866bfda56fb3
2021-04-29 09:11:12 -07:00
5a02f72fcf Modified batched residuals return of torch.linalg.lstsq (#54722)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54722

SciPy and NumPy operate only on non-batched input and return an empty array with shape (0,) if rank(a) != n.
The behavior for non-batched inputs is NumPy and SciPy compatible and the same result is computed.
For batched inputs, if any matrix in the batch has a rank less than `n`, then an empty tensor is returned.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D27993736

Pulled By: mruberry

fbshipit-source-id: 0d7cff967b322a5e816a23f282b6ce383c4468ef
2021-04-29 09:10:12 -07:00
36ebd0f65d Improve LeftRight documentation (#57164)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57164

Give some more indications about its performance characteristics
and when it is appropriate to use.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D28064685

Pulled By: ezyang

fbshipit-source-id: dbf5e041088d7921db2111d287feb9079466f1b5
2021-04-29 08:44:27 -07:00
b8e1be1a13 Revert D28041140: [pytorch][PR] Adding vector_norm to the C++ API
Test Plan: revert-hammer

Differential Revision:
D28041140 (fda8561944)

Original commit changeset: 65ab32efbcf9

fbshipit-source-id: ce69c6c1f2076c24f96d1f678ace415b22b2332c
2021-04-29 08:20:10 -07:00
fda8561944 Adding vector_norm to the C++ API (#57055)
Summary:
## BC Breaking Note
This PR removes the redundant linalg_ prefix from torch::linalg::linalg_det and torch::linalg::linalg_norm C++ API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57055

Reviewed By: H-Huang

Differential Revision: D28041140

Pulled By: heitorschueroff

fbshipit-source-id: 65ab32efbcf92010439881bd8a292cdb5b39c579
2021-04-29 08:12:24 -07:00
82e50f4757 Update test_overrides for gradcheck (#57155)
Summary:
Run both fast and slow mode for test overrides and fix failure in slow_mode

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57155

Reviewed By: albanD

Differential Revision: D28076483

Pulled By: soulitzer

fbshipit-source-id: ef942d787d986ba881329e9515e5de6194f3782b
2021-04-29 07:43:18 -07:00
762b3aa7ba Revert D28078846: [pytorch][PR] Enable clang-tidy on master
Test Plan: revert-hammer

Differential Revision:
D28078846 (4049732811)

Original commit changeset: adffa292c9f5

fbshipit-source-id: 44cf37ba1aac57aa77abf045ae0deefa0048756f
2021-04-29 06:56:20 -07:00
17b961b8bc [PyTorch][Edge] Fix mypy error (#56999)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56999

## Summary
Currently

## Test
![image](https://user-images.githubusercontent.com/16430979/116294682-19acaf80-a74d-11eb-9596-3a1d697ae835.png)
Note: there are still some other mypy failure for other functions in other repo

Differential Revision: D28023671

Test Plan:
See the test image above
Also CI

Reviewed By: dhruvbird

Pulled By: cccclai

fbshipit-source-id: d59da32b8b5a12c3f13bc5f4e02794db01132be3
2021-04-29 06:50:06 -07:00
5c8ceefe46 Pytorch add agent api tests (#56985)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56985

Pytorch add agent api tests

Test Plan: ci/cd

Reviewed By: cbalioglu

Differential Revision: D28020485

fbshipit-source-id: e6acf095f26ce4b99cddfbf7641fb4fa885b0c86
2021-04-29 06:14:39 -07:00
3a923a555a [NNC] moved lowerings out of the TensorExprKernel and into independent functions (#56679)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56679

moved lowerings out of the TensorExprKernel and into independent functions

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D28082921

Pulled By: Chillee

fbshipit-source-id: af530510957ed4aa8b64dcc77ca36b69866d8000
2021-04-29 05:46:50 -07:00
ca814904b4 Handle error reporting when reply file already exists (#57217)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57217

In torch multiprocessing error handler, we try to remove the file if it already exists. Before removing, we try to log the contents of the file. Here the assumption is that the contents would be valid json.
However, in some cases, it isn't and then we end up not clearing the file.
Let's handle this error and make sure that the file is cleaned irrespective of the contents of the file.

Reviewed By: devashisht

Differential Revision: D28041470

fbshipit-source-id: da96d11b8f7091715cf0152cccd3ecc08b688eae
2021-04-29 04:57:35 -07:00
2aadeac0ff Remove duplicate entry for filter in language ref v2 (#57154)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57154

Reviewed By: zou3519

Differential Revision: D28061690

Pulled By: gmagogsfm

fbshipit-source-id: b895238c0425cc6b60f5e19c67fc5bc6e0115d7f
2021-04-29 04:52:50 -07:00
e903e16d40 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D28088724

fbshipit-source-id: 3a350580427b92719a3c300bec310aea78375996
2021-04-29 04:12:25 -07:00
eac02f85cf Fix more clang-tidy errors (#57235)
Summary:
In my last PR I've missed CUDA and distributed folders, fixing this now
This change is autogenerated by `python tool/clang_tidy.py -s`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57235

Reviewed By: janeyx99

Differential Revision: D28084444

Pulled By: malfet

fbshipit-source-id: bf222f69ee90c7872c3cb0931e8cdb84f0cb3cda
2021-04-28 23:29:10 -07:00
565b034237 changed parametric type error in normalize to a warning (#57183)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57183

Previously, if it was unable to support matching against a type, it would throw an error.

However, this exposes the user to arbitrary Torchscript schemas, which may or may not be problematic. Although we may support these in the future, for now we just return False (which will simply eliminate that schema from the candidates).

Test Plan: T89661626 and T89664016

Reviewed By: spaugh, khabinov

Differential Revision: D28072018

fbshipit-source-id: 83017d1e96d19912163edc74a5e43b2816783218
2021-04-28 22:33:44 -07:00
54eee04226 support discontiguous tensors only for contiguous output format (#57177)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/57122

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57177

Reviewed By: zou3519

Differential Revision: D28072674

Pulled By: ngimel

fbshipit-source-id: 1f0b1d6916eb9739c35a5ac5aba33e70c1c43a34
2021-04-28 19:31:07 -07:00
d0ea3183c1 Remove debugging print in randperm (#57218)
Summary:
Sorry that I forget to delete this. Thank xwang233 for finding this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57218

Reviewed By: mruberry

Differential Revision: D28081292

Pulled By: ngimel

fbshipit-source-id: a75867aa82d8644ef3a863d94f225c37babfe249
2021-04-28 19:16:43 -07:00
1ee54cc7b4 Add devices argument to RRef constructor (#57085)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57085

PR #54932 fixed the CUDA RPC for RRef when RRef is created through
RPC. But besides that use case, RRef can also be created locally
by directly passing in a value, which would bypass the CUDA stream
synchronization in #54932.

This commit covers the above gap by adding a `devices` argument
to RRef constructor. The RRef will then use this argument to
choose between `CUDAFutre` and `ivalue::Future` to hold the value.
When `devices` is specified and non-empty, `CUDAFuture` will be
used, and the `devices` will be passed to that `CUDAFuture`.

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D28050001

Pulled By: mrshenli

fbshipit-source-id: 2316b419fa69aa4dcd444050f0b74e61c3d0af1e
2021-04-28 19:11:10 -07:00
dd6b9665bf [profiler] Add sequenceNr and fwdThreadId to the trace (#57182)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57182

Adding sequenceNr and fwdThreadId to the trace, to associate fwd ops with
backward ops

Test Plan: CI

Reviewed By: xuzhao9

Differential Revision: D28070725

fbshipit-source-id: aa4db580c9fd3ed061eaceb5239f4d9b2f8da3dc
2021-04-28 17:26:28 -07:00
2dc3dc2324 Enhance error message for Future.setErrorIfNeeded. (#56631)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56631

`setErrorIfNeeded` did not mention whether the future was already
completed or there was some other exception. This particular change ensures
that we also print out the original exception as part of the error message.

This would help in debugging issues where this codepath is triggered.
ghstack-source-id: 127248844

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D27919974

fbshipit-source-id: 2273a93f3475929b14f721c976f194f33a5aa746
2021-04-28 17:21:33 -07:00
6ff0002b12 Pytorch: enable many torchelastic tests (#56970)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56970

The diff enables metrics, events, utils and timer tests on ci/cd pipeline

Test Plan: ci/cd

Reviewed By: cbalioglu

Differential Revision: D28015200

fbshipit-source-id: 6b419aaf9e62a10a747b6511bff90c82cfb7bcd6
2021-04-28 17:05:09 -07:00
4049732811 Enable clang-tidy on master (#57213)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57213

Reviewed By: seemethere

Differential Revision: D28078846

Pulled By: malfet

fbshipit-source-id: adffa292c9f5d75b5f4840f9129d0184763d96a6
2021-04-28 16:41:44 -07:00
73453f1de1 Swap CUDA-10.2 and CUDA-11.1 master-only status (#57207)
Summary:
CUDA-11.1 build and tests will now run on PR and master, but 10.2 will
be master only

Also, delete remaining CUDA-10.1 build

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57207

Reviewed By: ngimel

Differential Revision: D28077271

Pulled By: malfet

fbshipit-source-id: 633945bf85091575efa34280e04a6b9d68a53138
2021-04-28 16:23:05 -07:00
78736a72a5 Fix default dtype for randperm, triu/tril_indices inside TorchScript (#57105)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/56676

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57105

Reviewed By: ezyang

Differential Revision: D28060969

Pulled By: gmagogsfm

fbshipit-source-id: 6b074418306377f5f906aafd121b614964972fc3
2021-04-28 16:18:33 -07:00
63d54874e7 [torch/deploy] smol cleanups to generate_packages
Summary: Remove some unnecessary args

Test Plan: sandcastle

Reviewed By: wconstab

Differential Revision: D28052626

fbshipit-source-id: f1b4d0555b4ab37dc9a245fbc1fa455f69a4db20
2021-04-28 15:54:46 -07:00
c69386ccee [torch/deploy] remove usage of fbcode_dir (#57102)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57102

We don't actually need to peek into `--fbcode_dir` for this. There are two reasons we should avoid this:
1. The [`TARGETS` docs](https://fburl.com/wiki/zz1wh6uc) recommend against it, as it can break buck caching and dependency tracking. This doesn't seem to be a serious issue in our case (we declare our sources anyway) but worth respecting.
2. More seriously, if we want to use this script from outside fbcode (like `fbsource/third-party/pypi`), it will break since `fbcode_dir` gets set to something wild

The preferred method is apparently to use `$SRCDIR`, which represents a directory that all specified sources are copied to before exexcuting the custom rule.
Found the suggestion here: https://fburl.com/w33wae2b. Seems less fragile, since it's publically documented as well: https://buck.build/rule/genrule.html

Test Plan: sandcastle

Reviewed By: wconstab

Differential Revision: D28052570

fbshipit-source-id: cb4772b5dc07fbdc251249d6e0759e71730098af
2021-04-28 15:53:36 -07:00
3483049d58 Add xnnpack global average pool op (#55791)
Summary:
Adaptive average pool with output size (1, 1) is a global average pool
For mobile use xnnpack to speed up that path

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55791

Test Plan:
buck test //xplat/caffe2:pt_xnnpack_test

pytest test/test_xnnpack_integration.py::TestXNNPACKOps
Fixes #{issue number}

Reviewed By: kimishpatel

Differential Revision: D27711082

Pulled By: axitkhurana

fbshipit-source-id: 8757042c4a31a60451d8ba5fb6bf8cfbaf0a8d10
2021-04-28 14:54:47 -07:00
aac2e68515 Add inplace hardswish xnnpack op (#56715)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56715

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55801

Refactor to add inplace version of xnnpack hardswish op

Test Plan: buck test //xplat/caffe2:pt_xnnpack_test

Reviewed By: kimishpatel

Differential Revision: D27712305

fbshipit-source-id: ed1dba22b026251f891fe7b88fbaa9a42985ef2c
2021-04-28 14:54:45 -07:00
28fc59d13d Add xnnpack hardswish op (#56714)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56714

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55800

For mobile use xnnpack implementation of hardswish

Test Plan: buck test //xplat/caffe2:pt_xnnpack_test

Reviewed By: kimishpatel

Differential Revision: D27712306

fbshipit-source-id: c7f0b70482aeef2aaa1966e2c669f79ecd29caa7
2021-04-28 14:53:46 -07:00
0a30d64c83 Revert D27966444: [pytorch][PR] [CUDA graphs] Avoid sync errors when graph capturing cudnn rnn calls that use cudnn dropout
Test Plan: revert-hammer

Differential Revision:
D27966444 (610c984d2e)

Original commit changeset: fe0df843c521

fbshipit-source-id: 8223b7f8b7183f0e7c9df6a7aa8f6b164e5634db
2021-04-28 14:51:10 -07:00
4cb534f92e Make PyTorch code-base clang-tidy compliant (#56892)
Summary:
This is an automatic change generated by the following script:
```
#!/usr/bin/env python3
from subprocess import check_output, check_call
import os

def get_compiled_files_list():
    import json
    with open("build/compile_commands.json") as f:
        data = json.load(f)
    files = [os.path.relpath(node['file']) for node in data]
    for idx, fname in enumerate(files):
        if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'):
            files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')]
    return files

def run_clang_tidy(fname):
    check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"])
    changes = check_output(["git", "ls-files", "-m"])
    if len(changes) == 0:
        return
    check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"])

def main():
    git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n")
    compiled_files = get_compiled_files_list()
    for idx, fname in enumerate(git_files):
        if fname not in compiled_files:
            continue
        if fname.startswith("caffe2/contrib/aten/"):
            continue
        print(f"[{idx}/{len(git_files)}] Processing {fname}")
        run_clang_tidy(fname)

if __name__ == "__main__":
    main()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892

Reviewed By: H-Huang

Differential Revision: D27991944

Pulled By: malfet

fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179
2021-04-28 14:10:25 -07:00
5a10ee71d6 [Reland] TCPStore add watchKey method and new listener thread (#56217)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56217

Reland of https://github.com/pytorch/pytorch/pull/54264

Changes:
- Update socket send() to use flag MSG_NOSIGNAL to prevent SIGPIPE because error in return is already capturad
- Update watchKey to block until callback has been registered on master.
- Fix race condition in testWatchKeyCallback which caused flaky test failures.

Test:
Ran TCPStoreTest 100 times locally with no errors, running [ci-all tests](https://github.com/pytorch/pytorch/pull/56219)

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D27824802

Pulled By: H-Huang

fbshipit-source-id: c32230ce726d7d848b9896a63aa52b8eb04a0a2d
2021-04-28 13:46:02 -07:00
6ec01b1610 [DataLoader] Add mode to LoadFilesFromDisk (#57056)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57056

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D28059776

Pulled By: ejguan

fbshipit-source-id: 0be511f196bedf6eab3cd0bded35096c17a473bf
2021-04-28 13:13:30 -07:00
31e59c3869 torch.package change Folder to Directory and add doc strings (#56925)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56925

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D28002145

Pulled By: Lilyjjo

fbshipit-source-id: 6265970202d1530c4fb7ea10011b0e09094037d5
2021-04-28 13:03:12 -07:00
610c984d2e [CUDA graphs] Avoid sync errors when graph capturing cudnn rnn calls that use cudnn dropout (#56433)
Summary:
Cudnn rnn calls that use use cudnn dropout maintain a "state" buffer across calls. [DropoutState](fe3f6f2da2/aten/src/ATen/native/cudnn/RNN.cpp (L1388-L1402))'s lock() and unlock() ensure the current call's use of the state buffer syncs with the end of the previous call's use of the state buffer (in case the previous call was on a different stream).

Telling a capturing stream to wait on an event recorded in a non-capturing stream is an error (1). Telling a non-capturing stream to wait on an event recorded during capture is also an error (2). So DropoutState's flow can error in either of two simple use cases:
```python
rnn = nn.LSTM(512, 512, 2, dropout=0.5).cuda()

out1 = rnn(in1)

# calling cudnn rnn with dropout in capture after calling it uncaptured triggers 1
capture_stream.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(capture_stream):
    graph.capture_begin()
    out2 = rnn(in2)
    graph.capture_end()
torch.cuda.current_stream().wait_stream(capture_stream)

# calling cudnn rnn with dropout uncaptured after calling it in capture triggers 2
out3 = rnn(in3)
```

This PR fixes both cases by telling `DropoutState::lock()`: "if the most recent end-of-usage event was in a different capture state (ie, we crossed a capturing<->noncapturing border) or in a different capture, don't sync on it." While considering the fix I had two assumptions in mind:
- only one capture using the RNN can be underway at a time in this process
- no noncapturing ops in this process are issuing RNN calls while the capture using the RNN is underway.

That second assumption seems brittle if, for example, someone wants to capture an internal region of the forward method of a model wrapped with DataParallel: multiple threads could be issuing RNN calls with some currently capturing and some not. We should talk about whether that use case seems realistic.

(Bigger-picture thoughts: I don't know if forcing calls to serialize on using the shared state buffer is the best design. And if we want to do it that way, we might as well run all cudnn rnns with dropout on a dedicated side stream synced with the surrounding stream (capturing or not), in which case I don't think this PR's event-handling diffs would be needed.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56433

Reviewed By: heitorschueroff

Differential Revision: D27966444

Pulled By: ezyang

fbshipit-source-id: fe0df843c521e0d48d7f2c81a17aff84c5497e20
2021-04-28 12:52:03 -07:00
efd451385c Add gzip format support for chrome tracing (#56554)
Summary:
add gzip format support when exporting chrome tracing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56554

Reviewed By: xuzhao9

Differential Revision: D28019111

Pulled By: ilia-cher

fbshipit-source-id: 7d522481912bc9e93b4b31b17f01b1b069c7d2b6
2021-04-28 12:40:33 -07:00
ce79bd255d Fix doc issues (#57153)
Summary:
Fixes inconsistencies in the TorchScript Language reference.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57153

Reviewed By: zou3519, gmagogsfm

Differential Revision: D28061449

Pulled By: nikithamalgifb

fbshipit-source-id: a055c7b1417391afe00ec0b35e1042acb049feed
2021-04-28 11:47:10 -07:00
911852ffe2 .github: Only add @generated on generated workflows (#57063)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57063

Removes the generated tag from the original template so the diff shows
up correctly on internal Phab

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D28040694

Pulled By: seemethere

fbshipit-source-id: c6ec0520fbc4ea169abefc7df2ff925ecc0474cc
2021-04-28 11:28:57 -07:00
18337fec7e Remove glaringlee from C++ frontend codeowners (#57130)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57130

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D28059800

Pulled By: ezyang

fbshipit-source-id: dc8a28761acaf19bc5620912c016c67bdd3a4e5b
2021-04-28 11:03:41 -07:00
4b8ccc6a0f .circleci: Add /opt/openssl to CI images (#57071)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57071

Adds /opt/openssl v1.1.1 to cpu CI images to enable testing for Gloo
TCP_TLS

Similar to https://github.com/pytorch/builder/pull/712

Enables https://github.com/pytorch/pytorch/pull/56442

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D28061203

Pulled By: seemethere

fbshipit-source-id: 222a824b30de96c1064da11ce8ce4dc6c851111e
2021-04-28 10:43:10 -07:00
ec0fa40f0f Release GIL before destructing RPCAgent subclasses. (#57029)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57029

Partially addresses https://github.com/pytorch/pytorch/issues/56297

This fixes deadlocks when the threads the RPCAgent are blocking
on try to take the GIL.  This also adds a general utility for
making shared_ptr run destructors without GIL.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D28030294

Pulled By: ezyang

fbshipit-source-id: 628c066eebbb70bda5b914645a109dce35d73c8d
2021-04-28 10:25:03 -07:00
fe09d54120 [c10d] Add debug level field in ProcessGroup (#56530)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56530

For upcoming diffs, ProcessGroup will need to know about debug level
for e.g. logging collective operations.
ghstack-source-id: 127535775

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D27849839

fbshipit-source-id: a9f016a27d30a242eced19929b3824ae68fe430f
2021-04-28 10:01:21 -07:00
6ee5e490d4 [BE][SyncBN] Avoid sync stats in eval mode (#56982)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56982

SyncBatchNorm should behave as a regular BN layer in eval model, this
change ensures that this is the case.

In particular, the bug was when `track_running_stats=False`, `bn_training` would be set to True in eval mode, but this would trigger a collective sync in syncBN.

However, in eval mode syncBN should behave like a regular BN layer and not do this sync.

Closes https://github.com/pytorch/pytorch/issues/48988

Ensured with unittest that when used for inference on a single rank, stats sync is not triggered.
ghstack-source-id: 127544421

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D27579297

fbshipit-source-id: 26406e2793f0be14f2daa46ae66f97a8494182ed
2021-04-28 09:53:30 -07:00
e362ee6f8a Make it illegal to directly construct _TensorBase (#56150)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56150

See #56017 for full context; the short story is that by making
it illegal to directly construct _TensorBase, we need only
write a *single* tp_dealloc function which will work universally
for all _TensorBase subclasses, rather than having to write two
versions, one for _TensorBase itself, and others for Python subclasses
of _TensorBase.  This means simpler code.

The subtlety here is that we only install our custom `tp_new` for direct subclasses of TensorBase.  This is important, because overriding the `tp_new` also overrides any user defined constructor.  Fortunately class Tensor(_TensorBase) has no nontrivial constructors and doesn't mind, but other subclasses like Parameter definitely mind!

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D28028746

Pulled By: ezyang

fbshipit-source-id: 3c03a14666ad1ded1145fe676afb0a7623cdb9bb
2021-04-28 09:25:25 -07:00
4d72538f80 Give Tensor a trivial (for now) metaclass _TensorMeta (#56147)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56147

This is support of #55686, you can see the broader context of the metaclass in
a more complete PR #56017.  The short story is that in the future I want to
give Tensor a non-trivial metaclass, so to derisk the change first I give it a
trivial metaclass to shake out any bugs that might be caused by it.  The
metaclass shouldn't have any performance impact on Tensor as it only gets
invoked upon subclass creation.

By the way, it was totally not documented how to create metaclasses in the Python
C API, and it took a good bit of trial error to figure it out (and the answer is
now immortalized in https://stackoverflow.com/q/67077317/23845 -- the things
that I got wrong in earlier versions of the PR included setting tp_basicsize
incorrectly, incorrectly setting Py_TPFLAGS_HAVE_GC on the metaclass--you want
to leave it unset so that it inherits, and determining that tp_init is what
actually gets called when you construct a class, not tp_call as another
not-to-be-named StackOverflow question suggests).

Aside: Ordinarily, adding a metaclass to a class is a user visible change, as
it means that it is no longer valid to mixin another class with a different
metaclass. However, because _C._TensorBase is a C extension object, it will
typically conflict with most other metaclasses, so this is not BC breaking.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D28028747

Pulled By: ezyang

fbshipit-source-id: c1e35a986aeb3db540c73d188f53dce951eeed33
2021-04-28 09:24:21 -07:00
5d7e48c9fc Disable one test in rocm (#56951)
Summary:
The test seems to be failing in ROCM 4.1 on CI node.  Disabling the same for now. The test will be    re-enabled for ROCM when CI transitions to 4.2.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56951

Reviewed By: zou3519

Differential Revision: D28059808

Pulled By: ezyang

fbshipit-source-id: a9b064b7525ae6dce89c51fe29ff07f37b7ac796
2021-04-28 08:58:51 -07:00
ef2bb784da Replace raw cudaMalloc calls with CUDACachingAllocator (#57083)
Summary:
Replace raw cudaMalloc calls with CUDACachingAllocator

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57083

Reviewed By: zou3519

Differential Revision: D28058989

Pulled By: ezyang

fbshipit-source-id: 84e2d0937e3ad5e3db9ae5a5e584d8c90954e213
2021-04-28 08:52:46 -07:00
46321cb937 [static runtime] binding for aten::norm_out (#56636)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56636

Test Plan:
Test it runs on the aug_1x model, which has aten::norm, and verify jit/sr results
```
./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --scripted_model=/data/users/ansha/tmp/adfinder/aug_1x/210616848_0.predictor.disagg.local.local.pt --pt_inputs=/data/users/ansha/tmp/adfinder/aug_1x/210616848_0.predictor.disagg.input_data.container.pt --iters=500 --warmup_iters=500 --num_threads=1 --pt_enable_static_runtime=1 --pt_cleanup_activations=true --pt_enable_out_variant=1 --pt_optimize_memory=1 --compare_results=1 --do_profile=1 --adsfinder_compatibility=1
```

```
Time per node type:
        1.53159 ms.    35.8619%. fb::sigrid_transforms_torch_bind (1 nodes)
         0.9481 ms.    22.1996%. aten::linear (6 nodes)
       0.704806 ms.    16.5029%. aten::argmin (1 nodes)
       0.252252 ms.    5.90643%. aten::matmul (1 nodes)
       0.140869 ms.    3.29842%. fb::clip_ranges_gather_sigrid_hash_v3 (77 nodes)
       0.100014 ms.    2.34181%. fb::clip_ranges_gather (263 nodes)
      0.0880838 ms.    2.06247%. aten::sub (1 nodes)
      0.0553556 ms.    1.29614%. aten::repeat (1 nodes)
      0.0438464 ms.    1.02665%. aten::norm (1 nodes)
      0.0395956 ms.   0.927124%. fb::batch_box_cox (1 nodes)
       0.035834 ms.   0.839045%. aten::__getitem__ (506 nodes)
      0.0345233 ms.   0.808357%. prim::TupleUnpack (254 nodes)
      0.0316876 ms.   0.741959%. aten::sigmoid (2 nodes)
      0.0293246 ms.   0.686629%. aten::mul (3 nodes)
      0.0287696 ms.   0.673635%. fb::offsets_to_ranges (253 nodes)
      0.0242373 ms.   0.567511%. aten::pow (1 nodes)
      0.0224204 ms.    0.52497%. fb::simple_embedding_bag_sum (3 nodes)
      0.0200074 ms.   0.468469%. fb::casted_batch_one_hot_lengths (1 nodes)
      0.0190264 ms.   0.445499%. fb::concat_add_mul_replacenan_clip (1 nodes)
      0.0167253 ms.    0.39162%. prim::TupleConstruct (1 nodes)
      0.0164962 ms.   0.386255%. aten::sum (3 nodes)
      0.0158986 ms.   0.372262%. prim::DictConstruct (2 nodes)
      0.0109372 ms.   0.256093%. aten::div (1 nodes)
     0.00910563 ms.   0.213207%. prim::ListConstruct (4 nodes)
     0.00876917 ms.   0.205328%. static_runtime::to_copy (8 nodes)
     0.00822567 ms.   0.192603%. fb::sigrid_hash_precompute (1 nodes)
     0.00622559 ms.   0.145771%. aten::contiguous (1 nodes)
     0.00460064 ms.   0.107723%. aten::narrow (4 nodes)
     0.00297164 ms.  0.0695804%. static_runtime::reshape_copy (2 nodes)
     0.00287099 ms.  0.0672237%. aten::logit (1 nodes)
     0.00277557 ms.  0.0649894%. aten::add (1 nodes)
     0.00264978 ms.  0.0620441%. aten::clamp_min (1 nodes)
     0.00215832 ms.  0.0505366%. aten::relu (1 nodes)
     0.00213779 ms.   0.050056%. fb::gather_ranges (4 nodes)
     0.00195846 ms.  0.0458571%. aten::full (1 nodes)
     0.00177333 ms.  0.0415222%. aten::stack (1 nodes)
     0.00147449 ms.   0.034525%. aten::size (3 nodes)
    0.000762524 ms.  0.0178544%. aten::expand_as (1 nodes)
    0.000757406 ms.  0.0177345%. fb::clip_ranges (2 nodes)
    0.000614798 ms.  0.0143954%. fb::lengths_to_offsets (3 nodes)
    0.000407952 ms. 0.00955212%. static_runtime::flatten_copy (1 nodes)
    0.000159918 ms. 0.00374445%. prim::device (1 nodes)
         4.2708 ms. in Total
StaticRuntime setup time: 0.000407 ms
Memory allocation time: 0.0089714 ms
Memory deallocation time: 0.0592135 ms
Outputs deallocation time: 0.0458097 ms
Total memory managed: 947328 bytes
Total number of reused tensors: 28
```

Reviewed By: hlu1

Differential Revision: D27922070

fbshipit-source-id: 538b39b7fff0638fc994b7983bf32d9e9f15d016
2021-04-28 08:44:10 -07:00
4638bd0f0f Fix ProcessGroupMPITest.cpp Gather, Scatter and SendRecv. Enable ProcessGroupMPITest (#56709)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56709

Right now, ProcessGroupMPITest testGather() fails with

 ```
what():  Gather: number of output tensors should be 0 for non-root
[devgpu025:429730] *** Process received signal ***

```

there is a similar issue with testScatter() where number of input/output tensors on source/destination respectively should be 0.

In addition testSendRecv(true); fails with

```
terminate called after throwing an instance of 'std::runtime_error'
  what():  src rank is wrong for recvAnysource

```

since we never populate `srcRanks`

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D28001963

Pulled By: agolynski

fbshipit-source-id: c381dfc6f417ee78fbbaf884e567b0485076dfc8
2021-04-28 08:39:08 -07:00
89377e3e45 model_dump tool for model inspection (#56868)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56868

See __init__.py for a summary of the tool.
The following sections are present in this initial version
- Model Size.  Show the total model size, as well as a breakdown by
  stored files, compressed files, and zip overhead.  (I expect this
  breakdown to be a bit more useful once data.pkl is compressed.)
- Model Structure.  This is basically the output of
  `show_pickle(data.pkl)`, but as a hierarchical structure.
  Some structures cause this view to crash right now, but it can be
  improved incrementally.
- Zip Contents.  This is basically the output of `zipinfo -l`.
- Code.  This is the TorchScript code.  It's integrated with a blame
  window at the bottom, so you can click "Blame Code", then click a bit
  of code to see where it came from (based on the debug_pkl).  This
  currently doesn't render properly if debug_pkl is missing or
  incomplete.
- Extra files (JSON).  JSON dumps of each json file under /extra/, up to
  a size limit.
- Extra Pickles.  For each .pkl file in the model, we safely unpickle it
  with `show_pickle`, then render it with `pprint` and include it here
  if the size is not too large.  We aren't able to install the pprint
  hack that thw show_pickle CLI uses, so we get one-line rendering for
  custom objects, which is not very useful.  Built-in types look fine,
  though.  In particular, bytecode.pkl seems to look fine (and we
  hard-code that file to ignore the size limit).

I'm checking in the JS dependencies to avoid a network dependency at
runtime.  They were retrieved from the following URLS, then passed
through a JS minifier:
  https://unpkg.com/htm@3.0.4/dist/htm.module.js?module
  https://unpkg.com/preact@10.5.13/dist/preact.module.js?module

Test Plan:
Manually ran on a few models I had lying around.
Mostly tested in Chrome, but I also poked around in Firefox.

Reviewed By: dhruvbird

Differential Revision: D28020849

Pulled By: dreiss

fbshipit-source-id: 421c30ed7ca55244e9fda1a03b8aab830466536d
2021-04-28 07:33:10 -07:00
1e77ba36db change ddpLoggingData struct to map or dict (#56641)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56641

currently ddpLoggingData is flat struct, which requires internal DDP developers and external users to know about the struct field names. This is not flexible to delete or add new fields in the future. also it is hard to access ddpLoggingData.

With maps/dict, developers and users can easily access the fields without knowing the field names, also easier to add/remove a new/old field.

Since C++ does not support map values to be different types, right now ddpLoggingData containes two types of maps.
ghstack-source-id: 127482694

Test Plan: unit tests

Reviewed By: SciPioneer

Differential Revision: D27923723

fbshipit-source-id: c90199c14925fc50ef219000e2f809dc7601cce1
2021-04-28 06:43:25 -07:00
3115728cba [profiler] Support for trace metadata (#56575)
Summary:
Adding support for user defined trace metadata

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56575

Test Plan: python test/test_profiler.py TestProfiler.test_profiler_metadata

Reviewed By: gdankel

Differential Revision: D27957876

Pulled By: ilia-cher

fbshipit-source-id: 8b6c254cca97eca23fc418e37e5772b207b0525a
2021-04-28 05:12:34 -07:00
5536cda19a Update floor_divide behavior in line with NumPy 1.20 (#56893)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56893

Fixes gh-56814

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D28025814

Pulled By: mruberry

fbshipit-source-id: 8654978ea1d5aa7c12bcf5a8c939966287a2d34e
2021-04-28 05:01:23 -07:00
77721ee318 [profiler] Add cuda synchronization point (ci-all) (#57036)
Summary:
Adding cuda synchronization when exiting the profiler
context manager

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57036

Test Plan: CI

Reviewed By: xuzhao9

Differential Revision: D28040552

Pulled By: ilia-cher

fbshipit-source-id: 944c46a58f4c2b6d1a1c64c8d4012d662d0262d0
2021-04-28 01:17:28 -07:00
8134806e23 [iOS GPU][Kernel] Implement channel split in Metal shaders (#56074)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56074

To run shufflenet we need to support at::chunk on GPU. The current implementation only splits the tensor into two on channel dimension. We'll come back and fully implement it in Metal shaders.
ghstack-source-id: 127522377

Test Plan:
```
2021-03-26 01:37:07.693411-0700 PyTorchPlayground[2279:235793] [MPSImageWrapper] Found a temporary image: [1, 2, 2, 2]
2021-03-26 01:37:07.693499-0700 PyTorchPlayground[2279:235793] [MPSImageWrapper] Found a temporary image: [1, 2, 2, 2]
2021-03-26 01:37:07.693544-0700 PyTorchPlayground[2279:235793] [MPSImageWrapper] Found a temporary image: [1, 4, 2, 2]
2021-03-26 01:37:07.695415-0700 PyTorchPlayground[2279:235793] [bool test_chunk()],[1 4 2 2 ],[SUCCEED]
2021-03-26 01:37:07.695862-0700 PyTorchPlayground[2279:235793] [MPSImageWrapper] Found a temporary image: [1, 4, 2, 2]
2021-03-26 01:37:07.695927-0700 PyTorchPlayground[2279:235793] [MPSImageWrapper] Found a temporary image: [1, 5, 2, 2]
2021-03-26 01:37:07.695971-0700 PyTorchPlayground[2279:235793] [MPSImageWrapper] Found a temporary image: [1, 9, 2, 2]
2021-03-26 01:37:07.698215-0700 PyTorchPlayground[2279:235793] [bool test_chunk2()],[1 9 2 2 ],[SUCCEED]
2021-03-26 01:37:07.699086-0700 PyTorchPlayground[2279:235793] [MPSImageWrapper] Found a temporary image: [1, 8, 2, 2]
2021-03-26 01:37:07.699154-0700 PyTorchPlayground[2279:235793] [MPSImageWrapper] Found a temporary image: [1, 16, 2, 2]
2021-03-26 01:37:07.699197-0700 PyTorchPlayground[2279:235793] [MPSImageWrapper] Found a temporary image: [1, 8, 2, 2]
2021-03-26 01:37:07.700842-0700 PyTorchPlayground[2279:235793] [bool test_chunk3()],[1 16 2 2 ],[SUCCEED]
```
- Sandcastle
- CircleCI

Reviewed By: SS-JIA

Differential Revision: D27357096

fbshipit-source-id: fd3908ad2c26466e4f714d531790be2f1ae24153
2021-04-28 00:51:58 -07:00
0df574017d Torchelastic: add support for the new error file format (#57084)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57084

The diff adds support for new error message file format:

    {
        "message":"test",
        "timestamp": 12
    }

Test Plan:
fbcode buck test mode/dev-nosan //caffe2/test/distributed/elastic/multiprocessing/errors:api_test

example job: tsm_aivanou-torchelastic_distributed_sum_77c0b147

Reviewed By: borovsky-d, wilson100hong

Differential Revision: D28042764

fbshipit-source-id: 4d21c2319654f3460d551d91cbf48568356cf4e8
2021-04-28 00:04:45 -07:00
882e273663 [caffe2] fix bug when weight_decay is used with fused rowwise + SLWS grad (#57090)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57090

We did loop-invariant code motion to avoid multiplying with in_weight_temp for each element but this breaks down when weight decay is not zero.

Test Plan:
In devgpu
buck test mode/dev-nosan //caffe2/caffe2/fb/net_transforms/tests:fuse_sparse_ops_test -- test_fuse_sparse_adagrad_with_sparse_lengths_weighted_sum_gradient --run-disabled

Reviewed By: jianyuh

Differential Revision: D28051026

fbshipit-source-id: f8906b72a41a87c2d43c447197b5fd695373ae23
2021-04-27 23:59:30 -07:00
51e6ebb5b7 Add missing vec256<>::isnan() for VSX float and double vectors (#56658)
Summary:
Obviously, have no way of testing it
Fixes https://github.com/pytorch/pytorch/issues/56650

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56658

Reviewed By: walterddr

Differential Revision: D27929750

Pulled By: malfet

fbshipit-source-id: a4e3fe75cfeeb35f47590c940ef17b2ba4172cd5
2021-04-27 20:59:40 -07:00
c91ea7d488 [PyTorch][Edge] Add binarires for unittests (#57039)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57039

## Summary
Add two models (v4 and v5) for testing runtime. (v5 will be introduced in https://github.com/pytorch/pytorch/pull/56002)

## Test plan
CI

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D28047615

Pulled By: cccclai

fbshipit-source-id: 47f7df3094dadb7e013ed57bc713cc8b3d1c8ce0
2021-04-27 20:46:34 -07:00
786b0a8091 [FX] fix normalization issues with lists of tensors (#57004)
Summary:
Fixes issue with lists of tensors not being normalized correctly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57004

Reviewed By: jamesr66a

Differential Revision: D28034559

Pulled By: Chillee

fbshipit-source-id: f935f0b73a8356acd8a2ae93fcfc0417f0eab224
2021-04-27 20:02:00 -07:00
1c0617bb54 Fix clang-tidy for native CPU ops (#57037)
Summary:
Attempts to call clang-tidy on any source file in
`aten/src/ATen/cpu/native` would fail with series of
```
/Users/nshulga/git/pytorch-worktree/aten/src/ATen/native/cpu/Activation.cpp:637:1: warning: variable 'REGISTER_DISPATCH' is non-const and globally accessible, consider making it const [cppcoreguidelines-avoid-non-const-global-variables]
/Users/nshulga/git/pytorch-worktree/aten/src/ATen/native/cpu/Activation.cpp:638:1: error: C++ requires a type specifier for all declarations [clang-diagnostic-error]
REGISTER_DISPATCH(log_sigmoid_backward_cpu_stub, &log_sigmoid_backward_cpu_kernel);
```
because those macros are only defined for cpu-arch specific compilation of above mentioned files.
Fix this by introducing `map_filename` function that will map source
file to its copy in `build` folder, run clang-tidy over the copy and
than map it back

Find it while working on https://github.com/pytorch/pytorch/pull/56892

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57037

Reviewed By: walterddr

Differential Revision: D28033760

Pulled By: malfet

fbshipit-source-id: b67cd007000574ecc165ab4b285c0c102cbceadd
2021-04-27 18:56:47 -07:00
808850b6de [ARM] Do not use depthwise3x3 conv in grad mode (#56889)
Summary:
cpu_depthwise3x3_winograd is not grad aware and therefore should not be used if grad is expected on the input

Fixes https://github.com/pytorch/pytorch/issues/56145

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56889

Reviewed By: ngimel

Differential Revision: D27990448

Pulled By: malfet

fbshipit-source-id: 9c649f14b8f514eb1dfb7f0eb8e3357c09ddb299
2021-04-27 18:45:29 -07:00
6e826cac67 To fix inconsistency of digamma with SciPy (#56689)
Summary:
Fixes {https://github.com/pytorch/pytorch/issues/49015}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56689

Reviewed By: mruberry

Differential Revision: D28014563

Pulled By: iramazanli

fbshipit-source-id: 4d311e6a32737e44ebfabfc1a4b9414b0de7b46e
2021-04-27 18:36:11 -07:00
0319b64ea0 [aten][simple] Optimize atrepeat (#56994)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56994

- Use `DimVector` in place of `std::vector<int64_t>` to remove heap allocations for tensors with ndim <= 5
- Use `sizes()[i]` in place of `size(i)` where we know i is positive

Test Plan: CI

Reviewed By: edvgha, swolchok

Differential Revision: D28022355

fbshipit-source-id: ef20ac73c0a330192ebc41ab9c92374ed8e2484a
2021-04-27 18:17:29 -07:00
e8c268746b Remove sync for randperm on small tensors. (#54113)
Summary:
For small tensors, it is known that GPU operates slower than CPU. However, offloading to CPU causes host <--> device sync. As a result, although offloading to CPU has better microbenchmarks, it often hurts instead of benefits the end-to-end performance, and it could be a blocker for CUDA graphs. After discussion with mcarilli and ptrblck, we think it might be good to just remove this piece of code and let it be slow.

Microbenchmarks:

```python
def run50_sync(f):
    for _ in range(50):
        f()
    torch.cuda.synchronize()

torch.cuda.synchronize()
%timeit run50_sync(lambda: torch.randperm(3, device='cuda'))
%timeit run50_sync(lambda: torch.randperm(30, device='cuda'))
%timeit run50_sync(lambda: torch.randperm(300, device='cuda'))
%timeit run50_sync(lambda: torch.randperm(3000, device='cuda'))
%timeit run50_sync(lambda: torch.randperm(30000, device='cuda'))
%timeit run50_sync(lambda: torch.randperm(300000, device='cuda'))
%timeit run50_sync(lambda: torch.randperm(3000000, device='cuda'))
%timeit run50_sync(lambda: torch.randperm(30000000, device='cuda'))
```

Before this PR:
```
5.79 ms ± 51.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.78 ms ± 92.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
6.17 ms ± 87.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.65 ms ± 69.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
17.6 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
21 ms ± 120 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
104 ms ± 880 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
944 ms ± 3.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

After this PR:
```
7.22 ms ± 11.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.28 ms ± 9.03 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.25 ms ± 10.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.19 ms ± 5.83 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.76 ms ± 162 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
12.3 ms ± 11.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
69.3 ms ± 42.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
716 ms ± 1.01 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54113

Reviewed By: ezyang

Differential Revision: D28017958

Pulled By: ngimel

fbshipit-source-id: 660992d43ca449e61ce0cb0aa1dae554c9560a4e
2021-04-27 16:47:41 -07:00
9fe2673d1c ns for fx: additional bugfix for user defined functions (#57028)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57028

Adds a test case for wrapped sigmoid, and fixes the following issues
to make it pass in NS:
* allows comparing between x.sigmoid() and torch.sigmoid(x), if they are related
* allows dtype cast from FP32_OR_INT8 to FP32, via dequantize (this will be improved later)

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_user_defined_function
```

Reviewed By: jerryzh168

Differential Revision: D28030089

Pulled By: vkuzo

fbshipit-source-id: b237353e2d564a4476f409df461746a259015a4b
2021-04-27 16:29:03 -07:00
da2cef6a40 ns for fx: allow comparing int8 to int8 for functionals (#57027)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57027

Fixes a bug to allow shadowing of linear and conv functionals.
The bug is to only detach tensors, not all objects.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_int8_shadows_int8_fun
```

Reviewed By: jerryzh168

Differential Revision: D28030090

Pulled By: vkuzo

fbshipit-source-id: 0a38c4b232e007d7822eee818b0af99d98335d22
2021-04-27 16:29:01 -07:00
a359cfac22 ns for fx: add option to skip matching classes and functions (#57026)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57026

Adds a config option to skip matching classes by class type
and functions by function type.

This is useful when users make custom modules which return
types other than tensors. With the current implementation of
Logger, these are not scriptable.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_user_module_scriptable
```

Reviewed By: jerryzh168

Differential Revision: D28030093

Pulled By: vkuzo

fbshipit-source-id: 71dc54dd935d2071c4b017260ea2a1e5c2298bfe
2021-04-27 16:29:00 -07:00
e8a5490c0a ns for fx: support binary ops when adding unshadowed loggers for inputs (#57025)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57025

Adds the ability to log unshadowed inputs of binary ops such as `add`
and `mul`, when indices 0, 1, or 0 and 1 are tensors.

Note: making shadowing support this is saved for a future PR.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_add_mul_inputs_activations
```

Reviewed By: jerryzh168

Differential Revision: D28030098

Pulled By: vkuzo

fbshipit-source-id: fd46760faac153975cd7688e70c44991ec1d5dff
2021-04-27 16:28:58 -07:00
ddedeab66d ns for fx: bug fix for shadowing fp16 emulation patterns (#57024)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57024

Enables shadow copies of fp16 emulation patterns where weights
are cast to fp16 before being passed to linear.  This previously
did not work because copying of `call_method` nodes was not implemented.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_linear_fp16_vs_linear_fp16_shadow_activations
```

Reviewed By: jerryzh168

Differential Revision: D28030096

Pulled By: vkuzo

fbshipit-source-id: 13a39ea6c106180df6d750246672286b58b4d04c
2021-04-27 16:28:56 -07:00
2acc19eca1 ns for fx: add fp16 function shadowing (#57023)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57023

Adds functionality for shadowing user functions with fp16 I/O dtype.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_user_defined_function
```

Reviewed By: jerryzh168

Differential Revision: D28030092

Pulled By: vkuzo

fbshipit-source-id: 642792398a76bd62593fa439ab14901e9dbdf4f8
2021-04-27 16:28:54 -07:00
782a0a1469 ns for fx: allow user functions in shadowing (#57022)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57022

Allows usage of user functions in NS shadow APIs. We expose the
i/o mapping to the user APIs, and thread them throughout the code.

Note: the format of the mapping is currently not the best.  Saving
improving that for a future PR.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_user_defined_function
```

Reviewed By: jerryzh168

Differential Revision: D28030095

Pulled By: vkuzo

fbshipit-source-id: 2863312362223ad276437e2aeeec4a3f71b691c7
2021-04-27 16:28:53 -07:00
c4bec76bec ns for fx: move node I/O dtype mapping to be local instead of global (#57021)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57021

To support shadows of custom functions, we need to allow user to
specify I/O type of the custom functions.

This PR is a cleanup in preparation for making the above happen.
We make the I/O dtype mappings be generated by a function instead
of a global variable. In the next PR, we will add a hook so user
can modify these mappings.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs
```

Reviewed By: jerryzh168

Differential Revision: D28030094

Pulled By: vkuzo

fbshipit-source-id: 3cbb617f034ef385c2875c4ec7fed13ca30bfc57
2021-04-27 16:27:40 -07:00
c307379170 Output tensor specified via out= must be on the same device as inputs for dot & vdot (#56334)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55561.

1. Added checks to ensure that Output tensor specified via out= must be on the same device as inputs for `dot` & `vdot`.
2. Unskipped `test_out` for `dot` & `vdot`.
3. Also changed the `tensordot` implementation to check if both input tensors are on the same device as the output tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56334

Reviewed By: H-Huang

Differential Revision: D27993778

Pulled By: mruberry

fbshipit-source-id: 36dee41ceef123c29d0cc52d6b09c3c440e8e60e
2021-04-27 16:14:39 -07:00
7bcce2acb9 Revert D27765618: Initial support for sparse complex tensors constructors for CPU/CUDA
Test Plan: revert-hammer

Differential Revision:
D27765618 (daef60c3b7)

Original commit changeset: a9cdd31d5c7a

fbshipit-source-id: f700d5db7ff8930b9158460b5a77f68a35e212a4
2021-04-27 15:48:51 -07:00
fa57191b16 fix #56822 (#56967)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/56822

There was an off by one in CPU randperm when checking the limits of the requested range. Also shows up in the "CUDA" version as it will fallback to CPU for small input sizes.

CC zasdfgbnm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56967

Reviewed By: mruberry

Differential Revision: D28031819

Pulled By: ngimel

fbshipit-source-id: 4d25995628997f164aafe94e7eae6c54f018e4e5
2021-04-27 15:32:01 -07:00
0d41122e61 Eliminate global usage of torch.set_default_dtype in sparse test (#56393)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56393

Fixes for  gh-56369

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D27913266

Pulled By: mruberry

fbshipit-source-id: 2c590d3a2188aae251184f08c1a6a2c4c570d150
2021-04-27 15:23:14 -07:00
18c89a904b Modernize test-suite in sparse tensor CSR (#56392)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56392

Fixes for gh-56371 and gh-56369

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D27913212

Pulled By: mruberry

fbshipit-source-id: 2c78fe9fa4b6c6b566d9eb01f71e6016d672a545
2021-04-27 15:22:17 -07:00
09feb5f579 Delete grandfathered Caffe2 dispatch keys. (#56939)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56939

These never have kernels registered to them and are effectively useless.
What I am not so sure if we allocate tensors to them or not; if we do
I cannot use asserts and I need to ensure we just return undefined
or something equivalent.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D28006160

Pulled By: ezyang

fbshipit-source-id: f8e2b61b8bd928fb2c0ac0b534bd4af076423f71
2021-04-27 14:58:35 -07:00
60a5ebfac2 [Pytorch Edge] Remove methods_to_optimize arg (#57045)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57045

Went back and adjusted the previous optimizations to just be applied to every function.
Cleaned up api to match.

ghstack-source-id: 127214412
ghstack-source-id: 127536155

Test Plan: unit test

Reviewed By: kimishpatel

Differential Revision: D27950859

fbshipit-source-id: 214e83d5a19b452747fe223615815c10fa4aee58
2021-04-27 14:54:13 -07:00
7b160e29a4 [DDP] remove backend constraints on uneven input tests (#56754)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56754

these tests are backend agnostic and shouldn't require a specific
backend(s) to run properly. Hence enabling them regardless of the backends that
are available.
ghstack-source-id: 127463147

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D27954174

fbshipit-source-id: 24759486b0c0647a5c88da4721a9a78d78c0b1f6
2021-04-27 14:50:38 -07:00
522dca4ab0 Port topk from THC to ATen, migrate most of sort as well (#55392)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/24648

The large tensor codepath is ported, but there is a legacy codepath that depends on an inplace sort in THC that is not callable from `at::`. At first glance, THC `topk` seems to be the only function that uses this `sortKeyValueInplace`.
Is the correct change to wrap `sortKeyValueInplace` in legacy functions for visibility in the `at::` namespace?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55392

Reviewed By: ezyang

Differential Revision: D28014257

Pulled By: ngimel

fbshipit-source-id: e297423c763f0691151cb62a4f5eff4cb31fb2b3
2021-04-27 14:49:41 -07:00
ecaa208fd6 Fix: sparse_csr_tensor segfaults when crow_indices or col_indices are non-tensors (#56723)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56723

WIP gh-56687

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D27999919

Pulled By: ezyang

fbshipit-source-id: 7eb23c8f45f3c459efe65793caecaa6b67a187c9
2021-04-27 14:47:12 -07:00
4a899bb3c4 Fix: Incorrect example output in sparse_csr_tensor doc-string (#56722)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56722

Fix: Incorrect example output in sparse_csr_tensor doc-string
closes gh-56685

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D27999920

Pulled By: ezyang

fbshipit-source-id: 0b344f7ddab4be8aadde540ce010b75df4433f4b
2021-04-27 14:46:03 -07:00
daef60c3b7 Initial support for sparse complex tensors constructors for CPU/CUDA (#54153)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54153

Currently, sparse tensors only support real floating point tensors. Complex support is added in this PR for CPU/CUDA.

- [x] add complex support (torch.cfloat and torch.cdouble) to torch.sparse_coo_tensor constructors
- [x] add complex support to coalesce function
- [x] add complex support to to_dense function
- [x] add complex support to to_sparse function
- [x] add complex support to sparse_add function
- [x] add unit tests

Note: This PR contains only complex support for torch.sparse_coo_tensor fordward function and the related ops used with this function (coalesce, to_dense, to_sparse, and sparse_add). The following PRs in ghstack should cover other sparse operations to have a more complex sparse support, specifically related with the use of specific APIs for accelerated linear algebra.

Note: Before using ghstack the original PR  was  #50984

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D27765618

Pulled By: ezyang

fbshipit-source-id: a9cdd31d5c7a7dafd790f6cc148f3df26e884c89
2021-04-27 14:39:13 -07:00
d16ed1ee8a Add first draft of gradcheck note (#55966)
Summary:
You can find the latest rendered version in the `python_doc_build` CI job below, in the artifact tab of that build on circle CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55966

Reviewed By: H-Huang

Differential Revision: D28032446

Pulled By: albanD

fbshipit-source-id: 227ad37b03d39894d736c19cae3195b4d56fc62f
2021-04-27 14:33:42 -07:00
dd84224edc .github: Switch alpine to ECR image instead (#57060)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57060

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D28040144

Pulled By: seemethere

fbshipit-source-id: f7590256c9f067add5d5e7b61a2c44beb2482d71
2021-04-27 14:18:13 -07:00
26ed4b4756 OpInfo : index_fill (port remaining method_tests) (#57009)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53237

Before PR (around 90s) (most time consuming tests in details)

<details>

```
pytest test/test_ops.py -k _index_fill --durations=20
========================================================================= test session starts ==========================================================================
platform linux -- Python 3.8.6, pytest-6.1.2, py-1.9.0, pluggy-0.13.1
plugins: hypothesis-5.38.1
collected 19327 items / 19225 deselected / 102 selected

test/test_ops.py s..................ssssssssssssssssssss..................ss....ssssssssssssssss....sssss....ssssss....                                          [100%]

=========================================================================== warnings summary ===========================================================================
========================================================================= slowest 20 durations =========================================================================
44.14s call     test/test_ops.py::TestGradientsCUDA::test_fn_gradgrad_index_fill_cuda_complex128
13.08s call     test/test_ops.py::TestGradientsCPU::test_fn_gradgrad_index_fill_cpu_complex128
7.36s call     test/test_ops.py::TestGradientsCUDA::test_fn_grad_index_fill_cuda_complex128
4.20s call     test/test_ops.py::TestCommonCUDA::test_variant_consistency_jit_index_fill_cuda_float32
3.42s call     test/test_ops.py::TestCommonCPU::test_variant_consistency_jit_index_fill_cpu_float32
2.93s call     test/test_ops.py::TestCommonCUDA::test_variant_consistency_jit_index_fill_cuda_complex64
2.32s call     test/test_ops.py::TestGradientsCPU::test_fn_grad_index_fill_cpu_complex128
2.18s call     test/test_ops.py::TestCommonCPU::test_variant_consistency_jit_index_fill_cpu_complex64
1.03s call     test/test_ops.py::TestOpInfoCUDA::test_duplicate_method_tests_index_fill_cuda_float32
0.84s call     test/test_ops.py::TestGradientsCUDA::test_fn_grad_index_fill_cuda_float64
0.64s call     test/test_ops.py::TestGradientsCUDA::test_fn_gradgrad_index_fill_cuda_float64
0.41s call     test/test_ops.py::TestOpInfoCUDA::test_supported_backward_index_fill_cuda_complex128
0.41s call     test/test_ops.py::TestOpInfoCUDA::test_supported_backward_index_fill_cuda_bfloat16
0.39s call     test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_index_fill_cuda_complex64
0.38s call     test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_index_fill_cuda_float32
0.36s call     test/test_ops.py::TestOpInfoCUDA::test_supported_backward_index_fill_cuda_complex64
0.36s call     test/test_ops.py::TestOpInfoCUDA::test_supported_dtypes_index_fill_cuda_float16
0.35s call     test/test_ops.py::TestOpInfoCUDA::test_supported_backward_index_fill_cuda_float16
0.35s call     test/test_ops.py::TestOpInfoCUDA::test_supported_dtypes_index_fill_cuda_int16
0.35s call     test/test_ops.py::TestOpInfoCUDA::test_supported_backward_index_fill_cuda_float32
======================================================================= short test summary info ========================================================================
=============================================== 52 passed, 50 skipped, 19225 deselected, 8 warnings in 97.31s (0:01:37) ================================================
```
</details>

After PR (around 90s) (most time consuming tests in details)

<details>

```
pytest test/test_ops.py -k _index_fill --durations=20
========================================================================= test session starts ==========================================================================
platform linux -- Python 3.8.6, pytest-6.1.2, py-1.9.0, pluggy-0.13.1
plugins: hypothesis-5.38.1
collected 19327 items / 19225 deselected / 102 selected

test/test_ops.py s..................ssssssssssssssssssss..................ss....ssssssssssssssss....sssss....ssssss....                                          [100%]

=========================================================================== warnings summary ===========================================================================
========================================================================= slowest 20 durations =========================================================================
40.88s call     test/test_ops.py::TestGradientsCUDA::test_fn_gradgrad_index_fill_cuda_complex128
13.12s call     test/test_ops.py::TestGradientsCPU::test_fn_gradgrad_index_fill_cpu_complex128
7.03s call     test/test_ops.py::TestGradientsCUDA::test_fn_grad_index_fill_cuda_complex128
3.48s call     test/test_ops.py::TestCommonCUDA::test_variant_consistency_jit_index_fill_cuda_complex64
3.01s call     test/test_ops.py::TestCommonCUDA::test_variant_consistency_jit_index_fill_cuda_float32
2.55s call     test/test_ops.py::TestCommonCPU::test_variant_consistency_jit_index_fill_cpu_complex64
2.43s call     test/test_ops.py::TestGradientsCPU::test_fn_grad_index_fill_cpu_complex128
2.38s call     test/test_ops.py::TestCommonCPU::test_variant_consistency_jit_index_fill_cpu_float32
1.10s call     test/test_ops.py::TestOpInfoCUDA::test_duplicate_method_tests_index_fill_cuda_float32
0.76s call     test/test_ops.py::TestGradientsCUDA::test_fn_grad_index_fill_cuda_float64
0.67s call     test/test_ops.py::TestGradientsCUDA::test_fn_gradgrad_index_fill_cuda_float64
0.50s call     test/test_ops.py::TestOpInfoCUDA::test_supported_dtypes_index_fill_cuda_bfloat16
0.50s call     test/test_ops.py::TestOpInfoCUDA::test_supported_dtypes_index_fill_cuda_uint8
0.49s call     test/test_ops.py::TestOpInfoCUDA::test_supported_dtypes_index_fill_cuda_float64
0.49s call     test/test_ops.py::TestOpInfoCUDA::test_supported_dtypes_index_fill_cuda_float16
0.49s call     test/test_ops.py::TestOpInfoCUDA::test_supported_dtypes_index_fill_cuda_complex128
0.49s call     test/test_ops.py::TestOpInfoCUDA::test_supported_dtypes_index_fill_cuda_bool
0.49s call     test/test_ops.py::TestOpInfoCUDA::test_supported_dtypes_index_fill_cuda_float32
0.49s call     test/test_ops.py::TestOpInfoCUDA::test_supported_dtypes_index_fill_cuda_int32
0.48s call     test/test_ops.py::TestOpInfoCUDA::test_supported_dtypes_index_fill_cuda_complex64
======================================================================= short test summary info ========================================================================

=============================================== 52 passed, 50 skipped, 19225 deselected, 8 warnings in 93.31s (0:01:33) ================================================
```

</details>

TODO:
* [x] Add test timings (Before and After)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57009

Reviewed By: H-Huang

Differential Revision: D28027095

Pulled By: mruberry

fbshipit-source-id: 6509ff726c8d954171cc0921b803ba261091a0e9
2021-04-27 13:50:23 -07:00
092eeedcb7 [profier] Fix double printing of FLOPs (#56974)
Summary:
Call table() shouldn't modify the events

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56974

Test Plan:
```
import torch
from torch import nn
from torch.profiler import profile, record_function

model = nn.Conv2d(8, 64, 3, padding=1)
input = torch.randn(1, 8, 272, 272)

with profile(record_shapes=True, with_flops=True) as prof:
    with record_function("model_inference"):
        model(input)

events = prof.key_averages(group_by_input_shape=True)
print(events.table())
print(events.table())
```

```
----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ---------------------------------------------  ------------
                        Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls                                   Input Shapes      GFLOPS/s
----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ---------------------------------------------  ------------
                 aten::zeros         0.78%      68.000us         1.16%     101.000us     101.000us             1                           [[], [], [], [], []]            --
                 aten::empty         0.49%      43.000us         0.49%      43.000us      14.333us             3                       [[], [], [], [], [], []]            --
                 aten::zero_         0.23%      20.000us         0.23%      20.000us      20.000us             1                                          [[1]]            --
             model_inference        13.67%       1.195ms        98.84%       8.639ms       8.639ms             1                                             []            --
                aten::conv2d         0.42%      37.000us        85.13%       7.440ms       7.440ms             1  [[1, 8, 272, 272], [64, 8, 3, 3], [64], [], [        91.645
           aten::convolution         0.15%      13.000us        84.70%       7.403ms       7.403ms             1  [[1, 8, 272, 272], [64, 8, 3, 3], [64], [], [            --
          aten::_convolution         0.48%      42.000us        84.55%       7.390ms       7.390ms             1  [[1, 8, 272, 272], [64, 8, 3, 3], [64], [], [            --
    aten::mkldnn_convolution        83.47%       7.295ms        84.07%       7.348ms       7.348ms             1  [[1, 8, 272, 272], [64, 8, 3, 3], [64], [], [            --
           aten::as_strided_         0.31%      27.000us         0.31%      27.000us      27.000us             1                [[1, 64, 272, 272], [], [], []]            --
----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ---------------------------------------------  ------------
Self CPU time total: 8.740ms

----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ---------------------------------------------  ------------
                        Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls                                   Input Shapes      GFLOPS/s
----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ---------------------------------------------  ------------
                 aten::zeros         0.78%      68.000us         1.16%     101.000us     101.000us             1                           [[], [], [], [], []]            --
                 aten::empty         0.49%      43.000us         0.49%      43.000us      14.333us             3                       [[], [], [], [], [], []]            --
                 aten::zero_         0.23%      20.000us         0.23%      20.000us      20.000us             1                                          [[1]]            --
             model_inference        13.67%       1.195ms        98.84%       8.639ms       8.639ms             1                                             []            --
                aten::conv2d         0.42%      37.000us        85.13%       7.440ms       7.440ms             1  [[1, 8, 272, 272], [64, 8, 3, 3], [64], [], [        91.645
           aten::convolution         0.15%      13.000us        84.70%       7.403ms       7.403ms             1  [[1, 8, 272, 272], [64, 8, 3, 3], [64], [], [            --
          aten::_convolution         0.48%      42.000us        84.55%       7.390ms       7.390ms             1  [[1, 8, 272, 272], [64, 8, 3, 3], [64], [], [            --
    aten::mkldnn_convolution        83.47%       7.295ms        84.07%       7.348ms       7.348ms             1  [[1, 8, 272, 272], [64, 8, 3, 3], [64], [], [            --
           aten::as_strided_         0.31%      27.000us         0.31%      27.000us      27.000us             1                [[1, 64, 272, 272], [], [], []]            --
----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ---------------------------------------------  ------------
Self CPU time total: 8.740ms
```

Fixes https://github.com/pytorch/pytorch/issues/55606

Reviewed By: xuzhao9

Differential Revision: D28019925

Pulled By: ilia-cher

fbshipit-source-id: 7e7d7ed496059caf917a3dd8dea2daaceb5db920
2021-04-27 13:46:25 -07:00
9da0f2e95e Support __pos__ and positive (#55891)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55604.

This PR implements `torch.Tensor.__pos__` and `torch.positive` for the compatibility with NumPy’s interface. (cc: mruberry, rgommers, emcastillo and kmaehashi)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55891

Reviewed By: H-Huang

Differential Revision: D28025928

Pulled By: mruberry

fbshipit-source-id: e43e329a802f31bf8805f6efab5c2c7ef34c88b9
2021-04-27 13:23:59 -07:00
5b3c0ae563 Use a FutureFactoryRegistry to allow libtorch_cpu files to create CUDAFuture (#56984)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56984

This is a preparation PR before we can create CUDAFuture in rref_impl.cpp.

The solution is adding a `FutureFactoryRegistry` in `rpc/utils.*`. The
TensorPipe RPC agent is responsible for registering `CUDAFuture` factory
and `ivalue::Future` factory. The reason that we need this change instead
of directly using `USE_CUDA` macro in RRef files is as follows. There are
three build targets: `torch_cpu`, `torch_cuda`, and `torch_python`.
`torch_python` is built on top of the other two. `torch_cpu` is CPU-only,
which contains no CUDA-related code, and hence no `USE_CUDA` macro.
`tensorpipe_*` files are in `torch_python` which does have access to CUDA.
However RRef source files are in `torch_cpu`, which cannot contain CUDA
code. The recommended solution is to allow dynamic dispatching. Therefore,
we had this PR.

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D28020917

Pulled By: mrshenli

fbshipit-source-id: e67c76a273074aebb61877185cc5e6bc0a1a5448
2021-04-27 12:34:15 -07:00
f9e7e2e20e Remove unnecessary noCuda arg from AtomicJitFuture (#56973)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56973

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D28020918

Pulled By: mrshenli

fbshipit-source-id: 99d0e4306f7650be97f73af00d89bdbb762595bc
2021-04-27 12:33:02 -07:00
cea265b8d8 Support layer_norm for static runtime (#56444)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56444

Added out version for layer_norm

Test Plan:
buck test caffe2/aten:math_kernel_test -- NativeLayerNorm

buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest

Reviewed By: hlu1

Differential Revision: D27873846

fbshipit-source-id: 53ee9fec4ff9a4e78198b031e86b5afd013626dd
2021-04-27 12:28:37 -07:00
3de86b951d Migrate thrust->cub for index put (#55693)
Summary:
64bit indexing is not supported, because if `num_indices = 2^31`, then 4 long tensors of `num_indices` elements will take 64GB RAM. I don't think anybody will be interested in running `index_put` with 64GB GPU RAM.

Benchmark on CUDA 11.3 RTX3090:
```python
import torch
import itertools

def run50_sync(f):
    for _ in range(50):
        f()
    torch.cuda.synchronize()

run50_sync(lambda: torch.randperm(1000000, device='cuda'))

def benchmark(M, L):
    a = torch.randn(M, device='cuda')
    i1 = torch.randint(M, (L,), dtype=torch.long, device='cuda')
    v = torch.randn(L, device='cuda')

    torch.cuda.synchronize()

    %timeit run50_sync(lambda:a.index_put_((i1,), v, True))

for M, L in itertools.product((100, 100000, 10000000), repeat=2):
    print(M, L)
    benchmark(M, L)
```

Before
```
100 100
5.13 ms ± 91 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
100 100000
30.2 ms ± 471 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
100 10000000
3.17 s ± 14.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
100000 100
5.19 ms ± 61.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
100000 100000
11.9 ms ± 200 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
100000 10000000
712 ms ± 3.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
10000000 100
5.07 ms ± 66.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
10000000 100000
12.1 ms ± 76.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
10000000 10000000
627 ms ± 7.65 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

After
```
100 100
3.75 ms ± 49.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
100 100000
26.2 ms ± 154 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
100 10000000
2.81 s ± 23.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
100000 100
3.85 ms ± 16.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
100000 100000
9.74 ms ± 40.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
100000 10000000
444 ms ± 1.86 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
10000000 100
3.85 ms ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
10000000 100000
10.7 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
10000000 10000000
396 ms ± 2.63 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55693

Reviewed By: albanD

Differential Revision: D27895967

Pulled By: ngimel

fbshipit-source-id: 0616ce33395ce46f1a4161dfd38940b8e54fedc2
2021-04-27 12:27:09 -07:00
6c602eb099 Don't hold ThreadPool lock when destructing task (#56817)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56817

Fix https://github.com/pytorch/pytorch/issues/56701 and https://github.com/pytorch/pytorch/issues/56786

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: pritamdamania87

Differential Revision: D27975642

Pulled By: ezyang

fbshipit-source-id: b7f4a6c18a4fa65c38bacc7c46246f0865c95f86
2021-04-27 12:22:49 -07:00
a18f3aacee Vectorize floating point floor_divide (#55380)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55380

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D27993499

Pulled By: mruberry

fbshipit-source-id: 45ea9c3295e4d85316bae9487db20914e0cbe3ed
2021-04-27 12:10:06 -07:00
cf17fd6dd5 Fix multinomial CUDA misalignment and non-deterministic behavior (#55364)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46702

- fails on probability distribution with odd items
  - trying to access an `acc_type` (`float`) in a `scalar_t` (`float16`) aligned memory
- produce unrepeatable result for large input tensor
  - parallel cumsum not monotonic at some positions

### Fixes
- computing cumsum on `acc_type` (`float`) instead of using `scalar_t` (`float16`) fixed both issues
- the non-monotonic behavior may happen even using `float`, though
  - in these cases, deterministic behavior may be achieved by eliminating the race condition when writing the result, using the atomic function `atomicMax`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55364

Reviewed By: mruberry

Differential Revision: D28031666

Pulled By: ngimel

fbshipit-source-id: 0fc6289e0b9ea2d31ef3771e7ca370de8f5c02de
2021-04-27 12:04:32 -07:00
6e91e90b4d Use OpInfo for unsqueeze test (#56924)
Summary:
This PR is ready for https://github.com/pytorch/pytorch/issues/56774.

(cc: mruberry, emcastillo, kmaehashi)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56924

Reviewed By: H-Huang

Differential Revision: D28026529

Pulled By: mruberry

fbshipit-source-id: 3afb33bb2999110c565728cd761d3e7d9d3fc82b
2021-04-27 11:58:30 -07:00
6c37788cb1 [torch] Add cuda support for segment reduction 'max' (#56704)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56704

This is re submit of PR: https://github.com/pytorch/pytorch/pull/54175

Main changes compared to original PR:
- Switch to importing "<ATen/cuda/cub.cuh>"
- Use CUB_WRAPPER to reduce boiler plate code.

Test Plan:
Will check CI status to make sure a

Added unit test

Reviewed By: ngimel

Differential Revision: D27941257

fbshipit-source-id: 24a0e0c7f6c46126d2606fe42ed03dca15684415
2021-04-27 11:29:03 -07:00
d578e8cfa2 Improved docs for torch.linalg (#56265)
Summary:
This PR tries to make the docs of `torch.linalg` have/be:
- More uniform notation and structure for every function.
- More uniform use of back-quotes and the `:attr:` directive
- More readable for a non-specialised audience through explanations of the form that factorisations take and when would it be beneficial to use what arguments in some solvers.
- More connected among the different functions through the use of  the `.. seealso::` directive.
- More information on when do gradients explode / when is a function silently returning a wrong result / when things do not work in general

I tried to follow the structure of "one short description and then the rest" to be able to format the docs like those of `torch.` or `torch.nn`. I did not do that yet, as I am waiting for the green light on this idea:
https://github.com/pytorch/pytorch/issues/54878#issuecomment-816636171

What this PR does not do:
- Clean the documentation of other functions that are not in the `linalg` module (although I started doing this for `torch.svd`, but then I realised that this PR would touch way too many functions).

Fixes https://github.com/pytorch/pytorch/issues/54878

cc mruberry IvanYashchuk

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56265

Reviewed By: H-Huang

Differential Revision: D27993986

Pulled By: mruberry

fbshipit-source-id: adde7b7383387e1213cc0a6644331f0632b7392d
2021-04-27 11:16:09 -07:00
9d54475032 Hide module paths leaking in the documentation. (#54585)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/54354

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54585

Reviewed By: H-Huang

Differential Revision: D28027037

Pulled By: mruberry

fbshipit-source-id: 219874e143221f5e8349d007f88464e0be1a6243
2021-04-27 10:58:01 -07:00
c203c921bc Revert D27926270: [pytorch][PR] [profiler] Add cuda synchronization points
Test Plan: revert-hammer

Differential Revision:
D27926270 (38bb0ac3e8)

Original commit changeset: 5cf30128590c

fbshipit-source-id: 940da27f5c921d8921191188230807f1708e3e1f
2021-04-27 09:27:35 -07:00
a93ceb333d Workaround intermittent gcc-7.5 ICE in cpp tests (#57016)
Summary:
gcc-7.5 optimizer can hit internal compiler error if both `-fopenmp` and
`-faligned-new` are passed:
```
/var/lib/jenkins/workspace/test/cpp/api/transformer.cpp: In function 'void transformer_decoder_test_helper(bool)':
/var/lib/jenkins/workspace/test/cpp/api/transformer.cpp:609:6: internal compiler error: in equal_mem_array_ref_p, at tree-ssa-scopedtables.c:429
 void transformer_decoder_test_helper(bool is_cuda) {
      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```

Fixes https://github.com/pytorch/pytorch/issues/40941

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57016

Reviewed By: walterddr

Differential Revision: D28027670

Pulled By: malfet

fbshipit-source-id: 834e34b95e09bcae39ada25e02749f479a7e9013
2021-04-27 09:21:23 -07:00
11d455fa8b .github: Enable Linux CPU GHA on PRs (#56942)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56942

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D28018455

Pulled By: seemethere

fbshipit-source-id: 2b4ba3d616c217b4e960871f1428dda03f2ad92a
2021-04-27 09:16:33 -07:00
ed617a61ce Adjust computeLRWorkDim() to work with Accelerate.framework (#56847)
Summary:
According to `vecLib.framework/Headers/clapack.h` Accelerate.framework's LAPACK implementation is based on 3.2.1, and so LRWORK should be computed using following formula (from
```
*>          If JOBZ = 'N', LRWORK >= 7*min(M,N).
*>          Otherwise,
*>          LRWORK >= min(M,N)*max(5*min(M,N)+7,2*max(M,N)+2*min(M,N)+1)
```

Found while looking at test_linalg.py crashes on M1, but would have happen on x86 as well, if Pytorch+Accelerate framework are to be tested on x86_64

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56847

Reviewed By: albanD

Differential Revision: D27983352

Pulled By: malfet

fbshipit-source-id: f757c515c85b32c1e09d00a91bc20fe4b390a75a
2021-04-27 09:12:54 -07:00
338a600e78 Add dispatch keys for out-of-tree grad+vmap prototype (#56824)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56824

This PR adds 6 dispatch uses to be used with prototyping.

I'm not sure what the best way to name these are, please let me know if
you think that these should have the same prefix.

Test Plan: - wait for tests

Reviewed By: driazati

Differential Revision: D27999963

Pulled By: zou3519

fbshipit-source-id: 0c3ef4788854f7a93d077cc454b773a6eedbbc22
2021-04-27 09:02:49 -07:00
cfbd06d7a1 add all pools, Batchnorm and Tanh (i.e. all ideeped MKLDNN ops) to MKLDNNFuser (#56541)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56541

Reviewed By: pbelevich

Differential Revision: D27930353

Pulled By: Krovatkin

fbshipit-source-id: 4d5b932bad4154e8bdd6e35498354e13b39c87a1
2021-04-27 08:59:30 -07:00
8d29ac2033 .github: Bump linux.2xlarge runners to 500 (#56945)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56945

In preparation to turn these on for CI

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D28018454

Pulled By: seemethere

fbshipit-source-id: fa94d666499877f2cdd7b8fd3fc8b2d8127f61e8
2021-04-27 08:49:22 -07:00
e138987818 .github: Build test binaries in build/ directory (#56941)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56941

Sets the custom test binaries we build in .jenkins/pytorch/build.sh to
be built in the `build` directory instead of the directory above the
workspace.

This should alleviate any weirdness we were seeing before with test
binaries having to be overwritten

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D28018453

Pulled By: seemethere

fbshipit-source-id: 74add11037a622e011d00fb6292bfe20e1d55d9e
2021-04-27 08:48:09 -07:00
6bbd8ba658 [NNC] removed the second run of llvm passmanager - it is repeated and caused a slowdown in the generated code (#56837)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56837

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D27980073

Pulled By: huiguoo

fbshipit-source-id: 4bc821adb7bba67078f0a4cb3294143f701f5335
2021-04-27 08:36:04 -07:00
3b977a0d28 [DataLoader] Add generate_state for NumPy seeding (#56797)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56797

After adding default seeding strategy for NumPy random module within each worker of DataLoader #56488, two concerns are raised:
- We dropped the support for NumPy < 1.17 due to `SeedSequence`
- In order to support seeding for NumPy < 1.17, how can we provide seed for `numpy.random`?
  - First option is set the same seed as `random`. But, the problem is a same algorithm is shared between `numpy.random` and `random`. With the same seed, they will have exact same state sequence. Thanks to rkern, we noticed this so-called [bad things](https://github.com/PyTorchLightning/pytorch-lightning/pull/6960#issuecomment-818393659).
  - Considering most of users do not aware this problem, we can provide a better seed by default for `numpy.random` using same `SeedSequence` algorithm as numpy. This is just a workaround with hard-coded function to generate an array of four int32 as the seed.

To better coping with this problem since there are amount of 3rd party libraries not just `NumPy` having random module. We may at the end need to implement a `SeedSequence` within `torch.random` module, then users can `spawn` a new `SeedSequence` for each library.

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D28000619

Pulled By: ejguan

fbshipit-source-id: 5701c8124a38ea5ded69eb8eee70f9680877ffa6
2021-04-27 08:14:02 -07:00
759cfb7495 add missing comma to run_test.py (#57010)
Summary:
Factored out from https://github.com/pytorch/pytorch/pull/57008#discussion_r621137121:

> Without this comma, the strings are concatenated to `test_binary_ufuncstest_numpy_interop`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57010

Reviewed By: malfet

Differential Revision: D28028061

Pulled By: walterddr

fbshipit-source-id: 97c64b79a6aaaf0242def03c8808c1a032537258
2021-04-27 08:00:13 -07:00
201ad938b2 Enable fixed fast_mode for complex (#55699)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55699

Todo:
- error message should be updated to say whether the failure is for fn's real or imaginary component

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D28007887

Pulled By: soulitzer

fbshipit-source-id: 1819201f59c8586a1d9631db05983969438bde66
2021-04-27 07:54:19 -07:00
7fe6e8e5a2 Refactor C->C to C->R twice (#55692)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55692

### Release notes
get_numerical_jacobian and get_analytical_jacobian only support `grad_out=1` and `fn` no longer accepts functions that return complex output

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D28004614

Pulled By: soulitzer

fbshipit-source-id: 9592c9c69584b4035b39be62252f138dce39d3b5
2021-04-27 07:53:13 -07:00
268cc117a8 Add OpInfos for torch.{complex, view_as_real, view_as_complex} (#56524)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56524

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D27909165

Pulled By: anjali411

fbshipit-source-id: 38592cdb357386549c8309792ef7c3218665d286
2021-04-27 07:40:46 -07:00
57e37080cd Added OpInfo for torch.einsum (#56276)
Summary:
Adds OpInfo testing for torch.einsum.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56276

Reviewed By: mruberry

Differential Revision: D27967095

Pulled By: heitorschueroff

fbshipit-source-id: 60524273d2ca885e7eeb932db3e7fd697ae5ca8e
2021-04-27 07:39:38 -07:00
ab1457ad14 Remove C++17 only optional include (#56782)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56782

Fixes #56749

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D28000019

Pulled By: ezyang

fbshipit-source-id: 87f86a402dac87e6c101aef8c78a928ce7d21340
2021-04-27 07:35:15 -07:00
0d777a808c Make test_randperm work with meta device (#56976)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56976

Band-aid fix for #54282

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D28020401

Pulled By: ezyang

fbshipit-source-id: 50546d5275eade408d65e9c883999fb3b65ff55a
2021-04-27 07:26:58 -07:00
f7fba854bf Implement module.to_empty() (#56610)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/54600

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56610

Reviewed By: malfet

Differential Revision: D27921653

Pulled By: jbschlosser

fbshipit-source-id: 10734b3eaa5b84bb4ba6eeba1043cfc8bb570a17
2021-04-27 06:19:54 -07:00
f2acdff73d DOC: Add note to mutating methods (#56877)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/56243 by adding a note to mutating functions not following the trailing `_` convention in `torch/nn/modules/module.py`

I can also raise separate PRs for other files, if needed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56877

Reviewed By: ezyang

Differential Revision: D28008856

Pulled By: jbschlosser

fbshipit-source-id: 63bfca0df05e49fceadd3167b1427dcb5542206a
2021-04-27 06:16:56 -07:00
1145e2c6e2 Revert D27831996: ns for fx: move node I/O dtype mapping to be local instead of global
Test Plan: revert-hammer

Differential Revision:
D27831996 (93de80203d)

Original commit changeset: 782f5e77de0e

fbshipit-source-id: 6637ef4e8ba76fc4f2b3836ad1ed8d37ce040576
2021-04-27 01:01:08 -07:00
45e96b5410 Revert D27833189: ns for fx: allow user functions in shadowing
Test Plan: revert-hammer

Differential Revision:
D27833189 (1917350977)

Original commit changeset: dac418e294d1

fbshipit-source-id: c6f58dac1a35806ea7d1dfb993d67e698196dee1
2021-04-27 01:01:06 -07:00
982c72ac33 Revert D27836064: ns for fx: add fp16 function shadowing
Test Plan: revert-hammer

Differential Revision:
D27836064 (96a9eafcfb)

Original commit changeset: 37a434a04e2b

fbshipit-source-id: e85088f5e301e14a0fc9ac1f7241c2baaf0a957e
2021-04-27 01:01:04 -07:00
90d554bd86 Revert D27857735: ns for fx: bug fix for shadowing fp16 emulation patterns
Test Plan: revert-hammer

Differential Revision:
D27857735 (f35540be38)

Original commit changeset: 7c1a067f035a

fbshipit-source-id: 6816223975b2e7b1f395e8894d17e3358fdb50ed
2021-04-27 01:01:02 -07:00
abb8b6c1c1 Revert D27864296: ns for fx: support binary ops when adding unshadowed loggers for inputs
Test Plan: revert-hammer

Differential Revision:
D27864296 (c004346c88)

Original commit changeset: 3cbeb728297a

fbshipit-source-id: bc87cb707b14a0965452e9a1aa0d4e37ffbe5bf1
2021-04-27 01:01:01 -07:00
cc8c5c1447 Revert D27886107: ns for fx: add option to skip matching classes and functions
Test Plan: revert-hammer

Differential Revision:
D27886107 (92c7aec5f5)

Original commit changeset: ec92c4f7ab71

fbshipit-source-id: 87d3b91c3d601f1706b61a2b2ce287a7b44f3d81
2021-04-27 01:00:59 -07:00
5dc7a6b050 Revert D27960767: ns for fx: allow comparing int8 to int8 for functionals
Test Plan: revert-hammer

Differential Revision:
D27960767 (502c58ad84)

Original commit changeset: abc911ca4b9e

fbshipit-source-id: 9bb1aa9d0e764bfd2dd6745af897d958c054ef3a
2021-04-27 01:00:57 -07:00
5db03b4109 Revert D27960766: ns for fx: additional bugfix for user defined functions
Test Plan: revert-hammer

Differential Revision:
D27960766 (9bd14da6e4)

Original commit changeset: 02935d2f400a

fbshipit-source-id: e7026c8637a591b6ffef288da8ef6306cdb9eb95
2021-04-27 00:59:57 -07:00
a0483cd06b Back out "fx: Fix type_matches for Optional[List[int]] arguments" (#56991)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56991

Original commit changeset: c5aa5f61a215

Diff: D27987746 (267b554b6f)

Test Plan: `buck test` under the glow-buck target is the target that this reversion is intended to fix

Reviewed By: jfix71

Differential Revision: D28019659

fbshipit-source-id: 37584ff404fc9195b309a5a6afdb4edbc2b4f088
2021-04-27 00:15:15 -07:00
780f454297 Add some functions for manipulating mkldnn tensors to TORCH_API (#56954)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56954

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D28010327

Pulled By: bertmaher

fbshipit-source-id: 59872a40c7bc06187f0d87046446dd39193a1d71
2021-04-26 23:52:49 -07:00
c42dd8b257 Revert "Use at::cpu in bench_approx (#56563)" (#56816)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56816

This doesn't actually work.  For some reason the linker can't find
at::cpu::logit_out, and it's not worth digging into why not.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D27977406

Pulled By: bertmaher

fbshipit-source-id: d0235a393f25243e2c8a011e9baf267daf483ae4
2021-04-26 23:51:49 -07:00
38bb0ac3e8 [profiler] Add cuda synchronization points (#56651)
Summary:
Adding cuda synchronization when entering and exiting the profiler
context manager

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56651

Test Plan: CI

Reviewed By: gdankel

Differential Revision: D27926270

Pulled By: ilia-cher

fbshipit-source-id: 5cf30128590c1c71a865f877578975c4a6e2cb48
2021-04-26 23:21:05 -07:00
dc8a8cea79 Move caffe2 signal_handler to c10. (#56717)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56717

The signal_handler was under the caffe2 namespacee but was being used
by PyTorch as well.

I've fixed this my moving it to the c10 namespace where now both C2 and PyTorch
can use it.

The signal_handler interface in caffe2/utils/signal_handler.h is kept the same
for backward compatiblity for C2, but most of the commmon code is moved to c10.
ghstack-source-id: 127446929

Test Plan: waitforbuildbot

Reviewed By: ezyang

Differential Revision: D27946738

fbshipit-source-id: d6228d1a0108f4c807d405e7a0bb799c5375388f
2021-04-26 23:08:12 -07:00
6ed5bbfb46 [TensorPipe] Give higher priority to CPU-only channels. (#56908)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56908

CUDA channels might implement CPU-to-CPU transfers, but will usually be
less efficient for that purpose.

Test Plan: CI

Reviewed By: lw

Differential Revision: D27994069

fbshipit-source-id: fefa7f243eb43cf769864233df518f2a1819f949
2021-04-26 22:27:44 -07:00
a09bbe73fd static runtime support for fb::equally_split (#56812)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56812

fb::equally_split get fused with ListUnpack and all outputs from ListUnpack getting attached to fb::equally_split.
So fb::equal_split will have as many outputs as ListUnpack .

Test Plan:
buck test caffe2/benchmarks/static_runtime/fb:test_fb_operators
buck test caffe2/torch/fb/sparsenn:test -- test_equally_split_op

Reviewed By: hlu1

Differential Revision: D27974999

fbshipit-source-id: b2ca19ff86aec76b977c1e3cfc56567adab66b35
2021-04-26 20:18:09 -07:00
35f3feca28 [RPC Framework] Supporting reading the input from the remote worker (#56943)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56943

If the module is placed on a CUDA device, then all the CPU tensors in `args` and `kwargs` will also be implicitly moved to the same CUDA device to run forward.

Currently still need to move the forward output from CUDA device back to CPU, until:
1) Process group RPC backend is completely deprecated, and we always use TensorPipe RPC backend;
2) A device map is explicitly provided to TensorPipe RPC backend.

These steps will be done in a separate PR.

#Original PR issue: https://github.com/pytorch/pytorch/issues/51670
ghstack-source-id: 127457584

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_input_moved_to_cuda_device

buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_input_moved_to_cuda_device_script

buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- RemoteModule

buck test mode/dev-nosan //caffe2/torch/fb/training_toolkit/applications/sparse_nn/batch_distributed_inference/tests:batch_distributed_inference_test -- --exact 'caffe2/torch/fb/training_toolkit/applications/sparse_nn/batch_distributed_inference/tests:batch_distributed_inference_test - test_load_di_parts (caffe2.torch.fb.training_toolkit.applications.sparse_nn.batch_distributed_inference.tests.batch_distributed_inference_test.BatchDistributedInferenceTest)'

Reviewed By: wanchaol

Differential Revision: D27934791

fbshipit-source-id: de27e27b905db83cc52800e63684fc6c942e9dc7
2021-04-26 20:04:06 -07:00
3721e01d60 Port adaptive_max_pool3d_backward to structured kernel (#56800)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56800

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D27984077

Pulled By: SplitInfinity

fbshipit-source-id: 1425ae741474128f3aacd032d7f926ce5ea81101
2021-04-26 20:01:09 -07:00
77e3f5d73d Port adaptive_max_pool2d_backward to structured kernel (#56799)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56799

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D27984078

Pulled By: SplitInfinity

fbshipit-source-id: 6404513f413fc6966687d8f1e9ea2a423a332ec9
2021-04-26 20:00:07 -07:00
e7c79cb158 Add type annotations to nnapi (#48142)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48141

~Mypy is complaining about a missing arg in a function call.~
```bash
torch/backends/_nnapi/serializer.py:806: error: Too few arguments for "_do_add_binary"  [call-arg]
Found 1 error in 1 file (checked 1140 source files)
```

9392137dbe/torch/backends/_nnapi/serializer.py (L804-L806)

~dreiss, would you mind take a look when you have some cycles to spare and see what would be the appropriated value for `fuse_code` here? Thanks :)~

Edit: https://github.com/pytorch/pytorch/issues/48925 got merged a couple of days ago. The blocking part is now unblocked, and I just pushed the changes to make mypy happy again. This PR is ready for review.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48142

Reviewed By: ezyang

Differential Revision: D28006249

Pulled By: walterddr

fbshipit-source-id: 5e43eeba7143512a549efaad31541f86718add7c
2021-04-26 19:08:07 -07:00
8a0eb7fb2d [TensorExpr] Docs: checkin 'Conditionals in TE' doc. (#56949)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56949

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D28009077

Pulled By: ZolotukhinM

fbshipit-source-id: 8d72c38ede623c93c6bd982d75a8ef9b23ba3825
2021-04-26 18:22:55 -07:00
e909ad2dc4 [static runtime] binding for aten::argmin_out (#56638)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56638

Test Plan:
```
./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --scripted_model=/data/users/ansha/tmp/adfinder/aug_1x/210616848_0.predictor.disagg.local.local.pt --pt_inputs=/data/users/ansha/tmp/adfinder/aug_1x/210616848_0.predictor.disagg.input_data.container.pt --iters=500 --warmup_iters=500 --num_threads=1 --pt_enable_static_runtime=1 --pt_cleanup_activations=true --pt_enable_out_variant=1 --pt_optimize_memory=1 --compare_results=1 --do_profile=1 --adsfinder_compatibility=1
```

```
Time per node type:
        1.55901 ms.    35.3486%. fb::sigrid_transforms_torch_bind (1 nodes)
       0.986321 ms.    22.3636%. aten::linear (6 nodes)
       0.722277 ms.    16.3767%. aten::argmin (1 nodes)
       0.256231 ms.    5.80971%. aten::matmul (1 nodes)
       0.149653 ms.    3.39319%. fb::clip_ranges_gather_sigrid_hash_v3 (77 nodes)
       0.105381 ms.    2.38938%. fb::clip_ranges_gather (263 nodes)
      0.0911405 ms.    2.06649%. aten::sub (1 nodes)
      0.0605429 ms.    1.37273%. aten::repeat (1 nodes)
      0.0456569 ms.    1.03521%. aten::norm (1 nodes)
      0.0421855 ms.   0.956501%. fb::batch_box_cox (1 nodes)
      0.0370142 ms.   0.839249%. aten::__getitem__ (506 nodes)
      0.0359091 ms.   0.814193%. prim::TupleUnpack (254 nodes)
      0.0338332 ms.   0.767123%. aten::sigmoid (2 nodes)
      0.0315159 ms.   0.714582%. aten::mul (3 nodes)
      0.0297553 ms.   0.674662%. fb::offsets_to_ranges (253 nodes)
      0.0279913 ms.   0.634666%. fb::simple_embedding_bag_sum (3 nodes)
      0.0233521 ms.   0.529478%. aten::pow (1 nodes)
       0.021296 ms.    0.48286%. fb::concat_add_mul_replacenan_clip (1 nodes)
      0.0208991 ms.   0.473861%. fb::casted_batch_one_hot_lengths (1 nodes)
      0.0183163 ms.   0.415298%. aten::sum (3 nodes)
      0.0164318 ms.   0.372571%. prim::DictConstruct (2 nodes)
      0.0160191 ms.   0.363211%. prim::TupleConstruct (1 nodes)
      0.0126953 ms.   0.287849%. aten::div (1 nodes)
      0.0106084 ms.   0.240532%. static_runtime::to_copy (8 nodes)
      0.0092846 ms.   0.210516%. prim::ListConstruct (4 nodes)
     0.00916175 ms.   0.207731%. fb::sigrid_hash_precompute (1 nodes)
     0.00707015 ms.   0.160307%. aten::contiguous (1 nodes)
     0.00621954 ms.    0.14102%. aten::narrow (4 nodes)
     0.00302307 ms.  0.0685441%. aten::add (1 nodes)
     0.00290759 ms.  0.0659259%. aten::full (1 nodes)
     0.00283369 ms.  0.0642503%. aten::logit (1 nodes)
     0.00239244 ms.  0.0542455%. fb::gather_ranges (4 nodes)
     0.00220181 ms.  0.0499232%. aten::relu (1 nodes)
     0.00211563 ms.  0.0479691%. static_runtime::reshape_copy (2 nodes)
      0.0020059 ms.  0.0454812%. aten::stack (1 nodes)
     0.00186682 ms.  0.0423276%. aten::clamp_min (1 nodes)
     0.00172548 ms.   0.039123%. aten::size (3 nodes)
      0.0011853 ms.  0.0268751%. aten::expand_as (1 nodes)
    0.000881784 ms.  0.0199933%. fb::clip_ranges (2 nodes)
    0.000835602 ms.  0.0189462%. fb::lengths_to_offsets (3 nodes)
    0.000444376 ms.  0.0100757%. static_runtime::flatten_copy (1 nodes)
    0.000197078 ms. 0.00446848%. prim::device (1 nodes)
         4.4104 ms. in Total
StaticRuntime setup time: 0.000702 ms
Memory allocation time: 0.00943333 ms
Memory deallocation time: 0.062704 ms
Outputs deallocation time: 0.0477171 ms
Total memory managed: 831744 bytes
Total number of reused tensors: 31
W0421 14:53:04.841202 929500 PyTorchPredictorContainer.cpp:200] Failed to load metadata file
W0421 14:53:04.841315 929500 PyTorchPredictorContainer.cpp:457] Couldn't find model param config file xl_model_weights/model_param_config
I0421 14:53:04.841341 929500 PyTorchPredictorBenchLib.cpp:137] PyTorch predictor: number of prediction threads 1
I0421 14:53:04.971776 929500 PyTorchPredictorBenchLib.cpp:230] PyTorch run finished. Milliseconds per iter: 130.423. Iters per second: 7.66736
I0421 14:53:05.122830 929500 PtVsBlackBoxPredictorBenchLib.cpp:132] Finished comparing PT static runtime and jit interpreter results
```

Reviewed By: hlu1

Differential Revision: D27923172

fbshipit-source-id: 05cf5497fb6ac39dd3ff24f583607a3dff8cae95
2021-04-26 17:28:42 -07:00
9bd14da6e4 ns for fx: additional bugfix for user defined functions (#56762)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56762

Adds a test case for wrapped sigmoid, and fixes the following issues
to make it pass in NS:
* allows comparing between x.sigmoid() and torch.sigmoid(x), if they are related
* allows dtype cast from FP32_OR_INT8 to FP32, via dequantize (this will be improved later)

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_user_defined_function
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D27960766

fbshipit-source-id: 02935d2f400aa0b8f3d51bbf664a6c8ca89aa811
2021-04-26 17:03:32 -07:00
502c58ad84 ns for fx: allow comparing int8 to int8 for functionals (#56742)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56742

Fixes a bug to allow shadowing of linear and conv functionals.
The bug is to only detach tensors, not all objects.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_int8_shadows_int8_fun
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D27960767

fbshipit-source-id: abc911ca4b9edafd1effb9dada7731981538c2df
2021-04-26 17:03:30 -07:00
92c7aec5f5 ns for fx: add option to skip matching classes and functions (#56493)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56493

Adds a config option to skip matching classes by class type
and functions by function type.

This is useful when users make custom modules which return
types other than tensors. With the current implementation of
Logger, these are not scriptable.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_user_module_scriptable
```

needs more testing before land

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D27886107

fbshipit-source-id: ec92c4f7ab7141021bc022f07b3b558b42bbb986
2021-04-26 17:03:28 -07:00
c004346c88 ns for fx: support binary ops when adding unshadowed loggers for inputs (#56408)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56408

Adds the ability to log unshadowed inputs of binary ops such as `add`
and `mul`, when indices 0, 1, or 0 and 1 are tensors.

Note: making shadowing support this is saved for a future PR.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_add_mul_inputs_activations
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D27864296

fbshipit-source-id: 3cbeb728297aa192d1ea17e815299709fd9db056
2021-04-26 17:03:26 -07:00
f35540be38 ns for fx: bug fix for shadowing fp16 emulation patterns (#56384)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56384

Enables shadow copies of fp16 emulation patterns where weights
are cast to fp16 before being passed to linear.  This previously
did not work because copying of `call_method` nodes was not implemented.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_linear_fp16_vs_linear_fp16_shadow_activations
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D27857735

fbshipit-source-id: 7c1a067f035acf7322175f8535876d0ead88a86a
2021-04-26 17:03:25 -07:00
96a9eafcfb ns for fx: add fp16 function shadowing (#56311)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56311

Adds functionality for shadowing user functions with fp16 I/O dtype.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_user_defined_function
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D27836064

fbshipit-source-id: 37a434a04e2bd2593a892209bbae59f0f1f34319
2021-04-26 17:03:23 -07:00
1917350977 ns for fx: allow user functions in shadowing (#56301)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56301

Allows usage of user functions in NS shadow APIs. We expose the
i/o mapping to the user APIs, and thread them throughout the code.

Note: the format of the mapping is currently not the best.  Saving
improving that for a future PR.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_user_defined_function
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D27833189

fbshipit-source-id: dac418e294d1c9b204efbf4071d5cc12a9e784c0
2021-04-26 17:03:21 -07:00
93de80203d ns for fx: move node I/O dtype mapping to be local instead of global (#56296)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56296

To support shadows of custom functions, we need to allow user to
specify I/O type of the custom functions.

This PR is a cleanup in preparation for making the above happen.
We make the I/O dtype mappings be generated by a function instead
of a global variable. In the next PR, we will add a hook so user
can modify these mappings.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D27831996

fbshipit-source-id: 782f5e77de0eef3899b9b7def0fdabd8dcafef12
2021-04-26 17:03:19 -07:00
8dbf6ae8fa ns for fx: handling for user functions in weight and unshadowed act APIs (#56292)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56292

Adds hooks for specifying user defined functions to NS weight and
unshadowed activation APIs.

Adding it to shadowed activation APIs will be a bit more work, upcoming
in a separate PR.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_user_defined_function
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D27830409

fbshipit-source-id: 6bbddc3062c0b3e412a3147244795319c0785a92
2021-04-26 17:03:18 -07:00
d405d41a7c ns for fx: enable user defined functions for graph matching (#56283)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56283

Exposes the `base_name_to_sets_of_related_ops` variable
to the graph matching API, so that users can add relationships
for custom functions. This is needed to enable full support of
external functions for custom backends.

The next PR will extend this to the NS APIs.

Test Plan:
```
python test/test_quantization.py TestFXGraphMatcher.test_user_defined_function
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D27830410

fbshipit-source-id: 8688cf697d388c52e3d18f108765edfca3c3d3aa
2021-04-26 17:02:11 -07:00
f5c24cc891 add deterministic path for index_copy_cpu (#56900)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56900

use serial copy with iter.serial_for_each in the deterministic mode

Test Plan:
buck test mode/opt //caffe2/test:torch -- test_index_copy_deterministic

    ✓ Pass: caffe2/test:torch - test_index_copy_deterministic_cpu (test_torch.TestTorchDeviceTypeCPU) (5.581)

buck test mode/opt //caffe2/test:torch_cuda -- test_nondeterministic_alert_index_copy

    ✓ ListingSuccess: caffe2/test:torch_cuda - main (11.565)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_index_copy_cuda_float64 (test_torch.TestTorchDeviceTypeCUDA) (29.172)
    ✓ Pass: caffe2/test:torch_cuda - main (29.172)

Reviewed By: ngimel

Differential Revision: D27992992

fbshipit-source-id: cebeefd8508553f9dbc4145819fe90dd625502f3
2021-04-26 16:57:47 -07:00
0888b8726a [static runtime] binding for aten::clamp_min_out (#56635)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56635

Test Plan:
```
./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --scripted_model=/data/users/ansha/tmp/adfinder/aug_1x/210616848_0.predictor.disagg.local.local.pt --pt_inputs=/data/users/ansha/tmp/adfinder/aug_1x/210616848_0.predictor.disagg.input_data.container.pt --iters=500 --warmup_iters=500 --num_threads=1 --pt_enable_static_runtime=1 --pt_cleanup_activations=true --pt_enable_out_variant=1 --pt_optimize_memory=1 --compare_results=1 --do_profile=0 --adsfinder_compatibility=1
```

```
Time per node type:
        1.50885 ms.    36.0064%. fb::sigrid_transforms_torch_bind (1 nodes)
        0.92296 ms.    22.0251%. aten::linear (6 nodes)
       0.695455 ms.     16.596%. aten::argmin (1 nodes)
       0.237931 ms.    5.67787%. aten::matmul (1 nodes)
       0.141634 ms.    3.37989%. fb::clip_ranges_gather_sigrid_hash_v3 (77 nodes)
      0.0925469 ms.     2.2085%. fb::clip_ranges_gather (263 nodes)
      0.0886556 ms.    2.11563%. aten::sub (1 nodes)
      0.0549624 ms.     1.3116%. aten::repeat (1 nodes)
       0.043996 ms.     1.0499%. aten::norm (1 nodes)
      0.0403472 ms.   0.962826%. fb::batch_box_cox (1 nodes)
      0.0371137 ms.   0.885664%. aten::sigmoid (2 nodes)
       0.035054 ms.   0.836512%. aten::__getitem__ (506 nodes)
      0.0338771 ms.   0.808427%. prim::TupleUnpack (254 nodes)
      0.0288516 ms.   0.688502%. aten::mul (3 nodes)
       0.026195 ms.   0.625106%. fb::offsets_to_ranges (253 nodes)
      0.0243627 ms.   0.581381%. aten::pow (1 nodes)
      0.0210347 ms.   0.501962%. fb::simple_embedding_bag_sum (3 nodes)
      0.0195358 ms.   0.466192%. fb::casted_batch_one_hot_lengths (1 nodes)
      0.0193484 ms.   0.461722%. fb::concat_add_mul_replacenan_clip (1 nodes)
      0.0164265 ms.   0.391995%. aten::sum (3 nodes)
      0.0157266 ms.   0.375291%. prim::TupleConstruct (1 nodes)
      0.0156512 ms.   0.373493%. prim::DictConstruct (2 nodes)
      0.0114427 ms.   0.273062%. aten::div (1 nodes)
     0.00884876 ms.   0.211163%. static_runtime::to_copy (8 nodes)
     0.00864496 ms.   0.206299%. prim::ListConstruct (4 nodes)
     0.00803458 ms.   0.191734%. fb::sigrid_hash_precompute (1 nodes)
     0.00619933 ms.   0.147938%. aten::contiguous (1 nodes)
     0.00462827 ms.   0.110447%. aten::narrow (4 nodes)
     0.00293105 ms.  0.0699452%. aten::logit (1 nodes)
     0.00287083 ms.  0.0685082%. static_runtime::reshape_copy (2 nodes)
     0.00250605 ms.  0.0598032%. aten::add (1 nodes)
     0.00217015 ms.  0.0517875%. fb::gather_ranges (4 nodes)
     0.00202655 ms.  0.0483607%. aten::full (1 nodes)
     0.00200812 ms.  0.0479208%. aten::relu (1 nodes)
     0.00175433 ms.  0.0418644%. aten::stack (1 nodes)
     0.00174899 ms.   0.041737%. aten::clamp_min (1 nodes)
     0.00134367 ms.  0.0320646%. aten::size (3 nodes)
    0.000811416 ms.  0.0193633%. fb::clip_ranges (2 nodes)
    0.000801096 ms.   0.019117%. aten::expand_as (1 nodes)
    0.000541452 ms.   0.012921%. fb::lengths_to_offsets (3 nodes)
    0.000477838 ms.  0.0114029%. static_runtime::flatten_copy (1 nodes)
    0.000192906 ms. 0.00460342%. prim::device (1 nodes)
        4.19049 ms. in Total
StaticRuntime setup time: 0.000408 ms
Memory allocation time: 0.00895982 ms
Memory deallocation time: 0.0587527 ms
Outputs deallocation time: 0.0430985 ms
Total memory managed: 947328 bytes
Total number of reused tensors: 28
W0421 14:33:55.610956 836281 PyTorchPredictorContainer.cpp:200] Failed to load metadata file
W0421 14:33:55.611043 836281 PyTorchPredictorContainer.cpp:457] Couldn't find model param config file xl_model_weights/model_param_config
I0421 14:33:55.611063 836281 PyTorchPredictorBenchLib.cpp:137] PyTorch predictor: number of prediction threads 1
I0421 14:33:55.736069 836281 PyTorchPredictorBenchLib.cpp:230] PyTorch run finished. Milliseconds per iter: 124.995. Iters per second: 8.0003
I0421 14:33:55.874794 836281 PtVsBlackBoxPredictorBenchLib.cpp:132] Finished comparing PT static runtime and jit interpreter results
```

Reviewed By: hlu1

Differential Revision: D27922570

fbshipit-source-id: 095aa9bd0c425bc73eb48841653441d5c9e45744
2021-04-26 16:39:12 -07:00
d221be6fb4 [iOS GPU] Use thread buffer to store indices for transpose (#56706)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56706

We've seen the transpose op failed on iOS 12 devices. This is because the index buffer is allocated in the device address space which is shared across multiple threads. Write operations are not guaranteed to be atomic. Use a thread buffer solves the issue.
ghstack-source-id: 127365795

Test Plan: CI

Reviewed By: SS-JIA

Differential Revision: D27941353

fbshipit-source-id: 5f09f0a085081b7c5e8019ebe711e36394cdde92
2021-04-26 16:34:35 -07:00
16710e5d93 Add reasons in TODO for the unblocked AVNTM -> InferenceMode cases. (#56823)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56823

Test Plan: CI

Reviewed By: bhosmer

Differential Revision: D27975596

fbshipit-source-id: 1d5681852163cd24ae245a6d90e44a34a0909145
2021-04-26 15:58:34 -07:00
e810bed63f [Static Runtime] Clean up op implementations (#56841)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56841

- Move arg checks to outside the lambda so we can perform these checks at Static Runtime initialization time
- use `optional` where possible
- support `to.other` overload, the 5-arg input load of `torch.to`.

Test Plan:
```
buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest
buck test mode/opt-clang //caffe2/caffe2/fb/predictor:ptvsc2_predictor_bench_test -- --run-disabled
```

Reviewed By: edvgha

Differential Revision: D27933176

fbshipit-source-id: 49d6249c8784c44146461e286e7a301596172d7c
2021-04-26 15:37:39 -07:00
9b46b6b37a Added sm_75 to CUDA Arch List for Linux CI GPU builds (#56619)
Summary:
This PR adds `sm_75` CUDA architecture support for CircleCI GPU builds, so that generated artifacts from these builds can be installed and run on machines with CUDA capability `sm_75`.

This PR is currently to see how much longer the PR CI GPU builds will take with `TORCH_CUDA_ARCH_LIST="7.5"` rather than `TORCH_CUDA_ARCH_LIST="5.2"`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56619

Reviewed By: malfet

Differential Revision: D28012538

Pulled By: seemethere

fbshipit-source-id: 3959736721eab7389984234d89eadcf04d163c37
2021-04-26 15:32:14 -07:00
d1088de522 Let RRef getValue() synchronize CUDA streams (#56895)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56895

PR #54932 fixes CUDA stream synchronization between RPC-created
OwnerRRef and UserRRef when `to_here()` is invoked. However, there
are two more gaps.

1. RRef value can be accessed on the owner directly through
    `local_value`, which bypasses the fix in #54932.
2. When RRef is created directly through RRef ctor instead of RPC,
    the OwnerRRef won't be able to correctly record CUDA events.

This PR fixes 1 by letting current streams wait for RRef recorded
CUDA events before returning the value in `RRef::getValue()`.

For 2, more discussions is needed to decide whether we should add
a `devices` argument to RRef ctor, or should RRef ctor inspect the
given values.

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D27992775

Pulled By: mrshenli

fbshipit-source-id: ed0e5bfbf715460208c85e46dd3317deef17f8fe
2021-04-26 15:27:26 -07:00
e1a7ec3c4f [caffe2] fix -Wrange-loop-construct
Test Plan:
```
% jf get -u D27943111
% buck build mode/dev-nosan admarket/adfinder:adfinder admarket/adindexer:adindexer \
  -c cxx.extra_cxxflags='-Wno-implicit-const-int-float-conversion -Wno-sign-compare -Wno-deprecated-copy -Wno-deprecated-declarations -Wno-pass-failed' \
  -c cxx.compiler_variant=clang-12 \
  -c cxx.modules=false
```

Reviewed By: hlu1

Differential Revision: D27988238

fbshipit-source-id: 304e44bfa141a1bcb291f9434fed514bbb568f8f
2021-04-26 13:27:59 -07:00
72c3ee073f add deterministic path for index_add_cuda (#56521)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56521

index_add_cuda is non-deterministic due to cuda atomicAdd. Here we add a deterministic code path with index_put(accumulate=True)

Test Plan:
buck test mode/opt //caffe2/test:torch_cuda -- test_index_add_deterministic

    ✓ ListingSuccess: caffe2/test:torch_cuda - main (12.289)
    ✓ Pass: caffe2/test:torch_cuda - test_index_add_deterministic_cuda (test_torch.TestTorchDeviceTypeCUDA) (27.190)
    ✓ Pass: caffe2/test:torch_cuda - main (27.190)
Summary
  Pass: 2
  ListingSuccess: 1

buck test mode/opt //caffe2/test:torch_cuda -- test_nondeterministic_alert

    ✓ ListingSuccess: caffe2/test:torch_cuda - main (16.088)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_ReflectionPad1d_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_kthvalue_cuda_float64 (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_ReplicationPad1d_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_bincount_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_index_put_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_EmbeddingBag_max_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_MaxPool3d_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_AdaptiveAvgPool3d_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_histc_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_interpolate_linear_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_AdaptiveMaxPool2d_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_FractionalMaxPool3d_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_AdaptiveAvgPool2d_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_NLLLoss_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_put_accumulate_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_grid_sample_2d_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_put_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_interpolate_trilinear_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_interpolate_bicubic_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_ReflectionPad2d_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_scatter_add_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_AvgPool3d_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_grid_sample_3d_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_CTCLoss_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_FractionalMaxPool2d_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_ReplicationPad3d_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_index_copy_cuda_float64 (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_ReplicationPad2d_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_median_cuda_float64 (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_gather_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_interpolate_bilinear_cuda (test_torch.TestTorchDeviceTypeCUDA) (37.654)
    ✓ Pass: caffe2/test:torch_cuda - main (37.654)
Summary
  Pass: 32
  ListingSuccess: 1

Reviewed By: ngimel

Differential Revision: D27861072

fbshipit-source-id: c33731017b863751f3e3068a23135129c555b66f
2021-04-26 12:14:58 -07:00
cb1e78038f .github: Add options to force unzip artifacts (#56929)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56929

Artifacts were failing to unzip since they already existed in the
current tree so this just forces the zip to go through no matter what

Was observing that test phases will fail if attempting to zip over an already existing directory, https://github.com/pytorch/pytorch/runs/2424525136?check_suite_focus=true

In the long run however it'd be good to have these binaries built out as part of the regular cmake process instead of being one off builds like they are now

**NOTE**: This wouldn't be an issue if `--ephemeral` workers was a thing, see: https://github.com/actions/runner/pull/660

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: janeyx99

Differential Revision: D28004271

Pulled By: seemethere

fbshipit-source-id: c138bc85caac5d411a0126d27cc42c60fe88de60
2021-04-26 11:40:48 -07:00
7989f2ac87 Clang format dist_utils.py and rpc/__init__.py (#56853)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56853

ghstack-source-id: 127412640

Test Plan: N/A

Reviewed By: rohan-varma

Differential Revision: D27984669

fbshipit-source-id: 8e89ba0c53107622b3ca29ea296226e260b251df
2021-04-26 11:33:42 -07:00
6155b0d9fa [reland] Trigger azure pipeline for multi gpu tests (#56128)
Summary:
Reland https://github.com/pytorch/pytorch/issues/52490 only for nightly builds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56128

Reviewed By: anjali411

Differential Revision: D28002257

Pulled By: seemethere

fbshipit-source-id: d32bf420fee13b809cee402362f98942234d380b
2021-04-26 10:47:58 -07:00
2639c4e6b3 fix bug in rocm device type (#56646)
Summary:
related to https://github.com/pytorch/pytorch/issues/56156.

https://github.com/pytorch/pytorch/issues/55808 effectively turned dtypeIfROCM off but let some legacy issues unfixed. Given the fact that we still need to deal with discrepancy between the 2 platform. This PR turns dtypeIfROCM default pointing to dtypeIfCUDA and only override when user specifies.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56646

Reviewed By: mruberry

Differential Revision: D27968959

Pulled By: walterddr

fbshipit-source-id: 6a11987b8ddf4417577b3d0d5054eaab169de42c
2021-04-26 10:38:44 -07:00
2f598b53dd catch xml parser error during report test result phase in CI (#56864)
Summary:
Fixes Report test result step fails the test suite entirely, such as:
https://app.circleci.com/pipelines/github/pytorch/pytorch/307916/workflows/4144870c-d1cf-4567-a6f8-93bb436471a4/jobs/12732796
and
https://app.circleci.com/pipelines/github/pytorch/pytorch/307388/workflows/a945940f-3325-43b3-bc14-c9f885b21f50/jobs/12705944

and also not related but could be a problem that only partial test reports are uploaded:
https://app.circleci.com/pipelines/github/pytorch/pytorch/308777/workflows/bdc37967-3863-448e-8264-311bf21ca381/jobs/12777741

This skips the parser error and move on to the next file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56864

Test Plan: CI

Reviewed By: janeyx99

Differential Revision: D28002166

Pulled By: walterddr

fbshipit-source-id: 6fa48122ae9dd68e401daf3692821fb00082b3ae
2021-04-26 10:29:20 -07:00
28a9483e36 fix ddp logging test (#56640)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56640

reset performance stats for current iteration, also fix ddp logging verifiction for sampled iterations.
ghstack-source-id: 127327708

Test Plan: unit tests

Reviewed By: SciPioneer

Differential Revision: D27923414

fbshipit-source-id: aaa1b10f64a0c952ba345c789c864bcef5cf1ab0
2021-04-26 10:12:05 -07:00
5b1f0ef622 Add cuBLAS path for batched torch.geqrf (#56253)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56253

`geqrfBatched` from cuBLAS is used if
```
(input.size(-2) <= 256 && batchCount(input) >= std::max<int64_t>(2, input.size(-2) / 16))
```

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D27960156

Pulled By: mruberry

fbshipit-source-id: 3e438eff01cbf7c7e075fb7aef709b97698a4650
2021-04-26 09:52:42 -07:00
27a8ece805 Add cuSOLVER path for torch.geqrf (#56252)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56252

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D27960152

Pulled By: mruberry

fbshipit-source-id: 0510a302aab50623d7490efaba0133f740cd57c3
2021-04-26 09:52:41 -07:00
f84f2063b4 Port CUDA torch.geqrf to ATen (#56251)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56251

This PR ports `torch.geqrf` from TH to ATen for CUDA path.

Resolves https://github.com/pytorch/pytorch/issues/24569

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D27960155

Pulled By: mruberry

fbshipit-source-id: a8b010c41d703a5de4bf40b045c89e6b95b5a5ca
2021-04-26 09:50:41 -07:00
5854e93bc9 Fix derivative of sinc at x=0 (#56763)
Summary:
Attempting to fix https://github.com/pytorch/pytorch/issues/56760

The derivative of `sinc(x)` at `x=0` should be special cased to 0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56763

Reviewed By: zhangguanheng66

Differential Revision: D27978135

Pulled By: albanD

fbshipit-source-id: ede5e734613cf60e720f6bcc7387c3cd9c6ec233
2021-04-26 09:43:42 -07:00
3e006fc57e Adding hsplit,vsplit and dsplit methods (#53536)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53536

Reviewed By: albanD

Differential Revision: D27938880

Pulled By: iramazanli

fbshipit-source-id: f741119517783ec2bafa296622ee518b587dd127
2021-04-26 09:39:09 -07:00
6ba9fd5963 Added "Tensor tol" overload of torch.linalg.matrix_rank (#54157)
Summary:
Currently `torch.linalg.matrix_rank` accepts only Python's float for `tol=` argument. The current behavior is not NumPy compatible and this PR adds the possibility to pass Tensor for matrix-wise tolerances.

Ref. https://github.com/pytorch/pytorch/issues/42666

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54157

Reviewed By: ezyang

Differential Revision: D27961548

Pulled By: mruberry

fbshipit-source-id: 47318eefa07a7876e6360dae089e5389b9939489
2021-04-26 09:35:40 -07:00
a90a3acbee Use JIT Plug-in for coverage to cover JIT'd functions and methods (#56310)
Summary:
This PR is step 2 (after https://github.com/pytorch/pytorch/issues/56708) to having JIT coverage--it actually uses the plug-in in CI!

Disclaimer: note that this will mark the entire JIT'd function/method as covered without seeking proof that the
compiled code has been executed. This means that even if the code chunk is merely compiled and not run, it will get
marked as covered.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56310

Test Plan:
We should see coverage improvements in CI after. A file to look out for would be `torch/jit/quantized.py`, which should have more coverage after this PR, which it does!
d3283ccd8c/torch/jit/quantized.py vs https://codecov.io/gh/pytorch/pytorch/src/master/torch/jit/quantized.py

More generally, the whole jit folder got ~3% increase in coverage, I believe.

Reviewed By: walterddr

Differential Revision: D28000672

Pulled By: janeyx99

fbshipit-source-id: 6712979d63a5e1224a92ee9bd9679ec62cf1cbba
2021-04-26 09:19:32 -07:00
1e51c05b71 Name .coverage.jit with timestamp to prevent loss of stats (#56829)
Summary:
The reason we were not seeing so many wins was because .coverage.jit would overwrite itself every coverage run. (What a noob mistake who wrote that code?!?!)

This should fix that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56829

Test Plan:
Coverage in CI should audibly increase. It does, somewhat:
Check out f8a475b056! New covered files include:
Classes in torch/distributed/optim
torch/utils/mkldnn.py

Reviewed By: walterddr

Differential Revision: D27984427

Pulled By: janeyx99

fbshipit-source-id: e82d074c2b4a60a5204a73efc2823824384c8bf5
2021-04-26 08:43:17 -07:00
689d3a70aa Fix broken link to fx graph quant guide in quantization.rst (#56776)
Summary:
No oustanding issue, can create it if needed.

Was looking for that resource and it was moved without fixing the documentation.

Cheers

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56776

Reviewed By: heitorschueroff

Differential Revision: D27967020

Pulled By: ezyang

fbshipit-source-id: a5cd7d554da43a9c9e44966ccd0b0ad9eef2948c
2021-04-26 08:22:28 -07:00
ed9c7e187b Added OpInfo for addmm (#55920)
Summary:
Added an OpInfo for `addmm` & ported its `method_tests`

Skipping `test_variant_consistency_eager` on CPU, as it's blocked by https://github.com/pytorch/pytorch/issues/56233

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55920

Reviewed By: agolynski

Differential Revision: D27800325

Pulled By: heitorschueroff

fbshipit-source-id: 311cd26c6b491b486f652cf64275c6901fea03c5
2021-04-26 06:20:00 -07:00
b3f56ec0e0 Automated submodule update: tensorpipe (#56495)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/tensorpipe](https://github.com/pytorch/tensorpipe).

New submodule commit: 87f7681286

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56495

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: beauby

Differential Revision: D27886370

fbshipit-source-id: 2b6e2b38412694633517df2b0501e5da9e81656c
2021-04-26 04:53:41 -07:00
f27513e951 Fix bug in torch.sparse.addmm on CUDA when beta != 0 or 1 (#56160)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55917, which caused `torch.sparse.addmm` to fail on CUDA whenever `beta` was different from 0 or 1

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56160

Reviewed By: ejguan

Differential Revision: D27825108

Pulled By: ngimel

fbshipit-source-id: 2ade5ea38c5322768dc4dffb40c65fcbb17ec201
2021-04-26 02:57:41 -07:00
f3743f097f [TensorExpr] Nuke tensorexpr::ScalarType and instead use c10::ScalarType directly. (#56825)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56825

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D27977461

Pulled By: ZolotukhinM

fbshipit-source-id: f8a72938ba395e426e2d9449627113abb1c9c34f
2021-04-26 01:51:21 -07:00
441c835733 [TensorExpr] Remove unused field from TensorExprKernel. (#56761)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56761

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D27960594

Pulled By: ZolotukhinM

fbshipit-source-id: 8f2bf1d688422363b97f48045ff96601665301f5
2021-04-26 01:51:19 -07:00
1faf1f96aa [TensorExpr] Fuser: don't lift tensor constants from fusion groups. (#56756)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56756

With #56319 TE kernel could handle tensor constants, so there is no more
need in lifting them out and passing as inputs.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D27959258

Pulled By: ZolotukhinM

fbshipit-source-id: 00269cf1c4747c10dfc40cb4e330991d0bf1e2ee
2021-04-26 01:49:26 -07:00
7b31ba4708 Fix cudnn ctc loss backward (#56639)
Summary:
Fix cudnn ctc loss backward

Fix https://github.com/pytorch/pytorch/issues/49046, which was working in pytorch 1.1

Originally modified in this PR in Oct 2019, https://github.com/pytorch/pytorch/pull/27039/files#diff-25ec2c1108ee03e2167622588ec31d167897ef1cccb12a4cfe77eb98777316daR2383-R2392

According to the original code

90ffab6e37/tools/autograd/derivatives.yaml (L1387-L1388)

and the code after PR

f461184505/tools/autograd/templates/Functions.cpp (L2456-L2465)

This `at::zeros({0}, raw_grad.options())` in line 2460 seems suspicious, and is causing `infer_size` runtime error

```
RuntimeError: The size of tensor a (0) must match the size of tensor b (177) at non-singleton dimension 2
Exception raised from infer_size at ..\aten\src\ATen\ExpandUtils.cpp:24 (most recent call first):
```

I've modified that to `at::zeros_like(raw_grad)`, which looks more accurate.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56639

Reviewed By: mruberry

Differential Revision: D27987860

Pulled By: ngimel

fbshipit-source-id: 5ad65e78d017c26894fb26318a5992b0878d04d5
2021-04-25 22:51:19 -07:00
9eee14704a OpInfo: roll and rot90 (#56770)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56770

Reviewed By: ngimel

Differential Revision: D27987820

Pulled By: mruberry

fbshipit-source-id: c6b86cdc1b89d91eeda2215020137582e7c20c65
2021-04-25 22:12:38 -07:00
9e027d7ea3 [OpInfo] Add opinfo for transpose and its aliases (#56122)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56122

Reviewed By: ezyang

Differential Revision: D27962878

Pulled By: mruberry

fbshipit-source-id: cfd84bb0dcedeb98233a10e2c9754281f7cb76af
2021-04-25 21:58:16 -07:00
298db67220 [OpInfo] Add Function Variant and Opinfo for permute (#56125)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56125

Reviewed By: ezyang

Differential Revision: D27960312

Pulled By: mruberry

fbshipit-source-id: b9dd89f7e69d7dff29f3b53828656c13df898fa5
2021-04-25 21:26:44 -07:00
267b554b6f fx: Fix type_matches for Optional[List[int]] arguments (#56790)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56790

If the argument doesn't match `List[int]`, this code falls through to
`issubclass(argument_type, List[int])` which is invalid and raises a
`TypeError`. If this happens during the processing of a `Union` (e.g.
`Optional`), the other union types aren't given the chance to match against the
signature.

This also stop normalize_function from indescriminately swallowing exceptions,
which let this bug go unnoticed.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D27987746

Pulled By: mruberry

fbshipit-source-id: c5aa5f61a215f0f39925e7053f33bff4b5d5acc2
2021-04-25 20:28:37 -07:00
dde2bc4818 Add OPENSSL_ROOT_DIR to cmake.py (#56846)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56846

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D27992923

Pulled By: pbelevich

fbshipit-source-id: dc2d26d4bc9d17a5da441ae4db8241609ca97c6e
2021-04-25 20:14:56 -07:00
7b74c3c70a Enable tests for dist profiling with torch.profiler (#56216)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56216

Verifies that the newly added distributed profiling works as expected for torch.profiler.

Example trace from `test_ddp_profiling`:

Note that tests are disabled internally due to an unrelated hang issue but run in OSS.
ghstack-source-id: 127357993

Reviewed By: mrshenli

Differential Revision: D27645105

fbshipit-source-id: 7ddba271acd8f7fbce1f9c5370830d5310314736
2021-04-25 19:41:27 -07:00
2d2370bb61 [Dist profiling] Fix ProcessGroupNCCL collective profiling (#55204)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55204

Implements a fix discussed offline with pritamdamia87 to run end callbacks after `CUDAFuture`'s wrapCallback has ensured appropriate synchronization. Also enables the relevant distributed profiling tests that were previously disabled for ProcessGroupNCCL.

Note that the profiling infrastructure has moved to primarily encourage the use of torch.profiler and CUPTI to trace CUDA kernels, support for distributed collectives for that will require further discussion with ilia-cher. However, this PR improves the usability of torch.autograd.profiler with respect to distributed collectives.

ghstack-source-id: 127357995

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D27491711

fbshipit-source-id: cec7703a4c5d59b5023b0aa8fef4c2e3fb8d37d0
2021-04-25 19:40:19 -07:00
70d9be0f42 Replace duplicative s with alpha (#56804)
Summary:
It is always easier to read a document when different objects / concepts denoted with different variables / representations.
In this PR we make sure the [complex autograd](https://pytorch.org/docs/master/notes/autograd.html#autograd-for-complex-numbers) documentation, the variable of output and step size diverge.

Fixes https://github.com/pytorch/pytorch/issues/53633

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56804

Reviewed By: anjali411

Differential Revision: D27989959

Pulled By: iramazanli

fbshipit-source-id: c271590ee744c8aeeff62bfaa2295429765ef64e
2021-04-25 16:27:09 -07:00
d4707e260b Infer types (#56832)
Summary:
Addresses:  Infer argument types for functions in JIT

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56832

Reviewed By: pbelevich

Differential Revision: D27979495

Pulled By: nikithamalgifb

fbshipit-source-id: 82156a516c7f96cdd3f7a067d41cb210a6d13a51
2021-04-25 13:01:55 -07:00
e97c17afa0 Update internal code for torch.geqrf (#56250)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56250

Moved `apply_geqrf` to `BatchLinearAlgebraKernel.cpp`. Added
`geqrf_stub` dispatch.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D27907362

Pulled By: mruberry

fbshipit-source-id: 6719464aef29dcf3bbbde060edf79f1e32fc8ad6
2021-04-25 03:46:59 -07:00
d5ff432615 Add torch.linalg.svdvals (#56684)
Summary:
This PR adds `torch.linalg.svdvals(input, out=None)` that computes only the singular values of `input`.

Resolves https://github.com/pytorch/pytorch/issues/54155.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56684

Reviewed By: albanD

Differential Revision: D27938229

Pulled By: mruberry

fbshipit-source-id: 5ea79ad9cccf818df0fbda1f431299ebf8de3798
2021-04-25 03:42:24 -07:00
58fcf77712 Port CPU torch.geqrf to ATen (#56249)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56249

This PR ports `torch.geqrf` from TH to ATen. CUDA path will be
implemented in a follow-up PR.
With ATen port support for complex and batched inputs is added.
There were no correctness tests, they are
added in this PR and I added OpInfo for this operation.

We can implement the QR decomposition as a composition of geqrf and
orgqr (torch.linalg.householder_product).
Also we can implement the least squares solver with geqrf + ormqr +
trtrs. So it's useful to have this function renewed at least for the
internal code.

Resolves https://github.com/pytorch/pytorch/issues/24705

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D27907357

Pulled By: mruberry

fbshipit-source-id: 94e1806078977417e7903db76eab9d578305f585
2021-04-25 01:17:00 -07:00
805129f957 enable support for custom error messages in torch.testing (#55890)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55890

Proof-of-concept for https://github.com/pytorch/pytorch/pull/55145#issuecomment-817297273

With this the user is able to pass a custom error message to `assert_(equal|close)` which will be used in case the values mismatch. Optionally, a callable can be passed which will be called with mismatch diagnostics and should return an error message:

```python
def make_msg(a, b, info):
    return (
        f"Argh, we found {info.total_mismatches} mismatches! "
        f"That is {info.mismatch_ratio:.1%}!"
    )

torch.testing.assert_equal(torch.tensor(1), torch.tensor(2), msg=make_msg)
```

If you imagine `a` and `b` as the outputs of binary ufuncs, the error message could look like this:

```python
def make_msg(input, torch_output, numpy_output, info):
    return (
        f"For input {input} torch.binary_op() and np.binary_op() do not match: "
        f"{torch_output} != {numpy_output}"
    )

torch.testing.assert_equal(
    torch.binary_op(input),
    numpy.binary_op(input),
    msg=lambda a, b, info: make_msg(input, a, b, info),
)
```

This should make it much easier for developers to find out what is actually going wrong.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D27903842

Pulled By: mruberry

fbshipit-source-id: 4c82e3d969e9a621789018018bec6399724cf388
2021-04-24 23:37:44 -07:00
edfbc989d1 add support for equal_nan in torch.testing.assert_close (#55788)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55788

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D27903821

Pulled By: mruberry

fbshipit-source-id: c10254b2cdc7c1ae5a31b22913136013f0472b26
2021-04-24 23:37:43 -07:00
27148db5df Add support for scalars and numpy in torch.testing (#55786)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55786

Add support to compare scalars as well as `np.ndarray`'s with torch.testing. We are reusing the mathcing functionality that is already in place for tensors, by casting the inputs. The approach can easily extended if we want to support other input types as long as they can be cast to a tensor.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D27903814

Pulled By: mruberry

fbshipit-source-id: fe3d063d0c9513cbd8b3408a2023e94c490c817e
2021-04-24 23:37:41 -07:00
dbf3451c6e Add support for checking tensor containers in torch.testing (#55385)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55385

This renames `assert_tensors_(equal|close)` to `_check_tensors_(equal|close)` and exposes two new functions: `assert_(equal|close)`. In addition to tensor pairs, the newly added functions also support the comparison of tensors in sequences or mappings. Otherwise their signature stays the same.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D27903805

Pulled By: mruberry

fbshipit-source-id: 719d19a1d26de8d14cb25846e3d22a6ac828c80a
2021-04-24 23:36:36 -07:00
bcef7ebd60 [NNC] Added matmul for NNC lowering/unified dtypes (#56456)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56456

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D27977532

Pulled By: Chillee

fbshipit-source-id: c04372d988c8ef795f27037348a155894c2eddad
2021-04-24 19:15:16 -07:00
710288e413 torch.fft: Document out argument (#56732)
Summary:
An oversight from https://github.com/pytorch/pytorch/issues/49335, the documentation was never updated to include `out` arguments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56732

Reviewed By: ezyang

Differential Revision: D27960478

Pulled By: mruberry

fbshipit-source-id: a342a4f590369d6d2e17bed014fa64e49ee72936
2021-04-24 17:14:00 -07:00
6e5ce569bd DOC: add note for torch.clamp() special case min > max See #45664 (#56367)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45664

This PR adds a note to the documentation for `torch.clamp()` to alert users to a special case: If `min` is greater than `max`, all values are set to the `max` value.

Also, an example was added after the first code example. And this one is referenced in the note.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56367

Reviewed By: ezyang

Differential Revision: D27960553

Pulled By: mruberry

fbshipit-source-id: 9dc6016ccacebe87c809a0dd9f557b4aea0ae6f5
2021-04-24 17:09:22 -07:00
45692fbef0 [fx splitter][fx net_min] Move Splitter, Minimizer and necessary deps to OSS (#56201)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56201

Refactor Splitter and Minimizer to superclass `_SplitterBase` and `_MinimizerBase` and move them to OSS. This is needed to create an OSS example of GPU lowering with those tools.

Test Plan: CI

Reviewed By: jackm321

Differential Revision: D27629598

fbshipit-source-id: 0d4da02105ca509b31f1a6c4a39b1122c2bc7bf0
2021-04-24 15:19:12 -07:00
51bca2ca4d [caffe2] fix -Wrange-loop-construct in onnx_exporter.cc (#56759)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56759

```
 caffe2/caffe2/onnx/onnx_exporter.cc:415:21: error: loop variable 'it' creates a copy from type 'const std::pair<const std::basic_string<char>, int>' [-Werror,-Wrange-loop-construct]
    for (const auto it : blob_versions) {
                    ^
caffe2/caffe2/onnx/onnx_exporter.cc:415:10: note: use reference type 'const std::pair<const std::basic_string<char>, int> &' to prevent copying
    for (const auto it : blob_versions) {
         ^~~~~~~~~~~~~~~
                    &
```

Reviewed By: yfeldblum

Differential Revision: D27960126

fbshipit-source-id: fd46f37cf1aca9441209de8eb06add204046db95
2021-04-24 13:13:51 -07:00
4ef8205104 [fx][normalize] Allow for args to be left as args (#55995)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55995

Normalization is kind of broken currently. But making default arguments visible still appears to work, and is nice functionality to still be able to rely on/use. Adds an option to `NormalizeArgs`'s `__init__` called `normalize_to_only_use_kwargs` which defaults to true, which if set to false will keep using the same signature as provided, but additionally set kwargs in kwargs.

Test Plan: Added test to `test_fx_experimental`.

Reviewed By: 842974287

Differential Revision: D27759448

fbshipit-source-id: 620061fcf46d8549ac70b62aede8b6740aee3778
2021-04-24 08:15:17 -07:00
3fbc15410a Revert D27967517: [pytorch][PR] Use JIT Plug-in for coverage to cover JIT'd functions and methods
Test Plan: revert-hammer

Differential Revision:
D27967517 (88bd0510ef)

Original commit changeset: 53fd8431d772

fbshipit-source-id: 491841dcde629f1e9f8ee38be7366955c03b6e27
2021-04-24 07:53:49 -07:00
c416167fb7 Add tests for CUDAFuture (#56518)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56518

I don't think we have any tests for CUDAFuture (I couldn't find any, and I didn't write any in the past). I think especially for the two latest features added by this stack we should have a test to ensure they properly work and to catch regressions. (These tests also add indirect coverage for the more "basic" features of CUDAFuture).

I didn't know how/where to add tests for C++ ATen stuff, so instead I added these tests to the Python RPC suite, using the torch.futures.Future wrapper. (It made sense in my mind because RPC is the main user of CUDAFuture). I'll gladly accept pointers to better ways of doing this.
ghstack-source-id: 127295022

Test Plan: The tests themselves.

Reviewed By: mrshenli

Differential Revision: D27887191

fbshipit-source-id: 4ad6d81e676fe486aa8d329591ee1a3818fea059
2021-04-24 07:07:31 -07:00
a688b29750 Support custom Python classes in CUDAFuture (#56516)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56516

One problem with CUDAFuture's extraction of DataPtrs from IValues is that it only supported Python objects that could be converted to "regular" IValues (e.g., lists/dicts/tuples of ints/strings/tensors/...). One notable exception are custom Python classes, which are in fact a very common data type transferred over RPC. The only solution we found for those is to use the Python pickler to extract the tensors contained in them.

We can't insert a Python dependency directly into CUDAFuture, so instead I'm proposing to use the same indirection technique used to support `getSubValues` on Python objects: define some methods on the abstract class `PyObjectHolder` (which can be used by CUDAFuture) but only implement them in the concrete subclass `ConcretePyObjectHolder` (which is only built when Python support is enabled).

I am a bit worried about the performance toll of this (pickling isn't exactly known to be cheap) but I think we should start by providing a functionally complete API. We already have ideas on how to make this faster if needed, for example by having users provide a custom DataPtr extractor tailored to their class via a decorator. (Or just use TorchScript).
ghstack-source-id: 127295014

Test Plan: Added a test later in the stack

Reviewed By: mrshenli

Differential Revision: D27887189

fbshipit-source-id: 9d27e4e62390b836e5bb4f06f401cc002f0cf95b
2021-04-24 07:06:28 -07:00
e4efc0c948 [Static Runtime] Enable check_for_memory_leak in StaticRuntime::benchmark (#56839)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56839

Enable check_for_memory_leak at the end of StaticRuntime::benchmark so this code is exercised more often.

Test Plan: Checked with adindexer merge net model

Reviewed By: edvgha

Differential Revision: D27417911

fbshipit-source-id: 5248942dc439fcc7301ffb0005da76374939fa96
2021-04-23 19:54:58 -07:00
34eb6c8589 [Caffe2] ScriptModuleOp support pass_inputs_as_tensor_list (#56813)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56813

When the arg `pass_inputs_as_tensor_list` is True, the input tensors are wrapped into a TensorList and passes in as a single param.

Test Plan: buck test //caffe2/caffe2/python:workspace_test -- TestScriptModule

Reviewed By: dzhulgakov

Differential Revision: D27972928

fbshipit-source-id: 5a199649445b0306f3134086c85bd55da45e1a0b
2021-04-23 18:49:57 -07:00
b2b9efb33a .github: Add initial Linux CI for CUDA (#56494)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56494

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D27953781

Pulled By: seemethere

fbshipit-source-id: bce9298dc40d035bfbb5057e48b99d15c13733bc
2021-04-23 18:09:08 -07:00
060e4c96ee Torchelastic: forbid mp tests running with *san (#56827)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56827

The diff makes sure that mp tests are not executed in modes that allow *san, since python mp does not behave well with tsan and asan.

Test Plan: buck test mode/opt-tsan //caffe2/test/distributed/launcher/... -- --run-disabled

Reviewed By: cbalioglu

Differential Revision: D27976626

fbshipit-source-id: 7747d67687fa0fd095f799b3708038f672119e73
2021-04-23 17:55:26 -07:00
bd3dda95fd Make old_gpu warning dynamic (#56621)
Summary:
Compute minimum support CUDA architecture as oldest GPU arch_list
supported by current build

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56621

Reviewed By: soumith

Differential Revision: D27920141

Pulled By: malfet

fbshipit-source-id: 71a42dd60c38a658ebad4544bcfb3d2d20e471b5
2021-04-23 17:52:07 -07:00
5d940e2fbc [TSAN] Fix PythonEngine data-race-on-vptr. (#56808)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56808

For information about data-race-on-vptr in general, see https://www.internalfb.com/intern/wiki/TSAN/Common_Concurrency_Mistakes/Stopping_a_Thread_in_Destructor/

Engine::~Engine() was previously tasked with stopping the threads. This causes a data race on the object's vptr when PythonEngine is being destructed. This fixes the data race by making ~PythonEngine trigger the thread stopping before going down to the base class's destructor.

Test Plan:
Many tests are affected, but here's one example:

buck test mode/dev-tsan -c fbcode.tsan_strict_mode=true //oculus/research/orcoptics/deep_learning/srg_nn/tests:test_grating_net -- 'test_train (oculus.research.orcoptics.deep_learning.srg_nn.tests.test_grating_net.TestGratingNet)' --run-disabled

Reviewed By: walterddr, albanD

Differential Revision: D27972384

fbshipit-source-id: 8b70fec8d9326497c591a2777b355ea590a85082
2021-04-23 17:39:27 -07:00
2041cd6707 Enable forward/backward compatibility in TS mobile (#56079)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56079

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D27828149

Pulled By: tugsbayasgalan

fbshipit-source-id: 9291ddbf01853354fca0fa0a58b8115d5d2294da
2021-04-23 16:55:18 -07:00
be7a943bb8 s/AutoDispatchBelowAutograd/AutoDispatchBelowInplaceOrView. (#56657)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56657

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D27931526

Pulled By: ailzhang

fbshipit-source-id: 3af718df3435e2b0b30bc62070dbdc5aeeecdfb4
2021-04-23 15:50:00 -07:00
375ebd634a [PyTorch] Break up generated tag in source (#56503)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56503

The presence of `generated` causes Phabricator and hg to think the file is generated (e.g., hg won't prompt to resolve merge conflicts with an editor). Breaking up the tag is the traditional way to solve this.
ghstack-source-id: 126965382

Test Plan: Review, builds

Reviewed By: ailzhang

Differential Revision: D27887691

fbshipit-source-id: 394a38d50289d64f8801a13f9a28f6f0f37ca59d
2021-04-23 15:46:24 -07:00
5288d05cfd Revert D27958477: [PyTorch][Edge] Add v4 and v5 models and remove unused model
Test Plan: revert-hammer

Differential Revision:
D27958477 (2e4c68a727)

Original commit changeset: 2e6f985a988d

fbshipit-source-id: 520cb8a353d91cd26cb27880a0a8e27dbfcd2d99
2021-04-23 14:42:01 -07:00
c37095760d [torch distributed] Implementing all_gather_base (#56315)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56315

This diff implements the all_gather_base in pytorch distributed.

Test Plan: dist.all_gather_base(output, input)...

Reviewed By: agolynski, amylittleyang

Differential Revision: D27488999

fbshipit-source-id: 937ec8bddf9527fa4d114f984d1d0f6a5b8c3936
2021-04-23 14:16:47 -07:00
5b7317b562 [NNC] API for Buffer Compression (#55853)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/54338

This PR adds the following API in NNC to implement "buffer compression".

```
static void compressBuffer(Buf* buf, Stmt* stmt);
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55853

Reviewed By: ezyang

Differential Revision: D27960986

Pulled By: navahgar

fbshipit-source-id: a69988e607196f3e2db0212313ea5deefb9859ac
2021-04-23 14:12:03 -07:00
e098515b89 Fix cdist backward for empty inputs (#56606)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56606

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D27939201

Pulled By: albanD

fbshipit-source-id: 7ac2b579577cc5b58e714935d791be26478eb83c
2021-04-23 14:08:20 -07:00
0d7e780eff Fix broadcasting of cdist backward (#56605)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56605

Fix https://github.com/pytorch/pytorch/issues/55370

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D27939202

Pulled By: albanD

fbshipit-source-id: a4ac50a7b504c24f47f5343414fb57523546a0c7
2021-04-23 14:08:18 -07:00
3ddcc8d833 Add more test cases for cdist OpInfo and TODOs (#56604)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56604

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D27939203

Pulled By: albanD

fbshipit-source-id: 197de148ba00d217eb0bfc5b5724d23cf6de0910
2021-04-23 14:08:17 -07:00
10fd7d8be6 Add option to OpInfo to skip gradgrad check and empty cdist OpInfo (#56603)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56603

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D27939204

Pulled By: albanD

fbshipit-source-id: c7c80551ef3c34c822832891a99104440893ea4c
2021-04-23 14:06:33 -07:00
ed2104fe5c Fixing MAGMA with HIP issues (#56448)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55552

* Root-caused issue to MAGMA kernels
* Issue is fixed on master of MAGMA
MAGMA issue: https://bitbucket.org/icl/magma/issues/43/zgetrf_batched-shfl-kernel-failure-seen-on
* Changing PyTorch to use particular commit sha from master of MAGMA project
* ~~Reactivating skipped ROCm tests~~ : We will reactivate tests in a different PR

Corresponding PyTorch builder PR: https://github.com/pytorch/builder/pull/695

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56448

Reviewed By: seemethere

Differential Revision: D27974563

Pulled By: janeyx99

fbshipit-source-id: 25e6f95a20a06d27a5199a623dd7c5db7ca8d6ea
2021-04-23 13:42:29 -07:00
0424f6af93 Local lint fixes - missing steps, pin to bash (#56752)
Summary:
Fixes #56738

* `setup_lint` now installs mypy / shellcheck
* the shell used to execute commands is pinned to `bash` (on Ubuntu the default is `dash`, which was causing the false positives in #56738)
* the emoji check marks don't always work, so use more basic ones instead
* adds `Run autogen` step for mypy (for the `lint` step only since it's pretty slow)
](https://our.intern.facebook.com/intern/diff/27972006/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56752

Pulled By: driazati

Reviewed By: samestep

Differential Revision: D27972006

fbshipit-source-id: 624e6c1af2d4f7c8623f420516744922b6b829a5
2021-04-23 13:10:14 -07:00
6de1d9b2d0 Fix bug in emitUse to drop all values that are marked as drop (#56652)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56652

Previous code doesn't drop prim::Constant values even when they are marked as drop.

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D27927413

fbshipit-source-id: 67cd52cf292e111be2830ccf93b0e7b089e49001
2021-04-23 12:42:51 -07:00
2e4c68a727 [PyTorch][Edge] Add v4 and v5 models and remove unused model (#56751)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56751

## Summary
1. Add two models (v4 and v5) for testing runtime. (v5 will be introduced in https://github.com/pytorch/pytorch/pull/56002)
2. Remove an unused model.

Side note: these binaries are part of the test in https://github.com/pytorch/pytorch/pull/56002, and currently there is an ongoing issue to `ghexport` with binaries (post is https://fb.workplace.com/groups/533197713799375/permalink/1130109004108240/). `ghimport` can work with binary after checking temporary diff (D23336574).

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D27958477

Pulled By: cccclai

fbshipit-source-id: 2e6f985a988da55ad08fb9a5037434a2b6db0776
2021-04-23 11:52:42 -07:00
798dd4665d Add a new API replace_input_with to node.py (#55887)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55887

Reviewed By: jfix71

Differential Revision: D27731389

fbshipit-source-id: 754654e64c4f3a584dfea06322d833bc11bcc3cc
2021-04-23 11:37:41 -07:00
7d2a9f2dc9 Fix instance norm input size validation + test (#56659)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45687

Fix changes the input size check for `InstanceNorm*d` to be more restrictive and correctly reject sizes with only a single spatial element, regardless of batch size, to avoid infinite variance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56659

Reviewed By: pbelevich

Differential Revision: D27948060

Pulled By: jbschlosser

fbshipit-source-id: 21cfea391a609c0774568b89fd241efea72516bb
2021-04-23 10:53:39 -07:00
7e9f7fb980 [Pytorch Edge] Prepack folding for functions besides forward (#56081)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56081
ghstack-source-id: 127205799

Test Plan: unit test. Since I'm prepacking the weights of the same operators multiple times I wonder if its a just works thing?

Reviewed By: kimishpatel

Differential Revision: D27777337

fbshipit-source-id: 909d2a667d9eb51e205536b478a6668c33b3fb15
2021-04-23 10:40:15 -07:00
7ff1990caf [c10d] Increment sequence numbers on collectives. (#55718)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55718

Increments sequence numbers when ProcessGroupGloo::enqueue or
ProcessGroupNCCL::collective is run, which is a common call all collectives
make. The next step will be to log these along with other collective info in
debug mode as well as integrating them with the process group wrapper.
ghstack-source-id: 127215077

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D27690690

fbshipit-source-id: cb284b7c760763b7c0f814a41f06656fabf806d6
2021-04-23 10:06:56 -07:00
ed0a0c3578 Revert D27902824: static runtime support for fb::equally_split
Test Plan: revert-hammer

Differential Revision:
D27902824 (a4e47ea152)

Original commit changeset: 7855047c3bd4

fbshipit-source-id: a46834418ce98826871cd604d1a01f0ff8f23d7f
2021-04-23 10:03:12 -07:00
d1fe68e70b To add single and chained learning schedulers to docs (#56705)
Summary:
In the optimizer documentation, many of the learning rate schedulers [examples](https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate) are provided according to a generic template. In this PR we provide a precise simple use case example to show how to use learning rate schedulers. Moreover, in a followup example we show an example how to chain two schedulers next to each other.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56705

Reviewed By: ezyang

Differential Revision: D27966704

Pulled By: iramazanli

fbshipit-source-id: f32b2d70d5cad7132335a9b13a2afa3ac3315a13
2021-04-23 09:36:00 -07:00
88bd0510ef Use JIT Plug-in for coverage to cover JIT'd functions and methods (#56310)
Summary:
This PR is step 2 (after https://github.com/pytorch/pytorch/issues/56708) to having JIT coverage--it actually uses the plug-in in CI!

Disclaimer: note that this will mark the entire JIT'd function/method as covered without seeking proof that the
compiled code has been executed. This means that even if the code chunk is merely compiled and not run, it will get
marked as covered.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56310

Test Plan:
We should see coverage improvements in CI after. A file to look out for would be `torch/jit/quantized.py`, which should have more coverage after this PR, which it does!
d3283ccd8c/torch/jit/quantized.py vs https://codecov.io/gh/pytorch/pytorch/src/master/torch/jit/quantized.py

More generally, the whole jit folder got ~3% increase in coverage, I believe.

Reviewed By: ezyang

Differential Revision: D27967517

Pulled By: janeyx99

fbshipit-source-id: 53fd8431d772c2447191135c29d1b166ecd42f50
2021-04-23 09:12:21 -07:00
22b151a3ba Make sure full backward hook fire when no input requires grad (#56693)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/56380

BC-breaking note:
This changes the behavior of full backward hooks as they will now fire properly even if no input to the Module require gradients.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56693

Reviewed By: ezyang

Differential Revision: D27947030

Pulled By: albanD

fbshipit-source-id: e8353d769ba5a2c1b6bdf3b64e2d61308cf624a2
2021-04-23 08:46:49 -07:00
acca89e25f Add more RRef CUDA RPC tests (#56757)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56757

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D27959592

Pulled By: mrshenli

fbshipit-source-id: b72c873bcaef4515b0fc8d48ae539477e1850a40
2021-04-23 08:40:41 -07:00
369e8bc4bc Added support for uppercase letters in torch.einsum (#56475)
Summary:
This PR adds support for upper case letters in `torch.einsum` equation.

Addresses PR https://github.com/pytorch/pytorch/pull/55013 here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56475

Reviewed By: ailzhang

Differential Revision: D27948362

Pulled By: heitorschueroff

fbshipit-source-id: 51cf57b17c4c23d88fab5343f17ba3bfbe3607a5
2021-04-23 08:13:58 -07:00
15ca379bde Add CUDA support to a user-created torch.futures.Future (#56517)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56517

Currently a torch.futures.Future could wrap a CUDAFuture, but it could not create one from scratch. This prevented users from using CUDAFutures in some occasions, for example when using `rpc.functions.async_execution`, or in their own code. I don't see any reason for such a limitation, hence here I add support for this.
ghstack-source-id: 127261554

Test Plan: Added a test later in the stack

Reviewed By: mrshenli

Differential Revision: D27887190

fbshipit-source-id: ecbb39c1ad7cd189d478ded9c361448f05a270ad
2021-04-23 08:13:56 -07:00
58d12eb75e Allow to specify a set of device for CUDAFuture (#56515)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56515

In https://github.com/pytorch/pytorch/pull/56405 we finally found a solution to support RPC remote user functions that created/used CUDA tensors on devices that were not used by their arguments, by defining a "bounding set" of devices when constructing the agent and allowing all functions to freely use any of those devices.

We had the same exact problem with the callbacks of CUDAFuture, and in this PR I'm adopting the same exact solution: I allow to specify a set of devices when constructing a CUDAFuture, and then every callback is allowed to use any of those devices. (These devices will also be propagated to child futures).

I'm also making ProcessGroupNCCL pass these devices. I can't yet do it for TensorPipeAgent, until #56405 lands.
ghstack-source-id: 127261552

Test Plan: Added a test for this later in the stack.

Reviewed By: mrshenli

Differential Revision: D27861067

fbshipit-source-id: 8ab2c9d06a514c0407a7e96abc3704e8d5c5dc09
2021-04-23 08:12:41 -07:00
d6a25a58f5 add hardtanh(0,6) to the set of MKLDNN fusible ops for mobilenetv2 (#56203)
Summary:
TODO: post the numbers for mobilenetv2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56203

Reviewed By: malfet

Differential Revision: D27917557

Pulled By: Krovatkin

fbshipit-source-id: acea0f933a7e8c7a036a494295f68222c46a36f7
2021-04-23 08:08:17 -07:00
7b7a4750a9 [PyTorch] Migrate hacky wrapper removal to borrow_from_optional_tensor (#56648)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56648

Generated with

```
fastmod -m "^((?P<indent>\s*)// See \[Note: hacky wrapper removal for optional tensor\])
 \s*const Tensor& (?P<varname>[A-Za-z_]+) = c10::value_or_else\((?P<optionalname>[A-Za-z_]+), \[\] \{return Tensor\(\);\}\);" \
 '${1}
 ${indent}c10::MaybeOwned<Tensor> ${varname}_maybe_owned = c10::borrow_from_optional_tensor(${optionalname});
 ${indent}const Tensor& ${varname} = *${varname}_maybe_owned;'
```
ghstack-source-id: 127112928

Test Plan: CI

Reviewed By: wenleix

Differential Revision: D27925837

fbshipit-source-id: 720a4f2e3b96e14c93466698c9c4a3b9c8446a69
2021-04-23 08:05:02 -07:00
f2fd91ccfd [PyTorch] Add & document borrow_from_optional_tensor (#56647)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56647

This should be more efficient than the old hacky wrapper for optional Tensor pattern. Despite appearances, the old pattern did a reference count bump for non-empty optionals. Following diff will contain an automated change to migrate callsites.
ghstack-source-id: 127112926

Test Plan: Review, CI on following change

Reviewed By: bhosmer

Differential Revision: D27925838

fbshipit-source-id: 2c6082c5930b1e71b853a75c52873088dbc48167
2021-04-23 08:05:00 -07:00
02c3e6d98a addmm CPU inplace implementation shouldn't resize an input tensor (#56452)
Summary:
`addmm` CPU inplace implementation shouldn't resize an input tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56452

Reviewed By: malfet

Differential Revision: D27925216

Pulled By: ngimel

fbshipit-source-id: 3a4cda62ea59774ddf89f2c0592e9faffa1afe43
2021-04-23 08:03:58 -07:00
e5fda07e80 Fix: Compare input against beta * threshold in softplus backwards (#56484)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55587

The fix converts the binary `TensorIterator` used by softplus backwards to a ternary one, adding in the original input for comparison against `beta * threshold`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56484

Reviewed By: malfet

Differential Revision: D27908372

Pulled By: jbschlosser

fbshipit-source-id: 73323880a5672e0242879690514a17886cbc29cd
2021-04-23 07:58:51 -07:00
cyy
83c23703b7 Some simple optimizations (#51831)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51831

Reviewed By: albanD

Differential Revision: D26379122

Pulled By: VitalyFedyunin

fbshipit-source-id: d3562232f8501f2ad0b291586bf7f828e9b47010
2021-04-23 07:55:15 -07:00
0a72904ab4 Torchelastic: make process failure init error non-fatal (#56739)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56739

The diff makes several tiny changes:
* Add logs for each worker error file destination
* Make sure log_dir is propagated from the launcher
* Make ProcessFailure initialization error non-fatal.

Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed/elastic/multiprocessing/errors:api_test

    https://fburl.com/tupperware/0nizb9z8

Reviewed By: borovsky-d, wilson100hong

Differential Revision: D27952596

fbshipit-source-id: 69582bf4be47758def4008f2abf82d123294cd1a
2021-04-23 00:49:47 -07:00
a4e47ea152 static runtime support for fb::equally_split (#56565)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56565

fb::equally_split get fused with ListUnpack and all outputs from ListUnpack getting attached to fb::equally_split.
So fb::equal_split will have as many outputs as ListUnpack .

Test Plan:
buck test caffe2/torch/fb/sparsenn:fb_operators_test

buck test caffe2/torch/fb/sparsenn:test -- test_equally_split_op

Reviewed By: hlu1

Differential Revision: D27902824

fbshipit-source-id: 7855047c3bd46bbb74b7346ac384c70b6a3e1f46
2021-04-23 00:12:54 -07:00
7c50852a60 moved more lowerings over (#55372)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55372

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D27884601

Pulled By: Chillee

fbshipit-source-id: 91b00182abb5dcf60209425d2717fa0303cb4932
2021-04-23 00:08:26 -07:00
1f04494c0e Consolidate nondeterministic error tests (#55631)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/51498

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55631

Reviewed By: malfet

Differential Revision: D27909953

Pulled By: mruberry

fbshipit-source-id: 9115b2433f9c276555be55bd51b270a7a2846829
2021-04-22 23:37:01 -07:00
88deea4e29 [torch.package] is_from_package check (#56729)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56729

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D27949829

Pulled By: Lilyjjo

fbshipit-source-id: 1159d66d51b8f187c43847a5d449b13683c39eeb
2021-04-22 22:28:07 -07:00
913f1f75b3 Revert "Revert [ONNX] Redesign inplace conversion" (#56675)
Summary:
Adjust how MutationRemover is used to avoid creating aliasDb multiple times for the same graph.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56675

Reviewed By: pbelevich

Differential Revision: D27945692

Pulled By: SplitInfinity

fbshipit-source-id: a6c548438e88ddee18ef03a6f0461ab9eaaaa829
2021-04-22 22:22:16 -07:00
461e887d92 CPU Convolution benchmark harness for some popular models (#56455)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56455

CPU convolution performance is pretty important for inference, so
tracking performance for CNNs often boils down to finding shapes that have
either regressed or need optimization.  This diff adds a benchmark harness that
lets you pretty easily add new sets of convolution parameters to benchmark.

I've started with an exhaustive list of layers from MobileNetV3, ResNet-18 and
ResNet-50, which are fairly popular torchvision models.  More to come if these
prove useful.

I've also added four backend configurations:

- native: uses at::conv2d, which applies its own backend selection heuristics
- mkldnn_none: uses mkldnn but applies no prepacking; uses the NCHW default
- mkldnn_weight: prepacks weights in an mkldnn-friendly format
- mkldnn_input: also prepacks the inputs in NCHW16c
ghstack-source-id: 127027784

Test Plan: Ran this on my Skylake Xeon

Reviewed By: ngimel

Differential Revision: D27876139

fbshipit-source-id: 950e1dfa09a33cc3acc7efd579f56df8453af1f2
2021-04-22 22:14:36 -07:00
f84a50109f Move windows testers to previous image (#56626)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56626

Reviewed By: seemethere

Differential Revision: D27920812

Pulled By: malfet

fbshipit-source-id: faa739ca8500654df18cf963707b31c3345132cf
2021-04-22 20:53:41 -07:00
29491f7954 [NNC] Add unroll and flatten APIs which not require return stmt pointer (#56420)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56420

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D27866118

Pulled By: huiguoo

fbshipit-source-id: f7e44fb20ef3a3c43b95d15f7b3b12e9e5cc89c9
2021-04-22 19:59:34 -07:00
2078836005 Clean up raise exception logic (#55656)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55656

### For release notes
What:
 - All errors that are silenced by "raise_exception=False" are now GradcheckError (which inherits from RuntimeError).

Why:
 - Due to a refactor of gradcheck

Workaround:
 - If you catch for 'RuntimeError' with `except RuntimeError`, since GradcheckError inherits from RuntimeError, no changes are necessary. However if you explicitly check for the errors type via `type(error)`, you'll need to update your code to check for `GradcheckError` instead.

Factors out all the logic handling involving `fail_test`, `raise_exception` into 1) a wrapper around gradcheck that uses try/except 2) gradcheck_helper that always raises exception.
This allows us to avoid having to write the `if not x: return False` logic that is scattered throughout gradcheck currently.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D27920809

Pulled By: soulitzer

fbshipit-source-id: 253aef6d9a3b147ee37a6e37a4ce06437981929a
2021-04-22 19:46:39 -07:00
d01302431c Enable fast gradcheck for real inputs and outputs (#55237)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55237

In this PR, we reenable fast-gradcheck and resolve misc issues that arise:
Before landing this PR, land #55182 so that slow tests are still being run periodically.

Bolded indicates the issue is handled in this PR, otherwise it is handled in a previous PR.

**Non-determinism issues**:
- ops that do not have deterministic implementation (as documented https://pytorch.org/docs/stable/generated/torch.use_deterministic_algorithms.html#torch.use_deterministic_algorithms)
  - test_pad_cuda (replication_pad2d) (test_nn)
  - interpolate (test_nn)
  - cummin, cummax (scatter_add_cuda_kernel) (test_ops)
  - test_fn_gradgrad_prod_cpu_float64 (test_ops)

Randomness:
  - RRelu (new module tests) - we fix by using our own generator as to avoid messing with user RNG state (handled in #54480)

Numerical precision issues:
- jacobian mismatch: test_gelu (test_nn, float32, not able to replicate locally) - we fixed this by disabling for float32 (handled in previous  PR)
- cholesky_solve (test_linalg): #56235 handled in previous PR
- **cumprod** (test_ops) - #56275 disabled fast gradcheck

Not yet replicated:
 - test_relaxed_one_hot_categorical_2d (test_distributions)

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D27920906

fbshipit-source-id: 894dd7bf20b74f1a91a5bc24fe56794b4ee24656
2021-04-22 19:46:37 -07:00
2ea3c24c06 Disable flaky tests (#56279)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56279

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D27916606

Pulled By: soulitzer

fbshipit-source-id: 60c07024f6eb818f4aa6730a5f9ff90d7bc2b80f
2021-04-22 19:45:41 -07:00
5c752ead3e Print non-breaking space directly in lint.yml (#56726)
Summary:
After some fun investigating, samestep found that `\u1234` to produce a unicode character is only supported in bash > 4.2, but MacOS ship with bash/sh 3.2, so it was searching for the literal string `u1234`. This fixes the issue by printing out the char directly via its UTF-8 bytes and `printf`.
](https://our.intern.facebook.com/intern/diff/27952866/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56726

Pulled By: driazati

Reviewed By: SplitInfinity

Differential Revision: D27952866

fbshipit-source-id: 35871e959e250dfdbbdf8b121fc92212bc0614e8
2021-04-22 16:58:12 -07:00
08ce2300bf torch: Add cpython as a dependency for torch_python_obj (#56740)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56740

Was running into a race condition where the torch_python_obj was
attempting to build before cpython was actually finished installing,
this should resolve that issue.

Only applicable on builds that use the `USE_DEPLOY=ON` option

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D27953782

Pulled By: seemethere

fbshipit-source-id: 76dd7c4218870eac97fc4c14e20b46128d264b30
2021-04-22 16:52:29 -07:00
bac4cfd54d Fix mp serialization for integer nn.Parameter on CUDA (#56529)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/56342

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56529

Reviewed By: albanD

Differential Revision: D27896094

Pulled By: ngimel

fbshipit-source-id: fe817781eb7139ea57c78acfd56e7c11b61eb4ed
2021-04-22 16:21:04 -07:00
febff45900 Support factory kwargs in torch.nn modules (#54508)
Summary:
Continuation of https://github.com/pytorch/pytorch/pull/53144

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54508

Reviewed By: albanD

Differential Revision: D27939544

Pulled By: jbschlosser

fbshipit-source-id: 4bf517e5f74f093e27ca38a85e732da65e44d805
2021-04-22 16:16:53 -07:00
3a4344a717 Create helper function for RPC profiling in _invoke_rpc and remote (#56643)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56643

Refactor enabling rpc profiling logic in `_invoke_rpc` and `remote()` into `_rpc_profiling()` helper function.

Reviewed By: rohan-varma

Differential Revision: D27922286

fbshipit-source-id: 27cfe662a401756f0ee8a3cd45978d933377f78f
2021-04-22 15:15:49 -07:00
1719cb82f3 [quant][graphmode][fx] Support preserving attributes in deepcopy of observed/quantized graphmodule (#56550)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56550

Add support for preserving a list of attributes on observed/quantized GraphModule

Test Plan:
python test/test_quantization.py TestQuantizeFx.test_deepcopy_preserve_attributes

Imported from OSS

Reviewed By: vkuzo, kazhang

Differential Revision: D27899317

fbshipit-source-id: ebf21334715e5ab764aaa27eed534cc0cdf9f2b5
2021-04-22 15:02:44 -07:00
3a44d269ac Add periodic_ prefix to all jobs run by cron (#56695)
Summary:
To make them more easily distinguishable in the HUD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56695

Reviewed By: walterddr, samestep

Differential Revision: D27939938

Pulled By: malfet

fbshipit-source-id: e0abd1a6bc931a89f2aa5c6e2d8ebb471c461051
2021-04-22 14:25:17 -07:00
375687839e [sparsity] Moving the sparsity python files to OSS (#56617)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56617

This migrates the sparsity to the open source

Test Plan: `buck test mode/opt //caffe2/test:ao`

Reviewed By: raghuramank100

Differential Revision: D27812207

fbshipit-source-id: cc87d9d2b486269901a4ad9b483615741a1cd712
2021-04-22 14:07:31 -07:00
31fe2bbb30 Remove extraneous variables in windows report stats step (#56596)
Summary:
Testing that the CIRCLE variables in the Windows test CI report stats step aren't needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56596

Test Plan: CI

Reviewed By: samestep

Differential Revision: D27948983

Pulled By: janeyx99

fbshipit-source-id: 71f2ca08246eea7580e31fb632612b205fb995fc
2021-04-22 13:45:09 -07:00
5b01b3e8e8 Introducing JitPlugin (#56708)
Summary:
This PR is step 1 to covering JIT'd methods and functions. Step 2 (using it in CI) is here: https://github.com/pytorch/pytorch/issues/56310.

1. This PR introduces a package `coverage_plugins` that hosts JITPlugin.
2. We also bring in a `.coveragerc` file that is used in CI to omit the files we don't want to report on (e.g., temporary directories or test or utils.)

**Disclaimer: This PR does NOT use the plug-in. Nothing should change as a result.**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56708

Test Plan:
CI. Coverage should not go down.

If you're interested in testing this plug-in locally, you should:
`pip install -e tools/coverage_plugins_package` from the root directory.
Add the following lines to `.coveragerc` under `[run]`
```
plugins =
    coverage_plugins.jit_plugin
```
And then try:
`coverage run test/test_jit.py TestAsync.test_async_script_no_script_mod`

You should see `.coverage.jit` show up at the end. You can then run `coverage combine --append` and `coverage debug data` to see that some files in `torch/jit` are covered.

Reviewed By: samestep

Differential Revision: D27945570

Pulled By: janeyx99

fbshipit-source-id: 78732940fcb498d5ec37d4075c4e7e08e96a8d55
2021-04-22 13:41:49 -07:00
2128a84a69 Fix grad_fn bindings when saved variable freed (#56499)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/54472

Adds HANDLE_TH_ERRORS to python bindings for grad_fn attrs and updates tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56499

Reviewed By: albanD

Differential Revision: D27920742

Pulled By: soulitzer

fbshipit-source-id: d4f7ac8c0aa2173d25517277c393f8c66de68951
2021-04-22 13:40:40 -07:00
679cc7eb13 Re-enable fast winograd conv on IOS (#56021)
Summary:
This is the proper fix for https://github.com/pytorch/pytorch/issues/38186
husthyc tested locally that it indeed fixes the issue there.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56021

Reviewed By: ailzhang

Differential Revision: D27940362

Pulled By: albanD

fbshipit-source-id: 020743315ce055633324ccd751c457e32ea3263d
2021-04-22 13:34:20 -07:00
2ee3f5f812 Copy over test reports before running "report results" for linux test jobs (#56725)
Summary:
This way, if report results fail, the test reports are still saved as artifacts so we could use them to help us debug.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56725

Test Plan: CI linux test to pass + see that the test reports are copied in the Run tests step

Reviewed By: samestep

Differential Revision: D27948434

Pulled By: janeyx99

fbshipit-source-id: 597a2ba4fe1dca16c7b75a1399600b27f380f5cd
2021-04-22 13:27:14 -07:00
048087d942 make beg_size output deterministic for EmbeddingBag (#56661)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56661

Under some conditions (requires_grad = false and mode=SUM) bag_size and max_indices will be created via at::empty and will not be modified, that is why corresponding outputs is not deterministic and causing tests to fail.

Test Plan: buck test mode/opt //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --exact 'caffe2/benchmarks/static_runtime:static_runtime_cpptest - StaticRuntime.EmbeddingBag' --run-disabled

Reviewed By: hlu1

Differential Revision: D27931445

fbshipit-source-id: fe9747094027e4e6f7c7b0771c1cd994f94fd554
2021-04-22 11:58:32 -07:00
8b3bf98cb8 Tell codegen that SparseCsrCUDA is cuda (#56602)
Summary:
Follow up to https://github.com/pytorch/pytorch/issues/50937, Fixes build failures in https://github.com/pytorch/pytorch/issues/56561

Currently SparseCsrCUDA is included in cpu build and also doesn't get code-generated device guards. This fixed both issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56602

Reviewed By: albanD

Differential Revision: D27921001

Pulled By: ezyang

fbshipit-source-id: 2b3b0b66d0a7c5ef96e0817d8852d511dd954ae4
2021-04-22 11:57:10 -07:00
b85b89d246 Re-enable test_device_maps_gpu (#56415)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56415

closes #53287

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D27865438

Pulled By: mrshenli

fbshipit-source-id: 3f7fcba8b799966388cc98ffc349cb62f281c367
2021-04-22 11:50:23 -07:00
0c544ebd24 Revert to ANVTM in jni_lite due to Oculus failure.
Test Plan: FanW123 verified on her Oculus device

Reviewed By: FanW123

Differential Revision: D27943428

fbshipit-source-id: ac1c1ca6b47937f8839ba23c9e3af0843ea086a3
2021-04-22 11:49:01 -07:00
614dce54a6 [iOS GPU] Fix Shader compilation errors for Metal 1.2 (iOS 12) (#56670)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56670

`int64_t` is only available for Metal 2.2 and above. `size_t` works fine in those situations. https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf
ghstack-source-id: 127169610

Test Plan:
- AIBench
```
buck run mode/mac aibench:run_bench_macos -- -b aibench/specifications/models/pytorch/metal/metal_unet_1001_detection.json --platform ios --framework pytorch --remote --devices D201 (85b1c45a45)AP-12.0.1
```

Reviewed By: linbinyu

Differential Revision: D27933297

fbshipit-source-id: 474b1eb191c68101367c9623c855645684434bd7
2021-04-22 11:44:31 -07:00
187a524249 Re-order tests based on changed files (#56666)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56666

Addresses some of #56557 by checking for changed files when running tests. This will help deliver signal faster when a failing test is run. It should always be safe to at least try to re-order the tests, so there's no option to turn it off, and any error ends up bailing out of the sorting process. Time saved will change between tests, with more improvement for things that are further down the static list here:

1e9c7ad4cb/test/run_test.py (L32)

The results vary from not much improvement ([before: 11m](https://app.circleci.com/pipelines/github/pytorch/pytorch/307580/workflows/6ab3def6-8d63-4f41-9b8d-9c2c50f6266b/jobs/12712819/steps), [after: 10m](https://app.circleci.com/pipelines/github/pytorch/pytorch/307578/workflows/157407b4-f850-431c-b641-d2ac97916a04/jobs/12712802/steps)) to a lot ([before: 75m](https://app.circleci.com/pipelines/github/pytorch/pytorch/307580/workflows/6ab3def6-8d63-4f41-9b8d-9c2c50f6266b/jobs/12712884/steps), [after: 8m](https://app.circleci.com/pipelines/github/pytorch/pytorch/307578/workflows/157407b4-f850-431c-b641-d2ac97916a04/jobs/12712865/steps)), but overall there shouldn't be any regression in test timing. These results are also probably a little confounded since the test sharding will be different after re-ordering.

As a follow up we can use the target determination logic to figure out which tests to bring to front based on the actual code instead of just edits to test files

Test Plan: Imported from OSS

Reviewed By: samestep

Differential Revision: D27934076

Pulled By: driazati

fbshipit-source-id: 747d09ad732289d7693101803d46e9fa8e6d2f59
2021-04-22 10:27:07 -07:00
1dbbbbe904 [doc] FX Graph Mode Quantization - fix preamble (#52192)
Summary:
The pre-amble here is misformatted at least and is hard to make sense of: https://pytorch.org/docs/master/quantization.html#prototype-fx-graph-mode-quantization

This PR is trying to make things easier to understand.

As I'm new to this please verify that my modifications remain in line with what may have been meant originally.

Thanks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52192

Reviewed By: ailzhang

Differential Revision: D27941730

Pulled By: vkuzo

fbshipit-source-id: 6c4bbf7c87d8fb87ab5d588b690a72045752e47a
2021-04-22 10:20:31 -07:00
f0958f4748 [c10d] Add requires_gloo decorator to test_logging_init (#56682)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56682

The process group is created on a gloo backend.

Context: https://github.com/pytorch/pytorch/pull/56598#discussion_r618206941
ghstack-source-id: 127150910

Test Plan: buck test caffe2/test/distributed:c10d -- test_logging_init

Reviewed By: pbelevich

Differential Revision: D27936805

fbshipit-source-id: 932efc638f94bdf78ddbae291e3720a20e43f2af
2021-04-22 10:14:41 -07:00
036becf29c Disable TestComplexity.test_nn_module_test in fbcode (#56677)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56677

This has been failing with `RecursionError: maximum recursion depth
exceeded while calling a Python object` in fbcode for a while now.  Obviously
this isn't a fix, but the test works in OSS, so...
ghstack-source-id: 127146338

Test Plan:
```
buck test mode/dev //caffe2/test:jit -- --exact 'caffe2/test:jit - test_nn_module_tests (jit.test_complexity.TestComplexity)' --run-disabled
```

Reviewed By: Lilyjjo

Differential Revision: D27934963

fbshipit-source-id: 21d9858dab9ca1ebb5b67f286e788662dd24a988
2021-04-22 10:01:45 -07:00
c6d004125e Port all non-float unary operators to structured (and rsqrt) (#56151)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56151

I missed rsqrt in the last PR.  The native_functions.yaml
was done with the following script:

```
import ruamel.yaml
from ruamel.yaml.tokens import CommentToken
from ruamel.yaml.error import CommentMark
from tools.codegen.model import *  # noqa: F403

with open("aten/src/ATen/native/native_functions.yaml", "r") as f:
    contents = f.read()

yaml = ruamel.yaml.YAML()
yaml.preserve_quotes = True
yaml.width = 1000
yaml.boolean_representation = ['False', 'True']
r = yaml.load(contents)

convert = '''\
rsqrt
bitwise_not
frac
i0
round
'''.split()

for e in r:
    f = NativeFunction.from_yaml(e, Location("", 0))
    if f.structured or f.structured_delegate is not None:
        continue
    n = f.func.name.name.base
    if n not in convert:
        continue
    # mutate e to make changes
    if f.func.kind() == SchemaKind.out:
        e.insert(1, 'structured', True)
        e.insert(2, 'structured_inherits', 'TensorIteratorBase')
    else:
        # TODO: The .out overload assumption is not sound in general
        e.insert(1, 'structured_delegate', f'{n}.out')

        if 'dispatch' in e:
            e['dispatch'].pop('CPU', None)
            e['dispatch'].pop('CUDA', None)
            e['dispatch'].pop('CPU, CUDA', None)
            e['dispatch'].pop('CompositeExplicitAutograd', None)
        else:
            print(n)

        *_, last_k = e.keys()
        needs_fixup = False

        if 'dispatch' in e and not e['dispatch']:
            if last_k == 'dispatch':
                needs_fixup = True
            del e['dispatch']

        # Manually fix up newlines at the end, because ruamel
        # made some bad life choices about where to associate trailing
        # whitespace for nested dicts; see
        # https://stackoverflow.com/questions/42172399/modifying-yaml-using-ruamel-yaml-adds-extra-new-lines
        if needs_fixup:
            *_, last_k = e.keys()
            # post_key, pre_key, post_value, pre_value
            e.ca.items[last_k] = [None, None, CommentToken('\n\n', CommentMark(0), None), None]

with open("aten/src/ATen/native/native_functions.yaml.new", "w") as f:
    yaml.dump(r, f)
```

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D27795078

Pulled By: ezyang

fbshipit-source-id: c8961b58753c12f985d786eae73f776c39d30e6e
2021-04-22 09:57:23 -07:00
86ae22d85d [torch.Package] Folder has_file() method (#56584)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56584

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D27909314

Pulled By: Lilyjjo

fbshipit-source-id: facc89735ab67c87f0ec7653d8ccc359f98d4e0d
2021-04-22 09:52:35 -07:00
dfb65146e5 Add RELEASE.md (#56520)
Summary:
The purpopse of this document is to outline our current release process
so that users coming into the project have a better idea on how the
actual release process works and how they can help contribute to it.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56520

Reviewed By: janeyx99

Differential Revision: D27890571

Pulled By: seemethere

fbshipit-source-id: 882a565ea8d9b9a46c9242be7cf79dede2bae63f
2021-04-22 09:43:29 -07:00
8cf85a1152 [DataLoader][doc] Randomness for base_seed generator and NumPy seed (#56528)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56528

Tried to search across internal and external usage of DataLoader. People haven't started to use `generator` for `DataLoader`.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D27908487

Pulled By: ejguan

fbshipit-source-id: 14c83ed40d4ba4dc988b121968a78c2732d8eb93
2021-04-22 09:40:45 -07:00
aec83ff45e [DataLoader] Add Numpy seeding to worker of DataLoader (#56488)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56488

Considering amount of requests for this feature, introduce numpy seeding as default within each worker for DataLoader.

## BC-breaking Note:
- By introducing default numpy.random seeding strategy to workers of DataLoader, users don't need to manually set seed for workers by the `worker_init_fn`. And this PR won't influence users who are currently using `worker_init_fn` to set customized seed for workers.
- DataLoader will preserve reproducibility for users who are using numpy.random within Dataset.
- Multiprocessing (without `worker_init_fn` to define seed for numpy)
  - Start method as `spawn`: Each worker will now have seed for numpy random, rather than the seed generated from the imported time of Numpy module that make the DataLoader lose the reproducibility.
  - Start method as `fork`: Each worker not only have the same benefit as `spawn`,  but also have different seed for numpy as default, rather than inheriting the same seed.

Using the following Dataset and script as an example:
```py
class RandomDataset(Dataset):
    def __getitem__(self, ind):
        item = [ind, np.random.randint(1, 10000)]
        return item

    def __len__(self):
        return 20

if __name__ == '__main__'"
    ctx = mp.get_context('fork')
    ds = RandomDataset()
    g = torch.Generator()
    g.manual_seed(0)
    dl = DataLoader(ds, 2, shuffle=False, num_workers=4, multiprocessing_context=ctx, generator=g)

    epochs = 2
    for _ in range(epochs):
        for batch in d;:
            print(batch)
        print("====" * 10)
```

### 1.8.1:
Each worker generates same random result per iteration. And the seed will be reset to same for each epoch.
```py
tensor([[   0, 7449],
        [   1, 1519]])
tensor([[   2, 7449],
        [   3, 1519]])
tensor([[   4, 9645],
        [   5, 2387]])
tensor([[   6, 9645],
        [   7, 2387]])
tensor([[   8, 3118],
        [   9, 4552]])
=========================
tensor([[   0, 7449],
        [   1, 1519]])
tensor([[   2, 7449],
        [   3, 1519]])
tensor([[   4, 9645],
        [   5, 2387]])
tensor([[   6, 9645],
        [   7, 2387]])
tensor([[   8, 3118],
        [   9, 4552]])
=========================
```

### This PR:
Each worker has different seed at the beginning and re-seed for each epoch.
```py
tensor([[   0, 8715],
        [   1, 5555]])
tensor([[   2, 6379],
        [   3, 1432]])
tensor([[   4, 3271],
        [   5, 5132]])
tensor([[   6, 4287],
        [   7, 1104]])
tensor([[   8, 8682],
        [   9, 1699]])
=========================
tensor([[   0, 1374],
        [   1,  996]])
tensor([[   2,  143],
        [   3, 3507]])
tensor([[   4, 5887],
        [   5, 4730]])
tensor([[   6, 7274],
        [   7,  738]])
tensor([[   8, 6374],
        [   9, 1572]])
=========================
```

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D27908486

Pulled By: ejguan

fbshipit-source-id: 5f313a30563bedeb88be214fa4beca0cefe9e4f4
2021-04-22 09:39:33 -07:00
bc3d892c20 README: Minor improvements (#56193)
Summary:
* Visual studio versions: clarify and shorten.
* Remove obsolete note about a bug that has been fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56193

Reviewed By: albanD

Differential Revision: D27939766

Pulled By: ezyang

fbshipit-source-id: e142ec04ba98d5468f28ddf2e8bba5d99d3cfc26
2021-04-22 09:30:23 -07:00
21fd5f4b79 Document current deploy cpython build #56490 (#56600)
Summary:
Call out the issues with cpython deps and suggest a workaround.

Fixes https://github.com/pytorch/pytorch/issues/56490

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56600

Reviewed By: albanD

Differential Revision: D27920647

Pulled By: wconstab

fbshipit-source-id: 61a53a176eaf42a6166d649d3cb0fdfa2489e9d2
2021-04-22 09:02:29 -07:00
78022aa62c Add more model symbolic tracing tests from torchvision (#55744)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55398

Generates tests that calls `symbolic_trace` on torchvision models and verifies the parity of outputs from eager model, `fx.GraphModule`, `jit.ScriptModule`.

Test errors: GoogleNet and Inception models throw a type mismatch when scripting the traced `fx.GraphModule`.
```
Return value was annotated as having type __torch__.torchvision.models.googlenet.GoogLeNetOutputs but is actually of type Tensor:
    dropout = self.dropout(flatten);  flatten = None
    fc = self.fc(dropout);  dropout = None
    return fc
    ~~~~~~~~~ <--- HERE
```

Relevant type-inconsistency 512ea299d4/torchvision/models/googlenet.py (L200)
```
torch.jit.unused
    def eager_outputs(self, x: Tensor, aux2: Tensor, aux1: Optional[Tensor]) -> GoogLeNetOutputs:
        if self.training and self.aux_logits:
            return _GoogLeNetOutputs(x, aux2, aux1)
        else:
            return x   # type: ignore[return-value]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55744

Reviewed By: albanD

Differential Revision: D27920595

Pulled By: suraj813

fbshipit-source-id: 01f6f2aef7badbde29b5162a7787b5af9398090d
2021-04-22 08:54:06 -07:00
9be2cabc45 Pass contiguous weight to NNPACK convolution (#56569)
Summary:
Added TestNN.test_conv2d_discontiguous_weight to prevent further regressions

Fixes https://github.com/pytorch/pytorch/issues/55781

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56569

Reviewed By: ngimel

Differential Revision: D27926509

Pulled By: malfet

fbshipit-source-id: fa5ce943c3e4db4aa4de1b1cba35bd399fb3c54d
2021-04-22 08:45:24 -07:00
690c8b434f [static runtime] binding for aten::sub_out (#56656)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56656

Test Plan:
```
./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --scripted_model=/data/users/ansha/tmp/adfinder/aug_1x/210616848_0.predictor.disagg.local.local.pt --pt_inputs=/data/users/ansha/tmp/adfinder/aug_1x/210616848_0.predictor.disagg.input_data.container.pt --iters=500 --warmup_iters=500 --num_threads=1 --pt_enable_static_runtime=1 --pt_cleanup_activations=true --pt_enable_out_variant=1 --pt_optimize_memory=1 --compare_results=1 --do_profile=1 --adsfinder_compatibility=1
```
```
Time per node type:
        1.85766 ms.    35.7817%. fb::sigrid_transforms_torch_bind (1 nodes)
         1.1238 ms.    21.6464%. aten::linear (6 nodes)
       0.858116 ms.    16.5288%. aten::argmin (1 nodes)
       0.334183 ms.    6.43694%. aten::matmul (1 nodes)
       0.173697 ms.     3.3457%. fb::clip_ranges_gather_sigrid_hash_v3 (77 nodes)
       0.118827 ms.    2.28881%. fb::clip_ranges_gather (263 nodes)
       0.101348 ms.    1.95215%. aten::sub (1 nodes)
      0.0748209 ms.    1.44118%. aten::repeat (1 nodes)
      0.0582576 ms.    1.12214%. aten::norm (1 nodes)
      0.0474353 ms.   0.913686%. fb::batch_box_cox (1 nodes)
      0.0457588 ms.   0.881393%. aten::__getitem__ (506 nodes)
      0.0435175 ms.   0.838222%. prim::TupleUnpack (254 nodes)
      0.0425416 ms.   0.819425%. aten::sigmoid (2 nodes)
      0.0383822 ms.   0.739308%. fb::offsets_to_ranges (253 nodes)
      0.0330187 ms.   0.635996%. aten::mul (3 nodes)
       0.027534 ms.   0.530352%. fb::simple_embedding_bag_sum (3 nodes)
      0.0274914 ms.   0.529532%. aten::pow (1 nodes)
      0.0236733 ms.   0.455989%. fb::casted_batch_one_hot_lengths (1 nodes)
       0.023348 ms.   0.449723%. fb::concat_add_mul_replacenan_clip (1 nodes)
      0.0193511 ms.   0.372735%. aten::sum (3 nodes)
      0.0188839 ms.   0.363737%. prim::DictConstruct (2 nodes)
      0.0183191 ms.   0.352858%. prim::TupleConstruct (1 nodes)
      0.0119029 ms.    0.22927%. aten::div (1 nodes)
      0.0103263 ms.   0.198902%. static_runtime::to_copy (8 nodes)
     0.00977658 ms.   0.188314%. prim::ListConstruct (4 nodes)
     0.00924042 ms.   0.177986%. fb::sigrid_hash_precompute (1 nodes)
     0.00692162 ms.   0.133322%. aten::contiguous (1 nodes)
     0.00567485 ms.   0.109307%. aten::narrow (4 nodes)
     0.00362285 ms.  0.0697823%. aten::logit (1 nodes)
     0.00329995 ms.  0.0635627%. aten::add (1 nodes)
     0.00285633 ms.  0.0550178%. aten::full (1 nodes)
     0.00268469 ms.  0.0517118%. fb::gather_ranges (4 nodes)
     0.00248577 ms.  0.0478803%. aten::stack (1 nodes)
     0.00241782 ms.  0.0465715%. aten::relu (1 nodes)
     0.00233674 ms.  0.0450096%. aten::clamp_min (1 nodes)
     0.00222238 ms.  0.0428068%. static_runtime::reshape_copy (2 nodes)
     0.00171177 ms.  0.0329716%. aten::size (3 nodes)
     0.00120008 ms.  0.0231155%. aten::expand_as (1 nodes)
     0.00112628 ms.  0.0216942%. fb::clip_ranges (2 nodes)
     0.00103193 ms.  0.0198768%. fb::lengths_to_offsets (3 nodes)
    0.000598624 ms.  0.0115305%. static_runtime::flatten_copy (1 nodes)
    0.000236196 ms. 0.00454954%. prim::device (1 nodes)
        5.19164 ms. in Total
StaticRuntime setup time: 0.000868 ms
Memory allocation time: 0.0109619 ms
Memory deallocation time: 0.071791 ms
Outputs deallocation time: 0.0560187 ms
Total memory managed: 1232320 bytes
Total number of reused tensors: 32
W0421 17:40:52.053653 1746499 PyTorchPredictorContainer.cpp:200] Failed to load metadata file
W0421 17:40:52.053757 1746499 PyTorchPredictorContainer.cpp:457] Couldn't find model param config file xl_model_weights/model_param_config
I0421 17:40:52.053779 1746499 PyTorchPredictorBenchLib.cpp:137] PyTorch predictor: number of prediction threads 1
I0421 17:40:52.185776 1746499 PyTorchPredictorBenchLib.cpp:230] PyTorch run finished. Milliseconds per iter: 131.985. Iters per second: 7.57661
I0421 17:40:52.337853 1746499 PtVsBlackBoxPredictorBenchLib.cpp:132] Finished comparing PT static runtime and jit interpreter results
```

Reviewed By: hlu1

Differential Revision: D27929253

fbshipit-source-id: 5a7984ba3ce2d6d4bce0a0ab6c5e09e8c037b44e
2021-04-22 08:40:35 -07:00
3355c30f91 Always run all the grep-based quick-checks steps (#56700)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56700

Reviewed By: walterddr

Differential Revision: D27940638

Pulled By: samestep

fbshipit-source-id: 54311ef45ec051ee29d934d501e83b3542bbb439
2021-04-22 08:35:43 -07:00
47d2edd597 Fix quick-checks for operator-schemas (#56692)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56692

Reviewed By: heitorschueroff

Differential Revision: D27939830

Pulled By: malfet

fbshipit-source-id: 67a054de5c58832fcd7d0df0dd37faf1ea1406fd
2021-04-22 08:11:29 -07:00
bdb421895a Remove some wildcards from mypy configs (#56645)
Summary:
See https://github.com/pytorch/pytorch/pull/56523#issuecomment-823562134 for context. Basically the idea is that people (including myself) keep assuming that the single-asterisk `*` wildcard means "match in this directory and in its subdirectories", which is _not_ true. Removing the wildcards thus reduces confusion.

Ideally I would like to remove _all_ of these wildcards and then add a lint to disallow them in the future (and also greatly simplify the pattern-matching logic in `tools/mypy_wrapper.py`; see https://github.com/pytorch/pytorch/issues/55702 for context), but currently this one can't be removed:

```
tools/autograd/*.py,
```

That is because there is a file called `tools/autograd/templates/annotated_fn_args.py` (added in https://github.com/pytorch/pytorch/issues/41575) which is not a valid Python file and thus cannot be checked by `mypy`. ezyang would it be possible to rename that file to use a suffix other than `.py`?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56645

Test Plan:
```
$ mypy
Success: no issues found in 1317 source files
$ mypy --config=mypy-strict.ini
Success: no issues found in 72 source files
```
The numbers of source files should be the same before and after this PR.

Reviewed By: ezyang

Differential Revision: D27925207

Pulled By: samestep

fbshipit-source-id: c17faf73665a75393d3109346a1138c2af023abb
2021-04-22 07:51:01 -07:00
1f0223d6bb Fix bug in gaussian_nll_loss (#56469)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53964. cc albanD almson

## Major changes:
- Overhauled the actual loss calculation so that the shapes are now correct (in functional.py)
- added the missing doc in nn.functional.rst

## Minor changes (in functional.py):
- I removed the previous check on whether input and target were the same shape. This is to allow for broadcasting, say when you have 10 predictions that all have the same target.
- I added some comments to explain each shape check in detail. Let me know if these should be shortened/cut.

Screenshots of updated docs attached.
Let me know what you think, thanks!

## Edit: Description of change of behaviour (affecting BC):
The backwards-compatibility is only affected for the `reduction='none'` mode. This was the source of the bug. For tensors with size (N, D), the old returned loss had size (N), as incorrect summation was happening. It will now have size (N, D) as expected.

### Example
Define input tensors, all with size (2, 3).
`input = torch.tensor([[0., 1., 3.], [2., 4., 0.]], requires_grad=True)`
`target = torch.tensor([[1., 4., 2.], [-1., 2., 3.]])`
`var = 2*torch.ones(size=(2, 3), requires_grad=True)`

Initialise loss with reduction mode 'none'. We expect the returned loss to have the same size as the input tensors, (2, 3).
`loss = torch.nn.GaussianNLLLoss(reduction='none')`

Old behaviour:
`print(loss(input, target, var)) `
`# Gives tensor([3.7897, 6.5397], grad_fn=<MulBackward0>. This has size (2).`

New behaviour:
`print(loss(input, target, var)) `
`# Gives tensor([[0.5966, 2.5966, 0.5966], [2.5966, 1.3466, 2.5966]], grad_fn=<MulBackward0>)`
`# This has the expected size, (2, 3).`

To recover the old behaviour, sum along all dimensions except for the 0th:
`print(loss(input, target, var).sum(dim=1))`
`# Gives tensor([3.7897, 6.5397], grad_fn=<SumBackward1>.`

![doc1](https://user-images.githubusercontent.com/26558092/115391089-f7f47b00-a1d6-11eb-8726-e4da9057aee0.png)
![doc2](https://user-images.githubusercontent.com/26558092/115391094-f925a800-a1d6-11eb-954b-afd187f42bc7.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56469

Reviewed By: jbschlosser, agolynski

Differential Revision: D27894170

Pulled By: albanD

fbshipit-source-id: 197890189c97c22109491c47f469336b5b03a23f
2021-04-22 07:43:48 -07:00
76214bb464 Add OpInfo for torch.baddbmm (#56502)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56502

Reviewed By: heitorschueroff

Differential Revision: D27890939

Pulled By: anjali411

fbshipit-source-id: 072647a05cf93aedb76df0367af71b534be77258
2021-04-22 07:00:52 -07:00
49df8993c4 Port scatter and scatter_add to OpInfo (#56140)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/54302
Tracking Issue https://github.com/pytorch/pytorch/issues/54261

**Summary:**
- Port `scatter` and `scatter_add` tests to `OpInfo`
- `masked_scatter` was already ported to `OpInfo`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56140

Reviewed By: malfet

Differential Revision: D27918038

Pulled By: heitorschueroff

fbshipit-source-id: 80b507fe8761cd15c967c85e0c289b568b877573
2021-04-22 06:53:52 -07:00
0df239e550 [FX] Make arg normalization a method on Node and not a pass (also augment tests to be exhaustive) (#55992)
Summary:
Commandeered from https://github.com/pytorch/pytorch/pull/54563

Primary changes from first PR:
1. Refactored primary `normalize_function` logic into `operator_schemas.py` so that non-FX users can use it.
2. Refactored tests a bit, and added a path to call `normalize_function` directly.
3. Moved check for `boolean_dispatch` so that `torch.lu` also gets properly handled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55992

Reviewed By: mruberry

Differential Revision: D27774396

Pulled By: Chillee

fbshipit-source-id: 7f65632e1d608e4abd55aec5ccbfdc3f67f52b8e
2021-04-22 03:53:41 -07:00
81b59211d4 [static runtime] binding for aten::div_out (#56653)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56653

Test Plan:
```
./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --scripted_model=/data/users/ansha/tmp/adfinder/aug_1x/210616848_0.predictor.disagg.local.local.pt --pt_inputs=/data/users/ansha/tmp/adfinder/aug_1x/210616848_0.predictor.disagg.input_data.container.pt --iters=500 --warmup_iters=500 --num_threads=1 --pt_enable_static_runtime=1 --pt_cleanup_activations=true --pt_enable_out_variant=1 --pt_optimize_memory=1 --compare_results=1 --do_profile=1 --adsfinder_compatibility=1
```

```
Time per node type:
        1.48563 ms.    35.9861%. fb::sigrid_transforms_torch_bind (1 nodes)
        0.92385 ms.    22.3783%. aten::linear (6 nodes)
       0.681066 ms.    16.4974%. aten::argmin (1 nodes)
       0.239311 ms.    5.79679%. aten::matmul (1 nodes)
       0.140157 ms.    3.39501%. fb::clip_ranges_gather_sigrid_hash_v3 (77 nodes)
      0.0951568 ms.    2.30497%. fb::clip_ranges_gather (263 nodes)
      0.0835801 ms.    2.02455%. aten::sub (1 nodes)
       0.054081 ms.       1.31%. aten::repeat (1 nodes)
      0.0424465 ms.    1.02818%. aten::norm (1 nodes)
      0.0389049 ms.   0.942389%. fb::batch_box_cox (1 nodes)
      0.0346992 ms.   0.840514%. aten::__getitem__ (506 nodes)
      0.0341335 ms.    0.82681%. prim::TupleUnpack (254 nodes)
      0.0306839 ms.   0.743252%. aten::sigmoid (2 nodes)
      0.0280489 ms.   0.679426%. aten::mul (3 nodes)
      0.0265321 ms.   0.642684%. fb::offsets_to_ranges (253 nodes)
      0.0207622 ms.    0.50292%. aten::pow (1 nodes)
      0.0202067 ms.   0.489465%. fb::simple_embedding_bag_sum (3 nodes)
      0.0195497 ms.    0.47355%. fb::casted_batch_one_hot_lengths (1 nodes)
      0.0184351 ms.   0.446551%. fb::concat_add_mul_replacenan_clip (1 nodes)
       0.016382 ms.    0.39682%. aten::sum (3 nodes)
      0.0158651 ms.   0.384299%. prim::TupleConstruct (1 nodes)
      0.0150918 ms.   0.365567%. prim::DictConstruct (2 nodes)
     0.00858005 ms.   0.207833%. aten::div (1 nodes)
     0.00810684 ms.   0.196371%. fb::sigrid_hash_precompute (1 nodes)
     0.00796325 ms.   0.192893%. static_runtime::to_copy (8 nodes)
     0.00782038 ms.   0.189432%. prim::ListConstruct (4 nodes)
      0.0057504 ms.   0.139291%. aten::contiguous (1 nodes)
      0.0044688 ms.   0.108247%. aten::narrow (4 nodes)
     0.00284054 ms.   0.068806%. aten::logit (1 nodes)
     0.00265049 ms.  0.0642024%. aten::add (1 nodes)
     0.00216242 ms.    0.05238%. aten::full (1 nodes)
     0.00207732 ms.  0.0503187%. aten::relu (1 nodes)
     0.00198412 ms.   0.048061%. fb::gather_ranges (4 nodes)
     0.00176954 ms.  0.0428632%. aten::stack (1 nodes)
     0.00175913 ms.  0.0426112%. static_runtime::reshape_copy (2 nodes)
      0.0016996 ms.  0.0411692%. aten::clamp_min (1 nodes)
     0.00128528 ms.  0.0311331%. aten::size (3 nodes)
    0.000849156 ms.   0.020569%. aten::expand_as (1 nodes)
    0.000757672 ms.   0.018353%. fb::clip_ranges (2 nodes)
    0.000596224 ms.  0.0144423%. fb::lengths_to_offsets (3 nodes)
    0.000442632 ms.  0.0107218%. static_runtime::flatten_copy (1 nodes)
    0.000196158 ms. 0.00475151%. prim::device (1 nodes)
        4.12833 ms. in Total
StaticRuntime setup time: 0.000451 ms
Memory allocation time: 0.0089336 ms
Memory deallocation time: 0.0578358 ms
Outputs deallocation time: 0.0431742 ms
Total memory managed: 947328 bytes
Total number of reused tensors: 31
W0421 16:56:34.220682 1522800 PyTorchPredictorContainer.cpp:200] Failed to load metadata file
W0421 16:56:34.220772 1522800 PyTorchPredictorContainer.cpp:457] Couldn't find model param config file xl_model_weights/model_param_config
I0421 16:56:34.220791 1522800 PyTorchPredictorBenchLib.cpp:137] PyTorch predictor: number of prediction threads 1
I0421 16:56:34.366667 1522800 PyTorchPredictorBenchLib.cpp:230] PyTorch run finished. Milliseconds per iter: 145.863. Iters per second: 6.85573
I0421 16:56:34.514202 1522800 PtVsBlackBoxPredictorBenchLib.cpp:132] Finished comparing PT static runtime and jit interpreter results
```

Reviewed By: hlu1

Differential Revision: D27927731

fbshipit-source-id: 595883a31ba0cadf6449799d47bf2294a1d05b41
2021-04-22 01:38:24 -07:00
57cba8e601 Use at::cpu in bench_approx (#56563)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56563

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D27902737

Pulled By: bertmaher

fbshipit-source-id: 66962671afbb093d5ae0b9308a401536c06ce8f5
2021-04-21 22:56:07 -07:00
426852b4f0 Split test_c10d_spawn.py to test_c10d_spawn_gloo.py,test_c10d_spawn_nccl.py (#56599)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56599

Test Plan: NA

Reviewed By: SciPioneer

Differential Revision: D27913955

fbshipit-source-id: 7206e589fb7d08c55d08a58a3d57dc3d210a795e
2021-04-21 22:11:49 -07:00
5cc75e46fa Split test_c10d.py to test_c10d_common.py, test_c10d_gloo.py, test_c10d_nccl.py (#56598)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56598

Test Plan: NA

Reviewed By: SciPioneer

Differential Revision: D27913170

fbshipit-source-id: 3439d18141131b02d55f2ca399a4c795cba2b04b
2021-04-21 22:10:41 -07:00
d24314bd2c Update Kineto submodule and use new metadata api (#56432)
Summary:
Update Kineto submodule and use new metadata api

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56432

Test Plan: CI

Reviewed By: chaekit

Differential Revision: D27871570

Pulled By: ilia-cher

fbshipit-source-id: 3556787f07a9c9e138666a62ee4cd23af6d7473b
2021-04-21 21:50:13 -07:00
1b87274460 [iOS GPU][Design] Support multiple tensors as outputs (#56072)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56072

Currently, we don't support outputting more than one tensors on GPU. For example, if you do

```
auto x = at::rand(1,4,2,2).metal();
auto y = at::chunk(x,2,1);  //y is a tuple
auto output1 = y[0].cpu();
auto output2 = y[1].cpu();
```
In the example above, when it hits `y[0].cpu()`, the command buffer will be committed to move `y[0]` from GPU to CPU. By the time it hits `y[1].cpu()`, since the command buffer has already become invalid, the temporary image that lives in `output2` has been recycled. Thus, a runtime exception will be thrown.

The way we address it is using the observer pattern
1. Before we flush the command buffer, we'll notify its the observers(a.k.a MPSImageWrapper objects) who hold the temporary images.
2. When observers receive the notification, they'll turn the current temporary images into a static images.
3. Now, when `.cpu()` happens, the output tensor can just read the data directly from the static image generated in the above step.

You may be wondering does that have a hidden cost where all the intermediate tensors have hold unused static images? The answers is no. All intermediate tensors will be released once their reference counts become zero. Since the MetalTensorImpl is subclassing from the TensorImpl, we're overriding the release_resource method  which gives us a chance to release the underlying storage (textures and buffers) and remove observers from the command buffer. Therefore, once the intermediate tensors go away, the temporary images will be recycled immediately.
ghstack-source-id: 127079751

Test Plan:
- We'll be using `at::chunk` to test this in the following diffs, as it returns a tuple that contains multiple tensors.
- Sandcastle CI
- CircleCI

Reviewed By: dreiss

Differential Revision: D27165886

fbshipit-source-id: 290b0d77b1dc74990b25cbd0abb775df1ab47ca0
2021-04-21 21:15:34 -07:00
36828aa0ff Revert D27866138: [ONNX] Redesign inplace conversion (#55033)
Test Plan: revert-hammer

Differential Revision:
D27866138 (24ff92f76d)

Original commit changeset: ab5c9188740c

fbshipit-source-id: b99bf5b12e109089ebd5748c1dc152c6af1cebdb
2021-04-21 21:11:06 -07:00
a1299a2802 Disable Windows GPU testing (#56655)
Summary:
Until https://github.com/pytorch/pytorch/issues/56654 is resolved

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56655

Reviewed By: ilia-cher

Differential Revision: D27929002

Pulled By: malfet

fbshipit-source-id: af741d67e4c938f632afad29e675533e1fcb445d
2021-04-21 20:46:11 -07:00
df1dfd879e Fix errors when initializing Linear with 0 in_features (#56505)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48152

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56505

Reviewed By: malfet

Differential Revision: D27919590

Pulled By: jbschlosser

fbshipit-source-id: 462ca280051f63c31ff588c38a9e436116c0f336
2021-04-21 20:42:32 -07:00
76fbd755c1 Reland of "D27708346: generate xla codegen in-tree" (#56601)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56601

Updating it to ensure that RegistrationDeclarations.yaml is completely
unchanged

This reverts commit 90e532f3ef17a9611e9e7a9f1f6189d4168bf084.

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D27915305

Pulled By: bdhirsh

fbshipit-source-id: 491a025c44221690dad849f9a2166934130c0fec
2021-04-21 19:36:31 -07:00
0cc42809ce Enable skipped test for c10::complex on CUDA >= 11.2 (#50227)
Summary:
That test was skipped due to a compiler bug. That bug should be fixed in 11.2, so we should enable it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50227

Reviewed By: malfet

Differential Revision: D27909195

Pulled By: anjali411

fbshipit-source-id: c802702079d0e521f53fc98cd0fc3ded0c12b455
2021-04-21 18:33:31 -07:00
24ff92f76d [ONNX] Redesign inplace conversion (#55033) (#56173)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56173

* Create `InplaceConverter` and `ValueTracker` to keep track of aliases of values throughout the graph. For a given value, a new alias is created every time when there is an inplace operation, SetAttr, or through nested blocks owned by If/Loop nodes.
* Fix bug where controlflow node output types are not set, when the complete node is unable to run ONNX shape inference due to containing non-onnx node.
* Add symbolic for `__not__` ~~and `prim_min`~~(update: moved to a separate PR), and update `index_put` opset9 to support case of assignment without providing indices.
* Bump ORT version in CI test.

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D27866138

Pulled By: SplitInfinity

fbshipit-source-id: ab5c9188740c50f783ceba4d54fda43c26e2fde7
2021-04-21 17:59:11 -07:00
818ce1d0d2 Add standardOps match more input type in ORT (#53813) (#56172)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56172

Enable the standardOps include **Add\Sub\Mul\Div\Gemm\Pow\Mod**  with low precision input in ORT

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D27866136

Pulled By: SplitInfinity

fbshipit-source-id: f2cf5649fffefd68c0cc7b6dce94198751636727
2021-04-21 17:58:08 -07:00
43ad172c54 make ProcessGroupDefaultTimeout the same as python (#56549)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56549

This make the `kProcessGroupDefaultTimeout` be the same as the python
side, and python side directly use the pybind value instead

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D27899190

Pulled By: wanchaol

fbshipit-source-id: 388a7f42358b0abed75cf4934fb7b311fd33fee6
2021-04-21 17:56:05 -07:00
a970e525fd make ProcessGroup.Options.timeout argument private in python (#56531)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56531

per discussions in
https://github.com/pytorch/pytorch/pull/53663/files#r593409009, we need
to make sure our API not confusing user by passing in both timeout in
argument and timeout in processgroup.options. This PR tries to make the
`ProcessGroup.Options.timeout` be a private field, and only be used in
our test utils, for both `init_process_group` and `new_group`, we still
allow user pass `timeout` as a separate argument. Since
`ProcessGroupGloo.Options` only have a `timeout` config, both functions
will not allow passing in options for the GLOO backend.

This way we still preserve the only `timeout` API, and only allow user
to use `ProcessGroupNCCL.Options` when needed.

cc pritamdamania87 rohan-varma

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D27893395

Pulled By: wanchaol

fbshipit-source-id: cdd29c84648002226ef3d9f9f3ea67b795e64bc5
2021-04-21 17:55:10 -07:00
6d7d36d255 s/“pad”/"pad"/ in files introduced by #56065 (#56618)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56618

Reviewed By: albanD

Differential Revision: D27919343

Pulled By: malfet

fbshipit-source-id: 2fac8ba5f399e050463141eba225da935c97a5ce
2021-04-21 17:40:29 -07:00
5dcc7ac35c Add new scheduled job to circle-ci workflow (#55182)
Summary:
Under this setting the job should run 3 times a day.

When the environment variable, `PYTORCH_TEST_WITH_SLOW_GRADCHECK` is set to `ON`, set the default value for `fast_mode` in gradchack wrapper as False. This would be overriden by whatever value the user explicitly passes in.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55182

Reviewed By: albanD

Differential Revision: D27919236

Pulled By: soulitzer

fbshipit-source-id: 3a55ec6edcfc6e65fbc3a8a09c63aaea1bd1c5bf
2021-04-21 17:05:10 -07:00
73eaa0a5f5 Fixing error in jit cuda on ROCm: non-constant-expression cannot be n… (#55243)
Summary:
On ROCm, the error when compiling was "non-constant-expression cannot be narrowed from type 'int' to 'uint32_t'"
when compiling grid_reduction.cu.

Added typecast to fix issue.

Also, removed test skip with ROCm : re-enabling

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55243

Reviewed By: malfet

Differential Revision: D27917066

Pulled By: ngimel

fbshipit-source-id: b0b7c5fc8ecd2624222b35fe060846f7d1670f07
2021-04-21 16:35:27 -07:00
e0be76fb9b [static_runtime] fix num args for to_copy (#56441)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56441

Since aten::to is overloaded, match schema to replace it with static_runtime::to_copy

Test Plan:
```
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 3 ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --c2_model=/data/users/ansha/tmp/adfinder/210494966_0.predictor.disagg.remote_request_only --c2_inputs=/data/users/ansha/tmp/adfinder/models/c2_remote_ro_input_data.pb --pred_net=/data/users/ansha/tmp/adfinder/models/c2_remote_ro_net2.pb --c2_sigrid_transforms_opt=1 --c2_apply_nomnigraph_passes=1 --c2_use_memonger=1 --scripted_model=/data/users/ansha/tmp/adfinder/models_dianshi/210494966_0.predictor.disagg.remote_request_only.pt --pt_inputs=/data/users/ansha/tmp/adfinder/models/remote_ro_wrapped_input_data.pt --pt_enable_static_runtime=1 --pt_cleanup_activations=1 --pt_enable_out_variant=1 --compare_results=1 --iters=1 --warmup_iters=1 --num_threads=1 --do_profile=1 --benchmark_c2_predictor=0 --do_benchmark=0
```

```
Time per node type:
       0.623426 ms.     55.337%. quantized::embedding_bag_4bit_rowwise_offsets (82 nodes)
       0.331633 ms.    29.4367%. quantized::embedding_bag_byte_rowwise_offsets (71 nodes)
       0.123163 ms.    10.9323%. aten::to (155 nodes)
       0.038479 ms.     3.4155%. fb::lengths_to_offsets (155 nodes)
       0.004169 ms.   0.370052%. aten::embedding_bag (2 nodes)
       0.002549 ms.   0.226256%. static_runtime::to_copy (2 nodes)
       0.002512 ms.   0.222972%. prim::TupleConstruct (1 nodes)
       0.000667 ms.  0.0592048%. prim::dtype (2 nodes)
         1.1266 ms. in Total
StaticRuntime setup time: 0.009605 ms
Memory allocation time: 0.001907 ms
Memory deallocation time: 0.032401 ms
Outputs deallocation time: 0.020876 ms
Total memory managed: 256 bytes
Total number of reused tensors: 159
```

I verified that all of the aten::to matches, for the local, local_ro, and remote_ro nets in opt and dev mode.

Only 2 of calls are replaced because the other 155 have either the input or the ouput of the op returned as an external output. This is a similar case for the other instances of aten::to in the local and local_ro nets.

Reviewed By: hlu1

Differential Revision: D27872350

fbshipit-source-id: b72785ea2768be415faae2afcf9915aef07daec2
2021-04-21 16:31:36 -07:00
d83ae5d1b7 Add devices to TensorPipe options (#56405)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56405

If not provided, the `devices` field will be initialized to local
devices in local `device_maps` and corresponding devices in peers'
`device_maps`. When processing CUDA RPC requests, the agent will
use a dedicated stream for each device in the devices list to 1)
accept argument CUDA tensors 2) run user functions 3) send return
value tensors.

closes #54017

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D27863133

Pulled By: mrshenli

fbshipit-source-id: 5d078c3b6d1812f85d62b0eb0f89f2b6c82cb060
2021-04-21 16:16:48 -07:00
853112bbfc [7/n] [torch/elastic] Rename _Rendezvous to _RendezvousState (#56535)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56535

This PR renames the `_Rendezvous` class to `_RendezvousState` in preparation of the upcoming changes.
ghstack-source-id: 126979138

Test Plan: Run the existing unit tests.

Reviewed By: H-Huang

Differential Revision: D27889894

fbshipit-source-id: 027d26aa5e1acd5bba3ad2e58b140428a4a176b2
2021-04-21 16:01:03 -07:00
21d9bc246b [6/n] [torch/elastic] Reorder type definitions in dynamic_rendezvous.py (#56534)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56534

This PR reorders the type definitions in dynamic_rendezvous.py to increase the readability.
ghstack-source-id: 126979087

Test Plan: Run the existing unit tests.

Reviewed By: H-Huang

Differential Revision: D27889817

fbshipit-source-id: 04291af9b8f3170e4b33cb4f33e0dff0d2d3fb23
2021-04-21 16:01:02 -07:00
df91eb924c [5/n] [torch/elastic] Introduce the delay utility function (#56533)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56533

This PR introduces a small utility function to delay the execution of the current thread.
ghstack-source-id: 126979035

Test Plan: Run the associated unit tests.

Reviewed By: H-Huang

Differential Revision: D27889671

fbshipit-source-id: aae93b624bd4704da7a48004f50d130cec64969d
2021-04-21 16:01:00 -07:00
76ca1eeeb8 [4/n] [torch/elastic] Fix the finalizer of PeriodicTimer (#56532)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56532

This PR fixes a subtle issue with the finalizer implementation of `_PeriodicTimer`.

We avoid using a regular finalizer (a.k.a. `__del__`) for stopping the timer as joining a daemon thread during the interpreter shutdown can cause deadlocks. The `weakref.finalize` is a superior alternative that provides a consistent behavior regardless of the GC implementation.
ghstack-source-id: 126978904

Test Plan: Run the existing unit tests as there is no behavioral change.

Reviewed By: H-Huang

Differential Revision: D27889289

fbshipit-source-id: a248cf6fd1abc4da8bef90e160fa9669a4961fa5
2021-04-21 15:59:19 -07:00
c244d1c540 [package] resolve __import__ calls on export (#55153)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55153

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D27504536

Pulled By: Lilyjjo

fbshipit-source-id: 5e3e10f213c6e0cf1755d18eb19727515362f91a
2021-04-21 15:43:15 -07:00
28f52649d8 add dtype information for input (#55358)
Summary:
add dtype for all input besides input dimenstion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55358

Reviewed By: heitorschueroff

Differential Revision: D27862346

Pulled By: ilia-cher

fbshipit-source-id: 656c5d6c9f23d723b27b44f0afc1a249ce1f3e44
2021-04-21 15:25:08 -07:00
6032ea0313 [PyTorch] Migrate add operators to borrow in TensorIteratorBase (#55691)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55691

Avoiding reference counting for these operations is
roughly a 5% CPU time win vs not supporting borrowing at all.
ghstack-source-id: 127092680

Test Plan:
Existing CI for correctness.

Continued perf stat experiment from previous diff. All results included below for reviewing convenience.

Baseline:
```
 Performance counter stats for '/tmp/cpp_benchmark.MaybeOwnedBaselineD27607270' (5 runs):

          5,837.13 msec task-clock                #    1.000 CPUs utilized            ( +-  0.34% )
               442      context-switches          #    0.076 K/sec                    ( +-  3.54% )
                 5      cpu-migrations            #    0.001 K/sec                    ( +- 19.07% )
            13,144      page-faults               #    0.002 M/sec                    ( +-  0.39% )
    11,597,542,455      cycles                    #    1.987 GHz                      ( +-  0.32% )  (50.05%)
    30,687,118,071      instructions              #    2.65  insn per cycle           ( +-  0.03% )  (50.08%)
     6,247,677,215      branches                  # 1070.334 M/sec                    ( +-  0.04% )  (50.08%)
         1,705,403      branch-misses             #    0.03% of all branches          ( +-  2.16% )  (50.05%)

            # Table of individual measurements:
            5.9025 (+0.0663) #
            5.8276 (-0.0085) #
            5.8151 (-0.0210) #
            5.7842 (-0.0519) #
            5.8511 (+0.0150) #

            # Final result:
            5.8361 +- 0.0198 seconds time elapsed  ( +-  0.34% )

```

Add but don't use borrowing support:
```
 Performance counter stats for '/tmp/cpp_benchmark.MeasureMaybeOwnedCost' (5 runs):

          5,947.20 msec task-clock                #    0.999 CPUs utilized            ( +-  0.15% )
               422      context-switches          #    0.071 K/sec                    ( +-  1.88% )
                 3      cpu-migrations            #    0.001 K/sec                    ( +- 47.14% )
            13,025      page-faults               #    0.002 M/sec                    ( +-  0.46% )
    11,814,216,945      cycles                    #    1.987 GHz                      ( +-  0.12% )  (50.08%)
    31,535,372,676      instructions              #    2.67  insn per cycle           ( +-  0.06% )  (50.09%)
     6,482,809,438      branches                  # 1090.060 M/sec                    ( +-  0.04% )  (50.07%)
         1,688,623      branch-misses             #    0.03% of all branches          ( +-  1.62% )  (50.07%)

           # Table of individual measurements:
           5.97105 (+0.01991) #
           5.93649 (-0.01466) #
           5.93568 (-0.01547) #
           5.95940 (+0.00825) #
           5.95310 (+0.00196) #

           # Final result:
           5.95114 +- 0.00679 seconds time elapsed  ( +-  0.11% )
```

Now, use the borrowing support (this diff):
```
 Performance counter stats for '/tmp/cpp_benchmark.MakeAddBorrow' (5 runs):

          5,528.58 msec task-clock                #    1.000 CPUs utilized            ( +-  0.33% )
               451      context-switches          #    0.082 K/sec                    ( +-  4.29% )
                 6      cpu-migrations            #    0.001 K/sec                    ( +- 34.65% )
            13,155      page-faults               #    0.002 M/sec                    ( +-  0.32% )
    10,985,806,260      cycles                    #    1.987 GHz                      ( +-  0.33% )  (50.09%)
    30,657,224,792      instructions              #    2.79  insn per cycle           ( +-  0.02% )  (50.07%)
     6,247,997,282      branches                  # 1130.127 M/sec                    ( +-  0.01% )  (50.04%)
         1,732,507      branch-misses             #    0.03% of all branches          ( +-  1.04% )  (50.06%)

            # Table of individual measurements:
            5.5626 (+0.0356) #
            5.4913 (-0.0357) #
            5.5007 (-0.0263) #
            5.5839 (+0.0569) #
            5.4965 (-0.0305) #

            # Final result:
            5.5270 +- 0.0192 seconds time elapsed  ( +-  0.35% )

```

7.02% cycles improvement vs previous diff
2.78% instructions improvement vs previous diff

5.28% cycles improvement vs baseline
0.1% instructions improvement vs baseline

Note that instructions per cycle improved. This makes sense because we are avoiding memory accesses, and memory accesses manifest as instructions which take 3 (or many more in the case of a cache miss) cycles. This is also a great example of an effect that instruction counting is blind to.

Reviewed By: bhosmer

Differential Revision: D27607295

fbshipit-source-id: 7a0205b4aba6b63febbb5966f0f5e2627815cbbe
2021-04-21 14:51:50 -07:00
01842d2bb0 [PyTorch] Support borrowing in/out Tensors in TensorIterator (#55690)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55690

Just change `OperandInfo::tensor` and
`TensorIteratorConfig::tensors` to hold `c10::MaybeOwned<Tensor>`, and
deal with the consequent pointer syntax. Had to C10_ALWAYS_INLINE
OperandInfo to preserve existing inlining behavior for whatever
compiler-idiosyncratic reason.

This is a separate diff from usage to enable measuring the cost of
support, and because there is no reason not to send it separately.

We probably should not land this without a plan to migrate a lot of
TensorIterator use cases to use either borrowing or structured kernels
& borrowing.
ghstack-source-id: 127092681

Test Plan:
Existing CI for correctness.

Ran perf stat on existing add in-place C++ benchmark and compared to D27607270 (diff before last; previous diff is arguably part of supporting borrowing). This is a devbig with turbo off.

Baseline:
```
 Performance counter stats for '/tmp/cpp_benchmark.MaybeOwnedBaselineD27607270' (5 runs):

          5,837.13 msec task-clock                #    1.000 CPUs utilized            ( +-  0.34% )
               442      context-switches          #    0.076 K/sec                    ( +-  3.54% )
                 5      cpu-migrations            #    0.001 K/sec                    ( +- 19.07% )
            13,144      page-faults               #    0.002 M/sec                    ( +-  0.39% )
    11,597,542,455      cycles                    #    1.987 GHz                      ( +-  0.32% )  (50.05%)
    30,687,118,071      instructions              #    2.65  insn per cycle           ( +-  0.03% )  (50.08%)
     6,247,677,215      branches                  # 1070.334 M/sec                    ( +-  0.04% )  (50.08%)
         1,705,403      branch-misses             #    0.03% of all branches          ( +-  2.16% )  (50.05%)

            # Table of individual measurements:
            5.9025 (+0.0663) #
            5.8276 (-0.0085) #
            5.8151 (-0.0210) #
            5.7842 (-0.0519) #
            5.8511 (+0.0150) #

            # Final result:
            5.8361 +- 0.0198 seconds time elapsed  ( +-  0.34% )

```

Add but don't use borrowing support:
```
 Performance counter stats for '/tmp/cpp_benchmark.MeasureMaybeOwnedCost' (5 runs):

          5,947.20 msec task-clock                #    0.999 CPUs utilized            ( +-  0.15% )
               422      context-switches          #    0.071 K/sec                    ( +-  1.88% )
                 3      cpu-migrations            #    0.001 K/sec                    ( +- 47.14% )
            13,025      page-faults               #    0.002 M/sec                    ( +-  0.46% )
    11,814,216,945      cycles                    #    1.987 GHz                      ( +-  0.12% )  (50.08%)
    31,535,372,676      instructions              #    2.67  insn per cycle           ( +-  0.06% )  (50.09%)
     6,482,809,438      branches                  # 1090.060 M/sec                    ( +-  0.04% )  (50.07%)
         1,688,623      branch-misses             #    0.03% of all branches          ( +-  1.62% )  (50.07%)

           # Table of individual measurements:
           5.97105 (+0.01991) #
           5.93649 (-0.01466) #
           5.93568 (-0.01547) #
           5.95940 (+0.00825) #
           5.95310 (+0.00196) #

           # Final result:
           5.95114 +- 0.00679 seconds time elapsed  ( +-  0.11% )
```

1.87% cycles regression vs baseline
2.76% instructions regression vs baseline

Reviewed By: ezyang

Differential Revision: D27607293

fbshipit-source-id: 55b9873c15b0de689ae17f9c35eb4ba0d026cade
2021-04-21 14:51:48 -07:00
7e8f078a3d [PyTorch] Always update op.current_dtype in TensorIteratorBase::set_output (#55940)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55940

Simpler way to keep current_dtype up to date than #55689.
ghstack-source-id: 127092676

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D27744064

fbshipit-source-id: 23fccb8b0375f5b790439a9a1c9ac07d5fae391b
2021-04-21 14:51:46 -07:00
b79901f932 [PyTorch] Remove non-const TensorIterator::tensor() method (#55420)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55420

It doesn't seem to be necessary, and it blocks using `c10::MaybeOwned` to support borrowing.
ghstack-source-id: 127092679

Test Plan: fitsships

Reviewed By: ezyang

Differential Revision: D27607270

fbshipit-source-id: a007e9896785c8708f8cc02035cc6f4607a0a31b
2021-04-21 14:51:44 -07:00
26fc27cb4f [PyTorch] Format generated structured kernels code better (#55258)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55258

Prior to this diff, the generated code contained whitespace-only lines and odd indenting. Now, it is easier to read.
ghstack-source-id: 127092678

Test Plan:
Inspect generated RegisterCPU.cpp.
Before: P372666985
After: P372665023

Reviewed By: ezyang

Differential Revision: D27544604

fbshipit-source-id: 03095aa0275e7e817951cf8b303e4ad5cbb486ca
2021-04-21 14:51:43 -07:00
1211bccc65 [PyTorch] Fix const correctness for resize native functions (#55351)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55351

We incorrectly used `Tensor&` to mean "the underlying
TensorImpl cannot be changed", as explained in
https://github.com/zdevito/ATen/issues/27#issuecomment-330717839 .
This diff gets us on the path to fixing this problem: we have an
incremental way to fix individual native functions so that we can
apply any handwritten fixes a few at a time. It gets the migration
started with the `resize` family of native functions.
ghstack-source-id: 127092677

Test Plan: fitsships

Reviewed By: ezyang

Differential Revision: D27583983

fbshipit-source-id: 4eeeec85f5d268e9d0f1645eb9396914a9f9557f
2021-04-21 14:51:41 -07:00
5e695b1271 Use absolute path for local linter (#56633)
Summary:
In some cases the `__file__` here was relative, so in the linter script it ended up setting the repo root to `''`, which `asyncio` doesn't handle.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56633

Pulled By: driazati

Reviewed By: samestep

Differential Revision: D27922510

fbshipit-source-id: 7e406fa374ec0e5c4917b7c11742b9457dd52668
2021-04-21 14:50:28 -07:00
772ca1a2c3 [vulkan] Add Vulkan registrar for internal build (#56620)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56620

Test Plan: Imported from OSS

Reviewed By: SS-JIA

Differential Revision: D27919883

Pulled By: IvanKobzarev

fbshipit-source-id: af5eb7e2e16a31af80539dcbebc296857b45faff
2021-04-21 14:46:51 -07:00
27a0d6f1df AutoDispatchBelowAutograd takes no arguments. (#56424)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56424

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D27866607

Pulled By: ailzhang

fbshipit-source-id: b82cfb90af5bc7b4129266083fe31f8b335a5b41
2021-04-21 14:44:12 -07:00
3ec6bf5d26 Fix cuda launch error in reflection_pad2d (#56451)
Summary:
Fix https://github.com/pytorch/pytorch/issues/55222

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56451

Reviewed By: malfet

Differential Revision: D27912184

Pulled By: ngimel

fbshipit-source-id: 3fc80273c30a68a247289d3fb698f99b92931731
2021-04-21 14:39:31 -07:00
eac082891f [package] Massage exporter docstrings (#56547)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56547

**Summary**
This commit tweaks the docstrings of `PackageExporter` so that they look
nicer on the docs website.

**Test Plan**
Continuous integration.

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D27912965

Pulled By: SplitInfinity

fbshipit-source-id: 38c0a715365b8cfb9eecdd1b38ba525fa226a453
2021-04-21 14:06:54 -07:00
0911ee9108 Split CUDAFuture into a .h and a .cpp file (#56514)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56514

rohan-varma mentioned that having CUDAFuture entirely defined in a header meant having to rebuild a whole lot of things whenever it changed. In fact there's no reason not to use a .cpp file, so here I do so.
ghstack-source-id: 127035765

Test Plan: Unit tests

Reviewed By: rohan-varma, mrshenli

Differential Revision: D27861071

fbshipit-source-id: c209d54af9b52d3ad781db1b61f6fca02c637f32
2021-04-21 13:58:45 -07:00
7dec14a491 Avoid defining RpcCUDAFuture subclass in TensorPipe agent (#56513)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56513

The RpcCUDAFuture class existed solely to support extracting DataPtrs from a Message class. However, this can be done more simply by using a vanilla CUDAFuture and just extracting those DataPtrs before marking it complete and passing them to markCompleted.

This allows to make the DataPtr extraction logic of CUDAFuture private again.
ghstack-source-id: 127035771

Test Plan: Unit tests

Reviewed By: mrshenli

Differential Revision: D27861064

fbshipit-source-id: b0b4df2cab7be6b4b16d5cfc888483c18fbce60e
2021-04-21 13:58:43 -07:00
5ddc2691d0 Merge ivalue::Future's markCompleted and markCompletedWithDataPtrs (#56512)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56512

I don't know if there was a reason to keep them separate, but since the former deferred to the latter, it seems to me that we can get the exact same behavior by merging them and making the `data_ptrs` argument optional (by giving it a default value).
ghstack-source-id: 127035767

Test Plan: Unit tests

Reviewed By: mrshenli

Differential Revision: D27861069

fbshipit-source-id: 93a49d6959b65a8d4ab9b31accce90bf30cd441e
2021-04-21 13:58:42 -07:00
af23822112 Gracefully handle failure of DataPtr extraction in CUDAFuture (#56511)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56511

CUDAFuture needs to inspect the value in order to extract DataPtrs. Sometimes it's unable to do so. So far we've handled this by raising an error when `markCompleted` is called. In this PR I'm proposing a change, which makes `markCompleted` return successfully, but instead causes the Future to be set to an error if the DataPtr extraction fails.

The advantages I see are that user code calling `markCompleted` didn't expect it to throw, and thus wasn't catching and handle that error. Which in the best case could lead to a crash, and in the worst case could lead to the Future remaining incomplete, thus not unblocking any client waiting on it. With this change those clients would be woken up and would see the error.
ghstack-source-id: 127035772

Test Plan: Unit tests

Reviewed By: mrshenli

Differential Revision: D27861070

fbshipit-source-id: 4bb6100a488ab35fbe3c2bc3ac6f98d166c60a0b
2021-04-21 13:58:40 -07:00
3e0c226eed Raise TypeErrors when IValue::getSubValues fails (#56510)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56510

The comment for `TORCH_INTERNAL_ASSERT` say to use it for "enforcement of internal invariants in code", meaning "assuming no bugs in PyTorch, the conditions tested by this macro should always be true". However this wasn't the case here, at least for the RPC code: CUDAFuture is calling the `getSubValues` method on a generic IValue of which it doesn't know (or care about) the type. It was thus sometimes triggering the internal assert when users provided non-inspectable types, which was producing an exception with a message containing "please report a bug to PyTorch", which was confusing to users.

It makes more sense to me to consider this a type error, which can thus be reported more clearly to the user (and, later on in this stack, to catch). Hence the difference introduced here is just the type and the message of the exception. I don't expect there to be any code depending on the old behavior (as it would mean depending on a violation of an internal invariant).
ghstack-source-id: 127035768

Test Plan: Unit tests

Reviewed By: mrshenli

Differential Revision: D27861066

fbshipit-source-id: 6d41c922257cba5f37c7a4614d8e5ab5c7c87b92
2021-04-21 13:57:34 -07:00
5e4dfd0140 Add quicklint make target (#56559)
Summary:
This queries the local git repo for changed files (any changed files, not just committed ones) and sends them to mypy/flake8 instead of the default (which is the whole repo, defined the .flake8 and mypy.ini files). This brings a good speedup (from 15 seconds with no cache to < 1 second from my local testing on this PR).

```bash
make quicklint -j 6
```

It should be noted that the results of this aren’t exactly what’s in the CI, since mypy and flake8 ignore the `include` and `exclude` parts of their config when an explicit list of files is passed in.
](https://our.intern.facebook.com/intern/diff/27901577/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56559

Pulled By: driazati

Reviewed By: malfet

Differential Revision: D27901577

fbshipit-source-id: 99f351cdfe5aba007948aea2b8a78f683c5d8583
2021-04-21 13:47:25 -07:00
12b2bc94d7 Revert D27909732: [pytorch][PR] Support factory kwargs in torch.nn modules
Test Plan: revert-hammer

Differential Revision:
D27909732 (5a09def9b0)

Original commit changeset: d8684b2403ab

fbshipit-source-id: d00d69fae4fa4ed58d9e97e70b27a06a0dcb39e4
2021-04-21 13:44:03 -07:00
284e735b3f Set show_error_codes = True in mypy-strict.ini (#56616)
Summary:
This should make it easier to resolve issues surfaced by https://github.com/pytorch/pytorch/issues/56290. Also see https://github.com/pytorch/pytorch/pull/56559#discussion_r617828152 for context.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56616

Test Plan:
You could add a type error in a strict-checked file like `tools/test_history.py`, and then run this command:
```
$ mypy --config=mypy-strict.ini tools/test_history.py
```

Output before this PR:
```
tools/test_history.py:13:1: error: Function is missing a type annotation for one or more arguments
Found 1 error in 1 file (checked 1 source file)
```

Output after this PR:
```
tools/test_history.py:13:1: error: Function is missing a type annotation for one or more arguments  [no-untyped-def]
Found 1 error in 1 file (checked 1 source file)
```

Reviewed By: driazati

Differential Revision: D27918753

Pulled By: samestep

fbshipit-source-id: 953926e019a7669da9004fd54498b414aec777a6
2021-04-21 13:23:36 -07:00
5a09def9b0 Support factory kwargs in torch.nn modules (#54508)
Summary:
Continuation of https://github.com/pytorch/pytorch/pull/53144

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54508

Reviewed By: malfet

Differential Revision: D27909732

Pulled By: jbschlosser

fbshipit-source-id: d8684b2403ab7eb336371d118799146a2520bd76
2021-04-21 13:20:11 -07:00
11e26e7246 [sparsity][refactor] Remove "Sparsity" from the function names (#56555)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56555

Remove the "sparse" and "sparsity" from the function/variable names

Test Plan: `buck test mode/opt //caffe2/torch/fb/model_optimization:sparsity_test`

Reviewed By: raghuramank100

Differential Revision: D27812205

fbshipit-source-id: 1665253720467030b84b744f824fa7742a802542
2021-04-21 13:15:27 -07:00
8ee1347c3f Changes to support strides in addition to shape and dtype. (#56567)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56567

This adds stride information to the serialized JSON.

This also adds shape, dtype and stride to the graph that is printed out.

Test Plan: Run unit tests.

Reviewed By: jfix71

Differential Revision: D27528988

fbshipit-source-id: f0be92055ad7c8e525625bfd1332c2db11ba612d
2021-04-21 12:43:52 -07:00
4230040470 torch: Fix flake8 errors from leftover import (#56614)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56614

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: heitorschueroff

Differential Revision: D27917831

Pulled By: seemethere

fbshipit-source-id: 90a3213080cc2c8da2bc63c8971e14f7823390a9
2021-04-21 12:39:53 -07:00
7660cb880f Rename job to be py2-setup-validate-errormsg (#56593)
Summary:
This should clarify its purpose, which is:

> to make sure that we give an appropriate error message when someone tries to use python2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56593

Test Plan: CI.

Reviewed By: gchanan

Differential Revision: D27913086

Pulled By: samestep

fbshipit-source-id: e7555d5cab5696b19a17824383c92f25f91da2cf
2021-04-21 12:32:26 -07:00
8a81c4dc27 Update padding_idx docs for EmbeddingBag to better match Embedding's (#56065)
Summary:
Match updated `Embedding` docs from https://github.com/pytorch/pytorch/pull/54026 as closely as possible. Additionally, update the C++ side `Embedding` docs, since those were missed in the previous PR.

There are 6 (!) places for docs:
1. Python module form in `sparse.py` - includes an additional line about newly constructed `Embedding`s / `EmbeddingBag`s
2. Python `from_pretrained()` in `sparse.py` (refers back to module docs)
3. Python functional form in `functional.py`
4. C++ module options - includes an additional line about newly constructed `Embedding`s / `EmbeddingBag`s
5. C++ `from_pretrained()` options
6. C++ functional options

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56065

Reviewed By: malfet

Differential Revision: D27908383

Pulled By: jbschlosser

fbshipit-source-id: c5891fed1c9d33b4b8cd63500a14c1a77d92cc78
2021-04-21 12:10:37 -07:00
e691f24079 [sparsity] Moving only the C++ files from internal to OSS (#56553)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56553

This splits the previous diff into multiple parts. This introduces only the c++ files.

The unittests pass as part of the internal build. Will be put in the OSS in the later PRs

Test Plan:
`buck test mode/opt //caffe2/torch/fb/model_optimization:sparsity_test`

```
Parsing buck files: finished in 2.0 sec
Creating action graph: finished in 16.4 sec
Building: finished in 55.0 sec (100%) 20264/20264 jobs, 16 updated
  Total time: 01:13.6 min
More details at https://www.internalfb.com/intern/buck/build/c9c5e69e-ce00-4560-adce-58b68bc43e47
Tpx test run coordinator for Facebook. See https://fburl.com/tpx for details.
Running with tpx session id: 1e678a07-0689-45b4-96f3-54d0a3181996
Trace available for this run at /tmp/tpx-20210415-161113.966600/trace.log
Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/3096224795029304
    ✓ ListingSuccess: caffe2/torch/fb/model_optimization:sparsity_test - main (4.186)
    ✓ Pass: caffe2/torch/fb/model_optimization:sparsity_test - test_sparse_qlinear (caffe2.torch.fb.model_optimization.test.sparsity.quantized_test.TestQuantizedSparseLayers) (1.752)
    ✓ Pass: caffe2/torch/fb/model_optimization:sparsity_test - test_sparse_qlinear (caffe2.torch.fb.model_optimization.test.sparsity.quantized_test.TestQuantizedSparseKernels) (1.884)
    ✓ Pass: caffe2/torch/fb/model_optimization:sparsity_test - test_sparse_qlinear_serdes (caffe2.torch.fb.model_optimization.test.sparsity.quantized_test.TestQuantizedSparseLayers) (2.013)
Summary
  Pass: 3
  ListingSuccess: 1
```

Reviewed By: ailzhang

Differential Revision: D27833226

fbshipit-source-id: a47707117de950a9794f79e50a544aa13542c1e1
2021-04-21 12:02:00 -07:00
02c9d2dc90 Release GIL before destructing ProcessGroup classes (#56381)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56381

Part of fix for https://github.com/pytorch/pytorch/issues/56297
ghstack-source-id: 126943449

Test Plan: sandcastle

Reviewed By: zhaojuanmao

Differential Revision: D27855337

fbshipit-source-id: 88bc9234685a6637318e35b25fa68ccbdc3cbc12
2021-04-21 11:49:38 -07:00
3e55fc91fd [pet] Remove additional @record in elastic_launch to fix file existing error
Summary:
Since `launch_agent()` in api.py is already decorated with record, we can remove the usage in elastic_launch.
It also fix the bug for FileExistError on MAST

We run an experiment to count how many times record is invoked in D27901961 to ensure the assumption.

Test Plan:
```
fbpkg build -E torchelastic_distributed_sum

buck run mode/dev-nosan //pytorch/elastic/torchelastic/tsm/fb/cli:tsm -- run_ddp --scheduler mast --fbpkg torchelastic_distributed_sum:fde7879   --nnodes 1 --nproc_per_node 1 --resource T1 --run_cfg hpcIdentity=oncall_dai_pet,hpcClusterUuid=MastNaoTestCluster main.par
```

https://www.internalfb.com/mast/job/tsm_wilsonhong-torchelastic_distributed_sum_a92f97e7

Reviewed By: borovsky-d

Differential Revision: D27902034

fbshipit-source-id: e08b02d4b9c7a7c70fbb0dbcb24b95af55d2ea95
2021-04-21 11:32:09 -07:00
90e532f3ef Revert D27708346: generate xla codegen in-tree
Test Plan: revert-hammer

Differential Revision:
D27708346 (51d0212d0f)

Original commit changeset: 2289edd641f3

fbshipit-source-id: 86711c07db19833b9e772c558e12accba1432499
2021-04-21 11:07:45 -07:00
b7d5a0cf10 [c10d] sequence number in process group (#55319)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55319

Adds a sequence number class as well as integration with ProcessGroup (nccl and gloo) as part of better debugability.

The main use case is that each ProcessGroup instantiated will have a sequence number initially set by rank 0, and broadcasted to all others. We will increment the number on each collective, thus allowing us to match the numbers appropriately when checking for desynchronization.

This PR just adds the bare-bones integration and verifies sequence numbers are set appropriately at the beginning.
ghstack-source-id: 127011277

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D27562769

fbshipit-source-id: d4a4de7529ce07a0c86fcf6beb06f317f359d89b
2021-04-21 10:59:24 -07:00
096089abcb [quant][graphmode][fx] Produce torch.cat instead of torch.ops.quantized.cat (#54924)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54924

Previously we are producing torch.ops.quantize.cat which takes inputs, dequantize them
and requantize with new qparams. This PR changes that to produce torch.cat directly, torch.cat
will assume all inputs are sharing the same qparam, and it will produce a quantized Tensor with
the same qparam as all inputs (because previous PR makes sure all inputs and output of cat are sharing
the same observer/fakequant instance).

Using torch.cat is expected to be more efficient since it does not introduce extra quant/dequant.

Test Plan:
python test/test_quantization.py TestQuantizeFx.test_cat

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D27416528

fbshipit-source-id: 896c280abec2903c29d597c655729666583ff0dd
2021-04-21 10:58:09 -07:00
2e8418025a [vulkan] safe_downcast for buck build (#56540)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56540

Test Plan: Imported from OSS

Reviewed By: SS-JIA

Differential Revision: D27894423

Pulled By: IvanKobzarev

fbshipit-source-id: 2302f9ef9d06c2e072a5e83ea7abecf754ce325d
2021-04-21 10:54:44 -07:00
a583b9cd86 Fixing "naive" forward of ModuleList and `ModuleDict (#48785)
Summary:
**Goal:** Making sure "calling"/"forwarding" a `ModuleList` or `ModuleDict` produce the intended `NotImpmentedError`.

**Current behavior:**
Currently, when naively calling `forward`  user ends up with the confusing error message:
```python
TypeError: forward() takes 1 positional argument but 2 were given
```
Instead of the intended `NotImplementedError.`
This minor issue was brought up by vadimkantorov in issue https://github.com/pytorch/pytorch/issues/37718 [here][1], also by a confused stackoverflow user [here][2].

**What this PR includes:**
Remove `forward` altogether from `ModuleList` and `ModuleDict` to fall back on the `_forward_unimplemented` of `Module` that properly throws `NotImplementedError` regardless of input arguments.

Appropriate test was added to `test_nn.py`

Fixes previous PR https://github.com/pytorch/pytorch/issues/48698 and PR https://github.com/pytorch/pytorch/issues/48783 (third time's a charm? I'm really sorry for the mess)

Test added according to ngimel [request][3].

[1]: https://github.com/pytorch/pytorch/issues/37718#issuecomment-736333345
[2]: https://stackoverflow.com/q/65096679/1714410
[3]: https://github.com/pytorch/pytorch/pull/48698#issuecomment-737398693

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48785

Reviewed By: zhangguanheng66

Differential Revision: D25359759

Pulled By: jbschlosser

fbshipit-source-id: 28f82386f2e9a2a9b0b0b81b16dba6b79398bd34
2021-04-21 10:43:07 -07:00
e51f73a03e Report test stats for macos_10_13 tests (#56429)
Summary:
Run print_test_stats.py for macos_10_13 tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56429

Test Plan: Make sure CI passes, specifically for macos_10_13

Reviewed By: samestep

Differential Revision: D27911557

Pulled By: janeyx99

fbshipit-source-id: 178c0ff7786ab5c41dec9d8afa257eebda4f5a0f
2021-04-21 10:02:39 -07:00
d43d6593cd [NNC] Handling conditionals in reorderAxis (#56063)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53093

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56063

Reviewed By: huiguoo

Differential Revision: D27894772

Pulled By: navahgar

fbshipit-source-id: 403b65f20567c27eab73faf670087cfab9885f84
2021-04-21 09:35:17 -07:00
fe0e1c71a7 Add type ignore lint to Makefile (#56587)
Summary:
Followup to https://github.com/pytorch/pytorch/issues/56290 which adds the new lint to the local runner from https://github.com/pytorch/pytorch/issues/56439.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56587

Test Plan: Same as https://github.com/pytorch/pytorch/issues/56439.

Reviewed By: walterddr

Differential Revision: D27909889

Pulled By: samestep

fbshipit-source-id: 8b67f3bc36c9b5567fe5a9e49904f2cf23a9f135
2021-04-21 08:45:19 -07:00
51d0212d0f generate xla codegen in-tree (#55050)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55050

not ready for review yet

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D27708346

Pulled By: bdhirsh

fbshipit-source-id: 2289edd641f30277d7561cf2d48ec69c6a2137a9
2021-04-21 08:19:08 -07:00
744360ce52 Fix missing definitions in Vec256 for VSX (#56486)
Summary:
Should fix https://github.com/pytorch/pytorch/issues/56474, although I have no Power PC system to test on.

Sleef has `copysign` support for vsx, according to https://sleef.org/ppc64.xhtml

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56486

Reviewed By: heitorschueroff

Differential Revision: D27890091

Pulled By: ezyang

fbshipit-source-id: be0221f33a12f66f30d49a4cdea858ffcce1061f
2021-04-21 08:13:07 -07:00
75024e228c Add lint for unqualified type: ignore (#56290)
Summary:
The other half of https://github.com/pytorch/pytorch/issues/56272.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56290

Test Plan:
CI should pass on the tip of this PR, and we know that the lint works because the following CI runs (before this PR was finished) failed:

- https://github.com/pytorch/pytorch/runs/2384511062
- https://github.com/pytorch/pytorch/actions/runs/765036024

Reviewed By: seemethere

Differential Revision: D27867219

Pulled By: samestep

fbshipit-source-id: e648f07b6822867e70833e23ddafe7fb7eaca235
2021-04-21 08:07:23 -07:00
87a1ebc9cd fix RegistrationDeclarations.yaml, now that we codegen composite kernels for structured functional/inplace ops (#56307)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56307

This should fix https://github.com/pytorch/pytorch/issues/56273. I tested these changes locally by making them directly on top of https://github.com/pytorch/pytorch/pull/56151, and running the xla tests (`xla/test/cpp/build/test_ptxla`).

**Current state:** For ops that are ported to structured, If external backends like XLA have implemented the `out` op but not the `functional` version, they will call into our code-generated `CompositeExplicitAutograd` kernel, which calls the structured operator's `meta()` function and then redispatches to the external backend's `out` function.

If XLA has registered their own kernel to the `functional` variant of the op, it'll override our codegen'd composite kernel. XLA has logic to code-generate "CPU fallback" kernels for "required" ops. It gets this information based off of `RegistrationDeclarations.yaml`. That info was technically incorrect up until this PR, since we were code-generating `inplace/functional` composite kernels for structured ops, but not updating `RegistrationDeclarations.yaml` with that information.

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D27883950

Pulled By: bdhirsh

fbshipit-source-id: fe896b0d2bbd4369490dcdf7a87f227fd3d8b8b3
2021-04-21 07:41:09 -07:00
46a1ac40d9 fix meta() calls for non-storage tensors (i.e. xla) (#56306)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56306

It turns out that TensorIteratorBase `meta()` calls don't work with XLA tensors, since the logic that builds up the `TensorIteratorBase` object also tries to grab/store the underlying tensors' data pointers. This doesn't work for XLA because they don't have storage.

I think it's fine to just skip this bit of logic for tensors that don't have storage- since the data_ptr information isn't important to the meta call, and TensorIterator isn't actually used in the implementation for non-native kernels, i.e. XLA.

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D27883949

Pulled By: bdhirsh

fbshipit-source-id: 7db4358b94b23c504a383f9673dc509c4020a708
2021-04-21 07:39:41 -07:00
d168eae114 make torch.testing error messages more expressive (#55145)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55145

Repeating the discussion from https://github.com/pytorch/pytorch/pull/54784#issuecomment-811792089

The error messages for mismatched values are directly adapted from the old `_compare_tensors_internal`:

50cb75edce/torch/testing/__init__.py (L104-L111)

A sample error message right now looks like this

```
With rtol=1.3e-06 and atol=1e-05, found 1 different element(s) out of 12 (8.3%). The greatest difference of 4.0 (5.0 vs. 9.0) occurred at index (2, 3)
```

Using the same data with `numpy.testing.assert_equal` gives the following output:

```
Not equal to tolerance rtol=1.3e-06, atol=1e-05

Mismatched elements: 1 / 12 (8.33%)
Max absolute difference: 4.
Max relative difference: 0.44444445
 x: array([[5., 5., 5., 5.],
       [5., 5., 5., 5.],
       [5., 5., 5., 5.]], dtype=float32)
 y: array([[5., 5., 5., 5.],
       [5., 5., 5., 5.],
       [5., 5., 5., 9.]], dtype=float32)
```

Pros:

- The info is presented in a list instead of a sentence. IMO this makes it more readable
- The maximum relative difference is reported, which is beneficial in case a comparison fails due to the `rtol`

Cons:

- The values of the inputs are reported (this can be disabled by passing `verbose=False`, but lets face it: most users will use the default setting). In case the inputs are large, the output gets truncated with `...`. Not only is it hard to visually find the mismatching values, they could also live within the truncated part, making the output completely useless.
- Even when visually find the offending values it is hard to parse this back to the index in the inputs.

This implements a mix of both to get a short but expressive message:

```
Tensors are not close according to rtol=1.3e-6 and atol=1e-05:

Mismatched elements: 1 / 12 (8.3%)
Max. rel. diff.: 4.44e-1 at (2, 3)
Max. abs. diff.: 4.0 at (2, 3)
```

Test Plan: Imported from OSS

Reviewed By: heitorschueroff

Differential Revision: D27877157

Pulled By: mruberry

fbshipit-source-id: 6898a995f116f127e3ae8ed0bcb1ada63eadc45a
2021-04-21 06:29:42 -07:00
b66a1e00a6 [NNC] added skeleton for refactoring (#55371)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55371

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D27616418

Pulled By: Chillee

fbshipit-source-id: 8187a0cb2495b6bec07bb5992e352da3ffb299fb
2021-04-21 04:07:01 -07:00
7929bc76a0 [shape inference] Fix dim type for Cast
Summary: ATT

Test Plan: unit test

Reviewed By: yinghai

Differential Revision: D27904584

fbshipit-source-id: b62d2eb5da0be79091c82e6300dd0c075a0bf2fe
2021-04-21 03:21:56 -07:00
4575028f6c Update script API to take example inputs (#55376)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55376

Test Plan: Imported from OSS

Reviewed By: driazati, gmagogsfm

Differential Revision: D27897350

Pulled By: nikithamalgifb

fbshipit-source-id: 4f63235b9eae898c8f4ccaec3fcf64b4b29c860e
2021-04-21 01:00:35 -07:00
c91c4a081d [NNC] Horizontally fuse all loops (#56324)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56324

Inlining is great if LLVM's CSE kicks in; but if a kernel has multiple outputs
(and thus multiple loops), CSE has no chance.

So, this pass "horizontally" fuses the output loops together so that CSE can go
to town. Essentially we want to turn
```
for (...) {
  output_1[] = some_complicated_expr...
}
for (...) {
  output_2[] = some_complicated_expr...
}
```

Into:
```
for (...) {
  output_1[] = complicated_expr
  output_2[] = complicated_expr. // llvm cse should take care of this
}
```

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D27841194

Pulled By: bertmaher

fbshipit-source-id: 54153bb59786be87183c636d64f05963c4b1624a
2021-04-20 23:54:40 -07:00
33f206b865 [StaticRuntime] Replace StorageImpl with TensorImpl in MemoryPlanner (#56447)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56447

MemoryPlanner shouldn't manage StorageImpls; instead, it should manage the TensorImpls because the StorageImpl in Tensors can change.

Test Plan: CI

Reviewed By: ajyu

Differential Revision: D27840361

fbshipit-source-id: f22165d167c70165be2934c6717b5057a8bb4d29
2021-04-20 23:04:01 -07:00
88fbbb4165 [ONNX] Fix ComputeShapeFromReshape when input_shape_size < reshape_size (#56171)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56171

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D27866146

Pulled By: SplitInfinity

fbshipit-source-id: 4f361e1a99e4aedd701c73aae97a440f98282086
2021-04-20 23:00:50 -07:00
1e449694a3 [ONNX] enable word_language_model GRU and LSTM scripting (#54310) (#56170)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56170

* enable test_word_language_model_GRU

* add test_word_language_model_LSTM

* fix ci clang

* fix flake8 format

* remove the outer call to tuple()

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D27866142

Pulled By: SplitInfinity

fbshipit-source-id: ff71d6e1af49b01c6059592930dcfffae98675e8

Co-authored-by: hwangdeyu <deyhuang@qq.com>
2021-04-20 23:00:48 -07:00
0b0fca3c59 [ONNX] Export mv op (#55470) (#56169)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56169

Adding matrix-vector multiplication op

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D27866141

Pulled By: SplitInfinity

fbshipit-source-id: 40e8f65c590bc5354b764b51e0c3cd8386fdc33b
2021-04-20 23:00:46 -07:00
90e63cc41f [ONNX] Add support for prim::min (#55259) (#56168)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56168

Add support for prim::min operator and update full_like

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D27866144

Pulled By: SplitInfinity

fbshipit-source-id: f4af4b8171ed8bd7980fa3141f5fc9811e2bc367
2021-04-20 23:00:44 -07:00
a31fd7f453 Fix onnx/constant_fold.cpp compilation on Windows (#55770) (#56167)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56167

VC++ does not recognize `or` as a valid operator. This breaks the build under `Debug` configuration.

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D27866143

Pulled By: SplitInfinity

fbshipit-source-id: 490cee57b9762170ce02a6f73130772a3542e76d
2021-04-20 23:00:43 -07:00
5a455dc717 [ONNX] Enable tensordot symbolic function. (#55654) (#56166)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56166

Support tensordot in symbolic function of opset 12, and add tests accordingly.

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D27866140

Pulled By: SplitInfinity

fbshipit-source-id: 68e218cfbd630900fb92871fc7c0de3e7e8c8c3d
2021-04-20 23:00:41 -07:00
f804b65d4e [ONNX] Update repeat_interleave symbolic (#54312) (#56165)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56165

Add implementation for cases when
- interleaving happens along dim which consist of dynamic axes

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D27866137

Pulled By: SplitInfinity

fbshipit-source-id: 7fef1b2c614f2e24a677b7ca0886bb37bd0ab479
2021-04-20 23:00:39 -07:00
9986b109d2 [ONNX] Fix assign input shape for tuple inputs & primitive type inputs (#54112) (#56164)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56164

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D27866139

Pulled By: SplitInfinity

fbshipit-source-id: c59f5a07df685e1ccdc4860d603ec422ec80d188
2021-04-20 23:00:37 -07:00
75995e4bf6 [ONNX] Add support for hann_window operator. (#54587) (#56163)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56163

* [ONNX] Improve index_put symbolic to handle singular Bool updates (#53690)

Adds support for cases where the updates to the index_put node is a single Bool value, such as the case shown below

```
mask[indices] = True
```

Fixes #53507

* [ONNX] Support primitive type input/outputs and attributes (#53550)

Support primitive type attributes. Needed for Silero model.

* [ONNX] Fix if output shape mismatch error & Fix graph input directly used as output (#53219)

Fix if output shape mismatch error & Fix graph input directly used as output

* Add support for hann_window operator.

* [ONNX] Replace decomposeLinear pre process pass with a symbolic (#53077)

Replace decomposeLinear pre process pass with a symbolic

* Add a test case for dtype is None.

* Resolve flake8 issue.

* Remove one unused test case.

* Add support for hann_window operator.

* Add a test case for dtype is None.

* Remove one unused test case.

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D27866145

Pulled By: SplitInfinity

fbshipit-source-id: e0b43df9ecd1a95cd7ac297213aba453bbaf2913

Co-authored-by: Shubham Bhokare <32080845+shubhambhokare1@users.noreply.github.com>
Co-authored-by: Negin Raoof <neginmr@utexas.edu>
Co-authored-by: Bowen Bao <bowbao@microsoft.com>
Co-authored-by: Ksenija Stanojevic <KsenijaS@users.noreply.github.com>
2021-04-20 22:59:31 -07:00
19943aafe9 [caffe2] Speed up remote net loading
Summary:
Training recovery takes over 3 hours for DI models. See T88118480 for more details.

One of the slowness reasons could be the linear search in the ApplicationSpecificInfo. To improve that, we cache the app info into a dict so the lookup can be much faster.

Test Plan:
Unit test
  buck test caffe2/caffe2/fb/predictor:predictor_py_dist_utils_test
```Building: finished in 6.2 sec (100%) 11023/11023 jobs, 2 updated
  Total time: 6.6 sec
More details at https://www.internalfb.com/intern/buck/build/95555464-b15f-44f2-a781-a712126aeaa1
Tpx test run coordinator for Facebook. See https://fburl.com/tpx for details.
Running with tpx session id: 3f4e4913-5802-4437-81bf-1e0a08c067da
Trace available for this run at /tmp/tpx-20210420-101444.394595/trace.log
Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/5348024608951863
    ✓ ListingSuccess: caffe2/caffe2/fb/predictor:predictor_py_dist_utils_test - main (8.412)
    ✓ Pass: caffe2/caffe2/fb/predictor:predictor_py_dist_utils_test - test_empty_remote_net_in_app_into (caffe2.caffe2.fb.predictor.predictor_py_dist_utils_test.TestPredictorDistUtils) (7.844)
    ✓ Pass: caffe2/caffe2/fb/predictor:predictor_py_dist_utils_test - test_distributed_context_in_app_info (caffe2.caffe2.fb.predictor.predictor_py_dist_utils_test.TestPredictorDistUtils) (8.014)
    ✓ Pass: caffe2/caffe2/fb/predictor:predictor_py_dist_utils_test - test_remote_net_in_app_info (caffe2.caffe2.fb.predictor.predictor_py_dist_utils_test.TestPredictorDistUtils) (8.027)
Summary
  Pass: 3
  ListingSuccess: 1
If you need help debugging your runs, please follow the wiki: https://fburl.com/posting_in_tpx_users
Finished test run: https://www.internalfb.com/intern/testinfra/testrun/5348024608951863
```

Performance Test:
N557020 is the old way, which takes about 30~60 secs for every 1000 remote nets
N556897 is the new way, which takes 0.12 secs for every 1000 remote nets

N557020 output:
~~~
I0420 112047.755 <ipython-input-2-515f8ba1b5f6>:48] Start retrieving remote nets ...
I0420 112050.036 <ipython-input-2-515f8ba1b5f6>:27] Get 1000 remote nets
I0420 112052.750 <ipython-input-2-515f8ba1b5f6>:27] Get 2000 remote nets
I0420 112055.907 <ipython-input-2-515f8ba1b5f6>:27] Get 3000 remote nets
I0420 112059.542 <ipython-input-2-515f8ba1b5f6>:27] Get 4000 remote nets
I0420 112103.628 <ipython-input-2-515f8ba1b5f6>:27] Get 5000 remote nets
I0420 112108.309 <ipython-input-2-515f8ba1b5f6>:27] Get 6000 remote nets
I0420 112113.883 <ipython-input-2-515f8ba1b5f6>:27] Get 7000 remote nets
I0420 112119.564 <ipython-input-2-515f8ba1b5f6>:27] Get 8000 remote nets
I0420 112125.629 <ipython-input-2-515f8ba1b5f6>:27] Get 9000 remote nets
I0420 112132.057 <ipython-input-2-515f8ba1b5f6>:27] Get 10000 remote nets
I0420 112138.979 <ipython-input-2-515f8ba1b5f6>:27] Get 11000 remote nets
I0420 112146.198 <ipython-input-2-515f8ba1b5f6>:27] Get 12000 remote nets
I0420 112154.381 <ipython-input-2-515f8ba1b5f6>:27] Get 13000 remote nets
I0420 112202.881 <ipython-input-2-515f8ba1b5f6>:27] Get 14000 remote nets
I0420 112211.595 <ipython-input-2-515f8ba1b5f6>:27] Get 15000 remote nets
I0420 112221.341 <ipython-input-2-515f8ba1b5f6>:27] Get 16000 remote nets
I0420 112231.300 <ipython-input-2-515f8ba1b5f6>:27] Get 17000 remote nets
I0420 112242.615 <ipython-input-2-515f8ba1b5f6>:27] Get 18000 remote nets
I0420 112253.730 <ipython-input-2-515f8ba1b5f6>:27] Get 19000 remote nets
I0420 112305.044 <ipython-input-2-515f8ba1b5f6>:27] Get 20000 remote nets
I0420 112316.378 <ipython-input-2-515f8ba1b5f6>:27] Get 21000 remote nets
I0420 112328.176 <ipython-input-2-515f8ba1b5f6>:27] Get 22000 remote nets
I0420 112341.466 <ipython-input-2-515f8ba1b5f6>:27] Get 23000 remote nets
I0420 112355.653 <ipython-input-2-515f8ba1b5f6>:27] Get 24000 remote nets
I0420 112409.014 <ipython-input-2-515f8ba1b5f6>:27] Get 25000 remote nets
I0420 112422.924 <ipython-input-2-515f8ba1b5f6>:27] Get 26000 remote nets
I0420 112437.026 <ipython-input-2-515f8ba1b5f6>:27] Get 27000 remote nets
I0420 112451.413 <ipython-input-2-515f8ba1b5f6>:27] Get 28000 remote nets
I0420 112506.773 <ipython-input-2-515f8ba1b5f6>:27] Get 29000 remote nets
I0420 112522.614 <ipython-input-2-515f8ba1b5f6>:27] Get 30000 remote nets
I0420 112538.564 <ipython-input-2-515f8ba1b5f6>:27] Get 31000 remote nets
I0420 112555.075 <ipython-input-2-515f8ba1b5f6>:27] Get 32000 remote nets
I0420 112612.159 <ipython-input-2-515f8ba1b5f6>:27] Get 33000 remote nets
I0420 112629.656 <ipython-input-2-515f8ba1b5f6>:27] Get 34000 remote nets
I0420 112647.850 <ipython-input-2-515f8ba1b5f6>:27] Get 35000 remote nets
I0420 112705.807 <ipython-input-2-515f8ba1b5f6>:27] Get 36000 remote nets
I0420 112724.495 <ipython-input-2-515f8ba1b5f6>:27] Get 37000 remote nets
I0420 112744.072 <ipython-input-2-515f8ba1b5f6>:27] Get 38000 remote nets
I0420 112804.266 <ipython-input-2-515f8ba1b5f6>:27] Get 39000 remote nets
I0420 112824.954 <ipython-input-2-515f8ba1b5f6>:27] Get 40000 remote nets
I0420 112845.934 <ipython-input-2-515f8ba1b5f6>:27] Get 41000 remote nets
I0420 112908.721 <ipython-input-2-515f8ba1b5f6>:27] Get 42000 remote nets
I0420 112930.573 <ipython-input-2-515f8ba1b5f6>:27] Get 43000 remote nets
I0420 112952.775 <ipython-input-2-515f8ba1b5f6>:27] Get 44000 remote nets
I0420 113015.969 <ipython-input-2-515f8ba1b5f6>:27] Get 45000 remote nets
I0420 113041.214 <ipython-input-2-515f8ba1b5f6>:27] Get 46000 remote nets
I0420 113104.702 <ipython-input-2-515f8ba1b5f6>:27] Get 47000 remote nets
I0420 113128.730 <ipython-input-2-515f8ba1b5f6>:27] Get 48000 remote nets
I0420 113153.378 <ipython-input-2-515f8ba1b5f6>:27] Get 49000 remote nets
I0420 113218.021 <ipython-input-2-515f8ba1b5f6>:27] Get 50000 remote nets
I0420 113243.351 <ipython-input-2-515f8ba1b5f6>:27] Get 51000 remote nets
I0420 113309.279 <ipython-input-2-515f8ba1b5f6>:27] Get 52000 remote nets
I0420 113335.202 <ipython-input-2-515f8ba1b5f6>:27] Get 53000 remote nets
I0420 113402.367 <ipython-input-2-515f8ba1b5f6>:27] Get 54000 remote nets
I0420 113430.947 <ipython-input-2-515f8ba1b5f6>:27] Get 55000 remote nets
I0420 113458.127 <ipython-input-2-515f8ba1b5f6>:27] Get 56000 remote nets
I0420 113526.365 <ipython-input-2-515f8ba1b5f6>:27] Get 57000 remote nets
I0420 113554.709 <ipython-input-2-515f8ba1b5f6>:27] Get 58000 remote nets
I0420 113623.601 <ipython-input-2-515f8ba1b5f6>:27] Get 59000 remote nets
I0420 113653.264 <ipython-input-2-515f8ba1b5f6>:27] Get 60000 remote nets
I0420 113724.726 <ipython-input-2-515f8ba1b5f6>:27] Get 61000 remote nets
I0420 113755.080 <ipython-input-2-515f8ba1b5f6>:27] Get 62000 remote nets
I0420 113827.936 <ipython-input-2-515f8ba1b5f6>:27] Get 63000 remote nets
I0420 113859.362 <ipython-input-2-515f8ba1b5f6>:27] Get 64000 remote nets
I0420 113931.138 <ipython-input-2-515f8ba1b5f6>:27] Get 65000 remote nets
I0420 114003.229 <ipython-input-2-515f8ba1b5f6>:27] Get 66000 remote nets
I0420 114038.085 <ipython-input-2-515f8ba1b5f6>:27] Get 67000 remote nets
I0420 114111.300 <ipython-input-2-515f8ba1b5f6>:27] Get 68000 remote nets
I0420 114145.383 <ipython-input-2-515f8ba1b5f6>:27] Get 69000 remote nets
I0420 114219.571 <ipython-input-2-515f8ba1b5f6>:27] Get 70000 remote nets
I0420 114254.233 <ipython-input-2-515f8ba1b5f6>:27] Get 71000 remote nets
I0420 114329.326 <ipython-input-2-515f8ba1b5f6>:27] Get 72000 remote nets
I0420 114405.087 <ipython-input-2-515f8ba1b5f6>:27] Get 73000 remote nets
I0420 114440.979 <ipython-input-2-515f8ba1b5f6>:27] Get 74000 remote nets
I0420 114518.520 <ipython-input-2-515f8ba1b5f6>:27] Get 75000 remote nets
I0420 114556.013 <ipython-input-2-515f8ba1b5f6>:27] Get 76000 remote nets
I0420 114633.434 <ipython-input-2-515f8ba1b5f6>:27] Get 77000 remote nets
I0420 114711.834 <ipython-input-2-515f8ba1b5f6>:27] Get 78000 remote nets
I0420 114750.741 <ipython-input-2-515f8ba1b5f6>:27] Get 79000 remote nets
I0420 114829.749 <ipython-input-2-515f8ba1b5f6>:27] Get 80000 remote nets
I0420 114909.038 <ipython-input-2-515f8ba1b5f6>:27] Get 81000 remote nets
I0420 114948.711 <ipython-input-2-515f8ba1b5f6>:27] Get 82000 remote nets
I0420 115028.869 <ipython-input-2-515f8ba1b5f6>:27] Get 83000 remote nets
I0420 115109.094 <ipython-input-2-515f8ba1b5f6>:27] Get 84000 remote nets
I0420 115150.249 <ipython-input-2-515f8ba1b5f6>:27] Get 85000 remote nets
I0420 115231.601 <ipython-input-2-515f8ba1b5f6>:27] Get 86000 remote nets
I0420 115313.772 <ipython-input-2-515f8ba1b5f6>:27] Get 87000 remote nets
I0420 115356.035 <ipython-input-2-515f8ba1b5f6>:27] Get 88000 remote nets
I0420 115438.846 <ipython-input-2-515f8ba1b5f6>:27] Get 89000 remote nets
I0420 115522.213 <ipython-input-2-515f8ba1b5f6>:27] Get 90000 remote nets
I0420 115607.908 <ipython-input-2-515f8ba1b5f6>:27] Get 91000 remote nets
I0420 115652.009 <ipython-input-2-515f8ba1b5f6>:27] Get 92000 remote nets
I0420 115736.510 <ipython-input-2-515f8ba1b5f6>:27] Get 93000 remote nets
I0420 115822.303 <ipython-input-2-515f8ba1b5f6>:27] Get 94000 remote nets
I0420 115908.392 <ipython-input-2-515f8ba1b5f6>:27] Get 95000 remote nets
I0420 115954.912 <ipython-input-2-515f8ba1b5f6>:27] Get 96000 remote nets
I0420 120042.219 <ipython-input-2-515f8ba1b5f6>:27] Get 97000 remote nets
I0420 120129.969 <ipython-input-2-515f8ba1b5f6>:27] Get 98000 remote nets
I0420 120218.765 <ipython-input-2-515f8ba1b5f6>:27] Get 99000 remote nets
I0420 120306.883 <ipython-input-2-515f8ba1b5f6>:27] Get 100000 remote nets
I0420 120355.543 <ipython-input-2-515f8ba1b5f6>:27] Get 101000 remote nets
I0420 120444.976 <ipython-input-2-515f8ba1b5f6>:27] Get 102000 remote nets
I0420 120533.482 <ipython-input-2-515f8ba1b5f6>:27] Get 103000 remote nets
I0420 120622.351 <ipython-input-2-515f8ba1b5f6>:27] Get 104000 remote nets
I0420 120712.467 <ipython-input-2-515f8ba1b5f6>:27] Get 105000 remote nets
I0420 120802.660 <ipython-input-2-515f8ba1b5f6>:27] Get 106000 remote nets
I0420 120854.634 <ipython-input-2-515f8ba1b5f6>:27] Get 107000 remote nets
I0420 120945.786 <ipython-input-2-515f8ba1b5f6>:27] Get 108000 remote nets
~~~

N556897 output:
~~~
I0420 111502.516 <ipython-input-7-52640a51556f>:60] Start retrieving remote nets ...
I0420 111504.709 <ipython-input-7-52640a51556f>:40] Get 1000 remote nets
I0420 111504.825 <ipython-input-7-52640a51556f>:40] Get 2000 remote nets
I0420 111504.941 <ipython-input-7-52640a51556f>:40] Get 3000 remote nets
I0420 111505.056 <ipython-input-7-52640a51556f>:40] Get 4000 remote nets
I0420 111505.174 <ipython-input-7-52640a51556f>:40] Get 5000 remote nets
I0420 111505.286 <ipython-input-7-52640a51556f>:40] Get 6000 remote nets
I0420 111505.405 <ipython-input-7-52640a51556f>:40] Get 7000 remote nets
I0420 111505.522 <ipython-input-7-52640a51556f>:40] Get 8000 remote nets
I0420 111505.639 <ipython-input-7-52640a51556f>:40] Get 9000 remote nets
I0420 111505.756 <ipython-input-7-52640a51556f>:40] Get 10000 remote nets
I0420 111505.873 <ipython-input-7-52640a51556f>:40] Get 11000 remote nets
I0420 111505.990 <ipython-input-7-52640a51556f>:40] Get 12000 remote nets
I0420 111506.106 <ipython-input-7-52640a51556f>:40] Get 13000 remote nets
I0420 111506.223 <ipython-input-7-52640a51556f>:40] Get 14000 remote nets
I0420 111506.343 <ipython-input-7-52640a51556f>:40] Get 15000 remote nets
I0420 111506.457 <ipython-input-7-52640a51556f>:40] Get 16000 remote nets
I0420 111506.585 <ipython-input-7-52640a51556f>:40] Get 17000 remote nets
I0420 111508.930 <ipython-input-7-52640a51556f>:40] Get 18000 remote nets
I0420 111509.045 <ipython-input-7-52640a51556f>:40] Get 19000 remote nets
I0420 111509.154 <ipython-input-7-52640a51556f>:40] Get 20000 remote nets
I0420 111509.266 <ipython-input-7-52640a51556f>:40] Get 21000 remote nets
I0420 111509.382 <ipython-input-7-52640a51556f>:40] Get 22000 remote nets
I0420 111509.497 <ipython-input-7-52640a51556f>:40] Get 23000 remote nets
I0420 111509.614 <ipython-input-7-52640a51556f>:40] Get 24000 remote nets
I0420 111509.736 <ipython-input-7-52640a51556f>:40] Get 25000 remote nets
I0420 111509.854 <ipython-input-7-52640a51556f>:40] Get 26000 remote nets
I0420 111509.972 <ipython-input-7-52640a51556f>:40] Get 27000 remote nets
I0420 111510.090 <ipython-input-7-52640a51556f>:40] Get 28000 remote nets
I0420 111510.210 <ipython-input-7-52640a51556f>:40] Get 29000 remote nets
I0420 111510.329 <ipython-input-7-52640a51556f>:40] Get 30000 remote nets
I0420 111510.448 <ipython-input-7-52640a51556f>:40] Get 31000 remote nets
I0420 111510.572 <ipython-input-7-52640a51556f>:40] Get 32000 remote nets
I0420 111510.689 <ipython-input-7-52640a51556f>:40] Get 33000 remote nets
I0420 111510.821 <ipython-input-7-52640a51556f>:40] Get 34000 remote nets
I0420 111510.989 <ipython-input-7-52640a51556f>:40] Get 35000 remote nets
I0420 111511.110 <ipython-input-7-52640a51556f>:40] Get 36000 remote nets
I0420 111511.236 <ipython-input-7-52640a51556f>:40] Get 37000 remote nets
I0420 111511.357 <ipython-input-7-52640a51556f>:40] Get 38000 remote nets
I0420 111511.482 <ipython-input-7-52640a51556f>:40] Get 39000 remote nets
I0420 111511.607 <ipython-input-7-52640a51556f>:40] Get 40000 remote nets
I0420 111511.729 <ipython-input-7-52640a51556f>:40] Get 41000 remote nets
I0420 111511.855 <ipython-input-7-52640a51556f>:40] Get 42000 remote nets
I0420 111511.988 <ipython-input-7-52640a51556f>:40] Get 43000 remote nets
I0420 111512.112 <ipython-input-7-52640a51556f>:40] Get 44000 remote nets
I0420 111512.232 <ipython-input-7-52640a51556f>:40] Get 45000 remote nets
I0420 111512.353 <ipython-input-7-52640a51556f>:40] Get 46000 remote nets
I0420 111512.477 <ipython-input-7-52640a51556f>:40] Get 47000 remote nets
I0420 111512.597 <ipython-input-7-52640a51556f>:40] Get 48000 remote nets
I0420 111512.723 <ipython-input-7-52640a51556f>:40] Get 49000 remote nets
I0420 111512.839 <ipython-input-7-52640a51556f>:40] Get 50000 remote nets
I0420 111512.969 <ipython-input-7-52640a51556f>:40] Get 51000 remote nets
I0420 111513.085 <ipython-input-7-52640a51556f>:40] Get 52000 remote nets
I0420 111513.205 <ipython-input-7-52640a51556f>:40] Get 53000 remote nets
I0420 111513.322 <ipython-input-7-52640a51556f>:40] Get 54000 remote nets
I0420 111513.441 <ipython-input-7-52640a51556f>:40] Get 55000 remote nets
I0420 111513.559 <ipython-input-7-52640a51556f>:40] Get 56000 remote nets
I0420 111513.678 <ipython-input-7-52640a51556f>:40] Get 57000 remote nets
I0420 111513.796 <ipython-input-7-52640a51556f>:40] Get 58000 remote nets
I0420 111513.918 <ipython-input-7-52640a51556f>:40] Get 59000 remote nets
I0420 111514.038 <ipython-input-7-52640a51556f>:40] Get 60000 remote nets
I0420 111514.158 <ipython-input-7-52640a51556f>:40] Get 61000 remote nets
I0420 111514.273 <ipython-input-7-52640a51556f>:40] Get 62000 remote nets
I0420 111514.391 <ipython-input-7-52640a51556f>:40] Get 63000 remote nets
I0420 111514.512 <ipython-input-7-52640a51556f>:40] Get 64000 remote nets
I0420 111514.638 <ipython-input-7-52640a51556f>:40] Get 65000 remote nets
I0420 111514.759 <ipython-input-7-52640a51556f>:40] Get 66000 remote nets
I0420 111514.874 <ipython-input-7-52640a51556f>:40] Get 67000 remote nets
I0420 111515.000 <ipython-input-7-52640a51556f>:40] Get 68000 remote nets
I0420 111515.117 <ipython-input-7-52640a51556f>:40] Get 69000 remote nets
I0420 111515.235 <ipython-input-7-52640a51556f>:40] Get 70000 remote nets
I0420 111515.358 <ipython-input-7-52640a51556f>:40] Get 71000 remote nets
I0420 111515.481 <ipython-input-7-52640a51556f>:40] Get 72000 remote nets
I0420 111515.604 <ipython-input-7-52640a51556f>:40] Get 73000 remote nets
I0420 111515.725 <ipython-input-7-52640a51556f>:40] Get 74000 remote nets
I0420 111515.848 <ipython-input-7-52640a51556f>:40] Get 75000 remote nets
I0420 111515.979 <ipython-input-7-52640a51556f>:40] Get 76000 remote nets
I0420 111516.102 <ipython-input-7-52640a51556f>:40] Get 77000 remote nets
I0420 111516.226 <ipython-input-7-52640a51556f>:40] Get 78000 remote nets
I0420 111516.344 <ipython-input-7-52640a51556f>:40] Get 79000 remote nets
I0420 111516.472 <ipython-input-7-52640a51556f>:40] Get 80000 remote nets
I0420 111516.603 <ipython-input-7-52640a51556f>:40] Get 81000 remote nets
I0420 111516.751 <ipython-input-7-52640a51556f>:40] Get 82000 remote nets
I0420 111516.883 <ipython-input-7-52640a51556f>:40] Get 83000 remote nets
I0420 111517.025 <ipython-input-7-52640a51556f>:40] Get 84000 remote nets
I0420 111517.160 <ipython-input-7-52640a51556f>:40] Get 85000 remote nets
I0420 111517.290 <ipython-input-7-52640a51556f>:40] Get 86000 remote nets
I0420 111517.415 <ipython-input-7-52640a51556f>:40] Get 87000 remote nets
I0420 111517.541 <ipython-input-7-52640a51556f>:40] Get 88000 remote nets
I0420 111517.665 <ipython-input-7-52640a51556f>:40] Get 89000 remote nets
I0420 111517.790 <ipython-input-7-52640a51556f>:40] Get 90000 remote nets
I0420 111517.918 <ipython-input-7-52640a51556f>:40] Get 91000 remote nets
I0420 111518.044 <ipython-input-7-52640a51556f>:40] Get 92000 remote nets
I0420 111518.171 <ipython-input-7-52640a51556f>:40] Get 93000 remote nets
I0420 111518.292 <ipython-input-7-52640a51556f>:40] Get 94000 remote nets
I0420 111518.429 <ipython-input-7-52640a51556f>:40] Get 95000 remote nets
I0420 111520.024 <ipython-input-7-52640a51556f>:40] Get 96000 remote nets
I0420 111520.148 <ipython-input-7-52640a51556f>:40] Get 97000 remote nets
I0420 111520.271 <ipython-input-7-52640a51556f>:40] Get 98000 remote nets
I0420 111520.396 <ipython-input-7-52640a51556f>:40] Get 99000 remote nets
I0420 111520.522 <ipython-input-7-52640a51556f>:40] Get 100000 remote nets
I0420 111520.646 <ipython-input-7-52640a51556f>:40] Get 101000 remote nets
I0420 111520.770 <ipython-input-7-52640a51556f>:40] Get 102000 remote nets
I0420 111520.899 <ipython-input-7-52640a51556f>:40] Get 103000 remote nets
I0420 111521.023 <ipython-input-7-52640a51556f>:40] Get 104000 remote nets
I0420 111521.149 <ipython-input-7-52640a51556f>:40] Get 105000 remote nets
I0420 111521.274 <ipython-input-7-52640a51556f>:40] Get 106000 remote nets
I0420 111521.399 <ipython-input-7-52640a51556f>:40] Get 107000 remote nets
I0420 111521.526 <ipython-input-7-52640a51556f>:40] Get 108000 remote nets
I0420 111521.651 <ipython-input-7-52640a51556f>:40] Get 109000 remote nets
I0420 111521.778 <ipython-input-7-52640a51556f>:40] Get 110000 remote nets
I0420 111521.900 <ipython-input-7-52640a51556f>:40] Get 111000 remote nets
I0420 111522.055 <ipython-input-7-52640a51556f>:40] Get 112000 remote nets
I0420 111522.173 <ipython-input-7-52640a51556f>:40] Get 113000 remote nets
I0420 111522.297 <ipython-input-7-52640a51556f>:40] Get 114000 remote nets
I0420 111522.421 <ipython-input-7-52640a51556f>:40] Get 115000 remote nets
I0420 111522.545 <ipython-input-7-52640a51556f>:40] Get 116000 remote nets
I0420 111522.671 <ipython-input-7-52640a51556f>:40] Get 117000 remote nets
I0420 111522.795 <ipython-input-7-52640a51556f>:40] Get 118000 remote nets
I0420 111522.919 <ipython-input-7-52640a51556f>:40] Get 119000 remote nets
I0420 111523.048 <ipython-input-7-52640a51556f>:40] Get 120000 remote nets
I0420 111523.171 <ipython-input-7-52640a51556f>:40] Get 121000 remote nets
I0420 111523.298 <ipython-input-7-52640a51556f>:40] Get 122000 remote nets
I0420 111523.420 <ipython-input-7-52640a51556f>:40] Get 123000 remote nets
I0420 111523.544 <ipython-input-7-52640a51556f>:40] Get 124000 remote nets
I0420 111523.669 <ipython-input-7-52640a51556f>:40] Get 125000 remote nets
I0420 111523.794 <ipython-input-7-52640a51556f>:40] Get 126000 remote nets
I0420 111523.920 <ipython-input-7-52640a51556f>:40] Get 127000 remote nets
I0420 111524.041 <ipython-input-7-52640a51556f>:40] Get 128000 remote nets
I0420 111524.173 <ipython-input-7-52640a51556f>:40] Get 129000 remote nets
I0420 111524.293 <ipython-input-7-52640a51556f>:40] Get 130000 remote nets
I0420 111524.417 <ipython-input-7-52640a51556f>:40] Get 131000 remote nets
I0420 111524.542 <ipython-input-7-52640a51556f>:40] Get 132000 remote nets
I0420 111524.665 <ipython-input-7-52640a51556f>:40] Get 133000 remote nets
I0420 111524.790 <ipython-input-7-52640a51556f>:40] Get 134000 remote nets
I0420 111524.913 <ipython-input-7-52640a51556f>:40] Get 135000 remote nets
I0420 111525.038 <ipython-input-7-52640a51556f>:40] Get 136000 remote nets
I0420 111525.166 <ipython-input-7-52640a51556f>:40] Get 137000 remote nets
I0420 111525.289 <ipython-input-7-52640a51556f>:40] Get 138000 remote nets
I0420 111525.414 <ipython-input-7-52640a51556f>:40] Get 139000 remote nets
I0420 111525.536 <ipython-input-7-52640a51556f>:40] Get 140000 remote nets
I0420 111525.659 <ipython-input-7-52640a51556f>:40] Get 141000 remote nets
I0420 111525.782 <ipython-input-7-52640a51556f>:40] Get 142000 remote nets
I0420 111525.907 <ipython-input-7-52640a51556f>:40] Get 143000 remote nets
I0420 111526.035 <ipython-input-7-52640a51556f>:40] Get 144000 remote nets
I0420 111526.157 <ipython-input-7-52640a51556f>:40] Get 145000 remote nets
I0420 111526.287 <ipython-input-7-52640a51556f>:40] Get 146000 remote nets
I0420 111526.409 <ipython-input-7-52640a51556f>:40] Get 147000 remote nets
I0420 111526.533 <ipython-input-7-52640a51556f>:40] Get 148000 remote nets
I0420 111526.658 <ipython-input-7-52640a51556f>:40] Get 149000 remote nets
I0420 111526.781 <ipython-input-7-52640a51556f>:40] Get 150000 remote nets
I0420 111526.908 <ipython-input-7-52640a51556f>:40] Get 151000 remote nets
I0420 111527.033 <ipython-input-7-52640a51556f>:40] Get 152000 remote nets
I0420 111527.158 <ipython-input-7-52640a51556f>:40] Get 153000 remote nets
I0420 111527.289 <ipython-input-7-52640a51556f>:40] Get 154000 remote nets
I0420 111527.413 <ipython-input-7-52640a51556f>:40] Get 155000 remote nets
I0420 111527.544 <ipython-input-7-52640a51556f>:40] Get 156000 remote nets
I0420 111527.665 <ipython-input-7-52640a51556f>:40] Get 157000 remote nets
I0420 111527.790 <ipython-input-7-52640a51556f>:40] Get 158000 remote nets
I0420 111527.917 <ipython-input-7-52640a51556f>:40] Get 159000 remote nets
I0420 111528.046 <ipython-input-7-52640a51556f>:40] Get 160000 remote nets
I0420 111528.175 <ipython-input-7-52640a51556f>:40] Get 161000 remote nets
I0420 111528.297 <ipython-input-7-52640a51556f>:40] Get 162000 remote nets
I0420 111528.422 <ipython-input-7-52640a51556f>:40] Get 163000 remote nets
I0420 111528.548 <ipython-input-7-52640a51556f>:40] Get 164000 remote nets
I0420 111528.672 <ipython-input-7-52640a51556f>:40] Get 165000 remote nets
I0420 111528.796 <ipython-input-7-52640a51556f>:40] Get 166000 remote nets
I0420 111528.920 <ipython-input-7-52640a51556f>:40] Get 167000 remote nets
I0420 111529.045 <ipython-input-7-52640a51556f>:40] Get 168000 remote nets
I0420 111529.172 <ipython-input-7-52640a51556f>:40] Get 169000 remote nets
I0420 111529.300 <ipython-input-7-52640a51556f>:40] Get 170000 remote nets
I0420 111529.426 <ipython-input-7-52640a51556f>:40] Get 171000 remote nets
I0420 111529.547 <ipython-input-7-52640a51556f>:40] Get 172000 remote nets
I0420 111529.683 <ipython-input-7-52640a51556f>:40] Get 173000 remote nets
I0420 111529.800 <ipython-input-7-52640a51556f>:40] Get 174000 remote nets
I0420 111529.923 <ipython-input-7-52640a51556f>:40] Get 175000 remote nets
I0420 111530.080 <ipython-input-7-52640a51556f>:40] Get 176000 remote nets
I0420 111530.205 <ipython-input-7-52640a51556f>:40] Get 177000 remote nets
I0420 111530.331 <ipython-input-7-52640a51556f>:40] Get 178000 remote nets
I0420 111530.453 <ipython-input-7-52640a51556f>:40] Get 179000 remote nets
I0420 111530.577 <ipython-input-7-52640a51556f>:40] Get 180000 remote nets
I0420 111530.705 <ipython-input-7-52640a51556f>:40] Get 181000 remote nets
I0420 111530.829 <ipython-input-7-52640a51556f>:40] Get 182000 remote nets
I0420 111530.955 <ipython-input-7-52640a51556f>:40] Get 183000 remote nets
I0420 111531.082 <ipython-input-7-52640a51556f>:40] Get 184000 remote nets
I0420 111531.210 <ipython-input-7-52640a51556f>:40] Get 185000 remote nets
I0420 111531.338 <ipython-input-7-52640a51556f>:40] Get 186000 remote nets
I0420 111531.461 <ipython-input-7-52640a51556f>:40] Get 187000 remote nets
I0420 111531.588 <ipython-input-7-52640a51556f>:40] Get 188000 remote nets
I0420 111531.708 <ipython-input-7-52640a51556f>:40] Get 189000 remote nets
I0420 111531.845 <ipython-input-7-52640a51556f>:40] Get 190000 remote nets
I0420 111531.968 <ipython-input-7-52640a51556f>:40] Get 191000 remote nets
I0420 111532.096 <ipython-input-7-52640a51556f>:40] Get 192000 remote nets
I0420 111534.047 <ipython-input-7-52640a51556f>:40] Get 193000 remote nets
I0420 111534.172 <ipython-input-7-52640a51556f>:40] Get 194000 remote nets
I0420 111534.297 <ipython-input-7-52640a51556f>:40] Get 195000 remote nets
I0420 111534.420 <ipython-input-7-52640a51556f>:40] Get 196000 remote nets
I0420 111534.543 <ipython-input-7-52640a51556f>:40] Get 197000 remote nets
I0420 111534.671 <ipython-input-7-52640a51556f>:40] Get 198000 remote nets
I0420 111534.794 <ipython-input-7-52640a51556f>:40] Get 199000 remote nets
I0420 111534.920 <ipython-input-7-52640a51556f>:40] Get 200000 remote nets
I0420 111535.044 <ipython-input-7-52640a51556f>:40] Get 201000 remote nets
I0420 111535.167 <ipython-input-7-52640a51556f>:40] Get 202000 remote nets
I0420 111535.291 <ipython-input-7-52640a51556f>:40] Get 203000 remote nets
I0420 111537.169 <ipython-input-7-52640a51556f>:64] Finish retrieving remote nets. Starting processing ...
I0420 111537.201 <ipython-input-7-52640a51556f>:77] Finished processing remote nets
~~~

Reviewed By: heslami

Differential Revision: D27886217

fbshipit-source-id: cdc398d04bf963d4f495adc0a91c8ceb54466e58
2021-04-20 22:32:40 -07:00
a2422cc243 Add stricter check for function schemas with varargs (#56509)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56509

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D27889626

Pulled By: tugsbayasgalan

fbshipit-source-id: 5ff81a313ff53a9519d7dc9f3d6f7234d58af8e2
2021-04-20 20:04:38 -07:00
a4626348bc fix unqualified noqa lint (#56548)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56548

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D27898933

Pulled By: suo

fbshipit-source-id: dc4dcd2ab8bb145e5a548566fc299fa6e7e1928e
2021-04-20 17:27:15 -07:00
594c546b69 [PyTorch Edge] Eliminate non-determinism when generating build YAML file (#56539)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56539

It seems like a potential source of non-determinism when generating YAML files during the build stems from the fact that when we write out Python lists, they get written out in list order. This isn't a problem per-se, but if you look to see how these lists are generated, you'll see that they come from sets, which are inherently [not order preserving](https://stackoverflow.com/questions/1653970/does-python-have-an-ordered-set) in Python.

I can't guarantee that this removes non-determinism, but it removes all non-determinism that I know of so far. The surface area of codegen isn't sprawling, and the YAML file is generated by converting the object `toDict()` and passing it into the YAML serializer, so this should cover it (I think). Dictionaries are serialized in key order by pyyaml, so that's not a problem.

This could be releated to the elevated Android build times being seen [here](https://fb.workplace.com/groups/pytorch.edge.users/permalink/841622146708080/).
ghstack-source-id: 126987721

Test Plan: Build + Sandcastle.

Reviewed By: JacobSzwejbka

Differential Revision: D27893058

fbshipit-source-id: 6d7bcb09f34c05b71fbb4a0673bac1c4c33f23d7
2021-04-20 17:26:14 -07:00
7fff71eb9a Fix warnings in tensor_flatten.cpp (#55956)
Summary:
Switch to use `TensorOptions` instead of deprecated `.type()` to fix compiler warnings as part of #55952
](https://our.intern.facebook.com/intern/diff/27830504/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55956

Pulled By: driazati

Reviewed By: pritamdamania87

Differential Revision: D27830504

fbshipit-source-id: f705818ddb7d8b17c0f5383f22dc431203a194d9
2021-04-20 17:22:05 -07:00
3d904b56ec s/AutoNonVariableTypeMode/AutoDispatchBelowAutograd/ (#56423)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56423

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D27866606

Pulled By: ailzhang

fbshipit-source-id: e3942356dc3133d1c5722de40ec0d45e6a60f2f1
2021-04-20 17:17:46 -07:00
13ac0019ae [NNC] Update loop-carried dependence check to handle all known dependences (#56354)
Summary:
This PR includes:
 * Update to the loop-carried dependence check API to correctly ignore loop-independent dependences and handle all kinds of loop-carried dependences like RAW, WAR and WAW.
 * Fix for the overlap API to look only for conflicting buffer accesses where at least one of them is a Store.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56354

Reviewed By: bertmaher

Differential Revision: D27856202

Pulled By: navahgar

fbshipit-source-id: 206e4ec771fe0f7f2ccf4b11b29e35df7b9b18bc
2021-04-20 17:12:51 -07:00
1d8053655d Rename AutoNonVariableTypeMode to AutoDispatchBelowAutograd and add a warning. (#56422)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56422

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D27866608

Pulled By: ailzhang

fbshipit-source-id: 507bbcaa4c25edf23e67162780efaa70f64ad14a
2021-04-20 17:04:08 -07:00
3cc4dbb66d Expose nbins and ratio (#50398)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50398

Test Plan: fbcode/caffe2/test/quantization/test_quantized_op.py

Differential Revision: D25873541

fbshipit-source-id: 7c3cdbb38a1e943e7fa8943a4195dc65d9d95105
2021-04-20 16:28:00 -07:00
af7775ba26 Types for caffe2/torch/testing/_internal/common_distributed.py (#55338)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55338

Test Plan: Sandcastle

Reviewed By: pritamdamania87, ngimel

Differential Revision: D27575367

fbshipit-source-id: ca8eb77967af71ce2734408b8e2e15bf64a5ab4a
2021-04-20 16:26:53 -07:00
8ae8fb7dd1 [iOS GPU][Stub] Move conv2d_prepack impl from MetalPrepackOpRegister.cpp to MetalConvolution.cpp (#56491)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56491

Move the prepack convolution to the op file to get rid of the selective compilation.
ghstack-source-id: 126960054

Test Plan: CI

Reviewed By: SS-JIA

Differential Revision: D27719539

fbshipit-source-id: 75fb3849858a31a915828a0f5f6f3d4066ff4c9b
2021-04-20 16:21:18 -07:00
15734f5b6f Ignore warnings for record_function_ops (#56543)
Summary:
This hides the warnings from #35026 until we can fix them for real by migrating to custom classes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56543

Pulled By: driazati

Reviewed By: rohan-varma

Differential Revision: D27895085

fbshipit-source-id: a325a5d8cefb20a5033c1a059e49c03c08514f18
2021-04-20 16:17:30 -07:00
20e88401db Add monkey type config for JIT (#54513)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54513

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D27881707

Pulled By: nikithamalgifb

fbshipit-source-id: d318a5f3fc2deb7d9b2364962ec709c6bbb68b2c
2021-04-20 16:11:53 -07:00
17b8a4db1c [nnc] Support pow on CPU (#56308)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56308

But only for float tensors.  Even on CUDA, int tensors just have weird
behavior with pow, and I bet FP is so much more common that it's just not worth
trying to fuse ints here.
ghstack-source-id: 126769637

Test Plan: `pytest test_jit_fuser_te.py -k test_binary_pow`

Reviewed By: navahgar

Differential Revision: D27834694

fbshipit-source-id: 7274d72cf02ab95d63574b6c17995b8f34560810
2021-04-20 15:13:03 -07:00
1e03a2505f add channels last for MaxPool2d (#56361)
Summary:
add channels last support for MaxPool2d.
this one is a replacement of https://github.com/pytorch/pytorch/pull/48917

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56361

Reviewed By: heitorschueroff

Differential Revision: D27874142

Pulled By: VitalyFedyunin

fbshipit-source-id: bc9604def9c974d7b59621fc709a39948088b992
2021-04-20 15:02:18 -07:00
7d4e9bdba1 Add type hint for SequentialSampler (#56374)
Summary:
Add type hint for SequentialSampler

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56374

Reviewed By: heitorschueroff

Differential Revision: D27884528

Pulled By: ejguan

fbshipit-source-id: 68eb900643098565743245c843e76e464f981458
2021-04-20 14:45:52 -07:00
c65284aa07 Remove caption for Lang Reference (#56526)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56526

Test Plan: Imported from OSS

Reviewed By: navahgar, gmagogsfm

Differential Revision: D27891208

Pulled By: nikithamalgifb

fbshipit-source-id: 50da4f08a01b5407c9a1ead535539a5a26aea0f7
2021-04-20 14:33:42 -07:00
12b5e666b0 add codegen subdirectories to mypy-strict.ini (#56523)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56523

Test Plan: Imported from OSS

Reviewed By: malfet, samestep

Differential Revision: D27890855

Pulled By: bdhirsh

fbshipit-source-id: 78cd725bcf534b8410bdfaf93d2eb681e8a56ff7
2021-04-20 14:00:46 -07:00
6e1fc5cef8 [quant] added dq->op->q quantization patterns for GELU and softmax ops (#56004)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56004

added reference pattern support for GELU, softmax and bmm for int dtypes. For GELU and Softmax, this consisted of adding reference patterns to the default node handler for int dtypes. Note GELU and softmax patterns are not registered since they do not have a proper quantized kernel which means they would either add unnecessary dequant and quant ops to the network, or they would simply error. This can be circumvented with custom qconfig usage as in test_gelu_reference

bmm was added within binary ops along with some significant changes to how that code is structured. Theoretically the reference pattern used for bmm could be applied to other dtypes. This was not enabled because of issues relating to Line 1323 in quantize.py. In essence, the prepare step does not know whether an op will use a reference pattern or not, so for ops that are supported with one dtype in reference and one dtype normally, this has the potential to cause issues. This is difficult to get aorund with the is_reference flag being available in the prepare step or discussed changes around separating

Test Plan:
python test/test_quantization.py TestQuantizeFxOps.test_gelu_reference
python test/test_quantization.py TestQuantizeFxOps.ttest_gelu_normal
python test/test_quantization.py TestQuantizeFxOps.test_softmax_reference
python test/test_quantization.py TestQuantizeFxOps.test_softmax_normal
python test/test_quantization.py TestQuantizeFxOps.test_silu_reference
python test/test_quantization.py TestQuantizeFxOps.test_bmm_int_reference
python test/test_quantization.py TestQuantizeFxOps
python test/test_quantization.py TestFuseFx
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxModels

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D27818340

fbshipit-source-id: de65be0797035463cd2d1b0e4677d1a87f69143c
2021-04-20 13:26:15 -07:00
ea4af1511c [Pytorch] Better error message for bundling inputs a second time (#56086)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56086

ghstack-source-id: 126671245

Test Plan: unittest

Reviewed By: dhruvbird

Differential Revision: D27778582

fbshipit-source-id: 6b59aa7ddb25c1b3162bbffdf0dd212a96f22bd3
2021-04-20 12:28:27 -07:00
43eb21bff3 [skip ci] Add simple local actions runner (#56439)
Summary:
This pulls out shell scripts from an action and runs them locally as a first pass at https://github.com/pytorch/pytorch/issues/55847. A helper script extracts specific steps in some order and runs them:

```bash
$ time -p make lint -j 5  # run lint with 5 CPUs
python scripts/actions_local_runner.py \
        --file .github/workflows/lint.yml \
        --job 'flake8-py3' \
        --step 'Run flake8'
python scripts/actions_local_runner.py \
        --file .github/workflows/lint.yml \
        --job 'mypy' \
        --step 'Run mypy'
python scripts/actions_local_runner.py \
        --file .github/workflows/lint.yml \
        --job 'quick-checks' \
        --step 'Ensure no trailing spaces' \
        --step 'Ensure no tabs' \
        --step 'Ensure no non-breaking spaces' \
        --step 'Ensure canonical include' \
        --step 'Ensure no unqualified noqa' \
        --step 'Ensure no direct cub include' \
        --step 'Ensure correct trailing newlines'
python scripts/actions_local_runner.py \
        --file .github/workflows/lint.yml \
        --job 'cmakelint' \
        --step 'Run cmakelint'
quick-checks: Ensure no direct cub include
quick-checks: Ensure canonical include
quick-checks: Ensure no unqualified noqa
quick-checks: Ensure no non-breaking spaces
quick-checks: Ensure no tabs
quick-checks: Ensure correct trailing newlines
cmakelint: Run cmakelint
quick-checks: Ensure no trailing spaces
mypy: Run mypy
Success: no issues found in 1316 source files
Success: no issues found in 56 source files
flake8-py3: Run flake8
./test.py:1:1: F401 'torch' imported but unused
real 13.89
user 199.63
sys 6.08
```

Mypy/flake8 are by far the slowest, but that's mostly just because they're wasting a bunch of work linting the entire repo.

In followup, we could/should:
* Improve ergonomics (i.e. no output unless there are errors)
* Speed up lint by only linting files changes between origin and HEAD
* Add clang-tidy

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56439

Reviewed By: samestep

Differential Revision: D27888027

Pulled By: driazati

fbshipit-source-id: d6f2a59a45e9d725566688bdac8e909210175996
2021-04-20 12:17:55 -07:00
ab20ba4427 Fix issue with dispatch key: AutogradXPU (#56336)
Summary:
Automatically add dispatch key "AutogradXPU" with "xpu" tensor. And set "fall through" for AutogradXPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56336

Reviewed By: heitorschueroff

Differential Revision: D27872125

Pulled By: ailzhang

fbshipit-source-id: c120c62becd577699f9aecb4c356c889bd37ad06
2021-04-20 12:09:59 -07:00
8868f9c8e3 [TensorPipe] Use targetDevice in tensorpipe_agent. (#56346)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56346

Now that TensorPipe's API has `targetDevice`, use that instead of
manually writing the CUDA device index in `metadata`.

Test Plan: CI

Reviewed By: lw

Differential Revision: D27703235

fbshipit-source-id: c5b620e3b3ce619367412efdbe9fa3778f6b8869
2021-04-20 11:54:13 -07:00
a8ea490f67 Revert caffe2 print stack traces flag (#56496)
Summary:
This reverts the change in #56198 which broke some internal tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56496

Pulled By: driazati

Reviewed By: walterddr

Differential Revision: D27886611

fbshipit-source-id: b04de01b3bcf886294ff7ae45776b5955ce19858
2021-04-20 11:43:33 -07:00
5017c5fcad [SPMD] Remove _specify_ddp_gpu_num method (#56425)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56425

As SPMD mode is gone, `_specify_ddp_gpu_num` becomes useless. It only checks if the module is a GPU module. This actually is already checked by the caller of this function (in fairscale and some other codebases).

Additionally also remove `enable_pytorch_sync_bn` wrapper that only calls this function and does nothing else.
ghstack-source-id: 126885376

Test Plan: waitforbuildbot

Reviewed By: zhaojuanmao

Differential Revision: D27866440

fbshipit-source-id: d2fd5cf43eda25c0a2bd35f647848ec0dbd6ad0f
2021-04-20 11:17:47 -07:00
04de24d10a Separate profiling tests from p2p tests (#56412)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56412

We are investigating some flaky profiiling tests such as https://github.com/pytorch/pytorch/issues/56146. One issue is that the profiling tests are tightly coupled to these send/recv tests, hence if this test is disabled, we lose signal round send/recv collectives tests.

To mitigate this, separate the tests into ones that only test send/recv, and ones that test it with profiling. This way flakiness should not result in the send/recv only tests being disabled.
ghstack-source-id: 126920867

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D27864845

fbshipit-source-id: 01f04a884482ec7741323218a7f8f4a8451eb4ae
2021-04-20 10:42:00 -07:00
59b61f912a Switch assertWarnsOnceRegex logic to check any instead of all. (#56434)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56434

If we hit multiple TORCH_WARN from different sources when running the
statement, it makes more sense to me that we want to check the regex is
met in any one of the warning messages instead of all messages.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D27871946

Pulled By: ailzhang

fbshipit-source-id: 5940a8e43e4cc91aef213ef01e48d506fd9a1132
2021-04-20 10:37:36 -07:00
75651e3cc4 Add remaining ToCs to ToC lint (#56487)
Summary:
The lint was originally added in https://github.com/pytorch/pytorch/issues/54974, but at the time I didn't realize that these other Markdown files also each have a table of contents:

- `GLOSSARY.md`
- `torch/csrc/jit/OVERVIEW.md`
- `torch/csrc/jit/docs/serialization.md`
- `torch/fx/OVERVIEW.md`

This PR adds those files to the lint, and also changes the rule from using a fixed list of filenames to a `git grep` command that finds all Markdown files containing this magic comment:

```md

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56487

Test Plan: The "Lint / toc" job in GitHub Actions.

Reviewed By: janeyx99

Differential Revision: D27884885

Pulled By: samestep

fbshipit-source-id: 5462437502b17fba93abf5098e21754bf566a4fe
2021-04-20 10:28:47 -07:00
062e70590c Add OpInfo tests for torch.{dot, vdot, bmm, mv} (#56409)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56409

Reviewed By: nikithamalgifb

Differential Revision: D27870769

Pulled By: anjali411

fbshipit-source-id: a1a0e89856529a4739c7612c5b1e3c5ed2569126
2021-04-20 10:22:15 -07:00
e4faebca0d Automated submodule update: tensorpipe (#56259)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/tensorpipe](https://github.com/pytorch/tensorpipe).

New submodule commit: 416a9d8a4a

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56259

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: pbelevich

Differential Revision: D27881993

Pulled By: beauby

fbshipit-source-id: e7d8cefe89c6fb09b59e3ef57da05a7ab0a3cb16
2021-04-20 09:38:05 -07:00
cyy
f74a346213 Fix torch.hub.load("pytorch/vision") fails to validate the master branch (#56138)
Summary:
We should iterate all pages of the branches API. Otherwise, even using "pytorch/vision" would fail to find master.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56138

Reviewed By: heitorschueroff

Differential Revision: D27872346

Pulled By: ailzhang

fbshipit-source-id: 55881558f7980b1fb08b0d08ed6687a38df06edd
2021-04-20 09:33:25 -07:00
b2dae294b6 Fix distributed.test_jit_c10d flaky tests (#56410)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56410

Changes:
- Move create_tcp_store() helper function to common file
- Update test_jit_c10d to retry TCP Store creation in case allocated port becomes used

fixes https://github.com/pytorch/pytorch/issues/55053

Test Plan: Imported from OSS

Reviewed By: heitorschueroff

Differential Revision: D27869560

Pulled By: H-Huang

fbshipit-source-id: f4a6613049bb25e6f6f194214379a380968bb19c
2021-04-20 09:28:27 -07:00
0e0a5471ef Remove an unused variable in SoftmaxWithLossOp (#56321)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56321

Reviewed By: heitorschueroff

Differential Revision: D27854332

Pulled By: bdhirsh

fbshipit-source-id: 1a9dcfdc63412069cee4444a595c3460815d3c6c
2021-04-20 09:18:08 -07:00
4e0760f41a Remove is_variable from tests (#56305)
Summary:
`is_variable` spits out a deprecation warning during the build (if it's
still something that needs to be tested we can ignore deprecated
warnings for the whole test instead of this change).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56305

Pulled By: driazati

Reviewed By: ezyang

Differential Revision: D27834218

fbshipit-source-id: c7bbea7e9d8099bac232a3a732a27e4cd7c7b950
2021-04-20 09:03:53 -07:00
eacf6f1b51 Updated the tech docs to be consistent with other two descriptions (#56338)
Summary:
Updated the Beta channel description to be consistent with other two channels (Stable, Prototype)

The screenshot attached is for reference before changes.

![Screenshot 2021-04-18 12-36-55](https://user-images.githubusercontent.com/20245964/115137303-0c077380-a043-11eb-9532-c46486e8a75a.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56338

Reviewed By: heitorschueroff

Differential Revision: D27854350

Pulled By: bdhirsh

fbshipit-source-id: a21208c11242e84de313d5b11269264756bf9029
2021-04-20 09:00:42 -07:00
c61778355c Upgrade ShellCheck to v0.7.2 (#56445)
Summary:
[First ShellCheck release in over a year!](https://github.com/koalaman/shellcheck/releases/tag/v0.7.2) I'm thankful for doing https://github.com/pytorch/pytorch/issues/55109 at the beginning of this month, because otherwise `master` would have just suddenly started failing a few hours ago.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56445

Test Plan:
CI. You can also run `shellcheck` locally; for instance, if you're on Mac and [installed it with Homebrew](https://github.com/koalaman/shellcheck/tree/v0.7.2#installing):
```sh
brew upgrade shellcheck
rm -r .extracted_scripts ; tools/extract_scripts.py --out=.extracted_scripts
tools/run_shellcheck.sh .jenkins/pytorch .extracted_scripts
```

Reviewed By: janeyx99

Differential Revision: D27874084

Pulled By: samestep

fbshipit-source-id: 3bd871a368fe03aecd559e2f55bce36af49cfa27
2021-04-20 07:58:22 -07:00
3d878dee45 Added out= variant for torch.linalg.lstsq (#54721)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54721

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D27874711

Pulled By: mruberry

fbshipit-source-id: 696ebb6eb0bad81988e9cb7a081388a3a5ab3e2c
2021-04-20 07:09:06 -07:00
43c747859c Use c10 backtrace generation in caffe2 (#56198)
Summary:
This cuts out caffe2's old backtrace generation in favor of the one already in c10.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56198

Pulled By: driazati

Reviewed By: nikithamalgifb

Differential Revision: D27868282

fbshipit-source-id: aa9b9691271eaa3f95baab48773ffefebd924ae2
2021-04-20 07:00:33 -07:00
63dac82444 Make grad mode error just a warning (#56401)
Summary:
Temporary fix to give people extra time to finish the deprecation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56401

Reviewed By: xw285cornell, drdarshan

Differential Revision: D27862196

Pulled By: albanD

fbshipit-source-id: ed460267f314a136941ba550b904dee0321eb0c6
2021-04-20 06:30:55 -07:00
0ea4eb745b [opinfo] torch.lerp: move remaining cases from tensor_methods to opinfo (#55665)
Summary:
Fixes : https://github.com/pytorch/pytorch/issues/54304
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55665

Reviewed By: bdhirsh

Differential Revision: D27845528

Pulled By: mruberry

fbshipit-source-id: 36bdf14c4923a83fb8e4f4d361467d9568784011
2021-04-20 02:01:34 -07:00
df8bb5a42b Add OpInfo for polygamma and remove torch_op_tests Infra (#51966)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/42515

* OpInfo entry for Polygamma
* Removes infra of torch_op_tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51966

Reviewed By: bdhirsh

Differential Revision: D27851858

Pulled By: mruberry

fbshipit-source-id: 7f1d0273065e1df56a152f95a14513959af29a1b
2021-04-20 01:03:09 -07:00
a661e58731 Removed infos vector in torch.linalg.qr (#56248)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56248

`info` error code for QR decomposition only indicates wrong parameters,
when everything is implemented correctly it will never be nonzero so we
don't need to check it for CPU path.
For MAGMA `checkMagmaInternalError` is added that checks for failed
memory allocations internal to MAGMA.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D27850414

Pulled By: mruberry

fbshipit-source-id: ddda1209008f879f24c9ad08739e10c28b194d18
2021-04-20 00:08:31 -07:00
c5c5230890 Pytorch resolve bug around incorrect rdzv handler resolution (#56386)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56386

The diff resolves bug around incorrect handler resolution:
_create_static_handler pointed towards etcd, and _create_etcd_handler pointed towards static.

Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed:test_launcher

Added test_launcher to the ci/cd tests

Reviewed By: cbalioglu

Differential Revision: D27858897

fbshipit-source-id: 440155789958c091ce5755e7c9524e4bb704203a
2021-04-19 23:50:28 -07:00
7ae45403a1 [static runtime] support aten::__getitem__ natively (#55310)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55310

Test Plan:
Run on the dper generated local/local_ro model
```
./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --scripted_model=/data/users/ansha/tmp/adfinder/aug_1x/210616848_0.predictor.disagg.local.local_ro.pt --pt_inputs=/data/users/ansha/tmp/adfinder/aug_1x/210616848_0.predictor.disagg.input_data.container.pt --iters=1000 --warmup_iters=10000 --num_threads=1 --pt_enable_static_runtime=1 --pt_cleanup_activations=true --pt_enable_out_variant=1 --pt_optimize_memory=1 --compare_results=0 --do_profile=0 --adsfinder_compatibility=1
```

Reviewed By: hlu1

Differential Revision: D27569662

fbshipit-source-id: df68c2fdd95e39a30aec35ddbaf1f5df0bc3a3da
2021-04-19 23:08:19 -07:00
85f4025ad7 Port adaptive_max_pool3d to structured kernel (#56320)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56320

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D27862931

Pulled By: SplitInfinity

fbshipit-source-id: 99fc79611a95ce934ed879f8ae6b0c26a645813b
2021-04-19 22:42:12 -07:00
0d4394778e Port adaptive_max_pool2d to structured kernel (#56317)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56317

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D27862930

Pulled By: SplitInfinity

fbshipit-source-id: e2d199df0ebaf585698f26fcfda5a0301ba67ade
2021-04-19 22:41:05 -07:00
513e9e0927 Fix cxx11 abi (#55984)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55829

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55984

Reviewed By: agolynski

Differential Revision: D27809478

Pulled By: seemethere

fbshipit-source-id: b00801e50c364b307009349594e396b934cc3a49
2021-04-19 22:20:10 -07:00
07653b7fe0 [SPMD] Remove ddp_gpu_size field from SyncBatchNorm (#55946)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55946

As `ddp_gpu_size` field of `SyncBatchNorm` will always be 1 for GPU modules, remove this field and the relevant code.
ghstack-source-id: 126883498

Test Plan: waitforbuildbot

Reviewed By: zhaojuanmao

Differential Revision: D27746021

fbshipit-source-id: b4518c07e6f0c6943fbd7a7548500a7d4337126c
2021-04-19 21:41:29 -07:00
023231a2ac [torch/distributed] Fix pydoc for torch.distributed.elastic.multiprocessing (replace Redirect with Std)
Summary: `Redirects` was renamed to `Std` in `torch.distributed.elastic.multiprocessing.api`. Pointed out by a user in https://github.com/pytorch/elastic/issues/147.

Test Plan: N/A just doc change

Reviewed By: tierex

Differential Revision: D27866614

fbshipit-source-id: 9fb901aae7ebe11cde13000a1c118de527f34400
2021-04-19 21:40:16 -07:00
94406f77f6 [quant][graphmode][fx] Add support for keeping output quantized for list and dict (#56391)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56391

Previously we only support keeping output quantized for tensor output, this PR adds support
for list and dict (values) as well

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D27860327

fbshipit-source-id: e770160ced47a7173abff5505ec620bd2b1a0b01
2021-04-19 21:37:11 -07:00
eqy
42f0fe1fe3 fix misaligned access #56325 (#56403)
Summary:
CC ngimel ptrblck
ref: https://github.com/pytorch/pytorch/issues/56325

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56403

Reviewed By: mruberry

Differential Revision: D27866625

Pulled By: ngimel

fbshipit-source-id: 9dff0e9749f8de57fac6a653f685c14854611a02
2021-04-19 20:12:03 -07:00
92d24e3060 Revert D27855386: [pytorch][PR] Support factory kwargs in torch.nn modules
Test Plan: revert-hammer

Differential Revision:
D27855386 (40483acc51)

Original commit changeset: dabd505d2a04

fbshipit-source-id: f5bf3120d87861b30a8e1bf11977ad7d27cd8500
2021-04-19 20:07:20 -07:00
b1282bc109 Use stack trace implementation in common/process on fbcode (#56400)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56400

See https://github.com/pytorch/pytorch/issues/56399

I don't have time to fix this properly, so this is just to stem the
bleeding.  Someone should go and figure out what it is that common/process
is doing better.
ghstack-source-id: 126868405

Test Plan:
I manually patched this into D27765125 and triggered a
exception and observed that everything symbolized good:

```
[9]   what():  new_refcount != 1INTERNAL ASSERT FAILED at "caffe2/c10/util/intrusive_ptr.h":234, please report a bug to PyTorch. intrusive_ptr: Cannot increase refcount after it reached zero.
Exception raised from retain_ at caffe2/c10/util/intrusive_ptr.h:234 (most recent call first):
# 0  c10::get_backtrace[abi:cxx11](unsigned long, unsigned long, bool)
# 1  c10::(anonymous namespace)::GetFetchStackTrace[abi:cxx11]()::$_0::operator()[abi:cxx11]() const
# 2  std::_Function_handler<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > (), c10::(anonymous namespace)::Ge
tFetchStackTrace()::$_0>::_M_invoke(std::_Any_data const&)
# 3  std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>::operator()() const
# 4  c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)
# 5  c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocat
or<char> > const&)
# 6  c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, char const*)
# 7  c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> >::retain_()
# 8  c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> >::intrusive_ptr(c10::intrusiv
e_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&)
# 9  c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> >& c10::intrusive_ptr<c10d::Pr
ocessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> >::operator=<c10d::ProcessGroup, c10::detail::intrusive_target
_default_null_type<c10d::ProcessGroup> >(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessG
roup> > const&) &
```

Reviewed By: driazati

Differential Revision: D27861908

fbshipit-source-id: 84c1dfb1ef28c460b020646f836c153562ad5c44
2021-04-19 19:30:11 -07:00
f096245610 AutoNonVariableTypeMode->InferenceMode in OSS. (#56421)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56421

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D27866609

Pulled By: ailzhang

fbshipit-source-id: 040991a031c5511501b03cfe21a4a636586e120e
2021-04-19 18:07:41 -07:00
5b4c3a9da1 record Torch DP and DDP modules forward (#55578)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55578

Reviewed By: gdankel

Differential Revision: D27862392

Pulled By: ilia-cher

fbshipit-source-id: 18545d23e35a97c8f760707fecb696a24d47dc0a
2021-04-19 17:52:59 -07:00
31677c5fcb [reland] .github: Add initial linux CI workflow (#56280)
Summary:
This reverts commit 6b5ed5ec454ecd8597ff0465305915dd1e09a805.

There'll also probably be fixes here, see diff from original PR: https://github.com/pytorch/pytorch/compare/f2abce0...ci-all/add-initial-linux-ci-gha

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56280

Reviewed By: walterddr

Differential Revision: D27826012

Pulled By: seemethere

fbshipit-source-id: 71cad1d7f840ede5025b1bb4a33d628aa74686d1
2021-04-19 17:36:09 -07:00
0917061f43 [vulkan][jit_pass] Add optimized_for_vulkan attribute on vulkan pass (#56414)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56414

Test Plan: Imported from OSS

Reviewed By: xta0

Differential Revision: D27865144

Pulled By: IvanKobzarev

fbshipit-source-id: c59f0eb2722f3fce0a91d9bd0b7cae3e0436c496
2021-04-19 17:27:59 -07:00
7adc04d7b5 Add more logging to debug test_reduce_sum_cuda_twice (#56406)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56406

Been hard to reproduce
https://github.com/pytorch/pytorch/issues/50840, adding some debug log to get a
better sense of the issue.
ghstack-source-id: 126874222

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D27863328

fbshipit-source-id: e6f125b77cfb636b90598eb54395609654f5e139
2021-04-19 17:24:49 -07:00
0d94c04247 [NNC] Change fuseLoops API to return bool flag and not throw any exceptions (#56353)
Summary:
Partial fix for https://github.com/pytorch/pytorch/issues/56357

Changes the `fuseLoops` API to the following form:
```
static bool fuseLoops(const std::vector<For*>& loops, For** fused);
```

Also, adds a new API to check for loop-carried dependences:
```
static bool hasLoopCarriedDependence(For* loop);
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56353

Reviewed By: bertmaher

Differential Revision: D27856214

Pulled By: navahgar

fbshipit-source-id: 443557088692585657faee296602c547a00117dd
2021-04-19 17:20:40 -07:00
fe3f6f2da2 [iOS GPU][Kernel] Implement mean.dim using MPSReduce kernel (#56073)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56073

Implement the `mean.dim` operator for Metal backend. Currently, we don't support reducing the batch dimension.
ghstack-source-id: 126802129

Test Plan:
- Sandcastle
- CircleCI
- Unit tests

```
2021-03-23 13:01:29.663842-0700 PyTorchPlayground[64572:9575354] [bool test_mean_dim()],[1 5 2 2 ],[SUCCEED]
2021-03-23 13:01:29.666230-0700 PyTorchPlayground[64572:9575354] [bool test_mean_dim2()],[1 5 2 2 ],[SUCCEED]
```

Reviewed By: dhruvbird

Differential Revision: D27269394

fbshipit-source-id: fafcdde50ac457a8488c6170d0a8d3db1871439b
2021-04-19 16:34:16 -07:00
fa7534788b Fix typo in gradcheck.py (#56368)
Summary:
betwen -> between

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56368

Reviewed By: bdhirsh

Differential Revision: D27860450

Pulled By: albanD

fbshipit-source-id: 86ef7b62e228c15319683a8d72b404b5f527666e
2021-04-19 15:53:02 -07:00
34d0bd5b1d Fix TestTypeHints.test_doc_examples (#56388)
Summary:
https://github.com/pytorch/pytorch/issues/54268 removed `test_run_mypy` since now we're running `mypy` as its own job in GitHub Actions, but previously we used this `set_cwd` context manager in that test to ensure that we picked up the `mypy` config correctly. However, for some reason, we have not been doing that in `test_doc_examples`, which has been succeeding in CI for a while despite being broken.

Specifically, [`run_test.py` changes the working directory to `test/` before running test files](48aaea3359/test/run_test.py (L534-L535)), which is contrary to [what `CONTRIBUTING.md` instructs developers to do](48aaea3359/CONTRIBUTING.md (python-unit-testing)). As a result, in CI, `test/test_type_hints.py` has been passing in CI, but if you run it locally from the root of the repo, this you get this error:
```
F
======================================================================
FAIL: test_doc_examples (__main__.TestTypeHints)
Run documentation examples through mypy.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/test_type_hints.py", line 127, in test_doc_examples
    self.fail(f"mypy failed:\n{stdout}")
AssertionError: mypy failed:
test/generated_type_hints_smoketest.py:851: error: Name 'tensor' is not defined  [name-defined]
test/generated_type_hints_smoketest.py:853: error: Name 'tensor' is not defined  [name-defined]
Found 2 errors in 1 file (checked 1 source file)

----------------------------------------------------------------------
Ran 1 test in 1.416s

FAILED (failures=1)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56388

Test Plan:
Before this PR, the first of the following two commands should fail (since that is essentially what is run in CI), but the second should fail:
```
python test/run_test.py -i test_type_hints
python test/test_type_hints.py
```
After this PR, both commands should succeed.

Reviewed By: driazati

Differential Revision: D27860173

Pulled By: samestep

fbshipit-source-id: efb82fffd7ccb04d0331824b40bdef7bbc319c98
2021-04-19 15:27:09 -07:00
2f5c352162 Fix protobuf warnings in caffe2 (#56186)
Summary:
This guards some deprecated usages of the Protobuf API behind an `#ifdef` (this is how onnx does it as well)
](https://our.intern.facebook.com/intern/diff/27803121/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56186

Pulled By: driazati

Reviewed By: bertmaher, dzhulgakov

Differential Revision: D27803121

fbshipit-source-id: 2d3a348ec1ab9879a0d8f2dff17c5444fd4baf2c
2021-04-19 15:19:53 -07:00
638617f9f8 Write mini dump on pybind exceptions (#55652)
Summary:
We register an [error handler](https://pybind11.readthedocs.io/en/stable/advanced/exceptions.html#registering-custom-translators) with pybind so that C++ exceptions are passed to Python and raised as runtime errors that can be `try...except`ed etc. Since these don't terminate the program (until Python does), they never fire the signal handler to write a minidump out with the crash information. This PR adds some logic in the exception translator to write out a minidump if enabled.
](https://our.intern.facebook.com/intern/diff/27830952/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55652

Pulled By: driazati

Reviewed By: bertmaher

Differential Revision: D27830952

fbshipit-source-id: 26e8f913e99dff971a4eb09eb87221c66f759763
2021-04-19 14:53:43 -07:00
a14178ed5c Remove useless code (#56230)
Summary:
Since we're using specific VS, we don't need to specify VC version.
In fact, the VC version is not used in CI now.

Why I make this change now?
I'm writing a robot to update the vs_install.ps1 (https://github.com/pytorch/pytorch/pull/56261/) every 2 weeks.
It will submit a PR to check if the latest VS is compatible with PyTorch automatically.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56230

Reviewed By: bdhirsh

Differential Revision: D27856647

Pulled By: ezyang

fbshipit-source-id: b46f2bdf35ab5841fded470e23bbf7a01d5f60f4
2021-04-19 14:22:18 -07:00
04607a58f1 [pytorch] Fix compiler warnings from conv.h (#56181)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56181

Need to change to size_t vs size_t:

Reviewed By: ezyang

Differential Revision: D27800849

fbshipit-source-id: 25f744128eb8750c382dc967a99af3c9f16247d9
2021-04-19 14:13:02 -07:00
2c9972facf [iOS GPU][Kernel] Implement transpose in Metal shaders (#54522)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54522

Implement the transpose operator in metal shaders using textures.
ghstack-source-id: 126802125

Test Plan:
- Metal operator tests
```
2021-03-22 02:25:53.941006-0700 PyTorchPlayground[57924:9047121] [bool test_transpose()],[1 2 2 5 ],[SUCCEED]
2021-03-22 02:25:53.949834-0700 PyTorchPlayground[57924:9047121] [bool test_transpose2()],[1 2 58 28 28 ],[SUCCEED]
2021-03-22 03:12:19.786584-0700 PyTorchPlayground[58230:9066223] [bool test_transpose3()],[4 5 6 ],[SUCCEED]
```
- Sandcastle CI
- CircleCI

Reviewed By: SS-JIA

Differential Revision: D27225940

fbshipit-source-id: 14bfb96435a39aecf4f14bc5e2f7232421014328
2021-04-19 13:52:54 -07:00
e3900d2ba5 Add lint for unqualified noqa (#56272)
Summary:
As this diff shows, currently there are a couple hundred instances of raw `noqa` in the codebase, which just ignore all errors on a given line. That isn't great, so this PR changes all existing instances of that antipattern to qualify the `noqa` with respect to a specific error code, and adds a lint to prevent more of this from happening in the future.

Interestingly, some of the examples the `noqa` lint catches are genuine attempts to qualify the `noqa` with a specific error code, such as these two:
```
test/jit/test_misc.py:27:            print(f"{hello + ' ' + test}, I'm a {test}") # noqa E999
test/jit/test_misc.py:28:            print(f"format blank") # noqa F541
```
However, those are still wrong because they are [missing a colon](https://flake8.pycqa.org/en/3.9.1/user/violations.html#in-line-ignoring-errors), which actually causes the error code to be completely ignored:

- If you change them to anything else, the warnings will still be suppressed.
- If you add the necessary colons then it is revealed that `E261` was also being suppressed, unintentionally:
  ```
  test/jit/test_misc.py:27:57: E261 at least two spaces before inline comment
  test/jit/test_misc.py:28:35: E261 at least two spaces before inline comment
  ```

I did try using [flake8-noqa](https://pypi.org/project/flake8-noqa/) instead of a custom `git grep` lint, but it didn't seem to work. This PR is definitely missing some of the functionality that flake8-noqa is supposed to provide, though, so if someone can figure out how to use it, we should do that instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56272

Test Plan:
CI should pass on the tip of this PR, and we know that the lint works because the following CI run (before this PR was finished) failed:

- https://github.com/pytorch/pytorch/runs/2365189927

Reviewed By: janeyx99

Differential Revision: D27830127

Pulled By: samestep

fbshipit-source-id: d6dcf4f945ebd18cd76c46a07f3b408296864fcb
2021-04-19 13:16:18 -07:00
7bcf95bbb6 [iOS GPU] Move the definition of fp16_t to MetalUtils.h (#54521)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54521

Move code around. No significant feature changes.
ghstack-source-id: 126802126

Test Plan:
- CircleCI
- Sandcastle

Reviewed By: SS-JIA

Differential Revision: D27225941

fbshipit-source-id: 8b5439b6bf5e24ea755cb1941d92fae26a8d5a06
2021-04-19 13:08:01 -07:00
40483acc51 Support factory kwargs in torch.nn modules (#54508)
Summary:
Continuation of https://github.com/pytorch/pytorch/pull/53144

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54508

Reviewed By: bdhirsh

Differential Revision: D27855386

Pulled By: jbschlosser

fbshipit-source-id: dabd505d2a04208e74b158570fb2859c736eea2c
2021-04-19 12:24:58 -07:00
ca6e5c7fc9 [NNC] added more python bindings for loopnest (#56213)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56213

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D27809000

Pulled By: huiguoo

fbshipit-source-id: 013a1ae74397650a958c4fdcd39b38a3c0ff17ef
2021-04-19 11:28:00 -07:00
d1b6383d65 Hide warnings for deprecated quantization APIs (#56291)
Summary:
These have a tracking task to actually fix them but in the meantime they
should not be clogging up everyone's build output (see #55952).
](https://our.intern.facebook.com/intern/diff/27830229/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56291

Pulled By: driazati

Reviewed By: bertmaher

Differential Revision: D27830229

fbshipit-source-id: f1e5d6e9b2c63d4a4ad99a1744a520f8c681c22b
2021-04-19 11:11:33 -07:00
48aaea3359 unified GlooStore and c10d store API (#56222)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56222

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55719

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D27785267

Pulled By: msaroufim

fbshipit-source-id: ce247f9226ecc971af8e1f08adeb835f64973e12
2021-04-19 10:57:18 -07:00
5748cc0d11 [Mobile GPU] Ban mutations in JIT passes (#56070)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56070

**Summary**

Currently, we're returning copies instead of alias on mobile GPU (Metal/Vulkan). As suggested by ailzhang , we could use the JIT pass - `RemoveTensorMutation` to ban mutations ahead of time. I've tested two scenarios as shown below. They both work fine on mobile.

- view

```
class Model (torch.nn.Module):
    def forward(self, x):
        y = x.view(-1)
        z = torch.tensor(2.0).float()
        y.add_(z)
        return x

m = Model()
x = torch.rand(2, 3)
y = m(x)
```
- transpose

```
class Model (torch.nn.Module):
    def forward(self, x):
        y = x.transpose(1, 2)
        z = torch.tensor(2.0).float()
        x.add_(z)
        return y

m = Model()
x = torch.rand(1, 2, 3)
y = m(x)
```

As we're adding more ops, we should add more tests to cover all the alias ops - https://github.com/pytorch/pytorch/blob/master/tools/autograd/gen_inplace_or_view_type.py#L31-L80

**Next step**

Synced offline with eellison. Since mutation removal is also being used in ONNX, Static runtime, some jit optimizations, Torch -> TVM, etc, instead of inventing something new, we would continue to make it better in cases where it fails.

Although this JIT pass could work for most of the mobile models, there are cases that it can't cover. What we're going to do next is to implement stub ops for GPU models to let them run on server side, such that users can compare results to see if there is any discrepancy.

ghstack-source-id: 126802123

Test Plan:
- Sandcastle
- CircleCI

Reviewed By: raziel

Differential Revision: D27692683

fbshipit-source-id: 9d1be8a6c0a276032b1907807a54fbe2afd882f9
2021-04-19 10:43:53 -07:00
98162cb0bb Enable AutoGradMode in InferenceMode. (#56107)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56107

Test Plan: Imported from OSS

Reviewed By: pbelevich, driazati

Differential Revision: D27807137

Pulled By: ailzhang

fbshipit-source-id: bfacf11ec5a431589cec73d6371cac81b425a115
2021-04-19 10:24:20 -07:00
8881f504f1 Remove the unused maximum and minimum functions in vec256_base (#56313)
Summary:
They are unused, unrelated to vectorization, and confusing for code
readers (each of them have 2 overloads that are actually used).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56313

Reviewed By: bdhirsh

Differential Revision: D27854290

Pulled By: ezyang

fbshipit-source-id: 14945ceac39a3f19e5d0f8d762b17f8c2172b966
2021-04-19 10:00:52 -07:00
6409d34482 Sort glob of files to ensure it is deterministic (#55850)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55850

ghstack-source-id: 126339587

Test Plan: diff on top builds successfully on Sandcastle

Reviewed By: wconstab

Differential Revision: D27722254

fbshipit-source-id: 181ae1a874dbfc73688dcc5b7e9264d79abd44d3
2021-04-19 09:50:13 -07:00
838d3079ad Lazily initialize alias db in remove_mutation opt (#55949)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55949

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D27793881

fbshipit-source-id: eebde5b5142d8fecfee4756604d313b0da809882
2021-04-19 09:45:33 -07:00
98ac6f7cbc Increase default rendezvous timeout to 15 minutes
Summary: Increase default rendezvous timeout to 15 minutes to address slow static initialization.

Test Plan: n/a

Reviewed By: wilson100hong

Differential Revision: D27725655

fbshipit-source-id: a1b8c49b225b61be0d13ff5e52bf6677bf72f792
2021-04-19 09:20:15 -07:00
d806b06167 Support int32 indices in torch.repeat_interleave (#55102)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55102

To avoid casting a tensor to `.long()`, we introduce support for int32 in `torch.repeat_interleave`.

Reviewed By: ezyang

Differential Revision: D27478235

fbshipit-source-id: 08b4cce65fe94ff10535ddc07e1ba2bacea6a2cf
2021-04-19 09:07:25 -07:00
b6b2fc7e3f Added OpInfos of add & mm (#55915)
Summary:
Added `OpInfo`s of `add` & `mm`.

cc anjali411

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55915

Reviewed By: agolynski

Differential Revision: D27800077

Pulled By: heitorschueroff

fbshipit-source-id: 84be4b0930f6ef472622e6721a516cc182ac76d1
2021-04-19 08:56:19 -07:00
92991d9533 Add OpInfo for (nan)quantile (#55548)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55548

Reviewed By: mruberry

Differential Revision: D27796638

Pulled By: heitorschueroff

fbshipit-source-id: d7f09c4ffa7d726cc8228a16f818b74fb9e1a93a
2021-04-19 08:41:32 -07:00
d05e7c163f Revert D27600457: [pytorch][PR] Support factory kwargs in torch.nn modules
Test Plan: revert-hammer

Differential Revision:
D27600457 (1077f87269)

Original commit changeset: b58bfee61c39

fbshipit-source-id: 19d5bfc5133a3880383731d0332503ca1f3bce0c
2021-04-19 07:47:24 -07:00
7d17559152 [special] OpInfo i0e: fix missing check (#56232)
Summary:
Fixes: https://github.com/pytorch/pytorch/issues/56274

Missed to check if scipy is installed or not (for reference tests).

Thanks mruberry !!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56232

Reviewed By: ejguan

Differential Revision: D27832397

Pulled By: ezyang

fbshipit-source-id: abc40ce7bf14d3c0f20877030880663ccb7fe375
2021-04-19 07:12:39 -07:00
1077f87269 Support factory kwargs in torch.nn modules (#54508)
Summary:
Continuation of https://github.com/pytorch/pytorch/pull/53144

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54508

Reviewed By: mrshenli

Differential Revision: D27600457

Pulled By: jbschlosser

fbshipit-source-id: b58bfee61c3917524b4622f63ef216c27a588eb1
2021-04-19 06:58:40 -07:00
7513455c74 Make tensordot resize output tensor's size if out= argument is specified & make it safely cast & copy output (#56286)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/56022.
Fixes https://github.com/pytorch/pytorch/issues/56316

For `torch.tensordot`,
1. `tensordot`'s out variant now resizes the output tensor provided as the `out` argument if necessary.
2. Added a check to verify if the output tensor provided as the argument for `out` is on the same device as the input tensors.
3. Added a check to verify if the dtype of the result is castable to the dtype of the output tensor provided as an argument for `out`.
4. Because of (2) & (3), `tensordot`'s out variant now [safely casts & copies output](https://github.com/pytorch/pytorch/wiki/Developer-FAQ#how-does-out-work-in-pytorch).
5. `test_tensordot` in `test_linalg.py` had a bug - the output tensor wasn't being defined to be on the same device as the input tensors. It was fixed by simply using a `device` argument in its definition.
6. Added an `OpInfo` for `tensordot` and modified the `OpInfo` for `inner`.

cc heitorschueroff mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56286

Reviewed By: ngimel

Differential Revision: D27845980

Pulled By: mruberry

fbshipit-source-id: 134ab163f05c31a6900dd65aefc745803019e037
2021-04-19 04:20:21 -07:00
0e106fce9c add tests for torch.testing (#54784)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54784

* #54769 make torch.testing asserts importable

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D27717422

Pulled By: mruberry

fbshipit-source-id: 7526af4f17d8ffcc4ea5e5a5d98f07ceac89df40
2021-04-19 03:47:31 -07:00
2219286de4 Updated internal code for orgqr function (#56247)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56247

Moved `apply_orgqr` to `BatchLinearAlgebraKernel.cpp`.

Removed `infos` tensor parameter. We don't need to expose
lapack/cusolver error codes because they do not contain any useful
information about the input. Its value is checked only in debug mode now
removing the device syncronization from the cuSOLVER path of
`torch.linalg.householder_product` or `torch.orgqr`.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D27844339

Pulled By: mruberry

fbshipit-source-id: 47aa20dfe2c116951b968362ad55e837caece042
2021-04-19 00:44:52 -07:00
b387f7ca47 [NNC] Make normalization transformation in-place (#56158)
Summary:
Partially fixes https://github.com/pytorch/pytorch/issues/56157

This PR changes `normalize` API in `LoopNest` to transform the given `For` statement and not create a new one.

New API:

```
static bool normalize(For* f);
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56158

Reviewed By: agolynski

Differential Revision: D27798361

Pulled By: navahgar

fbshipit-source-id: 57626a5a367bdf94a0efbd9dc8538f5e4e410d6b
2021-04-18 23:54:13 -07:00
22d4d9f4a6 [pytorch][PR] Automated submodule update: tensorpipe (#56348)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56348

This is an automated pull request to update the first-party submodule for [pytorch/tensorpipe](https://github.com/pytorch/tensorpipe).

New submodule commit: f88994cf33

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: lw

Differential Revision: D27820550

fbshipit-source-id: efde79955af9a902c2d2bf38ed2705282a9ae2f0
2021-04-18 23:29:25 -07:00
ffdecc1ac4 [CUDA graphs] Allows DeviceCachingAllocator to capture cross-stream memory use (#55860)
Summary:
Safely deallocating and repurposing memory used across streams relies on recording end-of-life events in all an allocation's usage streams beyond its original allocation stream. The events are later queried to see if all GPU work in those extra streams that could have used the allocation is done (from the CPU's perspective) before repurposing the allocation for use in its original stream.

The trouble is, calling EventQuery on an ordinary event recorded in a capturing stream is illegal. Calling EventQuery while capture is underway is also illegal. So when we call `tensor.record_stream` (or `c10::cuda::cudaCachingAllocator::recordStream`) on any tensor that's used or deleted in or around a capture, we often end up with a confusing error thrown from the cudaEventQuery in DeviceCachingAllocator::process_events().

This PR enables hopefully-safe deletion of tensors used across streams in or around capture with a conservative but simple approach: don't record or process end of life events for such tensors until the allocator's sure no captures are underway. You could whiteboard cases where this causes cross-stream-used allocations to be unavailable for reuse longer than absolutely necessary, but cross-stream-used allocations are uncommon, so for practical purposes this approach's impact on the memory footprint of captured sequences should be small.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55860

Reviewed By: ejguan

Differential Revision: D27822557

Pulled By: ezyang

fbshipit-source-id: b2e18a19d83ed05bad67a8157a14a606ed14d04e
2021-04-18 20:32:10 -07:00
3e42da09df Porting logcumsumexp tests to OpInfo (#56135)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56135

Reviewed By: mruberry

Differential Revision: D27844398

Pulled By: iramazanli

fbshipit-source-id: e0191314dc4e248501ad25170da0b77c0b799781
2021-04-18 18:49:06 -07:00
ce05b7a324 [c10d] Remove deprecated use of torch.LongTensor, torch.ByteTensor (#55861)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55861

APIs such as torch.LongTensor and torch.ByteTensor are deprecated and
the recommended API is torch.tensor(args, dtype=...). Use this API in
distributed_c10d.
ghstack-source-id: 126777875

Test Plan: CI

Reviewed By: pbelevich

Differential Revision: D27726600

fbshipit-source-id: 07eb8168d93697593589002c93c3903ce29431ef
2021-04-18 14:12:02 -07:00
a24b17248f Short circuits DistributedDataParallel._recursive_to's copy and stream syncs if input is already on the right device (#55624)
Summary:
^

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55624

Reviewed By: pbelevich, agolynski

Differential Revision: D27836170

Pulled By: rohan-varma

fbshipit-source-id: 954bf336d70f9e80c045a6a96c1d8843c7f1cf2c
2021-04-18 14:08:08 -07:00
29c5cb797d [NNC] Fuse loops that have the same bounds as expressions (#55997)
Summary:
This PR allows fusing loops whose bounds are specified as expressions that are equal.

For example:
```
   for (int j = 0; j < M + N; j++) {
     A[j] = 10 * j;
   }
   for (int k = 0; k < M + N; k++) {
     B[k] = 20 * k;
   }
```
`fuseLoops(j, k)` is possible since the stop bounds of the two loops are equal though they are different `Expr*` and will result in:
```
   for (int j = 0; j < M + N; j++) {
     A[j] = 10 * j;
     B[j] = 20 * j;
   }
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55997

Reviewed By: bertmaher

Differential Revision: D27841270

Pulled By: navahgar

fbshipit-source-id: a64e4503b7f8f28bc0c9823225bc923177bb4c2e
2021-04-18 11:14:26 -07:00
b0e0841f98 OpInfo porting for logsumexp operator (#55520)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55520

Reviewed By: mruberry

Differential Revision: D27844357

Pulled By: iramazanli

fbshipit-source-id: 6228041be9edc0a148fa34e965d2ff6423649b05
2021-04-18 06:57:14 -07:00
8c74e1b840 Vectorize copysign on CPU (#51792)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51792

Test Plan: Imported from OSS

Reviewed By: agolynski

Differential Revision: D27769007

Pulled By: mruberry

fbshipit-source-id: 65fceb9f59ed6afee4452278992340da104ed5fe
2021-04-18 02:14:18 -07:00
36b476ccdd Added OpInfos for eq, ne, ge, gt, le, and lt (#55709)
Summary:
A https://github.com/pytorch/pytorch/issues/54261 task
Added OpInfos for `eq`, `ne`, `ge`, `gt`, `le`, and `lt`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55709

Reviewed By: jbschlosser

Differential Revision: D27760382

Pulled By: mruberry

fbshipit-source-id: 30d8c9633c69a097c1e4a9daf4178c617c0a9093
2021-04-17 22:52:47 -07:00
85126629a5 [TensorExpr] Add support for constant tensors in tensorexpr kernel. (#56319)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56319

With this change the TorchScript graph can have constant tensors in it
and we still will be able to lower it to TE. The constants are
registered (or bound) within the `TensorExprKernel` object and when the
codegen is called, they are passed along with usual inputs and outputs.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D27838747

Pulled By: ZolotukhinM

fbshipit-source-id: 4a519d66fcc07fe5fa53f5cf9af28d25611f8437
2021-04-17 11:15:35 -07:00
dd9ef529ba [TensorExpr] TensorExprKernel: switch type of tensors_ from Tensor to Buf. (#56318)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56318

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D27838748

Pulled By: ZolotukhinM

fbshipit-source-id: 371a454912be76889999eda79e60d8154b749134
2021-04-17 11:14:26 -07:00
50d4c63f46 Allow inlining of more Tensor methods (#53905)
Summary:
This `is_meta` call in `TensorIterator` shows up in profiling as around 4-5% of fast setup time:
49a5f99440/aten/src/ATen/TensorIterator.cpp (L886)

After inlining, `is_meta()` compiles to a single `test` instruction. Saving 20-30 ns per operator call. The functions I'm moving into the header here are all similar, in that they inline away to almost nothing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53905

Reviewed By: gchanan

Differential Revision: D27513232

Pulled By: swolchok

fbshipit-source-id: 33ec9eefecd0ddebc285e1d830edb558818dc391
2021-04-17 09:21:23 -07:00
be2a0805d2 [TensorPipe] Update tensorpipe subodule + remove TP_NEW_API switch. (#56260)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56260

Test Plan: CI

Reviewed By: lw

Differential Revision: D27693102

fbshipit-source-id: b682e88f818657065a478b5a90ca1a4ca8c52018
2021-04-17 07:31:07 -07:00
928a4733af [nnc] Only lower float conv2d's (#56289)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56289

While there's no reason to think non-float32 conv2d's *don't* work,
they're only tested in float32 now.  Since that's the most important use case,
I'd rather restrict the dtypes than spend time testing all the weird dtype
combinations that could possibly happen.
ghstack-source-id: 126755549

Test Plan: unit tests

Reviewed By: navahgar

Differential Revision: D27828495

fbshipit-source-id: fcf179207f2c9b20e0e86eb2b85687517d87063c
2021-04-17 05:12:54 -07:00
04e7891aab Add adaptive_avgpool2d to the set of fusible ops (#56180)
Summary:
This improves mobilenetv3 perf from ~1300msto 1147ms (~12%)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56180

Reviewed By: Chillee

Differential Revision: D27840860

Pulled By: Krovatkin

fbshipit-source-id: 6ce38e93fd2f55e68a69c34b45271743f84a13b8
2021-04-17 02:04:17 -07:00
7636cb6bab clean up unused reduction functions in THC (#56293)
Summary:
Per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56293

Reviewed By: mruberry

Differential Revision: D27833949

Pulled By: ngimel

fbshipit-source-id: b9bf03c783b41c35890249902ea9bf1c34c9c13d
2021-04-16 22:37:17 -07:00
a43483586d A heuristic to avoid perf incompatible MKLDNN formats for binary ops (#56089)
Summary:
After adding new ops to a set of fusible ops, mobilenetv3 slows down to **9000ms from 1200ms** without this fix.

This happens because one of the inputs was expanded and converted to nchw/nhwc
we might end up in a very bad spot if the second argument
is in a blocked format. In this case, MKLDNN uses its
reference implementation for a binary operation that follows
these broadcasts and it could be up to ~100x slower.
We use a very simple heuristic to convert an arg in nchw
to the blocked format of the other argument.

* MKLDNN_VERBOSE without the issue:
[test_mobilenet_nopool.txt](https://github.com/pytorch/pytorch/files/6319528/test_mobilenet_nopool.txt)
* MKLDNN_VERBOSE with the issue (Note the times for `ref` operations)
[test_mobilenet_pool.txt](https://github.com/pytorch/pytorch/files/6319529/test_mobilenet_pool.txt)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56089

Reviewed By: eellison

Differential Revision: D27796688

Pulled By: Krovatkin

fbshipit-source-id: fc34d76358ce899e3b1f2b69efb9b5c38f5af1ad
2021-04-16 20:58:17 -07:00
d02919dd50 [FX] Make shape_prop handle targets with aggregate outputs (#56221)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56221

Test Plan: Imported from OSS

Reviewed By: Chillee

Differential Revision: D27810693

Pulled By: jamesr66a

fbshipit-source-id: 17c6ad671786b3bacb5026bd88b8f5b7b4b96a1a
2021-04-16 18:58:25 -07:00
72a93a6337 Fix warnings in ivalue test (#56303)
Summary:
Simple conversion of `std::unordered_map` -> `c10::Dict`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56303

Pulled By: driazati

Reviewed By: swolchok

Differential Revision: D27833970

fbshipit-source-id: e08b852a20b1cabc1cef890cdcbacbd0d40a3a8a
2021-04-16 18:33:28 -07:00
48e675ac75 fx quant: fix subtle bug in BinaryOpQuantizeHanlder logic in matching (#56294)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56294

When matching a pattern to `BinaryOpQuantizeHandler`, we need to make
sure we check for dtype support on the base node, instead of the current
node.  This is important in cases such as `add-relu` and `mul-relu`,
when the current node is `relu`, but the base node is `add|mul`.

Test Plan:
```
python test/test_quantization.py TestQuantizeFx
```

There is no good test case to check this in current logic.  Created an
add-relu model manually, and verified with pdb that the add node was
being used to match against dtypes.

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D27831070

fbshipit-source-id: 3697f1328dff9fec3eb910bae49a73793ef36d63
2021-04-16 18:19:22 -07:00
5eadc243f3 Preserve node meta info in split_module (#56212)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56212

The current design doesn't make it easy to use `node.copy()`. Explicitly copy over the node's meta.

Test Plan: Updated `test_subgraph_creation` in `test_fx_experimental`

Reviewed By: jamesr66a

Differential Revision: D27808477

fbshipit-source-id: 7fe7b6428c830307dbd1e395f16fa2774936d3b3
2021-04-16 18:02:50 -07:00
98933866a9 [quant][graphmode][fx] Optimize cat (#54813)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54813

Previously we have a cat that takes a list of Tensors with different qparams and dequantize them
cacatenate them and requantize with the output qparams. This adds some unnecessary overhead in dequantizing
and quantizing Tensors.

This PR adds an optimization for cat operator, we'll make sure inputs and output of cat
uses same observer/fake_quant and produce a cat that does not do rescaling.

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D27408377

fbshipit-source-id: 6a4bdcfd15e57ea1fe0f7e72d1e1288eb3ece4db
2021-04-16 16:00:43 -07:00
cd780e1c6e Move graph iterator to seperate utility file (#56211)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56211

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D27828150

Pulled By: tugsbayasgalan

fbshipit-source-id: f91747fabde9caf864a62e4028fdc7bbbab7ee66
2021-04-16 15:51:26 -07:00
8176ab6ca0 [JIT] Put explicit error message on class attribute accesses. (#55723)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55723

Resolving https://github.com/pytorch/pytorch/issues/51139

Test Plan:
python test/test_jit.py TestClassType.test_unresolved_attributes

Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D27691960

fbshipit-source-id: 1d078a4ab25af1a73109ca6ef0333a67a634bff6
2021-04-16 15:47:10 -07:00
d312aeb6ac Implement faster gradcheck but not enabled for most things (#54480)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54480

This PR shouldn't really change the behavior of gradcheck for most ops. However, the changes in test_autograd allow us to run basic checks for both fast and slow (instead of previously just slow). All it should be doing is wrapping the preexisting tests we introduced in prior PRs in a function which takes `fast_mode` as a param. We then call this function twice, once with `fast_mode=True` and once with `fast_mode=False`.

Plan for rollout:
 - This PR should only land the code (and runs some basic checks as described above).
   - This should help us verify that a) slow is still working as expected b) basic functionality of fast works
   - After we land this, but before we run the next PR in the stack, we should land https://github.com/pytorch/pytorch/pull/55182. This is to ensure that there is no gap where the slow tests aren't running.
 - The next PR is responsible for enabling the fast_mode=True flag on all tests (where the function has real inputs/outputs), and selectively disabling for the cases the fail.
 - Finally in a later PR, we reenable fast-gradcheck for functions w/ complex inputs/outputs

TODOs and open questions (not necessarily blocking this PR):
 - ~How do we think about atol/rtol~ (scale atol, keep rtol as-is)
 - ~reenable fast-gradcheck for complex numbers~
 - ~when inputs are uncoalesced we don't truly test this case because we coalesce the inputs before calling function. Revisit this when https://github.com/pytorch/pytorch/pull/52874/files is landed~

### Developer Experience
Sample output when jacobian mismatch occurs:
```
Traceback (most recent call last):
  File "/home/s/local/pytorch4/test/test_autograd.py", line 4220, in test_gradcheck_jacobian_mismatch
    check(fast_mode=True)
  File "/home/s/local/pytorch4/test/test_autograd.py", line 4196, in check
    gradcheck(fn, (x,), fast_mode=fast_mode)
  File "/home/s/local/pytorch4/torch/testing/_internal/common_utils.py", line 2067, in gradcheck
    return torch.autograd.gradcheck(fn, inputs, **kwargs)
  File "/home/s/local/pytorch4/torch/autograd/gradcheck.py", line 1020, in gradcheck
    if not fast_gradcheck(fail_test, seeded_func, func_out, tupled_inputs, outputs, eps, rtol,
  File "/home/s/local/pytorch4/torch/autograd/gradcheck.py", line 915, in fast_gradcheck
    return fail_test(get_notallclose_msg(a, n, i, j, prefix) + jacobians_str)
  File "/home/s/local/pytorch4/torch/autograd/gradcheck.py", line 996, in fail_test
    raise RuntimeError(msg)
RuntimeError: Jacobian mismatch for output 0 with respect to input 0,
numerical:tensor(0.9195)
analytical:tensor(0.9389)

The above quantities relating the numerical and analytical jacobians are computed
in fast mode. See: https://github.com/pytorch/pytorch/issues/53876 for more background
about fast mode. Below, we recompute numerical and analytical jacobians in slow mode:

Numerical:
 tensor([[1.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 1.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 1.0000, 0.0000],
        [0.0000, 0.0000, 0.0000, 1.0000]])
Analytical:
tensor([[1.0100, 0.0100, 0.0100, 0.0100],
        [0.0100, 1.0100, 0.0100, 0.0100],
        [0.0100, 0.0100, 1.0100, 0.0100],
        [0.0100, 0.0100, 0.0100, 1.0100]])

The max per-element difference (slow mode) is: 0.010000000000054632.
```
Additionally, if the per-element difference is small i.e., `allclose(analytical_slow, numerical_slow, rtol, atol) is True` we follow up with this message:
```
Fast gradcheck failed but element-wise differences are small. This means that the
test might've passed in slow_mode!

If you are adding a new operator, please file an issue and then use one of the
workarounds. The workaround depends on how your test invokes gradcheck/gradgradcheck.

If the test
- manually invokes gradcheck/gradgradcheck, then call gradcheck/gradgradcheck
  with `fast_mode=False` as a keyword argument.
- is OpInfo-based (e.g., in test_ops.py), then modify the OpInfo for the test
  to have `gradcheck_fast_mode=False`
- is a Module test (e.g., in common_nn.py), then modify the corresponding
  module_test entry to have `gradcheck_fast_mode=False`
```

Test Plan: Imported from OSS

Reviewed By: walterddr, ejguan

Differential Revision: D27825160

Pulled By: soulitzer

fbshipit-source-id: 1fe60569d8b697c213b0d262a832622a4e9cf0c7
2021-04-16 15:03:18 -07:00
83cfaf1a12 [kineto] deprecate pthreadid (#56209)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56209

Pull Request resolved: https://github.com/pytorch/kineto/pull/172

in this diff of the stack, we remove the threadId field from the ClientTraceActivity as our step towards the deprecation

Test Plan: sandcastle builds to cover all the dependent targets

Reviewed By: ilia-cher

Differential Revision: D27662747

fbshipit-source-id: 040ba040390680a0fc63ddc8149c6fad940439fc
2021-04-16 14:45:48 -07:00
643dd26389 Fix formatting for the new language reference (#56042)
Summary:
This PR fixes the formatting issues in the new language reference

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56042

Reviewed By: gmagogsfm

Differential Revision: D27830179

Pulled By: nikithamalgifb

fbshipit-source-id: bce3397d4de3f1536a1a8f0a16f10a703e7d4406
2021-04-16 14:18:09 -07:00
ce1380f9b5 fixing Optional[Tensor] type in autodiff (#55565)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/54783

We need to be extra careful with the pattern to legitimately use `unchecked_unwrap_optional` in autodiff.
This would at least allow us to start support `Optional[Tensor]` in autodiff, which is quite common in composite layers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55565

Reviewed By: ejguan

Differential Revision: D27825336

Pulled By: Krovatkin

fbshipit-source-id: a8562eb10ea741effff430d7417d313b1eb53dfe
2021-04-16 14:06:49 -07:00
d30e31cfe6 [20/n][torch/elastic][upstream] Move torchelastic.distributed.tests to pytorch.distributed (#56215)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56215

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56077

Move torchelastic.distributed.tests to pytorch.distributed

Test Plan: buck test mode/dev-nosan //pytorch/elastic/torchelastic/distributed/...

Reviewed By: H-Huang

Differential Revision: D27808887

fbshipit-source-id: 6c9e2cba0bb202d8a5497697773d48e215e555f8
2021-04-16 14:00:10 -07:00
1360980659 Remove duplicate test due to rebasing mistake (#56287)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56287

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D27828430

Pulled By: tugsbayasgalan

fbshipit-source-id: a5d846871ce78399409113fd5dbf2c43a4e46296
2021-04-16 13:57:13 -07:00
d4fad109e8 Add OpInfo tests for torch.inner (#55536)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55536

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D27650199

Pulled By: ejguan

fbshipit-source-id: 5805f1ca25019fc57971e31659fac345646368b6
2021-04-16 13:52:22 -07:00
a6940aae37 [19/n][torch/elastic][upstream] Replace pytorch.distributed.launch with torchelastic launcher (#56214)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56214

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56037

The diff introduces new  `torch.distributed.elastic_launch` and removes internals of `torch.distributed.launch` keeping backwards compatibility.

Since torchelastic and torch.launch are not fully compatible due to `--use_env` arg, the `torch.distributed.launch` deprecation is going to be iterative: as part of pytorch 1.9 we are going to deprecate it, and in the following releases we will remove `torch.distributed.launch`

The diff leaves `torchelastic.distributed.launch` module, and the follow up diffs will migrate the users form `torchelastic.distributed.launch` to `torch.distributed.elastic_launch`

Test Plan: buck test mode/dev-nosan //pytorch/elastic/torchelastic/distributed/...

Reviewed By: H-Huang

Differential Revision: D27805799

fbshipit-source-id: 599a4c0592fbc7a1bc1953040626dd6b72bac907
2021-04-16 13:38:23 -07:00
5a9b1ddf3b fix the readme link (#56269)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56269

Reviewed By: VitalyFedyunin

Differential Revision: D27824048

Pulled By: ejguan

fbshipit-source-id: 8d5ecbdf502ae8bf8e807b55f6daeb3ff234aa62
2021-04-16 13:35:53 -07:00
164de39a11 Fix build failure due to namespace change for log_out and tanh_out (#56278)
Summary:
There is a build failure in `bench_approx.cpp` due to namespace change for log_out and tanh_out.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56278

Reviewed By: bertmaher, nikithamalgifb

Differential Revision: D27825621

Pulled By: navahgar

fbshipit-source-id: 0bccd324af92a3460610bf475514449f0223de2b
2021-04-16 13:34:32 -07:00
8d4e6c9570 [package] make GlobGroup a public concept (#56238)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56238

It's already functionally public due to `extern` and `mock`, but
exposing the underlying implementation makes extending PackageExporter
easier.

Changed the underscores, expose on `torch.package`, add docs, etc.

Differential Revision: D27817013

Test Plan: Imported from OSS

Reviewed By: Lilyjjo

Pulled By: suo

fbshipit-source-id: e39199e7cb5242a8bfb815777e4bb82462864027
2021-04-16 13:31:48 -07:00
1ec12fd491 Add minidump collection via breakpad (#55647)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55647

This adds [breakpad](https://github.com/google/breakpad) which comes with out-of-the-box utilities to register a signal handler that writes out a minidump on an unhandled exception. Right now this is gated behind a flag in `torch.utils`, but in the future it could be on by default. Sizewise this adds aboute 500k to `libtorch_cpu.so` (187275968 B to 187810016 B).

```bash
$ cat <<EOF > test.py
import torch

torch.utils.enable_minidump_collection()

# temporary util that just segfaults
torch._C._crash()
EOF

$ python test.py
Wrote minidump to /tmp/pytorch_crashes/6a829041-50e9-4247-ea992f99-a74cf47a.dmp
fish: “python test.py” terminated by signal SIGSEGV (Address boundary error)
$ minidump-2-core /tmp/pytorch_crashes/6a829041-50e9-4247-ea992f99-a74cf47a.dmp -o core.dmp
$ gdb python core.dmp
... commence debugging ...
```

Right now all exceptions that get passed up to Python don't trigger the signal handler (which by default only
handles [these](https://github.com/google/breakpad/blob/main/src/client/linux/handler/exception_handler.cc#L115)). It would be possible for PyTorch exceptions to explicitly write a minidump when passed up to Python (maybe only when the exception is unhandled or something).

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D27679767

Pulled By: driazati

fbshipit-source-id: 1ab3b5160b6dc405f5097eb25acc644d533358d7
2021-04-16 13:05:01 -07:00
5f19385588 [TensorExpr] Add aten::matmuls to TE fuser. (#54605)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54605

For small sizes we generate a naive 3-layer loopnest, for bigger sizes
we generate an external call.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D27298364

Pulled By: ZolotukhinM

fbshipit-source-id: 2ddf275ff68d6fca16a3befca5ce5c26aef462b5
2021-04-16 12:54:38 -07:00
8d7faa2af8 Update _torch_docs.py to close #56240. (#56242)
Summary:
Update _torch_docs.py to close https://github.com/pytorch/pytorch/issues/56240.
Added the "generator" argument to the docs of torch.rand and torch.randn.

Fixes https://github.com/pytorch/pytorch/issues/56240

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56242

Reviewed By: ejguan

Differential Revision: D27821513

Pulled By: agolynski

fbshipit-source-id: e42c431eddc7a83bd1c1ea368a2effbe3f10e92e
2021-04-16 12:09:49 -07:00
0dc6e7ae38 Move grad_mode.h/cpp to c10. (#56204)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56204

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D27807139

Pulled By: ailzhang

fbshipit-source-id: 2b693eb0a1034138d8bd68836528078ea5f38145
2021-04-16 11:50:08 -07:00
d79326ce7a Revert D27812204: [sparsity] Moving only the C++ files from internal to OSS
Test Plan: revert-hammer

Differential Revision:
D27812204 (3e0744a1ae)

Original commit changeset: 6becaba3ab9c

fbshipit-source-id: 335fdd37f6cfd8a65aae749354cfe52590be5043
2021-04-16 11:46:35 -07:00
eca98fedb5 split out NamedCType from CType. Remove direct string comparison from autograd codegen (#55334)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55334

The goal of this PR is to clean up some of the autograd codegen to compare C++ types using `CType` objects instead of raw strings. My last PR in the stack made that string comparison a little more fragile, since the raw C++ strings needed to be namespace-aware.

I confirmed byte-for-byte no codegen changes vs. the last PR (which added namespaces to the codegen) by running `diff -qr ../pytorch-common_test/torch/csrc/autograd/generated/ ../pytorch-callgrind_test_after2/torch/csrc/autograd/generated/` and `diff -qr ../pytorch-common_test/build/aten/src/ATen/ ../pytorch-callgrind_test_after2/build/aten/src/ATen/`

Note that a better end-state for the autograd codegen would be to do all of its type pattern matching directly off of JIT types, instead of off of CType’s (which are really just generated from JIT types, incorporating C++ specific semantics). That looks like it’ll require a pretty substantial change though, so I’m not doing it in this PR.

As part of this change (and after talking with ezyang), I split off the `CType` data class into a separate `NamedCType` class, which holds a name and a `CType`. This way, `CType` only knows about actual C++ types, making it easier to compare CType’s to each other in the codegen when we only care about the type. The core change is in `types.py`, but it required a bunch of downstream changes to update all of the places where we create `CType`s to create `NamedCType`s instead.

The main change in the autograd codegen was that I updated `SavedAttribute` to store a `NamedCType`. The other autograd changes all pretty much came from that change.

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D27708347

Pulled By: bdhirsh

fbshipit-source-id: 3e07c80569c7b229c638f389e76e319bff6315f9
2021-04-16 11:43:08 -07:00
947c7a8215 add C++ namespacing logic to ctypes (#55047)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55047

Added namespaces to all of the `CTypes` printed in the codegen. This is pretty much required if we want to use codegen externally, since we can no longer assume that we're inside of the `at::` namespace.

Important changes are in `types.py`.

How do we add the notion of namespaces to C++ types without people having to write "at::Tensor" everywhere? Before this PR, `CType` held a raw string representing the type, i.e. `BaseCType("Tensor", binds)`. This PR introduces a set of singleton base C++ types in `types.py`, that know how to print their namespace. Instead, we'd write `BaseCType(tensorT, binds)`, where printing `tensorT` will properly print out "at::Tensor".

This also means that you can't create arbitrary `CTypes`. If we need a new C++ type in the codegen, we need to add it to the list in `types.py`.

One blip in the design: we don't want to change `RegistrationDeclarations.yaml`, since that'll break external backends that ingest it. I added separate functions to display types without the namespace that are used to create RegistrationDeclarations.yaml`. With an external codegen API though, we can eventually kill it :)

I also didn't realize until this PR that `Declarations.yaml` is still directly in use, by some python/autograd codegen. Rather than keep that yaml byte-for-byte compatible, I just updated the callsites in the autograd codegen to work with namespaces. In the NEXT pr, I try to clean up some of the autograd codegen to stop using raw strings to match against C++ types.

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D27708349

Pulled By: bdhirsh

fbshipit-source-id: 56a4f81fc101795bcb9ee1f722121480fb2356ad
2021-04-16 11:43:06 -07:00
164bee1d09 Return a CType instead of a string for returns, beef up CType (#55046)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55046

Updating `returns` in the codegen to return a CType instead of a raw string.

This has benefit of putting all stringifying logic through CType, which is useful in the followup PR when I add namespaces.

I also added new CTypes for other templated C++ types: array, vector and tuple. Mostly because it makes the namespacing logic in the next PR significantly easier. It also seems more natural to me that `BaseCType` shouldn't represent specializations of templated types.

There's a little bit of weirdness, types that are currently *only* used for returns, i.e. `TupleCType`. Returns aren't named, so I opted not to give it one- so we can add it in later if we discover that we need it.

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D27708348

Pulled By: bdhirsh

fbshipit-source-id: 230b210c3e53be1bd362105fbea8451055dc59a8
2021-04-16 11:41:46 -07:00
26046b9110 [caffe2][publish] Optimize metanetdef load
Summary:
When loading optional blobs from a large file to workspace, for instance: https://fburl.com/diffusion/l0mcnofg, we are currently loading the file in multiple times. https://fburl.com/diffusion/qhbpyq0e

This diff optimized the load time by loading in the large model file only once, and using the allow_incomplete arg into LoadOp. The implementation of the LoadOp with this arg previously did not delete the blobs that were not found, which is also fixed in this diff.

Test Plan:
Existing unit tests:
```
buck test //caffe2/caffe2/fb/distribute/tests:meta_net_def_storage_utils_test
```
Many sandcastle integration tests.

scuba logs: https://fburl.com/scuba/dai_modelstore/txdf3pjt

Reviewed By: TailofJune

Differential Revision: D27575622

fbshipit-source-id: 7c2b25ef603a378e87ebdbe349c94c2f1952493c
2021-04-16 11:35:53 -07:00
7629477ff7 Filter out more expected errors from sccache log (#56281)
Summary:
This PR extends `.jenkins/pytorch/print_sccache_log.py` to filter out a distracting "error" message that walterddr came across while debugging failures in https://github.com/pytorch/pytorch/issues/55176:

```
=================== sccache compilation log ===================
ERROR 2021-04-05T15:44:18Z: sccache::server: Compilation failed: Output { status: ExitStatus(ExitStatus(256)), stdout: "", stderr: "/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp: In function ‘int main()’:\n/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp:2:23: error: expected ‘;’ before ‘}’ token\n int main() { return 0 }\n                       ^\n" }
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56281

Test Plan: TODO (reviewers: is there an easy way to test this?)

Reviewed By: walterddr

Differential Revision: D27826064

Pulled By: samestep

fbshipit-source-id: 7322a830c1246820a5b2b7bbeaa4697ebd13b617
2021-04-16 11:27:41 -07:00
3e0744a1ae [sparsity] Moving only the C++ files from internal to OSS
Summary:
This splits the previous diff into multiple parts. This introduces only the c++ files.

The unittests pass as part of the internal build. Will be put in the OSS in the later PRs

Test Plan:
`buck test mode/opt //caffe2/torch/fb/model_optimization:sparsity_test`

```
Parsing buck files: finished in 2.0 sec
Creating action graph: finished in 16.4 sec
Building: finished in 55.0 sec (100%) 20264/20264 jobs, 16 updated
  Total time: 01:13.6 min
More details at https://www.internalfb.com/intern/buck/build/c9c5e69e-ce00-4560-adce-58b68bc43e47
Tpx test run coordinator for Facebook. See https://fburl.com/tpx for details.
Running with tpx session id: 1e678a07-0689-45b4-96f3-54d0a3181996
Trace available for this run at /tmp/tpx-20210415-161113.966600/trace.log
Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/3096224795029304
    ✓ ListingSuccess: caffe2/torch/fb/model_optimization:sparsity_test - main (4.186)
    ✓ Pass: caffe2/torch/fb/model_optimization:sparsity_test - test_sparse_qlinear (caffe2.torch.fb.model_optimization.test.sparsity.quantized_test.TestQuantizedSparseLayers) (1.752)
    ✓ Pass: caffe2/torch/fb/model_optimization:sparsity_test - test_sparse_qlinear (caffe2.torch.fb.model_optimization.test.sparsity.quantized_test.TestQuantizedSparseKernels) (1.884)
    ✓ Pass: caffe2/torch/fb/model_optimization:sparsity_test - test_sparse_qlinear_serdes (caffe2.torch.fb.model_optimization.test.sparsity.quantized_test.TestQuantizedSparseLayers) (2.013)
Summary
  Pass: 3
  ListingSuccess: 1
```

Reviewed By: raghuramank100

Differential Revision: D27812204

fbshipit-source-id: 6becaba3ab9cd054caf8b9bbae53af6d01347809
2021-04-16 11:18:39 -07:00
bb35b066af Put env before run or with in GHA workflows (#56268)
Summary:
Addresses seemethere's [comment](https://github.com/pytorch/pytorch/pull/56071#discussion_r614469633) on https://github.com/pytorch/pytorch/issues/56071.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56268

Test Plan: CI.

Reviewed By: seemethere, ejguan

Differential Revision: D27823149

Pulled By: samestep

fbshipit-source-id: 44d816abd85372b58c70bd81b189a0659a4079a4
2021-04-16 11:06:39 -07:00
03cc9fabd4 Added complex datatype support to sigmoid on cuda (#55975)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55359

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55975

Reviewed By: ezyang

Differential Revision: D27770438

Pulled By: prabhat00155

fbshipit-source-id: 730193950805ce28d8672104fe446a647194e8cb
2021-04-16 10:48:38 -07:00
dd8bfe2b93 Finish deprecation cycle for inplace view error checks (#56093)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/50617

Also updates the relevant tests to expect errors instead of warnings

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56093

Reviewed By: agolynski

Differential Revision: D27806795

Pulled By: soulitzer

fbshipit-source-id: 93c5c28edb1f97fa4457332c2ef4711f050ac81f
2021-04-16 10:44:58 -07:00
9f216b9499 ns for fx: enable shadowing int8 to int8 (#56205)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56205

Allows for int8 modules to shadow int8 modules. This is useful when
comparing quantized models with different qconfigs.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_int8_shadows_int8
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D27807405

fbshipit-source-id: 10c3bc7ab9bb1e6808aa1af23a34c7cf380465fd
2021-04-16 10:34:47 -07:00
ae0af8bb51 ns for fx: move unmatchable mod/fun/meth mapping to mappings file (#56197)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56197

No logic change, just moving code around.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_op_io_dtype_coverage
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D27805332

fbshipit-source-id: 0a63cf6ef7e5c4f655cdd5a18d54cc988424ac80
2021-04-16 10:34:46 -07:00
6de5d13e0f ns for fx: make call_method nodes work in NS APIs (#56196)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56196

Enables `call_method` nodes to work in NS APIs for unshadowed
and shadowed activations.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_op_io_dtype_coverage
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_match_activations_meth_ptq
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_add_shadow_loggers_meth_ptq
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D27805335

fbshipit-source-id: 39b9c02c5c5faf098f2dd4f36d1ea8296d51a63c
2021-04-16 10:34:44 -07:00
07f3eaa716 ns for fx: remove deprecated code (#56195)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56195

This is outdated, removing (forgot to clean up in a previous PR).

Test Plan:
```
python test/test_quantization.py TestFXGraphMatcher
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D27805334

fbshipit-source-id: 3b035945b4928a3c727e96e0f7fe0efe201f42c0
2021-04-16 10:34:42 -07:00
0fbc2be234 ns for fx: enable call_method nodes in graph matching (#56194)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56194

Enables the NS graph matcher to also match `call_method` nodes.
These are useful for ops such as `torch.sigmoid`.

Test Plan:
```
python test/test_quantization.py TestFXGraphMatcher.test_methods
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D27805333

fbshipit-source-id: 509ae283db6b245671f11e3eb6b7fcb3a5735ef5
2021-04-16 10:34:41 -07:00
2380cc7d65 ns for fx: fill out coverage for node I/O types (#55918)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55918

Adds coverage for determining I/O dtype for various ops. This will
enable shadowing of these ops.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_op_io_dtype_coverage
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D27740661

fbshipit-source-id: c5ce873ec56bffa50ca46d2fe134c70ed677e37e
2021-04-16 10:34:39 -07:00
430fc03e3f ns for fx: add category for ops which accept fp32 or int8 input (#55859)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55859

Adds mappings for ops which can accept either fp32 or int8 input,
such as `F.relu`.  A future PR will fill out the op coverage.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_op_with_either_fp32_or_int8_input
```

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D27740659

fbshipit-source-id: cfc3dd58319b7161ca7f1fe05cd22d9a3ff11141
2021-04-16 10:34:37 -07:00
5ec6434945 ns for fx: move op dtype category mapping to separate file (#55858)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55858

Moves the mappings of input and output dtypes of various ops
into its own file, and makes the variable names more clear. No logic
change.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs
```

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D27740662

fbshipit-source-id: d384e7e542d9cc868d9cee9c53c2ac2f74a15a48
2021-04-16 10:33:05 -07:00
fe18144618 Generalize HIP-specific launch bounds to apply to CUDA as well (#56143)
Summary:
Launch bounds for HIP were added along the way, but the smaller CUDA devices (like Jetson) also benefit from them.
So here I go over the HIP-specific launch bounds and try to generalize them to cover CUDA, too.

The long term goal is to eventually not need to resort to somewhat ad-hoc adaptations like the reduction of block size discussed in https://github.com/pytorch/pytorch/issues/8103, but have good coverage of our kernels with launch bound annotations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56143

Reviewed By: agolynski

Differential Revision: D27804640

Pulled By: ngimel

fbshipit-source-id: d4c345f9f7503e050a46361bfe2625865d0a42ba
2021-04-16 10:29:28 -07:00
48c6f0c25e Add OpInfo for torch.mean (#55525)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55525

Reviewed By: agolynski

Differential Revision: D27796651

Pulled By: heitorschueroff

fbshipit-source-id: 6473d854f090ff62c856b404870f226f46569449
2021-04-16 10:10:24 -07:00
119b3eccda Revert "Revert D27598681: Add OpInfo tests for torch.addbmm" (#55908)
Summary:
This reverts commit fd450ff1b93e4c498e7326cb35c7c26760c5ddbf.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55908

Reviewed By: agolynski

Differential Revision: D27800571

Pulled By: anjali411

fbshipit-source-id: f04144afe7768872acb3fc2f5f242bb0093abc5e
2021-04-16 10:01:43 -07:00
f9b3dcba0d Store coverage.xml as artifact for windows test jobs (#56179)
Summary:
Currently, coverage stats is getting covered for sharded windows tests. This PR attempts to store the coverage.xml file as an artifact.

I wonder what CircleCI will do when the artifacts don't exist (for nonsharded tests), and if we could conditionally store artifacts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56179

Reviewed By: samestep

Differential Revision: D27800628

Pulled By: janeyx99

fbshipit-source-id: 919f5696c0d7b4ee0d99969f35797f5be644c364
2021-04-16 07:53:57 -07:00
c5e80d30bf Harden "Add annotations" workflow (#56071)
Summary:
Resolves https://github.com/pytorch/pytorch/issues/55810 by closing some possible security holes due to using [GitHub Actions `${{ <expressions> }}`](https://docs.github.com/en/actions/reference/context-and-expression-syntax-for-github-actions#about-contexts-and-expressions) in `.github/workflows/add_annotations.yml` and also patching a few other possible scenarios that could cause the workflow to fail by a PR passing a malformed artifact.

- [x] flag and remove GitHub Actions expressions in JS scripts
- [x] don't fail the workflow if the artifact doesn't look as expected
- [x] write unit tests for `tools/extract_scripts.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56071

Test Plan:
I tested the end-to-end "Lint" and "Add annotations" system in a separate sandbox repo, including the following cases:

- well-formed artifact
- missing artifact
- artifact containing a file named `linter-output.zip` (name clash)
- artifact whose `commit-sha.txt` doesn't contain a 40-digit hex string
- artifact whose `commit-sha.txt` contains a 40-digit hex string that isn't a valid Git hash for the current repo
  - in this last case, the workflow does fail, but handling that is the responsibility of [pytorch/add-annotations-github-action](https://github.com/pytorch/add-annotations-github-action), not pytorch/pytorch

To run the new unit tests added in this PR:
```
python tools/test/test_extract_scripts.py
```

Reviewed By: seemethere

Differential Revision: D27807074

Pulled By: samestep

fbshipit-source-id: e2d3cc5437fe80ff03d46237ebba289901bc567c
2021-04-16 07:46:20 -07:00
e387bd780e Ignore envrc files (#56199)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56199

Reviewed By: ejguan

Differential Revision: D27821439

Pulled By: agolynski

fbshipit-source-id: 4be7158d723c58f82b6ec56b3817932899e1b196
2021-04-16 07:36:51 -07:00
f236c27819 Update Gloo submodule (#56189)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56189

Reviewed By: rohan-varma

Differential Revision: D27814124

Pulled By: pbelevich

fbshipit-source-id: cdea2db24634d9d171cac60709ef5135c099aabe
2021-04-16 07:20:31 -07:00
b96cc9ab20 [FX][testing] Test tracing into all the standard torch.nn.functional (#55550)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55550

Add a test for `symbolic_trace` into `torch.nn.functional`

Test against all `functional`s with `torch.Tensor` argument and `functional`s from `FUNCTIONALS_WITHOUT_ANNOTATION`.
```py
FUNCTIONALS_WITHOUT_ANNOTATION = (
        "adaptive_max_pool1d",
        "adaptive_max_pool2d",
        "adaptive_max_pool3d",
        "fractional_max_pool2d",
        "fractional_max_pool3d",
        "max_pool1d",
        "max_pool2d",
        "max_pool3d",
        "gaussian_nll_loss",
        "upsample",
        "upsample_bilinear",
        "upsample_nearest",
    )
```

`UNTRACEABLE_FUNCTIONALS` lists 110 current untraceable `functional`s with expected `Error`.
- `BUILT_IN_FUNC`: built-in functions or built-in methods can not be traced.
- `PROXY_ITERATED`: Proxy object cannot be iterated. This can be attempted when used in a for loop or as a *args or **kwargs function argument
- `LEN_ERROR`: 'len' is not supported in symbolic tracing by default. If you want this call to be recorded, please call torch.fx.wrap('len') at module scope
- `ARG_TYPE_MISMATCH`: `functional()`: argument <name> (position <n>) must be <type>, not Proxy
- `CONTROL_FLOW`: symbolically traced variables cannot be used as inputs to control flow
- `INTERPOLATE_ARGS_CONFLICT`: When tracing the functional by calling `interpolate(input, size, scale_factor, mode="bilinear", align_corners=True)`, `ValueError("only one of size or scale_factor should be defined")` is raised

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D27659367

Pulled By: ejguan

fbshipit-source-id: d0d05e4d94e0b85f47e6c171a31f0d41b1387373
2021-04-16 06:48:02 -07:00
1a1b23f00c [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D27818592

fbshipit-source-id: dc9d12a747464bb3c3d88bead606de6e9233b80c
2021-04-16 04:16:20 -07:00
6b5ed5ec45 Revert D27803529: [pytorch][PR] .github: Add initial linux CI workflow
Test Plan: revert-hammer

Differential Revision:
D27803529 (7d410bc3c8)

Original commit changeset: 52a65ec8f7a8

fbshipit-source-id: ce968654f2aecd8b36b5f86e0fe5ed6056f0fb8a
2021-04-16 02:53:31 -07:00
0a541e23e1 [nn] Add allow_duplicate option for named_modules (#54812)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54812

Needed for quantization since different attribute might refer to the same module instance

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D27408376

fbshipit-source-id: cada85c4a1772d3dd9502c3f6f9a56d690d527e7
2021-04-16 01:26:16 -07:00
b405e2ce12 Implicit conversion from null tensor to NoneType (#55823)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55823

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D27717324

Pulled By: tugsbayasgalan

fbshipit-source-id: a071b90bcea9e8f2b5da633a8dadd11772fb5101
2021-04-16 00:05:52 -07:00
d2d1112513 Set ThreadLocalState correctly in the autograd engine (#56174)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56174

evaluate_function:
1. calls the autograd function (call_function)
2. accumulates gradients into buffers

Previously, ThreadLocalStateGuard only covered part of `call_function`.
However, it should cover all Tensor operations in `evaluate_function`,
so this PR moves it to do so.

One alternative would have been to move ThreadLocalStateGuard to here:
71f9e99e29/torch/csrc/autograd/engine.cpp (L394)

Unfortunately that adds 2% additional instructions according to the
instruction count benchmark in the next section. This is because
`evaluate_function` does an early return:
71f9e99e29/torch/csrc/autograd/engine.cpp (L732-L735)
If this is preferred, please let me know.

Test Plan:
- run existing tests. It's hard to actually come up with a test case for
this.

Benchmark plan:

TL;DR: Instruction count decreases by a little after this PR.
```
import torch
from torch.utils.benchmark import Timer

timer = Timer(
    stmt="""\
torch::autograd::grad({y}, {x}, {}, /*retain_grad=*/true);""",
    setup="""\
auto x = torch::ones({}, torch::requires_grad());
auto y = x * 2;""",
    language="cpp")

stats = timer.collect_callgrind()
print(stats)
```
This gave the following:
```
Before:
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7f4b28ce6a90>
torch::autograd::grad({y}, {x}, {}, /*retain_grad=*/true);
setup:
  auto x = torch::ones({}, torch::requires_grad());
  auto y = x * 2;

                           All          Noisy symbols removed
    Instructions:      3514184                    3514184
    Baseline:                0                          0
100 runs per measurement, 1 thread

After:
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7fdbc9d187d0>
torch::autograd::grad({y}, {x}, {}, /*retain_grad=*/true);
setup:
  auto x = torch::ones({}, torch::requires_grad());
  auto y = x * 2;

                           All          Noisy symbols removed
    Instructions:      3513884                    3513884
    Baseline:                0                          0
100 runs per measurement, 1 thread
```

Reviewed By: albanD

Differential Revision: D27799283

Pulled By: zou3519

fbshipit-source-id: 0a8213824e08c04748d38e66604c73f395285d63
2021-04-15 20:57:27 -07:00
8f68396462 [package] fix error handling with allow_empty (#56190)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56190

Previously, if we had some code that did the following:
```
- pattern A, allow_empty=False
- save module B, but throws an exception for whatever reason
- save module that causes match against A
```

Then the resulting behavior would be:
1. exception thrown, which triggers `__close__` on `PackageExporter`
2. `PackageExporter` checks that all patterns are matched against, and sees that A was not matched.
3. Error is raised that we didn't match against pattern A.

This is confusing, since the *real* error that caused packaging to fail
occurred when trying to package module B, but it's being hidden by the
error about module A (even though if packaging module B had succeeded,
there would be no error).

Change it so that the behavior looks like:
1. exception thrown, which triggers `__close__` on `PackageExporter`
2. `PackageExporter` recognizes that an exception is happening and
immediately just returns control flow to the caller to handle the "real"
exception.

Differential Revision: D27803988

Test Plan: Imported from OSS

Reviewed By: guangyuwang

Pulled By: suo

fbshipit-source-id: f67b2e96165a0547c194a8bef1af1c185452173e
2021-04-15 20:16:43 -07:00
4611387608 [optim] take kw-only argument for functional optim APIs (#56185)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56185

ghstack-source-id: 126670123

Reviewed By: albanD

Differential Revision: D27802169

fbshipit-source-id: f5e1cb2046dcdeecf5f6b0f70892828bf0adb22f
2021-04-15 20:08:04 -07:00
bd3c63aeeb [PyTorch Edge] Move torch::jit::mobile::_export_operator_list() from serialization/export_module.cpp to mobile/import.cpp (#56044)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56044

We want to be able to drop the dependence of full-jit deps in the auto-generated unit tests for 2 reasons:

1. Running bloaty on the auto-generated unit tests should be somewhat representative of the actual size.
2. The runtime environment of the auto-generated unit tests should be as close to the production environment as possible to ensure that we are running the tests in a production-like runtime.

Due to the dependece on full-jit, we aren't there yet. For the auto-generated tests, we probably don't need to depend on `_export_operator_list()` evetually, but for now we do since it is used to decide whether the model being run is a Metal GPU model or a CPU model, and gates whether the test runs that model or not.

Eventually, we can stop doing this in the test and do it in the codegen from PTM-CLI instead (by fetching the operators from that tool, and writing out to the BUCK file which backend(s) this model is targeting). However, that will take some time to land, so in the spirit of expediency, this change is being proposed.

Discussed this offline with iseeyuan
ghstack-source-id: 126656877

Test Plan: Build + BSB.

Reviewed By: iseeyuan

Differential Revision: D27694781

fbshipit-source-id: f31a2dfd40803c02f4fd19c45a3cc6fb9bdf9697
2021-04-15 17:53:36 -07:00
94ce10f732 [iOS GPU] Use setTexture() rather than copyTexture() (#56069)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56069

It's more efficient to capture a MPSImage object than copying a one from outside.
ghstack-source-id: 126552396

Test Plan:
- All operator tests pass
- Sandcastle
- CircleCI

Reviewed By: SS-JIA

Differential Revision: D27694542

fbshipit-source-id: e1bbbffc3f8c109816cb117aebd0aae8576c6c5c
2021-04-15 17:35:29 -07:00
42f5d66080 [DDP] Fixes flaky tests caused by incorrect floating-point comparison (#56192)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/50699.

The root cause was that some floating-point assertions had a "greater than or **equal to**" condition. The "equal to" part was causing flakiness due to strict equality check (`==`) in `TestCase.assertGreaterEqual()`. This PR introduces a new assertion method called `assertGreaterAlmostEqual()` in `common_utils.py` that mitigates the problem by behaving similar to `TestCase.assertAlmostEqual()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56192

Reviewed By: zhaojuanmao

Differential Revision: D27804724

Pulled By: cbalioglu

fbshipit-source-id: bc44a41ca4ce45dfee62fb3769fb47bfd9028831
2021-04-15 17:15:42 -07:00
7d410bc3c8 .github: Add initial linux CI workflow (#55176)
Summary:
This is a commandeer of https://github.com/pytorch/pytorch/issues/54091.

TODO:

- [x] understand why the build is [failing](https://github.com/pytorch/pytorch/pull/55176/checks?check_run_id=2254742265) here when it was [succeeding](https://github.com/pytorch/pytorch/pull/54091/checks?check_run_id=2177844748) on https://github.com/pytorch/pytorch/issues/54091
- [x] fix the build failure
- [x] fix the test failure(s)
- [x] add CI check to generate YAML workflows from templates, similar to https://github.com/pytorch/pytorch/issues/55171
- [ ] uncomment the rest of the matrix

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55176

Reviewed By: walterddr

Differential Revision: D27803529

Pulled By: seemethere

fbshipit-source-id: 52a65ec8f7a83b929fed47f0bbdca544210ec9c2
2021-04-15 16:54:04 -07:00
400398006f [PARAM] Param comms debug info (#55976)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55976

- Define a concrete `DebugInfo` to collect Param comms.
- Add a macro to easily log `DebugInfo`

Test Plan:
Tested on `ads:simplified_launcher` with `dyno gputrace`
locally tested in libkinetoObserver that it can collect the debug Infobase

Reviewed By: kingchc, ilia-cher

Differential Revision: D26773447

fbshipit-source-id: a8eeede2d6dbf34d7a1b3614843b4a1baba94448
2021-04-15 16:22:01 -07:00
bde53cfd9a [tensorexpr] Add missing python bindings for NNC Stmts (#55570)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55570

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D27634987

Pulled By: huiguoo

fbshipit-source-id: 220a00b1dcc4d42d93b6600b730d35432316eff6
2021-04-15 16:13:59 -07:00
f59244ec16 ns for fx: add test for op relationship coverage (#55837)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55837

Adds a test that checks that all of the relevant op pairs defined in
`quantization_mappings.py` are also defined as related by Numerical
Suite.

Note: this does not cover all the ops, just the ones in
`quantization_mappings.py`.  A future PR will fill out the remainder.

Test Plan:
```
python test/test_quantization.py TestFXGraphMatcher.test_op_relationship_mapping
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D27719979

fbshipit-source-id: 9e852ef94da5f7a653ea15ba52c68a89c8e30208
2021-04-15 16:11:26 -07:00
c8209a7336 ns for fx: move pattern utils to separate file (#55805)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55805

No logic change, just moving util functions to separate file.

Test Plan:
```
python test/test_quantization.py TestFXGraphMatcher
python test/test_quantization.py TestFXNumericSuiteCoreAPIs
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D27719982

fbshipit-source-id: c80d5397c1efeb9fc83eacaa532ecbde557cca3f
2021-04-15 16:11:24 -07:00
b461104554 ns for fx: make get_reversed_fusions reuse quantization fusions (#55803)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55803

Makes the NS `graph_matcher.get_reversed_fusions` use the fusions
defined the FX quantization code instead of duplicating them.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs
python test/test_quantization.py TestFXNumericSuiteCoreAPIsModels
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D27719980

fbshipit-source-id: 12e3183405181bb9001f10e765cfb4d2ffdfdd88
2021-04-15 16:11:23 -07:00
84b5f67d9b ns for fx: add qat tests cases for shadowed activations (#55614)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55614

Adds testing for shadowed activations APIs and QAT.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_add_shadow_loggers_mod_ptq
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_add_shadow_loggers_mod_qat
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_add_shadow_loggers_fun_ptq
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_add_shadow_loggers_qat_qat
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D27650405

fbshipit-source-id: c5138d98aa072e2927a54329c87e755413adeb5d
2021-04-15 16:11:21 -07:00
37fbc069f1 ns for fx: qat test cases for unshadowed activations (#55508)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55508

Adds QAT test cases for unshadowed activation APIs.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D27650406

fbshipit-source-id: bcbbdf1d32b8f8627c30d6aaf22607f34d1e2e08
2021-04-15 16:11:19 -07:00
f6a3936ab3 ns for fx: extend functional weight extraction testing to QAT (#55507)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55507

As titled, extends the test cases for weight extraction from
functionals to cover QAT.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D27650408

fbshipit-source-id: 8ce87d56bbc0da7c2330ece71a897d6d8c5110a0
2021-04-15 16:11:17 -07:00
1cbc4023e9 ns for fx: add qat handling for weight extraction (#55506)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55506

Makes the NS weight extraction tests also test QAT, and fixes
the mappings where necessary to cover all the fusions and make
the tests pass.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_extract_weights_mod_ptq
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_extract_weights_mod_qat
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D27650409

fbshipit-source-id: c5bd9268d1bc559afc27d4c5109effd77bf1538a
2021-04-15 16:11:16 -07:00
3786c2719d ns for fx: make NSTracer inherit from QuantizationTracer (#55505)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55505

This necessary to add support in NS for QAT modules, to avoid
duplicating logic between NSTracer and QuantizationTracer.

The eng work to expose the custom module and class names to
the user will be in a future PR.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs
python test/test_quantization.py TestFXNumericSuiteCoreAPIsModels
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D27650407

fbshipit-source-id: 431f47c5353b41c11371c5efa79657bfd085459a
2021-04-15 16:11:14 -07:00
5ad3bc715c ns for fx: change node I/O determination to strict allowlist (#55434)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55434

Before this PR, there was some hacky logic which determined
the input and output types of nodes based on heuristics such
as inspecting `__module__`, or assuming that an op has an
I/O dtype of `torch.float` when the heuristics did not find
any matches.  This is problematic because the heuristics were not exact,
and this could result in non-sensical shadow graphs when the heuristics
would return an incorrect dtype.

This PR switches the dtype determination to an allowlist system,
where we specify exactly what the dtypes are for the nodes or modules
which are in an allowlist, and we add an `UNKNOWN` type for everything
else.  The shadow logic is changed to skip inserting shadows on any
function or module where the I/O dtype is unknown.

The current allowlist only contains functions necessary for the
currently existing tests.  Filling out the allowlist with all necessary
torch functions is left for a future PR.

As a result of this, we can do the following (also implemented in this PR):
1. enable graph matching on nodes with equal types (for example,
F.linear and F.linear). The restriction that only nodes with equal types
was in the code as a placeholder, it's better to allow comparisons of
nodes of equal types. One case where this is useful is unshadowed
activations.
2. enable models with user defined modules to be passed to Numeric Suite
APIs without errors.

Test Plan:
```
python test/test_quantization.py TestFXGraphMatcher
python test/test_quantization.py TestFXGraphMatcherModels
python test/test_quantization.py TestFXNumericSuiteCoreAPIs
python test/test_quantization.py TestFXNumericSuiteCoreAPIsModels
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D27622418

fbshipit-source-id: 40dcba0222c01154c141467640c1eb89725f33a7
2021-04-15 16:09:51 -07:00
1ca51f0fba [kineto] deprecate metdata args from ClientTraceActivity (#55988)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55988

Pull Request resolved: https://github.com/pytorch/kineto/pull/165

as part of the ClientTraceActivity -> GenericTraceActivity migration, move all the metadata fields to JSON encoded string

Test Plan:
- `buck build`
- tested with subsequent diffs

Reviewed By: gdankel

Differential Revision: D27340314

fbshipit-source-id: f55b77a779e4bda1fb8667cb4e0f4252b93af5ea
2021-04-15 16:06:45 -07:00
52f1a07b63 Python API for Vitals (#53238)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53238

There is a tension for the Vitals design: (1) we want a macro based logging API for C++ and (2) we want a clean python API. Furthermore, we want to this to work with "print on destruction" semantics.

The unfortunate resolution is that there are (2) ways to define vitals:
(1) Use the macros for local use only within C++ - this keeps the semantics people enjoy
(2) For vitals to be used through either C++ or Python, we use a global VitalsAPI object.

Both these go to the same place for the user: printing to stdout as the globals are destructed.

The long history on this diff shows many different ways to try to avoid having 2 different paths... we tried weak pointers & shared pointers, verbose switch cases, etc. Ultimately each ran into an ugly trade-off and this cuts the difference better the alternatives.

Test Plan:
buck test mode/dev caffe2/test:torch -- --regex vital
buck test //caffe2/aten:vitals

Reviewed By: orionr

Differential Revision: D26736443

fbshipit-source-id: ccab464224913edd07c1e8532093f673cdcb789f
2021-04-15 16:06:43 -07:00
f17c9ea2ed Port all unary float functions to structured (#56082)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56082

The native_functions.yaml changes were done by codemod using the
following script:

```
import ruamel.yaml
from ruamel.yaml.tokens import CommentToken
from ruamel.yaml.error import CommentMark
from tools.codegen.model import *  # noqa: F403

with open("aten/src/ATen/native/native_functions.yaml", "r") as f:
    contents = f.read()

yaml = ruamel.yaml.YAML()
yaml.preserve_quotes = True
yaml.width = 1000
yaml.boolean_representation = ['False', 'True']
r = yaml.load(contents)

convert = '''\
acos
acosh
asin
asinh
atan
atanh
cos
cosh
digamma
erf
erfc
erfinv
exp
expm1
exp2
lgamma
log
log10
log1p
log2
reciprocal
sigmoid
sin
sinc
sinh
special_entr
sqrt
tan
tanh'''.split()

for e in r:
    f = NativeFunction.from_yaml(e, Location("", 0))
    if f.structured or f.structured_delegate is not None:
        continue
    n = f.func.name.name.base
    if n not in convert:
        continue
    # mutate e to make changes
    if f.func.kind() == SchemaKind.out:
        e.insert(1, 'structured', True)
        e.insert(2, 'structured_inherits', 'TensorIteratorBase')
    else:
        # TODO: The .out overload assumption is not sound in general
        e.insert(1, 'structured_delegate', f'{n}.out')

        e['dispatch'].pop('CPU', None)
        e['dispatch'].pop('CUDA', None)
        e['dispatch'].pop('CPU, CUDA', None)
        e['dispatch'].pop('CompositeExplicitAutograd', None)

        *_, last_k = e.keys()
        needs_fixup = False

        if not e['dispatch']:
            if last_k == 'dispatch':
                needs_fixup = True
            del e['dispatch']

        # Manually fix up newlines at the end, because ruamel
        # made some bad life choices about where to associate trailing
        # whitespace for nested dicts; see
        # https://stackoverflow.com/questions/42172399/modifying-yaml-using-ruamel-yaml-adds-extra-new-lines
        if needs_fixup:
            *_, last_k = e.keys()
            # post_key, pre_key, post_value, pre_value
            e.ca.items[last_k] = [None, None, CommentToken('\n\n', CommentMark(0), None), None]

with open("aten/src/ATen/native/native_functions.yaml.new", "w") as f:
    yaml.dump(r, f)
```

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D27777769

Pulled By: ezyang

fbshipit-source-id: 1ecbac7cb3e0093167bb61c7d2b1ecb95b8ae17c
2021-04-15 16:06:42 -07:00
cfc9716246 Change all unary functions stubs to use TensorIteratorBase& (#56078)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56078

This is in preparation for making all unary functions structured.
I don't actually have to make them structured yet as TensorIterator&
casts to TensorIteratorBase&

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D27777768

Pulled By: ezyang

fbshipit-source-id: 05a3a95f200698eef72c5c74fff85fe881e1c4a3
2021-04-15 16:04:58 -07:00
3c4e1cd141 remove annoying warnings from common_nn.py (#55982)
Summary:
^^

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55982

Reviewed By: mruberry

Differential Revision: D27776380

Pulled By: Chillee

fbshipit-source-id: 22b3a8de73416821bed56b75b68dca1c33a21250
2021-04-15 16:03:00 -07:00
ff1498e668 Add cost inference for MulGradient operator
Summary: Add cost inference for MulGradient operator; also whitelist MulGradient in COMPUTE_OP_TYPES in dense_perf_estimation

Test Plan: buck run //caffe2/caffe2/python/operator_test:elementwise_ops_test

Reviewed By: CrazySherman

Differential Revision: D27614003

fbshipit-source-id: 30901e5e2b6ce7e2183c2362d1bf9f895046cf55
2021-04-15 16:02:06 -07:00
3fbca31be3 port addmv to structured kernels (#55746)
Summary:
Per title
I've revamped size checks a bit to provide better error message if `self` is of the wrong size, also added check that inplace variant has correct `self` size

Ref: https://github.com/pytorch/pytorch/issues/55070

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55746

Reviewed By: ezyang

Differential Revision: D27782980

Pulled By: ngimel

fbshipit-source-id: 6ba949b682b8fd1170d0304da0ed348dd1a7b8c7
2021-04-15 15:57:46 -07:00
8e82e932f3 Reland: D27652485: [nnc] Enable CPU fusion only when num_threads == 1" (#56120)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56120

This reverts commit ad17fadbfc786dc1ccb42e822208ff03c2a2b72c (D27786457).

The big annoyance here is that depending on the threading mode you may not be
able to toggle num_threads at will, so the fusion tests won't fail.

I hate this solution, but I'm adding a secondary override for the TE fuser.
Now you need to both turn on fusion (_jit_override_can_fuse_on_cpu), and you're
OK if you're running with 1 thread, or you can add
`_jit_set_texpr_parallel_cpu_enabled` to enable it anyways.

This is (a) mainly for tests, since a real user probably won't fiddle aimlessly
with the thread count, and (b) will go away once NNC's threading support is
fully baked.

Test Plan: Imported from OSS

Reviewed By: Krovatkin

Differential Revision: D27788199

Pulled By: bertmaher

fbshipit-source-id: 070d04474f15e9689dbdf8cc1fde43050c6506b1
2021-04-15 15:50:18 -07:00
e1752ffa04 [reland][ROCm] use hiprtc precompiled header (#55965)
Summary:
Revert "Revert D27449031 (2a7df657fe): [pytorch][PR] [ROCm] use hiprtc precompiled header".  Reland PR https://github.com/pytorch/pytorch/issues/54350.

This reverts commit 204ac21bf1457022caab197001788239720b96d6.

The original PR was reverted under suspicion that it was causing CI instability, but it was instead due to a hardware failure.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55965

Reviewed By: jbschlosser

Differential Revision: D27755907

Pulled By: malfet

fbshipit-source-id: 75bf0b9d888df3dee62f00a366b1123757e0474e
2021-04-15 15:47:56 -07:00
f02454f957 Fix ChanelShuffle named tensor warnings (#55911)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/54846

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55911

Reviewed By: agolynski

Differential Revision: D27798078

Pulled By: jbschlosser

fbshipit-source-id: 1ebd325ac8a21f82c395d2eafac7ef2ecd1f32b1
2021-04-15 15:36:35 -07:00
dd090e72b2 [dist_optim] add distributed functional rprop optimizer (#55834)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55834

ghstack-source-id: 126325536

Reviewed By: rohan-varma

Differential Revision: D27703878

fbshipit-source-id: 5c8ec9a4ccb4442b2b51d48d75ea5cd506179f14
2021-04-15 15:19:44 -07:00
4e9e7200f2 [dist_optim] Add distributed functional Adamax optimizer (#55833)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55833

Add distributed functional Adamax optimizer, to support in TorchScript
ghstack-source-id: 126325538

Reviewed By: rohan-varma

Differential Revision: D26696540

fbshipit-source-id: 6242faebd2476847831a05df7f8b0d616f2b5355
2021-04-15 15:19:43 -07:00
8ef13cf976 [optim] refactor rprop to use functional API (#55832)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55832

ghstack-source-id: 126325541

Reviewed By: driazati

Differential Revision: D27703877

fbshipit-source-id: 34d4ce7b7d124c0cd75e2f6d0bc8f836713b7301
2021-04-15 15:19:41 -07:00
bb245b6444 [optim] refactor adamax to use functional API (#55830)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55830

ghstack-source-id: 126325537

Reviewed By: driazati

Differential Revision: D26561017

fbshipit-source-id: 41273d200e546d4ac08d39b57865d63c624f143a
2021-04-15 15:19:39 -07:00
f26a6cb372 [quantization] Fix deepcopy on quantized ConvNd (#56154)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56154

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D27796268

Pulled By: jamesr66a

fbshipit-source-id: cb693dc16582a9334c93f46201c42eb0f4b794b3
2021-04-15 15:18:22 -07:00
a3a75bd35e Add complex autograd support for torch.cross (#55854)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53512

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55854

Reviewed By: nikithamalgifb

Differential Revision: D27737571

Pulled By: anjali411

fbshipit-source-id: 38165b952cc4c9213d61c7d98b549b984c154927
2021-04-15 15:07:25 -07:00
90e103ddfe Revert D27753803: [19/n][torch/elastic][upstream] Replace pytorch.distributed.launch with torchelastic launcher
Test Plan: revert-hammer

Differential Revision:
D27753803 (7c708ef4ea)

Original commit changeset: 5f24bcfdcb70

fbshipit-source-id: 650e229b788d046450615364e5cba65065a95e3b
2021-04-15 15:03:14 -07:00
512c744f2e [torch/elastic] Introduce PeriodicTimer (#55919)
Summary:
This PR introduces a basic timer type that periodically calls a specified function. Its main use in the upcoming `DynamicRendezvousHandler` implementation will be to send periodic keep-alive updates in a background thread.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55919

Reviewed By: tierex

Differential Revision: D27740823

Pulled By: cbalioglu

fbshipit-source-id: e46fc848ab033995946a38a29c01d67d387a4cf5
2021-04-15 14:51:14 -07:00
e2036ea342 Revert D27758303: [20/n][torch/elastic][upstream] Move torchelastic.distributed.tests to pytorch.distributed
Test Plan: revert-hammer

Differential Revision:
D27758303 (9f6fed8a15)

Original commit changeset: c987d4764f47

fbshipit-source-id: 90846dcd5c8512dd615c7f44dc3663f124cf4a25
2021-04-15 14:51:13 -07:00
9bfe16a308 should_check_autodiff is now should_autodiff_node (#56013)
Summary:
The name `should_check_autodiff` became `should_autodiff_node` but documentation did not change. The identifier is used in `test/test_jit.py`. It seems the file is too big for github to link to the line, but it is the return value from `normalize_check_ad`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56013

Reviewed By: agolynski

Differential Revision: D27800008

Pulled By: Lilyjjo

fbshipit-source-id: 88a43c14c0f48fb3f94792e3fd6de2bd6a59a1a2
2021-04-15 14:49:49 -07:00
aae1023bed [caffe2] allow passing options to the DB in Save operations (#55935)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55935

Add a new `DB::SetOptions()` method to allow passing options to the DB as part
of Save operations.  This can be used for passing in options to control the
serialization behavior, such as rate limits or other parameters.  The
serialization options are passed is an opaque string, so that different DB
implementations may choose their own options and options format.

This also adds a new `db_options` parameter to the `Save` operator.
This allows users to pass in the DB options when saving data.
ghstack-source-id: 126589771

Test Plan:
I don't have any tests in this diff since no DB implements options yet.  The
next diff in the stack includes an options implementation, along with unit
tests that verify the options are passed in correctly.

Differential Revision: D27729461

fbshipit-source-id: 4d03250c389c66a049cdee1d05e082f5649ac0f0
2021-04-15 14:45:47 -07:00
14d529a368 Add support for refinement for torch.jit.Future (#56148)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56148

Fixes issue: #55787

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D27796830

Pulled By: nikithamalgifb

fbshipit-source-id: b7a60218010793a54eb52d6b7602d333dc5a1c9e
2021-04-15 14:08:58 -07:00
33159b68a3 Revert "Deprecate legacy constructor torch.Tensor() (#54414)" (#55831)
Summary:
This PR reverts https://github.com/pytorch/pytorch/pull/54414 because of https://github.com/pytorch/pytorch/issues/55780

cc ysiraichi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55831

Reviewed By: agolynski

Differential Revision: D27762264

Pulled By: heitorschueroff

fbshipit-source-id: 8079a660cc440cafb9d22aa031d36dde121e13b3
2021-04-15 14:06:10 -07:00
b940516061 [nnc] Don't fuse fp16 on CPU (#56119)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56119

There are apparently still more issues with fp16 on LLVM so let's just
nuke it from orbit while we develop a robust workaround.
ghstack-source-id: 126619411

Test Plan: compile

Reviewed By: ZolotukhinM

Differential Revision: D27787080

fbshipit-source-id: 9e771211fe48266f50fca1de8d40295922da5bca
2021-04-15 14:01:29 -07:00
16820bba5a [nnc][trivial] Trailing underscore style for llvmCode, asmCode members (#56118)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56118

that's it
ghstack-source-id: 126592595

Test Plan: compile

Reviewed By: huiguoo

Differential Revision: D27781682

fbshipit-source-id: 12728c279d0e02eb007093e18d9fc989456bea77
2021-04-15 14:01:28 -07:00
d56f451820 [nnc] Separate printing of optimized llvm bitcode from assembly (#56117)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56117

I was debugging an issue during instruction selection and wanted to
see the input bitcode.  This way we always print it before going into the asm
generation pass.
ghstack-source-id: 126592596

Test Plan: Run with `PYTORCH_JIT_LOG_LEVEL=">>llvm_codegen"`

Reviewed By: huiguoo

Differential Revision: D27781683

fbshipit-source-id: 84635d0ca2a1318ae7a9a73cc1d2df450d8b6a08
2021-04-15 13:59:35 -07:00
06ea73942a [easy] Rename fb::jpeg_decode_to_NCHW to fb::image_decode_to_NCHW (#55857)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55857

Since OpenCV supports more than just the JPEG file format.

ghstack-source-id: 126528422

Test Plan: Build

Reviewed By: JacobSzwejbka

Differential Revision: D27722865

fbshipit-source-id: 6cf83bf187bb1fb3a28e3aa2a011959ef8925449
2021-04-15 13:44:13 -07:00
63f83edcfb OpInfo porting for torch.real & torch.imag (#55134)
Summary:
Related https://github.com/pytorch/pytorch/issues/54298

This PR ports the method_tests() entries of torch.real & torch.imag to OpInfo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55134

Reviewed By: agolynski

Differential Revision: D27793242

Pulled By: anjali411

fbshipit-source-id: 0e9a987bfef16e78a1cda81ce14970993a59e467
2021-04-15 13:28:21 -07:00
5ed3be799d skip test_filtering_env_var for rocm (#56178)
Summary:
ROCM doesn't report the correct number of expected test device type. Skipping for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56178

Reviewed By: seemethere

Differential Revision: D27802139

Pulled By: walterddr

fbshipit-source-id: 2e58df1a3ba2411e690be52babf946e284c4efcc
2021-04-15 13:20:03 -07:00
6c327ef9d4 matches_jit_signatures is dead (#53637)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53637

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D26920687

Pulled By: ezyang

fbshipit-source-id: 288bd9dca63da04ccc633d939833066a3305a68a
2021-04-15 12:31:19 -07:00
6366658fbf Add OpInfo for torch.nansum (#55523)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55523

Reviewed By: agolynski

Differential Revision: D27796660

Pulled By: heitorschueroff

fbshipit-source-id: fea4d9dcccb7f4b9ba1b00079fb3899a8d20ba4b
2021-04-15 12:11:34 -07:00
9f6fed8a15 [20/n][torch/elastic][upstream] Move torchelastic.distributed.tests to pytorch.distributed (#56077)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56077

Move torchelastic.distributed.tests to pytorch.distributed

Test Plan: buck test mode/dev-nosan //pytorch/elastic/torchelastic/distributed/...

Reviewed By: cbalioglu

Differential Revision: D27758303

fbshipit-source-id: c987d4764f4776f55306988b02eae2306db06c2b
2021-04-15 11:58:05 -07:00
857d8264a7 Skip RPC's CPU-only tests on CircleCI GPU jobs (#55778)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55778

The RPC suite takes very long to run, and most of it is CPU-only. As long as we run the CPU-only part on some CPU worker on CircleCI, we can skip it on the GPU workers (which are expensive and we should waste their time).

ghstack-source-id: 126270873

Test Plan: Exported to CircleCI and checked that the CPU-only part still runs on the CPU workers but doesn't on the GPU workers.

Reviewed By: mrshenli

Differential Revision: D27705941

fbshipit-source-id: a0a509d6e72cf69e417f4b48336df534b070a66d
2021-04-15 11:20:44 -07:00
0a06d054d0 Revert "Only allow hub.load() from original repo. (#54451)" (#56048)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56048

This reverts commit c411017a41988e9c5184279c1ec7dd7ef4e1a6fe.

This implementation broke CI in pytorch/vision and it's not handling
tags properly. So I want to revert it first to unblock vision CI and
send out a proper fix later.

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D27771701

Pulled By: ailzhang

fbshipit-source-id: 932f9be72a1ae1816f4032643b3c2dde0cb7ae4c
2021-04-15 11:16:56 -07:00
71f9e99e29 [torch/elastic] Introduce aux types required by DynamicRendezvousHandler (#55932)
Summary:
This PR includes the auxiliary types used by the upcoming implementation of the `DynamicRendezvousHandler`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55932

Test Plan: Run the existing and newly-introduced unit/integration tests.

Reviewed By: tierex

Differential Revision: D27742329

Pulled By: cbalioglu

fbshipit-source-id: cf2e0d88042909739e7c37c25b4b90192c26e198
2021-04-15 11:12:20 -07:00
7c708ef4ea [19/n][torch/elastic][upstream] Replace pytorch.distributed.launch with torchelastic launcher (#56037)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56037

The diff introduces new  `torch.distributed.elastic_launch` and removes internals of `torch.distributed.launch` keeping backwards compatibility.

Since torchelastic and torch.launch are not fully compatible due to `--use_env` arg, the `torch.distributed.launch` deprecation is going to be iterative: as part of pytorch 1.9 we are going to deprecate it, and in the following releases we will remove `torch.distributed.launch`

The diff leaves `torchelastic.distributed.launch` module, and the follow up diffs will migrate the users form `torchelastic.distributed.launch` to `torch.distributed.elastic_launch`

Test Plan: buck test mode/dev-nosan //pytorch/elastic/torchelastic/distributed/...

Reviewed By: cbalioglu

Differential Revision: D27753803

fbshipit-source-id: 5f24bcfdcb70356f0787b11f6cb9479f3515fb47
2021-04-15 11:09:12 -07:00
728d2e4e0f [BE] Speed up runtime of test_ddp_model_diff_across_ranks (#55659)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55659

As per https://github.com/pytorch/pytorch/issues/55583, this is the most expensive distributed test.

Instead of waiting for process 0 in this test to be taken down by
nccl_async_error_handling, just remove the barrier and let the process exit
when the backend is NCCL.

A slight downside here is that the test no longer verifies that the process
would be brought down by nccl_async_error_handling, but
nccl_async_error_handling is already well tested in other tests. If we feel we
need to ensure this for this test, then we can pass in a process group with a
smaller timeout as an alternative solution.

The test now runs in 4-6s as opposed to 70. Ran the test 1000 times to verify
no flakiness
ghstack-source-id: 126590904

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D27672161

fbshipit-source-id: 38fb518606daac9b0390ca4c3ce1a72dc2da36fc
2021-04-15 10:30:14 -07:00
7eed077406 [android] Fix headers publishing in aar (#56068)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56068

Test Plan: Imported from OSS

Reviewed By: dreiss

Differential Revision: D27776655

Pulled By: IvanKobzarev

fbshipit-source-id: 75e07b56dab8f7ff2ab501d0ddc4566ef2378fcf
2021-04-15 09:54:08 -07:00
49e5e284ea Additional annotations in fbcode/caffe2/torch/_jit_internal.py (#55855)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55855

Test Plan: Sandcastle

Reviewed By: ezyang

Differential Revision: D27715202

fbshipit-source-id: 99d59345a1915030f12441de91a6b7d4250a1f43
2021-04-15 09:47:17 -07:00
a60dca8e80 Make the script generate cancel_redundant_workflows.yml (#56092)
Summary:
This way, the user would just have to run the `regenerate_cancel_redundant_workflow.py` script to fix the inconsistency (instead of manual stuff).

Lots of the indentation changes were caused by regenerating the file, which I don't think is terrible, and ruamel.yaml did great at preserving comments and order and such!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56092

Reviewed By: samestep

Differential Revision: D27780877

Pulled By: janeyx99

fbshipit-source-id: dd2996a88cd70a83d8daac33ba6659f93add8b92
2021-04-15 09:36:49 -07:00
51e7a371f5 [DDP] Param to name mapping in Reducer (#55075)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55075

Constructs and passes in a mapping with parameter names to Reducer to log information about unused parameters in error messages about unused parameters/not all parameters getting gradient.

Use case:
1) User runs DDP forward + bwd, and it has some unused parameters that will result in ddp error in next iteration
2) Next forward pass calls `Reducer::ensure_prior_reduction_finished()` where we check all params got gradient from the previous bwd pass. DDP would throw here in this case.
3) Reducer maintains mapping and tracks used parameters, and computes which parameters did not get gradient and logs this as part of the error.

Implementation details:
0) The following is only enabled for debug modes of INFO or DETAIL.
1) To save memory, we don't map param -> param name so that we don't have to copy the entire tensor, instead we map param_index -> param_name and use the existing concept of variable_index in Reducer to look up parameter names.
2) DDP constructs param index -> param name mapping. The name is the fully qualified name: f"{module_name}:{param_name}" and passes it into Reducer
3) Reducer maintains per-iteration std::set<int> of variable indices that have had `mark_variable_ready` called.
4) When some params go unused, we take a set difference to detect the unused params.
5) Unittests to test the logged unused params, as well as for nested modules, are added
ghstack-source-id: 126581051

Test Plan: CI, UT

Reviewed By: zhaojuanmao

Differential Revision: D27356394

fbshipit-source-id: 89f436af4e74145b0a8eda92b3c4e2af8e747332
2021-04-15 09:19:50 -07:00
1934725875 Use cascade summation in nll_loss on CPU (#55841)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55657

This also avoids summing `total_weight_val` when weights aren't supplied. Avoiding accumulated error completely.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55841

Reviewed By: jbschlosser

Differential Revision: D27751492

Pulled By: ngimel

fbshipit-source-id: 2c2dc48f31c25dfa9db48693e3f765b179771a3c
2021-04-15 09:10:35 -07:00
6c65ce8ee1 Use THPVariable_Unpack in python_nccl (#56016)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56016

Missed these because I don't build on CUDA

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D27765124

Pulled By: ezyang

fbshipit-source-id: aa202f594659d53c903b88c9d4a4cbb0e1c0b40a
2021-04-15 08:57:06 -07:00
6ec71ed4f9 Replace all direct cdata access with THPVariable_Unpack (#55799)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55799

I'm going to change the implementation of cdata soon so I need to
abstract over cdata access with a function.  Additionally, many
users are casting manually casting to THPVariable to access
the member so I can remove these unsafe casts in the client code
(the implementation, of course, is still doing an unsafe cast.)

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D27712130

Pulled By: ezyang

fbshipit-source-id: 95fcc013bf3913d67f2c634068eb5b3aab144cb3
2021-04-15 08:57:04 -07:00
61418aa069 Make THPVariable_Unpack work on THPVariable too (#55798)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55798

I'm going to change how cdata is implemented internally, so I want to
make all callsites call through THPVariable_Unpack even if they
actually have a THPVariable in hand

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D27712131

Pulled By: ezyang

fbshipit-source-id: bd2eb1e43c52c6b7a776ff3a45350a23934e643c
2021-04-15 08:57:02 -07:00
82a7fff3cd Modify a few APIs to take/return const Tensor& instead of Tensor& (#55797)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55797

In all of these cases, the inside of the function didn't make use
of the fact that the tensor was a mutable reference

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D27712132

Pulled By: ezyang

fbshipit-source-id: 99e0bb1d783f63d2d42ab53d3d406b2064405ef4
2021-04-15 08:57:00 -07:00
e8faf69739 fix torch.pow type promotion issue (#54085)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54085

Fixes https://github.com/pytorch/pytorch/issues/50121.

This fixes two similar issues pointed out with the dtype that `torch.pow` performs its computation. Thanks ngimel for spotting the issues originally (comments [here](https://github.com/pytorch/pytorch/pull/53669#discussion_r594624355) and [here](https://github.com/pytorch/pytorch/pull/53669#discussion_r594719704))!

Before:
```
>>> torch.pow(2, torch.tensor([17], dtype=torch.uint8), out=torch.tensor([0]))
tensor([0])
>>> torch.pow(2, torch.tensor(17, dtype=torch.uint8), out=torch.tensor(0))
tensor(131072)
>>> torch.pow(2, torch.tensor([17], dtype=torch.uint8, device='cuda'), out=torch.tensor([0], device='cuda'))
tensor([131072], device='cuda:0')
>>> torch.pow(2, torch.tensor(17, dtype=torch.uint8, device='cuda'), out=torch.tensor(0, device='cuda'))
tensor(131072, device='cuda:0')
```

After:
```
>>> torch.pow(2, torch.tensor([17], dtype=torch.uint8), out=torch.tensor([0]))
tensor([0])
>>> torch.pow(2, torch.tensor(17, dtype=torch.uint8), out=torch.tensor(0))
tensor(0)
>>> torch.pow(2, torch.tensor([17], dtype=torch.uint8, device='cuda'), out=torch.tensor([0], device='cuda'))
tensor([0], device='cuda:0')
>>> torch.pow(2, torch.tensor(17, dtype=torch.uint8, device='cuda'), out=torch.tensor(0, device='cuda'))
tensor(0, device='cuda:0')
```

In all four cases above, `tensor(0, ...)` is the correct value because the computed "common dtype" among the inputs is expected to be `uint8`. Computing `2 ** 7` in uint8 will then overflow to zero. Finally, we cast the computed output to the output tensor's dtype, which is `int32`.

There were two separate issues fixed in this PR: one for cpu and one for cuda:
* For CPU, The `pow(Scalar, Tensor)` overload wasn't calling `set_wrapped_number(true)` after wrapping the scalar in a Tensor, which caused the "promoted" scalar to incorrectly participate in type promotion (see the documented behavior [here](aa8714dfed/c10/core/TensorImpl.h (L590)))
* For CUDA, the cuda kernels defined in `PowKernel.cu` were using the output's dtype to run the computation, instead of the common dtype.

As an aside: The CPU and CUDA kernels actually both use `iter.dtype()` instead of `iter.common_dtype()` to run the computation, which I fixed. The reason that only manifested here for CUDA is because TensorIterator has cpu-specific logic to create temporary outputs with the intermediate dtype (shown [here](aa8714dfed/aten/src/ATen/TensorIterator.cpp (L349))). I'm not sure what the end state is there- I can imagine that being something we're more okay doing for cpu than for cuda, but it also leads to hard-to-track-down inconsistencies between the two like in this case.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D27096330

Pulled By: bdhirsh

fbshipit-source-id: a7e2909243851625cb3056d1e7abb2383bfe95f2
2021-04-15 08:55:53 -07:00
9d3d169d2d Implement hardswish/hardsigmoid on MKLDNN tensors (#55218)
Summary:
Adding hardwish and hardsigmoid improves mobilenetv3 by ~13%

  | hardswish | base |
-- | -- | -- | --
run 1 | 1305.032 | 1486.013 |
run 2 | 1290.142 | 1491.001 |
run 3 | 1305.51 | 1491.66 |
run 4 | 1308.788 | 1495.577 |
avg | 1302.368 | 1491.063 | 0.873449

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55218

Reviewed By: albanD

Differential Revision: D27701276

Pulled By: Krovatkin

fbshipit-source-id: cde78da71d327e65461e80fbb6c3bb3429505410
2021-04-15 08:51:30 -07:00
71a5314591 Fix ScriptMethod dispatch on __torch_function__ (#56103)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56103

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D27784142

Pulled By: jamesr66a

fbshipit-source-id: 555dcb7c3a98b8fb9e9ca9b499cafad54e819aa7
2021-04-15 08:46:43 -07:00
61725f15c0 cleanup unused implicit argument of expand function (#56101)
Summary:
Per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56101

Reviewed By: mruberry

Differential Revision: D27783771

Pulled By: ngimel

fbshipit-source-id: 73044461fc2d7bfab5e84eef87ff381f40a46bad
2021-04-15 08:43:15 -07:00
6daa1760d7 Skip geqrf test if compiled without LAPACK (#56105)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55929

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56105

Reviewed By: walterddr

Differential Revision: D27785443

Pulled By: malfet

fbshipit-source-id: 9701f693a71f77259c0a6371106e7185cc49a803
2021-04-15 08:07:51 -07:00
e0f9a5fed8 [BE] add test selector to test_testing (#55931)
Summary:
This is a reflection of recent failures in https://github.com/pytorch/pytorch/issues/55753 and https://github.com/pytorch/pytorch/issues/55522.
We are lacking a test to safeguard these test env var.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55931

Test Plan:
1. CI
2. Run locally using `python test/test_testing.py -k test_filtering_env_var -v`
  - gives failure on 2ca45cb9e8 and d0cd16899f
  - passes on 159e1100bf and current master

Reviewed By: jbschlosser

Differential Revision: D27747537

Pulled By: walterddr

fbshipit-source-id: c88e1c818199c7838866037d702d4012cacf510e
2021-04-15 08:00:46 -07:00
1d49fd31c4 [reland] Add formulas and basic tests (#56083)
Summary:
Reland of https://github.com/pytorch/pytorch/pull/49098
See original issue for details.

The only difference with previous PR is the fix of the _embedding_bag_dense_backward formula to stop declaring a backward formula for an argument that does not exists.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56083

Reviewed By: samestep

Differential Revision: D27778221

Pulled By: albanD

fbshipit-source-id: 159ef91ca931ef2ccfbc3d1c46c7880c32919dc9
2021-04-15 07:52:43 -07:00
b383b63550 [ROCm] Updating ROCM_HOME handling for >ROCm 4.0 (#55968)
Summary:
- This change is required to handle the case when hipcc is
  updated to the latest using update-alternatives.
- Update-alternatives support for few ROCm binaries is available
  from ROCm 4.1 onwards.
- This change doesnt not affect any previous versions of ROCm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55968

Reviewed By: mruberry

Differential Revision: D27785123

Pulled By: ezyang

fbshipit-source-id: 8467e468d8d51277fab9b0c8cbd57e80bbcfc7f7
2021-04-15 07:48:36 -07:00
5cab3b9cf6 Revert D27709912: TCPStore add watchKey method and new listener thread
Test Plan: revert-hammer

Differential Revision:
D27709912 (f8f756efb2)

Original commit changeset: 619aa3b2a8eb

fbshipit-source-id: 3ef96ccaa76c702d7e5427dfc263531fb1c274ab
2021-04-15 07:43:48 -07:00
6350fcef83 [testing] add broadcasts_input and verifies the behaviour for inplace_variant. (#55771)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55595

* Add `broadcasts_input` attribute to SampleInput
* Update test_variant_consistency_eager to verify that sample with `broadcasts_input==True` and inplace variant raises an error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55771

Reviewed By: jbschlosser, ngimel

Differential Revision: D27760530

Pulled By: mruberry

fbshipit-source-id: feb0658730d4cff483848a5ade9512837a65c24c
2021-04-15 07:39:50 -07:00
50057e560b [special] Add i0e (#54409)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/50345

Changes:
* Add `i0e`
* Move some kernels from `UnaryOpsKernel.cu` to `UnarySpecialOpsKernel.cu` to decrease compilation time per file.

Time taken by i0e_vs_scipy tests: around 6.33.s

<details>

<summary>Test Run Log</summary>

```
(pytorch-cuda-dev) kshiteej@qgpu1:~/Pytorch/pytorch_module_special$ pytest test/test_unary_ufuncs.py -k _i0e_vs
======================================================================= test session starts ========================================================================
platform linux -- Python 3.8.6, pytest-6.1.2, py-1.9.0, pluggy-0.13.1
rootdir: /home/kshiteej/Pytorch/pytorch_module_special, configfile: pytest.ini
plugins: hypothesis-5.38.1
collected 8843 items / 8833 deselected / 10 selected

test/test_unary_ufuncs.py ...sss....                                                                                                                         [100%]

========================================================================= warnings summary =========================================================================
../../.conda/envs/pytorch-cuda-dev/lib/python3.8/site-packages/torch/backends/cudnn/__init__.py:73
test/test_unary_ufuncs.py::TestUnaryUfuncsCUDA::test_special_i0e_vs_scipy_cuda_bfloat16
  /home/kshiteej/.conda/envs/pytorch-cuda-dev/lib/python3.8/site-packages/torch/backends/cudnn/__init__.py:73: UserWarning: PyTorch was compiled without cuDNN/MIOpen support. To use cuDNN/MIOpen, rebuild PyTorch making sure the library is visible to the build system.
    warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/warnings.html
===================================================================== short test summary info ======================================================================
SKIPPED [3] test/test_unary_ufuncs.py:1182: not implemented: Could not run 'aten::_copy_from' with arguments from the 'Meta' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::_copy_from' is only available for these backends: [BackendSelect, Named, InplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, UNKNOWN_TENSOR_TYPE_ID, AutogradMLC, AutogradNestedTensor, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, Tracer, Autocast, Batched, VmapMode].

BackendSelect: fallthrough registered at ../aten/src/ATen/core/BackendSelectFallbackKernel.cpp:3 [backend fallback]
Named: registered at ../aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback]
InplaceOrView: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:56 [backend fallback]
AutogradOther: registered at ../torch/csrc/autograd/generated/VariableType_4.cpp:8761 [autograd kernel]
AutogradCPU: registered at ../torch/csrc/autograd/generated/VariableType_4.cpp:8761 [autograd kernel]
AutogradCUDA: registered at ../torch/csrc/autograd/generated/VariableType_4.cpp:8761 [autograd kernel]
AutogradXLA: registered at ../torch/csrc/autograd/generated/VariableType_4.cpp:8761 [autograd kernel]
UNKNOWN_TENSOR_TYPE_ID: registered at ../torch/csrc/autograd/generated/VariableType_4.cpp:8761 [autograd kernel]
AutogradMLC: registered at ../torch/csrc/autograd/generated/VariableType_4.cpp:8761 [autograd kernel]
AutogradNestedTensor: registered at ../torch/csrc/autograd/generated/VariableType_4.cpp:8761 [autograd kernel]
AutogradPrivateUse1: registered at ../torch/csrc/autograd/generated/VariableType_4.cpp:8761 [autograd kernel]
AutogradPrivateUse2: registered at ../torch/csrc/autograd/generated/VariableType_4.cpp:8761 [autograd kernel]
AutogradPrivateUse3: registered at ../torch/csrc/autograd/generated/VariableType_4.cpp:8761 [autograd kernel]
Tracer: registered at ../torch/csrc/autograd/generated/TraceType_4.cpp:9348 [kernel]
Autocast: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:250 [backend fallback]
Batched: registered at ../aten/src/ATen/BatchingRegistrations.cpp:1016 [backend fallback]
VmapMode: fallthrough registered at ../aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]
==================================================== 7 passed, 3 skipped, 8833 deselected, 2 warnings in 6.33s =====================================================
```

</details>

TODO:
* [x] Check rendered docs (https://11743402-65600975-gh.circle-artifacts.com/0/docs/special.html)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54409

Reviewed By: jbschlosser

Differential Revision: D27760472

Pulled By: mruberry

fbshipit-source-id: bdfbcaa798b00c51dc9513c34626246c8fc10548
2021-04-15 06:06:11 -07:00
2f895f790a [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D27789747

fbshipit-source-id: ef4882e92d7755669083573c43ae6c5088bf01ab
2021-04-15 04:27:27 -07:00
84e6580b5f Use cusolver potrs as the backend of cholesky_inverse for batch_size == 1 on CUDA (#54676)
Summary:
This PR adds the functionality to use cusolver potrs as the backend of cholesky_inverse for batch_size == 1 on CUDA.

Cusolver `potri` is **not** used, because

- it only returns the upper or lower triangular matrix as a result. Although the other half is zero, we may still need extra kernels to get the full Hermitian matrix
- it's no faster than cusolver potrs in most cases
- it doesn't have a batched version or 64-bit version

`cholesky_inverse` dispatch heuristics:

- If magma is not installed, or batch_size is 1, dispatch to `cusolverDnXpotrs` (64 bit) and `cusolverDn<T>potrs` (legacy).
- Otherwise, use magma.

See also https://github.com/pytorch/pytorch/issues/42666 #47953

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54676

Reviewed By: ngimel

Differential Revision: D27723805

Pulled By: mruberry

fbshipit-source-id: f65122812c9e56a781aabe4d87ed28b309abf93f
2021-04-15 04:16:18 -07:00
699b47cd2c Update use_deterministic_algorithms docs (#55413)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55086

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55413

Reviewed By: jbschlosser

Differential Revision: D27759069

Pulled By: mruberry

fbshipit-source-id: 16c0dc1dc6f80ddd4f131e5e91729bbda8850878
2021-04-15 04:04:27 -07:00
3802e577fb [TensorPipe] Use Descriptor::Tensor::sourceDevice in tensorpipe_agent. (#55821)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55821

Test Plan: CI

Reviewed By: lw

Differential Revision: D27661608

fbshipit-source-id: fd241f073d8928528a749758c7d0f570dfeb677b
2021-04-15 03:21:26 -07:00
047164437e [TensorPipe] Prepare for new Pipe API. (#55820)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55820

Test Plan: CI

Reviewed By: lw

Differential Revision: D27648291

fbshipit-source-id: e08db6e8c1f5f333ec355de29e25fbe552904b25
2021-04-15 03:20:32 -07:00
6eeffc64f1 Port NumPy typing testing style to PyTorch (#54234)
Summary:
This is a follow-up PR of https://github.com/pytorch/pytorch/issues/52408 and includes the `pass/` and `fail/` directories.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54234

Reviewed By: walterddr

Differential Revision: D27681410

Pulled By: malfet

fbshipit-source-id: e6817df77c758f4c1295ea62613106c71cfd3fc3
2021-04-15 01:25:16 -07:00
a128938a75 [ROCm] add MAGMA_HOME env var hint to cmake, centos-rocm Dockerfile (#54511)
Summary:
MAGMA_HOME was previously set for the ubuntu-rocm/Dockerfile.  However, this missed centos builds as well as any builds that do not use the CI image environments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54511

Reviewed By: jbschlosser

Differential Revision: D27755983

Pulled By: malfet

fbshipit-source-id: 1ffd2cd100f4221c2bb64e6915fa3372ee1f6247
2021-04-15 01:06:44 -07:00
1e9c7ad4cb Add a test to measure import torch time (#56041)
Summary:
This PR adds a couple very simple tests which (as the code comment says) measure the time it takes to `import torch` and ask for the CUDA device count.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56041

Test Plan:
```
$ rm -r /tmp/reports ; python3 test/test_import_time.py --save-xml=/tmp/reports

Running tests...
----------------------------------------------------------------------
..
----------------------------------------------------------------------
Ran 2 tests in 1.855s

OK

Generating XML reports...
```
```
$ tools/print_test_stats.py /tmp/reports
No scribe access token provided, skip sending report!
class TestImportTime:
    tests: 2 failed: 0 skipped: 0 errored: 0
    run_time: 1.85 seconds
    avg_time: 0.93 seconds
    median_time: 0.93 seconds
    2 longest tests:
        test_time_cuda_device_count time: 1.10 seconds
        test_time_import_torch time: 0.75 seconds

Total runtime is 0:00:01
2 longest tests of entire run:
    TestImportTime.test_time_cuda_device_count  time: 1.10 seconds
    TestImportTime.test_time_import_torch  time: 0.75 seconds
```

Reviewed By: driazati

Differential Revision: D27770908

Pulled By: samestep

fbshipit-source-id: 01bbf5a339f41d3a1f493e6fa8c946ff7567daec
2021-04-15 00:53:30 -07:00
75b6644a4c Add USE_NUMPY define only if PyTorch is compiled with Numpy (#56102)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55849

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56102

Reviewed By: driazati, mruberry

Differential Revision: D27784057

Pulled By: malfet

fbshipit-source-id: 636a005e7f74a58d47188f18b74f7deb4afe5fcb
2021-04-15 00:48:03 -07:00
81f181567a Add USE_MAGMA build flag (#55994)
Summary:
Many model pipelines/workflows don't use MAGMA even though it is included in the build by default. Leaving MAGMA kernels out of the build can save 60+MB of GPU memory when loading `libtorch_cuda.so` (tested on V100, current upstream master).

A current sharp corner of this flag is that toggling it when rebuilding requires `torch/include/THC/THCGeneral.h` to be *manually* deleted by the user, as even running `make clean` or `setup.py` with `--cmake` does not properly regenerate it with the appropriate substitution for `#cmakedefine USE_MAGMA`. Is there a way to force the regeneration of the header during a rebuild?

CC malfet ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55994

Reviewed By: mruberry

Differential Revision: D27766287

Pulled By: malfet

fbshipit-source-id: 93deca57befa0febb9c5b7875ecf0015c547d421
2021-04-15 00:43:12 -07:00
1995640d86 Fix compiler warnings in mkldnn Pooling (#56095)
Summary:
Also, add `-Werror` flag to prevent this regressions from happening in
the future

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56095

Reviewed By: walterddr

Differential Revision: D27781603

Pulled By: malfet

fbshipit-source-id: 2a404788a965c380ff9feb72d0b2d967b131371f
2021-04-15 00:33:21 -07:00
f5a7b2e641 Put llvmMathExtras in c10 namespace (#55886)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55886

We've imported llvm's MathExtras header, but now that we want to also
include LLVM (which includes its own MathExtras), we need to guard the c10
version appropriately (or interwine llvm more deeply with our build than just
the CPU fuser, which I'm not super excited about doing just yet).
ghstack-source-id: 126375067

Test Plan: build

Reviewed By: ZolotukhinM

Differential Revision: D27731038

fbshipit-source-id: 7c136341d6b433b3876ee983820016df75c14dec
2021-04-14 21:56:57 -07:00
556dfcb0db [TensorExpr] Re-enable "LoopNest.VectorizeUse" test. (#56094)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56094

Now FunctionCalls are merged with Loads and vectorization for
intermediate values automatically started to work.

Fixes #53553.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D27781519

Pulled By: ZolotukhinM

fbshipit-source-id: 1ed68ca2399e9bd4598639bd6dd8f369365f0ef0
2021-04-14 21:39:03 -07:00
ad17fadbfc Revert D27652485: [nnc] Enable CPU fusion only when num_threads == 1
Test Plan: revert-hammer

Differential Revision:
D27652485 (e7e164f9e6)

Original commit changeset: 182580cf758d

fbshipit-source-id: e3c95b06d1eef668095f3cf461485395179d94af
2021-04-14 20:23:15 -07:00
506eca24b9 Revert D27752279: [nnc] Do not try to vectorize kernels that use float16
Test Plan: revert-hammer

Differential Revision:
D27752279 (8df5e61fd6)

Original commit changeset: ac115080bf2a

fbshipit-source-id: cbc0aa2dcb7691d9fc9d081c6169dea711cd9fac
2021-04-14 20:21:40 -07:00
8f663170bd [17/n][torch/elastic] Make torchelastic launcher compatible with the caffe2.distributed.launch (#55687)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55687

The diff makes sure that users can transfer the following parameters:
* master_addr
* master_port
* node_rank
* use_env

The diff implement StaticTCPRendezvous that creates a store with listener on agent rank #0

The diff modifies caffe2/rendezvous: If the worker process launched with torchelastic agent, the worker processes will create a PrefixStore("worker/") from TCPStore without listener.

The diff adds macros functionality to torch/distributed/ealstic/utils that helps to resolve local_rank parameter.

Test Plan: buck test mode/dev-nosan //pytorch/elastic/torchelastic/distributed/test:launch_test

Reviewed By: cbalioglu, wilson100hong

Differential Revision: D27643206

fbshipit-source-id: 540fb26feac322cc3ec0a989fe53324755ccc4ea
2021-04-14 19:33:26 -07:00
c5f9e043e9 Collect instruction counts (and wall times) for CI (#55428)
Summary:
This PR add a `--mode` flag and a script to collect microbenchmarks in a single JSON file. I also added a version check since benchmarks are expected to evolve; this also turned up a determinism bug in `init_from_variants`. (`set` is not ordered, unlike `dict`)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55428

Test Plan:
Run in CI

CC: ngimel wconstab ezyang bhosmer

Reviewed By: mruberry

Differential Revision: D27775284

Pulled By: robieta

fbshipit-source-id: c8c338fedbfb2860df207fe204212a0121ecb006
2021-04-14 17:53:13 -07:00
92a09fb87a Manual revert of D27369251 (#56080)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56080

Reviewed By: hansonw

Differential Revision: D27777498

Pulled By: Krovatkin

fbshipit-source-id: f72ca725ceba3c1fbd54c30014ac001d4b35b9eb
2021-04-14 17:25:59 -07:00
f8d331b33b PyTorch Execution Graph Observers (#55957)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55957

This diff adds an execution graph observer that tracks all operators (dispatcher autograd, jit, user defined, etc.) and their inputs and outputs. The results are written to a temp JSON file which can be used for further analysis. This support various use cases, such as dependency analysis, performance optimizations, etc.

Some minor refactoring of existing code for clarity and completeness.

Test Plan:
Example output:

{F603167736}

```
=> buck build caffe2/torch/fb/observers:execution_graph_observer_runner --show-output

=> buck-out/gen/caffe2/torch/fb/observers/execution_graph_observer_runner --pytorch_enable_execution_graph_observer=true --pytorch_execution_graph_observer_iter_label="## START ##" --pytorch_execution_graph_observer_iter_target=3
I0414 01:26:55.834039 1038798 ExecutionGraphObserver.cpp:408] Enabled PyTorch execution graph observer
I0414 01:26:55.834717 1038798 ExecutionGraphObserver.cpp:411] Matching iteration start label: "## START ##"
I0414 01:26:55.834940 1038798 ExecutionGraphObserver.cpp:423] Target iteration: 3
I0414 01:26:55.835962 1038798 ExecutionGraphObserverRunner.cpp:50] Running test execution graph observer runner.
I0414 01:26:55.836180 1038798 ExecutionGraphObserverRunner.cpp:51] iterations: 10
I0414 01:26:55.836419 1038798 ExecutionGraphObserverRunner.cpp:52] output file name: /tmp/pytorch_execution_graph_1618388815_1038798_3.json
I0414 01:26:56.246432 1038798 ExecutionGraphObserver.cpp:137] Writing PyTorch execution graph to: /tmp/pytorch_execution_graph_1618388815_1038798_3.json
I0414 01:26:56.278715 1038798 ExecutionGraphObserver.cpp:314] PyTorch execution graph is written to file: /tmp/pytorch_execution_graph_1618388815_1038798_3.json
```

see `/tmp/pytorch_execution_graph_[timestamp]_[process_id]_[iter_target].json`

Reviewed By: albanD

Differential Revision: D27238906

fbshipit-source-id: 3eb717d7d512e2d51d3162e9995b1ccd18e5a725
2021-04-14 17:13:37 -07:00
55432982d2 [OpInfo][take2] move matmul to OpInfo (#55947)
Summary:
This is a reland of https://github.com/pytorch/pytorch/issues/55543 after fixing bfloat16 issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55947

Reviewed By: mruberry

Differential Revision: D27765035

Pulled By: walterddr

fbshipit-source-id: b27a769de7686777012194ebbc1f38fc5d4acb67
2021-04-14 16:01:40 -07:00
669a8acc54 [package] Allow save_module to accept module as arg (#55996)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55996

**Sumamary**
This commit modifies `PackageExporter.save_module` so that the `module`
argument can be either a string (`str`) or a module
(`types.ModuleType`).

**Test Plan**
This commit adds a unit test similar to `TestSaveLoad.test_save_module`
that tests that calling `save_module` with a module object works.

**Fixes**
This commit fixes #55939.

Test Plan: Imported from OSS

Reviewed By: jamesr66a, huiguoo

Differential Revision: D27771781

Pulled By: SplitInfinity

fbshipit-source-id: 57c8cf45575bb8dcfca711759fadfff72efb35e7
2021-04-14 15:52:55 -07:00
1a116a9332 [Static runtime] Add optimize_graph_output_memory flag (#55811)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55811

- Added manage_graph_output_memory flag to opts (default false)
- Added checking for flag dependency between enable_out_variant and optimize_graph_output_memory and optimize_memory
- Minor refactoring for readability

Test Plan: buck test mode/dev //caffe2/caffe2/fb/predictor:pytorch_predictor_test -- --exact 'caffe2/caffe2/fb/predictor:pytorch_predictor_test - PyTorchPredictor.StaticRuntime

Reviewed By: hlu1

Differential Revision: D27573780

fbshipit-source-id: 28698657f686f27b8ad60e1276cdf17402d2cf91
2021-04-14 15:41:18 -07:00
44e2c2cdfb Add a lint for native_functions.yaml (#56059)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56059

The lint doesn't do very much, mostly it enforces that indentation
is consistent.  The real point of the lint is to just make sure
that we can still do surgery on codemod with tools like ruamel,
by reusing the configuration in this script.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D27774590

Pulled By: ezyang

fbshipit-source-id: c26bc6c95a478bd9b86387b18de7e906e7d13193
2021-04-14 15:20:29 -07:00
6b8696172f Fixed some Clang-Tidy checks in Aten Context class (#55942)
Summary:
Clang-Tidy displayed that it's possible to make some methods static and const in Context class. So I made.
It also shows that it has some unused headers from standard libraries included, which i will fix with a next PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55942

Reviewed By: mruberry

Differential Revision: D27766213

Pulled By: bdhirsh

fbshipit-source-id: 4bd9b92c0b8e5c540ac94fbd2bdace64949946e3
2021-04-14 14:55:44 -07:00
817fd932ac Revert D25607505: Add formulas and basic tests
Test Plan: revert-hammer

Differential Revision:
D25607505 (70f5905565)

Original commit changeset: fe2315d58768

fbshipit-source-id: 519d7426a6f32f0db51c4f360e5d5a79dbaac99d
2021-04-14 14:50:43 -07:00
ed03a0791e Change MessageType values from decimals to hexadecimals for readability (#55985)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55985

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D27758101

Pulled By: pbelevich

fbshipit-source-id: 45a7c4d1c4fea874bca7b96e7f2b699ce3a199e5
2021-04-14 14:32:02 -07:00
50bd6a3640 ci: Remove CUDA 10.1 builds (#56056)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56056

Since internal systems as well as colab have updated beyond CUDA 10.1
I'd say it's safe to remove CUDA 10.1 builds entirely

As mentioned in https://github.com/pytorch/pytorch/issues/55829#issuecomment-818236019

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D27772826

Pulled By: seemethere

fbshipit-source-id: 1599bba26b73b909b2575130219e2708ade5654c
2021-04-14 14:24:56 -07:00
2e7e4d0795 ci: Add job to ensure python2 setup.py compat (#56057)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56057

Just to make sure we don't add anything there that'd break python 2 users from receiving the correct error message

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D27774120

Pulled By: seemethere

fbshipit-source-id: e40a1a2672a69eed3b6e834b1acbb7a04c0adec1
2021-04-14 14:17:14 -07:00
70f5905565 Add formulas and basic tests (#49098)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49098

RFC: https://github.com/pytorch/rfcs/pull/11

This PR adds:
- Codegen support to define forward grad formulas and few manual formulas
- Codegen support to automatically generate formulas as well as few usage
- Tests for basic forward grad components

Codegen generated examples.
For each of them, the only part that is changed is the if statement before the return checking for fw grad defined.

- For manual entry:
```yaml
- name: max(Tensor self) -> Tensor
  self: evenly_distribute_backward(grad, self, result)
  result: max_forward(self_fw_grad, self, result)
```

```cpp
Tensor max(const Tensor & self) {
  auto& self_ = unpack(self, "self", 0);
  auto _any_requires_grad = compute_requires_grad( self );
  std::shared_ptr<MaxBackward1> grad_fn;
  if (_any_requires_grad) {
    grad_fn = std::shared_ptr<MaxBackward1>(new MaxBackward1(), deleteNode);
    grad_fn->set_next_edges(collect_next_edges( self ));
    grad_fn->self_ = SavedVariable(self, false);
  }
  #ifndef NDEBUG
  c10::optional<Storage> self__storage_saved =
    self_.has_storage() ? c10::optional<Storage>(self_.storage()) : c10::nullopt;
  c10::intrusive_ptr<TensorImpl> self__impl_saved;
  if (self_.defined()) self__impl_saved = self_.getIntrusivePtr();
  #endif
  auto tmp = ([&]() {
    at::AutoNonVariableTypeMode non_var_type_mode(true);
    return at::max(self_);
  })();
  auto result = std::move(tmp);
  #ifndef NDEBUG
  if (self__storage_saved.has_value())
    AT_ASSERT(self__storage_saved.value().is_alias_of(self_.storage()));
  if (self__impl_saved) AT_ASSERT(self__impl_saved == self_.getIntrusivePtr());
  #endif
  if (grad_fn) {
      set_history(flatten_tensor_args( result ), grad_fn);
  }
  throw_error_for_complex_autograd(result, "max");
  if (isFwGradDefined(self)) {
      auto self_fw_grad = toLegacyFwGrad(self);
      auto self_primal = toLegacyPrimal(self);
      auto result_new_fw_grad = max_forward(self_fw_grad, self_primal, result);
      if (result_new_fw_grad.defined()) {
        result.set_fw_grad(result_new_fw_grad, /* level */ 0, /* is_inplace_op */ false);
      }
  }
  if (grad_fn) {
    grad_fn->result_ = SavedVariable(result, true);
  }
  return result;
}
```

- For element wise entry:
```yaml
- name: abs(Tensor self) -> Tensor
  self: grad * self.sgn()
  result: auto_element_wise
```

```cpp
Tensor abs(const Tensor & self) {
  auto& self_ = unpack(self, "self", 0);
  auto _any_requires_grad = compute_requires_grad( self );
  std::shared_ptr<AbsBackward> grad_fn;
  if (_any_requires_grad) {
    grad_fn = std::shared_ptr<AbsBackward>(new AbsBackward(), deleteNode);
    grad_fn->set_next_edges(collect_next_edges( self ));
    grad_fn->self_ = SavedVariable(self, false);
  }
  #ifndef NDEBUG
  c10::optional<Storage> self__storage_saved =
    self_.has_storage() ? c10::optional<Storage>(self_.storage()) : c10::nullopt;
  c10::intrusive_ptr<TensorImpl> self__impl_saved;
  if (self_.defined()) self__impl_saved = self_.getIntrusivePtr();
  #endif
  auto tmp = ([&]() {
    at::AutoNonVariableTypeMode non_var_type_mode(true);
    return at::abs(self_);
  })();
  auto result = std::move(tmp);
  #ifndef NDEBUG
  if (self__storage_saved.has_value())
    AT_ASSERT(self__storage_saved.value().is_alias_of(self_.storage()));
  if (self__impl_saved) AT_ASSERT(self__impl_saved == self_.getIntrusivePtr());
  #endif
  if (grad_fn) {
      set_history(flatten_tensor_args( result ), grad_fn);
  }
  throw_error_for_complex_autograd(result, "abs");
  if (isFwGradDefined(self)) {
      auto self_fw_grad = toLegacyFwGrad(self);
      auto self_primal = toLegacyPrimal(self);
      auto result_new_fw_grad = self_fw_grad * self_primal.sgn();
      if (result_new_fw_grad.defined()) {
        result.set_fw_grad(result_new_fw_grad, /* level */ 0, /* is_inplace_op */ false);
      }
  }
  return result;
}
```
- For linear entry:
```yaml
- name: clone(Tensor self, *, MemoryFormat? memory_format=None) -> Tensor
  self: grad
  result: auto_linear
```

```cpp
Tensor clone(const Tensor & self, c10::optional<MemoryFormat> memory_format) {
  auto& self_ = unpack(self, "self", 0);
  auto _any_requires_grad = compute_requires_grad( self );
  std::shared_ptr<CloneBackward> grad_fn;
  if (_any_requires_grad) {
    grad_fn = std::shared_ptr<CloneBackward>(new CloneBackward(), deleteNode);
    grad_fn->set_next_edges(collect_next_edges( self ));
  }
  #ifndef NDEBUG
  c10::optional<Storage> self__storage_saved =
    self_.has_storage() ? c10::optional<Storage>(self_.storage()) : c10::nullopt;
  c10::intrusive_ptr<TensorImpl> self__impl_saved;
  if (self_.defined()) self__impl_saved = self_.getIntrusivePtr();
  #endif
  auto tmp = ([&]() {
    at::AutoNonVariableTypeMode non_var_type_mode(true);
    return at::clone(self_, memory_format);
  })();
  auto result = std::move(tmp);
  #ifndef NDEBUG
  if (self__storage_saved.has_value())
    AT_ASSERT(self__storage_saved.value().is_alias_of(self_.storage()));
  if (self__impl_saved) AT_ASSERT(self__impl_saved == self_.getIntrusivePtr());
  #endif
  if (grad_fn) {
      set_history(flatten_tensor_args( result ), grad_fn);
  }
  if (isFwGradDefined(self)) {
      auto self_fw_grad = toLegacyFwGrad(self);
      auto result_new_fw_grad = at::clone(self_fw_grad, memory_format);
      if (result_new_fw_grad.defined()) {
        result.set_fw_grad(result_new_fw_grad, /* level */ 0, /* is_inplace_op */ false);
      }
  }
  return result;
}
```

- For no entry:
```yaml
- name: angle(Tensor self) -> Tensor
  self: angle_backward(grad, self)
```

```cpp
Tensor angle(const Tensor & self) {
  auto& self_ = unpack(self, "self", 0);
  auto _any_requires_grad = compute_requires_grad( self );
  std::shared_ptr<AngleBackward> grad_fn;
  if (_any_requires_grad) {
    grad_fn = std::shared_ptr<AngleBackward>(new AngleBackward(), deleteNode);
    grad_fn->set_next_edges(collect_next_edges( self ));
    grad_fn->self_ = SavedVariable(self, false);
  }
  #ifndef NDEBUG
  c10::optional<Storage> self__storage_saved =
    self_.has_storage() ? c10::optional<Storage>(self_.storage()) : c10::nullopt;
  c10::intrusive_ptr<TensorImpl> self__impl_saved;
  if (self_.defined()) self__impl_saved = self_.getIntrusivePtr();
  #endif
  auto tmp = ([&]() {
    at::AutoNonVariableTypeMode non_var_type_mode(true);
    return at::angle(self_);
  })();
  auto result = std::move(tmp);
  #ifndef NDEBUG
  if (self__storage_saved.has_value())
    AT_ASSERT(self__storage_saved.value().is_alias_of(self_.storage()));
  if (self__impl_saved) AT_ASSERT(self__impl_saved == self_.getIntrusivePtr());
  #endif
  if (grad_fn) {
      set_history(flatten_tensor_args( result ), grad_fn);
  }
  throw_error_for_complex_autograd(result, "angle");
  TORCH_CHECK(!(isFwGradDefined(self)), "Trying to use forward prop with angle that does not support it.");
  return result;
}
```

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D25607505

Pulled By: albanD

fbshipit-source-id: fe2315d587689af1cd5968536fa26c680b8b8829
2021-04-14 14:13:30 -07:00
1e225a5187 Add a few InferenceMode test cases to the wall. (#55993)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55993

Test Plan: Imported from OSS

Reviewed By: robieta

Differential Revision: D27769478

Pulled By: ailzhang

fbshipit-source-id: 009592d64ef24e1cf7e977d02acf662eb841ca58
2021-04-14 13:48:37 -07:00
cc7fab6e9c Update pthreadpool (#55950)
Summary:
This updates pthreadpool to include [this commit](a134dd5d4c), which removes a bunch of deprecation warnings in the build.

Fixes https://github.com/pytorch/pytorch/issues/33760

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55950

Reviewed By: mruberry

Differential Revision: D27773417

Pulled By: driazati

fbshipit-source-id: b4397787d882228bae47ddd8ccf628047466b904
2021-04-14 13:38:28 -07:00
0b8bd22614 Fix bug with rebuilding extensions every import (#56015)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56015

Reviewed By: mruberry

Differential Revision: D27765934

Pulled By: ezyang

fbshipit-source-id: 65cace951fce5f2284ab91d8bd687ac89a2311fb
2021-04-14 13:25:01 -07:00
f8f756efb2 TCPStore add watchKey method and new listener thread (#54264)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54264

**Changes**

- Creates new listener thread on each client to run the callback
- Create new class which listener thread and master thread derive from, this class is used to handle shut down and clean up of the thread in windows and linux
- Add watchKey method and update any functions that changes the key value.

**Background**
This PR adds functionality to TCPStore to allow users to watch a key and execute a callback on key change.

It introduces this a new watchKey() API:
`TCPStore::watchKey(const std::string& key, std::function<void(std::string, std::string)> callback)` which has parameters `key` and `callback(old_key, new_key)` to run on key change. Since current methods are blocking, for example in`TCPStore::get()` a worker will send a "get key" request to the master -> wait for a response back -> then exit the function and return the value to user, we need a non-blocking, asynchronous way to execute the callback whenever a key changes. This is done by creating a new listener thread on each client which the master can communicate with.

Right now, the API is C++ only and only for TCPStore, the internal use case is for elastic RPC. We will have an internal key such as `_NumNodes` and all nodes in the elastic RPC group will watch this key. When a node leaves, this key will be updated and each node will execute a callback to clean up Autograd context and RRef context.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D27709912

Pulled By: H-Huang

fbshipit-source-id: 619aa3b2a8eb23f4be5f5736efdcca6c175aadf3
2021-04-14 13:23:12 -07:00
bc86358cf5 Make run_test.py work even if s3_stat_parser fails to import (#56039)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56039

Python will try to eagerly resolve the name references even if
the import failed.  Quote them so that it doesn't.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: janeyx99

Differential Revision: D27770536

Pulled By: ezyang

fbshipit-source-id: b111739289498f9bab856fb9424f3080efee4ee0
2021-04-14 13:21:50 -07:00
48a7d69946 Catch and ignore tracebacks for compilation errors (#55986)
Summary:
The Python traceback on a cmake invocation is meaningless to most developers, so this PR wraps it in a `try..catch` so we can ignore it and save scrolling through the 20-or-so lines.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55986

Pulled By: driazati

Reviewed By: wanchaol

Differential Revision: D27769304

fbshipit-source-id: 5889eea03db098d10576290abeeb4600029fb3f2
2021-04-14 13:05:27 -07:00
40d74e6f71 breakup optim, cuda documentation (#55673)
Summary:
Related to https://github.com/pytorch/pytorch/issues/52256

Use autosummary instead of autofunction to create subpages for optim and cuda functions/classes.

Also fix some minor formatting issues in optim.LBFGS and cuda.stream docstings

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55673

Reviewed By: jbschlosser

Differential Revision: D27747741

Pulled By: zou3519

fbshipit-source-id: 070681f840cdf4433a44af75be3483f16e5acf7d
2021-04-14 12:44:00 -07:00
fd15557ccc breakup autograd documentation (#55672)
Summary:
Related to https://github.com/pytorch/pytorch/issues/52256

Use autosummary instead of autofunction to create subpages for autograd functions. I left the autoclass parts intact but manually laid out their members.

Also the Latex formatting of the spcecial page emitted a warning (solved by adding `\begin{align}...\end{align}`) and fixed alignment of equations (by using `&=` instead of `=`).

zou3519

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55672

Reviewed By: jbschlosser

Differential Revision: D27736855

Pulled By: zou3519

fbshipit-source-id: addb56f4f81c82d8537884e0ff243c1e34969a6e
2021-04-14 12:40:00 -07:00
bbc4c775bb [reland][c10d] monitored_barrier: ensure all ranks pass or none do (#55990)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55990

Reland of https://github.com/pytorch/pytorch/pull/55197, which fails windows test that was only run on master.

Disabled these tests for windows, similar to they are disabled on MacOS. The reason for disabling as that they use libuv transport which does not have as robust error handling as tcp on linux. The result is that non-zero ranks that were healthy don't throw immediately (like they do on linux) but they throw on timeout. The error handling still occurs as expected on rank 0 for all platforms.
ghstack-source-id: 126478371

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D27758424

fbshipit-source-id: d30841c8dda77f51b09a58161e638657ef758e63
2021-04-14 12:26:54 -07:00
752f5b1030 [reland][c10d] Log API usage of monitored barrier (#55989)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55989

Reland of https://github.com/pytorch/pytorch/pull/55197, which fails windows test that was only run on master.
ghstack-source-id: 126477554

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D27758425

fbshipit-source-id: ebca8b6baf0019879bc4b16639d6cccf27dc6b1c
2021-04-14 12:25:35 -07:00
c8cf9114bf Include short test suites ln total_seconds stat (#56040)
Summary:
Up until this PR, the top-level `total_seconds` stat we've been uploading to S3 has only included suites longer than one second. This PR corrects that issue, and also clarifies the script's textual output for "longest tests of entire run".

(Note that the `total_time` local variable is passed as the `total_seconds` parameter in the call to `assemble_s3_object`.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56040

Test Plan:
Create a simple test file (call it `test_quick_maths.py`) with these contents:

```py
from torch.testing._internal.common_utils import TestCase, run_tests

class TestQuickMaths(TestCase):
    def test_two_plus_two(self):
        self.assertEqual(2 + 2, 4)

if __name__ == '__main__':
    run_tests()
```

Run it and save the test results:

```sh
rm -r /tmp/reports ; python3 test_quick_maths.py --save-xml=/tmp/reports
```

Then display them using the script:

```sh
tools/print_test_stats.py /tmp/reports
```

- Before this PR:

  ```
  No scribe access token provided, skip sending report!
  Total runtime is 0:00:00
  0 longest tests of entire run:
  ```

- With this PR:

  ```
  No scribe access token provided, skip sending report!
  Total runtime is 0:00:00.108000
  0 longest tests of entire run (ignoring suites totaling less than 1.0 seconds):
  ```

If you were to upload this to S3 (see https://github.com/pytorch/pytorch/issues/49190 for an example of how to do this manually), the top-level `total_seconds` field should also change from `0` to `0.108`.

Reviewed By: janeyx99

Differential Revision: D27770666

Pulled By: samestep

fbshipit-source-id: 8255a4726ab3a692bbeff4c48974fbb3c6375142
2021-04-14 11:53:55 -07:00
8df5e61fd6 [nnc] Do not try to vectorize kernels that use float16 (#55970)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55970

LLVM's support for float16 is not great, and we were seeing assertion
failures trying to generate code for vectorized uses.  I note that clang
doesn't even try to vectorize operations involving half:
https://gcc.godbolt.org/z/86MW4xr17, so that's a good sign we shouldn't either.

Fixes #55905
ghstack-source-id: 126511474

Test Plan: pytest test_jit_fuser_te.py -k test_isnan

Reviewed By: asuhan

Differential Revision: D27752279

Pulled By: bertmaher

fbshipit-source-id: ac115080bf2a4a73d52b396d64a5bce0cf13abfe
2021-04-14 11:28:34 -07:00
087049000b Make c10 clang-tidy clean (#55870)
Summary:
This change was autogenerated by running:
```
% find c10 -iname "*.cpp" -exec python3 tools/clang_tidy.py -c build -x {} -s \;
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55870

Reviewed By: janeyx99

Differential Revision: D27728617

Pulled By: malfet

fbshipit-source-id: bede4d7f0c106d51394d1e9efddf01bf894421c5
2021-04-14 11:23:28 -07:00
416c18b7c9 Add a batch_first arg to Transformer / MHA modules (#55285)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/25100 #43112

EDIT: pardon my inexperience since this is my first PR here, that I did not realize the doc should not have any trailing white spaces, and `[E712] comparison to False should be 'if cond is False:' or 'if not cond:'`, now both fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55285

Reviewed By: mruberry

Differential Revision: D27765694

Pulled By: jbschlosser

fbshipit-source-id: c34774fa065d67c0ac130de20a54e66e608bdbf4
2021-04-14 11:18:42 -07:00
ba320cec6b Prepare for Azure Pipeline for multi-gpu tests (#55600)
Summary:
Previous PR: https://github.com/pytorch/pytorch/issues/52490

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55600

Reviewed By: albanD, seemethere

Differential Revision: D27667544

Pulled By: malfet

fbshipit-source-id: f5843379807d8c95f3791d19ac0ab2d1973fa087
2021-04-14 10:02:21 -07:00
1127bab828 Make GHA for consistency cancel_redundant_workflow return useful err msg (#55961)
Summary:
This way, the user gets more useful actionable results from the GHA.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55961

Test Plan: CI

Reviewed By: samestep

Differential Revision: D27749013

Pulled By: janeyx99

fbshipit-source-id: bb0edbcdab29ba8ef99005f6fcf52de6782b468d
2021-04-14 09:54:08 -07:00
3fe4718d16 Add padding_idx argument to EmbeddingBag (#49237)
Summary:
This PR adds a `padding_idx` parameter to `nn.EmbeddingBag` and `nn.functional.embedding_bag`. As with `nn.Embedding`'s `padding_idx` argument, if an embedding's index is equal to `padding_idx` it is ignored, so it is not included in the reduction.

This PR does not add support for `padding_idx` for quantized or ONNX `EmbeddingBag` for opset10/11 (opset9 is supported). In these cases, an error is thrown if `padding_idx` is provided.

Fixes https://github.com/pytorch/pytorch/issues/3194

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49237

Reviewed By: walterddr, VitalyFedyunin

Differential Revision: D26948258

Pulled By: jbschlosser

fbshipit-source-id: 3ca672f7e768941f3261ab405fc7597c97ce3dfc
2021-04-14 09:38:01 -07:00
f94c95a2dd Revert D23752058: [pytorch][PR] Don't split oversize cached blocks
Test Plan: revert-hammer

Differential Revision:
D23752058 (67dcd62310)

Original commit changeset: ccb7c13e3cf8

fbshipit-source-id: 12ae9702135ea510e9714ed97fb75ca3b9f97c27
2021-04-14 09:24:08 -07:00
e7e164f9e6 [nnc] Enable CPU fusion only when num_threads == 1 (#55621)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55621

Fuser support for thread-level parallelism is a work in progress, so
only fuse when the program is running single-threaded.
ghstack-source-id: 126069259

Test Plan: observe fusion groups formed when torch.get_num_threads==1 vs not

Reviewed By: ZolotukhinM

Differential Revision: D27652485

fbshipit-source-id: 182580cf758d99dd499cc4591eb9d080884aa7ef
2021-04-14 09:16:54 -07:00
88c06d9dfc Add cuda device synchronization support in JIT (#55469)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55469

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D27749077

Pulled By: nikithamalgifb

fbshipit-source-id: bce3d331ab781cf3232b47b4f02ef504b9eadc7e
2021-04-14 09:13:07 -07:00
1688a5d31a Cleanup since FEATURE_TORCH_MOBILE is always true. (#55835)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55835

Now that https://github.com/pytorch/pytorch/pull/55238 is landed for a
week and no complains. It seems safe to say FEATURE_TORCH_MOBILE is
always true and we can do some cleanup.

Test Plan: Imported from OSS

Reviewed By: ezyang, walterddr

Differential Revision: D27721284

Pulled By: ailzhang

fbshipit-source-id: 4896bc5f736373d0922cfbe8eed0d16df62f0fa1
2021-04-14 09:08:18 -07:00
8188d18f8d ns for fx: add functional conv-relu fusion support (#55433)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55433

Makes `F.conv{n}d -> F.relu` patterns work for NS weight
extraction.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_extract_weights_conv_fun
```

Imported from OSS

Reviewed By: hx89

Differential Revision: D27622417

fbshipit-source-id: d3ee08bd19865874cff3776c3b69e232fdfc5912
2021-04-14 09:04:37 -07:00
1ea95fa5b2 ns for fx: add test case for linear dynamic (#55432)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55432

As titled.

Test Plan:
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_extract_weights_dynamic

Imported from OSS

Reviewed By: hx89

Differential Revision: D27622416

fbshipit-source-id: 319cfc0401e843006cafe4c6a272cb4d7462db18
2021-04-14 09:04:34 -07:00
784ae23d43 ns for fx: fix bug in weight extraction testing (#55431)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55431

Fixes a bug in the test cases, returning early resulted
in some tests not being run.  Adds logic for `nni.LinearReLU`,
which was unmasked by making the tests run

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_extract_weights_mod
```

Imported from OSS

Reviewed By: hx89

Differential Revision: D27622415

fbshipit-source-id: 79d9e3125e5d881d9d13645abbe4bd007a5e1d44
2021-04-14 09:04:32 -07:00
8b992ab0e4 ns for fx: add conv1d weight extraction (#55327)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55327

Adds NS functionality for extracting weights from `F.conv1d` nodes.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_extract_weights_conv_fun
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D27575425

fbshipit-source-id: 65fa194802ac7a9fb75b7616d962c5c2e71321ff
2021-04-14 09:04:30 -07:00
8fc1ca0d22 fx quant: fix prepacking for F.conv1d (#55311)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55311

Before this PR, `F.conv1d` was matched by FX graph mode quant patterns
but the prepacking was happening inline.  There was also a bug with
argument type mismatch.

This PR fixes both issues and adds a test. Thanks jerryzh168 for the
code tip.

Test Plan:
```
python test/test_quantization.py TestQuantizeFx.test_functional_not_reference
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D27575422

fbshipit-source-id: 42301e23cb101a9e64e46800813bc771317e233e
2021-04-14 09:04:28 -07:00
457fac0a33 ns for fx: move more weight matching logic to weight_utils.py (#55288)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55288

No logic change, just moving util-like code to the utils file.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs
```

Imported from OSS

Reviewed By: hx89

Differential Revision: D27575423

fbshipit-source-id: cd5188a0940bb664be7d0275faa7df8ea18401a8
2021-04-14 09:04:26 -07:00
13d7b40ea0 ns for fx: add F.conv2d and F.conv3d weight extraction (#55287)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55287

Adds support for extracting weights from F.conv2d and F.conv3d.
F.conv1d and the fused variants are saved for future PRs.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_extract_weights_conv_fun
```

Imported from OSS

Reviewed By: hx89

Differential Revision: D27575424

fbshipit-source-id: e945912d7d0ab320f47cab30d00d60ddb7497158
2021-04-14 09:04:24 -07:00
1fb2abc7ad ns for fx: rename SugraphTypeRelationship to SubgraphTypeRelationship (#55155)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55155

Fixes typo in enum name, no logic change

Test Plan:
CI

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D27504625

fbshipit-source-id: 21605dadb48225987f1da5ad5f6c30b0183278f2
2021-04-14 09:04:22 -07:00
37a404610f ns for fx: add allowlist for ops with same signature across dtypes (#55154)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55154

Adds functionality to NS to allow matching nodes which have the
same signature across dtypes.  For now, only the skeleton is added,
we can fill out the rest of the ops later.  This is to unblock
the work to change `cat` to have the same signature for fp32 and int8,
and keep the testing we have for `cat` in NS.

For context, the main reason we are not matching nodes with equal types,
for now, is user defined types for which we do not know the signature.
For now, the design is strictly allowlist of everything. In the future,
we may adjust the design to safely match user defined types.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_ops_with_same_fp32_and_int8_signature
python test/test_quantization.py TestFXGraphMatcher.test_nodes_with_equal_types_do_not_get_matched
```

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D27504624

fbshipit-source-id: 4f8eb4f3258caf6f99aa373ca7ba516ebbcf4779
2021-04-14 09:04:20 -07:00
444b318a90 ns for fx: add linear-relu mod weight extraction (#55080)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55080

Adds support for extracting weights of linear-relu module pattern.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs
```

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D27474701

fbshipit-source-id: 69ceaadc28d7fdcebd16d519367274d348b0dd29
2021-04-14 09:02:51 -07:00
2587a28bbd Improve the instructions on how to build the docs (#56018)
Summary:
This PR includes:

- A formatting change to make katex installation instructions more visible for Facebook employees.
- A short tip about how to start a lightweight HTTP server on a remote machine to browse the doc build artifacts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56018

Reviewed By: H-Huang

Differential Revision: D27765157

Pulled By: cbalioglu

fbshipit-source-id: 67663de0ba7b742e0deb5358d1e45eea9edd840f
2021-04-14 08:47:43 -07:00
b1d17bc55f Added OpInfo for torch.sum (#55406)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55406

Reviewed By: mrshenli

Differential Revision: D27620593

Pulled By: heitorschueroff

fbshipit-source-id: 73f0a1890d3a92c5374470610dce086a868763b3
2021-04-14 03:32:13 -07:00
67dcd62310 Don't split oversize cached blocks (#44742)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/35901

This change is designed to prevent fragmentation in the Caching Allocator.  Permissive block splitting in the allocator allows very large blocks to be split into many pieces.  Once split too finely it is unlikely all pieces will be 'free' at that same time so the original allocation can never be returned.   Anecdotally, we've seen a model run out of memory failing to alloc a 50 MB block on a 32 GB card while the caching allocator is holding 13 GB of 'split free blocks'

Approach:

- Large blocks above a certain size are designated "oversize".  This limit is currently set 1 decade above large, 200 MB
- Oversize blocks can not be split
- Oversize blocks must closely match the requested size (e.g. a 200 MB request will match an existing 205 MB block, but not a 300 MB block)
- In lieu of splitting oversize blocks there is a mechanism to quickly free a single oversize block (to the system allocator) to allow an appropriate size block to be allocated.  This will be activated under memory pressure and will prevent _release_cached_blocks()_ from triggering

Initial performance tests show this is similar or quicker than the original strategy.  Additional tests are ongoing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44742

Reviewed By: ngimel

Differential Revision: D23752058

Pulled By: ezyang

fbshipit-source-id: ccb7c13e3cf8ef2707706726ac9aaac3a5e3d5c8
2021-04-14 03:04:41 -07:00
09c0bb4fb9 Make replication_pad2d structured (#55511)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55511

Reviewed By: albanD

Differential Revision: D27681155

Pulled By: asuhan

fbshipit-source-id: 9851d856b601337faa39a242904b8e4e696aeb61
2021-04-14 01:08:10 -07:00
7985753421 [package] Add dependency tracing function (#55167)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55167

**Summary**
This commit adds a function that uses `sys.setprofile` to trace the
execution of a callable in order to determine which modules it really
uses. The result of this trace can inform packaging decisions.

**Test Plan**
This commit adds a unit test to `test_analyze.py` that tests this
feature.

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D27730805

Pulled By: SplitInfinity

fbshipit-source-id: 11802625564513da9a0144904be0d34dbae0f601
2021-04-14 00:06:40 -07:00
9f89b53d7d Synchronize RRef.to_here() CUDA Streams properly (#54932)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54932

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D27684022

Pulled By: pbelevich

fbshipit-source-id: 2bae51ab6649258d0219ca4e9dbbf45ac6a76c28
2021-04-13 23:24:38 -07:00
c96b5b2a20 [quant][graphmode][fx][fix] Fix fp16 reference patterns for linear (#55727)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55727

number of dequantize for fp16 reference pattern was incorrect before, this
PR fixes the problem

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D27713390

fbshipit-source-id: 72b8d4cda0bdcea74abe27a76f918d1b47819b01
2021-04-13 23:19:45 -07:00
2236f43da0 [FX] Put tensor metadata into a NamedTuple in ShapeProp (#55930)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55930

Test Plan: Imported from OSS

Reviewed By: ansley

Differential Revision: D27741730

Pulled By: jamesr66a

fbshipit-source-id: 0a0a1b94beed6c482add9e9551f316f3b4220ab2
2021-04-13 22:21:50 -07:00
48c73d24b8 Revert D27523060: [c10d] monitored_barrier: ensure all ranks pass or none do
Test Plan: revert-hammer

Differential Revision:
D27523060 (a5290adea5)

Original commit changeset: fa05e4f8ad8a

fbshipit-source-id: aa59c1c3ab0ed5b124583a52aed0f93c3b93a05a
2021-04-13 21:33:09 -07:00
c7aa1026a8 Revert D27548433: [c10d] Log API usage of monitored barrier
Test Plan: revert-hammer

Differential Revision:
D27548433 (09231b5db1)

Original commit changeset: 7520ad0948b8

fbshipit-source-id: aa946d8d27472d19c0fe855952ec58d1266ee35a
2021-04-13 21:31:49 -07:00
3646fa3621 Fix tensorpipe test (#55979)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55979

Fix name used for this test
ghstack-source-id: 126465107

Test Plan: CI

Reviewed By: pbelevich, H-Huang

Differential Revision: D27755320

fbshipit-source-id: fead989041d703d473b6847ee0cee1deebe12571
2021-04-13 19:10:03 -07:00
09231b5db1 [c10d] Log API usage of monitored barrier (#55265)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55265

Logs API usage of monitored barrier for better tracking and use case
understanding.
ghstack-source-id: 126413087

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D27548433

fbshipit-source-id: 7520ad0948b8dc9d44fa3118d5ea953d52f9f1c5
2021-04-13 19:02:52 -07:00
a5290adea5 [c10d] monitored_barrier: ensure all ranks pass or none do (#55197)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55197

From initial user feedback, one unexpected difference between monitored_barrier impl and barrier is the "all or nothing" semantics.

In barrier, all ranks pass or they all fail. With monitored barrier however, if rank 1 is healthy, it will respond to both send and recv from rank 0, but rank 0 can later fail because rank 2 is stuck. In this case, rank 1 will move forward out of the barrier.

This change makes it so that if a rank fails in monitored barrier, all other ranks in monitored barrier will also fail. It does so by the following process, similar to acknowledgements:

Nonzero ranks call send()
Nonzero ranks call recv()

Rank 0 calls recv(), if this succeeds, rank 0 has acknowledged rank N as healthy
Once all ranks are acknowledged as healthy:
Rank 0 calls send() to all nonzero ranks to unblock them

Modified unittests to ensure the all or nothing failure behavior
ghstack-source-id: 126413088

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D27523060

fbshipit-source-id: fa05e4f8ad8ae97fd6cb20da5c3a7ef76fd31de6
2021-04-13 19:01:25 -07:00
86368700e8 [PyTorch] Change MaybeOwned tests to use intrusive_ptr and Tensor (#55684)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55684

Upcoming changes to `MaybeOwned<T>` will require that T is
one of these two types and will have custom code for both.

This diff updates the tests to continue to build under these new
requirements; it is being sent separately to demonstrate that the
tests continue to work on the current implementation.
ghstack-source-id: 126405918

Test Plan: CI will run the rewritten tests.

Reviewed By: bhosmer

Differential Revision: D27630289

fbshipit-source-id: e38097d9ca04f3337cfa543ebcc8fb5d6916fcf3
2021-04-13 18:53:43 -07:00
cf7c5dcae3 [PyTorch] Avoid double indirection in MaybeOwned's borrowed state (#55685)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55685

This diff introduces a traits class that tells `MaybeOwned` how
to borrow a specific type. While it is still capable of handling a
generic `T` by storing `const T*` and how to do so is shown in a
comment, it is not committed in live code because it is not needed.

Instead, we have specific traits implementations for
`c10::intrusive_ptr<T>` and `Tensor` that implement the borrowed state
as just a plain old `c10::intrusive_ptr<T>` or `Tensor` (respectively)
that we manipulate to avoid reference counting operations. We do this
entirely with public API to `c10::intrusive_ptr<T>` and could do
likewise with `Tensor`, but (as comments in the code explain) adding a
private constructor to `MaybeOwnedTraits<Tensor>` allowed additional
performance optimization.

This representation of `MaybeOwned` seems to be more efficient than
the generic `T-or-pointer-to-const-T` representation. Intuitively, we
avoid a double indirection at minimal cost vs the previous
implementation. It *also* seems to be more efficient than the pointer
tagging representation I sent out as #55555; apparently, having the
extra word for a flag is much cheaper than the masking operands for
pointer tagging and the same double indirection as the generic
representation.

In particular, this seems to have the same *effect* as the
`TensorHandle` idea we've discussed internally (a hypothetical class
like `Tensor` that wraps a raw `TensorImpl*` and shares the generated
methods of `Tensor` so that everything still works), but you have to
be explicit about borrowing and use pointer syntax to get the
effect. Unlike `TensorHandle`, you can use it as internal state in a
class and "upgrade" from a borrow to an owned `Tensor` derived from
your original borrow if necessary.

Note that this is just a representational change and it still has the
same semantics: you need to keep the T you borrowed from around!
ghstack-source-id: 126405920

Test Plan:
Previous diff changes the MaybeOwned tests to cover
both `intrusive_ptr` and `Tensor`, which we need in order to ensure
that our trait implementations are correct.

Further diffs in this stack will use this type to hold operand tensors
in `TensorIteratorBase` to allow borrowing at relatively small cost
(very roughly, a 6% win in the successful borrowing case for our
add-in-place benchmark at the cost of a 2.5% regression in the
legacy non-borrowing case, and we know that we will be able to borrow
in structured kernels and probably most unstructured operands as
well).

Reviewed By: ezyang

Differential Revision: D27679723

fbshipit-source-id: 57104f4edabc545ff83657233fde9eb40b969826
2021-04-13 18:48:41 -07:00
d398a705c6 Clang-format batchnorm.py and distributed.py (#55971)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55971

Per title
ghstack-source-id: 126454339

Test Plan: N/A

Reviewed By: zhaojuanmao

Differential Revision: D27752315

fbshipit-source-id: 64ca5dea7b2689037594a6bd9a75641a9bb817c1
2021-04-13 18:40:23 -07:00
132f5c1f36 Clang-format ProcessGroupMPI.cpp (#55969)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55969

Per title
ghstack-source-id: 126453717

Test Plan: N/A

Reviewed By: zhaojuanmao

Differential Revision: D27752173

fbshipit-source-id: e5069b91d699b9d02b12e5dab5e62007dbcee9f0
2021-04-13 17:11:19 -07:00
8bdea14cd3 [FX] Add memory_format to shape_prop (#55815)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55815

Test Plan: Imported from OSS

Reviewed By: pbelevich, ansley

Differential Revision: D27716342

Pulled By: jamesr66a

fbshipit-source-id: f7c22dd77a4f48650700fc4c3c44b1c59196282e
2021-04-13 16:37:54 -07:00
2bf26965e7 Revert D27710107: [pytorch][PR] Update a batch_first arg for transformers like GRU and LSTM.
Test Plan: revert-hammer

Differential Revision:
D27710107 (2237754b13)

Original commit changeset: c4363a460454

fbshipit-source-id: 5387b5deae6db43f17a7d5e0408a7d24e463d73a
2021-04-13 16:22:23 -07:00
a61d91e803 Port reflection_pad1d to structured kernel (#55531)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55531

Following https://github.com/pytorch/rfcs/blob/rfc-0005/RFC-0005-structured-kernel-definitions.md

Test Plan: unittests

Reviewed By: ezyang

Differential Revision: D27628059

fbshipit-source-id: 885a10b766db39f8f8df4dcaaf0769fcf2ff9751
2021-04-13 15:33:29 -07:00
de5e3b5eb0 Fix OSS flaky test_destroy_full_group on MPI backend in pytorch_linux_xenial_cuda10_2_cudnn7_py3_multigpu_test environment by adding a barrier and retrying MPI_Comm_create 3 times (#55921)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55921

Fix this flaky test by adding a barrier and retrying the flaky function call `MPI_Comm_create` 3 times.

Couldn't figure out the root cause why `createProcessGroupMPI` can be flaky when just creating a subgroup communicator by mainly invoking `MPI_Comm_create`. Here `createProcessGroupMPI` does not involve any p2p or collective communication at all. Cannot further dig into `MPI_Comm_create`, which is in MPI codebase.

Also checked the commit history, and no commit on `ProcessGroupMPI.cpp` can be found within a few days before Mar 10th.

First failure (on Mar 10th):
https://app.circleci.com/pipelines/github/pytorch/pytorch/283704/workflows/d84ac4a0-42e3-4925-b1cf-32d3c3d1022a/jobs/11456129

Note that the test failure cannot be reproduced locally.

Verified the fix on CI:
https://app.circleci.com/pipelines/github/pytorch/pytorch/300586/workflows/a5c16db4-3ae2-44c7-a9c8-b0885dad2a64/jobs/12356852
test_destroy_full_group has rerun 100 times and pass.

#Closes: https://github.com/pytorch/pytorch/issues/53899
ghstack-source-id: 126414937

Test Plan:
```
export BACKEND=mpi
export WORLD_SIZE=2
pytest -k test_destroy_full_group test/distributed/test_distributed_fork.py -vs
```

```
#!/bin/bash
for i in {1..100}
do
pytest -k test_destroy_full_group test/distributed/test_distributed_fork.py
done
```

The CI tests triggered by a new branch:
https://app.circleci.com/pipelines/github/pytorch/pytorch?branch=ci-all%2Fwayi_mpi

Reviewed By: mrshenli

Differential Revision: D27245421

fbshipit-source-id: 86e7fe208e34eda8a33885e385d56ec6b60eca27
2021-04-13 15:28:51 -07:00
c218ac3bc0 [NCCL] Join work clean up thread before aborting communicators (#55444)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55444

Changes ~ProcessGroupNCCL so that we join work cleanup thread before aborting nccl communicators. This is because if we abort nccl communicators first on destruction, outstanding work objects in workMetaList can have exceptions set on them. Right now this doesn't trigger errors in nccl async error handling due to the terminated check, but it seems a bit cleaner to just join this thread first.

The main motivation is also to reduce log spam since we added some logging when an exception is set on WorkNCCL, but this unexpectedly resulted in a lot of false-positive errors being logged even after pg shutdown. An example is below:

I0406 18:30:27.361981 1567104 ProcessGroupNCCL.cpp:527] [Rank 1] NCCL watchdog thread terminated normally
I0406 18:30:27.364675 1567105 ProcessGroupNCCL.cpp:265] [Rank 1] found async exception when checking for NCCL errors: NCCL error: unhandled system error, NCCL version 2.
7.3
With this change, we no longer see these false positive logs.
ghstack-source-id: 126145284

Test Plan: CI

Reviewed By: osalpekar

Differential Revision: D27613035

fbshipit-source-id: abf924630128b50e7f66ae41ac83403e7a0aac96
2021-04-13 15:25:22 -07:00
8596ac186b deterministic code path for gather_backward for dim = 1 (#55573)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55573

provide a deterministic code path for gather_backward when dim = 1

Test Plan:
buck test //caffe2/test:torch -- test_gather_backward
    ✓ Pass: caffe2/test:torch - test_gather_backward_one_dim (test_torch.TestTorch) (1.099)
    ✓ Pass: caffe2/test:torch - test_gather_backward_deterministic_path (test_torch.TestTorch) (1.166)

test on GPU

buck test mode/opt //caffe2/test:torch_cuda -- test_gather_backward_deterministic

Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/1407375070421778
    ✓ ListingSuccess: caffe2/test:torch_cuda - main (7.484)
    ✓ Pass: caffe2/test:torch_cuda - test_gather_backward_deterministic_path_cuda (test_torch.TestTorchDeviceTypeCUDA) (26.145)
    ✓ Pass: caffe2/test:torch_cuda - main (26.145)
Summary
  Pass: 2
  ListingSuccess: 1

Reviewed By: ngimel

Differential Revision: D27632008

fbshipit-source-id: ec27475332a3b36360cc014193256c21cba77d63
2021-04-13 15:18:00 -07:00
2237754b13 Update a batch_first arg for transformers like GRU and LSTM. (#55285)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/25100 #43112

EDIT: pardon my inexperience since this is my first PR here, that I did not realize the doc should not have any trailing white spaces, and `[E712] comparison to False should be 'if cond is False:' or 'if not cond:'`, now both fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55285

Reviewed By: ngimel

Differential Revision: D27710107

Pulled By: jbschlosser

fbshipit-source-id: c4363a4604548c0d84628c4997dd23d6b3afb4d9
2021-04-13 14:54:50 -07:00
b98f011cd4 cmake: Enable (s)ccache for nccl builds (#55814)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55814

I don't really know if the original issue is resolved but let's just
check and see if this passes CI so that we can potentially get some
speed up on our builds

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D27715734

Pulled By: seemethere

fbshipit-source-id: a8f90774dfd25b0abf8e57283fe3591a8d8f3c4b
2021-04-13 14:49:25 -07:00
c47cc30bf5 Skip testing torch.float16 in test_isnan (#55906)
Summary:
See https://github.com/pytorch/pytorch/issues/55905

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55906

Reviewed By: walterddr

Differential Revision: D27737356

Pulled By: malfet

fbshipit-source-id: 39571cfe6f078af8bb7387ed459a5d0f2410bad1
2021-04-13 14:44:43 -07:00
bf8b790ba7 .github: Bump disk size for auto-scaled workers (#55955)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55955

Was experiencing build failures related to disk size issues, let's bump
to 150 to see if that resolves these issues

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D27747958

Pulled By: seemethere

fbshipit-source-id: 9222475d2298cf942479650567616489387bf552
2021-04-13 14:40:35 -07:00
5a45b1b2f2 Add nondeterministic alert for index_put_ when accumulate=False (#55827)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55516

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55827

Reviewed By: yinghai

Differential Revision: D27725794

Pulled By: ngimel

fbshipit-source-id: f6b5b3e635170524fdb5a0141ebd27925c37e8d9
2021-04-13 14:28:16 -07:00
5ffc4e3b0f refactor prepare_for_backward (#54977)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54977

put part of codes in prepare_for_backward into functions, so that those functions can be used in static graph training and delay all reduce later on.
ghstack-source-id: 126366714

Test Plan: unit tests

Reviewed By: rohan-varma

Differential Revision: D27439195

fbshipit-source-id: 8899eda621260232d774cb145f9c6d683c47e188
2021-04-13 14:25:29 -07:00
6dd1978d4b print average duration for caffe2 benchmark
Summary: print average duration for caffe2 benchmark

Test Plan:
buck run //xplat/caffe2:caffe2_benchmarkAppleMac -- --init_net ~/track_init_net.pb --net ~/track_predict_net.pb --warmup 10 --input 'data' --input_dims '1,4,128,256' --input_type float --iter 20
Using additional configuration options from .buckconfig.local
Building: finished in 0.6 sec (100%) 247/2137 jobs, 0 updated
  Total time: 0.6 sec
Average Duration: 18111 us

Reviewed By: larryliu0820

Differential Revision: D27745416

fbshipit-source-id: a5d20b8ef0ba4a9547d396738d5ddd1aca57684d
2021-04-13 14:19:34 -07:00
d1fac54f13 [Pytorch] Only print gradient of a tensor if it requires_grad (#54446)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54446

Several people now have run into an issue with printing tensors using lite interpreter in xplat. https://fb.workplace.com/groups/2148543255442743/?multi_permalinks=2620088118288252&notif_id=1616432955971055&notif_t=work_group_activity&ref=notif This is due to the fallback of _fw_grad also requiring autograd to exist. Introduce a new function that can be used to guard against calling _fw_grad if autograd isnt built.

ghstack-source-id: 126334787

Test Plan: ci, tested the guard in by printing in a tensor in a situation where autograd isnt built.

Reviewed By: albanD

Differential Revision: D27239164

fbshipit-source-id: 4b98b4b7770b153bc2c13c95f7d256425e09ef39
2021-04-13 13:47:41 -07:00
aceceb3d5c Reland #50999 (Added pow() on CPU for float16 & bfloat16) (#55280)
Summary:
#### Reason for relanding
Line 1607 of `torch/testing/_internal/common_methods_invocations.py` of https://github.com/pytorch/pytorch/issues/50999  had `dtype` instead of `dtype=torch.bool`, so 4 of the 9 sample inputs for `bool` had incorrect dtype. This bug was caught by https://github.com/pytorch/pytorch/issues/54949.

1. Added support for pow() on CPU for `float16` (`Half`) and `bfloat16` types.
Both `pow(Tensor, Scalar)` and `pow(Tensor, Tensor)` are now supported for the aforementioned types.
However autograd isn't supported for `Float16` on CPU yet, as `log_vml_cpu` can't be enabled for it.
2. heitorschueroff added `pow_tensor_scalar_optimized_kernel` to refactor & simplify `PowKernel.cpp`.
It provides a common path for all the complex types & floating point types (except Float16, due to lack of complete AVX2 vectorization support for it).  It replaced code that had previously been duplicated for (float, double) and complex types,
so PowKernel.cpp looks a lot cleaner now.
3. Enabled (unskipped) some tests for `erf`, `erfc`,`erfinv`, `tan` and `linalg.vector.norm` which were being skipped earlier due to `pow()` not having been implemented for `float16` & `bfloat16`.
4. Added an OpInfo for `pow()` & enabled some test cases for `pow()`.
5. Extended the coverage of existing tests for `pow` in `test_binary_ufuncs.py` in order to enable comparison with `numpy`, even with discontiguous tensors, and added a test to ensure that a runtime error is raised for `pow`'s inplace variant if resizing the base tensor is required during its invocation.
6. Added `float16` & `bfloat16` to `square`'s dtype lists in its `UnaryUfuncInfo`.
7. Removed redundant `dtypesIfCPU` and `dtypesIfCUDA` from `OpInfo`s where they are equal to `dtypes`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55280

Reviewed By: jbschlosser

Differential Revision: D27591772

Pulled By: heitorschueroff

fbshipit-source-id: c7420811b32595bb3353149a61e54a73f2eb352b
2021-04-13 13:23:29 -07:00
de53de39d7 [PyTorch] Mark borrowed case as C10_LIKELY in MaybeOwned (#55553)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55553

If this case isn't likely, user code would have been better off with a regular T.
ghstack-source-id: 126369326

Test Plan: Existing CI

Reviewed By: ezyang

Differential Revision: D27630287

fbshipit-source-id: b074af3a65c61dfe9e0246df046cc8c49e8efb03
2021-04-13 13:23:27 -07:00
ea446ed600 [PyTorch] Allow copy operations on MaybeOwned (#55419)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55419

Turns out it's useful to have these. I chose to implement them in the straightforward safe way, rather than always borrowing.
ghstack-source-id: 126369328

Test Plan: Added more automated tests.

Reviewed By: hlu1

Differential Revision: D27545805

fbshipit-source-id: 84bb4458b86672ad340cc1f0aa18b80ca7ee13f1
2021-04-13 13:21:45 -07:00
bbdb37b93d [JIT] Use type cache in erasing shape information (#55828)
Summary:
`unshapedType` can be very slow on a graph with many modules and recursively contained classes, because each Type you have to recursively descend and map over. Speed it up with a type cache.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55828

Reviewed By: ngimel

Differential Revision: D27717995

Pulled By: eellison

fbshipit-source-id: f1d502bef0356e78100c27bf00f6caf08a75d68c
2021-04-13 12:35:09 -07:00
8f953ef544 Increase token count threshold for calling thrust sort in embedding backward (#49913)
Summary:
Increases the token count threshold to expand the span of custom CUDA kernel implementation of embedding backward. Here is the speedup for embedding backward implementation for DGXV100-128GB and DGXA100-640GB given below. I picked 6144 as the new threshold since anything below it mostly results in faster execution with custom CUDA kernel. One important advantage of the custom CUDA kernel is that it allows CUDA graph capture, whereas thrust code path results in CPU syncs, prohibiting graph capture (times below are collected without graph capture). For reference, MLPerf BERT benchmark uses num_features=1024.

  | num_tokens | num_features | thrust path(ms) | custom kernel(ms) | speedup
-- | -- | -- | -- | -- | --
DGXV100 |   |   |   |   |
  | 1024 | 64 | 0.36 | 0.18 | 2.04
  | 1024 | 256 | 0.43 | 0.30 | 1.46
  | 1024 | 1024 | 0.89 | 0.74 | 1.20
  | 1024 | 2048 | 1.50 | 1.33 | 1.12
  | 1024 | 4096 | 2.71 | 2.50 | 1.08
  | 1024 | 8192 | 5.07 | 4.89 | 1.04
  | 2048 | 64 | 0.33 | 0.23 | 1.46
  | 2048 | 256 | 0.41 | 0.33 | 1.26
  | 2048 | 1024 | 0.92 | 0.79 | 1.17
  | 2048 | 2048 | 1.54 | 1.38 | 1.11
  | 2048 | 4096 | 2.80 | 2.54 | 1.10
  | 2048 | 8192 | 5.29 | 4.98 | 1.06
  | 4096 | 64 | 0.46 | 0.32 | 1.43
  | 4096 | 256 | 0.50 | 0.47 | 1.07
  | 4096 | 1024 | 1.02 | 0.88 | 1.15
  | 4096 | 2048 | 1.70 | 1.59 | 1.07
  | 4096 | 4096 | 3.06 | 2.68 | 1.14
  | 4096 | 8192 | 5.79 | 5.28 | 1.10
  | 5120 | 64 | 0.42 | 0.33 | 1.28
  | 5120 | 256 | 0.51 | 0.46 | 1.11
  | 5120 | 1024 | 1.06 | 0.93 | 1.14
  | 5120 | 2048 | 1.77 | 1.55 | 1.14
  | 5120 | 4096 | 3.18 | 2.76 | 1.15
  | 5120 | 8192 | 6.24 | 5.46 | 1.14
  | 6144 | 64 | 0.42 | 0.36 | 1.17
  | 6144 | 256 | 0.52 | 0.50 | 1.05
  | 6144 | 1024 | 1.10 | 0.98 | 1.13
  | 6144 | 2048 | 1.85 | 1.61 | 1.15
  | 6144 | 4096 | 3.34 | 2.84 | 1.17
  | 6144 | 8192 | 6.19 | 5.69 | 1.09
  | 8192 | 64 | 0.42 | 0.48 | 0.88
  | 8192 | 256 | 0.51 | 0.65 | 0.78
  | 8192 | 1024 | 1.14 | 1.12 | 1.01
  | 8192 | 2048 | 1.92 | 1.77 | 1.09
  | 8192 | 4096 | 3.49 | 3.03 | 1.15
  | 8192 | 8192 | 6.59 | 5.96 | 1.11
  | 16384 | 64 | 0.46 | 0.82 | 0.56
  | 16384 | 256 | 0.59 | 0.99 | 0.60
  | 16384 | 1024 | 1.35 | 1.54 | 0.88
  | 16384 | 2048 | 2.31 | 2.24 | 1.03
  | 16384 | 4096 | 4.20 | 3.63 | 1.16
  | 16384 | 8192 | 8.26 | 7.51 | 1.10
  | 32768 | 64 | 0.47 | 1.48 | 0.32
  | 32768 | 256 | 0.68 | 1.70 | 0.40
  | 32768 | 1024 | 1.63 | 2.35 | 0.69
  | 32768 | 2048 | 2.87 | 3.19 | 0.90
  | 32768 | 4096 | 5.26 | 4.86 | 1.08
  | 32768 | 8192 | 10.17 | 9.92 | 1.03
  | 65536 | 64 | 0.50 | 2.81 | 0.18
  | 65536 | 256 | 0.78 | 3.12 | 0.25
  | 65536 | 1024 | 2.02 | 3.99 | 0.51
  | 65536 | 2048 | 3.58 | 5.06 | 0.71
  | 65536 | 4096 | 6.68 | 7.40 | 0.90
  | 65536 | 8192 | 13.08 | 15.35 | 0.85
DGXA100 |   |   |   |   |
  | 1024 | 64 | 0.28 | 0.09 | 3.05
  | 1024 | 256 | 0.30 | 0.17 | 1.71
  | 1024 | 1024 | 0.51 | 0.39 | 1.31
  | 1024 | 2048 | 0.81 | 0.68 | 1.20
  | 1024 | 4096 | 1.43 | 1.24 | 1.16
  | 1024 | 8192 | 2.63 | 2.42 | 1.09
  | 2048 | 64 | 0.25 | 0.12 | 2.15
  | 2048 | 256 | 0.29 | 0.22 | 1.36
  | 2048 | 1024 | 0.53 | 0.44 | 1.20
  | 2048 | 2048 | 0.86 | 0.73 | 1.18
  | 2048 | 4096 | 1.51 | 1.30 | 1.16
  | 2048 | 8192 | 2.81 | 2.55 | 1.10
  | 4096 | 64 | 0.31 | 0.20 | 1.57
  | 4096 | 256 | 0.35 | 0.33 | 1.08
  | 4096 | 1024 | 0.63 | 0.57 | 1.10
  | 4096 | 2048 | 1.08 | 0.86 | 1.26
  | 4096 | 4096 | 2.11 | 1.44 | 1.46
  | 4096 | 8192 | 3.33 | 2.81 | 1.19
  | 5120 | 64 | 0.36 | 0.22 | 1.63
  | 5120 | 256 | 0.37 | 0.37 | 0.98
  | 5120 | 1024 | 0.66 | 0.62 | 1.07
  | 5120 | 2048 | 1.05 | 0.92 | 1.15
  | 5120 | 4096 | 1.83 | 1.51 | 1.21
  | 5120 | 8192 | 3.35 | 2.94 | 1.14
  | 6144 | 64 | 0.29 | 0.25 | 1.18
  | 6144 | 256 | 0.37 | 0.43 | 0.86
  | 6144 | 1024 | 0.70 | 0.68 | 1.03
  | 6144 | 2048 | 1.08 | 0.98 | 1.11
  | 6144 | 4096 | 1.89 | 1.57 | 1.20
  | 6144 | 8192 | 3.49 | 3.07 | 1.14
  | 8192 | 64 | 0.29 | 0.31 | 0.95
  | 8192 | 256 | 0.37 | 0.52 | 0.70
  | 8192 | 1024 | 0.71 | 0.79 | 0.90
  | 8192 | 2048 | 1.16 | 1.10 | 1.06
  | 8192 | 4096 | 2.04 | 1.70 | 1.20
  | 8192 | 8192 | 3.86 | 3.32 | 1.16
  | 16384 | 64 | 0.31 | 0.55 | 0.56
  | 16384 | 256 | 0.42 | 0.93 | 0.45
  | 16384 | 1024 | 0.87 | 1.24 | 0.70
  | 16384 | 2048 | 1.46 | 1.57 | 0.93
  | 16384 | 4096 | 2.60 | 2.23 | 1.17
  | 16384 | 8192 | 5.15 | 4.69 | 1.10
  | 32768 | 64 | 0.33 | 1.03 | 0.32
  | 32768 | 256 | 0.49 | 1.78 | 0.28
  | 32768 | 1024 | 1.11 | 2.18 | 0.51
  | 32768 | 2048 | 1.90 | 2.54 | 0.75
  | 32768 | 4096 | 3.45 | 3.31 | 1.04
  | 32768 | 8192 | 6.46 | 6.43 | 1.00
  | 65536 | 64 | 0.36 | 2.19 | 0.16
  | 65536 | 256 | 0.56 | 3.41 | 0.17
  | 65536 | 1024 | 1.39 | 4.01 | 0.35
  | 65536 | 2048 | 2.48 | 4.45 | 0.56
  | 65536 | 4096 | 4.50 | 5.44 | 0.83
  | 65536 | 8192 | 8.49 | 10.55 | 0.80

Here is the script used to generate the times (30522 is used in BERT MLPerf benchmark as vocabulary size, hence is used in this example):

```
import torch
import torch.nn as nn
import time

vocabulary_size = 30522
for num_tokens in [512,1024,2048,4096,5120,6144,8192,16384,32768,65536]:
    for hidden_dim in [64,256,1024,2048,4096,8192]:
        fprop_time_avg = 0.0
        bprop_time_avg = 0.0
        emb = nn.Embedding(vocabulary_size, hidden_dim).cuda()
        for trial in range(0,10):
            inds = torch.round(torch.rand(num_tokens) * (vocabulary_size-1)).to(dtype=torch.int64).cuda()
            y = emb(inds)
            dy = torch.randn_like(y)
            torch.cuda.synchronize()
            t_start_bwd = time.time()
            y.backward(dy)
            torch.cuda.synchronize()
            t_stop_bwd = time.time()
            bprop_time_avg += t_stop_bwd - t_start_bwd
        bprop_time_avg /= 10.0
        print("bprop num_tokens %5d, num_features %5d, time %2.6f" %(num_tokens, hidden_dim, bprop_time_avg))

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49913

Reviewed By: jbschlosser

Differential Revision: D27727738

Pulled By: ngimel

fbshipit-source-id: fa497b6745b6d20bb11352579ed9eb5b66a8b1e2
2021-04-13 12:16:28 -07:00
72b8864b34 [caffe2] constexpr const
Summary:
To fix warning:

```
xplat\\caffe2\\torch\\csrc\\jit\\runtime\\instruction.cpp(59,20): warning: ISO C++11 does not allow conversion from string literal to 'char *const' [-Wwritable-strings]
```

which can be seen in Windows CI logs.

Test Plan: Eyes; did not run it.

Reviewed By: iseeyuan

Differential Revision: D27717057

fbshipit-source-id: f365405663b5adfbc0c87dc26a9921b6d03f1f5a
2021-04-13 12:11:12 -07:00
7ab654afd7 [TensorExpr] Rename Tensor::call to Tensor::load to be consistent with Buf and Placeholder. (#55826)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55826

It's a mechanical change.

Differential Revision: D27717777

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: fbc1bb99602250c706cf2c8c2684119c323e4d51
2021-04-13 12:08:53 -07:00
1263448cb2 [TensorExpr] Remove mask field from Load and Store classes. (#55825)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55825

The mask has never been used (in vectorization we generate an explicit
`IfThenElse` construct when we need to mask out some elements). The PR
removes it and cleans up all its traces from tests.

Differential Revision: D27717776

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: 41d1feeea4322da75b3999d661801c2a7f82b9db
2021-04-13 12:08:51 -07:00
754b0d073a [TensorExpr] Unbreak benchmarks. (#55824)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55824

Seemingly some of my last changes (namely, removing dep-tracker) broke
the TE benchmarks. This PR fixes it.

Differential Revision: D27717778

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: 48584bc0cfd4879a3e44cb45ee1f0d5c91b5afbc
2021-04-13 12:08:50 -07:00
b01a15d3d3 [TensorExpr] Redesign Rfactor loopnest transformation. (#55324)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55324

With this change `rfactor` only affects the passed loop and its body
never touching anything outside (that was a rootcause of a bug with the
previous implementation). Also, we don't have an `insertion_point`
parameter anymore - its meaning was vague, and the effect of it
should've been achievable with other transformations anyway.

The new `rfactor` semantics is as follows:

```
Requirements:
 * S is the reduction store
 * S is the only statement in the innermost loop
 * There is at least two reduction arguments in S
 * OUTER_REDUCTION_FOR loop corresponds to the outermost reduction variable
 used in the store and all other reduction variables are index variables of
 children loops of OUTER_REDUCTION_FOR
 * OUTER_REDUCTION_FOR is a perfect loop nest, i.e. it has only loops
 corresponding to the other reduction variables and the store, nested into
 each other

What it does:
  * Introduce a new buffer with an extra dimension of a size equal to the
  span of the loop OUTER_REDUCTION_FOR (the new buffer is returned via
  RFAC_BUF_PTR)
  * Insert an initialization store for the new buffer in
  OUTER_REDUCTION_FOR before its nested loop
  * Replace the reduction store to the original buffer with the reduction
  store to the temp buffer, removing the index var of OUTER_REDUCTION_FOR
  from reduction arguments
  * Insert a final reduction store over the extra dimension of the new
  buffer to the original buffer
  * Returns TRUE if the transformation succeeded and FALSE otherwise

Example:
Original IR:
S1: for i        # normal axis
S2:   X[i] = 0
S3:   for j      # reduction axis
S4:     for k    # reduction axis
S5:       X[i] = ReduceOp(X[i] + Y[i,j,k], reduce_axis={j,k})

After RFACTOR(S5, S3)
S1: for i               # normal axis
S2:   X[i] = 0
S3:   for j             # reduction axis for X, normal axis for X_rfac
        X_rfac[i,j] = 0
S4:     for k           # reduction axis
          X_rfac[i,j] = ReduceOp(X_rfac[i,j] + Y[i,j,k], reduce_axis={k})
        X[i] = ReduceOp(X[i] + X_rfac[i,j], reduce_axis={j})
```

Differential Revision: D27694960

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: 076fa6a1df2c23f5948302aa6b43e82cb222901c
2021-04-13 12:08:48 -07:00
57f795c27b [TensorExpr] Remove unused LoopNest::hasLoopBodyFor method. (#55323)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55323

Differential Revision: D27694961

Test Plan: Imported from OSS

Reviewed By: SplitInfinity, gmagogsfm

Pulled By: ZolotukhinM

fbshipit-source-id: 367ae212054c3516409a568facc19a19671df488
2021-04-13 12:07:31 -07:00
f61556a7ce Use autosummary on torch.fft, torch.linalg (#55748)
Summary:
Related to https://github.com/pytorch/pytorch/issues/52256

Use autosummary instead of autofunction to create subpages for `torch.fft` and `torch.linalg` functions.

zou3519

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55748

Reviewed By: jbschlosser

Differential Revision: D27739282

Pulled By: heitorschueroff

fbshipit-source-id: 37aa06cb8959721894ffadc15ae8c3b83481a319
2021-04-13 12:02:36 -07:00
657b66e87d [NCCL] Log when barrier guesses device to use (#54991)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54991

Actual proposed fix is in
https://github.com/pytorch/pytorch/pull/53934, in the meantime, would be useful
to include this LOG when barrier does not know what devices to use, and suggest
the workaround of passing in device_ids into barrier().
ghstack-source-id: 126351889

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D27444917

fbshipit-source-id: 0f269c5a7732e5be6e51adfca7ef70d04ffd71d3
2021-04-13 11:53:55 -07:00
0517222dc8 [package] Correct usage of miniz API in PyTorchStreamReader (#55725)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55725

We were previously checking m_last_error on the miniz struct directly,
which fails to preserve internal invariants and can the leave the reader
broken in specific situations (reading a non-existent file).

Using the provided error checking API fixes this.

Differential Revision: D27693105

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Pulled By: suo

fbshipit-source-id: 20c520bb1d590fb75751bca1e970df4f2b7eb043
2021-04-13 11:50:08 -07:00
c3a49cb30c Better types in fbcode/caffe2/torch/jit/_script.py (#55856)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55856

Test Plan: Sandcastle

Reviewed By: SplitInfinity

Differential Revision: D27715495

fbshipit-source-id: 9804e2d432fda302117f05a0d21cbb7f0dd3ae38
2021-04-13 11:46:23 -07:00
85b97e449d [RFC]fix test_ddp_logging_data_cpu with tsan (#54465)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54465

It is reported that there is data race issue when the test runs with tsan. The root cause is from 'model.frc1.double()' call. This is not because DistributedDataParallel() works together with 'model.frc1.double()'. If we remove DistributedDataParallel(), just call 'model.frc1.double(); model.frc2.double();', it complained the same data race issue.

I'm not sure how to do data type cast in this test without tsan complains, so removing this line of codes and mixed data type logging check.

Please kindly let me know if you have a better suggestion on how to do data type cast correctly

Test Plan: unit test

Reviewed By: SciPioneer

Differential Revision: D27249821

fbshipit-source-id: 0368157e11cbe7d15828dccca78271d89d502ec4
2021-04-13 11:20:43 -07:00
2eebd9fdce fix ddp logging flaky test (#55414)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55414

close #55384

backward_compute_comm_overlap_time may not be larger than 1. we should check backward_compute_time, backward_comm_time are larger than 1 instead.
ghstack-source-id: 126360517

Test Plan: unit tests

Reviewed By: H-Huang, SciPioneer

Differential Revision: D27606132

fbshipit-source-id: 418fe9f958287779d637856e355cc36cab384c68
2021-04-13 11:14:04 -07:00
800fa5f369 [ROCM] Enable more dtypes in common_method_invocations (#55808)
Summary:
The PR enables additional dtypes in common_method_invocations for ROCM.
This enables around 4k new tests for ROCM.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55808

Reviewed By: jbschlosser

Differential Revision: D27729885

Pulled By: ngimel

fbshipit-source-id: 061b88901bbe7128d51e49803f64295037b09b8d
2021-04-13 11:10:43 -07:00
18662d4321 [Static runtime] refactor MemoryPlanner codes to prepare for output tensor memory planning (#55809)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55809

[Static runtime] refactor MemoryPlanner codes to prepare for output tensor memory planning

Test Plan: buck test mode/dev //caffe2/caffe2/fb/predictor:pytorch_predictor_test -- --exact 'caffe2/caffe2/fb/predictor:pytorch_predictor_test - PyTorchPredictor.StaticRuntime'

Reviewed By: bwasti

Differential Revision: D27411416

fbshipit-source-id: 7dae7c2586ce3b4ebacf6169017140166c30e99c
2021-04-13 11:04:47 -07:00
6269efde91 Add stricter typing to caffe2/torch/distributed/elastic/multiprocessing/errors/__init__.py (#55848)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55848

Test Plan: Sandcastle

Reviewed By: xush6528

Differential Revision: D27714781

fbshipit-source-id: cff651e04c1e8363a249c7de9de01c33db47f003
2021-04-13 10:47:08 -07:00
70a09d97d1 Use nodes instead of node
Summary: `networkx 2.4+` replaced `node` attribute to `nodes` in graph object. This caused failures in `caffe2`'s' `topological_sort_traversal_longest_path` function which uses networkx library for topological sort.

Differential Revision: D27718857

fbshipit-source-id: 812fbb613946565d089cc84a20f3cdf7df046e19
2021-04-13 10:45:35 -07:00
2bb58a06ef move logic to skip a redispatch directly inside of resize_output (#55162)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55162

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D27506253

Pulled By: bdhirsh

fbshipit-source-id: 02fddb1926de49cd8c915c549eb99d92e58e75e1
2021-04-13 10:25:02 -07:00
87fcf3072e Fix overflow issue in quantized instance_norm/layer_norm/group_norm (#54872)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/54837
`hsum_sq` has the overflow issue when the input image size is large such as (H,W,D) as (224,224,160). `hsum_sq` is used in the quantized instance_norm/layer_norm/group_norm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54872

Reviewed By: dskhudia

Differential Revision: D27690767

Pulled By: vkuzo

fbshipit-source-id: 9b9ac3e76220d42a3b48f8bf4e20823f775789a2
2021-04-13 10:21:38 -07:00
8c8f8829f0 Factor out numerical logic (#54479)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54479

This change is similar to #54049 in that it helps us factor out some code that can be used in both fast and slow versions of gradcheck.
 - `compute_gradient` and `compute_numerical_jacobian_cols` have  fewer responsibilities:
   - compute_numerical_jacobian_cols essentially only handles the complexity of complex derivatives
   - compute_gradient handles only finite differencing (and doesn't worry about different layouts and indexing into the input tensor)
  - we have two stages again where we first compute the columns separately, then combine them

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D27728727

Pulled By: soulitzer

fbshipit-source-id: fad3d5c1a91882621039beae3d0ecf633c19c28c
2021-04-13 10:08:09 -07:00
381b3d8f4b Refactor get numerical jacobian to calculate wrt all outputs at once (#54378)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54378

### For release notes
`torch.autograd.gradcheck.get_numerical_jacobian` (not part of the public api) is being deprecated.

In the future, user code relying on this function will break because, among other changes, `get_numerical_jacobian` now returns `List[Tuple[torch.Tensor]]` instead of `List[torch.Tensor]`.

(more details if necessary)
For a `fn` that takes in M inputs and N outputs we now return a list of M N-tuples of jacobians where `output[i][j]` would represent the numerical jacobian w.r.t. to the ith input and the jth output. Previously `get_numerical_jacobian` returned a list of tensors where each tensor represents the jacobian w.r.t. to each of the M inputs and a specific output. Finally, the function passed in as the parameter `fn` should expect to handle individual parameters, where previously `fn` is required to expect its parameters wrapped in a tuple.

 --- end --

This PR addresses the comment here https://github.com/pytorch/pytorch/pull/53857#discussion_r595429639, to reduce the run-time of old gradcheck's get numerical jacobian by a factor of num_outputs. However, because very few ops actually return multiple outputs, there is not too much real speed up here.

The main benefit of doing this change as part of the refactor is that it helps us isolate the possible bugs that are specific to switching `get numerical jacobian` to run in a per output way vs all outputs at once. Much of the logic implemented here will be the same for the fast gradcheck case, so knowing for certain that everything should pass after this stage will make the next step much simpler.

The get_numerical_jacobian api is also being used in common_nn. So we update the callsite there as well.

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D27728720

Pulled By: soulitzer

fbshipit-source-id: ee0f90b4f26ddc5fdbe949c4965eaa91c9ed0bb8
2021-04-13 10:06:20 -07:00
fc6985eceb [package] Minor fixes to PackageExporter docstrings (#55817)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55817

**Summary**
This commit makes minor edits to the docstrings of `PackageExporter` so
that they render properly in the `torch.package` API reference.

**Test Plan**
Continuous integration (especially the docs tests).

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D27726817

Pulled By: SplitInfinity

fbshipit-source-id: b81276d7278f586fceded83d23cb4d0532f7c629
2021-04-13 10:00:38 -07:00
6a738196af [package] Create API reference (#55812)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55812

**Summary**
This commit creates a barebones API reference doc for `torch.package`.
The content is sourced from the docstrings in the source for the
`torch.package`.

**Test Plan**
Continuous integration (specifically the docs tests).

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D27726816

Pulled By: SplitInfinity

fbshipit-source-id: 5e9194536f80507e337b81c5ec3b5635d7121818
2021-04-13 09:58:45 -07:00
5e625906e9 Fix lint for redundant-workflows list (#55916)
Summary:
Currently this lint is [passing](https://github.com/pytorch/pytorch/runs/2335195975) on https://github.com/pytorch/pytorch/issues/55176 when it should be failing, because it is using [`l.sort()` instead of `sorted(l)`](https://docs.python.org/3/howto/sorting.html).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55916

Test Plan: Check out 0c29aa1679, start a `python3` shell, and run the steps from this lint. The final `assert` from before this PR should succeed, and the `assert` from this PR should fail. The lint should succeed on this PR's CI, though, since the list of workflows is correct on `master`.

Reviewed By: janeyx99

Differential Revision: D27739792

Pulled By: samestep

fbshipit-source-id: 068fa846569eb83b98088215d8a1b63d12560633
2021-04-13 09:47:26 -07:00
4753100a3b Un-ignore F403 in .flake8 (#55838)
Summary:
Generally wildcard imports are bad for the reasons described here: https://www.flake8rules.com/rules/F403.html

This PR replaces wildcard imports with an explicit list of imported items where possible, and adds a `# noqa: F403` comment in the other cases (mostly re-exports in `__init__.py` files).

This is a prerequisite for https://github.com/pytorch/pytorch/issues/55816, because currently [`tools/codegen/dest/register_dispatch_key.py` simply fails if you sort its imports](https://github.com/pytorch/pytorch/actions/runs/742505908).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55838

Test Plan: CI. You can also run `flake8` locally.

Reviewed By: jbschlosser

Differential Revision: D27724232

Pulled By: samestep

fbshipit-source-id: 269fb09cb4168f8a51fd65bfaacc6cda7fb87c34
2021-04-13 09:24:07 -07:00
75eb026e07 migrate matrix_exp to opInfo tests (#55533)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55533

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D27628966

Pulled By: bdhirsh

fbshipit-source-id: 87dd1858a1ebe22dcca9bd90b8cdca8c3d67d0e9
2021-04-13 08:32:34 -07:00
99d77c55dd Automated submodule update: tensorpipe (#55881)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/tensorpipe](https://github.com/pytorch/tensorpipe).

New submodule commit: e5e974b6cd

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55881

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: lw

Differential Revision: D27730207

fbshipit-source-id: 7d2901e676645f3da6e5ca8f9d8c1b55d63cc1c7
2021-04-13 08:04:54 -07:00
24f9a446c9 Fix wrong detection of depthwise conv on neon (#55794)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/54136

tldr: dephwise conv require that the nb of output channel is 1.

The code here only handles this case and previously, all but the first output channel were containing uninitialized memory. The nans from the issue were random due to the allocation of a torch.empty() that was sometimes returning non-nan memory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55794

Reviewed By: ngimel

Differential Revision: D27711717

Pulled By: albanD

fbshipit-source-id: 00eac3fd59db1d09fe7bab89427b105a019e7a5d
2021-04-13 07:52:11 -07:00
d7d7556f17 Move tensor implicit conversions to test_builtins.py (#55532)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55532

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D27729682

Pulled By: nikithamalgifb

fbshipit-source-id: d2517ee68b83e59cde87b8fb7d5bf7203f02cbc6
2021-04-13 07:13:20 -07:00
5dba4ff786 move topk to use OpInfo (#55547)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55547

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin, mruberry

Differential Revision: D27649412

Pulled By: albanD

fbshipit-source-id: e36a5bb5703681b7f7647ca30d6f4a72faf5ed0e
2021-04-13 06:21:13 -07:00
192df16a4d move logaddexp{2} to opinfo (#55535)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55535

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D27649410

Pulled By: albanD

fbshipit-source-id: 4453da3853e2ac8e2e625ae9bdb9f717336bb3ec
2021-04-13 06:21:12 -07:00
505f6f325f port addcdiv to opinfo (#55518)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55518

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D27649411

Pulled By: albanD

fbshipit-source-id: cfb0a235d94ef62589acbeb9bf11d2ea17248484
2021-04-13 06:21:10 -07:00
9ccae89102 port addcmul to OpInfo (#55517)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55517

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D27649413

Pulled By: albanD

fbshipit-source-id: e1faf25cf7f9c3636f62db1512aee78fd7c4f9b6
2021-04-13 06:19:33 -07:00
00737efdb2 [shape inference] Add shape inference func for Bucketize
Summary: ATT, to ensure output has the same dim type with the input. We need to find a more generic way though...

Test Plan: unit test

Reviewed By: ipiszy, khabinov

Differential Revision: D27690748

fbshipit-source-id: e53832c67b8ac86973c288d2d6b76ef8e5db14b9
2021-04-13 05:59:40 -07:00
4b09756d26 [SPMD] Move a comment (#55877)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55877

Address a comment in: 10bc1dae40 (r610930244)
ghstack-source-id: 126369525

Test Plan: N/A

Reviewed By: rohan-varma

Differential Revision: D27729567

fbshipit-source-id: 5509ebfba2b741cd3532c69044227e5af0fb54fc
2021-04-13 05:53:31 -07:00
56212daf7e allow tests to run locally without setting environment variables (#55880)
Summary:
Fixes breakage caused by https://github.com/pytorch/pytorch/issues/55753

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55880

Reviewed By: nikithamalgifb

Differential Revision: D27735299

Pulled By: mruberry

fbshipit-source-id: f8f927f95e4f7fe5f00448ed25d23dac3b7104a4
2021-04-13 05:29:51 -07:00
37ac271089 [AutoAccept][Codemod][FBSourceGoogleJavaFormatLinter] Daily arc lint --take GOOGLEJAVAFORMAT
Reviewed By: zertosh

Differential Revision: D27731676

fbshipit-source-id: 9402fa9f19b9186a2f38e56c110800254a8e8d91
2021-04-13 04:15:35 -07:00
b4cb020c0f [Gradient Compression] Make orthogonalization_epsilon configurable in PowerSGDState (#55738)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55738

Per title, and use 0 as the default value.

It turns out that setting this epsilon as 0 can accelerate convergence and improve accuracy for some use cases.

Test Plan:
unit tests
f264687105
f264675194

Reviewed By: shuyingsunshine21

Differential Revision: D27694971

fbshipit-source-id: b61528c6c817127974acdc4635bccf607532287f
2021-04-13 02:52:56 -07:00
4cfbb2401f [ROCM] Re-enable 3 previously faling tests in test_cuda.py (#55813)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53190
The following tests are passing in ROCM 4.1. Hence re-enabling them.
test_grad_scaling_multigpu
test_streaming_backwards_device_transfer
test_streaming_backwards_multiple_streams

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55813

Reviewed By: yinghai

Differential Revision: D27725547

Pulled By: ngimel

fbshipit-source-id: d8b3ed69fa44c2086f0666b4db0fabb30ad59439
2021-04-13 01:09:11 -07:00
5a4e5db9ad docs: fix profiler docstring (#55750)
Summary:
Description:
- change the docstrings for profiler module as per google docstring
- add link to `torch.autograd` module
- document `ProfilerAction` and `ProfilerActivity`

https://12292060-65600975-gh.circle-artifacts.com/0/docs/profiler.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55750

Reviewed By: yinghai

Differential Revision: D27725494

Pulled By: ngimel

fbshipit-source-id: 32d0a18e274a871ac712b28b61ba63eb08299a03
2021-04-13 00:23:14 -07:00
e61b4fa691 [3/n] [torch/elastic] Introduce EtcdRendezvousBackend. (#55637)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55637

This diff introduces the `EtcdRendezvousBackend` type that will serve as an experimental alternative to the existing `EtcdRendezvousHandler`.

The major advantage of `EtcdRendezvousBackend` is that it delegates the bulk of the rendezvous handling logic to `DynamicRendezvousHandler` which is shared with `C10dRendezvousBackend` (see D27654492) and any other potential future rendezvous backend (e.g. Amazon S3).
ghstack-source-id: 126312209

Test Plan: Run the existing and newly-introduced unit/integration tests.

Reviewed By: tierex

Differential Revision: D27654498

fbshipit-source-id: f3259adfc9068b7e323b947a7d8d52fcd0b8ada1
2021-04-12 22:20:29 -07:00
339d3bf394 [2/n] [torch/elastic] Introduce C10dRendezvousBackend. (#55636)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55636

This diff introduces:

- The `C10dRendezvousBackend` type to support C10d stores as rendezvous backends.
- A fix to the `TCPStore.compare_set()` function to support non-existent keys.
- A placeholder `c10d-experimental` registry to instantiate C10d-baked rendezvous backends via `get_rendezvous_handler()`.
ghstack-source-id: 126312162

Test Plan: Run the existing and newly-introduced unit/integration tests.

Reviewed By: tierex

Differential Revision: D27654492

fbshipit-source-id: 09f498138b35186de4b0e174adb33fb5b5aa4b52
2021-04-12 22:20:27 -07:00
b3dd8cde61 [1/n] [torch/elastic] Introduce DynamicRendezvousHandler and RendezvousBackend. (#55635)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55635

This diff introduces the `DynamicRendezvousHandler` type as a stub implementation and its accompanying `RendezvousBackend` interface.

`DynamicRendezvousHandler` is intended to be a backend-agnostic type that will contain the core (bulk) logic of rendezvous handling. Any backend specific operation will be delegated to a concrete subclass of `RendezvousBackend` (e.g. `C10dRendezvousBackend` - see D27654492) that is passed as a constructor argument to `DynamicRendezvousHandler`.
ghstack-source-id: 126304697

Test Plan: Run the existing and newly-introduced unit/integration tests.

Reviewed By: tierex

Differential Revision: D27654478

fbshipit-source-id: 9fc89a6e4cb308971c65b29a7c5af7ae191f70c5
2021-04-12 22:18:49 -07:00
da01f4398b Add InferenceMode TLS to ThreadLocalState. (#55822)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55822

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D27721285

Pulled By: ailzhang

fbshipit-source-id: c978927f8cb3a91de45635b8279e166a3d5652ab
2021-04-12 21:37:27 -07:00
8fc16da649 [Hackathon]Move tests for slice to test_slice.py (#55524)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55524

Test Plan: Imported from OSS

Reviewed By: driazati

Differential Revision: D27686738

Pulled By: nikithamalgifb

fbshipit-source-id: f1896d739c3a3a7ece987f6eea4072477626231b
2021-04-12 21:02:19 -07:00
5cd73df8f8 [Hackathon]Move complex tests to test_complex.py (#55514)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55514

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D27679881

Pulled By: nikithamalgifb

fbshipit-source-id: 8a4f4ab8f375187b72ede6feaea37ab546da6d3e
2021-04-12 20:35:36 -07:00
bbcb12614e Sort slow tests json by test name (#55862)
Summary:
This will make https://github.com/pytorch/test-infra/commits/master more readable in the future

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55862

Reviewed By: ngimel

Differential Revision: D27728462

Pulled By: malfet

fbshipit-source-id: 2f10dd7ace49f343c4b91fc02be9d955fdbf67cc
2021-04-12 20:08:56 -07:00
a756a9e553 Add device id to ConvolutionParams (#50892)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/50844

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50892

Reviewed By: mruberry

Differential Revision: D27703874

Pulled By: ngimel

fbshipit-source-id: aefa4f44ca3387c2f7aa06136e5c62d66a4ac6ab
2021-04-12 19:22:18 -07:00
5ba4cfb7bf Minor typo fixes in _script.py (#55818)
Summary:
I was reading through this file to get a better understanding of torch.jit.script and just fixed these along the way.

The only functional change is [here](https://github.com/pytorch/pytorch/compare/master...janeyx99:minor-jit-nits?expand=1#diff-c05f6af41a2d9c7ec7a2b15088259fb74763f7d1406da70f324fc6b20af47427R824). Everything else is documentation only.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55818

Reviewed By: walterddr

Differential Revision: D27718853

Pulled By: janeyx99

fbshipit-source-id: a08f5451a904ef7a440be418f11ec083dd14766d
2021-04-12 18:48:26 -07:00
e7bb00cb49 Add a warning message to retire ProcessGroup RPC backend (#55616)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55616

Test Plan: Imported from OSS

Reviewed By: beauby

Differential Revision: D27650627

Pulled By: mrshenli

fbshipit-source-id: ecf06f3b77c7e66b32822dfabf2ef88864b0e5bd
2021-04-12 18:31:57 -07:00
d805908c34 [NNC] API to reorder multiple loops (#55568)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/52690

This PR adds the following APIs:

```
static bool areLoopsPerfectlyNested(const std::vector<For*>& loops);

static std::vector<For*> reorder(
      const std::vector<For*>& loops,
      const std::vector<size_t>& permutation);
```

The first API checks if the given list of loops are perfectly nested. The second API reorders the given list of loops according to the permutation specified.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55568

Reviewed By: albanD

Differential Revision: D27689734

Pulled By: navahgar

fbshipit-source-id: dc1bffdbee068c3f401188035772b41847cbc7c6
2021-04-12 18:12:24 -07:00
48ddc9762b Upgrade mypy to version 0.812 (#55712)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/54211

This was a little more annoying than expected, because the `exclude = ` key in `mypy.ini` is weird. I'll file an upstream issue about that.

I ignored one file, `torch/distributed/elastic/agent/server/api.py` that had ~8 errors that were hard to figure out. This can be done in a follow-up.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55712

Reviewed By: walterddr

Differential Revision: D27694976

Pulled By: malfet

fbshipit-source-id: 228d8be6af040343ce46595dabaca212e69ccc68
2021-04-12 18:08:28 -07:00
68e0796466 [JIT][write path] Make NoneType annotation_str emit NoneType instead of None (#54746)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54746

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D27350331

Pulled By: jamesr66a

fbshipit-source-id: 3f44d6589c29f39378432d0b6b281d96bb4829e7
2021-04-12 17:36:45 -07:00
a3c06e69aa [JIT][write path] Fix TupleType.annotation_str to conform to typing module syntax for empty tuple type (#54745)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54745

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D27350332

Pulled By: jamesr66a

fbshipit-source-id: 62af3b2b53561bb8e4adbdc0aec54520e08a5bf7
2021-04-12 17:29:33 -07:00
d0cd16899f rework device type filter rule (#55753)
Summary:
Currently common_device_type generates device-specific test based on vague rules. see https://github.com/pytorch/pytorch/issues/55707.
This should fix https://github.com/pytorch/pytorch/issues/55707

# Changes included
This PR changes the rule:
1. First user provided args (`except_for` and `only_for`) are processed to filter out desired device type from a ALL_AVAILABLE_LIST
2. Then environment variables are processed the exact same way.

tests are generated based on the final filtered list.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55753

Test Plan: CI

Reviewed By: seemethere, ngimel

Differential Revision: D27709192

Pulled By: walterddr

fbshipit-source-id: 1d5378ef013b22a7fb9fdae24b486730b2e67401
2021-04-12 16:07:27 -07:00
dab1cdf7cb Revert D27708944: [pytorch][PR] [OpInfo] move matmul to OpInfo
Test Plan: revert-hammer

Differential Revision:
D27708944 (08561cad10)

Original commit changeset: c200ded15082

fbshipit-source-id: 5bb75aa19c1ca761d1f118aafc483746ae813e2a
2021-04-12 15:56:55 -07:00
561b507843 Eliminate device guard in generic dispatch key kernel wrappers (#55131)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55131

Benchmark `zeros_out`:

```python
from torch.utils.benchmark import Timer
counts = Timer(
    stmt="""at::zeros_out(t, {1});""",
    setup="auto t = at::empty({1});",
    language="cpp",
).collect_callgrind(number=1_000)
print(counts)
```

With device guard:
```
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7f834f095ca0>
at::zeros_out(t, {1});
setup: auto t = at::empty({1});
                           All          Noisy symbols removed
    Instructions:      1396022                    1396022
    Baseline:                0                          0
1000 runs per measurement, 1 thread
```

Without device guard:
```
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7f25e48927c0>
at::zeros_out(t, {1});
setup: auto t = at::empty({1});
                           All          Noisy symbols removed
    Instructions:      1296022                    1296022
    Baseline:                0                          0
1000 runs per measurement, 1 thread
```

We see about `7.7%` improvement.

ghstack-source-id: 126295368

Test Plan:
```
buck build //caffe2/aten/...
buck test mode/dev mode/no-gpu //caffe2/test:torch  -- 'caffe2/test:torch - test_msnpu_error (test_torch.TestTorch)'
```

Reviewed By: ezyang

Differential Revision: D27496584

fbshipit-source-id: 97f783a809b77b28f77a93096d69b3da9ee69df7
2021-04-12 15:42:19 -07:00
69b7b011dc [JIT] Add cond-add-relu matching pattern to cover in-place ops (#55458)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55458

Previously the cond-add-relu pass blindly turns all in-place add and relu ops into non-mutation version, when those ops are not part of the fusion patten, it can actually hurt the performance as shown in densenet on some platform.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D27620415

fbshipit-source-id: 8302c0c85f3a064dfd8ac994e92416dde927e348
2021-04-12 15:23:34 -07:00
566e06eb9b Use _WeakTensorRef over weakref in test_autograd.py (#55726)
Summary:
There are a few autograd tests checking for tensors leaked by reference cycles. This changes them to  use `_WeakTensorRef` over `weakref`. `_WeakTensorRef`, added in https://github.com/pytorch/pytorch/issues/52874, accesses the C++ level `TensorImpl` reference count, compared to `weakref` which accesses python refcounts and so can only tell if the python wrapper object gets deallocated. Not only is this less code, it's also more accurately detecting that the Tensor itself is deallocated.

I didn't touch `weakref` usage in [test_anomaly_assign_parent_cleanup](fc349cbcde/test/test_autograd.py (L3733)) and [test_nested_anomaly_printstack_cleanup](fc349cbcde/test/test_autograd.py (L3772)) because these are intentionally testing for python object cleanup.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55726

Reviewed By: ngimel

Differential Revision: D27718526

Pulled By: albanD

fbshipit-source-id: 37a4914360e35dd4ae8db06b29525cebec4d4b84
2021-04-12 14:16:02 -07:00
af1a772876 Disable overloading of std::max & std::min for inputs of distinct types (#55638)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55613

### Problem
By default, `std::max` and `std::min` only operate on inputs of the same type.
In [`c10/util/BFloat16-math.h`](https://github.com/pytorch/pytorch/blob/master/c10/util/BFloat16-math.h), `std::max` & `std::min` have been overloaded:
305abde976/c10/util/BFloat16-math.h (L32-L33)
ezyang [observed](https://github.com/pytorch/pytorch/pull/55586#issuecomment-815862373) &  [illustrated](https://godbolt.org/z/bjTjPMMco) that calls to `std::max` & `std::min` for distinct input types (eg. `std::max(int, float)`) are being handled via `BFloat16`'s aforementioned overloads via implicit conversion to `BFloat16`. (I haven't looked into why yet).

### Solution implemented
1. Disabled overloading of `std::max` & `std::min` for inputs of distinct types by removing these overloads for `BFloat16`.
2. Instead, `<` and `>` operators are now being overloaded for `BFloat16` now (for comparison with another `BFloat16`), since `std::max` and `std::min` use these operators.
3. Calls to `std::max` and `std::min` with inputs of distinct types are only present at 3 places in the codebase, where they can either be handled by a `static_cast`, or by changing the type:
a. [`aten/src/ATen/native/quantized/fake_quant_per_tensor_affine.cpp`](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/quantized/fake_quant_per_tensor_affine.cpp#L111)
b. [`aten/src/ATen/native/cpu/BinaryOpsKernel.cpp`](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/BinaryOpsKernel.cpp#L74)
c. [`aten/src/ATen/native/quantized/cpu/kernels/QuantizedOpKernels.cpp`](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/quantized/cpu/kernels/QuantizedOpKernels.cpp#L2998-L2999)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55638

Reviewed By: albanD

Differential Revision: D27669702

Pulled By: ezyang

fbshipit-source-id: 790a67b76f86c25fad2c7ed0345b7f35ab5eca68
2021-04-12 12:49:34 -07:00
c00b9dc599 Small typo in comment (#55485)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55485

Reviewed By: albanD

Differential Revision: D27641537

Pulled By: mrshenli

fbshipit-source-id: 1dc0d2d77c47a66dcf10866801a1e0f495422149
2021-04-12 12:43:14 -07:00
f7a51b2ab9 Don't set version_counter on inference tensor for unsafe_ ops. (#55819)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55819

Test Plan:
On devserver: `buck run //xplat/langtech/tuna/cli:tuclix -- --model-dir ~/workspace/portal_en_US/ --audio-file ~/fbsource/fbcode/shortwave/test/data/audio_unittest.wav.to.raw` on top of Rittzz's D27691649
On device:

Reviewed By: Rittzz

Differential Revision: D27716745

fbshipit-source-id: 1921f18ee6b06990f71b86b9c4b3e1f3ce531001
2021-04-12 12:37:48 -07:00
08561cad10 [OpInfo] move matmul to OpInfo (#55543)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55543

Reviewed By: samestep

Differential Revision: D27708944

Pulled By: walterddr

fbshipit-source-id: c200ded15082eaeed7ba3077a0c8629fed0db505
2021-04-12 11:35:10 -07:00
008ec544f4 [p2c2][operators] Self binning histogram op error msg
Summary: Change error msg to include the min max values when failing.

Test Plan:
Existing unit tests:
```
buck test //caffe2/caffe2/python/operator_test:self_binning_histogram_test
```
Failing wf with error msg:
f264505545

Reviewed By: TailofJune

Differential Revision: D27630820

fbshipit-source-id: c490ce8c8c40414403634979c9beaf9c08569a96
2021-04-12 11:33:39 -07:00
01441af763 Use mypy internals instead of fnmatch for mypy wrapper (#55702)
Summary:
I noticed that https://github.com/pytorch/pytorch/issues/53296 added these two lines to the `files` list in `mypy-strict.ini`:
```
    benchmarks/instruction_counts/*.py,
    benchmarks/instruction_counts/*/*.py,
```
I opened https://github.com/pytorch/pytorch/issues/55700 to simplify them into one line, but I was also curious whether `tools/mypy_wrapper.py` correctly handles those patterns, so I added the `test_glob_wildcards_dont_expand_or_collapse` case shown in this PR. Turns out, it doesn't!

I believe this is because [`mypy` uses `glob`](https://github.com/python/mypy/blob/v0.770/mypy/config_parser.py#L45-L63) to parse these patterns, and for some reason, [`fnmatch`](https://docs.python.org/3/library/fnmatch.html) and [`glob`](https://docs.python.org/3/library/glob.html) don't agree with each other on what `*` means:

- according to `fnmatch`, `*` seems to mean `.*`
- according to `glob`, `*` seems to mean `[^/]*`

[This SO answer](https://stackoverflow.com/a/60174071) suggests using the [`glob.globmatch` function from the `wcmatch` library](https://facelessuser.github.io/wcmatch/glob/#globmatch) to solve the issue, but [we didn't want to add another external dependency](https://github.com/pytorch/pytorch/pull/55702#discussion_r610868623), so instead I simply modified our matching function to just directly call `mypy`'s own internal function that does the globbing (linked above).

One possible downside of this approach is that now the tests in `tools/test/test_mypy_wrapper.py` could break if the directory structure of PyTorch is changed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55702

Test Plan:
```
python tools/test/test_mypy_wrapper.py
```

Reviewed By: malfet, seemethere

Differential Revision: D27684499

Pulled By: samestep

fbshipit-source-id: d99387a579c21eee73d1714e3e815ab7155f9646
2021-04-12 11:30:16 -07:00
9593af305c Automated submodule update: tensorpipe (#55137)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/tensorpipe](https://github.com/pytorch/tensorpipe).

New submodule commit: fddc3aa75b

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55137

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: beauby

Differential Revision: D27499763

fbshipit-source-id: d96538009be7824f2ef600e9816239188ddd991a
2021-04-12 11:27:16 -07:00
684589e8e0 [codemod][fbcode][1/n] Apply buildifier
Test Plan: Manual inspection. Sandcastle.

Reviewed By: karlodwyer, zsol

Differential Revision: D27702434

fbshipit-source-id: ee7498331c51daf44a29f2de452e3b02488b9af3
2021-04-12 11:04:32 -07:00
db394efbb9 Support batched embeddings for 8Bit embedding bag quantization (#55343)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55343

Add support for N-dimensioned batches of 2D embedding bags to qembeddingbag_byte_prepack and qembeddingbag_byte_unpack.

This is currently supported in C2 via caffe2::Fused8BitRowwiseQuantizedToFloat and caffe2::FloatToFused8BitRowwiseQuantized, but is being supported in PyTorch operators via this change.

Test Plan: buck test //caffe2/test:quantization  -- test_embedding_bag_byte

Reviewed By: radkris-git

Differential Revision: D27480917

fbshipit-source-id: 9878751c6cee8a55909fe58a3e8c222ea31c20bb
2021-04-12 11:00:44 -07:00
80d04f910c fix typo in argmax docstring (#55239)
Summary:
argmax docstring previously said that it returns indexes of the first 'minimal' value, fixed typo in that line to 'maximal'

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55239

Reviewed By: albanD

Differential Revision: D27641562

Pulled By: mrshenli

fbshipit-source-id: f8b5c579400088b5210c83a05da6c4c106fbf95d
2021-04-12 10:39:36 -07:00
c91cf1e7a9 Add support for multiple outputs in structured kernels, port fractional_max_pool2d (#55581)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55581

Multiple outputs now OK, as long as their all Tensor.  Ported
fractional_max_pool2d to make sure the shindig all works.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D27641267

Pulled By: ezyang

fbshipit-source-id: f88bfcd2b11e9ae90b023c9310c033d12637a53e
2021-04-12 10:17:40 -07:00
8dd7e1528f Port replication_pad1d_backward to structured (#55537)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55537

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D27641208

Pulled By: ezyang

fbshipit-source-id: 56303b103e3ebc651bb64d11a9b19647f9affe53
2021-04-12 10:17:37 -07:00
3b96a7965a Port replication_padding3d to structured (#55499)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55499

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D27641207

Pulled By: ezyang

fbshipit-source-id: 0a0ef15151ae5de09b08ee09c623f9f38df3bec0
2021-04-12 10:17:35 -07:00
b9b103ff94 Port replication_padding1d to structured (#55481)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55481

Tracking issue #55070

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D27641209

Pulled By: ezyang

fbshipit-source-id: f1fc5588aaeb91d974aee74a8740ab47b7383baf
2021-04-12 10:16:04 -07:00
5fb1142702 Add CSR (compressed sparse row) layout for sparse tensors (#50937)
Summary:
Implement compressed sparse row format. Derived from the GCS implementation at https://github.com/pytorch/pytorch/pull/44190

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50937

Reviewed By: mrshenli

Differential Revision: D27439865

Pulled By: ezyang

fbshipit-source-id: 3ba3dcb9679505b980ff6a5f513e913bbae2fb1d
2021-04-12 10:09:12 -07:00
c6d9ca0c2b [reland]Replace AutoNonVariableTypeMode with InferenceMode in static runtime. (#55731)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55731

Forgot to export the diff in my last one. Retry...

Test Plan:
https://www.internalfb.com/intern/aibench/details/3752129704
https://www.internalfb.com/intern/aibench/details/1306815519

Reviewed By: hlu1

Differential Revision: D27694660

fbshipit-source-id: b351338fa789b9e9c7337df9b1bc1bc0fc387f5d
2021-04-12 09:48:20 -07:00
211d31afc9 symeig supports complex backward (#55085)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53651
I did not put much effort in improving the docs, as I will go over all these docs in future PRs
cc anjali411

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55085

Reviewed By: nikithamalgifb

Differential Revision: D27493604

Pulled By: anjali411

fbshipit-source-id: 413363013e188bc869c404b2d54ce1f87eef4425
2021-04-12 09:45:50 -07:00
e05ca753bf Fix nightly tool for python 3.6 (#55776)
Summary:
Given that the minimal required Python version for using PyTorch is 3.6, the development tools should also be able to handle it. `./tools/nightly.py` currently uses the parameters `capture_output` and `text` of `subprocess.run` that were only added for [Python 3.7](https://docs.python.org/3/library/subprocess.html#subprocess.run).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55776

Reviewed By: ngimel

Differential Revision: D27709124

Pulled By: ezyang

fbshipit-source-id: aeea15a891ba792f3cd5fa602f0d7b746007e30c
2021-04-12 09:34:29 -07:00
13153924cc OpInfo porting for msort operator (#55488)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55488

Reviewed By: ngimel

Differential Revision: D27708648

Pulled By: iramazanli

fbshipit-source-id: 62b6bc5bd6e54c593b9afac56cb2511411683416
2021-04-12 09:22:30 -07:00
1a8ec9c447 Add breakpad to Docker image (#55439)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55439

This adds [breakpad](https://github.com/google/breakpad) to the build in CI (just on one image for now). I attempted in #54739 to build it from source as a normal third_party submodule but it uses autotools and has some weird build steps that made it hacky to integrate. We really only need it for release builds anyways since its use is moot if built with anything but `RELEASE=1`.

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D27679766

Pulled By: driazati

fbshipit-source-id: 8211444df49b219c722137b9243d16d649a1f1ae
2021-04-12 09:20:57 -07:00
3c6b52ae62 Cache slow/disabled test files (#55682)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55682

Fixes #55648

For now it downloads and writes the relevant files to the system's temp dir and marks it as valid for 3 hours.

Test Plan: Imported from OSS

Reviewed By: malfet, nikithamalgifb

Differential Revision: D27685616

Pulled By: driazati

fbshipit-source-id: 27469b85fe4b6b4addde6b22bf795bca3d4990ef
2021-04-12 09:17:07 -07:00
ec9b20ddc0 fx quant: fix edge case with copynode after user function (#55710)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55710

In the current code, there is an edge case which leads to an error
after the prepare step:

1. have a pattern like this:

```
user_func_unmatched_to_qhandler -> node_matched_to_copy_node_qhandler
```

2. the user function returns a type which is not observable (i.e. not a
Tensor)

3. if this is run through `prepare_fx`, calibrating it with data leads
to a runtime error, because observers cannot observe non-tensor types.

This PR fixes the issue.  If a node matched to `CopyNodeQuantizeHandler`
is after an unmatched node, we delete the observer.

Test Plan:
```
python test/test_quantization.py TestQuantizeFx.test_no_obs_between_unmatched_node_and_copy_node
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D27686811

fbshipit-source-id: 320be41b1f383c6352ff89fb39a9f480822a3bb2
2021-04-12 08:47:44 -07:00
3f8d476857 Split out CUDA RPC tests (#55695)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55695

In order to be able to run CUDA tests on their own (e.g., to avoid running CPU tests on GPU machines).

Done by moving test methods to a separate class (and sometimes introducing a "common" base class for utils), and then providing new entry points inside a `cuda/` subdirectory.

Test Plan: Checked they are run on Sandcastle.

Reviewed By: mrshenli

Differential Revision: D27618198

fbshipit-source-id: 8f671657f79c8ae115748ab7752fe0066705893b
2021-04-12 07:48:08 -07:00
399b66c813 Ports logdet from method_tests() to op_db (#55743)
Summary:
Per title. Also updates some tensor construction helpers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55743

Reviewed By: ngimel

Differential Revision: D27702060

Pulled By: mruberry

fbshipit-source-id: f64b7bee855733ad1f4fd182819ceec5831d9878
2021-04-11 20:39:16 -07:00
Jie
66289673f7 patching requires_grad on DifferentiableGraph (#55701)
Summary:
The retrieval of profile node is much easier prior to inserting guard node.
test cases updated to reflect the patch on a previously failing cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55701

Reviewed By: pbelevich

Differential Revision: D27701216

Pulled By: Krovatkin

fbshipit-source-id: e2e6b64b682377e622b75c762e85ff7967e45118
2021-04-11 17:04:13 -07:00
19f15317a0 [BE][Docs] Improve dist.new_group doc (#55660)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55660

Noticed this doc was missing clarification on nccl env vars that
init_process_group docs have. Also, specify default behavior when backend=None
is passed in.
ghstack-source-id: 126251116

Test Plan: Ci

Reviewed By: SciPioneer

Differential Revision: D27672208

fbshipit-source-id: 2e79d297174e135173bceb059450ea267367bde4
2021-04-11 16:16:18 -07:00
a3c062d4f5 docs: improve torch.matrix_exp() (#55626)
Summary:
Add a signature and make the mathematical expression related to the signature

Fixes https://github.com/pytorch/pytorch/issues/55599

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55626

Reviewed By: ngimel

Differential Revision: D27699518

Pulled By: mruberry

fbshipit-source-id: e61d76e99eb8fc36114c1c2ee90990740d78beea
2021-04-11 16:03:03 -07:00
93bf0ae6fc Remove legacy constructor calls from pytorch codebase. (#54142)
Summary:
Follow up from https://github.com/pytorch/pytorch/issues/53889
Related to https://github.com/pytorch/pytorch/issues/47112

Removing every occurrence of the legacy constructor call present in PyTorch at:
- _docs_
- _benchmarks_
- _test_
- _caffe2_
- _CONTRIBUTING.md_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54142

Reviewed By: ngimel

Differential Revision: D27699450

Pulled By: mruberry

fbshipit-source-id: 530aa3f5746cc8bc1407d5d51b2bbd8075e30546
2021-04-11 15:45:17 -07:00
fa29a647db [JIT] Allow unpacking tuple and assign their values to SELECT-type expressions (#55268)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/51176

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55268

Reviewed By: pbelevich, izdeby

Differential Revision: D27551950

Pulled By: gmagogsfm

fbshipit-source-id: 35324b728649bb1e6c5410a1004d2f6964f98304
2021-04-11 14:21:48 -07:00
b80c6f863f Disambiguate error message for working with not fully refined tuple types (#55745)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55745

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D27698691

Pulled By: tugsbayasgalan

fbshipit-source-id: 7855042d37290f19d53adfc0b4da606430501663
2021-04-11 02:30:05 -07:00
facbcec298 Make leak_corrupted_threadpool non-atomic (#55341)
Summary:
Following up on https://github.com/pytorch/pytorch/pull/54895#discussion_r606402656.

A race-condition wouldn't arise because `leak_corrupted_threadpool` can be set to true only after fork via the `pthread_atfork` handler, when a (child) process would be single-threaded. It's set to false also when the process is still single-threaded (`pthreadpool` is called during an invocation to `set_num_threads`, prior to which a child process would remain single-threaded). All threads (if & when multiple threads would be created) would always see `leak_corrupted_threadpool` as false if it would be accessed concurrently.

Since no reader threads can exist while a writer thread changes its value (false->true and true->false), `leak_corrupted_threadpool` might as well be a non-atomic bool.

### Pros
1. No thread-synchronization is required for `leak_corrupted_threadpool`, as it's a non-atomic bool.
2. The call to `compare_exchange_strong` has been be removed.

cc: malfet VitalyFedyunin ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55341

Reviewed By: albanD

Differential Revision: D27669442

Pulled By: ezyang

fbshipit-source-id: 926cb5c1b0a537c1c2ab164b0d51d37c1f1b67f0
2021-04-10 19:25:33 -07:00
84a7ab250b Optimize constructing tensors from external data (#55705)
Summary:
This PR optimizes the way tensors are constructed from external data. It avoids allocating an empty tensor beforehand and directly constructs the target tensor by passing the newly-initialized `DataPtr`. Running some Facebook-internal benchmarks showed that combined with https://github.com/pytorch/pytorch/issues/54530 this PR achieves performance parity with Caffe2 tensor construction. (Overall ~2x speed improvement over the original `at::from_blob()` implementation.)

Testing is done with the existing unit and integration tests as there is no user-observable API change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55705

Reviewed By: ezyang

Differential Revision: D27686043

Pulled By: cbalioglu

fbshipit-source-id: b365c614476bcf0567797dfaf2add1b76fb6c272
2021-04-10 17:54:10 -07:00
255494c2aa torch.testing allclose -> close (#54781)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54781

Right now the functions have divergent names with one postfixed `_equal` and the other `_allclose`. I've opted to use `_(equal|close)` over `_all(equal|close)` think it is a reasonable assumption that all values need to be equal or close for this pass even without explicitly naming the function this way.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D27438957

Pulled By: mruberry

fbshipit-source-id: 2951dac06d1430e15119ae94eafa234f3eb02f09
2021-04-10 13:35:38 -07:00
c9b94a85e9 change torch.testing helper asserts to checks (#54780)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54780

- In #53152 we opted to use `tb=native`. Thus, regardless if we use `pytest` to run the tests `__tracebackhide__` is not honored. and additional layers of helper functions make the traceback harder to parse. To overcome this, we change the internal helpers to return `ok: bool, msg: Optional[str]` and only raise the error in the top level function. We do that already in the current implementation that we are trying to replace:
    36ce673f16/torch/testing/__init__.py (L92-L93)
    36ce673f16/torch/testing/__init__.py (L112)

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D27438849

Pulled By: mruberry

fbshipit-source-id: 3e7a33dabb45463c29e8b9736fad09efb523f18d
2021-04-10 13:12:09 -07:00
548765d9a5 [PyTorch] Add & use inferExpandGeometry_dimvector (#55316)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55316

No need for heap allocations in the common case here.
ghstack-source-id: 126170054

Test Plan: Existing CI

Reviewed By: hlu1

Differential Revision: D27571942

fbshipit-source-id: 11fbf077c583c80ea63e024d2b9e1599785fff71
2021-04-09 22:15:20 -07:00
151869aca6 [PyTorch][easy] Use sizes()[x] instead of size(x) in addr (#55247)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55247

When x is known to be in-bounds, sizes() is faster.
ghstack-source-id: 126170048

Test Plan: CI

Reviewed By: hlu1

Differential Revision: D27523681

fbshipit-source-id: 021c82a8a6b770802f4cd51cf6ff77046d71c938
2021-04-09 22:15:15 -07:00
12c19c398c [PyTorch] Update expand_size API to match expand_inplace (#55246)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55246

c10::MaybeOwned<Tensor> and no more unary tuples.
ghstack-source-id: 126170051

Test Plan: Existing CI

Reviewed By: ngimel

Differential Revision: D27523682

fbshipit-source-id: 2590993cfc62136e65fd9a791e4ab68b2c366556
2021-04-09 22:15:10 -07:00
16a9141e2c [PyTorch] Update expand_outplace API to match expand_inplace (#55245)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55245

Like `expand_inplace`, `expand_outplace` now returns
`MaybeOwned<Tensor>` in most cases. I wasn't confident around the
ownership semantics of the `TensorList` -> `std::vector<Tensor>` case, so I
left that one alone.
ghstack-source-id: 126170052

Test Plan: Existing CI.

Reviewed By: ezyang

Differential Revision: D27522811

fbshipit-source-id: 28c5a626b65681e361f4006a0aaa7dc23ba9612a
2021-04-09 22:15:04 -07:00
6fd875923e [PyTorch] Add MaybeOwned::operator*() && (#55244)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55244

Add the ability to move from the underlying object in a `MaybeOwned`.

FWIW, `MaybeOwned` is new territory for me personally and this move-and-dereference operation is even more so, but I think it makes sense and the tests pass.
ghstack-source-id: 126170046

Test Plan: Added automated tests.

Reviewed By: bhosmer

Differential Revision: D27522809

fbshipit-source-id: 82b180031e93d725209b6328f656315c232e5237
2021-04-09 22:14:59 -07:00
e8dd65102b [PyTorch] Use infer_size_dimvector in ExpandUtils (#55180)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55180

Even if we're expanding a Tensor's dimensions, DimVector's size is still a good guess at the rank of a Tensor in general. None of these sites actually seem to need a std::vector.
ghstack-source-id: 126170045

Test Plan: Existing CI

Reviewed By: ezyang

Differential Revision: D27520127

fbshipit-source-id: 4064764fad1b3782b379f04627b48331c3ee011f
2021-04-09 22:14:55 -07:00
fa19b6dd4d [PyTorch] New expand_inplace API with MaybeOwned<Tensor> and no unary tuples (#55065)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55065

expand_inplace may give you the same Tensor(s) back, and it unnecessarily wrapped single-Tensor results in a tuple. Further diffs will deprecate and replace the rest of the similar APIs in ExpandUtils.
ghstack-source-id: 126170049

Test Plan: beyonce_test

Reviewed By: ezyang

Differential Revision: D27469297

fbshipit-source-id: 56cf14bc5603355f399fef2e5b02b97afa504428
2021-04-09 22:13:21 -07:00
2496a09314 [Gradient Compression] Fix PowerSGD docstring by removing an extra whitespace (#55666)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55666

{F590513307}

Some code is not properly displayed due to an extra whitespace ahead of `(num_rows + num_cols)`.
ghstack-source-id: 126148569

Test Plan: Locally viewed

Reviewed By: rohan-varma

Differential Revision: D27673663

fbshipit-source-id: 603ae4ddbe86ceaefc311885b82b0f6b48b57b27
2021-04-09 21:11:40 -07:00
5a8cdc2fdb Revert D27691509: Replace AutoNonVariableTypeMode with InferenceMode in static runtime.
Test Plan: revert-hammer

Differential Revision:
D27691509 (d695ba94f6)

Original commit changeset: d43db028a399

fbshipit-source-id: 8cfa2f821ef3251b323483691672ed70858d9d68
2021-04-09 20:36:20 -07:00
d695ba94f6 Replace AutoNonVariableTypeMode with InferenceMode in static runtime.
Test Plan:
https://www.internalfb.com/intern/aibench/details/3752129704
https://www.internalfb.com/intern/aibench/details/1306815519

Reviewed By: hlu1

Differential Revision: D27691509

fbshipit-source-id: d43db028a399bb02166a539577f6922237145f83
2021-04-09 20:04:00 -07:00
263a15c5aa [tensorexpr] Add PYTORCH_TENSOREXPR_DONT_FUSE env variable to disable fusion on specified operators - fixed #50757 (#55650)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55650

Test Plan:
Imported from OSS

$ python local/fusion.py
```
graph(%a.1 : Tensor,
      %b.1 : Tensor,
      %c.1 : Tensor):
  %33 : Float(4, 4, strides=[4, 1], requires_grad=0, device=cpu), %34 : Float(4, 4, strides=[4, 1], requires_grad=0, device=cpu), %35 : Float(4, 4, strides=[4, 1], requires_grad=0, device=cpu), %36 : bool = prim::TypeCheck[types=[Float(4, 4, strides=[4, 1], requires_grad=0, device=cpu), Float(4, 4, strides=[4, 1], requires_grad=0, device=cpu), Float(4, 4, strides=[4, 1], requires_grad=0, device=cpu)]](%c.1, %a.1, %b.1)
  %37 : Tensor = prim::If(%36)
    block0():
      %18 : Float(4, 4, strides=[4, 1], requires_grad=0, device=cpu) = prim::TensorExprGroup_0(%33, %34, %35)
      -> (%18)
    block1():
      %44 : Function = prim::Constant[name="fallback_function", fallback=1]()
      %45 : (Tensor) = prim::CallFunction(%44, %c.1, %a.1, %b.1)
      %46 : Tensor = prim::TupleUnpack(%45)
      -> (%46)
  return (%37)
with prim::TensorExprGroup_0 = graph(%c.1 : Float(4, 4, strides=[4, 1], requires_grad=0, device=cpu),
      %a.1 : Float(4, 4, strides=[4, 1], requires_grad=0, device=cpu),
      %b.1 : Float(4, 4, strides=[4, 1], requires_grad=0, device=cpu)):
  %10 : int = prim::Constant[value=1]()
  %11 : Float(4, 4, strides=[4, 1], requires_grad=0, device=cpu) = aten::add(%a.1, %b.1, %10) # local/fusion.py:13:15
  %9 : Float(4, 4, strides=[4, 1], requires_grad=0, device=cpu) = aten::mul(%a.1, %b.1) # local/fusion.py:13:19
  %6 : Float(4, 4, strides=[4, 1], requires_grad=0, device=cpu) = aten::mul(%9, %c.1) # local/fusion.py:13:19
  %3 : Float(4, 4, strides=[4, 1], requires_grad=0, device=cpu) = aten::add(%11, %6, %10) # local/fusion.py:13:15
  return (%3)

```
$ PYTORCH_TENSOREXPR_DONT_FUSE="add" python local/fusion.py
```
graph(%a.1 : Tensor,
      %b.1 : Tensor,
      %c.1 : Tensor):
  %3 : int = prim::Constant[value=1]()
  %6 : Tensor = aten::add(%a.1, %b.1, %3) # local/fusion.py:13:15
  %27 : Float(4, 4, strides=[4, 1], requires_grad=0, device=cpu), %28 : Float(4, 4, strides=[4, 1], requires_grad=0, device=cpu), %29 : Float(4, 4, strides=[4, 1], requires_grad=0, device=cpu), %30 : bool = prim::TypeCheck[types=[Float(4, 4, strides=[4, 1], requires_grad=0, device=cpu), Float(4, 4, strides=[4, 1], requires_grad=0, device=cpu), Float(4, 4, strides=[4, 1], requires_grad=0, device=cpu)]](%c.1, %a.1, %b.1)
  %31 : Tensor = prim::If(%30)
    block0():
      %18 : Float(4, 4, strides=[4, 1], requires_grad=0, device=cpu) = prim::TensorExprGroup_0(%27, %28, %29)
      -> (%18)
    block1():
      %35 : Function = prim::Constant[name="fallback_function", fallback=1]()
      %36 : (Tensor) = prim::CallFunction(%35, %c.1, %a.1, %b.1)
      %37 : Tensor = prim::TupleUnpack(%36)
      -> (%37)
  %15 : Tensor = aten::add(%6, %31, %3) # local/fusion.py:13:15
  return (%15)
with prim::TensorExprGroup_0 = graph(%c.1 : Float(4, 4, strides=[4, 1], requires_grad=0, device=cpu),
      %a.1 : Float(4, 4, strides=[4, 1], requires_grad=0, device=cpu),
      %b.1 : Float(4, 4, strides=[4, 1], requires_grad=0, device=cpu)):
  %5 : Float(4, 4, strides=[4, 1], requires_grad=0, device=cpu) = aten::mul(%a.1, %b.1) # local/fusion.py:13:19
  %2 : Float(4, 4, strides=[4, 1], requires_grad=0, device=cpu) = aten::mul(%5, %c.1) # local/fusion.py:13:19
  return (%2)
```

Reviewed By: navahgar

Differential Revision: D27667232

Pulled By: huiguoo

fbshipit-source-id: 002ddbb49760b42d52e0605ca3967f4fa36f4e3f
2021-04-09 18:57:40 -07:00
3e8ebb17aa [reland][quant][graphmode][fx][refactor] Factor out insert_observers_for_model to a separate function (#54733) (#55307)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55307

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D27567475

fbshipit-source-id: 74b7db63f7e1e795e7ac7ed6027cf786d922e7bf
2021-04-09 17:56:55 -07:00
d33829f844 Fix type annotations for state_dict() override (#55704)
Summary:
Change annotation to OrderedDict, but stringify it to stay compatible with Python-3.6

Fixes https://github.com/pytorch/pytorch/issues/55302

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55704

Reviewed By: walterddr

Differential Revision: D27686011

Pulled By: malfet

fbshipit-source-id: 3a8dedf33f38d86767ebd4e8a1a8abfe850b375a
2021-04-09 17:48:12 -07:00
fc349cbcde OpInfo for kron (#55546)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55546

Test Plan: pytest test/test_ops.py -v -k "_kron_"

Reviewed By: albanD

Differential Revision: D27681131

Pulled By: asuhan

fbshipit-source-id: e480d8f163d73b9ca5353b2320ccb0631a5f06c5
2021-04-09 17:36:26 -07:00
3e9cbe5ef7 [SPMD] Remove the code branches only used in SPMD mode from distributed.py (#55353)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55353

Remove all the code branches that will only be executed when `device_ids > 1`.

Some helper functions are also removed:
1.  `_verify_replicas_within_process` and `verify_replicas_within_process`
2. `_replicate_modules_within_process`
3. `parallel_apply`

The next step is deprecating `_module_copies` field.
ghstack-source-id: 126201121

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D27552201

fbshipit-source-id: 128d0216a202f5b1ba4279517d68c3badba92a6c
2021-04-09 17:27:56 -07:00
717d54bc2b [Hackathon] Add source highlighting check to test_unsupported_ops (#55501)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55501

Reviewed By: janeyx99

Differential Revision: D27627517

Pulled By: gmagogsfm

fbshipit-source-id: 2542473425d10f1e3eb926f9e0fb6cd40679bd82
2021-04-09 16:40:29 -07:00
7485818a3f Revert D27670883: [pytorch][PR] Added an OpInfo for mm & ported its method_tests
Test Plan: revert-hammer

Differential Revision:
D27670883 (fc1d7a85bb)

Original commit changeset: 51232f44ab01

fbshipit-source-id: c372b578541626de3871ef94c97b5766c8412580
2021-04-09 16:32:39 -07:00
846c8d94c7 mark embedding backward non-deterministic for max mode rather than all reducing modes (#55574)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55574

nn.EmbeddingBag backward is non-deterministic when reducing_mode = Max and on GPU, reducing mode Mean and Sum should be deterministic

Test Plan: NA

Reviewed By: ngimel

Differential Revision: D27633832

fbshipit-source-id: 50786ed8522f1aae27442f5f244a65eab8000b06
2021-04-09 16:19:01 -07:00
7671c15d4f Make VariableVersion::DISABLED the default constructor for VariableVersion. (#55572)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55572

We used to have VariableVersion default constructor
`VariableVersion(uint32_t version=0)`. But sometimes
we override the version_counter right after it's constructed.
E.g in SavedVariable/TensorImpl.
Thus we should make DISABLED  the default constructor and else
where using explicit `VariableVersion(uint32_t)` constructor.
Note this PR effectively changes SavedVariable constructor (which overrides
version_counter_ inside) to use the DISABLED constructor and we
can see the gains in reduced instruction counts.

```
// benchmark code
timer = Timer(
    "y = x * x",
    """
    x = torch.rand((3, 3)).requires_grad_()
    """,
    language=Language.PYTHON,
)

 λ ~ python compare.py
No CUDA runtime is found, using CUDA_HOME='/public/apps/cuda/10.2'
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.FunctionCounts
object at 0x7f06c48b3a50>
     7236  lookdict_unicode_nodummy
     2600  torch::autograd::VariableType::(...)
      100  0x0000000017751750
       -5  unlink_chunk.isra.0
     -100  0x000000001773e750
     -402  _int_malloc
    -1600  operator delete(...)
    -1600  c10::intrusive_ptr_target::release_resources()
    -2400  c10::VariableVersion::VersionCounter::~VersionCounter()
    -3600  torch::autograd::SavedVariable::operator=(...)
    -4800  operator new(...)
    -6400  torch::autograd::SavedVariable::SavedVariable(...)
    -7200  torch::autograd::SavedVariable::SavedVariable()
    -8400  free
   -16800  malloc
   -24400  _int_free

Total: -67771
```
Note there're for other callsites(esp. view related) we just keep it unchanged by
explicitly calling `VariableVersion(uint32_t)` but we should be
able to optimize those in the followup PRs.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D27669074

Pulled By: ailzhang

fbshipit-source-id: a4deb297cc89142ae8bd683284516c881ddf3c87
2021-04-09 15:55:02 -07:00
6e4e3a1159 Fix annotations in _autograd.pyi (#55706)
Summary:
`str` is reserved keyword, besides parameter name in `profiler_kineto.h` is `path`:
6ee333cdb5/torch/csrc/autograd/profiler_kineto.h (L209)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55706

Reviewed By: janeyx99

Differential Revision: D27686286

Pulled By: malfet

fbshipit-source-id: b27e8e3812214218054be0e69493177bb728d8d7
2021-04-09 14:54:04 -07:00
ee2de8ae3a [android] Module load extraFiles (#55644)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55644

Testing:
Prepare the model with extra:

```
    extra_files = {}
    extra_files["model/live.spec.json"] = '{ "spec": "spec_value"}'
    torch.jit.save(script_model, "extra.pt", _extra_files=extra_files)
    script_model._save_for_lite_interpreter("extra.ptl"), _extra_files=extra_files)
```

Change android/test_app/app/src/main/java/org/pytorch/testapp/MainActivity.java
1. Full jit
```
Map<String, String> map = new HashMap<>();
map.put("model/live.spec.json", "");
mModule = Module.load(assetFilePath(this, BuildConfig.MODULE_ASSET_NAME), map, Device.CPU);
android.util.Log.i("XXX", "map:" + map);
```
`gradle -p android test_app:installMnetLocalBaseDebug -PABI_FILTERS=arm64-v8a`

2. Lite
```
Map<String, String> map = new HashMap<>();
map.put("model/live.spec.json", "");
mModule = LiteModuleLoader.load(assetFilePath(this, BuildConfig.MODULE_ASSET_NAME), map, Device.CPU);
android.util.Log.i("XXX", "map:" + map);
```
`BUILD_LITE_INTERPRETER=1 gradle -p android test_app:installMnetLocalBaseDebug -PABI_FILTERS=arm64-v8a`

Check logcat

Test Plan: Imported from OSS

Reviewed By: ljk53

Differential Revision: D27663624

Pulled By: IvanKobzarev

fbshipit-source-id: 0dcd93b2fddaacd221db0306d18afee2584fcb85
2021-04-09 14:52:21 -07:00
9f519d2d2d Simplify benchmark patterns in mypy-strict.ini (#55700)
Summary:
These two lines were added in https://github.com/pytorch/pytorch/issues/53296, but they are needlessly complicated; this PR consolidates them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55700

Test Plan:
Run this command, and verify that the same number of files is given both before and after this PR:
```
mypy --config=mypy-strict.ini
```

Reviewed By: robieta

Differential Revision: D27684278

Pulled By: samestep

fbshipit-source-id: a34968cdff29cb8ad83813b277114224b5e37569
2021-04-09 14:48:45 -07:00
6842da6251 [WIP]Relax some limitations of InferenceMode. (#54403)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54403

A few important points about InferenceMode behavior:
1. All tensors created in InferenceMode are inference tensors except for view ops.
   - view ops produce output has the same is_inference_tensor property as their input.
     Namely view of normal tensor inside InferenceMode produce a normal tensor, which is
     exactly the same as creating a view inside NoGradMode. And view of
     inference tensor outside InferenceMode produce inference tensor as output.
2. All ops are allowed inside InferenceMode, faster than normal mode.
3. Inference tensor cannot be saved for backward.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D27316483

Pulled By: ailzhang

fbshipit-source-id: e03248a66d42e2d43cfe7ccb61e49cc4afb2923b
2021-04-09 14:40:37 -07:00
91ab0d9680 [hackathon] port addmv to OpInfo (#55545)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55545

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D27629053

Pulled By: Lilyjjo

fbshipit-source-id: d7a114e21d3b90c2563a26d7103703988114353d
2021-04-09 14:25:14 -07:00
162e1003c9 [package] fix whichmodule for OrderedImporter (#55646)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55646

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D27664238

Pulled By: suo

fbshipit-source-id: 752ba568ade2dbd268a7c1d5b3a12f5c396fcfbb
2021-04-09 13:26:44 -07:00
6ee333cdb5 modernize test_sparse (#54572)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54572

Adding device generic tests to `test_sparse`.
Follow-up PR: #54153

I think is ready to review.
Looking forward your comments cc mruberry.

Thanks

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D27562663

Pulled By: mruberry

fbshipit-source-id: c48973e707f779b529bc7f61b75103194b428987
2021-04-09 12:19:29 -07:00
fc1d7a85bb Added an OpInfo for mm & ported its method_tests (#55446)
Summary:
Added an `OpInfo` for `mm` & ported its `method_tests` entry (it only had one).

cc: mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55446

Reviewed By: ngimel

Differential Revision: D27670883

Pulled By: mruberry

fbshipit-source-id: 51232f44ab01ad0454113992f80a4cfc730f8800
2021-04-09 12:15:33 -07:00
53f9fc1802 Port hypot method_tests() to OpInfo (#55140)
Summary:
Related https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55140

Reviewed By: ngimel

Differential Revision: D27562164

Pulled By: mruberry

fbshipit-source-id: fc698ddc624d2abf5d540aac76baa5d398993f1f
2021-04-09 11:40:27 -07:00
f3367f917e Translate annotation line numbers from merge to head (#55569)
Summary:
This PR

- adds a `tools/translate_annotations.py` script that
  - parses annotations into JSON using the regexes that we were previously passing to [`pytorch/add-annotations-github-action`](https://github.com/pytorch/add-annotations-github-action) and
  - uses `git diff-index` to translate the line numbers for those annotations from the PR `merge` onto the PR `head`, since (as of https://github.com/pytorch/pytorch/issues/54967) we now run CI on the former instead of the latter;
- modifies the `flake8-py3` and `clang-tidy` jobs to use that script and thus upload JSON in their artifacts instead of raw text; and
- modifies the "Add annotations" workflow to specify `mode: json` to allow it to use those preprocessed annotations.

Depends on https://github.com/pytorch/add-annotations-github-action/pull/18.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55569

Test Plan:
You can run the unit tests with this command:
```
python tools/test/test_translate_annotations.py
```
I also tested the entire system together in my personal sandbox repo.

Reviewed By: malfet

Differential Revision: D27662161

Pulled By: samestep

fbshipit-source-id: ecca51b79b9cf00c90fd89f0d41d0c7b89d69c63
2021-04-09 11:12:40 -07:00
11dd6d3dbb Mycontrib Added Example for is_tensor API (#55052)
Summary:
[Added  Example for is_tensor API](https://github.com/harishsdev/practice/blob/master/changes_to_opensource/is_tensor_example_added.jpg)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55052

Reviewed By: ezyang

Differential Revision: D27523833

Pulled By: gchanan

fbshipit-source-id: 06036342223454856d4cfec46b40a72b311d261f
2021-04-09 11:02:59 -07:00
c0379ac83f Simplify device guard code generation (#55112)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55112

Based on https://github.com/pytorch/pytorch/pull/47765
ghstack-source-id: 126114775

Test Plan: buck build //caffe2/aten/...

Reviewed By: ezyang

Differential Revision: D27487085

fbshipit-source-id: 157fcd19f538ce0c1e053e3e974b48bdb93a0226
2021-04-09 10:53:38 -07:00
43ede4c2e3 Add Per Tensor Quantization Support to FXIRImporter (#55405)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55405

Pull Request resolved: https://github.com/pytorch/glow/pull/5516

Allows FXIRImport to import quantized model.

This diff doesn't include the supports for per-channel weights, linear and conv. Will address them in the next diff.

Test Plan: buck test glow/fb/fx/nnpi_importer:test_importer

Reviewed By: jackm321, jfix71

Differential Revision: D27313543

fbshipit-source-id: bf5c96ef5f2ff1835c09db981e0ceefaec56dd5b
2021-04-09 10:49:48 -07:00
076961e8b5 Add tuple add operator (#52292)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52292

Test Plan: Imported from OSS

Reviewed By: Lilyjjo

Differential Revision: D26792416

Pulled By: tugsbayasgalan

fbshipit-source-id: 882325b171c1ff53ec40243d3f9334049c03fe57
2021-04-09 10:24:48 -07:00
159e1100bf [fix][tests] fix logic if env variables not present (#55664)
Summary:
Fixes: https://github.com/pytorch/pytorch/issues/55670
Reference: https://github.com/pytorch/pytorch/pull/55522

**Cant Run tests locally without setting the ENV variables**

<details>

```
(pytorch-cuda-dev) kshiteej@qgpu1:~/Pytorch/pytorch_opinfo$ pytest test/test_ops.py
======================================================================= test session starts ========================================================================
platform linux -- Python 3.8.6, pytest-6.1.2, py-1.9.0, pluggy-0.13.1
rootdir: /home/kshiteej/Pytorch/pytorch_opinfo, configfile: pytest.ini
plugins: hypothesis-5.38.1
collected 0 items

========================================================================= warnings summary =========================================================================
../../.conda/envs/pytorch-cuda-dev/lib/python3.8/site-packages/torch/backends/cudnn/__init__.py:73
  /home/kshiteej/.conda/envs/pytorch-cuda-dev/lib/python3.8/site-packages/torch/backends/cudnn/__init__.py:73: UserWarning: PyTorch was compiled without cuDNN/MIOpen support. To use cuDNN/MIOpen, rebuild PyTorch making sure the library is visible to the build system.
    warnings.warn(

../../.conda/envs/pytorch-cuda-dev/lib/python3.8/site-packages/torch/testing/_internal/common_nn.py:1195
  /home/kshiteej/.conda/envs/pytorch-cuda-dev/lib/python3.8/site-packages/torch/testing/_internal/common_nn.py:1195: UserWarning: Legacy tensor constructor is deprecated. Use: torch.tensor(...) for creating tensors from tensor-like objects; or torch.empty(...) for creating an uninitialized tensor with specific sizes. (Triggered internally at  ../torch/csrc/utils/tensor_new.cpp:474.)
    random_samples = torch.DoubleTensor(1, 3, 2).uniform_()

-- Docs: https://docs.pytest.org/en/stable/warnings.html
======================================================================= 2 warnings in 2.85s ========================================================================
```

</details>

c7312f5271/torch/testing/_internal/common_device_type.py (L479-L486)

(When running locally where the environment variable is not set)

The case when the env variable is not present, `os.getenv` returns `''` which is split and we get `['']` for `only_for` and `except_for`.

c7312f5271/torch/testing/_internal/common_device_type.py (L496-L497)

At this point, we take the branch and skip all the tests.
```python
>>> if [''] and 'cuda' not in ['']:
...     print("TRUE")
...
TRUE
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55664

Reviewed By: albanD

Differential Revision: D27677752

Pulled By: malfet

fbshipit-source-id: 071486e3b6b5113c56f0f956b8d99a5ab24068fe
2021-04-09 10:22:58 -07:00
defc649eca Update to short forms of splitWithTail / splitWithMask (#55542)
Summary:
Switched to short forms of `splitWithTail` / `splitWithMask` for all tests in `test/cpp/tensorexpr/test_*.cpp` (except test_loopnest.cpp)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55542

Reviewed By: mrshenli

Differential Revision: D27632033

Pulled By: jbschlosser

fbshipit-source-id: dc2ba134f99bff8951ae61e564cd1daea92c41df
2021-04-09 10:15:20 -07:00
35a66db774 Fix complex mean and reduction tests not being run (#55640)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55640

Mean is broken for complex types, since #53218 it's now allocating the result
as a real tensor which discards the imaginary component. This wasn't picked up
in testing because `_test_dim_ops` tests are defined as closures inside of
`_test_dim_ops` instead of as methods on the test class. The result is, they
never get run.

For best results, view diff with "Hide whitespace changes".

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D27671127

Pulled By: mruberry

fbshipit-source-id: 4a1f6fea1048919fda7339c867ee78e88f2d7bd2
2021-04-09 10:03:44 -07:00
2a24a2418a common_utils.py use new file names for disabled/slow tests (#55620)
Summary:
Following these changes in renaming the files:
https://github.com/pytorch/pytorch/pull/55618
https://github.com/pytorch/test-infra/pull/3

We should update the use sites in common_utils.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55620

Reviewed By: samestep

Differential Revision: D27651884

Pulled By: janeyx99

fbshipit-source-id: 298a981e55e0b7c95202294d9bc4b3fcce359590
2021-04-09 09:25:20 -07:00
55d45458bd [cuDNN] Enable Conv3d channels_last_3d (#48430)
Summary:
This PR adds the functionality to use channals_last_3d, aka, NDHWC, in Conv3d. It's only enabled when cuDNN version is greater than or equal to 8.0.5.

Todo:

- [x] add memory_format test
- [x]  add random shapes functionality test

Close https://github.com/pytorch/pytorch/pull/52547

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48430

Reviewed By: mrshenli

Differential Revision: D27641452

Pulled By: ezyang

fbshipit-source-id: 0e98957cf30c50c3390903d307dd43bdafd28880
2021-04-09 07:56:49 -07:00
c7312f5271 Enabled xla device in CI. (#55658)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55658

Fix #55522.

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D27671867

Pulled By: ailzhang

fbshipit-source-id: af8cc5bfe540af6d33d839bf2f2f254290c95da2
2021-04-08 23:41:03 -07:00
bbd2b1bd3c [quant][graphmode][fx] Add shape to nontensor op list (#55529)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55529

x.shape outputs a non-Tensor, add this to the all_node_args_have_no_tensors function
to avoid inserting observer for the getattr "shape" node.

Test Plan: Imported from OSS

Reviewed By: wat3rBro

Differential Revision: D27628145

fbshipit-source-id: 4729294ab80c0a1e72440396d31e7e82257b1092
2021-04-08 23:27:05 -07:00
0910363e8f adds data_ptr checks to in-place OpInfo variant tests and out OpInfo tests (#55527)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55088.
Unfortunately, this test wouldn't have caught index_add_ breakage (because index_add_ breakage would appear only in a particular type promotion situation).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55527

Reviewed By: mruberry

Differential Revision: D27671138

Pulled By: ngimel

fbshipit-source-id: b52411f5a6d81098b706dfda4d0c9a16716414d7
2021-04-08 23:09:28 -07:00
d2784c233b Partially migrate sort from THC to ATen, replace the thrust path with cub (#54626)
Summary:
The thrust path of `torch.sort` in THC is rewritten and replaced with cub in ATen. The original algorithm is followed, but since cub does not offer custom compare operator, I have to change it a bit to 2 sort + gather.

Note: tensor larger than 2^31 elements is supported, but the dimension being sorted can not go beyond 2^31.

Related: https://github.com/pytorch/pytorch/pull/50887 https://github.com/pytorch/pytorch/issues/24637

Benchmark:

```python
import torch
import itertools

for i in range(1000):
    torch.arange(100000, device='cuda')

def run50_sync(f):
    for _ in range(50):
        f()
    torch.cuda.synchronize()

for i, j in itertools.product([512, 4096, 8192], repeat=2):
    print(i,j)
    t = torch.randn(i, j, device='cuda')
    torch.cuda.synchronize()
    %timeit run50_sync(lambda: torch.sort(t))
    torch.cuda.synchronize()
    %timeit run50_sync(lambda: torch.sort(t, dim=0))
    print()
```

Before
```
512 512
3.91 ms ± 8.53 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.87 ms ± 5.06 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

512 4096
70.5 ms ± 29.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
32.7 ms ± 14.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

512 8192
142 ms ± 21.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
64.4 ms ± 94.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

4096 512
26.8 ms ± 1.68 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
82.2 ms ± 13.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

4096 4096
606 ms ± 178 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
722 ms ± 94.8 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

4096 8192
1.28 s ± 157 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.54 s ± 500 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

8192 512
53.5 ms ± 73.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
168 ms ± 39.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

8192 4096
1.28 s ± 236 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.54 s ± 272 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

8192 8192
2.69 s ± 741 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
3.28 s ± 549 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

After
```
512 512
4.02 ms ± 28.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5 ms ± 15.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

512 4096
40.7 ms ± 74.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
33.9 ms ± 186 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

512 8192
71.7 ms ± 636 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
66.4 ms ± 163 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

4096 512
27.6 ms ± 27.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
46.6 ms ± 101 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

4096 4096
262 ms ± 1.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
321 ms ± 1.32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

4096 8192
520 ms ± 5.47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
661 ms ± 853 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

8192 512
54.6 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
83.2 ms ± 320 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

8192 4096
521 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
645 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

8192 8192
1.04 s ± 2.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.34 s ± 541 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54626

Reviewed By: VitalyFedyunin

Differential Revision: D27396078

Pulled By: ngimel

fbshipit-source-id: 4a23b9355e3542e49233b4b4328e43947ec17efd
2021-04-08 23:04:33 -07:00
5b149a0d4a Migrate cos to structured kernel (#55564)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55564

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D27653764

Pulled By: SplitInfinity

fbshipit-source-id: e13d07b2bc76d11e635de63a5d6d1a835da79e47
2021-04-08 22:42:11 -07:00
19e43eaaf4 Migrate cosh to structured kernel (#55563)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55563

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D27653767

Pulled By: SplitInfinity

fbshipit-source-id: 11cd631679b9b5a88443a714a56f4178f5bf41b0
2021-04-08 22:42:09 -07:00
a699cda846 Migrate acosh to structured kernel (#55540)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55540

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D27653766

Pulled By: SplitInfinity

fbshipit-source-id: 311087befcfaa4bd36d2539b3bfe1d5149922ca3
2021-04-08 22:42:06 -07:00
6bdf7ef2a3 Migrate sinh to structured kernel (#55538)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55538

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D27653765

Pulled By: SplitInfinity

fbshipit-source-id: ca708fa20cd95e525827c0834e135c61aff56298
2021-04-08 22:40:21 -07:00
4d449f915f [quant][graphmode][fx] Separate handling Copy operator to a helper function (#54644) (#55429)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55429

Previously we special case copy operator in normal insert observer code, this PR tries to split the
special case logic to a separate function and keep the rest of the code clean.

Test Plan:
Imported from OSS

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D27609972

fbshipit-source-id: 378f6aa70f18c0b477b62b6efe236648748aae7e
2021-04-08 22:12:24 -07:00
42486963b2 Integrate NNC conv2d with fuser (#55213)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55213

Adds the integration of conv2d with the TE fuser.  A few things of interest:

- I'm *super* selective of what convs get lowered.  Only 3x3 depthwise, because
  I've benchmarked those to death and I'm pretty sure it's a good change.

- I'm allowing single-node "fusion" groups for supported convs.  (Maybe this is
  a sign that conv2d codegen should go through a different path entirely, but
  it seems to basically work).

I'll shared full benchmarkr results once I clean them up a little.  To
summarize, I tested the following torchvision models containing depthwise
convolutions.  Results are single-core on a skylake-avx512:

mobilenet_v2: 8% improvement
mobilenet_v3: 9% improvement
mnasnet: 10% improvement
shufflenet: 18% improvement

Note these are comparing against a baseline with a fast-but-buggy grouped
convolution implementation in MKLDNN.  So perf results will be better if
compared on master, but I'm going to assume the MKLDNN bug will be fixed and
re-enabled.

Perf results are more complicated when comparing to freezing plus conversion to
mkldnn layout; mobilenet v2/v3 are still faster, but mnasnet and shufflenet are
not.  Landing this doesn't prevent MKLDNN freezing from kicking in though, so
there's no harm (although landing mkldnn freezing will regress mobilenet, but
cest la vie).
ghstack-source-id: 126076112

Test Plan: New unit test, plus torchvision

Reviewed By: ZolotukhinM

Differential Revision: D27530272

fbshipit-source-id: 92153fad234bc9f1eaa4f7624c543168d1294a87
2021-04-08 21:58:27 -07:00
cb4b3b04a8 [nnc] Move device type checks from isSupported to typesAreSupported (#55025)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55025

Something needs to be fixed about the names of these functions,
because they are confusing.

The profiling infrastructure calls `isSupported` to see if it should insert
profiling nodes.

The fuser calls `isSupported` but also `typesAreSupported` to determine if it
can actually fuse the node.

At profiling time, we don't know device types yet, so we can't use device type
checks in `isSupported` or else we'll never profile the node.  So we want to
move those checks into `typesAreSupported`, where we actually have profiling
info available.
ghstack-source-id: 126076111

Test Plan: sandcastle

Reviewed By: ngimel

Differential Revision: D27454968

fbshipit-source-id: 4ffb142ea7a0086842a034c9e202f9cb1065fc95
2021-04-08 21:58:25 -07:00
90f848572c NNC depthwise conv2d implementation (#54920)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54920

Add a depthwise convolution implementation and reasonably good
schedules for 3x3 stride=1,2.
ghstack-source-id: 126076113

Test Plan: new tensorexpr test: Conv.DepthwiseConv2D

Reviewed By: ZolotukhinM

Differential Revision: D27413745

fbshipit-source-id: 833da6072b655fbe2b679704e9d56a08e1bf7e7e
2021-04-08 21:56:53 -07:00
6a39613f35 [BE] Make torch/csrc/jit/tensorexpr/ clang-tidy clean (#55628)
Summary:
Mostly auto-generated changes using
```
 python3 tools/clang_tidy.py -c build -x torch/csrc/jit/tensorexpr/eval.cpp -s
```
With following common patterns manually fixed
- Use ` = default` instead of `{}`
- deleted methods should be public
- Use pass-by-value + std::move instead of pass-by-reference+copy

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55628

Reviewed By: walterddr

Differential Revision: D27655378

Pulled By: malfet

fbshipit-source-id: 92be87a08113435d820711103ea9b0364182c71a
2021-04-08 19:44:14 -07:00
c998f3573c [Hackathon]Move tests related to containers in typing to test_typing.py (#55504)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55504

Test Plan: Imported from OSS

Reviewed By: navahgar, pbelevich

Differential Revision: D27666760

Pulled By: nikithamalgifb

fbshipit-source-id: c1a7904f33855efa4f60f8f54c029a95a5fd529c
2021-04-08 18:40:37 -07:00
2ca45cb9e8 [hackathon] ci: Only generate cuda tests for cuda configurations (#55522)
Summary:
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55522

Reviewed By: walterddr

Differential Revision: D27634951

Pulled By: seemethere

fbshipit-source-id: 1dccaeb4bc8d0d53d61e467ba676c5c538fd4cf2
2021-04-08 17:31:26 -07:00
3498fde20e Add AccumulateType in AdaptiveAveragePooling3d.cu (#53607)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/52719

- Changed the type(`scalar_t`) of intermediate results to `at::acc_type<scalar_t, true>`

This issue occurs by decimal precision of the half precision.

Follows test cases of upper issue, The value range of input tensors are [0, 1] because init by `rand`.
And when the kernel size 1, summations all target values and divide numel of kernel
34d9278c19/aten/src/ATen/native/cuda/AdaptiveAveragePooling3d.cu (L94-L95)

When adding [0, 1] values, if `sum` more than 2048 then not changed values. ( Even if the value is small, the mored exact value is added, but there are still precision issues.)
(https://en.wikipedia.org/wiki/Half-precision_floating-point_format)

Benchmarks
- In V100 32GB, Driver : 450.80, cuda 10.1
- faster than prev

<details><summary>Script</summary><p>

```import torch
from torch.utils.benchmark import Timer

torch.manual_seed(0)

kernel_sizes = [1, 3, 5, 7, 9, 11, 13]
shapes = [(12, 12, 12), (16, 16, 16), (16, 32, 32), (16, 56, 56), (16, 112, 112)]

def run(batch, channel):
    print(f"Batch : {batch}, Channel : {channel} / (diff, diff / numel, time)")

    head = "\t".join(f"{str(s):30s}" for s in ["k \ shape"] + shapes)
    print(head)
    for kernel_size in kernel_sizes:
        kernel_size = (kernel_size, kernel_size, kernel_size)
        pool = torch.nn.AdaptiveAvgPool3d(kernel_size)

        print(f"{str(kernel_size):30s}", end="\t")
        for shape in shapes:
            x_half = torch.rand([batch, channel, *shape], dtype=torch.half, device="cuda")
            x_float = x_half.float()

            y_half = pool(x_half)
            y_float = pool(x_float)

            timer = Timer("pool(x_half)", globals={"pool": pool, "x_half": x_half})
            measurement = timer.blocked_autorange(min_run_time=5)

            diff = (y_float - y_half).abs().sum().item()
            diff = f"{diff:.4f}, {diff / y_half.numel():.6f}, {measurement.median * 1e6 :3.2f}us"
            print(f"{diff:30s}", end="\t")
        print("")

run(1, 1)
run(1, 3)
run(1, 54)
run(1, 16)

run(8, 1)
run(8, 16)
run(8, 54)

import torch
m = torch.nn.AdaptiveAvgPool3d((1,1,1))

inputs = torch.rand([8,54,16,56,56])
inputs = inputs.cuda()
inputs_2 = inputs.half()

print("Float")
out = m(inputs).float()
print("half")
out2 = m(inputs_2).float()

print('Discepancies', torch.sum(torch.abs(out2- out)).item(), torch.sum(torch.abs(out2- out)).item() / out.numel() , out.numel())

print("Sum : ", torch.sum(inputs, dim=(2,3,4))[0, 0], torch.sum(inputs_2, dim=(2,3,4))[0, 0])
```
</p>
</details>

<details><summary>This commit</summary><p>

```
Batch : 1, Channel : 1 / (diff, diff / numel, time)
k \ shape                       (12, 12, 12)                    (16, 16, 16)                         (16, 32, 32)                    (16, 56, 56)                    (16, 112, 112)
(1, 1, 1)                       0.0001, 0.000078, 55.73us       0.0001, 0.000079, 117.51us       0.0000, 0.000003, 379.60us      0.0000, 0.000046, 1046.21us      0.0001, 0.000139, 3897.17us
(3, 3, 3)                       0.0021, 0.000076, 22.04us       0.0031, 0.000115, 21.47us        0.0022, 0.000080, 41.63us       0.0030, 0.000111, 100.59us       0.0025, 0.000091, 295.04us
(5, 5, 5)                       0.0103, 0.000083, 21.65us       0.0097, 0.000078, 21.37us        0.0103, 0.000083, 21.60us       0.0114, 0.000091, 25.69us        0.0107, 0.000085, 97.06us
(7, 7, 7)                       0.0312, 0.000091, 21.52us       0.0290, 0.000084, 21.61us        0.0311, 0.000091, 21.60us       0.0309, 0.000090, 21.44us        0.0334, 0.000097, 33.60us
(9, 9, 9)                       0.0646, 0.000089, 21.57us       0.0672, 0.000092, 21.89us        0.0662, 0.000091, 21.89us       0.0684, 0.000094, 27.64us        0.0660, 0.000091, 54.85us
(11, 11, 11)                    0.1251, 0.000094, 21.68us       0.1194, 0.000090, 21.70us        0.1202, 0.000090, 21.72us       0.1233, 0.000093, 22.25us        0.1229, 0.000092, 41.39us
(13, 13, 13)                    0.2038, 0.000093, 21.57us       0.2047, 0.000093, 21.58us        0.1964, 0.000089, 21.54us       0.2021, 0.000092, 21.94us        0.1989, 0.000091, 40.01us
Batch : 1, Channel : 3 / (diff, diff / numel, time)
k \ shape                       (12, 12, 12)                    (16, 16, 16)                     (16, 32, 32)                    (16, 56, 56)                     (16, 112, 112)
(1, 1, 1)                       0.0003, 0.000110, 55.74us       0.0003, 0.000093, 118.62us       0.0003, 0.000093, 382.12us      0.0001, 0.000040, 1052.33us      0.0003, 0.000114, 3917.90us
(3, 3, 3)                       0.0073, 0.000090, 21.84us       0.0075, 0.000093, 22.25us        0.0072, 0.000089, 41.78us       0.0070, 0.000087, 100.27us       0.0069, 0.000086, 293.96us
(5, 5, 5)                       0.0353, 0.000094, 22.57us       0.0325, 0.000087, 21.64us        0.0343, 0.000092, 22.63us       0.0338, 0.000090, 25.82us        0.0332, 0.000089, 97.16us
(7, 7, 7)                       0.0937, 0.000091, 22.50us       0.0910, 0.000088, 21.92us        0.0933, 0.000091, 21.99us       0.0948, 0.000092, 21.56us        0.0928, 0.000090, 34.17us
(9, 9, 9)                       0.1957, 0.000089, 21.68us       0.1984, 0.000091, 21.57us        0.2025, 0.000093, 22.10us       0.1986, 0.000091, 27.66us        0.2020, 0.000092, 55.32us
(11, 11, 11)                    0.3585, 0.000090, 21.75us       0.3684, 0.000092, 22.70us        0.3706, 0.000093, 21.67us       0.3752, 0.000094, 21.86us        0.3663, 0.000092, 41.22us
(13, 13, 13)                    0.5931, 0.000090, 21.67us       0.6056, 0.000092, 21.79us        0.6005, 0.000091, 21.79us       0.6112, 0.000093, 21.69us        0.6034, 0.000092, 40.02us
Batch : 1, Channel : 54 / (diff, diff / numel, time)
k \ shape                       (12, 12, 12)                    (16, 16, 16)                     (16, 32, 32)                    (16, 56, 56)                     (16, 112, 112)
(1, 1, 1)                       0.0051, 0.000095, 55.76us       0.0060, 0.000112, 118.60us       0.0036, 0.000067, 381.50us      0.0054, 0.000100, 1054.03us      0.0048, 0.000089, 4888.68us
(3, 3, 3)                       0.1332, 0.000091, 21.66us       0.1344, 0.000092, 22.62us        0.1354, 0.000093, 45.72us       0.1364, 0.000094, 106.63us       0.1324, 0.000091, 448.31us
(5, 5, 5)                       0.6221, 0.000092, 22.48us       0.6220, 0.000092, 21.71us        0.6053, 0.000090, 27.65us       0.6137, 0.000091, 31.40us        0.6209, 0.000092, 172.78us
(7, 7, 7)                       1.6859, 0.000091, 22.42us       1.6972, 0.000092, 21.96us        1.6849, 0.000091, 23.14us       1.7012, 0.000092, 26.25us        1.6920, 0.000091, 75.58us
(9, 9, 9)                       3.5811, 0.000091, 21.73us       3.5746, 0.000091, 22.55us        3.6237, 0.000092, 27.66us       3.6046, 0.000092, 59.71us        3.6392, 0.000092, 168.15us
(11, 11, 11)                    6.5582, 0.000091, 22.05us       6.5746, 0.000091, 21.74us        6.5955, 0.000092, 32.91us       6.5644, 0.000091, 45.57us        6.5697, 0.000091, 114.01us
(13, 13, 13)                    10.6384, 0.000090, 21.81us      10.8608, 0.000092, 21.79us       10.8375, 0.000091, 37.01us      10.8662, 0.000092, 51.80us       10.8593, 0.000092, 123.19us
Batch : 1, Channel : 16 / (diff, diff / numel, time)
k \ shape                       (12, 12, 12)                    (16, 16, 16)                     (16, 32, 32)                    (16, 56, 56)                     (16, 112, 112)
(1, 1, 1)                       0.0015, 0.000093, 55.75us       0.0012, 0.000075, 118.10us           0.0013, 0.000079, 379.25us      0.0012, 0.000075, 1047.21us     0.0013, 0.000079, 4451.57us
(3, 3, 3)                       0.0407, 0.000094, 21.82us       0.0395, 0.000091, 21.69us            0.0385, 0.000089, 42.07us       0.0397, 0.000092, 100.33us      0.0384, 0.000089, 363.31us
(5, 5, 5)                       0.1858, 0.000093, 21.76us       0.1799, 0.000090, 21.63us            0.1834, 0.000092, 21.76us       0.1890, 0.000095, 26.04us       0.1814, 0.000091, 135.32us
(7, 7, 7)                       0.4937, 0.000090, 21.65us       0.5076, 0.000092, 21.69us            0.5001, 0.000091, 22.31us       0.4988, 0.000091, 21.59us       0.5123, 0.000093, 50.03us
(9, 9, 9)                       1.0678, 0.000092, 21.73us       1.0752, 0.000092, 21.75us            1.0673, 0.000091, 21.75us       1.0649, 0.000091, 30.01us       1.0786, 0.000092, 70.92us
(11, 11, 11)                    1.9591, 0.000092, 21.57us       1.9522, 0.000092, 21.60us            1.9566, 0.000092, 21.73us       1.9475, 0.000091, 23.46us       1.9323, 0.000091, 55.02us
(13, 13, 13)                    3.1784, 0.000090, 22.02us       3.2165, 0.000092, 21.95us            3.1969, 0.000091, 21.92us       3.2061, 0.000091, 24.40us       3.2578, 0.000093, 56.00us
Batch : 8, Channel : 1 / (diff, diff / numel, time)
k \ shape                       (12, 12, 12)                    (16, 16, 16)                         (16, 32, 32)                    (16, 56, 56)                    (16, 112, 112)
(1, 1, 1)                       0.0010, 0.000122, 55.74us       0.0009, 0.000114, 118.82us           0.0006, 0.000074, 379.80us      0.0009, 0.000107, 1047.31us     0.0008, 0.000102, 3900.36us
(3, 3, 3)                       0.0219, 0.000101, 21.57us       0.0200, 0.000093, 21.61us            0.0194, 0.000090, 41.74us       0.0208, 0.000096, 99.91us       0.0212, 0.000098, 293.03us
(5, 5, 5)                       0.0906, 0.000091, 21.46us       0.0911, 0.000091, 21.60us            0.0934, 0.000093, 21.93us       0.0927, 0.000093, 25.74us       0.0913, 0.000091, 96.85us
(7, 7, 7)                       0.2530, 0.000092, 22.53us       0.2526, 0.000092, 22.46us            0.2558, 0.000093, 22.03us       0.2542, 0.000093, 22.29us       0.2475, 0.000090, 34.44us
(9, 9, 9)                       0.5305, 0.000091, 22.34us       0.5368, 0.000092, 22.42us            0.5265, 0.000090, 21.74us       0.5370, 0.000092, 27.81us       0.5416, 0.000093, 55.65us
(11, 11, 11)                    0.9887, 0.000093, 21.80us       0.9660, 0.000091, 21.61us            0.9793, 0.000092, 22.11us       0.9719, 0.000091, 21.80us       0.9650, 0.000091, 43.90us
(13, 13, 13)                    1.6024, 0.000091, 21.87us       1.6198, 0.000092, 22.65us            1.6242, 0.000092, 21.73us       1.6236, 0.000092, 22.59us       1.6025, 0.000091, 42.77us
Batch : 8, Channel : 16 / (diff, diff / numel, time)
k \ shape                       (12, 12, 12)                    (16, 16, 16)                         (16, 32, 32)                    (16, 56, 56)                    (16, 112, 112)
(1, 1, 1)                       0.0113, 0.000088, 56.66us       0.0117, 0.000091, 119.57us           0.0130, 0.000102, 389.57us      0.0110, 0.000086, 1433.78us     0.0119, 0.000093, 5217.61us
(3, 3, 3)                       0.3209, 0.000093, 21.54us       0.3184, 0.000092, 22.87us            0.3115, 0.000090, 51.00us       0.3171, 0.000092, 164.17us      0.3182, 0.000092, 500.60us
(5, 5, 5)                       1.4391, 0.000090, 22.39us       1.4577, 0.000091, 21.69us            1.4601, 0.000091, 53.87us       1.4626, 0.000091, 93.65us       1.4567, 0.000091, 370.11us
(7, 7, 7)                       4.0501, 0.000092, 22.34us       4.0230, 0.000092, 31.45us            4.0381, 0.000092, 45.19us       4.0171, 0.000091, 65.35us       4.0108, 0.000091, 164.76us
(9, 9, 9)                       8.5360, 0.000091, 22.80us       8.5456, 0.000092, 27.24us            8.5461, 0.000092, 50.23us       8.5677, 0.000092, 117.63us      8.5645, 0.000092, 270.46us
(11, 11, 11)                    15.5521, 0.000091, 26.56us      15.5826, 0.000091, 32.81us           15.6014, 0.000092, 63.82us      15.5620, 0.000091, 96.87us      15.5722, 0.000091, 220.24us
(13, 13, 13)                    25.4146, 0.000090, 32.91us      25.7898, 0.000092, 38.48us           25.6698, 0.000091, 72.02us      25.8193, 0.000092, 121.73us     25.7718, 0.000092, 249.71us
Batch : 8, Channel : 54 / (diff, diff / numel, time)
k \ shape                       (12, 12, 12)                    (16, 16, 16)                         (16, 32, 32)                    (16, 56, 56)                    (16, 112, 112)
(1, 1, 1)                       0.0377, 0.000087, 109.07us      0.0405, 0.000094, 233.17us           0.0392, 0.000091, 998.97us      0.0393, 0.000091, 2960.68us     0.0408, 0.000094, 11879.53us
(3, 3, 3)                       1.0660, 0.000091, 25.68us       1.0761, 0.000092, 64.12us            1.0725, 0.000092, 182.50us      1.0801, 0.000093, 505.82us      1.0736, 0.000092, 1650.21us
(5, 5, 5)                       4.9587, 0.000092, 50.84us       4.9336, 0.000091, 47.38us            4.9696, 0.000092, 158.49us      4.9347, 0.000091, 237.39us      4.9303, 0.000091, 965.13us
(7, 7, 7)                       13.5409, 0.000091, 45.60us      13.5736, 0.000092, 87.45us           13.5012, 0.000091, 141.63us     13.6111, 0.000092, 181.51us     13.5296, 0.000091, 469.77us
(9, 9, 9)                       28.7817, 0.000091, 58.01us      28.7969, 0.000091, 77.61us           28.8761, 0.000092, 159.33us     28.8786, 0.000092, 334.47us     28.8093, 0.000091, 786.72us
(11, 11, 11)                    52.4453, 0.000091, 78.19us      52.7265, 0.000092, 95.12us           52.7322, 0.000092, 200.38us     52.6342, 0.000092, 282.41us     52.6467, 0.000092, 652.54us
(13, 13, 13)                    85.7411, 0.000090, 98.85us      86.7183, 0.000091, 115.28us          86.8545, 0.000092, 232.34us     86.9997, 0.000092, 367.32us     86.9083, 0.000092, 757.73us
Float
half
Discepancies 0.03963914513587952 9.175728040712852e-05 432
Sum :  tensor(25110.1484, device='cuda:0') tensor(25104., device='cuda:0', dtype=torch.float16)
```
</p>
</details>

<details><summary>1.8.0</summary><p>

```
Batch : 1, Channel : 1 / (diff, diff / numel, time)
k \ shape                       (12, 12, 12)                    (16, 16, 16)                  (16, 32, 32)                    (16, 56, 56)                    (16, 112, 112)
(1, 1, 1)                       0.0023, 0.002275, 74.35us       0.0040, 0.003985, 159.73us        0.3740, 0.374021, 546.59us      0.4587, 0.458663, 1543.16us       0.4906, 0.490637, 5945.97us
(3, 3, 3)                       0.0100, 0.000370, 20.37us       0.0230, 0.000852, 22.12us         0.0309, 0.001143, 54.75us       0.0520, 0.001926, 129.78us        7.1219, 0.263775, 377.11us
(5, 5, 5)                       0.0441, 0.000352, 20.06us       0.0394, 0.000316, 20.50us         0.0759, 0.000607, 26.43us       0.1499, 0.001199, 32.01us         0.2707, 0.002166, 128.15us
(7, 7, 7)                       0.0791, 0.000231, 20.10us       0.1002, 0.000292, 20.56us         0.1812, 0.000528, 20.48us       0.2424, 0.000707, 20.83us         0.4994, 0.001456, 43.97us
(9, 9, 9)                       0.1122, 0.000154, 20.55us       0.1778, 0.000244, 20.44us         0.2572, 0.000353, 20.15us       0.4149, 0.000569, 35.64us         0.7208, 0.000989, 68.46us
(11, 11, 11)                    0.2044, 0.000154, 20.47us       0.2647, 0.000199, 20.62us         0.3867, 0.000291, 20.61us       0.6059, 0.000455, 23.54us         1.0902, 0.000819, 53.32us
(13, 13, 13)                    0.3094, 0.000141, 20.53us       0.3843, 0.000175, 20.60us         0.5756, 0.000262, 20.80us       0.8598, 0.000391, 24.52us         1.4853, 0.000676, 47.70us
Batch : 1, Channel : 3 / (diff, diff / numel, time)
k \ shape                       (12, 12, 12)                    (16, 16, 16)                      (16, 32, 32)                    (16, 56, 56)                      (16, 112, 112)
(1, 1, 1)                       0.0054, 0.001801, 74.36us       0.0108, 0.003614, 158.94us        1.1183, 0.372768, 547.67us      1.3782, 0.459387, 1545.27us       1.4685, 0.489505, 5949.17us
(3, 3, 3)                       0.0308, 0.000380, 20.14us       0.0502, 0.000619, 22.11us         0.1210, 0.001493, 54.80us       0.1900, 0.002345, 130.47us        21.3483, 0.263560, 375.68us
(5, 5, 5)                       0.1179, 0.000314, 20.68us       0.1326, 0.000354, 20.53us         0.2662, 0.000710, 26.51us       0.4116, 0.001098, 31.85us         0.8369, 0.002232, 128.19us
(7, 7, 7)                       0.2335, 0.000227, 20.40us       0.3057, 0.000297, 20.43us         0.4954, 0.000481, 20.31us       0.7339, 0.000713, 20.74us         1.4208, 0.001381, 44.55us
(9, 9, 9)                       0.3326, 0.000152, 20.63us       0.5353, 0.000245, 20.42us         0.8025, 0.000367, 20.13us       1.2693, 0.000580, 35.64us         2.2096, 0.001010, 68.88us
(11, 11, 11)                    0.6121, 0.000153, 20.59us       0.8086, 0.000202, 20.42us         1.1700, 0.000293, 20.71us       1.8170, 0.000455, 23.54us         3.2117, 0.000804, 53.36us
(13, 13, 13)                    0.9165, 0.000139, 20.51us       1.1395, 0.000173, 20.56us         1.7343, 0.000263, 20.80us       2.5868, 0.000392, 24.59us         4.5823, 0.000695, 47.77us
Batch : 1, Channel : 54 / (diff, diff / numel, time)
k \ shape                       (12, 12, 12)                    (16, 16, 16)                      (16, 32, 32)                    (16, 56, 56)                      (16, 112, 112)
(1, 1, 1)                       0.1092, 0.002023, 75.45us       0.1709, 0.003165, 160.44us        20.2452, 0.374911, 548.61us     24.7990, 0.459240, 1550.34us      26.4494, 0.489804, 6957.79us
(3, 3, 3)                       0.5352, 0.000367, 20.58us       1.0281, 0.000705, 24.14us         2.0150, 0.001382, 59.12us       3.3069, 0.002268, 138.23us        384.5216, 0.263732, 529.71us
(5, 5, 5)                       2.0739, 0.000307, 20.60us       2.5199, 0.000373, 20.44us         4.6916, 0.000695, 33.89us       7.9482, 0.001178, 37.74us         14.2553, 0.002112, 200.54us
(7, 7, 7)                       4.2236, 0.000228, 20.61us       5.5605, 0.000300, 20.97us         9.0440, 0.000488, 26.40us       12.7847, 0.000690, 30.64us        25.3050, 0.001366, 88.05us
(9, 9, 9)                       6.0817, 0.000154, 20.63us       9.5416, 0.000242, 20.84us         14.2416, 0.000362, 32.47us      22.8452, 0.000580, 78.57us        40.3246, 0.001024, 194.50us
(11, 11, 11)                    11.1144, 0.000155, 20.56us      14.5581, 0.000203, 20.91us        20.8263, 0.000290, 38.07us      33.0004, 0.000459, 52.74us        57.3275, 0.000798, 137.19us
(13, 13, 13)                    16.5176, 0.000139, 21.26us      20.8089, 0.000175, 22.33us        31.3433, 0.000264, 42.93us      45.9733, 0.000388, 59.84us        82.8301, 0.000698, 138.42us
Batch : 1, Channel : 16 / (diff, diff / numel, time)
k \ shape                       (12, 12, 12)                    (16, 16, 16)                      (16, 32, 32)                    (16, 56, 56)                      (16, 112, 112)
(1, 1, 1)                       0.0274, 0.001715, 74.99us       0.0485, 0.003034, 159.92us    5.9925, 0.374529, 546.35us      7.3389, 0.458679, 1544.53us     7.8354, 0.489714, 6677.00us
(3, 3, 3)                       0.1560, 0.000361, 20.72us       0.3043, 0.000704, 22.37us     0.5838, 0.001352, 54.97us       1.0455, 0.002420, 130.57us      113.9739, 0.263828, 463.43us
(5, 5, 5)                       0.6121, 0.000306, 20.12us       0.7247, 0.000362, 20.73us     1.3740, 0.000687, 26.59us       2.3794, 0.001190, 32.12us       4.1929, 0.002096, 165.81us
(7, 7, 7)                       1.2389, 0.000226, 20.59us       1.6311, 0.000297, 20.53us     2.6732, 0.000487, 20.37us       3.7501, 0.000683, 20.71us       7.4575, 0.001359, 59.16us
(9, 9, 9)                       1.7983, 0.000154, 20.64us       2.8075, 0.000241, 20.59us     4.2165, 0.000361, 20.38us       6.7153, 0.000576, 38.29us       12.0530, 0.001033, 86.33us
(11, 11, 11)                    3.3326, 0.000156, 20.56us       4.3061, 0.000202, 20.67us     6.2235, 0.000292, 20.47us       9.8009, 0.000460, 27.41us       16.9994, 0.000798, 68.49us
(13, 13, 13)                    4.9016, 0.000139, 20.63us       6.1261, 0.000174, 20.65us     9.2106, 0.000262, 20.93us       13.5843, 0.000386, 27.95us      24.6476, 0.000701, 64.88us
Batch : 8, Channel : 1 / (diff, diff / numel, time)
k \ shape                       (12, 12, 12)                    (16, 16, 16)                  (16, 32, 32)                    (16, 56, 56)                    (16, 112, 112)
(1, 1, 1)                       0.0170, 0.002122, 74.99us       0.0316, 0.003946, 160.66us    3.0013, 0.375158, 546.94us      3.6780, 0.459753, 1544.58us     3.9197, 0.489966, 5948.43us
(3, 3, 3)                       0.0821, 0.000380, 20.27us       0.1559, 0.000722, 22.29us     0.3133, 0.001450, 54.72us       0.5100, 0.002361, 130.12us      57.0481, 0.264111, 376.71us
(5, 5, 5)                       0.3075, 0.000307, 20.57us       0.3680, 0.000368, 20.69us     0.6786, 0.000679, 26.61us       1.1744, 0.001174, 31.77us       2.0654, 0.002065, 128.31us
(7, 7, 7)                       0.6512, 0.000237, 20.60us       0.8359, 0.000305, 20.50us     1.3712, 0.000500, 20.75us       1.9472, 0.000710, 20.92us       3.7586, 0.001370, 44.59us
(9, 9, 9)                       0.9138, 0.000157, 20.43us       1.4198, 0.000243, 20.58us     2.1018, 0.000360, 20.52us       3.3691, 0.000578, 35.90us       5.9491, 0.001020, 69.16us
(11, 11, 11)                    1.6606, 0.000156, 20.63us       2.1599, 0.000203, 20.57us     3.1240, 0.000293, 20.98us       4.8874, 0.000459, 24.65us       8.4780, 0.000796, 56.47us
(13, 13, 13)                    2.4987, 0.000142, 20.71us       3.0667, 0.000174, 20.45us     4.6387, 0.000264, 20.76us       6.8187, 0.000388, 25.95us       12.2077, 0.000695, 50.46us
Batch : 8, Channel : 16 / (diff, diff / numel, time)
k \ shape                       (12, 12, 12)                    (16, 16, 16)                  (16, 32, 32)                    (16, 56, 56)                    (16, 112, 112)
(1, 1, 1)                       0.2635, 0.002059, 75.66us       0.4030, 0.003149, 161.78us    48.0296, 0.375231, 550.46us     58.7787, 0.459209, 1902.41us    62.6966, 0.489817, 7817.48us
(3, 3, 3)                       1.2271, 0.000355, 20.72us       2.4185, 0.000700, 26.44us     4.6933, 0.001358, 64.66us       7.7016, 0.002228, 192.69us      912.0736, 0.263910, 593.69us
(5, 5, 5)                       4.8716, 0.000304, 24.75us       5.8624, 0.000366, 21.39us     11.0705, 0.000692, 66.94us      18.9280, 0.001183, 104.93us     34.0512, 0.002128, 441.81us
(7, 7, 7)                       10.1713, 0.000232, 20.98us      13.2273, 0.000301, 36.26us    21.5426, 0.000491, 52.18us      30.1910, 0.000688, 72.94us      59.8381, 0.001363, 191.52us
(9, 9, 9)                       14.4542, 0.000155, 23.85us      22.6579, 0.000243, 30.59us    33.8839, 0.000363, 57.40us      54.3563, 0.000583, 142.53us     95.8123, 0.001027, 309.24us
(11, 11, 11)                    26.3348, 0.000155, 30.07us      34.3043, 0.000201, 37.01us    49.8093, 0.000292, 74.04us      78.3720, 0.000460, 110.53us     136.5404, 0.000801, 264.14us
(13, 13, 13)                    39.3550, 0.000140, 37.38us      49.3207, 0.000175, 43.51us    74.1139, 0.000264, 83.70us      108.7627, 0.000387, 136.09us    196.5412, 0.000699, 280.16us
Batch : 8, Channel : 54 / (diff, diff / numel, time)
k \ shape                       (12, 12, 12)                    (16, 16, 16)                  (16, 32, 32)                    (16, 56, 56)                    (16, 112, 112)
(1, 1, 1)                       0.8467, 0.001960, 147.36us      1.3993, 0.003239, 314.95us    162.0182, 0.375042, 1327.22us   198.3226, 0.459080, 3921.79us   211.6123, 0.489843, 15646.94us
(3, 3, 3)                       4.3146, 0.000370, 29.23us       8.1125, 0.000696, 74.94us     15.8886, 0.001362, 223.69us     26.2404, 0.002250, 601.33us     3076.5354, 0.263763, 1974.06us
(5, 5, 5)                       16.5032, 0.000306, 58.79us      19.6887, 0.000365, 53.79us    37.2731, 0.000690, 192.34us     63.3076, 0.001172, 270.01us     114.8880, 0.002128, 1148.56us
(7, 7, 7)                       34.0802, 0.000230, 51.12us      44.4087, 0.000300, 100.93us   72.4613, 0.000489, 161.48us     101.9317, 0.000688, 202.91us    201.8955, 0.001363, 545.33us
(9, 9, 9)                       48.8179, 0.000155, 65.78us      76.3465, 0.000242, 87.48us    114.0228, 0.000362, 179.11us    182.9805, 0.000581, 403.66us    322.7040, 0.001025, 894.86us
(11, 11, 11)                    88.9993, 0.000155, 88.69us      116.4213, 0.000202, 107.55us  168.3363, 0.000293, 228.71us    264.2232, 0.000460, 322.84us    459.1324, 0.000799, 784.25us
(13, 13, 13)                    132.7447, 0.000140, 112.91us    165.4525, 0.000174, 131.08us  249.7127, 0.000263, 266.43us    367.0824, 0.000387, 410.17us    663.1367, 0.000699, 847.87us
Float
half
Discepancies 198.37625122070312 0.4592042852331091 432
Sum :  tensor(25110.1484, device='cuda:0') tensor(25104., device='cuda:0', dtype=torch.float16)
```
</p>
</details>

ngimel malfet anjali411

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53607

Reviewed By: mruberry

Differential Revision: D27652337

Pulled By: ngimel

fbshipit-source-id: 6439c0cafe6ca3f761a3f5d058050a55e9a0abd8
2021-04-08 15:48:08 -07:00
cc11aaaa60 Disallow non-breaking spaces (#55465)
Summary:
malfet found a couple of these in https://github.com/pytorch/pytorch/issues/55346; this PR removes the rest and adds a lint that prevents them from being accidentally added again in the future. It also removes the `-o` flag added in https://github.com/pytorch/pytorch/issues/53733 (which was unnecessarily hiding context without reducing the number of lines of output), and updates the lint error messages to reflect that the individual line numbers are shown in the logs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55465

Test Plan:
The "Lint / quick-checks" job in GitHub Actions should succeed on this PR. To verify that the lint does correctly find and error on non-breaking spaces, checkout ece075195d49c25213c96b9d53fcf7077215f44a and run it locally:
```sh
(! git --no-pager grep -In $'\u00a0' -- . || (echo "The above lines have non-breaking spaces (U+00A0); please convert them to spaces (U+0020)"; false))
```
It should print over a hundred lines of output and exit with status 1.

Reviewed By: janeyx99

Differential Revision: D27622136

Pulled By: samestep

fbshipit-source-id: e7ffd5a9519093e7a0ffdf55e9291f63e21ce841
2021-04-08 15:44:44 -07:00
bf882929f1 [skip ci] Add explanation for why we split TORCH_CUDA_API (#55641)
Summary:
Provide explanation for why we have (and use) the BUILD_SPLIT_CUDA option as a result of PR https://github.com/pytorch/pytorch/pull/49050.

This should hopefully clarify why there is both TORCH_CUDA_CU_API and TORCH_CUDA_CPP_API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55641

Reviewed By: samestep

Differential Revision: D27661729

Pulled By: janeyx99

fbshipit-source-id: a68b44df2b45ce10590b9b0229558a1ad40ce485
2021-04-08 15:40:48 -07:00
fc45ff8177 [skip ci] Document '[skip ci]' (#55418)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55418

Test Plan: Imported from OSS

Reviewed By: janeyx99

Differential Revision: D27609269

Pulled By: driazati

fbshipit-source-id: 6ce562950ee35e029f0bfa3d0fffbbcc28265a7a
2021-04-08 15:36:29 -07:00
364639041f Revert D27121170: [torch] Add cuda support for segment reduction 'max'
Test Plan: revert-hammer

Differential Revision:
D27121170 (eb5e1fc713)

Original commit changeset: 1c2565f42e29

fbshipit-source-id: 3dd394edcf5ef53c27098b4d0a1dd6fbbabdd506
2021-04-08 15:30:58 -07:00
55db156229 remove test_jit_py3.py entirely (#55560)
Summary:
1. move module related stuff to test_module_container
2. created test_types for types and annotation
3. created test_misc for the rest

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55560

Reviewed By: VitalyFedyunin

Differential Revision: D27650911

Pulled By: walterddr

fbshipit-source-id: d895a7da9e9c3d25a662a37faf4daabc276b9c1a
2021-04-08 14:28:54 -07:00
305abde976 Fix nvcc warnings (#55367)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55367

During compilation, nvcc emits several warnings about unused variables and static funcions:

```
caffe2/aten/src/ATen/native/cuda/SpectralOps.cu(231): warning: function "at::native::_run_cufft" was declared but never referenced

caffe2/aten/src/ATen/native/sparse/cuda/SparseMatMul.cu(60): warning: function "at::native::<unnamed>::confirm_mult_size" was declared but never referenced

caffe2/aten/src/ATen/native/cuda/UnaryFractionKernels.cu(112): warning: function "at::native::nearbyint_wrapper(c10::complex<double>)" was declared but never referenced

caffe2/aten/src/ATen/native/cuda/TensorFactories.cu(106): warning: variable "d_temp_storage" was declared but never referenced

caffe2/torch/fb/sparsenn/sparsenn_operators_gpu.cu(2325): warning: variable "kMaxThreads" was declared but never referenced
```

To reproduce, run the following build command on remote/master:
```
buck build mode/dev-nosan caffe2/torch/fb/sparsenn:sparsenn_operators_gpu
```

Warnings about unused variables are fixed by removing the variable declaration. However, I don't want to remove the unused static functions. They were probably used before some other part of the code was refactored. They might be useful again in the future. So, I added a #pragma firectives to disable warnings for such functions.

Test Plan: Compilation does not produce warnings any more.

Reviewed By: r-barnes

Differential Revision: D27577342

fbshipit-source-id: e6a6e5ec513996337d904985dd27c60601c74803
2021-04-08 13:42:44 -07:00
eb5e1fc713 [torch] Add cuda support for segment reduction 'max' (#54175)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54175

Building on top of previous PR. This PR adds cuda support for 1D max reduction.

Next steps:
- Add support for other major reduction types (e.g. min, sum) for 1D tensor
- Documentation for the op
- Perf optimizations and benchmark util
- Backward support  (not high priority)
- Support for multi dimensional tensors (on data and lengths) (not high priority)
- Support for 'indices' (not high priority)

Test Plan: Added unit test

Reviewed By: ngimel

Differential Revision: D27121170

fbshipit-source-id: 1c2565f42e2903e6fc089d56983ce8857efbfa3c
2021-04-08 13:25:55 -07:00
778f9eab6c Don't switch streams when running Caffe2 ops from c10. (#55121)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55121

This is done by allow -1 as a stream ID, meaning "don't change
the stream", in SwitchToDevice

Fixes #54830

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D27527544

Pulled By: ezyang

fbshipit-source-id: c54983d6fc79a8fa1c65a71559a57425e40ba717
2021-04-08 13:21:11 -07:00
adc65974b2 Run ShellCheck on scripts in GitHub Actions workflows (#55486)
Summary:
Resolves https://github.com/pytorch/pytorch/issues/55314.

- [x] Extract shell scripts from `.github/workflows/*.yml` into `.shellcheck_generated` dir
- [x] Run ShellCheck on `.shellcheck_generated`
- [x] Fail if any of the extracted scripts contain [GitHub Actions expressions][]: `${{ <expression> }}`
- [x] Fix the newly-surfaced warnings

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55486

Test Plan:
Locally run the "ShellCheck" step from "Lint / quick-checks".

[github actions expressions]: https://docs.github.com/en/actions/reference/context-and-expression-syntax-for-github-actions#about-contexts-and-expressions

Reviewed By: malfet

Differential Revision: D27627590

Pulled By: samestep

fbshipit-source-id: 8a22c6743e11b3059506043735f100efdd7c5a26
2021-04-08 13:15:00 -07:00
960b40156c [6/n][torch/elastic][upstream] Move torchelastic/distributed/api to torch/distributed/elastic/launchers/api (#55471)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55471

Move torchelastic/distributed/api to torch/distributed/elastic/launchers/api

Test Plan:
buck test mode/dev-nosan //pytorch/elastic/torchelastic/...
    buck test mode/dev-nosan //caffe2/test/distributed/elastic/agent/server/test/...

SyncSGD: tsm_aivanou-SparseNNApplication_432fc009

f263322216

Reviewed By: wilson100hong

Differential Revision: D27614353

fbshipit-source-id: a3b58fac2ebf803b8da5852ae2be0851b1cca695
2021-04-08 12:30:25 -07:00
fd450ff1b9 Revert D27598681: Add OpInfo tests for torch.addbmm
Test Plan: revert-hammer

Differential Revision:
D27598681 (b5647dd52b)

Original commit changeset: 24082f54b12e

fbshipit-source-id: 43d5713829fbaa00353bb7b054b66f537d768cd1
2021-04-08 11:38:49 -07:00
2564c0c889 avoid CPU std::copysign segfault when compiling on arm64 (take-2) (#55608)
Summary:
Re-land of https://github.com/pytorch/pytorch/issues/51834

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55608

Reviewed By: ngimel

Differential Revision: D27649077

Pulled By: malfet

fbshipit-source-id: 1a21611fb12106f75fe50e8f9f14796ab6ab9464
2021-04-08 11:34:09 -07:00
11add8f45f Add --suppress-diagnostics option (#55612)
Summary:
Add option to add //NOLINTNEXTLINE for every detected violation

Series of automated huge diffs are coming after this one to make large chunks of code clang-tidy

PR generated by new option: https://github.com/pytorch/pytorch/pull/55628

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55612

Reviewed By: samestep

Differential Revision: D27649473

Pulled By: malfet

fbshipit-source-id: 251a68fcc50bf0fd69c6566293d4a516c0ab24c8
2021-04-08 11:32:32 -07:00
ad823888a1 [FX] Speed up _Namespace.create_name (#55580)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55580

Test Plan: Imported from OSS

Reviewed By: ansley

Differential Revision: D27641156

Pulled By: jamesr66a

fbshipit-source-id: d2443d41c8d84dddb1794a7901e2d09ae3639846
2021-04-08 10:59:42 -07:00
60263e0f5a OpInfo porting for torch.maximum / torch.minimum / torch.fmax / torch.fmin (#55129)
Summary:
Related https://github.com/pytorch/pytorch/issues/54261

This PR ports the method_tests() entries of following operators to OpInfo.
- torch.maximum
- torch.minimum
- torch.fmax
- torch.fmin

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55129

Reviewed By: ngimel

Differential Revision: D27562189

Pulled By: mruberry

fbshipit-source-id: 9f25aeb09eb353080af43f25ea2e931474510aca
2021-04-08 10:14:38 -07:00
f665a7f8a1 [pet] Set error code in reply file when child process is terminated by signals.
Summary: Fill reply file's error code with ProcessFailure's exitcode. This is necessary when child process terminated by signals (ex. SIGSEGV).

Test Plan:
- Buck test
```
buck test mode/dev-nosan pytorch/elastic/torchelastic/distributed/fb/test:launch_test
buck test mode/dev-nosan caffe2/torch/distributed/elastic/multiprocessing/errors/fb/test:error_handler_fb_test_needed_coverage
```

- TSM
```
fbpkg build -E torchelastic_distributed_sum

buck run mode/dev-nosan //pytorch/elastic/torchelastic/tsm/fb/cli:tsm -- run_ddp --scheduler mast --fbpkg torchelastic_distributed_sum:ecdf31f --nnodes 2 --nproc_per_node 2 --resource T1  --run_cfg hpcIdentity=oncall_dai_pet,hpcClusterUuid=MastNaoTestCluster main.pa
```
https://www.internalfb.com/mast/job/tsm_wilsonhong-torchelastic_distributed_sum_ef3fd8d3

- classy_vision
```
flow-cli canary  pytorch.elastic.examples.classy_vision.main --entitlement gpu_prod --run-as-secure-group oncall_dai_pet --buck-target //fblearner/flow/projects/pytorch/elastic/examples:workflow
```
https://our.intern.facebook.com/intern/fblearner/details/263970380/?notif_channel=cli

Reviewed By: tierex

Differential Revision: D27512554

fbshipit-source-id: 903d25d96655085685f874113826d4627d9a79e4
2021-04-08 09:58:20 -07:00
8b5da2f48d rename .pytorch-disabled-tests to disabled-tests.json (#55618)
Summary:
We shouldn't store it as a hidden file, and having the json extension is useful.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55618

Test Plan: https://github.com/janeyx99/gha-experiments/runs/2298470065?check_suite_focus=true in my own repo

Reviewed By: samestep

Differential Revision: D27651467

Pulled By: janeyx99

fbshipit-source-id: cd9b6c8d065f1ffdcabb0844d375ae8be7177e13
2021-04-08 09:28:52 -07:00
3f9492c8b3 [Hackathon] Modernize API used in NNC C++ tests (1/3) (#55512)
Summary:
Partially fixes https://github.com/pytorch/pytorch/issues/55203

Fixes issues (1) and (2) in the following tests:
tests in test/cpp/tensorexpr/test_loopnest.cpp from the beginning to LoopNestReorderLongStringFull (including)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55512

Reviewed By: mrshenli

Differential Revision: D27630679

Pulled By: soulitzer

fbshipit-source-id: b581aaea4f5f54b3285f0348aa76e99779418f80
2021-04-08 08:34:25 -07:00
432df40d83 [Hackathon] Move python builtins to test_python_builtins.py (#55479)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55479

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D27642098

Pulled By: nikithamalgifb

fbshipit-source-id: 8d92a7d0f6db63f3cc3f439cb75a8d809af9106d
2021-04-08 08:06:54 -07:00
7d56de1834 DOC: use autosummary on tensors.rst (#55042)
Summary:
Related to https://github.com/pytorch/pytorch/issues/52256

Splits tensors into a table-of-contents page and many sub-pages, one for each function

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55042

Reviewed By: mrshenli

Differential Revision: D27628688

Pulled By: zou3519

fbshipit-source-id: 08e87700a8e7d5b3fba3f1949e29e988a42bf2c6
2021-04-08 06:44:23 -07:00
d3d7f57c2c Fix a problem when removing parametrizations (#55456)
Summary:
There was an error when removing a parametrization with `leave_parametrized=True`. It had escaped the previous tests. This PR should fix that.
**Edit.**
I also took this chance to fix a few mistakes that the documentation had, and to also write the `set_original_` in a more compact way.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55456

Reviewed By: mrshenli

Differential Revision: D27620481

Pulled By: albanD

fbshipit-source-id: f1298ddbcf24566ef48850c62a1eb4d8a3576152
2021-04-08 06:39:28 -07:00
473d193966 Use mkldnn copy for copy_ when self and src are Mkldnn layout (#54248)
Summary:
Currently, when copy_ is called with Mkldnn layout, a RuntimeError is raised.

**Environment**
- CPU : Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
- PyTorch master(1772e26f6380d1)
- build with USE_MKLDNN=1

**Sample code to reproduce:**
```python
import torch

x = torch.randn(4, 5, dtype=torch.float32)
mkldnn_x = x.to_mkldnn()
mkldnn_y = torch.randn(4, 5, dtype=torch.float32).to_mkldnn()
mkldnn_y.copy_(mkldnn_x)

print(x)
print(mkldnn_y.to_dense())
```

**Results:**
Actual:
```sh
Traceback (most recent call last):
  File "mkldnn_copy.py", line 6, in <module>
    mkldnn_y.copy_(mkldnn_x)
RuntimeError: unsupported tensor layout: Mkldnn
```

Expected:
```sh
# x
tensor([[ 0.1276, -0.1179,  1.1970,  2.4836,  1.9059],
        [-1.9647,  0.8613, -0.5060,  0.1555,  0.3661],
        [-0.1560, -0.2133,  0.3414, -1.7095, -2.3431],
        [ 1.3291,  0.3083,  0.5523, -2.0577, -0.4740]])
# mkldnn_y
tensor([[ 0.1276, -0.1179,  1.1970,  2.4836,  1.9059],
        [-1.9647,  0.8613, -0.5060,  0.1555,  0.3661],
        [-0.1560, -0.2133,  0.3414, -1.7095, -2.3431],
        [ 1.3291,  0.3083,  0.5523, -2.0577, -0.4740]])
```

This is because `copy_` does not support Mkldnn layout.
So I modified to call `copy_mkldnn_` in `copy_` when both `self` and `src` are Mkldnn layout.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54248

Reviewed By: mrshenli

Differential Revision: D27641352

Pulled By: ezyang

fbshipit-source-id: 70a37cdacb4a40b250ca16f2f6ddb6b71ff52d90
2021-04-08 06:35:39 -07:00
b5647dd52b Add OpInfo tests for torch.addbmm (#55378)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55378

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D27598681

Pulled By: anjali411

fbshipit-source-id: 24082f54b12e6346b81c9b6a6e20714e8fd94a9b
2021-04-08 05:48:23 -07:00
f1a0b817f0 [pthreadpool] Apply cap for macos builds (#55435)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55435

We've seen issues from the macos skylight app that PyTorch is super slow due to the lack of cap support in pthreadpools. For mac builds, we set the thread count to `#threads/2`.
ghstack-source-id: 125900852

Test Plan:
- Sandcastle CI
- CircleCI

Reviewed By: kimishpatel

Differential Revision: D27578871

fbshipit-source-id: 7b947bc5d6cf289378abf5f479575e112325d02b
2021-04-08 03:56:12 -07:00
f88a3fff65 Set requires_gradient to help autodiff to prune unneeded gradients (#54374)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/54040
`prim::RequiresGradCheck` guarantees that requires_grad properties
of input tensors will match the profiled, otherwise a fallback path
will be triggered. This allow us to prune off gradients in backward
graph for inputs that don't need gradients. We transfer requires_grad
properties from inputs to the `prim::DifferentiableGraph` onto inputs to the
differentiable graph. Autodiff will inspect these properties and prune
off gradients that aren't required

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54374

Reviewed By: H-Huang

Differential Revision: D27369251

Pulled By: Krovatkin

fbshipit-source-id: 2bce7a2d7f2ec091db9bf4c4b91d8b29edd5be11
2021-04-08 03:15:40 -07:00
37d1b39413 OpInfo: atan2 (#55132)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55132

Reviewed By: mrshenli

Differential Revision: D27615135

Pulled By: mruberry

fbshipit-source-id: 22fa1a225b9a75eb478797316e4462d4af4e8826
2021-04-08 01:21:46 -07:00
902bf0bbbe [special] Alias for sigmoid and logit & follow-up (#54759)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/50345

Chages:
* Alias for sigmoid and logit
* Adds out variant for C++ API
* Updates docs to link back to `special` documentation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54759

Reviewed By: mrshenli

Differential Revision: D27615208

Pulled By: mruberry

fbshipit-source-id: 8bba908d1bea246e4aa9dbadb6951339af353556
2021-04-08 00:56:59 -07:00
f4967d68f5 make torch.testing asserts importable (#54769)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54769

Follow-up to #53820. This

- makes the `asserts.py` module private as per suggestion from rgommers in https://github.com/pytorch/pytorch/pull/53820#issuecomment-802661387. With this the functions should only be accessible through `torch.testing`, giving us the option the change the underlying structure later.
- moves the code from `torch/testing/__init__.py` to `torch/testing/_core.py` (happy to accept other name suggestions). Otherwise we can't import the new `_asserts.py` in `torch/testing/__init__.py` due to circular imports.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D27438451

Pulled By: mruberry

fbshipit-source-id: c7292b4d5709185b42b4aac8016648562688040e
2021-04-07 23:53:02 -07:00
ffe301846b [Hackathon] Add error source range highlighting check in test_hash and test_list_dict (#55490)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55490

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D27628697

Pulled By: tugsbayasgalan

fbshipit-source-id: 694226f0b083606f665569e6a84d547026c7f19f
2021-04-07 23:48:51 -07:00
3517ee1bcb Fix ordered_dict.h for CUDA on Windows (#55275)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55266

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55275

Reviewed By: mrshenli

Differential Revision: D27623887

Pulled By: malfet

fbshipit-source-id: 6dac357e21179a259ac95f0e1b7399b03dacc81d
2021-04-07 23:43:35 -07:00
0dff0d1537 [ROCM] Disable few tests for Magma (#55534)
Summary:
After MAGMA has been enabled, around 5k new tests are running now.
Out of these 5 tests (each having 4 datatypes) are failing on the latest ROCM
CI with Rocm 4.1.  Disabling these tests for now so the ROCM CI does not fail.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55534

Reviewed By: ZolotukhinM

Differential Revision: D27630085

Pulled By: malfet

fbshipit-source-id: c48d124e6a2b4a4f3c6c4b6ac2bdf6c214f325c7
2021-04-07 22:22:43 -07:00
ec38dda1cc Remove extra close bracket in extending.rst (#55409)
Summary:
Small typo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55409

Reviewed By: pbelevich

Differential Revision: D27611177

Pulled By: jamesr66a

fbshipit-source-id: 8a5ff702e4ab8a7eb2403432889f8b7a5a69484b
2021-04-07 21:15:46 -07:00
493a233c04 [torch/elastic] Revise the rendezvous handler registry logic. (#55466)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55466

Improve the implementation and the unit test coverage of `RendezvousHandlerRegistry`.

### Note
See the original diff (D27442325 (df299dbd7d)) that had to be reverted due to an unexpected Python version incompatibility between the internal and external PyTorch CI tests.

Test Plan: Run the existing and newly-introduced unit tests.

Reviewed By: tierex

Differential Revision: D27623215

fbshipit-source-id: 51538d0f154f64e04f685a95d40d805b478c93f9
2021-04-07 20:43:20 -07:00
8ac0619784 Avoid infinite recursion in __torch_function__ example (#55391)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55284

This gets the example to run but probably doesn't help the readability of the example.

Thoughts?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55391

Reviewed By: mrshenli

Differential Revision: D27621096

Pulled By: ezyang

fbshipit-source-id: d02c4fb0001e54139a167b477fd3b4a229e4dc8c
2021-04-07 20:31:46 -07:00
b39eeb07ed Revert D27622277: [pytorch][PR] avoid CPU std::copysign segfault when compiling on arm64 with gcc 7.5 / 8 for CUDA
Test Plan: revert-hammer

Differential Revision:
D27622277 (3bb1f59a9c)

Original commit changeset: a1dc4c3a67f9

fbshipit-source-id: 352443cec6ae0ba794e559f92578192cefbe2ab4
2021-04-07 18:25:32 -07:00
d6cbecbbb6 [PyTorch] Reapply D27404164: Devirtualize is_contiguous (#55333)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55333

Reapplying without using enum class in a bitfield. See new
comments about gcc bug.
ghstack-source-id: 125776904

Test Plan: Carefully review OSS test failure logs this time

Reviewed By: kimishpatel, bhosmer

Differential Revision: D27576623

fbshipit-source-id: 68fb00e5ff5215e56c8b9bc02717e1e7b2fedf9b
2021-04-07 18:20:33 -07:00
e359842f23 Strict typecheck all files in tools/codegen (#55227)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55227

This seems to increase the number of typechecked files.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: janeyx99

Differential Revision: D27535373

Pulled By: ezyang

fbshipit-source-id: b36f6f8ce52c76848ed600ca9dd6b0c1de5813ff
2021-04-07 18:06:41 -07:00
384daacd1e [Hackathon] Add source range info for tests in test_module_containers (#55500)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55500

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D27628752

Pulled By: tugsbayasgalan

fbshipit-source-id: 3b0a1a1daae4d701be2358f912cba839844b2a44
2021-04-07 16:46:30 -07:00
b3f1fece1b [Hackathon] add highlight to test_module_interface.py (#55530)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55530

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D27628729

Pulled By: tugsbayasgalan

fbshipit-source-id: 4d7d2d56f0475c4311fe68ff336c073b564e02fa
2021-04-07 16:42:19 -07:00
524dbe1fa1 [Easy] Fix typo in package_exporter.py (#55551)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55551

Simple typo, it should be `OrderedImporter`

Test Plan: ci

Differential Revision: D27629463

fbshipit-source-id: 745527a8339f03a8fd38d0a4491811b3c9ca9b1e
2021-04-07 16:30:07 -07:00
f0ce8593db [Hackathon] Add source highlight check in test_torchbind (#55495)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55495

Reviewed By: janeyx99

Differential Revision: D27627499

Pulled By: gmagogsfm

fbshipit-source-id: 6749d7f58a98f40d6f301c6f37321ec85707242e
2021-04-07 16:17:22 -07:00
0f1350055b [Hackathon] Add source range highlight check to test_with (#55513)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55513

Reviewed By: janeyx99

Differential Revision: D27627520

Pulled By: gmagogsfm

fbshipit-source-id: 132f4dd2e99d2b5981fdd1522dbf7727b6abf7ab
2021-04-07 16:14:10 -07:00
94a3bad343 [Hackathon] Add source highlighting check in test_type_sharing (#55498)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55498

Reviewed By: janeyx99

Differential Revision: D27627506

Pulled By: gmagogsfm

fbshipit-source-id: abdd2a505099d3976762b4851d1024eb791e9204
2021-04-07 16:12:17 -07:00
5f90ed550c [Hackathon] Add source range highligh check to test_string_formatting (#55491)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55491

Reviewed By: janeyx99

Differential Revision: D27627477

Pulled By: gmagogsfm

fbshipit-source-id: 4586d7c96eae762be53c1155c6c724c6d65f1e7f
2021-04-07 16:08:30 -07:00
b9326d418d [Hackathon] Add error source range highlighting check in test_scriptmod_ann (#55482)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55482

Reviewed By: janeyx99

Differential Revision: D27627460

Pulled By: gmagogsfm

fbshipit-source-id: 099e36f36561c9252b027c8f89b301108133b0a7
2021-04-07 15:46:43 -07:00
5d78c4f701 Use assertRaisesRegexWithHighlight test_class_type.py (#55510)
Summary:
Step to resolving https://github.com/pytorch/pytorch/issues/55072

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55510

Reviewed By: gmagogsfm

Differential Revision: D27627208

Pulled By: janeyx99

fbshipit-source-id: 6cfbd4523ebd9803496fbdc5128b91110e87e07a
2021-04-07 15:06:06 -07:00
11889a51ed Use assertRaisesRegexWithHighlight test_enum.py (#55521)
Summary:
Step to resolving https://github.com/pytorch/pytorch/issues/55072

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55521

Reviewed By: gmagogsfm

Differential Revision: D27627616

Pulled By: janeyx99

fbshipit-source-id: fabdec52729087b9ae693b14a0bc11c596003035
2021-04-07 15:00:45 -07:00
b665298dc8 Use assertRaisesRegexWithHighlight test_custom_operators.py (#55519)
Summary:
Step to resolving https://github.com/pytorch/pytorch/issues/55072

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55519

Reviewed By: gmagogsfm

Differential Revision: D27627487

Pulled By: janeyx99

fbshipit-source-id: 6bd54433617180c56153785b69c2e49faf19ef34
2021-04-07 14:39:52 -07:00
469734ae54 Replace assertRaisesRegex w/ assertRaisesRegexWithHighlight test_builtins (#55496)
Summary:
Step to resolving https://github.com/pytorch/pytorch/issues/55072

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55496

Reviewed By: gmagogsfm

Differential Revision: D27627271

Pulled By: janeyx99

fbshipit-source-id: c59c93018dbb5051e1e49b66298e9caf779b438b
2021-04-07 14:37:01 -07:00
b1bae01e0c Replace raiseRegex with raiseRegexWithHighlight test_async.py (#55489)
Summary:
Step to resolving https://github.com/pytorch/pytorch/issues/55072

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55489

Reviewed By: gmagogsfm

Differential Revision: D27625872

Pulled By: janeyx99

fbshipit-source-id: 1ee606a30b13d041d8d107e6cc23be16c076d072
2021-04-07 14:30:23 -07:00
a20a72d41b Replace assertRaisesRegex w/ assertRaisesRegexWithHighlight test_backends.py (#55493)
Summary:
Step to resolving https://github.com/pytorch/pytorch/issues/55072

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55493

Reviewed By: gmagogsfm

Differential Revision: D27626192

Pulled By: janeyx99

fbshipit-source-id: 047b7b6754e21388f52489160d712858b7aa0288
2021-04-07 14:30:20 -07:00
e6bfff679d [ONNX] Add hardsigmoid symbolic in opset 9 #49649 (#54193)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49649
Adds support for torch.nn.Hardsigmoid operator in torch.onnx.export

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54193

Reviewed By: anjali411

Differential Revision: D27522969

Pulled By: SplitInfinity

fbshipit-source-id: 33abcec578f4bc3cf5c3ee1c1bed7d94816bee96
2021-04-07 14:28:31 -07:00
2dd7dafb62 [Hackathon][take2] jit py3 move list dict tuple to jit/ (#55515)
Summary:
moved to jit/test_list_dict.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55515

Reviewed By: mrshenli

Differential Revision: D27627615

Pulled By: walterddr

fbshipit-source-id: 6b17a4d6535ae2d6d848532a4df2278d3aaefa7b
2021-04-07 14:25:04 -07:00
afd549bb8f [Doc] fix profiler doc (#55449)
Summary:
fix profiler doc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55449

Reviewed By: bdhirsh

Differential Revision: D27626301

Pulled By: mrshenli

fbshipit-source-id: e9540fa0022c764371c785ca4079797d17532417
2021-04-07 14:16:47 -07:00
1e70d217e7 Add error message for complex alpha and non-complex inputs (#54964)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54964

Previously, the following would error out with a strange error message:
```
import torch
x=torch.randn(2)
torch.rsub(x, 1, alpha=2j)

Traceback (most recent call last)
<ipython-input-2-caf2a1c03d0b> in <module>
      1 import torch
      2 x=torch.randn(2)
----> 3 torch.rsub(x, 1, alpha=2j)

RuntimeError: value cannot be converted to type float without overflow: (-0,-2)
```

The reason why this is happening is because the alpha check doesn't check for if `x` is not complex and `alpha` is complex.
The error gets thrown further along in the implementation of torch.sub,
when it coerces `alpha` to be the same dtype as the input tensor:
https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/BinaryOpsKernel.cpp#L53

This PR fixes the bad error message by adding a new check to the alpha check.

Test Plan:
- pytest test/test_binary_ufuncs.py
- NB: add, sub, and rsub all share the same alpha check. The test only tests it for torch.add, but that should be sufficient.

Reviewed By: gchanan

Differential Revision: D27504017

Pulled By: zou3519

fbshipit-source-id: 70b9aa75a7a4faaaa93f6ba235cae85998a91697
2021-04-07 14:12:34 -07:00
dd2bccafc5 nnc hackathon - use new APIs in tests (#55497)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55497

Migrating some of the NNC API's used in testing, from this issue: https://github.com/pytorch/pytorch/issues/55203

I covered the second half of `test_loopnest.cpp`, and migrated (1) and (2) in the above issue: `LoopNest::getLoopStmtsFor`, `splitWithTail`, and `splitWithMask`

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D27628625

Pulled By: bdhirsh

fbshipit-source-id: ec15efba45fae0bbb442ac3577fb9ca2f8023c2d
2021-04-07 13:03:25 -07:00
10abbb812a Support tensor subclasses in Torchscript (#54817)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54817

Test Plan:
python test case

Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D27407723

fbshipit-source-id: 459b9067f07908026f94620c1cfa3e00e8b50a4e
2021-04-07 12:10:27 -07:00
b91d48877d Reland Fix reference cycle in sparse coalesce graph (#55404)
Summary:
Reland of https://github.com/pytorch/pytorch/pull/52874

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55404

Reviewed By: bdhirsh

Differential Revision: D27600438

Pulled By: albanD

fbshipit-source-id: f5c286638b324ad59be65657a016028af5e2b303
2021-04-07 12:02:42 -07:00
797d0c4c68 [Hackathon] Add error source range highlighting check in test_recursive_script.py (#55475)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55475

Reviewed By: janeyx99

Differential Revision: D27625464

Pulled By: gmagogsfm

fbshipit-source-id: 27cf508593f02a26ba63e58b9bbb125b9e90e1ea
2021-04-07 11:51:15 -07:00
0d1058fbc7 Revert D27625646: [pytorch][PR] move list dict and named tuple tests out of py3 and into test_list_dict.py
Test Plan: revert-hammer

Differential Revision:
D27625646 (1c78a4a733)

Original commit changeset: 2d68f0e24df2

fbshipit-source-id: 8ccae798e1c38b7df1320d767bcf281d2d466758
2021-04-07 11:47:22 -07:00
85fcadc059 [lite-interpreter] speed_benchmark_torch support BUILD_LITE_INTERPRETER (#55402)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55402

Test Plan: Imported from OSS

Reviewed By: cccclai

Differential Revision: D27599824

Pulled By: IvanKobzarev

fbshipit-source-id: 3adbb8a16a785d3610404d71ef2d895904b1a8ef
2021-04-07 11:39:32 -07:00
1c78a4a733 move list dict and named tuple tests out of py3 and into test_list_dict.py (#55476)
Summary:
Hackathon: Split test_jit_py3 into jit/ individual tests

part 1: move Dict, List, NamedTuple

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55476

Reviewed By: nikithamalgifb

Differential Revision: D27625646

Pulled By: walterddr

fbshipit-source-id: 2d68f0e24df2c26ea73860b9d36669e2a6e4ff44
2021-04-07 11:29:44 -07:00
bfee8d0464 [Pytorch Edge] Dont cache inflated bundled inputs (#55181)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55181

There can be a dramatic model size delta between saving a model after calling generate_bundled_inputs_for_* and saving before. This is due to the caching of the inflated tensor.

This increases latency when asking for the bundled inputs multiple times. I dont think this matters but it might for something like benchmarking?
ghstack-source-id: 125746773

Test Plan: unit tests.

Reviewed By: dreiss

Differential Revision: D27519487

fbshipit-source-id: 6ba22bff9c4e3a8d86c04627b7cbf47ca2d141b9
2021-04-07 10:46:43 -07:00
56cd1d366e [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D27617241

fbshipit-source-id: a5f695a6ee34daf0acd970720565296d785e9eb1
2021-04-07 10:37:27 -07:00
f9a0bbbeb8 [DataPipe] Remove duplicate dataset (#54553)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54553

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D27279301

Pulled By: ejguan

fbshipit-source-id: 112a83e7061e3f35dc517eb623bd9ca93c2f034c
2021-04-07 10:11:22 -07:00
f5675f8306 [torchelastic] Make sure torchelastic mp wait for queue to be drained before finishing the process (#55412)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55412

The diff resolves bug where worker processes could exit before torchelastic process would read the return values. This is a rare event, but still can happen, e.g. https://fb.workplace.com/groups/319878845696681/permalink/512409069776990/

When users want to return torch.Tensor object from worker process, the torchelastic multiprocessing will fail. Currently worker process finishes its job after it writes output to the IPC queue without receiver process confirmation. When this happens, the underlying channel between worker and torchelastic process could be closed (in case of mp.SimpleQueue it is file descriptors, that is why we see FileNotFoundException: since worker process finished execution, the file descriptor just got deleted, and torchelastic process cannot find it).

Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed/elastic/agent/server/test:local_agent_test

User workflow: f263531643

Reviewed By: cbalioglu

Differential Revision: D27602838

fbshipit-source-id: 29871178232e3af4ad3dec406c234aba9c5faba1
2021-04-07 09:39:24 -07:00
3bb1f59a9c avoid CPU std::copysign segfault when compiling on arm64 with gcc 7.5 / 8 for CUDA (#51834)
Summary:
It seems that the std::copysign code introduced in https://github.com/pytorch/pytorch/issues/51706 is too much for gcc 7.5 / 8 when compiled on arm64 (e.g. on Jetson with latest Jetpack) and causes it to produce an internal compiler error with segfault during compilation. This avoids the compiler bug it by not using std::copysign.

A very kind person sent a Jetson Xavier NX {emoji:1f381} thank you {emoji:2764}.

After https://github.com/pytorch/pytorch/issues/51900 fixed this for CPU-only arm64 (eg Raspberry), this fixes it for CUDA-using arm64 (e.g. Jetson). CUDA device lambdas must also be present as host functions for technical reasons but they are never used, so we just assert in the CPU variant instead of actually doing the operation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51834

Reviewed By: mrshenli

Differential Revision: D27622277

Pulled By: malfet

fbshipit-source-id: a1dc4c3a67f925019782e24b796919e17339749f
2021-04-07 09:31:13 -07:00
af2beaf675 [profiler] Fix time discrepancy between legacy and kineto events (#55226)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55226

Fixes a bug caused by using different clocks in legacy events, also fixes two
small issues with not using relative time in memory events and discrepancy
between start and stop profile events CUDA-wise

Test Plan: CI

Reviewed By: xuzhao9

Differential Revision: D27534920

fbshipit-source-id: 7a877367b3031660516c9c4fdda1bf47e77bcb3e
2021-04-07 09:20:19 -07:00
8e6e7dca09 [ROCm] if TEST_WITH_ROCM, only instantiate GPU device tests (#55069)
Summary:
Improves ROCm CI throughput by instantiating only for device tests that exercise the AMD GPU devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55069

Reviewed By: mrshenli

Differential Revision: D27610877

Pulled By: malfet

fbshipit-source-id: aa2b42b9f7611dca37cbb922790d7fe0f4be8dbd
2021-04-07 09:12:09 -07:00
17e5ba44f1 [testing] Support input samples where self is broadcasted. (#53014)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/50747

Reference https://github.com/pytorch/pytorch/issues/50006

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53014

Reviewed By: SplitInfinity, ngimel

Differential Revision: D27615320

Pulled By: mruberry

fbshipit-source-id: a48bccf06aef1ee8f66a89e6b2bbe79736700b2b
2021-04-07 08:20:27 -07:00
2e9eb5afa2 Use slow tests stats in common_utils (#55190)
Summary:
This is a step in adding automatic slowTest detection to our testing infrastructure. This uses stats (updated daily) in https://github.com/pytorch/test-infra/blob/master/stats/.pytorch-slow-tests to determine whether more tests need to be marked as slow as they are run.

More details in previous PR draft/proposal [here](https://github.com/pytorch/pytorch/pull/54456#issue-598388491), though I no longer think we need the third step as using a raw git file does not require much processing.

Upon looking at [logs](https://circleci.com/api/v1.1/project/github/pytorch/pytorch/12060292/output/107/0?file=true&allocation-id=606660dbd8e5857bcc2b2e0f-0-build%2F60DCA8CD) for the coverage tests as of the first commit [when I had not skipped the tests so we could see their actual times], here are some slow tests that weren't marked as slow before:
```
test_fn_gradgrad_unfold_cpu_complex128 (__main__.TestGradientsCPU) (172.554s)
test_matmul_4d_4d_complex_cpu (__main__.TestAutogradDeviceTypeCPU) (180.057s)
test_conv1d_basic (__main__.TestXNNPACKConv1dTransformPass) (94.737s)
```

And here is a test that wasn't actually slow but was still marked as slow based on stats:
```
test_trunc_normal (__main__.TestNNInit) ... ok (1.208s)
```

The new logs show the above tests as skipped (as they should be):
[Coverage Test 1](https://app.circleci.com/pipelines/github/pytorch/pytorch/296224/workflows/ba6c2917-51f8-4fb8-be57-90151c2e5c25/jobs/12126156) and [Coverage Test 2](https://app.circleci.com/pipelines/github/pytorch/pytorch/296224/workflows/ba6c2917-51f8-4fb8-be57-90151c2e5c25/jobs/12126155)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55190

Reviewed By: samestep

Differential Revision: D27566663

Pulled By: janeyx99

fbshipit-source-id: c13f8c676bb8eb15d9d697d224dbaef7df98aef3
2021-04-07 08:04:39 -07:00
b9a02128bc split nn.functional (#55038)
Summary:
Related to https://github.com/pytorch/pytorch/issues/52256

Splits torch.nn.functional into a table-of-contents page and many sub-pages, one for each function

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55038

Reviewed By: gchanan

Differential Revision: D27502677

Pulled By: zou3519

fbshipit-source-id: 38e450a0fee41c901eb56f94aee8a32f4eefc807
2021-04-07 06:35:47 -07:00
263d8ef4ef docs: fix formatting for embedding_bag (#54666)
Summary:
fixes https://github.com/pytorch/pytorch/issues/43499

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54666

Reviewed By: H-Huang

Differential Revision: D27411027

Pulled By: jbschlosser

fbshipit-source-id: a84cc174155bd725e108d8f953a21bb8de8d9d23
2021-04-07 06:32:16 -07:00
6fd20a8dea Back out "[pytorch][PR] [fix] torch.frac : Handle inf correctly"
Summary: Original commit changeset: 92c7309558ee

Test Plan: reverting D27566407 (ece075195d)

Differential Revision: D27618949

fbshipit-source-id: 7930251f4bc88e7991805d77a617a181d68a4880
2021-04-07 04:34:07 -07:00
c96f076248 Fix typo in extending.rst (#55408)
Summary:
Small typo in docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55408

Reviewed By: pbelevich

Differential Revision: D27611175

Pulled By: jamesr66a

fbshipit-source-id: a83a6220054c0411329792c7ac6afceb2b699f44
2021-04-07 03:46:01 -07:00
ece075195d [fix] torch.frac : Handle inf correctly (#52678)
Summary:
Fixes : https://github.com/pytorch/pytorch/issues/51948
Fixes : https://github.com/pytorch/pytorch/issues/52027

Depends On: https://github.com/pytorch/pytorch/issues/52660

TODO
* [x] Benchmark

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52678

Reviewed By: anjali411

Differential Revision: D27566407

Pulled By: heitorschueroff

fbshipit-source-id: 92c7309558ee41f8b9c641f791e8f84819c333e2
2021-04-07 02:27:56 -07:00
bc05867618 Separate TLS for InferenceMode (#55424)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55424

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55238

I tried to avoid creating new TLS, but InferenceMode::is_enabeld()
is in perf critical path (TensorImpl constructor) so it seems
worth adding one for it.
This PR reduces one sources of instruction count increased by
https://github.com/pytorch/pytorch/pull/55008.
```
 λ ~ python compare.py
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.FunctionCounts object at 0x7f59097ef310>
     100  0x0000000004854750
    -100  0x0000000004854760
   -4400  c10::impl::tls_is_dispatch_key_included(...)
```

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D27539230

Pulled By: ailzhang

fbshipit-source-id: e040877faef966dca3c2c3d5f9e9a80496c81415
2021-04-06 22:17:26 -07:00
82006ba460 [PyTorch Edge] Implement fb::jpeg_decode_to_NCHW (#55251)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55251

Based on a discussion with dreiss and JacobSzwejbka, we decided to implement a flexible operator for decoding a JPEG bundled image that allows getting the image in BGR format with scaling, and offsets applied for the MaskRCNN operators without calling `conv2d()` and pulling in a ton of additional operators and kernel functions. Please see the previous diff in the stack for the new operators that the change w/o this diff would have pulled in since Inflatable Arg string is non-trivial.

This change implements that operator. Please see the comments in the code for detail regarding what the operator does.
ghstack-source-id: 125641068

Test Plan:
I re-implemented the existing operator in terms of the new operator and used the existing unit test to ensure that the same (or comparable) tensor is produced.

```
cd fbsource/fbcode/
buck test caffe2/test:test_bundled_images
```

Ran this bento notebook https://www.internalfb.com/intern/anp/view/?id=476100 with the new operator `fb::jpeg_decode_to_NCHW` and saw that it is able to generate proposals.

Ran the generated hand tracking model with tracer and observed just the 2 new operators and 0 new dtypes copy kernel function, which to me seems like an acceptable set of new ops to pull in since they are relatively simple operators: {P383858691}

Reviewed By: dreiss

Differential Revision: D27531423

fbshipit-source-id: 2dc6c41029236bb71922e51cbfd14a46c5651149
2021-04-06 21:37:53 -07:00
8c1a70a7c9 [A*][Gen-1.5] Add shape inference func for PredictorCall.
Summary:
ATT, so that the shape inference works for a model with only distributed parts.

Previously, we rely on a full_predictor net to do shape inference. For very large models, the full_predictor net won't be generated, so we have to do shape inference based on distributed parts. Surprisingly, the PredictorCall op does tensor name mapping so it has to have shape inference func supported.

Test Plan: Added unittests.

Reviewed By: khabinov

Differential Revision: D27250956

fbshipit-source-id: 3ebd36ba1eb020bb5d00358cffb8f038a6a996e8
2021-04-06 21:18:40 -07:00
87cf277bd7 Don't allocate result Tensors in out overloads: _linalg_solve_out_helper_cuda (#55321)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55321

We have some operators that previously allowed you to pass in an undefined tensor to the out argument,
and then would go on to allocate that for you. This behavior is broken and doesn't work in JIT when things
are converted to/from IValues. Because of this, it blocks backend fallbacks because they force going
through IValue.

This PR is one in a series to remove that behavior and forces out arguments to be defined tensors.

It only looks at at::_linalg_solve_out_helper_cuda(), but there's more PRs for other ops.
ghstack-source-id: 125886984

(Note: this ignores all push blocking failures!)

Test Plan: waitforsandcastle

Reviewed By: ngimel

Differential Revision: D27572759

fbshipit-source-id: 5bca60b39c513b8d85fe282ebd4d66607d54774f
2021-04-06 19:55:39 -07:00
acfb05ff43 Boxing logic forwards arguments to stack (#53624)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53624

Previously, the boxing logic didn't correctly forward arguments to the stack but called copy constructors.
This PR fixes that.
ghstack-source-id: 125886983

(Note: this ignores all push blocking failures!)

Test Plan: waitforsandcastle

Reviewed By: bhosmer

Differential Revision: D26852856

fbshipit-source-id: d2463eeca2f3fce1bbe117611be200fda59c880b
2021-04-06 19:55:37 -07:00
34b46359e3 Fix forwarding/move bug (#53556)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53556

When packing a `Tensor&` (mutable lvalue reference) into an IValue, we accidentally didn't increase the refcount.
This wasn't triggered anywhere, until I tried to enable backend fallbacks. Backend fallbacks for ops that
have out arguments (i.e. ops that take `Tensor&` arguments and return `Tensor&` arguments) pack those returns
into an IValue stack (and accidentally don't increase the refcount), then later that stack gets destructed
(which decreases the refcount and possibly destroys the Tensor), and the `Tensor&` passed in as an out argument
is suddenty freed memory.

This PR fixes that by forwarding instead of moving when wrapping Tensors into IValues.
ghstack-source-id: 125886986

(Note: this ignores all push blocking failures!)

Test Plan: waitforsandcastle

Reviewed By: swolchok

Differential Revision: D26896507

fbshipit-source-id: 62102fa89e522699b5174c33279a2b1a775066a4
2021-04-06 19:55:34 -07:00
4757d4c077 Don't allocate result Tensors in out overloads: at::kron_out() (#53640)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53640

We have some operators that previously allowed you to pass in an undefined tensor to the out argument,
and then would go on to allocate that for you. This behavior is broken and doesn't work in JIT when things
are converted to/from IValues. Because of this, it blocks backend fallbacks because they force going
through IValue.

This PR is one in a series to remove that behavior and forces out arguments to be defined tensors.

It only looks at at::kron_out(), but there's more PRs for other ops.

BC Breaking: This breaks BC since those ops previously allowed calling with undefined tensors and that isn't allowed anymore.
ghstack-source-id: 125886981

(Note: this ignores all push blocking failures!)

Test Plan: waitforsandcastle

Reviewed By: bhosmer, ngimel

Differential Revision: D26921165

fbshipit-source-id: e61411226c12d33cb196a1e010ff733fe9fa6b7b
2021-04-06 19:55:31 -07:00
db75ebac4a Don't allocate result Tensors in out overloads: Reduction Ops (#53218)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53218

We have some operators that previously allowed you to pass in an undefined tensor to the out argument,
and then would go on to allocate that for you. This behavior is broken and doesn't work in JIT when things
are converted to/from IValues. Because of this, it blocks backend fallbacks because they force going
through IValue.

This PR removes that behavior and forces out arguments to be defined tensors.

It only looks at reduction ops for now, there's likely more PRs coming for other ops.

BC Breaking: This breaks BC since those ops previously allowed calling with undefined tensors and that isn't allowed anymore.
ghstack-source-id: 125886980

(Note: this ignores all push blocking failures!)

Test Plan: waitforsandcastle

Reviewed By: ezyang

Differential Revision: D26795461

fbshipit-source-id: 158465260fe59deb7d4b2081e810a7434cfba722
2021-04-06 19:55:29 -07:00
73aeea648e Fix Scalar formatting (#53229)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53229

Scalar formatting was assuming that everything non-float was integral. This would output bools as ints, and even worse, it would crash for complex.
This PR fixes that.
ghstack-source-id: 125886979

(Note: this ignores all push blocking failures!)

Test Plan: waitforsandcastle

Reviewed By: ezyang

Differential Revision: D26800345

fbshipit-source-id: 1a9efd085276b40d6fb399d255a6bbd7d5f3619f
2021-04-06 19:55:26 -07:00
35caae6045 Fix boxing/unboxing for Scalar bool values (#53228)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53228

Previously, if a Scalar value contained a bool and was put into and then out of an IValue, it would magically transform to an int.
This PR fixes that and preserves the bool-ness.
ghstack-source-id: 125886985

(Note: this ignores all push blocking failures!)

Test Plan: unit tests

Reviewed By: ezyang

Differential Revision: D26800346

fbshipit-source-id: f170a5b8419bde9d3155042f9126e377714ec3ba
2021-04-06 19:53:29 -07:00
add49e7e4e Enforce PEP263 for PyTorch python codebase (#55346)
Summary:
All python files containing non-ASCII characters should be correctly annotated with `# -*- coding: utf-8 -*-` comment

Delete number of superfluous UTF-8 characters, most commonly UTF-8 opening closing quotation mark U+2019 (’) instead of ascii apostrophe ', for example `Module’s`->`Module's`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55346

Reviewed By: samestep

Differential Revision: D27582044

Pulled By: malfet

fbshipit-source-id: c1cd89655915858ff3a41f675cdfffff795a8e44
2021-04-06 18:31:38 -07:00
34a7b4aabb [tools] Remove newline from clang-format reference hashes (#55328)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55328

**Summary**
The `clang-format` reference hashes committed in #54737 have newlines at
the end but the locally computed ones do not. This commit removes these
newlines so that the `clang-format` binary verification step doesn't
fail.

**Test Plan**
`./tools/clang_format_all.py`, ran successfully.

**Fixes**
This commit fixes #54790.

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D27577398

Pulled By: SplitInfinity

fbshipit-source-id: e30bee58c2eb5ea96ed0a503480dea4f67b86aca
2021-04-06 17:17:19 -07:00
96655e2b81 Re-enable disabled tests workflow with GHA (#55417)
Summary:
Replace the old (and disabled) workflow to update disabled tests with a GitHub Action that would gather a list of disabled tests and export them to our test-infra repo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55417

Test Plan: This [workflow](https://github.com/janeyx99/gha-experiments/runs/2282792158?check_suite_focus=true) has successfully pushed, resulting in this file: https://github.com/pytorch/test-infra/blob/master/stats/.pytorch-disabled-tests

Reviewed By: walterddr

Differential Revision: D27608584

Pulled By: janeyx99

fbshipit-source-id: b9dc184712484ef4806f24a34670390f723824bc
2021-04-06 16:40:41 -07:00
79fe5b7897 [Doc]fix torch.ceil formula issue(pytorch#54948) (#55039)
Summary:
Fixes wrong formula https://github.com/pytorch/pytorch/issues/54948
The new one is
<img width="157" alt="截屏2021-03-31 下午5 25 59" src="https://user-images.githubusercontent.com/32546978/113124411-14407000-9248-11eb-92f6-7b47b4cfd5e4.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55039

Reviewed By: ngimel

Differential Revision: D27562484

Pulled By: mruberry

fbshipit-source-id: e01d9bfc0cf04558ecff3336a055037e6c3df028
2021-04-06 15:33:23 -07:00
5c402d9026 STFT: Clarify output shape in documentation (#54877)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/54631

I removed the phrase "When `onesided` is the default value `True`". It's not always the default and it's also confusing because it doesn't seem to relate to the bullet points it's introducing. It makes more sense in the sentence before, i.e. these frequencies are included "when the output is onesided". So, I've rewritten it as that meaning and included the correct formula for frequencies.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54877

Reviewed By: ngimel

Differential Revision: D27562785

Pulled By: mruberry

fbshipit-source-id: d7f36382611e8e176e3370393d1b371d577d46bb
2021-04-06 15:28:57 -07:00
933bbbbed6 [PyTorch] Fix waste in unfold() (#55339)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55339

Use DimVector. Avoid calling size()/stride() when we know argument is in bounds.
ghstack-source-id: 125839415

Test Plan: Existing CI

Reviewed By: hlu1

Differential Revision: D27577647

fbshipit-source-id: b33057c383037dd0865de3a944ebf225ad8d9169
2021-04-06 14:38:06 -07:00
4cf42fc62f [PyTorch] Cache self.size(dim) in TensorShape functions (#55336)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55336

The compiler cannot optimize this away because it does not know that size() has no side effects and doesn't get changed by anything else that goes on in the function.
ghstack-source-id: 125775704

Test Plan: Spot-check assembly to verify assertion I made in the summary

Reviewed By: ngimel

Differential Revision: D27577299

fbshipit-source-id: 7b7ce1044c4c0b437d95103a5d149acb5d86c1bd
2021-04-06 14:36:36 -07:00
8eaa4a97b7 Back out "[quant][graphmode][fx] Separate handling Copy operator to a helper function" (#55388)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55388

temporarily revert D27314678 (c57541ce06), it appears to cause a perf regression that makes quantization of some models take too long to complete tests.

Reviewed By: houseroad

Differential Revision: D27583809

fbshipit-source-id: e9c088ccbfd3bfb3a1d4c7eafee3eca29ee7717b
2021-04-06 14:20:36 -07:00
84d18727bd Added linalg.eig, linalg.eigvals (#52491)
Summary:
This PR adds `torch.linalg.eig`, and `torch.linalg.eigvals` for NumPy compatibility.

MAGMA uses a hybrid CPU-GPU algorithm and doesn't have a GPU interface for the non-symmetric eigendecomposition. It means that it forces us to transfer inputs living in GPU memory to CPU first before calling MAGMA, and then transfer results from MAGMA to CPU. That is rather slow for smaller matrices and MAGMA is faster than CPU path only for matrices larger than 3000x3000.
Unfortunately, there is no cuSOLVER function for this operation.

Autograd support for `torch.linalg.eig` will be added in a follow-up PR.

Ref https://github.com/pytorch/pytorch/issues/42666

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52491

Reviewed By: anjali411

Differential Revision: D27563616

Pulled By: mruberry

fbshipit-source-id: b42bb98afcd2ed7625d30bdd71cfc74a7ea57bb5
2021-04-06 13:53:26 -07:00
da7a27b847 [NNAPI] Initial flexible size support (#54701)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54701

We need NNAPI models to support inputs (and, by extension, intermediate
values and outputs) whose shape is only determined at load time.  For
example, a vision models input shape might be dependent on the aspect
ratio of the device camera.  While NNAPI has full support for variable
shapes (by setting components of the operand shape to 0), the guidance
we have received is that vendor-provided drivers for real hardware are
not able to support this efficiently.  Therefore, we take a hybrid
approach where shapes are calculated at model load time to
semi-dynamically construct our NNAPI model.  While this doesn't let us
have truly dynamic input shapes, it does allow us to ensure that the
vendor driver only sees fixed shapes, so we get maximum performance.

In this initial commit, only PReLU supports dynamic shapes.  Additional
operators will be converted in separate diffs.

- In order to convert a flexible-shape model, the user supplies inputs
  with shapes containing dimensions of size 0 for the flexible
  dimensions.
- During conversion, we generate code to compute the shapes of all
  intermediates and outputs as a function of the input shapes.
- We no longer run the input model to produce the output templates.
  Instead, we generate code to return properly-sized templates, given
  the input shapes.
- All of this generated code goes into a "ShapeComputeModule" that is
  used by the NnapiModule during initialization.
- The ShapeComputeModule mutates the serialized model to fill in the
  computed sizes for each operand.  This requires us to change the dtype
  for the serialized model to int32, but this should be fine because
  everything in it is already 4-byte aligned.
- NnapiInitWrapper no longer exists.  Instead, initialization is
  performed on the first run, based on the real arguments.  We plan to
  provide an API for doing eager initialization.
- Unit test updated to allow separate arguments to be given for trace,
  conversion, and inference.  A flexible-shape test case was added for
  PReLU.

Test Plan: Unit test

Reviewed By: axitkhurana

Differential Revision: D27536796

Pulled By: dreiss

fbshipit-source-id: 105585f247987b1e6ec6946a6fe44401237cb0a0
2021-04-06 13:49:43 -07:00
1e3b3a4714 [NNAPI] Create get_next_operand_id (#54700)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54700

This is an internal method just to make it more clear what
len(self.operands) is doing.

Test Plan: Unit test

Reviewed By: axitkhurana

Differential Revision: D27536794

Pulled By: dreiss

fbshipit-source-id: 678cee8a47df6757dd2e6feabf2560fd82d32e26
2021-04-06 13:49:41 -07:00
ca67c17e46 [NNAPI] Add fixed-size assertions (#54699)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54699

We'll soon be adding support for flexible-size tensors to the NNAPI
converter, but it won't be added to all ops at once.  Create
get_tensor_operand_by_jitval_fixed_size as a wrapper for
get_tensor_operand_by_jitval that verifies that the argument has a fixed
shape.  Update all call sites.  As flexible size support is added to
each op, the call sites can be converted back and proper size checks
added.

Test Plan: Unit test

Reviewed By: axitkhurana

Differential Revision: D27536791

Pulled By: dreiss

fbshipit-source-id: 6fb1fea814d767b6ff263fd8b88240a51be74777
2021-04-06 13:49:38 -07:00
5936faee7e [NNAPI] Rename local variable (#54698)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54698

"mf" was short for memory format, but the concept that this variable
represents was renamed to "dim_order", so rename the variable.

Test Plan: Unit test

Reviewed By: axitkhurana

Differential Revision: D27536793

Pulled By: dreiss

fbshipit-source-id: 2b31c70da1ff221a7833e67486690fa606f01dea
2021-04-06 13:49:35 -07:00
1f1d26137b [NNAPI] Use code generation to better support list input/output (#54697)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54697

Previously, models being converted to NNAPI were expected to take inputs
as separate arguments, but the generated NNAPI model could only take
multiple inputs as a list.  Now the generated model always takes inputs
(single or multiple) as separate tensor arguments.

Previously, models being converted to NNAPI were expected to return
outputs as a single tensor or tuple of tensors, but the generated NNAPI
model would return multiple outputs as a list. Now the generated model
returns a tuple as well (or single tensor).

Internally, we decied what output format to use (single tensor or tuple)
based on the conversion process, rather than by running the model.

Test Plan: Unit test

Reviewed By: axitkhurana

Differential Revision: D27536790

Pulled By: dreiss

fbshipit-source-id: c0f93c85d450757e568985947cc2f32043795859
2021-04-06 13:49:33 -07:00
d34d6244e7 [NNAPI] Use array instead of struct for serializing ints (#54696)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54696

This was originally developed for a Python version where array was not
available.

Test Plan: Unit test

Reviewed By: axitkhurana

Differential Revision: D27536792

Pulled By: dreiss

fbshipit-source-id: 39e5507e37d4f91871113439fe752a4d5373eaba
2021-04-06 13:49:30 -07:00
1d1db42340 Fix NNAPI for internal fbcode build (#48925)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48925

The internal build has different header visibility than CMake.

Test Plan: Ran unit tests on dev server.

Reviewed By: axitkhurana

Differential Revision: D25365246

Pulled By: dreiss

fbshipit-source-id: 6b66f972b75874596b5b0e7fef34475950d8f611
2021-04-06 13:49:27 -07:00
476c597ae6 [NNAPI] Handle binary ops combining NHWC+NCHW in some cases (#48812)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48812

This came up in a squeeze-and-excitation model.  Starting with an NHWC
tensor T, we perform a mean operation across H and W, giving an NxC
tensor, which (after some fully connected layers) is reshaped to
NxCx1x1, then multiplied with T.  To handle this, we detect the specific
case of a binary op with one NHWC input and one contiguous input with
H,W == 1,1 and allow the op to be applied (after transposing the
contiguous input).

Test Plan: Unit test.

Reviewed By: axitkhurana

Differential Revision: D25317939

Pulled By: dreiss

fbshipit-source-id: b4c17ab3b874d1a7defa04664010ba82115f1c20
2021-04-06 13:49:25 -07:00
b057d27b0b [NNAPI] Add support for unsqueeze, cat, and mean (#48811)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48811

Test Plan: Unit tests.

Reviewed By: axitkhurana

Differential Revision: D25317936

Pulled By: dreiss

fbshipit-source-id: 9b3a0a75b8157ae35ac13d52293a67800bad0ded
2021-04-06 13:49:22 -07:00
3802edd9ab [NNAPI] Add unit test (#47521)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47521

This mostly goes op-by-op.  We construct a simple model containing the
op (in various configurations for complex ops) and verify that it can be
converted to NNAPI.  Additionally, if libneuralnetworks is available, we
also run both the eager model and NNAPI model and ensure that their
outputs are equal (allowing for some slight numerical differences).

serializer.py has 94% coverage.  And most of the uncovered lines are
error cases, defensive code, or dead code that I might want to use
later.  prepare.py has 56% coverage, but probably closer to 75-80% if we
could collect coverage from TorchScript.

Test Plan:
Ran tests with NNAPI available.  Made various tweaks to the codebase to
make sure tests properly detected bugs.

Reviewed By: axitkhurana

Differential Revision: D25317940

Pulled By: dreiss

fbshipit-source-id: 709125af820440bfa7a73bab3304395f115f717f
2021-04-06 13:49:19 -07:00
8fcf9ca341 [NNAPI] Update support for Linear (#54695)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54695

Previously, torch.nn.Linear was calling aten::addmm internally.  Now
it's calling aten::linear, so add support for that.

Test Plan: Unit test

Reviewed By: axitkhurana

Differential Revision: D27536795

Pulled By: dreiss

fbshipit-source-id: 42c8d2a80b20ac12ed9bba599c5e0e874256bb13
2021-04-06 13:49:17 -07:00
8d960f7043 [NNAPI] Fix hardtanh (#47520)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47520

NNAPI defines "RELU1" as clamping from [-1, 1], not [0, 1] as I
previously assumed.  Fix our implementation to match.

Test Plan: Upcoming unit test.

Reviewed By: axitkhurana

Differential Revision: D25317934

Pulled By: dreiss

fbshipit-source-id: 70efd5bb6092b0628ff6b765ce6f6274ef28d741
2021-04-06 13:49:14 -07:00
beca1fdbec [NNAPI] Fix MUL op (#47519)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47519

This wasn't updated when _do_add_binary was refactored.

Test Plan: Upcoming unit test.

Reviewed By: axitkhurana

Differential Revision: D25317938

Pulled By: dreiss

fbshipit-source-id: 99212404c189481cfa692dd77d8f7c7865b6872b
2021-04-06 13:49:12 -07:00
38a3c28f17 [NNAPI] Remove solid weights support (#47518)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47518

This was left over from an old version of the code.  The idea was that
instead of indexing into separate tensors for each weight, you could
bundle them all into a single file and use different offsets into that
file.  With the current design, this is nontrivial to support, so drop
the code for now.

Test Plan: CI

Reviewed By: axitkhurana

Differential Revision: D25317935

Pulled By: dreiss

fbshipit-source-id: e26ab3a8d437cb1bbb50319209fa56d9c571ce61
2021-04-06 13:49:09 -07:00
1be909f074 [NNAPI] Fix models with no weights (#47517)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47517

While we're unlikely to see this in practice, it comes up in unit tests.
This type annotation is necessary for `torch.jit.script` to figure out
the type of the list if it is empty.

Test Plan: Unit tests in a later diff.

Reviewed By: axitkhurana

Differential Revision: D25317937

Pulled By: dreiss

fbshipit-source-id: de8b6665c6fcd3cd2b39e3c696a39336c064e4c1
2021-04-06 13:49:06 -07:00
0e7af36acd Make bundled inputs work with quantized zero inputs (#47407)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47407

Previously, the code for bundling contiguous single-valued tensors (like
torch.zeros) wasn't working for quantized tensors because it was calling
the `torch.tensor` constructor without passing in the quantizer.
Instead, skip the constructor entirely, which makes this use case work
and also simplifies the code.  (Originally, I forgot that
`arg.flatten()[0]` would return a tensor, not a scalar.)

Test Plan: Bundled a quantized zero input and saw it run properly.

Reviewed By: dhruvbird

Differential Revision: D24752890

Pulled By: dreiss

fbshipit-source-id: 26bc4873a71dd44660cc0fcb74c227b754e31663
2021-04-06 13:47:35 -07:00
ad5dc84ed3 [vulkan] Add Winograd convolutions (#54639)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54639

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D27514882

Pulled By: SS-JIA

fbshipit-source-id: 35cae338cf1e2e753bc66d27e1318168573ecb1d
2021-04-06 13:40:53 -07:00
20d7916a6a [Pytorch Mobile] Fold Conv BatchNorm for functions besides forward (#54619)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54619

Minor refactor to conv batchnorm folding to work on other functions besides forward
ghstack-source-id: 125767010

Test Plan: unit test and {P339453712}

Reviewed By: kimishpatel

Differential Revision: D27301452

fbshipit-source-id: 4e0cc544a171a970583979a496b2908935124497
2021-04-06 13:07:12 -07:00
a9bcab46ff Revert back changes in test_custom_ops.cpp. (#55350)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55350

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D27600413

Pulled By: ailzhang

fbshipit-source-id: 5e0d5f13fe3a51fcdccaad8af4d46cbe82795174
2021-04-06 12:41:31 -07:00
a0d9776104 [JIT] Include conv3d in the conv-add-relu fusion (#54772)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54772

conv3d-add-relu fusion does not work on some platforms when TF32 is enabled, so set allow_tf32 to false.

Test Plan:
```
python test/test_jit.py -k test_freeze_conv_relu_fusion
```

Imported from OSS

Reviewed By: bertmaher

Differential Revision: D27435560

fbshipit-source-id: e35e2297dce85acfbe988deea97c3f5e68f1e1c7
2021-04-06 12:08:13 -07:00
ec80981d28 Revert D27246997: [pytorch][PR] Fix reference cycle in sparse coalesce graph
Test Plan: revert-hammer

Differential Revision:
D27246997 (815bfad28c)

Original commit changeset: 0fe6c1104350

fbshipit-source-id: 4d345718589a642d3c65474b266342285205ccdf
2021-04-06 11:45:27 -07:00
ae3a876c9c Revert D27572158: [torchelastic] Make sure torchelastic mp wait for queue to be drained before finishing the process
Test Plan: revert-hammer

Differential Revision:
D27572158 (e9c6a51100)

Original commit changeset: 9a360468acc9

fbshipit-source-id: 29f7e2cba3e134bc81fb31b7e1dfceb7c1f9d734
2021-04-06 11:41:55 -07:00
8e78a1b084 [Resubmit] Fix for incorrect usage of logging in torch/distributed/distributed_c10d.py (#52757)
Summary:
Resubmit of https://github.com/pytorch/pytorch/pull/51739
Fixes https://github.com/pytorch/pytorch/issues/51428

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52757

Reviewed By: cbalioglu

Differential Revision: D26646843

fbshipit-source-id: df4962ef86ea465307e39878860b9fbbcc958d52
2021-04-06 11:32:26 -07:00
e9c6a51100 [torchelastic] Make sure torchelastic mp wait for queue to be drained before finishing the process
Summary:
The diff resolves bug where worker processes could exit before torchelastic process would read the return values. This is a rare event, but still can happen, e.g. https://fb.workplace.com/groups/319878845696681/permalink/512409069776990/

When users want to return torch.Tensor object from worker process, the torchelastic multiprocessing will fail. Currently worker process finishes its job after it writes output to the IPC queue without receiver process confirmation. When this happens, the underlying channel between worker and torchelastic process could be closed (in case of mp.SimpleQueue it is file descriptors, that is why we see FileNotFoundException: since worker process finished execution, the file descriptor just got deleted, and torchelastic process cannot find it).

Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed/elastic/agent/server/test:local_agent_test

User workflow: f263531643

Reviewed By: cbalioglu, wilson100hong

Differential Revision: D27572158

fbshipit-source-id: 9a360468acc98d85d587ebf223e7e96d4b43fe4b
2021-04-06 11:03:00 -07:00
f8788d5188 Upgrade onednn to v2.1.2 (#54956)
Summary:
This PR is to upgrade onednn to v2.1.2 which has the following main changes about cpu:

- Improved performance of forward convolution with plain activations for processors with Intel AVX-512 support

- Improved performance of fp32 depthwise convolution with plain activations on CPU.

more changes can be found in https://github.com/oneapi-src/oneDNN/releases.

Ideep used version is [pytorch-rls-v2.1.2](https://github.com/intel/ideep/tree/pytorch-rls-v2.1.2).
OneDNN used version is  [v2.1.2](https://github.com/oneapi-src/oneDNN/tree/v2.1.2).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54956

Reviewed By: ejguan

Differential Revision: D27466741

Pulled By: VitalyFedyunin

fbshipit-source-id: ff96e2cbda4b6bf04d299b9978e9125a013ce32f
2021-04-06 10:51:57 -07:00
8243ba7205 Add MonkeyType dependency for testing on Linux (#55305)
Summary:
Install Monkey Type as part of our testing on Linux

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55305

Reviewed By: ailzhang

Differential Revision: D27592679

Pulled By: nikithamalgifb

fbshipit-source-id: c92b786e45fc16288d658228a5f96aca53a3da6b
2021-04-06 09:14:11 -07:00
158cdece65 Correct many OpInfos "test_out" skips. (#55141)
Summary:
Partially solves https://github.com/pytorch/pytorch/issues/54061

This PR solves many of the "easy to solve" problems with `out=` not notifying when it resizes a tensor. It also reports the cause of some fails of the `out=` operation in the tests. Hopefully this way we will be able to catch some errors that do not come simply from not using `resize_output`.
cc mruberry anjali411

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55141

Reviewed By: anjali411

Differential Revision: D27568755

Pulled By: mruberry

fbshipit-source-id: a32546555fef8d241de2ef635a99e5615461ed09
2021-04-06 08:41:25 -07:00
815bfad28c Fix reference cycle in sparse coalesce graph (#52874)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/52253

In the issue reproducer we can replace `torch.sparse.sum(S)` with `S.coalesce()` and get the same memory leak. The reason is that calling `coalesce()` on an already coalesced tensor returns `self`. With autograd, the result gets it's `grad_fn` set to a node that contains a reference to the input tensor, creating a reference cycle. Cloning the tensor fixes this, so `coalesce` always returns a new tensor.

As an aside, `torch.sparse.sum(S)` doesn't need to coalesce. The result should be the same either way.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52874

Reviewed By: bdhirsh

Differential Revision: D27246997

Pulled By: albanD

fbshipit-source-id: 0fe6c11043501a7874a50982afd42964f47470d3
2021-04-06 08:32:19 -07:00
e5f66f0059 Optimized generic interpolation using TensorIterator (keeps original 2d/3d channels last impl) (#54500)
Summary:
Related to https://github.com/pytorch/pytorch/issues/10482

A follow-up PR to https://github.com/pytorch/pytorch/pull/51653/

Description:
- Replaces nearest/linear/cubic implementations with generic interpolation implementation
- Retains 2d/3d channels last implementation due to perf slowdown for 1 thread (see below appendix note)

Speed-ups for cases:
- upsample_nearest channels first
- upsample_bicubic channels first/last

### Results for this PR

<details>
<summary>

Benchmark results between 8518b0e (master) and 73137d8 (this PR)

</summary>

```
Description:
- 20210331-092940_pth_nightly_results_1.9.0a0+git8518b0e.6
- 20210331-092940_pth_nightly_results_1.9.0a0+git8518b0e.1
- 20210331-092940_pr_results_1.9.0a0+git73137d8.6
- 20210331-092940_pr_results_1.9.0a0+git73137d8.1

[---------- upsample_bilinear2d channels_first contiguous torch.float32 ----------]
                                       |  1.9.0a0+git8518b0e  |  1.9.0a0+git73137d8
1 threads: ------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |          331.8       |          334.6
      [1, 3, 320, 320] -> (512, 512)   |         1261.7       |         1271.5
      [32, 128, 64, 64] -> (32, 32)    |        10164.6       |        10251.4
      [32, 128, 64, 64] -> (128, 128)  |       195966.1       |       197141.8
      [1, 3, 500, 500] -> (256, 256)   |          347.7       |          348.3
      [1, 3, 500, 500] -> (800, 800)   |         3044.9       |         3071.4
6 threads: ------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |           76.1       |           77.0
      [1, 3, 320, 320] -> (512, 512)   |          244.8       |          247.6
      [32, 128, 64, 64] -> (32, 32)    |         2329.4       |         2315.8
      [32, 128, 64, 64] -> (128, 128)  |        47855.3       |        49047.7
      [1, 3, 500, 500] -> (256, 256)   |           78.1       |           78.7
      [1, 3, 500, 500] -> (800, 800)   |          569.3       |          575.6

Times are in microseconds (us).

[------- upsample_bilinear2d channels_first non-contiguous torch.float32 --------]
                                      |  1.9.0a0+git8518b0e  |  1.9.0a0+git73137d8
1 threads: -----------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)  |         339.0        |         340.3
      [1, 3, 320, 320] -> (512, 512)  |        1266.1        |        1277.3
      [1, 3, 500, 500] -> (256, 256)  |         348.8        |         351.3
      [1, 3, 500, 500] -> (800, 800)  |        3054.5        |        3077.3
6 threads: -----------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)  |          76.6        |          77.4
      [1, 3, 320, 320] -> (512, 512)  |         246.0        |         248.1
      [1, 3, 500, 500] -> (256, 256)  |          78.3        |          79.5
      [1, 3, 500, 500] -> (800, 800)  |         572.2        |         580.0

Times are in microseconds (us).

[--------- upsample_bilinear2d channels_last non-contiguous torch.float32 --------]
                                       |  1.9.0a0+git8518b0e  |  1.9.0a0+git73137d8
1 threads: ------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |         965.4        |         964.9
      [1, 3, 320, 320] -> (512, 512)   |        3856.2        |        3866.8
      [32, 128, 64, 64] -> (32, 32)    |        5808.3        |        5812.8
      [32, 128, 64, 64] -> (128, 128)  |       99575.2        |       97226.2
      [2, 128, 64, 46] -> (32, 32)     |         110.5        |         109.0
      [2, 128, 64, 46] -> (128, 128)   |        1662.3        |        1612.0
      [1, 128, 64, 46] -> (32, 32)     |          55.6        |          55.5
      [1, 128, 64, 46] -> (128, 128)   |         467.0        |         463.9
      [1, 3, 500, 500] -> (256, 256)   |         967.7        |         966.7
      [1, 3, 500, 500] -> (800, 800)   |        9394.7        |        9436.6
6 threads: ------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |         962.2        |         965.4
      [1, 3, 320, 320] -> (512, 512)   |        3844.3        |        3844.3
      [32, 128, 64, 64] -> (32, 32)    |        2270.0        |        2267.6
      [32, 128, 64, 64] -> (128, 128)  |       31909.7        |       32106.5
      [2, 128, 64, 46] -> (32, 32)     |          61.3        |          59.9
      [2, 128, 64, 46] -> (128, 128)   |         912.3        |         893.5
      [1, 128, 64, 46] -> (32, 32)     |          55.5        |          55.3
      [1, 128, 64, 46] -> (128, 128)   |         467.0        |         466.4
      [1, 3, 500, 500] -> (256, 256)   |         967.2        |         971.1
      [1, 3, 500, 500] -> (800, 800)   |        9383.2        |        9417.4

Times are in microseconds (us).

[------ upsample_linear1d channels_first contiguous torch.float32 -------]
                              |  1.9.0a0+git8518b0e  |  1.9.0a0+git73137d8
1 threads: ---------------------------------------------------------------
      [4, 512, 320] -> [256]  |        513.5         |         521.8
      [4, 512, 320] -> [512]  |        999.0         |        1011.8
6 threads: ---------------------------------------------------------------
      [4, 512, 320] -> [256]  |        103.7         |         104.9
      [4, 512, 320] -> [512]  |        192.2         |         194.9

Times are in microseconds (us).

[------------- upsample_trilinear3d channels_first contiguous torch.float32 -------------]
                                              |  1.9.0a0+git8518b0e  |  1.9.0a0+git73137d8
1 threads: -------------------------------------------------------------------------------
      [1, 3, 16, 320, 320] -> [8, 256, 256]   |          5.4         |          5.5
      [1, 3, 16, 320, 320] -> [32, 512, 512]  |        111.2         |        111.1
6 threads: -------------------------------------------------------------------------------
      [1, 3, 16, 320, 320] -> [8, 256, 256]   |          1.1         |          1.0
      [1, 3, 16, 320, 320] -> [32, 512, 512]  |         23.4         |         23.2

Times are in milliseconds (ms).

[----------- upsample_trilinear3d channels_last non-contiguous torch.float32 ------------]
                                              |  1.9.0a0+git8518b0e  |  1.9.0a0+git73137d8
1 threads: -------------------------------------------------------------------------------
      [1, 3, 16, 320, 320] -> [8, 256, 256]   |        13521.9       |        12939.9
      [1, 3, 16, 320, 320] -> [32, 512, 512]  |       244561.3       |       236595.6
      [1, 16, 32, 64, 64] -> [16, 32, 32]     |          362.2       |          365.5
      [1, 16, 32, 64, 64] -> [64, 128, 128]   |        38141.4       |        37957.7
6 threads: -------------------------------------------------------------------------------
      [1, 3, 16, 320, 320] -> [8, 256, 256]   |        12980.4       |        12962.7
      [1, 3, 16, 320, 320] -> [32, 512, 512]  |       236256.4       |       236364.5
      [1, 16, 32, 64, 64] -> [16, 32, 32]     |          367.9       |          393.2
      [1, 16, 32, 64, 64] -> [64, 128, 128]   |        38222.5       |        38198.3

Times are in microseconds (us).

[----------- upsample_nearest2d channels_first contiguous torch.float32 ----------]
                                       |  1.9.0a0+git8518b0e  |  1.9.0a0+git73137d8
1 threads: ------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |         1205.7       |          107.2
      [1, 3, 320, 320] -> (512, 512)   |         4793.5       |          357.7
      [32, 128, 64, 64] -> (32, 32)    |        26550.0       |         6227.1
      [32, 128, 64, 64] -> (128, 128)  |       341140.3       |       116404.4
      [1, 3, 500, 500] -> (256, 256)   |         1208.6       |          122.9
      [1, 3, 500, 500] -> (800, 800)   |        11648.0       |          848.1
6 threads: ------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |          220.5       |           32.6
      [1, 3, 320, 320] -> (512, 512)   |          865.4       |           78.1
      [32, 128, 64, 64] -> (32, 32)    |         4890.9       |         2201.2
      [32, 128, 64, 64] -> (128, 128)  |        73533.8       |        32315.4
      [1, 3, 500, 500] -> (256, 256)   |          222.3       |           35.0
      [1, 3, 500, 500] -> (800, 800)   |         2107.5       |          170.7

Times are in microseconds (us).

[----------- upsample_nearest2d channels_first contiguous torch.uint8 -----------]
                                      |  1.9.0a0+git8518b0e  |  1.9.0a0+git73137d8
1 threads: -----------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)  |        1457.0        |         310.7
      [1, 3, 320, 320] -> (512, 512)  |        5808.0        |        1196.6
      [1, 3, 500, 500] -> (256, 256)  |        1460.9        |         312.7
      [1, 3, 500, 500] -> (800, 800)  |       14094.3        |        2903.5
6 threads: -----------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)  |         264.8        |          66.8
      [1, 3, 320, 320] -> (512, 512)  |        1046.0        |         228.9
      [1, 3, 500, 500] -> (256, 256)  |         266.0        |          68.0
      [1, 3, 500, 500] -> (800, 800)  |        2546.6        |         535.8

Times are in microseconds (us).

[-------- upsample_nearest2d channels_first non-contiguous torch.float32 --------]
                                      |  1.9.0a0+git8518b0e  |  1.9.0a0+git73137d8
1 threads: -----------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)  |        1284.3        |        109.9
      [1, 3, 320, 320] -> (512, 512)  |        4870.0        |        361.6
      [1, 3, 500, 500] -> (256, 256)  |        1482.8        |        123.3
      [1, 3, 500, 500] -> (800, 800)  |       12050.3        |        858.8
6 threads: -----------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)  |         240.2        |         32.8
      [1, 3, 320, 320] -> (512, 512)  |         886.1        |         78.4
      [1, 3, 500, 500] -> (256, 256)  |         274.9        |         34.9
      [1, 3, 500, 500] -> (800, 800)  |        2188.8        |        174.0

Times are in microseconds (us).

[--------- upsample_nearest2d channels_first non-contiguous torch.uint8 ---------]
                                      |  1.9.0a0+git8518b0e  |  1.9.0a0+git73137d8
1 threads: -----------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)  |        1501.9        |         312.2
      [1, 3, 320, 320] -> (512, 512)  |        5853.4        |        1202.1
      [1, 3, 500, 500] -> (256, 256)  |        1574.0        |         313.9
      [1, 3, 500, 500] -> (800, 800)  |       14210.2        |        2904.5
6 threads: -----------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)  |         277.2        |          67.2
      [1, 3, 320, 320] -> (512, 512)  |        1059.8        |         228.9
      [1, 3, 500, 500] -> (256, 256)  |         292.2        |          68.1
      [1, 3, 500, 500] -> (800, 800)  |        2574.4        |         536.2

Times are in microseconds (us).

[--------- upsample_nearest2d channels_last non-contiguous torch.float32 ---------]
                                       |  1.9.0a0+git8518b0e  |  1.9.0a0+git73137d8
1 threads: ------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |         746.0        |         751.1
      [1, 3, 320, 320] -> (512, 512)   |        2967.6        |        2979.2
      [32, 128, 64, 64] -> (32, 32)    |        3408.5        |        3379.0
      [32, 128, 64, 64] -> (128, 128)  |       90166.4        |       90023.0
      [2, 128, 64, 46] -> (32, 32)     |          74.8        |          74.5
      [2, 128, 64, 46] -> (128, 128)   |        1591.2        |        1594.3
      [1, 128, 64, 46] -> (32, 32)     |          39.3        |          39.2
      [1, 128, 64, 46] -> (128, 128)   |         420.3        |         419.1
      [1, 3, 500, 500] -> (256, 256)   |         751.6        |         756.3
      [1, 3, 500, 500] -> (800, 800)   |        7222.2        |        7268.6
6 threads: ------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |         144.9        |         140.1
      [1, 3, 320, 320] -> (512, 512)   |         560.7        |         540.6
      [32, 128, 64, 64] -> (32, 32)    |        1418.1        |        1418.6
      [32, 128, 64, 64] -> (128, 128)  |       28158.4        |       26411.4
      [2, 128, 64, 46] -> (32, 32)     |          18.4        |          17.8
      [2, 128, 64, 46] -> (128, 128)   |         532.3        |         552.0
      [1, 128, 64, 46] -> (32, 32)     |          13.9        |          13.6
      [1, 128, 64, 46] -> (128, 128)   |          81.3        |          82.9
      [1, 3, 500, 500] -> (256, 256)   |         145.9        |         141.6
      [1, 3, 500, 500] -> (800, 800)   |        1363.4        |        1316.2

Times are in microseconds (us).

[---------- upsample_nearest2d channels_last non-contiguous torch.uint8 ----------]
                                       |  1.9.0a0+git8518b0e  |  1.9.0a0+git73137d8
1 threads: ------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |         795.7        |         824.1
      [1, 3, 320, 320] -> (512, 512)   |        3163.4        |        3274.8
      [32, 128, 64, 64] -> (32, 32)    |         798.8        |         812.2
      [32, 128, 64, 64] -> (128, 128)  |       25259.6        |       25453.1
      [2, 128, 64, 46] -> (32, 32)     |          39.3        |          39.9
      [2, 128, 64, 46] -> (128, 128)   |         493.7        |         499.9
      [1, 128, 64, 46] -> (32, 32)     |          22.6        |          22.9
      [1, 128, 64, 46] -> (128, 128)   |         249.7        |         254.0
      [32, 64, 128, 64] -> (32, 32)    |         475.3        |         507.4
      [32, 64, 128, 64] -> (128, 128)  |       13709.7        |       13767.5
      [1, 3, 500, 500] -> (256, 256)   |         804.0        |         827.6
      [1, 3, 500, 500] -> (800, 800)   |        7764.9        |        7982.7
6 threads: ------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |         150.1        |         151.4
      [1, 3, 320, 320] -> (512, 512)   |         589.5        |         592.6
      [32, 128, 64, 64] -> (32, 32)    |         141.3        |         194.5
      [32, 128, 64, 64] -> (128, 128)  |        6916.5        |        7445.0
      [2, 128, 64, 46] -> (32, 32)     |          10.0        |          12.5
      [2, 128, 64, 46] -> (128, 128)   |          95.8        |         141.1
      [1, 128, 64, 46] -> (32, 32)     |           8.1        |          10.0
      [1, 128, 64, 46] -> (128, 128)   |          52.5        |          74.3
      [32, 64, 128, 64] -> (32, 32)    |          79.8        |         123.7
      [32, 64, 128, 64] -> (128, 128)  |        3639.9        |        4087.9
      [1, 3, 500, 500] -> (256, 256)   |         150.7        |         152.2
      [1, 3, 500, 500] -> (800, 800)   |        1430.9        |        1440.7

Times are in microseconds (us).

[------ upsample_nearest1d channels_first contiguous torch.float32 ------]
                              |  1.9.0a0+git8518b0e  |  1.9.0a0+git73137d8
1 threads: ---------------------------------------------------------------
      [4, 512, 320] -> [256]  |        1601.7        |        241.7
      [4, 512, 320] -> [512]  |        3188.5        |        435.7
6 threads: ---------------------------------------------------------------
      [4, 512, 320] -> [256]  |         291.9        |         53.3
      [4, 512, 320] -> [512]  |         577.8        |         88.1

Times are in microseconds (us).

[------- upsample_nearest1d channels_first contiguous torch.uint8 -------]
                              |  1.9.0a0+git8518b0e  |  1.9.0a0+git73137d8
1 threads: ---------------------------------------------------------------
      [4, 512, 320] -> [256]  |        2010.1        |         532.3
      [4, 512, 320] -> [512]  |        3999.7        |        1011.4
6 threads: ---------------------------------------------------------------
      [4, 512, 320] -> [256]  |         364.2        |         104.6
      [4, 512, 320] -> [512]  |         722.8        |         193.5

Times are in microseconds (us).

[-------------- upsample_nearest3d channels_first contiguous torch.float32 --------------]
                                              |  1.9.0a0+git8518b0e  |  1.9.0a0+git73137d8
1 threads: -------------------------------------------------------------------------------
      [1, 3, 16, 320, 320] -> [8, 256, 256]   |        14801.0       |         977.5
      [1, 3, 16, 320, 320] -> [32, 512, 512]  |       217368.5       |       41577.3
6 threads: -------------------------------------------------------------------------------
      [1, 3, 16, 320, 320] -> [8, 256, 256]   |         2670.3       |         210.7
      [1, 3, 16, 320, 320] -> [32, 512, 512]  |        42023.6       |       10971.6

Times are in microseconds (us).

[--------------- upsample_nearest3d channels_first contiguous torch.uint8 ---------------]
                                              |  1.9.0a0+git8518b0e  |  1.9.0a0+git73137d8
1 threads: -------------------------------------------------------------------------------
      [1, 3, 16, 320, 320] -> [8, 256, 256]   |        17151.7       |        3195.8
      [1, 3, 16, 320, 320] -> [32, 512, 512]  |       221221.0       |       50524.5
6 threads: -------------------------------------------------------------------------------
      [1, 3, 16, 320, 320] -> [8, 256, 256]   |         3085.3       |         588.6
      [1, 3, 16, 320, 320] -> [32, 512, 512]  |        39842.0       |        9141.0

Times are in microseconds (us).

[------------ upsample_nearest3d channels_last non-contiguous torch.float32 -------------]
                                              |  1.9.0a0+git8518b0e  |  1.9.0a0+git73137d8
1 threads: -------------------------------------------------------------------------------
      [1, 3, 16, 320, 320] -> [8, 256, 256]   |         7694.1       |         7729.0
      [1, 3, 16, 320, 320] -> [32, 512, 512]  |       138104.6       |       138158.0
      [1, 16, 32, 64, 64] -> [16, 32, 32]     |          251.1       |          252.4
      [1, 16, 32, 64, 64] -> [64, 128, 128]   |        28991.5       |        28882.8
6 threads: -------------------------------------------------------------------------------
      [1, 3, 16, 320, 320] -> [8, 256, 256]   |         1398.3       |         1402.6
      [1, 3, 16, 320, 320] -> [32, 512, 512]  |        28056.5       |        28123.2
      [1, 16, 32, 64, 64] -> [16, 32, 32]     |           50.8       |           51.1
      [1, 16, 32, 64, 64] -> [64, 128, 128]   |         7595.7       |         7540.7

Times are in microseconds (us).

[------------- upsample_nearest3d channels_last non-contiguous torch.uint8 --------------]
                                              |  1.9.0a0+git8518b0e  |  1.9.0a0+git73137d8
1 threads: -------------------------------------------------------------------------------
      [1, 3, 16, 320, 320] -> [8, 256, 256]   |         8147.8       |         8176.2
      [1, 3, 16, 320, 320] -> [32, 512, 512]  |       114658.1       |       114992.7
      [1, 16, 32, 64, 64] -> [16, 32, 32]     |          364.3       |          356.0
      [1, 16, 32, 64, 64] -> [64, 128, 128]   |        17276.0       |        16331.0
6 threads: -------------------------------------------------------------------------------
      [1, 3, 16, 320, 320] -> [8, 256, 256]   |         1469.4       |         1476.1
      [1, 3, 16, 320, 320] -> [32, 512, 512]  |        20647.1       |        20722.6
      [1, 16, 32, 64, 64] -> [16, 32, 32]     |           69.7       |           68.4
      [1, 16, 32, 64, 64] -> [64, 128, 128]   |         3125.7       |         2948.2

Times are in microseconds (us).

[----------- upsample_bicubic2d channels_first contiguous torch.float32 ----------]
                                       |  1.9.0a0+git8518b0e  |  1.9.0a0+git73137d8
1 threads: ------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |          5961.0      |         1680.2
      [1, 3, 320, 320] -> (512, 512)   |         23803.7      |         6591.0
      [32, 128, 64, 64] -> (32, 32)    |        620609.4      |        37981.6
      [32, 128, 64, 64] -> (128, 128)  |      10120286.1      |       646305.5
      [1, 3, 500, 500] -> (256, 256)   |          6005.4      |         1694.6
      [1, 3, 500, 500] -> (800, 800)   |         58271.9      |        16047.6
6 threads: ------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |          6218.5      |          347.1
      [1, 3, 320, 320] -> (512, 512)   |         24144.6      |         1253.4
      [32, 128, 64, 64] -> (32, 32)    |        612762.5      |         6934.8
      [32, 128, 64, 64] -> (128, 128)  |       9906221.2      |       127411.1
      [1, 3, 500, 500] -> (256, 256)   |          6241.9      |          350.2
      [1, 3, 500, 500] -> (800, 800)   |         59052.2      |         2984.8

Times are in microseconds (us).

[-------- upsample_bicubic2d channels_first non-contiguous torch.float32 --------]
                                      |  1.9.0a0+git8518b0e  |  1.9.0a0+git73137d8
1 threads: -----------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)  |        6050.9        |        1694.3
      [1, 3, 320, 320] -> (512, 512)  |       23897.1        |        6607.9
      [1, 3, 500, 500] -> (256, 256)  |        6282.8        |        1693.9
      [1, 3, 500, 500] -> (800, 800)  |       58608.1        |       16061.0
6 threads: -----------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)  |        6243.7        |         347.6
      [1, 3, 320, 320] -> (512, 512)  |       24779.9        |        1253.8
      [1, 3, 500, 500] -> (256, 256)  |        6348.0        |         350.7
      [1, 3, 500, 500] -> (800, 800)  |       59255.6        |        2983.8

Times are in microseconds (us).

[--------- upsample_bicubic2d channels_last non-contiguous torch.float32 ---------]
                                       |  1.9.0a0+git8518b0e  |  1.9.0a0+git73137d8
1 threads: ------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |          6117.0      |         1688.2
      [1, 3, 320, 320] -> (512, 512)   |         23967.4      |         6644.8
      [32, 128, 64, 64] -> (32, 32)    |        679574.0      |        78477.4
      [32, 128, 64, 64] -> (128, 128)  |      10334325.5      |       817649.0
      [2, 128, 64, 46] -> (32, 32)     |          9828.0      |         4449.2
      [2, 128, 64, 46] -> (128, 128)   |        134989.3      |        42817.4
      [1, 128, 64, 46] -> (32, 32)     |          4508.2      |         2228.6
      [1, 128, 64, 46] -> (128, 128)   |         59404.9      |        21400.4
      [1, 3, 500, 500] -> (256, 256)   |          6359.0      |         1712.7
      [1, 3, 500, 500] -> (800, 800)   |         58717.6      |        16086.6
6 threads: ------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |          6922.0      |          349.5
      [1, 3, 320, 320] -> (512, 512)   |         24916.5      |         1260.2
      [32, 128, 64, 64] -> (32, 32)    |        454240.4      |        16491.4
      [32, 128, 64, 64] -> (128, 128)  |       7198101.5      |       159921.9
      [2, 128, 64, 46] -> (32, 32)     |         10082.8      |          891.1
      [2, 128, 64, 46] -> (128, 128)   |        151037.0      |         7704.2
      [1, 128, 64, 46] -> (32, 32)     |          4325.5      |          633.9
      [1, 128, 64, 46] -> (128, 128)   |         62400.4      |         3853.5
      [1, 3, 500, 500] -> (256, 256)   |          6374.9      |          354.9
      [1, 3, 500, 500] -> (800, 800)   |         58638.8      |         2992.0

Times are in microseconds (us).

Intermediate benchmark sources:

- results/20210331-092940_pth_nightly_results_1.9.0a0+git8518b0e.log.save
- results/20210331-092940_pr_results_1.9.0a0+git73137d8.log.save
```

[Source file](https://raw.githubusercontent.com/vfdev-5/interpolate-tensoriterator/master/step_seven/results/20210326-061238_pr_1.9.0a0%2Bgita17040a_vs_pth_1.9.0a0%2Bgit8518b0e_results.md)

</details>

This description is based on the benchmarks and the code from [here](https://github.com/vfdev-5/interpolate-tensoriterator/tree/master/step_seven).

Joint work with Francisco Massa (fmassa).

 ---

Appendix: Results without original 2d/3d channels last implementation

<details>
<summary>

Quick benchmark results between 8518b0e (master) and [this branch](https://github.com/pytorch/pytorch/compare/master...Quansight:vfdev-5/generic-upsample-tensor-iterator)

</summary>

```
Description:
- 20212303-061238_pth_nightly_results_1.9.0a0+git8518b0e.opencv.6
- 20212303-061238_pth_nightly_results_1.9.0a0+git8518b0e.opencv.1
- 20212303-061238_pr_results_1.9.0a0+gite3a9544.opencv.6
- 20212303-061238_pr_results_1.9.0a0+gite3a9544.opencv.1

[----------------- upsample_bilinear2d channels_first contiguous -----------------]
                                       |  1.9.0a0+git8518b0e  |  1.9.0a0+gite3a9544
1 threads: ------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |          348.5       |          331.7
      [1, 3, 320, 320] -> (512, 512)   |         1254.0       |         1178.1
      [32, 128, 64, 64] -> (32, 32)    |        10409.4       |        10009.1
      [32, 128, 64, 64] -> (128, 128)  |       210175.8       |       204542.5
      [1, 3, 500, 500] -> (256, 256)   |          348.5       |          329.5
      [1, 3, 500, 500] -> (800, 800)   |         3079.8       |         2890.1
6 threads: ------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |           76.4       |           73.4
      [1, 3, 320, 320] -> (512, 512)   |          247.1       |          232.0
      [32, 128, 64, 64] -> (32, 32)    |         2371.1       |         2340.5
      [32, 128, 64, 64] -> (128, 128)  |        62182.6       |        54089.9
      [1, 3, 500, 500] -> (256, 256)   |           78.2       |           75.8
      [1, 3, 500, 500] -> (800, 800)   |          569.0       |          541.3

Times are in microseconds (us).

[-------------- upsample_bilinear2d channels_first non-contiguous ---------------]
                                      |  1.9.0a0+git8518b0e  |  1.9.0a0+gite3a9544
1 threads: -----------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)  |         340.5        |         321.9
      [1, 3, 320, 320] -> (512, 512)  |        1256.1        |        1179.0
      [1, 3, 500, 500] -> (256, 256)  |         351.4        |         332.0
      [1, 3, 500, 500] -> (800, 800)  |        3089.1        |        2898.6
6 threads: -----------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)  |          77.2        |          75.0
      [1, 3, 320, 320] -> (512, 512)  |         246.6        |         232.7
      [1, 3, 500, 500] -> (256, 256)  |          78.6        |          75.4
      [1, 3, 500, 500] -> (800, 800)  |         576.3        |         539.6

Times are in microseconds (us).

[------------------------ upsample_bilinear2d channels_last non-contiguous ------------------------]
                                       |  1.9.0a0+git8518b0e  |  1.9.0a0+gite3a9544  |  opencv 4.5.1
1 threads: -----------------------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |          971.9       |         1324.6       |       99.6
      [1, 3, 320, 320] -> (512, 512)   |         3867.8       |         5329.9       |      271.5
      [32, 128, 64, 64] -> (32, 32)    |         6010.6       |         6304.3       |
      [32, 128, 64, 64] -> (128, 128)  |       112299.9       |       116956.8       |
      [2, 128, 64, 46] -> (32, 32)     |          110.1       |          133.2       |
      [2, 128, 64, 46] -> (128, 128)   |         1690.1       |         1838.6       |
      [1, 128, 64, 46] -> (32, 32)     |           55.8       |           73.4       |      185.8
      [1, 128, 64, 46] -> (128, 128)   |          474.5       |          684.9       |     1445.7
      [1, 3, 500, 500] -> (256, 256)   |          972.9       |         1343.0       |      149.5
      [1, 3, 500, 500] -> (800, 800)   |         9460.2       |        12925.8       |      685.1
6 threads: -----------------------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |          956.6       |          260.1       |       27.1
      [1, 3, 320, 320] -> (512, 512)   |         3867.3       |          967.1       |       63.6
      [32, 128, 64, 64] -> (32, 32)    |         2489.4       |         2427.0       |
      [32, 128, 64, 64] -> (128, 128)  |        37462.1       |        41329.8       |
      [2, 128, 64, 46] -> (32, 32)     |           61.2       |           38.9       |
      [2, 128, 64, 46] -> (128, 128)   |          904.2       |          652.0       |
      [1, 128, 64, 46] -> (32, 32)     |           57.1       |           32.0       |      191.1
      [1, 128, 64, 46] -> (128, 128)   |          491.4       |          138.1       |     1485.8
      [1, 3, 500, 500] -> (256, 256)   |          977.0       |          257.8       |       36.6
      [1, 3, 500, 500] -> (800, 800)   |         9470.0       |         2696.0       |      142.8

Times are in microseconds (us).

[------------- upsample_linear1d channels_first contiguous --------------]
                              |  1.9.0a0+git8518b0e  |  1.9.0a0+gite3a9544
1 threads: ---------------------------------------------------------------
      [4, 512, 320] -> [256]  |        516.5         |         524.7
      [4, 512, 320] -> [512]  |        993.8         |        1008.0
6 threads: ---------------------------------------------------------------
      [4, 512, 320] -> [256]  |        104.3         |         105.4
      [4, 512, 320] -> [512]  |        193.5         |         195.6

Times are in microseconds (us).

[-------------------- upsample_trilinear3d channels_first contiguous --------------------]
                                              |  1.9.0a0+git8518b0e  |  1.9.0a0+gite3a9544
1 threads: -------------------------------------------------------------------------------
      [1, 3, 16, 320, 320] -> [8, 256, 256]   |          5.5         |         11.5
      [1, 3, 16, 320, 320] -> [32, 512, 512]  |        116.3         |        213.1
6 threads: -------------------------------------------------------------------------------
      [1, 3, 16, 320, 320] -> [8, 256, 256]   |          1.1         |          2.1
      [1, 3, 16, 320, 320] -> [32, 512, 512]  |         36.1         |         47.2

Times are in milliseconds (ms).

[------------------ upsample_trilinear3d channels_last non-contiguous -------------------]
                                              |  1.9.0a0+git8518b0e  |  1.9.0a0+gite3a9544
1 threads: -------------------------------------------------------------------------------
      [1, 3, 16, 320, 320] -> [8, 256, 256]   |         13.1         |         19.9
      [1, 3, 16, 320, 320] -> [32, 512, 512]  |        242.3         |        349.4
6 threads: -------------------------------------------------------------------------------
      [1, 3, 16, 320, 320] -> [8, 256, 256]   |         13.1         |          4.4
      [1, 3, 16, 320, 320] -> [32, 512, 512]  |        242.4         |         87.2

Times are in milliseconds (ms).

[------------------ upsample_nearest2d channels_first contiguous -----------------]
                                       |  1.9.0a0+git8518b0e  |  1.9.0a0+gite3a9544
1 threads: ------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |         1194.5       |          107.8
      [1, 3, 320, 320] -> (512, 512)   |         4813.8       |          365.5
      [32, 128, 64, 64] -> (32, 32)    |        26745.6       |         6280.6
      [32, 128, 64, 64] -> (128, 128)  |       357686.7       |       129032.9
      [1, 3, 500, 500] -> (256, 256)   |         1205.9       |          123.8
      [1, 3, 500, 500] -> (800, 800)   |        11770.3       |          879.2
6 threads: ------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |          220.2       |           32.7
      [1, 3, 320, 320] -> (512, 512)   |          867.2       |           78.7
      [32, 128, 64, 64] -> (32, 32)    |         5789.6       |         2241.8
      [32, 128, 64, 64] -> (128, 128)  |        89125.3       |        41881.3
      [1, 3, 500, 500] -> (256, 256)   |          224.3       |           34.8
      [1, 3, 500, 500] -> (800, 800)   |         2182.8       |          176.6

Times are in microseconds (us).

[--------------- upsample_nearest2d channels_first non-contiguous ---------------]
                                      |  1.9.0a0+git8518b0e  |  1.9.0a0+gite3a9544
1 threads: -----------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)  |        1279.5        |        110.2
      [1, 3, 320, 320] -> (512, 512)  |        4908.1        |        367.1
      [1, 3, 500, 500] -> (256, 256)  |        1488.1        |        123.4
      [1, 3, 500, 500] -> (800, 800)  |       12186.4        |        879.3
6 threads: -----------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)  |         241.8        |         32.6
      [1, 3, 320, 320] -> (512, 512)  |         889.0        |         79.2
      [1, 3, 500, 500] -> (256, 256)  |         279.2        |         35.6
      [1, 3, 500, 500] -> (800, 800)  |        2226.5        |        174.3

Times are in microseconds (us).

[------------------------ upsample_nearest2d channels_last non-contiguous -------------------------]
                                       |  1.9.0a0+git8518b0e  |  1.9.0a0+gite3a9544  |  opencv 4.5.1
1 threads: -----------------------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |          752.1       |          487.2       |      75.5
      [1, 3, 320, 320] -> (512, 512)   |         2992.6       |         1880.0       |     251.4
      [32, 128, 64, 64] -> (32, 32)    |         3458.6       |         3466.5       |
      [32, 128, 64, 64] -> (128, 128)  |       102350.7       |       103919.4       |
      [2, 128, 64, 46] -> (32, 32)     |           75.2       |           85.2       |
      [2, 128, 64, 46] -> (128, 128)   |         1637.0       |         1690.4       |
      [1, 128, 64, 46] -> (32, 32)     |           39.6       |           47.2       |      37.6
      [1, 128, 64, 46] -> (128, 128)   |          426.3       |          449.0       |     412.4
      [1, 3, 500, 500] -> (256, 256)   |          757.5       |          495.5       |      85.0
      [1, 3, 500, 500] -> (800, 800)   |         7281.4       |         4532.6       |     622.8
6 threads: -----------------------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |          139.3       |          104.1       |      75.7
      [1, 3, 320, 320] -> (512, 512)   |          535.5       |          361.2       |      73.0
      [32, 128, 64, 64] -> (32, 32)    |         1518.6       |         1458.2       |
      [32, 128, 64, 64] -> (128, 128)  |        37117.7       |        40142.4       |
      [2, 128, 64, 46] -> (32, 32)     |           17.6       |           26.6       |
      [2, 128, 64, 46] -> (128, 128)   |          537.6       |          629.4       |
      [1, 128, 64, 46] -> (32, 32)     |           13.7       |           22.1       |      38.8
      [1, 128, 64, 46] -> (128, 128)   |           83.6       |           94.5       |     420.2
      [1, 3, 500, 500] -> (256, 256)   |          140.8       |          104.9       |      87.8
      [1, 3, 500, 500] -> (800, 800)   |         1317.8       |          853.8       |     139.7

Times are in microseconds (us).

[------------- upsample_nearest1d channels_first contiguous -------------]
                              |  1.9.0a0+git8518b0e  |  1.9.0a0+gite3a9544
1 threads: ---------------------------------------------------------------
      [4, 512, 320] -> [256]  |        1594.3        |        247.4
      [4, 512, 320] -> [512]  |        3222.6        |        440.4
6 threads: ---------------------------------------------------------------
      [4, 512, 320] -> [256]  |         294.4        |         53.7
      [4, 512, 320] -> [512]  |         575.0        |         88.5

Times are in microseconds (us).

[--------------------- upsample_nearest3d channels_first contiguous ---------------------]
                                              |  1.9.0a0+git8518b0e  |  1.9.0a0+gite3a9544
1 threads: -------------------------------------------------------------------------------
      [1, 3, 16, 320, 320] -> [8, 256, 256]   |        14952.7       |        1005.7
      [1, 3, 16, 320, 320] -> [32, 512, 512]  |       224955.6       |       46228.0
6 threads: -------------------------------------------------------------------------------
      [1, 3, 16, 320, 320] -> [8, 256, 256]   |         2887.2       |         206.2
      [1, 3, 16, 320, 320] -> [32, 512, 512]  |        56872.0       |       13566.3

Times are in microseconds (us).

[------------------- upsample_nearest3d channels_last non-contiguous --------------------]
                                              |  1.9.0a0+git8518b0e  |  1.9.0a0+gite3a9544
1 threads: -------------------------------------------------------------------------------
      [1, 3, 16, 320, 320] -> [8, 256, 256]   |         7772.3       |         4770.9
      [1, 3, 16, 320, 320] -> [32, 512, 512]  |       144655.1       |       108605.0
6 threads: -------------------------------------------------------------------------------
      [1, 3, 16, 320, 320] -> [8, 256, 256]   |         1401.9       |          877.7
      [1, 3, 16, 320, 320] -> [32, 512, 512]  |        35939.6       |        28621.5

Times are in microseconds (us).

[------------------ upsample_bicubic2d channels_first contiguous -----------------]
                                       |  1.9.0a0+git8518b0e  |  1.9.0a0+gite3a9544
1 threads: ------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |         6038.7       |         2340.4
      [1, 3, 320, 320] -> (512, 512)   |        24040.6       |         9205.9
      [32, 128, 64, 64] -> (32, 32)    |       471016.3       |        52059.1
      [32, 128, 64, 64] -> (128, 128)  |      7705594.5       |       884743.9
      [1, 3, 500, 500] -> (256, 256)   |         6061.5       |         2361.9
      [1, 3, 500, 500] -> (800, 800)   |        58940.7       |        22401.8
6 threads: ------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |         6594.3       |          466.5
      [1, 3, 320, 320] -> (512, 512)   |        25361.5       |         1729.1
      [32, 128, 64, 64] -> (32, 32)    |       487783.5       |        11550.0
      [32, 128, 64, 64] -> (128, 128)  |      7963636.6       |       196017.3
      [1, 3, 500, 500] -> (256, 256)   |         6443.8       |          464.1
      [1, 3, 500, 500] -> (800, 800)   |        61891.9       |         4257.2

Times are in microseconds (us).

[--------------- upsample_bicubic2d channels_first non-contiguous ---------------]
                                      |  1.9.0a0+git8518b0e  |  1.9.0a0+gite3a9544
1 threads: -----------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)  |        6116.7        |        2357.0
      [1, 3, 320, 320] -> (512, 512)  |       24182.0        |        9213.9
      [1, 3, 500, 500] -> (256, 256)  |        6349.6        |        2358.5
      [1, 3, 500, 500] -> (800, 800)  |       59365.2        |       22431.2
6 threads: -----------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)  |        7155.1        |         464.6
      [1, 3, 320, 320] -> (512, 512)  |       24566.8        |        1712.4
      [1, 3, 500, 500] -> (256, 256)  |        7217.5        |         466.6
      [1, 3, 500, 500] -> (800, 800)  |       59880.2        |        4148.8

Times are in microseconds (us).

[------------------------ upsample_bicubic2d channels_last non-contiguous -------------------------]
                                       |  1.9.0a0+git8518b0e  |  1.9.0a0+gite3a9544  |  opencv 4.5.1
1 threads: -----------------------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |         6184.3       |         2360.0       |      215.0
      [1, 3, 320, 320] -> (512, 512)   |        24499.7       |         9231.1       |      510.7
      [32, 128, 64, 64] -> (32, 32)    |       548304.5       |        93517.8       |
      [32, 128, 64, 64] -> (128, 128)  |      7810958.3       |      1086334.6       |
      [2, 128, 64, 46] -> (32, 32)     |        10883.4       |         5594.9       |
      [2, 128, 64, 46] -> (128, 128)   |       153253.2       |        57071.2       |
      [1, 128, 64, 46] -> (32, 32)     |         4519.4       |         2826.5       |      619.7
      [1, 128, 64, 46] -> (128, 128)   |        61339.7       |        28470.7       |     3654.5
      [1, 3, 500, 500] -> (256, 256)   |         6444.8       |         2389.9       |      292.9
      [1, 3, 500, 500] -> (800, 800)   |        59448.0       |        22479.1       |     1316.9
6 threads: -----------------------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |         6370.1       |          464.9       |       61.3
      [1, 3, 320, 320] -> (512, 512)   |        25365.6       |         1767.5       |      145.7
      [32, 128, 64, 64] -> (32, 32)    |       502888.7       |        22016.3       |
      [32, 128, 64, 64] -> (128, 128)  |      8072918.9       |       234567.0       |
      [2, 128, 64, 46] -> (32, 32)     |        11171.4       |         1049.5       |
      [2, 128, 64, 46] -> (128, 128)   |       152612.5       |        11264.8       |
      [1, 128, 64, 46] -> (32, 32)     |         4359.3       |          791.4       |      651.1
      [1, 128, 64, 46] -> (128, 128)   |        61346.5       |         7563.9       |     3765.2
      [1, 3, 500, 500] -> (256, 256)   |         6644.4       |          469.7       |       77.4
      [1, 3, 500, 500] -> (800, 800)   |        59947.2       |         4154.3       |      313.2

Times are in microseconds (us).

Intermediate benchmark sources:

- results/20212303-061238_pth_nightly_results_1.9.0a0+git8518b0e.log.save.opencv
- results/20212303-061238_pr_results_1.9.0a0+gite3a9544.log.save.opencv

```

[Source file](https://raw.githubusercontent.com/vfdev-5/interpolate-tensoriterator/master/step_seven/results/20212303-061238_pr_1.9.0a0%2Bgite3a9544_vs_pth_1.9.0a0%2Bgit8518b0e_results.opencv.md)
</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54500

Reviewed By: glaringlee

Differential Revision: D27463566

Pulled By: fmassa

fbshipit-source-id: ceac3a8cee0eeb1a4ddd9344accffcc65449a49a
2021-04-06 08:21:10 -07:00
87d55058f1 Fix the clang-tidy diff SHA for using PR merge (#55318)
Summary:
Since https://github.com/pytorch/pytorch/issues/54967, our clang-tidy CI job has been giving warnings on files that PRs don't touch (see the screenshot below for an example). This PR should fix the issue by comparing against the merge-base of the `merge` commit with `master`, which is just `master` itself.

![clang-tidy](https://user-images.githubusercontent.com/8246041/113618718-eb83f600-960c-11eb-9375-8b88158eb566.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55318

Test Plan: CI.

Reviewed By: janeyx99

Differential Revision: D27572553

Pulled By: samestep

fbshipit-source-id: 9a833aaeecc2ab22462b3fa99fa3353490c3de85
2021-04-06 07:40:39 -07:00
bf70fe69ae Revert D27442325: [torch/elastic] Revise the rendezvous handler registry logic.
Test Plan: revert-hammer

Differential Revision:
D27442325 (df299dbd7d)

Original commit changeset: 8519a2caacbe

fbshipit-source-id: f10452567f592c23ae79ca31556a2a77546726b1
2021-04-06 06:17:14 -07:00
c3d0607ffa [Static Runtime] Make sure the copy version of the op exist in ReplaceWithCopy (#55337)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55337

`static_runtime::permute_copy` is in fb-only folder. Because `caffe2/test/test_static_runtime.py` is in OSS, we can't load the fb-only operator library. The workaround is to check at runtime whether the op is registered or not.

Test Plan:
This fixed two of the broken tests:
```
    ✓ Pass: caffe2/test:static_runtime - test_multihead_attention_layer (test_static_runtime.TestStaticModule) (10.316)
    ✓ Pass: caffe2/test:static_runtime - test_mlp (test_static_runtime.TestStaticModule) (16.134)
```

Reviewed By: ajyu

Differential Revision: D27577066

fbshipit-source-id: ac87dcde71f0d5140ccde448bb49aaebbbb5908a
2021-04-06 04:25:04 -07:00
1b4bb3691c [Gradient Compression] Update _powerSGD_comm_hook_wrapper to only expose 2 most critical hyperparameters (#55295)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55295

Update `_powerSGD_comm_hook_wrapper` to only expose 2 most critical hyperparameters, to make this API more clear to any future user (although the second hyperparameter `start_powerSGD_iter` is not in use yet).

Test Plan: waitforbuildbot

Reviewed By: shuyingsunshine21

Differential Revision: D27561734

fbshipit-source-id: b661981cc033b109f4f2fc92b435567a184a7fb5
2021-04-06 01:29:10 -07:00
cc4036905c [Gradient Compression] Update the default value of start_powerSGD_iter and update the docstring (#55272)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55272

1. Set 1K as the default value of `start_powerSGD_iter` for practicability. The original default value 10 is usually too small for real use cases. The new default value 1K is also consistent with PyTorch Lightning.
2. Update the docstring of `start_powerSGD_iter` to remind the users to set a value no less than the warm-up steps if any.
3. Update some unit tests to start PowerSGD early.

ghstack-source-id: 125707662

Test Plan: waitforbuildbot

Reviewed By: shuyingsunshine21

Differential Revision: D27553388

fbshipit-source-id: 40076419bc85755c0c0b64b79ba914b241085fcc
2021-04-06 01:27:29 -07:00
3551bd31be [PyTorch] Lite interpreter with a backend delegate (#54462)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54462

Unclean files during sync - Sat Mar 20 04:00:02 PDT 2021

Unclean files during sync - Sun Mar 21 04:00:01 PDT 2021
ghstack-source-id: 124585992

Test Plan:
```
buck run xplat/caffe2/fb/test/delegate:interpreter_test -- --model_file_path=/path/to/mobile_model.ptl
```

Reviewed By: raziel

Differential Revision: D27232309

fbshipit-source-id: 8504a3185339d73bfa6e924485c4745acf269cec
2021-04-06 00:55:26 -07:00
7d9a619796 [PyTorch] Fix bin hash comparison failure in clang format script (#55281)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55281

## Summary

` python3 tools/clang_format_all.py` is complaining that binary is not what expected. Find out the reference hash include an extra new line comparing with the actual hash. In this pr,
1. Use `expr(hash)` to show the raw string, such that it's easier to compare two string.
2. Remove the extra new line.
3. Run `python3 tools/clang_format_all.py `, and it formats `torch/csrc/jit/runtime/static/passes.h`.

Before the change,
```
(base) chenlai@chenlai-mp pytorch % python3 tools/clang_format_all.py -v
Found pre-existing clang-format binary, skipping download
Reference Hash: '5fde7bccf65032da297dfb1f18e4a95e96e278fa397e9dcaf364dfe23ec46353'
Actual Hash: '5fde7bccf65032da297dfb1f18e4a95e96e278fa397e9dcaf364dfe23ec46353'
The downloaded binary is not what was expected!
(base) chenlai@chenlai-mp pytorch %
```

After the change,
```
(base) chenlai@chenlai-mp pytorch % python3 tools/clang_format_all.py -v
Found pre-existing clang-format binary, skipping download
Reference Hash: '5fde7bccf65032da297dfb1f18e4a95e96e278fa397e9dcaf364dfe23ec46353\n'
Actual Hash: '5fde7bccf65032da297dfb1f18e4a95e96e278fa397e9dcaf364dfe23ec46353'
The downloaded binary is not what was expected!
(base) chenlai@chenlai-mp pytorch %
```

After strip the hash str:
```
(base) chenlai@chenlai-mp pytorch % python3 tools/clang_format_all.py -v
Downloading clang-format to /Users/chenlai/pytorch/.clang-format-bin
0% |################################################################| 100%
Reference Hash: '5fde7bccf65032da297dfb1f18e4a95e96e278fa397e9dcaf364dfe23ec46353'
Actual Hash: '5fde7bccf65032da297dfb1f18e4a95e96e278fa397e9dcaf364dfe23ec46353'
Using clang-format located at /Users/chenlai/pytorch/.clang-format-bin/clang-format
```

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D27556372

Pulled By: cccclai

fbshipit-source-id: 2fd1ba220733e767ffab41ab31e162f0bf3f1d62
2021-04-06 00:33:53 -07:00
df299dbd7d [torch/elastic] Revise the rendezvous handler registry logic.
Summary: Improve the implementation and the unit test coverage of `RendezvousHandlerRegistry`.

Test Plan: Run the existing and newly-introduced unit tests.

Reviewed By: tierex

Differential Revision: D27442325

fbshipit-source-id: 8519a2caacbe2e3ce5d9a02e87a910503dea27d7
2021-04-05 23:38:29 -07:00
359d0a0205 [torch/elastic] Improve the implementation of RendezvousParameters and add its unit tests. (#146)
Summary:
Pull Request resolved: https://github.com/pytorch/elastic/pull/146

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54807

Improve the implementation and the unit test coverage of `RendezvousParameters`.

Test Plan: Run the existing and newly-introduced unit tests.

Reviewed By: kiukchung

Differential Revision: D27342444

fbshipit-source-id: 88de356c0a799844a739eb9105185bb8c1acf11f
2021-04-05 23:38:27 -07:00
7f06c65a4c [torch/elastic] Improve the implementation of the utility functions and add their unit tests. (#54804)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54804

Improve the implementation of the utility functions to handle more edge cases and also have a new set of unit tests to cover their usage.

Test Plan: Run the existing and newly introduced unit tests.

Reviewed By: kiukchung

Differential Revision: D27327898

fbshipit-source-id: 96b6fe2d910e3de69f44947a0e8a9f687ab50633
2021-04-05 23:38:25 -07:00
de7f05b9eb [torch/elastic] Expose a stderr parameter in EtcdServer. (#54805)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54805

Expose a `stderr` parameter to `EtcdServer` to have a clean unit test outputs.

Test Plan: Run the existing test suite.

Reviewed By: kiukchung

Differential Revision: D27327495

fbshipit-source-id: 0a342aeda0ff4d85d809aab1cbf155d3fafd4fa1
2021-04-05 23:38:22 -07:00
bad8d34780 [torch/elastic] Revise the rendezvous exception types. (#54803)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54803

Revise the rendezvous exception types to align their naming convention more closely with the standard Python exception types.

Test Plan: Run the existing test suite.

Reviewed By: H-Huang

Differential Revision: D27327505

fbshipit-source-id: 862c59222f9ca61a0e5afde89ae8f226090b4f92
2021-04-05 23:36:50 -07:00
5584332180 Wrap cub in its own namespace (#55292)
Summary:
Tentative fix for https://github.com/pytorch/pytorch/issues/55027.
Wraps cub import in its name space so that static variables used by cub and thrust don't conflict if they end up in the different libraries when torch is built with BUILD_SPLIT_CUDA. cub variables end up in their own namespace, thrust variables are unwrapped, so they don't clash.
This also allows extensions to use cub without wrapping it (thrust will still be problematic). The solution to allowing extensions to use thrust is to stop using thrust in pytorch completely.
Now importing cub and importing thrust cannot coexist, so I had to move nonzero to its own file, and remove reliance on thrust functions for it. Nonzero now uses cub only.
Also, we cannot selectively import just some of cub headers, we are forced to import `cub/cub.cuh`, which is not great.
Caffe2 ops using cub are not touched (there are too many), so mixing caffe2 and torch will (can) still result in the same bug. We are moving towards disabling c2 ops, so I think this is fine.
Still, even with that compiler (correctly) warns about redefinition of `CUB_NS_PREFIX` because including `ATen/ATen.h` transitively includes `thrust/complex.h` and that in turn includes original (empty) definition of `CUB_NS_PREFIX`. We probably can just ignore this warning. Here's an example warning:
```
In file included from /data/users/ngimel/pytorch/aten/src/ATen/native/cuda/Nonzero.cu:9:
/data/users/ngimel/pytorch/aten/src/ATen/cuda/CubUtils.cuh:4: warning: "CUB_NS_PREFIX" redefined
 #define CUB_NS_PREFIX namespace at{ namespace native{

In file included from /home/ngimel/local/cuda/include/thrust/system/cuda/config.h:76,
                 from /home/ngimel/local/cuda/include/thrust/system/cuda/detail/execution_policy.h:33,
                 from /home/ngimel/local/cuda/include/thrust/iterator/detail/device_system_tag.h:23,
                 from /home/ngimel/local/cuda/include/thrust/iterator/iterator_traits.h:111,
                 from /home/ngimel/local/cuda/include/thrust/detail/type_traits/pointer_traits.h:23,
                 from /home/ngimel/local/cuda/include/thrust/type_traits/is_contiguous_iterator.h:27,
                 from /home/ngimel/local/cuda/include/thrust/type_traits/is_trivially_relocatable.h:19,
                 from /home/ngimel/local/cuda/include/thrust/detail/complex/complex.inl:20,
                 from /home/ngimel/local/cuda/include/thrust/complex.h:1031,
                 from /data/users/ngimel/pytorch/c10/util/complex.h:9,
                 from /data/users/ngimel/pytorch/c10/core/ScalarType.h:4,
                 from /data/users/ngimel/pytorch/c10/core/Scalar.h:10,
                 from /data/users/ngimel/pytorch/build/aten/src/ATen/core/TensorBody.h:8,
                 from /data/users/ngimel/pytorch/aten/src/ATen/Tensor.h:3,
                 from /data/users/ngimel/pytorch/aten/src/ATen/Context.h:4,
                 from /data/users/ngimel/pytorch/aten/src/ATen/ATen.h:9,
                 from /data/users/ngimel/pytorch/aten/src/ATen/native/cuda/Nonzero.cu:1:
/home/ngimel/local/cuda/include/cub/util_namespace.cuh:43: note: this is the location of the previous definition
 #define CUB_NS_PREFIX

```
We will need a lint rule to prevent people from including `cub/cub.cuh`, because this will lead to https://github.com/pytorch/pytorch/issues/55027 reappearing again for some sequence of operations (and will lead to errors with cub code in extensions).
Also, for this to work reliably we'll need to make sure that everything calling cub ends up in only one of libtorch_cuda_cu or libtorch_cuda_cpp, otherwise even namespace won't help (there still will be same symbols in 2 libraries).

Upd: libtorch_cuda_cpp and libtorch_cuda_cu still contain the same symbols, which means that there exists a sequence of operations that will cause cache bug to reappear, so this is not a solution, we need to adjust file lists for BUILD_SPLITC_CUDA:
```
(pytorch) [ngimel@ ~/local/pytorch/build/lib] nm libtorch_cuda_cu.so | grep PerDeviceAttributeCache | c++filt
000000000c6bf808 u guard variable for at::native::cub::GetPerDeviceAttributeCache<at::native::cub::PtxVersionCacheTag>()::cache
000000000c600830 u guard variable for cub::GetPerDeviceAttributeCache<cub::PtxVersionCacheTag>()::cache
00000000018625e0 t at::native::cub::PerDeviceAttributeCache::DevicePayload at::native::cub::PerDeviceAttributeCache::operator()<at::native::cub::PtxVersion(int&)::{lambda(int&)https://github.com/pytorch/pytorch/issues/1}>(at::native::cub::PtxVersion(int&)::{lambda(int&)https://github.com/pytorch/pytorch/issues/1}&&, int)
00000000009ce630 t cub::PerDeviceAttributeCache::DevicePayload cub::PerDeviceAttributeCache::operator()<cub::PtxVersion(int&)::{lambda(int&)https://github.com/pytorch/pytorch/issues/1}>(cub::PtxVersion(int&)::{lambda(int&)https://github.com/pytorch/pytorch/issues/1}&&, int)
000000000c6bf820 u at::native::cub::GetPerDeviceAttributeCache<at::native::cub::PtxVersionCacheTag>()::cache
000000000c600840 u cub::GetPerDeviceAttributeCache<cub::PtxVersionCacheTag>()::cache
(pytorch) [ngimel@ ~/local/pytorch/build/lib] nm libtorch_cuda_cpp.so | grep PerDeviceAttributeCache | c++filt
0000000000ad2d98 u guard variable for at::native::cub::GetPerDeviceAttributeCache<at::native::cub::PtxVersionCacheTag>()::cache
0000000000ad2da0 u at::native::cub::GetPerDeviceAttributeCache<at::native::cub::PtxVersionCacheTag>()::cache
```
Upd2:
Moved TensorFactories.cu to torch_cuda_cu sources (see a change to caffe2/CMakeLists.txt), so now cub-related symbols are only in libtorch_cuda_cu. We'd need a test for that, any suggestions on how best to test it?
cc zasdfgbnm malfet

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55292

Reviewed By: anjali411

Differential Revision: D27576442

Pulled By: ngimel

fbshipit-source-id: 1ef29503a342bb214794d34a42a47052092a66c1
2021-04-05 23:21:05 -07:00
0e03a2978a [DDP] Call ensure_prior_reduction_finished within lock (#55074)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55074

This function accesses member variables that can be modified by
different threads (i.e. autograd engine threads), so call it within lock scope.
ghstack-source-id: 125707513

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D27474526

fbshipit-source-id: 8d43faedd6e6eeeb69e21ce3262337ab83d7ba07
2021-04-05 22:16:13 -07:00
697b130374 Add some missing types to torch (#55184)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55184

Test Plan: Sandcastle

Reviewed By: ezyang

Differential Revision: D27515470

fbshipit-source-id: 264bc067db8fb430465d14bf9508ac8b1faf0f2f
2021-04-05 21:44:47 -07:00
0521e420fd [Static Runtime] Temporarily disable fusion tests (#55342)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55342

The fusion stuff is pretty hard to debug. Given that we're not shipping this part of the stack any time soon, let's temporarily disable them and re-enable them when somebody has the cycles to debug them.

Test Plan: Verified that the tests are now disabled

Reviewed By: ajyu

Differential Revision: D27578573

fbshipit-source-id: cb8d7c9339f7c1700b7653b0231cf570996995ff
2021-04-05 20:54:02 -07:00
e0c5d0ea15 Add tutorials to pipeline docs. (#55209)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55209

ghstack-source-id: 125588324

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D27528715

fbshipit-source-id: e6de3649e7265f34de03d452ffdf66ae45569d58
2021-04-05 20:01:00 -07:00
15b087cdd2 [fx]Allow rewrite a symbolic traced module (#54011)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54011

After symbolic tracing, `fn` seems to already have "forward" in its globals. In this case, `new_keys` would have length of 0 and we take "forward" from `global_dict` directly as `fn_compiled`.

Test Plan: Added a new test in test_fx_experimental.

Reviewed By: ansley

Differential Revision: D27049012

fbshipit-source-id: 7fbeb50ebb717900ff5fc0a8a0925d6a97f5a6dd
2021-04-05 18:35:51 -07:00
fd02fc5d71 Port put_ and take from TH to ATen (#53356)
Summary:
The two ports were don together, as they can be implemented with the same kernel. In TH, they were already implemented with the same kernel.

Resolves https://github.com/pytorch/pytorch/issues/24751
Resolves https://github.com/pytorch/pytorch/issues/24614
Resolves https://github.com/pytorch/pytorch/issues/24640
Resolves https://github.com/pytorch/pytorch/issues/24772

This port makes sure that it interacts correctly with the "deterministic algorithms" flag, as done in https://github.com/pytorch/pytorch/pull/51388

This PR also makes these two functions correct in the following aspects (all of them added to the tests as well):
- Support for complex numbers
- Correct handling of scalar inputs and zero-dimensional inputs
- Implementation that does not do any copies nor sorting of any of the input tensors
- Faster and more correct implementation of the backwards (now it works as it should when `source.shape() != index.shape()`)
- Now `put_(..., accumulate=True)` is implemented correctly with atomic operations on GPU / CPU (when possible) and is deterministic (modulo the loss of precision that might happen due to the reordering of a sum of floats)
- Adds the `torch.put` function that was missing, (`index_put` exists, for example)
- Corrected docs

It also adds a much more thorough testing to the operations and their gradients.

There is a BC-breaking change, and that is that now we check that the inputs do not overlap in the `put_` operation. This was handled (some of the cases, other cases were wrong) in the TH implementation by making contiguous copies of the inputs. How should we handle this one?

**Edit.** Benchmarks:
<details>
<summary>Script</summary>

```python
from IPython import get_ipython
import torch
from itertools import product

torch.manual_seed(13)
torch.set_num_threads(1)

ipython = get_ipython()

cpu = torch.device('cpu')
cuda = torch.device('cuda')

def run_test(ndims, size, index_len, device, cmd):
    print(f"cmd: {cmd}, ndims: {ndims}, tensor_size: {size}, index_len: {index_len}, device: {device}")

    large_tensor = torch.rand(*([size] * ndims), device=device)
    small_tensor = torch.rand((index_len,), device=device)
    index = torch.randint(size * ndims, (index_len,), dtype=torch.long, device=device)
    if cmd == "put":
        command = "large_tensor.put_(index, small_tensor, accumulate=False)"
        if device == cuda:
            command += "; torch.cuda.synchronize()"
    elif cmd == "accumulate":
        command = "large_tensor.put_(index, small_tensor, accumulate=True)"
        if device == cuda:
            command += "; torch.cuda.synchronize()"
    elif cmd == "take":
        command = "torch.take(large_tensor, index)"
        if device == cuda:
            command += "; torch.cuda.synchronize()"
    ipython.magic(f"timeit {command}")
    print()

for method, device in product(["accumulate", "put", "take"], [cpu, cuda]):
    run_test(3, 1000, 10, device, method)
    run_test(3, 1000, 1000, device, method)
    run_test(3, 1000, 10000, device, method)
    run_test(2, 10000, 100000, device, method)
```
</details>

```python
put_(accumulate=False)
```

<details>
<summary>ATen CPU (1.5x - 2x speedup)</summary>

```python
cmd: put, ndims: 3, tensor_size: 1000, index_len: 10, device: cpu
1.05 µs ± 2.35 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

cmd: put, ndims: 3, tensor_size: 1000, index_len: 1000, device: cpu
3.15 µs ± 5.13 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: put, ndims: 3, tensor_size: 1000, index_len: 10000, device: cpu
21.6 µs ± 13.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

cmd: put, ndims: 2, tensor_size: 10000, index_len: 100000, device: cpu
238 µs ± 781 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```
</details>

<details>
<summary>TH CPU</summary>

```python
cmd: put, ndims: 3, tensor_size: 1000, index_len: 10, device: cpu
722 ns ± 2.67 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

cmd: put, ndims: 3, tensor_size: 1000, index_len: 1000, device: cpu
4.89 µs ± 18.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: put, ndims: 3, tensor_size: 1000, index_len: 10000, device: cpu
42.5 µs ± 96.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

cmd: put, ndims: 2, tensor_size: 10000, index_len: 100000, device: cpu
428 µs ± 774 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```
</details>
<details>
<summary>ATen GPU (same speed)</summary>

```python
cmd: put, ndims: 3, tensor_size: 1000, index_len: 10, device: cuda
8.99 µs ± 16 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: put, ndims: 3, tensor_size: 1000, index_len: 1000, device: cuda
10.4 µs ± 24.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: put, ndims: 3, tensor_size: 1000, index_len: 10000, device: cuda
10.4 µs ± 11.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: put, ndims: 2, tensor_size: 10000, index_len: 100000, device: cuda
15.6 µs ± 1.12 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
```
</details>

<details>
<summary>TH GPU</summary>

```python
cmd: put, ndims: 3, tensor_size: 1000, index_len: 10, device: cuda
8.44 µs ± 31.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: put, ndims: 3, tensor_size: 1000, index_len: 1000, device: cuda
9.09 µs ± 4.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: put, ndims: 3, tensor_size: 1000, index_len: 10000, device: cuda
9.77 µs ± 0.998 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: put, ndims: 2, tensor_size: 10000, index_len: 100000, device: cuda
15.8 µs ± 5.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
```
</details>

```python
put_(accumulate=True)
```

<details>
<summary>ATen CPU (x2 speedup)</summary>

```python
cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10, device: cpu
1.12 µs ± 2.91 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 1000, device: cpu
3.14 µs ± 2.05 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10000, device: cpu
20.8 µs ± 25.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

cmd: accumulate, ndims: 2, tensor_size: 10000, index_len: 100000, device: cpu
264 µs ± 263 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```
</details>

<details>
<summary>TH CPU</summary>

```python
cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10, device: cpu
814 ns ± 1.87 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 1000, device: cpu
5.11 µs ± 6.02 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10000, device: cpu
43.9 µs ± 49.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

cmd: accumulate, ndims: 2, tensor_size: 10000, index_len: 100000, device: cpu
442 µs ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```
</details>
<details>
<summary>ATen GPU (3x - 11x speedup)</summary>

```python
cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10, device: cuda
9.01 µs ± 14.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 1000, device: cuda
10.4 µs ± 15.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10000, device: cuda
10.3 µs ± 44.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: accumulate, ndims: 2, tensor_size: 10000, index_len: 100000, device: cuda
12.6 µs ± 19 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
```
</details>

<details>
<summary>TH GPU</summary>

```python
cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10, device: cuda
34.7 µs ± 131 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 1000, device: cuda
38.2 µs ± 116 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10000, device: cuda
61.2 µs ± 50.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

cmd: accumulate, ndims: 2, tensor_size: 10000, index_len: 100000, device: cuda
140 µs ± 24.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
```
</details>

```python
take()
```

<details>
<summary>ATen CPU (1.1x speedup)</summary>

```python
cmd: take, ndims: 3, tensor_size: 1000, index_len: 10, device: cpu
1.18 µs ± 2.34 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

cmd: take, ndims: 3, tensor_size: 1000, index_len: 1000, device: cpu
2.79 µs ± 2.96 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 3, tensor_size: 1000, index_len: 10000, device: cpu
16.6 µs ± 10.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 2, tensor_size: 10000, index_len: 100000, device: cpu
161 µs ± 984 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
```
</details>

<details>
<summary>TH CPU</summary>

```python
cmd: take, ndims: 3, tensor_size: 1000, index_len: 10, device: cpu
1.1 µs ± 3.14 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

cmd: take, ndims: 3, tensor_size: 1000, index_len: 1000, device: cpu
2.93 µs ± 7.31 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 3, tensor_size: 1000, index_len: 10000, device: cpu
18.6 µs ± 14.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 2, tensor_size: 10000, index_len: 100000, device: cpu
178 µs ± 139 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
```
</details>
<details>
<summary>ATen GPU (same speed)</summary>

```python
cmd: take, ndims: 3, tensor_size: 1000, index_len: 10, device: cuda
9.38 µs ± 23.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 3, tensor_size: 1000, index_len: 1000, device: cuda
10.7 µs ± 9.77 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 3, tensor_size: 1000, index_len: 10000, device: cuda
10.6 µs ± 107 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 2, tensor_size: 10000, index_len: 100000, device: cuda
11.5 µs ± 21.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
```
</details>

<details>
<summary>TH GPU</summary>

```python
cmd: take, ndims: 3, tensor_size: 1000, index_len: 10, device: cuda
9.31 µs ± 7.57 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 3, tensor_size: 1000, index_len: 1000, device: cuda
9.52 µs ± 5.78 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 3, tensor_size: 1000, index_len: 10000, device: cuda
9.73 µs ± 17.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 2, tensor_size: 10000, index_len: 100000, device: cuda
11.7 µs ± 5.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
```
</details>

cc mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53356

Reviewed By: mruberry

Differential Revision: D27520243

Pulled By: ngimel

fbshipit-source-id: e3979349c2c62d2949e09fb05e5fd4883fbc9093
2021-04-05 18:05:38 -07:00
bf37bf7da4 Make JSON files more human readable (#55335)
Summary:
Prettifies JSON files .pytorch-test-times and .pytorch-slow-tests so that not everything is on one single line.

This is of slightly more importance as generated  .pytorch-slow-tests ends up getting stored in our test-infra repo ([example](ad9cd87565)), and it is nice to not have that lil red symbol at the end.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55335

Reviewed By: samestep

Differential Revision: D27576930

Pulled By: janeyx99

fbshipit-source-id: be58565b8c8593a9bfcfab383ee19facc79f0572
2021-04-05 17:23:36 -07:00
b986a76d91 Clang-format distributed.py (#55254)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55254

ghstack-source-id: 125680320

Test Plan: N/A

Reviewed By: rohan-varma

Differential Revision: D27542846

fbshipit-source-id: 700c3e59a9df98233fdb27054b472f5cb33eb604
2021-04-05 16:48:22 -07:00
6a2f046504 [SPMD] Restrict DDP communication hooks to SPSD mode (#55253)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55253

Previously DDP communication hooks takes a tensor list as the input. Now only takes a single tensor, as the preparation of retiring SPMD and only providing a single model replica for DDP communication hooks.

The next step is limiting only 1 model replica in Reducer.
ghstack-source-id: 125677637

Test Plan: waitforbuildbot

Reviewed By: zhaojuanmao

Differential Revision: D27533898

fbshipit-source-id: 5db92549c440f33662cf4edf8e0a0fd024101eae
2021-04-05 16:46:47 -07:00
d690973295 irange on int64_t (#55148)
Summary:
Converts loops of the form:
```
for(int64_t VAR=0;VAR<LIMIT;VAR++)
```
to the form
```
for(const auto VAR : c10::irange(LIMIT))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55148

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D27447811

fbshipit-source-id: 6311a094ec4a81a0b57383aaee0ba1b1dc2445c4
2021-04-05 16:14:00 -07:00
ef262575dd [pytorch] Fix printing of optional string arguments in schemas (#55196)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55196

This commit fixes printing of default values for optional string type arguments in schemas. At the moment, these default values are not printed as quoted strings. If a schema with an optional string type parameter with a default value that is not `None` is printed and then parsed, the lack of quotes causes a parsing error.
ghstack-source-id: 125655241

Test Plan: This commit adds a unit test to `test_function_schema.py` to test this case.

Differential Revision: D27525450

fbshipit-source-id: 23a93169e7599e7b385e59b7cfafb17fd76318b7
2021-04-05 15:28:18 -07:00
2ee02b30b1 Replace rounding_mode="true" with rounding_mode=None (#51988)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51988

* **#51988 Replace rounding_mode="true" with rounding_mode=None**

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D27561817

Pulled By: mruberry

fbshipit-source-id: 60d1d9c389570f60d599fc1876518717367fb368
2021-04-05 14:53:43 -07:00
3acbaf834e Make structured functions properly check device/dtype of explicit out args (#55150)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55150

Somehow I forgot to add these checks.  Now they're in here.  Thanks
ngimel for noticing.

This is probably a slight efficiency hit on TensorIterator, which is
probably already doing all these checks.  Would be good to follow up
on this, though it may not be easily fixable with the TI rewrite.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: zhangguanheng66

Differential Revision: D27523879

Pulled By: ezyang

fbshipit-source-id: 458e617dbc6de6fcfa9e5841148b30b99f52e001
2021-04-05 14:42:43 -07:00
45aaaef22c Fix timer overflow on small, fast snippets (#55200)
Summary:
- Fixes https://github.com/pytorch/pytorch/issues/54114
- Capped estimated block size to the largest multiple of ten less than C++ INT_MAX

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55200

Test Plan: unit test doesn't throw exception as expected

Reviewed By: robieta

Differential Revision: D27542652

Pulled By: naveedgol

fbshipit-source-id: 3ba68ce84d5fa1d8338cdd5c9f9e5d8c9adda51c
2021-04-05 14:11:26 -07:00
7613b1150b [docs][quant] Add fx graph mode quant api doc (#55306)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55306

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D27567187

fbshipit-source-id: ceef873b78fc77e366a47be66c8efd856bac013e
2021-04-05 13:56:23 -07:00
e61f5b586b Revert D27404164: [PyTorch] Devirtualize is_contiguous
Test Plan: revert-hammer

Differential Revision:
D27404164 (62aa924368)

Original commit changeset: e1dce8c02100

fbshipit-source-id: 9caad109f371607479314501653c275ad95120b8
2021-04-05 13:41:31 -07:00
62aa924368 [PyTorch] Devirtualize is_contiguous (#54896)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54896

This should help performance. (For example, it improves total
time spent in a C++ benchmark that just adds 2 tensors in place by
about 10%.)
ghstack-source-id: 125659451

Reviewed By: bhosmer

Differential Revision: D27404164

fbshipit-source-id: e1dce8c02100ee4ce22510298c7e0d0f192be201
2021-04-05 13:16:49 -07:00
d0ffada9ee .github: Add scale-config.yml (#55315)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55315

Tested here: https://github.com/seemethere/test-repo/actions/runs/720143591

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: samestep

Differential Revision: D27572122

Pulled By: seemethere

fbshipit-source-id: 0b5a772cebf2a8adb9b8805fd813e9cfbe0249d7
2021-04-05 12:49:09 -07:00
f4a618bb5a [PyTorch] Don't create intermediate Tensor for at::result_type w/Scalar (#55232)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55232

Fixes https://github.com/pytorch/pytorch/issues/55229 .
ghstack-source-id: 125616311

Test Plan: Looks like test/test_type_promotion.py covers this.

Reviewed By: ezyang

Differential Revision: D27536521

fbshipit-source-id: 3e686934f845588da07de9190c9760c8ed453caf
2021-04-05 12:19:19 -07:00
fffdc5fa2f docs: Pin docutils to 0.16 (#55309)
Summary:
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55309

Reviewed By: seemethere, samestep

Differential Revision: D27569585

Pulled By: agolynski

fbshipit-source-id: 09f7ee08a0aea9fffd118a290f2295fe9dcab25a
2021-04-05 11:31:09 -07:00
5339d534a3 Add runner for instruction count benchmarks. (#54652)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54652

This PR adds a fairly robust runner for the instruction count microbenchmarks. Key features are:

* Timeout and retry. (In rare cases, Callgrind will hang under heavy load.)
* Robust error handling and keyboard interrupt support.
* Benchmarks are pinned to cores. (Wall times still won't be great, but it's something.)
* Progress printouts, including a rough ETA.

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D27537823

Pulled By: robieta

fbshipit-source-id: 699ac907281d28bf7ffa08594253716ca40204ba
2021-04-05 11:18:57 -07:00
c5a1eb4156 extend benchmarks (#54651)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54651

This PR fleshes out the benchmarks to everything I could come up with. (166 individual cases when all is said and done.) If there's anything you feel warrants a spot in CI that I've missed, by all means let me know.

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D27537824

Pulled By: robieta

fbshipit-source-id: 3819e8fec2131c6b5f29f5099cd41e79131bed90
2021-04-05 11:17:12 -07:00
c9b214f9fb Add Python-3.9 PyTorch M1 nightly builds (#55278)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55278

Reviewed By: janeyx99

Differential Revision: D27554985

Pulled By: malfet

fbshipit-source-id: 8d2cd0ef1cea7f2c7c586da798f07dde4581d279
2021-04-05 10:43:23 -07:00
a102adb55e Automated submodule update: FBGEMM (#54575)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 7c0c486650

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54575

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: dskhudia, yns88

Differential Revision: D27286716

fbshipit-source-id: 03b83dacc04edecebbb5b49046baa27deb5ba541
2021-04-05 10:18:36 -07:00
7fd3c030ef Write OpInfo for dist (#55092)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53516
cc anjali411

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55092

Reviewed By: nikithamalgifb

Differential Revision: D27493577

Pulled By: anjali411

fbshipit-source-id: c7e8400a20bbc7138249b249e322b3b23e112336
2021-04-05 10:09:56 -07:00
6c8270ea21 fix bc breakage of #52043 (#55303)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55303

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D27567671

Pulled By: agolynski

fbshipit-source-id: 771e75b68be52dd5dd31437238d1f9fef481f853
2021-04-05 09:49:22 -07:00
ebf40e6ed2 CI: Run test_lite_interpreter_runtime from built lib directly (#55291)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55291

From the script the build happens in cpp-bulid/caffe2. All the executables and dylibs are available there. It may be more straightforward and accurate to use those binaries, instead of copying the test binary to miniconda3 and use dylibs from there.

Test: CI, especially pytorch_macos_10_13_py3_lite_interpreter_build_test.

Test Plan: Imported from OSS

Reviewed By: raziel

Differential Revision: D27566631

Pulled By: iseeyuan

fbshipit-source-id: 402b9941ab422979d53243624f67d65752213191
2021-04-05 09:19:33 -07:00
980d6f2589 torch.linalg.det (#53119)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/51652.
In particular:
- the main implementation is in `torch.linalg.det` now. `torch.det` is just a deprecated alias to it
- add a new `OpInfo` for `torch.linalg.det`
- remove the old-style tests for `torch.det` (this is similar to what we did for `torch.linalg.slogdet`, see https://github.com/pytorch/pytorch/issues/49194)
- added a `out=` argument to `torch.linalg.det`, but **not** to `torch.det`.

It is worth noting that I had to skip few tests:
- `TestGradientsCuda::test_fn_gradgrad_linalg_det_cuda_float64`. This is not a regression: the functionality is broken also on master, but the test is not executed properly due to https://github.com/pytorch/pytorch/issues/53361.

And the following tests which fails only on ROCm:
- `test_variant_consistency_jit_cuda_{float64,float32}`
- `test_fn_grad_cuda_float64`

I think that the ROCm tests fail because the current linalg.det backward is unstable if the matrix has repeated singular values, see https://github.com/pytorch/pytorch/issues/53364 .

(At the moment of writing some CI jobs are still running but I believe the build will be green, since the only difference wrt the last push is the skip of the ROCm tests)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53119

Reviewed By: H-Huang

Differential Revision: D27441999

Pulled By: mruberry

fbshipit-source-id: 5eab14c4f0a165e0cf9ec626c3f4bb23359f2a9e
2021-04-05 08:45:27 -07:00
197f9f0826 Merge CUDA Streams and Events (#53902)
Summary:
-----------
- Updates current_stream and default stream API's to take `optional[device]` argument
- Adds parsing logic to replace `torch.cuda.Stream` and `torch.cuda.Event` -> `torch.classes.cuda.Stream` and `torch.classes.cuda.Event` for JIT
- Merges StreamContext manager for both Eager and JIT.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53902

Test Plan:
------
Run JIT tests:
python test/test_jit.py -v TestCUDA

Run eager tests:
python test/test_cuda.py -v TestCuda

Reviewed By: glaringlee

Differential Revision: D27494627

Pulled By: nikithamalgifb

fbshipit-source-id: b30b0570e38a33fb335c83762eb06ffd46a44b5c
2021-04-05 08:19:55 -07:00
5e72571df3 Fix wrong changes from #54103 (#54610)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54610

The `.is_view()` method actually only refers to backward mode views
This is not a problem right now in master (and thus I didn't revert the other PR) because nothing creates forward AD views.

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D27396756

Pulled By: albanD

fbshipit-source-id: 64ff11c6f2486c6430714988d1cf6ecf3d80dccb
2021-04-05 07:48:23 -07:00
f3969d3db6 Fix bug in self.assertExpectedInline (#55149)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55149

I was wondering why no one used this function.  It's because it
doesn't work!  Also a small doc improvement for expected inline.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: zhangguanheng66

Differential Revision: D27523880

Pulled By: ezyang

fbshipit-source-id: a1d80c088ebf1c58a2b9b13d28f7f23d08c42e60
2021-04-05 06:37:36 -07:00
edb919376d [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D27563536

fbshipit-source-id: 2323c810b4bcac9934e90675d6291822d463b081
2021-04-05 04:17:35 -07:00
c821b83ab3 [typing] make mypy-protobuf output compatible with pyre for caffe2 type stubs (#55294)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55294

Some static checkers like pyre have difficulties with types like `builtings.type`, so we strip the `builtins` prefix from autogened proto type stubs.

Test Plan: Let CI run.

Reviewed By: d4l3k

Differential Revision: D27477699

fbshipit-source-id: 45e19835974200a030817d37aec785e3ecb23e8b
2021-04-05 03:23:31 -07:00
3d492b0697 Revert D27505153: [pytorch][PR] OpInfo: atan2
Test Plan: revert-hammer

Differential Revision:
D27505153 (e309ab8510)

Original commit changeset: 45430ad0a7ef

fbshipit-source-id: 630c287e9344b32bd3fcf5092e3e952907774fba
2021-04-05 02:11:00 -07:00
bcdcf347cb Add cusolver potrs and potrsBatched to the backend of torch.cholesky_solve (#54315)
Summary:
This PR adds cusolver potrs and potrsBatched to the backend of torch.cholesky_solve and torch.linalg.cholesky_solve.

`cholesky_solve` heuristics:

- If magma is not installed, or batch_size is 1:
  - If batch_size > 1 and nrhs == 1, dispatch to `cusolverDn<T>potrsBatched`,
  - Otherwise, dispatch to `cusolverDnXpotrs` (64 bit) and `cusolverDn<T>potrs` (legacy).
- Otherwise, use magma.

Note: `cusolverDn<T>potrsBatched` only supports `nrhs == 1`. It is used for `nrhs==1` batched matrix if magma is **not** installed.

See also https://github.com/pytorch/pytorch/issues/42666 #47953

Todo:

- [x] benchmark and heuristic

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54315

Reviewed By: ngimel

Differential Revision: D27562225

Pulled By: mruberry

fbshipit-source-id: 323e5d60610abbbdc8369f5eb112d9fa01da40f6
2021-04-05 02:03:55 -07:00
0a81034dd0 Port atan2 to structured kernel (#55130)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/55070

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55130

Reviewed By: gchanan

Differential Revision: D27502777

Pulled By: ezyang

fbshipit-source-id: 9c368e2c3670f5633e059024ccff8b3e95e2733e
2021-04-05 00:12:42 -07:00
d2a58bfe6f Add mkldnn tanh operator (#54656)
Summary:
## 🚀 Feature
Add Mkl-Layout kernel for tanh.

## Motivation
We want to add a Mkl-Layout kernel for tanh to improve tanh's performance when the input Tensor is Mkl-Layout.
Because, PyTorch does not have the Mkl-Layout kernel for tanh, so it cannot execute the tanh input by the Mkl-Layout Tensor.
Off course you can temporarily avoid this problem by executing to_dense/to_mkldnn, but the performance is significantly reduced due to the copy overhead(1.6-4.3 times slower than CPU kernel).

## Perfomance results

### Environment
- CPU: Intel(R) Core(TM) i7-8086K CPU @ 4.00GHz
- OS: 18.04.1 LTS
- compiler: gcc 7.5.0
- branch: master
- commit ID: fe2c126
- build Environment variable: USE_CUDA=0
- Python: 3.6.9
- Intel MKL(Math Kernel Library): 2020.2-254
- Intel oneDNN: 1.8.1

### Benchmark script
``` python
import torch
import torch.nn as nn

torch.manual_seed(1)

x = torch.randn(2048, 2048)
x_mkl = x.to_mkldnn()

print("### CPU tanh")
with torch.autograd.profiler.profile(record_shapes=True) as prof:
    for i in range(100):
        output = x.tanh()
print(prof.key_averages().table(sort_by="self_cpu_time_total"))

print("\n### CPU tanh_")
with torch.autograd.profiler.profile(record_shapes=True) as prof:
    for i in range(100):
        x.tanh_()
print(prof.key_averages().table(sort_by="self_cpu_time_total"))

print("\n### to_dense/to_mkldnn + tanh")
with torch.autograd.profiler.profile(record_shapes=True) as prof:
    for i in range(100):
        output = x_mkl.to_dense().tanh().to_mkldnn()
print(prof.key_averages().table(sort_by="self_cpu_time_total"))

print("\n### to_dense/to_mkldnn + tanh_")
with torch.autograd.profiler.profile(record_shapes=True) as prof:
    for i in range(100):
        x_mkl.to_dense().tanh_().to_mkldnn()
print(prof.key_averages().table(sort_by="self_cpu_time_total"))

print("\n### Mkl-Layout tanh")
with torch.autograd.profiler.profile(record_shapes=True) as prof:
    for i in range(100):
        output = x_mkl.tanh()
print(prof.key_averages().table(sort_by="self_cpu_time_total"))

print("\n### Mkl-Layout tanh_")
with torch.autograd.profiler.profile(record_shapes=True) as prof:
    for i in range(100):
        x_mkl.tanh_()
print(prof.key_averages().table(sort_by="self_cpu_time_total"))
```

### Results
#### OMP_NUM_THREADS=1 Results(Self CPU time total ms)
| Operation | CPU kernel | to_dense/to_mkldnn+CPU kernel | Mkl-Layout kernel(This PR) |
| ---------- | ---------- | ----------------------------- | -------------------------- |
|tanh | 579.662 | 1658.000 | 617.565 |
| tanh_ | 554.477 | 881.997 | 589.426 |

#### OMP_NUM_THREADS=6 Results(Self CPU time total ms)
| Operation | CPU kernel | to_dense/to_mkldnn+CPU kernel | Mkl-Layout kernel(This PR) |
| ---------- | ---------- | ----------------------------- | -------------------------- |
|tanh | 182.387 | 421.336 | 136.226 |
| tanh_ | 94.331 | 404.931 | 99.254 |

## Modification policy for the code
oneDNN is already supported tanh operation.

[oneDNN: Elementwise](https://spec.oneapi.com/versions/latest/elements/oneDNN/source/primitives/eltwise.html)

There is already exist sigmoid implementation that uses the same Elementwise API as tanh, so we created this PR code with reference to the sigmoid implementation.

527c1e0e37/aten/src/ATen/native/mkldnn/UnaryOps.cpp (L28-L42)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54656

Test Plan:
A test for sigmoid has already been created as shown below.
So, I added a new test of tanh referring to the test of sigmoid.

527c1e0e37/test/test_mkldnn.py (L944-L954)

### mkldnn tanh test result

```
$ python3 test/test_mkldnn.py TestMkldnn.test_tanh
Couldn't download test skip set, leaving all tests enabled...
.
----------------------------------------------------------------------
Ran 1 test in 0.004s

OK
```

Reviewed By: gchanan

Differential Revision: D27395827

Pulled By: ezyang

fbshipit-source-id: d4481332de187e2dea095f9b6aabc73a497960fe
2021-04-05 00:00:16 -07:00
19a0eb4cdb [c10d] Monitored barrier: option to collect all failed ranks (#55010)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55010

Follow up change to add a flag to provide an option for monitored barrier to collect all the failed ranks and then throw instead of just throwing on the first one. This is useful as now monitored barrier will be able to pick up on all hanging ranks instead of just one.

This is done by passing in a flag `wait_all_ranks=True`.
ghstack-source-id: 125699839

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D27447787

fbshipit-source-id: ec23aee212060d9eb515ff8adc96c6a17822d1bb
2021-04-04 21:39:54 -07:00
0ec1af4b7e [c10d] Enforce order of waited ranks in monitored barrier. (#55009)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55009

Changes monitoredBarrier so that we await acknowledgemenet from ranks
in a consistent order (from least to greatest). This will reduce confusion
around the order the ranks are awaited. We are still planning to add support
for awaiting all ranks in follow up changes.
ghstack-source-id: 125699838

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D27405417

fbshipit-source-id: b9a3e72742cbffdd9bf890ab2c94103b768a7b71
2021-04-04 21:38:25 -07:00
e309ab8510 OpInfo: atan2 (#55132)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55132

Reviewed By: gchanan

Differential Revision: D27505153

Pulled By: mruberry

fbshipit-source-id: 45430ad0a7efab0b32c945356aa49f45d0175f83
2021-04-04 21:24:06 -07:00
c0ac0fef4e Revert D27448156: irange for size_t
Test Plan: revert-hammer

Differential Revision:
D27448156 (041b4431b2)

Original commit changeset: 585da57d4de9

fbshipit-source-id: 8e047c29f391c0166e0a1a87c3fb2a0854377365
2021-04-03 19:14:00 -07:00
e3691be2d9 Dump C++ stack traces of all threads for distributed tests. (#55003)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55003

Using the `caffe2::setPrintStackTracesOnFatalSignal` utility in
distributed tests to set a signal handler that dumps the state of all threads
for all processes when it receives a FATAL signal. This would help in debugging
tests further.

I had to revert all the python faulthandler code since only one signal handler
function is supported, so running python faulthandler with
`setPrintStackTracesOnFatalSignal` doesn't work.

Sample output:
```
SIGSEGV(11), PID: 3492872, Thread 3492872:
[0] ???(0x7fa7b2d1d61b) in libcaffe2_caffe2_caffe2_cpu.so
[1] ???(0x7fa7b2d1d3fb) in libcaffe2_caffe2_caffe2_cpu.so
[2] ???(0x7fa7b2d1d33d) in libcaffe2_caffe2_caffe2_cpu.so
[3] ???(0x7fa7b2d1d167) in libcaffe2_caffe2_caffe2_cpu.so
[4] ???(0x7fa7ce683150) in libpthread.so.0
[5] ???(0x7fa7be2b233c) in libcaffe2__C_impl_cuda.so
[6] ???(0x7fa7be2ce80c) in libcaffe2__C_impl_cuda.so
[7] ???(0x7fa7be2a0512) in libcaffe2__C_impl_cuda.so
[8] torch::distributed::rpc::TensorPipeAgent::send(torch::distributed::rpc::WorkerInfo const&, torch::distributed::rpc::Message&&, float, std::unordered_map<signed char, signed char, std::hash<signed char>, std::equal_to<signed char>, std::allocator<std::pair<signed char const, signed char> > > const&)+0x24f(0x7fa7be29f71f) in libcaffe2__C_impl_cuda.so
[9] torch::distributed::autograd::sendMessageWithAutograd(torch::distributed::rpc::RpcAgent&, torch::distributed::rpc::WorkerInfo const&, torch::distributed::rpc::Message&&, bool, float, bool)+0x393(0x7fa7b602b203) in libcaffe2_libtorch.so
[10] torch::distributed::rpc::pyRpcPythonUdf(torch::distributed::rpc::WorkerInfo const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, std::vector<at::Tensor, std::allocator<at::Tensor> >&, float, bool)+0x201(0x7fa7bd844971) in libcaffe2__C_impl_cuda.so
```
ghstack-source-id: 125630551

Test Plan: waitforbuildbot

Reviewed By: SciPioneer

Differential Revision: D27419714

fbshipit-source-id: 8aca9a14ef688004053d8798124d9c3a3fbe3489
2021-04-03 13:59:56 -07:00
8ed20b3f65 Leak Caffe2 threadpool in child processes right after fork to prevent segfault (#54895)
Summary:
## Problem summary
Fixes https://github.com/pytorch/pytorch/issues/54752 - when the number of threads is more than 3 and at least one `set_num_threads` invocation has taken place before forking child processes by the dataloader, `set_num_threads(1)` in the child process causes a segfault, as during its invocation, the child process is made to handle the data structures of the Caffe2 thread-pool of the parent process, whose data structures it inherits from the parent process (these threads don't exist in the child process, but some of its data structures do, due to the copy-on-write technique used by `fork`).

## Solution
malfet [advised](https://github.com/pytorch/pytorch/issues/54752#issuecomment-810315302) & [authored code](https://github.com/pytorch/pytorch/pull/54895#pullrequestreview-625670122) for adding a `pthread_atfork` handler in `pytorch/caffe2/utils/threadpool/pthreadpool-cpp.cc`, that's invoked in the child process right after fork, to leak the Caffe2 thread-pool (the child inherits the thread-pool's data structures from its parent process, but doesn't actually have those threads, since after `fork` , a child process only has one thread).

## Additional changes
Added unittest `test_no_segfault` to test for this issue in `test_dataloader.py`
Also enabled `test_segfault` (which actually makes sure that segfaults happen in worker processes in a particular case).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54895

Reviewed By: zhangguanheng66

Differential Revision: D27542253

Pulled By: malfet

fbshipit-source-id: 10f9c67ce1ff1aa37d3efebf405bd93f7f9d2489
2021-04-03 10:51:20 -07:00
8377e6221a Revert D27478225: [pytorch][PR] Added pow() on CPU for float16 & bfloat16
Test Plan: revert-hammer

Differential Revision:
D27478225 (6d030c14cf)

Original commit changeset: d309dd98d5a9

fbshipit-source-id: e0518f15185b41946caf3a8456c7af3f52e5a910
2021-04-03 10:26:44 -07:00
a84c92b78b [package] populate a special attribute on imported modules (#55255)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55255

This allows packaged code to detect whether or not they are used in a
packaged context, and do different things depending on that. An example
where this might be useful is to control dynamic dependency loading
depending on whether or not something is packaged.

Test Plan: Imported from OSS

Reviewed By: Lilyjjo

Differential Revision: D27544245

Pulled By: suo

fbshipit-source-id: 55d44ef57281524b8d9ab890bd387de97f20bd9f
2021-04-03 00:58:59 -07:00
041b4431b2 irange for size_t (#55163)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55163

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D27448156

fbshipit-source-id: 585da57d4de91c692b6360d65f7b8a66deb0f8c1
2021-04-02 23:22:29 -07:00
322854d2f0 [SPMD] Error out SPMD in C++ Reducer (#55212)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55212

Error out SPMD in C++ Reducer.

Added a new test `test_reducer_no_multi_replicas`, which checks no multiple replicas are allowed at the Reducer constructor.

Removed 2 tests relevant to reducer in SPMD mode:
`test_ddp_comm_hook_multiple_replica_check`
`test_forward_backward_multi_replica`

ghstack-source-id: 125602472

Test Plan: waitforbuildbot

Reviewed By: pritamdamania87

Differential Revision: D27497747

fbshipit-source-id: 17ef1bc4d889cbe8076bcb3d504aed4c1aea1562
2021-04-02 22:59:25 -07:00
4170a6cc24 Migrate mode from TH to ATen (#52043)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/24731 #24673 https://github.com/pytorch/pytorch/issues/24597 #24526 https://github.com/pytorch/pytorch/issues/46507
Related https://github.com/pytorch/pytorch/issues/24507

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52043

Reviewed By: mruberry

Differential Revision: D27468266

Pulled By: ngimel

fbshipit-source-id: 35a3229c2a706da9bad4ccd0070161831e5476ba
2021-04-02 22:21:53 -07:00
e8dbd0e1a0 [TensorExpr] Minor cleanups in kernel.cpp. (#55257)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55257

Test Plan: Imported from OSS

Reviewed By: asuhan

Differential Revision: D27544659

Pulled By: ZolotukhinM

fbshipit-source-id: c2f51be1a42df090a105689c8e3e91446e9ea8b4
2021-04-02 21:47:48 -07:00
641d4ff160 [FX] Add stride to shape_prop pass (#55108)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55108

Test Plan: Imported from OSS

Reviewed By: ansley

Differential Revision: D27482241

Pulled By: jamesr66a

fbshipit-source-id: 7d928015712126e916c86225dc3ab27aba22d431
2021-04-02 19:57:11 -07:00
28531c97b2 [caffe2] Shape inference for Transpose (#55188)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55188

We need to make sure dim types are preserved after applying Transpose.

Test Plan:
```
$ buck build caffe2/caffe2/opt:bound_shape_inference_test && ./buck-out/gen/caffe2/caffe2/opt/bound_shape_inference_test --gtest_filter=*Transpose*
```

Reviewed By: yinghai

Differential Revision: D27514487

fbshipit-source-id: 431b7f2d08664f2ec311a733c926dbb52c63a7d4
2021-04-02 17:43:27 -07:00
6d030c14cf Added pow() on CPU for float16 & bfloat16 (#50999)
Summary:
Added the functionality desired in https://github.com/pytorch/pytorch/issues/50789.

1. Added support for pow() on CPU for `float16` (`Half`) and `bfloat16` types.
Both `pow(Tensor, Scalar)` and `pow(Tensor, Tensor)` are now supported for the aforementioned types.
However autograd isn't supported for `Float16` on CPU yet, as `log_vml_cpu` can't be enabled for it.
2. heitorschueroff added `pow_tensor_scalar_optimized_kernel` to refactor & simplify `PowKernel.cpp`.
It provides a common path for all the complex types & floating point types (except Float16, due to lack of complete AVX2 vectorization support for it).  It replaced code that had previously been duplicated for (float, double) and complex types,
so PowKernel.cpp looks a lot cleaner now.
3. Enabled (unskipped) some tests for `erf`, `erfc`,`erfinv`, `linalg.norm` and `linalg.vector.norm` which were being skipped earlier due to `pow()` not having been implemented for `float16` & `bfloat16`.
4. Added an OpInfo for `pow()` & enabled some test cases for `pow()`.
5. Extended the coverage of existing tests for `pow` in `test_binary_ufuncs.py` in order to enable comparison with `numpy`, even with discontiguous tensors, and added a test to ensure that a runtime error is raised for `pow`'s inplace variant if resizing the base tensor is required during its invocation.
6. Added `float16` & `bfloat16` to `square`'s dtype lists in its `UnaryUfuncInfo`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50999

Reviewed By: zou3519

Differential Revision: D27478225

Pulled By: heitorschueroff

fbshipit-source-id: d309dd98d5a96d0cb9b08281757bb1c65266d011
2021-04-02 15:57:06 -07:00
c549a147a9 [DataLoader] Typing Doc (#54773)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54773

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D27364246

Pulled By: ejguan

fbshipit-source-id: 48908555853c364d2d3cc173e3b73a6bec2e19f1
2021-04-02 15:22:35 -07:00
0b1c3dfae4 [DataLoader] Typing Enforcement for DataPipe at runtime (#54544)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54544

## Feature
- Add `subinstance(data, type)` to check `data` is a subtype instance of the `type`
- Add a decorator of `runtime_validation` to validate the returned data from `__iter__` is subtype instance of hint.

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D27327234

Pulled By: ejguan

fbshipit-source-id: fb6a332762b0fe75284bb2b52a13ed171b42558c
2021-04-02 15:22:32 -07:00
1535520f08 [DataLoader] Typing Enforcement for DataPipe at construct-time (#54066)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54066

## Feature
- Add a decorator `construct_time_validation` to validate each input datapipe according to the corresponding type hint.

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D27327236

Pulled By: ejguan

fbshipit-source-id: a9d4c6edb5b05090bd5a369eee50a6fb4d7cf957
2021-04-02 15:22:29 -07:00
44edf8c421 [DataLoader] Typing Enforcement for DataPipe at Compile-time (#54020)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54020

## Feature
- Add `issubtype` to check the type is a subtype of the other type.
- Add `_DataPipeMeta` (mimic Python typing 3.6)
  - Add `type` attribute for each DataPipe
  - Save original `__init__` function for each DataPipe
  - Validate return hint of `__iter__`
  - Replace `__init__` function bases on `type`
    - Fixed type: Put original `__init__` back, if it exists or use a plain `__init__`
    -  Non-fixed type: Add new `__init__` with the functionality to copy `cls.type` for each instance. (Optimized for memory)

No Error for main repo, `torchvision`, `torchaudio` and `torchtext`.

## Future
- Add same thing for `__getitem__`.
- When DataFrame came out, add an another type for DataFrame with column name and type.

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D27327232

Pulled By: ejguan

fbshipit-source-id: fd3a6029c16f5d814b1d7e1b1566fdcd8fd1ad9a
2021-04-02 15:22:27 -07:00
560e3be587 [DataLoader] Implement issubtype for type hints (#54299)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54299

## Feature
- Check type is a subtype of another type

Prerequisite for DataPipe tying system.

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D27327235

Pulled By: ejguan

fbshipit-source-id: 8f50a663a86540677c9e132ac7c5216fdac46f70
2021-04-02 15:20:55 -07:00
159fdde9ae Support needsOutputs for RecordFunction and ObserverUtil improvements (#55012)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55012

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54442

Added needsOutputs support to RecordFunction, improved ObserverUtil functions to handle list data. Minor refactor names to be consistent.

To get output data from kernel calls, we need to temporarily capture them before passing them to the record function. Then the results are released to function return. We handle two cases, for unboxed and boxed kernels. The boxed version is fairly simple since all outputs are stored in the stack object. For unboxed kernel calls, we added a `ReturnValue` utility class to properly handle the different return values of unboxed kernels.

For optimization, this intermediate capture is only enabled for observers that request `needsOutputs(true)` and should not affect other observers or when the observer is not enabled.

Test Plan:
```
=> buck build //caffe2/test/cpp/jit: --show-output
=> buck-out/gen/caffe2/test/cpp/jit/jit --gtest_filter=RecordFunctionTest*
CUDA not available. Disabling CUDA and MultiCUDA tests
Note: Google Test filter = RecordFunctionTest*-*_CUDA:*_MultiCUDA
[==========] Running 7 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 7 tests from RecordFunctionTest
[ RUN      ] RecordFunctionTest.TracedTestInputsOutputs
[       OK ] RecordFunctionTest.TracedTestInputsOutputs (226 ms)
[ RUN      ] RecordFunctionTest.SampledCallbacks
[       OK ] RecordFunctionTest.SampledCallbacks (771 ms)
[ RUN      ] RecordFunctionTest.RecordFunctionGuard
[       OK ] RecordFunctionTest.RecordFunctionGuard (0 ms)
[ RUN      ] RecordFunctionTest.Callbacks
[       OK ] RecordFunctionTest.Callbacks (2 ms)
[ RUN      ] RecordFunctionTest.ShouldRun
[       OK ] RecordFunctionTest.ShouldRun (0 ms)
[ RUN      ] RecordFunctionTest.Basic
[       OK ] RecordFunctionTest.Basic (1 ms)
[ RUN      ] RecordFunctionTest.OperatorNameOverload
[       OK ] RecordFunctionTest.OperatorNameOverload (1 ms)
[----------] 7 tests from RecordFunctionTest (1001 ms total)

[----------] Global test environment tear-down
[==========] 7 tests from 1 test case ran. (1002 ms total)
[  PASSED  ] 7 tests.

```

Reviewed By: ilia-cher

Differential Revision: D27449877

fbshipit-source-id: 69918b729565f5899471d9db42a587f9af52238d
2021-04-02 15:16:17 -07:00
2452182e6c [SPMD] Remove test_grad_layout_1devicemodule_2replicaperprocess (#54826)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54826

This test will no longer work, because we errored out SPMD in #54454.

This test is already disabled.
ghstack-source-id: 125602473

Test Plan: N/A

Reviewed By: rohan-varma

Differential Revision: D27381719

fbshipit-source-id: a3079ff0766f91112cbe58c1f00c1b02d241c8cd
2021-04-02 15:13:47 -07:00
e589247a19 [SPMD] Change assertions to raising value errors in distributed.py (#54825)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54825

These assertions are tested in test_c10d.py

Context: https://github.com/pytorch/pytorch/pull/54454#discussion_r602657818
ghstack-source-id: 125602462

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_multi_device_module_config

Reviewed By: rohan-varma

Differential Revision: D27381649

fbshipit-source-id: 9b994e9c2acf796770c2f2af2cebdd5561834d14
2021-04-02 15:13:45 -07:00
6a40339920 [SPMD] Error out SPMD mode (#54454)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54454

According to the pitch in https://github.com/pytorch/pytorch/issues/47012

1. Let DDP error out if `device_ids` contains multiple devices.
2. If device_ids is not specified, DDP will use the provided model (module argument in DDP constructor) as-is, regardless if the model is on one GPU or multiple GPUs or on CPU.
3. Remove the assertion that prevents SPMD in DDP `join()` method, because now SPMD is already forbidden by the constructor. Also remove the relevant unit test `test_ddp_uneven_inputs_replicated_error`.

#Closes: https://github.com/pytorch/pytorch/issues/47012

ghstack-source-id: 125644392

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:distributed_gloo_spawn -- test_cuda
buck test mode/dev-nosan caffe2/test/distributed:distributed_gloo_spawn -- test_rnn

buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_nccl_backend_multi_device_ids_not_allowed
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_nccl_backend_single_device_module_device_ids_None
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_nccl_backend_multi_device_module_device_ids_None

buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_multi_device_module_config

waitforbuildbot

Reviewed By: pritamdamania87

Differential Revision: D27226092

fbshipit-source-id: 3ee1e4bc46e5e362fc82cf7a24b2fafb34fcf1b9
2021-04-02 15:11:59 -07:00
6e33420436 Add embedding bag support to fx_glow (#54909)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54909

Pull Request resolved: https://github.com/pytorch/glow/pull/5481

This diff adds support for embedding bag to fx_glow and a test case to test_fx_glow.

Reviewed By: jfix71

Differential Revision: D27272897

fbshipit-source-id: 9e3be28efee38a01784afceb188a86f6408393dd
2021-04-02 14:20:40 -07:00
29916dbf1e Clang-format _distributed_c10d.pyi (#55220)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55220

ghstack-source-id: 125597170

Test Plan: N/A

Reviewed By: pritamdamania87

Differential Revision: D27531346

fbshipit-source-id: c603cadbff682a9361d0e97d164f18b029e396b1
2021-04-02 13:43:31 -07:00
91a809bbd7 [c10] Adjust macro check that detects if glibc++ use c99 csqrt (#55177)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55177

This fixes `warning: '_GLIBCXX11_USE_C99_COMPLEX' is not defined, evaluates to 0`, that would be raised if https://github.com/pytorch/pytorch/pull/54820 used with libstd++ compiled without USE_C99_COMPLEX support.

In `c++config.h` `_GLIBCXX_USE_C99_COMPLEX` is aliased to either `_GLIBCXX98_USE_C99_COMPLEX` or `_GLIBCXX11_USE_C99_COMPLEX` depending on  `__cplusplus` macro, as shown here:
0cf4813202/libstdc%2B%2B-v3/include/bits/c%2B%2Bconfig (L641-L647)

Abovementioned config file is generated by autoconf, that leaves macro undefined if feature is not used, so using conditional like `defined(_GLIBCXX_USE_C99_COMPLEX) && _GLIBCXX_USE_C99_COMPLEX == 0` would trigger undefined macro preprocessor warning.

Test Plan: CI

Reviewed By: Orvid

Differential Revision: D27517788

fbshipit-source-id: a6db98d21c9bd98205815641363b765a02399678
2021-04-02 13:20:30 -07:00
fb64caedb5 Don't fail "Add annotations" if "Lint" is canceled (#55242)
Summary:
https://github.com/pytorch/pytorch/issues/54779 split out the logic from our "Lint" workflow into a separate workflow that allows us to annotate PRs from forks. However, as of https://github.com/pytorch/pytorch/issues/54689, it is possible for the "Lint" workflow to be canceled, in which case it may not upload the "flake8-py3" and "clang-tidy" artifacts that the "Add annotations" workflow expects. This often results in GitHub pointlessly sending notification emails due to the failure in the "Add annotations" workflow. This PR fixes the issue by gracefully handling the case where the expected artifact is absent.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55242

Test Plan: I tested this in the same external sandbox repo used to test https://github.com/pytorch/pytorch/issues/54779.

Reviewed By: malfet

Differential Revision: D27540120

Pulled By: samestep

fbshipit-source-id: 47cc02950edbbc6381033bda2fe4570cb3e331cb
2021-04-02 12:40:20 -07:00
38a08a49ea Flip clip_grad_norm default for error_if_nonfinite to false (#55169)
Summary:
Non-backwards-compatible change introduced in https://github.com/pytorch/pytorch/pull/53843 is tripping up a lot of code. Better to set it to False initially and then potentially flip to True in the later version to give people time to adapt.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55169

Reviewed By: mruberry

Differential Revision: D27511150

Pulled By: jbschlosser

fbshipit-source-id: 1ac018557c0900b31995c29f04aea060a27bc525
2021-04-02 12:25:32 -07:00
6866c033d5 [JIT] Add recursive scripting for class type module attributes (#55124)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55124

**Summary**
This commit modifies type inference (used by the module scripting code)
so that it tries to script the type of any class instances that it
encounters. This enables recursive, automatic scripting of class type
module attributes.

**Test Plan**
This commit adds a test case for this to `TestClassType`.

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D23971883

Pulled By: SplitInfinity

fbshipit-source-id: 7a5a2e7c12ee68cbdeb0a07e6aaf98734a79cb06
2021-04-02 12:16:21 -07:00
6e2d020037 Add interpolation kwarg to torch.quantile (#49267)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49267

This PR builds upon the PR https://github.com/pytorch/pytorch/pull/48711 by RockingJavaBean. The original PR introduced a BC breaking change by making the interpolation parameter positional. Thus, previous invocations of torch.quantile that did not include the interpolation parameter failed after the PR landed.

To avoid BC breaking changes, we preserve the original signatures and make the interpolation parameter in the new signatures kwarg only. For now, interpolation cannot have a default value to avoid ambiguity with the deprecated signature. However, due to limitations of codegen and C++, we cannot have a required arg after optional ones. Thus, this PR also makes dim and keepdim requires args. Once we can remove the old signatures, dim, keepdim and interpolation parameters in the new signature will get the default values back.

__TODO__
 ---
- [ ] Run backward compat tests

This reverts commit 2f1d1eb7df5e8032392b73751c84025a2aa3d1ee.

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D27337117

Pulled By: heitorschueroff

fbshipit-source-id: 7fe31f22027645e0d6cb3cab0392d532a4b362c9
2021-04-02 12:11:36 -07:00
e593044748 [Gradient Compression] Update a warning in ddp_comm_hooks.rst (#55031)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55031

It turns out that PowerSGD hooks can work on PyTorch native AMP package, but not Apex AMP package, which can somehow mutate gradients during the execution of communication hooks.

{F561544045}
ghstack-source-id: 125268206

Test Plan:
Used native amp backend for the same pytext model and worked:
f261564342
f261561664

Reviewed By: rohan-varma

Differential Revision: D27436484

fbshipit-source-id: 2b63eb683ce373f9da06d4d224ccc5f0a3016c88
2021-04-02 12:07:50 -07:00
7ab53eb960 [StaticRuntime] Unbreak benchmarks. (#55199)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55199

Test Plan: Imported from OSS

Reviewed By: walterddr, hlu1

Differential Revision: D27526600

Pulled By: ZolotukhinM

fbshipit-source-id: 9318cb5d6adca3e8073f8ec4219afc3cc1c75f7c
2021-04-02 12:03:56 -07:00
a0bb0968d5 [PyTorch] Don't bother with SmallVector in TensorMaker (#55125)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55125

We can provide an ArrayRef to 1-5 zeros much more efficiently, like this.
ghstack-source-id: 125471024

Test Plan: Existing CI

Reviewed By: ezyang

Differential Revision: D27494800

fbshipit-source-id: 5e2addfabae70960475a4b322925cd0eae71b4c6
2021-04-02 11:56:23 -07:00
02af4b511d Enhance Pipe docs to explicitly mention RPC initialization. (#55187)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55187

As described in https://github.com/pytorch/pytorch/issues/54927, Pipe
docs didn't explicitly mention initializing RPC. This PR improves the docs and
also ensures Pipe throws a more useful error message when RPC is not
initialized and not an internal assertion error.
ghstack-source-id: 125563552

Test Plan:
1) unit test added.
2) waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D27521783

fbshipit-source-id: d1a5c6ca789b9a66c07a794468178c25cfd4b743
2021-04-02 11:51:22 -07:00
24c904951c Replace AutoNonVariableTypeMode with InferenceMode in fbcode. (#55114)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55114

Test Plan: CI

Reviewed By: ezyang, bhosmer

Differential Revision: D27472768

fbshipit-source-id: 76f17ef7de40f6e04e2968f8958027b5f93e1c0c
2021-04-02 11:45:53 -07:00
181de40688 Split copy_ kernel to InplaceOrView. (#55133)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55133

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D27527939

Pulled By: ailzhang

fbshipit-source-id: 5ddaac563b5bab38b7091b5b88e00502cb390f1a
2021-04-02 10:48:47 -07:00
09670c7d43 Don't globally disable any ShellCheck warnings (#55165)
Summary:
https://github.com/pytorch/pytorch/issues/47786 updated ShellCheck and fixed the warnings that it was already giving in CI (since it previously didn't cause the job to fail). https://github.com/pytorch/pytorch/issues/54069 enabled two ShellCheck warnings that previously were globally disabled. This PR continues the trend by reenabling the remaining four ShellCheck warnings that previously were globally disabled.

Also, this PR puts as many remaining ShellCheck arguments as possible into `.shellcheckrc` to make it easier to integrate with editors. For instance, in VS Code, this is now all that is needed (due to https://github.com/koalaman/shellcheck/issues/1818 and the fact that VS Code only runs ShellCheck on one file at a time):

```json
{
  "shellcheck.customArgs": [
    "--external-sources"
  ]
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55165

Test Plan:
[The "Lint / quick-checks" job in GitHub Actions](https://github.com/pytorch/pytorch/pull/55165/checks?check_run_id=2250098330), or this command if you want to check locally:
```
.jenkins/run-shellcheck.sh
```

Reviewed By: walterddr

Differential Revision: D27514119

Pulled By: samestep

fbshipit-source-id: f00744b2cb90a2ab9aa05957bff32852485a351f
2021-04-02 10:41:37 -07:00
978fca64a6 Revert D25399470: add channels last for MaxPool2d
Test Plan: revert-hammer

Differential Revision:
D25399470 (f43eb59a68)

Original commit changeset: b49b9581f132

fbshipit-source-id: ab8c053964aeecf196f6d932c63ada51a3b7ced8
2021-04-02 10:15:11 -07:00
e406d4e6cb Modified lstsq_helper to accept lapack error codes tensor (#54720)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54720

lstsq_helper takes infos tensor now; it is modified in-place.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D27439273

Pulled By: mruberry

fbshipit-source-id: b964003982b88be85bf305059a15fb92207e2b6f
2021-04-02 09:38:49 -07:00
8062545c63 ns for fx: weight extaction for conv1d and conv3d (#55079)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55079

Extends weight extraction to conv1d and conv3d.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D27474696

fbshipit-source-id: 9d5f892160b1b003aa557cfd099c6834e3f70ded
2021-04-02 09:35:34 -07:00
80b1b7e4b1 ns for fx: ensure kwargs are handled when graph matching (#55078)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55078

Fixes a TODO, make sure we iterate through kwargs as well as args
when navigating graphs.  We can use `node.all_input_nodes` convenience
property to accomplish this.

Test Plan:
```
python test/test_quantization.py TestFXGraphMatcher
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D27474699

fbshipit-source-id: 8a6e3db5a73328c4f296ac5fce951e81213b6f58
2021-04-02 09:35:32 -07:00
a590fa7af4 ns for fx: clean up debug print statements (#55077)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55077

Deletes debugging prints from the code, no logic change.

Test Plan:
CI

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D27474700

fbshipit-source-id: 3d9d73da6615ddffdfdb0df270bcdfd2c4b50be3
2021-04-02 09:35:30 -07:00
f6b25e758d ns for fx: move it to top level file (#55060)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55060

Removes the previous iteration of Numeric Suite for FX graph mode
quantization, and moves the current iteration into the top level
file.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs
python test/test_quantization.py TestFXGraphMatcher
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D27467725

fbshipit-source-id: 4c22b5a3221857231f9f59cf6d2908820e6a7f12
2021-04-02 09:35:27 -07:00
c6cb99a6c7 ns for fx: weight extraction for nni.ConvReLU2d (#54335)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54335

Simple fix to enable weight extraction for nni.ConvReLU2d.

Note: this module only appears if the internal GraphModule APIs are
called, so we add testing for this path.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_extract_weights_mod
```

Imported from OSS

Reviewed By: hx89

Differential Revision: D27192844

fbshipit-source-id: 923cf63e29e4638fd77ca42e69aedb15fb20a330
2021-04-02 09:35:25 -07:00
5319d17be4 ns for fx: make input logging work for multi node subgraphs (#54327)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54327

Makes input logging work properly for multi-node subgraphs.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_linear_fp16_shadow_activations
```

Imported from OSS

Reviewed By: hx89

Differential Revision: D27190137

fbshipit-source-id: 3f39bfd5112d5ee92c1e66c133e970c28db40d46
2021-04-02 09:35:22 -07:00
b8019cee0e ns for fx: make input logging work for multi-node subgraphs (#54326)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54326

Fixes unshadowed activation input logging for subgraphs where start_node does
not equal end_node. In detail:
* instead of passing around a single list of nodes, pass around a list
of nodes to instrument inputs, and a list of nodes to instrument
outputs. This way we can handle multi-node subgraphs properly, and we
also keep the subgraph instance definition out of the public APIs.
* add a test case

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_linear_fp16_activations
```

Imported from OSS

Reviewed By: hx89

Differential Revision: D27190138

fbshipit-source-id: 58e2377c1c128baaf3b760c1ad29098fb21f53d3
2021-04-02 09:35:20 -07:00
cbcde79023 ns for fx: refactor test cases (#54280)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54280

Some easy refactors to reduce duplicate logic in test cases
for NS for FX. In particular, we start reusing a common model
within this file, and we split the fp16 test cases to be more
modular.

Test Plan:
```
python test/test_quantization.py TestFXGraphMatcher
python test/test_quantization.py TestFXNumericSuiteCoreAPIs
```

Imported from OSS

Reviewed By: hx89

Differential Revision: D27173373

fbshipit-source-id: cf3f21ee8b9b12dff89f1cd2d3ac1749f3f63fe6
2021-04-02 09:35:18 -07:00
757e3cbf82 ns for fx: add support for shadowing linear fp16 patterns (#54275)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54275

Adds support for NS shadow activations path for the fp16 emulation
pattern such as

```
... -> dequantize -> linear -> relu -> to(torch.float16) -> ...
```

There are a couple of changes necessary here:

1. removing the restriction on the shadowing graph pass that the B
subgraph is a single node (since this subgraph is four nodes), and
modifying the code to correctly add the relevant inputs versus output
loggers (input loggers and subgraph copy if we are at start_node,
and output logger if we are at end_node)

2. modifying the logic for calculating node input and output type
to work correcty for the `to` and `dequantize` nodes:
2a. make the function return the first input and output, instead of just
the first input
2b. make the function handle `dequantize` correctly by recursively
using the output if its input
2c. make the function handle `to` correctyl by recursively using the
output of its input and the target dtype

3. a bug fix to handle observers in kwargs, while copying subgraphs

Note: input logging for these patterns is not tested yet,
this will be in the next PR.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_linear_fp16
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D27172655

fbshipit-source-id: 3bdc86618b2a5782627fcf303d58af7f47fbc30d
2021-04-02 09:33:36 -07:00
f43eb59a68 add channels last for MaxPool2d (#48917)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48917

max_pool2d channels last support forward path

max_pool2d channels last support backward path

vectorize channels last forward path

rename the header file

fix windows build

combine PoolingKernel.h into Pool.h

add data type check

loosen test_max_pool2d_nhwc to cover device CPU

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D25399470

Pulled By: VitalyFedyunin

fbshipit-source-id: b49b9581f1329a8c2b9c75bb10f12e2650e4c65a
2021-04-02 09:13:06 -07:00
bdb225e9f0 Revert D27478436: Use tensorpipe::Buffer::device() instead of tensorpipe::Buffer::deviceType().
Test Plan: revert-hammer

Differential Revision:
D27478436 (3e185253b6)

Original commit changeset: 3962257bc623

fbshipit-source-id: 6619617af2b32445473f21e73bf3841dd7a491b2
2021-04-02 09:08:50 -07:00
3e185253b6 Use tensorpipe::Buffer::device() instead of tensorpipe::Buffer::deviceType().
Summary: The `tensorpipe::Buffer::deviceType()` method is going away.

Test Plan: CI

Reviewed By: lw

Differential Revision: D27478436

fbshipit-source-id: 3962257bc6237d1dde7e5f4fddae38abe8384c68
2021-04-02 08:39:39 -07:00
61914cb2fa [ATen][qembeddingbag] Avoid tensor refcount bumps (#55023)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55023

Test Plan: CI

Reviewed By: swolchok

Differential Revision: D27453856

fbshipit-source-id: f2b5ed97d3cc179baba4c158871a0225e3ba9030
2021-04-02 06:43:13 -07:00
93d0f636bb [c10] Add default constructor to Maybeowned (#55128)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55128

Test Plan: CI

Reviewed By: swolchok

Differential Revision: D27495079

fbshipit-source-id: 3bd01956a8b65170d6b38096dbd15c4809904f88
2021-04-02 06:42:04 -07:00
ec609e7420 Adds torch.* API section for TorchScript Lang Ref (#53236)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53236

Reviewed By: SplitInfinity

Differential Revision: D27526584

Pulled By: gmagogsfm

fbshipit-source-id: ea931ea63aa4b37a7782935a1760bebffedc5b67
2021-04-02 03:01:08 -07:00
271879fe67 [PyTorch Edge] Provide a method ObservedOperators::getUnobservedOperatorList() so that model tracer can empty it out during tracing (#55017)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55017

JacobSzwejbka in D26678637 found the mis-alignment between the operator list that the YAML file claimed and the dispatcher claimed. After some digging and thorough investigation by JacobSzwejbka, we have come to the conclusion that the non-traced operators are more trouble than they are worth since they will result in phantom operators which every user of the capabilities API needs to be aware of (or every language implementation needs to be aware of). Instead, with this change, we can **reliably** trace all operators called via the dispatcher by clearing the list of un-observed operators during model tracing.

Also another thing to note is that the ignore-list in the observer is a list of base operator names, and not full operator names (with overload), which is whaat tracing based selective build needs. If we use the ignore-list, then we would need to include every overload on un-traced operators.

Latency isn't an issue during model tracing, so this should be generally okay.

Ran the following command to re-generate all the YAML files: `buck run caffe2/torch/fb/mobile/cli:cli -- --gen_all_model_configs`
ghstack-source-id: 125337353

(Note: this ignores all push blocking failures!)

Test Plan: Sandcastle and wait for unit tests. Also see BSB results in the diff comments.

Reviewed By: JacobSzwejbka

Differential Revision: D27452855

fbshipit-source-id: 410bafec7ac67503f68623a5e3d4ab258f434cbf
2021-04-02 02:31:25 -07:00
09f1f14569 Transition to new tensorpipe::Pipe API. (#55193)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55193

Test Plan: CI

Reviewed By: lw

Differential Revision: D27466387

fbshipit-source-id: 07b831d699f56874dd45f37e448b8c4244ead5e3
2021-04-02 02:28:07 -07:00
b074a24394 Port torch.copysign method_tests() to OpInfo (#54945)
Summary:
Related https://github.com/pytorch/pytorch/issues/54261

This PR ports the method_tests() entries of `torch.copysign` to OpInfo.

While porting the tests, the `test_out` cases from `test_ops.py` would fail as the out variant of `torch.copysign` does not support scalar inputs.
```python
>>> x = torch.randn(2)
>>> y = torch.empty_like(x)
>>> torch.copysign(x, 1.)
tensor([1.4836, 1.2156])
>>> torch.copysign(x, 1., out=y)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: copysign(): argument 'other' (position 2) must be Tensor, not float
```
This PR fixes the tests by adding an overload `native_functions` entry and re-dispatching scalar inputs to the existing `copysign_out` function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54945

Reviewed By: gchanan

Differential Revision: D27505300

Pulled By: mruberry

fbshipit-source-id: f68250fa52f8dcfd45426039ec178ca5e883e206
2021-04-01 20:30:25 -07:00
ed4a1d54a7 [OpInfo] Enable jit tests for multi_dot (#55147)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55147

Enabling this test now that jit supports TensorList inputs

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D27505270

Pulled By: heitorschueroff

fbshipit-source-id: 05b0d47cb71740309ec5130bf520c576fb90a4d1
2021-04-01 20:11:43 -07:00
ff6b3c76ab [TensorExpr] Add TORCH_APIs to all expr classes. (#55002)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55002

Test Plan: Imported from OSS

Reviewed By: navahgar, walterddr

Differential Revision: D27446409

Pulled By: ZolotukhinM

fbshipit-source-id: 3442d5876bc68974fb3d44878f89c1a7895668d2
2021-04-01 19:48:10 -07:00
1ccaec0238 [TensorExpr] Cleanup IRNodeType enum. (#55001)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55001

The enum is only used for precedence computation thus we only need to
enum node-types for which we know the precedence priority.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D27446410

Pulled By: ZolotukhinM

fbshipit-source-id: 217dd63c4fd086155030ebf0c3e1772605109f7b
2021-04-01 19:48:07 -07:00
f8f30a5e27 [TensorExpr] Remove stale docs from DesignOverview.md. (#55000)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55000

Test Plan: Imported from OSS

Reviewed By: bertmaher, pbelevich

Differential Revision: D27446413

Pulled By: ZolotukhinM

fbshipit-source-id: 4874dcd992fd4bc60ade008c59822194d39792d7
2021-04-01 19:48:05 -07:00
bdbfb2a035 [TensorExpr] Nuke BaseCallNode. (#54999)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54999

BaseCallNode was used as a base class for Intrinsics and FunctionCall.
Now FunctionCall is gone, so BaseCallNode could be removed as well.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D27446411

Pulled By: ZolotukhinM

fbshipit-source-id: be8ce06fbac72bfe355e5e3e1d2aa2267fae79fd
2021-04-01 19:48:02 -07:00
0b75f862c7 [TensorExpr] Nuke FunctionCall. (#54998)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54998

The only reason why we couldn't use Load instead of FunctionCall was
DepTracker. Now this is gone and we finally could replace FunctionCall
with Load.

Test Plan: Imported from OSS

Reviewed By: bertmaher, pbelevich

Differential Revision: D27446412

Pulled By: ZolotukhinM

fbshipit-source-id: 9183ae5541c2618abc9026b1dc4c4c9fab085d47
2021-04-01 19:47:59 -07:00
688e350725 [TensorExpr] Nuke DepTracker and findAllNeededTensors. (#54997)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54997

DepTracker was used to automatically pull in dependent computations from
output ones. While it seems quite convenient, it's led to several
architectural issues, which are fixed in this stack.

DepTracker worked on Tensors, which is a pair of Buf and Stmt. However,
Stmt could become stale and there was no way to reliably update the
corresponding tensor. We're now using Bufs and Stmts directly and moving
away from using Tensors to avoid these problems.

Removing DepTracker allowed to unify Loads and FunctionCalls, which
essentially were duplicates of each other.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D27446414

Pulled By: ZolotukhinM

fbshipit-source-id: a2a32749d5b28beed92a601da33d126c0a2cf399
2021-04-01 19:46:26 -07:00
0d47374c54 construct only necessary elements in OffsetCalculator (#55107)
Summary:
Per title. Elements beyond `dim` are never accessed because 646510f702/aten/src/ATen/cuda/detail/OffsetCalculator.cuh (L49-L51).
On `addmm` instruction count per 30 repetitions 1467813 -> 1452261
`add`  651522 -> 633462
`add_` 529331 -> 511271

add benchmarking snippet:
```
 timer = Timer("m1.add_(b);", setup="at::Tensor m1=torch::empty({2,2},device(at::kCUDA) ); at::Tensor b = torch::empty({2}, device(at::kCUDA));", language="c++", timer=timeit.default_timer)
 stats=timer.collect_callgrind(number=30)
 print(stats.as_standardized().stats(inclusive=False))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55107

Reviewed By: swolchok

Differential Revision: D27494492

Pulled By: ngimel

fbshipit-source-id: 23389a6bc9c9c0096751b95e7f9bf1c9f7bc594f
2021-04-01 19:04:30 -07:00
5610e8271b Fix skip_if_not_multigpu decorator (#54916)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54916

Fixes https://github.com/pytorch/pytorch/issues/54887

`skip_if_not_multigpu` was skipping all the tests that use it.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D27412193

Pulled By: H-Huang

fbshipit-source-id: 28d6697bd8cc6b6784cdb038ccb3ff138d0610eb
2021-04-01 18:01:33 -07:00
8822c7e052 Update TensorPipe submodule. (#55164)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55164

Reviewed By: mrshenli

Differential Revision: D27522063

Pulled By: beauby

fbshipit-source-id: 5473ab7a51f5da365bd5931254bc4d9f47b46201
2021-04-01 16:30:58 -07:00
047a487b07 Fix accidental Flake8 excludes (#55178)
Summary:
[Currently](faa4da49ff/.flake8 (L22)), our `.flake8` config file has the `exclude` pattern `scripts`. I'm guessing that this is just meant to exclude the top-level `scripts` dir from Flake8, but it also applies to the following (apparently erroneously):

- `.circleci/scripts`
- `.github/scripts`
- `test/scripts`

This PR corrects the problem by making all the `exclude` patterns (except for the wildcard `*.pyi` pattern) relative to the repository root. Also, since this PR already touches all the `exclude` lines, it also sorts them to help reduce merge conflicts when `.flake8` is edited in the future. This sorting happened to reveal that the `build` pattern was previously present twice, so now it has been deduplicated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55178

Test Plan:
Locally:
```
flake8
```
And also [in CI](https://github.com/pytorch/pytorch/pull/55178/checks?check_run_id=2249949511).

Reviewed By: janeyx99

Differential Revision: D27520412

Pulled By: samestep

fbshipit-source-id: 359275c10ca600ee4ce7906e3a7587ffaa4ae1ed
2021-04-01 16:26:46 -07:00
3575e71be8 [DDP Logging] Log use of uneven inputs API (#54919)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54919

Log the use of uneven inputs API for better tracking and use case
detection.
ghstack-source-id: 125446499

Test Plan: CI, added ut

Reviewed By: zhaojuanmao, SciPioneer

Differential Revision: D27410764

fbshipit-source-id: abc8055a2e15a3ee087d9959f8881b05a0ea933e
2021-04-01 16:22:32 -07:00
057ec81b17 [PyTorch] OperandInfo ctor should take rvalue reference (#54972)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54972

No reason to create a temporary.
ghstack-source-id: 125338543

Test Plan: CI

Reviewed By: bdhirsh

Differential Revision: D27437190

fbshipit-source-id: 05eeb3ccd33700d8776b6ce58a120c7697acf49e
2021-04-01 14:55:31 -07:00
dded5d72a4 [PyTorch] Move Tensor::has_names inline (#54965)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54965

Yet another small getter that should be inlineable.
ghstack-source-id: 125338544

Test Plan:
Framework overhead benchmark w/o arguments

Before:
```
I0329 13:56:41.268244 447880 bench.cpp:186] Mean 0.562635
I0329 13:56:41.268270 447880 bench.cpp:187] Median 0.562465
I0329 13:56:41.268276 447880 bench.cpp:188] Min 0.561757
I0329 13:56:41.268285 447880 bench.cpp:189] stddev 0.000707741
I0329 13:56:41.268292 447880 bench.cpp:190] stddev / mean 0.0012579
```

After:
```
I0329 14:32:34.116181 607857 bench.cpp:186] Mean 0.557326
I0329 14:32:34.116206 607857 bench.cpp:187] Median 0.557194
I0329 14:32:34.116212 607857 bench.cpp:188] Min 0.556323
I0329 14:32:34.116219 607857 bench.cpp:189] stddev 0.000700897
I0329 14:32:34.116226 607857 bench.cpp:190] stddev / mean 0.00125761
```

So roughly 1% faster overall if I've done the mental arithmetic right?

Reviewed By: ezyang

Differential Revision: D27410928

fbshipit-source-id: 4e66d40c71f534f66deb9c64502fb35d0a5997bf
2021-04-01 14:55:28 -07:00
22f3b4eaa8 Tensor::register_hook: Avoid wrapping hook in two levels of std::function (#53917)
Summary:
The void overload of `register_hook` puts the user's callable into a `std::function` which is used in a lambda, then `_register_hook` wraps that lambda in another `std::function`. This is bad because each call goes through two indirections and also it requires more heap allocations.

Instead, the lambda can capture the original callable without wrapping it in an `std::function` first.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53917

Reviewed By: gchanan

Differential Revision: D27513822

Pulled By: swolchok

fbshipit-source-id: 026d40d7e9fb718757b7203737b0662ba36bc021
2021-04-01 14:53:55 -07:00
8dc29e8a4a [PyTorch] Allow IValue to construct from Tuple with fewer copies (#54534)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54534

Moving overload of tuple -> IValue constructor was missing.
ghstack-source-id: 124671165

Test Plan:
Compare assembly for ivalue_test.cpp before/after this
change. Newly added snippet stops calling `std::__invoke_impl` with a
real function pointer to a by-value variant of
`c10::ivalue::Tuple::create` and starts directly calling
by-const-reference variant of `c10::ivalue::Tuple::create` instead.

Reviewed By: smessmer

Differential Revision: D27271895

fbshipit-source-id: 8b0e146a15d66883146b89b93da5e95f903484e6
2021-04-01 14:49:21 -07:00
faa4da49ff Add code to ensure workflow consistency for autocanceling (#55171)
Summary:
Currently, we only have three GHA workflows that need to be canceled on reruns. To anticipate for future workflows, this PR enables a check that will make sure any new workflow that should be autocanceled on reruns will be included in the cancel_redundant_workflows.yml GHA workflow.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55171

Test Plan: Succeeded quick-checks https://github.com/pytorch/pytorch/runs/2249162035?check_suite_focus=true

Reviewed By: samestep

Differential Revision: D27514294

Pulled By: janeyx99

fbshipit-source-id: 27da321f648b97a090052823ec955caffeb6ae97
2021-04-01 14:11:32 -07:00
2962fee99a Fix/suppress a type warning in PyTorch (#55142)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55142

Declare some functions C10_HOST_DEVICE to fix the NVCC warning.

During pytorch compilation, NVCC compiler was emmiting several warnings like this one:

```
caffe2/c10/util/TypeCast.h(39): warning: calling a constexpr __host__ function from a __host__ __device__ function is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this.
          detected during:
            instantiation of "dest_t c10::static_cast_with_inter_type<dest_t, src_t>::apply(src_t) [with dest_t=c10::complex<double>, src_t=__nv_bool]"
(158): here
            instantiation of "To c10::convert<To,From>(From) [with To=c10::complex<double>, From=__nv_bool]"
(170): here
            instantiation of "To c10::checked_convert<To,From>(From, const char *) [with To=c10::complex<double>, From=__nv_bool]"
caffe2/c10/core/Scalar.h(63): here
```

How to reproduce.
- Make sure you are on remote/master
- run:
  `buck build mode/dev-nosan caffe2/torch/fb/sparsenn:sparsenn_operators_gpu`

Test Plan: - compilation completes without warnings.

Reviewed By: r-barnes

Differential Revision: D27469757

fbshipit-source-id: f8c4eedb637c6d487ac49bb310e48be11db204e2
2021-04-01 13:59:56 -07:00
787854ce41 [ZeroRedundancyOptimizer] bounding the multiple gpus unit test to 4 gpus, hardcoded values (#54788)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53322, the test has some hardcoded values to check that the sharding works as expected, and was not used beyond 4 gpus prior

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54788

Reviewed By: mrshenli

Differential Revision: D27483078

Pulled By: blefaudeux

fbshipit-source-id: 63fe072c41e1601925af23d8fb1ea3f4729b2044
2021-04-01 13:50:29 -07:00
0a329c66bf [PyTorch] Remove stray comments in TensorBody (#54985)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54985

I forgot to fix these before landing D27375016 (e829754992).
ghstack-source-id: 125291731

Test Plan: Review

Reviewed By: bhosmer

Differential Revision: D27442002

fbshipit-source-id: 0bff8396e90f4e6889bf3320c2e316760491ce2f
2021-04-01 13:00:53 -07:00
84ad5df8e3 Correct the name of the label in auto-label-rocm (#55170)
Summary:
The label name was meant to be "module: rocm".

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55170

Test Plan: None.

Reviewed By: malfet

Differential Revision: D27513290

Pulled By: samestep

fbshipit-source-id: ef86fcd5f94a76c9e04653995c2ba9369c5ecb34
2021-04-01 12:54:56 -07:00
dfa2daac1d [PyTorch] Remove outdated C++11 note on C10_DEPRECATED (#55061)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55061

We're C++14.
ghstack-source-id: 125377571

Test Plan: Review

Reviewed By: bhosmer

Differential Revision: D27467852

fbshipit-source-id: 720cdd02813e84a43357ab5e35dfebe3d773bb0f
2021-04-01 12:06:32 -07:00
070169e4d0 [ATen] tensor.contiguous() -> tensor.expect_contiguous (#55022)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55022

Replace tensor.contiguous() with tensor.expect_contiguous in aten::narrow_copy

Test Plan: CI

Reviewed By: edvgha

Differential Revision: D27453866

fbshipit-source-id: c5a6e64ccca4cf52cb879dfb02fd4c451fb397cb
2021-04-01 11:24:17 -07:00
b74795c460 [Pyper] resize_as_ -> resize_ (#55098)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55098

resize_as_ still goes through the dispatcher because it calls tensor.resize_. We can easily call resize_ directly while bypassing the dispatcher.

Reviewed By: swolchok

Differential Revision: D27457894

fbshipit-source-id: 8a5da185d1a6addafbf4915e29613013451b5e43
2021-04-01 11:17:40 -07:00
f34de6a9f4 Modified lstsq_helper to accept rank and singular_values (#54719)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54719

lstsq_helper now takes rank and singular_values that are modified in-place.
This is required for adding out= variant.

TODO:

- [ ] Fix CI failures

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D27439197

Pulled By: mruberry

fbshipit-source-id: f2fe421aa393c2d58f5c50f33e21a9eae57e4f01
2021-04-01 11:14:04 -07:00
cdd9911a12 Revert D27470071: [pytorch][PR] Trigger azure pipeline for multi gpu tests
Test Plan: revert-hammer

Differential Revision:
D27470071 (f0dafeb0cb)

Original commit changeset: 9b7615799da5

fbshipit-source-id: 60a7d9ba5eda31d7381d15920f9fc9ec15df1a6c
2021-04-01 11:02:28 -07:00
0eba63ec93 Improve testing documentation in CONTRIBUTING.md (#54904)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54904

Test Plan: Imported from OSS

Reviewed By: samestep

Differential Revision: D27407009

Pulled By: ansley

fbshipit-source-id: ae69d8387b55f714fd105efe7c4ecbdd69674f65
2021-04-01 10:18:34 -07:00
1b2b3ca86d Language Ref Python Builtin Functions and Values (#52830)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52830

Reviewed By: SplitInfinity, nikithamalgifb

Differential Revision: D27407474

Pulled By: gmagogsfm

fbshipit-source-id: 06fcafbcc66376c5f1818cb12fca2f2a57843c9d
2021-04-01 10:14:03 -07:00
c64e006fc3 Fix security of ROCm labeling workflow (#55157)
Summary:
The current workflow fails when there are backticks in the PR title, because bash tries to evaluate it right away. (example: https://github.com/pytorch/pytorch/runs/2242913870) Moving the variables to the env section away from bash, which removes the security risk.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55157

Test Plan: my repo: https://github.com/janeyx99/gha-experiments/actions/runs/709088679

Reviewed By: gchanan

Differential Revision: D27505033

Pulled By: janeyx99

fbshipit-source-id: 1cc7545c18400d63a4490d9b019afe383b272229
2021-04-01 10:07:32 -07:00
eqy
53609b4cac fix typo in ReduceMinMaxKernel (#54984)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54984

Reviewed By: zou3519

Differential Revision: D27494418

Pulled By: heitorschueroff

fbshipit-source-id: 5066df75ba82c15787edbcb0208594aac2bbaf01
2021-04-01 09:43:17 -07:00
a4125876c9 Move BackendSelect to default_included_set. (#55117)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55117

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D27490571

Pulled By: ailzhang

fbshipit-source-id: a0d8a25a8217a754061fbf3b8e31cc1cf2d3bdea
2021-04-01 09:38:07 -07:00
2798f38c86 Added checks for dtype and device of OpInfo's sample_inputs (#54949)
Summary:
Currently, it's not tested whether `op.sample_inputs` actually used the provided dtype and device arguments. This PR fixes that introducing asserts in `test_supported_dtypes`.
This will help to detect incorrectly generated inputs in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54949

Reviewed By: H-Huang

Differential Revision: D27435952

Pulled By: mruberry

fbshipit-source-id: 8465c459b9b0c007411a9a74340bc2755519624a
2021-04-01 09:34:51 -07:00
36c27fd0ac SVD docs improved (#54002)
Summary:
- Corrected a few errata in the SVD docs
- Made the notation more uniform (refer to `Vh` in `linalg.svd`, always use double tilts...)
- Wrote a better explanation about why the gradients of `U` and `V` are not well-defined when the input is complex or real but has repeated singular values. The previous one pointed to a somewhat obscure post on gauge theory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54002

Reviewed By: malfet

Differential Revision: D27459502

Pulled By: mruberry

fbshipit-source-id: f5c35eca02d35dadd2fc0eeadfacc8824f409400
2021-04-01 09:31:40 -07:00
6b20046491 Pin ShellCheck version to v0.7.1 (#55109)
Summary:
Not sure why I didn't do this in https://github.com/pytorch/pytorch/issues/47786. Version 0.7.1 (the latest `"stable"` version of ShellCheck) was released [almost a year ago](https://github.com/koalaman/shellcheck/releases/tag/v0.7.1), but even if releases are infrequent, it's better to just get rid of the nondeterminism that caused https://github.com/pytorch/pytorch/issues/47786 in the first place.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55109

Test Plan: The "Lint / quick-checks" job in GitHub Actions.

Reviewed By: janeyx99

Differential Revision: D27483473

Pulled By: samestep

fbshipit-source-id: e09f52844db440f2b6ea3cd54340c9e62dea09f4
2021-04-01 09:09:49 -07:00
8d5df95551 Make TensorIterator, SparseTensorMath and UnaryOps clang-tidy clean (#55087)
Summary:
Disable `cppcoreguidelines-macro-usage` as PyTorch codebase uses a lots
of macros that violate this rule.

Disable `bugprone-reserved-identifier` and
`performance-unnecessary-value-param` as those checks are very slow

Add `NOLINT` to DEFINE_DISPATCH as it introduces non-const global variables
Replace `for(auto i = 0; i < lim; ++i)` with `for(auto i: c10::irange(lim))` throughout the modified files

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55087

Reviewed By: samestep

Differential Revision: D27475822

Pulled By: malfet

fbshipit-source-id: 2651a4b3dc062066a15e69380354414a198fb279
2021-04-01 09:04:35 -07:00
f0dafeb0cb Trigger azure pipeline for multi gpu tests (#52490)
Summary:
The run on CircleCI:  https://app.circleci.com/pipelines/github/pytorch/pytorch/283891/workflows/e049872e-5327-4f8c-abc3-a72ca6a1a548/jobs/11462671

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52490

Reviewed By: glaringlee

Differential Revision: D27470071

Pulled By: malfet

fbshipit-source-id: 9b7615799da5fc8381fef226da003449b2698d35
2021-04-01 08:29:56 -07:00
69c5fd1e00 SyncBatchNorm.forward() to handle optional weight (#54568)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/54495

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54568

Reviewed By: ezyang

Differential Revision: D27285822

Pulled By: malfet

fbshipit-source-id: 4f7b489d80294cb2509eec4f6c4aa22d5c47b35d
2021-04-01 08:02:21 -07:00
f83668b4e5 Update release notes scripts following runbook update (#54594)
Summary:
This adds:
- new categories
- global commit counter
- support for new "Reverted" label on PRs
- new export system to multiple files

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54594

Reviewed By: H-Huang

Differential Revision: D27396011

Pulled By: albanD

fbshipit-source-id: ca1ec3a1b90221ba26fd8b053dfb10f614f05909
2021-04-01 07:55:16 -07:00
967e59e557 [tensorexpr] Add sliceHead/sliceTail APIs with short parameter list (#55115)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55115

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D27488754

Pulled By: huiguoo

fbshipit-source-id: d8a1b39ec891c80f6a9078768d692ac4ebeb5f79
2021-04-01 07:34:33 -07:00
1324b0dd44 [FX] Adds C-level monkeypatching of torch.randn so that we can capture it during tracing. (#54060)
Summary:
```
def foo(x):
    return x + torch.randn(3, 3)

fx.enable_ctracing(True)
print(fx.symbolic_trace(foo).code)
```
results in
```
def forward(self, x):
    randn = torch.randn(3, 3)
    add = x + randn;  x = randn = None
    return add
```

Seems to slow down tracing by 1.5-3x.

DenseNet121: 0.05 -> 0.12 seconds
ResNet18: 0.10 -> 0.15

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54060

Reviewed By: jamesr66a

Differential Revision: D27208978

Pulled By: Chillee

fbshipit-source-id: b9e19a9b1084dadfc0dfaee41a03bc25a45910b1
2021-04-01 07:34:31 -07:00
0cfd9e881f [static runtime] fix out variant for 4bit embedding bag (#55096)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55096

There were issues with D26138322 (5b0a6482c1) that we didn't catch the first time around.
This (rebased on top of the to_copy fixes)  fixes the converted remote_ro c2/pt output comparison

Test Plan:
```
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 3 ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --c2_model=/data/users/ansha/tmp/adfinder/210494966_0.predictor.disagg.remote_request_only --c2_inputs=/data/users/ansha/tmp/adfinder/models/c2_remote_ro_input_data.pb --pred_net=/data/users/ansha/tmp/adfinder/models/c2_remote_ro_net2.pb --c2_sigrid_transforms_opt=1 --c2_apply_nomnigraph_passes=1 --c2_use_memonger=1 --scripted_model=/data/users/ansha/tmp/adfinder/models_dianshi/210494966_0.predictor.disagg.remote_request_only.pt --pt_inputs=/data/users/ansha/tmp/adfinder/models/remote_ro_wrapped_input_data.pt --pt_enable_static_runtime=1 --pt_cleanup_activations=1 --pt_enable_out_variant=1 --compare_results=1 --iters=1 --warmup_iters=1 --num_threads=1 --do_profile=0 --benchmark_c2_predictor=1 --do_benchmark=1
```

Reviewed By: hlu1

Differential Revision: D27477104

fbshipit-source-id: 5a95dfa7eae23566fadc3fec323ad03a34e6734d
2021-04-01 07:33:02 -07:00
b880854f31 port copysign to structured kernel (#55040)
Summary:
Related https://github.com/pytorch/pytorch/issues/54945

This PR ports `copysign` to structured, and the `copysign.Scalar` overloads are re-dispatched to the structured kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55040

Reviewed By: glaringlee

Differential Revision: D27465501

Pulled By: ezyang

fbshipit-source-id: 5cbabfeaaaa7ca184ae0b701b9692a918a90b117
2021-04-01 07:29:11 -07:00
8b02d1207b Fixed OpInfo jit tests failing for TensorList inputs (#54954)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53906

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54954

Reviewed By: glaringlee

Differential Revision: D27474863

Pulled By: heitorschueroff

fbshipit-source-id: cf8c1cac6fd1cceacd6be73a2eb49d28a5cfc20a
2021-04-01 07:09:07 -07:00
9d6a81d1a6 Avoid aggregate initialization for tensorpipe::{Cpu,Cuda}Buffer and tensorpipe::Message::Tensor. (#55136)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55136

This will ease the transition to the new API where `Buffer` does not
store a length anymore.

Test Plan: CI

Reviewed By: lw

Differential Revision: D27466385

fbshipit-source-id: 9a167f8c501455a3ab49ce75257c69d8b4869925
2021-04-01 06:55:02 -07:00
204ac21bf1 Revert D27449031: [pytorch][PR] [ROCm] use hiprtc precompiled header
Test Plan: revert-hammer

Differential Revision:
D27449031 (2a7df657fe)

Original commit changeset: 81a8d7847a47

fbshipit-source-id: b7b970c8ea4110357fba3ad4d52a86fa5641d90c
2021-04-01 06:42:04 -07:00
3036777305 Replace torch.chain_matmul calls to torch.linalg.multi_dot (#55064)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55064

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D27469261

Pulled By: heitorschueroff

fbshipit-source-id: 4a53cb058babc81f93f159747b4ed2b6c985a0bc
2021-04-01 04:50:53 -07:00
d98072b027 Deprecate torch.chain_matmul in favor of torch.linalg.multi_dot (#53453)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53453

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D27406282

Pulled By: heitorschueroff

fbshipit-source-id: b6e715d1b88e0613ee6b6208cb28ba4757e31717
2021-04-01 04:50:51 -07:00
5d68b3695c [Relanding] Implemented torch.linalg.multi_dot (#52859)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52859

This reverts commit 92a4ee1cf6092dd941591f80885eb7fef5b2c0d8.

Added support for bfloat16 for CUDA 11 and removed fast-path for empty input tensors that was affecting autograd graph.

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D27402390

Pulled By: heitorschueroff

fbshipit-source-id: 73c5ccf54f3da3d29eb63c9ed3601e2fe6951034
2021-04-01 04:49:05 -07:00
5a1191d050 Check exception messages in embedding_bag_proxy unit test
Summary:
This replaces the use of `assertRaises` with `assertRaisesRegex` to make sure that we catch the expected exceptions.
It also corrects a few unit tests:
* Test for the case where `input is 2D and offsets is not None` was wrong.
* Check for `empty offsets` was missing.
* Check for `offsets length when include_last_offset=True` was wrong.

Test Plan:
```
buck test mode/opt caffe2/torch/fb/training_toolkit/common/proxy_module_thrift/tests:test_embedding_bag_proxy
    ✓ ListingSuccess: caffe2/torch/fb/training_toolkit/common/proxy_module_thrift/tests:test_embedding_bag_proxy - main (3.049)
    ✓ Pass: caffe2/torch/fb/training_toolkit/common/proxy_module_thrift/tests:test_embedding_bag_proxy - test_module_swapping_py (caffe2.torch.fb.training_toolkit.common.proxy_module_thrift.tests.test_embedding_bag_proxy.EmbeddingBagProxyTest) (1.084)
    ✓ Pass: caffe2/torch/fb/training_toolkit/common/proxy_module_thrift/tests:test_embedding_bag_proxy - test_bad_inputs (caffe2.torch.fb.training_toolkit.common.proxy_module_thrift.tests.test_embedding_bag_proxy.EmbeddingBagProxyTest) (1.164)
    ✓ Pass: caffe2/torch/fb/training_toolkit/common/proxy_module_thrift/tests:test_embedding_bag_proxy - test_module_swapping_jit (caffe2.torch.fb.training_toolkit.common.proxy_module_thrift.tests.test_embedding_bag_proxy.EmbeddingBagProxyTest) (1.388)
Summary
  Pass: 3
  ListingSuccess: 1
Finished test run: https://www.internalfb.com/intern/testinfra/testrun/4222124700133860

buck test caffe2/test:nn
  Pass: 1086
  Skip: 1099
  Timeout: 3
  Omit: 1
    {emoji:2702} caffe2/test:nn - test_conv_double_backward (test_nn.TestNN)
  ListingSuccess: 1
Finished test run: https://www.internalfb.com/intern/testinfra/testrun/6755399476551597

buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest
    ✓ ListingSuccess: caffe2/benchmarks/static_runtime:static_runtime_cpptest - main (7.985)
    ✓ Pass: caffe2/benchmarks/static_runtime:static_runtime_cpptest - StaticRuntime.TrivialModel (12.349)
    ✓ Pass: caffe2/benchmarks/static_runtime:static_runtime_cpptest - StaticRuntime.LongModel (12.805)
    ✓ Pass: caffe2/benchmarks/static_runtime:static_runtime_cpptest - StaticRuntime.IndividualOps_to (12.890)
    ✓ Pass: caffe2/benchmarks/static_runtime:static_runtime_cpptest - StaticRuntime.IndividualOps_pow (13.329)
    ✓ Pass: caffe2/benchmarks/static_runtime:static_runtime_cpptest - StaticRuntime.EmbeddingBag (13.703)
    ✓ Pass: caffe2/benchmarks/static_runtime:static_runtime_cpptest - StaticRuntime.IndividualOps_Reshape (13.886)
    ✓ Pass: caffe2/benchmarks/static_runtime:static_runtime_cpptest - StaticRuntime.LeakyReLU (13.964)
    ✓ Pass: caffe2/benchmarks/static_runtime:static_runtime_cpptest - StaticRuntime.IndividualOps_Binary (13.967)
    ✓ Pass: caffe2/benchmarks/static_runtime:static_runtime_cpptest - StaticRuntime.DeepWide (14.095)
    ✓ Pass: caffe2/benchmarks/static_runtime:static_runtime_cpptest - StaticRuntime.KWargsAPI_1 (14.461)
    ✓ Pass: caffe2/benchmarks/static_runtime:static_runtime_cpptest - StaticRuntime.UnaryOps (14.527)
    ✓ Pass: caffe2/benchmarks/static_runtime:static_runtime_cpptest - StaticRuntime.CleanUpMemory (14.624)
    ✓ Pass: caffe2/benchmarks/static_runtime:static_runtime_cpptest - StaticRuntime.FusionPass (14.635)
    ✓ Pass: caffe2/benchmarks/static_runtime:static_runtime_cpptest - StaticRuntime.KWargsAPI_2 (15.027)
    ✓ Pass: caffe2/benchmarks/static_runtime:static_runtime_cpptest - StaticRuntime.IndividualOps_flatten (15.299)
Summary
  Pass: 15
  ListingSuccess: 1
Finished test run: https://www.internalfb.com/intern/testinfra/testrun/5348024606957775
```

Reviewed By: qizzzh

Differential Revision: D27415247

fbshipit-source-id: c4915170e89359ea961c1a6df513b29790f147fa
2021-04-01 03:58:11 -07:00
50cb75edce [tensorexpr] Add python bindings for TensorExprKernel (#54450)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54450

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D27243175

Pulled By: huiguoo

fbshipit-source-id: 820cf0d6cd1dd984d4153628e0f419d234668c82
2021-04-01 02:11:32 -07:00
ba95e08a95 [PyTorch] Use DimVector for inputs to as_strided that don't grow dim (#55016)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55016

When we call as_strided() and don't add an extra dimension, we should continue to expect that the number of dimensions will fit in a DimVector and thus that using it will save heap allocations.
ghstack-source-id: 125337281

Test Plan: Existing CI

Reviewed By: ngimel

Differential Revision: D27452838

fbshipit-source-id: 8b3d118de322638c0c0e3a4bfcfb3c820c64e6cc
2021-04-01 01:22:33 -07:00
6145ac07b5 [caffe2] Reintroduce Log1p operator (#55073)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55073

Original diff D27422219 (d92e2520de) was reverted, reintroducing this op again.

Reviewed By: ChunliF

Differential Revision: D27473735

fbshipit-source-id: 1af0281724e9ada699ebf2045d51f65083daf5b4
2021-03-31 22:29:23 -07:00
547346d663 [caffe2] Fix -Wundef
Summary:
* `#if` with some undefined name is a warning when `-Wundef` is specified (which is in ovrsource for example)
* identifiers starting with two underscores are [reserved for compiler internals](https://en.cppreference.com/w/cpp/language/identifiers)

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D27318070

fbshipit-source-id: 4989fc6a3bf3c176eddd7c25aca47414e4973edd
2021-03-31 22:24:30 -07:00
058357a439 [Gradient Compression] Report compression rate for batched PowerSGD hook (#55103)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55103

Previously compression rate is only reported in PowerSGD hook. Also report this metric for comprehensive experimentation.

It is very easy to compute the sizes before and after compression, because there is only one matrix factorization per bucket, and no accumulation within the bucket is needed.
1) The size before compression is the input tensor size.
2) The size after compression is the size of P + Q, where each has a size of `square_side_length * state.matrix_approximation_rank`.
ghstack-source-id: 125399028

Test Plan: Tested by running scripts/wayi/torch/power_sgd.py locally.

Reviewed By: deadlybulb

Differential Revision: D27474295

fbshipit-source-id: a2225e85be03ab20238f01014d5ec9ae1787c4fb
2021-03-31 22:17:05 -07:00
d2aab53dc2 [PyTorch] Check is_same instead of data_ptr in addmm_out_cuda_impl (#55111)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55111

I don't see how we could have ended up with !is_same but also identical data_ptr, and is_same is cheaper.
ghstack-source-id: 125438822

Test Plan: Existing CI?

Reviewed By: ngimel

Differential Revision: D27484914

fbshipit-source-id: 22125b29e6e09d312a2b92e893d08c69059e4435
2021-03-31 21:35:42 -07:00
7824d8277a [ONNX] Fix export of copy_ operator (#51938) (#54870)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54870

copy_operator before going into onnx exporter is being decomposed into aten::expand_as and aten::index_put.
There is a scenario where inputs to copy are not of the same type, but copy op in torch does implicit casting that is not currently reflected inside onnx exporter. This PR is adding casting inside index_put symbolic in case when tensor self is not of the same type as values.

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D27408975

Pulled By: SplitInfinity

fbshipit-source-id: 15022703e76b9c98b02285c06b13d44f3c4a3f00
2021-03-31 21:14:32 -07:00
40869884cd Add outer export to onnx (#53603) (#54869)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54869

Add symbolic fuction to support torch.outer export to onnx.
Support for transfo-xl-wt103 model.

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D27408978

Pulled By: SplitInfinity

fbshipit-source-id: 70c89a9fc1a5e4a4ddcf674afb1e82e492a7d3b9
2021-03-31 21:14:29 -07:00
c5f3d92816 [ONNX] Update scripting docs (#54634) (#54868)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54868

* Updating docs for scripting

* Rebase

* Fix formatting

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D27408980

Pulled By: SplitInfinity

fbshipit-source-id: 2b176a5a746c1a2369be1940d84e6491a1ecd015
2021-03-31 21:14:27 -07:00
cb0cee4a3d [ONNX] Replace decomposeLinear pre process pass with a symbolic (#53077) (#54866)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54866

Replace decomposeLinear pre process pass with a symbolic

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D27408981

Pulled By: SplitInfinity

fbshipit-source-id: d2d76cab3383122a60df1f356742a33db56adc71
2021-03-31 21:14:25 -07:00
849dcb8b69 [ONNX] Fix if output shape mismatch error & Fix graph input directly used as output (#53219) (#54865)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54865

Fix if output shape mismatch error & Fix graph input directly used as output

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D27408979

Pulled By: SplitInfinity

fbshipit-source-id: 4cfc7b8110b6cb73e000c9cf754190044fb5e1c0
2021-03-31 21:14:22 -07:00
cd9dd653e9 [ONNX] Support primitive type input/outputs and attributes (#53550) (#54864)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54864

Support primitive type attributes. Needed for Silero model.

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D27408982

Pulled By: SplitInfinity

fbshipit-source-id: 16b291eedbe9f9bb31d7664a29a484555df53755
2021-03-31 21:14:20 -07:00
ce48b14060 [ONNX] Improve index_put symbolic to handle singular Bool updates (#53690) (#54863)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54863

Adds support for cases where the updates to the index_put node is a single Bool value, such as the case shown below

```
mask[indices] = True
```

Fixes #53507

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D27408977

Pulled By: SplitInfinity

fbshipit-source-id: bcfb55b50ce76b3d4913ffbc16cdef1f98cb7a84
2021-03-31 21:12:53 -07:00
6c235ef267 Allow std=0 in torch.normal, and error if std<0 (#51317)
Summary:
Part of https://github.com/pytorch/pytorch/issues/49998

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51317

Reviewed By: bdhirsh

Differential Revision: D27253939

Pulled By: mruberry

fbshipit-source-id: af7a72c3d91549b1a88b73849b6973e7619dc50b
2021-03-31 21:06:07 -07:00
15f04e3466 Revert D27408378: [quant][graphmode][fx][refactor] Factor out insert_observers_for_model to a separate function
Test Plan: revert-hammer

Differential Revision:
D27408378 (c445f4ee93)

Original commit changeset: 9143f0a6f939

fbshipit-source-id: ae65ea798a6d72f2ec724c4c1b492937edddf721
2021-03-31 20:51:42 -07:00
8b8c4096ee Added OpInfo gradcheck_wrapper to replace output_func (#54914)
Summary:
Added a field to `OpInfo` to provide a wrapper function for gradcheck. This is useful for functions that need to perform some extra input/output processing to work with gradcheck.

fixes https://github.com/pytorch/pytorch/issues/50837

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54914

Reviewed By: H-Huang

Differential Revision: D27435234

Pulled By: heitorschueroff

fbshipit-source-id: fa3e9b61f3d3df221243fd142ddb8b7861dbf669
2021-03-31 20:23:49 -07:00
790b69e096 Language Ref for Statements in Torchscript (#52847)
Summary:
Addresses the Statements supported in Torchscript for Language Spec

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52847

Reviewed By: gmagogsfm

Differential Revision: D27463142

Pulled By: nikithamalgifb

fbshipit-source-id: ff3def1b878092b0a2afc7c2f47b7857e6658ecf
2021-03-31 19:15:53 -07:00
c445f4ee93 [quant][graphmode][fx][refactor] Factor out insert_observers_for_model to a separate function (#54733)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54733

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D27408378

fbshipit-source-id: 9143f0a6f939fa80f1d1d6bae4b2d37aa21cb9b9
2021-03-31 18:50:47 -07:00
c57541ce06 [quant][graphmode][fx] Separate handling Copy operator to a helper function (#54644)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54644

Previously we special case copy operator in normal insert observer code, this PR tries to split the
special case logic to a separate function and keep the rest of the code clean.

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D27314678

fbshipit-source-id: d36870ceb3717bc01eaeaa6f3f1532ad562cbaf1
2021-03-31 17:50:32 -07:00
c0d6dbdce4 [quant][fx][graphmode][refactor] Change activation_post_process_map to track the observer name instead (#54643)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54643

A refactor needed for future changes.

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D27314677

fbshipit-source-id: 972fbfb506f86da13f8817b3eaa5e6d0ad16ffe1
2021-03-31 17:50:30 -07:00
c2adedf6fe [quant][graphmode][refactor] Remove reduandent code (#54073)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54073

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D27086067

fbshipit-source-id: b1995138de56f1352c5df03378ebc2832bf35ef7
2021-03-31 17:50:27 -07:00
55544cb13a [quant][graphmode][fx] Add support for one value being quantized with different qconfigs (#53586)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53586

Previously one value can only be quantized to one dtype, this PR adds the support for quantizing one value
in the fx graph with multiple dtypes, e.g. first quantize to int8 and then float16

might do some followup PRs to clean up the hacks and refactor the code.

Test Plan:
python test/test_quantization.py TestQuantizeFx.test_multiple_qconfigs_single_value

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D26912676

fbshipit-source-id: ae3653fd67f05870a3a9e808f491871826c555d5
2021-03-31 17:48:50 -07:00
eb52e36460 Revert D27469727: [pytorch][PR] [android] fbjni from prefab dependency 0.2.2
Test Plan: revert-hammer

Differential Revision:
D27469727 (507b46f23e)

Original commit changeset: 2ab22879e81c

fbshipit-source-id: d656463b81a02fbf870dded5d3868bb33e016fe0
2021-03-31 17:21:30 -07:00
c85d3f501f Move shape prop inside acc_tracer (#55091)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55091

Test Plan: All tests are updated and passing.

Reviewed By: 842974287

Differential Revision: D27471887

fbshipit-source-id: 98969fb1bfc72f6c57835525d82d4a8b78bb19bb
2021-03-31 17:16:16 -07:00
0d80f378f6 fix boto3 resource not close (#55082)
Summary:
Test Plan
GHA CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55082

Reviewed By: samestep

Differential Revision: D27475006

Pulled By: walterddr

fbshipit-source-id: ccf50ea0b15ea6840e593a2c056ed2b388a96c52
2021-03-31 16:49:15 -07:00
f29039677d Refactor get numerical jacobian (#54092)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54092

This is the first of several refactors to get numerical jacobian:
This one just moves some logic around as to try to split the get_numerical_jacobian function into smaller more manageable functions:
- compute_gradient is now no longer nested, but we have to pass in the parameters instead
- iter_tensor extracts out the logic of iterating through different types of tensors (the code should be almost the exact same here except for instead of calling into the update jacobian function, we yield the arguments instead)

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D27354268

Pulled By: soulitzer

fbshipit-source-id: 73288e3c889ae31bb8bf77a0e3acb3e9020e09a3
2021-03-31 16:28:16 -07:00
70af5db7ca Remove use_c10_dispatcher option (#54969)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54969

With all use cases to hacky wrapper removed, all kernels will be
dispatched with c10 full dispatcher.
ghstack-source-id: 125434790

Test Plan: buck build //caffe2/aten/...

Reviewed By: ezyang, walterddr

Differential Revision: D27436596

fbshipit-source-id: 7a146d1f4a983b4a81f8552be4eec6c482b6bea2
2021-03-31 16:24:24 -07:00
908a74e8c1 [Refactoring] make transformations return whether graph is modified (#54777)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54777

Updated RemoveListMutation, PeepholeOptimizedListIdoms,
LoopUnrolling, PeepholeOptimization to return whether graph is
modified after transformation, PeepholeAliasSensitivity

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D27412105

fbshipit-source-id: 0c1520bc34f6bd59acd83d98bed58897376eac41
2021-03-31 16:20:12 -07:00
a37fbf9b45 [Futures] Bump log verbosity when ignoring cb errors in python future. (#54476)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54476

Per title. For `add_done_callback`, we log but swallow exceptions in order to keep consistent with what concurrent.futures python library does, see discussion in https://github.com/pytorch/pytorch/pull/45675.

Although, it would be good to improve the verbosity here as this can be a source of confusion if users are setting a different future via `add_done_callback`, and an error is hit resulting in an unexpected hang (see https://github.com/pytorch/pytorch/issues/52132 for more details on how this can happen).
ghstack-source-id: 125300389

Test Plan: CI

Reviewed By: lw

Differential Revision: D27253004

fbshipit-source-id: 72ed21c8fb6d27de5797c17fc46b762f893e6fea
2021-03-31 15:17:06 -07:00
28daa6b7dd [Futures] enhance error handling in then() (#54475)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54475

Implements the proposal in https://github.com/pytorch/pytorch/issues/53717#issuecomment-800545655. See that issue for more details, but at a high level:

1. markCompleted() immediately sets completed_ = true
2. Subclasses of future (such as cuda future) implement a nontrivial `postMarkCompletedHook` which may throw
3. If above error is caught and we call `setError`, setError itself will error out because completed_ = true.

To fix this, only call setError if the user-defined cb resulted in an error, otherwise, call `markCompleted` and let postMarkCompletedHook() throw and crash the program (per lw's thoughts this should be a fatal).
ghstack-source-id: 125300388

Test Plan: CI

Reviewed By: lw

Differential Revision: D27252965

fbshipit-source-id: fda41e8844104774aaf897286512d83ff06632b1
2021-03-31 15:15:34 -07:00
63c70ae032 various overhead improvements to cuda addmm (#55026)
Summary:
Add fast common case to `prepare_matrix_for_cublas`, use index size instead of size(), move some checks where they belong so they are not triggered where they are guaranteed to be true.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55026

Reviewed By: gchanan

Differential Revision: D27468945

Pulled By: ngimel

fbshipit-source-id: 79c9f7b3d61595536f603d6fb0316e6f21630f38
2021-03-31 14:58:31 -07:00
8eb9a934bc Clarify tools/test_history.py output for re-runs (#55106)
Summary:
This PR clarifies the output of `tools/test_history.py` in the presence of re-runs for a single commit/job pair. Specifically:

- in `multiline` mode, the results from all re-runs are now shown
- in `columns` mode, the wording is now changed from "S3 reports omitted" to "job re-runs omitted"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55106

Test Plan:
```
python tools/test/test_test_history.py
```

Reviewed By: walterddr

Differential Revision: D27480590

Pulled By: samestep

fbshipit-source-id: 5b4ccae7586ef1df744663cba1c16bb5bfa75bb7
2021-03-31 14:54:38 -07:00
2726de0119 [Pytorch] Expose ops present in dispatcher (#54791)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54791

Several usecases have a need to want and see what ops are present in a specific pytorch runtime. This diff exposes that information in the dispatcher
ghstack-source-id: 125314247

Test Plan: D26678637 uses this api.

Reviewed By: swolchok

Differential Revision: D27271371

fbshipit-source-id: e572f0c85dcd75d75356e2cd4cfdd77efee17f94
2021-03-31 14:48:29 -07:00
5c3963373a Handle 1D input for xnnpack::linear (#54986)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54986

If the input is 1D xnnpack::linear fails while aten::linear makes it (1, D) and continues

Test Plan: buck test //caffe2/test:xnnpack_integration  -- TestXNNPACKOps

Reviewed By: kimishpatel

Differential Revision: D27441966

fbshipit-source-id: dfb2c23b91247632e0e3fd2482056a503c246c39
2021-03-31 14:45:43 -07:00
fb1c193eed Simplify convolution double backward gradInput formulas (#54840)
Summary:
Currently in convolution double backward grad of input is computed as `convT(ggW, gO.T)`. Notice how first argument is, in fact, of the size that convolution weight has, and second is of the size of gradOutput, which is an inverse order compared to how convolutions are regularly called, and sizes are far from what cudnn heuristics is trained for and what cudnn is guaranteed to have efficient kernels for. This takes cudnn 8 to some dark places, calling  kernels that take 20-100 s. But, luckily for us, convT is a commutative operation (unlike conv), so convT(ggW, gO) is actually the same as  convT(gO, ggW), modulo some transposes because of conventions around the weight size, so we  can use convT(gO, ggW). As an added bonus, we don't need a special branch for groups with this formulation.
For the following pretty standard convolution,
 - cudnn 7.6+old formulation takes 7.5 ms for double backward,
 - cudnn 8 + old formulation takes ~40 s,
 - cudnn 8 + new formulation is 1.8 ms with benchmark enabled,
 - cudnn 8 + new formulation is 4 ms with benchmark disabled,
 benchmarking script is below:
```
import torch
import time

#torch.backends.cudnn.benchmark=True

def ggI(conv, inp):
    out = conv(inp)
    grads = torch.autograd.grad(out, conv.weight, torch.rand_like(out), create_graph=True, retain_graph=True)
    torch.cuda.synchronize()
    start = time.time()
    grads[0].backward(torch.rand_like(grads[0]))
    torch.cuda.synchronize()
    print("db time: ", time.time()-start)
    return inp.grad

conv = torch.nn.Conv2d(512,256,kernel_size=3, padding=1, groups=2).cuda()
inp = torch.randn(1,512,128,128, device="cuda", requires_grad=True)
for _ in range(20):
    ggI(conv, inp)
torch.cuda.synchronize()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54840

Reviewed By: mruberry

Differential Revision: D27384866

Pulled By: ngimel

fbshipit-source-id: c6c875776a9801a0a2cd2f34f8ec39d0fcd59df8
2021-03-31 14:44:09 -07:00
26c1e2ee83 [ROCM] enable miopen for rnn f16 (#52475)
Summary:
This PR enables using MIOpen for RNN FP16 on ROCM.

It does this by altering use_miopen to allow fp16.  In the special case where LSTMs use projections we use the default implementation, as it is not implemented in MIOpen at this time. We do send out a warning once to let the user know.

We then remove the various asserts that are no longer necessary since we handle the case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52475

Reviewed By: H-Huang

Differential Revision: D27449150

Pulled By: malfet

fbshipit-source-id: 06499adb94f28d4aad73fa52890d6ba361937ea6
2021-03-31 14:39:54 -07:00
7f87358840 [android] bump nigtlies version to 1.9.0 (#55076)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55076

Test Plan: Imported from OSS

Reviewed By: seemethere

Differential Revision: D27474604

Pulled By: IvanKobzarev

fbshipit-source-id: b0b694333464485fd20036e9d5ef982877b1aa19
2021-03-31 14:37:59 -07:00
bcb4583170 [FX] Add a metadata dict to Node and switch shapeprop to use that (#54926)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54926

Test Plan: Imported from OSS

Reviewed By: ansley

Differential Revision: D27417801

Pulled By: jamesr66a

fbshipit-source-id: 68a5155120a235065f58aa64ba1a6a97818dd0c1
2021-03-31 14:36:54 -07:00
b64d775636 Adding workflow to auto-label PRs with ROCm (#54989)
Summary:
This PR adds a workflow that automatically adds ROCm label to PRs and issues with ROCm (case insensitive) in their titles.
Note that this does not remove labels even if the title is changed to no longer contain ROCm, but I can easily add removal functionality if that is desired.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54989

Test Plan: much test in my own repo: https://github.com/janeyx99/gha-experiments/actions (thanks samestep for your help!)

Reviewed By: walterddr

Differential Revision: D27448651

Pulled By: janeyx99

fbshipit-source-id: 103f39df0697eb6571c96e88c98d28c8b7adcfd7
2021-03-31 14:17:42 -07:00
507b46f23e [android] fbjni from prefab dependency 0.2.2 (#55066)
Summary:
Switching pytorch android to use fbjni from prefab dependencies
Bumping version of fbjni to 0.2.2
soloader version to 0.10.1

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55066

Reviewed By: dreiss

Differential Revision: D27469727

Pulled By: IvanKobzarev

fbshipit-source-id: 2ab22879e81c9f2acf56807c6a133b0ca20bb40a
2021-03-31 14:12:18 -07:00
0bd96458ba Revert D26820202: Support mix of int32 and int64 offsets/indices for EmbeddingBag and its variants
Test Plan: revert-hammer

Differential Revision:
D26820202 (f9097c43b9)

Original commit changeset: 3e8f09523329

fbshipit-source-id: 5742b69a96ce1c848d75348d0f761cf66a69cbf3
2021-03-31 13:57:44 -07:00
8ad32dbbd7 update build tutorial - choose the correct VS version (#54933)
Summary:
There might be regressions in newest VS.
Remind users to choose the stable VC version as our CI's

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54933

Reviewed By: walterddr

Differential Revision: D27466645

Pulled By: malfet

fbshipit-source-id: a6a1ebea4cc1b22e13c7342ee4c061afcef7e2b5
2021-03-31 13:45:48 -07:00
2a7df657fe [ROCm] use hiprtc precompiled header (#54350)
Summary:
HIP's runtime compiler (hiprtc) is adding support for precompiled HIP headers in the ROCm 4.2 release.  Conditionally add support for this feature.  Using this feature will improve the ROCm torch wheel user experience; users will no longer need to install HIP headers separately to use torch JIT features.

The use of this feature is conditionalized on a new ROCM_VERSION macro.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54350

Reviewed By: H-Huang

Differential Revision: D27449031

Pulled By: malfet

fbshipit-source-id: 81a8d7847a47ce2bb253d1ea58740ef66ed154a3
2021-03-31 13:36:50 -07:00
07602bf7e1 [caffe2] Use the CXX11 version of the USE_C99_COMPLEX macro
Summary: Because the bare CXX version forwards to this without checking if it's defined causing errors for builds with -Wundef enabled

Test Plan: contbuilds

Differential Revision: D27443462

fbshipit-source-id: 554a3c653aae14d19e35038ba000cf5330e6d679
2021-03-31 12:54:47 -07:00
f71a0daeb7 Use faulthandler to dump traceback of timed out processes in unit tests. (#54818)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54818

Several flaky tests fail due to some sort of timeout and it isn't
clear from the error message in CI where exactly each process is stuck. In this
PR, I've added mechanism to dump the entire python traceback of all python
threads when we encounter a timeout.

Example traceback:

```
Process 3 timed out with traceback:
Current thread 0x00007ff3363ff700 (most recent call first):
  File "torch/testing/_internal/common_distributed.py", line 373 in _event_listener
  File "threading.py", line 870 in run
  File "threading.py", line 932 in _bootstrap_inner
  File "threading.py", line 890 in _bootstrap

Thread 0x00007ff406132180 (most recent call first):
  File "torch/distributed/distributed_c10d.py", line 2477 in barrier
  File "torch/testing/_internal/distributed/rpc/rpc_test.py", line 838 in test_reinit
  File "torch/testing/_internal/dist_utils.py", line 90 in new_test_method
  File "torch/testing/_internal/common_distributed.py", line 292 in wrapper
  File "torch/testing/_internal/common_distributed.py", line 409 in run_test
  File "torch/testing/_internal/common_distributed.py", line 393 in _run
  File "multiprocessing/process.py", line 108 in run
  File "multiprocessing/process.py", line 315 in _bootstrap
  File "multiprocessing/popen_fork.py", line 75 in _launch
  File "multiprocessing/popen_fork.py", line 19 in __init__
  File "multiprocessing/context.py", line 277 in _Popen
  File "multiprocessing/process.py", line 121 in start
```
ghstack-source-id: 125323810

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D27378764

fbshipit-source-id: 661c009a5458c724f004aa83de9347a4bc03b63e
2021-03-31 11:38:30 -07:00
bab8a1a81e [reland] Add annotations to PRs from forks (#54779)
Summary:
We've been using [pytorch/add-annotations-github-action](https://github.com/pytorch/add-annotations-github-action) to add annotations to PRs when they fail Flake8 or clang-tidy. Up until now, though, that functionality has only worked on PRs in pytorch/pytorch itself, not on PRs from forks. This PR fixes that using a technique from [this GitHub blog post](https://securitylab.github.com/research/github-actions-preventing-pwn-requests/) (also linked in a comment in this diff).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54779

Test Plan: janeyx99 and I tested this in the same GitHub repo used to test https://github.com/pytorch/pytorch/issues/54685 and https://github.com/pytorch/pytorch/issues/54693, including with PRs from forks.

Reviewed By: seemethere, xuzhao9

Differential Revision: D27470866

Pulled By: samestep

fbshipit-source-id: d165b8e875d412b910592aa897163fb938d23365
2021-03-31 11:05:27 -07:00
57519e705a Link onnx_library when BUILD_TEST=0 for Windows (#51937)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/51877
cc antoniovs1029

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51937

Reviewed By: malfet

Differential Revision: D27470392

Pulled By: ezyang

fbshipit-source-id: 5abe47b58df9ea3f0706fa4d5a7c8dd92e738f8b
2021-03-31 10:58:25 -07:00
cff266544a Fix minor style/typos problems in comment_device_type.py (#54768)
Summary:
One typo, one example correction and capitalization for a couple of comment lines.

ailzhang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54768

Reviewed By: H-Huang

Differential Revision: D27362999

Pulled By: ezyang

fbshipit-source-id: 91404ac9e9747ef7d7882a5f50b81d7eb570448b
2021-03-31 10:53:17 -07:00
43d4f3b8d0 Implement public API InferenceMode and its error handling (#55008)
Summary:
https://www.internalfb.com/phabricator/paste/view/P360377337Pull Request resolved: https://github.com/pytorch/pytorch/pull/53343

For easier review, here's a diff between the version before revert. https://www.internalfb.com/phabricator/paste/view/P360750919

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55008

Test Plan: Imported from OSS

Pulled By: ailzhang

Reviewed By: bhosmer

Differential Revision: D27443229

fbshipit-source-id: 01b03446a1f6373f43dd5c7170d26226b50f363c
2021-03-31 10:48:00 -07:00
f1f3c8b0fa Adding PyTorch + DNNL + AMD BLIS path (#54953)
Summary:
These changes provide the user with an additional option to choose the DNNL+BLIS path for PyTorch.

This assumes BLIS is already downloaded or built from source and the necessary library file is available at the location: $BLIS_HOME/lib/libblis.so and include files are available at: $BLIS_HOME/include/blis/blis.h and $BLIS_HOME/include/blis/cblas.h

Export the below variables to build PyTorch with MKLDNN+BLIS and proceed with the regular installation procedure as below:
$export BLIS_HOME=path-to-BLIS
$export PATH=$BLIS_HOME/include/blis:$PATH LD_LIBRARY_PATH=$BLIS_HOME/lib:$LD_LIBRARY_PATH
$export BLAS=BLIS USE_MKLDNN_CBLAS=ON WITH_BLAS=blis
$python setup.py install

CPU only Dockerfile to build PyTorch with AMD BLIS is available at : docker/cpu-blis/Dockerfile
Example command line to build using the Dockerfile:
sudo DOCKER_BUILDKIT=1 docker build . -t docker-image-repo-name
Example command line to run the built docker container:
sudo docker run --name container-name -it docker-image-repo-name

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54953

Reviewed By: glaringlee

Differential Revision: D27466799

Pulled By: malfet

fbshipit-source-id: e03bae9561be3a67429df3b1be95a79005c63050
2021-03-31 10:40:25 -07:00
a74b10def9 Keep Markdown ToCs up to date (#54974)
Summary:
This PR uses [markdown-toc](https://github.com/jonschlinkert/markdown-toc#cli) to [automatically update the table of contents for `README.md` and `CONTRIBUTING.md`](https://github.com/pytorch/pytorch/pull/54904#issuecomment-809682134) in CI.

This keeps the same format already used in `README.md`. While it does slightly change the format for the ToC in `CONTRIBUTING.md`, the new format is actually just the same as the old format that was already being used prior to https://github.com/pytorch/pytorch/issues/51458.

Race condition with https://github.com/pytorch/pytorch/issues/54904.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54974

Test Plan: The new "Lint / toc" job in GitHub Actions [succeeds](https://github.com/pytorch/pytorch/pull/54974/checks?check_run_id=2238739005) on this PR, and [fails](https://github.com/pytorch/pytorch/pull/54976/checks?check_run_id=2238784022) on https://github.com/pytorch/pytorch/issues/54976 with an understandable error message.

Reviewed By: malfet

Differential Revision: D27468390

Pulled By: samestep

fbshipit-source-id: 14a73f42ed546d4310140b94ded14e099185d0e0
2021-03-31 10:36:09 -07:00
7fc03dd7c9 Back out "[pytorch][PR] Merge CUDA Streams and Events" (#54996)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54996

Original commit changeset: 45d9fee9a582

Test Plan: CI

Reviewed By: jspark1105

Differential Revision: D27444718

fbshipit-source-id: deb627230817923eaf84ade50ecb14bfbce4e779
2021-03-31 10:21:35 -07:00
24bfcd537e [FX] Added FX prepare_for_inference for Intel CPUs (#53805)
Summary:
Part of https://github.com/pytorch/pytorch/issues/48209

Taken from the docstring:
 Performs a set of optimization passes to optimize a model for the purposes of inference. Specifically, the passes that are run are:
    1. Conv/BN fusion
    2. Dropout removal
    3. MKL layout optimizations

The third optimization takes a function `use_mkl_heuristic` that's used to determine whether a subgraph should be explicity run in MKL layout.

I implemented 2 heuristics:
1. Does it in MKL if the subgraph is larger than 2.
2. Benchmarks each subgraph with MKL layout and without, and keeps the subgraph if it's faster.

### Batch size of 10 and multi-threaded.

Results with the second heuristic are generally as strong as the "jit.freeze" version, except in `densenet` and `vgg`, where it's faster, likely due to the heuristic being better. With the first heuristic, there are some notable gaps, particularly on `inception_v3` and `alexnet`.

```
model         Eager      FX         FX Auto   jit.mkldnn
------------  ---------  ---------  ---------  ---------  -
custom        0.195614   0.14686    0.15929    0.156442   6
resnet18      0.172012   0.114007   0.119678   0.12945    6
resnet50      0.486463   0.294308   0.299518   0.318121   6
densenet161   0.955309   0.893502   0.882798   1.29315    6
inception_v3  0.38454    0.307076   0.239513   0.233083   6
googlenet     0.229388   0.237486   0.170458   0.174106   6
shufflenet    0.0513613  0.0286739  0.0292908  0.0267209  6
alexnet       0.0709602  0.0768137  0.0660831  0.0650399  6
vgg16         1.053993   0.9013264  0.9360212  1.082820   6
mobilenet     0.12264    0.0970935  0.0936568  0.106314   6
mnasnet       0.0989875  0.0412083  0.0424499  0.0472336  6
resnext       0.476811   0.315428   0.314422   0.343156   6
```

For single-threaded (still running...)
```
model             eager         FX    FX auto        mkl    threads
------------  ---------  ---------  ---------  ---------  ---------
custom        0.0401415  0.259863   0.0263152  0.200667           1
resnet18      0.499931   0.382113   0.383711   0.396335           1
resnet50      1.10353    0.911865   0.923645   0.992125           1
densenet161   2.20158    2.39421    2.08204    2.30124            1
inception_v3  0.79161    0.849207   0.703546   0.724492           1
googlenet     0.66896    0.820965   0.515927   0.529414           1
shufflenet    0.0987308  0.0689343  0.0629298  0.0617193          1
alexnet       0.198795   0.19862    0.19325    0.211934           1
vgg16         3.744      3.2499     3.28503    3.31576            1
mobilenet     0.152725   0.14505    0.135555   0.159754           1
mnasnet       0.141983   0.089406   0.089599   0.0956167          1
resnext       1.13778    0.97016    0.955417   0.965376           1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53805

Reviewed By: gmagogsfm

Differential Revision: D27424611

Pulled By: Chillee

fbshipit-source-id: a39137159de962fba7ca15121dfa9e78c1e01223
2021-03-31 10:15:01 -07:00
a0ae3e520f [Pytorch Mobile] 'fix' filter of named parameters for FL (#54633)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54633

Theres currently no information that could be used to determine what is a parameter during the loading of a mobile module. This prevents named parameters from functioning correctly. This change is a temporary hack to help out federated learning the sole user of this api currently.
ghstack-source-id: 124885201

Test Plan: todo

Reviewed By: dhruvbird

Differential Revision: D27308738

fbshipit-source-id: 0af5d1e8381ab7b7a43b20560941aa070a02e7b8
2021-03-31 09:21:35 -07:00
0dd873bdd5 [ROCm] add 4.1 docker image (#54628)
Summary:
Add a ROCm 4.1 docker image for CI.  Plan is to keep two ROCm versions at a time, however we still need the 3.9 image due to some CI jobs depending on it.  Keep the 4.0.1 and 3.10 images, in addition to the 3.9 image until the 3.9 image is no longer needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54628

Reviewed By: H-Huang

Differential Revision: D27443378

Pulled By: malfet

fbshipit-source-id: 3f3417ec4822c6ef4c10ce2144a5b2957503dfbe
2021-03-31 09:13:56 -07:00
aeedd5c7df cmake: fix ONNX_NAMESPACE if USE_SYSTEM_ONNX (#54973)
Summary:
`ONNX_NAMESPACE` is empty by default if `USE_SYSTEM_ONNX ON`, while it should be equal to `onnx`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54973

Reviewed By: glaringlee

Differential Revision: D27466020

Pulled By: walterddr

fbshipit-source-id: 47cde3604acbda3f45bec5893036b39fd1eb58c9
2021-03-31 08:29:00 -07:00
449a9514d1 Update Kineto submodule (#55011)
Summary:
Update Kineto submodule rev. Fixes
https://github.com/pytorch/pytorch/issues/54889

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55011

Test Plan: CI

Reviewed By: gdankel

Differential Revision: D27450222

Pulled By: ilia-cher

fbshipit-source-id: 0652a5d42182197acc4c9e6f07e71b5b55a557ee
2021-03-31 08:07:40 -07:00
99a423f8fc Automated submodule update: tensorpipe (#54970)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/tensorpipe](https://github.com/pytorch/tensorpipe).

New submodule commit: 5bc304d17e

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54970

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: lw

Differential Revision: D27436760

fbshipit-source-id: 7325350c1798feacdc1faeea8c39ce8e4b91c73d
2021-03-31 07:53:35 -07:00
09756e7280 Revert D27370295: [android] fbjni android use prefab dependency, version 0.2.2
Test Plan: revert-hammer

Differential Revision:
D27370295 (2bee09a577)

Original commit changeset: bde881a8d4ed

fbshipit-source-id: 2fcc8f522fb08d4f8299f7e824341be32afb184a
2021-03-31 06:13:26 -07:00
25e07c6e91 Revert D27422219: [caffe2] Support Log1p operator
Test Plan: revert-hammer

Differential Revision:
D27422219 (d92e2520de)

Original commit changeset: f9eba82bf09c

fbshipit-source-id: 7cd5b778ae5f296187f57b6efaa782de97a6f013
2021-03-31 06:03:45 -07:00
6d87b3667f Added support for TensorList inputs in OpInfo (#54922)
Summary:
Stack:
* https://github.com/pytorch/pytorch/issues/54954 Fixed OpInfo jit tests failing for TensorList inputs
* __#54922 Added support for TensorList inputs in OpInfo__

Updated OpInfo to accept either a `Tensor` or `TensorList` as `sample.input` and added workarounds to make this work with gradcheck.

Note: JIT testing support for TensorList inputs will be added in a follow up PR.

Fixes https://github.com/pytorch/pytorch/issues/51996

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54922

Reviewed By: H-Huang

Differential Revision: D27448952

Pulled By: heitorschueroff

fbshipit-source-id: 3f24a56f6180eb2d044dcfc89ba59fce8acfe278
2021-03-31 04:42:10 -07:00
8a170fbacd [package] fix mangling issues with TorchScript (#54915)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54915

TorchScript and torch.package have different mangling schemes. To avoid
them interfering with each other, we should undo the torch.package
mangling before processing anything with TorchScript (since TS
independently makes sure that no names collide).

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D27410472

Pulled By: suo

fbshipit-source-id: d1cc013c532d9abb7fb9615122bc465ded4785bb
2021-03-31 00:58:05 -07:00
444e5f0b60 Add Type System (I) (#53244)
Summary:
**Summary**
This commit adds a new .rst file to update the language specification with the updated content for the Type System section.

**Test Plan**

![image](https://user-images.githubusercontent.com/70345919/109920057-9308b400-7c6e-11eb-8391-83635efbf036.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53244

Reviewed By: H-Huang

Differential Revision: D27445210

Pulled By: nikithamalgifb

fbshipit-source-id: 984c25b06686ba7a72cc03c5c069d819709eedb8
2021-03-30 23:10:27 -07:00
4865195cf4 [PyTorch] Add DimVector variant of infer_size (#54882)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54882

Sometimes we have no reason to think that the output of `infer_size` won't be within the range of typical tensor sizes. In those cases, we can use a DimVector.
ghstack-source-id: 125137792

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D27400387

fbshipit-source-id: 9a11d0f93010540f3aa65c0e208fc8e03f0e8a7f
2021-03-30 20:40:50 -07:00
2bee09a577 [android] fbjni android use prefab dependency, version 0.2.2 (#54792)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54792

Test Plan: Imported from OSS

Reviewed By: dreiss

Differential Revision: D27370295

Pulled By: IvanKobzarev

fbshipit-source-id: bde881a8d4edd4636aa4ec7cecbe770b5b65bb1f
2021-03-30 20:26:36 -07:00
854c92078a Fixed the default size of the workspace array for MAGMA's SVD (#54875)
Summary:
The problem was that MAGMA might not set the value for the optimal size of the workspace array leaving it uninitialized. This is fixed by setting the default value for `wkopt` variable.

Fixes https://github.com/pytorch/pytorch/issues/54381 and https://github.com/pytorch/pytorch/issues/53976.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54875

Reviewed By: H-Huang

Differential Revision: D27437702

Pulled By: mruberry

fbshipit-source-id: bf61555abc4c50e8ef2dae933df24ce4d4fe4527
2021-03-30 19:28:06 -07:00
1dffbe759b [ROCm] utilize PUBLIC vs PRIVATE linking to avoid incorrect dependencies (#54727)
Summary:
Fixes the build of projects that depend on torch, such as torchaudio.  Otherwise torchaudio will complain that gloo_hip is missing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54727

Reviewed By: H-Huang

Differential Revision: D27361513

Pulled By: ezyang

fbshipit-source-id: 714cc2db23e7adf3e89303e941b78c27625b9460
2021-03-30 19:22:56 -07:00
d4c37b314e Mark redispatch functions with TORCH_API (#54966)
Summary:
So they can be called from out-of-tree extensions

Otherwise I get linking errors like:
```
ImportError: /anaconda/envs/mytorch/lib/python3.8/site-packages/torchy-0.1-py3.8-linux-x86_64.egg/_TORCHY.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN2at10redispatch3addEN3c1014DispatchKeySetERKNS_6TensorES5_RKNS1_6ScalarE
```

cc ezyang bdhirsh

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54966

Reviewed By: H-Huang

Differential Revision: D27439712

Pulled By: ezyang

fbshipit-source-id: 4c0b45e87e708c57283758da49c54a767ab7ecbc
2021-03-30 18:30:42 -07:00
b907d6e3b6 [ROCm] skip some tests to enable 4.1 CI upgrade (#54536)
Summary:
Skips the tests indicated as failing in https://github.com/pytorch/pytorch/issues/54535.

During the ROCm CI upgrade from 4.0.1 to 4.1, some tests regressed. Specifically, FFT tests in test_spectral_ops.py and test_grid_sample in test_nn.py. In order to keep a passing CI signal, we need to disable these temporarily.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54536

Reviewed By: H-Huang

Differential Revision: D27442974

Pulled By: malfet

fbshipit-source-id: 07dffb957757a5fc7afaa5bf78b935a427251ef4
2021-03-30 17:49:45 -07:00
3baeeb3f57 Added Azure Pipelines build steps for PyTorch (#54039)
Summary:
This PR adds Azure Pipelines build steps for PyTorch. There are 3 pipelines that are added.

1) CI Build
    - Runs when a PR is opened or when new commits to an open PR is added. This build must succeed before the PR can be merged.
    - Currently only TestTorch unit tests are run.
    - Only the CI Build configurations are run.
2) Daily Build
    - Runs once a day during inactive hours to ensure the current PyTorch repo performs as expected.
    - Runs all unit tests.
        - Note: I do not have access to the current [determine-from](b9e900ee52/test/run_test.py (L737)) unit tests that are skipped on Windows builds. This `determine-from` filter can be added once a clear way to skip certain unit tests given the build configuration is explained.
    - Runs on All Build configurations.
3) Official Build
    - Runs once a day during inactive hours to publish official PyTorch artifacts to Azure DevOps Artifacts for consumption.
    - No unit tests are run.
    - Runs in three stages: Build, Verify, Publish, where PyTorch is built, then its wheel is installed in a clean Conda environment for verification, and then the wheel is published to Azure Artifacts as a Universal Package.
    - Runs on All Build configurations.

Ubuntu builds run on Docker with the specified Dockerfile configuration. Windows builds run directly on configured Windows VMs (CPU, CUDA/cuDNN)

CI Build configurations:
1. Ubuntu 18.04
    1. Python 3.9
		a. CUDA 11.2/cuDNN 8.1.0
	2. Python 3.8
		a. CPU
2. Windows 2019
	1. Python 3.8
		b. CUDA 10.2/cuDNN 7.6.5
	2. Python 3.7
		a. CPU

All Build configurations:
1. Ubuntu 18.04
	1. Python 3.9
		a. CUDA 11.2/cuDNN 8.1.0
	2. Python 3.8
		a. CPU
		b. CUDA 10.2/cuDNN 8.1.0
	3. Python 3.7
		a. CPU
		b. CUDA 10.1/cuDNN 7.6.5

2. Windows 2019
	1. Python 3.9
		a. CUDA 11.2/cuDNN 8.1.0
	2. Python 3.8
		a. CPU
		b. CUDA 10.2/cuDNN 7.6.5
	3. Python 3.7
		a. CPU
        b. CUDA 10.1/cuDNN 7.6.4

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54039

Reviewed By: ezyang

Differential Revision: D27373310

Pulled By: malfet

fbshipit-source-id: 06dcfe2d99da0e9876b6deb224272800dae46028
2021-03-30 17:32:43 -07:00
f956b7524e lazily init AliasDb and add changed status to CSE (#54776)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54776

Reviewed By: H-Huang

Differential Revision: D27422064

Pulled By: Krovatkin

fbshipit-source-id: dfeb61001f60a2080246e128d8b7f83bbc584801
2021-03-30 16:59:55 -07:00
2df4acd025 Remove hacky wrapper for about 70 kernels (#54898)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54898

Codemod commands generated by https://github.com/pytorch/pytorch/pull/54098
ghstack-source-id: 125278884

Test Plan:
buck build //caffe2/aten/...
BUILD_TENSOREXPR_BENCHMARK=ON BUILD_STATIC_RUNTIME_BENCHMARK=ON python setup.py install

Reviewed By: smessmer

Differential Revision: D27404868

fbshipit-source-id: cb6593c0d1a2dee4e65f0baa08f32a76cf7f5339
2021-03-30 16:47:33 -07:00
1bf57430f1 Remove hacky wrappers for 21 operators (#54819)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54819

20 of them contain both optional Tensor and output position.

Hacky wrapper for `_convolution_mode` was added in
04e0cbf5a9f073a1b73195537c12fb332c2fddd9 after hacky wrappers
are removed for optional<Tensor>.

Codemod commands are generated by a hacked version of
https://github.com/pytorch/pytorch/pull/54223 and
https://github.com/pytorch/pytorch/pull/54098.
ghstack-source-id: 125278883

Test Plan:
buck build //caffe2/aten/...
BUILD_TENSOREXPR_BENCHMARK=ON BUILD_STATIC_RUNTIME_BENCHMARK=ON python setup.py install

Reviewed By: smessmer

Differential Revision: D27378819

fbshipit-source-id: b925ed0510a83e3976383aaeec8b7de438b23bf3
2021-03-30 16:46:22 -07:00
d92e2520de [caffe2] Support Log1p operator (#54968)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54968

Support Log1p operator to add feature parity with PyTorch.

NumPy doc https://numpy.org/doc/stable/reference/generated/numpy.log1p.html
PyTorch doc https://pytorch.org/docs/stable/generated/torch.log1p.html

Test Plan:
```
$ buck test mode/dev-nosan //caffe2/caffe2/python/operator_test:log1p_op_test
```

Differential Revision: D27422219

fbshipit-source-id: f9eba82bf09c1c440f11a33f8ae2bf8084609457
2021-03-30 16:38:37 -07:00
d490e0120f [PyTorch] One less refcount bump in linear() (#54936)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54936

Another case where we can use `MaybeOwned<Tensor>` to save a bump at a small cost.
ghstack-source-id: 125218488

Test Plan: Existing CI

Reviewed By: ngimel

Differential Revision: D27421117

fbshipit-source-id: 16bb31ec38817be1f889360e2abfd0d9596e2943
2021-03-30 15:54:57 -07:00
dde7fff0e9 [PyTorch] Avoid refcount bumps in addmm_out_cuda_impl (#54935)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54935

Bunch of avoidable copying of Tensor objects, which results in a refcount bump.
ghstack-source-id: 125216023

Test Plan:
Compared percentage of self time spent in addmm_out_cuda_impl while running the following sample:

```
+import torch
+import torch.nn as nn
+
+m = nn.Linear(1024, 1024).cuda().half()
+x = torch.randn(16, 1024).cuda().half()
+while True: y = m(x)
```

in perf record, decreased from 0.74% to 0.56%.

Reviewed By: ngimel

Differential Revision: D27420388

fbshipit-source-id: d2c5e4c4899cd02c60c45735b2d72c4ed913f6e8
2021-03-30 15:54:55 -07:00
ea37fe34ff [PyTorch] Avoid refcount bump in TensorArg (#54934)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54934

It looks like the vast majority of usage is just borrowing a pre-existing Tensor.
ghstack-source-id: 125216052

Test Plan: Existing CI.

Reviewed By: hlu1

Differential Revision: D27415131

fbshipit-source-id: d5a8dc4ca5d48ca3eaa3664655b724094e61f371
2021-03-30 15:53:22 -07:00
5b448cf21a Revert D25966661: Support needsOutputs for RecordFunction and ObserverUtil improvements
Test Plan: revert-hammer

Differential Revision:
D25966661 (0e43a73f76)

Original commit changeset: 707886e1f212

fbshipit-source-id: a4e4af29abf622c1e0aaaf7dfb019c045988b4bc
2021-03-30 15:41:12 -07:00
23b15ef98a test_c10d: use with_nccl_blocking_wait decorator (#54742)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54742

Uses with_nccl_blocking_wait decorator for test_c10d.
ghstack-source-id: 125233691

Test Plan: ci

Reviewed By: SciPioneer

Differential Revision: D27277835

fbshipit-source-id: 063de32646b19d18969e9d60cb9a31a40d73d6a7
2021-03-30 15:33:17 -07:00
3f1cd2e3a0 test_c10d: Run tests with nccl_async_error_handling (#54741)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54741

Similar to what we did for distributed_test.py, let MultiProcessTests that run collecticve comm. tests with nccl blocking run under nccl_async_error_handling. This will better simulate real-world training scenarios.
ghstack-source-id: 125233692

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D27277389

fbshipit-source-id: a6c6e9abcf3a53b03ea8b9f8fb63b78e0cb6e81e
2021-03-30 15:33:14 -07:00
0e543b2b00 Provide a decorator to set/unset nccl blocking wait for tests (#54740)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54740

Adds a simple helper decorator to set/unset nccl blocking wait for
tests. This will make it easier than having to manually set/unset the
os.environ vars every time.
ghstack-source-id: 125233693

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D27277222

fbshipit-source-id: c289b9d05e2f6328d672810b07501979b6e177c6
2021-03-30 15:31:30 -07:00
920eb01e2e Add scatter_add to amp docs (#54908)
Summary:
Updates docs to reflect https://github.com/pytorch/pytorch/pull/52133.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54908

Reviewed By: agolynski

Differential Revision: D27431302

Pulled By: H-Huang

fbshipit-source-id: fa3dc6267bc73c81cdd96f986c971daee1922cb5
2021-03-30 15:26:41 -07:00
4694452d08 [complex] masked_fill: Complex Autograd support and update masked_scatter skips. (#54244)
Summary:
Reference Issue: https://github.com/pytorch/pytorch/issues/33152
Previous PR : https://github.com/pytorch/pytorch/pull/52035, https://github.com/pytorch/pytorch/pull/52483

Fixes : https://github.com/pytorch/pytorch/issues/53608
Fixes : https://github.com/pytorch/pytorch/issues/53523

**Note**: This PR is based on `ci-all/*` branch to ascertain that we don't break the master again.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54244

Reviewed By: H-Huang

Differential Revision: D27429147

Pulled By: anjali411

fbshipit-source-id: 97945998b6911c2e7fd3f8db6cbd8963e5d6f21f
2021-03-30 14:58:40 -07:00
0e43a73f76 Support needsOutputs for RecordFunction and ObserverUtil improvements (#54442)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54442

Added needsOutputs support to RecordFunction, improved ObserverUtil functions to handle list data. Minor refactor names to be consistent.

To get output data from kernel calls, we need to temporarily capture them before passing them to the record function. Then the results are released to function return. We handle two cases, for unboxed and boxed kernels. The boxed version is fairly simple since all outputs are stored in the stack object. For unboxed kernel calls, we added a `ReturnValue` utility class to properly handle the different return values of unboxed kernels.

For optimization, this intermediate capture is only enabled for observers that request `needsOutputs(true)` and should not affect other observers or when the observer is not enabled.

Test Plan:
```
=> buck build //caffe2/test/cpp/jit: --show-output
=> buck-out/gen/caffe2/test/cpp/jit/jit --gtest_filter=RecordFunctionTest*
CUDA not available. Disabling CUDA and MultiCUDA tests
Note: Google Test filter = RecordFunctionTest*-*_CUDA:*_MultiCUDA
[==========] Running 7 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 7 tests from RecordFunctionTest
[ RUN      ] RecordFunctionTest.TracedTestInputsOutputs
[       OK ] RecordFunctionTest.TracedTestInputsOutputs (226 ms)
[ RUN      ] RecordFunctionTest.SampledCallbacks
[       OK ] RecordFunctionTest.SampledCallbacks (771 ms)
[ RUN      ] RecordFunctionTest.RecordFunctionGuard
[       OK ] RecordFunctionTest.RecordFunctionGuard (0 ms)
[ RUN      ] RecordFunctionTest.Callbacks
[       OK ] RecordFunctionTest.Callbacks (2 ms)
[ RUN      ] RecordFunctionTest.ShouldRun
[       OK ] RecordFunctionTest.ShouldRun (0 ms)
[ RUN      ] RecordFunctionTest.Basic
[       OK ] RecordFunctionTest.Basic (1 ms)
[ RUN      ] RecordFunctionTest.OperatorNameOverload
[       OK ] RecordFunctionTest.OperatorNameOverload (1 ms)
[----------] 7 tests from RecordFunctionTest (1001 ms total)

[----------] Global test environment tear-down
[==========] 7 tests from 1 test case ran. (1002 ms total)
[  PASSED  ] 7 tests.

```

Reviewed By: ilia-cher

Differential Revision: D25966661

fbshipit-source-id: 707886e1f212f40ba16a1fe292ea7dd33f2646e3
2021-03-30 14:26:22 -07:00
85c056a508 [JIT] Add EliminateExceptions pass. (#54730)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54730

Test Plan: Imported from OSS

Reviewed By: Krovatkin

Differential Revision: D27343165

Pulled By: ZolotukhinM

fbshipit-source-id: 1574e7aad4d527c4caf74383335265c9bffc7640
2021-03-30 13:56:54 -07:00
5bcbbf5373 Lint trailing newlines (#54737)
Summary:
*Context:* https://github.com/pytorch/pytorch/issues/53406 added a lint for trailing whitespace at the ends of lines. However, in order to pass FB-internal lints, that PR also had to normalize the trailing newlines in four of the files it touched. This PR adds an OSS lint to normalize trailing newlines.

The changes to the following files (made in 54847d0adb9be71be4979cead3d9d4c02160e4cd) are the only manually-written parts of this PR:

- `.github/workflows/lint.yml`
- `mypy-strict.ini`
- `tools/README.md`
- `tools/test/test_trailing_newlines.py`
- `tools/trailing_newlines.py`

I would have liked to make this just a shell one-liner like the other three similar lints, but nothing I could find quite fit the bill. Specifically, all the answers I tried from the following Stack Overflow questions were far too slow (at least a minute and a half to run on this entire repository):

- [How to detect file ends in newline?](https://stackoverflow.com/q/38746)
- [How do I find files that do not end with a newline/linefeed?](https://stackoverflow.com/q/4631068)
- [How to list all files in the Git index without newline at end of file](https://stackoverflow.com/q/27624800)
- [Linux - check if there is an empty line at the end of a file [duplicate]](https://stackoverflow.com/q/34943632)
- [git ensure newline at end of each file](https://stackoverflow.com/q/57770972)

To avoid giving false positives during the few days after this PR is merged, we should probably only merge it after https://github.com/pytorch/pytorch/issues/54967.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54737

Test Plan:
Running the shell script from the "Ensure correct trailing newlines" step in the `quick-checks` job of `.github/workflows/lint.yml` should print no output and exit in a fraction of a second with a status of 0. That was not the case prior to this PR, as shown by this failing GHA workflow run on an earlier draft of this PR:

- https://github.com/pytorch/pytorch/runs/2197446987?check_suite_focus=true

In contrast, this run (after correcting the trailing newlines in this PR) succeeded:

- https://github.com/pytorch/pytorch/pull/54737/checks?check_run_id=2197553241

To unit-test `tools/trailing_newlines.py` itself (this is run as part of our "Test tools" GitHub Actions workflow):
```
python tools/test/test_trailing_newlines.py
```

Reviewed By: malfet

Differential Revision: D27409736

Pulled By: samestep

fbshipit-source-id: 46f565227046b39f68349bbd5633105b2d2e9b19
2021-03-30 13:09:52 -07:00
eafa235582 Clarify and document commit choice for CI jobs (#54967)
Summary:
PRs https://github.com/pytorch/pytorch/issues/53652 and https://github.com/pytorch/pytorch/issues/54693 attempted to increase the consistency of our choice of commit (head vs merge) for CI on PRs, and have so far been unsuccessful. This PR takes a less ambitious approach to the problem by clarifying the choice in one specific way (see the following paragraph) and documenting it in `CONTRIBUTING.md`.

In addition to documentation, this PR also removes the current behavior of our GHA jobs that checkout the PR tip instead of the merge commit. At first glance, this behavior seems to increase consistency (by eliminating the special-case for `ghstack` PRs), but in reality, it actually just means that for non-`ghstack` PRs, the question "Which commit is used in CI?" has *two* answers instead of just one; see the description of https://github.com/pytorch/pytorch/issues/53652 for more details.

Once merged, this PR will unblock other PRs that add modify our GHA workflows in breaking ways, such as https://github.com/pytorch/pytorch/issues/54737.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54967

Test Plan: None.

Reviewed By: walterddr, seemethere

Differential Revision: D27435913

Pulled By: samestep

fbshipit-source-id: 405fb419cf015cf88107d5eb2498cfb5bcb7ce33
2021-03-30 11:47:40 -07:00
18e61d1ce9 Improve placeholder matching in subgraph rewriter (#54958)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54958

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D27431889

Pulled By: ansley

fbshipit-source-id: 8b1b4f2f0202305530b9648b6b770f9e2ecacfe2
2021-03-30 11:40:33 -07:00
f5d6b90c35 Add a missing sys import in test/distributed/rpc/test_tensorpipe_agent.py (#54925)
Summary:
`sys` is used a couple of lines below.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54925

Reviewed By: agolynski

Differential Revision: D27434941

Pulled By: H-Huang

fbshipit-source-id: b03c9373ee77e7a158964f619b29967fa55226d0
2021-03-30 11:24:06 -07:00
46c27ea84d Enabling OneDNN for group conv (#54890)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/50042

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54890

Reviewed By: ejguan

Differential Revision: D27405252

Pulled By: VitalyFedyunin

fbshipit-source-id: 7f4880ff07a51b83f796e218eb0df048ad4725ce
2021-03-30 11:18:23 -07:00
d49beba071 [pyper] out variant of sigrid_transforms_torch_bind + ListUnpack (#54761)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54761

Test Plan:
Regen adindexer model that uses sigrid_transforms_torch_bind: /mnt/public/ansha/adindexer/merge20210323/adindexer_pt_traced_merge.pt

```
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 3 ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --scripted_model=adindexer_pt_traced_merge.pt --pt_inputs=/data/users/ansha/tmp/adindexer/merge2/container_precomputation_bs1.pt --iters=30000 --warmup_iters=300000 --num_threads=1 --pred_net=c2_net_merge.pb --pt_enable_static_runtime=1 --pt_cleanup_activations=true --pt_enable_out_variant=1 --pt_optimize_memory=1
```

Before ms/iter: 0.0647056
After ms/iter: 0.0581197

Reviewed By: hlu1

Differential Revision: D27239617

fbshipit-source-id: dffe6cbaf3a783c41605c97c5947a36e3b1b1f3b
2021-03-30 10:54:44 -07:00
d60874354f [docs] Add updated TorchScript language reference section for types (#53673)
Summary:
**Summary**
This commit adds information about type annotation and inference to
the updated language specification. It will be rebased on top of https://github.com/pytorch/pytorch/issues/52494
after it lands.

**Test Plan**
Continuous integration.

Screen capture:
https://user-images.githubusercontent.com/4392003/110560184-66371f80-80fa-11eb-803a-923cf8de25ff.mov

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53673

Reviewed By: gmagogsfm

Differential Revision: D27413001

Pulled By: SplitInfinity

fbshipit-source-id: b54b300b4b1f10537ec06e2ee9eeb6d2b1f1810b
2021-03-30 10:32:58 -07:00
9f93d82907 OpInfo: Add opinfo for cum{min,max} and minor fixes (#54762)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54762

Reviewed By: H-Huang

Differential Revision: D27390171

Pulled By: mruberry

fbshipit-source-id: 9376dafa3bd2228786756f62fed01565134228fa
2021-03-30 10:24:38 -07:00
4e110528bd Added cuSOLVER path for torch.linalg.eigh/eigvalsh (#53040)
Summary:
This PR adds the cuSOLVER based path for `torch.linalg.eigh/eigvalsh`.
The device dispatching helper function was removed from native_functions.yml, it is replaced with `DECLARE/DEFINE_DISPATCH`.

cuSOLVER is used if CUDA version >= 10.1.243. In addition if CUDA version >= 11.1 (cuSOLVER version >= 11.0) then the new 64-bit API is used.

I compared cuSOLVER's `syevd` vs MAGMA's `syevd`. cuSOLVER is faster than MAGMA for all matrix sizes.
I also compared cuSOLVER's `syevj` (Jacobi algorithm) vs `syevd` (QR based divide-and-conquer algorithm). Despite it is said that `syevj` is better than `syevd` for smaller matrices, in my tests it is the case only for float32 dtype and matrix sizes 32x32 - 512x512.

For batched inputs comparing a for loop of `syevd/syevj` calls to `syevjBatched` shows that for batches of matrices up to 32x32 the batched routine is much better. However, there are bugs in `syevjBatched`, sometimes it doesn't compute the result leaving eigenvectors as a unit diagonal matrix and eigenvalues as the real diagonal of the input matrix.  The output is the same with `cupy.cusolver.syevj` so the problem is definitely on the cuSOLVER side. This bug is not present in the non-batched `syevj`.

The performance of 64-bit `syevd` is the same as 32-bit version.

Ref. https://github.com/pytorch/pytorch/issues/47953

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53040

Reviewed By: H-Huang

Differential Revision: D27401218

Pulled By: mruberry

fbshipit-source-id: aef91eefb57ed73fef87774ff9a36d50779903f7
2021-03-30 10:14:00 -07:00
c9d0c855f7 [special] Alias for special.expm1 and special.exp2 (#54670)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/50345

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54670

Reviewed By: H-Huang

Differential Revision: D27401440

Pulled By: mruberry

fbshipit-source-id: 02b1fd0e8ffd3f5a017d6b6b9229b76b92b4b745
2021-03-30 10:03:13 -07:00
75ed6fbd91 Fix CUDA 11.2 jobs for Windows (#54955)
Summary:
Fixes https://github.com/pytorch/pytorch/pull/54589#issuecomment-810255467

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54955

Reviewed By: walterddr

Differential Revision: D27434722

Pulled By: agolynski

fbshipit-source-id: b99f24be679da65e5894e1a21e3cb2a62320fdda
2021-03-30 09:58:31 -07:00
728d18f976 Enable USE_KINETO (#51273)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51273

Reviewed By: malfet

Differential Revision: D26119144

fbshipit-source-id: eab0d17789c1eab89a7369f0574d3b4c2767c98a
2021-03-30 09:39:11 -07:00
9b9e19a808 Fix test_variant_consistency_jit_addmm for complex types (#54917)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54917

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D27411483

Pulled By: anjali411

fbshipit-source-id: 95a2241ff326a7ab8b8d3abe0ad100074c23e47a
2021-03-30 09:33:10 -07:00
6c8d783830 Generate no-op meta functions for all inplace operations (#54901)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54901

Some subtleties:
- Need to make sure not to clobber composite definitions when
  deciding when to generate
- I was lazy and so I didn't make inplace on TensorList work,
  nor did I make inplace functions that returned void work
- A few tests started complaining that these noop meta functions
  weren't raising the errors they needed.  This is tracked
  in https://github.com/pytorch/pytorch/issues/54897

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D27407232

Pulled By: ezyang

fbshipit-source-id: 5e706a267496368acdafd128942c310954e43d29
2021-03-30 09:31:39 -07:00
7c0941ee63 Clang-format powerSGD_hook.py (#54839)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54839

ghstack-source-id: 125089465

Test Plan: N/A

Reviewed By: rohan-varma

Differential Revision: D27384796

fbshipit-source-id: 8312059f6a47d60ca29f75041141bb88804e1b32
2021-03-30 09:28:45 -07:00
6c31f56bf4 [Gradient Compression] Add cuda.syncrhonize back to batched powerSGD (#54838)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54838

Realize that an explicit sync is somehow still needed for batched PowerSGD hook. I find that a job failure can be fixed by this change.

The sync was once removed by #54482.

Test Plan:
f260900882
f260899693

Reviewed By: rohan-varma

Differential Revision: D27384738

fbshipit-source-id: 3efd738b9fd375e2ceb36ed3a6bf99cd8ce8ff95
2021-03-30 09:27:11 -07:00
6f63126b5c [quant][fx] Add pass in convert to fold quant-dequant sequence (#54860)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54860

Currently we insert a quantize_per_tensor op when we encounter the quantizable input,
so if it has multiple uses and not all are quantizable then we need to add a dequantize op
before these ops.

In this pass - For a sequence of quantize_per_tensor - dequantize, we combine them
since it is a no-op.

[internal only][pyper]

Before this change we had redundant dequantize nodes in the graph
Example 1x inline_cvr graph https://www.internalfb.com/intern/everpaste/?handle=GODBxAlUMzGHD6 (98143776f5)MSACpHKKu9qjorbsIXAAAz
 FC layers -> 37
 quantize_per_tensor -> 30
 dequantize -> 49

After this change
https://www.internalfb.com/intern/everpaste/?handle=GAl0uQnOlDNmpLoSAB-GZqRxu9wMbsIXAAAz
 FC layers -> 37
 quantize_per_tensor -> 30
 dequantize -> 39

We remove extra 10 dequantize nodes in the graph.

Test Plan:
python test/test_quantization.py test_fold_quant_dequant

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D27390506

fbshipit-source-id: 56e6fb8496171246eccf4bd45eb8bebd87fcb740
2021-03-30 08:40:24 -07:00
a7dc0ab845 [quant][fx][pyper] Get first linear use of quantize_per_tensor for FQN (#54859)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54859

This is applicable to the case when a call_function linear op is one of the users of quantize op
In order to be able to map the qparams of quantize_per_tensor to the qparams of the linear operator
that consumes it, we need to use the FQN of the module with linear op for the qparmas of quantize_per_tensor.

Test Plan:
python test/test_quantization.py test_qparams_fqn

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D27390505

fbshipit-source-id: a47af0e5ac016f2b2df74fbdf45afe99dc04be46
2021-03-30 08:38:51 -07:00
c690ed0ae8 Fix override for __iter__ (#54702)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54702

This fixes subclassing for __iter__ so that it returns an iterator over
subclasses properly instead of Tensor.

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D27352563

Pulled By: ezyang

fbshipit-source-id: 4c195a86c8f2931a6276dc07b1e74ee72002107c
2021-03-30 08:30:50 -07:00
2503028ff5 Fix ConvTranspose with padding as a list of values (#54911)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/54452

The assertion that fails in the issue is necessary to appease mypy. Instead, I fix `_ntuple` to always return a `tuple`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54911

Reviewed By: H-Huang

Differential Revision: D27411088

Pulled By: jbschlosser

fbshipit-source-id: 7f5045c58dd4f5f3b07b4826d9b4ca85606c5bce
2021-03-30 07:37:31 -07:00
0269a5f481 Re-enable cmath.sqrt(complex(-1,-0.0)) test (#54923)
Summary:
Both JITed and plan `cmath.sqrt(complex(-1, -0.0))` should return `-1j` after https://github.com/pytorch/pytorch/pull/54820 has been resolved.

Also, use f-string instead of `.format` method

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54923

Reviewed By: anjali411

Differential Revision: D27415117

Pulled By: malfet

fbshipit-source-id: 52e182feca50b690684de87c99df0ad6bef1ab44
2021-03-30 07:25:26 -07:00
46e7f6773f [Static Runtime] Check for inplace ops explicitly in ReplaceWithCopy (#54657)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54657

The constraint checked in D27145406 (acf03b13f1) is too tight for the adindexer model and as a result, 5 ops (4 aten::narrow + 1 aten::premute) are not replaced with the copy version and resulted in perf regression. This diff checks for inplace ops explicitly and only applies the input constraint to graphs with inplace ops.

Test Plan: Contbuild

Reviewed By: ajyu

Differential Revision: D27253145

fbshipit-source-id: 23e2b1a018c84dd0fc2880fddd9c41bc0422b8eb
2021-03-30 07:08:00 -07:00
32bb5c3609 [iOS GPU][Kernel] Fix the softmax kernels (#54519)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54519

The current MPSCNNSoftmax kernels operates on tensors' feature channels. Therefore, in order to use it, we need to reshape the input tensors based on the value of `dim` . Currently, I decide to limit the input to be two dimensional. I'll remove the constraint once we have shader implementations.
ghstack-source-id: 124497702

Test Plan:
- SandcastleCI
- CircleCI

Reviewed By: dhruvbird

Differential Revision: D27218823

fbshipit-source-id: 48c427ceedb42e63c183114939ca801ebfc81fd9
2021-03-30 03:58:55 -07:00
626bb3d310 [iOS GPU][Design] Use function_constants to simply shader kernels (#54518)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54518

When I was reading the Metal Shader Language Specification, I noticed that using `function_constants` in C++ attributes could let us do compile time kernel selection, which can drastically reduce the complexity of writing GPU kernels for different input texture types. We should apply this trick to all our existing shader functions.
ghstack-source-id: 124497703

Test Plan:
- Metal op tests
```
2021-03-20 23:35:20.496922-0700 PyTorchPlayground[48215:8455407] [bool test_view()],[1 10 2 2 ],[SUCCEED]
2021-03-20 23:35:20.522714-0700 PyTorchPlayground[48215:8455407] [bool test_view2()],[1 10 2 2 ],[SUCCEED]
2021-03-20 23:35:20.553591-0700 PyTorchPlayground[48215:8455407] [bool test_view3()],[5 8 ],[SUCCEED]
2021-03-20 23:35:20.571194-0700 PyTorchPlayground[48215:8455407] [bool test_view4()],[5 8 ],[SUCCEED]
```
- Sandcastle CI
- CircleCI

Reviewed By: SS-JIA

Differential Revision: D27218965

fbshipit-source-id: 763c54d551de3a88e4ff0007894200d72f00958c
2021-03-30 03:57:02 -07:00
f9097c43b9 Support mix of int32 and int64 offsets/indices for EmbeddingBag and its variants (#53655)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53655

Currently EmbeddingBag and it variants support either int32 or int64 indices/offsets. We have use cases where there are mix of int32 and int64 indices which are not supported yet. To avoid introducing too many branches we could simply cast offsets type to indices type when they are not the same.

Test Plan: unit tests

Reviewed By: qizzzh

Differential Revision: D26820202

fbshipit-source-id: 3e8f09523329ea12393ea92ee9a6315aa40a0b7f
2021-03-29 23:58:03 -07:00
5c12d97d96 Add script to export a JSON of slow test case times (#54907)
Summary:
This PR introduces a script to spit our a list of slow tests into a file `.pytorch-slow-tests`. The format is currently JSON, and is simply a dictionary with entries that look like: `("test_case_name (__main__.test_suite)" -> average time in seconds)`. This is one of the steps in maintaining a list of slow tests so we could retire the manual slowTest labeling process.

The script reads data from the previous day's viable/strict's data (to ensure we have fully uploaded data), and aggregates the test times for **passed** test cases. It then filters the individual test cases to exclude those faster than 60 seconds.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54907

Test Plan:
`python tools/export_slow_test.py`
Check that `.pytorch-slow-tests` contains data. Mine looks like:
```
{
    "test_matmul_4d_4d_complex_cpu (__main__.TestAutogradDeviceTypeCPU)": 91.22675,
    "test_unary_ops (__main__.TestTEFuser)": 68.6,
    "test_fn_gradgrad_unfold_cpu_complex128 (__main__.TestGradientsCPU)": 82.49153333333334,
    "test_conv1d_basic (__main__.TestXNNPACKConv1dTransformPass)": 94.0914375,
    "test_ddp_uneven_inputs (__main__.TestDistBackendWithFork)": 134.4995,
    "test_pdist_norm_large_cuda (__main__.TestTorchDeviceTypeCUDA)": 60.2634,
    "test_cusparse_multiple_threads_same_device (__main__.TestCuda)": 97.9022,
    "test_fn_gradgrad_unfold_cuda_complex128 (__main__.TestGradientsCUDA)": 130.7222,
    "test_ddp_uneven_inputs (__main__.TestDistBackendWithSpawn)": 136.08133333333333,
    "test_jit_cuda_archflags (__main__.TestCppExtensionJIT)": 112.80733333333333,
    "test_lobpcg_ortho_cuda_float64 (__main__.TestLinalgCUDA)": 63.8312,
    "test_matmul_4d_4d_complex_cuda (__main__.TestAutogradDeviceTypeCUDA)": 62.1062,
    "test_inverse_many_batches_cuda_complex128 (__main__.TestLinalgCUDA)": 1434.505,
    "test_inverse_many_batches_cuda_complex64 (__main__.TestLinalgCUDA)": 1403.846,
    "test_inverse_many_batches_cuda_float32 (__main__.TestLinalgCUDA)": 2081.614,
    "test_inverse_many_batches_cuda_float64 (__main__.TestLinalgCUDA)": 1410.788,
    "test_matrix_exp_analytic_cuda_complex128 (__main__.TestLinalgCUDA)": 172.167,
    "test_matrix_exp_analytic_cuda_complex64 (__main__.TestLinalgCUDA)": 172.57,
    "test_matrix_exp_analytic_cuda_float32 (__main__.TestLinalgCUDA)": 258.61,
    "test_matrix_exp_analytic_cuda_float64 (__main__.TestLinalgCUDA)": 174.793,
    "test_inverse_many_batches_cpu_complex128 (__main__.TestLinalgCPU)": 666.464,
    "test_inverse_many_batches_cpu_complex64 (__main__.TestLinalgCPU)": 667.26,
    "test_inverse_many_batches_cpu_float32 (__main__.TestLinalgCPU)": 1100.719,
    "test_inverse_many_batches_cpu_float64 (__main__.TestLinalgCPU)": 651.037,
    "test_matrix_exp_analytic_cpu_complex128 (__main__.TestLinalgCPU)": 72.965,
    "test_matrix_exp_analytic_cpu_complex64 (__main__.TestLinalgCPU)": 74.184,
    "test_matrix_exp_analytic_cpu_float32 (__main__.TestLinalgCPU)": 128.768,
    "test_matrix_exp_analytic_cpu_float64 (__main__.TestLinalgCPU)": 72.138,
    "test_conv1d_with_relu_fc (__main__.TestXNNPACKConv1dTransformPass)": 123.728,
    "test_fn_gradgrad_linalg_householder_product_cuda_complex128 (__main__.TestGradientsCUDA)": 60.708,
    "test_lobpcg (__main__.TestAutograd)": 120.408,
    "test_collect_callgrind (__main__.TestBenchmarkUtils)": 206.896,
    "test_collect_cpp_callgrind (__main__.TestBenchmarkUtils)": 122.507,
    "test_proper_exit (__main__.TestDataLoader)": 172.356,
    "test_proper_exit (__main__.TestDataLoaderPersistentWorkers)": 172.02,
    "testNBit (__main__.operator_test.fused_nbit_rowwise_conversion_ops_test.TestNBitGreedyFused)": 96.9435,
    "IntegerDivider (__main__.TestCUDAIntegerDivider)": 156.73700000000002
}
```

Reviewed By: walterddr, malfet

Differential Revision: D27412861

Pulled By: janeyx99

fbshipit-source-id: ec3d327e0dc6c93093e8b1c8454e3166b0649909
2021-03-29 20:45:02 -07:00
a1bd7918cc [docs][quant] Fix FX Graph Mode Quantization tutorial link (#54715)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54715

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D27338515

fbshipit-source-id: d61b140284548073df42ead1900f179c6ada2f02
2021-03-29 17:25:19 -07:00
fbaad8c0f9 [PyTorch] TensorIterator::output should return const reference (#54811)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54811

Callers can make a refcount bump themselves if they need one.
ghstack-source-id: 125136516

Test Plan: CI

Reviewed By: ngimel

Differential Revision: D27377210

fbshipit-source-id: ea58c7190fe2d7896432e403ecb1c59761aa319d
2021-03-29 15:16:25 -07:00
1267efce75 [nnc] Add a default constructor for Placeholder
Summary:
It's useful to be able to have an uninitialized Placeholder,
sometimes, e.g., as a class member, where member initialization is
awkward/impossible.

(Yes, one could wrap a Placeholder in a unique_ptr, but it's an extra layer of
cruft).

Test Plan: `buck build //caffe2/test:jit`

Reviewed By: navahgar

Differential Revision: D27400784

fbshipit-source-id: 56191ee11cbb4bc91b5624af6329f2d6d007570b
2021-03-29 15:11:21 -07:00
1bccd48465 Allow creating SugaredValue for a complex valued IValue and deserialization logic for "infj" and "nanj" global constants (#54328)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54328

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D27369134

Pulled By: anjali411

fbshipit-source-id: aec26750a6fc8917ee15306684b743d13a91570c
2021-03-29 14:46:29 -07:00
f4dfa02c03 Add documentation for torch.jit.Attribute and torch.jit.annotate (#54485)
Summary:
This is to prepare for new language reference spec that needs to describe `torch.jit.Attribute` and `torch.jit.annotate`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54485

Reviewed By: SplitInfinity, nikithamalgifb

Differential Revision: D27406843

Pulled By: gmagogsfm

fbshipit-source-id: 98983b9df0f974ed69965ba4fcc03c1a18d1f9f5
2021-03-29 14:44:53 -07:00
1a0b77e7c4 Suggest TORCH_LIBRARY_FRAGMENT in duplicate TORCH_LIBRARY error message (#54883)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54883

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D27400592

Pulled By: ezyang

fbshipit-source-id: 45d6a3a890979cce1b07e933f5335f3fa3a375a2
2021-03-29 14:43:11 -07:00
e829754992 [PyTorch] Inline Tensor keyset-checking methods & similar getters (#54806)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54806

These are all very small key set checks (or similar getters
like `dtype()`, and we clearly want them to be inlinable -- we've even
made them non-virtual for perf in TensorImpl and said so in
comments. Don't make LTO work to figure that out.
ghstack-source-id: 125060650

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D27375016

fbshipit-source-id: 5c3dbfa38fa493c8f7e0ac4e5acd3598d5896558
2021-03-29 14:40:02 -07:00
49b07ac5d1 Enable complex autograd for index, add index and index_put OpInfos (#54562)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53605

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54562

Reviewed By: malfet

Differential Revision: D27300086

Pulled By: anjali411

fbshipit-source-id: 23e8335e6e4c8f10888b5c54a040880c5b499215
2021-03-29 14:36:43 -07:00
d5564618d0 [NCCL][Blocking Wait] Log set exceptions when checking for exceptions in (#54558)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54558

In blocking wait's polling synchronization loop, we frequently call checkAndSetException() as part of isCompleted() to check the status of nccl operations. It would be useful to log here in case we encounter any exceptions (which are later thrown by `checkAndThrowException`).

Also slightly refactors code previously added to make use of a helper function to get the error message given an `std::exception_ptr`.
ghstack-source-id: 125124314

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D27136202

fbshipit-source-id: 256eb63c5c2a84be909722d3fd7377ad9303fa11
2021-03-29 14:15:45 -07:00
028d2d6e63 [NCCL] Enhance watchdog to log exceptions (#54557)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54557

When looping through the nccl communicator cache checking for errors, enhance the watchdog to log exceptions that are set on the communicator.

This will allow for better debugability since the NCCL error will be logged when the watchdog receives errors for the communicators and aborts them appropriately.

Tested by forcing a NCCL error with NCCL_BLOCKING_WAIT=1 and verifying that the exception is indeed logged.
ghstack-source-id: 125124310

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D27106699

fbshipit-source-id: 1d2bd9f057a3796ce15dd8a4ce34cf6899eee45c
2021-03-29 14:15:42 -07:00
8c13dde458 [DDP] Remove redundant pass statement (#54219)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54219

There is no need for this ``pass``.
ghstack-source-id: 125124311

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D27105234

fbshipit-source-id: 95496fa785fdc66a6c3c8ceaa14af565588325df
2021-03-29 14:15:39 -07:00
d185719455 Expose dist.monitored_barrier() API (#53787)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53787

Per title, exposes a python-based monitored barrier API that we can use as part of debugability and may be useful for user applications.
ghstack-source-id: 125124315

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D26965127

fbshipit-source-id: 6c7826e63758462e3e5111f28cced54cba76a758
2021-03-29 14:15:37 -07:00
4541f60390 Gloo-only CPU-based monitored barrier (#53773)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53773

Closes https://github.com/pytorch/pytorch/issues/52876

Implements a barrier by doing send/recv to rank 0, and rank 0 waits for these requests and on timeout, throws an exception indicating which rank did not join in the given timeout.

This barrier is only intended for CPU use cases and built into process group gloo, and will be used for debugging synchronization/hang issues.

Test Plan: Added UT

Reviewed By: zhaojuanmao

Differential Revision: D26921357

fbshipit-source-id: 7c16e861b4b8ea2bdd67a36b3de7b1029af7d173
2021-03-29 14:14:10 -07:00
8e89d30f09 [nnc] Lower scalar constants as doubles/longs (#54824)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54824

Test Plan: Imported from OSS

Reviewed By: asuhan

Differential Revision: D27383224

Pulled By: bertmaher

fbshipit-source-id: 84b43ba6c22c1338c68c40a11ca647c3717f2abc
2021-03-29 14:06:04 -07:00
7c8b0f2600 Test torch.chain_matmul for complex dtype (#54885)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54885

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D27400936

Pulled By: anjali411

fbshipit-source-id: 415d843d7c55f4d84a8e9faab926a4895e1544d0
2021-03-29 13:37:23 -07:00
8cf97cbb55 [ROCm] add 4.1 to nightly builds (#54635)
Summary:
Depends on https://github.com/pytorch/builder/pull/685.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54635

Reviewed By: malfet

Differential Revision: D27368700

Pulled By: walterddr

fbshipit-source-id: 35ac59bed8450e7e69b1a4ba74955a72d729487a
2021-03-29 12:33:38 -07:00
ff537b77ff [PyTorch][easy] Move more strings in torch::class_ (#54547)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54547

These arguments to `BuiltinOpFunction`'s ctor don't need to be copied.
ghstack-source-id: 124690196

Test Plan: CI

Reviewed By: SplitInfinity

Differential Revision: D27277318

fbshipit-source-id: 68f1f545ca977b2e1cabc91620da31719bf81e1a
2021-03-29 12:27:11 -07:00
51fa25443f [PyTorch][easy] Move strings into class_::defineMethod (#54533)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54533

There were some forgotten moves here. Since the values are
not otherwise used, let's just not give them names.
ghstack-source-id: 124674348

Test Plan: CI

Reviewed By: SplitInfinity

Differential Revision: D27271991

fbshipit-source-id: 793dd4576db659b3b9b973a4e09ee3133cf41dfe
2021-03-29 12:25:41 -07:00
67d44377e3 Remove hacky wrapper for about 100 kernels (#54751)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54751

Codemod commands generated by https://github.com/pytorch/pytorch/pull/54098
ghstack-source-id: 125141211

Test Plan:
buck build //caffe2/aten/...
BUILD_TENSOREXPR_BENCHMARK=ON BUILD_STATIC_RUNTIME_BENCHMARK=ON python setup.py install

Reviewed By: smessmer

Differential Revision: D27353530

fbshipit-source-id: 66f83edfb1016ca0040fb603e43604cd2db02c4c
2021-03-29 12:06:34 -07:00
ec1bbe130c Revert D27364777: [pytorch][PR] Add annotations to PRs from forks
Test Plan: revert-hammer

Differential Revision:
D27364777 (56f12e6199)

Original commit changeset: a830d372d7bb

fbshipit-source-id: 56d490a4161a78ab28fd7d948b5a13af58efd9d7
2021-03-29 11:52:55 -07:00
3187a71bbe [test] vc toolchain modification (#54589)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/54502
Needs to be merged after https://github.com/pytorch/builder/pull/684

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54589

Reviewed By: walterddr

Differential Revision: D27402066

Pulled By: seemethere

fbshipit-source-id: 68f92485d89edf2c3315de8c57447f180679c77d
2021-03-29 11:21:17 -07:00
263180d7fc Revert D26973911: Implement public API InferenceMode and its error handling
Test Plan: revert-hammer

Differential Revision:
D26973911 (7caa464631)

Original commit changeset: 0ebdac7a3cd5

fbshipit-source-id: afd37a3785bc694e8ffbd679eba1cfed89ef2273
2021-03-29 11:17:49 -07:00
1551bcc670 change logging.warn to logging.warning (#51727)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51727

logging.warn() is deprecated since Python 3.3 in favor of logging.warning()

Reviewed By: yinghai

Differential Revision: D25785598

fbshipit-source-id: 391d834fe607cd571ee147445aa0a98910535099
2021-03-29 10:42:30 -07:00
9ef53f7e0f docs: remove extra backticks in narrow_copy (#54669)
Summary:
fixes https://github.com/pytorch/pytorch/issues/41590
https://11813004-65600975-gh.circle-artifacts.com/0/docs/tensors.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54669

Reviewed By: ailzhang

Differential Revision: D27328228

Pulled By: zou3519

fbshipit-source-id: 9a4a9bc4b265b0e82cf91f94dbbfd842fc42cdcb
2021-03-29 10:38:21 -07:00
63997db6ec [JIT] fix freezing with mkldnn tensors (#54632)
Summary:
We were accessing their storage which will throw

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54632

Reviewed By: ezyang

Differential Revision: D27372192

Pulled By: eellison

fbshipit-source-id: 9985e85af7a35a3d6bf1c0be0185699c34877b94
2021-03-29 10:27:33 -07:00
74e01c1dd9 docs: change to FloatTensor for requires_grad=True (#54658)
Summary:
fixes https://github.com/pytorch/pytorch/issues/54506

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54658

Reviewed By: ailzhang

Differential Revision: D27328321

Pulled By: zou3519

fbshipit-source-id: d29fa266a1cb2b6d8566055dfb6ce001edde9d96
2021-03-29 10:25:56 -07:00
6dedecc77c docs: add memory_format in torch.empty (#54664)
Summary:
fixes https://github.com/pytorch/pytorch/issues/43504

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54664

Reviewed By: ailzhang

Differential Revision: D27328504

Pulled By: zou3519

fbshipit-source-id: 6c3e11473ada34f7e9fae7bae366328e50f71b0e
2021-03-29 10:23:36 -07:00
02f5c50828 docs: separate autosummary for flatten layers (#54663)
Summary:
fixes https://github.com/pytorch/pytorch/issues/46881
https://11815123-65600975-gh.circle-artifacts.com/0/docs/generated/torch.nn.Flatten.html#torch.nn.Flatten

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54663

Reviewed By: ailzhang

Differential Revision: D27328367

Pulled By: zou3519

fbshipit-source-id: de1651a670181db8ea8ab16624c17ba08a88eb5d
2021-03-29 10:23:34 -07:00
7eef0c3ab5 docs: add functional group_norm (#54673)
Summary:
fixes https://github.com/pytorch/pytorch/issues/34209
https://11813548-65600975-gh.circle-artifacts.com/0/docs/nn.functional.html#normalization-functions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54673

Reviewed By: ailzhang

Differential Revision: D27328211

Pulled By: zou3519

fbshipit-source-id: 75c49849377047502962157239857ed99afe6d1e
2021-03-29 10:21:50 -07:00
475251631b docs: reference links to serialization.html (#54659)
Summary:
fixes https://github.com/pytorch/pytorch/issues/54311
https://11811979-65600975-gh.circle-artifacts.com/0/docs/generated/torch.save.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54659

Reviewed By: ailzhang

Differential Revision: D27328281

Pulled By: zou3519

fbshipit-source-id: b88d02e5407238a338d537d013a297ae9cdf922b
2021-03-29 10:15:07 -07:00
59d1f08b4c docs: fix docstring signature of torch.{onnx,utils} (#54662)
Summary:
fixes https://github.com/pytorch/pytorch/issues/50018
fixes https://github.com/pytorch/pytorch/issues/50017
https://11811000-65600975-gh.circle-artifacts.com/0/docs/onnx.html#functions
https://11811000-65600975-gh.circle-artifacts.com/0/docs/mobile_optimizer.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54662

Reviewed By: ailzhang

Differential Revision: D27328485

Pulled By: zou3519

fbshipit-source-id: e658542072ba633b9c309145fc5182edf895d0a6
2021-03-29 10:07:42 -07:00
84232b762b docs: add reset_peak_memory_stats in cuda.rst (#54668)
Summary:
fixes https://github.com/pytorch/pytorch/issues/41808
https://11812999-65600975-gh.circle-artifacts.com/0/docs/cuda.html

One question: does `reset_peak_stats` exist in `torch.cuda` ?
I can't find anywhere.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54668

Reviewed By: ailzhang

Differential Revision: D27328444

Pulled By: zou3519

fbshipit-source-id: 098024d43da98e3249aa9aa71cb10126095504a4
2021-03-29 10:05:20 -07:00
12a454788b docs: fix parameter in torch.take (#54667)
Summary:
fixes https://github.com/pytorch/pytorch/issues/43495
https://11812612-65600975-gh.circle-artifacts.com/0/docs/generated/torch.take.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54667

Reviewed By: ailzhang

Differential Revision: D27328252

Pulled By: zou3519

fbshipit-source-id: 5812ebdaba063ca0a9c0f4a9becd00a570d84d30
2021-03-29 10:01:23 -07:00
56f12e6199 Add annotations to PRs from forks (#54779)
Summary:
We've been using [pytorch/add-annotations-github-action](https://github.com/pytorch/add-annotations-github-action) to add annotations to PRs when they fail Flake8 or clang-tidy. Up until now, though, that functionality has only worked on PRs in pytorch/pytorch itself, not on PRs from forks. This PR fixes that using a technique from [this GitHub blog post](https://securitylab.github.com/research/github-actions-preventing-pwn-requests/) (also linked in a comment in this diff).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54779

Test Plan: janeyx99 and I tested this in the same GitHub repo used to test https://github.com/pytorch/pytorch/issues/54685 and https://github.com/pytorch/pytorch/issues/54693, including with PRs from forks.

Reviewed By: walterddr

Differential Revision: D27364777

Pulled By: samestep

fbshipit-source-id: a830d372d7bb3b2529fc633b707b44f2b6cf9baa
2021-03-29 09:25:12 -07:00
68af6d9565 Use custom sqrt if stdc++ does not fall back to C99 csqrt (#54820)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54820

template implementation of std::sqrt() in libstdc++ yields incorrect results for `std::complex(-std::abs(x), -0.0)`, see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89991
For example:
```
#include <iostream>
#include <complex>
int main() {
  std::cout << std::sqrt(std::complex<float>(-1.0f, -0.0f)) << std::endl;
}
```
prints `(0, -1)` if libstdc++ is compiled to use C99 csqrt/csqrtf fallback, but `(0, 1)` if configured not to use it.

Test Plan: CI

Reviewed By: luciang

Differential Revision: D27379302

fbshipit-source-id: 03f614fdb7ff734139736a2a5f6872cee0173bee
2021-03-29 09:05:48 -07:00
717e70a824 (BE) Refactor get-test-times-from-S3 into s3_stat_parser (#54808)
Summary:
Moves more s3 parsing code to s3_stat_parser.py. This is another step in modularizing the parsing code more correctly. I will also be using this exact function in future slowTest code.

Also replaces some Any's in the code to be Report.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54808

Test Plan:
.pytorch-test-times generated before the code and after this code is the same.
CI should pass, specifically the test tools GHA.

Reviewed By: walterddr

Differential Revision: D27375783

Pulled By: janeyx99

fbshipit-source-id: bec28551668b2eb3fdd60d802200993e493eac83
2021-03-29 08:45:22 -07:00
3ddc6174da Raise error in clip_grad_norm_ if norm is non-finite (#53843)
Summary:
**BC-breaking note**: This change throws errors for cases that used to silently pass. The old behavior can be obtained by setting `error_if_nonfinite=False`

Fixes https://github.com/pytorch/pytorch/issues/46849

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53843

Reviewed By: malfet

Differential Revision: D27291838

Pulled By: jbschlosser

fbshipit-source-id: 216d191b26e1b5919a44a3af5cde6f35baf825c4
2021-03-29 08:41:21 -07:00
1f36ce6e4d Restore storage on meta tensors; increase meta coverage (#53973)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53973

Two parts to this PR; I had to put them together because adding support for X causes more test code to be exercised, which in turn may require a fix for Y.

The first part is restoring the concept of storage to meta tensors.  Previously, meta tensors had a nullptr storage (e.g., `meta_tensor.storage()` is an error.) As I was increasing the coverage of meta tensors, I started running into test cases (specifically memory overlap tests) that were failing because not having storage meant I couldn't check for memory overlap. After some discussion, we decided that it would make sense for meta tensors to model this as well (we already model strides, so getting accurate view information also seems useful). This PR does that by:

* Rewrite all of the factory functions in MetaTensor.cpp to use the generic versions (which are very carefully written to not actually poke at the data pointer, so everything works out). The key idea here is we give meta tensors a special allocator, MetaAllocator, which always returns a nullptr even if you ask for a nonzero number of bytes. resize_ is also made generic; the normal variant can be used directly rather than having to instruct it to avoid resizing storage
* Turn on memory overlap checking in TensorIterator even for meta tensors
* Although meta tensors now have storage, the concept of meta storage is NOT exposed to Python land (as it would imply I would have to codegen MetaFloatStorage, MetaDoubleStorage, etc. classes). So `x.storage()` still raises an error and I have a cludge in `__deepcopy__` to break storage sharing upon deep copy (this is wrong, but no tests exercise this at the moment).

The second part is adding more support for the most used functions in the test suite.

* Inplace operations have very simple meta functions. I added `fill_`, `zero_`, `random_`, `uniform_` and `normal_`. In the case of random, I take advantage of pbelevich's templates for defining random kernels, so that I can reuse the common scaffolding, and then just register a noop stub that actually does the RNG. (Look, another structured kernels tiny variant!)
* `copy_` is now implemented. Copying into a meta tensor is always OK, but copying out of a meta tensor raises an error (as we don't know what the "correct" data to copy out is in this case)
* `empty_strided` usage from structured kernels now is implemented (TBH, this could have been done as soon as `empty_strided` was added)
* Meta was missing in a few places in TensorOptions/DispatchKey utility functions, so I added them
* Autograd engine now correctly homes meta tensors with CPU tensors (they have -1 device index so CUDA queues wouldn't work anyway)
* `apply_`, `map_` and `map2_` are special cased to no-op on meta tensor self. These count as inplace operations too but they are implemented a little differently.

Getting more meta function support triggers a number of bugs in the test suite, which I then fix:

- Linear algebra functions sometimes don't report NotImplementedError because they get swallowed by catch all try blocks. This is tracked in https://github.com/pytorch/pytorch/issues/53739
- dlpack obviously doesn't work with meta tensors, I just disabled the test

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: D27036572

Test Plan: Imported from OSS

Reviewed By: agolynski, bdhirsh

Pulled By: ezyang

fbshipit-source-id: 7005ecf4feb92a643c37389fdfbd852dbf00ac78
2021-03-29 08:37:46 -07:00
94efb48e16 Adds the cfloat dtype to the eager and jit variant consistency tests (#54854)
Summary:
Per title. One skip for addmm was needed. Either it or the jit test doesn't seem to handle a complex literal properly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54854

Reviewed By: anjali411

Differential Revision: D27395651

Pulled By: mruberry

fbshipit-source-id: 0bfadf0a8500f26d3a89f56f104fb44561f594d9
2021-03-29 08:15:27 -07:00
2fd1eb3a9f make all arguments in test_history.py optional kwarg (#54797)
Summary:
This is to make it more flexible to be reused when pulling test stats other than by-test-case.
Also it makes it less likely to use it wrong with positional arguments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54797

Test Plan: see the updated tools/test/test_test_history.py examples.

Reviewed By: samestep

Differential Revision: D27371903

Pulled By: walterddr

fbshipit-source-id: 0ee02d654684315b44f5942904b857053d27e954
2021-03-29 07:25:14 -07:00
6d2bf76bba Using latest windows CUDA exe (#54596)
Summary:
Using latest cuda_11.2.2_461.33_win10 to fix cu112 test failures.
this should fix https://github.com/pytorch/pytorch/issues/51980.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54596

Reviewed By: seemethere

Differential Revision: D27365008

Pulled By: walterddr

fbshipit-source-id: 682e79888d9f10c0a5b227d66165ea50c47ba0f9
2021-03-29 07:20:34 -07:00
86b1f4e9f2 fix silent correctness bug with channels_last usage of upsample cuda kernels (#54744)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54744

Fixes https://github.com/pytorch/pytorch/issues/54590

After the porting the upsample operators to be structured, they now forward memory_format information to the output. This is a problem for the cuda kernels, which are not implemented to deal with `torch.channels_last` memory format. The operators are:
* upsample_nearest2d
* upsample_bilinear2d
* upsample_nearest3d
* upsample_trilinear3d

This fix just allocates a temporary, contiguous output tensor when that happens, writes the results to the temporary and copies the results back to the output tensor.

I held off on adding tests to get the fix out quickly, but I wrote a script and ran some manual tests, that basically just asserts that the outputs are the same for cpu and cuda, for some threshold. I ran it for all 4 operators:
```
import torch

def basically_equal(t1, t2):
    epsilon = 1e-4
    diffs = torch.abs(t1 - t2)
    print(torch.all(diffs < 1e-4))

# upsample 2d
a = torch.arange(48).reshape(2, 2, 3, 4).contiguous(memory_format=torch.channels_last).float()

out_cpu = torch.nn.functional.interpolate(a, scale_factor=2, mode='nearest')
out_cuda = torch.nn.functional.interpolate(a.to('cuda'), scale_factor=2, mode='nearest')

basically_equal(out_cpu, out_cuda.to("cpu"))

out_cpu = torch.nn.functional.interpolate(a, scale_factor=2, mode='bilinear', align_corners=True)
out_cuda = torch.nn.functional.interpolate(a.to('cuda'), scale_factor=2, mode='bilinear', align_corners=True)

basically_equal(out_cpu, out_cuda.to("cpu"))

# upsample 3d
a = torch.arange(96).reshape(2, 2, 2, 3, 4).contiguous(memory_format=torch.channels_last_3d).float()

out_cpu = torch.nn.functional.interpolate(a, scale_factor=3, mode='nearest')
out_cuda = torch.nn.functional.interpolate(a.to('cuda'), scale_factor=3, mode='nearest')

basically_equal(out_cpu, out_cuda.to("cpu"))

out_cpu = torch.nn.functional.interpolate(a, scale_factor=3, mode='trilinear', align_corners=True)
out_cuda = torch.nn.functional.interpolate(a.to('cuda'), scale_factor=3, mode='trilinear', align_corners=True)

basically_equal(out_cpu, out_cuda.to("cpu"))
```

prints
```
tensor(True)
tensor(True)
tensor(True)
tensor(True)
```

One thing that was weird- `upsample_bilinear2d` and `upsample_trilinear3d` were only accurate across cpu/cuda with an epsilon of `1e-4`. That tentatively sounds close enough to say that cuda isn't "wrong" (?), but that's not exactly "equal"... and I also ran the script before my change, and `bilinear2d` and `trilinear3d` were also the same across cpu/cuda with an epsilon of `1e-4`.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D27351393

Pulled By: bdhirsh

fbshipit-source-id: b33f46e4855dc8b49b363770190b639beebbf5a7
2021-03-29 06:42:03 -07:00
4e5af53d29 Deprecate legacy constructor torch.Tensor() (#54414)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/47112

This pull request is the final step in [the proposed plan](https://github.com/pytorch/pytorch/issues/47112#issuecomment-789972007) for deprecating `torch.Tensor()` constructor. Specifically, it **updates the docs and throws `TORCH_WARN_ONCE` if someone uses `torch.Tensor()`**.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54414

Reviewed By: ailzhang

Differential Revision: D27325267

Pulled By: heitorschueroff

fbshipit-source-id: 5442572603d340b89e8cc5a886a330dd9b13550a
2021-03-29 05:14:47 -07:00
a0a7a2d648 [quant][fx] store dtype, axis as literals in the graph (#54624)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54624

previously we were creating setattr nodes for dtype and axis.
The FX convention is that primitive types are embedded as literals in args/kwargs.

With this change we won't see getattr nodes in the graph anymore for dtype/axis

Test Plan:
python test/test_quantization.py TestQuantizeFx

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D27306898

fbshipit-source-id: a7c91c7cb21ee96015c7f8830b38d943ada65358
2021-03-28 21:59:49 -07:00
9e6877c5c5 Port torch.outer method_tests() to OpInfos (#54798)
Summary:
An attempt to make an OpInfo-based test for torch.outer (aka toch.ger).

As a part of https://github.com/pytorch/pytorch/issues/54261 effort.

mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54798

Reviewed By: ngimel

Differential Revision: D27384891

Pulled By: mruberry

fbshipit-source-id: 0c90f84a388d2addc8de37d0c1713d8598211555
2021-03-28 18:34:54 -07:00
b7c5d57563 [testing] support op with args/kwargs in test_unary_ufunc (#52194)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/51242

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52194

Reviewed By: ngimel

Differential Revision: D27385139

Pulled By: mruberry

fbshipit-source-id: 63118dee33a138ef13810465d2d2d9fa194dfb28
2021-03-28 18:10:20 -07:00
07350da3b4 enable bf16 for cat serial kernel (#54674)
Summary:
cat 10 2-D tensors at dim=1

|                     | shape                    | serial kernel | copy kernel |
| ------------ | ------------- | ------------ | ------------- |
| fp32          | 1024 * 16k           |   105.45 ms    | 102.41 ms     |
| fp32          | 1024 * (100 + i)  |   324.75 us     | 448.66 us      |
| bf16          | 1024 * 16k           |   49.82 ms       | 51.39 ms       |
| bf16          | 1024 * (100 + i)  |   164.74 us      | 244.64 us      |

i = {0, ..., 9}

benchmark code
```
import torch
import torch.utils.benchmark as benchmark
def cat(*args, dim=0):
    return torch.cat(args, dim)

tensors = []
for i in range(10):
    tensors.append(torch.rand(1024, 16 *1024))
    # tensors.append(torch.rand(1024, 16 *1024).bfloat16())
    # tensors.append(torch.rand(1024, 100 + i))
    # tensors.append(torch.rand(1024, 100 + i).bfloat16())

t0 = benchmark.Timer(
    stmt='cat(*tensors, dim=1)',
    setup='from __main__ import cat',
    globals={'tensors': tensors},
    num_threads=1)

print(t0.blocked_autorange())

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54674

Reviewed By: ailzhang

Differential Revision: D27325347

Pulled By: heitorschueroff

fbshipit-source-id: 7a0f4bf8d92dbf8e725fdd2e8a2c901188811d6f
2021-03-28 17:05:10 -07:00
01b1557014 enable bf16 vec copy (#54671)
Summary:
Enable bf16 vectorized copy.

BFloat16's copy get 2x performance for fp32 as our expectation.

BFloat16's vec copy dose not show performance gain compare with scalar version with op benchmark. This should caused by the memory system of operator. The system will really "read/write" a scalar at one time, although the code is written in scalar version.

benchmarks code:
```
import torch
import torch.utils.benchmark as benchmark

# x = torch.empty(10 * 18304 * 1024 * 16, dtype=torch.bfloat16)
x = torch.empty(10 * 18304 * 1024 * 16, dtype=torch.float)
def copy(tensors):
    for t in tensors:
        x.copy_(t)

tensors = []
for i in range(2):
    # l3 cache size 36608k = 18304 bfloat16 * 2 byte(per bfloat16)
    # tensors.append(torch.rand(10 * 18304 * 1024 * 16).bfloat16())
    tensors.append(torch.rand(10 * 18304 * 1024 * 16))

t0 = benchmark.Timer(
    stmt='copy(tensors)',
    setup='from __main__ import copy',
    globals={'tensors': tensors},
    num_threads=1)

print(t0.timeit(20))
```

Before this comit:
fp32:
  3.84 s
  1 measurement, 20 runs , 1 thread
bf16:
  1.89 s
  1 measurement, 20 runs , 1 thread

After:
fp32:
  3.71 s
  1 measurement, 20 runs , 1 thread
bf16:
  1.85 s
  1 measurement, 20 runs , 1 thread

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54671

Reviewed By: ailzhang

Differential Revision: D27325350

Pulled By: heitorschueroff

fbshipit-source-id: 1a3b8ca17b4c60dbb3e86bf196f63e0a05228c65
2021-03-28 08:40:24 -07:00
0527d14248 [numpy] Add torch.take_along_dim (#52833)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/38349

Wrapper around the existing `torch.gather` with broadcasting logic.

TODO:
* [x] Add Doc entry (see if phrasing can be improved)
* [x] Add OpInfo
* [x] Add test against numpy
* [x] Handle broadcasting behaviour and when dim is not given.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52833

Reviewed By: malfet

Differential Revision: D27319038

Pulled By: mruberry

fbshipit-source-id: 00f307825f92c679d96e264997aa5509172f5ed1
2021-03-28 05:22:51 -07:00
eec48303c0 Make index_add take a scalar argument alpha (#54176)
Summary:
```
index_add(Tensor self, int dim, Tensor index, Tensor source) -> Tensor
```
now becomes
```
index_add(Tensor self, int dim, Tensor index, Tensor source, Scalar alpha=1) -> Tensor
```
Generally, this sounds useful and harmless, and inside PyTorch, we are already needing this feature in `add_out_dense_sparse_cuda`, see the `SparseCUDATensorMath.cu` change in this PR.

**Test not added yet. Will add if after discussion we believe this is a good idea.**
- [ ] TODO: add test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54176

Reviewed By: ngimel

Differential Revision: D27319198

Pulled By: mruberry

fbshipit-source-id: fe43be082d1230c87c5313458213d5252be2ff23
2021-03-28 00:22:45 -07:00
695eef05a4 optimizer exploration - v1 and v2 + fix position_weighted optimizer + decoupled weight decay (#54042)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54042

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53881

1. Fix position_weighted optimizer: Position weighted layer uses default optimizer but is actually gradient_slice, which will cause problem if we do not handle it properly in the new optimizier. The solution is to use sparseadagrad when it is gradient_slices.
2. Optimizer implementation of v1 and v2: using 1st momentum with/without bias_correction.
3. also implemented decoupled weight decay in the new optimizer.

Test Plan:
buck test //caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test_2 -- test_mlp_optimization

buck test //caffe2/caffe2/python:optimizer_test -- TestDecayAdagrad

buck test //caffe2/caffe2/python/operator_test:decay_adagrad_test

ctr_mbl_feed work flow: f255731660
oc work flow: f255739503

Reviewed By: 0x10cxR1

Differential Revision: D26839668

fbshipit-source-id: 2b6881c1a88540ef5766be40f5e80001257e2199
2021-03-27 23:03:29 -07:00
5c3d80d8fa [DDP] Mark a few variables as const in reducer (#54764)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54764

We mark a few vars as const in Reducer, also do this for replicas_ and
process_group_ as they should not be changed by Reducer during training. This
can help eliminate issues at compile time and prevent the developer from
accidently changing these variables.
ghstack-source-id: 125040110

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D27357132

fbshipit-source-id: 23a0edf754a8e4f9e6440e99860e5549724cb7ad
2021-03-27 21:40:18 -07:00
671f80a313 [c10d] s/torch::autograd::variable/at::Tensor/g (#54763)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54763

Replaces deprecated torch::autograd::variable with at::Tensor.
torch::autograd::variable is defined as equal to at::Tensor now so this should
be a noop, but follows convention of using tensor instead of Variable.
ghstack-source-id: 125040109

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D27356450

fbshipit-source-id: 1a001358d7726a597141ec47803c8213db4814c0
2021-03-27 21:38:51 -07:00
7caa464631 Implement public API InferenceMode and its error handling (#53343)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53343

Test Plan: Imported from OSS

Reviewed By: ezyang, nikithamalgifb

Differential Revision: D26973911

Pulled By: ailzhang

fbshipit-source-id: 0ebdac7a3cd554822d26d5a40f539b6e2aaec61d
2021-03-27 13:44:23 -07:00
2309173143 Compute Tensor::toString() without reference to backend (#54711)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54711

Just print the dispatch key directly.  The format here doesn't really
make sense but you'll still get something like CPUFloatTensor (because
the dispatch key is just CPU).

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D27338811

Pulled By: ezyang

fbshipit-source-id: f459c5f7c006c06df4913ab33697eae89c46d83f
2021-03-27 11:55:52 -07:00
f067972527 Make memory overlap a little less precise so it works with null data ptr (#54710)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54710

I'm going to make meta tensors have storage (but DataPtr is always
null) so that I can accurately report memory overlap error checking, but
I now have a problem which is that if memory overlap test looks at the
actual data pointer, everything is going to look like it aliases!  A
more conservative test is to just see if the Storage objects themselves
alias, and assume that the data pointers are unique if they don't.

The loss of precision stems from if you unsafely have two distinct
storage objects that point to the same data pointer.  This situation
is pretty rare and so I think it is worth it (and I am hoping no tests
trigger by this.)

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D27338810

Pulled By: ezyang

fbshipit-source-id: 5ebaf81c22824494c47c1ae78982d9c0e5cba59f
2021-03-27 11:55:50 -07:00
c782949e17 Make the fuser raise NotImplementedError when unknown device is hit (#54709)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54709

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D27338815

Pulled By: ezyang

fbshipit-source-id: 5cbaf3c19b9b85cc3f171f3b405d0cd98f832e65
2021-03-27 11:55:47 -07:00
6445c9a1cb Avoid testing device in cdist when called in a "Math" context (#54708)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54708

cdist advertises itself as Math but actually it error checks that the inputs
are CPU/CUDA in cdist_impl, which is invoked from a composite context in some
situations. I worked around this by ensuring that when cdist_impl was called in
this way, we DON'T do the device checks, but the entire function is a little
janky and I filed an issue about it at #54096

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D27338813

Pulled By: ezyang

fbshipit-source-id: 1202b02c58584a33dc32a5270e59e5f0af6398c5
2021-03-27 11:55:44 -07:00
c9e0aab2bf Make convolution_overrideable default implementation raise NotImplementedError (#54707)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54707

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D27338807

Pulled By: ezyang

fbshipit-source-id: b18c39a09d130626709408c08034c260c34e2bc5
2021-03-27 11:55:42 -07:00
ed560cf2c6 Disambiguate where 'Doesn't run' error message comes from (#54706)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54706

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: wenleix, anjali411

Differential Revision: D27338812

Pulled By: ezyang

fbshipit-source-id: 76321e49f2a8140595c89775afbecd5717e31c2e
2021-03-27 11:55:39 -07:00
b5ab348253 Fix missing format string qualifier (#54705)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54705

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D27338808

Pulled By: ezyang

fbshipit-source-id: b21c931c2306e525bc444766bc203bb303868dbf
2021-03-27 11:55:36 -07:00
d9a7c758e1 Rename linalg.det test so that it generates a valid method name (#54704)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54704

See https://github.com/pytorch/pytorch/issues/54607

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D27338809

Pulled By: ezyang

fbshipit-source-id: 52a246a8b8743b8a887403c02df6271ba6db3617
2021-03-27 11:55:33 -07:00
05fa570bbc Add empty_generic, which allocates an empty tensor in a device-generic way (#54703)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54703

The trick is that this function takes in the allocator and dispatch key
explicitly; so you still need to know where to find the appropriate
allocator.  The plan is to use this for meta tensors, but you probably
could also use this for empty_cuda as well.  It also takes in arguments
post optional resolution, which can save a few instructions if you want
to call this function directly (no uses yet).

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D27338814

Pulled By: ezyang

fbshipit-source-id: 131c97922d245e9a2de547527123b464bddb2f99
2021-03-27 11:55:31 -07:00
90e70ace9b Fix some more native_functions.yaml mistakes (#54597)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54597

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D27328667

Pulled By: ezyang

fbshipit-source-id: 79ddfda28e05d4cbcbed37a969f2577ea7c292fb
2021-03-27 11:55:28 -07:00
e70f3d1189 Nasty little hack to preserve NotImplementedError raised in interpreter (#54627)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54627

This is the simplest little fix to get interpreter to preserve
NotImplementedError, so that the test suite doesn't start choking
on meta tensors not working in interpreter.  It is sound and correct
but doesn't work for other c10::Error subclasses with special handling.
A more proper fix is requested at
https://github.com/pytorch/pytorch/issues/54612

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: wenleix, ngimel

Differential Revision: D27328666

Pulled By: ezyang

fbshipit-source-id: 483bef062de5a907d20e2d9e25eafe2d5197cf8d
2021-03-27 11:53:06 -07:00
e5634f5f25 More types for torch (#54037)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54037

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D27054755

fbshipit-source-id: f21985e201b35bdb83269595cdcf5e1e64837e52
2021-03-27 08:57:15 -07:00
d59fb7a2f6 Add complex autograd support for torch.unfold (#52999)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/51875

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52999

Reviewed By: H-Huang

Differential Revision: D26735206

Pulled By: iramazanli

fbshipit-source-id: ee134461e97079722a79f89737a7f0d2b620c2c8
2021-03-27 08:21:28 -07:00
6eaf96961d [codemod] fix tautological imports
Test Plan: waitforsandcastle

Reviewed By: koronthaly

Differential Revision: D27310963

fbshipit-source-id: 9ca0a6468e00d481b1583ab98578dc70f80bb3bf
2021-03-27 01:15:57 -07:00
65781f94ad Enable faulthandler for distributed tests. (#54531)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54531

Enabling faulthandler will intercept signals like SIGSEGV, SIGFPE,
SIGABRT, SIGBUS and SIGKILL and dump the entire python traceback before the
process goes down.

This can help us in debugging flaky tests where a process crashes and we need
to debug what happened.
ghstack-source-id: 125045894

Test Plan:
1) Tested locally to see traceback is produced.
2) waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D27271048

fbshipit-source-id: ca12125a9da6cdfc7bac5619ad1c7e116666014b
2021-03-27 00:43:58 -07:00
1d5cc6c53d Move requires_grad_/backward out of VariableTypeManual. (#54543)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54543

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D27321819

Pulled By: ailzhang

fbshipit-source-id: 991c83e134d109e270c872b4b79026dcb732d77a
2021-03-26 23:16:32 -07:00
d63dd07f06 Add JIT support for cmath unary ops (#54089)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54089

**This PR adds:**
1. support for the following [cmath](https://docs.python.org/3/library/cmath.html) functions:
     - Power and logarithmic functions (`cmath.{exp, log, log10, sqrt}`)
     - Trigonometric functions (`cmath.{sin, cos, tan, asin, acos, atan}`)
     - Hyperbolic functions (`cmath.{sinh, cos, tanh, asinh, acosh, atanh}`)
     - `cmath.phase()`
2. `abs()`

**Future work:**
1. support
    - `cmath.{polar, rect}`
    - classification functions (`cmath.{isfinite, isnan, isinf, isclose}`)
    - constants (`cmath.{pi, e, inf, nan, infj, nanj}`)

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D27339149

Pulled By: anjali411

fbshipit-source-id: fe1a019c95adbc9f27f7948eb28c0c3b93d8c026
2021-03-26 22:55:34 -07:00
8eb896ce99 Improve error message while setting error twice. (#54464)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54464

In case where we accidentaly set an error twice on a Future, we get a
cryptic error like this:

```
Exception in thread pool task: !completed() INTERNAL ASSERT FAILED at "aten/src/ATen/core/ivalue_inl.h":534, please report a bug to PyTorch.
```

This PR, updates the error message to include some additional information about
what the previous error was.
ghstack-source-id: 125039478

Test Plan:
1) unit test
2) waitforbuildbot

Reviewed By: swolchok

Differential Revision: D27249758

fbshipit-source-id: 517cf3837fb7b7821312e101e8813844c188f372
2021-03-26 21:55:19 -07:00
f612d4eb58 Add 'remote_parameters' and 'get_module_rref' to RemoteModule docs. (#54645)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54645

Had to replace RRef[..] with just RRef in the return signature since
sphynx seemed to completely mess up rendering RRef[..]
ghstack-source-id: 125024783

Test Plan: View locally.

Reviewed By: SciPioneer

Differential Revision: D27314609

fbshipit-source-id: 2dd9901e79f31578ac7733f79dbeb376f686ed75
2021-03-26 21:41:28 -07:00
316804e373 [test_c10d] Add wait in nccl high priority stream test (#54714)
Summary:
Add wait in test_pass_nccl_options_high_priority_stream
after the all reduce operation.
Without wait, the allreduce operation might be still running and the
comparison of result might not be valid.

Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54714

Reviewed By: ezyang

Differential Revision: D27379544

fbshipit-source-id: 6393d25f8f3d5635c5d34c9b3aac8b801315b48e
2021-03-26 20:47:00 -07:00
e4d19798f3 [nnc][tests] Convert a bunch of FileCheck to checkIR
Summary:
I added a helper to convert a Stmt to string and FileCheck it, so
started using it in a bunch of places.  I replaced about half the current uses,
got tired, started to write a Perl script to automate it, realized that was
hard, and decided to give up for a bit.  But this cleans up some of the tests a
bit, so seems easy to review and worth landing.

Test Plan: test_tensorexpr --gtest_filter=LoopNest.*

Reviewed By: navahgar

Differential Revision: D27375866

fbshipit-source-id: 15894b9089dec5cf25f340fe17e6e54546a64257
2021-03-26 20:27:50 -07:00
24f589df44 [nnc] Disabled test case for failure in implementing conv1d (#54756)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54756

We have multiple bugs here, one relating to index flattening and the
other to computeAt.
ghstack-source-id: 125054729

Test Plan: yikes

Reviewed By: ZolotukhinM

Differential Revision: D27354082

fbshipit-source-id: 8b15bac28e3eba4629881ae0f3bd143636f65ad7
2021-03-26 20:27:48 -07:00
e542e67253 [nnc] Test case for computeAt with reduction (#54755)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54755

As title.  A step on the way to using computeAt to optimize
convolution.
ghstack-source-id: 125054730

Test Plan: new test

Reviewed By: ZolotukhinM

Differential Revision: D27353663

fbshipit-source-id: 930e09d96d1f74169bf148cd30fc195c6759a3e9
2021-03-26 20:25:18 -07:00
71201340c6 Remove 13 hacky wrapper not required (#54793)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54793

ghstack-source-id: 125033229

Test Plan:
buck build //caffe2/aten/...
BUILD_TENSOREXPR_BENCHMARK=ON BUILD_STATIC_RUNTIME_BENCHMARK=ON python setup.py install

Generated `build/aten/src/ATen/NativeFunctions.h` is same

Reviewed By: smessmer

Differential Revision: D27369943

fbshipit-source-id: 5171bad44290a4ecf62a8f4deab17252c5bd0852
2021-03-26 20:08:10 -07:00
2620bce42a [ROCM] load only hipfft separately past rocm4.1 (#54349)
Summary:
This PR is a follow up to https://github.com/pytorch/pytorch/pull/53408.

It only loads hipfft if the version is rocm 4.1 or after and stops loading rocfft. This was done to resolve some issues observed in our internal ci due to conflicts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54349

Reviewed By: ezyang

Differential Revision: D27374252

Pulled By: ngimel

fbshipit-source-id: 724e80df5011ea8fabd81739e18ae8a13d3a7ea0
2021-03-26 19:54:25 -07:00
0e320ddb36 Lazily initialize alias db constant prop (#54640)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54640

If we are running constant propagation on a graph that doesn't have any operators with constant inputs and any mutable inputs/outputs, we do not need to initialize an alias db. This is going to be used to speed up symbolic shape analysis.

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D27340863

Pulled By: eellison

fbshipit-source-id: 087b2a33b42c58fa5dae405d652b056d0f1d72e7
2021-03-26 19:44:29 -07:00
ba1f640928 Optimize memory usage in logsumexp_out (#51239)
Summary:
Partly fixes https://github.com/pytorch/pytorch/issues/31837.

### Update: This is ready for review.

Currently, `torch.logsumexp(input, out=result)` internally creates 2 intermediate tensors with same shape as `input` tensor. This causes unnecessary OOM problems when tensor size is large.

These tensors come from the following:
1. `self - maxes` will create a new tensor with shape of `self`
2. `at::exp` will create another tensor with the shape of `self`

To get rid of this problem, we can use `(self-maxes).exp_()` that performs exp operation in-place. This would reduce memory need from `~3 x input.shape` to `~2 x input.shape` (`self-maxes` is still there)

I think we can't get rid of having a single intermediate tensor with shape of `input` because of `self - maxes` as we have to keep `self` intact. The only scenario would be to have a `torch.Tensor.logsumexp_` method that can do in-place operations on tensor itself. However, I didn't see any in-place method example for reduction operations, so it might not be a good fit.

This is my first contribution here, please let me know if I'm missing anything!

Thanks!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51239

Reviewed By: anjali411

Differential Revision: D27363147

Pulled By: ezyang

fbshipit-source-id: 696fa8764b74386a80b4aa33104f3f9ca57ed712
2021-03-26 19:28:55 -07:00
f22bad752d Move some variable ops out of VariableTypeManual. (#54459)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54459

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D27321820

Pulled By: ailzhang

fbshipit-source-id: e45392d2332f3c4bc31f20a500f58cdcd75f9ddf
2021-03-26 18:42:46 -07:00
394b720e38 Fix raw_deleter() bug with PYTORCH_NO_CUDA_MEMORY_CACHING=1 (#54775)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54775

Thanks danpovey for reporting. Fixes https://github.com/pytorch/pytorch/issues/54770

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D27363730

Pulled By: ezyang

fbshipit-source-id: 81777aff7d9194b060fb076ef97cf788f2a4f43e
2021-03-26 15:00:47 -07:00
416ba5c48f Merge CUDA Streams and Events (#53902)
Summary:
-----------
- Updates current_stream and default stream API's to take `optional[device]` argument
- Adds parsing logic to replace `torch.cuda.Stream` and `torch.cuda.Event` -> `torch.classes.cuda.Stream` and `torch.classes.cuda.Event` for JIT
- Merges StreamContext manager for both Eager and JIT.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53902

Test Plan:
------
Run JIT tests:
python test/test_jit.py -v TestCUDA

Run eager tests:
python test/test_cuda.py -v TestCuda

Reviewed By: SplitInfinity

Differential Revision: D27285996

Pulled By: nikithamalgifb

fbshipit-source-id: 45d9fee9a582b5f4c82330f5f99eb88584804270
2021-03-26 14:19:39 -07:00
593295daac Migrate kernels with TensorOptions to C10 full dispatcher (#54539)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54539

Codemod commands generated by https://github.com/pytorch/pytorch/pull/54468

ghstack-source-id: 125018630

# Facebook:
The following 2 files are changed on fb side:
```
// Should be hidden
```

Test Plan: buck build //caffe2/aten/...

Reviewed By: smessmer

Differential Revision: D27273744

fbshipit-source-id: 35c1bff63189477645008caaf0dc794096e3fcc4
2021-03-26 13:55:22 -07:00
ee73c752c6 Delete unnecessary empty file (#54796)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54796

Reviewed By: albanD

Differential Revision: D27370733

Pulled By: iramazanli

fbshipit-source-id: 5f78e9250a545afb91b4bc7b14daa7135a2b6a1b
2021-03-26 13:45:05 -07:00
14a2501786 Update max-version in setup.py to 3.9 (#54690)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54690

Reviewed By: seemethere

Differential Revision: D27330462

Pulled By: malfet

fbshipit-source-id: db332acf5aa5bff67af2bef777935f2387bc963c
2021-03-26 12:45:03 -07:00
3ed6e0ce6c Remove ops from the complex_list for which the method_tests have been ported (#54754)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54754

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D27354326

Pulled By: anjali411

fbshipit-source-id: 745cbc24b885f7d9263fa8796279200518e56edb
2021-03-26 12:09:28 -07:00
3db2333d09 [JIT] Make NoneType annotation_str emit NoneType instead of None (#54642)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54642

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D27314174

Pulled By: jamesr66a

fbshipit-source-id: 153e9aa4ab781fa1d49d9d55a2e487bf7b04f0d7
2021-03-26 11:32:20 -07:00
1e9ad6e5cd [JIT] Fix TupleType.annotation_str to conform to typing module syntax for empty tuple type (#54641)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54641

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D27314173

Pulled By: jamesr66a

fbshipit-source-id: 13c6e6b571672adc443429f59f3b30aae356c03d
2021-03-26 11:30:17 -07:00
df70e2fde5 Refactor get analytical jacobian (#54049)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54049

The goal of this is to factor out the core logic of getting the analytical jacobian which is effectively doing `f(grad_out) = grad_out^T J = grad_input`. This allows us to test a lot of logic that was not possible before because now we can replace f with whatever we want in order to simulate potential issues that gradcheck is designed to catch.

Edit: I realize a lot of things this PR was originally aiming to allow is actually possible with hooks, hence the tests have already been added in a earlier PR in the stack. But this is still slightly useful for reducing code duplication when adding the new fast gradcheck code (more details below)

After this change, `get_analytical_jacobian` is only responsible for gathering a list of rows that are later combined into a single Jacobian tensor. This means we don't have to perform any checks for correctness of the dtypes/size at this step

We factor out that logic into a separate function, `combine_jacobian_rows`, which handles the list of rows -> single Tensor step for each jacobian, and the error checking it entails. (This allows this code to be shared between the fast/slow versions.)

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D27307240

Pulled By: soulitzer

fbshipit-source-id: 65bb58cda000ed6f3114e5b525ac3cae8da5b878
2021-03-26 11:19:19 -07:00
0435059ddf docs: fix docstring signature in all_reduce_multigpu (#54665)
Summary:
fixes https://github.com/pytorch/pytorch/issues/43500

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54665

Reviewed By: ezyang

Differential Revision: D27340481

Pulled By: rohan-varma

fbshipit-source-id: d53c36b41dd26c7a791d3674a5b4b67daaadae13
2021-03-26 11:08:32 -07:00
db3a9d7f8a Fix __torch_function__ tests. (#54492)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54492

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D27292567

Pulled By: ezyang

fbshipit-source-id: dc29daea967c6d8aaf63bdbcb4aff0bb13d7a5f7
2021-03-26 10:59:15 -07:00
13b1ca9466 Rename DefaultBackend to CompositeExplicitAutograd (#54470)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54470

```
git grep -l 'DefaultBackend' | xargs sed -i 's/DefaultBackend/CompositeExplicitAutograd/g'
```

Plus a quick fixup in native/README.md

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D27253240

Pulled By: ezyang

fbshipit-source-id: 964df951ea8b52fa72937f3cc66aeaf49a702e6f
2021-03-26 10:53:30 -07:00
70dd2a2bdd Add myself on all native_functions.yaml code reviews (#54595)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54595

Seeing a lot of misuse of DefaultBackend, want to try to
nip some of these in code review.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D27301721

Pulled By: ezyang

fbshipit-source-id: 1a39426cb6cac5c7f322df6f8a69ccb463f1b258
2021-03-26 10:51:40 -07:00
5c6208abba remove docker dir (#54729)
Summary:
i dont think docker/ folder is used anymore. creating this draft to verify

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54729

Reviewed By: ezyang

Differential Revision: D27364811

Pulled By: walterddr

fbshipit-source-id: 3e4a9d061b0e5f00015a805dd8b4474105467572
2021-03-26 10:47:13 -07:00
4399aadcc7 add sndfile yum package to centos dockerfile (#54687)
Summary:
Fixes error when running torch test suite inside a centos CI image.  As described by https://pypi.org/project/SoundFile/0.10.3.post1/, `On Linux, you need to install libsndfile using your distribution’s package manager`.  This was missing from the centos CI image.

```
python test_spectral_ops.py -v
...
Traceback (most recent call last):
  File "test_spectral_ops.py", line 25, in <module>
    import librosa
  File "/opt/conda/lib/python3.6/site-packages/librosa/__init__.py", line 211, in <module>
    from . import core
  File "/opt/conda/lib/python3.6/site-packages/librosa/core/__init__.py", line 6, in <module>
    from .audio import *  # pylint: disable=wildcard-import
  File "/opt/conda/lib/python3.6/site-packages/librosa/core/audio.py", line 8, in <module>
    import soundfile as sf
  File "/opt/conda/lib/python3.6/site-packages/soundfile.py", line 142, in <module>
    raise OSError('sndfile library not found')
OSError: sndfile library not found
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54687

Reviewed By: ezyang

Differential Revision: D27332975

Pulled By: walterddr

fbshipit-source-id: 9c6b37545e9f2536c83e606912859439847c884a
2021-03-26 10:35:24 -07:00
20d8fe83cd [TSAN] Suppress data races in caffe2/c10/util/Logging.cpp
Summary:
This suppresses some data races reported by TSAN. See the associated
task(s) below for context, including sample stack traces caused by these races
and reproduction instructions.

This diff is automatically generated. Therefore, the way it makes suppressions
may not be as beautiful as if written by hand. *However, we don't have the
resources to manually adjust these diffs, nor do we have the capacity to
actually fix the bugs*; we just want to get the existing bugs
out of the way so we can enable TSAN across the fleet. If you are a reviewer
please do one of the following:

1. Accept the diff as is, and you may follow up with more changes (or fix the
   bugs) later.
2. Fix the data races in a different diff and land it within a reasonable amount
   of time (e.g. a week), and comment about it here.
3. Comment to suggest us a different code location(s) to suppress these data
   races.

Test Plan: Unit tests were automatically run as part of https://www.internalfb.com/intern/sandcastle/job/22517998509525934/

Reviewed By: ezyang

Differential Revision: D26094360

fbshipit-source-id: 06c285570bcf7a1491d8f17d1885d065ef0bc537
2021-03-26 10:11:23 -07:00
2be1b486ce Drop Python 2 support in common_device_type.py (#54691)
Summary:
Hey!

Just stumbled across these Python 2 fragments while reading the source code and thought it could be removed, since the Python 2 support has already been dropped.

mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54691

Reviewed By: mruberry

Differential Revision: D27344439

Pulled By: ailzhang

fbshipit-source-id: 926303bfff9afa6dabd2efb5e98f9d0d9ef83dc7
2021-03-26 10:04:52 -07:00
f6634be4c2 Fix OpInfo failing without scipy (#54735)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54735

One of the tests didn't wrap scipy call with TEST_SCIPY. Also, the wrapper function seems unnecessary and requires lambdas to be created.

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D27351349

Pulled By: heitorschueroff

fbshipit-source-id: 029e273785b11e01d6be7b816469654de6583deb
2021-03-26 08:01:24 -07:00
645119eaef Lowering NLLLoss/CrossEntropyLoss to ATen code (#53789)
Summary:
* Lowering NLLLoss/CrossEntropyLoss to ATen dispatch
* This allows the MLC device to override these ops
* Reduce code duplication between the Python and C++ APIs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53789

Reviewed By: ailzhang

Differential Revision: D27345793

Pulled By: albanD

fbshipit-source-id: 99c0d617ed5e7ee8f27f7a495a25ab4158d9aad6
2021-03-26 07:31:08 -07:00
d4045e9aa1 initial commit to refactor all s3 access codes to s3_stats_parser (#54681)
Summary:
First step to move all S3 related operations into S3 parser utils.
in the end we provide APIs from s3_stats_parser:
1. downloading data as reports and uploading data as reports
2. filter by job name

and handle all compression, formatting inside.

TODO
- [ ] Refactor out upload into s3_stats_parser
- [ ] Remove all S3/BOTO related checkers and try/catch blocks outside of s3_stats_parser

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54681

Test Plan:
1. Running tools/test/* covers the refactoring logic (test_test_history.py and test_stats.py as entrypoint and both using the 2 new APIs in s3_stats_parser after the refactoring.
2. print_test_stats.py's main argparse entrypoint is covered by CI step Report Test Result step.
3. run `python test/run_test.py --export-past-test-times` before and after this PR should result in the same file content in .pytorch-test-times

Reviewed By: ailzhang

Differential Revision: D27346742

Pulled By: walterddr

fbshipit-source-id: fb40162e631e007fed9d5821fe4f190bda2cb52e
2021-03-26 06:49:15 -07:00
1126d51de9 Remove useless contiguous calls from torch.matmul (#54616)
Summary:
This reduces the memory usage of matmul significantly for expanded batch size.

This reduces the peak memory usage of
```
a = torch.rand(1, 1024, 1024, device="cuda")
b = torch.rand(1024, 1024, 1, device="cuda")

out = torch.matmul(a, b)
```
From 4GB to 16MB which is not too bad.

It also fixes the same problem when `b` is not batched.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54616

Reviewed By: ailzhang

Differential Revision: D27327056

Pulled By: albanD

fbshipit-source-id: 4bb5f4015aeab4174148512f3c5b8d1ffa97bf54
2021-03-26 06:34:24 -07:00
5e62da2efd [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D27356622

fbshipit-source-id: f03ad23a2847b3cbaf61e16055393cbbfbc215ae
2021-03-26 04:18:11 -07:00
b7b481bd07 [PyTorch] Enable template build at aten operator level (#53801)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53801

## Summary

Enable partial explicit Aten level sources list for lite interpreter. More aten level source list will be added.

1. Use `gen_selected_mobile_ops_header.py ` to generate `selected_mobile_ops.h`. Currently, it only includes selected operators, and dtypes is all.
2. Add a custom target includes only `seleteted_mobile_ops.h`, and add it to `torch_cpu` dependency, when `BUILD_LITE_INTERPRETER` is enabled.

As a note, the current input yaml file is slightly different than the one use in internal. Will align these two yaml as next step.

**Android**
x86:
`SELECTED_OP_LIST=/Users/chenlai/Documents/pytorch/experiemnt/deeplabv3_scripted.yaml BUILD_LITE_INTERPRETER=1 ./scripts/build_pytorch_android.sh x86`

libpytorch_jni_lite.so -- 3.4 MB

armeabi-v7a
`SELECTED_OP_LIST=/Users/chenlai/Documents/pytorch/experiemnt/deeplabv3_scripted.yaml BUILD_LITE_INTERPRETER=1 ./scripts/build_pytorch_android.sh armeabi-v7a`
libpytorch_jni_lite.so -- 2.5 MB

**iOS:**
```
(base) chenlai@chenlai-mp install % du -sh *
 15M	include
 57M	lib
2.8M	share
```

```
(base) chenlai@chenlai-mp lib % ls -lh
total 117296
-rw-r--r--  1 chenlai  staff   3.2M Mar 15 22:03 libXNNPACK.a
-rw-r--r--  1 chenlai  staff   913K Mar 15 22:03 libc10.a
-rw-r--r--  1 chenlai  staff   4.6K Mar 15 22:03 libclog.a
-rw-r--r--  1 chenlai  staff    42K Mar 15 22:03 libcpuinfo.a
-rw-r--r--  1 chenlai  staff   1.5M Mar 15 22:03 libeigen_blas.a
-rw-r--r--  1 chenlai  staff    44K Mar 15 22:03 libpthreadpool.a
-rw-r--r--  1 chenlai  staff   166K Mar 15 22:03 libpytorch_qnnpack.a
-rw-r--r--  1 chenlai  staff   384B Mar 15 22:03 libtorch.a
-rw-r--r--  1 chenlai  staff    51M Mar 15 22:03 libtorch_cpu.a
```

### **Master (Baseline):**

**Android**
x86:
`SELECTED_OP_LIST=/Users/chenlai/Documents/pytorch/experiemnt/deeplabv3_scripted.yaml BUILD_LITE_INTERPRETER=1 ./scripts/build_pytorch_android.sh x86`

libpytorch_jni_lite.so -- 3.8 MB

armeabi-v7a
`SELECTED_OP_LIST=/Users/chenlai/Documents/pytorch/experiemnt/deeplabv3_scripted.yaml BUILD_LITE_INTERPRETER=1 ./scripts/build_pytorch_android.sh armeabi-v7a`
libpytorch_jni_lite.so -- 2.8 MB

**iOS:**
```
(base) chenlai@chenlai-mp install % du -sh *
 15M	include
 58M	lib
2.8M	share
```

```
(base) chenlai@chenlai-mp lib % ls -lh
total 119600
-rw-r--r--  1 chenlai  staff   3.2M Mar  4 23:16 libXNNPACK.a
-rw-r--r--  1 chenlai  staff   910K Mar  4 23:16 libc10.a
-rw-r--r--  1 chenlai  staff   4.6K Mar  4 23:16 libclog.a
-rw-r--r--  1 chenlai  staff    42K Mar  4 23:16 libcpuinfo.a
-rw-r--r--  1 chenlai  staff   1.5M Mar  4 23:16 libeigen_blas.a
-rw-r--r--  1 chenlai  staff    44K Mar  4 23:16 libpthreadpool.a
-rw-r--r--  1 chenlai  staff   166K Mar  4 23:16 libpytorch_qnnpack.a
-rw-r--r--  1 chenlai  staff   384B Mar  4 23:16 libtorch.a
-rw-r--r--  1 chenlai  staff    52M Mar  4 23:16 libtorch_cpu.a
```

Test Plan: Imported from OSS

Reviewed By: dhruvbird

Differential Revision: D27074814

Pulled By: cccclai

fbshipit-source-id: 762b5ad5b87b6a262444392fd089249c4837ba18
2021-03-25 23:57:48 -07:00
0a18211989 ns for fx: add weight matching for linear fp16 emulation (#54257)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54257

Makes the NS weight extraction fuction work correctly with
fp16 emulation patterns for linear.  We navigate to the
weight correctly, and cast it to `torch.float16` before returning.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_linear_fp16
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D27159370

fbshipit-source-id: 95f555298e3153e4783c64b3d8c83b9d3fdffa12
2021-03-25 22:35:38 -07:00
182d8c375c ns for fx: add partial support for subgraphs with base_op_node (#54254)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54254

In fp16 emulation, we now have patterns such as

```
... -> dequantize -> linear -> relu -> to(torch.float16) -> ...
```

This PR adds support for
* specifying a subgraph's "base_op_node", which is the node with the op
which should be matched to related nodes. In the example above,
"base_op_node" would be the linear node, and it would be the second
node in the matched pattern.
* matching these fusion patterns and properly setting "base_op_node"
based on pattern and index
* using "base_op_node" instead of "start_node" throughout the NS
codebase wherever the intent is to match subgraphs or create names
for subgraphs.

At the end of this PR, matching unshadowed activations with an example
fp16 emulation pattern works e2e.

I'm saving the following work for future PRs (soon), mostly to keep
PR size manageable:
* adding weight matching (will require some changes to function which
extracts weights)
* adding shadowed activation matching (will require some changes to
shadow copying)
* adding input logging for these patterns (will likely require some changes as well)

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_linear_fp16
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D27158199

fbshipit-source-id: 49fc445395452fda62e3c7a243544190f9af691c
2021-03-25 22:35:36 -07:00
454832e5fa ns for fx: create subgraph type (#54253)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54253

Creates an `NSSubgraph` type for representing a subgraph instance,
and modifies the NS code to use it. This will enable us to add
more information to the subgraph instance definition without
having to change all the callsites.

Test Plan:
```
mypy torch/quantization
python test/test_quantization.py TestFXGraphMatcher
python test/test_quantization.py TestFXNumericSuiteCoreAPIs
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D27158198

fbshipit-source-id: 548785dd90144e2da256c23af990620c778e7cfe
2021-03-25 22:35:34 -07:00
9e8e744efe ns for fx: move shadow lstm test to new API (#53828)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53828

Moves LSTM shadow activations test to new API. In order
to enable this, adds support for passing two args instead
of one arg when copying a subgraph from A to B.

Since this was the last test of the old API, deletes
the old test case.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIsModels.test_compare_shadow_activations_lstm_dynamic
```

Imported from OSS

Reviewed By: hx89

Differential Revision: D26982733

fbshipit-source-id: 03f580688dd37f3ccd688d9f444e9e79cfa84734
2021-03-25 22:35:31 -07:00
cfe7364809 ns for fx: move shadow activations linear test to new API (#53819)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53819

Moves the linear tests for shadow activations to new API.
In order to do so, adds logic for fp32 to fp32 dtype cast,
which is an identity.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIsModels.test_compare_shadow_activations_linear
```

Imported from OSS

Reviewed By: hx89

Differential Revision: D26982734

fbshipit-source-id: b6203228abf3cdf74ab0638468a6df77658aa662
2021-03-25 22:35:29 -07:00
3dc8ba27a5 ns for fx: move shadow activations conv test to new API (#53818)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53818

Moves testing of conv for shadow activations to new NS API

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIsModels.test_compare_shadow_activations_conv
```

Imported from OSS

Reviewed By: hx89

Differential Revision: D26982732

fbshipit-source-id: 9e8709a76363fbcdf84413e5d4a6c8a0889cb97b
2021-03-25 22:35:27 -07:00
52a8075f16 ns for fx: add support for lstm activation matching (#53779)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53779

Moves the test case for LSTM activation matching to new NS APIs.

This requires adding the ability to log non-Tensor types.
Since we need Loggers to be scriptable and TorchScript does
not support `Union`, we collect statistics in a separate collector
if we have an RNN.  Note: this can scale to a small N of
return types, but not to a large N.  If the N becomes large in
the future, we will solve it then.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIsModels
```

Imported from OSS

Reviewed By: hx89

Differential Revision: D26967110

fbshipit-source-id: afe60b44fdec28a328813b4f342cf4fe04820baa
2021-03-25 22:33:41 -07:00
c656a5befa [FX] Normalize Python operators to torch. ops when called with Tensors (#54236)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54236

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D27149411

Pulled By: jamesr66a

fbshipit-source-id: fe9c468f7c84c254dbb1b70163d08b343725861a
2021-03-25 22:27:49 -07:00
b81e10a291 fx quant: fix bug with fusion patterns and disabling quantization (#54654)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54654

Fixes a bug where disabling quantizaton on potential fusion patterns
would lead to errors in the `convert` function.  For example:
1. have a model with add-relu
2. disable quantization for the part of the model containing add-relu
3. run prepare and convert, the convert step would fail because
intermediate nodes were missing from `env`.

The fix is to add handling for this edge case.  If quantization is
disabled, we manually copy the nodes for multi-node fusion patterns.

Test Plan:
```
python test/test_quantization.py TestQuantizeFx.test_fusion_pattern_unquantized
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D27318454

fbshipit-source-id: 27c1fd1cb7c9711a8e8d338200971c428dae8f98
2021-03-25 22:21:41 -07:00
a28c7db9f9 [FX] Garbage collect values in Interpreter (#54726)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54726

Test Plan: Imported from OSS

Reviewed By: ansley

Differential Revision: D27341449

Pulled By: jamesr66a

fbshipit-source-id: 9dc5f9675ed197dee4a31c8b0e6276248378f1ea
2021-03-25 20:35:32 -07:00
fd58ececab Pin autocanceling GHA repo to specific commit (#54738)
Summary:
This way, if malicious code gets committed and the tag moves forward, we would be at risk. This does mean that we would have to manually update the SHA if there are desirable upgrades to the repository.

We are pinning it to this commit: a81b3c4d59

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54738

Reviewed By: samestep

Differential Revision: D27346792

Pulled By: janeyx99

fbshipit-source-id: 5641a78567c3cd61dce35dfa2fd4918f255a7681
2021-03-25 16:09:42 -07:00
9db4802184 [fuser] Support bfloat16 (#54571)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54571

Supports bfloat16 via a similar method to half: upconvert inputs to
fp32, do math, then downconvert outputs to bf16.

Resource strings are mostly derived from cuda-11 headers.

Fixes #53918, for the legacy fuser at least.

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D27328987

Pulled By: bertmaher

fbshipit-source-id: 5c0eae44164623faa0c75cb818e8bf0211579fdc
2021-03-25 15:59:15 -07:00
6b7652e26c [DDP logging] Prefer use of c10::Join (#54649)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54649

Some operator<< code manually implemented string join in C++, turns
out there is a c10 util for this. Use the util instead of rolling our own.
ghstack-source-id: 124840043

Test Plan: Ci

Reviewed By: SciPioneer

Differential Revision: D27316705

fbshipit-source-id: 5118097f84be2f38a503d8f81faa38c8d95ec17a
2021-03-25 15:54:48 -07:00
dfc7fa03e5 lu_backward: more numerically stable and with complex support. (#53994)
Summary:
As per title.

Numerical stability increased by replacing inverses with solutions to systems of linear triangular equations.

Unblocks computing `torch.det` for FULL-rank inputs of complex dtypes via the LU decomposition once https://github.com/pytorch/pytorch/pull/48125/files is merged:
```
LU, pivots = input.lu()
P, L, U = torch.lu_unpack(LU, pivots)
det_input = P.det() * torch.prod(U.diagonal(0, -1, -2), dim=-1)  # P is not differentiable, so we are fine even if it is complex.
```

Unfortunately, since `lu_backward` is implemented as `autograd.Function`, we cannot support both autograd and scripting at the moment.
The solution would be to move all the lu-related methods to ATen, see https://github.com/pytorch/pytorch/issues/53364.

Resolves https://github.com/pytorch/pytorch/issues/52891
TODOs:
* extend lu_backward for tall/wide matrices of full rank.
* move lu-related functionality to ATen and make it differentiable.
* handle rank-deficient inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53994

Reviewed By: pbelevich

Differential Revision: D27188529

Pulled By: anjali411

fbshipit-source-id: 8e053b240413dbf074904dce01cd564583d1f064
2021-03-25 13:33:58 -07:00
3bb0f1f343 Automated submodule update: tensorpipe (#54686)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/tensorpipe](https://github.com/pytorch/tensorpipe).

New submodule commit: 5d15ff7a64

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54686

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: lw

Differential Revision: D27328262

fbshipit-source-id: 81e1ede0607da4d8f676145cfb6729ac5544c77d
2021-03-25 13:16:55 -07:00
68bdeef2ce [CMake] Simplify CPU architecture detection logic (#54637)
Summary:
CMAKE_SYSTEM_PROCESSOR set to x86_64(on Linux) or AMD64 (5ec224496b)(on Windows) indicates build is running on x86_64 architecture, while `CMAKE_SYSTEM_PROCESSOR` set to aarch64 or arm64 means we running on ARMv8+ architecture.
Delete `i[3-6]86` pattern as 32-bit builds are no longer supported

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54637

Reviewed By: ezyang

Differential Revision: D27311897

Pulled By: malfet

fbshipit-source-id: 26989fc9b54a96d70c768ab03ca4528506ee7808
2021-03-25 12:32:18 -07:00
911b8b1bfc [package] rename PackageExporter.external to PacakgeExporter.extern_modules (#54601)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54601

This make it consistent with PackageImporter and the on-disk format.

Test Plan: Imported from OSS

Reviewed By: Lilyjjo

Differential Revision: D27296915

Pulled By: suo

fbshipit-source-id: a9bc615b1952b6cc4dcba31d4a33932b1fa1a2aa
2021-03-25 11:50:07 -07:00
9c60fc9cd9 Fix broken javadoc URL in README (#54434)
Summary:
The link in the README was broken

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54434

Reviewed By: ailzhang

Differential Revision: D27328733

Pulled By: nairbv

fbshipit-source-id: 12ebb6f66983f9348a90b9738fbd9f3f2660c2d1
2021-03-25 11:28:23 -07:00
a7c7fc96ff Add doc warnings for default SELU gain (#54057)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/24991 and provides the alternative solution suggested in https://github.com/pytorch/pytorch/issues/53694. Also related to https://github.com/pytorch/pytorch/issues/54055

Attempt to make people aware of the difference between paper and implementation of SELU gain.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54057

Reviewed By: ailzhang

Differential Revision: D27292060

Pulled By: jbschlosser

fbshipit-source-id: e0e303595e6a7d05d11dfb68735e1839f55987a2
2021-03-25 11:21:02 -07:00
f1edaabc35 Simplify creation of unary structured kernels. (#54592)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54592

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D27293362

Pulled By: bdhirsh

fbshipit-source-id: 805f4e321645bc1ad7b8811f4b6daf96775eac9f
2021-03-25 11:08:33 -07:00
71b9f2dd76 Add GHA to cancel redundant GHA workflows except on master (#54689)
Summary:
Relands https://github.com/pytorch/pytorch/issues/54685 with the fix to filter out master

Tested with samestep in other repository.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54689

Reviewed By: walterddr

Differential Revision: D27330804

Pulled By: janeyx99

fbshipit-source-id: 06d8199af6173eedca2e7db4a1fd7b9a143d29d2
2021-03-25 10:37:41 -07:00
53596cdb73 Remove hacky wrapper for about 100 kernels (#54367)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54367

Codemod commands generated by https://github.com/pytorch/pytorch/pull/54098
ghstack-source-id: 124804544

Test Plan: buck build //caffe2/aten/...

Reviewed By: smessmer

Differential Revision: D27210057

fbshipit-source-id: 368dc77843468cfc44535488a040dbc2cb67208d
2021-03-25 10:00:16 -07:00
d12118c0aa Handle stride > 1 with im2col in CUDA thnn conv2d (#54080)
Summary:
The fallback thnn 2d convolution uses `im2col` to get patches and `gemm` to implement convolution .
I has a shortcut to use `gemm` directly for kernel size 1, but this only works for stride == 1 and padding == 0.
This PR adds checks for stride == 1 and padding == 0 to determining whether `im2col` can be skipped.

Fixes https://github.com/pytorch/pytorch/issues/54036

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54080

Reviewed By: ejguan

Differential Revision: D27170482

Pulled By: zou3519

fbshipit-source-id: 055d6502239d34945934de409d78144d8a5c56f4
2021-03-25 09:53:49 -07:00
0b0a5dd35f Revert D27327999: [pytorch][PR] Cancel redundant GHA workflows
Test Plan: revert-hammer

Differential Revision:
D27327999 (f251bb40c1)

Original commit changeset: c5793a7660d2

fbshipit-source-id: 1f65b5341527de00d497780565a5cfd27da5239d
2021-03-25 09:25:31 -07:00
f251bb40c1 Cancel redundant GHA workflows (#54685)
Summary:
This PR adds a lightweight workflow which runs when any of our GitHub Actions lint or test workflows start (currently just the three listed in the YAML in this PR's diff), and cancels redundant ones (e.g. if a PR author pushes several commits in rapid succession). Currently this isn't particularly impactful, but it would become more so if/when we add heavier workflows that run on PRs.

Initially we tried using [`technote-space/auto-cancel-redundant-workflow`](https://github.com/technote-space/auto-cancel-redundant-workflow) instead of [`potiuk/cancel-workflow-runs`](https://github.com/potiuk/cancel-workflow-runs), but for some reason it the former doesn't seem to work even if triggered by `workflow_run` with the `TARGET_RUN_ID` input set appropriately.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54685

Test Plan: janeyx99 and I tested this in a separate GitHub repo, and confirmed that it successfully cancels redundant `push`-triggered workflows on the source repo and `pull_request`-triggered workflows from forks.

Reviewed By: janeyx99

Differential Revision: D27327999

Pulled By: samestep

fbshipit-source-id: c5793a7660d21361381e0f033d314f2d603f70ec
2021-03-25 09:13:39 -07:00
4bf90558e0 [Gradient Compression] Add logging for gradient compression stats. (#54647)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54647

Regularly log stats showing effect of gradient compression when using the PowerSGD DDP communication hook.

Test Plan:
buck run mode/dev-nosan scripts/wayi/torch:power_sgd

Play with the layer sizes of the input model (you can just use linear layers for convenience), and check the log that shows compression stats. For convenience, you can change `logging.info` to `print` locally.

You can create some test diffs on top of this diff, to show that the compression stats are correct in different cases.

Run with power_sgd script:
{F537381542}

Diff with example using a simple linear model: D27299934
sample output:
{F538486535}

Reviewed By: SciPioneer

Differential Revision: D27240254

fbshipit-source-id: 9e142b2f7957cc874804f799b7bb3bffdf824858
2021-03-25 07:44:17 -07:00
267fc27d39 Ensure torch.futures.wait_all exits early on error. (#53953)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53953

torch.futures.wait_all, would wait for all specified futures to
complete before it returned. As a result, if there was an error it would still
wait for a long time (ex: long running RPCs) before it returned an error to the
user.

This PR ensures `wait_all` returns and error as soon as any future runs into an
error and doesn't wait for all futures to complete.

I removed the logic _invoke_rpc_python_udf which raised an error in the unwrap
function, because ideally the error should be set on the Future and not be
raised to the user only when `wait()` is called. As an example, in the case of
`wait_all`, the user never calls `wait()` on the future that errored out but a
future down the chain and we should propagate these errors via `setError`
instead.
ghstack-source-id: 124721216

Test Plan:
1) Unit test added.
2) waitforbuildbot

Reviewed By: mrshenli

Differential Revision: D27032362

fbshipit-source-id: c719e2277c27ff3d45f1511d5dc6f1f71a03e3a8
2021-03-25 07:39:14 -07:00
93bbbeccf7 Make SharedCache thread-safe (#53750)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53731

Make SharedCache thread-safe by using explicit locks instead of relying on atomicity of certain Python operations

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53750

Reviewed By: malfet

Differential Revision: D27304793

Pulled By: albanD

fbshipit-source-id: 7c62babe4357bed57df3056fbda6801fb6168846
2021-03-25 06:35:03 -07:00
9029d0d7d8 Introduce a fluent API to construct tensors from external data. (#54530)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54530

This diff introduces the following changes and improvements:

- Introduces a new fluent API to construct tensors from external data as an alternative to `from_blob` overloads. See below for an example.
- Leverages several small-buffer optimizations which result in %50 reduction in tensor construction times.
- Exposes a new (lightweight) way to construct tensors by passing a naked `context` and `context_deleter` pair as an alternative to the existing `deleter` parameter.
- Updates the existing `from_blob` overloads to internally use the fluent API.

```
// Example 1
at::Tensor tensor = at::for_blob(data, sizes)
  .strides(strides)
  .context(context, [](void *ctx) { delete static_cast<Ctx*>(ctx); })
  .options(...)
  .target_device(...)
  .make_tensor();

// Example 2
at::Tensor tensor = at::for_blob(data, sizes).make_tensor();

// Example 3
at::Tensor tensor = at::for_blob(data, sizes)
  .deleter(...)
  .make_tensor();
```

Test Plan:
Below are the folly Benchmark results for the following two equivalent operations:

```
// The fluent API
at::Tensor tensor = at::for_blob(data, sizes)
  .deleter([buffer](void*) mutable { buffer.reset(); })
  .options(dtype(c10::ScalarType::Float))
  .make_tensor();

// The original `from_blob` overload
at::Tensor tensor = at::from_blob(
  data,
  sizes,
  [buffer](void*) mutable { buffer.reset(); },
  dtype(c10::ScalarType::Float));
```

```
============================================================================
scripts/balioglu/from_blob_exp/main.cpp         relative  time/iter  iters/s
============================================================================
fluent                                                     298.34ns    3.35M
from_blob                                         55.19%   540.51ns    1.85M
============================================================================
```

Various similar experiments show an approximate %50 reduction in tensor construction times.

Reviewed By: ezyang

Differential Revision: D27269344

fbshipit-source-id: e6bd0b78384bf89fd24f22254008180329000363
2021-03-25 06:24:50 -07:00
6cdabb2e40 Update .gitignore to ignore NFS handle files (#54618)
Summary:
Ignore NFS handle files starting with .nfs*.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54618

Reviewed By: malfet

Differential Revision: D27304405

Pulled By: heitorschueroff

fbshipit-source-id: 9abeed796fec0a4ff416eacea450f3f8e2813b32
2021-03-25 04:49:36 -07:00
55dfb4a575 Update CODEOWNERS for distributed training (#54661)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54661

My username was somehow deleted, and I couldn't receive review requests.
ghstack-source-id: 124853153

Test Plan: N/A

Reviewed By: rohan-varma

Differential Revision: D27320286

fbshipit-source-id: c38ea3adb2e8197f949a806127d20982299a2851
2021-03-25 00:04:13 -07:00
c0bcd5a58f Remove NestedTensor from DefaultBackend alias (#54559)
Summary:
Kernels such as "add" are registered to DefaultBackend. At a minimum NestedTensor is not compatible with structured kernels due to missing fields such as size, which can therefore cause difficult to catch bugs when being passed into a function without a NestedTensor-specific kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54559

Reviewed By: ezyang

Differential Revision: D27283591

Pulled By: cpuhrsch

fbshipit-source-id: fad7c03ca3b2190f2f90039dd2872184e9bc5049
2021-03-24 23:43:13 -07:00
2662e34e92 Add PyTorchDeploy predictor model type (#54120)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54120

Construct InterpreterManager inside PyTorchDeployModel
 - add ReadAdapterInterface to deploy::Package

Implement PyTorchDeployModel::makePrediction for FeatureStore Examples
- Basic test of loading and executing 'simple' model

Test Plan: ran unit tests locally and CI

Differential Revision: D26961744

fbshipit-source-id: fce72bc83b9005500d9b7ce3fab2ed466f73d6ed
2021-03-24 23:01:06 -07:00
947ab84fd2 enable_and_enhance_bf16_threshold (#54384)
Summary:
enable_and_enhance_bf16_threshold

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54384

Reviewed By: ngimel

Differential Revision: D27286323

Pulled By: mruberry

fbshipit-source-id: 517fa94764d8202bbcbf94011d2d48f716fbd01b
2021-03-24 22:46:20 -07:00
9f336bdf10 Fixes new tf32 failures in test_nn.py (#52871)
Summary:
Also modify the `tf32_on_and_off` decorator to make it support function without `device` argument.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52871

Reviewed By: ngimel

Differential Revision: D27286674

Pulled By: mruberry

fbshipit-source-id: 14f6d558271bd6a1d0bc40691c170d47e81de1ff
2021-03-24 21:53:33 -07:00
64d31e3f45 Add double tensor type to DivFakeFp16 Op (#54636)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54636

Test Plan: The model will be rerun after the diff lands...

Reviewed By: hx89

Differential Revision: D27310244

fbshipit-source-id: 88575237596a59996da14a49a8459f8b3d0ee66a
2021-03-24 21:40:29 -07:00
fe2c1268b7 More name refactoring of memory planning codes to make it more readable (#54272)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54272

Test Plan: Imported from OSS

Reviewed By: bwasti

Differential Revision: D27233881

fbshipit-source-id: f257f16ac0684df055961e539f17d002cb8f1bfe
2021-03-24 19:52:35 -07:00
1ceb90405b [TensorExpr] Add plumbing for conv2d fusion. (#54439)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54439

For now the only way to represent conv2d in TE is via an external call,
and since aten library doesn't have an out variant for conv2d, the
external call has to perform an extra copy. Because of that fusing
conv2d now regressed performance and hence is disabled. However, in near
future we should have two alternative ways to enable it:
1) represent conv2d natively in TE (without an external call)
2) add an out variant for conv2d

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D27237045

Pulled By: ZolotukhinM

fbshipit-source-id: f5545ff711b75f9f37bc056316d1999a70043b4c
2021-03-24 18:49:07 -07:00
6f8328ef44 [special] Add special.entr (#53500)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/50345

TODO:

* [x] Verfiy docs rendering (https://11397990-65600975-gh.circle-artifacts.com/0/docs/special.html)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53500

Reviewed By: ngimel

Differential Revision: D27287096

Pulled By: mruberry

fbshipit-source-id: 6b3dfd53e811a0f023ee444a0b56176f825d39e9
2021-03-24 18:44:42 -07:00
347ab5d8b8 Update Kineto submodule (#54621)
Summary:
Update Kineto submodule to the latest rev.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54621

Test Plan: CI

Reviewed By: gdankel

Differential Revision: D27303589

Pulled By: ilia-cher

fbshipit-source-id: 7cea96f779981acd36d10290a537601e2f361720
2021-03-24 17:31:58 -07:00
1442a92741 Ensure local_used_maps_tmp is distinct from local_used_maps_[i] (#54474)
Summary:
Followup/hotfix for https://github.com/pytorch/pytorch/pull/53160. rohan-varma and zhaojuanmao were seeing https://github.com/pytorch/pytorch/pull/53160/files#diff-9273e5ff7b40f30d6a4444d1c7be9fe9a5c2068070c68af4e7b0ac2d4cff0923R582 fire in some internal workloads, indicating `local_used_maps_tmp` wasn't actually being created as a distinct temporary, in other words, `local_used_maps_[i]` was already pinned for some reason. This seems like a bug with the CPU allocator: [`local_used_maps_` should not have been pinned on construction](9be4c75fa0/torch/lib/c10d/reducer.cpp (L180-L183)). We should [investigate that separately](https://github.com/pytorch/pytorch/pull/53160/files#r599188373).

In the meantime, the present PR should ensure `local_used_maps_tmp` is always distinct from `local_used_maps_[i]` (and therefore prevents the race condition described in https://github.com/pytorch/pytorch/pull/51360) even if `local_used_maps_[i]`is already pinned.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54474

Reviewed By: zhaojuanmao

Differential Revision: D27268039

Pulled By: rohan-varma

fbshipit-source-id: ab9af3dd845098bde788cb28a9217caea246ddfa
2021-03-24 16:58:31 -07:00
ac33432606 Fixed out= variants of non-symmetric eigendecomposition and QR decomposition (#54056)
Summary:
This PR modifies the behavior of _out variants of `torch.eig`, `torch.qr`, `torch.linalg.qr`  to match the description here https://github.com/pytorch/pytorch/wiki/Developer-FAQ#how-does-out-work-in-pytorch
With this PR result and input tensors must be on the same device and have the same "type kind".

Tested with OpInfo's `supports_out=True`.

Ref. https://github.com/pytorch/pytorch/issues/42666

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54056

Reviewed By: heitorschueroff

Differential Revision: D27230275

Pulled By: mruberry

fbshipit-source-id: 3fe1ce6c0e2c20bdfd6742305a20f3cf3632a4d6
2021-03-24 16:52:19 -07:00
7605ce4ed8 [PyTorch] Enable test_lite_interpreter_runtime running in android (#54579)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54579

## Summary

1. Eliminate a few more tests when BUILD_LITE_INTERPRETER is on, such that test_lite_interpreter_runtime can build and run on device.
2. Remove `#include <torch/torch.h>`, because it's not needed.

## Test plan

Set the BUILD_TEST=ON `in build_android.sh`, then run
` BUILD_LITE_INTERPRETER=1 ./scripts/build_pytorch_android.sh x86`

push binary to android device:
```
 adb push ./build_android_x86/bin/test_lite_interpreter_runtime /data/local/tmp
```

Reorganize the folder in `/data/local/tmp` so the test binary and model file is like following:
```
/data/local/tmp/test_bin/test_lite_interpreter_runtime
/data/local/tmp/test/cpp/lite_interpreter_runtime/sequence.ptl
```
such that the model file is in the correct path and can be found by the test_lite_interpreter_runtime.

![image](https://user-images.githubusercontent.com/16430979/112276332-d89d1900-8c3d-11eb-91de-7bf10d1e418d.png)

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D27300720

Pulled By: cccclai

fbshipit-source-id: d9526c7d3db8c0d3e76c5a4d604c6877c78afdf9
2021-03-24 14:45:27 -07:00
673ed4623e Gradcheck small fixes (#53916)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53916

This PR fixes some bugs that are made more clear by the previous refactor.
- make sure gradcheck returns false when its supposed to fail and when raise_exception=False.
- make sure when test_batched_grad fails, it returns false when raise_exception=False

Removing checkIfNumericalAnalyticAreClose made sense here to me because underneath its really doing `torch.allclose`, and using that directly instead of adding another opaque function to call seemed to make the code more clear.

TODO:
- ~add a test to see if when torch.allclose fails, we indeed return false.~
- ~uncomment test from previous PR.~

Test Plan: Imported from OSS

Reviewed By: heitorschueroff

Differential Revision: D27201692

Pulled By: soulitzer

fbshipit-source-id: 8b8dc37c59edb7eebc2e8db6f8839ce98a81d78b
2021-03-24 14:35:40 -07:00
796be045bb Refactor gradcheck (#53857)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53857

This PR basically just factors a lot of the logic out from the main gradcheck function into their own individual functions. It aims to avoid any behavior change (but we may not have enough tests to actually verify this). Refactorings that lead to any behavior chang are done in the next PR in this stack.

The rationale for this change is 1) to make the main gradcheck function cleaner to read, and 2) also allow us to reuse the same pieces when we add the fast gradcheck.

Maybe this PR is also a good place to add some tests for gradcheck, i.e., make sure gradcheck fails when it should fail, as to make sure that we are indeed not changing any logic. This will also help us make sure our fast_gradcheck does all the necessary checks:
So far existing tests are:
- test_gradcheck_fail_when_no_differentiable_outputs_and_num_grad_not_zero` (test_autograd)
- test_gradcheck_single_input (test_autograd)
- test_gradcheck_sparse_input (test_autograd)
- test_gradcheck_nondeterministic (test_autograd)
- test_gradcheck (test_overrides)

Full coverage would potentially require adding the following missing tests (for each test for both raise_exception=True/False) - Methodology for getting the list below is that for every type of error message we spit out, we make sure we can hit it:
- complex:
  - when numerical != analytical when tested with imag grad_out
- check_inputs
  - ~when inputs are not dense, but check_sparse_nnz is false~
  - ~when none of the inputs require grad~
  - ~(warning) when inputs are not double precision~
  - ~when layout is not mkldnn(aka has strides) and input has a dimension with stride 0.~
- check_no_differentiable_outputs:
  - ~when none of the outputs are differentiable, but numerical gradient is not zero~
- check_outputs:
  - ~when sparse outputs (always raise)~
  - ~when mkldnn outputs (always raise)~
- test_batched_grad
  - ~when encounter runtime error while computing batched grad (print big message)~
  - when not allclose (print out big message)
- test_backward_mul_by_grad_output
  - ~when layout of grad_input is not the same as input~
  - ~when grad_input is sparse and has incorrect sparse_dim/dense_dim~
  - ~when backward not multiplied by grad_output (sparse/non-sparse case)~
  - when grad is incorrect type/size
- test_undefined_grad
  - ~when encounter runtime error while running backward~
  - when we complete backward but grad inputs (the output of .grad()) is not none
- check_analytical_jacobian_attributes (for both complex/non complex)
  - when grad input is incorrect dtype/size

Test Plan: Imported from OSS

Reviewed By: heitorschueroff

Differential Revision: D27201571

Pulled By: soulitzer

fbshipit-source-id: 86670a91e65740d57dd6ada7c6b4512786d15962
2021-03-24 14:34:08 -07:00
d371a9f9c5 Change ScatterGather kernel names on dtype dispatch. (#54498)
Summary:
Changed `ScatterGather` kernel name when `dtype` dispatching to a more meaningful name than `"method_name"`.

2a53897114/aten/src/ATen/native/cpu/ScatterGatherKernel.cpp (L146-L148)

2a53897114/aten/src/ATen/native/cpu/ScatterGatherKernel.cpp (L241-L243)

Maybe, a more specific name, based on who is calling (e.g. `gather_cpu_kernel`, `scatter_cpu_kernel`), would be better. Any thoughts?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54498

Reviewed By: malfet

Differential Revision: D27291514

Pulled By: bdhirsh

fbshipit-source-id: 123b77296e685ee34031da661c78e201a10757db
2021-03-24 14:30:02 -07:00
556fc8d418 skip test_symeig if MAGMA not detected (#54526)
Summary:
Add proper way to skip test_symeig. In case MAGMA is not detected, skip the test_symeig properly.
Added skipCUDAIfNoMagma decorator.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54526

Reviewed By: malfet

Differential Revision: D27293640

Pulled By: heitorschueroff

fbshipit-source-id: 245f86540af0e37c8795e80dc003e1ca4c08cd5b
2021-03-24 13:55:36 -07:00
145bc5cd51 Rename Math to CompositeImplicitAutograd (#54466)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54466

I had to very carefully audit all the use sites since there are a lot
of other uses of the string Math; I did most of the conversion by
grepping for all occurrences of Math and then doing a search
replace.

I also updated documentation for clarity.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D27253239

Pulled By: ezyang

fbshipit-source-id: afb485d07ff39575742a4f0e1e205179b60bc953
2021-03-24 13:49:24 -07:00
87989a6cf9 [caffe2] support serializing float data as bfloat16 (#53735)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53735

Add an option to BlobSerializationOptions to request that float data be
serialized as bfloat16.  This reduces the serialized data size at the expense
of some loss in precision.
ghstack-source-id: 124317910

Test Plan: Included a new unit test.

Reviewed By: mraway

Differential Revision: D26658205

fbshipit-source-id: 74521ed161059066355a3f208488ed01a344dbb5
2021-03-24 13:27:22 -07:00
b032316c41 Improve nn.Sequential documentation (#53380)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53380

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D26849861

Pulled By: ansley

fbshipit-source-id: 2add8c73ae421332ed1c03340806e25656bafabb
2021-03-24 13:02:43 -07:00
2b07bcf9eb [operator benchmarks] Added more interpolation test cases (#54584)
Summary:
Description:
- Added uint8 nearest test case
- Added 3d vectorization test case

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54584

Reviewed By: malfet

Differential Revision: D27291303

Pulled By: fmassa

fbshipit-source-id: 236ee5af351c8dc34ec3cdb7dda662c77feb8cf0
2021-03-24 11:46:27 -07:00
12a61a172e Fix missing class in cpp tensor documentation (#54488)
Summary:
The given example in the documentation does not compile due to the missing `torch::`. It is correct in the tutorial about [writing a custom extension ](https://pytorch.org/tutorials/advanced/cpp_extension.html#writing-a-mixed-c-cuda-extension)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54488

Reviewed By: bdhirsh

Differential Revision: D27267000

Pulled By: glaringlee

fbshipit-source-id: 86a46d656c1a4fa4098287a6a43a38d1ef80171e
2021-03-24 11:10:19 -07:00
f9ca0d87a7 Teach Python TS frontend to parse complex literals (#52881)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52881

**This PR adds:**
1. logic to parse complex constants (complex literals of the form `bj`)
2. logic to parse complex lists
3. support for complex constructors: `complex(tensor/int/float/bool, tensor/int/float/bool)`
4. Limited operator support
     - `add`, `sub`, `mul`, `torch.tensor`, `torch.as_tensor`

**Follow-up work:**
1. Add complex support for unary and other registered ops.
2. support complex constructor with string as input (this is supported in Python eager mode).
3. Test all emitXYZ for all XYZ in `ir_emitter.cpp` (currently only emitConst, emitValueToTensor are tested). e.g., test loops etc.
4. onnx doesn't support complex tensors, so we should error out with a clear and descriptive error message.

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D27245059

Pulled By: anjali411

fbshipit-source-id: af043b5159ae99a9cc8691b5a8401503fa8d6f05
2021-03-24 08:12:17 -07:00
2f5db68797 Make nightly checkout work with generated testing py (#54477)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54477

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D27263065

Pulled By: ezyang

fbshipit-source-id: 7fa653fb334ff91c9100cf5adcabab6b30533a89
2021-03-24 07:40:26 -07:00
67e4618037 Add arg layer_norm_eps to transformer layers (#54494)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44367

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54494

Reviewed By: bdhirsh

Differential Revision: D27264321

Pulled By: jbschlosser

fbshipit-source-id: ed264d253b2df2d6f1d80898464f4f26022482ec
2021-03-24 06:59:25 -07:00
732815af7a Automated submodule update: tensorpipe (#54582)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/tensorpipe](https://github.com/pytorch/tensorpipe).

New submodule commit: 52774a0165

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54582

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: lw

Differential Revision: D27289673

fbshipit-source-id: c1284b1642c518ce4568e32ddebee5034d8a542e
2021-03-24 06:29:27 -07:00
05c8ddfe05 [AutoAccept][Codemod][FBSourceGoogleJavaFormatLinter] Daily arc lint --take GOOGLEJAVAFORMAT
Reviewed By: zertosh

Differential Revision: D27288729

fbshipit-source-id: 84c9f4cffdabd3c1967e3279ec123867d8eded00
2021-03-24 04:18:23 -07:00
c371542efc testing: dont skip test_ops suite for operators testing against scipy (#54186)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/54152

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54186

Reviewed By: ngimel

Differential Revision: D27287024

Pulled By: mruberry

fbshipit-source-id: 3e19b94b138fb56a7cb2c1c13af3587a5b6d937a
2021-03-24 00:25:24 -07:00
bac566bf61 torch.square : OpInfo and minor fixes (#52551)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/42515

Add `out` variant to be consistent with Unary Ops.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52551

Reviewed By: heitorschueroff

Differential Revision: D27233482

Pulled By: mruberry

fbshipit-source-id: fef6f241849a12c46028bd1aad8f5ecc1dc65ea1
2021-03-24 00:04:42 -07:00
d3f784244e fix comparison of narrow type with wide type in loop condition part2 (#54471)
Summary:
Follow up PR of https://github.com/pytorch/pytorch/issues/53951.
This PR fixes remaining semmle warning: comparison of narrow type with wide type in loop condition

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54471

Reviewed By: bdhirsh

Differential Revision: D27262493

Pulled By: malfet

fbshipit-source-id: 05765758da79699936af11de237c3ff3d34373d6
2021-03-23 23:38:38 -07:00
0d81528a47 Definition infrastructure for instruction count ubenchmarks (#53296)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53296

Part 1 of the instruction count microbenchmarks. This PR is focused on benchmark definition machinery. (Though you can run `main.py` to see it in action.) A summary of the system is given in the README.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D26907092

Pulled By: robieta

fbshipit-source-id: 0f61457b3ce89aa59a06bf1f0e7a74ccdbf17090
2021-03-23 21:59:46 -07:00
a4ca394f8a Revert "Revert D26907093: Add repeats to Timer.collect_callgrind(...)" (#54484)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54484

Re-land of https://github.com/pytorch/pytorch/pull/53295. (With fixed unit tests.)

This reverts commit 0dc5abfaa9cac9266791788839d896b14600d123.

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D27255201

Pulled By: robieta

fbshipit-source-id: 4e9fed7522631d66c5cd7e27ace9b5ffc3a0bbfc
2021-03-23 21:58:17 -07:00
afe339d7dd [static runtime] support DictConstruct (#54438)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54438

August 1x model has DictConstruct in the graph (P331168321)
These can be easily removed with jit pass, but to easily measure the improvement
and run replayer with the model in the meantime, enable DictConstruct in static runtime

Test Plan:
```
./sigrid/predictor/scripts/pytorch/pyper_inference_e2e_local_replayer_test.sh \
    cpu 218841466_0 7449 /data/users/ansha/tmp/adfinder/august_1x/ /data/users/ansha/tmp/adfinder/august_1x/filtered_requests_inline_cvr_100
```

```
TEST trace
Total num requests                                   100
Num exceptions                                         0
Latency us avg                                    180965
Latency us p25                                     89785
Latency us p50                                    131240
Latency us p75                                    146621
Latency us p90                                    158378
Latency us p95                                    166628
Latency us p99                                   1886680
Latency us p100                                  3803252
Server latency us avg                              91554
Server latency us p25                              51447
Server latency us p50                              86371
Server latency us p75                              95229
Server latency us p90                             102706
Server latency us p95                             116023
Server latency us p99                             557017
Server latency us p100                            716319
Num rankUnits avg                                     28
```

Reviewed By: hlu1

Differential Revision: D27236682

fbshipit-source-id: 1da49a836dd7533480e77797338baa9edcb65fb5
2021-03-23 21:20:03 -07:00
601e79200d [NNC] Implementing LoopFusion (#54461)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/54337

This PR adds a new API to NNC to perform loop fusion.

```
static For* fuseLoops(const std::vector<For*>& loops);
```

Loop fusion is done only when all the conditions below are satisfied.
  * All the loops have the same parent.
  * There are no statements between these loops in their parent body.
  * The start bounds are the same for all loops.
  * The stop bounds are the same for all loops.
  * Fusing the loops does not violate or add any dependencies.

This PR also adds an API to check for partial overlaps in `buffer_inference.h` and fixes a bug in `mem_dependency_checker.cpp`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54461

Reviewed By: bertmaher

Differential Revision: D27254888

Pulled By: navahgar

fbshipit-source-id: c21b027d738e5022e9cb88f6f72cd9e255bdb15e
2021-03-23 21:20:00 -07:00
5105250e16 [FX] Add docs for shape propagation (#54554)
Summary:
Fixes #{i54538}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54554

Reviewed By: nikithamalgifb

Differential Revision: D27281263

Pulled By: jamesr66a

fbshipit-source-id: 2fd3914f0e24be0b6a18ad7715f3336dcf7949ba
2021-03-23 21:18:11 -07:00
5cd8a77e01 Skip inplace autograd test if inplace variant doesn't exist (#54460)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/54413

1. Skip inplace autograd test for an op if its inplace variant does not exist.
2. For ops that don't have an inplace variant, remove redundant `supports_inplace_autograd=False` assignments in their `OpInfo`s.
3. Ops having inplace variants that do not support autograd should not have `supports_inplace_autograd=False` entries removed from their `OpInfo`s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54460

Reviewed By: ngimel

Differential Revision: D27255938

Pulled By: mruberry

fbshipit-source-id: f15334b09e68995e9f26adc2ff3e59c292689ee8
2021-03-23 21:10:37 -07:00
789dc6d445 [NCCL] Add more details for checkForNCCLErrors (#54117)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54117

https://github.com/pytorch/pytorch/pull/45950 enhanced our NCCL logging errors so that we add some basic debug information about what when wrong when erroring out with a NCCL error.

However, that PR only used the added function for `C10D_NCCL_CHECK` which is used to check the return values of NCCL calls. However, in ProcessGroupNCCL we also have `checkForNCCLErrors` which checks for errors on nccl communicators, and in case of errors it would be good to have this logging there too.

Also renames the function s/errorMessage/getNcclErrorDetailStr
ghstack-source-id: 124662592

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D27100497

fbshipit-source-id: fec3663ffa3e92bae8391ef4f77054abb4bb9715
2021-03-23 20:29:16 -07:00
b93ab10b7a torch.lerp: cuda complex support (#54129)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/54048

TODO
* [x] Add test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54129

Reviewed By: bdhirsh

Differential Revision: D27261878

Pulled By: anjali411

fbshipit-source-id: 10937a2eab944c73b5a98ec6278f50a876b8c7dc
2021-03-23 19:58:43 -07:00
5781aec74e Automated submodule update: FBGEMM (#54509)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: a2b58dfab5

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54509

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: jspark1105

Differential Revision: D27264145

fbshipit-source-id: 606948e002dcf364bb39aad49ef4f2144bbba7a4
2021-03-23 18:52:30 -07:00
33b95c6bac Add __torch_function__ support for torch.nn.functional.embedding (#54478)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54478

Fixes #54292

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D27264179

Pulled By: ezyang

fbshipit-source-id: cd267e2e668fdd8d7f958bf70a0b93e058ec7c23
2021-03-23 17:22:39 -07:00
91d37d7d2f [CI] Install compatible cmath for Win binary builds (#54527)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54527

Reviewed By: walterddr

Differential Revision: D27269528

Pulled By: malfet

fbshipit-source-id: 4afdc706598f3a6ad296468dfb77a70433ae7d0f
2021-03-23 17:05:20 -07:00
66a3614b47 Fix typo in .github/workflows/lint.yml (#54551)
Summary:
Fixes a minor typo introduced in https://github.com/pytorch/pytorch/issues/51796.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54551

Test Plan: None, since this only changes a comment.

Reviewed By: seemethere

Differential Revision: D27278347

Pulled By: samestep

fbshipit-source-id: 34a781cce0cb4e93a68821d6006bbf05b0bbe2f0
2021-03-23 16:45:41 -07:00
5754816597 fix SC2126 introduced error (#54545)
Summary:
SC2126 suggested from diff CI is wrong. reverting last commit in https://github.com/pytorch/pytorch/issues/54373

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54545

Reviewed By: samestep

Differential Revision: D27276006

Pulled By: walterddr

fbshipit-source-id: 1a9823e9ad6c6509a36896df88d599546826f4e9
2021-03-23 16:39:42 -07:00
4a74b0f2dd Fix logic in TestFX.test_get_torch_func_signature_exhaustive (#54510)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54510

Test Plan: Imported from OSS

Reviewed By: ansley

Differential Revision: D27264670

Pulled By: jamesr66a

fbshipit-source-id: 0ef6395dacde3eb2a4b9c7eeff760a1be38b6dfe
2021-03-23 16:23:25 -07:00
7e3cf1ee24 [pytorch] Add native support for segment reduce step1: API definition (#53727)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53727

This is first diff to add native support for segment reduction in PyTorch. It provides similar functionality like torch.scatter or "numpy.ufunc.reduceat".

This diff mainly focuses on API layer to make sure future improvements will not cause backward compatibility issues. Once API is settled, here are next steps I am planning:
- Add support for other major reduction types (e.g. min, sum) for 1D tensor
- Add Cuda support
- Backward support
- Documentation for the op
- Perf optimizations and benchmark util
- Support for multi dimensional tensors (on data and lengths) (not high priority)
- Support for 'indices' (not high priority)

Test Plan: Added unit test

Reviewed By: ngimel

Differential Revision: D26952075

fbshipit-source-id: 8040ec96def3013e7240cf675d499ee424437560
2021-03-23 16:00:30 -07:00
591084abb8 Deprecate torch.matrix_power in favor of torch.linalg.matrix_power (#53538)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53538

* #52608 Added torch.linalg.matrix_power

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D27261531

Pulled By: heitorschueroff

fbshipit-source-id: 5a944b390f3cc6896c2aa92ba467319ddc9309e4
2021-03-23 15:11:24 -07:00
f9e7f132fb Added torch.linalg.matrix_power (#52608)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52608

**TODO**

- [x] Add OpInfo
- [x] Update documentation
- [x] Add more tests and compare against NumPy

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D27261532

Pulled By: heitorschueroff

fbshipit-source-id: c1e4ab297da3683f6d5751be8790602f9dc37b6b
2021-03-23 15:10:06 -07:00
345b26ca08 [android][utils] Support ChannelsLast in TensorImageUtils (#48990)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48990

Introducing TensorImageUtils methods to prepare tensors in channelsLast MemoryFormat.
ChannlesLast is preferred for performance.

Not to introduce api breaking changes, adding additional parameter MemoryFormat which is CONTIGUOUS by default.

Testing by checking test_app that uses this call
```
gradle -p android installMnetLocalBaseDebug -PABI_FILTERS=arm64-v8a
```

Test Plan: Imported from OSS

Reviewed By: jeffxtang

Differential Revision: D27173940

Pulled By: IvanKobzarev

fbshipit-source-id: 27788082d2c8b190323eadcf18de25d2c3b5e1f1
2021-03-23 14:54:36 -07:00
792f5ffb83 Also strip slow_test (#54528)
Summary:
Since `_test1`, `_test2` and `_build` and `test` are all stripped, `slow_test` should be stripped as well. This way, the _slow_test stats will be considered as a part of all stats relating to a particular build job, though currently, it doesn't do much because the jobs don't share a common stemmed name--the build has `_gcc7` while the slow_test CI job does not.

This makes me think...do we omit the `gcc7` intentionally? Are there other things I should strip, e.g., `multigpu_test`?

See:
ci/circleci: pytorch_linux_xenial_cuda10_2_cudnn7_py3_slow_test
ci/circleci: pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test1
ci/circleci: pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54528

Reviewed By: samestep

Differential Revision: D27270393

Pulled By: janeyx99

fbshipit-source-id: ffb7289cfe4dba52ded67f50a89f3e75e7bad68d
2021-03-23 14:44:21 -07:00
c06d979731 [Static Runtime] Name refactoring to make MemoryPlanning more readable (#54045)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54045

Test Plan: Imported from OSS

Reviewed By: bwasti

Differential Revision: D27233880

fbshipit-source-id: 43b38901d8cfea0941a1a2934997a08027b57b6d
2021-03-23 14:28:43 -07:00
35186eb983 Update TensorPipe submodule (#54507)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54507

Test Plan: CircleCI

Reviewed By: mrshenli

Differential Revision: D27262943

fbshipit-source-id: cffecd01756180325147d4fb85fbe9bc78727884
2021-03-23 14:22:13 -07:00
1b792a7f15 Fix Flake8 (#54540)
Summary:
https://github.com/pytorch/pytorch/issues/54339 broke Flake8. This PR fixes it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54540

Test Plan:
```
flake8
```

Reviewed By: walterddr

Differential Revision: D27274171

Pulled By: samestep

fbshipit-source-id: 4b440d72b4b5615f45e6fcb25f7a4c0423add272
2021-03-23 13:50:03 -07:00
e5b97777e3 [ROCm] allow PYTORCH_ROCM_ARCH in cpp_extension.py (#54341)
Summary:
Allows extensions to override ROCm gfx arch targets.  Reuses the same env var used during cmake build for consistency.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54341

Reviewed By: bdhirsh

Differential Revision: D27244010

Pulled By: heitorschueroff

fbshipit-source-id: 279e1a41ee395a0596aa7f696b6e908cf7f5bb83
2021-03-23 13:06:00 -07:00
446e477d4f [complex] torch.rsub(): complex autograd support (#53702)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53643

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53702

Reviewed By: agolynski

Differential Revision: D27142807

Pulled By: anjali411

fbshipit-source-id: 053d0a0f9a478cf04efcb0d84aacf042abae1a4e
2021-03-23 12:46:51 -07:00
21a9a93eb4 gdb special command to print tensors (#54339)
Summary:
This is something which I wrote because it was useful during my debugging sessions, but I think it might be generally useful to other people as well so I took the liberty of proposing an official `pytorch-gdb` extension.

`pytorch-gdb` is a gdb script written in python. Currently, it contains only one command: `torch-tensor-repr`, which prints a human-readable repr of an `at::Tensor` object. Example:
```
Breakpoint 1, at::native::neg (self=...) at [...]/pytorch/aten/src/ATen/native/UnaryOps.cpp:520
520     Tensor neg(const Tensor& self) { return unary_op_impl(self, at::neg_out); }
(gdb) # the default repr of 'self' is not very useful
(gdb) p self
$1 = (const at::Tensor &) 0x7ffff72ed780: {impl_ = {target_ = 0x5555559df6e0}}
(gdb) torch-tensor-repr self
Python-level repr of self:
tensor([1., 2., 3., 4.], dtype=torch.float64)
```

The idea is that by having an official place where to put these things, `pytorch-gdb` will slowly grow other useful features and make the pytorch debugging experience nicer and faster.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54339

Reviewed By: bdhirsh

Differential Revision: D27253674

Pulled By: ezyang

fbshipit-source-id: dba219e126cc2fe66b2d26740f3a8e3b886e56f5
2021-03-23 12:30:18 -07:00
583c4bf7d3 [Pytorch Mobile] optimize_for_mobile: Fuse Add Relu on any function (#54441)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54441

Similar to previous dropout one
ghstack-source-id: 124544176

Test Plan: Printed graphs before and after fusion. verified input outputs stayed the same {P299343882}

Reviewed By: kimishpatel

Differential Revision: D27014352

fbshipit-source-id: d0a9548f8743472bdd7e194efd8e8d5fe53b95b6
2021-03-23 12:11:59 -07:00
acffa604cc disable cu112 test on windows (#54512)
Summary:
Currently cu112 test on windows is broken. see
https://app.circleci.com/pipelines/github/pytorch/pytorch/288940/workflows/c6612fe8-8396-4266-88d8-2ad2736c994c/jobs/11744008/steps

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54512

Reviewed By: janeyx99

Differential Revision: D27265585

Pulled By: walterddr

fbshipit-source-id: 49e212d6c332a9725e6f2a78faf41198d4a21ac5
2021-03-23 11:48:06 -07:00
f3c00047ce Reset Optimizer counter while deserializing netWithBackwardOptions
Summary: Add ability to reset optimizer counter..

Test Plan: will wait for integration tests to run on diff.

Differential Revision: D27248286

fbshipit-source-id: a608df1bd61b64eb317c9ffd9cfdd804c5288f6d
2021-03-23 11:16:11 -07:00
ba9f12d235 Fix minor whitespace typo in tools/test_history.py (#54504)
Summary:
oops

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54504

Test Plan:
```
tools/test_history.py --help
```

Reviewed By: walterddr

Differential Revision: D27262271

Pulled By: samestep

fbshipit-source-id: 0f47e9a69d35a605c558f6c86e3e2ca98720ff86
2021-03-23 08:26:44 -07:00
a4a21e7d8d Automated submodule update: FBGEMM (#54486)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 8998e6f1d7

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54486

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: jspark1105

Differential Revision: D27255655

fbshipit-source-id: 5315687d4121c5ff2628ba7f134c1a5134369ed2
2021-03-23 07:10:05 -07:00
2a53897114 [jit][tensorexpr] Added aten::batch_norm into fuser when in inference mode (#54204)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54204

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D27134348

Pulled By: huiguoo

fbshipit-source-id: 5ea7a6c5bc694fcdfc436dba3fa6eb269420324e
2021-03-23 04:41:52 -07:00
fee470d8ef [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D27257117

fbshipit-source-id: 6fdda695987892a74137d3afa720979c8d5c68bb
2021-03-23 04:04:13 -07:00
f2a38a0edd Enabled BFloat16 support for argmax & argmin on both CPU & CUDA (#52582)
Summary:
1. Enabled `BFloat16` support for `argmax` & `argmin` on both CPU & CUDA
2. Added `OpInfo`s for `argmax` & `argmin`
3. Enabled `test_argminmax_multiple` for `float16`. It can't be enabled for `bfloat16`, as comparison is done with numpy, which doesn't currently support `bfloat16`.
4. Enabled `test_dim_arg_reduction_scalar` for `float16` & `bfloat16`.
5. Enabled `test_reduction_vectorize_along_output` for `bfloat16`.
6. Enabled `test_reduction_vectorize_along_input_corner` for `bfloat16`.
7. Enabled `test_dim_reduction` for both `float16` and `bfloat16`, except that both of them don't support `prod` on CPU.
8. Unskipped `TestCommonCPU.test_variant_consistency_jit` for dtype `bfloat16` for `amax` & `amin`, as they're passing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52582

Reviewed By: anjali411

Differential Revision: D27204704

Pulled By: heitorschueroff

fbshipit-source-id: cdad5df494d070f8e1a8fb83939441a91124b4d9
2021-03-23 03:38:11 -07:00
3519625a34 Fix onnx warning message (#54371)
Summary:
Adding a space between "as" and "Dropout".

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54371

Reviewed By: radkris-git

Differential Revision: D27244053

Pulled By: heitorschueroff

fbshipit-source-id: 500ea719e239ce89e5ac4b54e5b32a36155e8544
2021-03-23 03:05:52 -07:00
1041fdd069 Grammatically update tech docs (#54370)
Summary:
Small grammatical update to nn.rst

![Screenshot 2021-03-20 at 11 44 29](https://user-images.githubusercontent.com/80534697/111867047-d868f900-8971-11eb-8cc2-0ae7d2c59229.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54370

Reviewed By: radkris-git

Differential Revision: D27243944

Pulled By: heitorschueroff

fbshipit-source-id: 08d8061d9e74ffaf95c8a610107a8632259474ca
2021-03-23 02:59:19 -07:00
8518b0ee55 [PyTorch] Update Bazel build for TensorPipe (#54416)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54416

Once D27230990 lands, we'll need this for TensorPipe to be built with Bazel.
ghstack-source-id: 124512701

Test Plan: None for now.

Reviewed By: beauby

Differential Revision: D27231000

fbshipit-source-id: 474cc1b23118703ecb47ed4b8e0c5b000572eae8
2021-03-23 01:34:13 -07:00
c22fc448cd [Gradient Compression] Remove cuda.syncrhonize in batched powerSGD (#54482)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54482

`cuda.synchronize` is unnecessary for `batched_powerSGD_hook`.
ghstack-source-id: 124607761

Test Plan:
f259607860
f259563921

Reviewed By: rohan-varma

Differential Revision: D27254314

fbshipit-source-id: 4744c07a6f0c8939e766ffa935ddbf3c47e85d18
2021-03-23 00:55:53 -07:00
6d0027197c Delete all unnecessary singular Math entries (#54436)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54436

An operator entry with no dispatch table implicitly generates a Math
entry, so you don't need to define one yourself.  I also added
some asserts in the codegen to fail on these cases.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D27235381

Pulled By: ezyang

fbshipit-source-id: f8c905090b863120f4f3656c37e2b7f26e8bb9ef
2021-03-23 00:44:01 -07:00
6e8c4ad7fd s/StructuredNativeFunctions/NativeFunctionsGroup/ (#54427)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54427

A StructuredNativeFunctions is no longer guaranteed to actually
be structured (test structured property for that), so we rename
this to a more neutral name.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D27235380

Pulled By: ezyang

fbshipit-source-id: 2b438d615bf06a47fc9c7bf6eb66fd8b4df31bc8
2021-03-23 00:43:57 -07:00
bf2ca35f35 Rejigger to use NativeFunctionsGroup even without structured: True (#54426)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54426

Previously, we only put NativeFunctions in StructuredNativeFunctions
if the out variant advertised that the kernel was structured.  However,
there are a few code generation things that can take advantage of
this trio structure, even if the kernel itself hasn't been ported
to be structured.  So better to always group things when they are
related, and then let clients decide whether or not to use the
structure or throw it away.

While doing this, I had hoped that there weren't any functional/inplace
pairs that didn't also have an out variant.  This turned out to not
be true.  These are probably all oversights and should get fixed at
some point.

Bill of changes:

- The actual operational change happens in
  StructuredNativeFunctions.from_dict; then I need to relax some
  __post_init__ invariants.  To tell if a StructuredNativeFunctions
  is actually structured, there is a new structured property, which
  is queried from a few new locations in code
- Refactor native_functions.py into gen_structured/gen_unstructured
  functions so I can easily call gen_unstructured from two contexts

I intend to s/StructuredNativeFunctions/NativeFunctionsGroup/ but
for ease of review this rename hasn't been done in this PR.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D27235379

Pulled By: ezyang

fbshipit-source-id: d8a15de9abb75b365348ab94e67b830704e30cf0
2021-03-23 00:43:54 -07:00
c00d66f73c Move compute_native_function_declaration to its own dest module (#54419)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54419

I'm planning to break it into some helper functions, so let's put it in its own module first.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D27235378

Pulled By: ezyang

fbshipit-source-id: c03c5440d2d753859e2c5ec2b2c8b1b82870f03a
2021-03-23 00:43:50 -07:00
349a17f1c0 Replace some tensor.device().is_cpu() calls with direct tensor.is_cpu() (#54397)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54397

I was supposed to have done this in https://github.com/pytorch/pytorch/pull/54079
but apparently I forgot to push these changes before landing, so here's
the clean up.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D27235382

Pulled By: ezyang

fbshipit-source-id: ffcce5abc78251c81c230992bac70b8973906ace
2021-03-23 00:42:05 -07:00
77ccd4f9a3 [5/n][torch/elastic][upstream] Move torchelastic/agent to torch/distributed/elastic/agent (#54343)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54343

Move torchelastic/agent to torch/distributed/elastic/agent

Test Plan:
buck test mode/dev-nosan //pytorch/elastic/torchelastic/...
      buck test mode/dev-nosan //caffe2/test/distributed/elastic/agent/server/test/...

Reviewed By: kiukchung, wilson100hong

Differential Revision: D27173271

fbshipit-source-id: 26761acc3f962af2afffcc3c7a237f3b6d65e531
2021-03-22 23:15:37 -07:00
5870346173 Port index_copy from TH to ATen (#52203)
Summary:
The design of the `TensorIterator` was similar to that in https://github.com/pytorch/pytorch/pull/50578

Resolves https://github.com/pytorch/pytorch/issues/24670
Resolves https://github.com/pytorch/pytorch/issues/24523

Timings:
<details>
<summary>Script</summary>

```python
from IPython import get_ipython
import torch

torch.manual_seed(13)
torch.set_num_threads(1)

ipython = get_ipython()

cpu = torch.device('cpu')
cuda = torch.device('cuda')

def run_test(ndims, size, index_len, device):
    print(f"ndims: {ndims}, tensor_size: {size}, index_len: {index_len}, device: {device}")

    x = torch.rand(*([size] * ndims), device=device)
    index = torch.randint(size, (index_len,), dtype=torch.long, device=device)
    for d in range(ndims):
        shape_t = [size] * d + [index_len] + [size] * (ndims - d - 1)
        t = torch.rand(*shape_t, device=device)
        command = "x.index_copy(d, index, t)"
        if device == cuda:
            command = command + "; torch.cuda.synchronize()"
        ipython.magic(f"timeit {command}")
    print()

run_test(3, 700, 10, cpu)
run_test(3, 700, 100, cpu)
run_test(3, 700, 700, cpu)
run_test(2, 10000, 10000, cpu)

run_test(3, 700, 10, cuda)
run_test(3, 700, 100, cuda)
run_test(3, 700, 700, cuda)
run_test(2, 10000, 10000, cuda)
```

</details>

<details>
<summary>CPU ATen</summary>

```
ndims: 3, tensor_size: 700, index_len: 10, device: cpu
327 ms ± 309 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
329 ms ± 456 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
378 ms ± 1.44 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

ndims: 3, tensor_size: 700, index_len: 100, device: cpu
348 ms ± 1.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
359 ms ± 330 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
526 ms ± 686 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

ndims: 3, tensor_size: 700, index_len: 700, device: cpu
560 ms ± 19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
552 ms ± 2.61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
932 ms ± 2.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

ndims: 2, tensor_size: 10000, index_len: 10000, device: cpu
163 ms ± 5.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
302 ms ± 5.75 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```
</details>

<details>
<summary>CUDA ATen</summary>

```
ndims: 3, tensor_size: 700, index_len: 10, device: cuda
9.63 ms ± 441 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.65 ms ± 230 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
12.4 ms ± 881 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)

ndims: 3, tensor_size: 700, index_len: 100, device: cuda
10.8 ms ± 1.51 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
11 ms ± 417 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
21.2 ms ± 18.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

ndims: 3, tensor_size: 700, index_len: 700, device: cuda
19 ms ± 4.42 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
17.8 ms ± 493 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
25.8 ms ± 1.22 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

ndims: 2, tensor_size: 10000, index_len: 10000, device: cuda
5.59 ms ± 109 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
10 ms ± 25.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

</details>

<details>
<summary>CPU TH</summary>

```
ndims: 3, tensor_size: 700, index_len: 10, device: cpu
333 ms ± 2.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
327 ms ± 1.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
366 ms ± 753 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

ndims: 3, tensor_size: 700, index_len: 100, device: cpu
336 ms ± 1.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
345 ms ± 914 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
884 ms ± 4.32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

ndims: 3, tensor_size: 700, index_len: 700, device: cpu
441 ms ± 3.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
514 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
7.46 s ± 6.46 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

ndims: 2, tensor_size: 10000, index_len: 10000, device: cpu
141 ms ± 233 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.13 s ± 855 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

</details>

<details>
<summary>CUDA TH</summary>

```
ndims: 3, tensor_size: 700, index_len: 10, device: cuda
9.64 ms ± 390 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.68 ms ± 3.26 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
13.9 ms ± 928 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)

ndims: 3, tensor_size: 700, index_len: 100, device: cuda
11.6 ms ± 1.38 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
12.1 ms ± 3.72 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
30.3 ms ± 27.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

ndims: 3, tensor_size: 700, index_len: 700, device: cuda
27.2 ms ± 19.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
30.6 ms ± 43.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
146 ms ± 204 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

ndims: 2, tensor_size: 10000, index_len: 10000, device: cuda
6.5 ms ± 3.99 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
64.7 ms ± 55.5 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

</details>

According to these we see a slight performance improvement across both CPU and GPU.

cc: nikitaved

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52203

Reviewed By: jbschlosser

Differential Revision: D27066572

Pulled By: mruberry

fbshipit-source-id: 6101e461cf731afa3db042a383b723d3d6bfdc26
2021-03-22 22:36:35 -07:00
52abd3bd7b [Static Runtime] Fix bug in reshape_copy (#54467)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54467

`at::native::copy_` requires src/dest to have the same sizes, which isn't true in reshape.

Test Plan: Added new test cases to cover this case.

Reviewed By: ajyu

Differential Revision: D27249617

fbshipit-source-id: 2c95175fa8564b3c648979445ad4314f97818852
2021-03-22 22:20:55 -07:00
c411017a41 Only allow hub.load() from original repo. (#54451)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54451

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D27243825

Pulled By: ailzhang

fbshipit-source-id: 2f65a82064d83b71224b4280ddfaabfa8ec9aec3
2021-03-22 20:27:54 -07:00
9be4c75fa0 [JIT] Add Reinplacing to MKLDNN Subgraphs (#53908)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53908

This adds reinplacing to MKLDNN Subgraphs so that we replace `aten::add` with `aten::add_`. Normally you would have to prove device and dtype, but we know that already, and because we have explicit broadcast nodes for other reasons we dont have to prove that the output shape of add is the same as inputs.

Ive tested correctness on resnet, I'm going to do more extensive testing as well. When I benchmarked the "unsafe" version (always inplace) I saw average speedups of ~16% for both Single threaded and Multithreaded. I dont think the "safe" version will be far beyond; when I looked at resnet for example every `add` and `relu` were reinplaced.

Theres some question of reusing other alias / liveness / inplacing passes in SR. I thought about it, however I didnt want to add a cross-dependency between very different parts of the code base with a bunch of different assumptions. The logic here is also covering a simpler case and does not add much complexity IMO.

Test Plan: Imported from OSS

Reviewed By: Krovatkin

Differential Revision: D27132969

Pulled By: eellison

fbshipit-source-id: 121a38daaedf01363f6b66a814beaaa72a0ab0dc
2021-03-22 19:21:03 -07:00
81c6e5fb38 use reshape when possible in broadcasting (#53326)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53326

Test Plan: Imported from OSS

Reviewed By: Krovatkin

Differential Revision: D26897275

Pulled By: eellison

fbshipit-source-id: 44278633a1e6429db43443ca689b97d5a077a15c
2021-03-22 19:20:59 -07:00
18c04a3f0f Avoid dispatch overhead in call to mkldnn convolution (#52614)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52614

This can speed up models by 5% (~.5-1% from the base, but ~5% after they've been sped up with mkldnn).

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D26696693

Pulled By: eellison

fbshipit-source-id: bfed55242524a4c2f1ae5d63e76d6803016d986d
2021-03-22 18:38:39 -07:00
3959d393b8 [PyTorch][JIT] Less shared_ptr use in dictConstruct (#54110)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54110

dictConstruct doesn't need to make its caller have a `shared_ptr<DictType>`. It also doesn't need to do extra `shared_ptr` copies into the `key_type` and `value_type` locals.
ghstack-source-id: 124150642

Test Plan: fitsships

Reviewed By: ezyang

Differential Revision: D27101782

fbshipit-source-id: 3c632ad9d8f1bd7bdf37f517a86aca27bd41548a
2021-03-22 18:31:27 -07:00
7e33dc3498 [PyTorch] Avoid extra intrusive_ptr copy in IValue::toIntrusivePtr (#54124)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54124

No need to have an extra temporary intrusive_ptr (`p`) just to do an `incref`.
ghstack-source-id: 124150644

Test Plan:
existing tests for correctness; inspect assembly for
c10::IValue::toObject to double-check & see that it's a bit shorter

Reviewed By: smessmer

Differential Revision: D27109183

fbshipit-source-id: 497706190867eeac0fb1d309d0ecc97cf8d65b08
2021-03-22 18:29:50 -07:00
568d43b935 Automated submodule update: FBGEMM (#54447)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: ffff7a3118

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54447

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: jspark1105

Differential Revision: D27242112

fbshipit-source-id: 768b1a40652b6c2f0710bd4bb655697daf45f756
2021-03-22 18:21:47 -07:00
decbdf7b0b Get rid of {Cpu,Cuda}{Channel,Context} in tensorpipe_agent. (#54432)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54432

Following the merge of channel hierarchies, here comes the promised
clean up.

Test Plan: CI

Reviewed By: lw

Differential Revision: D27232442

fbshipit-source-id: 540dc6bc18a9a415b676e06e75530d729daf2d5b
2021-03-22 18:03:23 -07:00
2668149b8c Export torch::jit::toIValue (#54449)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/54448

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54449

Reviewed By: SplitInfinity

Differential Revision: D27243154

Pulled By: cpuhrsch

fbshipit-source-id: fc21d6ce251b868356ad8ea13ae891fb56e311ce
2021-03-22 17:17:18 -07:00
92770d25cd fix comparison of narrow type with wide type in loop condition (#53951)
Summary:
fix Semmle warning: Comparison of narrow type with wide type in loop condition

For example there is below piece of code:
for (int i=0; i<array.size(); ++i) {}

The problem is that array.size() return type is size_t can be larger type than int depending on the implementation so there is chance that i overflows (for very large array that array size is beyond the range of integer) and this loop will never be terminated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53951

Reviewed By: zou3519

Differential Revision: D27181495

Pulled By: malfet

fbshipit-source-id: 0612c5cedcdc656c193085e7fbb87dd163f20688
2021-03-22 16:40:35 -07:00
edfc787df4 Migrate kernels with Tensor? to C10 full dispatcher (#54263)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54263

Codemod commands generated by https://github.com/pytorch/pytorch/pull/54223

Signatures of the following 8 methods in LegacyTHFunctionsCUDA.h are
manually changed.

```
_thnn_multi_margin_loss_forward
_thnn_multi_margin_loss_backward
_thnn_nll_loss_forward
_thnn_nll_loss_backward
_thnn_nll_loss2d_forward
_thnn_nll_loss2d_backward
_thnn_conv2d_forward
_thnn_conv_depthwise2d_forward
```

ghstack-source-id: 124539990

Test Plan: buck build //caffe2/aten/...

Reviewed By: smessmer

Differential Revision: D27164092

fbshipit-source-id: 59062179ffd958ca253cbf63fdd495799b9a9586
2021-03-22 16:08:10 -07:00
5a27199149 Add device_of overload for optional<Tensor> (#54262)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54262

register_dispatch_key.py might generate device_of call over
optional<Tensor> if it happened to be the first Tensor-like
argument.

ghstack-source-id: 124535550

Test Plan: Test together with next diff in stack

Reviewed By: ezyang

Differential Revision: D27164093

fbshipit-source-id: 3b0400d5d603338e884218498106f6481e53f194
2021-03-22 16:06:29 -07:00
2130f4ccc4 Use c10::ArrayRef instead of std::vector for the jit::unpickle's tensor_table. (#54428)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54428

Using c10::ArrayRef as the parameter type makes the API more flexible and allows the caller to leverage small-buffer optimizations (e.g. c10::SmallVector, std::array) for performance critical cases.

Test Plan: No behavioral changes. Run the existing unit and integration tests.

Reviewed By: suo

Differential Revision: D27232222

fbshipit-source-id: 7b13bc6bd02257097ca119077028fbccc68cc925
2021-03-22 15:31:47 -07:00
4919fecf23 Delete dead TensorOptions::key_set (#54004)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54004

According to
`glean-search find-decls --refs 'c10::TensorOptions::key_set'`
there are no uses of this function

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D27047971

Pulled By: ezyang

fbshipit-source-id: 63662dd7ab27753ecb79c45c152c2cad1160dab2
2021-03-22 15:24:34 -07:00
9fef25e579 [Pytorch Mobile] optimize_for_mobile: Remove dropout from any function (#53846)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53846

Theres already a varient of removeDropout that takes in a graph. So just switch to calling that one. It doesnt error check that the module isnt in training mode (because it doenst have a module) but optimize_for_mobile guarantees the cloned_module is in eval mode.
ghstack-source-id: 124544216

Test Plan: called optimize on forward and foo, both contained dropouts, both dropouts removed. Called both functions afterwords to verify they ran and gave the same output. {P308987364}

Reviewed By: kimishpatel

Differential Revision: D26986251

fbshipit-source-id: 085e08cbaa982aa08803a718fee4380af5f86b78
2021-03-22 14:57:02 -07:00
f1e72a7e18 add uncommit change detector (#54373)
Summary:
warn if uncommit changes exists in .circleci/config.yml, unlike other generated code, .circleci/config.yml actually commits to the repo. (this is a follow up of https://github.com/pytorch/pytorch/issues/54345)

two options I am open to
1. abort regenerate if detected
2. print out backed up temp filename

Also remove the `-x` since it is currently very verbose
```
++ dirname .circleci/regenerate.sh
+ cd .circleci
++ mktemp
+ OLD_FILE=/var/folders/vw/ryb6j4d97xs1t_14024b710h0000gn/T/tmp.54GhUh7w
+ cp config.yml /var/folders/vw/ryb6j4d97xs1t_14024b710h0000gn/T/tmp.54GhUh7w
++ mktemp
+ NEW_FILE=/var/folders/vw/ryb6j4d97xs1t_14024b710h0000gn/T/tmp.aV87RTvQ
+ ./generate_config_yml.py
+ cp /var/folders/vw/ryb6j4d97xs1t_14024b710h0000gn/T/tmp.aV87RTvQ config.yml
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54373

Test Plan:
1.
```
$ echo "418 I'm a teapot" > .circleci/config.yml
$ .circleci/regenerate.sh
$ .circleci/regenerate.sh
```
Result:
```
$ .circleci/regenerate.sh
Uncommitted change detected in .circleci/config.yml
It has been backed up to /var/folders/89/brnr1wt970130lk0m52605mw0000gn/T/tmp.2VOp4BPo
New config generated in .circleci/config.yml
$ .circleci/regenerate.sh  #-- second time there's no uncommitted changes
New config generated in .circleci/config.yml
```

2.
```
$ echo "418 I'm a teapot" > .circleci/config.yml
$ git add .circleci/config.yml
$ .circleci/regenerate.sh
$ .circleci/regenerate.sh
```
Result:
```
$ .circleci/regenerate.sh
Uncommitted change detected in .circleci/config.yml
It has been backed up to /var/folders/89/brnr1wt970130lk0m52605mw0000gn/T/tmp.2VOp4BPo
New config generated in .circleci/config.yml
$ .circleci/regenerate.sh #-- second time there's still uncommitted changes b/c git split staged vs unstaged changes
Uncommitted change detected in .circleci/config.yml
It has been backed up to /var/folders/89/brnr1wt970130lk0m52605mw0000gn/T/tmp.2ruMAynI
New config generated in .circleci/config.yml
```

Reviewed By: samestep

Differential Revision: D27234394

Pulled By: walterddr

fbshipit-source-id: 6364cc1f6f71a43424a63ca6fce9d2ba69437741
2021-03-22 14:51:22 -07:00
0f628d1503 [ROCm][doc] add ROCm section for building from source (#53845)
Summary:
Instructions for compiling PyTorch from source for ROCm were missing now that PyTorch 1.8 announced beta support for ROCm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53845

Reviewed By: heitorschueroff

Differential Revision: D27237916

Pulled By: malfet

fbshipit-source-id: c8be92fd76ea8df7e9f6944c0036568189f58808
2021-03-22 14:35:35 -07:00
1e09880b45 Add support for list insertion for mutation removal (#54271)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54271

Test Plan:
python test/test_jit.py TestRemoveMutation.test_lists_insert

Imported from OSS

Reviewed By: bertmaher

Differential Revision: D27180031

fbshipit-source-id: ba4388b6688cf83caf70901934f4adacd22d8ca6
2021-03-22 14:27:24 -07:00
263cd5cf98 Disable all cu92 in scheduled-ci (#54421)
Summary:
since we no longer support cuda9.2 disabling scheduled ci for those

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54421

Reviewed By: janeyx99

Differential Revision: D27234293

Pulled By: walterddr

fbshipit-source-id: 923e32c0229ea861bce6ff473501892bd4e5bec1
2021-03-22 13:25:08 -07:00
7bda8b650c [caffe2] Fix caffe2 build with TensorRT support (#54322)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/54321

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54322

Reviewed By: heitorschueroff

Differential Revision: D27190319

Pulled By: malfet

fbshipit-source-id: 224b19f71572e07ef5092ce397497f99935a45a6
2021-03-22 13:19:08 -07:00
17f9b5ff4a [caffe2] increase deadline of test_dnnlowp_batch_matmul_int_constant_B (#54241)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54241

As title

Test Plan: buck test //caffe2/caffe2/quantization/server:batch_matmul_dnnlowp_op_test -- test_dnnlowp_batch_matmul_int_constant_B -[jongsoo@devbig279.ftw3 ~/server] buck test :batch_matmul_dnnlowp_op_test -- test_dnnlowp_batch_matmul_int_constant_B --run-disabled --stress-runs 10 --jobs 18

Reviewed By: dskhudia

Differential Revision: D27150098

fbshipit-source-id: be8bc1e57077a7399ae5fd5e5df14407b618a7f3
2021-03-22 13:13:22 -07:00
8bb07c7e21 [CI]Install older cmath during Windows build (#54431)
Summary:
Based on peterjc123 analysis, `cmath` after 26bbe2ad50 (diff-3fa97ceb95d524432661f01d4b34509c6d261a2f7f45ddcf26f79f55b3eec88a) renders a lot of CUDA fail to compile with:
```
error: calling a __host__ function("__copysignf") from a __host__ __device__ function("c10::guts::detail::apply_impl< ::at::native::AUnaryFunctor< ::>  &,     ::std::tuple<float >  &, (unsigned long long)0ull > ") is not allowed
```
Workaround for https://github.com/pytorch/pytorch/issues/54382

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54431

Reviewed By: anjali411

Differential Revision: D27234299

Pulled By: malfet

fbshipit-source-id: b3f1fef941341222cc10cb27346fcf4a1d522a0c
2021-03-22 12:20:23 -07:00
6e7a3c1fdd Clang-format distributed.py (#54402)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54402

ghstack-source-id: 124497872

Test Plan: N/A

Reviewed By: zhaojuanmao

Differential Revision: D27225942

fbshipit-source-id: 277f466554fbc034fb76de161bf4b3b7c431daf7
2021-03-22 11:39:58 -07:00
a46d56f988 Update tensorpipe submodule. (#54412)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54412

Reviewed By: lw

Differential Revision: D27230317

Pulled By: beauby

fbshipit-source-id: 9e8380584cdd0f5750047005416202a23abe738c
2021-03-22 11:19:03 -07:00
19f77700ec clean up typos in submodule (#54372)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54372

Reviewed By: heitorschueroff

Differential Revision: D27233797

Pulled By: walterddr

fbshipit-source-id: f8d321199b6ae8b482e2ac3f10575402551365ef
2021-03-22 11:13:06 -07:00
c2c97cd290 <tensorexpr> Add python bindings for missing loop transformations in LoopNest (#54355)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54355

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D27202561

Pulled By: huiguoo

fbshipit-source-id: 87cb6730311845157d4cba749017de80fd9aa82e
2021-03-22 10:32:18 -07:00
afb560065c [testing] OpInfo for sgn and sign (#53885)
Summary:
Reference https://github.com/pytorch/pytorch/issues/42515

TODO:
* [x] Check rendered docs. https://11525594-65600975-gh.circle-artifacts.com/0/docs/generated/torch.sgn.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53885

Reviewed By: ejguan

Differential Revision: D27114318

Pulled By: mruberry

fbshipit-source-id: 678179d87741aacd3b50f03dc460207c5aa29589
2021-03-22 09:39:40 -07:00
b6bbb41fd8 Temporary disable TestNumbaIntegration.test_from_cuda_array_interface* (#54430)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54430

see https://github.com/pytorch/pytorch/issues/54429

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D27232636

Pulled By: pbelevich

fbshipit-source-id: 15fb69828a23cb6f3c173a7863bd55bf4973f408
2021-03-22 09:17:28 -07:00
ef472d5b31 Back out "[PT QNNPACK] Temporarily disable input pointer caching" (#52917)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52917

Original commit changeset: f6ceef606994

Test Plan:
FB:
This was an attempt to fix ig crashes but we root caused it to pthreadpool changes. Thus this is not needed anymore.

Reviewed By: AshkanAliabadi

Differential Revision: D26485737

fbshipit-source-id: 5d689231cccd11d911b571f8486a19d646352698
2021-03-22 09:05:42 -07:00
2355f61f19 Add logging for debugging S223170
Summary: more context in T86752810. Add info for tensor lengths size to see if it fails on in complete batch

Test Plan: manually created failed run: f258719092

Reviewed By: aartibasant

Differential Revision: D27181049

fbshipit-source-id: 341c020a3430c410f9726d92315efb80d36e9452
2021-03-22 08:58:40 -07:00
635595f706 Change sharding in ci (#54228)
Summary:
Step three (landing this should fix https://github.com/pytorch/pytorch/issues/53882)!

Modifying CI to compute job times during build so that the exported job times can be used for sharding future test jobs.
The builds that are exempted from this:
- `bazel` (no python tests so no need)
- `libtorch` (no python stuff so no need)
- `onnx` (the test shards are not calculated the same way)
- `asan` (runs into error I don't know how to debug/we can debug later: [logs](https://app.circleci.com/pipelines/github/pytorch/pytorch/288019/workflows/57f95f67-1a1b-44a0-9b02-9652b57f2a5f/jobs/11693962)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54228

Test Plan: CI

Reviewed By: samestep

Differential Revision: D27192978

Pulled By: janeyx99

fbshipit-source-id: 3cb20d14f4989e61873043b81dfd6b0f82d17ccd
2021-03-22 08:40:34 -07:00
d226985257 Read out layout from options directly, rather than via backend (#54074)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54074

I don't see why this shouldn't work.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: ljk53

Differential Revision: D27086594

Pulled By: ezyang

fbshipit-source-id: 1d5f1997017ec48c4140f43e44f0d8a3df28ac7f
2021-03-22 08:20:13 -07:00
36ce673f16 Disable the fusion group which is not supported by XPU device. (#54239)
Summary:
The XPU device doesn't support the fusion group. Disable it for XPU devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54239

Reviewed By: zou3519

Differential Revision: D27188735

Pulled By: ezyang

fbshipit-source-id: f28f62148e7aa12e8b08345df7eb0133216ce6a5
2021-03-22 07:43:28 -07:00
4ffafbac40 Make test_cpp_extensions_aot handle lack of pytest more gracefully (#53740)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53740

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D26956603

Pulled By: ezyang

fbshipit-source-id: 09ff60b29c4bd64044f4c9f0b7e17ffed33c30db
2021-03-22 07:23:41 -07:00
7b939d934e Simplifes OpInfo test matrix to reduce test time (#53255)
Summary:
This PR:

- Updates the structure of the SampleInput class to require the "input" attribute be a tensor
- Limits unary ufuncs to test only the uint8, long, float16, bfloat16, float and cfloat dtypes by default
- Limits variant testing to the float dtype
- Removes test_variant_consistency from test_unary_ufuncs.py since it's now redundant with variant testing in test_ops.py
- Adds backwards supported testing to clarify failures that were coming from variant testing

This should decrease test e2e time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53255

Reviewed By: ngimel

Differential Revision: D27043643

Pulled By: mruberry

fbshipit-source-id: 91d6b483ad6e2cd1b9ade939d42082980ae14217
2021-03-22 03:48:27 -07:00
ab8e9188dc add --gpu-max-threads-per-block=256 to hipMAGMA build (#54161)
Summary:
As of ROCm version 4.0.1, the HIP compiler default for max threads per block is 256 but is subject to change in future releases.  To protect against changes, hipMAGMA should be built with the previously-assumed default.  This change is necessary here in PyTorch until upstream magma project utilizes `__launch_bounds__` or some other means of controlling launch bounds.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54161

Reviewed By: zou3519

Differential Revision: D27194829

Pulled By: malfet

fbshipit-source-id: 8be2cff3b38786526954b627ff6ab02b510040a1
2021-03-21 22:09:36 -07:00
80a4a50081 Automated submodule update: FBGEMM (#54118)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 88ba128b7c

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54118

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: ejguan

Differential Revision: D27105781

fbshipit-source-id: 3f71299dcee11459efa3a14c051afc031a99ecea
2021-03-21 20:12:30 -07:00
fc58b3f146 Skips failing pinv and pinverse test (#54392)
Summary:
This will unblock the CI failing due to https://github.com/pytorch/pytorch/issues/54381.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54392

Reviewed By: ngimel

Differential Revision: D27221925

Pulled By: mruberry

fbshipit-source-id: 5b6e6f21428fd7d97cc75300e3a1aca2a40fbb24
2021-03-21 14:45:27 -07:00
f48a9712b7 Rewrite functional.tensordot to be TorchScript-able (#53672)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53487

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53672

Reviewed By: VitalyFedyunin

Differential Revision: D26934392

Pulled By: gmagogsfm

fbshipit-source-id: f842af340e4be723bf90b903793b0221af158ca7
2021-03-20 23:03:30 -07:00
cffa70d36d Merge channel hierarchies. (#54333)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54333

Pull Request resolved: https://github.com/pytorch/tensorpipe/pull/326

Pull Request resolved: https://github.com/pytorch/tensorpipe/pull/312

This is a first step towards cross-device type transfers: eventually,
channels will not connect devices of a given type between two hosts,
but possibly heterogeneous pairs of devices. Hence, the distinction
between CPU-to-CPU and GPU-to-GPU channels will not make much sense
anymore, and we can afford to simplify the Pipe's code quite bit.

The main change here is that the `channel::Channel` and
`channel::Context` classes are not templated (on the buffer type)
anymore. Instead, a channel's `send`/`recv` methods act on generic
`Buffer`s and the actual unpacking is done in the
`ChannelBoilerplate`. The
`channel::CpuContext`/`channel::CudaContext` (respectively
`channel::CudaContext`/`channel::CudaChannel`) aliases now simply
resolve to `channel::Context` (respectively `channel::Channel`). A
subsequent diff will get rid of the aliases altogether.

The Pipe is being simplified: all the duplication due to having
separate hierarchies is gone, which gets rid of a lot of boiler plate
template code. Note that previously, two channels with the same name
could potentially coexist, provided one was a CPU channel and the
other a GPU channel. This is not the case anymore, though it should
not matter.
In its current state, the Pipe still needs to pick a channel based on
whether that channel acts on CPU or GPU buffers. This is solved by
introducing the temporary method
`bool channel::Context::supportsDeviceType(DeviceType t)`. When
iterating through available channels to select one for a given tensor,
the Pipe now discards channels that do not support the tensor's
`DeviceType`. This leads to having a single ordered list of channels,
which in practice is two separate lists (one for CPU, one for GPU)
merged together. This will change soon as we initialize only one
channel per `DeviceType`.

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D26958187

Pulled By: beauby

fbshipit-source-id: 3e3f7921166892d468fa78cfad3199277588021c
2021-03-20 17:59:23 -07:00
8294bff20d [StaticRuntime] Copy version of reshape/flatten (#54353)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54353

The current implementation of reshape/flatten is problematic because whether the output is sometimes a tensor view and sometimes not. It entirely depends on the graph ir and input shapes. Replacing them with the copy version makes it deterministic and the output is always a tensor.

Reviewed By: ajyu, edvgha

Differential Revision: D26358525

fbshipit-source-id: ee7571317b061221a8d50083676cded388ce6f87
2021-03-20 16:55:30 -07:00
08e4312559 Tag distributed team for review for /torch/nn/parallel (#54221)
Summary:
This folder contains the DDP python interface as well as several misc. communication files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54221

Reviewed By: agolynski

Differential Revision: D27149068

Pulled By: rohan-varma

fbshipit-source-id: 0c23ea9a0d1dfc2719a2008e182ea75f2058d7dc
2021-03-20 13:20:00 -07:00
98baad5764 [nnc] Remove cached argv from LLVMCodeGen to fix race condition (#54286)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54286

A generated code object was holding not just a function pointer but a
pre-allocated argument buffer.  I assume this was a performance optimization to
avoid allocating a vector on each call?

This cached buffer makes it unsafe to call a generated function from multiple
threads, which is too severe a limitation.  This diff fixes it by locally
allocating a SmallVector to hold the args.

A better fix will be to avoid creating CallArgs, so the function can be called
directly without this packing-and-unpacking nonsense, but that's a slightly
more involved fix, possibly involving changing the kernel codegen, and this bug
needs fixing now.
ghstack-source-id: 124333028

Test Plan: `threads=64 scripts/bwasti/static_runtime/run.sh`

Reviewed By: asuhan

Differential Revision: D27175715

fbshipit-source-id: 44dafe77b95ede69c63ae6d64f39f0aa4877712f
2021-03-19 22:36:58 -07:00
4fa47e5e7d Support non-tensor inputs and outputs for checkpointed functions. (#52422)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52422

As mentioned in https://github.com/pytorch/pytorch/issues/52415,
`torch.utils.checkpoint` doesn't support checkpointing for functions which have
non-tensor inputs and outputs.

This PR resolves this issue by ensuring the autograd machinery ignores the
non-tensor inputs and outputs and processes the tensors accordingly.
ghstack-source-id: 124406867

Test Plan:
1) unit test
2) waitforbuildbot

Reviewed By: albanD

Differential Revision: D26507228

fbshipit-source-id: 0a5a1591570814176185362e83ad18dabd9c84b0
2021-03-19 21:29:03 -07:00
9d9986fd10 Support for Half / bfloat16 / index_select and better testing (#53898)
Summary:
Added the support for half / bfloat / bool for `index_select`, as suggested by ngimel in
https://github.com/pytorch/pytorch/issues/49707#issuecomment-788140578

For the tests to pass, I also added the support for `index_add`.

I added `OpInfo` tests for `index_add` and more thorough forward tests for `index_select` to test these changes.

While doing so, I found that the support for scalar types in the derivative of `index_add` was not correct, so I corrected it.

Resolves https://github.com/pytorch/pytorch/issues/49707

It should also resolve similar issues that I encountered when porting `index_copy`, `take` and `put`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53898

Reviewed By: mruberry

Differential Revision: D27193294

Pulled By: ngimel

fbshipit-source-id: 5a0af2c62a0cf24f3cc9c74f230ab4f3712bbb7a
2021-03-19 20:37:48 -07:00
d58c00a5d8 [package] Make exporters write to buffer in fbcode (#54303)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54303

**Summary**
Creating temporary files can cause problem in fbcode. This commit
updates the packaging tests so that exporters write to a memory
buffer when tests run in fbcode.

**Test Plan**
Continuous integration.

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D27180839

Pulled By: SplitInfinity

fbshipit-source-id: 75689d59448de2cd1595ef0ecec69e1bbcf9a96f
2021-03-19 19:59:35 -07:00
41b1ea216f Fix torch.linalg.qr example (#54342)
Summary:
Since a.size() is (3, 4, 5), so r.size() is (3, 4, 5) , but q.size is (3, 4, 4)
Also, reduce tolerance from 1e-8 to 1e-5 as

Fixes https://github.com/pytorch/pytorch/issues/54320

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54342

Reviewed By: walterddr

Differential Revision: D27193947

Pulled By: malfet

fbshipit-source-id: 362a0fdd90550888a4f0c6deaa49b9f72d379842
2021-03-19 19:49:53 -07:00
270d675f86 update distributed doc table for alltoall nccl (#54277)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54277

alltoall already supported in nccl backend, so update the doc to reflect it.

Test Plan: Imported from OSS

Reviewed By: divchenko

Differential Revision: D27172904

Pulled By: wanchaol

fbshipit-source-id: 9afa89583d56b247b2017ea2350936053eb30827
2021-03-19 15:35:10 -07:00
27048c1dfa Remove legacy constructor calls from _torch_ folder. (#53889)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53146
Related to https://github.com/pytorch/pytorch/issues/47112

As mentioned in https://github.com/pytorch/pytorch/issues/47112, the plan is to:

1. Verify that all `torch.Tensor()` scenarios are covered by other functions
2. Scrub internal `torch.Tensor()` uses
3. Update the docs and throw `TORCH_WARN_ONCE` if someone uses `torch.Tensor()`

In this PR, I replaced all occurrences of `torch.Tensor` present in the _torch_ folder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53889

Reviewed By: walterddr, zou3519

Differential Revision: D27190743

Pulled By: jbschlosser

fbshipit-source-id: 7ecc201d57935b8dbb98ae3718b60d95cb55a010
2021-03-19 15:20:19 -07:00
6a4d2c61d5 Allow linking against vcomp on Windows (#54132)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/54054

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54132

Reviewed By: zou3519

Differential Revision: D27181524

Pulled By: malfet

fbshipit-source-id: b79b34afb7edcc594d9b5907c5a7505b9cc5683b
2021-03-19 14:36:07 -07:00
6f7a5a47af port div to structured (#53680)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53680

Porting `div` to structured.

One weird thing to call out with div: It has an overload, `div.Tensor_mode`, which uses different TensorIterator settings depending on this input (the "mode" argument that you pass to it). So I ended up switching on the mode inside of the meta function to determine which TensorIterator builder to use.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D27029819

Pulled By: bdhirsh

fbshipit-source-id: 3f216f6c197a2321087b4c23202bc2fc561491ba
2021-03-19 14:32:28 -07:00
fa482aa4ef port sub to structured, fix sub.Scalar bug (#53679)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53679

This PR ports sub to be a structured kernel.

It also fixes a bug with `sub.Scalar`. `sub.Scalar` is currently listed as a `DefaultBackend` op, but it isn't actually backend agnostic- it calls into `native::sub`, which is CPU/CUDA-specific. That can cause bugs like [this](https://github.com/pytorch/pytorch/pull/51758) for other backends like MKLDNN. `sub.Scalar` is now **really** backend-agnostic, since it performs a redispatch to call the overload.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D27029820

Pulled By: bdhirsh

fbshipit-source-id: d24b435a42f4c505bc763ea77672956f81ad3e26
2021-03-19 14:32:24 -07:00
779cae9e42 port at::pow to structured (#53669)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53669

This PR does two things:
* Ports `pow` to be structured
* Fixes a bug with how pow handles mixed cpu and cuda tensors

**bug fix**
Pow is a binary op, and all binary ops that use TensorIterator are currently written to handle the case when one of the inputs is a CUDA tensor, and the other is a zero-dimensional cpu tensor.

`pow` incidentally only handles one of the two cases: it fails when the CUDA tensor is passed as the exponent, e.g. `at::pow(torch.tensor(2.0, device='cpu'), torch.tensor([2, 2], device='cuda'))`. Porting `pow` to structured happened to change the error that was outputted from a `TORCH_CHECK` in TensorIterator to an `INTERNAL_ASSERT` in loop.cuh, so I ended up trying to fix the error and update the tests. I added more details in a comment on the PR.

**notes on the structured port**
Pow is a little weird, so I wrote down a couple of issues I noticed during the port:
* Multiple independent overloads. `pow` has two overloads that have their own cpu/cuda kernels, meaning one doesn't call the other. I have to update the names of the kernel overloads to make the compiler happy, since the codegen would otherwise try to generate two classes with the same name. `pow` actually has 3 overloads that all have `out` variants, so I ported all 3 to structured- one of them just happens to redispatch one of the others in most cases.
* Name propagation. Is name propagation implemented per operator? Or is expected to work for most/all ops by default. Right now it looks like it happens for TensorIterator ops by default. For ops that don't use TensorIterator, we need to explicitly pass the names through to the `set_output()` call in the meta function. This happened to matter for `pow` because it has 3 overloads, but only two of them directly use TensorIterator. I had to pass names directly to `set_output` in the 3rd overload to make tests happy.
*  Lack of `const Tensor &` in the C++ API. It's a goal to slowly make all `Tensor &` arguments const as part of the structured port, but in this case I needed to explicitly cast constness away because one structured kernel called back into the C++ API, which still has ordinary `Tensor &` arguments. This probably isn't something we'll fix soon, since we have boxing logic that actually relies on the `Tensor &` / `const Tensor &` distinction in some places.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D27029821

Pulled By: bdhirsh

fbshipit-source-id: c1786e770de6e6c2474b9a48210b88057ab1018e
2021-03-19 14:30:48 -07:00
454dd7ba86 Add codeowners for onnx export (#54287)
Summary:
cc neginraoof spandantiwari SplitInfinity for review

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54287

Reviewed By: zou3519

Differential Revision: D27190058

Pulled By: SplitInfinity

fbshipit-source-id: 2f218e6da563be19338531e57fe8138f530cb86d
2021-03-19 13:51:39 -07:00
679f07a017 Backup .circleci/config.yml before regenerating (#54345)
Summary:
If you accidentally modify `.circleci/config.yml` directly and then run `.circleci/regenerate.sh`, it clobbers your changes. This PR saves the previous contents of `.circleci/config.yml` to a temporary file, whose name is printed to the console due to the `-x` already present in the script.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54345

Test Plan:
```
$ echo "418 I'm a teapot" > .circleci/config.yml
$ .circleci/regenerate.sh
```

Before:
```
++ dirname .circleci/regenerate.sh
+ cd .circleci
++ mktemp
+ NEW_FILE=/var/folders/vw/ryb6j4d97xs1t_14024b710h0000gn/T/tmp.vW7yBQT2
+ ./generate_config_yml.py
+ cp /var/folders/vw/ryb6j4d97xs1t_14024b710h0000gn/T/tmp.vW7yBQT2 config.yml
```
```
$ echo ':('
:(
```

After:
```
++ dirname .circleci/regenerate.sh
+ cd .circleci
++ mktemp
+ OLD_FILE=/var/folders/vw/ryb6j4d97xs1t_14024b710h0000gn/T/tmp.54GhUh7w
+ cp config.yml /var/folders/vw/ryb6j4d97xs1t_14024b710h0000gn/T/tmp.54GhUh7w
++ mktemp
+ NEW_FILE=/var/folders/vw/ryb6j4d97xs1t_14024b710h0000gn/T/tmp.aV87RTvQ
+ ./generate_config_yml.py
+ cp /var/folders/vw/ryb6j4d97xs1t_14024b710h0000gn/T/tmp.aV87RTvQ config.yml
```
```
$ cat /var/folders/vw/ryb6j4d97xs1t_14024b710h0000gn/T/tmp.54GhUh7w
418 I'm a teapot
$ echo ':D'
:D
```

Reviewed By: janeyx99

Differential Revision: D27195142

Pulled By: samestep

fbshipit-source-id: fcd9e4ac102ec3523d96f772eedbd42123364e26
2021-03-19 13:08:02 -07:00
da18313de3 [caffe2] expose whether FBGEMM is available to the Python code (#54274)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54274

Some of the Python tests need to be aware of whether or not FBGEMM is
available, so expose this setting in the pybind extension.
ghstack-source-id: 124317732

Test Plan: Will use this variable in the tests on D26658205.

Reviewed By: mraway

Differential Revision: D27171780

fbshipit-source-id: 4c94144a959bf8bf0e1553b6e029e94a91794e29
2021-03-19 12:52:14 -07:00
f1cbd10276 [PyPer] Port c2 add to pt (#54229)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54229

Because caffe2 add uses Eigen for add with broadcasting which is not well supported by OSS PyTorch, it's easier to just keep the `c2_add_out` internal for now. Caffe2 does use mkl add when the input dims of A and B are the same and there is no broadcasting needed.

Reviewed By: bertmaher

Differential Revision: D27036279

fbshipit-source-id: 49f0ec5407ea1f641896f054cad2283faed81687
2021-03-19 12:45:11 -07:00
fa07d0c8eb .github: Add workflow to build libtorch (#53292)
Summary:
Based on https://github.com/pytorch/pytorch/issues/50633 and https://github.com/pytorch/pytorch/issues/51243.

Things left to do:

- [x] modify `.github/scripts/generate_binary_build_matrix.py` further
  - [x] add option for not iterating over Python version
  - [x] add `LIBTORCH_VARIANT`
  - [x] add option for cxx11
  - [x] fix the artifact uploading
  - [x] remove `pull_request` hook before merging

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53292

Test Plan: [CI](https://github.com/pytorch/pytorch/actions/runs/665781075).

Reviewed By: seemethere

Differential Revision: D27189150

Pulled By: samestep

fbshipit-source-id: ec91e1f0b75b8c93613d55801585ed975697be03
2021-03-19 12:39:36 -07:00
05a03a6c8c [FX][EZ] Fix type correctness on GraphModule.graph (#54305)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54305

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D27181176

Pulled By: jamesr66a

fbshipit-source-id: ed91cfed193984249c07a5bafc7aa732bfe0194d
2021-03-19 11:48:15 -07:00
bc4f521178 port at::mul to structured (#52692)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52692

Porting `at::mul` to structured.

One other issue I hit with the port was the fact that there are a bunch of other places around the code base that used to call out to variants of `at::native::mul`, which no longer exists. *Technically*, `at::cpu::mul` does the equivalent thing now, so I patched most call-sites to use that. There were two other places where I did something slightly different (calling `at::cuda::mul` and `at::mul`, respectively), which I called out in the comments.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D27029822

Pulled By: bdhirsh

fbshipit-source-id: 6cc80de0dfccec304bf8e16a1823e733bed27bf4
2021-03-19 11:34:33 -07:00
61b074581c torch.prod backward for complex types. (#48125)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53511
torch.det does depend on torch.prod, which in turn depends on several other functions, and they also depend on torch.prod, so there is a circular relationship, hence this PR will enable complex backward support for several functions at once.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48125

Reviewed By: pbelevich

Differential Revision: D27188589

Pulled By: anjali411

fbshipit-source-id: bbb80f8ecb83a0c3bea2b917627d3cd3b84eb09a
2021-03-19 09:44:08 -07:00
cc7a28d727 Refactor Unary Ops tests (#49712)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49712

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D25673712

Pulled By: izdeby

fbshipit-source-id: 4420d5d129026195097d914e410b75b144bea795
2021-03-19 09:28:00 -07:00
645a3e9a92 Fix inaccurate dispatch tables (#54127)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54127

During the meta tensor bringup, I found all of these operators
advertised that they worked on all backends (DefaultBackend/Math)
but actually they only worked on CPU/CUDA.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D27109508

Pulled By: ezyang

fbshipit-source-id: 0f474ecf4aba8b8207f2910bdc962bf581f53853
2021-03-19 09:10:29 -07:00
49f1336106 Add Tensor::is_cpu, genericize TensorIterator (#54079)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54079

Fixes https://github.com/pytorch/pytorch/issues/53815

Instead of testing if something is CUDA, we instead test if something
is not CPU.  This in the general theming of "Don't be so darn CUDA
centric".

Intruigingly, we didn't have a is_cpu() method on Tensor.  Which seems
like a big oversight and one of the reasons how we ended up in this
mess.  So in it goes.  Maybe we should also get this for Python bindings
as well (but in that case, should probably look into redoing all of the
is_X bindings so they aren't done manually).

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D27109507

Pulled By: ezyang

fbshipit-source-id: abbe72c2e688c452ffe098d206cb79938b5824b1
2021-03-19 09:10:24 -07:00
e0aebe241d Refactor tensor_new.cpp to use TensorOptions instead of DispatchKey (#54034)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54034

Fixes #53544

I had to touch a bunch of lines but the refactoring was fairly
mechanical.  Here's how it works.

The basic concept behind this PR is that tensor_new.cpp was previously
abusing DispatchKey when it actually meant TensorOptions.  The provided
DispatchKey argument to most of the constructor functions typically
comes from torch::tensors::get_default_dispatch_key();  it doesn't
really make sense for people to set the default dispatch key, but
this got grandfathered in due to the old API set_default_tensor_type
(where the "Type" concept got refactored into "DispatchKey" concept
over time).  See also #53124.  But the upshot is that, semantically,
what we refer to as the default dispatch key really is more like
torch.set_default_tensor_type(torch.Tensor) versus
torch.set_default_tensor_type(torch.cuda.Tensor): clearly the user
wants to do something about *construction* of the tensor, and
TensorOptions captures that exactly.

So, how exactly to translate from one to the other?
- Sources (things that used to PRODUCE DispatchKey)
  - Most top level functions take a DispatchKey as their argument.  I
    use the new function dispatchKeyToTensorOptions to convert it into
    a TensorOptions
  - typeIdWithDefault now produces a TensorOptions (probably could do
    with a rename, though I didn't)
- Sinks (things that used to CONSUME DispatchKey)
  - Previously, the function options() was typically used to convert the
    DispatchKey into a TensorOptions.  Now its replacement build_options
    just takes a TensorOptions and sets some extra fields on it.
    Irritatingly, I can't just replace
    `build_options(options, scalar_type, device)` with
    `options.dtype(scalar_type).device(device)` because the semantics
    are slightly different: if device is nullopt, we should preserve
    the usage of the device specified in options (what options.device()
    does is overwrite the device unconditionally; e.g., if device is
    nullopt, unset device from options)
  - The other major sink for DispatchKey was `internal_new_from_data`,
    but it turns out it only really extracts the device type from
    the dispatch key.  Now it just pulls out the device from
    TensorOptions.
- To actually do the translation of DispatchKey to TensorOptions, I
  introduce new functions dispatchKeyToLayout (replicating
  layout_from_backend--there are still a few uses of this function
  so I couldn't delete it) and dispatchKeyToDeviceType (replacing
  computeDeviceType)
- In all internal functions, whenever DispatchKey is taken as an argument,
  I instead take TensorOptions as an argument, and pass it along.
- Anywhere `legacyExtractDispatchKey(other.key_set())` equality was
  previously used, I now do `other.options().type_equal()`, which
  is the intended BC for doing "backend to backend" comparisons
- There are a few places in the sparse constructors where we allocated
  a tensor for values, and then read out the dispatch key from the
  result to allocate the keys.  As best as I can tell, this is totally
  equivalent to just passing in the options to both values and indices
  (the only difference is dtype, which is captured via a separate
  argument)

This refactor doesn't really go far enough: for example, there are now
functions that take both TensorOptions and ScalarType, when really
the TensorOptions can capture this all.  I kept it solely just
s/DispatchKey/TensorOptions/ to reduce the number of possible bugs;
also, a lot of this will be mooted by a proper fix to #53124.

Even with this limited refactor, the payoff is sweet.  I can delete:

- backendToCPU
- backendToXPU
- backendToCUDA
- backendToHIP
- backendToBackendOfDeviceType

The reason I can do this is because I can simply overwrite layout in TensorOptions
to do the conversion, rather than having to type out each backend case
explicitly.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D27109509

Pulled By: ezyang

fbshipit-source-id: 91d16cfbc390127770362ac04fb43f7e070077e9
2021-03-19 09:08:32 -07:00
544a996f83 Revert D27155845: [pytorch][PR] Fixed the size of the workspace array in functions calling MAGMA
Test Plan: revert-hammer

Differential Revision:
D27155845 (04a2506091)

Original commit changeset: 04439bfa82a5

fbshipit-source-id: f45967e94883effbb43d8d0a019596f1f82caa56
2021-03-19 08:27:18 -07:00
887759c9b9 Changes to autograd/custom functions to handle optional arguments (#54270)
Summary:
Small changes to autograd to support optional Tensor values.
On MLC device, we use Autograd Custom Functions to override the autograd engine for a specific operation. We do something like:

```
at::Tensor AtenMLCAutogradTypeDefault::abs(const at::Tensor & self) {
  torch_mlc::mlclogger() << "MLC bridge autograd MLC : abs" << std::endl;
  torch_mlc::AutoNonAtenMLCAutogradTypeDefault guard(true);
  return MLCAbsFunction::apply(self);
}

TORCH_LIBRARY_IMPL(aten, AutogradMLC, m) {
  m.impl("abs", static_cast<at::Tensor (*)(const at::Tensor &)>(&AtenMLCAutogradTypeDefault::abs));
}
```
What I noticed is that the existing code does not always work for optional Tensor types. This PR fixes it. Let me know if you have a better way to deal with this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54270

Reviewed By: ejguan

Differential Revision: D27171623

Pulled By: albanD

fbshipit-source-id: 3aa8d59ee8da3cc943ad5e73521c2755d1ff2341
2021-03-19 07:45:36 -07:00
f2b4b0e9eb [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D27184963

fbshipit-source-id: 65355a12697c8bd996b86947e3e0aeb0ee4eff3f
2021-03-19 05:16:43 -07:00
a84afb3a7c Use type-erased union for Buffer. (#54251)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54251

Pull Request resolved: https://github.com/pytorch/tensorpipe/pull/324

In order to merge the channel hierarchies, we need a generic `Buffer` type, that can wrap either a `CpuBuffer` or a `CudaBuffer`.
The constraints are that, since this type is used by the channels, it cannot explicitly refer to `CudaBuffer`. We propose here a type-erasure based solution, with small-buffer optimization to avoid heap-allocating the wrapped concrete buffer.

This is a new version of D27001339 (c618dc13d2) which broke PyTorch OSS build.

Test Plan: CI

Reviewed By: lw, mrshenli

Differential Revision: D27156053

fbshipit-source-id: 4244302af33a3be91dcd06093c0d6045d081d3cc
2021-03-19 04:58:09 -07:00
8f755b9ed0 initial draft for assert_tensors_(equal|allclose) in torch.testing (#53820)
Summary:
Addresses https://github.com/pytorch/pytorch/issues/53618#issuecomment-795896298

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53820

Reviewed By: agolynski

Differential Revision: D27113912

Pulled By: mruberry

fbshipit-source-id: 2a37522eaa37e90bf7b116f3964b06b46068cab7
2021-03-18 20:32:03 -07:00
acf03b13f1 [Static Runtime] Check for number of uses of op inputs > 1 in ReplaceWithCopy (#54230)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54230

The comments in the code explained why this change is needed.

Reviewed By: bwasti

Differential Revision: D27145406

fbshipit-source-id: 2a61a42f22dfadfad59ee6c3be3e9e9d19e90ac3
2021-03-18 20:02:20 -07:00
bfd009836e [torch.special] Add special.erf{c, inv} (#53260)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/50345

Also adds `overrides` entry for module and the newly added functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53260

Reviewed By: agolynski

Differential Revision: D27114342

Pulled By: mruberry

fbshipit-source-id: b1dd88f373db251bb71df12d33b160382138f63f
2021-03-18 19:06:25 -07:00
19792b45db add a pytest.ini file (#53152)
Summary:
This shall fix the first three items of https://github.com/pytorch/pytorch/issues/52984 by adding a pytest.ini configuration file.

#### `--tb=short`
In a failure, pytest will not show the entire traceback nor the docstring.

- Without `--tb=short`:
<details>

```
$ pytest test/test_typing.py -k tensor_copy
================================================================== test session starts ===================================================================
platform linux -- Python 3.8.6, pytest-6.2.1, py-1.10.0, pluggy-0.13.1
rootdir: /home/guilhermel/git/pytorch, configfile: pytest.ini
plugins: hypothesis-5.38.1, typeguard-2.10.0
collected 8 items / 7 deselected / 1 selected

test/test_typing.py F                                                                                                                              [100%]

======================================================================== FAILURES ========================================================================
______________________________________________________________ test_reveal[tensor_copy.py] _______________________________________________________________

path = '/home/guilhermel/git/pytorch/test/typing/reveal/tensor_copy.py', reveal = 'int '
expected_reveal = "/home/guilhermel/git/pytorch/test/typing/reveal/tensor_copy.py:11: note: Revealed type is 'torch.tensor.Tensor'", lineno = 11

    def _test_reveal(path: str, reveal: str, expected_reveal: str, lineno: int) -> None:
        if reveal not in expected_reveal:
>           raise AssertionError(_REVEAL_MSG.format(lineno, expected_reveal, reveal))
E           AssertionError: Reveal mismatch at line 11
E
E           Expected reveal: "/home/guilhermel/git/pytorch/test/typing/reveal/tensor_copy.py:11: note: Revealed type is 'torch.tensor.Tensor'"
E           Observed reveal: 'int '

test/test_typing.py:156: AssertionError
================================================================ short test summary info =================================================================
FAILED test/test_typing.py::test_reveal[tensor_copy.py] - AssertionError: Reveal mismatch at line 11
```

</details>

- With `--tb=short`:
<details>

```
$ pytest test/test_typing.py -k tensor_copy
================================================================== test session starts ===================================================================
platform linux -- Python 3.8.6, pytest-6.2.1, py-1.10.0, pluggy-0.13.1
rootdir: /home/guilhermel/git/pytorch, configfile: pytest.ini
plugins: hypothesis-5.38.1, typeguard-2.10.0
collected 8 items / 7 deselected / 1 selected

test/test_typing.py F                                                                                                                              [100%]

======================================================================== FAILURES ========================================================================
______________________________________________________________ test_reveal[tensor_copy.py] _______________________________________________________________
test/test_typing.py:156: in _test_reveal
    raise AssertionError(_REVEAL_MSG.format(lineno, expected_reveal, reveal))
E   AssertionError: Reveal mismatch at line 11
E
E   Expected reveal: "/home/guilhermel/git/pytorch/test/typing/reveal/tensor_copy.py:11: note: Revealed type is 'torch.tensor.Tensor'"
E   Observed reveal: 'int '
```

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53152

Reviewed By: agolynski

Differential Revision: D26846808

Pulled By: walterddr

fbshipit-source-id: d16c951b370b0643c8bbedca73d5184c6b65aba7
2021-03-18 17:46:26 -07:00
bbb06c05a8 remove type_hint_tests and convert the files to use the new test style (#53167)
Summary:
This is a follow-up PR of https://github.com/pytorch/pytorch/issues/52408 and move/convert all files under `test/type_hint_tests/*.py` to use the new test style.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53167

Reviewed By: ejguan

Differential Revision: D27081041

Pulled By: walterddr

fbshipit-source-id: 56508083800a5e12a7af88d095ca26229f0df358
2021-03-18 17:33:53 -07:00
53d8778b4d Update clang-format linux hash and yaml import calls (#53932)
Summary:
Fixing Bandit security issues.
- yaml_load: Use of unsafe yaml load. Allows instantiation of arbitrary objects. Consider yaml.safe_load().
Test ID: B506
Severity: MEDIUM
Confidence: HIGH
File: ./caffe2/contrib/aten/gen_op.py
More info: https://bandit.readthedocs.io/en/latest/plugins/b506_yaml_load.html
235 if __name__ == '__main__':
236     decls = yaml.load(read(os.path.join(args.yaml_dir, 'Declarations.yaml')), Loader=Loader)
237     factory_methods = find_factory_methods(decls)

- Blacklist: Use of insecure MD2 (6149a26adb), MD4 (fc7f026980), MD5 (7ea9d9af4e), or SHA1 hash function.
Test ID: B303
Severity: MEDIUM
Confidence: HIGH
File: ./tools/clang_format_utils.py
More info: https://bandit.readthedocs.io/en/latest/blacklists/blacklist_calls.html#b303-md5
36
37     hash = hashlib.sha1()
38

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53932

Reviewed By: jbschlosser

Differential Revision: D27072017

Pulled By: malfet

fbshipit-source-id: 2fef0119388797aee3cacdc880fc345bd2ba68ce
2021-03-18 17:11:58 -07:00
04e0cbf5a9 Add padding='same' mode to conv{1,2,3}d (#45667)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45667

First part of #3867 (Pooling operators still to do)

This adds a `padding='same'` mode to the interface of `conv{n}d`and `nn.Conv{n}d`. This should match the behaviour of `tensorflow`. I couldn't find it explicitly documented but through experimentation I found `tensorflow` returns the shape `ceil(len/stride)` and always adds any extra asymmetric padding onto the right side of the input.

Since the `native_functions.yaml` schema doesn't seem to support strings or enums, I've moved the function interface into python and it now dispatches between the numerically padded `conv{n}d` and the `_conv{n}d_same` variant. Underscores because I couldn't see any way to avoid exporting a function into the `torch` namespace.

A note on asymmetric padding. The total padding required can be odd if both the kernel-length is even  and the dilation is odd. mkldnn has native support for asymmetric padding, so there is no overhead there, but for other backends I resort to padding the input tensor by 1 on the right hand side to make the remaining padding symmetrical. In these cases, I use `TORCH_WARN_ONCE` to notify the user of the performance implications.

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D27170744

Pulled By: jbschlosser

fbshipit-source-id: b3d8a0380e0787ae781f2e5d8ee365a7bfd49f22
2021-03-18 16:22:03 -07:00
a8a1090324 Perform appropriate CUDA stream synchronization in distributed autograd. (#53929)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53929

The local autograd engine performs appropriate stream synchronization
between autograd nodes in the graph to ensure a consumer's stream is
synchronized with the producer's stream before executing the consumer.

However in case of distributed autograd, the SendRpcBackward function receives
gradients over the wire and TensorPipe uses its own pool of streams for this
purpose. As a result, the tensors are received on TensorPipe's stream pool but
SendRpcBackward runs on a different stream during the backward pass and there
is no logic to synchronize these streams.

To fix this, I've enhanced DistEngine to synchronize these streams
appropriately when it receives grads over the wire.
ghstack-source-id: 124055277

(Note: this ignores all push blocking failures!)

Test Plan:
1) Added unit test which reproduced the issue.
2) waitforbuildbot.

Reviewed By: walterddr, wanchaol

Differential Revision: D27025307

fbshipit-source-id: 2944854e688e001cb3989d2741727b30d9278414
2021-03-18 16:15:46 -07:00
75498164fe Remove nonexistent files (#54276)
Summary:
Since both these files were deleted back in time, we shouldn't be running them anymore, as this was the old sharding strategy (see https://github.com/pytorch/pytorch/issues/50660).
```
test_python_nn.bat
test_python_all_except_nn.bat
```

I believe we intend to run all the python files, so I added a call for that instead.

Note: I don't believe there is a single unsharded test build, though, so should I instead just assume that all windows tests will be sharded?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54276

Reviewed By: ejguan

Differential Revision: D27173045

Pulled By: janeyx99

fbshipit-source-id: a7562c1479e18bd63f192f02129a42911a73a70b
2021-03-18 16:10:40 -07:00
8cd4dac78f Move mypy wrapper to tools (#54268)
Summary:
This PR

- moves `torch/testing/_internal/mypy_wrapper.py` (and its accompanying tests from `test/test_testing.py`) to `tools`,
- removes the now-unused `test_run_mypy` from `test/test_type_hints.py`, and
- replaces the hardcoded list of `mypy` configs (previously duplicated across `mypy_wrapper.py` and `.github/workflows/lint.yml`) with a simpler glob

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54268

Test Plan:
Should also be run in the "Test tools" GHA workflow in CI:
```
python tools/test/test_mypy_wrapper.py
```

Reviewed By: janeyx99

Differential Revision: D27168095

Pulled By: samestep

fbshipit-source-id: a8dc18407b5e4c103ace23a636b0a8534951905a
2021-03-18 15:41:27 -07:00
4626886f21 [JIT] Add CUDNN Conv-Add-Relu fusion for Frozen Model Optimization (#52102)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52102

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D26646100

fbshipit-source-id: 7f7a82cc0b42c958b9e0c854b3b5dc6ea7cfff6c
2021-03-18 15:18:52 -07:00
90bbe0b38b cmake: auto-detect ccache to speed up developer builds (#49389)
Summary:
https://ccache.dev/ is a compiler cache that speeds up subsequent builds. Auto-detecting ccache ensures that it is used on systems where it is available, greatly improving build times for developers. There is no risk in enabling ccache in practice. Please refer to https://ccache.dev/ for a short summary / motivation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49389

Reviewed By: ejguan

Differential Revision: D27169957

Pulled By: malfet

fbshipit-source-id: 673b60bbceb0d323901c8a992a75792c6da9b805
2021-03-18 14:20:53 -07:00
a95abc4648 Test tools/test_history.py (#54259)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54259

Test Plan:
The main point of this is to be run in our "Test tools" GitHub Actions workflow. To test locally:
```
mypy --config=mypy-strict.ini
python tools/test/test_test_history.py
```

Reviewed By: seemethere

Differential Revision: D27164519

Pulled By: samestep

fbshipit-source-id: 46f90e62e2d4d0c413b202419e509d471bad43de
2021-03-18 14:05:42 -07:00
0645e2b490 Use shard file if present, improve functions used for sharding (#54210)
Summary:
Step 2 to fixing https://github.com/pytorch/pytorch/issues/53882 :)

This changes TARGET_DET_LIST and sharding automation by checking if there's already cached data from the commit in `.pytorch-test-times`. If not, it pulls data from S3 and updates the file to have the stats. This way, S3 pulling does not need to happen more than once for the same commit.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54210

Test Plan:
the following methods should run the same set of tests.
First `export CIRCLE_JOB=pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test2` or your favorite CIRCLE JOB.

1. Pull data first and use it:
Download the data from S3 and write it to the cache file with `python test/run_test.py --export-historic-test-times .pytorch-test-times`
Now run `python test/run_test.py --shard 1 10`

2. Make the sharding job pull data:
Delete the file you just created: `rm .pytorch-test-times`
Now run `python test/run_test.py --shard 1 10`

Reviewed By: walterddr

Differential Revision: D27136849

Pulled By: janeyx99

fbshipit-source-id: 51a42c4e2fa3f8cf15e682679dd3eb6130aad927
2021-03-18 13:25:51 -07:00
3b1e3103ca Remove usage of onEachDevice from legacy profiler (#54125)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54125

Fixes https://github.com/pytorch/pytorch/issues/48987

Test Plan:
python setup.py clean
TORCH_CUDA_ARCH_LIST="6.0" USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install --cmake 2>&1 | tee ~/output.txt
python test/test_profiler.py -v

python setup.py clean
USE_CUDA=0 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install --cmake 2>&1 | tee ~/output.txt
python test/test_profiler.py -v

+ CI

Reviewed By: rohan-varma

Differential Revision: D27109481

Pulled By: ilia-cher

fbshipit-source-id: 3fba8bc55deafeed1ab4680b311e927f40eaf99c
2021-03-18 12:19:51 -07:00
d85faf8d8e Cleanup mypy lint job (#54260)
Summary:
Update to checkout v2
Delete "Get HEAD commit SHA" step

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54260

Reviewed By: samestep

Differential Revision: D27160678

Pulled By: malfet

fbshipit-source-id: d1afe4f1cf0046cfb93de583ee123b4db5b25f9a
2021-03-18 10:49:57 -07:00
04a2506091 Fixed the size of the workspace array in functions calling MAGMA (#54009)
Summary:
The size of the workspace arrays should not be less than 1. This PR fixes lstsq calls to LAPACK and MAGMA. Also `max(1, ...)` guards were added to a few other functions (symeig, svd).
ROCm testing is enabled for lstsq, pinv, pinverse.

Fixes https://github.com/pytorch/pytorch/issues/53976

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54009

Reviewed By: ejguan

Differential Revision: D27155845

Pulled By: mruberry

fbshipit-source-id: 04439bfa82a5bdbe2297a6d62b6e68ba1c30e4a2
2021-03-18 10:07:45 -07:00
f0056f89a4 Final kernel launch checks (#54214)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54214

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D27138004

fbshipit-source-id: 4448ad8242eb721d0ce02b35a65236226eed9a31
2021-03-18 09:37:16 -07:00
cc92117aad cleanup static_cast of AutogradMeta (#54103)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54103

The goal is to reduce the spread of static casts in the autograd code as per the comment in https://github.com/pytorch/pytorch/pull/49097#discussion_r543695091
I wasn't sure how to use a virtual method here but a simple method in impl clean it up quite nicely.

Test Plan: Imported from OSS

Reviewed By: agolynski

Differential Revision: D27117840

Pulled By: albanD

fbshipit-source-id: 5f277dde34ccf6bc20f76583b906ff3528cde5aa
2021-03-18 09:29:07 -07:00
004db37358 properly make AutogradMeta/DifferentiableViewMeta attributes internal (#54102)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54102

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D27117841

Pulled By: albanD

fbshipit-source-id: bb047cf1878ccff81d677ceb02e98e784760c3ec
2021-03-18 09:29:03 -07:00
09b4af2f0f Remove legacy from optional-related function names (#54101)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54101

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D27117839

Pulled By: albanD

fbshipit-source-id: 1f50b06ff9b0be8301f6ea9eca14f73a3a5fa137
2021-03-18 09:29:00 -07:00
a425eb2135 Add size check for forward grads (#54100)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54100

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D27117842

Pulled By: albanD

fbshipit-source-id: ccb6abac38d7fca31bea72cbbf3bba38c6030c37
2021-03-18 09:28:56 -07:00
cba8516b52 make internal forwardAD methods on at::Tensor internal (#54099)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54099

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D27117838

Pulled By: albanD

fbshipit-source-id: ede96529a4b099dea9cf885d0bf2cb352aa30fa5
2021-03-18 09:27:17 -07:00
a52e295cbb Add MyPY to lint GHA workflow (#54067)
Summary:
Also disable test_run_mypy from  test_type_hints.py as it is running as part of GHA

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54067

Reviewed By: ezyang

Differential Revision: D27091530

Pulled By: malfet

fbshipit-source-id: 9cfe397260aba34aeb055676855db383cd06f76d
2021-03-18 08:55:04 -07:00
4b2abc4b8e [NNC] Adding API to distribute loops (#53865)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53864

This PR adds the following APIs that perform loop distribution to `LoopNest`:
```
static std::vector<For*> distributeLoop(For* loop, const std::unordered_set<Stmt*>& pivots);
static std::vector<For*> distributeLoop(For* loop);
static std::vector<For*> distributeLoopOverInnerLoops(For* loop);
```

* The first method distributes the given loop over its body by splitting after every given pivot stmt.
* The second method distributes the given loop over every stmt in its body.
* The last method distributes the given loop over its body by splitting after every `For` stmt in its body.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53865

Reviewed By: mruberry

Differential Revision: D27075006

Pulled By: navahgar

fbshipit-source-id: 031746aad619fe84c109e78b53387535e7f77cef
2021-03-18 07:27:39 -07:00
dc35848804 [PyTorch] Rename XPLAT_MOBILE_BUILD to TEMPLATE_SELECTIVE_BUILD (#54217)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54217

As title. Find all XPLAT_MOBILE_BUILD usage: [search result](https://www.internalfb.com/intern/codesearch/?bunny_arg=XPLAT_MOBILE_BUILD&bunny_command=fbgs&lucky=0&q=repo%3Afbcode%20case%3Ainsensitive%20regex%3Aoff%20XPLAT_MOBILE_BUILD&source=redirect) and replace.

 Since template selective build is added in OSS, rename the macro to make it clearer.

T86478520 to follow up to unify XPLAT_MOBILE_BUILD (rename to TEMPLATE_SELECTIVE_BUILD), C10_MOBILE and BUILD_LITE_INTERPRETER macros.
ghstack-source-id: 124206354

Test Plan: CI

Reviewed By: dhruvbird, iseeyuan

Differential Revision: D27112046

fbshipit-source-id: 6f89b168c1f39c5449c8ed6538d887ea066a2225
2021-03-18 07:25:52 -07:00
9f86b656ba Resubmit: Adding parallel support for the LLVM backend. (#54122)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54122

Test Plan:
* USE_TBB=1 ATEN_THREADING=TBB python setup.py develop --cmake
  * USE_TBB=1 ATEN_THREADING=NATIVE python setup.py develop --cmake
  * USE_TBB=1 ATEN_THREADING=OMP python setup.py develop --cmake
  * cd build; ninja bin/tensorexpr_bench
  * bin/test_tensorexpr --gtest_filter="*Parallel*"

Reviewed By: bertmaher

Differential Revision: D27109802

Pulled By: zheng-xq

fbshipit-source-id: db159466d0b46357bcf0fbefb36094bee312368c
2021-03-18 07:19:37 -07:00
444552e7f9 Optimize alias_analysis node lookup (#54115)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54115

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D27104047

Pulled By: tugsbayasgalan

fbshipit-source-id: 0ef4e78be9ea7081b63ab2303711746bf09653eb
2021-03-18 07:14:49 -07:00
382a47b493 Add torch.linalg.vector_norm function (#51099)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/50214

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51099

Reviewed By: agolynski

Differential Revision: D27147360

Pulled By: mruberry

fbshipit-source-id: 1056f840e7027ad81971c9d1a9f952ab9648f1b5
2021-03-18 06:41:39 -07:00
564456ac44 Added autograd support for torch.orgqr (#52637)
Summary:
This PR adds autograd support for `torch.orgqr`.

Since `torch.orgqr` is one of few functions that expose LAPACK's naming and all other linear algebra routines were renamed a long time ago, I also added a new function with a new name and `torch.orgqr` now is an alias for it.

The new proposed name is `householder_product`. For a matrix `input` and a vector `tau` LAPACK's orgqr operation takes columns of `input` (called Householder vectors or elementary reflectors) scalars of `tau` that together represent Householder matrices and then the product of these matrices is computed. See https://www.netlib.org/lapack/lug/node128.html.
Other linear algebra libraries that I'm aware of do not expose this LAPACK function, so there is some freedom in naming it. It is usually used internally only for QR decomposition, but can be useful for deep learning tasks now when it supports differentiation.

Resolves https://github.com/pytorch/pytorch/issues/50104

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52637

Reviewed By: agolynski

Differential Revision: D27114246

Pulled By: mruberry

fbshipit-source-id: 9ab51efe52aec7c137aa018c7bd486297e4111ce
2021-03-18 05:42:18 -07:00
2f3b194dc2 Add cusolver potrf and potrfBatched to the backend of torch.cholesky decomposition (#53104)
Summary:
This PR adds cusolver potrf and potrfBatched to the backend of torch.cholesky and torch.linalg.cholesky.

Cholesky heuristics:

- Use cusolver potrf for batch_size 1
- Use magma_xpotrf_batched for batch_size >= 2
- if magma is not available, use loop of cusolver potrf for batch_size >= 2

cusolver potrf batched currently has some nan output issue, we will switch to cusolver potrf batched after it's fixed

See also https://github.com/pytorch/pytorch/issues/42666 #47953

Todo:

- [x] benchmark and heuristic

Close https://github.com/pytorch/pytorch/pull/53992

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53104

Reviewed By: agolynski

Differential Revision: D27113963

Pulled By: mruberry

fbshipit-source-id: 1429f63891cfc6176f9d8fdeb5c3b0617d750803
2021-03-18 05:35:40 -07:00
8caa7889fc Revert D27001339: Use type-erased union for Buffer.
Test Plan: revert-hammer

Differential Revision:
D27001339 (c618dc13d2)

Original commit changeset: 26d7dc19d69d

fbshipit-source-id: 6e036ed7e1f71c9cf20e3361607c4fe4fa2d3d02
2021-03-18 05:27:17 -07:00
c618dc13d2 Use type-erased union for Buffer. (#322)
Summary:
Pull Request resolved: https://github.com/pytorch/tensorpipe/pull/322

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54145

In order to merge the channel hierarchies, we need a generic `Buffer` type, that can wrap either a `CpuBuffer` or a `CudaBuffer`.
The constraints are that, since this type is used by the channels, it cannot explicitly refer to `CudaBuffer`. We propose here a type-erasure based solution, with small-buffer optimization to avoid heap-allocating the wrapped concrete buffer.
ghstack-source-id: 124131499

Test Plan: CI

Reviewed By: lw

Differential Revision: D27001339

fbshipit-source-id: 26d7dc19d69d7e3336df6fd4ff6ec118dc17c5b6
2021-03-18 02:23:17 -07:00
133000fe7a [distributed] add processgroup options as argument (#53663)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53663

This add the processgroup option as an optional argument to new_group
and init_processgroup, this allows user to pass in a initialized
processgroup option for gloo and nccl.

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D26968857

Pulled By: wanchaol

fbshipit-source-id: 2ff73a009120b85e83ecde7c69956b731902abc2
2021-03-18 01:04:17 -07:00
2d8795c552 [FX] Normalize torch. namespace ops (#53832)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53832

Test Plan: Imported from OSS

Reviewed By: jfix71, Chillee

Differential Revision: D26982801

Pulled By: jamesr66a

fbshipit-source-id: 96ac8efe2b3c644cfb7328168f6db089d3756aa2
2021-03-17 23:34:29 -07:00
72c7983f23 Remove __get__ from Tensor stub. (#54208)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54208

It seems like it was added to suppress some errors in LazyModules, but I think we should solve those more directly with some type ignores in more surgical places.

Fixes #54087.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D27137363

Pulled By: ezyang

fbshipit-source-id: 017cafcc3350e73cd62436078835b97cd9b3b929
2021-03-17 21:40:58 -07:00
a27f46bbe3 [FX] Experimental type annotation pass using Python signatures (#53831)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53831

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D26982804

Pulled By: jamesr66a

fbshipit-source-id: 17db9f71e729206f29ee231e34723d9616f128b7
2021-03-17 20:43:17 -07:00
255b103c1b [WIP] Function to retrieve inspect.Signature instances for PyTorch ops (#53830)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53830

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D26982802

Pulled By: jamesr66a

fbshipit-source-id: 18fddc9f3f34b09e173de59f2fe886f8eedd000e
2021-03-17 20:41:27 -07:00
0dc5abfaa9 Revert D26907093: Add repeats to Timer.collect_callgrind(...)
Test Plan: revert-hammer

Differential Revision:
D26907093 (74993dcf7b)

Original commit changeset: 72e5b4889691

fbshipit-source-id: 80779ec895920a4e9b33daa56f32b587f8912ed6
2021-03-17 20:14:21 -07:00
ca429fedd3 [StaticRuntime] Fuse SigridTransforms + ListUnpack (#53920)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53920

Fusing SigridTransforms + ListUnpack allows for enabling out variant for SigridTransforms so that the output tensors can be managed by the MemoryPlanner in Static Runtime.

The speedup comes from three parts 1) get rid of memory allocation inside SigridTransforms itself, 2) memory deallocation cost (outside SigridTransforms, inside MemoryPlanner), 3) get rid of ListUnpack. However, in 3) we still need to pay the cost of constructing `vector<Tensor>` for outputs and a round of refcount bumps for all the output TensorImpls.

Reviewed By: ajyu

Differential Revision: D26220546

fbshipit-source-id: 651bdfb850225511c43b8f50083b13e8dec46bcc
2021-03-17 19:58:02 -07:00
ef9ee46756 Avoid modifying rebuild buckets state in no_grad context (#54159)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54159

See https://github.com/pytorch/pytorch/issues/54059 for discussion.

In short, users might want to run evaluation on a single rank
in `torch.no_grad()` mode. When this happens, we need to make
sure that we skip all rebuild bucket logics, as the forward only
runs on one rank and not all peers can sure the bucket configuration
sync communication.

Test Plan: Imported from OSS

Reviewed By: zhaojuanmao

Differential Revision: D27119666

Pulled By: mrshenli

fbshipit-source-id: 4b2f8cce937cdd893e89d8d10c9267d255ba52ea
2021-03-17 19:50:29 -07:00
fef0219f7e [ROCM] Fix hipfft transform type error (#53411)
Summary:
This PR enable some failing unit tests for fft in pytorch on ROCM.

The reason these tests were failing was due to an error in how hipfft was executed for different transform types for float inputs causing a mismatch error when compared to baselines.

We solved the problem by calling hipfft with the right config for each transformation type.

There PR doesnot enable all fft tests. There are still other issues that need to be resolved before that can happen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53411

Reviewed By: albanD

Differential Revision: D27008323

Pulled By: mruberry

fbshipit-source-id: 649c65d0f12a889a426ec475f7d8fcc6f1d81bd3
2021-03-17 19:26:04 -07:00
f4a044ca1d [distributed] add options field in ProcessGroupGloo/NCCL (#54090)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54090

This PR adds an options field to both ProcessGroupGloo/NCCL so that we
have a constant `options` field even after the initialization of
ProcessGroup, which gives us the ability to inspect the options during
construction of specific ProcessGroup. Also use options inside different
methods instead of separate fields.

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D27093670

Pulled By: wanchaol

fbshipit-source-id: b02d9394290e9be88b21bddb94d4de7993b4a2e3
2021-03-17 18:41:55 -07:00
a4f0f8b1e9 [distributed] add base processgroup::options (#53662)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53662

Add a base processgroup::options so that we can do inheritance and
provide
a universal option API in python

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D26968856

Pulled By: wanchaol

fbshipit-source-id: 858f4b61b27aecb1943959bba68f8c14114f67d8
2021-03-17 18:40:04 -07:00
ac78d05d05 [Kineto] Update rev for fix to #53848 (#54226)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53848.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54226

Reviewed By: ilia-cher

Differential Revision: D27144893

Pulled By: gdankel

fbshipit-source-id: f3609de540fd62c58f60f19cdca88f0dbf3ee8ca
2021-03-17 18:23:25 -07:00
74993dcf7b Add repeats to Timer.collect_callgrind(...) (#53295)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53295

A lot of the time spent in `collect_callgrind` is spinning up Valgrind and executing the initial `import torch`. In most cases the actual run loop is a much smaller fraction. As a result, we can reuse the same process to do multiple replicates and do a much better job amortizing that startup cost. This also tends to result in more stable measurements: the kth run is more repeatable than the first because everything has been given a chance to settle into a steady state. The instruction microbenchmarks lean heavily on this behavior. I found that in practice doing several `n=100` replicates to be more reliable than one monolithic 10,000+ iteration run. (Since rare cases like memory consolidation will just contaminate that one replicate, as opposed to getting mixed into the entire long run.)

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D26907093

Pulled By: robieta

fbshipit-source-id: 72e5b48896911f5dbde96c8387845d7f9882fdb2
2021-03-17 18:05:13 -07:00
8ecb2d35bc Add ability to override _reduce_ex_ function of DataPipe (#52858)
Summary:
Required for `torchdata` graph functions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52858

Reviewed By: H-Huang

Differential Revision: D26736348

Pulled By: VitalyFedyunin

fbshipit-source-id: 1735e88374090422e6365d07d5b84075e371500c
2021-03-17 17:23:05 -07:00
2eb3917629 [Vulkan] Add reflection_pad2d to Vulkan (#53604)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53604

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D27098310

Pulled By: SS-JIA

fbshipit-source-id: efb6692d20edcab06907d12ad0121676876216dc
2021-03-17 16:11:25 -07:00
06cb9293c5 Add GitHub Actions workflow to test tools (#54207)
Summary:
This PR closes https://github.com/pytorch/pytorch/issues/52866 by adding a GitHub Actions workflow to run the tests in the dir introduced by https://github.com/pytorch/pytorch/issues/53755. It also uses `actions/setup-python@v2`, assuming that https://github.com/pytorch/pytorch/issues/54202 will be merged.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54207

Test Plan: The added "Test tools" GHA workflow in CI.

Reviewed By: walterddr

Differential Revision: D27135159

Pulled By: samestep

fbshipit-source-id: c8c5e2e2ac2491baab1b1f1ed4f44b4c3266ee8d
2021-03-17 15:30:08 -07:00
7d1e1c7e0d Pyre-ify torch.jit.interface's (#54084)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54084

Test Plan: Sandcastle

Reviewed By: derekmod-fb

Differential Revision: D27075597

fbshipit-source-id: 992592c88320df61e3a65eb0ac4ba5705b0b5802
2021-03-17 14:56:31 -07:00
94b22b5b3b try catch test upload failures (#54194)
Summary:
Exception during send report shouldn't fail the entire pipeline

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54194

Test Plan: CI

Reviewed By: samestep

Differential Revision: D27128457

Pulled By: walterddr

fbshipit-source-id: 5404b542bc1a14c6f6c4d8586c1643c8c65e6d1f
2021-03-17 14:47:11 -07:00
8f61b13e80 [Pytorch Mobile] Optimize Non Forward for Mobile (#53314)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53314

Introduction of api for optimizing non forward functions for mobile. As of this diff, all functions that you say to optimize will be preserved, and those functions will be run through canonical optimization. The intention is to stack each further optimization onto separate diffs since they touch multiple files, and it seems like it'd be a nightmare to review.
ghstack-source-id: 123909414

Test Plan:
torch.utils.mobile_optimizer.optimize_for_mobile(net, methods_to_optimize=["forward", "foo"]) runs fine

torch.utils.mobile_optimizer.optimize_for_mobile(net, methods_to_optimize={"foo"}) optimizes just foo if the model doesnt define forward otherwise optimizes foo and forward

torch.utils.mobile_optimizer.optimize_for_mobile(net, methods_to_optimize=["forward"]) runs fine

torch.utils.mobile_optimizer.optimize_for_mobile(net) runs fine if the model defines forward, Throws otherwise

Reviewed By: kimishpatel

Differential Revision: D26618689

fbshipit-source-id: 5bff1fb3f3f6085c4a649a8128af9c10f0fa9400
2021-03-17 14:31:24 -07:00
407d60ee91 Upgrade actions/setup-python from v1 to v2 (#54202)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54202

Test Plan: The lint and clang-format workflows in CI.

Reviewed By: janeyx99

Differential Revision: D27134223

Pulled By: samestep

fbshipit-source-id: 7f38240696e31f1a479e93f6b326b9d13e3ddf9c
2021-03-17 14:06:12 -07:00
cd776560d0 [vulkan] Add hardswish and hardsigmoid activations to Vulkan (#53362)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53362

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D27098430

Pulled By: SS-JIA

fbshipit-source-id: aa2edf2af4ebabe95dbc02d33ecff6f7c9f0953c
2021-03-17 14:00:34 -07:00
957700be7e Improved aten::to performance from inline cvr remote_request_only (#53800)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53800

copy_impl improvement:

Before: 1732 ns
After:.    1159 ns

remote_request_only

Before: Milliseconds per iter: 1.24185. Iters per second: 805.252
             0.161477 ms.    13.5036%. aten::to (155 nodes)

After:     Milliseconds per iter: 1.14195. Iters per second: 875.696
             0.113893 ms.     10.339%. aten::to (155 nodes)

Test Plan: buck test caffe2:ATen-core-test

Reviewed By: ajyu

Differential Revision: D26967349

fbshipit-source-id: d8f8dc5e8e3df1cec57fa098b21119ec9568e4a5
2021-03-17 13:54:18 -07:00
e442d5c8a5 Disallow CUDA RPC to use new devices in output tensors (#54024)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54024

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D27059108

Pulled By: mrshenli

fbshipit-source-id: 1997ce8b130220786883b54c8a32e99989f70f22
2021-03-17 13:44:15 -07:00
8cc06e3ca3 Disable CUDA RPC tests that use new device in user-function outputs (#54023)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54023

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D27059107

Pulled By: mrshenli

fbshipit-source-id: e878511942f2e2577b2f0b8e7711d70582537851
2021-03-17 13:41:50 -07:00
79534867ac Migrate about 100 kernel to C10 full dispatcher (#54109)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54109

Codemod command generated by https://github.com/pytorch/pytorch/pull/54098

ghstack-source-id: 124114894

Test Plan: CI

Reviewed By: smessmer

Differential Revision: D27100359

fbshipit-source-id: 8338405274a2a020856af6e4a35a2fb21438f2a8
2021-03-17 13:35:39 -07:00
fd5c1123e4 wrap AliasDb in Python (#51336)
Summary:
Also added a wrapper tlemo 's graphviz export to string.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51336

Reviewed By: ezyang

Differential Revision: D26150809

Pulled By: eellison

fbshipit-source-id: 9beafce5cbdc1785b986b71c3cd986c1087faa11
2021-03-17 12:55:22 -07:00
2e7311ef25 First step to refactoring S3 reading logic (#53755)
Summary:
This is an initial attempt in refactoring and consolidating our S3 read logic for print_test_stats.py, test_history.py, and run_test.py. This way, boto3 and botocore do not need to be imported in various places throughout the code base, and duplicated logic (such as the many type definitions) can exist in one place: `tools/stat_utils/s3_stat_parser.py`. walterddr contributed to this PR by moving print_test_stats.py to the tools folder and the corresponding tests a subfolder within tools.

**NOTE: this removes those tests from CI as the new `tools/test/test_stats.py` is not in the test/ directory as the other tests in TESTS in run_test.py.**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53755

Test Plan:
This refactoring change should not break anything, so running the files as before should work as they did previously.
To make sure that print_test_stats.py still functions: run `python tools/test/test_stats.py` and make sure all tests pass.
To make sure that test_history.py works, run the example commands from `tools/test_history.py --help` and check that their output matches that shown. Note that the script will continue printing for a while, so don't be alarmed.

Some next steps:
- Actually coming up with similarities among the three current use cases and further refactoring/consolidating of functions (e.g., combining simplify and get_cases)
- Moving more parsing logic to s3_stat_parser.py to have better abstraction between our files
- Adding tests for s3_stat_parser.py when there is more functionality in it

Reviewed By: agolynski, samestep

Differential Revision: D27030285

Pulled By: janeyx99

fbshipit-source-id: e664781324ef7c0c30943bfd7f17c895075ef7a7
2021-03-17 12:38:09 -07:00
ccdcfba5de [caffe2] Refactor tensor serialization function (#53404)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53404

This refactors `TensorSerializer::Serialize()` so that we have a separate
helper function for each data type.

This should make it slightly easier in the future to add new serialization
formats for specific data types.
ghstack-source-id: 124085413

Test Plan:
Confirmed the existing tests pass.  This diff is not expected to have any
behavior changes.

Reviewed By: mraway, glamtechie

Differential Revision: D26658204

fbshipit-source-id: 232776262db6486ba845a7ba223e3987053dac27
2021-03-17 12:36:31 -07:00
a2a7179695 Fix bug in assertRaises NotImplemented handling when no exception is thrown (#54126)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54126

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: agolynski, mruberry

Differential Revision: D27109510

Pulled By: ezyang

fbshipit-source-id: ba5a4de85ca00f81724f3d4e645797e8f32aa3b1
2021-03-17 12:30:51 -07:00
7e7533b2e2 Delete denseTypeIdWithDefault and toDense (#54016)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54016

I managed to convince myself that typeIdWithDefault was sufficient for
the sparse constructor case.  Here is the reasoning.

The surface reading of the use site of denseTypeIdWithDefault is
to convert what could be a sparse dispatch key into the dense version
so we can properly allocate underlying dense tensors for the sparse
constructor call.  But WHERE does this dispatch key come from?
Inspection of call sites reveals that dispatch key is provided by
torch::tensors::get_default_dispatch_key().  This key is NEVER
sparse, as that would correspond to setting sparse tensors to be
the default tensor via torch.set_default_tensor_type() (which is
forbidden, and even if it worked most of everything in PyTorch would
break).  That means that typeIdWithDefault is a sufficient replacmenet.

With denseTypeIdWithDefault removed, we can also delete toDense
as this was the sole use of that function.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D27109511

Pulled By: ezyang

fbshipit-source-id: c698eff0ab54c0c101fe9f55be3b7657584c4372
2021-03-17 12:28:55 -07:00
f30a7a2739 Add export-historic-test-times option to dump S3 test times into a JSON file (#54083)
Summary:
This will allow for future work to use the test times file (which will save computation time and also allow for more consistency). (Step one to fixing https://github.com/pytorch/pytorch/issues/53882)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54083

Test Plan:
export CIRCLE_JOB=your-favorite-circleci-job e.g., pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test2
`python test/run_test.py --export-historic-test-times` OR
`python test/run_test.py --export-historic-test-times .your-favorite-file`

When opening either .pytorch-test-times or .your-favorite-file, you should see something like:
```
{"commit": "2d559a09392aabb84dfb4a498010b2f01d99818c", "job_times": {"distributed/test_distributed_spawn": 583.5889999999973, "distributed/test_data_parallel": 4.866999999999997, "test_binary_ufuncs": 171.1569999999998, "test_numpy_interop": 2.5649999999999995, "test_public_bindings": 0.011,...}}
```

Note that no tests will be run when this option is specified.

Reviewed By: walterddr

Differential Revision: D27091351

Pulled By: janeyx99

fbshipit-source-id: e191d739268d86de0a0ba0eea0006969859d1940
2021-03-17 12:22:00 -07:00
7367bca066 [nnc] Tests for proposed feature: loop bounds conditional simplification (#54121)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54121

It would be nice to do range analysis to determine if a condition
cannot be satisfied.  These are some tests that we should be able to turn on
once we have this feature.
ghstack-source-id: 124116847

Test Plan: Simplify.*LoopBounds

Reviewed By: ZolotukhinM

Differential Revision: D27107956

fbshipit-source-id: bb27e3d3bc803f0101c416e4a351ba2278684980
2021-03-17 11:01:10 -07:00
a852fdb6b5 [nnc] Test for using int64 dimensions (#54094)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54094

We should be able to use 64-bit integers for loop boundaries and
buffer/tensor indexing.
ghstack-source-id: 124116846

Test Plan: New tests, disabled

Reviewed By: ZolotukhinM

Differential Revision: D27094934

fbshipit-source-id: a53de21a0ef523ea3560d5dd4707df50624896ef
2021-03-17 10:59:26 -07:00
0806126aad [fx][trivial] Add TestConstFold coverage to test_fx (#54072)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54072

att

Test Plan: Adding coverage

Differential Revision: D27085591

fbshipit-source-id: 8c5ea5a52be619249f23a938ddb0a3aed1ada0f7
2021-03-17 10:38:54 -07:00
91747a5e93 add tests for ddp with activation check points (#52894)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52894

add two success cases and two failure cases for ddp with activation check points when grad_as_bucket_view = true and false

Test Plan: unit tests

Reviewed By: rohan-varma

Differential Revision: D26679895

fbshipit-source-id: a6f6cb22b4903ed8b1f7b8ed4fe8b13e102d8c21
2021-03-17 10:16:20 -07:00
ce40ff5c64 Avoid DDP race condition with find_unused_parameters=True when all params are used (#53160)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53159.

See comments for a description of the race condition. Thanks to ptrblck xwang233 and especially zasdfgbnm for lots of help isolating the problem and discussing the fix.

PRing for discussion. We can try to concoct a dedicated test for the problem if you want. The ingredients are:
- DDP(..., find_unused_parameters=True)
- Use all the DDP-ed model's params in forward such that the "lazy local used work wait()" path will be taken in backward
- Queue up a lot of asynchronous dummy work just before backward(), so stream work gets pushed far into the future relative to CPU work

Benchmark:
Bert model, When find_unused_parameters=true, latency (sec) per iteration P50: trunk-1.265sec, this PR-1.263sec, if add blocking copy before calling local_used_.fill(i)-1.236 sec
Bert model, When find_unsued_parameters=false, latency (sec) per iteration P50: trunk-1.00sec, this PR-1.026sec
Resnet50 model, accuracy is also matched with trunk when find_unused_parameters=true and false

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53160

Reviewed By: albanD

Differential Revision: D26916766

Pulled By: zhaojuanmao

fbshipit-source-id: 3e0ed91b7b5c42e2f2c82e12d4d2940fdc89e023
2021-03-17 10:08:22 -07:00
4fac72ee9d [fix] Dimension out of range in pixel_shuffle / pixel_unshuffle (#54086)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/54051

Problem was application of the unary minus operator to an unsigned type. Positive indices are now used to build the permutation array for both `pixel_shuffle` and `pixel_unshuffle`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54086

Reviewed By: agolynski

Differential Revision: D27093435

Pulled By: jbschlosser

fbshipit-source-id: 4062f71277d037e91dc3cf5835b29b8ed4d16607
2021-03-17 09:26:47 -07:00
4a24c552cc [PyTorch] Fix string copy in WARN path for both interpreters (#54076)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54076

If we don't constrain ourselves to use `torch::jit::pop`, we can avoid copying a string or moving IValues around.
ghstack-source-id: 124040891

Test Plan:
existing tests

spot-checked regular interpreter assembly; seems better

Reviewed By: dhruvbird, walterddr

Differential Revision: D27087204

fbshipit-source-id: 7cf355dbcec31409bdb37afa09d7df85cf2a7e4b
2021-03-17 08:44:08 -07:00
8f1af02f35 [PyTorch][mobile] Audit mobile interpreter for extra copies (#54031)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54031

Similar to D27060762 (665d5e2a4f), caught some probably-unintended copies.
ghstack-source-id: 124040889

Test Plan: CI?

Reviewed By: walterddr, iseeyuan

Differential Revision: D27061818

fbshipit-source-id: f4a77cb5c21cd3ebce7b7e82764e4361467bab91
2021-03-17 08:42:34 -07:00
ce15f312a8 [PyTorch] Align function parameters across declaration and definition for max pool 2d (#54105)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54105

This is preparing XNNPACK to be enabled in Windows. For some reason Windows clang doesn't think functions taking `float` and `const float` to have the same signature and thus throwing link errors like:
```
lld-link: error: undefined symbol: bool __cdecl at::native::xnnpack::use_max_pool2d(class at::Tensor const &, class c10::ArrayRef<__int64>, class c10::ArrayRef<__int64>, class c10::ArrayRef<__int64>, class c10::ArrayRef<__int64>, bool, float, float)
>>> referenced by C:\open\fbsource\buck-out\gen\f84e6a81\xplat\caffe2\pt_ops_full_template_registration\aten\src\ATen\native\Pooling.cpp:127
>>>               libpt_ops_fullWindows.lib(out.obj):(class at::Tensor __cdecl at::native::max_pool2d(class at::Tensor const &, class c10::ArrayRef<__int64>, class c10::ArrayRef<__int64>, class c10::ArrayRef<__int64>, class c10::ArrayRef<__int64>, bool))

lld-link: error: undefined symbol: class at::Tensor __cdecl at::native::xnnpack::max_pool2d(class at::Tensor const &, class c10::ArrayRef<__int64>, class c10::ArrayRef<__int64>, class c10::ArrayRef<__int64>, class c10::ArrayRef<__int64>, bool, float, float)
>>> referenced by C:\open\fbsource\buck-out\gen\f84e6a81\xplat\caffe2\pt_ops_full_template_registration\aten\src\ATen\native\Pooling.cpp:129
>>>               libpt_ops_fullWindows.lib(out.obj):(class at::Tensor __cdecl at::native::max_pool2d(class at::Tensor const &, class c10::ArrayRef<__int64>, class c10::ArrayRef<__int64>, class c10::ArrayRef<__int64>, class c10::ArrayRef<__int64>, bool))
```

Declaration: `src/ATen/native/xnnpack/Engine.h`
Definition: `src/ATen/native/xnnpack/MaxPooling.cpp`
Reference: `src/ATen/native/Pooling.cpp`

Test Plan: build succeeded

Reviewed By: kimishpatel

Differential Revision: D27097201

fbshipit-source-id: ab557f608713840ee0a65b252fa875624ddd502f
2021-03-17 08:23:05 -07:00
527c1e0e37 [iOS GPU] Remove unnecessary texture size changing (#54108)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54108

Clean up the hardcode for 2d tensors.
ghstack-source-id: 124113406

Test Plan:
```
2021-03-16 00:27:31.280761-0700 PyTorchPlayground[16024:6249832] [bool test_view()],[1 10 2 2 ],[SUCCEED]
2021-03-16 00:27:31.282833-0700 PyTorchPlayground[16024:6249832] [bool test_view2()],[1 10 2 2 ],[SUCCEED]
2021-03-16 00:27:31.285320-0700 PyTorchPlayground[16024:6249832] [bool test_view3()],[5 8 ],[SUCCEED]
2021-03-16 00:27:31.286929-0700 PyTorchPlayground[16024:6249832] [bool test_view4()],[5 8
```
- Sandcastle CI
- CircleCI

Reviewed By: SS-JIA

Differential Revision: D27075236

fbshipit-source-id: 1005fd82f4a75603697579a191d3acc6fc1bd690
2021-03-17 01:03:56 -07:00
e579b39b9e [iOS GPU] Implement view and reshape in metal shaders (#54107)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54107

The current implementation doesn't change the underlying texture's shape. This diff converts MPSImage from one shape to the other. The way we'll do it is we implement this as an elementwise kernel. We have a thread grid of size (N2, C2, H2, W2) with a thread for each output element, and we compute the "linear index" of the output element, and convert it to the equivalent "linear index" of the input element. This is the known as sub2ind/ind2sub conversion in MATLAB, ravel_multi_index in numpy, etc. a08841a8e1/cupy/indexing/generate.py (L301-L304) is a clean generic version of ind2sub.
ghstack-source-id: 124113407

Test Plan:
```
2021-03-16 00:27:31.280761-0700 PyTorchPlayground[16024:6249832] [bool test_view()],[1 10 2 2 ],[SUCCEED]
2021-03-16 00:27:31.282833-0700 PyTorchPlayground[16024:6249832] [bool test_view2()],[1 10 2 2 ],[SUCCEED]
2021-03-16 00:27:31.285320-0700 PyTorchPlayground[16024:6249832] [bool test_view3()],[5 8 ],[SUCCEED]
2021-03-16 00:27:31.286929-0700 PyTorchPlayground[16024:6249832] [bool test_view4()],[5 8 ],[SUCCEED]
```
- Sandcastle CI
- CircleCI

Reviewed By: SS-JIA

Differential Revision: D27074719

fbshipit-source-id: 445f55fefeb9cc7b3eeab106b6d567facef58343
2021-03-17 01:03:53 -07:00
2e8a9d2bfe [iOS GPU] Support multi-dimension tensors via MPSImage (#54106)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54106

The texture size will always be four dimensional. For higher dim tensors, we just fold them to the batch dim.
ghstack-source-id: 124113408

Test Plan:
- Sandcastle CI
- CircleCI
- Metal Unit tests

```
2021-03-16 00:27:30.407417-0700 PyTorchPlayground[16024:6249832] [bool test_synchronization()],[1 3 2 2 ],[SUCCEED]
2021-03-16 00:27:30.440521-0700 PyTorchPlayground[16024:6249832] [bool test_nchw_to_nc4_cpu()],[8 2 154 299 ],[SUCCEED]
2021-03-16 00:27:30.478765-0700 PyTorchPlayground[16024:6249832] [bool test_nchw_to_nc4_cpu()],[11 9 25 319 ],[SUCCEED]
2021-03-16 00:27:30.668841-0700 PyTorchPlayground[16024:6249832] [bool test_nchw_to_nc4_cpu()],[12 14 281 86 ],[SUCCEED]
2021-03-16 00:27:30.820580-0700 PyTorchPlayground[16024:6249832] [bool test_nchw_to_nc4_cpu()],[13 3 308 264 ],[SUCCEED]
2021-03-16 00:27:30.863287-0700 PyTorchPlayground[16024:6249832] [bool test_nchw_to_nc4_cpu()],[8 2 281 213 ],[SUCCEED]
2021-03-16 00:27:30.870941-0700 PyTorchPlayground[16024:6249832] [bool test_copy_nchw_to_metal()],[1 3 224 224 ],[SUCCEED]
2021-03-16 00:27:30.881768-0700 PyTorchPlayground[16024:6249832] [bool test_conv2d()],[4 2 10 258 ],[SUCCEED]
2021-03-16 00:27:30.916943-0700 PyTorchPlayground[16024:6249832] [bool test_conv2d()],[7 9 68 111 ],[SUCCEED]
2021-03-16 00:27:31.011515-0700 PyTorchPlayground[16024:6249832] [bool test_conv2d()],[4 25 186 246 ],[SUCCEED]
2021-03-16 00:27:31.018628-0700 PyTorchPlayground[16024:6249832] [bool test_conv2d()],[5 5 291 25 ],[SUCCEED]
2021-03-16 00:27:31.070833-0700 PyTorchPlayground[16024:6249832] [bool test_conv2d()],[2 38 178 109 ],[SUCCEED]
WARNING: Logging before InitGoogleLogging() is written to STDERR
W0316 00:27:31.076831 1843703808 TensorImpl.h:965] Warning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (function operator())
2021-03-16 00:27:31.094476-0700 PyTorchPlayground[16024:6249832] [bool test_depthwiseConv()],[1 32 112 112 ],[SUCCEED]
2021-03-16 00:27:31.097782-0700 PyTorchPlayground[16024:6249832] [bool test_max_pool2d()],[1 3 4 4 ],[SUCCEED]
2021-03-16 00:27:31.109290-0700 PyTorchPlayground[16024:6249832] [bool test_max_pool2d_ceil()],[1 96 55 55 ],[SUCCEED]
2021-03-16 00:27:31.112203-0700 PyTorchPlayground[16024:6249832] [bool test_relu()],[1 3 4 4 ],[SUCCEED]
2021-03-16 00:27:31.116675-0700 PyTorchPlayground[16024:6249832] [bool test_addmm()],[5 120 105 ],[SUCCEED]
2021-03-16 00:27:31.119392-0700 PyTorchPlayground[16024:6249832] [bool test_addmm()],[6 6 84 ],[SUCCEED]
2021-03-16 00:27:31.122741-0700 PyTorchPlayground[16024:6249832] [bool test_addmm()],[5 110 38 ],[SUCCEED]
2021-03-16 00:27:31.125273-0700 PyTorchPlayground[16024:6249832] [bool test_addmm()],[8 116 90 ],[SUCCEED]
2021-03-16 00:27:31.128231-0700 PyTorchPlayground[16024:6249832] [bool test_addmm()],[5 92 123 ],[SUCCEED]
2021-03-16 00:27:31.132546-0700 PyTorchPlayground[16024:6249832] [bool test_add()],[1 180 12 12 ],[SUCCEED]
2021-03-16 00:27:31.138931-0700 PyTorchPlayground[16024:6249832] [bool test_add_broadcast()],[2 17 58 67 ],[SUCCEED]
2021-03-16 00:27:31.145191-0700 PyTorchPlayground[16024:6249832] [bool test_add_broadcast2()],[2 17 1 67 ],[SUCCEED]
2021-03-16 00:27:31.174218-0700 PyTorchPlayground[16024:6249832] [bool test_sub()],[5 3 167 222 ],[SUCCEED]
2021-03-16 00:27:31.182838-0700 PyTorchPlayground[16024:6249832] [bool test_sub_broadcast()],[3 1 1 ],[SUCCEED]
2021-03-16 00:27:31.205262-0700 PyTorchPlayground[16024:6249832] [bool test_sub_broadcast2()],[2 3 3 192 192 ],[SUCCEED]
2021-03-16 00:27:31.227730-0700 PyTorchPlayground[16024:6249832] [bool test_mul()],[2 7 262 119 ],[SUCCEED]
2021-03-16 00:27:31.244125-0700 PyTorchPlayground[16024:6249832] [bool test_mul_broadcast()],[4 3 192 192 ],[SUCCEED]
2021-03-16 00:27:31.250476-0700 PyTorchPlayground[16024:6249832] [bool test_mul_broadcast2()],[1 3 192 192 ],[SUCCEED]
2021-03-16 00:27:31.254482-0700 PyTorchPlayground[16024:6249832] [bool test_div()],[1 3 24 24 ],[SUCCEED]
2021-03-16 00:27:31.258273-0700 PyTorchPlayground[16024:6249832] [bool test_div_broadcast()],[4 3 24 24 ],[SUCCEED]
2021-03-16 00:27:31.259873-0700 PyTorchPlayground[16024:6249832] [bool test_div_broadcast2()],[1 3 24 24 ],[SUCCEED]
2021-03-16 00:27:31.269028-0700 PyTorchPlayground[16024:6249832] [bool test_t()],[109 196 ],[SUCCEED]
2021-03-16 00:27:31.271374-0700 PyTorchPlayground[16024:6249832] [bool test_t()],[82 227 ],[SUCCEED]
2021-03-16 00:27:31.273238-0700 PyTorchPlayground[16024:6249832] [bool test_t()],[33 175 ],[SUCCEED]
2021-03-16 00:27:31.275284-0700 PyTorchPlayground[16024:6249832] [bool test_t()],[13 226 ],[SUCCEED]
2021-03-16 00:27:31.277017-0700 PyTorchPlayground[16024:6249832] [bool test_t()],[7 153 ],[SUCCEED]
2021-03-16 00:27:31.280761-0700 PyTorchPlayground[16024:6249832] [bool test_view()],[1 10 2 2 ],[SUCCEED]
2021-03-16 00:27:31.282833-0700 PyTorchPlayground[16024:6249832] [bool test_view2()],[1 10 2 2 ],[SUCCEED]
2021-03-16 00:27:31.285320-0700 PyTorchPlayground[16024:6249832] [bool test_view3()],[5 8 ],[SUCCEED]
2021-03-16 00:27:31.286929-0700 PyTorchPlayground[16024:6249832] [bool test_view4()],[5 8 ],[SUCCEED]
2021-03-16 00:27:31.515716-0700 PyTorchPlayground[16024:6249832] [bool test_cat_dim0()],[3 9 221 193 ],[SUCCEED]
2021-03-16 00:27:31.520599-0700 PyTorchPlayground[16024:6249832] [bool test_cat_dim0_nonarray()],[1 3 90 77 ],[SUCCEED]
2021-03-16 00:27:32.122259-0700 PyTorchPlayground[16024:6249832] [bool test_cat_dim1_0()],[4 10 271 333 ],[SUCCEED]
2021-03-16 00:27:32.618431-0700 PyTorchPlayground[16024:6249832] [bool test_cat_dim1_1()],[3 11 271 333 ],[SUCCEED]
2021-03-16 00:27:32.621299-0700 PyTorchPlayground[16024:6249832] [bool test_cat_dim1_nonarray_0()],[1 3 22 33 ],[SUCCEED]
2021-03-16 00:27:32.626100-0700 PyTorchPlayground[16024:6249832] [bool test_cat_dim1_nonarray_1()],[1 9 53 67 ],[SUCCEED]
2021-03-16 00:27:32.630042-0700 PyTorchPlayground[16024:6249832] [bool test_softmax()],[2 3 1 1 ],[SUCCEED]
2021-03-16 00:27:32.632536-0700 PyTorchPlayground[16024:6249832] [bool test_sigmoid()],[1 3 4 4 ],[SUCCEED]
2021-03-16 00:27:32.636125-0700 PyTorchPlayground[16024:6249832] [bool test_hardsigmoid()],[3 3 44 44 ],[SUCCEED]
2021-03-16 00:27:32.638887-0700 PyTorchPlayground[16024:6249832] [bool test_hardswish()],[3 3 44 44 ],[SUCCEED]
2021-03-16 00:27:32.646802-0700 PyTorchPlayground[16024:6249832] [bool test_upsampling_nearest2d_vec()],[1 48 24 24 ],[SUCCEED]
2021-03-16 00:27:32.650445-0700 PyTorchPlayground[16024:6249832] [bool test_adaptive_avg_pool2d()],[1 48 24 24 ],[SUCCEED]
2021-03-16 00:27:32.667118-0700 PyTorchPlayground[16024:6249832] [bool test_hardtanh_()],[1 32 112 112 ],[SUCCEED]
2021-03-16 00:27:32.669041-0700 PyTorchPlayground[16024:6249832] [bool test_reshape()],[1 1280 1 1 ],[SUCCEED]
```

Reviewed By: SS-JIA

Differential Revision: D27033569

fbshipit-source-id: 0b140a76e0ae2b27b57c0c9efb34a5fa03793c59
2021-03-17 01:02:17 -07:00
0c8f16622b [Caffe2] Rework CAFFE_ENFORCE_THAT (#53303)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53303

The old code did a heap allocation unnecessarily and was a
little convoluted. I think that it was structured that way to avoid
double-evaluating arguments; I just forced them to be evaluated once
as though they were passed to a function by binding const references
to them.
ghstack-source-id: 123918262

Test Plan:
1) `buck run mode/opt-clang //caffe2/caffe2/fb/tests:logging_bench`

Before:
```
============================================================================
caffe2/caffe2/fb/tests/logging_bench.cpp        relative  time/iter  iters/s
============================================================================
glog_CHECK                                                   2.01ns  498.63M
caffe2_ENFORCE_GE                                 50.00%     4.01ns  249.31M
glog_CHECK_GE                                     17.39%    11.53ns   86.73M
fbcode_ENFORCE                                   100.00%     2.01ns  498.65M
caffe2_ENFORCE                                   100.00%     2.01ns  498.63M
caffe2_ENFORCE_THAT                               50.00%     4.01ns  249.33M
============================================================================
```

After:
```
============================================================================
caffe2/caffe2/fb/tests/logging_bench.cpp        relative  time/iter  iters/s
============================================================================
glog_CHECK                                                   2.01ns  498.63M
caffe2_ENFORCE_GE                                 97.44%     2.06ns  485.88M
glog_CHECK_GE                                     17.39%    11.53ns   86.73M
fbcode_ENFORCE                                   100.00%     2.01ns  498.65M
caffe2_ENFORCE                                   100.00%     2.01ns  498.65M
caffe2_ENFORCE_THAT                               97.28%     2.06ns  485.06M
============================================================================
```

Looks like about a 1.94x speedup!

2) Inspect generated assembly for logging_bench.cpp before & after by:
```
$ compile-commands caffe2/caffe2/fb/tests/logging_bench.cpp -f "mode/opt-clang"
$ jq -r '.[0].arguments | sh' < compile_commands.json | sed -e "s/'-c'/'-S'/g" | sed -E -e "s/'-g[12]'/'-g0'/g" > out.sh
$ sh out.sh
```

Then diff logging_bench.s as you like.

Before: P255408666
After: P277883307

Net about 1500 lines deleted from the assembly. We can see that the
happy path (which the benchmark tests) no longer contains string
creation.

Reviewed By: dzhulgakov

Differential Revision: D26829714

fbshipit-source-id: 6e11f8ea29292ae3d9f2cc89d08afcb06f7d39c9
2021-03-16 23:01:00 -07:00
11a135ec82 Remove _th_take (#52665)
Summary:
These definitions of TH functions were left in the codebase after they were ported to ATen in https://github.com/pytorch/pytorch/pull/45283 and https://github.com/pytorch/pytorch/pull/45430

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52665

Reviewed By: mruberry

Differential Revision: D26655236

Pulled By: ailzhang

fbshipit-source-id: eb106b72dfb814bd1fb4d240a1ede621ef4261b2
2021-03-16 22:56:35 -07:00
04d5278cb6 [Static Runtime] Only run ReplaceWithCopy pass when enable_out_variant is true (#54111)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54111

If we only run the ReplaceWithCopy pass when enable_out_variant is true, there is no need register a default op implementation.

Reviewed By: edvgha

Differential Revision: D27036077

fbshipit-source-id: f615f5d8b84629044af1c554421ea5e505e93239
2021-03-16 22:06:33 -07:00
fb7bab97c4 Automated submodule update: FBGEMM (#53947)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: a7fd8fba11

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53947

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: jspark1105

Differential Revision: D27031755

fbshipit-source-id: d4cc9a791d4b9908f993a950c539bcbd988bde8b
2021-03-16 17:31:26 -07:00
b936abd840 fix nest openmp performance bug in thnn_conv2d (#52577)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52577

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D27063800

Pulled By: VitalyFedyunin

fbshipit-source-id: 000e17b722b2b1d48e1012b3fa222729e26777fb
2021-03-16 17:10:53 -07:00
252916ab61 Update TensorPipe submodule (#54070)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54070

Test Plan: Export to CircleCI

Reviewed By: mrshenli

Differential Revision: D27084375

fbshipit-source-id: 9e67916ad5abf91ccb62f8cbce6197e1e7fbc8d6
2021-03-16 17:05:56 -07:00
c4f50162be [typing] suppress errors in fbcode/caffe2 - batch 2
Test Plan: Sandcastle

Differential Revision: D27082725

fbshipit-source-id: a920b4eb62ff07d8e80fa2b9e3fd340cb44b689f
2021-03-16 16:45:41 -07:00
8533a485ea Fix SIGSEGV in CudaIPCTypes.cpp. (#53080)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53080

As described in https://github.com/pytorch/pytorch/issues/51619,
ProcessGroupShareTensorTest was failing due to segfaults in CudaIPCTypes.cpp.
There were two issues that had to be fixed for this:

1. The ref_counter_files_ map was looked up and the result was used without
checking whether or not the appropriate key existed in the map. This would
result in default construction in the map if the key didn't exist resulting in
a nullptr being stored in the map.
2. ~CudaIPCSentData uses the global cuda_ipc_global_entities variable. But as
part of destroying cuda_ipc_global_entities, ~CudaIPCSentData is called which
accesses an already destroyed cuda_ipc_global_entities. This is now avoided by
clearing all shared blocks in ~CudaIPCGlobalEntities to ensure they are all
cleaned up before the destructor exits.

#Closes: https://github.com/pytorch/pytorch/issues/51619
ghstack-source-id: 122812319

Test Plan: Run `python test/distributed/test_c10d_spawn.py -v ProcessGroupShareTensorTest`

Reviewed By: VitalyFedyunin

Differential Revision: D26742332

fbshipit-source-id: 6de4c4533f5bca673e6e171af32d034bd6ade5bb
2021-03-16 16:39:40 -07:00
dc070605f1 TST Replaces assertEqualIgnoreTypes with assertEqual in test_indexing (#53115)
Summary:
Related to https://github.com/pytorch/pytorch/issues/38095 and https://github.com/pytorch/pytorch/issues/50006

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53115

Reviewed By: mruberry

Differential Revision: D27086086

Pulled By: VitalyFedyunin

fbshipit-source-id: 7a6af6bcf3d7ce9ba96d47a24a40f451d00f0e67
2021-03-16 16:06:36 -07:00
4b00bce156 [Gradient Compression] Introduce fp16_compress_wrapper in ddp_comm_hooks.rst (#54052)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54052

Introduce `fp16_compress_wrapper`, which can give some speedup on top of some gradient compression algorithms like PowerSGD.

ghstack-source-id: 124001805

Test Plan: {F509205173}

Reviewed By: iseessel

Differential Revision: D27076064

fbshipit-source-id: 4845a14854cafe2112c0caefc1e2532efe9d3ed8
2021-03-16 15:40:10 -07:00
524cb0a514 [PyTorch Mobile] Dedup method names in bytecode serialization (#53677)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53677

When serializing bytecode, we serialize it based on methods. It may happen that there are multiple instances of a class. In such a case, the methods inside the class may be serialized multiple times.

To reduce the duplication, we cache the qualified name of the methods, so that one method is serialized only once.

Test Plan: existing unittests and CI

Reviewed By: dhruvbird, raziel

Differential Revision: D26933945

Pulled By: iseeyuan

fbshipit-source-id: 8a9833949fa18f7103a5a0be19e2028040dc7717
2021-03-16 15:24:47 -07:00
282eefebf3 Delete defunct ComplexCPU/ComplexCUDA dispatch keys (#54013)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54013

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D27051837

Pulled By: ezyang

fbshipit-source-id: c2a20737b6bd4a1317905bafceb2d8cb39f37e76
2021-03-16 15:20:04 -07:00
4878415688 Make storage access error NotImplementedError (#53972)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53972

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D27036573

Pulled By: ezyang

fbshipit-source-id: 5cc7d9e124bd27ca4041feb56b5007d9408d622a
2021-03-16 15:20:01 -07:00
d47fd3df81 Compute type_equal() without reference to backend() (#53823)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53823

Argument for correctness: type_equal previous compared if backends
are equal.  Backend is computed by translation from dispatch key.
I verified that computeDispatchKey never computed a weird
dispatch key (e.g., AutogradXLA), so that dispatchKeyToBackend
was effectively injective.  Then it is always valid to compare
the arguments of an injective function for equality, rather than
the output of the injective function.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D27036575

Pulled By: ezyang

fbshipit-source-id: 6aeafc89f287da0bc0065bd21c1adb5e272dbb81
2021-03-16 15:19:57 -07:00
3c457043fb Also propagate storage_access_should_throw_ when copying tensor metadata (#53816)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53816

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D27036574

Pulled By: ezyang

fbshipit-source-id: 71e61b0aa3d46159c9af1112c262cbfa7eaa1879
2021-03-16 15:18:37 -07:00
665d5e2a4f [PyTorch][JIT] Audit interpreter for extra copies (#54029)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54029

I found what appear to be some missed moves and/or extra copies in the JIT interpreter.
ghstack-source-id: 123958682

Test Plan:
Existing CI for correctness

Ran AdIndexer inline_cvr local_ro model benchmark with static_runtime off via
`env bin=/tmp/ptvsc2_predictor_bench.StaticDispatchModeFile static_runtime=0 caffe2=0 scripts/swolchok/static_runtime/inline_cvr/run_local_ro.sh`

before:
```
I0315 14:25:23.916893 3075680 PyTorchPredictorBenchLib.cpp:215] PyTorch run finished. Milliseconds per iter: 1.01635. Iters per second: 983.914
I0315 14:26:05.536207 3080560 PyTorchPredictorBenchLib.cpp:215] PyTorch run finished. Milliseconds per iter: 1.01689. Iters per second: 983.395
I0315 14:26:47.510561 3083335 PyTorchPredictorBenchLib.cpp:215] PyTorch run finished. Milliseconds per iter: 1.02697. Iters per second: 973.737
I0315 14:27:29.024830 3086767 PyTorchPredictorBenchLib.cpp:215] PyTorch run finished. Milliseconds per iter: 1.01326. Iters per second: 986.918
I0315 14:28:10.849496 3091323 PyTorchPredictorBenchLib.cpp:215] PyTorch run finished. Milliseconds per iter: 1.023. Iters per second: 977.517
```

after:
```
I0315 14:17:43.280469 3046242 PyTorchPredictorBenchLib.cpp:215] PyTorch run finished. Milliseconds per iter: 0.997838. Iters per second: 1002.17
I0315 14:18:24.244606 3046861 PyTorchPredictorBenchLib.cpp:215] PyTorch run finished. Milliseconds per iter: 1.00173. Iters per second: 998.269
I0315 14:19:05.208899 3051998 PyTorchPredictorBenchLib.cpp:215] PyTorch run finished. Milliseconds per iter: 1.00187. Iters per second: 998.136
I0315 14:19:46.103854 3055392 PyTorchPredictorBenchLib.cpp:215] PyTorch run finished. Milliseconds per iter: 1.00073. Iters per second: 999.27
I0315 14:20:27.011411 3056062 PyTorchPredictorBenchLib.cpp:215] PyTorch run finished. Milliseconds per iter: 0.999121. Iters per second: 1000.88
```

(This was just a convenient workload I had handy; the plan of record is to use static runtime for inline_cvr inference AIUI.)

Reviewed By: dhruvbird, walterddr

Differential Revision: D27060762

fbshipit-source-id: 5567206d7c2d9ae99776ce5524caf09ec2035e87
2021-03-16 15:09:09 -07:00
ae154a8c2c various doc building cleanups (#53851)
Summary:
brianjo
- Add a javascript snippet to close the expandable left navbar sections 'Notes', 'Language Bindings', 'Libraries', 'Community'
- Fix two latex bugs that were causing output in the log that might have been misleading when looking for true doc build problems
- Change the way release versions interact with sphinx. I tested these via building docs twice: once with `export RELEASE=1` and once without.
  - Remove perl scripting to turn the static version text into a link to the versions.html document. Instead, put this where it belongs in the layout.html template. This is the way the domain libraries (text, vision, audio) do it.
  -  There were two separate templates for master and release, with the only difference between them is that the master has an admonition "You are viewing unstable developer preview docs....". Instead toggle that with the value of `release`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53851

Reviewed By: mruberry

Differential Revision: D27085875

Pulled By: ngimel

fbshipit-source-id: c2d674deb924162f17131d895cb53cef08a1f1cb
2021-03-16 15:01:59 -07:00
aa8714dfed [complex] torch.lerp: complex autograd support (#53689)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53606

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53689

Reviewed By: bdhirsh

Differential Revision: D27081150

Pulled By: anjali411

fbshipit-source-id: 06f96b6f67bac69ef56c12a12fc12499c2435641
2021-03-16 14:28:13 -07:00
c0fafcc766 Don't actually print anomalies in TTRR (#54078)
Summary:
This PR disables the bulk of the output for test time regression reporting, since it's obscuring more important signal (especially in cases where shards are shifting around).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54078

Test Plan:
```
python test/test_testing.py
```

Reviewed By: ezyang, walterddr

Differential Revision: D27088987

Pulled By: samestep

fbshipit-source-id: 06a4eeb75641552bad2ab4b9154a8c70c57b0d68
2021-03-16 14:26:32 -07:00
1f5b9170aa Faster backwards for cumsum and cumprod (#53711)
Summary:
Provides a faster formula for `cumprod` in the case when the input has zeros. This formula is non-differentiable, so we leave the previous formula for the cases when `at::GradMode::is_enabled()`.

This new formula gives up to x10 and x30 speed-ups in CPU and GPU (see the benchmarks below).

The `cumsum` backward formula was rewritten so that no copies are necessary. We also removed a double negation in its formula. This gives a significant speed-up in CPU, while being almost as efficient as the formula with copies in GPU. We can see this speed-up when comparing the "No zeros" part of the benchmark.

Benchmarks:

nb. It is worth noting that the script tests the forward and the backward for `cumprod`, so the speed-ups should be even larger than those announced here.
<details>
<summary>Script</summary>

```python
from IPython import get_ipython
import torch
from itertools import product

torch.manual_seed(13)
torch.set_num_threads(1)

ipython = get_ipython()

cpu = torch.device('cpu')
cuda = torch.device('cuda')

def run_test(ndims, size, size_prod, zeros, device):
    print(f"ndims: {ndims}, tensor_size: {size}, size_prod: {size_prod}, zeros: {zeros}, device: {device}")

    for dim in range(ndims):
        sizes = ndims * [size]
        sizes[dim] = size_prod
        tensor = torch.rand(*sizes, device=device)
        with torch.no_grad():
            if zeros:
                # Set 0.1 of them to zero
                p_drop = 0.1
                mask = torch.full_like(tensor, 1.0 - p_drop)
                tensor = tensor * torch.bernoulli(mask)
            else:
                tensor = tensor + 1e-3
        tensor.requires_grad_()
        grad = torch.ones_like(tensor)
        # We test both forward + backward, meaning that the speed-up is actually greater than reported
        # That being said, this is more realistic than doing `retain_graph=True`
        command = "torch.autograd.grad([tensor.cumprod(dim)], [tensor], grad_outputs=[grad])"
        if device == cuda:
            command += "; torch.cuda.synchronize()"
        ipython.magic(f"timeit {command}")
    print()

for device, zeros in product([cuda, cpu], [True, False]):
    run_test(3, 300, 10, zeros, device)
    run_test(3, 300, 100, zeros, device)
    if device == cuda:
        run_test(3, 300, 300, zeros, device)
```

</details>

<details>
<summary>CPU This PR  (Some regression small tensors, x4 speed-up large tensors)</summary>

```
Zeros:
ndims: 3, tensor_size: 300, size_prod: 10, zeros: True, device: cpu
28.2 ms ± 12.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
29.8 ms ± 78.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
24.5 ms ± 29.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

ndims: 3, tensor_size: 300, size_prod: 100, zeros: True, device: cpu
414 ms ± 3.63 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
428 ms ± 4.12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
382 ms ± 3.18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

No Zeros:
ndims: 3, tensor_size: 300, size_prod: 10, zeros: False, device: cpu
3.11 ms ± 9.72 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.83 ms ± 3.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.08 ms ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

ndims: 3, tensor_size: 300, size_prod: 100, zeros: False, device: cpu
92.2 ms ± 113 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
101 ms ± 101 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
87 ms ± 170 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```
</details>

<details>
<summary>CUDA This PR (7-30x speed-up)</summary>

```

Zeros:
ndims: 3, tensor_size: 300, size_prod: 10, zeros: True, device: cuda
1.46 ms ± 2.07 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.48 ms ± 3.51 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.93 ms ± 8.07 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

ndims: 3, tensor_size: 300, size_prod: 100, zeros: True, device: cuda
10.5 ms ± 914 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
10.6 ms ± 509 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
11.7 ms ± 864 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)

ndims: 3, tensor_size: 300, size_prod: 300, zeros: True, device: cuda
30.3 ms ± 5.16 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
30.6 ms ± 6.44 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
32.2 ms ± 2.34 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

No Zeros:
ndims: 3, tensor_size: 300, size_prod: 10, zeros: False, device: cuda
248 µs ± 335 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
252 µs ± 186 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
438 µs ± 254 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

ndims: 3, tensor_size: 300, size_prod: 100, zeros: False, device: cuda
2.1 ms ± 193 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.16 ms ± 380 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.59 ms ± 398 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)

ndims: 3, tensor_size: 300, size_prod: 300, zeros: False, device: cuda
6.3 ms ± 857 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
6.39 ms ± 288 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.15 ms ± 233 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

</details>

<details>
<summary>CPU master</summary>

```
Zeros:
ndims: 3, tensor_size: 300, size_prod: 10, zeros: True, device: cpu
8.27 ms ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
10.8 ms ± 13.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
28.2 ms ± 74.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

ndims: 3, tensor_size: 300, size_prod: 100, zeros: True, device: cpu
1.53 s ± 116 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.95 s ± 4.38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.86 s ± 3.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

No Zeros:
ndims: 3, tensor_size: 300, size_prod: 10, zeros: False, device: cpu
3.42 ms ± 20 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.25 ms ± 3.65 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.34 ms ± 3.04 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

ndims: 3, tensor_size: 300, size_prod: 100, zeros: False, device: cpu
104 ms ± 148 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
117 ms ± 99.5 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
94.8 ms ± 125 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

</details>

<details>
<summary>CUDA master</summary>

```
Zeros:
ndims: 3, tensor_size: 300, size_prod: 10, zeros: True, device: cuda
912 µs ± 431 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.05 ms ± 2.46 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2.74 ms ± 381 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)

ndims: 3, tensor_size: 300, size_prod: 100, zeros: True, device: cuda
71.3 ms ± 7.91 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
85.4 ms ± 9.82 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
119 ms ± 6.21 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

ndims: 3, tensor_size: 300, size_prod: 300, zeros: True, device: cuda
646 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
776 ms ± 81.7 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
917 ms ± 160 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

No Zeros:
ndims: 3, tensor_size: 300, size_prod: 10, zeros: False, device: cuda
301 µs ± 893 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
308 µs ± 236 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
592 µs ± 140 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

ndims: 3, tensor_size: 300, size_prod: 100, zeros: False, device: cuda
2.61 ms ± 375 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.68 ms ± 524 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.38 ms ± 736 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)

ndims: 3, tensor_size: 300, size_prod: 300, zeros: False, device: cuda
7.89 ms ± 848 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
8.03 ms ± 517 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.24 ms ± 405 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

</details>

cc nikitaved

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53711

Reviewed By: jbschlosser

Differential Revision: D27059662

Pulled By: anjali411

fbshipit-source-id: be610d5590c0199b4412dff66fac47666faaff9d
2021-03-16 13:57:43 -07:00
6332fd6255 enable sc1090 and sc1091 (#54069)
Summary:
SC1090/1091 are important to prevent accidental delete/move of utility shell scripts

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54069

Test Plan: CI

Reviewed By: samestep

Differential Revision: D27084094

Pulled By: walterddr

fbshipit-source-id: 16deb83fce691eba0263978374564d172bc8d371
2021-03-16 12:59:55 -07:00
2c4a64589b fix mkldnn_add in-place behavior (#51687)
Summary:
There are the following two patterns to call add in-pace.

```python
torch.add(a, b, out=a) # (1) a in-placed
torch.add(a, b, out=b) # (2) b in-placed
```

If a and b are mkldnn Tensor, the value is different from expected in case (2).

**Sample code to reproduce the behavior:**

```python
import torch

torch.manual_seed(4)
a = torch.randn(4, 4)
b = torch.randn(4, 4)
b.fill_(1.0)

a_mkl = a.to_mkldnn()
b_mkl = b.to_mkldnn()

torch.add(b, a, alpha=1.0, out=a)
torch.add(b_mkl, a_mkl, alpha=1.0, out=a_mkl)

print(a)
print(a_mkl)
```

**Results:**

Actual:

```python
tensor([[ 0.0586,  2.2632,  0.8162,  1.1505],
        [ 1.1075,  0.7220, -1.6021,  1.6245],
        [ 0.1316,  0.7949,  1.3976,  1.6699],
        [ 0.9463,  1.0467, -0.7671, -1.1205]])
tensor([[2., 2., 2., 2.],
        [2., 2., 2., 2.],
        [2., 2., 2., 2.],
        [2., 2., 2., 2.]], layout=torch._mkldnn)
```

Expected:

```python
tensor([[ 0.0586,  2.2632,  0.8162,  1.1505],
        [ 1.1075,  0.7220, -1.6021,  1.6245],
        [ 0.1316,  0.7949,  1.3976,  1.6699],
        [ 0.9463,  1.0467, -0.7671, -1.1205]])
tensor([[ 0.0586,  2.2632,  0.8162,  1.1505],
        [ 1.1075,  0.7220, -1.6021,  1.6245],
        [ 0.1316,  0.7949,  1.3976,  1.6699],
        [ 0.9463,  1.0467, -0.7671, -1.1205]], layout=torch._mkldnn)
```

This is because `dnnl::sum` called in `mkldnn_add` has the following specifications:

[oneDNN doc : Sum](https://oneapi-src.github.io/oneDNN/dev_guide_sum.html)

> The sum primitive supports in-place operation, meaning that the src0 tensor can be used as both input and output.
> In-place operation overwrites the original data. Using in-place operation requires the memory footprint of the
> output tensor to be either bigger than or equal to the size of the dst memory descriptor used for primitive creation.

but, case 2) are added to the first argument.
So, we modified it so that a and b are swapped and passed to "sum" in case (2).

**Environment**
・CPU : Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
・build USE_MKLDNN=1

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51687

Reviewed By: jbschlosser

Differential Revision: D27062172

Pulled By: VitalyFedyunin

fbshipit-source-id: bf76d36f9fdb1b4337d71d87bcdbaf4edb11f12f
2021-03-16 12:54:27 -07:00
b27e678dfb [RELAND] [CUDA graphs] Private mempools for CUDA graphs (#54038)
Summary:
Resubmit of https://github.com/pytorch/pytorch/pull/51436.

Apparently some non-public windows builds run cuda tests on the default stream, so I changed a few capture tests to manually ensure all captures happen on non-default streams.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54038

Reviewed By: mruberry

Differential Revision: D27068649

Pulled By: ngimel

fbshipit-source-id: 4284475fa40ee38c0f8faff05a2faa310cf8a207
2021-03-16 12:13:33 -07:00
bea3cb7069 remove aliasMultinomial decode from TH and THC (#52585)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52585

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D26654125

Pulled By: albanD

fbshipit-source-id: 6a745080021623a2472dae7862cde91b949983ee
2021-03-16 09:43:56 -07:00
e8e570e9c5 [MacOS] Cross compile stub when building for M1 on x86 (#54046)
Summary:
Also rename `CROSS_COMPILE_ARM` to `CROSS_COMPILE_ARM64`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54046

Reviewed By: walterddr

Differential Revision: D27071928

Pulled By: malfet

fbshipit-source-id: 9143cd5d110ed67f0609f0a4bbb20922012ee665
2021-03-16 00:24:09 -07:00
2ecb2c7931 Pass Scalar by reference (#53583)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53583

`Scalar` takes 32 bytes due to `c10::complex<double>`
requires aligning to 16 bytes. Passing Scalar by reference
shows about 1% improvements on instruction count.

All the changes in this commit are codemoded except for
the following 4 files (which code-gen signatures):
```
tools/codegen/api/cpp.py
tools/codegen/api/native.py
tools/codegen/api/structured.py
caffe2/contrib/aten/gen_op.py
```

# Codemode

## Main Step

For the codemod part, here is the main command used:
```
fastmod --extensions h '([a-zA-Z_+]\([^)]*,?\s*)Scalar (\w+)' '${1}const Scalar& ${2}'
fastmod --extensions h '([a-zA-Z_+]\([^)]*,?\s*)optional<Scalar> (\w+)' '${1}const optional<Scalar>& ${2}'
fastmod --extensions cpp '([a-zA-Z_+]\([^)]*,?\s*)Scalar (\w+)' '${1}const Scalar& ${2}'
fastmod --extensions cpp '([a-zA-Z_+]\([^)]*,?\s*)optional<Scalar> (\w+)' '${1}const optional<Scalar>& ${2}'
```

As you can tell, it codemods both `Scalar` and `optional<Scalar>`.  Apply these commands iteratively until reaching a fix-point (since one method signature might contain multiple `Scalar` parameter).

In retrospect, excluding `thrid_party` and `torch/csrc/jit` would be a good idea. (I revert it manually later, see https://github.com/pytorch/pytorch/pull/53479 as an reference).

## Pre-Step

Prior to applying the main command,  as some `Scalar` are presented as `at::Scalar` or `c10::Scalar`, so I codemod some of them in advance. Here is an incomplete list:
```
fastmod --extensions h '([a-zA-Z_+]\([^)]*,?\s*)at::Scalar (\w+)' '${1}const at::Scalar& ${2}'
fastmod --extensions cpp '([a-zA-Z_+]\([^)]*,?\s*)at::Scalar (\w+)' '${1}const at::Scalar& ${2}'
fastmod --extensions h '([a-zA-Z_+]\([^)]*,?\s*)c10::optional<Scalar> (\w+)' '${1}const c10::optional<Scalar>& ${2}'
fastmod --extensions cpp '([a-zA-Z_+]\([^)]*,?\s*)c10::optional<Scalar> (\w+)' '${1}const c10::optional<Scalar>& ${2}'
```

## Fixup
There are a couple of post codemod fixup. For example, `const Scalar` will be codemoded into `const const Scalar&`. `at:Scalar` will be codemoded into `at::const Scalar&`  (if `Pre-step` is not done comprehensively). Here is an incomplete list:
```
fastmod --extensions cpp 'const const Scalar' 'const Scalar'
fastmod --extensions h 'const const c10::optional<Scalar>' 'const c10::optional<Scalar>'
fastmod --extensions cpp 'const const c10::optional<Scalar>' 'const c10::optional<Scalar>'
fastmod 'at::const Scalar&' 'const at::Scalar&'
```

## Supplementary

`cu` and `mm` files also need to be codemoded, for example:

```
fastmod --extensions cu 'at::const Scalar&' 'const at::Scalar&'
fastmod --extensions mm '([a-zA-Z_+]\([^)]*,?\s*)Scalar (\w+)' '${1}const Scalar& ${2}'
```

Function pointers are not codemoded. Here is an incomplete list:

```
# Cover case: using index_fill_fn = void(*)(TensorIterator & iter, int64_t dim, int64_t self_dim_size, int64_t self_dim_stride, Scalar source);
fastmod --extensions h '(void\s*\(\s*\*\s*\)\([^)]*,?\s*)Scalar (\w+)' '${1}const Scalar& ${2}'

# Cover case: using softplus_fn = void (*)(TensorIterator&, Scalar, Scalar);
fastmod --extensions h '(void\s*\(\s*\*\s*\)\([^)]*,?\s*)Scalar([, \)])' '${1}const Scalar&${2}'
fastmod --extensions cpp '(void\s*\(\s*\*\s*\)\([^)]*,?\s*)Scalar([, \)])' '${1}const Scalar&${2}'
fastmod --extensions h '(void\s*\(\s*\*\s*\)\([^)]*,?\s*)optional<Scalar>([, \)])' '${1}const optional<Scalar>&${2}'
```

Some corner cases needs to be manually fixed.

ghstack-source-id: 123970306

Test Plan: Imported from OSS

Reviewed By: smessmer

Differential Revision: D26904445

fbshipit-source-id: 8d8a002af4b5125f153a32f03c6956be7ae5671d
2021-03-15 23:17:06 -07:00
4dd1c72dde Treat Scalar parameter as if it is constant (#53582)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53582

We will pass `Scalar` by reference in the following commit,
i.e. `const Scalar&`.
ghstack-source-id: 123965970

Test Plan: Imported from OSS

Reviewed By: smessmer

Differential Revision: D26904444

fbshipit-source-id: 7f58ee4e38dcd860f0d1120cab4e82f35ca3770f
2021-03-15 23:15:27 -07:00
603097be18 OneDNN MaxPooling: reduce memory use for inference path (#52728)
Summary:
For OneDNN MaxPooling training, it will save indices as a workspace for backward, but for inference, indices are not necessary, this PR will make check to avoid saving indices to reduce memory use for inference path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52728

Reviewed By: jbschlosser

Differential Revision: D27062435

Pulled By: VitalyFedyunin

fbshipit-source-id: 9e70268a8ba491a7914b980079c0945d753cd4f3
2021-03-15 21:53:05 -07:00
2c5579702a [PyTorch Mobile] Add module size to logged metadata (#53578)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53578

We want to be able to log the loaded module size to the scuba table `qpl_metrics/pytorch`. Hence, adding the `model_size` field to the logged metadata when logging a module load success event.

ghstack-source-id: 123980964

Test Plan: xcheng16 How should this be tested?

Reviewed By: xcheng16, raziel

Differential Revision: D26902971

fbshipit-source-id: a7c2e9120706bd31f76f6572c8503d4acf8a89e2
2021-03-15 21:11:36 -07:00
08f04c0db2 Test forward reference annotations (#53713)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53713

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D26946847

Pulled By: ansley

fbshipit-source-id: 2f99247c4b54ee06dcb54b23fdcee3537643cad4
2021-03-15 19:40:26 -07:00
ce2f71836c Disabling dispatch to OneDNN for group convolutions when groups size = 24 * n (#53991)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53991

Reviewed By: malfet

Differential Revision: D27048155

Pulled By: VitalyFedyunin

fbshipit-source-id: 5009f064220156ca14e1eb97172cfd4f7531b2a9
2021-03-15 19:30:19 -07:00
f52a3bd634 [DDP] remove dedupe check in reducer (#53919)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53919

https://github.com/pytorch/pytorch/pull/53279/files has landed
deduplicating the shared params in python before constructing reducer. Because
of this, we no longer need the changes in
https://github.com/pytorch/pytorch/pull/46755/files.

This is already tested by `test_ddp_shared_grad_acc_unused_params` and
`test_ddp_weight_sharing`
ghstack-source-id: 123828299

Test Plan: ci

Reviewed By: SciPioneer

Differential Revision: D27015466

fbshipit-source-id: efb079540c1a0e18bb38e68479caeb50cf550304
2021-03-15 18:50:05 -07:00
8c2c9450cc [package] autoformat (#53783)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53783

Use isort + black on torch/package/

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D26969020

Pulled By: suo

fbshipit-source-id: e2c0738e79bf41b6342355eb7025998178c35dc9
2021-03-15 17:18:43 -07:00
ee35060888 Fix sharding algo + test it (#53942)
Summary:
This PR:
1. moves sharding algorithm from run_test.py to framework_utils.py (let me know if you have a better place for it)
2. adds tests for the algorithm in test_testing.py
3. fixes the algorithm so that it doesn't tack on the unknown jobs all to the shard with the minimum time, but instead distributes them around the shards.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53942

Test Plan: python test/test_testing.py -k TestFrameworkUtils

Reviewed By: samestep

Differential Revision: D27047223

Pulled By: janeyx99

fbshipit-source-id: 824b20009c0bb707aa5361de445cdec795d5e3f1
2021-03-15 16:33:56 -07:00
e91aeb0470 [4/n][torch/elastic][upstream] Move torchelastic/metrics to torch/distributed/elastic/metrics (#53870)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53870

Move torchelastic/metrics to torch/distributed/elastic/metrics

Test Plan: buck test mode/dev-nosan //pytorch/elastic/torchelastic/...

Reviewed By: kiukchung

Differential Revision: D26970901

fbshipit-source-id: 0e0a211fe509b7bc3ab10adfefba81cd71b6db37
2021-03-15 16:07:18 -07:00
b9fdf72174 Fix doc (#53996)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53996

Fixes issue: #52479

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D27051056

Pulled By: nikithamalgifb

fbshipit-source-id: ff5d2fc3599571346e2323fa893c1e238097a164
2021-03-15 15:44:30 -07:00
e87ab2ac4d [DataLoader] Switch to guaranteed determinism & add option to non_deterministic (#53532)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53532

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D26888825

Pulled By: ejguan

fbshipit-source-id: 1e8c266146aa802a43e8c23c4f0b3b02134c8b50
2021-03-15 14:47:16 -07:00
274b96b878 Move as_view/increment_version to its separate key. (#53342)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53342

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D26973913

Pulled By: ailzhang

fbshipit-source-id: bc7fc25d1a3a1f20cdfa1d7126fa559a84d194a4
2021-03-15 14:47:12 -07:00
8f98b87212 Update Kineto revision (#53940)
Summary:
Update Kineto revision

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53940

Test Plan: CI

Reviewed By: gdankel, ngimel

Differential Revision: D27027834

Pulled By: ilia-cher

fbshipit-source-id: f5515720c641fde8a8b80c38fa4cbb611f76f36e
2021-03-15 14:45:22 -07:00
a7ba3f3aa8 Automated submodule update: tensorpipe (#53999)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/tensorpipe](https://github.com/pytorch/tensorpipe).

New submodule commit: 17008b1be8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53999

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: lw

Differential Revision: D27046211

fbshipit-source-id: 72d7eb3814d30afb7956e0e0b43b0b320fbf009a
2021-03-15 14:39:17 -07:00
65087dd1d4 Fix broken link from load_inline to new test location (#53701)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53701

Reviewed By: jbschlosser

Differential Revision: D27047406

Pulled By: ezyang

fbshipit-source-id: 0be6e669cf41527d3ffeb101e5f36db07e41b4af
2021-03-15 13:53:15 -07:00
67f765328b scripts: Change promote pypi to be more flexible (#53774)
Summary:
Promotion to PyPI should be more flexible to allow any package to be
promoted to PyPI.

After we re-added a version suffix to cuda 10.2 it means that this
script needs to have the flexibility to designate which platform and
which version suffix will actually be uploaded to PyPI

Should coincide with https://github.com/pytorch/builder/pull/678

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53774

Reviewed By: jbschlosser

Differential Revision: D27052347

Pulled By: seemethere

fbshipit-source-id: 71129cc5afbd7de448c970ef721bc979c3420586
2021-03-15 13:30:21 -07:00
793a29a7d5 add OneDNN batch_norm backward (#50460)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50460

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D26006887

Pulled By: VitalyFedyunin

fbshipit-source-id: 472398772af01a31594096ccc714fd487ed33dd4
2021-03-15 13:30:17 -07:00
33e3deed4f add OneDNN relu backward and reshape backward (#49455)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49455

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D26006886

Pulled By: VitalyFedyunin

fbshipit-source-id: c81ef115205171b80652800a76170dd759905e28
2021-03-15 13:27:56 -07:00
7f88840495 Fix prefix store timeout bug (#53928)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53928

HashStoreTest was taking forever to run. Turns out it was because a default timeout is set when creating Store() and setTimeout for prefixStore is not actually able to change the timeout of the underlying store.

After removing the default timeout and updating setTimeout, this will save ~10 minutes for all of the gcc_test CI runs.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D27025275

Pulled By: H-Huang

fbshipit-source-id: 650c8c1eb8b166da1d412ed88e765747a2ca2069
2021-03-15 13:23:20 -07:00
7ff4955de5 [doc] Fix documentation for tensorsolve (#53320)
Summary:
This PR fixes a typo in the explanation of `dims` for `linalg.tensorsolve`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53320

Reviewed By: jbschlosser

Differential Revision: D27048736

Pulled By: anjali411

fbshipit-source-id: db230b21191cc9cfb73b967cd15305fe74178c2b
2021-03-15 12:22:17 -07:00
b5cdb53af1 Add division logic to a slow/fast path (#49250)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49250

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D25502938

Pulled By: izdeby

fbshipit-source-id: bdd583464eb15d7cb30fd0c22d119cc4b31cbf8d
2021-03-15 12:17:39 -07:00
4bb34c2a75 Update Binary Ops with scalar lists (#49249)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49249

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D25502939

Pulled By: izdeby

fbshipit-source-id: b16e23063b37521be549e83cb17676e3afc4ddb3
2021-03-15 12:16:04 -07:00
c1a39620b8 [nn] nn.Embedding : padding_idx doc update (#53809)
Summary:
Follow-up of https://github.com/pytorch/pytorch/pull/53447

Reference: https://github.com/pytorch/pytorch/pull/53447#discussion_r590521051

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53809

Reviewed By: bdhirsh

Differential Revision: D27049643

Pulled By: jbschlosser

fbshipit-source-id: 623a2a254783b86391dc2b0777b688506adb4c0e
2021-03-15 11:54:51 -07:00
5b62b0d9bc [RPC] Fix typo in rref_context.cpp (#53978)
Summary:
untill -> until

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53978

Reviewed By: jbschlosser

Differential Revision: D27039043

Pulled By: rohan-varma

fbshipit-source-id: c9178e79fe8b2a3dc61665148fe55dba5adb0abf
2021-03-15 11:08:59 -07:00
7e39a40300 Fix typo in torchvision_models.py (#53968)
Summary:
accross -> across

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53968

Reviewed By: jbschlosser

Differential Revision: D27035761

Pulled By: ngimel

fbshipit-source-id: 94fac6f2e27648e70652fd29f7800e60b211acd5
2021-03-15 11:02:06 -07:00
da10ccd35f Implements cpu_kernel_multiple_outputs and torch.frexp (#51097)
Summary:
Close https://github.com/pytorch/pytorch/issues/51108
Related https://github.com/pytorch/pytorch/issues/38349

This PR implements the `cpu_kernel_multiple_outputs` to support returning multiple values in a CPU kernel.
```c++
auto iter = at::TensorIteratorConfig()
  .add_output(out1)
  .add_output(out2)
  .add_input(in1)
  .add_input(in2)
  .build();

at::native::cpu_kernel_multiple_outputs(iter,
  [=](float a, float b) -> std::tuple<float, float> {
    float add = a + b;
    float mul = a * b;
    return std::tuple<float, float>(add, mul);
  }
);
```

The `out1` will equal to `torch.add(in1, in2)`, while the result of `out2` will be `torch.mul(in1, in2)`.
It helps developers implement new torch functions that return two tensors more conveniently, such as NumPy-like functions [divmod](https://numpy.org/doc/1.18/reference/generated/numpy.divmod.html?highlight=divmod#numpy.divmod) and [frexp](https://numpy.org/doc/stable/reference/generated/numpy.frexp.html#numpy.frexp).

This PR adds `torch.frexp` function to exercise the new functionality provided by `cpu_kernel_multiple_outputs`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51097

Reviewed By: albanD

Differential Revision: D26982619

Pulled By: heitorschueroff

fbshipit-source-id: cb61c7f2c79873ab72ab5a61cbdb9203531ad469
2021-03-15 10:44:32 -07:00
ad8d1b2aaa [ONNX] Update embedding export wrt padding_idx (#53931)
Summary:
To be in-sync with https://github.com/pytorch/pytorch/issues/53447

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53931

Reviewed By: ngimel

Differential Revision: D27026616

Pulled By: malfet

fbshipit-source-id: 4c50b29fa296c90aeeeb1757bdaada92cbba33d4
2021-03-15 10:03:53 -07:00
4f62c622b3 Cleanup of unused list in adam.py (#53874)
Summary:
Code cleanup.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53874

Reviewed By: jbschlosser

Differential Revision: D27036819

Pulled By: ngimel

fbshipit-source-id: c267e20c8d91224cd3c01b302a75f43aa309b560
2021-03-15 09:49:27 -07:00
8734e88f0b delete has no more data after the key (#53886)
Summary:
The tcpstore delete key implementation inadvertendly set "moreData" when sending the key when it was in fact the last message.

Thank you, PetrochukM, for the reproducing example which was instrumental in developing the fix (and is the blueprint for the test case).

Fixes https://github.com/pytorch/pytorch/issues/53872

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53886

Reviewed By: jbschlosser

Differential Revision: D27011846

Pulled By: H-Huang

fbshipit-source-id: 5c460d1e4d095a8bc267bf63613b556856ced3e8
2021-03-15 08:44:55 -07:00
700c817a6a Add install for libCaffe2_perfkernels_avx*.a (#53825)
Summary:
When build libtorch static library, these three static libraries will be generated but won't be installed to CMAKE_INSTALL_LIBDIR:
- libCaffe2_perfkernels_avx2.a
- libCaffe2_perfkernels_avx512.a
- libCaffe2_perfkernels_avx.a

This PR will fix this issue.

Please be noted that after this fix there still have static libraries missing in CMAKE_INSTALL_LIBDIR, but they belong to third_party repo, and we need to fix in the corresponding repo:
- libfoxi_loader.a
- libonnx.a
- libonnx_proto.a
- libfmt.a
- libnccl_static.a

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53825

Reviewed By: ngimel

Differential Revision: D27013844

Pulled By: malfet

fbshipit-source-id: 8a84cc72b6ae87393ca26c4e474f5526a7b18ab2
2021-03-15 08:37:11 -07:00
2782126bfe Automated submodule update: tensorpipe (#53892)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/tensorpipe](https://github.com/pytorch/tensorpipe).

New submodule commit: cd0eb12c1f

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53892

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: lw

Differential Revision: D27009398

fbshipit-source-id: af46edd701cde94c6175d3058fd15487d8b0b8c7
2021-03-15 05:58:27 -07:00
bb21aea37a [iOS GPU] Add the reset of binary ops (#53950)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53950

Add four binary ops to Metal

- `aten::mul_`
- `aten::sub_`
- `aten::div`
- `aten::div_`
ghstack-source-id: 123850577

Test Plan:
- `buck test pp-mac`

```
2021-03-11 20:36:47.151139-0800 PyTorchPlayground[8469:5169786] [bool test_sub()],[5 3 167 222 ],[SUCCEED]
2021-03-11 20:36:47.157638-0800 PyTorchPlayground[8469:5169786] [bool test_sub_broadcast()],[1 3 1 1 ],[SUCCEED]
2021-03-11 20:36:47.170640-0800 PyTorchPlayground[8469:5169786] [bool test_sub_broadcast2()],[3 3 192 192 ],[SUCCEED]
2021-03-11 20:36:47.194009-0800 PyTorchPlayground[8469:5169786] [bool test_mul()],[2 7 262 119 ],[SUCCEED]
2021-03-11 20:36:47.210344-0800 PyTorchPlayground[8469:5169786] [bool test_mul_broadcast()],[4 3 192 192 ],[SUCCEED]
2021-03-11 20:36:47.216610-0800 PyTorchPlayground[8469:5169786] [bool test_mul_broadcast2()],[1 3 192 192 ],[SUCCEED]
2021-03-11 20:36:47.224471-0800 PyTorchPlayground[8469:5169786] [bool test_div()],[1 3 192 192 ],[SUCCEED]
2021-03-11 20:36:47.240817-0800 PyTorchPlayground[8469:5169786] [bool test_div_broadcast()],[4 3 192 192 ],[SUCCEED]
2021-03-11 20:36:47.246816-0800 PyTorchPlayground[8469:5169786] [bool test_div_broadcast2()],[1 3 192 192 ],[SUCCEED]
```

Reviewed By: SS-JIA

Differential Revision: D27003417

fbshipit-source-id: 290f7e524eef4c444f8884fc1315151752e5ac31
2021-03-14 22:14:24 -07:00
530dc828ae [iOS GPU] Support element-wise broadcasting for binary ops in shaders (#53949)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53949

As title says
ghstack-source-id: 123849745

Test Plan:
`buck test pp-mac`

```
2021-03-11 18:25:07.922375-0800 PyTorchPlayground[8324:5122672] [bool test_add()],[1 180 12 12 ],[SUCCEED]
2021-03-11 18:25:07.960812-0800 PyTorchPlayground[8324:5122672] [bool test_add_broadcast()],[2 17 58 67 ],[SUCCEED]
2021-03-11 18:25:07.978399-0800 PyTorchPlayground[8324:5122672] [bool test_add_broadcast2()],[2 17 1 67 ],[SUCCEED]
2021-03-11 18:25:08.021570-0800 PyTorchPlayground[8324:5122672] [bool test_sub()],[5 3 167 222 ],[SUCCEED]
2021-03-11 18:25:08.034218-0800 PyTorchPlayground[8324:5122672] [bool test_sub_broadcast()],[1 3 1 1 ],[SUCCEED]
2021-03-11 18:25:08.069419-0800 PyTorchPlayground[8324:5122672] [bool test_sub_broadcast2()],[3 3 192 192 ],[SUCCEED]
2021-03-11 18:25:08.112967-0800 PyTorchPlayground[8324:5122672] [bool test_mul()],[2 7 262 119 ],[SUCCEED]
2021-03-11 18:25:08.136691-0800 PyTorchPlayground[8324:5122672] [bool test_mul_broadcast()],[4 3 192 192 ],[SUCCEED]
2021-03-11 18:25:08.148920-0800 PyTorchPlayground[8324:5122672] [bool test_mul_broadcast2()],[1 3 192 192 ],[SUCCEED]
```

Reviewed By: SS-JIA

Differential Revision: D27000487

fbshipit-source-id: f86fca5ac1960ca0a56636da17ae05020c1a4138
2021-03-14 22:12:52 -07:00
df7c0a06d6 [testing] assert no duplicate in method_tests for an OpInfo entry (#53492)
Summary:
Assert no duplicate in method_tests for an OpInfo entry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53492

Reviewed By: izdeby

Differential Revision: D26882441

Pulled By: mruberry

fbshipit-source-id: f0631ea2b46b74285c76365c679bd45abc917d63
2021-03-14 21:58:39 -07:00
547f435763 Fix restriding logic for structured kernels (#53759)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53759

Fixes #53587, see issue for in-depth explanation of the bug.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D26971342

Pulled By: ezyang

fbshipit-source-id: 805983fed2658e27fb033f36a71fd30950a29328
2021-03-14 20:41:23 -07:00
c2f41b6b84 Add meta device to generic device testing framework, skip NotImplementedError (#53682)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53682

With this, under the meta device, 101 tests passed and 16953 skipped.
It ain't much, but it's a start.

Some various bits and bobs:
- NotImplementedError suppression at test level is implemented
  in the same way as CUDA memory leak check, i.e., by wrapping
  test methods and monkeypatching them back in.
- I had to reimplement assertRaises/assertRaisesRegex from scratch to
  ignore NotImplementedError when _ignore_not_implemented_error is True.
  The implementation relies on a small amount of private API that hasn't
  changed since 2010
- expectedAlertNondeterministic doesn't really work so I skipped them
  all; there's probably a way to do it better

I tested this using `pytest --disable-warnings --tb=native -k meta --sw
test/*.py` and a pile of extra patches to make collection actually work
(lol).

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D26955539

Pulled By: ezyang

fbshipit-source-id: ac21c8734562497fdcca3b614a28010bc4c03d74
2021-03-14 20:41:19 -07:00
d47d246206 Add 'noarch' tests which only run in one CI config (#53747)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53747

Fixes #53743

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D26971343

Pulled By: ezyang

fbshipit-source-id: cee7aa10063ae674f741406a3af830e4b4f128df
2021-03-14 20:39:07 -07:00
f6df18f6ca Clean up future imports for Python 2 (#53349)
Summary:
See https://github.com/pytorch/pytorch/issues/42919

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53349

Reviewed By: malfet

Differential Revision: D27039089

Pulled By: bugra

fbshipit-source-id: 8063dc184248604506a8dbb1bcb73da8ec85bb18
2021-03-14 15:56:13 -07:00
319ab58e27 Skips test_linalg_lstsq on ROCm (#53977)
Summary:
This test is flaky (tracked in https://github.com/pytorch/pytorch/issues/53976). This PR skips it to let the rest of the ROCm CI run.

cc nikitaved

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53977

Reviewed By: ngimel

Differential Revision: D27036705

Pulled By: mruberry

fbshipit-source-id: 5bae741fd2a68f23717cb3a7c8b73e97cfb23b5c
2021-03-14 05:42:39 -07:00
790326d49b Fixed the size of the workspace array in functions calling LAPACK (#53909)
Summary:
The size of the workspace array should be max(1, lwork) according to LAPACK documentation. We got away with this previously because we tested only MKL which does a nice thing returning lwork >= 1.

Fixes https://github.com/pytorch/pytorch/issues/53454

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53909

Reviewed By: heitorschueroff

Differential Revision: D27017025

Pulled By: mruberry

fbshipit-source-id: 040a8cfb4bfb98db47d0b117938856d9483b20fb
2021-03-14 01:17:11 -08:00
7df176b1f9 Added OpInfo-based testing of some linalg functions (#51107)
Summary:
Added OpInfo-based testing of the following linear algebra functions:
* cholesky, linalg.cholesky
* linalg.eigh
* inverse, linalg.inv
* qr, linalg.qr
* solve

The output of `torch.linalg.pinv` for empty inputs was not differentiable, now it's fixed.

In some cases, batched grad checks are disabled because it doesn't work well with 0x0 matrices (see https://github.com/pytorch/pytorch/issues/50743#issuecomment-767376085).

Ref. https://github.com/pytorch/pytorch/issues/50006

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51107

Reviewed By: albanD

Differential Revision: D27006115

Pulled By: mruberry

fbshipit-source-id: 3c1d00e3d506948da25d612fb114e6d4a478c5b1
2021-03-14 01:10:02 -08:00
d46978cc55 Refines test_orgqr_* skip (#53975)
Summary:
https://github.com/pytorch/pytorch/pull/51348 added CUDA support for orgqr but only a cuSOLVER path; the orgqr tests, however, were marked to run on builds with either MAGMA or cuSOLVER.

This PR addresses the issue by creating a skipCUDAIfNoCusolver decator and applying to the orgqr tests. It triggers ci-all because our CI build with MAGMA but no cuSOLVER is CUDA 9.2, which does run in the typical PR CI.

cc IvanYashchuk

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53975

Reviewed By: ngimel

Differential Revision: D27036683

Pulled By: mruberry

fbshipit-source-id: f6c0a3e526bde08c44b119ed2ae5d51fee27e283
2021-03-14 00:41:26 -08:00
39f50f468d matmul performance benchmarks (#51647)
Summary:
Minor PR following up the previous PR about sparse benchmarking utils https://github.com/pytorch/pytorch/pull/48397

Fixes https://github.com/pytorch/pytorch/issues/44634:  Performance benchmarks for matrix-matrix and matrix-vector ops (dense-sparse, sparse-sparse, and compare to dense-dense)

I ran  all benchmarks on an 2xRTX8000 machine with AMD 2970WX 24-cores for `DLMC/magnitude_pruning` dataset with different sparsity levels.

 ---
<details><summary> forward tests (expand for details).
</summary>

- `sparse@sparse`
```
[------------------------------- cpu:matmul-forward -------------------------------]
                           |   0.5   |   0.7   |   0.8   |   0.9   |  0.95  |   0.98
1 threads: -------------------------------------------------------------------------
      torch:dense@dense    |  108.1  |  100.5  |  101.3  |  108.4  |  98.4  |  187.4
      torch:sparse@sparse  |  659.1  |  368.8  |  156.5  |   53.3  |  26.8  |   14.9
      scipy:sparse@sparse  |  565.1  |  233.9  |  130.2  |   23.1  |  21.6  |   15.2

Times are in milliseconds (ms).

[----------------------------------- cuda:matmul-forward -----------------------------------]
                           |    0.5    |    0.7    |   0.8    |   0.9    |   0.95   |   0.98
1 threads: ----------------------------------------------------------------------------------
      torch:dense@dense    |   2243.5  |   4392.5  |  4419.8  |  2272.3  |  4433.9  |  8920.1
      torch:sparse@sparse  |  21369.2  |  11877.6  |  7339.2  |  1787.2  |  1335.1  |   845.7

Times are in microseconds (us).

```
- `sparse@dense`
```
[------------------------------- cpu:matmul-forward -------------------------------]
                          |   0.5   |   0.7   |   0.8   |   0.9   |   0.95  |   0.98
1 threads: -------------------------------------------------------------------------
      torch:dense@dense   |  105.8  |  103.8  |  103.0  |  104.4  |  104.4  |  197.0
      torch:sparse@dense  |  119.9  |  102.4  |   84.0  |   19.7  |   16.8  |   11.6
      scipy:sparse@dense  |  906.5  |  799.6  |  697.8  |  182.2  |  165.5  |  135.4

Times are in milliseconds (ms).

[------------------------- cuda:matmul-forward --------------------------]
                          |  0.5  |  0.7  |  0.8  |  0.9  |  0.95  |  0.98
1 threads: ---------------------------------------------------------------
      torch:dense@dense   |  2.2  |  4.4  |  4.4  |  2.3  |  4.5   |  2.3
      torch:sparse@dense  |  5.7  |  6.6  |  4.5  |  1.4  |  1.4   |  1.3

Times are in milliseconds (ms).

```
- `sparse@vector`
```
[----------------------------------- cpu:matmul-forward ----------------------------------]
                           |    0.5    |   0.7    |   0.8    |   0.9    |   0.95   |   0.98
1 threads: --------------------------------------------------------------------------------
      torch:dense@vector   |    510.6  |   505.8  |   759.6  |   782.1  |   682.4  |  764.6
      torch:sparse@vector  |  10122.8  |  6241.1  |  7935.6  |  2076.3  |  1049.5  |  826.3
      scipy:sparse@vector  |   1756.7  |  1033.9  |   678.2  |   343.5  |   168.5  |   65.4

Times are in microseconds (us).

[-------------------------------- cuda:matmul-forward --------------------------------]
                           |   0.5    |   0.7    |   0.8   |   0.9   |   0.95  |   0.98
1 threads: ----------------------------------------------------------------------------
      torch:dense@vector   |    36.1  |    21.5  |   21.6  |   21.5  |   21.6  |   21.5
      torch:sparse@vector  |  1099.2  |  1289.4  |  775.7  |  327.1  |  285.4  |  274.0

Times are in microseconds (us).

```
</details>

 ---
<details><summary> backward tests (expand for details).
</summary>

- `sparse@sparse`
```
[--------------------------------- cpu:matmul-backward ---------------------------------]
                           |   0.5    |   0.7    |   0.8    |   0.9    |   0.95  |   0.98
1 threads: ------------------------------------------------------------------------------
      torch:dense@dense    |   246.1  |   315.0  |   306.9  |   168.6  |  290.6  |  146.9
      torch:sparse@sparse  |  6417.5  |  4393.7  |  3012.7  |  1029.4  |  908.0  |  650.7

Times are in microseconds (us).

[----------------------------- cuda:matmul-backward -----------------------------]
                           |   0.5   |   0.7   |   0.8   |  0.9   |  0.95  |  0.98
1 threads: -----------------------------------------------------------------------
      torch:dense@dense    |    6.7  |   13.3  |   13.3  |   6.9  |  13.5  |   6.9
      torch:sparse@sparse  |  143.7  |  143.4  |  119.6  |  29.5  |  29.1  |  10.9

Times are in microseconds (us).

```
- `sparse@dense`
```
 [------------------------------ cpu:matmul-backward -------------------------------]
                          |   0.5   |   0.7   |   0.8   |   0.9   |   0.95  |   0.98
1 threads: -------------------------------------------------------------------------
      torch:dense@dense   |  185.9  |  304.8  |  305.8  |  169.9  |  308.7  |  168.4
      torch:sparse@dense  |  407.9  |  345.8  |  274.6  |  114.2  |  163.6  |  230.5

Times are in milliseconds (ms).

[--------------------------- cuda:matmul-backward --------------------------]
                          |  0.5   |  0.7   |  0.8   |  0.9  |  0.95  |  0.98
1 threads: ------------------------------------------------------------------
      torch:dense@dense   |   6.7  |  13.3  |  13.3  |  6.9  |  13.4  |   6.9
      torch:sparse@dense  |  16.7  |  19.0  |  15.1  |  6.3  |   8.2  |  12.7

Times are in milliseconds (ms).

```
</details>

Kindly review this  PR. cc mruberry, ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51647

Reviewed By: albanD

Differential Revision: D27007809

Pulled By: mruberry

fbshipit-source-id: 8c1922cb3280027ca5e3eef31bfa20500c548cfd
2021-03-14 00:25:45 -08:00
142c6b0e55 increase timeout for test_op_nnpi_fp16
Summary: As title. Otherwise we are getting flaky when running on devices in dev mode

Reviewed By: jfix71

Differential Revision: D27035924

fbshipit-source-id: 4946a90bd341be63d74b7052cace3fabdefdc0c4
2021-03-13 23:17:21 -08:00
84af0c7acd Refactor ForeachUtils.h (#51131)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51131

--------
- Refactored `can_use_fast_route` logic in ForeachUtils.h.
- Fixed related bugs in test_foreach.py

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D26103904

Pulled By: izdeby

fbshipit-source-id: b3859b39adaab55c87dab6f7709d227adc0f6342
2021-03-13 13:39:25 -08:00
f2689b1e13 Make ideep honor torch.set_num_thread changes (#53871)
Summary:
When compiled with OpenMP support `ideep`'s computational_cache would cache max number of OpenMP workers
This number could be wrong after `torch.set_num_threads` call, so clean it after the call.

Fixes https://github.com/pytorch/pytorch/issues/53565

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53871

Reviewed By: albanD

Differential Revision: D27003265

Pulled By: malfet

fbshipit-source-id: 1d84c23070eafb3d444e09590d64f97f99ae9d36
2021-03-13 11:20:44 -08:00
de70cdb66b Clang format default_hooks.py (#53956)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53956

ghstack-source-id: 123852987

Test Plan: N/A

Reviewed By: iseessel

Differential Revision: D27032713

fbshipit-source-id: 11d831fa0f08b1c8bc2e44acd144bf85a69a1211
2021-03-13 10:41:11 -08:00
ca4aae85fa [Gradient Compression] Update the docstring of fp16_compress_wrapper (#53955)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53955

Per title
ghstack-source-id: 123852836

Test Plan: N/A

Reviewed By: iseessel

Differential Revision: D27032700

fbshipit-source-id: 6f9bbc028efe6cc9b54f4ec729fea745368efb2e
2021-03-13 10:39:40 -08:00
3ce51fd5f4 remove th_fill and th_mul dead code (#52546)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52546

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D26654127

Pulled By: malfet

fbshipit-source-id: 68b777cd8ce2992a876dc8d22276a2afcef4830e
2021-03-12 20:55:09 -08:00
317ff429d3 [TB] Support writing new style scalar (#53496)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53496

New style vs old style
b306651ab5/tensorboard/data_compat.py (L49-L53)

Writing in new style can help avoid the cost of migration
b306651ab5/tensorboard/data_compat.py (L46)

----

Test Plan:
buck run caffe2/test:tensorboard

 ---

Reviewed By: edward-io

Differential Revision: D26879076

fbshipit-source-id: 43cfe9e1ca52dad3efc10332715d39f1cc984862
2021-03-12 19:03:13 -08:00
ef07a04072 [NNC] New APIs to get loops corresponding to a Buf (#53778)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53092

This PR adds the following APIs to NNC.
```
// In For:
static For* getParentLoop(const Stmt* st);
static std::vector<For*> getEnclosingLoopNest(const Stmt* st);

// In LoopNest:
std::vector<const Stmt*> getAllWritesToBuf(const Buf*) const;
std::vector<For*> getAllInnermostLoopsWritingToBuf(const Buf*) const;
std::vector<std::vector<For*>> getAllLoopNestsWritingToBuf(const Buf*) const;
```

These APIs are required for some usecases that involve multiple transformations like `splitWithTail` followed by `reorder` as shown in https://github.com/pytorch/pytorch/issues/53092

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53778

Reviewed By: albanD

Differential Revision: D26987013

Pulled By: navahgar

fbshipit-source-id: 491459eddfff045132d2358631ad069bbcc520df
2021-03-12 18:50:15 -08:00
ce0fd095a8 Implemented embedding_bag for SR (#52429)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52429

Implemented embedding_bag for supporting out version in SR

Befor:Milliseconds per iter: 1.15443. Iters per second: 866.226

 After:  Milliseconds per iter: 1.14791. Iters per second: 871.149

Test Plan:
buck test caffe2/test:nn
buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest

Reviewed By: hlu1

Differential Revision: D26089498

fbshipit-source-id: c9ba7068d5aa696c8f37a4846d8e80c6379538d2
2021-03-12 17:52:27 -08:00
3078233e9a [Gradient Compression] Make FP16 compression as a wrapper that can be combined with other communication hooks (#53808)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53808

Create a FP16 wrapper that can combine FP16 gradient compression with any gradient compression algorithm.

Test Plan:
Unit test:
```
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_compress_wrapper
```

Performance Test on DDP QPS Benchmark: Check if AllReduce + FP16 Wrapper = FP16 Compression
1) FP16 Compression:
f256897690

2) FP16 Wrapper + AllReduce (after patching D26960986):
f256897289

Reviewed By: SciPioneer

Differential Revision: D26978832

fbshipit-source-id: 0dcd18b050c02f5e9f3cff56344d1f39a04e20c0
2021-03-12 17:31:07 -08:00
8a5b946ff6 [caffe2] Don't call TensorImpl::size() in dim32() (#53852)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53852

dim32() requires that its argument is in range, so we can use the faster `TensorImpl::sizes()` call instead.
ghstack-source-id: 123784862

Test Plan:
Ran MergeNet AdIndexer benchmark under perf stat.

Before:

```
 Performance counter stats for 'scripts/bwasti/static_runtime/run.sh' (5 runs):

          7,008.70 msec task-clock                #    0.997 CPUs utilized            ( +-  0.25% )
             4,203      context-switches          #    0.600 K/sec                    ( +- 14.71% )
                 3      cpu-migrations            #    0.000 K/sec
            93,896      page-faults               #    0.013 M/sec                    ( +-  0.80% )
    13,869,719,763      cycles                    #    1.979 GHz                      ( +-  0.23% )  (50.05%)
    27,561,765,867      instructions              #    1.99  insn per cycle           ( +-  0.06% )  (50.04%)
     4,288,245,412      branches                  #  611.846 M/sec                    ( +-  0.05% )  (50.01%)
        19,633,433      branch-misses             #    0.46% of all branches          ( +-  0.83% )  (50.01%)

            # Table of individual measurements:
            7.0670 (+0.0379) #
            6.9897 (-0.0394) #
            7.0203 (-0.0088) #
            6.9829 (-0.0462) #
            7.0856 (+0.0565) #

            # Final result:
            7.0291 +- 0.0205 seconds time elapsed  ( +-  0.29% )
```

After:
```
 Performance counter stats for 'scripts/bwasti/static_runtime/run.sh' (5 runs):

          6,935.61 msec task-clock                #    0.997 CPUs utilized            ( +-  0.47% )
             2,913      context-switches          #    0.420 K/sec                    ( +- 15.25% )
                 3      cpu-migrations            #    0.000 K/sec
            92,628      page-faults               #    0.013 M/sec                    ( +-  0.50% )
    13,724,940,495      cycles                    #    1.979 GHz                      ( +-  0.47% )  (50.01%)
    27,226,217,974      instructions              #    1.98  insn per cycle           ( +-  0.02% )  (50.03%)
     4,220,129,358      branches                  #  608.472 M/sec                    ( +-  0.06% )  (50.04%)
        19,025,346      branch-misses             #    0.45% of all branches          ( +-  0.53% )  (50.04%)

            # Table of individual measurements:
            6.9402 (-0.0145) #
            6.8570 (-0.0978) #
            6.9311 (-0.0236) #
            7.0101 (+0.0554) #
            7.0352 (+0.0805) #

            # Final result:
            6.9547 +- 0.0315 seconds time elapsed  ( +-  0.45% )

```

Roughly 1% cycles win, which is outside the quoted noise level.

Reviewed By: hlu1

Differential Revision: D26994107

fbshipit-source-id: f4c4963be0a5c268cbcdac5359f8278750218ae6
2021-03-12 16:22:29 -08:00
5b648ef909 Revert D26922420: [ONNX] fix export of embedding with padding_idx (#53053)
Test Plan: revert-hammer

Differential Revision:
D26922420 (ee4ce8e9d9)

Original commit changeset: b8b867a96a13

fbshipit-source-id: 501392f419f2735658001c96f83d9754acd8e476
2021-03-12 14:51:01 -08:00
00771eff8e [reland] Add OpInfo for bitwise_not (#53181)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/42515

Note: Reland https://github.com/pytorch/pytorch/issues/51944

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53181

Reviewed By: albanD

Differential Revision: D27004695

Pulled By: mruberry

fbshipit-source-id: 92b4e8c60bb6f3c302907716de040b5c81c8db69
2021-03-12 14:43:56 -08:00
fe08671756 Added cuBLAS path for torch.triangular_solve (#53147)
Summary:
This PR adds the cuBLAS based path for `torch.triangular_solve`
The device dispatching helper function was removed from native_functions.yml, it is replaced with DECLARE/DEFINE_DISPATCH.

`magmaTriangularSolve` is removed and replaced with cuBLAS calls, this is not a BC-breaking change because internally MAGMA just calls the same cuBLAS function and doesn't do anything else.

Batched cuBLAS is faster than batched MAGMA for matrices of size up until 512x512, after that MAGMA is faster. For batches smaller than ~8 and matrix sizes larger than 64x64 a forloop of cuBLAS calls is faster than batched version.

Ref. https://github.com/pytorch/pytorch/issues/47953

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53147

Reviewed By: heitorschueroff

Differential Revision: D27007416

Pulled By: mruberry

fbshipit-source-id: ddfc190346e6a56b84145ed0a9af67ca9cde3506
2021-03-12 13:38:42 -08:00
afa1ff8e04 Implements torch.linalg.lstsq (#49093)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44378 by providing a wider range of drivers similar to what SciPy is doing.

The supported CPU drivers are `gels, gelsy, gelsd, gelss`.
The CUDA interface has only `gels` implemented but only for overdetermined systems.

The current state of this PR:
- [x] CPU interface
- [x] CUDA interface
- [x] CPU tests
- [x] CUDA tests
- [x] Memory-efficient batch-wise iteration with broadcasting which fixes https://github.com/pytorch/pytorch/issues/49252
- [x] docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49093

Reviewed By: albanD

Differential Revision: D26991788

Pulled By: mruberry

fbshipit-source-id: 8af9ada979240b255402f55210c0af1cba6a0a3c
2021-03-12 13:25:55 -08:00
4932342363 [Static Runtime] Fix bug in ClipRangesGatherRangesX2SigridHash (#53799)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53799

Fix two issues with ClipRangesGatherRangesX2SigridHash and ClipRangesGatherRangesX2SigridHashPrecompute:
- The first issue is with the two step graph rewrite process. If step 2 doesn't happen after step 1, then we're stuck with a graph with a `fb::placeholder` op that can't run. Step 3 is added to revert step 1 so we restore the original graph if there's any `fb::placeholder` op left.
- The second issue is with `SigridHashPrecompute`. The coupling with `freeze_module` is not ideal and limits its use to Static Runtime only. By running `ConstantPropagation` and `ConstantPooling` after splitting SigridHash, we can move all the Constant ops to the front of the graph and fusion can happen right afterwards.

Reviewed By: ajyu

Differential Revision: D26920008

fbshipit-source-id: e4bc67c7a15181bac5dbbfbb95d861849652bddf
2021-03-12 13:15:44 -08:00
76129c7cdf Revert D26993790: [pytorch][PR] [CUDA graphs] Private mempools for CUDA graphs
Test Plan: revert-hammer

Differential Revision:
D26993790 (90dfdef226)

Original commit changeset: a992eaee1b8c

fbshipit-source-id: 6ddb4aedd6154d7d89847aa5a34181158d06a309
2021-03-12 13:07:28 -08:00
fe38027fc3 [fix] torch.cat : cross-device check for out and input tensors (#53004)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/52044 (`stack` dispatches to `cat`)

The way dispatcher works, currently this case happens only in CUDA kernel (CPU kernel is chosen if all inputs and out are on CPU). That is why the check is added only on the CUDA side.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53004

Reviewed By: albanD

Differential Revision: D27003956

Pulled By: mruberry

fbshipit-source-id: 818ea0f76153f4fa281740f30705e5ef018413f6
2021-03-12 12:51:11 -08:00
fdbd667e31 compareSet method for HashStore and FileStore (#53803)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53803

Reviewed By: ngimel

Differential Revision: D27017014

Pulled By: H-Huang

fbshipit-source-id: 736aa5ad848f5708e6581e472e48d5682bef7131
2021-03-12 12:38:30 -08:00
d4c877b59b Fix typo "informations" -> "information" (#53746)
Summary:
Hey, fixing the [uncountable](https://www.oxfordlearnersdictionaries.com/definition/american_english/information) noun to the proper form.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53746

Reviewed By: ngimel

Differential Revision: D27012035

Pulled By: albanD

fbshipit-source-id: dc653e739b5f6abed99b74bd2fd514b795d61b2e
2021-03-12 12:07:38 -08:00
f62e9156dc Add missing decorators in test_spectral_ops (#53736)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53456

I'm confused why this wasn't picked up in CI. There's definitely at least one CI job that builds without MKL. Are spectral_ops not being run at all on that job?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53736

Reviewed By: albanD

Differential Revision: D27007901

Pulled By: mruberry

fbshipit-source-id: cd93a2c48f4ccb2fd2e0e35768ee059039868a1b
2021-03-12 12:00:25 -08:00
89fce74d55 fix for method_tests() random failures (#53854)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48125 and https://github.com/pytorch/pytorch/issues/53237

The origin of the problem was that `common_methods_invocations.method_tests()` uses `set_rng_seed(0)` which is different from the seed used at `TestCase:setUp` -> `set_rng_seed(SEED)`.

As this issue might block removing old tests I also notice that  this could also be the reason of test failures at PR https://github.com/pytorch/pytorch/issues/50655

Thanks
cc mruberry, kshitij12345,  imaginary-person

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53854

Reviewed By: albanD

Differential Revision: D27004797

Pulled By: mruberry

fbshipit-source-id: 66a15ed900131c782bc341b16c902972d7bb2541
2021-03-12 11:49:47 -08:00
33aaea912a [caffe2] Support deserializing tensors using alternate serialization formats (#53403)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53403

This updates the `TensorProto` field to independently track the data type of
the in-memory (deserialized) data from the serialized data format.

This will allow us to support multiple different serialization formats in the
future.  For instance, we could choose to perform quantization of floating
point data types, or varint encoding for integer fields.

For now this diff does not actually change the serialization code path yet,
and does not introduce any new serialization formats, but only refactors the
deserialization code path to make it easier to introduce new formats.

I'm not really that thrilled with the heavy use of macros and templates here,
but I didn't really see better alternatives that made it as simple to specify
new deserialization function implementations.
ghstack-source-id: 123594220

Test Plan:
Confirmed that the existing unit tests pass.  This diff only touches the
deserialization code path and not the serialization code to help ensure that
the deserialization code works with the existing serialization logic, and that
there are no changes to the current serialization format.

Reviewed By: mraway

Differential Revision: D26658206

fbshipit-source-id: d7297d600aee28b92fd9f4ece437b7f519060942
2021-03-12 11:35:15 -08:00
91531d3047 [caffe2] add a CAFFE2_NODISCARD macro to help support old compilers (#53754)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53754

Some of the PyTorch CircleCI builds still use gcc 5.4, and compile with
`-Werror=attributes` causing this old compiler to fail because it does not
understand the `[[nodiscard]]` attribute.

Let's define a `CAFFE2_NODISCARD` macro to work around this.
ghstack-source-id: 123594084

Test Plan: I'm using this macro in subsequent diffs in the stack.

Reviewed By: mraway

Differential Revision: D26959584

fbshipit-source-id: c7ba94f7ea944b6340e9fe20949ba41931e11d41
2021-03-12 11:32:30 -08:00
7763bb6cb3 Use the conda channel defined in docker.Makefile to install cudatoolkit (#53316)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53316

Test Plan:
Nightly Docker build CI

This is a follow-up PR after docker moved default CUDA => 11.1. Only merge this after https://github.com/pytorch/pytorch/issues/53299 is committed.

Reviewed By: albanD

Differential Revision: D26996287

Pulled By: xuzhao9

fbshipit-source-id: 0c2e03da41d036d7aada3e07d479a3dede219f58
2021-03-12 11:18:05 -08:00
90dfdef226 [CUDA graphs] Private mempools for CUDA graphs (#51436)
Summary:
Implements https://github.com/pytorch/pytorch/issues/51075#issuecomment-768884685 and additions discussed offline with ezyang ngimel . (Calling it "simple" is charitable but it's not too bad).

[High level strategy](https://github.com/pytorch/pytorch/pull/51436/files#diff-acc6337586bf9cdcf0a684380779300ec171897d05b8569bf439820dc8c93bd5R57-R82)

The current design aggregates stats from private pools with the ordinary pools, which may or may not be what we want.

Instead of adding PrivatePools as an internal feature of DeviceAllocator, I could inherit from DeviceAllocator (eg `DevicePrivateAllocator : public DeviceAllocator`) and create separate per-graph instances of the inherited class. I'm not sure if that would be better.

Graph bindings in Python are almost unchanged from https://github.com/pytorch/pytorch/pull/48875:
```python
# Same bindings as 48875, but now implicitly grabs a private mempool
graph1.capture_begin()
graph1.capture_end()

# pool=... is new.  It hints that allocations during graph2's capture may share graph1's mempool
graph2.capture_begin(pool=graph1.pool())
graph2.capture_end()

# graph3 also implicitly creates its own mempool
graph3.capture_begin()
graph3.capture_end()
```

Test plan (other suggestions appreciated):

- [x] Stop maintaining manual references for all the tensors in my existing graphs+RNG tests. If private pools somehow give bad allocations, they should start failing intermittently. They run eager ops and eager allocations mixed with graph replays, so they may expose if eager ops and replays corrupt each other.
- [x] `test_graph_two_successive`: Capture successive graphs, with the second graph using the first graph's result. Try with and without sharing a pool. Check results, also check memory stats to confirm sharing a pool saves memory.
- [x] `test_graph_concurrent_replay`: Capture some graphs in separate private pools, replay them concurrently in different streams, check the results to make sure they don't corrupt each other's memory. Capture some graphs with a shared pool, replay them concurrently in different streams, check results, confirm they DO corrupt each other's memory.
- [x] `test_graph_three_successive`: A three-graph case, checking the safe and unsafe replay patterns in [Restrictions of the Strawman API](https://github.com/pytorch/pytorch/issues/51075)).
- [x] `test_graph_memory_stats_and_use_result_after_destroy_graph`: Comprehensively check torch.cuda.memory_stats() changes that result from graph capture and delete. Check that a tensor ref created during capture and held after graph delete stays valid until the tensor itself is deleted.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51436

Reviewed By: mruberry

Differential Revision: D26993790

Pulled By: ngimel

fbshipit-source-id: a992eaee1b8c23628e7b388a5a3c26e0f80e54da
2021-03-12 11:07:47 -08:00
804f3f9879 [PyTorch] Remove unnecessary assert in maybe_resize_storage_cpu (#53724)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53724

See new code comment -- stealAndSetStoragePtr calls set_storage_keep_dtype.
ghstack-source-id: 123636226

Test Plan: CI

Reviewed By: mruberry

Differential Revision: D26922164

fbshipit-source-id: fe1dd2b3e5f0876b8b41694ff2fb19b9ca2bae61
2021-03-12 10:48:42 -08:00
34eb644e88 Replace thrust with cub in randperm (#53841)
Summary:
Benchmark of
```python
%timeit torch.randperm(100000, device='cuda'); torch.cuda.synchronize()
```
thrust:
```
5.76 ms ± 42.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```
cub:
```
3.02 ms ± 32.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

sync in thrust sort is removed

Warning:
Thrust supports 64bit indexing, but cub doesn't, so this is a functional regression. However, `torch.randperm(2**31, device='cuda')` fails with OOM on 40GB A100, and `torch.randperm(2**32, device='cuda')` fails with OOM on 80GB A100, so I think this functional regression has low impact and is acceptable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53841

Reviewed By: albanD

Differential Revision: D26993453

Pulled By: ngimel

fbshipit-source-id: 39dd128559d53dbb01cab1585e5462cb5f3cceca
2021-03-12 10:30:30 -08:00
7f4aff8203 Skip dispatch for is_signed (#53847)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53847

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D26994937

Pulled By: carolineechen

fbshipit-source-id: 8af25ecdade0b31d29fac27de6ee5f704353af10
2021-03-12 10:26:25 -08:00
2912ad1324 ns for fx: move linear activation test case to new API (#53777)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53777

Moves linear activation test case to new NS API

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIsModels.test_compare_activations_linear
```

Imported from OSS

Reviewed By: hx89

Differential Revision: D26967107

fbshipit-source-id: 83c4401b2bf79d15227b7fb3e59c54276ec5626b
2021-03-12 10:02:52 -08:00
57bf13409a ns for fx: move compare activations for conv test to new API (#53776)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53776

Moves the test for comparing activations for conv to new API.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIsModels.test_compare_activations_conv
```

Imported from OSS

Reviewed By: hx89

Differential Revision: D26967106

fbshipit-source-id: 2eb986ff19761a1e2408cb7780ac0b282cdcc523
2021-03-12 10:02:47 -08:00
01c6e9360e ns for fx: move lstm dynamic weight test case to new API (#53772)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53772

Moves the test case for extracting LSTM dynamic weights to new NS API.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIsModels.test_compare_weights_lstm_dynamic
```

Imported from OSS

Reviewed By: hx89

Differential Revision: D26967104

fbshipit-source-id: 0d17e7735ec361167dcf72bcb373bfc1aad84668
2021-03-12 10:02:43 -08:00
a71cd135ae ns for fx: move linear dynamic weight test case to new API (#53765)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53765

Moves linear dynamic weight test case to new NS API.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIsModels.test_compare_weights_linear
```

Imported from OSS

Reviewed By: hx89

Differential Revision: D26967109

fbshipit-source-id: 2096a88a3005270696d536f2e1bbc87e70c07230
2021-03-12 10:02:38 -08:00
9c8f112ada ns for fx: move linear weight test case to new API (#53764)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53764

Moving the linear weight test case to new FX NS APIs.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIsModels.test_compare_weights_linear
```

Imported from OSS

Reviewed By: hx89

Differential Revision: D26967111

fbshipit-source-id: f0a90d7863d5d866e391729ec28e0e0dea339900
2021-03-12 10:02:34 -08:00
19fe8a529e ns for fx: move conv weight test case to new API (#53761)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53761

Moves the testing of conv weight matching to new NS APIs.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIsModels.test_compare_weights_conv
```

Imported from OSS

Reviewed By: hx89

Differential Revision: D26967108

fbshipit-source-id: af3647733f954a657e0868c2c40642018de9ea49
2021-03-12 10:02:30 -08:00
986e3c0a00 ns for fx: extract common code in tests to util functions (#53748)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53748

Extracts common testing patterns for FX numeric suite into
util functions.  No logic change.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs
python test/test_quantization.py TestFXNumericSuiteCoreAPIsModels
```

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D26967105

fbshipit-source-id: 9f6cbe75bb6d2ede142929e0c9e40812006c159d
2021-03-12 10:02:25 -08:00
7d27eb8068 ns for fx: clean up API naming (#53729)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53729

Aligns the names of the three core APIs to the design doc.
New names:

```
// weights
_extract_weights_one_model
extract_weights

// unshadowed activations
_add_loggers_one_model
add_loggers
_extract_logger_info_one_model
extract_logger_info

// shadowed activations
add_shadow_loggers
extract_shadow_logger_info
```

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs
python test/test_quantization.py TestFXNumericSuiteCoreAPIsModels
```

Imported from OSS

Reviewed By: hx89

Differential Revision: D26953071

fbshipit-source-id: dda6df1c26afd99dd7779e72e3eed2d3d72c8128
2021-03-12 10:02:21 -08:00
421e91dfd2 ns for fx: add support for logging inputs
Summary:
This PR implements the option to log inputs for FX Numeric Suite.  The user facing api looks like

```
def prepare_model_outputs(..., should_log_inputs : bool = False)
def prepare_model_with_stubs(..., should_log_inputs : bool = False)
```

The output data now looks like

```
{
  "layer1": {
    "node_inputs": {
      "model1": [{
        "values": ...,
        ...,
      }],
    },
    "node_outputs": {
      ...,
    }
  },
  ...  // other layers
}
```

One key design decision taken here is that an input logger logs the output of previous nodes, instead of logging the input of the current node.  This matters for a signature such as `cat([x1, x2, x3])`.  We are inserting three input loggers here (for x1, x2, and x3), instead of a single input logger for `[x1, x2, x3]`.  This was chosen in order to preserve the structure of the original graph as much as possible and keep flexibility for future optimizations.

Test Plan:
TODO: fill out

Imported from OSS

Differential Revision: D26931225

Reviewed By: hx89

Pulled By: vkuzo

fbshipit-source-id: dd692bfb5ddaaf5554f80c25e2f40b21762e4fc3
2021-03-12 10:02:17 -08:00
cc940f3580 ns for fx: change dtype cast from once per N node to once per node
Summary:
This PR ensures that when we do a dtype cast for a shadow module,
we insert N dtype casts for N nodes, instead of combining N nodes
into a single dtype cast.

An example where this occurs is `cat([x, y], dim=0)`

```
// original graph

[x, y] -> cat_b -> output

// shadow graph with a single dtype cast, before this PR

  dtype_cast -> cat_a_shadow -> output_a_shadow
  /
[x, y] -> cat_b -> output_b

// shadow graph with multiple dtype casts, after this PR

 [dtype_cast_x, dtype_cast_y] -> cat_a_shadow -> output_a_shadow
 /
[x, y] -> cat_b -> output_b
```

The reason things worked before this PR is because `torch.dequantize`
can take either a single tensor or a list of tensors.  We are changing
this to make an upcoming addition of input loggers easier.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_prepare_model_with_stubs_multiple_dtype_casts
```

Imported from OSS

Differential Revision: D26931226

Reviewed By: hx89

Pulled By: vkuzo

fbshipit-source-id: e9c7d4c7942e0f59c952094d2e446b1e2c838396
2021-03-12 10:02:12 -08:00
d73e36a44a ns for fx: change API to take nn.Module instead of GraphModule (#53075)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53075

The input and output types should be `nn.Module`, to hide
the implementation detail that the pass is using FX.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D26740548

fbshipit-source-id: d5ed445379355bebdd90d377c95fcd7e671371a3
2021-03-12 10:00:35 -08:00
b00cdfe136 Fix run_test_module logic (#53884)
Summary:
First argument is either file name or test module name, but key to `CUSTOM_HANDLERS` is test module name.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53884

Test Plan: Run `python3 run_test.py -i distributed/test_distributed_spawn.py`

Reviewed By: janeyx99

Differential Revision: D27006164

Pulled By: malfet

fbshipit-source-id: f30b42856cd2754e5981c1c69618f84e392c986a
2021-03-12 09:53:58 -08:00
ae7984b1d6 Do not use shards for single run tests (#53883)
Summary:
Do not compute shards if whole testsuite needs to be run anyway.
Helps avoid occasional test duplication/gaps when access to test time database is not available while one of the shards is computed

Fixes https://github.com/pytorch/pytorch/issues/53882

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53883

Reviewed By: janeyx99

Differential Revision: D27005910

Pulled By: malfet

fbshipit-source-id: f9603db0523a3a2539118e3fec1c6874c54f8d6d
2021-03-12 09:47:00 -08:00
a7ddd15d15 fix static dispatch linker error (#53859)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53859

The redispatch API wasn't linking properly when static dispatch is enabled. I'm still not sure why this wasn't caught by the static dispatch test in CI- maybe, as swolchok pointed out, we have a flag set somewhere that defers undefined symbols until runtime.

Before, building with static dispatch enabled locally + running `import torch` gave me this error:
```
>>> import torch
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/raid/hirsheybar/pytorch/torch/__init__.py", line 197, in <module>
    from torch._C import *
ImportError: /raid/hirsheybar/pytorch/torch/lib/libtorch_cpu.so: undefined symbol: _ZN2at10redispatch11logical_or_EN3c1014DispatchKeySetERNS_6TensorERKS3_
>>>
```

Printing the symbol:
```
(pytorch) hirsheybar@devfair017:/scratch/hirsheybar/pytorch$ c++filt _ZN2at10redispatch11logical_or_EN3c1014DispatchKeySetERNS_6TensorERKS3_
at::redispatch::logical_or_(c10::DispatchKeySet, at::Tensor&, at::Tensor const&)
```

Sure enough, the functions defined in `RedispatchFunctions.cpp` don't have the DispatchKeySet argument included. Adding them in this PR.

Test Plan: Imported from OSS

Reviewed By: ljk53

Differential Revision: D26998735

Pulled By: bdhirsh

fbshipit-source-id: c6c1104e42d13b7ec9d964b7e08d2adc8b344b78
2021-03-12 09:41:08 -08:00
924c15c962 [doc] reorg dist init and non-init functions (#52976)
Summary:
This PR proposes to improve the distributed doc:

* [x] putting the init functions together
* [x] moving post-init functions into their own sub-section as they are only available after init and moving that group to after all init sub-sections

If this is too much, could we at least put these 2 functions together:

```
.. autofunction:: init_process_group

.. autofunction:: is_initialized
```
as they are interconnected. and the other functions are not alphabetically sorted in the first place.

Thank you.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52976

Reviewed By: albanD

Differential Revision: D26993933

Pulled By: mrshenli

fbshipit-source-id: 7cacbe28172ebb5849135567b1d734870b49de77
2021-03-12 08:48:18 -08:00
fff0a3f906 [DataLoader] ZipIterDataPipe (#53554)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53554

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D26913406

Pulled By: ejguan

fbshipit-source-id: 24604b41d08eb6f7689add152229049a4c65c06e
2021-03-12 08:26:21 -08:00
7297556d5d Add support for single tensor in inputs argument for backward (#53827)
Summary:
Also updates the doc such that the language matches the type. For example, previously the `tensors` argument is specified as `(sequence of tensor)`, but has type annotation of `_TensorOrTensors`. Now its correctly updated to be `Sequence[Tensor] or Tensor`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53827

Reviewed By: albanD

Differential Revision: D26997541

Pulled By: soulitzer

fbshipit-source-id: e1e609a4e9525139d0fe96f6157175481c90d6f8
2021-03-12 08:19:31 -08:00
4884a6ab51 fx quant: clean up names of quantize handlers (#53614)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53614

Ensures that every subclass of `QuantizeHandler` has a clear name.  This
prevents ambiguous names like `Cat`, which look like a module but are
really a quantize handler.

Test Plan:
```
python test/test_quantization.py TestQuantizeFx
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D26914784

fbshipit-source-id: 6dca7e27975c09f422f8e36f1d2b709bf3eaaadf
2021-03-12 07:43:53 -08:00
279b5372ab [not for land] fix fx quant for quant_layer -> stack -> sum (#53196)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53196

Before this PR, code patterns like this did not work:

```
x = some_quant_layer(x)
x = torch.stack([x, ...])
x = torch.sum(x, ...)
```

The reason this did not work is because `torch.sum` is treated as
"quantized" because of the newly added fp16 support, even though it is
not actually "quantized" for models where fp16 is not used.  We may
need to adjust the concept of "quantized vs non-quantized" into a
"dtype" for the longer term fix.

The current PR is a hacky fix to unblock.  We need to clean things
up before this is landable

Test Plan:
```
python test/test_quantization.py TestQuantizeFx.test_quant_sum
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D26783960

fbshipit-source-id: 3be7c3c1eaa2b8fcb99a105e1b0004c9ffd3a1c1
2021-03-12 07:43:50 -08:00
93d5807c1e [not for land yet]fix using size of quant layer in torch._assert (#53187)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53187

Before this diff, if we had code lik

```
x = any_quant_layer(...)
x_size0 = x.size(0)
torch._assert(x_size_0 == 1)
```

The convert code would try to insert a dequantize after `x_size0`,
because it was a descendant of a quantized node and it was needed
for a non-quantized operation.  Since the actual type of the `size`
function output is an integer, this does not make sense.

For now, this is fixed as a one-off to unblock a customer.  In the
future, we may need to think more deeply about all the functions which
can return non-quantized types from quantized tensors and make sure
they are all covered.

Test Plan:
```
python test/test_quantization.py TestQuantizeFx.test_assert_on_size_after_quant_layer
```

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D26780690

fbshipit-source-id: 44cc25c9179d460efb3f110d40b73d854d676af5
2021-03-12 07:43:48 -08:00
ccab6680d5 [not for land yet] hacky fix for x.ndim followed by sub (#53120)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53120

Currently there is a pattern which is not handled correctly by
FX graph mode quantization:

```
def forward(self, x):
    ndim = x.ndim
    # or add, mul, div, etc
    x = torch.sub(x, ndim)
    return x
```

The reason this does not work is as follows:
1. x.ndim becomes a getattr node
2. the real world type of x.ndim is an integer, but this is not known from the graph (yet)
3. binary ops such as `torch.sub` require quantization of inputs
4. the framework inserts an observer to observe the output of `ndim`
5. the observer fails because `ndim` is not a Tensor

For now, we hack a bandaid to unblock some teams, none of this is for
land.  We will have to think of a better fix which is landable (TBD).

Test Plan:
```
python test/test_quantization.py TestQuantizeFx.test_getattr_with_nontensor_result
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D26756180

fbshipit-source-id: c0e498766b22c23df74fbb5aaeaa237c4c944263
2021-03-12 07:42:12 -08:00
4873641602 Fix TCPStore wait() hang when key is previously set (#53860)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53860

Fixes [#53840](https://github.com/pytorch/pytorch/issues/53840)

Right now [TCPStore wait([LIST_OF_KEYS_TO_AWAIT])](https://pytorch.org/docs/master/distributed.html#torch.distributed.Store.wait) will hang if any of the keys in [LIST_OF_KEYS_TO_AWAIT] has been previously set. This change will ensure that wait() is only waiting for the keys that have not been set

Before change:
```
# Case 1: HANG
store.set("1", "1")
store.wait(["1", "2"])
store.set("2", "2")

# Case 2: SUCCEED
store.wait(["1", "2"])
store.set("1", "1")
store.set("2", "2")
```
After change:
Both cases work

TODO: working on adding a test for wait()

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D26999929

Pulled By: H-Huang

fbshipit-source-id: 8931749923c98b520366538f785af82ef37cca8e
2021-03-12 07:05:31 -08:00
a51f130d37 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D27005870

fbshipit-source-id: 5d51d0e64ae3fb15d38f8a9f8479af1c86b18fa9
2021-03-12 04:00:36 -08:00
ee4ce8e9d9 [ONNX] fix export of embedding with padding_idx (#53053) (#53530)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53530

fix export of embedding with padding_idx

Test Plan: Imported from OSS

Reviewed By: navahgar, jamesr66a, malfet

Differential Revision: D26922420

Pulled By: SplitInfinity

fbshipit-source-id: b8b867a96a13cf810f9c0ae88fcc5c95072bb390
2021-03-12 02:49:46 -08:00
a572f70f2f [ONNX] Support torch.isinf, torch.any and torch.all export to ONNX (#53328) (#53529)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53529

Supported for ONNX export after opset 10.
This is not exportable to opsets < 10 due to
1. onnx::IsInf is introduced in opset 10
2. onnx::Equal does not accept float tensor prior to opset 11

Test Plan: Imported from OSS

Reviewed By: pbelevich, malfet

Differential Revision: D26922418

Pulled By: SplitInfinity

fbshipit-source-id: 69bcba50520fa3d69db4bd4c2b9f88c00146fca7
2021-03-12 02:49:41 -08:00
705131c5d3 [ONNX] Update ONNX documentation (#51362) (#53313)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53313

Add information about .data field

Test Plan: Imported from OSS

Reviewed By: pbelevich, malfet

Differential Revision: D26922421

Pulled By: SplitInfinity

fbshipit-source-id: 5117ac20990e286dcacb44f7b810b1bcc75d3dd6
2021-03-12 02:49:38 -08:00
a6a811f23a [ONNX] Add repeat_interleave symbolic (#52855) (#53312)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53312

- Add support for aten::repeat_interleave
- NOTE: Also adds fix for cases with split op where input tensor sizes are not known but _outputs is provided

Test Plan: Imported from OSS

Reviewed By: pbelevich, malfet

Differential Revision: D26922422

Pulled By: SplitInfinity

fbshipit-source-id: 5362d0d8ccfdc14c15e1ae73fd70c4c113f823e6
2021-03-12 02:49:34 -08:00
76147b897c [ONNX] Update assign output shape for nested structure and dict output (#52893) (#53311)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53311

Fixes dict output & nested tuple.

Test Plan: Imported from OSS

Reviewed By: pbelevich, malfet

Differential Revision: D26922426

Pulled By: SplitInfinity

fbshipit-source-id: c2c6b71c8d978b990181e0b025626dbf6ef2199e
2021-03-12 02:49:30 -08:00
4c1d9e58c2 Fix copy_ export (#53046) (#53310)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53310

Fixes export of torch.copy_

Test Plan: Imported from OSS

Reviewed By: pbelevich, malfet

Differential Revision: D26922424

Pulled By: SplitInfinity

fbshipit-source-id: f509e531f5064d2be7f55e1681813f10f17475d2
2021-03-12 02:49:26 -08:00
8dab886d3b [ONNX] enable several script unit tests using new jit passes (#51722) (#53309)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53309

* enable scripting related unit test

* fix flaske

* enable more tests

* fix interpolate and BCElogits test

* fix interpolate test ci

* add test_interpolate_upsample opset comment

* add interpolate_upsample tracing support below opset 9

Test Plan: Imported from OSS

Reviewed By: pbelevich, malfet

Differential Revision: D26922423

Pulled By: SplitInfinity

fbshipit-source-id: d1cd6a34c0820a75ffc28ff17acc5daa7807c00b

Co-authored-by: hwangdeyu <deyhuang@qq.com>
2021-03-12 02:49:22 -08:00
be344e9d88 Update test cases generated by make_test() method to support running them in script mode. (#52748) (#53308)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53308

* Update tests for test_gru_* at this moment.

* Update flake8 error.

* Update tests for test_gru_* at this moment.

* Update flake8 error.

* Update test_gru_* test cases only.

* Fix flake8 issue.

* Fix flake8 issue on test.

* Still disable test cases created by make_test.

* Update code to fix issue 'AttributeError: 'RecursiveScriptModule' object has no attribute 'forward'' for test_elman_* test cases.

* Add script model support for test_lstm_* test cases.

Test Plan: Imported from OSS

Reviewed By: pbelevich, malfet

Differential Revision: D26922419

Pulled By: SplitInfinity

fbshipit-source-id: a96432b2e7da9b142a38f87fbaf56737117462c1
2021-03-12 02:49:18 -08:00
7f17058894 [ONNX] Symbolic shape inference (#51481) (#53307)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53307

This PR did symbolic shape inference, in the onnx pass _jit_pass_onnx_graph_shape_type_inference.
It creates a singleton ConstantValueMap.
It leverages constant folding technique and did a per-op based handling for ConstantValueMap.
As a byproduct, it enables fold_if pass for dynamic axes cases, typically for faster-rcnn etc.

The core change is in `torch/csrc/jit/passes/onnx/shape_type_inference.cpp` and `torch/csrc/jit/passes/onnx/constant_map.cpp`.

We usually need copy tensor to store in the ConstantValueMap, otherwise the underlying value may change. I see this issue in (1) from_blob (2) get value from Constant node.

Test Plan: Imported from OSS

Reviewed By: pbelevich, malfet

Differential Revision: D26922414

Pulled By: SplitInfinity

fbshipit-source-id: 7654dc13d1de8d9496ad4be89f1454260d7bdeb0
2021-03-12 02:49:14 -08:00
57d1df071f [ONNX] Support inplace operations on inplace indexing (#52063) (#53306)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53306

* [ONNX] Fix for sequence of mutations in blocks (#51577)

Fixes consecutive mutations in a tensor inside blocks.
Also, support append and pop in blocks.

* Support inplace operations + indexing

* Clean up old pass for remove mutations

* Add loop test

* Fixes for set attr in loops

* Removing the new jit API flag

* [ONNX] Redesign onnx pass to enable shape type dependent pattern conversion - cont (#51795)

With the introduction of ONNX shape inference, shape and type are inferred on the fly as operators get converted from ATen to ONNX when running symbolic function. This resolves the shape/type requirement for the symbolic functions. The pre-onnx passes however, can not be supported by shape inference, since at that stage the operators in the graph are still ATen operators.

This PR is to update the design of ONNX pass, to enable a mechanism of capturing subgraphs of ATen operators of certain patterns, and convert them later, when shape/type information of upstream operators are available.

The new design will require pre-onnx passes that need shape/type to be written in two parts, encapsulation and conversion.

    The encapsulation part will find the nodes of patterns, like how pre-onnx passes were written previously. But instead of converting the nodes, it will encapsulate them into a sub-block of a new placeholder node. This part is called before onnx pass, so it runs before calling symbolic functions.

    The conversion part will be called inside the onnx pass. In onnx pass, run_symbolic_func will be called for each node in topological order. When it reaches the placeholder node, the conversion part will be invoked. It will convert the nodes inside the sub-block based on pattern. By that time, it will have shape/type of upstream operators available. After the conversion is complete, the placeholder node will be removed, and nodes inside its sub-block converted. Run_symbolic_func will be called for these nodes, and they will be converted from ATen operator to ONNX operator.

This PR includes several other fixes, listed below.
* ~~replace helper.cpp with onnx_utils.cpp for holding utility functions.~~
* fix EraseNumberTypes on Bool type, the code was outdated that back then Bool type doesn't exist.
* ~~enable onnx shape inference in export with parameter/initializer data.~~
* other code clean ups.
* fix insertion of identity nodes for loop opset 13 sequence output.

~~PR depends on #51603~~

* Fix after merge

* clang

* Fix clang

* Fix clang

* Fix warning message.

* Fixes for non-model param attributes

* Fix for caffe2

* Additional test

* clang

* Skip test for lower opsets

* fix clang-tidy

* Update init.cpp

* Update remove_inplace_ops_for_onnx.cpp

* Update remove_inplace_ops_for_onnx.cpp

* Update remove_inplace_ops_for_onnx.cpp

* Fix for clang formatting

Test Plan: Imported from OSS

Reviewed By: pbelevich, malfet

Differential Revision: D26922416

Pulled By: SplitInfinity

fbshipit-source-id: e7108620b39b6404c594910786c4d275fee59d84

Co-authored-by: Bowen Bao <bowbao@microsoft.com>
2021-03-12 02:49:11 -08:00
38414d29a1 [ONNX] Remove the last Cast in pow symbolic_opset9 (#52646) (#53305)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53305

Fixes #52436
For opset 9 of onnx Pow, if X is int32, Y is float, we will cast back to int32 which is consistent with X type.
However, pytorch is still float. The aten graph sometimes does not bind with the type for operators,
we are fine with the float type and don't want to cast back.
Even if X, Y are int32, the resulting float32 and int32 makes no difference.

Test Plan: Imported from OSS

Reviewed By: pbelevich, malfet

Differential Revision: D26922425

Pulled By: SplitInfinity

fbshipit-source-id: f8c09524acee0de615df10a14310ca1dd583831e
2021-03-12 02:47:19 -08:00
1772e26f63 [PyTorch] Move selected_mobile_ops.h codegen function to tools (#53786)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53786

To generate `selected_mobile_ops.h` in OSS, move the header file codegen functions to `tools/lite_interpreter/gen_selected_mobile_ops_header.py` file, so OSS can reuse these functions.
ghstack-source-id: 123754437

Test Plan:
```
buck test //xplat/caffe2:supported_mobile_models_test
```

```
buck run //xplat/caffe2:gen_oplist -- --model_file_list_path @/data/users/chenlai/data/pytorch/oplist_folder/file_list_path.macro  --allow_include_all_overloads --output_dir /home/chenlai/local/data/pytorch/oplist_folder
```

`file_list_path.macro` content is:
```
chenlai@devvm2090:~/fbsource(45a9b7888)$ cat /data/users/chenlai/data/pytorch/oplist_folder/file_list_path.macro
/data/users/chenlai/fbsource/buck-out/gen/aab7ed39/xplat/caffe2/supported_mobile_models_test_op_list/model_operators.yaml
```

In output folder `/home/chenlai/local/data/pytorch/oplist_folder`, these files are generated:
```
selected_mobile_ops.h  selected_operators.yaml  SupportedMobileModelsRegistration.cpp
```

the generated files are the same as before.

{P282056731}

{P282055046}

Reviewed By: dhruvbird, iseeyuan

Differential Revision: D26907868

fbshipit-source-id: 9ba786f9c5674a72cad237ae7baadbe4642c51d5
2021-03-12 00:13:03 -08:00
8737c2a1a2 [TensorExpr] Reland: "Simplify index expressions constructed in loop flattening. Fixes #51173" (#53861)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53861

Replaced the iterators in the for-loops with integer index variables due to
overflow when handling empty vectors.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D26998894

Pulled By: huiguoo

fbshipit-source-id: a1f6475c8ba123968ef7247b4f6f38edbf24b9ef
2021-03-11 23:52:36 -08:00
aeb3e93351 Move view handling logic to gen_inplace_or_view_type.py (#53341)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53341

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D26973912

Pulled By: ailzhang

fbshipit-source-id: ea31bdef0beac6996d509f5d45ebefa3ea8e2b89
2021-03-11 21:25:15 -08:00
e09e97ebf9 [DDP] add _distributed_rank helper function (#53795)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53795

There are 4 calls in ddp implementation to dist.get_rank(), move these
to a helper property to ensure that users don't actually call `dist.get_rank()`
instead of `dist.get_rank(self.process_group)`.

Keeping API private for now because not sure if there is a user need to call `model.distributed_rank`, but can make it public if we think it's a useful api.
ghstack-source-id: 123640713

Test Plan: Ci

Reviewed By: mrshenli

Differential Revision: D26972368

fbshipit-source-id: a5f1cac243bca5c6f90a44f74d39cfffcc2b9a5a
2021-03-11 21:20:05 -08:00
0c2fe02ec1 [DDP] Fix wrong call to dist.get_rank() (#53793)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53793

This call should pass in the process group so it works appropriately
for subgroups instead of whole world being passed into DDP.

Aside: This wasn't caught by tests since we don't have good testing around
passing subgroups into DDP, I believe nearly all tests use the entire world.
Should we add better testing for subgroups which may potentially bring up more
subtle bugs?
ghstack-source-id: 123640712

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D26972367

fbshipit-source-id: 8330bd51e2ad66841e4c12e96b67d3e78581ec74
2021-03-11 21:18:31 -08:00
d4602b7e45 [NNC] Fixes case where inlining wouldn't work because dim-size was 1. (#53254)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/52581

The git diff is absolutely atrocious since I also refactored the code to share stuff between `Load` and `FunctionCall`.

Biggest questions I have about this diff are:

1. The asserts I added. From my understanding it's not possible to have a constant index in `Store` that's non-zero, since `Store` always creates a new buffer. Perhaps the user can write this kind of incorrect code, though, so perhaps I should just check for it and not assert it?

2. I don't think(?) I need to do any special handling for `index_vars`, but wasn't totally able to track the logic there.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53254

Reviewed By: albanD

Differential Revision: D26991064

Pulled By: Chillee

fbshipit-source-id: 0bcd612d5f4b031c0b34e68a72d9c8d12d118be8
2021-03-11 20:53:20 -08:00
ce670238ba Revert D26927500: [libkineto] Log CUPTI errors on libkineto initialization
Test Plan: revert-hammer

Differential Revision:
D26927500 (cffe9aa617)

Original commit changeset: 2a78005239a5

fbshipit-source-id: ff9fdcb197b06b4ff99f41c80b4cecaf6a1820b8
2021-03-11 20:17:19 -08:00
9f75de278f Move common autograd utils functions from gen_variable_type.py to api/autograd.py. (#53340)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53340

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D26973914

Pulled By: ailzhang

fbshipit-source-id: 8367a08b27b25808782c77aadc3c67d07c354957
2021-03-11 19:58:45 -08:00
cffe9aa617 [libkineto] Log CUPTI errors on libkineto initialization
Summary: When libkineto is initialized from the PyTorch Profiler, if it fails we will not know why because errors are not reported. Reporting errors is not always safe, e.g. if init happens from static initialization or a dlopen library constructor function, so add a flag to specify whether to log.

Test Plan: Testing in PyTorch OSS build.

Reviewed By: chaekit

Differential Revision: D26927500

fbshipit-source-id: 2a78005239a5fcbe7e1de82e5405f04e07000fa8
2021-03-11 19:47:58 -08:00
d726ce6668 Support loading a non-DP/DDP model from a DP/DDP state_dict (#53224)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53224

Loading a DP/DDP dict just needs to strip the module prefix from all items in the state dict and the metadata.

One existing example is here: https://github.com/facebookresearch/fvcore/blob/master/fvcore/common/checkpoint.py#L239.

#Closes: https://github.com/pytorch/pytorch/issues/41048/
ghstack-source-id: 123722976

Test Plan:
buck test mode/dev-nosan caffe2/test:nn -- test_load_state_dict
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_save_load_checkpoint

Reviewed By: rohan-varma, mrshenli

Differential Revision: D26798495

fbshipit-source-id: 035c7d0907d7ae8f0d7ca21ec71f7f96ef8df6c8
2021-03-11 18:43:33 -08:00
5c2b3d7784 [ROCm] Enable RNN test in test_c10d_spawn.py for ROCm (#52707)
Summary:
Enabling test_rnn test because it is passing for ROCm.

Signed-off-by: Kyle Chen <kylechen@amd.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52707

Reviewed By: albanD

Differential Revision: D26994407

Pulled By: mrshenli

fbshipit-source-id: f7d60ab7c4f0128e5f7770f959e2b83694d18275
2021-03-11 18:41:54 -08:00
dfb5f029da Disable TF32 on DDP tests (#52941)
Summary:
When a system has an ampere and a non-ampere card, lots of tests will fail, because results on different cards are differnet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52941

Reviewed By: albanD

Differential Revision: D26994287

Pulled By: mrshenli

fbshipit-source-id: 287537495fc13361104a4460f5bcd79a208b5d8d
2021-03-11 18:31:28 -08:00
06cf6d37b5 [ROCm] Enable test cases in test_data_parallel.py for ROCm (#52708)
Summary:
Enabling the test cases because they are passing for ROCm.

Signed-off-by: Kyle Chen <kylechen@amd.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52708

Reviewed By: albanD

Differential Revision: D26994458

Pulled By: mrshenli

fbshipit-source-id: f0b3797c7889287a0154b1d5397df715ffb1c605
2021-03-11 18:29:37 -08:00
c15d943149 [PyTorch] Fix broken build caused by keyword missing on Windows (#53562)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53562

On Windows when we try to build //xplat/caffe2/c10:c10Windows, it failed with an error like
```
stderr: buck-out\gen\83497cbb\xplat\caffe2\c10\c10Windows#header-mode-symlink-tree-only,headers\c10/macros/Macros.h(189): error C2220: warning treated as error - no 'object' file generated
buck-out\gen\83497cbb\xplat\caffe2\c10\c10Windows#header-mode-symlink-tree-only,headers\c10/macros/Macros.h(189): warning C4067: unexpected tokens following preprocessor directive - expected a newline
```
See log here: https://www.internalfb.com/intern/buck/build/6eaea1f8-e237-4860-9f3b-3a8edd2207c6/

This is because Windows doesn't support `__has_attribute` keyword. Here I'm changing the ordering of `if` and `elif` so that we don't hit that line when build in Windows.

Test Plan: buck build //xplat/caffe2/c10:c10Windows xplat/mode/windows

Reviewed By: kimishpatel, swolchok

Differential Revision: D26896510

fbshipit-source-id: d52438a3df7bf742e467a919f6ab4fed14484f22
2021-03-11 18:24:46 -08:00
b69dd910e8 [docs] Add starter content for new TorchScript language reference (#53837)
Summary:
**Summary**
This commit adds a new .rst file to use for updating the language specification and prepopulates it with the updated content for the expressions section.

**Test Plan**
https://user-images.githubusercontent.com/4392003/110441235-638ee880-806e-11eb-83ae-3b908bf00d5b.mov

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53837

Reviewed By: nikithamalgifb

Differential Revision: D26990801

Pulled By: SplitInfinity

fbshipit-source-id: 3b4e711bfaa8aac4ee3a075822fed7267a818121
2021-03-11 18:18:27 -08:00
d57ae6c46d Revert D26906509: Adding parallel support for the LLVM backend.
Test Plan: revert-hammer

Differential Revision:
D26906509 (95d2318510)

Original commit changeset: 12c17f2f21af

fbshipit-source-id: cc86d0dfca0dd791b31bda23a0172fc1cfd89760
2021-03-11 17:54:47 -08:00
8d8a4a0624 Remove the extra ":noindex:" in ddp_comm_hooks.rst (#53855)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53855

Remove "noindex" here:

{F492926346}
ghstack-source-id: 123724419

Test Plan:
waitforbuildbot

The failure on doctest does not seem to be relevant.

Reviewed By: rohan-varma

Differential Revision: D26967086

fbshipit-source-id: adf9db1144fa1475573f617402fdbca8177b7c08
2021-03-11 17:26:50 -08:00
5344c3ea9e Remove join_workers from Pipeline destructor. (#53433)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53433

As described in https://github.com/pytorch/pytorch/issues/53413, the
pipeline destructor ends up hanging sometimes. The reason for this is that Pipe
uses daemon threads and as a result these threads could be destroyed before the
Pipe destructor is done. The Pipe destructor then calls `join_workers` which
waits on signals from the worker threads, which might be already dead and
results in the main thread blocking forever.

To resolve this issue, in this PR we remove `join_workers` completely since it
is not necessary to wait for daemon threads.

#Closes: https://github.com/pytorch/pytorch/issues/53413
ghstack-source-id: 123641509

Test Plan:
1) Tested with repro in
https://github.com/pytorch/pytorch/issues/53413.
2) Hard to add a unit test for this since the bug really depends on order of
objects being destroyed.

Reviewed By: rohan-varma

Differential Revision: D26863321

fbshipit-source-id: 18fff072cabacfb10390e971eac789859d3dcc81
2021-03-11 17:05:22 -08:00
6da0b94dd8 Add note on forwarding arguments in the dispatcher (#53641)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53641

ghstack-source-id: 123466764

Test Plan: comments only

Reviewed By: bhosmer

Differential Revision: D26922477

fbshipit-source-id: ad630b5e1b10a2238f9b48aba656b2ffe65520a1
2021-03-11 16:40:37 -08:00
13f63fda5f Automated submodule update: FBGEMM (#53722)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: d12fc485d5

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53722

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: jspark1105

Differential Revision: D26949768

fbshipit-source-id: 718796736c0641b7cf6c5b0617fc744a090c78c4
2021-03-11 16:13:22 -08:00
ec6a7cace3 [ROCm] Fix the flaky test test_stream_event_nogil (#53850)
Summary:
Fix the flaky test in https://github.com/pytorch/pytorch/issues/53192 properly.

Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53850

Reviewed By: albanD

Differential Revision: D26993582

Pulled By: malfet

fbshipit-source-id: b0aefb188a236a5e94ee31a30ede7e8175443ff5
2021-03-11 16:07:41 -08:00
b9e900ee52 [ONNX] Update inputs/input_names formatting to avoid ValueError with scriptMethods (#53519) (#53548)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53548

fixes issue faced in #53506

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D26922415

Pulled By: malfet

fbshipit-source-id: b61842827bb14cef8c7a7089b2426fa53e642c90
2021-03-11 14:26:02 -08:00
cdac61ecd4 Prevent VS from emitting ambiguous symbol errors (third time) (#53490)
Summary:
Fixes: https://github.com/pytorch/pytorch/issues/53409

First: https://github.com/pytorch/pytorch/issues/15697
Second: https://github.com/pytorch/pytorch/issues/17863

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53490

Reviewed By: VitalyFedyunin

Differential Revision: D26946687

Pulled By: mrshenli

fbshipit-source-id: 27f85abecbb75456354cc0373529c8cadc8133bd
2021-03-11 13:51:41 -08:00
8016d28c0b [Gradient Compression] Update the comment on fp16_compress_hook (#53780)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53780

Update the comment, because the input data type of `fp16_compress_hook` does not have to be FP32. For example, the input dtype can also be FP64, as long as it can be casted into FP16.
ghstack-source-id: 123680621

Test Plan: N/A

Reviewed By: iseessel

Differential Revision: D26967224

fbshipit-source-id: 26d79a3629a597e6335b6f59c97d25a764a8ed80
2021-03-11 13:40:32 -08:00
cyy
14d02517e1 replace data with data_ptr (#53097)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53097

Reviewed By: albanD

Differential Revision: D26972445

Pulled By: rohan-varma

fbshipit-source-id: 04798a3fd55dd297638377513cfc57ff86c8916d
2021-03-11 13:14:35 -08:00
fa980bb22a [wip][Dist Profiling] Enable dist profiling for MPI backend (#52949)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52949

Enables distributed profiling which we have for gloo and nccl for the MPI backend
ghstack-source-id: 123610105

Test Plan: CI

Reviewed By: wanchaol

Differential Revision: D26591590

fbshipit-source-id: a20ec9d104faa26bc62c727dd01319c3ea230f5d
2021-03-11 13:08:41 -08:00
7e5ffbfa94 [caffe2] add a SerializationOptions field for the save operator (#53402)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53402

Add an `options` field to the `Save` operator which accepts options for how to
serialize different blobs.  At the moment this simply allows controlling the
existing `chunk_size` behavior, but in the future we can add other options,
such as the ability to control compression settings or other serialization
formats.
ghstack-source-id: 123567034

Test Plan:
Added a new test to `load_save_test.py` that passes in options and verifies
that blobs were serialized with the expected number of chunks.

  buck test caffe2/caffe2:caffe2_test_cpu \
    caffe2/caffe2/core:serialization_test \
    caffe2/caffe2/python/operator_test:load_save_test

Reviewed By: mraway

Differential Revision: D26502577

fbshipit-source-id: 6e302e530bb96990517c2e35c505db7f14a56284
2021-03-11 13:02:58 -08:00
1acced4eba Implemented getCodeText(string attr) in llvm/cuda codegen and added python bindings for it - #52974 (#53664)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53664

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D26929204

Pulled By: huiguoo

fbshipit-source-id: 281fe6c25f4664636b29d51dba396056a222a9e7
2021-03-11 11:57:39 -08:00
379f1f1ede Automated submodule update: tensorpipe (#53810)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/tensorpipe](https://github.com/pytorch/tensorpipe).

New submodule commit: 2719d7e0b7

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53810

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: lw

Differential Revision: D26979037

fbshipit-source-id: d0cc7c25b764d5f207431a839f396fb8e22b2a22
2021-03-11 11:35:55 -08:00
8b9e3e6fd4 [complex] enable complex autograd cumsum (#53240)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53182

Turns out that there is no need to update formula :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53240

Reviewed By: VitalyFedyunin

Differential Revision: D26948582

Pulled By: anjali411

fbshipit-source-id: 450aab0d585f15385dd1748c2a3ddf787df0764b
2021-03-11 11:30:18 -08:00
ec484981c6 [3/n][torch/elastic][upstream] Move torchelastic/events to torch/distributed/events (#53760)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53760

Pull Request resolved: https://github.com/pytorch/elastic/pull/143

The diff upsteams torchelastic/events to the torch.

Test Plan:
buck test mode/dev-nosan //pytorch/elastic/torchelastic/agent/...
    buck test mode/dev-nosan //caffe2/test/distributed/elastic/events/fb/...

Reviewed By: kiukchung

Differential Revision: D26932830

fbshipit-source-id: 23fc10d2ead5af7f7ed510ae0d2581cc2421cf76
2021-03-11 11:25:24 -08:00
bbce574ccf Pass commit_sha to add-annotations-github-action again (#53834)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53833.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53834

Test Plan: The CI logs for flake8-py3 and clang-tidy on this PR should show `commit_sha` being set to the PR tip in their respective "Add annotations" steps.

Reviewed By: malfet

Differential Revision: D26983201

Pulled By: samestep

fbshipit-source-id: e5d1fbbaf2a2611fec583b430c6353e778bc77a6
2021-03-11 11:17:17 -08:00
5cf4527c88 Update repo name for add-annotations-github-action (#53826)
Summary:
It looks like https://github.com/suo/add-annotations-github-action redirects to https://github.com/pytorch/add-annotations-github-action, so this is a bit less confusing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53826

Test Plan: The clang-tidy CI job should pass on this PR.

Reviewed By: malfet

Differential Revision: D26981832

Pulled By: samestep

fbshipit-source-id: 273c18535d0d27b14942b02ae552020ffc60623b
2021-03-11 11:11:24 -08:00
3f9c803fe8 [ONNX] Redesign onnx pass to enable shape type dependent pattern conversion - cont (#51795) (#53304)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53304

With the introduction of ONNX shape inference, shape and type are inferred on the fly as operators get converted from ATen to ONNX when running symbolic function. This resolves the shape/type requirement for the symbolic functions. The pre-onnx passes however, can not be supported by shape inference, since at that stage the operators in the graph are still ATen operators.

This PR is to update the design of ONNX pass, to enable a mechanism of capturing subgraphs of ATen operators of certain patterns, and convert them later, when shape/type information of upstream operators are available.

The new design will require pre-onnx passes that need shape/type to be written in two parts, encapsulation and conversion.

    The encapsulation part will find the nodes of patterns, like how pre-onnx passes were written previously. But instead of converting the nodes, it will encapsulate them into a sub-block of a new placeholder node. This part is called before onnx pass, so it runs before calling symbolic functions.

    The conversion part will be called inside the onnx pass. In onnx pass, run_symbolic_func will be called for each node in topological order. When it reaches the placeholder node, the conversion part will be invoked. It will convert the nodes inside the sub-block based on pattern. By that time, it will have shape/type of upstream operators available. After the conversion is complete, the placeholder node will be removed, and nodes inside its sub-block converted. Run_symbolic_func will be called for these nodes, and they will be converted from ATen operator to ONNX operator.

This PR includes several other fixes, listed below.
* ~~replace helper.cpp with onnx_utils.cpp for holding utility functions.~~
* fix EraseNumberTypes on Bool type, the code was outdated that back then Bool type doesn't exist.
* ~~enable onnx shape inference in export with parameter/initializer data.~~
* other code clean ups.
* fix insertion of identity nodes for loop opset 13 sequence output.

~~PR depends on #51603~~

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D26922417

Pulled By: malfet

fbshipit-source-id: 14ed06158d539e2451c2e5e63ba1b32fb0f75095
2021-03-11 10:30:09 -08:00
5648fe6093 Make storage access throw for meta tensors (#53681)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53681

Without throwing, we can easily segfault trying to access nullptr
storage.

To do this I made set_storage_access_should_throw public so that you
don't have to subclass TensorImpl to do it.  An alternative is
to just bite the bullet and add a MetaTensorImpl subclass.  Let
me know what is preferred.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D26955540

Pulled By: ezyang

fbshipit-source-id: 8ce22dd07ef1beb042f1d91de981954d59c2f84a
2021-03-11 10:18:14 -08:00
ec713c0eb5 [Pytorch] Improve scale and zero point extraction for per channel quantized (#53726)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53726

In quantized linear layers, during deserialization we create scales and zero
points which are later used for qnnpack kernels.
Scales and zero pointer extraction for per channel quantized tensors is slow.
This is due to the fact that we index directly into zero point and scales
tensor and this indexing creates a tensor slice of 1 element which is then cast
to int32 or float.
This is super slow and increases model loading time.
This diff fixes that.

Test Plan: CI

Reviewed By: raziel

Differential Revision: D26922138

fbshipit-source-id: b78e8548f736e8fa2f6636324ab1a2239b94a27c
2021-03-11 09:55:31 -08:00
d7b5a6faaa Revert "Revert D26733731: [pytorch][PR] Skip dispatch for `is_floatin… (#53242)
Summary:
…g_point`"

This reverts commit fbf2883d350f62d17292b71a58f404b5e3e58b7b.

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53242

Reviewed By: mrshenli

Differential Revision: D26896105

Pulled By: iramazanli

fbshipit-source-id: 279a6f6d4fbb7949a7ed65df848db71a9b8d44e2
2021-03-11 09:46:25 -08:00
7484c56fa3 [quant][graphmode][fx] Fix a condition check for CopyNode (#53585)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53585

Previously fp16_static CopyNode would be marked as unquantized because of
an incorrect condition check of whether a Node is statically quantized or not.
This PR fixes that.

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D26912677

fbshipit-source-id: 4ddb538714c5ba2db28430de5e1cf2931baf1993
2021-03-11 09:32:20 -08:00
4c1af249fb [ROCM] load hipfft separately from rocfft (#53408)
Summary:
This PR makes changes to how hipfft is loaded in pytorch. hipfft is packaged in a separate library to rocfft following rocm 4.1.

We check the rocm version and if it is past rocm 4.1 we load hipfft in addition to rocfft.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53408

Reviewed By: albanD

Differential Revision: D26952702

Pulled By: malfet

fbshipit-source-id: f42be304b587c060816e39d36f5c1a2cdc37bfab
2021-03-11 09:18:33 -08:00
5842d34fac Call nvidia-smi.exe before running tests on Windows (#53422)
Summary:
Follow up for https://github.com/pytorch/pytorch/issues/53334

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53422

Reviewed By: VitalyFedyunin

Differential Revision: D26954202

Pulled By: malfet

fbshipit-source-id: fe16a2413618e07d6380824e967d87e29a09b178
2021-03-11 09:12:34 -08:00
0a549f9412 [ROCm] Disable flaky tests on ROCm (#53192)
Summary:
The disabled tests are tracked by
https://github.com/pytorch/pytorch/issues/53190

Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53192

Reviewed By: zhangguanheng66

Differential Revision: D26782204

Pulled By: mrshenli

fbshipit-source-id: bc90b182c236249961da1f0d4894d29f6b44fa27
2021-03-11 08:29:12 -08:00
05f137c765 Remove GHA "Checkout PR tip" step (#53719)
Summary:
This PR replaces our current "Checkout PR tip" step (which is duplicated across many places) using a [scenario](https://github.com/actions/checkout#checkout-pull-request-head-commit-instead-of-merge-commit) from the `actions/checkout` README. We previously tried something similar in https://github.com/pytorch/pytorch/issues/49578, but using `github.head_ref` didn't work.

The reason this PR works is because, for events besides `pull_request`, the value of `github.event.pull_request.head.sha` defaults to the empty string, so it's as if we didn't set the `ref` option for `actions/checkout` at all, so it just uses its default behavior (e.g. for `push` events).

Incidentally, this PR also upgrades our use of `actions/checkout` from `v1` to `v2`, which introduces shallow clones by default. A couple of our jobs require deep clones, so we use `fetch-depth: 0` in those cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53719

Test Plan: CI.

Reviewed By: albanD

Differential Revision: D26949121

Pulled By: samestep

fbshipit-source-id: e06f8066682ae0557fb5a055a10ea33b6bd320db
2021-03-11 08:06:49 -08:00
f364e492df Autograd functional API should enable_grad (#47543)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44640

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47543

Reviewed By: albanD

Differential Revision: D26965136

Pulled By: iramazanli

fbshipit-source-id: 1dd46b9402bb670c0e165db684712e26c1a2036f
2021-03-11 07:41:31 -08:00
e185ec6c3d Revert D26955317: Perform appropriate CUDA stream synchronization in distributed autograd.
Test Plan: revert-hammer

Differential Revision:
D26955317 (0b84f45f03)

Original commit changeset: eace6d4f91d4

fbshipit-source-id: 1f322b4d7cf7d1a7e6caf3194c6f0bf163d45850
2021-03-11 07:27:44 -08:00
ffac9b2ead Revert D26965463: [pytorch][PR] [docs] Add starter content for new TorchScript language reference
Test Plan: revert-hammer

Differential Revision:
D26965463 (d49c5c74f5)

Original commit changeset: 246c76a56d91

fbshipit-source-id: 50de1a2ac92204a2f3a2ad9b8fa163338062bf58
2021-03-11 07:26:00 -08:00
07d315fce8 Revert D26676150: Simplify index expressions constructed in loop flattening - #51173
Test Plan: revert-hammer

Differential Revision:
D26676150 (1f01899e4a)

Original commit changeset: e202e0c8610e

fbshipit-source-id: 9611dda6897b67e16e44c731994bc9e5fccab0b9
2021-03-11 07:17:38 -08:00
95d2318510 Adding parallel support for the LLVM backend. (#53243)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53243

Test Plan: Imported from OSS

Reviewed By: bertmaher, Chillee

Differential Revision: D26906509

Pulled By: zheng-xq

fbshipit-source-id: 12c17f2f21af11e73fa4c5b5199043a7a15ecdec
2021-03-11 03:27:37 -08:00
351f6f5e02 [JIT] Update set_stream API to change the device (#53741)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53741

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D26970606

Pulled By: nikithamalgifb

fbshipit-source-id: 257b9425d105a68fc9ef567af266fa461ddf05ec
2021-03-11 02:13:22 -08:00
cfaa0bf286 [JIT] Update Namespace from cuda to _cuda (#53378)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53378

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D26970607

Pulled By: nikithamalgifb

fbshipit-source-id: 20a55dd9c0071c5870a4b176d30cb9c1e1496687
2021-03-11 00:52:01 -08:00
0b84f45f03 Perform appropriate CUDA stream synchronization in distributed autograd. (#53769)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53769

The local autograd engine performs appropriate stream synchronization
between autograd nodes in the graph to ensure a consumer's stream is
synchronized with the producer's stream before executing the consumer.

However in case of distributed autograd, the SendRpcBackward function receives
gradients over the wire and TensorPipe uses its own pool of streams for this
purpose. As a result, the tensors are received on TensorPipe's stream pool but
SendRpcBackward runs on a different stream during the backward pass and there
is no logic to synchronize these streams.

To fix this, I've enhanced DistEngine to synchronize these streams
appropriately when it receives grads over the wire.
ghstack-source-id: 123607221

Test Plan:
1) Added unit test which reproduced the issue.
2) waitforbuildbot.

Reviewed By: wanchaol, mrshenli

Differential Revision: D26955317

fbshipit-source-id: eace6d4f91d4006c9c16ede5ac16362ada052406
2021-03-10 23:39:55 -08:00
1053c96693 [GraphModule] Back out changes to module root version of __init__ (#53791)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53791

Reviewed By: houseroad

Differential Revision: D26970869

fbshipit-source-id: 80684516f57fd2d1aca794f17fe488b2fe2b2f64
2021-03-10 23:18:56 -08:00
37ab711822 Adding learning rate schedulers to C++ API (#52268)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/50577

Learning rate schedulers had not yet been implemented for the C++ API.

This pull request introduces the learning rate scheduler base class and the StepLR subclass. Furthermore, it modifies the existing OptimizerOptions such that the learning rate scheduler can modify the learning rate.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52268

Reviewed By: mrshenli

Differential Revision: D26818387

Pulled By: glaringlee

fbshipit-source-id: 2b28024a8ea7081947c77374d6d643fdaa7174c1
2021-03-10 23:09:51 -08:00
ebfa9276d8 Move prim::layout for lite jit (#53781)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53781

needed for running noise suppression model in lite interpreter

Test Plan: run model

Reviewed By: linbinyu

Differential Revision: D26967227

fbshipit-source-id: 19677fc796f1fb4423ebb11b5ffd9df5870a39cf
2021-03-10 21:26:09 -08:00
3bd250fd03 [nnc] Test ability to vectorize reads from an intermediate tensor (#53752)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53752

This test doesn't work today because we don't properly vectorize
"FunctionCall" (which is the way one accesses an intermediate tensor).
ghstack-source-id: 123592860

Test Plan: `buck test //caffe2/test/cpp/tensorexpr -- LoopNest.VectorizeUse`

Reviewed By: ZolotukhinM

Differential Revision: D26895550

fbshipit-source-id: 0798ebf3e6a834bd70181732c81528455d5329fa
2021-03-10 20:32:10 -08:00
a5e19126b6 [NNC] LoopNest cleanup (#53688)
Summary:
* Replacing vector of Tensors with a set of output buffers in `TensorExprKernel`.
* Creating a block statement while compiling in `TensorExprKernel`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53688

Reviewed By: mrshenli

Differential Revision: D26941222

Pulled By: navahgar

fbshipit-source-id: 9eb81ec2effcdeafbeaa67d1e12475166054f80f
2021-03-10 20:20:03 -08:00
d49c5c74f5 [docs] Add starter content for new TorchScript language reference (#52494)
Summary:
**Summary**
This commit adds a new .rst file to use for updating the language specification and prepopulates it with the updated content for the expressions section.

**Test Plan**
https://user-images.githubusercontent.com/4392003/110441235-638ee880-806e-11eb-83ae-3b908bf00d5b.mov

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52494

Reviewed By: nikithamalgifb

Differential Revision: D26965463

Pulled By: SplitInfinity

fbshipit-source-id: 246c76a56d911a8061e720abd200a44d7dfa1f35
2021-03-10 19:36:27 -08:00
ced91bb713 [deploy] namespace and rename (#53670)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53670

This puts deploy into the torch::deploy namespace. It also renames some
objects to better match their behavior:

PythonObject -> Obj, in the future it will refer to either a python object or a handle to a script obj, so rename it torch::deploy::Obj to be generic
MovableObject -> ReplicatedObj, to prevent confusion with "std::move" which is unrelated, and to note that we are replicating this object across interpreters.

Test Plan: Imported from OSS

Reviewed By: wconstab

Differential Revision: D26932131

Pulled By: zdevito

fbshipit-source-id: 8041d6c5b2041a7c3192c1a17d2edb38112a89f3
2021-03-10 19:13:07 -08:00
14acf92b2b [PyTorch] Speed up Tensor::data_ptr (#53723)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53723

We know the size of the data item at compile time. Let's take
advantage of that instead of doing a runtime multiplication by the
data type size. (Presumably, constant propagating through
`data_type.itemsize()` to optimize the `imul` away was just a bridge
too far for clang -- I checked assembly and we went from a
load-and-`imul` to a `lea` that multiplied by constant 4 for
`data_ptr<float>()`.)
ghstack-source-id: 123559924

Test Plan:
Compared `perf stat` output for Mergenet AdIndexer
benchmark before/after this change:

Before:
```
         16,943.46 msec task-clock                #    0.999 CPUs utilized            ( +-  0.16% )
             3,771      context-switches          #    0.223 K/sec                    ( +- 15.86% )
                 3      cpu-migrations            #    0.000 K/sec
           101,660      page-faults               #    0.006 M/sec                    ( +-  1.00% )
    33,519,516,740      cycles                    #    1.978 GHz                      ( +-  0.20% )  (49.99%)
    68,556,471,199      instructions              #    2.05  insn per cycle           ( +-  0.08% )  (49.98%)
    11,100,415,689      branches                  #  655.145 M/sec                    ( +-  0.12% )  (50.02%)
        73,082,369      branch-misses             #    0.66% of all branches          ( +-  0.45% )  (50.01%)
```

After:
```
         16,779.99 msec task-clock                #    0.999 CPUs utilized            ( +-  0.40% )
             2,815      context-switches          #    0.168 K/sec                    ( +-  7.89% )
                 3      cpu-migrations            #    0.000 K/sec                    ( +-  6.25% )
           100,124      page-faults               #    0.006 M/sec                    ( +-  0.40% )
    33,213,000,715      cycles                    #    1.979 GHz                      ( +-  0.39% )  (49.99%)
    68,359,270,731      instructions              #    2.06  insn per cycle           ( +-  0.08% )  (50.00%)
    11,058,104,630      branches                  #  659.005 M/sec                    ( +-  0.11% )  (50.00%)
        72,840,016      branch-misses             #    0.66% of all branches          ( +-  0.51% )  (49.99%)
```

0.9% cycles win and 0.29% instruction count win, both of which seem to
be outside the noise.

Reviewed By: bhosmer

Differential Revision: D26919110

fbshipit-source-id: 23fab7adcfcf6ec9c87ebfb5d5304b6f155f577f
2021-03-10 19:03:42 -08:00
1f01899e4a Simplify index expressions constructed in loop flattening - #51173 (#52882)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52882

Test Plan:
Imported from OSS

build/bin/test_tensorexpr

Reviewed By: ZolotukhinM

Differential Revision: D26676150

Pulled By: huiguoo

fbshipit-source-id: e202e0c8610eb107558a3add8a6560a0cb97704a
2021-03-10 18:37:42 -08:00
997f05cd34 [nnc] Add an initialization expression to Reduce() (#53751)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53751

Sometimes the initial value of a reduction expression needs to be
computed with reference to the loop axes; for example, adding bias can be
efficiently represented by initializing the accumulator from the bias tensor:
```
C[n, c, h, w] = bias[c]
for (...)
  C[n, c, h, w] += ...
```
ghstack-source-id: 123592861

Test Plan: `buck test //caffe2/test/cpp/tensorexpr -- Reductions.InitFunction`

Reviewed By: navahgar

Differential Revision: D26940321

fbshipit-source-id: 8a08e19e5d0b9ad453a07fab8b61e75dcd3d626b
2021-03-10 17:13:14 -08:00
49a5f99440 skip dispatch in resize_ (#53575)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53575

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D26902348

Pulled By: bdhirsh

fbshipit-source-id: b322f233d934278f03e56cd1e35acc0665389398
2021-03-10 17:06:35 -08:00
21f9a6da7d Avoid of creating a copy of statusString every inference time (#53756)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53756

as title

Reviewed By: yinghai

Differential Revision: D26949450

fbshipit-source-id: a737ce1ed25cf53faef8cdc94912542769a1008f
2021-03-10 16:58:02 -08:00
0584fd9339 [quant][fx][graphmode][fix] Only insert observers for fixed qparam ops (#53330)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53330

Fixed a condition check for fixed qparam ops, previously we were including CopyNodes as well

Test Plan:
python test/test_quantization.py TestQuantizeFxOps.test_fixed_qparams_ops_fp16

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D26836867

fbshipit-source-id: 8c486155244f852e675a938c3f4237f26505671c
2021-03-10 16:51:36 -08:00
f9185973d1 [quantization] Add some support for 3d operations (#50003)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/50002

The last commit adds tests for 3d conv with the `SubModelFusion` and `SubModelWithoutFusion` classes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50003

Reviewed By: mrshenli

Differential Revision: D26325953

Pulled By: jerryzh168

fbshipit-source-id: 7406dd2721c0c4df477044d1b54a6c5e128a9034
2021-03-10 16:40:35 -08:00
895735c69f TensorIterator: Avoid nesting two levels of function_ref in for_each (#53613)
Summary:
When calling `TensorIterator::for_each` with a 1d loop, it creates a `function_ref` for the 1D iteration, then wraps it with `LOOP_WRAPPER` to transform it into a 2d loop. That 2d loop then gets wrapped in another `function_ref`. This can result in significant overhead if the 1d inner loop is over a small number of elements.

Instead, this wraps the 1d loop before type-erasure so only one level of `function_ref` is introduced. A simple benchmark demonstrates this is a win:
```python
import torch
a = torch.rand((10000, 2))[::2]
%timeit a + a
```

Note the 2D tensor cannot be coalesced into 1D and both `cpu_kernel` and `cpu_kernel_vec` use 1D for_each. On master, this takes 42 us but with this change it's down to 32us.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53613

Reviewed By: VitalyFedyunin

Differential Revision: D26947143

Pulled By: ezyang

fbshipit-source-id: 5189ada0d82bbf74170fb446763753f02478abf6
2021-03-10 16:28:22 -08:00
fe0810e2f8 Add a section to introduce GradBucket class in ddp_comm_hooks.rst (#53253)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53253

Since GradBucket class becomes public, mention this class in ddp_comm_hooks.rst.

Screenshot:
{F478201008}

ghstack-source-id: 123596842

Test Plan: viewed generated html file

Reviewed By: rohan-varma

Differential Revision: D26812210

fbshipit-source-id: 65b70a45096b39f7d41a195e65b365b722645000
2021-03-10 16:14:34 -08:00
c988b78be2 Add a description of GradBucket Python class (#53596)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53596

This description will be used in ddp_comm_hook docstrings.
ghstack-source-id: 123590360

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D26908160

fbshipit-source-id: 824dea9203ca583676bddf0161c9edca52c9d20e
2021-03-10 16:12:53 -08:00
741d0f41d6 [package] split tests (#53749)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53749

Split up tests into cases that cover specific functionality. Goals:
1. Avoid the omnibus test file mess (see: test_jit.py) by imposing early
   structure and deliberately avoiding a generic TestPackage test case.
2. Encourage testing of individual APIs and components by example.
3. Hide the fake modules we created for these tests in their own folder.

You can either run the test files individually, or still use
test/test_package.py like before.

Also this isort + black formats all the tests.

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D26958535

Pulled By: suo

fbshipit-source-id: 8a63048b95ca71f4f1aa94e53c48442686076034
2021-03-10 16:07:36 -08:00
4351d09683 Fix error message in setStorage (#53198)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53198

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D26830030

Pulled By: zdevito

fbshipit-source-id: 34d383c4561bba88dee6d570cbd22bd58a3fe856
2021-03-10 15:45:14 -08:00
fee263595c Remove trailing whitespace introduced in #52175 (#53762)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53762

Test Plan: CI.

Reviewed By: seemethere

Differential Revision: D26961133

Pulled By: samestep

fbshipit-source-id: 972ea480baa3f34b65327abdf7e8bfdf30788572
2021-03-10 15:35:04 -08:00
023948e6d7 [caffe2] update load_save_test.py to also verify the chunking behavior (#53401)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53401

This is a reland of D26641599 (cd9ac54ea7) after rebasing onto D26802576 (f595ba1bae).

Add some small utility functions to read the blob names back from the minidb
file so that we can verify how many chunks were written for each blob.
ghstack-source-id: 123567033

Test Plan: buck test caffe2/caffe2/python/operator_test:load_save_test

Reviewed By: mraway

Differential Revision: D26853942

fbshipit-source-id: 0b45078fdd279f547752c8fdb771e296374a00da
2021-03-10 15:29:36 -08:00
99d7c8ff94 [caffe2] use AddNAlreadyReserved() when serializing blobs (#53400)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53400

This is a reland of D26617038 (b4a8d98247) after rebasing onto D26802576 (f595ba1bae).

Optimize the blob serialization code by using `AddNAlreadyReserved()` when
serializing tensor data, rather than making N separate `Add()` calls.
`AddNAlreadyReserved()` is a simple addition operation, while each `Add()`
call checks to see if it needs to reserve new space, and then updates the
element data, which is unnecessary in this case.
ghstack-source-id: 123567030

Test Plan:
This appears to improve raw serialization performance by 30 to 35% for float,
double, and int64_t types which use this function.  This improvement appears
relatively consistent across large and small tensor sizes.

Reviewed By: mraway

Differential Revision: D26853941

fbshipit-source-id: 4ccaa5bc1dd7f7864068d71a0cde210c699cbdba
2021-03-10 15:27:52 -08:00
2cf90982e9 [TestZeroRedundancyOptimizer] Add multi gpu checker (#53564)
Summary:
The test test_collect_shards fails on single GPU setup.
Enabling the multi gpu checker.

Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53564

Reviewed By: H-Huang

Differential Revision: D26952325

Pulled By: rohan-varma

fbshipit-source-id: e8956f9277c7320024bece129767e83fbdf02b2c
2021-03-10 15:17:55 -08:00
d9fa957ecc [quant][graphmode][fix] Handle the case when observed node has no users (#53210)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53210

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D26791724

fbshipit-source-id: b2a226a22d6aba86dd01cacbb56577048a289b3e
2021-03-10 15:08:48 -08:00
56f3cb7a99 Add AST rewriter to acc_tracer (#53644)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53644

Reviewed By: gcatron

Differential Revision: D26506841

fbshipit-source-id: 64367d7e9f6619d014a01c147476b50467efa5c8
2021-03-10 14:46:31 -08:00
5563248b58 [JIT] [Remove Mutation] Add handling of normal_ (#52175)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/51735

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52175

Reviewed By: mrshenli

Differential Revision: D26919193

Pulled By: eellison

fbshipit-source-id: d036cbc7b42377f88a3d381e4932a710b8d22a04
2021-03-10 14:28:09 -08:00
c68cc24cee update upsample tests in test_nn.py to test for memory_format (#53665)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53665

ngimel pointed out to me where we already test the behavior of the `Upsample` ops in `test_nn.py`. This PR deleting my bespoke tests in `test_torch.py` and updates those in `test_nn.py` to test memory format properly.

There were two reasons the original test didn't pick up on a memory format regression:
- They didn't test the memory format of the output tensor explicitly, i.e. `output.is_contiguous(memory_format=...)`
- Even with that change, the test tensors were to simple to fail the tests. From some trial and error, it looks like one of the first two dimensions in the inputs needs to be > 1 in order for the `channels_last` memory format to actually re-order the strides.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D26929683

Pulled By: bdhirsh

fbshipit-source-id: d17bc660ff031e9b3e2c93c60a9e9308e56ea612
2021-03-10 14:21:14 -08:00
669fcf3093 Replace supports_tensor_out with supports_out (#53745)
Summary:
https://github.com/pytorch/pytorch/issues/52875 introduced this bug, as `supports_tensor_out` was replaced with `supports_out` in https://github.com/pytorch/pytorch/issues/53259, so CI checks are failing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53745

Reviewed By: gmagogsfm

Differential Revision: D26958151

Pulled By: malfet

fbshipit-source-id: 7cfe5d1c1a33f06cb8be94281ca98c635df76838
2021-03-10 13:42:56 -08:00
76b58dd9ae Fix distributions which don't properly honor validate_args=False (#53600)
Summary:
A number of derived distributions use base distributions in their
implementation.

We add what we hope is a comprehensive test whether all distributions
actually honor skipping validation of arguments in log_prob and then
fix the bugs we found. These bugs are particularly cumbersome in
PyTorch 1.8 and master when validate_args is turned on by default
In addition one might argue that validate_args is not performing
as well as it should when the default is not to validate but the
validation is turned on in instantiation.

Arguably, there is another set of bugs or at least inconsistencies
when validation of inputs does not prevent invalid indices in
sample validation (when with validation an IndexError is raised
in the test). We would encourage the implementors to be more
ambitious when validation is turned on and amend sample validation
to throw a ValueError for consistency.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53600

Reviewed By: mrshenli

Differential Revision: D26928088

Pulled By: neerajprad

fbshipit-source-id: 52784a754da2faee1a922976e2142957c6c02e28
2021-03-10 13:16:32 -08:00
b03c92a9c5 [2/n][torch/elastic][upstream] Move torchelastic/timer torchelastic/multiprocessing to torch/distributed/elastic (#53574)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53574

Upstreams `torchelastic/timer|multiprocessing` to `torch/distributed/elastic/timer|multiprocessing`

Test Plan:
```
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/...
buck test mode/dev-nosan //caffe2/test/distributed/elastic/...
buck test mode/dev-nosan //pytorch/elastic/torchelastic/...
buck test mode/dev-nosan //hpc/...
buck test mode/dev-nosan //caffe2/torch/fb/training_toolkit/...
```

Reviewed By: borovsky-d, wilson100hong

Differential Revision: D26899809

fbshipit-source-id: e6dbc2a78282eac296c262b3206a979e3ef1ff53
2021-03-10 12:32:53 -08:00
cb68039363 Port NumPy typing testing style to PyTorch (#52408)
Summary:
ref: https://github.com/pytorch/pytorch/issues/16574

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52408

Reviewed By: anjali411

Differential Revision: D26654687

Pulled By: malfet

fbshipit-source-id: 6feb603d8fb03c2ba2a01468bfde1a9901e889fd
2021-03-10 12:18:01 -08:00
17bc70e6f7 [package] make importer a little more obscure (#51676)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51676

We offer the ability to access the importer from within packaged modules by doing
`import resources`. This behavior is nice (and more powerful than the
importlib resources API), but I think `resources` is too common a name
(pip has a package for it)

Change to `import torch_package_importer` but open to bikeshedding

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D26620314

Pulled By: suo

fbshipit-source-id: 0942c99f02c0f55f5f3a1b2566961018b796bdd4
2021-03-10 12:13:15 -08:00
b4d8f4af82 [package] implement get_resource_reader API (#51674)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51674

See
https://docs.python.org/3/library/importlib.html#importlib.abc.ResourceReader

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D26237034

Pulled By: suo

fbshipit-source-id: 4c19f6172d16b710737528d3de48372873b9368d
2021-03-10 12:11:11 -08:00
bfc80b3566 Give line numbers in git-grep-based lints (#53733)
Summary:
Meant to make tasks like https://github.com/pytorch/pytorch/issues/53728 easier. The `-n` flag enables line numbers, and the `-o` flag reduces noise by only showing the part of the line that matched (which in this case is just the trailing whitespace).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53733

Test Plan:
```
$ git checkout e937db5dbaeaeae1134b02b3b78c43db3f6a91cd
```

Before:
```
$ (! git grep -I -l ' $' -- . ':(exclude)**/contrib/**' ':(exclude)third_party' || (echo "The above files have trailing spaces; please remove them"; false))
aten/src/ATen/native/cuda/BatchLinearAlgebra.cu
The above files have trailing spaces; please remove them
```

After:

```
$ (! git grep -I -no ' $' -- . ':(exclude)**/contrib/**' ':(exclude)third_party' || (echo "The above files have trailing spaces; please remove them"; false))
aten/src/ATen/native/cuda/BatchLinearAlgebra.cu:1972:
The above files have trailing spaces; please remove them
```

Reviewed By: mruberry

Differential Revision: D26953538

Pulled By: samestep

fbshipit-source-id: 5f7d48b79f1a02e5e5a09fe00316ec350cfc340e
2021-03-10 12:03:26 -08:00
70a43425e0 Simplify init._calculate_fan_in_and_fan_out (#53522)
Summary:
This uses the shape of the tensor instead of directly indexing it. This is useful when extending PyTorch's tensor class, e.g. for lazy access. Since the `init` sub-module doesn't check for `torch_function`, it is not possibly to override its functions. Explicitly indexing the tensor will force a call to tensor() and reconstruct the full tensor/explicitly access the elements. Simply using the shape allows to avoid that.

Fixes https://github.com/pytorch/pytorch/issues/53540

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53522

Reviewed By: anjali411

Differential Revision: D26947794

Pulled By: jbschlosser

fbshipit-source-id: 80cd65efed16383f21363cee2eb404c9bc05971c
2021-03-10 11:57:17 -08:00
a76b4736db clang format reducer and logger files (#53148)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53148

clang format reducer and logger files
ghstack-source-id: 123453983

Test Plan: unit test

Reviewed By: SciPioneer

Differential Revision: D26764509

fbshipit-source-id: 711efcfd77420f912861cfd20c69e3af5086f4b9
2021-03-10 11:35:30 -08:00
d032287ec3 fix data type logging (#53162)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53162

it is possible there are multiple data types in mixed precision training, so log data types as a list of data type names.
ghstack-source-id: 123452626

Test Plan: unit test

Reviewed By: SciPioneer

Differential Revision: D26769256

fbshipit-source-id: 8f7d73821e89864fedbbce723f301fe8fbad5685
2021-03-10 11:35:26 -08:00
7d4b229d61 add is_multi_device_module logging field (#53149)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53149

add is_multi_device_module logging field
ghstack-source-id: 123444621

Test Plan: unit test

Reviewed By: SciPioneer

Differential Revision: D26765355

fbshipit-source-id: d4d9c5981b18b1744299aebe8af37eb4e2e35c61
2021-03-10 11:35:22 -08:00
a08fc1a7fc allow users to set sample rate and add per iteration latency breakdowns (#53145)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53145

add a new API to allow users to set sample rate for runtime stats, also add per iteration latency breakdowns to DDPLoggingData struct. e.g.
if users set sample rate to be 1, they can analyze per iteration latency change over time (not avged)
ghstack-source-id: 123443369

Test Plan: unit test

Reviewed By: SciPioneer

Differential Revision: D26763957

fbshipit-source-id: baff6a09c2a590e6eb91362ca6f47ae8fa6ddb0e
2021-03-10 11:35:18 -08:00
8f15a2f052 eig_backward: faster and with complex support (#52875)
Summary:
As per title. Compared to the previous version, it is lighter on the usage of `at::solve` and `at::matmul` methods.

Fixes https://github.com/pytorch/pytorch/issues/51621

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52875

Reviewed By: mrshenli

Differential Revision: D26768653

Pulled By: anjali411

fbshipit-source-id: aab141968d02587440128003203fed4b94c4c655
2021-03-10 11:33:30 -08:00
b99b6065f8 Removes trailing whitespace (#53728)
Summary:
Fixes

```
Run (! git grep -I -l ' $' -- . ':(exclude)**/contrib/**' ':(exclude)third_party' || (echo "The above files have trailing spaces; please remove them"; false))
aten/src/ATen/native/cuda/BatchLinearAlgebra.cu
The above files have trailing spaces; please remove them
Error: Process completed with exit code 1.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53728

Reviewed By: ngimel

Differential Revision: D26953099

Pulled By: mruberry

fbshipit-source-id: 5f1ed2cd767de49447fcbd8e03cb3af7841cbcaf
2021-03-10 11:27:33 -08:00
5658ab5f77 [andoid] publishing to maven central (#53568)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53568

Bintray, JCenter are going to be unavailable on May 1st

https://jfrog.com/blog/into-the-sunset-bintray-jcenter-gocenter-and-chartcenter/

Migrating publishing to Maven Central the same way as other fb oss projects, reference PR https://github.com/pytorch/pytorch/pull/53568/files

to publish
```
./android/gradlew -p android publish
```

<img width="697" alt="Screen Shot 2021-03-09 at 3 14 08 PM" src="https://user-images.githubusercontent.com/6638825/110551387-3e3fc000-80ea-11eb-9604-4e69d6e6d085.png">

Test Plan: Imported from OSS

Reviewed By: dreiss

Differential Revision: D26928884

Pulled By: IvanKobzarev

fbshipit-source-id: 8754c93a2542405870e2621be5b3f14e3d0081b9
2021-03-10 10:42:23 -08:00
05e0ea9661 [android] bump gradle version to 6.8.3 (#53567)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53567

Updating gradle to version 6.8.3
Proper zip was uploaded to aws.

Successful CI check: https://github.com/pytorch/pytorch/pull/53619

Test Plan: Imported from OSS

Reviewed By: dreiss

Differential Revision: D26928885

Pulled By: IvanKobzarev

fbshipit-source-id: b1081052967d9080cd6934fd48c4dbe933630e49
2021-03-10 10:40:28 -08:00
e13ef777a7 Use native ctc loss for target length 256 (#53557)
Summary:
Apparently cudnn (8.1) does not like 256-long targets.

Thank you raotnameh for reporting.

Fixes https://github.com/pytorch/pytorch/issues/53505

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53557

Reviewed By: VitalyFedyunin

Differential Revision: D26947262

Pulled By: albanD

fbshipit-source-id: df6da7db8fd8e35050b4303ff1658646ebc60141
2021-03-10 10:13:42 -08:00
e937db5dba Added CUDA support for torch.orgqr (#51348)
Summary:
**Update:** MAGMA support was dropped from this PR. Only the cuSOLVER path is implemented and it's used exclusively.

**Original PR message:**

This PR adds support for CUDA inputs for `torch.orgqr`.

CUDA implementation is based on both [cuSOLVER](https://docs.nvidia.com/cuda/cusolver/index.html#cuSolverDN-lt-t-gt-orgqr) and MAGMA. cuSOLVER doesn't have a specialized routine for the batched case. While MAGMA doesn't have a specialized GPU native (without CPU sync) `orgqr`. But MAGMA has implemented (and not documented) the batched GPU native version of `larft` function (for small inputs of size <= 32), which together with `larfb` operation form `orgqr` (see the call graph [here at the end of the page](http://www.netlib.org/lapack/explore-html/da/dba/group__double_o_t_h_e_rcomputational_ga14b45f7374dc8654073aa06879c1c459.html)).

So now there are two main codepaths for CUDA inputs (if both MAGMA and cuSOLVER are available):
* if `batchsize > 1` and `tau.shape[-1] <= 32` then MAGMA based function is called
* else [cuSOLVER's `orgqr`](https://docs.nvidia.com/cuda/cusolver/index.html#cuSolverDN-lt-t-gt-orgqr) is used.

If MAGMA is not available then only cuSOLVER is used and vice versa.

Documentation updates and possibly a new name for this function will be in a follow-up PR.

Ref. https://github.com/pytorch/pytorch/issues/50104

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51348

Reviewed By: heitorschueroff

Differential Revision: D26882415

Pulled By: mruberry

fbshipit-source-id: 9f91ff962921932777ff108bedc133b55fe22842
2021-03-10 09:59:56 -08:00
45ddf113c9 [fix] nn.Embedding: allow changing the padding vector (#53447)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53368

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53447

Reviewed By: albanD

Differential Revision: D26946284

Pulled By: jbschlosser

fbshipit-source-id: 54e5eec7da86fa02b1b6e4a235d66976a80764fc
2021-03-10 09:53:27 -08:00
bcbe07200c Improve logic for S3 stats gathering. Uses automatic SLOW_TESTS. (#53549)
Summary:
This PR:
1. refactors the logic for S3 stats gathering.
2. Renames SLOW_TESTS to TARGET_DET_LIST to disambiguate and remove confusion with slowTest
2. detects slow tests (tests with time > 5min) to add to the TARGET_DET_LIST based on results in S3 from the previous nightly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53549

Test Plan:
Set CIRCLE_JOB to your favorite CI job (like `pytorch_linux_bionic_py3_8_gcc9_coverage_test1`).
Run `python test/run_test.py --determine-from=<your fave pytorch files>`
e.g., `python test/run_test.py --determine-from=test/run_test.py`

Reviewed By: mrshenli

Differential Revision: D26904478

Pulled By: janeyx99

fbshipit-source-id: 9576b34f4fee09291d60e36ff2631753a3925094
2021-03-10 09:37:06 -08:00
1c9fc38eb2 Remove reference to 9.2 as it's been removed from nightlies (#53716)
Summary:
Removing a tiny bit of unneeded reference to cuda92 for windows binary. Note that the config.yml did not change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53716

Reviewed By: VitalyFedyunin

Differential Revision: D26947029

Pulled By: janeyx99

fbshipit-source-id: 3bbf1faa513756eda182d2d80033257f0c629309
2021-03-10 09:29:10 -08:00
70733f2e67 Marginally improve pytest collection for top-level test files (#53617)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53617

I'm trying to make `pytest test/*.py` work--right now, it fails during
test collection.  This removes a few of the easier to fix pytest
collection problems one way or another.  I have two remaining problems
which is that the default dtype is trashed on entry to test_torch.py and
test_cuda.py, I'll try to fix those in a follow up.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D26918377

Pulled By: ezyang

fbshipit-source-id: 42069786882657e1e3ee974acb3ec48115f16210
2021-03-10 08:56:39 -08:00
6e020a4844 Fix inaccurate dispatch table for fill_ (#53611)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53611

fill_ now uses DispatchStub which means it only works for
CPU/CUDA.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D26918374

Pulled By: ezyang

fbshipit-source-id: fc899c28f02121e7719b596235cc47a0f3da3aea
2021-03-10 08:56:29 -08:00
4dbd0b639d Convert a few more checks to raise NotImplementedError (#53610)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53610

I noticed these because I was running the test suite under
meta device and triggered these error checks without getting
a NotImplementedError.  Well, now they raise.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D26918376

Pulled By: ezyang

fbshipit-source-id: 20d57417aa64875d43460fce58af11dd33eb4a23
2021-03-10 08:53:59 -08:00
e787872a47 [RELAND] Deduplicate shared params before constructing Reducer in DDP (#53279)
Summary:
Original PR https://github.com/pytorch/pytorch/pull/51929 seemed to trigger failures in `pytorch_linux_xenial_py3_clang5_asan_test2`. Resubmitting to figure out why, and hopefully reland.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53279

Reviewed By: mrshenli

Differential Revision: D26916701

Pulled By: zhaojuanmao

fbshipit-source-id: 75c74c8ad8ad24154eb59eddb2b222da0a09897e
2021-03-10 07:56:20 -08:00
039402b945 If distributed module isn't available, don't run distributed/pipeline tests (#53547)
Summary:
Following up https://github.com/pytorch/pytorch/issues/52945

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53547

Reviewed By: mrshenli

Differential Revision: D26946364

Pulled By: ezyang

fbshipit-source-id: 9f93e76e2420d19b46d4eb3429eac5f263fd5c23
2021-03-10 07:43:43 -08:00
6aa5148df2 Filter 0's returned by exponential distribution (#53480)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48841 for half datatype (it was fixed for other datatypes before).
The reason for https://github.com/pytorch/pytorch/issues/48841 happening for half was that `exponential_` for half was producing 0s.
Exponential distribution implementation on cuda is here e08aae2613/aten/src/ATen/native/cuda/DistributionTemplates.h (L535-L545)
with `transformation::exponential` defined here
e08aae2613/aten/src/ATen/core/TransformationHelper.h (L113-L123)
It takes a uniformly distributed random number and takes `log` of it. If necessary, the result is then converted to low precision datatype (half). To avoid 0's, before applying `log`,  ones are replaced with std::nextafter(1,0). This seems fine, because log(1-eps) is still representable in half precision (`torch.tensor([1.], device="cuda").nextafter(torch.tensor([0.], device="cuda")).log().half()` produces 5.96e-8) , so casting to `scalar_t` should work. However, since fast log approximation is used (`__logf`), the log result is ~3e-9 instead of more accurate 5.96e-8, and underflows when casting to half. Using `::log` instead of fast approximation fixes it, however, it comes with ~20% perf penalty on exponential kernel for fp32 datatype, probably more for half.

Edit: alternative approach used now is to filter all small values returned by transformation. The result is equivalent to squashing of 1's to 1-eps that was used before, and computing correct log of 1-eps (which is -eps, exactly equal even for doubles). This doesn't incur noticeable performance hit.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53480

Reviewed By: mruberry

Differential Revision: D26924622

Pulled By: ngimel

fbshipit-source-id: dc1329e4773bf91f26af23c8afa0ae845cfb0937
2021-03-10 00:35:31 -08:00
c5cd993add Adds a bool is_available() method to the backend contract (#53068)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53068

Adds a ```bool is_available()``` method to the backend contract: it returns ```true``` if ```compile()``` and ```execute()``` can be called; ```false``` otherwise.

It is used to implement the following changes in the ```LoweredModule```:
* ```compile()``` in ```__setstate__``` will run if ```is_available()```, else ```__setstate__``` throws an exception (“Backend not available.”).
* ```compile()``` at ```LoweredModule``` creation will run if ```is_available()```, else a WARNING will be thrown.
* ```execute()``` will only be executed if ```is_available()``` returns true; else throws an exception (“Backend not available.”).

The goal of these changes is to ensure we have a well defined behaviour for the different combinations of backend availability on-host and on-target.

More specifically, backends may have different capabilities to compile and/or execute the Module, depending whether this happens on-host (i.e. where the program is being written) or on-target (where the program is being executed).

First of all, we know that "preprocess" always takes place, and that only happens on-host at creation time. So, we can assume that any compilation is needed/possible on-host then all of it could be pushed here.

Overall, we want to ensure the following:

**On host**

| compile | execute | Outcome |
| -- | -- | -- |
| No | No | On module creation, LoweredModule is generated, with a warning  (since compilation and execution can still take place on-target). On module load, throws an exception (since execution is not possible). |
| No | Yes | This configuration should not be possible. This assumes the full compiler is not available, even if some work was done in preprocess the program cannot be finalized for execution. |
| Yes | No | In this case, the expectation would be for is_available() to return false, and compilation logic to move into preprocess. |
| Yes | Yes | All good. This is the only case that is_available() should return true. |

**On target**

| compile | execute | Outcome |
| -- | -- | -- |
| No | No | Loading the LoweredModule throws an exception. Since execution is not possible. |
| No | Yes | Basically this is another instance of Yes/Yes: compilation per se may not be possible on device, which means compile() can be called without issue but it is a no-op, and thus is_available should return true. Consequently, loading the LoweredModule: Succeeds, if the preprocessed module is ready for execution. Fails with exception otherwise. |
| Yes | No | This configuration should not be possible. Just putting here for completeness. |
| Yes | Yes | All good. This, along with No/Yes case (because compilation is assumed to have happened on-host, so it's just another instance of Yes/Yes), are the cases where is_available() should return true. |

**Refactoring existing code**
This change also updates other backends (Glow) code, to implement the is_available() method to have the same behaviour as before this change (i.e. always available).

This should not cause backward incompatibilities with already saved models since we're adding a new method to the PyTorchBackendInterface.
Models saved with the old interface that didn't have is_available() will still find the other 2 methods in the bound object (i.e. compile and execute), and the saved LoweredModule logic will be the old one.

**Future**
We plan to use is_available() to implement support for fallback to the PyTorch interpreter.
ghstack-source-id: 123498571

Test Plan: Added C++ (test_backend.cpp) and Python (test_backends.py) tests to validate the exceptions.

Reviewed By: jackm321, spaugh, iseeyuan

Differential Revision: D26615833

fbshipit-source-id: 562e8b11db25784348b5f86bbc4179aedf15e0d3
2021-03-10 00:24:16 -08:00
215950e2be Convert type annotations in nn/functional.py to py3 syntax (#53656)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53656

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D26926018

Pulled By: jamesr66a

fbshipit-source-id: 2381583cf93c9c9d0c9eeaa6e41eddce3729942d
2021-03-09 22:26:22 -08:00
a20b36b03d [JIT] Fix backward compatibility test broken by #53410 (#53683)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53683

**Summary**
This commit fixes the BC test broken by #53410. There are no promises
about operator-level BC with the operators added and modified by that
PR, so this test failure does not represent a real backward
compatibility issue.

**Test Plan**
Ran the BC test locally by runniing `dump_all_schemas.py` and then
`check_backward_compatibility.py`.

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D26936505

Pulled By: SplitInfinity

fbshipit-source-id: 829d5d78e4cba44feea382d0fbd66e77dee7eed2
2021-03-09 22:00:15 -08:00
8a6df06a0e Print onnxifi failed status code in readable format (#53648)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53648

Reviewed By: hl475

Differential Revision: D26838564

fbshipit-source-id: 6e0e5695a58422d573f9c97bfb241bce2688f13b
2021-03-09 21:34:57 -08:00
3b0e4a6ed4 [GraphModule] Improve buffer registration during init (#53444)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53444

GraphModule construction has two options when constructing the base nn.Module: a dict of names to attrs to assign to the GraphModule, or another nn.Module to copy attrs from.

- For the dict case, add logic to explicitly register `nn.Tensors` that are not `nn.Parameter` as buffers on the GraphModule, else fall back to `__setattr__`.
- For the other `nn.Module` case, update so that it checks in the other module whether the attr to copy in is a buffer, and register it as such, else fall back to `__setattr__`.

Test Plan: Added tests for fetching params and buffers from a GraphModule using both dict and module `__init__`s

Reviewed By: jamesr66a

Differential Revision: D26860055

fbshipit-source-id: 8d9999f91fef20aaa10969558006fc356247591f
2021-03-09 21:05:01 -08:00
c3f8d57c70 use DimVector for sizes and strides (#53001)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53001

Test Plan: Imported from OSS

Reviewed By: swolchok

Differential Revision: D26719508

Pulled By: bhosmer

fbshipit-source-id: 4053d632e11b2de1576c59c5a6b881a195d6206b
2021-03-09 20:09:28 -08:00
0257eddc16 Editing pass on native/README.md updates (#53638)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53638

Mostly slight edits, and deleting some outdated sections.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: heitorschueroff

Differential Revision: D26920600

Pulled By: ezyang

fbshipit-source-id: e3bda80ecb622a1fcfde64e4752ba89a71056340
2021-03-09 19:30:59 -08:00
409a76f72c [Static Runtime] Fix bug in static_runtime::to_copy (#53634)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53634

Make the op signature of `static_runtime::to_copy` consistent with that of native_functions.yaml so it works with 2-5 args:
```
- func: to.dtype(Tensor self, ScalarType dtype, bool non_blocking=False, bool copy=False, MemoryFormat? memory_format=None) -> Tensor
  variants: method
  device_guard: False
```

(Note: this ignores all push blocking failures!)

Reviewed By: ajyu

Differential Revision: D26906726

fbshipit-source-id: b9203eb23619aba42b1bfed1a077401f9fe2ddf0
2021-03-09 16:26:34 -08:00
60ed8fb244 [JIT] Enable ModuleList non-literal indexing (#53410)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53410

**Summary**
This commit enables indexing into `ModuleList` using a non-literal
index if the LHS of the assignment statement of which the indexing is
the RHS is annotated with an interface type.

This feature already exists for `ModuleDict`, and this commit builds on
top of that implementation. A `prim::ModuleContainerIndex` operator is
emitted for any statement of the form `lhs: InterfaceType =
module_container[idx]`. The same operator has to be used for both
`ModuleDict` and `ModuleList` because serialization does not preserve
the metadata that indicates whether a `Module` is a `ModuleDict` or
`ModuleList`.

**Testing**
This commit extends the existing unit tests for non-literal `ModuleDict`
indexing to test non-literal `ModuleList` indexing.

**Fixes**
This commit fixes #47496.

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D26857597

Pulled By: SplitInfinity

fbshipit-source-id: d56678700a264d79aae3de37ad6b08b080175f7c
2021-03-09 16:11:34 -08:00
5dca8ff6de [FX] Make TracerBase._find_user_frame private (#53654)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53654

Test Plan: Imported from OSS

Reviewed By: suo, Chillee

Differential Revision: D26924950

Pulled By: jamesr66a

fbshipit-source-id: 23e641bbcabff148c18db0edeff0a12c10b8c42d
2021-03-09 16:06:22 -08:00
cff22f8794 Port sin to structured. (#52277)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52277

Test Plan: Imported from OSS

Reviewed By: walterddr, nikithamalgifb

Differential Revision: D26732398

Pulled By: bdhirsh

fbshipit-source-id: fa1a3c2359e2bf8fe326d2f74d2f9041ba189d24
2021-03-09 16:06:18 -08:00
b9c3edd583 Remove hacky wrapper from a lot of unary operators. (#52276)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52276

Test Plan: Imported from OSS

Reviewed By: walterddr, nikithamalgifb

Differential Revision: D26732399

Pulled By: bdhirsh

fbshipit-source-id: 4189594e938c9908a4ea98a0b29d75a494d0dc35
2021-03-09 16:04:36 -08:00
233b9490c2 fix channels_last bug in upsample kernels (#53535)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53535

During the port to structured kernels for upsample kernels, I missed that a subset of them explicitly pass `memory_format` information from the input to the output tensors.

Note 1:
I added the logic into the `meta` function of each op, which feels morally correct since this logic affects the output shape/metadata. One consequence is that all backend implementations will get the logic. I synced with fmassa that this seems reasonable.

Note 2:
This logic used to happen in the following operators, which this PR fixes:
- upsample_nearest3d
- upsample_trilinear3d
- upsample_nearest2d
- upsample_bilinear2d

I explicitly didn't patch the other upsample kernels, which look like they never forwarded memory_format information:
- `upsample_bicubic2d` (maybe this should though? `UpSampleBicubic2d.cpp` isn't currently written to do anything different for `channels_last` tensors)
- All of the `upsample_{mode}1d` operators. Probably because, afaik, channels_last isn't supported for 3d tensors
- The corresponding backwards operator for every upsample op.

Note 3:
I'm also wondering why memory_format isn't just directly a part of the `tensor::options()` method, which would cause all ops to universally forward memory_format information from input to output tensors, rather than just the upsample ops. My guess is:
- BC-breakage. I'm not sure whether this would really *break* people, but it's an API change
- performance. `tensor::options()` is called everywhere, and adding a call to `suggest_memory_format()` would probably noticeably hit microbenchmarks. We could probably deal with that by making `memory_format` a precomputed field on the tensor?

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D26891540

Pulled By: bdhirsh

fbshipit-source-id: b3845f4dd5646b88bf738b9e41fe829be6b0e5cf
2021-03-09 15:23:53 -08:00
a3465214ba move rnn cell size check to cpp (#51964)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/32193.

Possible further improvements:
- do the same for quantized cells
- reuse newly written functions in 56034636b9/torch/csrc/api/src/nn/modules/rnn.cpp (L699-L715)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51964

Reviewed By: albanD

Differential Revision: D26757050

Pulled By: ngimel

fbshipit-source-id: 9c917d9124de2b914ad9915c79af675ae561295a
2021-03-09 15:02:20 -08:00
0606057af3 [PyTorch] Add c10::MaybeOwned and Tensor::expect_contiguous (#53317)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53317

This seems like it might help in cases where we have to call
`Tensor::contiguous`, but we expect that the tensor in question will
be contiguous a good portion of the time.
ghstack-source-id: 123203771

Test Plan:
Profiled AdIndexer on inline_cvr; time spent in
clip_ranges_gather_sigrid_hash_each_feature<int> was cut in half from
1.37% to 0.66%

Reviewed By: smessmer

Differential Revision: D26738036

fbshipit-source-id: b5db10783ccd103dae0ab3e79338a83b5e507ebb
2021-03-09 14:51:23 -08:00
8acb74c405 [PyTorch] Make IValue::toTensor() inlineable (#53213)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53213

The failure path for toTensor() is fairly long because it has to stringify tagKind() and construct a std::string. Forcibly outlining it should allow inlining the happy path.
ghstack-source-id: 123012703

Test Plan:
1) Compare perf profile on AdIndexer benchmark before/after --
toTensor frames no longer show up, demonstrating inlining
2) Compare perf stat results on AdIndexer benchmark before/after:

Before:
```
         17,104.66 msec task-clock                #    0.999 CPUs utilized            ( +-  0.26% )
             3,666      context-switches          #    0.214 K/sec                    ( +- 18.53% )
                 3      cpu-migrations            #    0.000 K/sec                    ( +-  6.25% )
           102,745      page-faults               #    0.006 M/sec                    ( +-  0.47% )
    33,860,604,938      cycles                    #    1.980 GHz                      ( +-  0.25% )  (50.02%)
    69,514,752,652      instructions              #    2.05  insn per cycle           ( +-  0.06% )  (50.01%)
    11,280,877,966      branches                  #  659.521 M/sec                    ( +-  0.11% )  (50.01%)
        75,739,099      branch-misses             #    0.67% of all branches          ( +-  0.98% )  (50.03%)

           # Table of individual measurements:
           17.2467 (+0.1172) #
           17.0014 (-0.1280) #
           17.2134 (+0.0840) #
           17.0951 (-0.0343) #
           17.0905 (-0.0389) #

           # Final result:
           17.1294 +- 0.0447 seconds time elapsed  ( +-  0.26% )
```
After:
```
         16,910.66 msec task-clock                #    0.999 CPUs utilized            ( +-  0.27% )
             3,495      context-switches          #    0.207 K/sec                    ( +- 18.34% )
                 3      cpu-migrations            #    0.000 K/sec                    ( +-  6.25% )
           101,769      page-faults               #    0.006 M/sec                    ( +-  0.45% )
    33,460,776,952      cycles                    #    1.979 GHz                      ( +-  0.28% )  (50.03%)
    69,243,346,925      instructions              #    2.07  insn per cycle           ( +-  0.17% )  (50.02%)
    11,229,930,860      branches                  #  664.074 M/sec                    ( +-  0.14% )  (50.03%)
        72,273,324      branch-misses             #    0.64% of all branches          ( +-  0.55% )  (50.03%)

           # Table of individual measurements:
           16.9530 (+0.0246) #
           17.0898 (+0.1614) #
           16.8493 (-0.0791) #
           16.8282 (-0.1002) #
           16.9217 (-0.0067) #

           # Final result:
           16.9284 +- 0.0464 seconds time elapsed  ( +-  0.27% )
```

1.1% cycles win, 0.38% instructions win, both apparently outside noise level

Reviewed By: smessmer

Differential Revision: D26793481

fbshipit-source-id: b035b3ad20f9e22ae738d91163641031b1130ce6
2021-03-09 14:49:44 -08:00
f8e7d8bb0d [FX][docs] Render inherited methods in fx.Tracer API reference (#53630)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53630

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D26918962

Pulled By: jamesr66a

fbshipit-source-id: 2c84e308889d4ba3176018c7bd44a841e715e6c8
2021-03-09 14:30:41 -08:00
a8ecf306da [Static Runtime] Remove dead code (#53588)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53588

Remove `SRViewOperatorRegistry` and related code now that it's no longer needed.

Reviewed By: swolchok

Differential Revision: D26901367

fbshipit-source-id: fa73501cd785d4b89466cda81481aea892f8241f
2021-03-09 13:36:41 -08:00
a9e4bb56e5 Add more kernel launch checks (#53286)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53286

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D26818164

fbshipit-source-id: 01ba50dc7e4a863e26c289d746bc5b95aa76d3cc
2021-03-09 13:18:54 -08:00
b8546bde09 ci: Remove special versioning privileges for cu102 (#53133)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53133

In light of some issues where users were having trouble installing CUDA
specific versions of pytorch we should no longer have special privileges
for CUDA 10.2.

Recently I added scripts/release/promote/prep_binary_for_pypi.sh (https://github.com/pytorch/pytorch/pull/53056) to make
it so that we could theoretically promote any wheel we publish to
download.pytorch.org to pypi

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D26759823

Pulled By: seemethere

fbshipit-source-id: 2d2b29e7fef0f48c23f3c853bdca6144b7c61f22
2021-03-09 13:16:56 -08:00
c0c5f80f36 Lazy Modules Documentation Clarifications (#53495)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53366

gchanan albanD
Thanks for the feedback. Did a first pass trying to address the concerns in the original issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53495

Reviewed By: mrshenli

Differential Revision: D26914768

Pulled By: albanD

fbshipit-source-id: fa049f1952ef05598f0da2abead9a5a5d3602f75
2021-03-09 13:09:33 -08:00
4fa11147c5 Automated submodule update: FBGEMM (#53632)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 4b88f40a0e

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53632

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: jspark1105

Differential Revision: D26919594

fbshipit-source-id: 4ac25bbe883b3c2cd4c02bc75a6e2c6f41d2beb7
2021-03-09 13:03:28 -08:00
efb1895f81 [caffe2] use snprintf() instead of sprintf() in the Checkpoint operator (#53434)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53434

Use `snprintf()` to avoid buffer overflows.
Also only throw an exception on error, instead of crashing the entire
application.  A failure can occur if the caller supplies an invalid format
string.
ghstack-source-id: 123401582

Test Plan:
Ran the checkpoint tests:

  buck test caffe2/caffe2/python/operator_test:checkpoint_test

Verified that the checkpoint file names logged in the output are the same
before and after this change.

I also tested manually changed the initial buffer size to 1 to confirm that
the code works when the initial buffer size is too small.  I considered
updating the checkpoint_test.py code to test using long db names that would
exceed this limit, but I figured that long filenames was likely to cause
other problems on some platforms (Windows has a maximum path length of 260
characters up until pretty recent releases).

Differential Revision: D26863355

fbshipit-source-id: 8fc24faa2a8dd145471067718d323fdc8ce055d6
2021-03-09 12:54:15 -08:00
e87a686d21 .circleci: Remove hardcoded tag for rocm (#53636)
Summary:
We shouldn't need the hardcoding anymore

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53636

Reviewed By: malfet

Differential Revision: D26921067

Pulled By: seemethere

fbshipit-source-id: 1e3ba4bbef4c5c6c6a6bcc2f137fef017cec3bb7
2021-03-09 12:52:22 -08:00
bcd94e220d [PyTorch] Fix typo in QNNPACK
Summary: Build failed when `PYTORCH_QNNPACK_RUNTIME_QUANTIZATION` is unset. According to D21339044 (622f5b68f0) it seems like a typo.

Test Plan: buck build //xplat/caffe2/aten/src/ATen/native/quantized/cpu/qnnpack:pytorch_qnnpackWindows xplat/mode/windows-msvc-15.9

Reviewed By: kimishpatel

Differential Revision: D26907439

fbshipit-source-id: ac52eeef4ee70726f2a97b22ae65921b39aa0c0b
2021-03-09 12:45:25 -08:00
a496520c1d Automated submodule update: tensorpipe (#53599)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/tensorpipe](https://github.com/pytorch/tensorpipe).

New submodule commit: a11ddfdf99

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53599

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: lw

Differential Revision: D26910634

fbshipit-source-id: a2bf808536e42b9208e5d9f88198ce64061385fa
2021-03-09 11:50:05 -08:00
b0afe945a7 Fix pylint error torch.tensor is not callable (#53424)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53424

Fixes https://github.com/pytorch/pytorch/issues/24807 and supersedes the stale https://github.com/pytorch/pytorch/issues/25093 (Cc Microsheep). If you now run the reproduction

```python
import torch

if __name__ == "__main__":
    t = torch.tensor([1, 2, 3], dtype=torch.float64)
```

with `pylint==2.6.0`, you get the following output

```
test_pylint.py:1:0: C0114: Missing module docstring (missing-module-docstring)
test_pylint.py:4:8: E1101: Module 'torch' has no 'tensor' member; maybe 'Tensor'? (no-
member)
test_pylint.py:4:38: E1101: Module 'torch' has no 'float64' member (no-member)
```

Now `pylint` doesn't recognize `torch.tensor` at all, but it is promoted in the stub. Given that it also doesn't recognize `torch.float64`, I think fixing this is out of scope of this PR.

 ---

## TL;DR

This BC-breaking only for users that rely on unintended behavior. Since `torch/__init__.py` loaded `torch/tensor.py` it was populated in `sys.modules`. `torch/__init__.py` then overwrote `torch.tensor` with the actual function. With this `import torch.tensor as tensor` does not fail, but returns the function rather than the module. Users that rely on this import need to change it to `from torch import tensor`.

Reviewed By: zou3519

Differential Revision: D26223815

Pulled By: bdhirsh

fbshipit-source-id: 125b9ff3d276e84a645cd7521e8d6160b1ca1c21
2021-03-09 11:32:53 -08:00
ef3765b992 Fix a cuda max_pool3d issue, do multiplication in int64 (#52828)
Summary:
Fix https://github.com/pytorch/pytorch/issues/52822

- [x] benchmark

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52828

Reviewed By: mrshenli

Differential Revision: D26866674

Pulled By: heitorschueroff

fbshipit-source-id: bd8276dd70316a767dc6e1991c1259f1f0b390b2
2021-03-09 10:54:43 -08:00
9f2aea7b88 [Pytorch] Fix embedding bag bug accessing unaligned memory (#53300)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53300

Float scale and bias are packed as per row parameters at the end of each row.
This takes 8 bytes. However if the number of elements in row are such that end
of row address is unaligned for float, not multiply of 4 bytes, we will get
unaglined memory access.

Current solution is inefficient, so this should really be fixed at weight
packing time.
It seems that longer term there will be prepack function that packs weights. So
this fallback path should eventually match that and not store scale and bias
inline.

Test Plan: python test/test_quantization.py

Reviewed By: pengtxiafb

Differential Revision: D26828077

fbshipit-source-id: 8512cd95f3ac3ca53e1048139a9f6e19aa8af298
2021-03-09 09:48:04 -08:00
7e6a84d238 Add logic to auto-fetch submodules (#53461)
Summary:
In setup.py add logic to:
 - Get list of submodules from .gitmodules file
 - Auto-fetch submodules if none of them has been fetched

In CI:
 - Test this on non-docker capable OSes (Windows and Mac)
 - Use shallow submodule checkouts whenever possible

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53461

Reviewed By: ezyang

Differential Revision: D26871119

Pulled By: malfet

fbshipit-source-id: 8b23d6a4fcf04446eac11446e0113819476ef6ea
2021-03-09 09:13:35 -08:00
2f91cda37e Modify error message (#53525)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53518

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53525

Reviewed By: mthrok

Differential Revision: D26900045

Pulled By: carolineechen

fbshipit-source-id: 387301381603d37d24cc829c8fed38123f268c0b
2021-03-09 09:12:00 -08:00
02c0c7a32b Add Meta support for empty_strided (#53397)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53397

It turns out once you remove all the indirection from the
empty_cpu_strided implementation, this implementation is pretty
simple.  We should see if we can simplify empty_cpu this way too.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D26891870

Pulled By: ezyang

fbshipit-source-id: 9bddd332d32d8bf32fa3175e3bb0ac3a8954ac91
2021-03-09 09:06:30 -08:00
707fc354eb Add debug only layout assert for empty_cpu (#53396)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53396

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D26891806

Pulled By: ezyang

fbshipit-source-id: 4789ab5587d1a11d50e9a60bbfa1c21c1222823e
2021-03-09 09:06:25 -08:00
28d6e01511 Add TORCH_CHECK_NOT_IMPLEMENTED/c10::NotImplementedError; make dispatch use it (#53377)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53377

My underlying goal is I want to make the test suite ignore
NotImplementedError without failing when bringing up a backend (meta)
that doesn't have very many functions implemented.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D26850766

Pulled By: ezyang

fbshipit-source-id: ffbdecd22b06b5ac23e1997723a6e2a71dfcd14a
2021-03-09 09:04:22 -08:00
2d36b30a8c Expands OpInfo out= testing (#53259)
Summary:
Addresses several of the challenges described in https://github.com/pytorch/pytorch/issues/49468.

This PR builds on https://github.com/pytorch/pytorch/pull/50741 and https://github.com/pytorch/pytorch/issues/53105 to extend OpInfo out= testing. It covers the following cases for ops that produce a single tensor:

- out= values don't affect computation
- out= noncontiguous produces the correct output and preserves strides
- out= with the wrong shape throws a warning
- out= with an empty tensor throws no warning
- out= with the wrong device throws an error
- out= with a dtype the computation's result can't be "safely" cast to throws an error

It works with operations that produce a single tensor and operations that produce an iterable of tensors (the latter is tested with operations like torch.svd).

In addition to the new out= test, the OpInfos have been updated. "supports_tensor_out" is replaced with the more general and straightforward "supports_out" metadata, and many operations which previously had to skip out= testing with an explicit SkipInfo no longer need to. A couple redundant tests in test_unary_ufuncs.py have been removed, too.

One other perk of these tests is that once all operations have OpInfos this will allow us to validate that we've universally deprecated incorrectly sized tensors passed to out=, and give us the option to actually disable the behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53259

Reviewed By: mrshenli

Differential Revision: D26894723

Pulled By: mruberry

fbshipit-source-id: 2b536e9baf126f36386a35f2f806dd88c58690b3
2021-03-09 08:19:26 -08:00
9df1b98bab Quality of life improvements to Timer (#53294)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53294

Just a bunch of little things, none of which are big enough to need a full PR.

1) C++ wall time should release the GIL
2) Add option to retain `callgrind.out` contents. This will allow processing with kCachegrind for more detailed analysis.
3) Stop subtracting the baseline instruction counts. (People just found it confusing when they saw negative instruction counts.) There is a finesse in #53295 that drops the baseline to ~800 instructions for `number=100`, and at that level it's not worth correcting.
4) Add a `__mul__` overload to function counts. e.g. suppose `c0` was run with `number=100`, and `c1` was run with `number=200`, then `c0 * 2 - c1` is needed to properly diff them. (Obviously there are correctness concerns, but I think it's fine as a caveat emptor convenience method.)
5) Tweak the `callgrind_annotate` call, since by default it filters very small counts.
6) Move some args to kwargs only since types could be ambiguous otherwise.
7) Don't omit rows from slices. It was annoying to print something like `stats[:25]` and have `__repr__` hide the lines in the middle.

Test Plan: Imported from OSS

Reviewed By: Chillee

Differential Revision: D26906715

Pulled By: robieta

fbshipit-source-id: 53d5cd92cd17212ec013f89d48ac8678ba6e6228
2021-03-09 08:15:30 -08:00
f4b344ad5c Definition infrastructure for instruction count ubenchmarks (#53293)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53293

Instruction count benchmarks need some includes for IValues, but this is also just generally useful. (Unlike Python where you can just drop imports anywhere, C++ will get very upset if you `#include` in a function body...)

Test Plan: Imported from OSS

Reviewed By: Chillee

Differential Revision: D26906684

Pulled By: robieta

fbshipit-source-id: cbdfd79d3b8383100ff2e6857b6f309c387cbe2a
2021-03-09 08:13:38 -08:00
0a97712326 [caffe2] don't use static for template declarations in headers (#53602)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53602

Using static in headers causes code bloat. Remove the unnecessary `static` qualifiers.

Test Plan: sandcastle

Reviewed By: asp2insp

Differential Revision: D26886180

fbshipit-source-id: 6008bce0d47f06d3146ce998234574a607c99311
2021-03-09 07:33:45 -08:00
34d9278c19 Remove notion of "level" from Module::dump_to_str. (#52539)
Summary:
The code uses `torch::jit::jit_log_prefix` for handling recursive
indenting in most places in this function. There was one place that was
using "level", but it was buggy -- it would result in a compounding
superlinear indent. Note that changing it to "level+1" doesn't fix the
bug.

Before/after:
https://gist.github.com/silvasean/8ee3ef115a48de6c9c54fbc40838d8d7

The new code establishes a recursive invariant for
`Module::dump_to_str`: the function returns the module printed at the
base indent level (i.e. no indent). `torch::jit:log_prefix` is used
to prefix recursive calls. The code was already nearly there, except for
this spurious use of "level".

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52539

Reviewed By: navahgar

Differential Revision: D26773657

Pulled By: gmagogsfm

fbshipit-source-id: ab476f0738bf07de9f40d168dd038dbf62a9a79e
2021-03-09 05:45:57 -08:00
bf88a4dad5 Support parsing Ellipsis in JIT frontend (#53576)
Summary:
De-sugars `Ellipsis` into dots (`...`)

Fixes https://github.com/pytorch/pytorch/issues/53517

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53576

Reviewed By: pbelevich

Differential Revision: D26904361

Pulled By: gmagogsfm

fbshipit-source-id: 5b23e049a075a9a99e37dcb47a9410b6f82a6fb7
2021-03-09 00:06:21 -08:00
c2ccb3578e Fix inport -> import typo in documentation (#53589)
Summary:
Fixes a small documentation typo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53589

Reviewed By: ngimel

Differential Revision: D26907045

Pulled By: Chillee

fbshipit-source-id: 15c35bec8d75dd897fe8886d0e0e1b889df65b24
2021-03-08 23:56:42 -08:00
cb36e503d8 [iOS GPU][BE][5/n] Remove indirection calls from MPSCNNOps.mm and MetalAten.mm (#53432)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53432

1. Creating individual .mm files for each op under the ops/ folder, and each op just has it's own function. The op is registered at the end of the file.
2. Remove the indirection calls from MetalAten.mm to MPSCNNOps.mm
3. Delete MPSCNNOps.mm
ghstack-source-id: 123205443

Test Plan:
1. Sandcastle
2. CircleCI
3. Mobilelab

Reviewed By: SS-JIA

Differential Revision: D26840953

fbshipit-source-id: e1664c8d7445fdbd3b016c4dd51de0a6294af3a5
2021-03-08 22:44:04 -08:00
aa687bb6f4 [iOS GPU][BE][4/n] - Convert Objective-C class methods to C functions (#53431)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53431

Objective-C’s dynamism comes at the cost of code size, perf and safety. In Facebook, we tend to not use Objective-C primitives or keep it to a minimum unless you need them.
ghstack-source-id: 123063340

Test Plan:
1. CircleCI
2. SandCastleCI
3. Mobilelab

Reviewed By: SS-JIA

Differential Revision: D26800753

fbshipit-source-id: b5a752a700d72ca3654f6826537aa3af47e87ecd
2021-03-08 22:42:25 -08:00
2dffb4e38e [Static Runtime] Back out D26659824 (#53570)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53570

Reviewed By: allwu

Differential Revision: D26899099

fbshipit-source-id: 87c6d74a91c102e6b0487f9e6f49394755792a94
2021-03-08 22:14:15 -08:00
dc29604fd1 [iOS GPU][BE][3/n] - Rename MetalTensor to MetalTensorImplStorage (#53430)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53430

The definition of Metal tensor is confusing, as we're using it to initialize the MetalTensorImpl. It acts more like a TensorImplStorage.
ghstack-source-id: 123038073

Test Plan:
1. Sandcastle CI
2. Circle CI
3. AIBench/Mobilelab

Reviewed By: SS-JIA

Differential Revision: D26685439

fbshipit-source-id: e0487d0884e4efc3044d627ed0e4af454eca9d67
2021-03-08 21:41:35 -08:00
d521fd799d [FX Acc] Add support for multi partitions in fx-glow (#53280)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53280

Add supports for handling multiple partitions in fx_glow e2e flow.

Test Plan: `buck test glow/fb/fx/fx_glow:test_fx_glow`

Reviewed By: gcatron

Differential Revision: D26819886

fbshipit-source-id: b31aa4612aab3aee694bb155571ba6f5e75c27ba
2021-03-08 20:16:39 -08:00
5b52ff6c8e [fx] Add DCE pass (#52658)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52658

DCE will reverse iterate over the graph looking for nodes without users and delete them. It will skip over unused placeholders (since this affects the signature of the method) and outputs (which never have users but we want to keep them :) )

Test Plan: Added unit tests

Reviewed By: jamesr66a, khabinov, chenccfb

Differential Revision: D26602212

fbshipit-source-id: f4f196973e40546076636090bb0008c24f33795e
2021-03-08 19:54:56 -08:00
17d00319bc Install GCC-9 into ROCm builders (#53459)
Summary:
Should prevent intermittent internal compiler errors reported in https://bugs.launchpad.net/ubuntu/+source/gcc-7/+bug/1917830

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53459

Reviewed By: izdeby

Differential Revision: D26870602

Pulled By: malfet

fbshipit-source-id: 1e90bb0d33736d01a696f80fc981aedcf7e3b639
2021-03-08 19:14:41 -08:00
97460d3545 [static runtime] Minimum fusion group size (#50217)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50217

If we fuse small groups, things are slow

Test Plan: buck test //caffe2/test:static_runtime

Reviewed By: bertmaher

Differential Revision: D25643460

fbshipit-source-id: d2f39a4d612df3e1e29362abb23c2d997202f6ea
2021-03-08 19:06:16 -08:00
a947bfaa26 [Pytorch] Remove assumption forward exists in freeze_module (#52918)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52918

Freeze_module seems to operate under the assumption that forward always exists. This isnt true, so the change first checks for existence then retrieves the function.
ghstack-source-id: 123215242

Test Plan: Try freezing something with and without forward.

Reviewed By: dhruvbird

Differential Revision: D26671815

fbshipit-source-id: d4140dad3c59d3d20012143175f9b9268bf23050
2021-03-08 18:29:44 -08:00
53c77e7d5d Add mock.patch() to clear environment for test (#53537)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53537

fixes #53526

This fixes the issue of the one of the environment variables being tested is somehow set by a previous test. For example:
`WORLD_SIZE=1 python test/distributed/test_c10d.py RendezvousEnvTest.test_common_errors` would have previously failed but now passes

Test Plan: Imported from OSS

Reviewed By: samestep

Differential Revision: D26891207

Pulled By: H-Huang

fbshipit-source-id: 1c23f6fba60ca01085a634afbafbb31ad693d3ce
2021-03-08 17:15:47 -08:00
b0984f7925 [pytorch] use correct warning type for tracer warnings (#53460)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53460

We have code to ignore this category of warnings and found this one is incorrect.

Use `stacklevel=2`, otherwise the warning is always filtered by TracerWarning.ignore_lib_warnings()

Test Plan: sandcastle

Reviewed By: wanchaol

Differential Revision: D26867290

fbshipit-source-id: cda1bc74a28d5965d52387d5ea2c4dcd1a2b1e86
2021-03-08 17:02:41 -08:00
0d04e51233 [caffe2] Add an optimization to avoid extra fp32->fp16 conversions in Onnxifi (#53560)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53560

If an op like Fused8BitRowwiseQuantizedToFloat ends up on CPU and Tile ends up on an accelerator and only FP16 is supported, then we want to make sure conversion from FP32 to FP16 is done on CPU to save cycles on accelerator.

Reviewed By: ChunliF

Differential Revision: D26862322

fbshipit-source-id: a7af162f2537ee9e4a78e6ef3f587129de410b07
2021-03-08 16:36:12 -08:00
d0b32156f0 move test to CUDA only (#53561)
Summary:
Helps make master green by removing this hefty memory allocating from CPU test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53561

Reviewed By: malfet, albanD

Differential Revision: D26897941

Pulled By: janeyx99

fbshipit-source-id: 9f6c2d55f4eea1ab48665f7819fc113f21991036
2021-03-08 16:32:14 -08:00
a0d425d38d Automated submodule update: FBGEMM (#53509)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: da1e687ee3

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53509

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: jspark1105

Differential Revision: D26885426

fbshipit-source-id: 80a3d0680fa584744380bb993ee3a2dc13991847
2021-03-08 16:26:07 -08:00
7c0a4e78ca [static runtime] convert to->to_copy (#53524)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53524

Add to->to_copy in the ReplaceWithCopy pass for playing well with
AliasDb

Test Plan:
Run bench with CastedBatchOneHot fusion off
(https://www.internalfb.com/intern/diff/view-version/123230476/),
on adindexer and adfinder models

Reviewed By: hlu1

Differential Revision: D26887050

fbshipit-source-id: 3f2fb9e27783bcdeb91c8b4181575f059317aff1
2021-03-08 16:19:03 -08:00
1e992810b5 Revert D26811466: [pytorch][PR] [reland] Add OpInfo for bitwise_not and make ROCM and CUDA OpInfo tests consistent
Test Plan: revert-hammer

Differential Revision:
D26811466 (a5ada2127d)

Original commit changeset: 8434a7515d83

fbshipit-source-id: 9c2c760e18154a88cf7531e45843a802e3f3d19c
2021-03-08 15:47:47 -08:00
067ad31210 [NNC] Added some more external function bindings (#53420)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53420

Reviewed By: navahgar

Differential Revision: D26876784

Pulled By: Chillee

fbshipit-source-id: 05e7c782a72de5159879f88a104f1a273e0345eb
2021-03-08 14:18:30 -08:00
c72473fe2c Adding print_test_stats.py job to Windows CI (#53387)
Summary:
This way, we can get S3 test time stats for windows tests as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53387

Reviewed By: samestep

Differential Revision: D26893613

Pulled By: janeyx99

fbshipit-source-id: ac59e4406e472c9004eea0aae8a87a23242e3b34
2021-03-08 13:56:46 -08:00
48ec939d39 [iOS GPU][BE][2/n] - Use dispatcher in MPSCNNTests.mm (#53429)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53429

Call the testing ops through dispatcher instead of calling them through `at::native`. Some metal ops can't be called through dispatcher yet. For example, `at::t` will call `at::as_strided` which hasn't been implemented on metal yet. For those ops, we'll skip and call `mpscnn::`directly. We'll convert those ops once we have implemented the missing ops.
ghstack-source-id: 123038068

Test Plan:
- Sandcastle CI
- Circle CI
- AIBench/Mobilelab

Reviewed By: SS-JIA, AshkanAliabadi

Differential Revision: D26683366

fbshipit-source-id: bf130b191046f5d9ac9b544d512bc6cb94f08c09
2021-03-08 13:50:42 -08:00
e90e773445 Fix to empty_like example (#53088)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/52375

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53088

Reviewed By: zou3519

Differential Revision: D26752772

Pulled By: iramazanli

fbshipit-source-id: 21e395c6bbfd8f2cc808ddc12aefb2a426bb50d0
2021-03-08 13:19:47 -08:00
64255294ba [PyTorch][CI] Enable building test_lite_interpreter_runtime unittest in CI (macos) (#52566)
Summary:
## Summary

1. Enable building libtorch (lite) in CI (macos)
2. Run `test_lite_interpreter_runtime` unittest in CI (macos)

![image](https://user-images.githubusercontent.com/16430979/110189039-b2b8ed00-7dd2-11eb-8fa1-be2d9e23792a.png)

{F467163464}

![image](https://user-images.githubusercontent.com/16430979/110189119-e3008b80-7dd2-11eb-9e80-7c2ae6862468.png)

{F467164144}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52566

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D26601585

Pulled By: cccclai

fbshipit-source-id: da7f47c906317ab3a4ef38fe2dbf2e89bc5bdb24
2021-03-08 13:09:25 -08:00
7b7775bec2 feature_segmented_histogram_binning_calibration
Summary: We implement a hierarchical fine grained binning structure, with the top level corresponding to different feature segments and bottom level corresponding to different range of ECTR. The model is designed to be general enough to perform segmented calibration on any useful feature

Test Plan:
buck test dper3/dper3/modules/calibration/tests:calibration_test -- test_histogram_binning_calibration_by_feature

buck test dper3/dper3_models/ads_ranking/model_impl/mtml/tests:mtml_lib_test -- test_multi_label_dependent_task_with_histogram_binning_calibration_by_feature

e2e test:
buck test dper3/dper3_models/ads_ranking/tests:model_paradigm_e2e_tests -- test_sparse_nn_histogram_binning_calibration_by_feature

buck test dper3/dper3_models/ads_ranking/tests:model_paradigm_e2e_tests -- test_mtml_with_dependent_task_histogram_binning_calibration_by_feature

All tests passed

Canary packages:
Backend -> aml.dper2.canary:e0cd05ac9b9e4797a94e930426d76d18
Frontend -> ads_dper3.canary:55819413dd0f4aa1a47362e7869f6b1f

Test FBL jobs:
**SparseNN**
ctr mbl feed
f255676727

inline cvr
f255677216

**MTML regular task**
offsite cvr
f255676719

**MTML dependent task**
mobile cvr
f255677551

**DSNN for AI models**
ai oc
f255730905

**MIMO for both AI DSNN part and AF SNN part**
mimo ig
f255683062

Reviewed By: zhongyx12

Differential Revision: D25043060

fbshipit-source-id: 8237cad41db66a09412beb301bc45231e1444d6b
2021-03-08 12:47:19 -08:00
b2758cdc77 [PyTorch] Don't copy vector arguments to caffe2::Tensor::Resize (#53389)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53389

Resize was written to take arguments by value, which was
totally fine if they were ArrayRef or a series of integers, but not so
fine if they're std::vector.
ghstack-source-id: 123212128

Test Plan:
Existing CI should make sure it builds

Inspected assembly for ios_caffe.cc and saw no more vector copy before
calling Resize

Reviewed By: smessmer

Differential Revision: D26852105

fbshipit-source-id: 9c3b9549d50d32923b532bbc60d0246e2c2b5fc7
2021-03-08 12:33:33 -08:00
b64acfa9ac [PyTorch] Move non-template part of TensorImpl::Resize to cpp (#53388)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53388

Most of this method did not depend on the template parameter. No need to include it in the .h file or duplicate it in the generated code.
ghstack-source-id: 123211590

Test Plan: Existing CI should cover this

Reviewed By: smessmer

Differential Revision: D26851985

fbshipit-source-id: 115e00fa3fde547c4c0009f2679d4b1e9bdda5df
2021-03-08 12:33:29 -08:00
98943bb863 [PyTorch] Enable explicit ATen level sources for lite interpreter (#52769)
Summary:
Enable partial explicit Aten level sources list for lite interpreter. More aten level source list will be added.

x86:
`SELECTED_OP_LIST=/Users/chenlai/Documents/pytorch/experiemnt/deeplabv3_scripted.yaml BUILD_LITE_INTERPRETER=1 ./scripts/build_pytorch_android.sh x86`

libpytorch_jni_lite.so -- 3.8 MB

armeabi-v7a
`SELECTED_OP_LIST=/Users/chenlai/Documents/pytorch/experiemnt/deeplabv3_scripted.yaml BUILD_LITE_INTERPRETER=1 ./scripts/build_pytorch_android.sh armeabi-v7a`
libpytorch_jni_lite.so -- 2.8 MB

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52769

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D26717268

Pulled By: cccclai

fbshipit-source-id: 208300f198071bd6751f76ff4bc24c7c9312d337
2021-03-08 12:31:50 -08:00
25a9f45a5a fix broken quantization_test in operator_benchmark (#53153)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53153

This diff is a fix for quantization_test in operator_benchmark, which is broken because of removing the py_module for learnable fake_quantization.
ghstack-source-id: 123103477

Test Plan: `buck run mode/opt //caffe2/benchmarks/operator_benchmark/pt:quantization_test`

Reviewed By: z-a-f

Differential Revision: D26764881

fbshipit-source-id: 8d40c6eb5e7090ca65f48982c837f7dc87d14378
2021-03-08 12:12:57 -08:00
1fc8831322 Add missing tensor header (#53489)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53489

It appears that D26675801 (1fe6a6507e) broke Glow builds (and probably other instals) with the inclusion of the python_arg_parser include. That dep lives in a directory of its own and was not included in the setup.py.

Test Plan: OSS tests should catch this.

Reviewed By: ngimel

Differential Revision: D26878180

fbshipit-source-id: 70981340226a9681bb9d5420db56abba75e7f0a5
2021-03-08 12:05:17 -08:00
117a49c4cb .circleci: Restore docker builds for scheduled workflows (#53412)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53412

Docker builds for scheduled workflows still need to happen within the
regular build workflow since new docker image builds are actually only
done within the `build` workflow

A follow up to #52693

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: janeyx99

Differential Revision: D26890300

Pulled By: seemethere

fbshipit-source-id: d649bfca5186a89bb5213865f1f5738b809d4d38
2021-03-08 12:03:33 -08:00
1588df6b99 Fix typo in tools/test_history.py (#53514)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53514

Test Plan:
```
tools/test_history.py columns --ref=0ca029b22d17d236d34bcecad44b94b35b1af4bb test_common_errors pytorch_linux_xenial_cuda11_1_cudnn8_py3_gcc7_test1
```

Reviewed By: janeyx99

Differential Revision: D26886385

Pulled By: samestep

fbshipit-source-id: d3d79282e535707616d992ab8cf6216dfb777639
2021-03-08 11:42:30 -08:00
36dc5d3b3a [iOS GPU][BE][1/n] - Remove unused headers + improve error message (#53428)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53428

Start to do some code clean up work.
ghstack-source-id: 123038070

Test Plan:
- CircleCI
- Sandcastle CI
- AIBench

Reviewed By: SS-JIA, AshkanAliabadi

Differential Revision: D26681115

fbshipit-source-id: b1b7cfc6543b73928f517cd52e94a2664ee0bd21
2021-03-08 11:36:10 -08:00
1e306b9a71 Disable failing distributed test (#53527)
Summary:
See https://github.com/pytorch/pytorch/issues/53526. We're disabling the test temporarily until we can figure out what's going on (since it's unclear what needs to be reverted).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53527

Reviewed By: zhangguanheng66

Differential Revision: D26888037

Pulled By: samestep

fbshipit-source-id: f21a2d665c13181ed3c8815e352770b2f26cdb84
2021-03-08 11:29:05 -08:00
2b359dd6dc Automated submodule update: tensorpipe (#53504)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/tensorpipe](https://github.com/pytorch/tensorpipe).

New submodule commit: 46949a8ca3

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53504

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: lw

Differential Revision: D26883701

fbshipit-source-id: 9e132a1389ac9cee9507c5600668af1afbb26efd
2021-03-08 11:02:52 -08:00
115df4fa28 Fix set_device_map docs (#53508)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53508

closes #53501

Differential Revision: D26885263

Test Plan: Imported from OSS

Reviewed By: H-Huang

Pulled By: mrshenli

fbshipit-source-id: dd0493e6f179d93b518af8f082399cacb1c7cba6
2021-03-08 10:56:46 -08:00
93f1b10f72 Add missing attr in LazyModuleMixin doc (#53363)
Summary:
To fix some rendering issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53363

Reviewed By: izdeby

Differential Revision: D26884560

Pulled By: albanD

fbshipit-source-id: fedc9c9972a6c68f311c6aafcbb33a3a881bbcd2
2021-03-08 10:50:56 -08:00
656930df26 [FX] Fix default to align with documentation in fuser.py (#53457)
Summary:
Currently it says it does a deepcopy by default, but that's not true.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53457

Reviewed By: navahgar

Differential Revision: D26876781

Pulled By: Chillee

fbshipit-source-id: 26bcf76a0c7052d3577f217e79545480c9118a4e
2021-03-08 10:06:40 -08:00
c07a62b854 [FX] change dynamic control flow example to a *more* dynamic version (#53250)
Summary:
This is a more fundamental example, as we may support some amount of shape specialization in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53250

Reviewed By: navahgar

Differential Revision: D26841272

Pulled By: Chillee

fbshipit-source-id: 027c719afafc03828a657e40859cbfbf135e05c9
2021-03-08 10:00:19 -08:00
0ca029b22d [caffe2] Fix DBFileReader (#53498)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53498

This code depended on `Blobs()` being returned in sorted order:

https://www.internalfb.com/intern/diffusion/FBS/browsefile/master/fbcode/caffe2/caffe2/python/db_file_reader.py?commit=472774e7f507e124392491800d9654e01269cbaf&lines=89-91

But D26504408 (69bb0e0285) changed the underlying storage to a hashmap, so now the blobs are returned in arbitrary order (Note that `Blobs()` returns also non-local blobs, and for those there was already no guarantee of ordering).

So we need to explicitly sort the result.

Test Plan:
```
$ buck test dper3/dper3/toolkit/tests:lime_test
$ buck test //dper3/dper3/toolkit/tests:model_insight_test
```
Pass after this diff.

Differential Revision: D26879502

fbshipit-source-id: d76113f8780544af1d97ec0a818fb21cc767f2bf
2021-03-08 08:34:39 -08:00
d54be1a946 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D26879724

fbshipit-source-id: 0e2dd4c5f7ba96e97e7cbc078184aed2a034ad2c
2021-03-08 03:49:09 -08:00
a5ada2127d [reland] Add OpInfo for bitwise_not and make ROCM and CUDA OpInfo tests consistent (#53181)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/42515

This PR also enables the OpInfo tests on ROCM to check the same dtypes that of CUDA.

Note: Reland https://github.com/pytorch/pytorch/issues/51944

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53181

Reviewed By: zhangguanheng66

Differential Revision: D26811466

Pulled By: mruberry

fbshipit-source-id: 8434a7515d83ed859db1b2f916fad81a9deaeb9b
2021-03-08 03:39:01 -08:00
54a2498919 Modify tests to use assertWarnsOnceRegex instead of maybeWarnsRegex (#52387)
Summary:
Related to https://github.com/pytorch/pytorch/issues/50006

Follow on for https://github.com/pytorch/pytorch/issues/48560 to ensure TORCH_WARN_ONCE warnings are caught. Most of this is straight-forward find-and-replace, but I did find one place where the TORCH_WARN_ONCE warning was not wrapped into a python warning.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52387

Reviewed By: albanD

Differential Revision: D26773387

Pulled By: mruberry

fbshipit-source-id: 5be7efbc8ab4a32ec8437c9c45f3b6c3c328f5dd
2021-03-08 03:32:14 -08:00
d3cde6c23c [NNC] Implementation for aten::cat without conditionals. (#53128)
Summary:
This PR adds an implementation for `aten::cat` in NNC without any conditionals. This version is not enabled by default.

Here is the performance of some micro benchmarks with and without conditionals. There is up to 50% improvement in performance without conditionals for some of the shapes.

aten::cat implementation in NNC **with** conditionals
```
$ python -m benchmarks.tensorexpr --device cpu --mode fwd --jit_mode trace --cpu_fusion concat
pt: concat2d2input_fwd_cpu_1_160_1_14_1: 5.44 us, SOL 0.26 GB/s, algorithmic 0.51 GB/s
pt: concat2d2input_fwd_cpu_1_580_1_174_1: 5.75 us, SOL 1.05 GB/s, algorithmic 2.10 GB/s
pt: concat2d2input_fwd_cpu_20_160_20_14_1: 6.87 us, SOL 4.05 GB/s, algorithmic 8.11 GB/s
pt: concat2d2input_fwd_cpu_20_580_20_174_1: 14.52 us, SOL 8.31 GB/s, algorithmic 16.62 GB/s
pt: concat2d2input_fwd_cpu_8_512_8_512_1: 9.58 us, SOL 6.84 GB/s, algorithmic 13.68 GB/s
```
aten::cat implementation in NNC **without** conditionals
```
$ python -m benchmarks.tensorexpr --device cpu --mode fwd --jit_mode trace --cpu_fusion --cat_wo_conditionals concat
pt: concat2d2input_fwd_cpu_1_160_1_14_1: 4.67 us, SOL 0.30 GB/s, algorithmic 0.60 GB/s
pt: concat2d2input_fwd_cpu_1_580_1_174_1: 5.65 us, SOL 1.07 GB/s, algorithmic 2.14 GB/s
pt: concat2d2input_fwd_cpu_20_160_20_14_1: 6.10 us, SOL 4.56 GB/s, algorithmic 9.12 GB/s
pt: concat2d2input_fwd_cpu_20_580_20_174_1: 7.44 us, SOL 16.22 GB/s, algorithmic 32.44 GB/s
pt: concat2d2input_fwd_cpu_8_512_8_512_1: 6.46 us, SOL 10.14 GB/s, algorithmic 20.29 GB/s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53128

Reviewed By: bertmaher

Differential Revision: D26758613

Pulled By: navahgar

fbshipit-source-id: 00f56b7da630b42bc6e7ddd4444bae0cf3a5780a
2021-03-07 22:57:02 -08:00
c7b1979b6b Use Store collect and verify names in all RPC agents (#53209)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53209

closes #40048

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D26791524

Pulled By: mrshenli

fbshipit-source-id: fc75589f9707014334fcfae6f05af3c04217783b
2021-03-07 16:51:46 -08:00
affdcce833 Extract TensorPipeAgent's collectNames to be a standalone utility function (#53202)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53202

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D26791525

Pulled By: mrshenli

fbshipit-source-id: 8234c4d0350a5cd61926dce4ecc9e918960d30d2
2021-03-07 16:48:46 -08:00
e08aae2613 Automated submodule update: FBGEMM (#53478)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53478

This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: c3612e67ee

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53087

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: jianyuh

Differential Revision: D26744074

Pulled By: jspark1105

fbshipit-source-id: c16de118a5befb9dae9e256a5796993cdcfb714b
2021-03-07 12:39:10 -08:00
b26c0bb2b9 [PyTorch Mobile] Allow skipping operator exists check when bytecode model is loaded (#52814)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52814

Currently, there is no way to load a model on a devvm (CPU) if that model has operators that the runtime doesn't support. This ends up happening (currently) for Metal GPU models, and potentially in the future for other backends that have backend-specific operators that don't have a registered implementation (even a dummy one) on CPU.

There are at least a couple reasons for why this is needed:

1. We want to extract operator list directly from the bytecode (instead of looking it up from `mobile_info.json).
2. We want to be able to trace the quantized operators that are invoked when loading the compressed weights for a model that has prepacked weights. xta0 root-caused this after husthyc discovered that there are untraced operators showing up when loading a Metal GPU model.

If we want to scale out to support different types of models, we absolutely need the ability to load a model on a devvm irrespective of what backend (device/etc...) it is targeted at.

ghstack-source-id: 123284366

Test Plan: The next diff in this stack is using the newly introduced methods.

Reviewed By: iseeyuan

Differential Revision: D26656266

fbshipit-source-id: eed9af2f7b55979e9c18b986b8c3b9a767153297
2021-03-07 02:56:12 -08:00
3236efa4de [Static Runtime] Call native resize_/resize_as_ as much as possible (#53425)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53425

t.resize_ goes through the dispatcher. Replace with direct native calls
- t.resize_/resize_as_ -> at::native::resize_/resize_as_
- t.resize_({0}) -> fastResizeToZero(t)

Reviewed By: ajyu, edvgha

Differential Revision: D26836278

fbshipit-source-id: d1a95240099a35f5ece0de2eea50620ba8054ee5
2021-03-06 21:12:23 -08:00
dbbe0a2105 [DataLoader] Introduce deterministic context (#53271)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53271

- [x] Add `set_determinism` context manager
- [x] Add `non_deterministic` decorator for `DataPipe`
  - Raise error at the construction time for non-deterministic DataPipe when `determinism` is set to `True`
 - [ ] Support `non_deterministic` with option
   - When `GreedyJoin` only contains one datapipe, it should still be deterministic.

Note: Test is in the [PR](https://github.com/facebookexternal/torchdata/pull/15). As the main repo doesn't have non-deterministic DataPipe yet.

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D26823023

Pulled By: ejguan

fbshipit-source-id: 51bb92fc3d18d1fc9536c1229363c536ad120876
2021-03-06 07:37:39 -08:00
1ba80264f4 [DataLoader] ConcatDataPipe (#53301)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53301

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D26829322

Pulled By: ejguan

fbshipit-source-id: eeea42fd9ab267d10f39ad7debc279eaded23570
2021-03-06 07:32:02 -08:00
1fe6a6507e [WIP][FX] Fix tracing support for torchbind (#52884)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52884

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D26675801

Pulled By: jamesr66a

fbshipit-source-id: 8e5100bcea17589a53163abf6ab991658e11fa3a
2021-03-05 23:40:16 -08:00
a0d1e701db Replace internal::GRAIN_SIZE by grain_size (parameter). (#53177)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53013

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53177

Reviewed By: SplitInfinity, nikithamalgifb

Differential Revision: D26860248

Pulled By: ngimel

fbshipit-source-id: 56917f8421f7b45c461945fd3d1ff107ce8535b2
2021-03-05 22:13:01 -08:00
369601355f [caffe2] Use extended versions of cuDNN calls for SpatialBN
Summary: Using `cudnnBatchNormalizationForwardTrainingEx` and `cudnnBatchNormalizationBackwardEx` if cuDNN version is greater than 8.0.0.

Reviewed By: xw285cornell

Differential Revision: D26794173

fbshipit-source-id: dc4994375350f303a3fa0aee03255e8f8be1c605
2021-03-05 18:18:15 -08:00
758fb94fcb Prefix assert_async with underscore, fix some bugs in assert_async CUDA testing (#53276)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53276

- One of the tests had a syntax error (but the test
  wasn't fine grained enough to catch this; any error
  was a pass)
- Doesn't work on ROCm

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: D26820048

Test Plan: Imported from OSS

Reviewed By: mruberry

Pulled By: ezyang

fbshipit-source-id: b02c4252d10191c3b1b78f141d008084dc860c45
2021-03-05 17:36:01 -08:00
8c798e0622 Forbid trailing whitespace (#53406)
Summary:
Context: https://github.com/pytorch/pytorch/pull/53299#discussion_r587882857

These are the only hand-written parts of this diff:
- the addition to `.github/workflows/lint.yml`
- the file endings changed in these four files (to appease FB-internal land-blocking lints):
  - `GLOSSARY.md`
  - `aten/src/ATen/core/op_registration/README.md`
  - `scripts/README.md`
  - `torch/csrc/jit/codegen/fuser/README.md`

The rest was generated by running this command (on macOS):
```
git grep -I -l ' $' -- . ':(exclude)**/contrib/**' ':(exclude)third_party' | xargs gsed -i 's/ *$//'
```

I looked over the auto-generated changes and didn't see anything that looked problematic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53406

Test Plan:
This run (after adding the lint but before removing existing trailing spaces) failed:
- https://github.com/pytorch/pytorch/runs/2043032377

This run (on the tip of this PR) succeeded:
- https://github.com/pytorch/pytorch/runs/2043296348

Reviewed By: walterddr, seemethere

Differential Revision: D26856620

Pulled By: samestep

fbshipit-source-id: 3f0de7f7c2e4b0f1c089eac9b5085a58dd7e0d97
2021-03-05 17:22:55 -08:00
cab2689eb1 Revert D26849826: [pytorch][PR] Call nvidia-smi.exe before running tests Windows
Test Plan: revert-hammer

Differential Revision:
D26849826 (efebc6524d)

Original commit changeset: 14f0d9dfe41a

fbshipit-source-id: 5069f25a6bb1301df50a970817729e5241c30c81
2021-03-05 15:19:57 -08:00
1974969842 Cleanup async execution for python RPC calls. (#53230)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53230

As part of https://github.com/pytorch/pytorch/issues/39351, cleaning
up the python call async execution.
ghstack-source-id: 123120119

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D26800758

fbshipit-source-id: 50fe94c684bf53b907762e8bf196a6f6b97e4cf0
2021-03-05 15:05:45 -08:00
7bfa9dc7de Simplify async execution for script remote calls. (#53207)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53207

Simplifying some of the async execution logic in request_callback_impl
as part of https://github.com/pytorch/pytorch/issues/39351.
ghstack-source-id: 123004020

Test Plan: waitforbuildbot

Reviewed By: mrshenli

Differential Revision: D26791325

fbshipit-source-id: 790ad413dad410dbcd07787583674cb5af1d1c92
2021-03-05 15:04:07 -08:00
6cbbef2fea Modify assert order to correct the error message when nan appears in multinomial on cuda (#53288)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53288

Modify assert order to correct the error message when nan appears in multinomial on cuda

Test Plan: unittest

Reviewed By: ngimel

Differential Revision: D26824353

fbshipit-source-id: af6195e7c36fd51b3fc90df558ad6fac41288142
2021-03-05 14:58:39 -08:00
f595ba1bae [caffe2] move the SaveOp implementation from a header to a .cc file (#53298)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53298

This is a re-land of D26641600 (3969391c07), but with the `SaveOpImpl` class marked as
`TORCH_API` to ensure that its symbols get exported properly in shared library
builds.

This moves the `SaveOp` code from `load_save_op.h` to `load_save_op.cc`.

Previously this implementation was all in the templatized `SaveOp` class, even
though most of the logic didn't depend on the template parameters.  Having
this code be in the header file slows down the build, and forces more files to
be rebuilt than necessary when changing the SaveOp code.  Having this code be
in a template class can also increase the generated code size be larger than
needed, as we don't need separate copies instantiated for each context type.
ghstack-source-id: 123146018

Test Plan:
buck test //caffe2/caffe2/python/operator_test:load_save_test

Also tested performing the CMake-based build using shared libraries with CUDA
enabled, and confirmed that the build succeeded.

Reviewed By: mraway

Differential Revision: D26802576

fbshipit-source-id: fc2dbdc1cd20680b082c887366a6305d86688138
2021-03-05 14:52:14 -08:00
474fe7d976 docker: Update default cuda => 11.1 (#53299)
Summary:
We no longer build binaries for CUDA 11.0 so let's ensure that we have
build for CUDA 11.1 by default instead

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53299

Reviewed By: anjali411

Differential Revision: D26857194

Pulled By: seemethere

fbshipit-source-id: 6094913922c0da832b96e5e49a67369d69d0b8ad
2021-03-05 14:45:57 -08:00
f58f7b786c add distributed backend options in setup.py (#53214)
Summary:
Currently there's only one indicator for build_ext regarding distributed backend `USE_DISTRIBUTED`.

However one can build with selective backends. adding the 3 distributed backend option in setup.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53214

Test Plan: Set the 3 options in environment and locally ran `python setup.py build_ext`

Reviewed By: janeyx99

Differential Revision: D26818259

Pulled By: walterddr

fbshipit-source-id: 688e8f83383d10ce23ee1f019be33557ce5cce07
2021-03-05 14:39:36 -08:00
387d9a6bab Simplify async execution for script calls. (#53204)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53204

Async execution for script calls in request_callback_impl.cpp had two
similar if-else blocks that were hard to read. This PR simplifies some of the
logic by breaking the logic into resuable components.
ghstack-source-id: 122996440

Test Plan: waitforbuildbot

Reviewed By: mrshenli

Differential Revision: D26788459

fbshipit-source-id: f2818c6251a465936ed75b7bd356b616f0580094
2021-03-05 13:55:52 -08:00
c0adabe172 automate sharding using S3 test time stats (#53269)
Summary:
Uses nightly commit stats to automatically shard tests based on execution time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53269

Test Plan:
set CIRCLE_JOB to an existing job, like `pytorch_linux_bionic_py3_6_clang9_test`
Then you can run something like: `python test/run_test.py --shard 1 10`

Reviewed By: malfet

Differential Revision: D26819440

Pulled By: janeyx99

fbshipit-source-id: 6bc73d6aa3d52d9850817536be15d7b54a72780e
2021-03-05 13:40:24 -08:00
00bd0e9862 [caffe2] Fix shape inference for LpNorm (#53332)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53332

This is to make sure we don't get `BATCH` dim type for the output.

Reviewed By: ChunliF

Differential Revision: D26836902

fbshipit-source-id: bedbd12330c608406e3466b240015235a28d2c4a
2021-03-05 13:35:32 -08:00
efebc6524d Call nvidia-smi.exe before running tests Windows (#53334)
Summary:
To display the basic information about the GPUs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53334

Reviewed By: anjali411

Differential Revision: D26849826

Pulled By: ngimel

fbshipit-source-id: 14f0d9dfe41a35fa45fdf6aa7bf2a41704887c0c
2021-03-05 12:46:01 -08:00
c3405e5ba1 Revert "Automated submodule update: tensorpipe (#53012)" (#53394)
Summary:
This reverts commit 62d1cdd725a9b2af332b7ee67a75be2bbac1481a.

Fixes https://github.com/pytorch/pytorch/issues/53393

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53394

Reviewed By: samestep

Differential Revision: D26852966

Pulled By: seemethere

fbshipit-source-id: 325c6c3478a990ade8c7b51d40260caf3028b62d
2021-03-05 11:48:34 -08:00
ba75cedfc5 [1/n][torch/elastic][upstream] Move torchelastic/rendezvous to torch/distributed/rendezvous (#53172)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53172

Pull Request resolved: https://github.com/pytorch/elastic/pull/141

Upstreams two modules to torch:

1. `torchelastic.rendezvous`
2. `torchelastic.utils`

These modules were chosen as `[1/n]` since they are the leaf modules in torchelastic.

==== NOTES: ====
1. I'm disabling etcd_rendezvous and etcd_server tests in CIRCLECI for the moment since I need to edit the test dockers to contain the etcd server binary (there's 4-5 test dockers - one for each platform so this is going to take some time for me to set up the environments and test) - T85992919.

2. I've fixed all lint errors on python files but there are ones on the cpp files on the ZeusRendezvous. I took a look at them, and I don't want to fix the linter errors right now for 2 major reasons:
     1. Some of them are more than formatting changes (e.g. std::move vs pass by value) and I don't want to introduce bundled changes with the move
     1. The old rendezvous code (the one we forked from in caffe2/fb) has the same problems and I think its better for us to deal with this when we deprecate caffe2/fb/rendezvous in favor of the one in torchelastic -T86012579.

Test Plan:
```
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/utils/test/...
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/utils/data/test/...
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/rendezvous/test/...
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/rendezvous/fb/...
buck test mode/dev-nosan //pytorch/elastic/torchelastic/...
```
\+ Sandcastle

Reviewed By: H-Huang

Differential Revision: D26718746

fbshipit-source-id: 67cc0350c3d847221cb3c3038f98f47915362f51
2021-03-05 11:27:57 -08:00
14fa47631b [DDP Logging] Log comm. hook in ddp logging (#52966)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52966

Logs registerd comm hook if there is one, else logs
"builtin_allreduce"
ghstack-source-id: 123174803

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D26709388

fbshipit-source-id: 484fdbbd6643ec261b3797bd8d9824b2b6a1a490
2021-03-05 11:23:26 -08:00
5d9b7bee1a [DDP Logging] Log nccl_async_error_handling (#52965)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52965

Logs nccl async error handling in ddp logger
ghstack-source-id: 123171876

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D26709030

fbshipit-source-id: 530456a5005b8e4956d7fb023986e9b948ebe1a8
2021-03-05 11:23:22 -08:00
bdbfc2582d [Dist Debugality] Log key DDP metrics to stderr under debug mode. (#52957)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52957

This diff:
1. Under TORCH_DISTRIBUTED_DEBUG=INFO or DETAIL, logs DDP information during init time (all stats in ddp_logging_data_)
2. Under TORCH_DISTRIBUTED_DEBUG=DETAIL, logs runtime stats when they are collected (first 10 iterations and then once every 100 iterations). Avoiding logging every iteration to not spam logs.

Verified by inspecting logs:

```
I0226 19:12:47.109243 2818475 logger.cpp:140] [Rank 1]: DDP Initialized with:
world_size: 2 module_name: Linear device_ids: 1 output_device: 1 backend_name: nccl parameter_dtype: float total
_parameter_size_in_bytes: 40 num_parameter_tensors: 2 bucket_sizes: 40 CUDA_VISIBLE_DEVICES: N/Abroadcast_buffer
s: 1 bucket_cap_mb: 25 find_unused_parameters: 0 gradient_as_bucket_view: 0
 Backend Info: nccl_socket_ifname: N/A nccl_blocking_wait: N/A nccl_debug: WARN nccl_nthreads: N/A nccl_ib_timeo
ut: N/A
I0226 19:12:47.109252 2818473 logger.cpp:140] [Rank 0]: DDP Initialized with:
world_size: 2 module_name: Linear device_ids: 0 output_device: 0 backend_name: nccl parameter_dtype: float total
_parameter_size_in_bytes: 40 num_parameter_tensors: 2 bucket_sizes: 40 CUDA_VISIBLE_DEVICES: N/Abroadcast_buffer
s: 1 bucket_cap_mb: 25 find_unused_parameters: 0 gradient_as_bucket_view: 0
 Backend Info: nccl_socket_ifname: N/A nccl_blocking_wait: N/A nccl_debug: WARN nccl_nthreads: N/A nccl_ib_timeo
ut: N/A
```

```
I0226 19:12:48.117936 2818473 logger.cpp:286] [Rank 0 / 2] Training Linear unused_parameter_size=0
 Avg forward compute time: 568944
 Avg backward compute time: 885504
Avg backward comm. time: 692496
 Avg backward comm/comp overlap time: 113536
I0226 19:12:48.118517 2818475 logger.cpp:286] [Rank 1 / 2] Training Linear unused_parameter_size=0
 Avg forward compute time: 565584
 Avg backward compute time: 876992
Avg backward comm. time: 201872
 Avg backward comm/comp overlap time: 128624
```
ghstack-source-id: 123171875

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D26708184

fbshipit-source-id: 16defd5610d28bc4cf3fc2a0cc564e84efcfa791
2021-03-05 11:23:18 -08:00
68134374cb Refactor/fix DDP model check during init (#52887)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52887

This diff changes the way to do model consistency check (i.e. `_verify_replicas_across_processes`) in DDP.

There were a few things that could be improved with the way we verify model across processes in DDP initialization:

1. We should do this check before syncing module states in DDP init, otherwise with Gloo backend this will throw but we would like to throw the error corresponding to different models on different ranks. To do this, we move the methods to be standalone C++ functions (not part of reducer) and move this check to before synchronizing parameters.
2. Refactor DDP init in the following ways:
- Run model consistency check before creating reducer, 2
- add helper functions to build params to pass into reducer
- add helper function to call `_verify_model_across_ranks`
- move `def parameters` to a helper function `_get_parameters` to be used more broadly within DDP

In follow up changes we will add the ability to detect which rank had inconsistent model (https://github.com/pytorch/pytorch/issues/52876 would be useful for this to determine which ranks(s) had errors).
ghstack-source-id: 123171877

Test Plan:
CI/unittest
buck test mode/dev-nosan //caffe2/test/distributed:c10d
BACKEND="nccl" WORLD_SIZE="2" ~/fbcode/buck-out/dev/gen/caffe2/test/distributed/distributed_nccl_fork#binary.par -r test_ddp_model_diff_across_ranks

Reviewed By: zhaojuanmao

Differential Revision: D26565290

fbshipit-source-id: f0e1709585b53730e86915e768448f5b8817a608
2021-03-05 11:21:45 -08:00
1b35b1a0c4 Properly skip distributed tests when distributed module is not built (#52945)
Summary:
Currently there is some code that intends to skip distributed tests if
the distributed module is not built. However, they are missing in some
test files; and in some other test files they are checked after
distributed module is imported, which leads to failure.  This is
generating a lot of headaches when testing minimal builds locally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52945

Reviewed By: anjali411

Differential Revision: D26848241

Pulled By: ezyang

fbshipit-source-id: 983a848844add40869a86f3c9413503a3659b115
2021-03-05 10:28:47 -08:00
c697e48023 Refactor ForeachUnaryOp.cu (#51894)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51894

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D26323605

Pulled By: izdeby

fbshipit-source-id: eb65269ab3e14160d7cb5e6e84e85ef4037d3b0d
2021-03-05 10:26:58 -08:00
56f8379802 [static runtime] Move all heavy constructor logic into InferenceModule (renamed to StaticModule) (#51564)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51564

Constructor logic was spread throughout InferenceModule and StaticRuntime.  This diff unifies the two.  After a lot of discussion on this diff D25961626 it became apparent that `clone` is uglier than a cheap StaticRuntime.

This means StaticRuntime is effectively StaticModule and the only code in the new StaticRuntime is the `run` functions.

```
graph, schema = PrepareForStaticModule(torchscript_module)
sm = StaticModule(graph, schema, options)
sm(inputs)
// or create many cheap runtimes with the module
sr = StaticRuntime(sm)
sr(inputs)
```

Changelist:
- Rename InferenceModule StaticModule
- Move all logic for construction into StaticModule
- Create a new StaticRuntime that only has a unique memory planner (everything else is in StaticModule)
- Update comments with explanation
- Propagate all changes to predictor integration
- Propagate all changes to python integration
- Change semantics to be a bit more PyTorch-standard (no "run" calls, no "get_" getters).

Test Plan:
buck test //caffe2/test:static_runtime
buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest

Reviewed By: hlu1

Differential Revision: D25592967

fbshipit-source-id: 8233bed03137ce129137af2d44bce0095033ef0f
2021-03-05 10:15:26 -08:00
5ebfabb310 MAGMA: Initialize ipiv data to avoid internal memory access violation (#53064)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/51930

Running the reproducer under `cuda-gdb`, I see access violations in either [`zswap_kernel_batched`](4fd4634f35/magmablas/zgetf2_kernels.cu (lines-276)) (part of the LU factorization) and other times in [`zlaswp_columnserial_kernel`](4fd4634f35/magmablas/zlaswp_batched.cu (lines-335)) (part of the inverse).

The common factor between both of these is they use `ipiv` to index into the matrix. My best guess is the `ipiv` indices aren't written when the factorization fails, hence garbage data is used as matrix indices and we get an access violation. Initializing `ipiv` to a known-good value before the  factorization fixes the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53064

Reviewed By: zhangguanheng66

Differential Revision: D26829053

Pulled By: heitorschueroff

fbshipit-source-id: 842854a6ee182f20b2acad0d76d32d27cb51b061
2021-03-05 08:59:27 -08:00
268b96f069 Automated submodule update: tensorpipe (#53353)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/tensorpipe](https://github.com/pytorch/tensorpipe).

New submodule commit: a4816001b8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53353

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: lw

Differential Revision: D26844238

fbshipit-source-id: 9895773f616c53d7d3b3a5e1b95507d26bb93fee
2021-03-05 08:48:15 -08:00
e9d7137072 fixes #38775 #38779: complex support for linspace and logspace (#38875)
Summary:
Closes https://github.com/pytorch/pytorch/issues/38775, Closes https://github.com/pytorch/pytorch/issues/38779

TO-DO:
* [x] Add Tests

Quansight Tracking : q-38775, q-38779

Pull Request resolved: https://github.com/pytorch/pytorch/pull/38875

Reviewed By: malfet

Differential Revision: D26628530

Pulled By: anjali411

fbshipit-source-id: ca4259b9f6725c4a4350f944465327169d12122e
2021-03-05 08:37:55 -08:00
42e0983230 [NNC] Added some APIs for dealing directly with Bufs (instead of Tensors) (#53011)
Summary:
(also includes some python binding stuff :P)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53011

Reviewed By: gchanan, robieta

Differential Revision: D26801120

Pulled By: Chillee

fbshipit-source-id: 42a1efb6cbc9ddc0b72b780f3d6b712b3ae62b09
2021-03-05 06:55:48 -08:00
854cc53594 Automated submodule update: tensorpipe (#53265)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/tensorpipe](https://github.com/pytorch/tensorpipe).

New submodule commit: 20224c5fe7

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53265

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: walterddr, lw

Differential Revision: D26816470

fbshipit-source-id: 8e381a3d6632acbc90691128ef85591b325ecf64
2021-03-05 02:27:28 -08:00
63e0e88ccc [PyPer] More at::empty -> at::detail::empty_cpu (#53333)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53333

- Add more variants to `create_empty_from` to take more args, like dtype/layout/device.
- Clean up stray at::empty uses, mostly in the out variants.

Reviewed By: ajyu

Differential Revision: D26799900

fbshipit-source-id: 6676d8043fead63208913ef3a28cabbae76e46bb
2021-03-05 00:16:51 -08:00
69bb0e0285 [caffe2] Avoid some double (and triple) lookups in workspace (#53319)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53319

Noticed these in profiles.

Also switch to `unordered_map`.

Test Plan: Unit tests.

Reviewed By: swolchok

Differential Revision: D26504408

fbshipit-source-id: 9e14d55909a4af019058b8c27c67ee2348cd02a9
2021-03-04 22:57:02 -08:00
35364c3641 [static runtime] Enable ClipRangesGatherRangesX2SigridHash fusion for SigridHashPrecompute (#53324)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53324

Reviewed By: maratsubkhankulov

Differential Revision: D26833478

fbshipit-source-id: 55ab63faf5b535f2acd2ec5dc5721f5b692832d7
2021-03-04 22:01:08 -08:00
dfd5331e9c Skip tests on ROCm (#53339)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53339

Skip tests on ROCm

Test Plan: CI

Reviewed By: gdankel, ZolotukhinM

Differential Revision: D26838813

fbshipit-source-id: e26286a61a192710e393c19d3eb2316b6c76a42e
2021-03-04 21:55:34 -08:00
8bac382d9d [TensorExpr] Remove unused classes from TensorExprKernel. (#53283)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53283

We had `ShapeArg` and `KernelArg` classes, which were wrappers over
`BufferArg` without adding any new functionality on top of what already
existed. This PR removes them and replace their uses with `BufferArg`s
directly.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D26821993

Pulled By: ZolotukhinM

fbshipit-source-id: d1f95ea069b9f38f1d32424464551df2565b3c49
2021-03-04 21:24:29 -08:00
cfd9360d09 Revert D26837780: Revert D26819810: Revert D26815021: Revert D26744062: Add assert_async
Test Plan: revert-hammer

Differential Revision:
D26837780

Original commit changeset: 21567cab5c0f

fbshipit-source-id: 8ea735e5fdc97e32ae3fafd40297a1b8a7cd34b0
2021-03-04 20:45:35 -08:00
51592a9e0a [package] Add deny method to PackageExporter (#53233)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53233

**Summary**
This commit adds a `deny` method to `PackageExporter` that allows
modules to be prohibited during the packaging process. A dependency on a
module matching the names or globs that `deny` was called with will
cause an exception to be raised.

**Test Plan**
This commit adds unit tests to `PackagingTest` for this new method:
`test_deny` and `test_deny_glob`.

**Fixes**
This commit fixes #53217.

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D26834010

Pulled By: SplitInfinity

fbshipit-source-id: 469b5c6741bcc6dab77e352f41db38fa1e0dae12
2021-03-04 20:37:41 -08:00
f1eedfa2c8 [package] Add allow_empty flag to mock and extern (#53232)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53232

**Summary**
This commit adds an optional `allow_empty` argument to
`PackageExporter.mock` and `PackageExporter.extern` that allows certain
patterns for mocked modules and extern modules to be marked ones that
*must* be matched during the packaging process. If a mock or extern
module with `allow_empty=False` is not matched while packaging, an error
is thrown.

**Test Plan**
This commit adds two new test cases to `PackagingTest`,
`test_extern_glob_allow_empty` and `test_mock_glob_allow_empty` that
test this new flag. Existing tests already tests `allow_empty=True`.

**Fixes**
This commit fixes #53217.

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D26834011

Pulled By: SplitInfinity

fbshipit-source-id: 9cf4ea56079ae210d6cfa8604218849eb5cde5f4
2021-03-04 20:35:06 -08:00
842ba90739 [iOS] Bump up the Cocoapods version (#53335)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53335

ghstack-source-id: 123166245

(Note: this ignores all push blocking failures!)

Test Plan: CI

Reviewed By: xta0

Differential Revision: D26838693

fbshipit-source-id: 0007eba40b3145c8ba77b3211759f0609e17f561
2021-03-04 20:29:23 -08:00
fdd074e806 [caffe2] Fix shape inference for Softmax (#53132)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53132

Input and output should have the same shape for Softmax https://caffe2.ai/docs/operators-catalogue.html#softmax.

Reviewed By: walterddr, yinghai, ChunliF

Differential Revision: D26536592

fbshipit-source-id: 8b50794803aeadcb75d8f370c77f4fef98a1f2ad
2021-03-04 19:37:43 -08:00
795ed5ca3f Enable Kineto in CPU builds (#53174)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53174

Enable Kineto also in the CPU builds (non-mobile, non-Windows(atm))

Test Plan: CI

Reviewed By: gdankel

Differential Revision: D26776112

Pulled By: ilia-cher

fbshipit-source-id: 8733f65c2993105136c853f2a7b6e497d0fa53bf
2021-03-04 19:15:52 -08:00
17495e0318 [PyTorch Mobile] Fix case when error messages are stripped, and stack value isn't popped off in lite-interpreter (#53201)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53201

This resulted in [S22350](https://www.internalfb.com/intern/sevmanager/view/s/223540), which caused truoble on Android.

1. The Python has a call to `warnings.warn()`, which resulted in code generated to emit the `WARN` instruction on lite-interpreter.
2. The code for handling that instruction/op-code popped off the value in a call to the `TORCH_WARN()` *macro*.
3. This macro conditionally compiled out evaluation of the arguments if `STRIP_ERROR_MESSAGES` was defined, which resulted in the stack not getting popped, and the lite-interpreter returning the last pushed value on to the stack.

I've attempted to re-produce it using this python code: {P243842428}
ghstack-source-id: 122990001

(Note: this ignores all push blocking failures!)

Test Plan:
Created a new unit test to re-produce the failure in the test. Was able to do so locally using the following command:

```
buck test -c pt.strip_error_messages=1 //xplat/caffe2:test_s223540
```

However, since `pt.strip_error_messages=0` for dev and continuous builds, I have had to check in a separate contbuild config to try and trigger this failure on contbuild.

Reviewed By: iseeyuan

Differential Revision: D26765662

fbshipit-source-id: 63c3c96d84ce6a9e5471f13d80165aa3718be9a2
2021-03-04 19:10:07 -08:00
1accffe450 Revert D26819810: Revert D26815021: Revert D26744062: Add assert_async
Test Plan: revert-hammer

Differential Revision:
D26819810

Original commit changeset: e528260e1aa9

fbshipit-source-id: 21567cab5c0ff5f5e60a699d4d4678773a567c30
2021-03-04 18:48:56 -08:00
110a17a4d9 Update foreach APIs to use scalar lists (#51893)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51893

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D26323606

Pulled By: izdeby

fbshipit-source-id: 53791087c924d04526fe7adb8f4ab5676d383b04
2021-03-04 18:20:53 -08:00
47dbdfcfe9 [Static Runtime] remove redundant gather_ranges when fusing (#53323)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53323

Whilst optimizing inline cvr local ro, found a pattern where gather_ranges is used redundantly. Fuse this pattern to remove unnecessary gather_ranges.

Reviewed By: hlu1

Differential Revision: D26659824

fbshipit-source-id: 6420afa3a2c3272c57706b70c2e9834014d6c32d
2021-03-04 18:14:29 -08:00
97d4ed3d2d [torch.futures] Add note about error handling for non-chained futures. (#53212)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53212

Ran into a strange issue with error handling in future callbacks, more
details in https://github.com/pytorch/pytorch/issues/52132, but essentially,
after a callback throws all additional processing stops, and other futures can
never be completed, resulting in a hang. Add a note to warn about this.
ghstack-source-id: 123122890

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D26793310

fbshipit-source-id: b1ae73a81163d7b37ba07b0685e8de4228f01da6
2021-03-04 18:09:23 -08:00
ac668c55e5 [Static Runtime] Remove dead code in MemoryPlanner and rename unmanaged_value_set to unmanaged_ivalue_set
Test Plan:
```
buck test mode/opt //caffe2/caffe2/fb/predictor:ptvsc2_predictor_bench_test -- --run-disabled
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
```

Reviewed By: bwasti

Differential Revision: D26827700

fbshipit-source-id: a8696af3e1d2b504fa5754f823b389d45b48af38
2021-03-04 17:37:43 -08:00
36180c1322 [static runtime] aten::to copy out variant (#52343)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52343

aten::to returns self when the TensorOptions match and copy is set to false. For static runtime, always copy. There isn't a separate op for aten::to copy, but instead the same function
with different arguments.

Test Plan:
On AdFinder local_ro:

Before:
0.896742
0.00824827 ms.    0.92773%. aten::to (5 nodes)

After:
0.88233
0.0056607 ms.   0.644675%. aten::to (5 nodes)

buck test mode/opt caffe2/benchmarks/static_runtime:static_runtime_cpptest

Reviewed By: hlu1

Differential Revision: D26477980

fbshipit-source-id: 8e8448092adff38c141af1ce27a10acd39c07dd1
2021-03-04 17:30:15 -08:00
18277137ff make torch.load() aware of import path changes: torch.tensor -> torch._tensor (#53139)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53139

ghstack-source-id: 123090847

Test Plan:
Sandcastle

Also explicitly tests that this test passes after incorporating the changes from D26656767, and adding a `torch.tensor` -> `torch._tensor` mapping to the `load_module_mapping` dict: `buck test mode/dev //pandora/utils/tests:manifold_utils_tests -- --exact 'pandora/utils/tests:manifold_utils_tests - test_load_dataset_valid_dir (pandora.utils.tests.manifold_utils_tests.TestManifoldUtils)'`

With just D26656767, that test fails. With D26656767 + the changes in this diff, that test passes.

Reviewed By: ezyang

Differential Revision: D26760600

fbshipit-source-id: cb16493b858a358acf468d755740aa272ae9d363
2021-03-04 17:11:20 -08:00
a558d3629f Remove MNIST for XLA (#53274)
Summary:
Mitigates https://github.com/pytorch/pytorch/issues/53267

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53274

Reviewed By: zhangguanheng66, ailzhang

Differential Revision: D26819702

Pulled By: cpuhrsch

fbshipit-source-id: 5b9b30db6f8fc414aa9f3c841429bf99bc927763
2021-03-04 17:05:56 -08:00
a3c3141dd2 Fix gradfn attr bindings when saved variable is of an output (#53205)
Summary:
When saved variable is of an output, its grad_fn is not saved in SavedVariable, so it must be passed in during `unpack`.
Here, we can always pass in grad_fn (whether or not saved variable is an output) because it is ignored if the saved variable is not an output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53205

Reviewed By: gchanan, zhangguanheng66

Differential Revision: D26794365

Pulled By: soulitzer

fbshipit-source-id: e039baba20c364c4ab42ff99d0b242dd95c67fb3
2021-03-04 16:59:42 -08:00
6db2f012a5 [PyTorch] Reduce size of register_symbols.cpp (#53278)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53278

We can avoid duplicating the string data for the namespaces
by assembling qualified names ourselves as needed.
ghstack-source-id: 123111718

Test Plan:
CI

buildsizebot some iOS apps

Reviewed By: dhruvbird, walterddr, ot

Differential Revision: D26820648

fbshipit-source-id: e2560874c54f46210181ddfee354967644bd41e1
2021-03-04 16:53:58 -08:00
4739d15a67 Skip some nodes during discovery using sequence number (#52180)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/12635

This change will help us speed up autograd's discovery algorithm in cases where we use `.grad` and we try to "unroll" the training loop. For example the example in the issue and also https://github.com/pytorch/pytorch/pull/52180#issuecomment-783400832 observe an unbounded multiple of speed-up.

We do this by adding a new sequence_nr-type numbering: for each node, we maintain the length of the longest path from it to any leaf node. How does this help us speed up discovery (dfs)? Previously the bottleneck was that the dfs that computes which nodes need to be executed always explored every node. With this change, before we run dfs, we first compute the mininum seq_nr among all the nodes passed as the `inputs`. If let this be some number N, intuitively this means that dfs should stay at least N units away from any leaf node. So, if we find ourselves too close to any leaf node, we should stop our search early.

Edit:
After some discussion offline, the plan is:
 - make old sequence_nr a construct of the profiler. This means we can avoid accessing thread local state in cases where the profiler is disabled. Note that we cannot replace sequence_nr as-is because profiler's use-case requires that thread-id + sequence_nr can uniquely identify a given node in order for downstream users/programs to correlate nodes from backward and forward passes. This means we must maintain two sequence_nr's and that we have an extra field in Node.
 - In a future PR, we can potentially remove sequence_nr entirely from the profiler as well, but we avoid doing it now because we haven't measured, and its a larger effort because we'd have to mess around with the dispatcher and profiler

Testing with this [code](https://gist.github.com/kyunghyuncho/5fb9991ce1233f909051854a84b7148e), we see that runtime no longer increases as we iterate.

Before:
```
100: Time taken: 0.47s, loss: 1.1e+06
200: Time taken: 0.064s, loss: 6.5e+05
300: Time taken: 0.088s, loss: 4.4e+05
400: Time taken: 0.1s, loss: 3.2e+05
500: Time taken: 0.12s, loss: 2.5e+05
600: Time taken: 0.15s, loss: 2e+05
700: Time taken: 0.18s, loss: 1.7e+05
800: Time taken: 0.2s, loss: 1.4e+05
900: Time taken: 0.22s, loss: 1.2e+05
1000: Time taken: 0.24s, loss: 1.1e+05
1100: Time taken: 0.27s, loss: 9.3e+04
1200: Time taken: 0.3s, loss: 8.3e+04
1300: Time taken: 0.34s, loss: 7.4e+04
1400: Time taken: 0.36s, loss: 6.7e+04
1500: Time taken: 0.38s, loss: 6.1e+04
1600: Time taken: 0.4s, loss: 5.6e+04
1700: Time taken: 0.42s, loss: 5.1e+04
1800: Time taken: 0.44s, loss: 4.7e+04
1900: Time taken: 0.47s, loss: 4.4e+04
2000: Time taken: 0.5s, loss: 4.1e+04
```
After:
```
100: Time taken: 0.49s, loss: 1.2e+06
200: Time taken: 0.031s, loss: 6.9e+05
300: Time taken: 0.031s, loss: 4.6e+05
400: Time taken: 0.031s, loss: 3.3e+05
500: Time taken: 0.031s, loss: 2.6e+05
600: Time taken: 0.031s, loss: 2.1e+05
700: Time taken: 0.031s, loss: 1.7e+05
800: Time taken: 0.031s, loss: 1.4e+05
900: Time taken: 0.031s, loss: 1.2e+05
1000: Time taken: 0.031s, loss: 1.1e+05
1100: Time taken: 0.031s, loss: 9.6e+04
1200: Time taken: 0.031s, loss: 8.6e+04
1300: Time taken: 0.031s, loss: 7.7e+04
1400: Time taken: 0.031s, loss: 7e+04
1500: Time taken: 0.031s, loss: 6.3e+04
1600: Time taken: 0.031s, loss: 5.8e+04
1700: Time taken: 0.031s, loss: 5.3e+04
1800: Time taken: 0.031s, loss: 4.9e+04
1900: Time taken: 0.031s, loss: 4.5e+04
2000: Time taken: 0.032s, loss: 4.2e+04

```
Testing w/ small graph to check for regression:
```
import torch
from torch.utils.benchmark import Timer

setup="""
a = torch.rand((2, 2), requires_grad=True)
b = torch.rand((2, 2), requires_grad=True)
gradient = torch.ones(2, 2)
"""

stmt="""
torch.autograd.grad(a*b, [a, b], gradient)
"""

timer = Timer(stmt, setup)

print(timer.timeit(10000))
print(timer.collect_callgrind(100))
```
Result: there doesn't seem to be any significant regression
```
Time before: 12.74 us
Time after: 13.12 us
Instruction count before:
                           All          Noisy symbols removed
    Instructions:      8078960                    8000882
    Baseline:             4226                       3838
Instruction count after:
                           All          Noisy symbols removed
    Instructions:      8091846                    8017940
    Baseline:             4336                       3838
100 runs per measurement, 1 thread
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52180

Reviewed By: gchanan, zhangguanheng66

Differential Revision: D26794387

Pulled By: soulitzer

fbshipit-source-id: c00d387a29f151109c33dc6f1b56a8f275cdec58
2021-03-04 16:13:53 -08:00
85109ce427 Support submodule manipulation in GraphModule (#52358)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52358

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D26759260

Pulled By: ansley

fbshipit-source-id: 25d2b9124a7d957704f1700a45dca143aaed391d
2021-03-04 14:52:35 -08:00
72ec718373 Leak autograd threads after wait limit (#53170)
Summary:
Leak autograd threads if TORCH_AUTOGRAD_SHUTDOWN_WAIT_LIMIT is reached
(default to 10 seconds)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53170

Reviewed By: zhangguanheng66

Differential Revision: D26821983

Pulled By: malfet

fbshipit-source-id: 310960564da7cd8c9f475432a8efbee32cfe6009
2021-03-04 14:42:15 -08:00
51718c2f3c Update CODEOWNERS to be tagged as reviewer (#53277)
Summary:
Fixes #FOMOOCR (fear of missing out on code review)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53277

Reviewed By: mrshenli

Differential Revision: D26820361

Pulled By: H-Huang

fbshipit-source-id: 9e985a6a7e6dbda5e454f54fa95cc7d7050245b2
2021-03-04 14:05:36 -08:00
b0aa03b703 fix tensorpipe_agent linked even when USE_TENSORPIPE is turned off (#53281)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53281

Reviewed By: xuzhao9

Differential Revision: D26822375

Pulled By: walterddr

fbshipit-source-id: d4e2b7ed1b38782a9e7f6c5b96b7bb0e31c4bdae
2021-03-04 13:29:27 -08:00
b4395b046a Edit SiLU documentation (#53239)
Summary:
I edited the documentation for `nn.SiLU` and `F.silu` to:
- Explain that SiLU is also known as swish and that it stands for "Sigmoid Linear Unit."
- Ensure that "SiLU" is correctly capitalized.

I believe these changes will help users find the function they're looking for by adding relevant keywords to the docs.

Fixes: N/A

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53239

Reviewed By: jbschlosser

Differential Revision: D26816998

Pulled By: albanD

fbshipit-source-id: b4e9976e6b7e88686e3fa7061c0e9b693bd6d198
2021-03-04 12:51:25 -08:00
7aeee2849b Parametrization Functionality (#33344)
Summary:
Provides the implementation for feature request issue https://github.com/pytorch/pytorch/issues/28937.

Adds the `Parametrization` functionality and implements `Pruning` on top of it.
It adds the `auto` mode, on which the parametrization is just computed once per forwards pass. The previous implementation computed the pruning on every forward, which is not optimal when pruning RNNs for example.

It implements a caching mechanism for parameters. This is implemented through the mechanism proposed at the end of the discussion https://github.com/pytorch/pytorch/issues/7313. In particular, it assumes that the user will not manually change the updated parameters between the call to `backwards()` and the `optimizer.step()`. If they do so, they would need to manually call the `.invalidate()` function provided in the implementation. This could be made into a function that gets a model and invalidates all the parameters in it. It might be the case that this function has to be called in the `.cuda()` and `.to` and related functions.

As described in https://github.com/pytorch/pytorch/issues/7313, this could be used, to implement in a cleaner way the `weight_norm` and `spectral_norm` functions. It also allows, as described in https://github.com/pytorch/pytorch/issues/28937, for the implementation of constrained optimization on manifolds (i.e. orthogonal constraints, positive definite matrices, invertible matrices, weights on the sphere or the hyperbolic space...)

TODO (when implementation is validated):
- More thorough test
- Documentation

Resolves  https://github.com/pytorch/pytorch/issues/28937

albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/33344

Reviewed By: zhangguanheng66

Differential Revision: D26816708

Pulled By: albanD

fbshipit-source-id: 07c8f0da661f74e919767eae31335a9c60d9e8fe
2021-03-04 12:45:27 -08:00
3826a07a63 [PyTorch] Don't inline Dispatcher::call on mobile (#53197)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53197

This probably causes a code size blowup and we care more about the size savings than the incremental perf on mobile.
ghstack-source-id: 122977713

Test Plan: buildsizebot some mobile apps

Reviewed By: dhruvbird

Differential Revision: D26731181

fbshipit-source-id: 78a926278a85028af09bfa0731d4d59a55ee3746
2021-03-04 11:10:16 -08:00
8c54cd7f37 Declare NamedTuple at top level (#53273)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53273

This prevents a mypy bug.  Fixes #53272

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D26819428

Pulled By: ezyang

fbshipit-source-id: e71575ed13321665a976cc5ef8b2993c00626b7d
2021-03-04 10:41:40 -08:00
9e5e5a7d96 Revert D26815021: Revert D26744062: Add assert_async
Test Plan: revert-hammer

Differential Revision:
D26815021

Original commit changeset: 972eaafcdf14

fbshipit-source-id: e528260e1aa91df1873c73af00aa57addd671607
2021-03-04 09:28:25 -08:00
6557ea0509 Context manager for hiding source ranges (#53188)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/52456

## Background

Provides a context manager `_hide_source_ranges()` that disables printing graph source ranges by default. It can be overridden on a per-graph basis if desired.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53188

Test Plan:
```
python test/test_jit.py TestJit.test_hide_source_ranges_context_manager
```

```python
import torch

torch.jit.script
def foo(x):
    return torch.add(x, x)

print(foo.graph)
with torch.jit._hide_source_ranges():
    print(foo.graph)

    # Override context manager
    print(foo.graph.str(print_source_ranges=True))

print(foo.graph)
```

```
graph(%x.1 : Tensor):
  %3 : int = prim::Constant[value=1]()
  %4 : Tensor = aten::add(%x.1, %x.1, %3) # /Users/jbschlosser/misc/example.py:5:11
  return (%4)

graph(%x.1 : Tensor):
  %3 : int = prim::Constant[value=1]()
  %4 : Tensor = aten::add(%x.1, %x.1, %3)
  return (%4)

graph(%x.1 : Tensor):
  %3 : int = prim::Constant[value=1]()
  %4 : Tensor = aten::add(%x.1, %x.1, %3) # /Users/jbschlosser/misc/example.py:5:11
  return (%4)

graph(%x.1 : Tensor):
  %3 : int = prim::Constant[value=1]()
  %4 : Tensor = aten::add(%x.1, %x.1, %3) # /Users/jbschlosser/misc/example.py:5:11
  return (%4)
```

Reviewed By: walterddr, zhangguanheng66

Differential Revision: D26817070

Pulled By: jbschlosser

fbshipit-source-id: e9d123452c616b0a9dda9e134ef6c2886f229d9b
2021-03-04 09:11:08 -08:00
6dce0cd0d4 Optimize module path finding (#52990)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52990

This PR changes module path finding from O(N^2) to O(1)

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D26779399

Pulled By: ansley

fbshipit-source-id: ff49d8e10bb4f82583ab4757926198ed46507c29
2021-03-04 09:00:30 -08:00
e698a634cc Enabled amin & amax for float16 & bfloat16 (#52579)
Summary:
1. Enabled `amax` & `amin` for `float16` & `bfloat16` dtypes for both CPU & CUDA.
2. Added `OpInfo`s for `amax` & `amin`.
3. Enabled `test_min_with_inf` & `test_max_with_inf` for both `float16` & `bfloat16`, as they also use `torch.amin` & `torch.amax` respectively.
4. Enabled `test_amax` & `test_amin` for `float16` but not for `bfloat16`, as comparison is done with `numpy`, which doesn't support `bfloat16`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52579

Reviewed By: pbelevich

Differential Revision: D26784194

Pulled By: heitorschueroff

fbshipit-source-id: 1050de3e155b83f282fb30b0db6658eead89936c
2021-03-04 07:03:03 -08:00
5095332ab9 Minor cleanup of interpolate microbenchmark
Summary: Minor cleanup, addresses comments from https://www.internalfb.com/diff/D26780116 (1559fa6a5c)

Test Plan:
```
➜  vision buck run //caffe2/benchmarks/operator_benchmark/pt:interpolate_test -- --tag_filter short
Parsing buck files: finished in 0.6 sec
Building: finished in 6.2 sec (100%) 10951/10951 jobs, 0 updated
  Total time: 6.9 sec
/data/users/nicolashug/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/pt/interpolate_test#link-tree/torch/utils/cpp_extension.py:3: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: interpolate
# Mode: Eager
# Name: interpolate_input_size(1,3,60,40)_output_size(24,24)_channels_lastTrue_modenearest
# Input: input_size: (1, 3, 60, 40), output_size: (24, 24), channels_last: True, mode: nearest
Forward Execution Time (us) : 1346.156

# Benchmarking PyTorch: interpolate
# Mode: Eager
# Name: interpolate_input_size(1,3,60,40)_output_size(24,24)_channels_lastTrue_modelinear
# Input: input_size: (1, 3, 60, 40), output_size: (24, 24), channels_last: True, mode: linear
Forward Execution Time (us) : 1283.784

# Benchmarking PyTorch: interpolate
# Mode: Eager
# Name: interpolate_input_size(1,3,60,40)_output_size(24,24)_channels_lastTrue_modebicubic
# Input: input_size: (1, 3, 60, 40), output_size: (24, 24), channels_last: True, mode: bicubic
Forward Execution Time (us) : 4769.578

# Benchmarking PyTorch: interpolate
# Mode: Eager
# Name: interpolate_input_size(1,3,60,40)_output_size(24,24)_channels_lastFalse_modenearest
# Input: input_size: (1, 3, 60, 40), output_size: (24, 24), channels_last: False, mode: nearest
Forward Execution Time (us) : 982.910

# Benchmarking PyTorch: interpolate
# Mode: Eager
# Name: interpolate_input_size(1,3,60,40)_output_size(24,24)_channels_lastFalse_modelinear
# Input: input_size: (1, 3, 60, 40), output_size: (24, 24), channels_last: False, mode: linear
Forward Execution Time (us) : 1182.191

# Benchmarking PyTorch: interpolate
# Mode: Eager
# Name: interpolate_input_size(1,3,60,40)_output_size(24,24)_channels_lastFalse_modebicubic
# Input: input_size: (1, 3, 60, 40), output_size: (24, 24), channels_last: False, mode: bicubic
Forward Execution Time (us) : 3545.873

# Benchmarking PyTorch: interpolate
# Mode: Eager
# Name: interpolate_input_size(1,3,600,400)_output_size(240,240)_channels_lastTrue_modenearest
# Input: input_size: (1, 3, 600, 400), output_size: (240, 240), channels_last: True, mode: nearest
Forward Execution Time (us) : 34373.955

# Benchmarking PyTorch: interpolate
# Mode: Eager
# Name: interpolate_input_size(1,3,600,400)_output_size(240,240)_channels_lastTrue_modelinear
# Input: input_size: (1, 3, 600, 400), output_size: (240, 240), channels_last: True, mode: linear
Forward Execution Time (us) : 42248.109

# Benchmarking PyTorch: interpolate
# Mode: Eager
# Name: interpolate_input_size(1,3,600,400)_output_size(240,240)_channels_lastTrue_modebicubic
# Input: input_size: (1, 3, 600, 400), output_size: (240, 240), channels_last: True, mode: bicubic
Forward Execution Time (us) : 405944.286
...
```

Reviewed By: fmassa

Differential Revision: D26782757

fbshipit-source-id: 2039e1e6b4fea2b56bb4bcf2a017476f928e4928
2021-03-04 05:36:28 -08:00
b864457743 Revert D26744062: Add assert_async
Test Plan: revert-hammer

Differential Revision:
D26744062 (12d63cc2f5)

Original commit changeset: be6d2653afe5

fbshipit-source-id: 972eaafcdf14d96abdec3dea6bcbd5cac1f3d759
2021-03-04 04:11:25 -08:00
bf5e5bf901 [ROCm] Enable test in test_linalg.py, test_optim.py and test_vmap.py … (#52818)
Summary:
Enable test in test_linalg.py, test_optim.py, and test_vmap.py for ROCm because they are passing.

Signed-off-by: Kyle Chen <kylechen@amd.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52818

Reviewed By: H-Huang

Differential Revision: D26694091

Pulled By: mruberry

fbshipit-source-id: 285d17aa7f271f4d94b5fa9d9f6620de8a70847b
2021-03-04 02:29:45 -08:00
c4c77e2001 [special] add torch.special namespace (#52296)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/50345

 * Add `torch.special` namespace
* Add `torch.special.gammaln` (alias to `torch.lgamma`)

TODO:
* Add proper entries for docs.
   * [x] Add .rst file entry
   * [x] Add documentation
   * [x] Update `lgamma` OpInfo entry for alias to `special.gammaln`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52296

Reviewed By: ngimel

Differential Revision: D26754890

Pulled By: mruberry

fbshipit-source-id: 73479f68989d6443ad07b7b02763fa98973c15f6
2021-03-04 00:04:36 -08:00
c5b0c2fa8b Support torch.complex (#53227)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53227

========
Adds Support for Torch.complex in JIT

Test:
====
python test/test_jit.py -k test_torch_complex

Test Plan: Imported from OSS

Reviewed By: zdevito, bhosmer

Differential Revision: D26808285

Pulled By: nikithamalgifb

fbshipit-source-id: c6918b2baac814e78613a264d90941b8c6102237
2021-03-04 00:03:00 -08:00
d98839e53e [static runtime] register pow out variant (#52454)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52454

Test Plan:
adfinder local net
Before:
7.13307 ms/iter
0.0222672 ms.   0.311136%. aten::pow (1 nodes)
After:
7.10623 ms/iter
0.0174462 ms.   0.242774%. aten::pow (1 nodes)

Reviewed By: malfet, hlu1

Differential Revision: D26521717

fbshipit-source-id: 8d9279b59d37c8786a9eeccd0f54bd84c400c128
2021-03-03 21:33:11 -08:00
68810c1836 Delete test_rand_quantization (#53234)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53234

Test has been permanently skipped since Nov 2019, see https://github.com/pytorch/pytorch/pull/29463

Test Plan: CI

Reviewed By: mruberry

Differential Revision: D26802660

fbshipit-source-id: ea66be1afd4d7cfbe692594df5d9dd8c29bc5d23
2021-03-03 20:59:00 -08:00
457b9f672c [CI]Shard cuda11_1 tests (#53235)
Summary:
As single pytorch_linux_xenial_cuda11_1_cudnn8_py3_gcc7_test hits the timeout

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53235

Reviewed By: glaringlee

Differential Revision: D26802806

Pulled By: malfet

fbshipit-source-id: 8dbd30defa978e806d685b0d851145dc7a9049b4
2021-03-03 20:31:14 -08:00
d5507aa5b5 fix output dtype test in compute_types (#52731)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52731

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D26630251

Pulled By: bhosmer

fbshipit-source-id: 5f61967c7e94882a3cc3c1b6beaa2b69d68b9656
2021-03-03 20:30:00 -08:00
fc7171badc inline TensorIteratorConfig setters (#52661)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52661

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D26602944

Pulled By: bhosmer

fbshipit-source-id: 54ab402a33cb35927ca5de0106884223475f7528
2021-03-03 20:26:47 -08:00
30a8a13a7d Revert D26625807: [pytorch][PR] Deduplicate shared params before constructing Reducer in DDP
Test Plan: revert-hammer

Differential Revision:
D26625807 (5c15a5bb46)

Original commit changeset: f5f5959fef90

fbshipit-source-id: c875cc86b8fd21d9d64f934559f8e3126ed1d23d
2021-03-03 20:05:47 -08:00
38a34887ac [PyTorch] Fix missing move in {List,Tuple}Construct (#53206)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53206

Copying the List in ListConstruct is 1 extra refcount bump. Copying the vector in TupleConstruct is 1 extra bump per tuple element.
ghstack-source-id: 123001815

Test Plan: Don't have a precise measurement but it's very roughly 0.5% off total time for AdIndexer inline_cvr based on wall time, and more like 1.2% based on change in perf profile.

Reviewed By: hlu1

Differential Revision: D26790670

fbshipit-source-id: 697ef82fe72a85719bf8ce28f2bb87fe56bbd8ad
2021-03-03 19:28:44 -08:00
68b62493b8 [Gradient Compression] Make GradBucket class public (#53099)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53099

Publish GradBucket APIs for publishing DDP communication hooks.

s/_GradBucket/GradBucket
ghstack-source-id: 123030921

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D26721121

fbshipit-source-id: ee5f68e33095b9965b51937b86cdeb331fd2419a
2021-03-03 19:22:15 -08:00
b59075eced [Gradient Compression] Refactor tensor grouping in PowerSGD (#52981)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52981

No need to create a hard boundary between rank-1 tensors and high-rank tensors, since some high-rank tensors will not be compressed if the compression cannot save enough bandwidth, according to `_should_compress` function.

Therefore, refactor and simplify the tensor grouping logic, which addresses the comment in https://github.com/pytorch/pytorch/pull/52541#discussion_r580867311
ghstack-source-id: 122997032

Test Plan:
waitforbuildbot

Already LGTMed by PowerSGD paper author.

Ads1x (completed):
https://www.internalfb.com/intern/tupperware/details/job/?handle=priv3_global%2Fmast_hpc%2Ftsm_hpc-wayi_ads_10x_POWER_SGD_gpu8_2021-02-28_15-29.trainer&tatwTabs=tasks&task_id=0&task_tab=TASK_LOGS

Detectron2:
1) Before refactoring:
f254353864
Accuracy: 39.972
Overall training speed: 67498 iterations in 6:15:42 (0.3340 s / it)

2) After refactoring:
f254353380
Accuracy: 39.944
Overall training speed: 67498 iterations in 6:09:41 (0.3286 s / it)

Reviewed By: rohan-varma

Differential Revision: D26713689

fbshipit-source-id: 12cfcb65feaa2a2d94e3c7793073031f13828305
2021-03-03 19:20:41 -08:00
248e8b42fa [Static Runtime] Use native version of at::empty (#53216)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53216

- at::native::empty_cpu calls at::detail::empty_cpu without any changes to the arguments. So we could call at::detail::empty_cpu directly.
- There is no need to create a TensorOptions object first since we can get all the relevant information from the tensor directly.

Reviewed By: bertmaher, swolchok

Differential Revision: D26792255

fbshipit-source-id: 7a4e368a19cea79e136e34dab854cb1d37dbeb58
2021-03-03 17:13:26 -08:00
9b7396e7e2 [pyper] casted_batch_one_hot_lengths with 4-arg to (#53215)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53215

The current 5-arg version doesn't fuse the inline_cvr model instances

Test Plan:
```
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 3 ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --c2_weights=/data/users/ansha/tmp/adfinder/models/c2_local_weight_data.pb --c2_inputs=/data/users/ansha/tmp/adfinder/models/c2_local_input_data.pb --pred_net=/data/users/ansha/tmp/adfinder/models/c2_local_net.pb --c2_sigrid_transforms_opt=1 --c2_apply_nomnigraph_passes=1 --c2_use_memonger=1 --scripted_model=/data/users/ansha/tmp/adfinder/models_dianshi/210494966_0.predictor.disagg.local.pt --pt_inputs=/data/users/ansha/tmp/adfinder/models/local_wrapped_input_data.pt --pt_enable_static_runtime=1 --pt_cleanup_activations=true --pt_enable_out_variant=1 --compare_results=1 --iters=2000 --warmup_iters=2000 --num_threads=1 --do_profile=1 --do_benchmark --benchmark_c2_predictor=1
```

```
Time per node type:
        3.82029 ms.    71.8523%. aten::addmm (9 nodes)
       0.926298 ms.    17.4219%. fb::sigrid_transforms (1 nodes)
       0.122496 ms.    2.30391%. fb::clip_ranges_gather (210 nodes)
        0.11985 ms.    2.25416%. fb::clip_ranges_gather_sigrid_hash_precompute_v3 (54 nodes)
      0.0973721 ms.    1.83138%. aten::sigmoid (3 nodes)
      0.0352937 ms.   0.663807%. fb::batch_box_cox (1 nodes)
       0.034759 ms.    0.65375%. prim::TupleConstruct (1 nodes)
      0.0222235 ms.   0.417981%. aten::index (4 nodes)
      0.0215314 ms.   0.404964%. fb::casted_batch_one_hot_lengths (1 nodes)
      0.0199659 ms.   0.375521%. fb::concat_add_mul_replacenan_clip (1 nodes)
      0.0192885 ms.   0.362779%. aten::cat (2 nodes)
      0.0181285 ms.   0.340963%. aten::mul (2 nodes)
      0.0109381 ms.   0.205725%. aten::pow (1 nodes)
      0.0091476 ms.   0.172049%. prim::ListConstruct (8 nodes)
     0.00794012 ms.   0.149338%. aten::relu (2 nodes)
     0.00668873 ms.   0.125802%. prim::ListUnpack (1 nodes)
     0.00569745 ms.   0.107158%. aten::to (4 nodes)
     0.00527507 ms.   0.099214%. aten::narrow_copy (4 nodes)
     0.00483189 ms.  0.0908785%. fb::lengths_range (4 nodes)
     0.00399056 ms.  0.0750548%. aten::logit (1 nodes)
     0.00324574 ms.  0.0610462%. fb::gather_ranges (4 nodes)
     0.00161166 ms.  0.0303122%. fb::clip_ranges (2 nodes)
        5.31686 ms. in Total
StaticRuntime setup time: 0.016461 ms
Memory allocation time: 0.00220284 ms
Memory deallocation time: 0.118134 ms
Outputs deallocation time: 0.0674883 ms
Total memory managed: 716352 bytes
Total number of reused tensors: 22
```

Reviewed By: hlu1

Differential Revision: D26789260

fbshipit-source-id: 52adadddaae29a946de8a58bd592c06e6d4ce8c8
2021-03-03 16:41:39 -08:00
12d63cc2f5 Add assert_async (#53086)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53086

Fixes #36853

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D26744062

Pulled By: ezyang

fbshipit-source-id: be6d2653afe584adf67a05b5d43185b40764650d
2021-03-03 16:18:07 -08:00
14a2ef0932 Deduplicate test cases in suites by taking the longer test case (#53154)
Summary:
Also removes unneeded filename field in S3.

Tested locally:
I locally installed
```
conda install -c anaconda boto3
conda install -c conda-forge unittest-xml-reporting
```
I ran `python test/test_type_hints.py --save-xml=/tmp/reports/test_type_hints` twice to generate two reports of the same test cases.
Then, I edited the print_test_stats.py file to print the report instead of upload to S3, and then ran `CIRCLE_SHA1="$(git rev-parse HEAD)" CIRCLE_JOB=foo python torch/testing/_internal/print_test_stats.py --upload-to-s3 /tmp/reports/test_type_hints`. I verified the report object looked correct:
```
{
   'build_pr': '',
   'build_tag': '',
   'build_sha1': '67cecd7f6cf2956bda1178ae2369cd74ba946f78',
   'build_branch': '',
   'build_job': 'foo',
   'build_workflow_id': '',
   'total_seconds': 67.316,
   'format_version': 2,
   'files': {
       'test/test_type_hints': {
             'total_seconds': 67.316,
             'suites': {
                    'TestTypeHints': {
                           'total_seconds': 67.316,
                           'cases': {
                                 'test_doc_examples': {
                                        'seconds': 8.821,
                                        'status': None
                                  },
                                 'test_run_mypy': {
                                       'seconds': 58.495,
                                       'status': None
                                  }
                           }
                    }
             }
       }
   }
}
```
It did take the longer of the two test cases for both test cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53154

Reviewed By: samestep

Differential Revision: D26793522

Pulled By: janeyx99

fbshipit-source-id: 5644c1bd38acb8bca0d69851cf1d549a03334b7a
2021-03-03 16:12:44 -08:00
c94b8e13ec Remove docker_config_defaults from CircleCI config (#53200)
Summary:
It doesn't seem to be used anywhere.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53200

Test Plan: CI.

Reviewed By: xuzhao9

Differential Revision: D26785924

Pulled By: samestep

fbshipit-source-id: f4698ff2c213d4679e6d76b7677d9b9004917ee1
2021-03-03 16:05:47 -08:00
79944f7ad9 [fx] simple doc fix
Reviewed By: houseroad

Differential Revision: D26739803

fbshipit-source-id: e680ce961a9ed1a5042d675aca9f5cf118c8ff85
2021-03-03 15:47:40 -08:00
ba36e32406 [Gradient Compression] Correct the usage of min_compression_rate (#52979)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52979

Compression rate = uncompressed size / compressed size, so the compression rate is usually greater than 1.

Previously the compression rate was perceived as compressed size / uncompressed size, which can be very confusing.
ghstack-source-id: 122996272

Test Plan: unit tests

Reviewed By: zhaojuanmao

Differential Revision: D26713349

fbshipit-source-id: 83b7f8908c101954cf01f56a22161047fbfeaa53
2021-03-03 15:35:40 -08:00
d30f4d1dfd Migrate apex.parallel.SyncBatchNorm channels_last to pytorch (#46906)
Summary:
per title

This PR did
- Migrate `apex.parallel.SyncBatchNorm` channels_last to pytorch `torch.nn.SyncBatchNorm`
- Fix a TODO here by fusing `sum`, `div` kernels into backward elementwise kernel
b167402e2e/torch/nn/modules/_functions.py (L76-L95)

Todo
- [x] Discuss a regression introduced in https://github.com/pytorch/pytorch/pull/37133#discussion_r512530389, which is the synchronized copy here
b167402e2e/torch/nn/modules/_functions.py (L32-L34)

**Comment**: This PR uses apex version for the size check. Test passed and I haven't seen anything wrong so far.

- [x] The restriction to use channels_last kernel will be like this
```
inline bool batch_norm_use_channels_last_kernels(const at::Tensor& self) {
  return self.is_contiguous(at::MemoryFormat::ChannelsLast) || self.ndimension() == 2;
}
```
I think we can relax that for channels_last_3d as well?

**Comment**: we don't have benchmark for this now, will check this and add functionality later when needed.
- [x] Add test
- [x] Add benchmark

Detailed benchmark is at https://github.com/xwang233/code-snippet/tree/master/syncbn-channels-last

Close https://github.com/pytorch/pytorch/issues/50781

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46906

Reviewed By: albanD

Differential Revision: D26771437

Pulled By: malfet

fbshipit-source-id: d00387044e9d43ac7e6c0e32a2db22c63d1504de
2021-03-03 15:29:45 -08:00
9c2673df46 Revert D26723384: [pytorch][PR] Implements torch.linalg.lstsq
Test Plan: revert-hammer

Differential Revision:
D26723384 (3ac9013235)

Original commit changeset: c9866a95f140

fbshipit-source-id: 3e5263d71facdc91ca09d7dcbbbe3ba818ee2821
2021-03-03 15:24:25 -08:00
a812175173 Update Kineto revision (#53199)
Summary:
Update Kineto revision

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53199

Reviewed By: gdankel

Differential Revision: D26784476

Pulled By: ilia-cher

fbshipit-source-id: 7e908f63ee2790ddb5348c580ad5a4d5ad94b921
2021-03-03 15:07:18 -08:00
096c66a99f [sparsity][refactor] Rename row/col to out/in features
Summary: Names such as `row_block_size` and `col_block_size` might be ambiguous, especially if different engines use different tensor layouts (i.e. rows=output features, etc.). Having names such as `out_features_block_size` and `in_features_block_size` makes more sense

Test Plan:
`buck test mode/opt //caffe2/torch/fb/model_optimization:sparsity_test`

```
Building with Remote Execution [RE]. Used 36:09 minutes of total time.
[RE] Waiting on 0 remote actions. Completed 264 actions remotely.
Building: finished in 02:34.4 min (100%) 18884/18884 jobs, 420 updated
  Total time: 02:34.8 min
More details at https://www.internalfb.com/intern/buck/build/b34b5c52-eba6-4e17-92f9-1f5ce620f8f0
Tpx test run coordinator for Facebook. See https://fburl.com/tpx for details.
Running with tpx session id: 8fe8fa95-c1f8-4b4f-9cbf-88b3b1b28eaf
Trace available for this run at /tmp/tpx-20210302-000019.503678/trace.log
Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/4785074650825194
    ✓ ListingSuccess: caffe2/torch/fb/model_optimization:sparsity_test - main (4.094)
    ✓ Pass: caffe2/torch/fb/model_optimization:sparsity_test - test_sparse_qlinear (caffe2.torch.fb.model_optimization.test.sparsity.quantized_test.TestQuantizedSparseKernels) (1.896)
    ✓ Pass: caffe2/torch/fb/model_optimization:sparsity_test - test_sparse_qlinear (caffe2.torch.fb.model_optimization.test.sparsity.quantized_test.TestQuantizedSparseLayers) (1.907)
    ✓ Pass: caffe2/torch/fb/model_optimization:sparsity_test - test_sparse_qlinear_serdes (caffe2.torch.fb.model_optimization.test.sparsity.quantized_test.TestQuantizedSparseLayers) (2.035)
Summary
  Pass: 3
  ListingSuccess: 1
Finished test run: https://www.internalfb.com/intern/testinfra/testrun/4785074650825194
```

Reviewed By: dskhudia

Differential Revision: D26747065

fbshipit-source-id: 685fe864062ed532de284b22db757a921806d4ab
2021-03-03 15:05:40 -08:00
f7d65c5cd2 Use .gv instead of .dot for Graphviz in fast_nvcc (#53208)
Summary:
See this page for context: https://marc.info/?l=graphviz-devel&m=129418103126092

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53208

Test Plan:
```
tools/fast_nvcc/fast_nvcc.py --help
```

Reviewed By: janeyx99

Differential Revision: D26791398

Pulled By: samestep

fbshipit-source-id: 6a0363a4664e79b80ddf2ae799ec05ee7d028357
2021-03-03 15:01:21 -08:00
86166f2124 [quant][fix] MHA tensor assignment fix (#53031)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53031

During the module conversion, the weight was assigned directly to the linear layer inside the quantizable MHA. Instead the weight must be assigned to the `layer.weight`.

Test Plan:
`buck test mode/opt //caffe2/test:quantization -- test_custom_module_multi_head_attention`

```
Building: finished in 6.9 sec (100%) 7316/7316 jobs, 3 updated
  Total time: 7.4 sec
More details at https://www.internalfb.com/intern/buck/build/914cb095-806e-4891-8822-e2644283f05c
Tpx test run coordinator for Facebook. See https://fburl.com/tpx for details.
Running with tpx session id: fcccbd0b-a887-4874-8455-d1cf8411be1d
Trace available for this run at /tmp/tpx-20210301-004359.492205/trace.log
Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/1688849910412609
    ✓ ListingSuccess: caffe2/test:quantization - main (2.440)
    ✓ Pass: caffe2/test:quantization - test_custom_module_multi_head_attention (quantization.test_quantized_op.TestQuantizedOps) (5.672)
Summary
  Pass: 1
  ListingSuccess: 1
Finished test run: https://www.internalfb.com/intern/testinfra/testrun/1688849910412609
```

Reviewed By: raghuramank100

Differential Revision: D26720500

fbshipit-source-id: 3ba5d5df1c23cc5150c4a293d3c93c44dc702e50
2021-03-03 14:49:19 -08:00
4008df3507 Add property binding in torchbind (#50670)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50670

This PR adds property support to Torchbind. There are two cases that it needs to work:

**Torchscript**
Inside Torchscript, we don't go through pybind so there is no issue with accessing properties through ClassType.

**Eager Mode**
In Eager Mode, Torchbind creates ScriptObject which we cannot dynamically add (aka access) properties after initializing it. (https://stackoverflow.com/questions/1325673/how-to-add-property-to-a-class-dynamically
) Therefore we created a Python wrapper (ScriptObjectWrapper) around ScriptObject where we can use property method to set properties.  By doing so, we can look up wrapped object's property through __getattr__ method of the ScriptObjectWrapper. This logic is inspired from https://github.com/pytorch/pytorch/pull/44324

Test Plan:
test cases in test_torchbind.py

Imported from OSS

Reviewed By: pbelevich

Differential Revision: D26632781

fbshipit-source-id: dd690887cfda0c48ff0d104aa240ce0ab09055bc
2021-03-03 14:25:52 -08:00
59c0c19be2 Add RemoteModule to master RPC docs. (#53084)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53084

Adding RemoteModule to master RPC docs since it is a prototype
feature.
ghstack-source-id: 122816689

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D26743372

fbshipit-source-id: 00ce9526291dfb68494e07be3e67d7d9c2686f1b
2021-03-03 13:52:11 -08:00
e5ecd1ddf8 [Vulkan]Fix build warnings-treated-as-error on Linux. (#52781)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52781

Test Plan: Imported from OSS

Reviewed By: SS-JIA

Differential Revision: D26669311

Pulled By: AshkanAliabadi

fbshipit-source-id: 78b08d0b264d4d5cf8af964c589b9b7d0ddc7311
2021-03-03 13:48:43 -08:00
f3190a77b2 [Vulkan] Update VMA to VMA::e74dc79903f3e59b15a48f112b5c804fea2220b0. (#52938)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52938

Update VMA to git revision
e74dc79903f3e59b15a48f112b5c804fea2220b0 to fix
https://github.com/GPUOpen-LibrariesAndSDKs/VulkanMemoryAllocator/issues/164

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D26784525

Pulled By: AshkanAliabadi

fbshipit-source-id: a1d88f708f4d64d00167b2f02fefd7d51a25a3ca
2021-03-03 13:47:05 -08:00
7cec4b3d4a [quant][fx] add _remove_qconfig flag to convert_fx (#53166)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53166

Context: For fx modules that consist of scriptmodules, calling
delattr(module, 'qconfig') throws an attribute error. will follow up
with a separate issue/repro to fix this problem

This PR adds a temporary flag to convert_fx API to preserve the qconfig attributes on the converted model
We will remove this flag once we reach a conclusion on calling delattr on scriptmodules

Test Plan:
python test/test_quantization.py test_preserve_qconfig

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D26771518

fbshipit-source-id: 9fd72816576856ffb4aa11f8fde08303d1df10a2
2021-03-03 12:58:05 -08:00
25a3732c8d [vulkan] Add, sub, mul, and div ops with broadcasting for Vulkan (#52842)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52842

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D26665326

Pulled By: SS-JIA

fbshipit-source-id: ed73918a5cd3390d6c8a7fa284c79eb7de9f9906
2021-03-03 12:55:54 -08:00
8b5b7fa83d [WIP][FX] Optionally record stack traces when symtracing (#53081)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53081

Test Plan: Imported from OSS

Reviewed By: ansley

Differential Revision: D26742402

Pulled By: jamesr66a

fbshipit-source-id: 7987f9ddf061f6de3b4a638d98e0fae6d68d90c6
2021-03-03 12:30:43 -08:00
510c03d922 [Gradient Compression] Remove some low-level methods of GradBucket class (#53098)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53098

Remove some low-level methods that are no longer needed since `get_per_parameter_tensors` method is added to `GradBucket` class.

Avoid unnecessary exposure to the internals before publishing GradBucket APIs.
ghstack-source-id: 122979064

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

Reviewed By: osalpekar

Differential Revision: D26784249

fbshipit-source-id: d1b27bb026989c25a5b65be4767cb752afd6f19b
2021-03-03 12:06:14 -08:00
f8238d7917 [optim] bugfix when all parameters have no grad (#52944)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52944

This fix the bug introduced during refactoring optimizers https://github.com/pytorch/pytorch/pull/50411. When all parameters have no grads, we should still allows `beta` like hyper params to be defined.

Reviewed By: ngimel

Differential Revision: D26699827

fbshipit-source-id: 8a7074127704c7a4a1fbc17d48a81e23a649f280
2021-03-03 11:56:09 -08:00
ecd8e4c1d5 Add guard to run on current thread (#52361)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52361

Test Plan:
buck build //xplat/caffe2:aten_test_test_thread_pool_guard
./aten_test_test_thread_pool_guard

Reviewed By: kimishpatel

Differential Revision: D26429540

fbshipit-source-id: 16e4a56d4bf9b73b1ea1ff88d7dc6730e0b1e029
2021-03-03 11:43:40 -08:00
0f81a69a96 Make meta a device (getting rid of empty_meta) (#53143)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53143

Meta is now an honest to goodness device type, like cpu, so you can use
device='meta' to trigger allocation of meta tensors.  This way better
than empty_meta since we now have working API for most factory functions
(they don't necessarily work yet, though, because need to register Meta
versions of those functions.)

Some subtleties:
- I decided to drop the concept of CPU versus CUDA meta tensors; meta
  tensors are device agnostic.  It's hard to say exactly what the
  correct level of abstraction here is, but in this particular case
  implementation considerations trump semantic considerations: it
  is way easier to have just a meta device, than to have a meta device
  AND a cpu device AND a cuda device.  This may limit the applicability
  of meta tensors for tracing models that do explicit cpu()/cuda()
  conversions (unless, perhaps, we make those operations no-ops on meta
  tensors).
- I noticed that the DeviceType uppercase strings are kind of weird.
  Are they really supposed to be all caps?  That's weird.
- I moved the Meta dispatch key to live with the rest of the "device"
  dispatch keys.
- I intentionally did NOT add a Backend for Meta.  For now, I'm going to
  hope meta tensors never exercise any of the Backend conversion code;
  even if it does, better to fix the code to just stop converting to and
  from Backend.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: samestep

Differential Revision: D26763552

Pulled By: ezyang

fbshipit-source-id: 14633b6ca738e60b921db66a763155d01795480d
2021-03-03 11:24:13 -08:00
fd3004d3ee Add NoOpDeviceGuardImpl (#53142)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53142

It turns out to make Meta a device I need to substantively reuse
the CPUGuardImpl implementation.  It's pretty parametrizable so
just move this over to DeviceGuardImplInterface templated over
the DeviceType.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: anjali411, samestep

Differential Revision: D26763553

Pulled By: ezyang

fbshipit-source-id: 464fb3e3a72ba7c55a12adffe01c18171ce3e857
2021-03-03 11:24:08 -08:00
99098c1d70 Delete dead Backend toSparse (#53116)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53116

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D26753226

Pulled By: ezyang

fbshipit-source-id: 2941876d546c39ee3913c2ffffdb0a0ea7360f0c
2021-03-03 11:22:03 -08:00
f5e725527d [PyTorch] Save a single add instruction in the dispatcher (#52543)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52543

This saves one (1) add instruction. New code comments should
explain exactly why. In short, we store a direct pointer in
`OperatorHandle` in addition to the `std::list<OperatorDef>::iterator`
because converting the latter to the former requires an add instruction.

It is not clear to me whether this is a particularly great tradeoff,
but I spent (more) time on it (than I expected), so here it is for
review.
ghstack-source-id: 122147199

Test Plan:
Inspect assembly for at::empty in benchmark code -- see add
instruction disappeared.

Compare empty benchmark performance to baseline with perf stat.

Baseline:
          5,077.43 msec task-clock                #    1.000 CPUs utilized            ( +-  0.25% )
               405      context-switches          #    0.080 K/sec                    ( +-  1.37% )
                 3      cpu-migrations            #    0.001 K/sec                    ( +- 18.22% )
            12,259      page-faults               #    0.002 M/sec                    ( +-  0.10% )
    10,089,754,343      cycles                    #    1.987 GHz                      ( +-  0.25% )  (50.04%)
    29,516,000,227      instructions              #    2.93  insn per cycle           ( +-  0.04% )  (50.08%)
     5,662,629,032      branches                  # 1115.256 M/sec                    ( +-  0.02% )  (50.08%)
         1,955,729      branch-misses             #    0.03% of all branches          ( +-  0.88% )  (50.04%)

            5.0796 +- 0.0128 seconds time elapsed  ( +-  0.25% )

After:
```
          5,017.77 msec task-clock                #    1.001 CPUs utilized            ( +-  0.19% )
               400      context-switches          #    0.080 K/sec                    ( +-  3.09% )
                 4      cpu-migrations            #    0.001 K/sec                    ( +- 46.91% )
            12,240      page-faults               #    0.002 M/sec                    ( +-  0.37% )
     9,960,189,535      cycles                    #    1.985 GHz                      ( +-  0.19% )  (50.02%)
    29,467,149,773      instructions              #    2.96  insn per cycle           ( +-  0.11% )  (50.03%)
     5,661,074,219      branches                  # 1128.206 M/sec                    ( +-  0.02% )  (50.07%)
         2,032,712      branch-misses             #    0.04% of all branches          ( +-  1.35% )  (50.07%)

            5.0151 +- 0.0101 seconds time elapsed  ( +-  0.20% )
```

1.2% cycles win, outside the noise
0.16% instruction count win, barely outside noise

I am surprised at the size of the cycles win.

Reviewed By: bhosmer

Differential Revision: D26564192

fbshipit-source-id: 71f731ba54ec1cb407673db691eaf77a257de4a9
2021-03-03 10:47:34 -08:00
43906f9b8b [ZeroRedundancyOptimizer] Minor stub fix (#53165)
Summary:
Not sure how important that is
Tied to https://github.com/pytorch/pytorch/issues/53108

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53165

Reviewed By: albanD

Differential Revision: D26781956

Pulled By: blefaudeux

fbshipit-source-id: b7daca0ea95be190a5ffeae12123e301204ed4eb
2021-03-03 10:15:10 -08:00
5c15a5bb46 Deduplicate shared params before constructing Reducer in DDP (#51929)
Summary:
Currently, `torch.nn.parallel.DistributedDataParallel(model...)` doesn't deduplicate params shared across `model`'s child Modules before calling Reducer with the param list. This can cause Reducer to register more than one hook on the shared param(s), at which point who knows what happens.

We ran into this in mlperf BERT, which has at least one param shared across submodules (an embedding weight iirc, not 100% sure). Running with `gradient_as_bucket_view = False` produced different numerics from running with `gradient_as_bucket_view = True` (which i guess is one potential consequence of multiple DDP hooks on a given param, not sure why, i'd have to dig further).

This PR changes DDP to deduplicate shared params (a small diff), and adds some tests (right now just `test_ddp_weight_sharing`, but I'll add more). `test_ddp_weight_sharing` fails with bad numerics on current master (proving the shared param issue is real) and passes with the deduplication diff.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51929

Reviewed By: zou3519

Differential Revision: D26625807

Pulled By: zhaojuanmao

fbshipit-source-id: f5f5959fef90dfe2c55812d79fa88b877f22ecc3
2021-03-03 10:13:24 -08:00
20860ab01a Revert D26727918: [pytorch][PR] Added CUDA support for torch.orgqr
Test Plan: revert-hammer

Differential Revision:
D26727918 (e29d8477a6)

Original commit changeset: 1c4d15fa76ba

fbshipit-source-id: f3d5d6811ab77332a333cd165d69fcd9ecd92dc6
2021-03-03 10:06:49 -08:00
fbf60b5aaf Store only coverage info as artifacts (#53150)
Summary:
I noticed https://github.com/pytorch/pytorch/issues/53126 stored everything in the test folder as an artifact, which isn't exactly what we want. Here, I try to store just the relevant info, coverage files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53150

Reviewed By: albanD

Differential Revision: D26767185

Pulled By: janeyx99

fbshipit-source-id: 286d341ccdfa97d138a2048bb4ee01c7ae2579a1
2021-03-03 09:56:17 -08:00
c8cc2e2133 Update CODEOWNERS for test_public_bindings (#53158)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53158

Reviewed By: glaringlee

Differential Revision: D26779568

Pulled By: albanD

fbshipit-source-id: f7d56a30dff95dc3f24608ff01367c134cc08bbf
2021-03-03 09:27:04 -08:00
a1d204807a Add shape inference for SparseLengthsSumSparseLookup
Summary: Just copy whatever corresponding input shape info. Or we will miss the shape info of output of SparseLengthsSumSparseLookup, which will be infered as the input of downstream SparseLengthsSum op, whose int64/int32 mode is undetermined.

Test Plan:
```
buck test caffe2/caffe2/opt:bound_shape_inference_test
```

Reviewed By: khabinov, ChunliF

Differential Revision: D26769226

fbshipit-source-id: 4032bc4643a125095a48fa8c23ca4ebcf26dc29c
2021-03-03 09:25:29 -08:00
1559fa6a5c [operator benchmarks] Added more modes to interpolation tests (#53186)
Summary:
Description:
- Added more modes: bicubic and nearest to interpolation tests
- Added a test case for downsampling a small image

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53186

Reviewed By: albanD

Differential Revision: D26780116

Pulled By: fmassa

fbshipit-source-id: f4f498e6e1da1ec131e6d9d9f42dc482135ae9e2
2021-03-03 09:18:38 -08:00
85e5fdb919 disable TCPStore multi_worker tests for windows (#53156)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53156

Will SSH into windows machine to validate that these tests are skipped.

Test Plan: Imported from OSS

Reviewed By: osalpekar

Differential Revision: D26769791

Pulled By: H-Huang

fbshipit-source-id: e4427ba2d6cfe5a1de26e335cd27c1e8875174d3
2021-03-03 08:37:08 -08:00
b3c4ac6319 Fix OpenBLAS discovery (#53168)
Summary:
Fix accidental regression introduced by https://github.com/pytorch/pytorch/issues/47940

`FIND_PACKAGE(OpenBLAS)` does not validate that discovered library can actually be used, while `check_fortran_libraries` does that

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53168

Test Plan: Build PyTorch with static OpenBLAS and check that `torch.svd(torch.ones(3, 3)).S` do not raise an exception

Reviewed By: walterddr

Differential Revision: D26772345

Pulled By: malfet

fbshipit-source-id: 3e4675c176b30dfe4f0490d7d3dfe4f9a4037134
2021-03-03 08:23:02 -08:00
c957e2ab42 Add more datapipe to functional API (#53123)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53123

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D26756638

Pulled By: ejguan

fbshipit-source-id: 6ff0eb6c7ee702056ff19eeb723949e4642f2784
2021-03-03 07:01:00 -08:00
0aa9f22f1a Move groupbykey to grouping (#53122)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53122

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D26756641

Pulled By: ejguan

fbshipit-source-id: c4bc5864d841ce20c49446a03cfd195245b2be6e
2021-03-03 06:59:22 -08:00
59b2b8b091 Revert D26727660: [pytorch][PR] Add OpInfo for bitwise_not and make ROCM and CUDA OpInfo tests consistent
Test Plan: revert-hammer

Differential Revision:
D26727660 (816646bd6f)

Original commit changeset: 3aea236cf000

fbshipit-source-id: 91c6ec0c55c0295bb209f450ae3c96bee0a37356
2021-03-03 06:08:48 -08:00
d90d7245f4 [PyPer] Optimize sigrid_hash (#53065)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53065

Reviewed By: ajyu

Differential Revision: D26563512

fbshipit-source-id: a1a76f92ba500605ab2e3370737bd3965d81deb1
2021-03-03 01:31:53 -08:00
30dd15e778 [PyTorch] Add doc string for lite interpreter related api in Android (#53136)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53136

As title, doc string in ios, c++ and python is ready.

As a reference, the doc string for other lite interpreter related apis
[_load_for_mobile](https://www.internalfb.com/intern/diffusion/FBS/browsefile/master/fbcode/caffe2/torch/csrc/jit/mobile/import.h?commit=c95d12f9d67ee198aa4b5aafec980e9048de1702&lines=16-43)
[_save_for_lite_interpreter](https://www.internalfb.com/intern/diffusion/FBS/browsefile/master/fbcode/caffe2/torch/jit/_script.py?commit=b1d7f0ba6001beed6ba3b0a69a225abab4ed3866&lines=496-509)
ghstack-source-id: 122936777

Test Plan: CI

Reviewed By: IvanKobzarev, iseeyuan

Differential Revision: D26742092

fbshipit-source-id: 76464b5e4ceafe71348b58ba2af98c3debdaae63
2021-03-02 23:17:54 -08:00
a2a88990cd [PyTorch] Remove extra RNN.cpp file (#53169)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53169

As title, there are two `aten/src/ATen/native/RNN.cpp` in `aten_native_source_list`
ghstack-source-id: 122936706

Test Plan: CI

Reviewed By: dhruvbird, iseeyuan

Differential Revision: D26715640

fbshipit-source-id: 54717ded9b293e022a47ab7891dfd04afae48ce5
2021-03-02 23:09:03 -08:00
70d0aab7bd De-prioritise Dimname and DimnameList in python overload resolution (#51350)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51350

`None` being a valid `Dimname` is awkward for optional `dim` arguments, as found
on NumPy's reduction functions like `std` and `var`. In these cases `dim=None`
should mean an all-reduction, but instead you get an error
"Please look up dimensions by name".

I've also had to fix `FunctionParameter::check` to actually check the first
element of `INT_LIST` arguments and reject non-int types. Otherwise, the dim
names end up calling the `int[]` overload and fail.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D26756208

Pulled By: mruberry

fbshipit-source-id: 44221ca0f4822ec2c1f62b092466fd4f779eb45a
2021-03-02 23:07:08 -08:00
816646bd6f Add OpInfo for bitwise_not and make ROCM and CUDA OpInfo tests consistent (#51944)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/42515

This PR also enables the OpInfo tests on ROCM to check the same dtypes that of CUDA.

Few tests have to be skipped (due to failure).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51944

Reviewed By: H-Huang

Differential Revision: D26727660

Pulled By: mruberry

fbshipit-source-id: 3aea236cf0002f46c2737afbda2ed3efccfe14f5
2021-03-02 22:56:40 -08:00
926e011cde Fixed out= variant of linalg.solve (#51968)
Summary:
This PR modifies the behavior of the `linalg_solve_out` variant to match the description here https://github.com/pytorch/pytorch/wiki/Developer-FAQ#how-does-out-work-in-pytorch
With this PR result and input tensors must be on the same device and have the same "type kind".
It's allowed to pass out tensors with complex dtypes for float inputs.

`linalg_solve_out` was broken for batched vector inputs and it's now fixed.

Ref. https://github.com/pytorch/pytorch/issues/42666

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51968

Reviewed By: H-Huang

Differential Revision: D26728825

Pulled By: mruberry

fbshipit-source-id: c06fe937e7f452193b23ba09ca6cfa2703488455
2021-03-02 22:33:19 -08:00
bd7ac755d8 Fix loop type (#50484)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50484

I currently see the compilation warning:
```
Jan 13 16:46:21 [3644/5223] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/core/ivalue.cpp.o
Jan 13 16:46:21 ../aten/src/ATen/core/ivalue.cpp:855:22: warning: comparison of integers of different signs: 'int' and 'std::__1::vector<c10::IValue, std::__1::allocator<c10::IValue> >::size_type' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:21   for (auto i = 0; i < slots_.size(); ++i) {
```
This diff fixes that

Test Plan: Sandcastle tests

Reviewed By: ngimel

Differential Revision: D25901674

fbshipit-source-id: 0a09570866f23b5878bf06f46f918d71a733974f
2021-03-02 21:59:31 -08:00
e29d8477a6 Added CUDA support for torch.orgqr (#51348)
Summary:
This PR adds support for CUDA inputs for `torch.orgqr`.

CUDA implementation is based on both [cuSOLVER](https://docs.nvidia.com/cuda/cusolver/index.html#cuSolverDN-lt-t-gt-orgqr) and MAGMA. cuSOLVER doesn't have a specialized routine for the batched case. While MAGMA doesn't have a specialized GPU native (without CPU sync) `orgqr`. But MAGMA has implemented (and not documented) the batched GPU native version of `larft` function (for small inputs of size <= 32), which together with `larfb` operation form `orgqr` (see the call graph [here at the end of the page](http://www.netlib.org/lapack/explore-html/da/dba/group__double_o_t_h_e_rcomputational_ga14b45f7374dc8654073aa06879c1c459.html)).

So now there are two main codepaths for CUDA inputs (if both MAGMA and cuSOLVER are available):
* if `batchsize > 1` and `tau.shape[-1] <= 32` then MAGMA based function is called
* else [cuSOLVER's `orgqr`](https://docs.nvidia.com/cuda/cusolver/index.html#cuSolverDN-lt-t-gt-orgqr) is used.

If MAGMA is not available then only cuSOLVER is used and vice versa.

Documentation updates and possibly a new name for this function will be in a follow-up PR.

Ref. https://github.com/pytorch/pytorch/issues/50104

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51348

Reviewed By: ngimel

Differential Revision: D26727918

Pulled By: mruberry

fbshipit-source-id: 1c4d15fa76ba624e341a69a32337a9a16cc01013
2021-03-02 21:34:23 -08:00
0819d5f9e9 [FX] Added docstring for concrete_args (#53151)
Summary:
An oversight.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53151

Reviewed By: jamesr66a

Differential Revision: D26766450

Pulled By: Chillee

fbshipit-source-id: 26e6e44386bbff4bc06b41c39dff9e02cadfcc73
2021-03-02 21:15:00 -08:00
e1e19a71ce [shape inference] fix pruning
Summary: Use the dim type of the first input for output.

Test Plan:
unit test
flow test: f254777437
https://fburl.com/n933wc3a
shapes {
  shape {
    dims: 19102004
    dims: 68
    data_type: UINT8
    name: "sparse_nn_2/sparse_arch_2/grouped_embedding_10/grouped_generic_embedding_10/GSF_IDLIST_IG_BUSINESS_AUTHOR_PPR_ORGANIC_ENGAGEMENT_UNIFORM_RIDS/w_EmbeddingFusedUint4Quantization"
  }
  dim_type: CONSTANT
  dim_type: CONSTANT
  name: "sparse_nn_2/sparse_arch_2/grouped_embedding_10/grouped_generic_embedding_10/GSF_IDLIST_IG_BUSINESS_AUTHOR_PPR_ORGANIC_ENGAGEMENT_UNIFORM_RIDS/w_EmbeddingFusedUint4Quantization"
  shape_is_final: true
}

Reviewed By: yinghai, khabinov

Differential Revision: D26763978

fbshipit-source-id: b9c0d6ca4a2b0e4d50d34e08f724e99ad705196b
2021-03-02 20:59:27 -08:00
5c1c8cb93b [caffe2] Fix shape inference for pruning ops (#53082)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53082

Reviewed By: yinghai, khabinov

Differential Revision: D26742532

fbshipit-source-id: 6cdfb293541b601f7916a95e08bf573876c9ca74
2021-03-02 20:57:45 -08:00
0dac7d86ca blas copy and axpy to aten (#52345)
Summary:
Fixes #{issue number}

Follow-up PR: https://github.com/pytorch/pytorch/pull/50984

`copy` and `axpy` functions are ported to ATen. `THBlas_axpy` and `THBlas_copy` are removed.

Looking forward your comments cc ngimel, mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52345

Reviewed By: zou3519

Differential Revision: D26756533

Pulled By: ngimel

fbshipit-source-id: 97649485eeb6b361d6434c4701539b5abba4a17d
2021-03-02 20:50:57 -08:00
565d8235e5 [nnc] Test cases for uneven split + reorder (#53091)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53091

Split with tail followed by reorder causes a segfault in NNC
Split with mask followed by reorder generates invalid code that writes out of
bounds
ghstack-source-id: 122870733

Test Plan: LoopNest.ColReduceSplit*

Reviewed By: navahgar

Differential Revision: D26746254

fbshipit-source-id: f8a0de18531b34d2bf06ccaa35d9c98b81b5c600
2021-03-02 20:36:48 -08:00
cyy
d8730194e7 use device methods (#52899)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52899

Reviewed By: zou3519

Differential Revision: D26752203

Pulled By: albanD

fbshipit-source-id: eaef89377999b20655fe85d5a38ca7a2c5882de7
2021-03-02 20:14:23 -08:00
aba33b0042 [TensorExpr] IRVerifier: add index verifier for Store. (#53137)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53137

Also, add casting to Int for Load and Store indices.

Fixes #52773.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D26760256

Pulled By: ZolotukhinM

fbshipit-source-id: a2d3141b17584724a5feabcabec25d0577b83a30
2021-03-02 19:56:28 -08:00
0f7f600e01 Fix constexpr __host__ warning (#52702)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52702

Fixes:
```
stderr: caffe2/c10/util/MathConstants.h(22): warning: calling a constexpr __host__ function("from_bits") from a __host__ __device__ function("pi") is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this.
```

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D26589533

fbshipit-source-id: 42c4b36b0ba1e08cbdc9a122fedf35610483c764
2021-03-02 19:44:08 -08:00
3ac9013235 Implements torch.linalg.lstsq (#49093)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44378 by providing a wider range of drivers similar to what SciPy is doing.

The supported CPU drivers are `gels, gelsy, gelsd, gelss`.
The CUDA interface has only `gels` implemented but only for overdetermined systems.

The current state of this PR:
- [x] CPU interface
- [x] CUDA interface
- [x] CPU tests
- [x] CUDA tests
- [x] Memory-efficient batch-wise iteration with broadcasting which fixes https://github.com/pytorch/pytorch/issues/49252
- [x] docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49093

Reviewed By: H-Huang

Differential Revision: D26723384

Pulled By: mruberry

fbshipit-source-id: c9866a95f14091955cf42de22f4ac9e2da009713
2021-03-02 19:00:07 -08:00
c0b31a5ba7 [StaticRuntime] Clean up (#53096)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53096

- auto[&] -> const auto[&]
- clean up size() calls

Test Plan:
```
buck test //caffe2/torch/fb/sparsenn:test
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
```

Reviewed By: ajyu

Differential Revision: D26747001

fbshipit-source-id: 6ec81310747d86f7c5d2d17202eef7e299ef610c
2021-03-02 18:51:09 -08:00
870bac13bc Fixed out= variant of linalg.inv (#51977)
Summary:
This PR modifies the behavior of the `linalg_inv_out` variant to match the description here https://github.com/pytorch/pytorch/wiki/Developer-FAQ#how-does-out-work-in-pytorch
With this PR result and input tensors must be on the same device and have the same "type kind".
It's allowed to pass out tensors with complex dtypes for float inputs.

Ref. https://github.com/pytorch/pytorch/issues/42666

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51977

Reviewed By: H-Huang

Differential Revision: D26725718

Pulled By: mruberry

fbshipit-source-id: 2acc2a311328268706ce27ce060fc88fc7416753
2021-03-02 18:45:29 -08:00
fd582af06c enable coverage test for dataloader on Windows (#52550)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/50661
For coverage,
The class qualified name is `'SimpleCustomBatch': <class '__mp_main__.SimpleCustomBatch'>`

For pytest
The class qualified name is `'SimpleCustomBatch': <class 'test_dataloader.SimpleCustomBatch'>`

So move the class to one separate file

![image](https://user-images.githubusercontent.com/16190118/108611869-d6b51f80-741d-11eb-908e-be7a64da916d.png)

As malfet suggestion, use __import__ to avoid adding new file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52550

Reviewed By: walterddr

Differential Revision: D26754023

Pulled By: malfet

fbshipit-source-id: 34b0fbe7336b9303cedc28ec6116ab752a2d3630
2021-03-02 18:40:47 -08:00
e86476f736 Huber loss (#50553)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48595.

## Background

This PR implements HuberLoss, which differs from SmoothL1Loss by a factor of beta. The current implementation does not share logic between the two. Feedback is welcome for the optimal way to minimize code duplication while remaining performant.

I've done some early [benchmarking](https://pytorch.org/tutorials/recipes/recipes/benchmark.html#collecting-instruction-counts-with-callgrind) with Huber calling in to the Smooth L1 kernel and scaling afterwards; for the simple test case I used, instruction counts are as follows:
```
Huber loss calls dedicated Huber kernel: 2,795,300
Huber loss calls Smooth L1 kernel and scales afterwards: 4,523,612
```
With these numbers, instruction counts are ~62% higher when using the pre-existing Smooth L1 kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50553

Test Plan:
```
python test/test_nn.py TestNN.test_HuberLoss
python test/test_nn.py TestNN.test_HuberLoss_delta
python test/test_nn.py TestNN.test_huber_loss_invalid_delta
python test/test_nn.py TestNNDeviceTypeCPU.test_smooth_l1_loss_vs_huber_loss_cpu
python test/test_nn.py TestNNDeviceTypeCUDA.test_smooth_l1_loss_vs_huber_loss_cuda
python test/test_nn.py TestNNDeviceTypeCPU.test_invalid_reduction_strings_cpu
python test/test_nn.py TestNNDeviceTypeCUDA.test_invalid_reduction_strings_cuda
python test/test_nn.py TestNN.test_loss_equal_input_target_shape
python test/test_nn.py TestNN.test_pointwise_loss_broadcast
python test/test_overrides.py
python test/test_jit.py TestJitGeneratedFunctional.test_nn_huber_loss
python test/test_type_hints.py
python test/test_cpp_api_parity.py
build/bin/test_api
```

## Documentation
<img width="677" alt="Screen Shot 2021-01-14 at 4 25 08 PM" src="https://user-images.githubusercontent.com/75754324/104651224-5a445980-5685-11eb-884b-14ea517958c2.png">
<img width="677" alt="Screen Shot 2021-01-14 at 4 24 35 PM" src="https://user-images.githubusercontent.com/75754324/104651190-4e589780-5685-11eb-974d-8c63a89c050e.png">
<img width="661" alt="Screen Shot 2021-01-14 at 4 24 45 PM" src="https://user-images.githubusercontent.com/75754324/104651198-50225b00-5685-11eb-958e-136b36f6f8a8.png">
<img width="869" alt="Screen Shot 2021-01-14 at 4 25 27 PM" src="https://user-images.githubusercontent.com/75754324/104651208-53b5e200-5685-11eb-9fe4-5ff433aa13c5.png">
<img width="862" alt="Screen Shot 2021-01-14 at 4 25 48 PM" src="https://user-images.githubusercontent.com/75754324/104651209-53b5e200-5685-11eb-8051-b0cfddcb07d3.png">

Reviewed By: H-Huang

Differential Revision: D26734071

Pulled By: jbschlosser

fbshipit-source-id: c98c1b5f32a16f7a2a4e04bdce678080eceed5d5
2021-03-02 17:30:45 -08:00
2c8f9aec64 avoid TLS in has_names (#53003)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53003

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D26719724

Pulled By: bhosmer

fbshipit-source-id: b575e2cec6509e287ed216d9926bbf1108eb7636
2021-03-02 17:19:06 -08:00
e2ecfb60a6 FIX Validates target in cosine_embedding (#53110)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53030

This PR validates the target for `cosine_embedding_loss`. This is consistent with how `cross_entropy` handles non 1d targets:

```py
import torch
import torch.nn.functional as F

input = torch.randn(3, 5, requires_grad=True)
target = torch.randint(5, (3, 1))

# Raises RuntimeError
loss = F.cross_entropy(input, target)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53110

Reviewed By: VitalyFedyunin

Differential Revision: D26766579

Pulled By: jbschlosser

fbshipit-source-id: 73ad559ff9376543b6528a36af094e82eb6f9735
2021-03-02 16:50:44 -08:00
593b0fbade Revert D26720919: [Gradient Compression] Remove some low-level methods of GradBucket class
Test Plan: revert-hammer

Differential Revision:
D26720919 (521e1e83ea)

Original commit changeset: 46fb64230087

fbshipit-source-id: e2b68892d1735b7249b4d36f3dff57160c9cbc78
2021-03-02 16:18:39 -08:00
c4c20a5d2d Suppress unsigned comparison warning (#52653)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52653

Fixes:
```
caffe2/aten/src/ATen/native/cuda/BinaryMulDivKernel.cu(105): warning: pointless comparison of unsigned integer with zero
```

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D26588918

fbshipit-source-id: b1a72cebbb7dcb516f63c7c8e2526840ed7c85d1
2021-03-02 16:00:55 -08:00
6ab3a8b6f2 Update torch.nn.quantizable.MultiHeadAttention docstring (#53106)
Summary:
Apply the same fix as PR https://github.com/pytorch/pytorch/pull/49950

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53106

Reviewed By: zou3519

Differential Revision: D26752234

Pulled By: albanD

fbshipit-source-id: 5c924319b8365da4d3d2ba2206e2586e23e718f0
2021-03-02 15:43:00 -08:00
a3a2150409 Codegen python bindings to access attributes of grad_fn (#52451)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/9922

Adds python bindings to *selected* fields that grad_fn saves - we did not add python bindings to certain types such as 'TypeAndSize' and 'TensorGeometry'. All field names are prefixed with `_saved_` so they are easy to discern. User code should not depend on particular saved fields to exist as what grad_fn saves for the backward pass is considered an implementation detail and thus prone to change.

Warning: Not all parameters that are passed in are necessarily stored to be used for the backward pass. What you put in is not necessarily what you get out either. Here we pass `kernel_size=3`, but `b.grad_fn._saved_kernel_size` returns `(3, 3)` instead of 3. It seems to vary case-by-case.

For example:
```
import torch
import torch.nn as nn

model = nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3, stride=2, padding=1, dilation=1)

a = torch.ones(1, 3, 32, 32, requires_grad=True)
b = model(a)

print("kernel_size: ", b.grad_fn._saved_kernel_size)
print("stride: ", b.grad_fn._saved_stride) # returns tuple: (3, 3)
# print("dilation: ", b.grad_fn._saved_dilation) # dilation is not stored for backward pass
print("padding: ", b.grad_fn._saved_padding)
print("weight: ", b.grad_fn._saved_weight)
```

Sample of generated code:
```
PyObject* THPThnnConv2DBackward_self_getter(THPCppFunction *self, void *_unused) {
  const auto& prop = static_cast<ThnnConv2DBackward*>(self->cdata.get())->self_;
  return THPVariable_Wrap(prop.unpack());
}

PyObject* THPThnnConv2DBackward_weight_getter(THPCppFunction *self, void *_unused) {
  const auto& prop = static_cast<ThnnConv2DBackward*>(self->cdata.get())->weight_;
  return THPVariable_Wrap(prop.unpack());
}

PyObject* THPThnnConv2DBackward_kernel_size_getter(THPCppFunction *self, void *_unused) {
  auto prop = static_cast<ThnnConv2DBackward*>(self->cdata.get())->kernel_size;
  PyObject* tup = PyTuple_New((Py_ssize_t) prop.size());
  for (int i = 0; i < prop.size(); i++) {
    PyTuple_SetItem(tup, (Py_ssize_t) i, PyLong_FromUnsignedLong((uint64_t) prop[i]));
  }
  return tup;
}

PyObject* THPThnnConv2DBackward_stride_getter(THPCppFunction *self, void *_unused) {
  auto prop = static_cast<ThnnConv2DBackward*>(self->cdata.get())->stride;
  PyObject* tup = PyTuple_New((Py_ssize_t) prop.size());
  for (int i = 0; i < prop.size(); i++) {
    PyTuple_SetItem(tup, (Py_ssize_t) i, PyLong_FromUnsignedLong((uint64_t) prop[i]));
  }
  return tup;
}

PyObject* THPThnnConv2DBackward_padding_getter(THPCppFunction *self, void *_unused) {
  auto prop = static_cast<ThnnConv2DBackward*>(self->cdata.get())->padding;
  PyObject* tup = PyTuple_New((Py_ssize_t) prop.size());
  for (int i = 0; i < prop.size(); i++) {
    PyTuple_SetItem(tup, (Py_ssize_t) i, PyLong_FromUnsignedLong((uint64_t) prop[i]));
  }
  return tup;
}

PyObject* THPThnnConv2DBackward_finput_getter(THPCppFunction *self, void *_unused) {
  const auto& prop = static_cast<ThnnConv2DBackward*>(self->cdata.get())->finput_;
  return THPVariable_Wrap(prop.unpack());
}

PyObject* THPThnnConv2DBackward_fgrad_input_getter(THPCppFunction *self, void *_unused) {
  const auto& prop = static_cast<ThnnConv2DBackward*>(self->cdata.get())->fgrad_input_;
  return THPVariable_Wrap(prop.unpack());
}

static struct PyGetSetDef ThnnConv2DBackward_properties[] = {
  THP_FUNCTION_DEFAULT_PROPERTIES,
  {(char*)"_saved_self", (getter)THPThnnConv2DBackward_self_getter, nullptr, nullptr, nullptr},
  {(char*)"_saved_weight", (getter)THPThnnConv2DBackward_weight_getter, nullptr, nullptr, nullptr},
  {(char*)"_saved_kernel_size", (getter)THPThnnConv2DBackward_kernel_size_getter, nullptr, nullptr, nullptr},
  {(char*)"_saved_stride", (getter)THPThnnConv2DBackward_stride_getter, nullptr, nullptr, nullptr},
  {(char*)"_saved_padding", (getter)THPThnnConv2DBackward_padding_getter, nullptr, nullptr, nullptr},
  {(char*)"_saved_finput", (getter)THPThnnConv2DBackward_finput_getter, nullptr, nullptr, nullptr},
  {(char*)"_saved_fgrad_input", (getter)THPThnnConv2DBackward_fgrad_input_getter, nullptr, nullptr, nullptr},
  {nullptr} /* sentinel */
};

...

void initialize_autogenerated_functions() {
...
  static PyTypeObject ThnnConv2DBackwardClass;
  addClass<ThnnConv2DBackward>(ThnnConv2DBackwardClass, "ThnnConv2DBackward", ThnnConv2DBackward_properties);
...
}
```

Before:
```
void initialize_autogenerated_functions() {
...
  static PyTypeObject ThnnConv2DBackwardClass;
  addClass<ThnnConv2DBackward>(ThnnConv2DBackwardClass, "ThnnConv2DBackward");
...
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52451

Reviewed By: H-Huang

Differential Revision: D26692633

Pulled By: soulitzer

fbshipit-source-id: a09b5b8138e4641093aff68c7e9dffdbb96911b8
2021-03-02 15:20:56 -08:00
baed2cfe01 Back out "Revert D26753571: [pytorch][PR] add submodules to sys.modules so their attributes can be pickled" (#53127)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53127

Original commit changeset: cc9cc4f508af
ghstack-source-id: 122871468

Test Plan: run flake8 on the files locally

Reviewed By: malfet, janeyx99

Differential Revision: D26757859

fbshipit-source-id: 7e7bde5c1f2b434442079656e2186b500d53fdc2
2021-03-02 14:46:56 -08:00
521e1e83ea [Gradient Compression] Remove some low-level methods of GradBucket class (#53098)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53098

Remove some low-level methods that are no longer needed since `get_per_parameter_tensors` method is added to `GradBucket` class.

Avoid unnecessary exposure to the internals before publishing GradBucket APIs.
ghstack-source-id: 122723683

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

Reviewed By: rohan-varma

Differential Revision: D26720919

fbshipit-source-id: 46fb6423008792e72d7a1dd68930a31e0724c92c
2021-03-02 14:39:19 -08:00
b05dd931ee [Gradient Compression] Add is_the_last_bucket_to_allreduce method to GradBucket class (#53010)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53010

To determine the boundary between different iterations in a DDP communication hook, currently the user code needs `bucket.get_index() == 0`, which involves internal bucketization implementation details and undermines the usability of DDP communication hook.

Create an API to hide the details and improve the usability before publishing GradBucket APIs.
ghstack-source-id: 122723081

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

Reviewed By: rohan-varma

Differential Revision: D26720813

fbshipit-source-id: f4a3147382c1f970534d7f0dee0cd599156c8b8c
2021-03-02 14:39:12 -08:00
4997c38a15 [Gradient Compression] Don't provide default values in GradBucket constructor (#53102)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53102

In `GradBucket` constructor, `offsets`, `lengths`, and `sizes_vec` are optional arguments and could possibly be empty. It will be safe to remove the default values.
ghstack-source-id: 122833603

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D26748199

fbshipit-source-id: 2e3bcd1b732851919a64bbbd20fe85e77a616fe3
2021-03-02 14:39:07 -08:00
ecb5ac90ed [Gradient Compression] Add get_per_parameter_tensors method to GradBucket class (#53009)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53009

It can be a common operation to apply layer-wise operations over per-parameter tensors in a DDP communication hook.

Create a util method in GradBucket class before publishing GradBucket APIs.
ghstack-source-id: 122833594

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

f254364097

Reviewed By: rohan-varma

Differential Revision: D26717893

fbshipit-source-id: 916db319de8b85dd22bc4e35db5671bf4e34740f
2021-03-02 14:39:03 -08:00
ab7f6f3f5b Add default arguments to cuda stream and events (#53025)
Summary:
* **https://github.com/pytorch/pytorch/issues/53025 Add default args for CUDA stream and events**

Tests:
=====
python test/test_jit.py -v TestCUDA

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53025

Reviewed By: H-Huang

Differential Revision: D26734499

Pulled By: nikithamalgifb

fbshipit-source-id: 5311623a501e2e6fb3fc70e39522e3970e401feb
2021-03-02 14:37:24 -08:00
2444b4d122 Add wait_for_worker param to TCPStore and fix port in use flaky test failures (#52888)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52888

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D26678707

Pulled By: H-Huang

fbshipit-source-id: 5662e60c4d06d88d2e57834f496b52fb7600de29
2021-03-02 14:31:33 -08:00
41765d4681 Store coverage files as artifacts for better debugging (#53126)
Summary:
Helps with https://github.com/pytorch/pytorch/issues/44120 by storing coverage as artifacts to be investigated

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53126

Reviewed By: walterddr

Differential Revision: D26757702

Pulled By: janeyx99

fbshipit-source-id: f7db2b3f51b9ee1a95178bdbd4b1c453078d2ba7
2021-03-02 14:24:36 -08:00
d697090260 Add a note in DDP doc to point to ZeroRedundancyOptimizer (#53113)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53113

Test Plan: Imported from OSS

Reviewed By: blefaudeux

Differential Revision: D26752339

Pulled By: mrshenli

fbshipit-source-id: 7a082f1007bc550eabb82b559d020bbe717fa497
2021-03-02 14:18:06 -08:00
29034b9487 [Reland] Update and expose ZeroRedundancyOptimizer docs (#53112)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53112

Test Plan: Imported from OSS

Reviewed By: blefaudeux

Differential Revision: D26752289

Pulled By: mrshenli

fbshipit-source-id: 897257417b530e6e18788cb40c44e5cb7ac688d5
2021-03-02 14:16:12 -08:00
66b20bb738 [CUDA graphs] [JIT] improves readability and nvfuser convenience for graph-safe cuda RNG (#51580)
Summary:
I'm trying to make jitted RNG graph-safe in csarofeen 's nvfuser branch. Doing so requires diffs in files outside torch/csrc/jit, and we'd like these to go upstream through the present simple separate PR (instead of needing to be reviewed as part of Christian's branch's eventual merge, which will be massive).

From the perspective of eager mode consumers, diffs here are purely cosmetic. I moved raw definitions of `PhiloxCudaState` and `at::cuda::philox::unpack` to standalone headers the codegen can easily copy from.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51580

Reviewed By: malfet

Differential Revision: D26626972

Pulled By: ngimel

fbshipit-source-id: 7f04d6c5ffe0af7a8a66d3ae6ed36191d12f7d67
2021-03-02 14:13:12 -08:00
37bf6c134b Register DefaultBackend implementations for functional/inplace structured operators (#53037)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53037

As remarked in #52277 it is easy to give an (inefficient, due to extra
redispatches) DefaultBackend implementation of foo and foo_ in terms of
foo_out.  This patch enables code generation for DefaultBackend in these
cases by default for all structured kernels.  You can see the payoff
in MSNPU extension: it only has to register a kernel for add.out, and it
gets add and add_ kernels automatically.

The actual code changes are very modest:
- When DefaultBackend, call the dispatched (not direct native::)
  functions to allocate tensors, change device guard, etc
- Don't call impl() for DefaultBackend (as it doesn't exist); instead,
  directly generate a call to at::foo_out to do the actual work.
- Do NOT generate DefaultBackend implementation for foo_out.  Actually,
  there is a case to be made for this being a good idea with more infra;
  see comments inside.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D26731225

Pulled By: ezyang

fbshipit-source-id: 939da7cb69f694722ec293e5e42e74a755dd0985
2021-03-02 14:13:08 -08:00
c5a67f1675 Fix minor inaccuracy in translate error reporting (#53032)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53032

Previously, you could get this error message:

```
Failed to synthesize the expression "Tensor & out".
When I failed, the following bindings were available in the context:

  const Tensor & self;
  const Tensor & other;
  Scalar alpha;
  const Tensor & op.outputs_[0];
```

There's a problem with this error message: it doesn't seem like there
is any 'out' argument available, but actually there is: the last
binding in the context is it.  We printed the *expression*, not
the *ctype name*.

After this patch, the context now prints as:

```
  const Tensor & self; // self
  const Tensor & other; // other
  Scalar alpha; // alpha
  const Tensor & out; // op.outputs_[0]
```

Now it becomes clear that it's a const mismatch.  Maybe we could also
beef up the error message so it points out near misses, but I'll leave
that to future work.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: ljk53

Differential Revision: D26729768

Pulled By: ezyang

fbshipit-source-id: adb363551a7145eac788943c20969c86b1f8a81b
2021-03-02 14:11:28 -08:00
fbf2883d35 Revert D26733731: [pytorch][PR] Skip dispatch for is_floating_point
Test Plan: revert-hammer

Differential Revision:
D26733731 (4fb82a8808)

Original commit changeset: 87398d3b7583

fbshipit-source-id: 9eac2b63c72c7d3da43e6a2fe1610549f5c13b70
2021-03-02 13:21:21 -08:00
890e051047 Clang-format quantization_hooks.py (#53100)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53100

ghstack-source-id: 122723751

Test Plan: N/A

Reviewed By: rohan-varma

Differential Revision: D26721146

fbshipit-source-id: 985057fc02c997124b676854eb0a55e569971a3f
2021-03-02 12:48:43 -08:00
cb1596a193 [operator_benchmark] Added channels last 3d option to interpolate test (#53117)
Summary:
Description:

- Added channels last 3d option to interpolate test
  - split config non-4d into two : 3d and 5d

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53117

Reviewed By: NicolasHug

Differential Revision: D26754243

Pulled By: fmassa

fbshipit-source-id: 49bbab3bb47de27790e39537d0fbeca0f01782c4
2021-03-02 11:54:45 -08:00
62d1cdd725 Automated submodule update: tensorpipe (#53012)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/tensorpipe](https://github.com/pytorch/tensorpipe).

New submodule commit: f73bcd9dfa

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53012

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: lw

Differential Revision: D26722108

fbshipit-source-id: ea6fa719c8fb666818a0e91da8d4f2edcc88fc49
2021-03-02 11:49:09 -08:00
2d7119f943 Revert D26753571: [pytorch][PR] add submodules to sys.modules so their attributes can be pickled
Test Plan: revert-hammer

Differential Revision:
D26753571 (fbf9745c85)

Original commit changeset: 2bda03bab39f

fbshipit-source-id: cc9cc4f508af122b0fdec7f8475343bd9badb9db
2021-03-02 11:11:31 -08:00
73a57246d9 disable dill extension behavior (#53118)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53118

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D26754878

Pulled By: suo

fbshipit-source-id: e088d1dc841633bfc0902e3d19f151892ac5c38c
2021-03-02 11:07:08 -08:00
43f810fa96 Add streams boundary check to torch::cuda::scatter` (#53057)
Summary:
Accessing elements of `std::vector` outside of its boundaries can lead to crashes/memory corruptions

Fixes https://github.com/pytorch/pytorch/issues/52526

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53057

Reviewed By: janeyx99

Differential Revision: D26736829

Pulled By: malfet

fbshipit-source-id: 7aa13c53c8d062adfef082153809a7a724a74ee5
2021-03-02 10:58:10 -08:00
e5e54ada61 fix logcumsumexp functor to properly handle infs and nans (#52947)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/52213
Nans were previously inconsistently propagated due to std::min always returning first argument if one of the args in nan
when reduction functor was called on 2 `-inf` arguments, `std::min(x,y) - std::max(x,y)` resulted in `-inf - (-inf)` = nan, even though logcumsumexp is well defined for `-inf, -inf` pair.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52947

Reviewed By: H-Huang

Differential Revision: D26718456

Pulled By: ngimel

fbshipit-source-id: a44433889da352cc959786dd15b6361a68fcfed7
2021-03-02 10:58:01 -08:00
d8ef3a4793 [ROCm] Enable test cases in test_nn.py for ROCm (#52836)
Summary:
Enabling tests in test_nn.py for ROCm because they are passing.

Signed-off-by: Kyle Chen <kylechen@amd.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52836

Reviewed By: H-Huang

Differential Revision: D26725891

Pulled By: mruberry

fbshipit-source-id: 59655a2515ddce92ffc4c55dcf6f28257c05e3c9
2021-03-02 10:56:07 -08:00
2bf079d060 Remove useless test_reference_numerics skip infos (#52890)
Summary:
These are no longer useful. Let's wait for a few days before merging this, just in case somebody finds failures in them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52890

Reviewed By: H-Huang

Differential Revision: D26725500

Pulled By: mruberry

fbshipit-source-id: 3ebc18ee11ebef34451e60861414521730742288
2021-03-02 10:49:21 -08:00
fbf9745c85 add submodules to sys.modules so their attributes can be pickled (#53107)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/38137

As mentioned in the issue, this is a workaround for [python issue 43367](https://bugs.python.org/issue43367). There are a number of other places where `sys.modules` is modified, if something changes in python perhaps those should be reviewed as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53107

Reviewed By: zou3519

Differential Revision: D26753571

Pulled By: ezyang

fbshipit-source-id: 2bda03bab39ff9ca58ce4bc13befe021da91b9c4
2021-03-02 10:47:21 -08:00
aa603cb2ce add OpInfo entry for signbit (#52198)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/42515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52198

Reviewed By: H-Huang

Differential Revision: D26727598

Pulled By: mruberry

fbshipit-source-id: 282350febbd0b1af73320f0e912bf553d386d4b0
2021-03-02 10:38:34 -08:00
4fb82a8808 Skip dispatch for is_floating_point (#52998)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52998

Reviewed By: H-Huang

Differential Revision: D26733731

Pulled By: iramazanli

fbshipit-source-id: 87398d3b7583632ca18e906fc997e939c73a57e3
2021-03-02 10:30:49 -08:00
d4e64dad15 [static runtime] Register both TupleConstruct and ListConstruct as out variants (#52684)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52684

With alias analysis we get much more powerful registration and we can start removing "native" and fallback interpreted implementations.  `inputsOutOfPlace` is an artifact of the hardcoded "native" and lax fallback implementations.  Ideally every node will run out of place every time.  Afaik, there's never a reason to disable it and we may want to remove that functionality.

This diff does introduce a "leak" in the memory management - containers are not cleaned up.  This only happens when out variants are enabled

Test Plan: buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --run-disabled

Reviewed By: maratsubkhankulov, hlu1

Differential Revision: D26515801

fbshipit-source-id: 7391d66b9d36e15fc2955a5c34a04d027d18fe78
2021-03-02 09:55:25 -08:00
2d67b76fa6 [static runtime] Add Alias analysis to Memory Management/Planning (#50060)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50060

Aliasing is currently mishandled in SR.

This diff fixes that issue entirely and allows us to avoid hard coded "view" registration.  I'll remove the macro in a follow up diff.

However, this diff introduces a subtle assumption when memory optimization is turned on: operators cannot "sometimes alias."  Some care will need to be taken to actually make sure this is enforced going forward.

This diff
```
$ batch=20 ./run.sh --pt_optimize_memory=false |& grep "finished"
C2 run finished. Milliseconds per iter: 0.512114. Iters per second: 1952.69
PyTorch run finished. Milliseconds per iter: 0.51176. Iters per second: 1954.04

$ batch=20 ./run.sh --pt_optimize_memory=true |& grep "finished"
C2 run finished. Milliseconds per iter: 0.511402. Iters per second: 1955.41
PyTorch run finished. Milliseconds per iter: 0.506493. Iters per second: 1974.36

$ batch=1 iters=100000 ./run.sh --pt_optimize_memory=false |& grep "finished"
C2 run finished. Milliseconds per iter: 0.0562877. Iters per second: 17765.9
PyTorch run finished. Milliseconds per iter: 0.0667712. Iters per second: 14976.5

$ batch=1 iters=100000 ./run.sh --pt_optimize_memory=true |& grep "finished"
C2 run finished. Milliseconds per iter: 0.0561829. Iters per second: 17799
PyTorch run finished. Milliseconds per iter: 0.0665069. Iters per second: 15036
```

Test Plan:
buck test //caffe2/test:static_runtime
buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest

Reviewed By: eellison

Differential Revision: D25581156

fbshipit-source-id: 41e68119d53e687a9c32d966ed420b270aea4b5b
2021-03-02 09:53:32 -08:00
b22df26361 Explicitly export submodules and variables from torch module (#52339)
Summary:
For https://github.com/pytorch/pytorch/issues/47027.

Some progress has been made in https://github.com/pytorch/pytorch/issues/50665, but in my testing trying to unwrap the circular dependencies is turning into a neverending quest.

This PR explicitly exports things in the top-level torch module without any semantic effect, in accordance with this py.typed library guidance: https://github.com/microsoft/pyright/blob/master/docs/typed-libraries.md#library-interface

It may be possible to do some of the other fixes just using `__all__` where needed, but `__all__` has a semantic effect I would like to further review. This PR at least fixes simple completions like `torch.nn` in Pylance/pyright.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52339

Reviewed By: smessmer

Differential Revision: D26694909

Pulled By: malfet

fbshipit-source-id: 99f2c6d0bf972afd4036df988e3acae857dde3e1
2021-03-02 09:03:08 -08:00
048e3917f9 Add duplicate scheduled-ci to allow for debugging (#53109)
Summary:
This should trigger the 11.2 and 9.2 tests on ci-all and release branch pushes so that debugging can happen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53109

Reviewed By: yns88

Differential Revision: D26752151

Pulled By: janeyx99

fbshipit-source-id: 3272038cc97560896ee3e9f5bc461212806c71e2
2021-03-02 08:37:53 -08:00
28f87bb734 Don't run cpp tests a second time in the sharded ort_test2 job (#53067)
Summary:
Currently, the same C++ tests are run in CI twice in the onnx_ort_test1 job as well as the onnx_ort_test2 job. This PR runs it once on our test1 job only.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53067

Reviewed By: walterddr

Differential Revision: D26739857

Pulled By: janeyx99

fbshipit-source-id: 8960ad5c70181b8154a230914167286f1d9b64f6
2021-03-02 07:46:18 -08:00
09ce9b5877 Store test file in S3 as well for every TestSuite (#52869)
Summary:
We want to store the file names that triggers each test suite so that we can use this data for categorizing those test files.

~~After considering several solutions, this one is the most backwards compatible, and the current test cases in test_testing.py for print test stats don't break.~~

The previous plan did not work, as there are multiple Python test jobs that spawn the same suites. Instead, the new S3 format will store test files (e.g., `test_nn` and `distributed/test_distributed_fork`) which will contain the suites they spawn, which will contain the test cases run within the suite. (Currently, there is no top layer of test files.)

Because of this major structural change, a lot of changes have now been made (thank you samestep!) to test_history.py and print_test_stats.py to make this new format backwards compatible.

Old test plan:
Make sure that the data is as expected in S3 after https://github.com/pytorch/pytorch/pull/52873 finishes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52869

Test Plan: Added tests to test_testing.py which pass, and CI.

Reviewed By: samestep

Differential Revision: D26672561

Pulled By: janeyx99

fbshipit-source-id: f46b91e16c1d9de5e0cb9bfa648b6448d979257e
2021-03-02 07:36:00 -08:00
931100f829 Revert D26696938: Update and expose ZeroRedundancyOptimizer docs
Test Plan: revert-hammer

Differential Revision:
D26696938 (a586c02962)

Original commit changeset: dafb00e5c9f0

fbshipit-source-id: b08604d2009f4df7b620699dd6659dfed2b02792
2021-03-02 07:14:23 -08:00
46bd76fdec [quant][graphmode][fx][fp16] Add fp16 support for silu (#52865)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52865

Test Plan:
python test/test_quantization.py TestQuantizeFxOps.test_silu

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D26672270

fbshipit-source-id: a6a6ab58c347a56f0ded612b2e0a3e2230a91d9e
2021-03-02 02:11:29 -08:00
267aeb8a56 [quant][graphmode][fx][fp16] Add fp16 support for tanh (#52864)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52864

Test Plan:
python test/test_quantization.py TestQuantizeFxOps.test_fixed_qprams_ops_fp16

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D26672271

fbshipit-source-id: 539017c3045a28fc95f4f9d32591c2b2d10af6c0
2021-03-02 02:11:25 -08:00
d40b501cfc [quant][graphmode][fx][fp16] Add fp16 support for sigmoid (#52863)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52863

Test Plan:
python test/test_quantization.py TestQuantizeFxOps.test_fixed_qparams_ops_fp16

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D26672273

fbshipit-source-id: 30d5befe2a24081ac12ac773df4d2bd26d2d0192
2021-03-02 02:11:21 -08:00
3fb324f05b [quant][graphmode][fx][fp16] Add fp16 support for layer_norm (#52862)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52862

Test Plan:
python test/test_quantization.py TestQuantizeFxOps.test_layer_norm

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D26672272

fbshipit-source-id: 4cfdce986efa98db7dc58bf2a62b650e45a69ed0
2021-03-02 02:11:17 -08:00
fc6fdade9f [quant][graphmode][fx][fp16] Add fp16 support for torch.sum (#52811)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52811

Test Plan:
python test/test_quantization.py TestQuantizeFxOps.test_sum

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D26655619

fbshipit-source-id: 642e0de47d0da7bd1abe1e981819de33e84c32f3
2021-03-02 02:11:13 -08:00
97c51d5d5d [quant][graphmode][fx][fp16] Add fp16 support for div (#52810)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52810

Test Plan:
python test/test_quantization.py TestQuantizeFxOps.test_div

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D26655620

fbshipit-source-id: e46cb895ba456e99e4433bd6037229b8248a1b28
2021-03-02 02:11:08 -08:00
a6af93e921 [quant][graphmode][fx][fp16] Add fp16 support for sub (#52809)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52809

Test Plan:
python test/test_quantization.py TestQuantizeFxOps.test_sub

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D26655618

fbshipit-source-id: b47966ee1b75a2f814b9019d8d16b2da2212f5da
2021-03-02 02:09:07 -08:00
d382693263 [NNC] Build aggregate stmt for kernel before LoopNest. (#53024)
Summary:
This PR builds an aggregate stmt for all the tensors in the kernel before constructing LoopNest. This migrates to using the LoopNest constructor that takes in a stmt and output buffers. This is one more step closer to eliminating the dependency of LoopNest on Tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53024

Reviewed By: H-Huang

Differential Revision: D26729221

Pulled By: navahgar

fbshipit-source-id: 43e972585351f6902c14b383b137aaaee3aaa3e1
2021-03-02 00:51:56 -08:00
f448c59a57 Fix jit.trace mis-handling of InterfaceType (#53052)
Summary:
`jit.trace` recursively gathers all named attributes in module at beginning of
tracing. This is fine in a pure-tracing environment, but breaks when a
scripted module that contains an InterfaceType'd submodule is involved.
Because InterfaceType, by design, is not allowed to have any attribute,
thus some of the gathered attributes will turn into fatal errors in
following some graph rewrite passes.

This PR fixes this bug by distinguishing InterfaceType'd submodules from
normal ClassType'd submodules.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53052

Reviewed By: wanchaol

Differential Revision: D26735566

Pulled By: gmagogsfm

fbshipit-source-id: a14aee6f1fe8000f80c2dc60bdf19acee6225090
2021-03-02 00:40:19 -08:00
aae188c529 [NNC] Handle non literal constant bounds in Unroll. (#53029)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53000

Also added test to confirm this case works in FlattenLoop as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53029

Reviewed By: bertmaher

Differential Revision: D26742705

Pulled By: navahgar

fbshipit-source-id: d87a0f9698411026b5b6e55eee7c2b9fb123d06b
2021-03-02 00:35:27 -08:00
748285ccd7 [complex] add autograd support for torch.polar (#52488)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/33152

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52488

Reviewed By: zou3519

Differential Revision: D26711841

Pulled By: anjali411

fbshipit-source-id: b8538fb8cb44456b832e4f993cf41954b3ddd2e8
2021-03-01 21:57:35 -08:00
87b6702833 [distributed] make the pickler in distributed_c10d pluggable (#53060)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53060

As title. We would like to use alternative pickler/unpickler
implementations, to make it possible to send objects over the wire that
are coming from a torch.package

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D26737317

Pulled By: suo

fbshipit-source-id: 6bdef9824e48ef657dcad72cc5a9114e6612ea4a
2021-03-01 21:37:48 -08:00
ac122a5a6d [package] catch exceptions from calling reduce function. (#53061)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53061

We only care about evaluating the string return version. If `reduce()`
throws an error, we should just continue on with pickling.

Test Plan: Imported from OSS

Reviewed By: Lilyjjo

Differential Revision: D26737652

Pulled By: suo

fbshipit-source-id: 0b6fbbe345ad0b6a33330b2efa39d7bab703193d
2021-03-01 21:27:08 -08:00
506f756a0a Include max pool in fusion groups (#52613)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52613

Including MaxPool as part of the MKLDNN fusion group sped up resnet18 by ~20%, and was a win on other models I tested as well. I will post more complete benchmarks.

As mentioned in the diff, in some cases MaxPool can be slower than aten - ideally we'd only include maxpool if it decreased the number of layout transformations that occur. That hasnt actually matttered for all of the torchvision models, I don't think its necessary for this PR.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D26696704

Pulled By: eellison

fbshipit-source-id: 61a025dbf5e7591c0a0f75def3beb439a138a21e
2021-03-01 21:22:46 -08:00
6149a26adb Extend subgraph utils to cover merging a node following a subgraph (#52513)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52513

Subgraph Utils previously only worked with merging a node into a subgraph if the node was before the subgraph; extend the logic for the case where the subgraph is first.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D26696697

Pulled By: eellison

fbshipit-source-id: b0595b7d400161b0972321c55718b67103c7bbcd
2021-03-01 21:22:43 -08:00
dbbe21dfd7 Remove unused subgraph vmap api (#52512)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52512

This API is not used at all, and is tricky to maintain. When we were using it last we ran into lifetime issues when using `Value *` as the key. In hind sight, we should have been using `value->unique()`, but regardless, this not being used and should be removed.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D26696695

Pulled By: eellison

fbshipit-source-id: 97ed92e88ecab0085fabbac46573611666bf2420
2021-03-01 21:22:39 -08:00
b1284cfbfb Only functionalize ops which we want to include in mkldnn group (#51924)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51924

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D26696705

Pulled By: eellison

fbshipit-source-id: df2a780f6316d66f0d6ae99bbb54d044947195e5
2021-03-01 21:22:36 -08:00
9a990dafd9 Add a filter to remove mutation (#51923)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51923

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D26696700

Pulled By: eellison

fbshipit-source-id: 9665e9b786f55b6e5b98420eae19de262d46bb96
2021-03-01 21:22:33 -08:00
f41c80c267 Dont error on 0-dim in convolution (#51922)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51922

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D26696701

Pulled By: eellison

fbshipit-source-id: f8b2c19e134931971fac00246920c1584dd43581
2021-03-01 21:22:30 -08:00
42bfda36e1 Add 0-dim support for binary mkldnn ops (#51921)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51921

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D26696696

Pulled By: eellison

fbshipit-source-id: 96ca79c0d6b5ed7c32c14dc4e7c383f2522a85cb
2021-03-01 21:22:26 -08:00
32fed3f375 Handle mkldnn broadcasting in mkldnn fuser (#51736)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51736

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D26696694

Pulled By: eellison

fbshipit-source-id: 473cc64c8d9f775e9d06340437aff2eb6c0619b9
2021-03-01 21:22:23 -08:00
a2f7e929ef Add MKLDNN fuser (#51600)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51600

Looking for notes on implementation first, will post more notes on benchmarks and overall thoughts/implementation and solicit more input soon.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D26696702

Pulled By: eellison

fbshipit-source-id: cd612f093fe3859e42fb0b77560ebd1b44fccff7
2021-03-01 21:22:19 -08:00
43f56e19a6 [NNC] Make NNC sanitize input names (#52786)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52786

Previously, NNC did not sanitize input names. I ran into this in the next PR when making subgraph creation preserve debug names caused a number of NNC cuda failures. I also previously ran into this with some masked_fill failures internally, which led me to disable the operator.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D26696699

Pulled By: eellison

fbshipit-source-id: 7c3af4d559d58762fb8332666784a4d5cd6a4167
2021-03-01 21:22:16 -08:00
4b40141d2c Add support for linear in mkldnn fusion (#51484)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51484

This PR moves the linear weights of a frozen model to MKLDNN. When the weights are already in MKLDNN, just computing a single linear by converting the input and output from/to mkldnn provides large speedups. I benchmark'd the results of the top 200 shapes in predictor [here](https://www.internalfb.com/phabricator/paste/view/P171537854) (taken from aten::matmul), as well as verified that it sped up popular models. .

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D26696698

Pulled By: eellison

fbshipit-source-id: 53d03b9e6956e11b700ee58214e2266e2aa4106a
2021-03-01 21:22:13 -08:00
bfae3789ba Move conv to mkldnn (#51483)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51483

This PR moves the conv weights of a frozen model to MKLDNN, and AOT reorders the weights. When the weights are already in MKLDNN, just computing a single conv by converting the input and output from/to mkldnn provides large speedups. I benchmark'd the results of the top 200 shapes in predictor [here](https://www.internalfb.com/phabricator/paste/view/P171537938), as well as verified that it sped up popular models in torchvision.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D26696703

Pulled By: eellison

fbshipit-source-id: 0b4441bee4f6e0890a4540fbca3bb5e58b8c5adf
2021-03-01 21:19:27 -08:00
7a60b7dc3e Add support to compare devices (#53045)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53045

Test Plan:
=====
python test/test_jit.py -k test_device_not_equal

Reviewed By: pbelevich

Differential Revision: D26737964

Pulled By: nikithamalgifb

fbshipit-source-id: 2205aa1f214a86282602168c364dca1363d2f7dd
2021-03-01 21:04:43 -08:00
a586c02962 Update and expose ZeroRedundancyOptimizer docs (#52937)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52937

Test Plan: Imported from OSS

Reviewed By: blefaudeux

Differential Revision: D26696938

Pulled By: mrshenli

fbshipit-source-id: dafb00e5c9f0c0c602f471fdcb6416bde74f806b
2021-03-01 20:50:33 -08:00
a176c73ed5 [TensorExpr] Reland: "PyBind: bind ExternalCalls." (#53063)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53063

The problem was that a derived class was marked with "py::nodelete",
while the base class wasn't. Now they both are marked correctly.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D26737877

Pulled By: ZolotukhinM

fbshipit-source-id: 17d9d430651c8f695fc7b6bf6784e7719e20a4d2
2021-03-01 20:44:10 -08:00
e22da0a5c4 [TensorExpr] Add IRVerifier. (#52901)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52901

This PR implements IR Verifier and adds a call to it in `LoopNest`
constructors. Checks that were in expr/stmt constructors before are now
moved to the corresponding `::make` functions or to the verifier. They
didn't really help from the constructors anyway since an exception
thrown from there led to a segfault due to the fact our memory
management works (object was not fully created but was registered in the
kernel arena for destruction anyway).

Fixes #52778.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D26682928

Pulled By: ZolotukhinM

fbshipit-source-id: c56524015cdffb1ed8bce4394509961a4071dcfa
2021-03-01 20:38:00 -08:00
3bd779cec6 [rpc] make pickler/unpickler pluggable in RPC (#53050)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53050

As title. We would like to use alternative pickler/unpickler
implementations without changing the entire RPCPickler, to make it
possible to send objects over the wire that are coming from a
torch.package

Test Plan: Imported from OSS

Reviewed By: Lilyjjo

Differential Revision: D26734592

Pulled By: suo

fbshipit-source-id: d9d9fa62ee15bfcb00e09192030541b61df8c682
2021-03-01 18:40:56 -08:00
83a93ee145 [package] Pull out _UnpicklerWrapper into PackageUnpickler (#53049)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53049

This makes our API symmetric--now we have an `Importer` aware Pickler
and Unpickler implementation that have similar interfaces.

Test Plan: Imported from OSS

Reviewed By: Lilyjjo

Differential Revision: D26734593

Pulled By: suo

fbshipit-source-id: 3479437cf6b98e0d6a8aa4907c75f0c61d5495d4
2021-03-01 18:40:52 -08:00
ec128eadea [package] _custom_import_pickler -> _package_pickler (#53048)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53048

I am planning the custom pickler and unpicklers that we use as
semi-public interfaces for `torch.rpc` to consume. Some prefatory
movements here.

Test Plan: Imported from OSS

Reviewed By: Lilyjjo

Differential Revision: D26734594

Pulled By: suo

fbshipit-source-id: 105ae1161d90f24efc7070a8d80c6ac3d2111bea
2021-03-01 18:38:43 -08:00
272dfc7bb9 Add MANIFEST.in (#52908)
Summary:
Do not build PyTorch if `setup.py` is called with  'sdist' option
Regenerate bundled license while sdist package is being built
Refactor `check_submodules` out of `build_deps` and check that submodules project are present during source package build stage.

Test that sdist package is configurable during `asan-build` step

Fixes https://github.com/pytorch/pytorch/issues/52843

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52908

Reviewed By: walterddr

Differential Revision: D26685176

Pulled By: malfet

fbshipit-source-id: 972a40ae36e194c0b4e0fc31c5e1af1e7a815185
2021-03-01 18:28:25 -08:00
b5ae8e69a7 [Lite Interpreter] Support features from to_backend (#52870)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52870

Add the missing parts to support to_backend modules by lite interpreter.
1. Add ISINSTANCE instruction support, which is used in to_backend for output type check.
2. Bypass lite interpreter's type parser by checking the qualified name. If it starts with "torch.jit", use the same type resolver as nn module (starting with "__torch__").

Tests
Mobile module is serialized and loaded in ```BackendTest.TestCompiler```. The results are compared to those from original torchscript module.

Test Plan: Imported from OSS

Reviewed By: raziel

Differential Revision: D26715351

Pulled By: iseeyuan

fbshipit-source-id: ad9d74ee81c6aa692ab9e5dd7a9003bae5d4f01f
2021-03-01 17:56:01 -08:00
8467e5cad3 Remove ci-all and release branches running scheduled tests (#53069)
Summary:
The previous code allowed these tests to run every four hours on certain ci-all branches...which is really bad and resource intensive. This code removes that, but then disallows the 11.2 and 9.2 tests to be run on ci-all branches.

To debug CUDA 11.2 or 9.2 tests, one must now manually change the config to allow for them. (Look at https://github.com/pytorch/pytorch/issues/51888 and https://github.com/pytorch/pytorch/issues/51598 for examples of how to do that.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53069

Reviewed By: H-Huang

Differential Revision: D26739738

Pulled By: janeyx99

fbshipit-source-id: 7577b9b2e876bac0e4e868ce2a1f3ffdb6aca597
2021-03-01 17:22:13 -08:00
cfa41cea7e [numpy] torch.logit: promote integer inputs to float (#52028)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/42515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52028

Reviewed By: ngimel

Differential Revision: D26400552

Pulled By: mruberry

fbshipit-source-id: 5aec9c9755a7ae283aa52294517ea28f4b0fd3e7
2021-03-01 17:07:52 -08:00
c7c03dd388 [PyTorch] Fix TORCH_CHECK_INDEX(false, ...) in IndexKernel (#53028)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53028

TORCH_CHECK (and variants) wrap the condition in C10_UNLIKELY, so this code is both prettier and better.
ghstack-source-id: 122755165

Test Plan: CI

Reviewed By: malfet

Differential Revision: D26522821

fbshipit-source-id: 70aa11f1859f979657a1f376f7039b5015c69321
2021-03-01 16:54:20 -08:00
07ae4e9309 scripts: Add script to prep wheels for pypi (#53056)
Summary:
Adds a script so that we can take wheels directly from
download.pytorch.org and publish them to pypi

This is currently mainly used to prep windows binaries for publication to PyPI

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53056

Reviewed By: H-Huang

Differential Revision: D26738642

Pulled By: seemethere

fbshipit-source-id: 96777ed6c3f3454bddb4bc13121f727074312816
2021-03-01 16:46:44 -08:00
fd4722949d Fix the repeated entry in the Tensor Attributes doc (#52995)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52995

Reviewed By: H-Huang

Differential Revision: D26732911

Pulled By: iramazanli

fbshipit-source-id: 86ab93f7f3540cf16dde02670e05cb56999b4929
2021-03-01 16:42:32 -08:00
e2462745ba Update kineto submodule (#53039)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53039

Reviewed By: gdankel

Differential Revision: D26732608

Pulled By: malfet

fbshipit-source-id: 5c7f30d237f238fc69a6d2a18a0aee41a68f6f09
2021-03-01 15:43:57 -08:00
3993fb2bf9 fix(docs): indent in docstring of key_averages (#53006)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/52742

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53006

Reviewed By: H-Huang

Differential Revision: D26725101

Pulled By: albanD

fbshipit-source-id: 867be12b0ee363a3c0ddcaf8cb4f6354dd4aa901
2021-03-01 15:18:20 -08:00
b3bf08e67f Log nccl debug level in ProcessGroupNCCL (#52803)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52803

This is useful for double checking we have the expected nccl_debug
level when debugging problematic jobs.

New logs:

When default is warn:
```
NCCL_ASYNC_ERROR_HANDLING: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 60000
USE_HIGH_PRIORITY_STREAM: 0
NCCL_DEBUG: WARN
```

off:

```
NCCL_ASYNC_ERROR_HANDLING: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
NCCL_DEBUG: N/A
```
ghstack-source-id: 122751110

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D26653699

fbshipit-source-id: 845cc1236f3838f4763c6dcf2a30d059b3d44f02
2021-03-01 14:57:22 -08:00
ec42c2d89c [pyper] fuse clip_ranges+gather_ranges (#52461)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52461

TODO: add tests

Test Plan:
Before:
7.10623 ms/iter
0.0849279 ms.    1.21267%. fb::clip_ranges (212 nodes)
0.254071 ms.    3.62783%. fb::gather_ranges (214 nodes)

After:
7.0654 ms/iter
0.300174 ms.     4.2739%. fb::clip_ranges_gather (264 nodes)

Reviewed By: hlu1

Differential Revision: D26523903

fbshipit-source-id: 9b2420c522232659b198cbe250d4454bbcd9297b
2021-03-01 14:50:39 -08:00
991160ebd9 [quant][graphmode][fx] Add support for fp16 bmm pattern (#52808) (#53021)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53021

Add support for producing fp16 bmm pattern

Test Plan:
python test/test_quantization.py TestQuantizeFxOps.test_bmm

Imported from OSS

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D26725349

fbshipit-source-id: debee718fc33e562aff3f5664757bb52ee85f651
2021-03-01 14:45:55 -08:00
b039dd15ce Delete defunct LegacyTHFunctions templates (#53016)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53016

We just checked in the generated files directly.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D26724876

Pulled By: ezyang

fbshipit-source-id: 887d781cac47b7cf16ba2cd6079c63b8f186fe44
2021-03-01 14:41:45 -08:00
812339ca3d [ZeroRedundancyOptimizer] Buckets as tensor view + minimize public interface (#52987)
Summary:
Updated version following  https://github.com/pytorch/pytorch/issues/52764 (including comments from Shen), but this one I expect to be able to land.
ZeroRedundancyOptimizer:
- bucket as tensor views, optional
- make a lot of attributes private
- minor unit test refactor
- adding coverage in the unit test for with and without bucket views

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52987

Reviewed By: mrshenli

Differential Revision: D26728851

Pulled By: blefaudeux

fbshipit-source-id: f8c745966719c9076c20a554ef56198fb838856c
2021-03-01 14:37:04 -08:00
e10d2f477b Clang-format c10d/init.cpp (#53008)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53008

ghstack-source-id: 122722409

Test Plan: N/A

Reviewed By: rohan-varma

Differential Revision: D26720507

fbshipit-source-id: e3ddbcd9e430c8261cc5364795e4b55320e05c5c
2021-03-01 14:31:50 -08:00
084839faa6 Clang-format test_c10d.py (#52978)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52978

ghstack-source-id: 122701029

Test Plan: N/A

Reviewed By: zhaojuanmao

Differential Revision: D26713240

fbshipit-source-id: 25301f794a68bee3d6a73d15986a96edab498310
2021-03-01 14:24:36 -08:00
e00e42dbab [reland][quant][graphmode][fx][test][refactor] Refactoring binary op tests to split int8 and float16 tests (#52807) (#53020)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53020

Test Plan:
Imported from OSS

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D26725351

fbshipit-source-id: 35086ab19087501e1c9fdef4f16993ee9f364d0d
2021-03-01 14:06:10 -08:00
a9f7ae5357 [ROCm] Enable test cases in test/test_dataloader.py for ROCm (#52766)
Summary:
Enabling test cases in test_dataloader.py for ROCm because they are passing now.

Signed-off-by: Kyle Chen <kylechen@amd.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52766

Reviewed By: H-Huang

Differential Revision: D26706402

Pulled By: ngimel

fbshipit-source-id: 63d4ea6d9b16f6244eb0f0f8f7a957bac8469111
2021-03-01 13:32:35 -08:00
096bea5251 [reland][quant][graphmode][fx][fp16] Add fp16 support for {add|mul}{_relu} (#52714) (#53019)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53019

Test Plan:
python test/test_quantization.py TestQuantizedOps.test_add
python test/test_quantization.py TestQuantizedOps.test_mul
python test/test_quantization.py TestQuantizedOps.test_add_relu
python test/test_quantization.py TestQuantizedOps.test_mul_relu

Imported from OSS

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D26725350

fbshipit-source-id: 2a89f5da6a21908f454f870521d2a4549fdd291e
2021-03-01 13:19:42 -08:00
89b1053413 [DataLoader] Move BufferedShuffle from Dataset to DataPipe (#52141)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52141

Remove BufferShuffleDataSet, as it's not being used anywhere within PyTorch (no usage on Github based on a search) and it's not included in the release of PyTorch 1.7.1.

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D26710940

Pulled By: ejguan

fbshipit-source-id: 90023b4bfb105d6aa392753082100f9181ecebd0
2021-03-01 12:54:44 -08:00
f2657d2e4f [ROCm] Enable test cases in test_cuda.py for ROCm (#52739)
Summary:
Enabling four test cases in test_cuda.py for ROCm because they are passing.

Signed-off-by: Kyle Chen <kylechen@amd.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52739

Reviewed By: H-Huang

Differential Revision: D26706321

Pulled By: ngimel

fbshipit-source-id: 6907c548c4ac4e387f0eb7c646e8a01f0d036c8a
2021-03-01 12:54:40 -08:00
0a70ec45d1 [ROCm] Enable test cases in autocast_test_lists.py for ROCm (#52737)
Summary:
Enabling test cases in autocast_test_lists.py for ROCm because they are passing.

Signed-off-by: Kyle Chen <kylechen@amd.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52737

Reviewed By: H-Huang

Differential Revision: D26706346

Pulled By: ngimel

fbshipit-source-id: c1b3b3d8c0ef2a5b1f7e2bd061a749afbae16590
2021-03-01 12:51:56 -08:00
4daa81e267 Automated submodule update: FBGEMM (#52992)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: a431ee37cb

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52992

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: dskhudia

Differential Revision: D26718007

fbshipit-source-id: 7b35ab2012b8b6300a6e78c8425f9e08864a9f68
2021-03-01 12:46:23 -08:00
e36576d153 Probable fix for out of place BinaryOpScalar bad values and/or IMAs on 11.2 (ci-all edition) (#52634)
Summary:
Should close https://github.com/pytorch/pytorch/issues/51992.

ci-all resubmit of https://github.com/pytorch/pytorch/pull/52591. The plot also thickened considerably since then. Every foreach functor, it turns out, has bad `r_args` accesses for certain code paths and instantiations.

Also, I noticed the [`n % kILP == 0`](2680ff7759/aten/src/ATen/native/cuda/ForeachFunctors.cuh (L87)) condition for vectorization in all functors is way too restrictive: it'll refuse to vectorize anything on any tensor whose overall numel is not a multiple of ILP. That's out of scope though.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52634

Reviewed By: H-Huang

Differential Revision: D26725991

Pulled By: izdeby

fbshipit-source-id: 4bade0ac186bf85527baddc1c44b2c2b8e3c9777
2021-03-01 12:41:24 -08:00
8870c391e9 Update mkl to 2020.2.254 (#52964)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/52907

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52964

Reviewed By: H-Huang

Differential Revision: D26726464

Pulled By: seemethere

fbshipit-source-id: 8f3067292e6416e299b4b040c8fb73510134f02e
2021-03-01 11:13:57 -08:00
d4527b4e16 add a full pipeline test for a TypeCheck (#52933)
Summary:
This tests a simple failure mode for a TypeCheck when a shape changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52933

Reviewed By: H-Huang

Differential Revision: D26727583

Pulled By: Krovatkin

fbshipit-source-id: b277218af9572cd6f89f2ece044f7d84d4c10283
2021-03-01 10:58:08 -08:00
7d060735ca Back out "[TensorExpr] PyBind: bind ExternalCalls."
Summary: Original commit changeset: e1ea3b3630d1

Test Plan: Broke tests, e.g. T85754010

Reviewed By: ZolotukhinM

Differential Revision: D26727166

fbshipit-source-id: 42d2090bc55ec2982a4a08c38278c80617d5398a
2021-03-01 10:51:06 -08:00
MY_
b22b082cc8 Fixed the error of generator in the RandomSampler. (#52956)
Summary:
In  `__iter__` of the `RandomSampler`, when `self.replacement` is `False` in the original code, `self.generator` is always used in the `torch.randperm` instead of the generator we set.

Fixes https://github.com/pytorch/pytorch/issues/52568

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52956

Reviewed By: mruberry

Differential Revision: D26724303

Pulled By: H-Huang

fbshipit-source-id: 86f2795c76f3548e31181fb077af046078a173cb
2021-03-01 10:05:43 -08:00
3403babd94 [doc] Fix documentations of torch functions (#52982)
Summary:
This PR includes multiple small fixes of docstrings.

* Fix documentation for [`torch.atleast_2d`](https://pytorch.org/docs/master/generated/torch.atleast_2d.html) and [`torch.atleast_3d`](https://pytorch.org/docs/master/generated/torch.atleast_3d.html) by adding a new line before `Args::`.
* Fix indentation for [`torch.isfinite`](https://pytorch.org/docs/master/generated/torch.isfinite.html) and [`torch.isinf`](https://pytorch.org/docs/master/generated/torch.isinf.html). The "Arguments", "Parameters" and "Examples" sections need to be at the same level as the first description.
* Insert a new line after `Example::` where it is missing. This makes difference in the way the documentations are rendered: see [this](https://pytorch.org/docs/master/generated/torch.gt.html) (with a new line) and [this](https://pytorch.org/docs/master/generated/torch.triu_indices.html) (without). As the majority of the docs seems to follow the former style, this PR amends the latter cases.
* Fix the "Returns" section of [`torch.block_diag`](https://pytorch.org/docs/master/generated/torch.block_diag.html) and [`torch.cartesian_prod`](https://pytorch.org/docs/master/generated/torch.cartesian_prod.html). The second and the subsequent lines shouldn't be indented, as can be seen in the docstring of [`torch.vander`](https://pytorch.org/docs/master/generated/torch.vander.html).
* Fix variable names in the example of `torch.fft.(i)fftn`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52982

Reviewed By: mruberry

Differential Revision: D26724408

Pulled By: H-Huang

fbshipit-source-id: c65aa0621f7858b05fd16f497caacf6ea8eb33c9
2021-03-01 09:59:57 -08:00
6d29aa5486 Make lambda supported by Map DataPipe (#52856)
Summary:
Pickle lambda function with `dill` module. Tests are in `torchdata`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52856

Reviewed By: anjali411

Differential Revision: D26673337

Pulled By: VitalyFedyunin

fbshipit-source-id: c2a1b41b7c4cd824945a016d3c1637eb489da700
2021-03-01 09:22:45 -08:00
66f07c0c12 Optimized bilinear interpolation using TensorIterator (#51653)
Summary:
Related to https://github.com/pytorch/pytorch/issues/10482

Description:

- Optimized bilinear interpolation for 1d, 2d, 3d cases using TensorIterator

<details>
<summary>
Interpolation 2d - 6 thread(s)
</summary>

In | Out | Is contiguous | Channels last | master | this PR | speed-up
---|---|---|---|---|---|---
[1, 3, 320, 320] | [256, 256] | True | False | 0.3938 | 0.0782 | 5.0339
[1, 3, 320, 320] | [512, 512] | True | False | 1.5585 | 0.4105 | 3.7965
[1, 3, 320, 320] | [256, 256] | False | False | 0.3481 | 0.0760 | 4.5780
[1, 3, 320, 320] | [512, 512] | False | False | 1.5848 | 0.4091 | 3.8734
[1, 3, 320, 320] | [256, 256] | False | True | 1.2058 | 1.2034 | 1.0020
[1, 3, 320, 320] | [512, 512] | False | True | 4.8691 | 4.8537 | 1.0032
[32, 128, 64, 64] | [32, 32] | False | True | 6.3915 | 6.4041 | 0.9980
[32, 128, 64, 64] | [128, 128] | False | True | 166.1769 | 164.5621 | 1.0098
[32, 128, 64, 64] | [32, 32] | True | False | 3.7194 | 2.4720 | 1.5046
[32, 128, 64, 64] | [128, 128] | True | False | 86.6704 | 52.3754 | 1.6548
[1, 3, 500, 500] | [256, 256] | True | False | 0.3270 | 0.0792 | 4.1307
[1, 3, 500, 500] | [800, 800] | True | False | 3.3116 | 0.5567 | 5.9482
[1, 3, 500, 500] | [256, 256] | False | False | 0.3763 | 0.0773 | 4.8700
[1, 3, 500, 500] | [800, 800] | False | False | 3.2577 | 0.5590 | 5.8279

</details>

<details>
<summary>
Interpolation 1d - 6 thread(s)
</summary>

In | Out | Is contiguous | Channels last | master | this PR | speed-up
---|---|---|---|---|---|---
[4, 512, 320] | 256 | True | False | 0.2795 | 0.1032 | 2.7089
[4, 512, 320] | 512 | True | False | 0.5533 | 0.1888 | 2.9303

</details>

<details>
<summary>
Interpolation 3d - 6 thread(s)
</summary>

In | Out | Is contiguous | Channels last | master | this PR | speed-up
---|---|---|---|---|---|---
[1, 3, 16, 320, 320] | [8, 256, 256] | True | False | 4.4105 | 2.1236 | 2.0769
[1, 3, 16, 320, 320] | [32, 512, 512] | True | False | 83.9426 | 42.6641 | 1.9675
[1, 3, 16, 320, 320] | [8, 256, 256] | False | True | 15.5736 | 15.5758 | 0.9999
[1, 3, 16, 320, 320] | [32, 512, 512] | False | True | 272.4795 | 273.2745 | 0.9971

</details>

<details>
<summary>
Interpolation 2d - 1 thread(s)
</summary>

In | Out | Is contiguous | Channels last | master | this PR | speed-up
---|---|---|---|---|---|---
[1, 3, 320, 320] | [256, 256] | True | False | 1.0240 | 0.4145 | 2.4705
[1, 3, 320, 320] | [512, 512] | True | False | 4.0771 | 1.3836 | 2.9467
[1, 3, 320, 320] | [256, 256] | False | False | 0.9771 | 0.3270 | 2.9878
[1, 3, 320, 320] | [512, 512] | False | False | 4.1732 | 1.2209 | 3.4180
[1, 3, 320, 320] | [256, 256] | False | True | 1.5466 | 1.5363 | 1.0067
[1, 3, 320, 320] | [512, 512] | False | True | 6.1555 | 6.1199 | 1.0058
[32, 128, 64, 64] | [32, 32] | False | True | 27.6362 | 27.5901 | 1.0017
[32, 128, 64, 64] | [128, 128] | False | True | 468.6442 | 465.5163 | 1.0067
[32, 128, 64, 64] | [32, 32] | True | False | 20.1495 | 10.0694 | 2.0011
[32, 128, 64, 64] | [128, 128] | True | False | 400.0401 | 204.0662 | 1.9603
[1, 3, 500, 500] | [256, 256] | True | False | 0.8956 | 0.3366 | 2.6606
[1, 3, 500, 500] | [800, 800] | True | False | 8.6554 | 2.9530 | 2.9310
[1, 3, 500, 500] | [256, 256] | False | False | 1.0921 | 0.3385 | 3.2263
[1, 3, 500, 500] | [800, 800] | False | False | 8.9594 | 2.9627 | 3.0241

</details>

<details>
<summary>
Interpolation 1d - 1 thread(s)
</summary>

In | Out | Is contiguous | Channels last | master | this PR | speed-up
---|---|---|---|---|---|---
[4, 512, 320] | 256 | True | False | 1.5233 | 0.5027 | 3.0301
[4, 512, 320] | 512 | True | False | 3.0302 | 0.9735 | 3.1128

</details>

<details>
<summary>
Interpolation 3d - 1 thread(s)
</summary>

In | Out | Is contiguous | Channels last | master | this PR | speed-up
---|---|---|---|---|---|---
[1, 3, 16, 320, 320] | [8, 256, 256] | True | False | 12.0477 | 11.3196 | 1.0643
[1, 3, 16, 320, 320] | [32, 512, 512] | True | False | 222.8618 | 209.9955 | 1.0613
[1, 3, 16, 320, 320] | [8, 256, 256] | False | True | 17.9883 | 17.9937 | 0.9997
[1, 3, 16, 320, 320] | [32, 512, 512] | False | True | 380.7244 | 380.1916 | 1.0014

</details>

<details>
<summary>
Versions and build configs
</summary>

PyTorch master: 1.9.0.dev20210223
PyTorch master build setting:
```
BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=7.6.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,
```

PR : 1.9.0a0+74b172b
PR build setting:
```
BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/usr/bin/g++-7, CXX_FLAGS=-O3 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=1, USE_CUDNN=1, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=0, USE_OPENMP=ON,
```
</details>

This description is based on the benchmarks and the code from [here](https://github.com/vfdev-5/interpolate-tensoriterator/tree/master/step_six).

TL;DR
- Linear upsampling generic implementation using TensorIterator for Nd case (single loop function for 1d, 2d and 3d cases)
  - can be generalized to nearest, bicubic interpolation modes.
- works for channels first and last cases.

Joint work with Francisco Massa (fmassa).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51653

Reviewed By: malfet

Differential Revision: D26619437

Pulled By: fmassa

fbshipit-source-id: 7d435e23881c5b40a18bf0dbcab4906d5462025f
2021-03-01 09:13:31 -08:00
0d46926c63 ns for fx: remove subgraphs from user facing API (#52928)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52928

Changes the user facing API of `prepare_single_model_output` to
require a list of nodes instead of a list of subgraphs. This ensures
that how we define a subgraph is an implementation detail and is
not exposed to the user, keeping the eng cost of updating this
implementation later low.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs
```

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D26693471

fbshipit-source-id: 67c2feb844556225e36f8d6d4023246939bcb445
2021-03-01 08:56:26 -08:00
87be8c1d7c ns for fx: clean up duplicate code in get_matching_activations_a_shadows_b (#52927)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52927

Refactor to use an existing util instead of duplicating code, no logic
change.

Test Plan:
CI

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D26693474

fbshipit-source-id: 06b7047eb9a762557b7f679347e424c0dd009aad
2021-03-01 08:56:22 -08:00
5b93cdace1 ns for fx: remove model_name from get_matching_activations API (#52926)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52926

Model name is already stored in the Loggers in the prepare call.
Removing the need to specify it again in the extract activations
functions, to simplify things.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs
```

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D26693473

fbshipit-source-id: 52511cacc16f79fa09c78ccde78e7f439f4b315c
2021-03-01 08:56:18 -08:00
907ee5b290 ns for fx: docblock fixes (#52925)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52925

Cleans up some incorrect comments and docblocks in
`numeric_suite_core_apis.py`.

Test Plan:
CI

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D26693472

fbshipit-source-id: 17f3ff464c6ea01374bcc6ac5899da7034627152
2021-03-01 08:53:57 -08:00
0569f638fe Update CODEOWNERS for torch.nn (#52942)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52942

Reviewed By: H-Huang

Differential Revision: D26699633

Pulled By: albanD

fbshipit-source-id: cf8b213e9bb69fa4980dba380bd42deee40faf85
2021-03-01 07:50:03 -08:00
a06cf5d8a4 [numpy] torch.{rad2deg, deg2rad}: promote integer inputs to float (#51853)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/42515

Depends on https://github.com/pytorch/pytorch/issues/51283

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51853

Reviewed By: albanD

Differential Revision: D26399743

Pulled By: mruberry

fbshipit-source-id: a6f0e12723e1451c6479d818752fe5d41788715d
2021-03-01 06:25:23 -08:00
f5617b0932 [testing] Add Opinfo for torch.frac and minor fixes (#52660)
Summary:
Reference : https://github.com/pytorch/pytorch/issues/42515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52660

Reviewed By: ailzhang

Differential Revision: D26618151

Pulled By: mruberry

fbshipit-source-id: cf0df38e46f44d3afff6e0015af5a840c661aa0e
2021-03-01 04:58:31 -08:00
312b297b82 Revert D26626092: [quant][graphmode][fx][fp16] Add fp16 support for {add|mul}{_relu}
Test Plan: revert-hammer

Differential Revision:
D26626092 (2962fbb03c)

Original commit changeset: 91d040efa51e

fbshipit-source-id: cc6bcc0f451d6adcd7bf7572451e6e3cd6ad59d1
2021-03-01 04:52:47 -08:00
03693c7e4a Revert D26655617: [quant][graphmode][fx][test][refactor] Refactoring binary op tests to split int8 and float16 tests
Test Plan: revert-hammer

Differential Revision:
D26655617 (f2f7fdba05)

Original commit changeset: c36edef09f52

fbshipit-source-id: 7a43cfc9385e45f4532168d2c3d9227da2f1967f
2021-03-01 04:52:44 -08:00
3a024a7ae2 Revert D26655616: [quant][graphmode][fx] Add support for fp16 bmm pattern
Test Plan: revert-hammer

Differential Revision:
D26655616 (2c44b256d8)

Original commit changeset: 1d0639303e5c

fbshipit-source-id: 403429c706c8a9e6a657669daf8aadf282025f83
2021-03-01 04:50:35 -08:00
e43ea227fe Automated submodule update: tensorpipe (#52930)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/tensorpipe](https://github.com/pytorch/tensorpipe).

New submodule commit: 4b9f7f8abe

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52930

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: lw

Differential Revision: D26694739

fbshipit-source-id: d8c835f6e74fec6e2c9a3a6e6713926ccf7dcedd
2021-03-01 01:54:33 -08:00
57c7a61237 [NNC] Added NNC IR specification (#52912)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52912

Reviewed By: bhosmer

Differential Revision: D26695726

Pulled By: Chillee

fbshipit-source-id: c2f1efe0696d7567d4ed85487cc20a2db4e73cd5
2021-03-01 01:48:44 -08:00
b9e12a0e82 [pytorch] Fix mkldnn heuristic for multithreaded convolution (#52909)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52909

PR #46675 introduced heuristics to use thnn_conv2d for 1x1
convolutions, since mkldnn had a bug that was slowing those cases
down. Unfortunately, the test plan for that PR only tested single-threaded
convolutions; mkldnn is considerably faster on multithreaded convolutions.

An example from yolov3, on 24 cores of a Xeon Platinum 8175M CPU @ 2.50GHz
```
input:{1, 64, 192, 256}, weight:{32, 64, 1, 1}
thnn_conv2d: GFLOPS/s=104.574G/s
mkldnn_convolution: GFLOPS/s=467.357G/s
```
ghstack-source-id: 122627564

Test Plan: Multithreaded 1x1 convolutions

Reviewed By: wconstab, xuzhao9

Differential Revision: D26685272

fbshipit-source-id: e8e05db89e43856969e26570a170c13b3e73ac74
2021-02-28 22:46:47 -08:00
2c44b256d8 [quant][graphmode][fx] Add support for fp16 bmm pattern (#52808)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52808

Add support for producing fp16 bmm pattern

Test Plan:
python test/test_quantization.py TestQuantizeFxOps.test_bmm

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D26655616

fbshipit-source-id: 1d0639303e5ca2ca4ceae08d03ebc3b25256de57
2021-02-28 16:48:41 -08:00
4d94ee566e Ge v1 (#52136)
Summary:
This is a second attempt to use graph executor to run forward on a gradient. This allows a secondary chance to profile intermediate tensor introduced by autodiff.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52136

Reviewed By: pbelevich

Differential Revision: D26693978

Pulled By: Krovatkin

fbshipit-source-id: 91dde8009a210950af8e5173668ada241e16dd52
2021-02-28 00:53:13 -08:00
f2f7fdba05 [quant][graphmode][fx][test][refactor] Refactoring binary op tests to split int8 and float16 tests (#52807)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52807

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D26655617

fbshipit-source-id: c36edef09f522fe4c8eb0a8872add80c8dae4938
2021-02-27 23:16:49 -08:00
2962fbb03c [quant][graphmode][fx][fp16] Add fp16 support for {add|mul}{_relu} (#52714)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52714

Test Plan:
python test/test_quantization.py TestQuantizedOps.test_add
python test/test_quantization.py TestQuantizedOps.test_mul
python test/test_quantization.py TestQuantizedOps.test_add_relu
python test/test_quantization.py TestQuantizedOps.test_mul_relu

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D26626092

fbshipit-source-id: 91d040efa51e9c955eb688ec16a30f0c12233958
2021-02-27 22:12:10 -08:00
729d88119a Fix GradBucket Typing (#52943)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52943

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D26699759

Pulled By: mrshenli

fbshipit-source-id: 712165a29d114da761ef4f161096ca46a958df03
2021-02-27 20:04:38 -08:00
0818dbf49d [quant][refactor] Merge add and mul handler (#52651)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52651

Merging them for easier extensions to fp16 and more binary ops

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D26600118

fbshipit-source-id: a1816e593cf3065afe87d2e6e44cdace13bf6aeb
2021-02-27 19:56:32 -08:00
a296fa36ac [Caffe2] Implement BlackBoxPredictor::BenchmarkIndividualOps (#52903)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52903

Implement BlackBoxPredictor::BenchmarkIndividualOps so that we can clean up the output tensors properly after each iteration and get more accurate per operator timing.

Add four more metrics to track setup_time, memory_alloc_time, memory_dealloc_time, and output_dealloc_time.

Reviewed By: ajyu

Differential Revision: D26657473

fbshipit-source-id: 1cf282192b531513b9ee40b37252087818412f81
2021-02-27 19:49:22 -08:00
249c213462 [ZeroRedundancyOptimizer] Pytorch compliant state (#52960)
Summary:
Same as https://github.com/pytorch/pytorch/issues/52760 which I could not get to land. I just could not live with ghstack/ghimport/randomly broken things, I break enough of them myself, so this is a fresh copy without ghstack shenanigans. I'm hopeful that this can land relatively bug free, and am sorry for the duplications..

What this does:
- call the common_utils test runner instead of unittest, because it seems that it's how it should be done
- change the returned state from ZeroRedundancyOptimizer to be PyTorch compliant, which has the added benefit of being elastic (world size independent)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52960

Reviewed By: mrshenli

Differential Revision: D26710932

Pulled By: blefaudeux

fbshipit-source-id: 1d914bc9221442ba1bb2b48f5df10c313e674ece
2021-02-27 11:54:08 -08:00
b685864f50 [quant][graphmode][fx] Add reference option support for linear_static_fp16 (#52650)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52650

linear_dynamic_fp16 has following dtypes for activation, weight, bias, output:
(fp32, fp16, fp32, fp32)

linear_static_fp16 has following dtypes:
(fp16, fp16, fp16, fp16)

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D26599803

fbshipit-source-id: b4a8345d355125070be718a227288cc848cc8bbc
2021-02-27 08:25:44 -08:00
7f1693d95e Fix type hints of the callable arguments for DataLoader (#52924)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/52806

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52924

Reviewed By: malfet

Differential Revision: D26694894

Pulled By: ejguan

fbshipit-source-id: 55734ec9684caa90f1e599b65659b7c57047f802
2021-02-27 07:45:49 -08:00
177694681e [quant][graphmode][fx] Add reference option support for linear_dynamic_fp16 (#52534)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52534

Currently linear_dynamic_fp16 has a signature that's tied to fbgemm/qnnpack
We'll need to produce a pattern equivalent to linear_dynamic_fp16 to support extensions
to other backends

Test Plan:
python test/test_quantization.py TestQuantizeFxOps.test_linear_dynamic_fp16

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D26557726

fbshipit-source-id: 270c9f781f73c79416a092b7831294cabca84b0c
2021-02-26 21:12:22 -08:00
e63ec556bf [TensorExpr] PyBind: bind ExternalCalls. (#52905)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52905

Differential Revision: D26707491

Test Plan: Imported from OSS

Reviewed By: Chillee

Pulled By: ZolotukhinM

fbshipit-source-id: e1ea3b3630d115e3d81842895c62e22c4cb06fb8
2021-02-26 20:28:13 -08:00
94e23e51c4 [caffe2] EnforceFinite: log blobs finiteness in workspace on error (#52892)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52892

When an EnforceFinite check fails this logs all of the tensors in the workspace and whether they are finite or not.

This is a little bit hacky since it uses the aten APIs. I've `ifdef`ed the implementation so it should compile fine on xplat and mobile. It's also accessing the workspace directly but since this is a logging op it seems fine to bend the rules.

Test Plan:
$ buck test //caffe2/caffe2/python/operator_test:enforce_finite_op_test

  $ buck-out/gen/caffe2/caffe2/python/operator_test/enforce_finite_op_test#binary.par
  I0225 16:29:46.166507 311548 enforce_finite_op.h:62] blob X isfinite=false

Reviewed By: dzhulgakov

Differential Revision: D26626336

fbshipit-source-id: f68e219b910a7242f2e72bb4d734c3e84f46eec5
2021-02-26 16:48:19 -08:00
10087337c7 Exclude 'test' from codecoverage (#52935)
Summary:
Also, do not generate coverage report on patch level

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52935

Reviewed By: walterddr

Differential Revision: D26696285

Pulled By: malfet

fbshipit-source-id: 87518682f883c94409778525524e7c392407efa8
2021-02-26 16:41:51 -08:00
1d6bd15790 [JIT] Add torch._C._jit submodule (#52910)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52910

**Summary**
PR #52158 tried to move all JIT bindings from `torch._C` to a new
submodule `torch._C._jit`, but that...did not go well. This pull request
adds the new `torch._C._jit` submodule, but does not migrate the
existing bindings. Instead, it adds a unit test that fails if any new
bindings are added to `torch._C`. A comment in the test instructs
developers to add their new binding to the allowlist if it really should
be in `torch._C`, or to add it to the appropriate submodule (e.g
`torch._C._jit`, for example). The idea is to prevent the issue
described in #51691 from getting *worse* if it cannot be fixed.

**Test Plan**
Continuous integration.

**Fixes**
This commit fixes #51691.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D26698373

Pulled By: SplitInfinity

fbshipit-source-id: ec9f5426051227a513d4fd09512b624420e0100b
2021-02-26 16:05:05 -08:00
cb6b65699f [quant][graphmode][fx] Add support for packed params in state_dict (#51639)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51639

Test Plan: Imported from OSS

Reviewed By: z-a-f

Differential Revision: D26228185

fbshipit-source-id: 6cf8b4106fec9c6900521a2afe0de6f3d29cc896
2021-02-26 15:13:50 -08:00
b8e6e2971c Run distributed_test with NCCL_ASYNC_ERROR_HANDLING (#52619)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52619

Runs this test suite with nccl_async_error_handling enabled. It is the
default to run many distributed training jobs, and can also help catch
errors/hangs in tests more easily. We don't expect any changes in the actual
existing tests since they shouldn't have any hangs.

Also removes a commented out line
ghstack-source-id: 122595646

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D26588108

fbshipit-source-id: a57bbe2ae5a0c86731d77be45756b17151618eb6
2021-02-26 11:59:49 -08:00
b2520ab3dc Add a demo backend with compiler (#52603)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52603

This PR introduced a backend with minimum compilation capability to the to_<backend> flow. The targets are:

- Demonstrate the end-to-end flow with adding a backend -> compilation -> runtime
- How the backend compilation errors be surfaced to the user, with the original model's source code information. (C++ only in this PR. Python APIs will be demonstrated in a following PR.)

Changes:

- Compilation

1. A backend with minimum compilation features, "backend_with_compiler_demo" is added.
2. The compilation happens AOT in the ```pre_process``` function registered to this backend.
3. Compiled results are stored in a string blob for each method. They are serialized to the lowered module with ```__get_state__``` function.
4. Error message with model source code is thrown, for features not handled by the backend compiler.

- Runtime

1. The compiled blob is loaded in ```__set_state__``` method.
2. The ```compile``` function of the backend pass through the AOT compiled blob. (TODO: parsing the blob to the format that the backend can understand can happen here.)
3. The ```execute``` function of the backend executes the specified method (handle).

Test Plan:
- ```BackendTest.TestCompiler```: the C++ end-to-end demonstration on a supported model. After compilation and running, the lowered model produces the same result as the original torchscript model.
- ```BackendTest.TestCompilerNotSupport```: Demonstrate the error message from the AOT compilation for a feature not supported from the input module. The error message looks like:

```
"The node of aten::mul is not supported in this compiler. Source code:   File "<string>", line 3

    def forward(self, x, h):
        return x * h
               ~~~~~ <--- HERE
```

Reviewed By: raziel

Differential Revision: D26593968

Pulled By: iseeyuan

fbshipit-source-id: 8f264f60a0470e9f07e36fdeccbf17da6c1d7cd7
2021-02-26 11:53:34 -08:00
502a85990d [PyTorch] Move Aten level source list to build_variable.bzl (#52792)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52792

Move the aten level source code list from `pt_template_srcs.bzl` to `build_variables.bzl`, such that this source list can be shared by both OSS and internal.
ghstack-source-id: 122458909

Test Plan: CI

Reviewed By: dhruvbird, iseeyuan

Differential Revision: D26647695

fbshipit-source-id: 88469c934d4a73c261418c0c584e46104295a0c2
2021-02-26 11:26:22 -08:00
44b9fcfb55 Fix local version generation (#52898)
Summary:
Add "git" prefix to PyTorch local version, otherwise it might strip leading zeroes from git hashum according to https://www.python.org/dev/peps/pep-0440/#local-version-identifiers:
> If a segment consists entirely of ASCII digits then that section should be considered an integer for comparison purposes

Fixes https://github.com/pytorch/pytorch/issues/52857

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52898

Reviewed By: anjali411

Differential Revision: D26681878

Pulled By: malfet

fbshipit-source-id: 0e7baa2716fc06193cfacd7c4e6cdc6f4bbac4a9
2021-02-26 10:57:07 -08:00
155b19ef1a [Pytorch Mobile] Remove useless line from bundled_inputs (#52824)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52824

How was this not breaking? _bundled_inputs_deflated doesnt exist
ghstack-source-id: 122491970

Test Plan: unit tests

Reviewed By: iseeyuan

Differential Revision: D26658098

fbshipit-source-id: 9ebf961b8764ba8779052c520dd46a8724be042a
2021-02-26 10:36:32 -08:00
18ee39944a .circleci: Change conda image to be cuda specific (#51494)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51494

The overall `pytorch/conda-cuda` image was getting to a ridiculous size
of 36GB so this splits up that image into cuda specific ones to try and
reduce the amount of things we have to download.

coincides with: https://github.com/pytorch/builder/pull/634

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: janeyx99

Differential Revision: D26281958

Pulled By: seemethere

fbshipit-source-id: 83b498532a6f04801952438537b564f998b62d94
2021-02-26 10:25:04 -08:00
97568d7471 Use --delta=0 by default for tools/test_history.py (#52877)
Summary:
This is less surprising than the current default, `--delta=12`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52877

Test Plan: Run the example commands from `tools/test_history --help` and check that their output matches that shown.

Reviewed By: pritamdamania87

Differential Revision: D26674258

Pulled By: samestep

fbshipit-source-id: 1413e11519854b0a47e14af2f1d20c57f145dacd
2021-02-26 08:58:08 -08:00
7a178a8a52 [Static Runtime] Add memoray alloc/dealloc time to benchmark (#52902)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52902

Add more metrics to track memory_alloc_time, memory_dealloc_time, and output_dealloc_time.

Reviewed By: maratsubkhankulov

Differential Revision: D26660715

fbshipit-source-id: 96c6cfac2d2ec66d4c31c84129721a846c3914f0
2021-02-25 22:55:14 -08:00
7cfe140705 Add distributed debug mode func to python (#52481)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52481

Adds an API `get_debug_mode` that can be used by distributed package and users to retrieve debug mode. Currently no functionality changes, but wanted to get the bare bones function out and add relevant debug mode logging in follow up diffs.
ghstack-source-id: 122471216

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D26508972

fbshipit-source-id: d1153774f8697bc925a05db177d71c0566d25344
2021-02-25 22:35:55 -08:00
a3cd881890 Fix grammar in reducer warning (#52835)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52835

Addresses comment in https://github.com/pytorch/pytorch/pull/52385
that was missed before landing the PR
ghstack-source-id: 122543534

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D26660764

fbshipit-source-id: 3edfebed56f382c1414ba9eb65a753ced7e34154
2021-02-25 22:29:52 -08:00
af1fb4e4ee Revert D26641600: [caffe2] move the SaveOp implementation from a header to a .cc file
Test Plan: revert-hammer

Differential Revision:
D26641600 (3969391c07)

Original commit changeset: 84ebe8164ffa

fbshipit-source-id: c3a85b7b15b8cdbf019abfabfd740a5b1d5e8775
2021-02-25 21:32:44 -08:00
21c3f6f415 Revert D26617038: [caffe2] use AddNAlreadyReserved() when serializing blobs
Test Plan: revert-hammer

Differential Revision:
D26617038 (b4a8d98247)

Original commit changeset: 97dedbae889d

fbshipit-source-id: 6921d0a64dee26e18f16628773953bbe7280998e
2021-02-25 21:32:40 -08:00
69b2d5c7c3 Revert D26641599: [caffe2] update load_save_test.py to also verify the chunking behavior
Test Plan: revert-hammer

Differential Revision:
D26641599 (cd9ac54ea7)

Original commit changeset: bccb0af157d8

fbshipit-source-id: 9fe35382876d19aefd16496bf8f920e12aa6f169
2021-02-25 21:30:36 -08:00
c423733967 Add support for builtin sum (#52188)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/18627
Adds torch.sum support for JIT

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52188

Test Plan:
python test/test_jit.py -k test_list_sum
python test/test_jit.py -k test_torch_sum

Reviewed By: pbelevich, anjali411

Differential Revision: D26670022

Pulled By: nikithamalgifb

fbshipit-source-id: eb58f0a3a64dab4b9fa1f4eb854e9854fa9bda55
2021-02-25 21:09:34 -08:00
25001a0148 ns for fx: remove ".stats" suffix (#52799)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52799

We agreed that it's better to not add this, removing.
We can make Eager mode NS match this in a future PR.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs
```

Imported from OSS

Reviewed By: hx89

Differential Revision: D26652638

fbshipit-source-id: 5baa51a6bf6de5632946417fe9fd3d0f3e78f7fa
2021-02-25 20:45:53 -08:00
1d3172130d ns for fx: add node name and type to results (#52798)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52798

Adds the node name and node target type to Numerical Suite outputs.
This is useful to debug which node got matched to which node,
and what is the type of the operation.

```
// before
{
  layer_name: {
    model_name: {
      'type': 'weight',
      'values': [...],
    },
  },
}

// after
{
  layer_name: {
    model_name: {
      'type': 'weight',
      'values': [...],
      'node_name': '0',
      'node_target_type': "<class 'torch.nn.modules.conv.Conv2d'>",
    },
  },
}
```

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs
```

Imported from OSS

Reviewed By: hx89

Differential Revision: D26652637

fbshipit-source-id: ba75b110cb91234f17a926ccbc5d0ccee2c3faeb
2021-02-25 20:45:49 -08:00
d2e88246d8 ns for fx: make return type of ns APIs future proof (#52789)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52789

Changes the return type of NS APIs from

```
{
  layer_name: {
    model_name: [torch.Tensor(...), ...],
  },
}
```

to

```
{
  layer_name: {
    model_name: {
      'type': 'weight',  # or node_output, etc
      'values': [torch.Tensor(...), ...],
      // future info can be added here, such as node name, etc
  },
}
```

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs
```

Imported from OSS

Reviewed By: hx89

Differential Revision: D26652640

fbshipit-source-id: 4b31164e402754141368d5a04d595f2b643af3bb
2021-02-25 20:45:44 -08:00
fe068157de ns for fx: unify return types of weight and activation APIs (#52779)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52779

1. makes the return type of the weight comparison APIs match the return
type of the activation comparison APIs:

```
# before
{layer_name: {model_name: weight_tensor}}
{layer_name: {model_name: [activation_tensor]}}

# after
{layer_name: {model_name: [weight_tensor]}}
{layer_name: {model_name: [activation_tensor]}}
```

2. makes a type alias for the type, so future changes are easier

Test Plan:
```
mypy torch/quantization
python test/test_quantization.py TestFXNumericSuiteCoreAPIs
```

Imported from OSS

Reviewed By: hx89

Differential Revision: D26652639

fbshipit-source-id: eb1f04d6913cedf88d628f362468875ae9ced928
2021-02-25 20:45:39 -08:00
7094d970d1 ns for fx: decouple subgraph names from node names (#52771)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52771

Before this PR, subgraph names were derived from node names
in model B.  For example, if we had

```
A: linear0 -> relu0 -> ...
B: linear_relu0 -> ...
```

Then the subgraph name would be `linear_relu0`, and the outputs before this
PR would look like

```
{
  'linear_relu0': {
    'model_a': ...,
    'model_b': ...,
  },
}
```

This PR decouples subgraph naming from node names.
The outputs after this PR look like:

```
{
  # guaranteed to match the right subgraphs across different models
  # without needing more than one model during the prepare passes
  'base_op_torch.nn.functional.linear_0': {
    'model_a': ...,
    'model_b': ...,
  },
}
```

There are future requirements for which using node_name as subgraph name does not work well:
a. the need to support N models, without having all of them in memory at the same time
b. the need to support fusions and match subgraphs with related but non-equal types

This PR changes the naming of subgraphs to be based on two things:
1. the name of the underlying set of related ops (i.e. `torch.nn.functional.linear`)
2. the order in which this subgraph was named (i.e. `foo_0`, `foo_1`, ...)

Basically, we can't use a node name because of (a), since there must be
a reference model which node name other models must use, but that
reference model is not guaranteed to be available.  Note: we could add
some state and require the reference model to go through the APIs first,
saving the reference node names, but I'm deliberately not doing that
to minimize the state used throughout.

To support (b), we need a way to determine a name of a subgraph which is
the same for all related subgraphs (i.e. linear-relu vs quantized_linear
vs quantized_linear_relu). In this PR, this is done by using the base
aten op's name.  We use a string name so it looks nice in the output
(I tried `str(underlying_type)`, and it is not easy for humans to read).

Note: after this PR, it's hard to parse the results to see which layer
is related to which node in the graph. This will be fixed in a future PR
where we will store the node name on the logger, and expose it in the
output.

Test Plan:
```
python test/test_quantization.py TestFXGraphMatcher
python test/test_quantization.py TestFXGraphMatcherModels
python test/test_quantization.py TestFXNumericSuiteCoreAPIs
python test/test_quantization.py TestFXNumericSuiteCoreAPIsModels
```

Imported from OSS

Reviewed By: hx89

Differential Revision: D26652641

fbshipit-source-id: ee8dacc2d6e875357c1574cbf426923f9466ea10
2021-02-25 20:43:45 -08:00
cd9ac54ea7 [caffe2] update load_save_test.py to also verify the chunking behavior
Summary:
Add some small utility functions to read the blob names back from the minidb
file so that we can verify how many chunks were written for each blob.

Test Plan: buck test caffe2/caffe2/python/operator_test:load_save_test

Reviewed By: mraway

Differential Revision: D26641599

fbshipit-source-id: bccb0af157d85e585e95bc7be61c4584fba3cb04
2021-02-25 20:24:06 -08:00
b4a8d98247 [caffe2] use AddNAlreadyReserved() when serializing blobs
Summary:
Optimize the blob serialization code by using `AddNAlreadyReserved()` when
serializing tensor data, rather than making N separate `Add()` calls.
`AddNAlreadyReserved()` is a simple addition operation, while each `Add()`
call checks to see if it needs to reserve new space, and then updates the
element data, which is unnecessary in this case.

Test Plan:
This appears to improve raw serialization performance by 30 to 35% for float,
double, and int64_t types which use this function.  This improvement appears
relatively consistent across large and small tensor sizes.

Differential Revision: D26617038

fbshipit-source-id: 97dedbae889d35463628f3016ac56986e685289e
2021-02-25 20:24:01 -08:00
3969391c07 [caffe2] move the SaveOp implementation from a header to a .cc file
Summary:
Move the `SaveOp` code from `load_save_op.h` to `load_save_op.cc`.

Previously this implementation was all in the templatized `SaveOp` class, even
though most of the logic didn't depend on the template parameters.  Having
this code be in the header file slows down the build, and forces more files to
be rebuilt than necessary when changing the SaveOp code.  Having this code be
in a template class can also increase the generated code size be larger than
needed, as we don't need separate copies instantiated for each context type.

Test Plan: buck test //caffe2/caffe2/python/operator_test:load_save_test

Reviewed By: mraway

Differential Revision: D26641600

fbshipit-source-id: 84ebe8164ffac1e4a691be41147f0c5d8e890e09
2021-02-25 20:21:55 -08:00
fdd25f82c9 Update to replace AT_ERROR with TORCH_CHECK (#52711)
Summary:
Fixes #{52699}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52711

Reviewed By: ailzhang

Differential Revision: D26654677

Pulled By: malfet

fbshipit-source-id: 97079250d144c9b1c69028f35e4a23a34481b2a5
2021-02-25 19:51:29 -08:00
a0a1bb074b Make NumPy dependency dynamic (#52794)
Summary:
Move NumPy initialization from `initModule()` to singleton inside
`torch::utils::is_numpy_available()` function.
This singleton will print a warning, that NumPy integration is not
available, rather than fails to import torch altogether.
The warning be printed only once, and will look something like the
following:
```
UserWarning: Failed to initialize NumPy: No module named 'numpy.core' (Triggered internally at  ../torch/csrc/utils/tensor_numpy.cpp:66.)
```

This is helpful if PyTorch was compiled with wrong NumPy version, of
NumPy is not commonly available on the platform (which is often the case
on AARCH64 or Apple M1)

Test that PyTorch is usable after numpy is uninstalled at the end of
`_test1` CI config.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52794

Reviewed By: seemethere

Differential Revision: D26650509

Pulled By: malfet

fbshipit-source-id: a2d98769ef873862c3704be4afda075d76d3ad06
2021-02-25 19:45:00 -08:00
9a03e65456 Adding functional way of stacking DataPipes with fixed mypy (#52885)
Summary:
Readding reverted PR with MyPY fixed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52885

Reviewed By: ejguan

Differential Revision: D26676405

Pulled By: VitalyFedyunin

fbshipit-source-id: 020216c5522d21a4994cd896ae778c0b77f6444b
2021-02-25 19:37:35 -08:00
f40c9db622 [FX][EZ] Hoist custom class .so loading into setUp (#52883)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52883

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D26675802

Pulled By: jamesr66a

fbshipit-source-id: 7a7bcb1d0a6f8c9b1431bc3e09143ada6e5fbf4d
2021-02-25 18:46:05 -08:00
6514a47385 [quant] Fix conv packed param serialization in state_dict (#52787)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52787

Current conv packed param can be serialized/deserialized with `torch.jit.save/torch.jit.load`, but
can't be saved with `torch.load(m.state_dict())/torch.save(m.state_dict())`

reason is (from James):
```
I think the issue probably has to do with the normal pickle deserialization not detecting List[Optional[Tensor]] if it doesn't witness a None in the list. IIRC this is implemented on the TorchScript side through this type tag mechanism: https://github.com/.../jit/serialization/unpickler.cpp...
```

This PR is a hack but acceptable to JIT team until a proper solution is proposed.

Test Plan:
Will be tested in next PR

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D26649272

fbshipit-source-id: 4fc47a4c63e4cd1fabb404de5f0b95e127a9fca0
2021-02-25 17:52:31 -08:00
a27aaa49aa quant norm layers: move scale + zp to buffers (#52861)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52861

Currently, scale and zp in these layers are not buffers, which
means they do not get saved to the state dict.  Movin them
into buffers to allow people to properly use state_dict.

Note: this is a redo of https://github.com/pytorch/pytorch/pull/45313,
with BN taken out. Not doing this for BN because it has dependencies on existing
behavior.  We should clean it up eventually.

Note: not handling BC because it's 100% broken now, so there is
no practical value in handling BC.

Test Plan:
```
python test/test_quantization.py TestPostTrainingStatic.test_normalization
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D26671761

fbshipit-source-id: 7615b1dd0d1ae88eeff8b1d150f3846815dc2bc9
2021-02-25 17:23:39 -08:00
51d8543ac7 [FX] Use precompiled regex in graph name processing (#52853)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52853

ghstack-source-id: 122531132

Test Plan: waitforsadcastle

Reviewed By: anjali411

Differential Revision: D26668527

fbshipit-source-id: bd34d860cd3a71d3b29f2430df97a0501d542f5b
2021-02-25 17:21:38 -08:00
569d4fe3f9 .github: Add workflow to build conda packages (#51243)
Summary:
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51243

Reviewed By: walterddr

Differential Revision: D26669795

Pulled By: seemethere

fbshipit-source-id: 1e54aa8cab2b0b5324815fa4f1706e468f9f57dd
2021-02-25 16:50:02 -08:00
649760e5f1 maybe_resize_storage_cuda new_size argument should be unsigned (#52672)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52672

This allows correct handling on a very large tensor allocations

Also, replace AT_ERROR with TORCH_CHECK(false)

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D26607547

Pulled By: malfet

fbshipit-source-id: 247f7e8c59f76af3b95799afc9bc4ab4cc228739
2021-02-25 16:30:16 -08:00
0f3a3f22af Add sample validation for LKJCholesky.log_prob (#52763)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/52724.

This fixes the following for the LKJCholesky distribution in master:
 - `log_prob` does sample validation when `validate_args=True`.
 - exposes documentation for the LKJCholesky distribution.

cc. fehiepsi, fritzo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52763

Reviewed By: anjali411

Differential Revision: D26657216

Pulled By: neerajprad

fbshipit-source-id: 12e8f8384cf0c3df8a29564c1e1718d2d6a5833f
2021-02-25 16:12:29 -08:00
a52001f923 Improve test_reference_numerics (#51604)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/50749
ci-all version of https://github.com/pytorch/pytorch/pull/50550

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51604

Reviewed By: anjali411

Differential Revision: D26666951

Pulled By: mruberry

fbshipit-source-id: b87db68f1d2a0f6c151edbc5c7809bbceece69b0
2021-02-25 15:38:42 -08:00
94da8b9816 Fix resource leak bug in TCPStore constructor (#52860)
Summary:
This PR fixes a resource leakage bug in the constructor of `TCPStore` where an exception thrown in `TCPStoreDaemon` or `tcputil::connect()` can leave the server socket dangling. The ideal long-term solution would be to have a RAII wrapper for TCP sockets returned by `tcputil`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52860

Reviewed By: osalpekar

Differential Revision: D26671775

Pulled By: cbalioglu

fbshipit-source-id: ccebbd7533ac601a4b80e6e759f2fb4fe01c70fa
2021-02-25 15:32:38 -08:00
8ba7c4918a [nnc] Test for direct usage of ramp/broadcast
Summary:
I was attempting to experiment with "manual" vectorization, and boy
was it hard.  I finally came up with this, which I want to write down as a test
case.  Eventually the APIs should make this easier...

Test Plan: buck test

Reviewed By: navahgar

Differential Revision: D26631189

fbshipit-source-id: c28794b25d7852890ea843fdbcaf8751648258c0
2021-02-25 15:02:20 -08:00
0b93974075 Fix incorrect runtime error in mul_() when the tensor layout is Mkldnn (#51758)
Summary:
Calling Mkl-layout's mul_ from C++ API raises a RuntimeError.
Error message is bellow:
```
terminate called after throwing an instance of 'c10::Error'
  what():  unsupported tensor layout: Mkldnn
```

Environment
・CPU : Intel(R) Core(TM) i7-8086K CPU @ 4.00GHz
・OS : 18.04.1 LTS
・compiler : gcc 7.5.0
・branch : master
・commit ID: 16cfe97
・build Environment variable: USE_CUDA=0, USE_DISTRIBUTED=0, USE_MKLDNN=1
・Python: 3.6.9

CMakeLists.txt
```
cmake_minimum_required(VERSION 3.0 FATAL_ERROR)
project(mkldnn_test)

find_package(Torch REQUIRED)

add_executable(mkldnn_test mkldnn_test.cpp)
target_link_libraries(mkldnn_test "${TORCH_LIBRARIES}")
set_property(TARGET mkldnn_test PROPERTY CXX_STANDARD 14)
```

mkldnn_test.cpp
```
#include <torch/torch.h>

int main() {
  torch::Tensor a = torch::randn({2, 2});
  torch::Tensor a_mkl = a.to_mkldnn();
  a.mul_(0.5)
  a_mkl.mul_(0.5);
  std::cout << a << std::endl;
  std::cout << a_mkl.to_dense() << std::endl;
  return 0;
}
```

Expected Result
```
$ ./mkldnn_test
 0.1344  0.8107
-0.8157 -0.2610
[ CPUFloatType{2,2} ]
 0.1344  0.8107
-0.8157 -0.2610
[ CPUFloatType{2,2} ]
```

Execution Result
```
$ ./mkldnn_test
terminate called after throwing an instance of 'c10::Error'
  what():  unsupported tensor layout: Mkldnn
Exception raised from validate at /home/gtka7311/pytorch_v180/c_api_test/pytorch/aten/src/ATen/TensorIterator.h:128 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7f8a1472690b in /home/gtka7311/pytorch_v180/c_api_test/pytorch/torch/lib/libc10.so)
frame https://github.com/pytorch/pytorch/issues/1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xce (0x7f8a1472316e in /home/gtka7311/pytorch_v180/c_api_test/pytorch/torch/lib/libc10.so)
frame https://github.com/pytorch/pytorch/issues/2: <unknown function> + 0x965bc3 (0x7f8a0d07dbc3 in /home/gtka7311/pytorch_v180/c_api_test/pytorch/torch/lib/libtorch_cpu.so)
frame https://github.com/pytorch/pytorch/issues/3: at::TensorIteratorBase::populate_operands(at::TensorIteratorConfig&) + 0xf1 (0x7f8a0d079ee1 in /home/gtka7311/pytorch_v180/c_api_test/pytorch/torch/lib/libtorch_cpu.so)
frame https://github.com/pytorch/pytorch/issues/4: at::TensorIteratorBase::build(at::TensorIteratorConfig&) + 0x3b (0x7f8a0d07ad3b in /home/gtka7311/pytorch_v180/c_api_test/pytorch/torch/lib/libtorch_cpu.so)
frame https://github.com/pytorch/pytorch/issues/5: at::TensorIteratorBase::build_binary_op(at::Tensor const&, at::Tensor const&, at::Tensor const&) + 0x129 (0x7f8a0d07b339 in /home/gtka7311/pytorch_v180/c_api_test/pytorch/torch/lib/libtorch_cpu.so)
frame https://github.com/pytorch/pytorch/issues/6: at::TensorIterator::binary_op(at::Tensor&, at::Tensor const&, at::Tensor const&) + 0x38 (0x7f8a0d07b418 in /home/gtka7311/pytorch_v180/c_api_test/pytorch/torch/lib/libtorch_cpu.so)
frame https://github.com/pytorch/pytorch/issues/7: at::native::mul_out(at::Tensor&, at::Tensor const&, at::Tensor const&) + 0x33 (0x7f8a0d217793 in /home/gtka7311/pytorch_v180/c_api_test/pytorch/torch/lib/libtorch_cpu.so)
frame https://github.com/pytorch/pytorch/issues/8: at::native::mul_(at::Tensor&, c10::Scalar) + 0x45 (0x7f8a0d217865 in /home/gtka7311/pytorch_v180/c_api_test/pytorch/torch/lib/libtorch_cpu.so)
frame https://github.com/pytorch/pytorch/issues/9: <unknown function> + 0x1435c21 (0x7f8a0db4dc21 in /home/gtka7311/pytorch_v180/c_api_test/pytorch/torch/lib/libtorch_cpu.so)
frame https://github.com/pytorch/pytorch/issues/10: at::Tensor& c10::Dispatcher::call<at::Tensor&, at::Tensor&, c10::Scalar>(c10::TypedOperatorHandle<at::Tensor& (at::Tensor&, c10::Scalar)> const&, at::Tensor&, c10::Scalar) const + 0x15c (0x7f8a0d9e482c in /home/gtka7311/pytorch_v180/c_api_test/pytorch/torch/lib/libtorch_cpu.so)
frame https://github.com/pytorch/pytorch/issues/11: <unknown function> + 0x2a86269 (0x7f8a0f19e269 in /home/gtka7311/pytorch_v180/c_api_test/pytorch/torch/lib/libtorch_cpu.so)
frame https://github.com/pytorch/pytorch/issues/12: at::Tensor& c10::Dispatcher::call<at::Tensor&, at::Tensor&, c10::Scalar>(c10::TypedOperatorHandle<at::Tensor& (at::Tensor&, c10::Scalar)> const&, at::Tensor&, c10::Scalar) const + 0x15c (0x7f8a0d9e482c in /home/gtka7311/pytorch_v180/c_api_test/pytorch/torch/lib/libtorch_cpu.so)
frame https://github.com/pytorch/pytorch/issues/13: main + 0xfd (0x5653221cd282 in ./mkldnn_test)
frame https://github.com/pytorch/pytorch/issues/14: __libc_start_main + 0xe7 (0x7f8a0bba5b97 in /lib/x86_64-linux-gnu/libc.so.6)
frame https://github.com/pytorch/pytorch/issues/15: _start + 0x2a (0x5653221ccf2a in ./mkldnn_test)
```

Modification policy for the code
Generally ``mul_`` is processed by ``TensorIterator`` of ``mul_out``.
However, ``TensorIterator`` does not support ``Mkl-Layout tensor``.
Therefore, to solve this problem, modified ``aten/src/ATen/native/BinaryOps.cpp`` so that ``mkldnn_mul_out`` would be executed if ``Mkl-Layout tensor`` is inputed in ``mul_out``.
The modifications of the code are as follows:
```
 diff --git a/aten/src/ATen/native/BinaryOps.cpp b/aten/src/ATen/native/BinaryOps.cpp
index ee55114285..5c403546f2 100644
 --- a/aten/src/ATen/native/BinaryOps.cpp
+++ b/aten/src/ATen/native/BinaryOps.cpp
@@ -270,6 +270,9 @@ Tensor& floor_divide_(Tensor& self, const Tensor& other) {
 }

 Tensor& mul_out(Tensor& result, const Tensor& self, const Tensor& other) {
+  if (self.is_mkldnn()) {
+    return native::mkldnn_mul_out(result, self, other);
+  }
   auto iter = TensorIterator::binary_op(result, self, other);
   mul_stub(iter.device_type(), iter);
   return result;
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51758

Reviewed By: pbelevich

Differential Revision: D26655442

Pulled By: bdhirsh

fbshipit-source-id: fcc5e74734cae91f725fab525f181b3066eafa28
2021-02-25 14:36:01 -08:00
da732c76c4 Revert D26644079: [pytorch][PR] Adding functional way of stacking DataPipes
Test Plan: revert-hammer

Differential Revision:
D26644079 (7972036bbb)

Original commit changeset: dcf464637b4f

fbshipit-source-id: a12a06d7e7fb3821a0990bbc6305d02721ead82c
2021-02-25 14:30:49 -08:00
c2558b4b61 [vulkan] Add nonVarTypeModeGuard to vulkan tests and speed_benchmark_torch (#52535)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52535

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D26580994

Pulled By: SS-JIA

fbshipit-source-id: 94f091432265cf6607b73c34846c07273d47c70b
2021-02-25 14:23:40 -08:00
e94940b169 Use touch() in pathlib for better compatibility on Windows (#52729)
Summary:
https://github.com/pytorch/pytorch/issues/52477 introduced the usage of `touch`, which is not available on plain Windows environment, unless you made all the things come with Git Bash available. This PR fixes the build break on those systems by using the `touch` provided by Python pathlib.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52729

Reviewed By: anjali411

Differential Revision: D26666724

Pulled By: walterddr

fbshipit-source-id: aae357eb55c6787631eadf22bee7901ad3c2604e
2021-02-25 13:46:21 -08:00
19a8ada8d5 quant: fix conv transpose with qconfig == None (#52844)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52844

Fixes a crash in qconfig checking which happened if a model had conv transpose
with qconfig set to None.

Test Plan:
```
python test/test_quantization.py TestPostTrainingStatic.test_convtranspose_per_channel_qconfig_none
```

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D26666043

fbshipit-source-id: e1b62840b4e3c67acbb4dbdcd32514b374efce1e
2021-02-25 11:52:30 -08:00
c871abecf5 Added torch.no_grad() to update_bn (#52654)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/52055

This fixes the **out of memory error** while using update_bn in **SWA**, by not allocating memory for backpropagation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52654

Reviewed By: malfet

Differential Revision: D26620077

Pulled By: albanD

fbshipit-source-id: 890b5a78ba9c1a148f3ab7c63472a73d8f6412a4
2021-02-25 11:35:38 -08:00
0e86f14ec0 Upgrade onednn to v.1.8.1 (#51184)
Summary:
This PR is upgrade onednn to v1.8.1 to bug fixed.

- https://github.com/pytorch/pytorch/issues/50042 is fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51184

Reviewed By: ailzhang

Differential Revision: D26645894

Pulled By: VitalyFedyunin

fbshipit-source-id: 5fb3e5f673c819bccc158672e4b648e570bda3a0
2021-02-25 11:29:15 -08:00
7972036bbb Adding functional way of stacking DataPipes (#52507)
Summary:
Allows to use functional API to stack datapipes:
```python
numbers_dp = NumbersDataset(size=10).filter(filter_fn = lambda x: x % 2 == 1).map(fn = lambda x: x * 10)
```

DataPipes have to be decorated with:
```python
functional_datapipe('map')
class MapIterDataPipe(IterDataPipe[T_co]):
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52507

Reviewed By: ailzhang

Differential Revision: D26644079

Pulled By: VitalyFedyunin

fbshipit-source-id: dcf464637b4fcf9ea1eb8e84c2a0cd4dfd58b43d
2021-02-25 11:22:01 -08:00
a11b601100 Expose Store's timeout and TCPStore's host and port in Python API (#52784)
Summary:
This PR introduces the `timeout` accessor to `Store` and `host`, `port` accessors to `TCPStore` to help testing and troubleshooting higher level APIs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52784

Reviewed By: anjali411

Differential Revision: D26648202

Pulled By: cbalioglu

fbshipit-source-id: 9cf23bf998ed330d648dfec2a93e1bbb50817292
2021-02-25 11:05:15 -08:00
f974cf4688 Test for distributed RL with RPC (#52393)
Summary:
Addresses one item in https://github.com/pytorch/pytorch/issues/46321

## Background
This is a test version of the RL RPC example defined [here](https://github.com/pytorch/examples/blob/master/distributed/rpc/rl/main.py) and [here](https://pytorch.org/tutorials/intermediate/rpc_tutorial.html), with the following differences:
* It defines and uses a `DummyEnv` to avoid a dependency on `gym`. The `DummyEnv` simply returns random states & rewards for a small number of iterations.
* It removes the `ArgumentParser` and utilizes `RpcAgentTestFixture` + hard-coded constants for configuration and launching.
* It changes the worker names to match what the internal Thrift RPC tests expect.

The code is purposefully kept very similar to the original example code outside of these differences.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52393

Test Plan:
```
pytest test/distributed/rpc/test_tensorpipe_agent.py -k test_rl_rpc -vs
pytest test/distributed/rpc/test_process_group_agent.py -k test_rl_rpc -vs
```

Reviewed By: glaringlee

Differential Revision: D26515435

Pulled By: jbschlosser

fbshipit-source-id: 548548c4671fe353d83c04108580d807108ca76e
2021-02-25 10:52:53 -08:00
163a91bed3 Fix TensorPipe agent trying to double-set error (#52837)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52837

After https://github.com/pytorch/pytorch/pull/52749 we started seeing an increased flakiness of the TensorPipeDistAutogradTestWithSpawn.test_backward_node_failure_python_udf test, with failures like this one:

https://app.circleci.com/pipelines/github/pytorch/pytorch/277824/workflows/cfcbef5a-544e-43bd-b3b0-ebc7b95134fe/jobs/11145394

https://gist.github.com/lw/a0b48900673b5ae0f5d03aca1e72ffff

The logs are very clear and point to the changes in the error handling code upon a write error. Namely, the bug is triggered when a incoming read fails while there is an outgoing write, in which case the read callback (invoked first) will flush all pending futures, which then causes the write callback (invoked after) to not find the future it's looking for.

In a sense this bug wasn't introduced by https://github.com/pytorch/pytorch/pull/52749, however that PR introduced a check for whether the outgoing message was found, whereas before we would silence such a condition.

A fix for this could be to just resume silencing the error. However, I'm trying to go a bit further: when an outgoing write fails, we know that all subsequent callbacks will fail too, and thus all pending operations should be flushed. Hence we can do so, instead of just trying to flush a single given operation. This allows us to merge the error-handling code of both the read and write paths.
ghstack-source-id: 122509550

Test Plan: Will export to GitHub, run on CircleCI, and manually SSH into a machine and stress-run that test that was flaky.

Reviewed By: mrshenli

Differential Revision: D26663448

fbshipit-source-id: fbff0f6aff0d98994c08018a27c47c97149b920c
2021-02-25 10:47:04 -08:00
3ff6c9174a Update TensorPipe submodule (#52677)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52677

Test Plan: CircleCI

Reviewed By: beauby

Differential Revision: D26609075

fbshipit-source-id: 7dc2f8a1e6b9d8fe1ff49398379888237c115f2b
2021-02-25 10:36:56 -08:00
39fa0b5d0a Add scatter_add to amp promote list (#52133)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/51730

I've added the `scatter_add` and `scatter_add.dimname` to the promote list as well as test cases for the former op.
However, it seems that `scatter_add` [doesn't support named tensors yet](8b0cb5ede3/aten/src/ATen/native/NamedTensor.cpp (L356-L358)) (thanks t-vi for the pointer):
```python
dev = 'cuda'
torch.scatter_add(torch.zeros(2, 2, 2, dtype=torch.float16, device=dev, names=('N', 'C', 'L')),
                             'C',
                             torch.randint(0, 2, (2, 2, 2), device=dev),
                             torch.randn((2, 2, 2), dtype=torch.float32, device=dev))
> RuntimeError: scatter_add: You passed a dimname (string) to this op in place of a dimension index but it does not yet support this behavior. Please pass a dimension index to work around this.
```
which raised this error after adding this test case.

I'm thus unsure, if I should also remove `scatter_add.dimname` from the promote list or not.

In any case, once named tensors are supported a potential test could be added as:
```python
            ("scatter_add", (torch.zeros(2, 2, 2, dtype=torch.float16, device=dev, names=('N', 'C', 'L')),
                             'C',
                             torch.randint(0, 2, (2, 2, 2), device=dev),
                             torch.randn((2, 2, 2), dtype=torch.float32, device=dev))),
```

CC mcarilli ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52133

Reviewed By: ejguan

Differential Revision: D26440392

Pulled By: ngimel

fbshipit-source-id: f4ee2d0b9e1f81afb6f94261c497cf2bf79ec115
2021-02-25 09:37:01 -08:00
316eabe9ba fix(docs): remove redundant hardsigmoid() in docstring to show up inplace parameter (#52559)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/50016

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52559

Reviewed By: ailzhang

Differential Revision: D26636347

Pulled By: vkuzo

fbshipit-source-id: da615d0eb6372637a6441e53698e86252591f6d8
2021-02-25 09:09:32 -08:00
1618dc2ac6 ns for fx: update graph matching to handle dicts and tuples in node args (#52681)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52681

Updates the NS graph matching to properly traverse through args of nodes
if args are lists or tuples.  As a side benefit, refactors the code to
make future similar improvements easier.

Test Plan:
```
python test/test_quantization.py TestFXGraphMatcher
```

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D26611221

fbshipit-source-id: 4ddd9b26338a5a2763b2883967e100f73e207538
2021-02-25 08:53:44 -08:00
608f44b24b ns for fx: update graph matching to not match nodes with equal types (#52402)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52402

Before this PR, any pair of subgraphs with base nodes of equal
types matched.

While sometimes this is useful, this should be off by default to
properly handle user defined modules and functions, for which we do not
know how to extract weights or cast to the right input type.

In a future PR, we can add hooks to turn on matching for nodes
of equal types, for the situations where it makes sense.

Test Plan:
```
python test/test_quantization.py TestFXGraphMatcher.test_nodes_with_equal_types_do_not_get_matched
```

Imported from OSS

Reviewed By: hx89

Differential Revision: D26499848

fbshipit-source-id: 5818b88eb7fd8ed36390f60aa1a18228bb50507e
2021-02-25 08:53:39 -08:00
4483c48eb1 ns for fx: support linear_relu for weight matching (#52395)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52395

Simple change to add logic to get the weight of a quantized
`linear_relu` node.

More flavors of conv and linear will be added in future PRs.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_compare_weights_fun
```

Imported from OSS

Reviewed By: hx89

Differential Revision: D26497992

fbshipit-source-id: e6d88e92eedd6cdbf9116cbcfc8f6164f8499246
2021-02-25 08:53:35 -08:00
64b4e37c26 ns for fx: allow graph matching of parents of cat (#52368)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52368

Before this PR, the graph matching logic only handles node arguments of
type Node. This PR extends it to allow to handle node arguments of type
Tuple, so that the matcher can properly navigate through the arguments
of `cat`.

Test Plan:
```
python test/test_quantization.py TestFXGraphMatcher.test_nodes_before_cat
```

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D26490101

fbshipit-source-id: 2de8d6acc30f237e22bfc3cfa89728b37411aab6
2021-02-25 08:51:48 -08:00
13121598ef [Pytorch, sparsity] Bug fix to update requantization and zp parameters of input (#52797)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52797

Also sneaking in change to check for realloc failure for packed activation buffer

FB:
In dynamic quantization input's quantization scale and zero point can be
different on every iterations. Thus requantization scale needs to be
recomputed.

Earlier bug that calculated those only at op creation time results in wrong
results on subsequent runs.

This diff fixes that.

Test Plan:
FB:
buck test caffe2/torch/fb/model_optimization:sparsity_test

Reviewed By: z-a-f, jiatongzhou

Differential Revision: D26651968

fbshipit-source-id: e5b9acef03fc45f31c43d88a175f3a64f7dbf4bd
2021-02-25 08:44:30 -08:00
99a428ab22 Lower ReLu6 to aten (#52723)
Summary:
-Lower Relu6 to ATen
-Change Python and C++ to reflect change
-adds an entry in native_functions.yaml for that new function
-this is needed as we would like to intercept ReLU6 at a higher level with an XLA-approach codegen.
-Should pass functional C++ tests pass. But please let me know if more tests are required.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52723

Reviewed By: ailzhang

Differential Revision: D26641414

Pulled By: albanD

fbshipit-source-id: dacfc70a236c4313f95901524f5f021503f6a60f
2021-02-25 08:38:11 -08:00
fa7575ea05 Update backwards compatibility check to ignore reverted op (#52841)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52841

ghstack-source-id: 122515522

Test Plan: CircleCI

Reviewed By: malfet

Differential Revision: D26665136

fbshipit-source-id: f2aafa8e05f39e284f66f88685d9ce675bebe1cf
2021-02-25 08:28:38 -08:00
914126901e Fix typos in tools/test_history.py helpstring (#52840)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52840

Test Plan:
```
$ tools/test_history.py --help
```

Reviewed By: janeyx99

Differential Revision: D26665121

Pulled By: samestep

fbshipit-source-id: 3607a4a598f1b1639ac1752b4e377491bff7188f
2021-02-25 08:21:23 -08:00
1ac59d9db3 Fix RPC get_worker_info for rank=0 (#52804)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52804

`rpc.get_worker_info` used to only take string in v1.6. We recently
allow it to accept `int` and `WorkerInfo`, but the previous check
on `worker_name` is no longer correct. This commit adds explicit
`not None` check.

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D26655089

Pulled By: mrshenli

fbshipit-source-id: fa1545bd6dd2b33bc1e919de46b94e799ab9719c
2021-02-25 08:15:01 -08:00
f71d9e28f9 Store test filename in test report path (#52791)
Summary:
This way, we can have a mapping from the test files we directly execute (the tests [here](https://github.com/pytorch/pytorch/blob/master/test/run_test.py#L20)) to the test suites that we store data for in XML reports.

This will come in use later for categorizing the tests we run in CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52791

Reviewed By: samestep

Differential Revision: D26655086

Pulled By: janeyx99

fbshipit-source-id: 94be32f80d7bc0ea1a7a11d4c4b1d3d8e774c5ea
2021-02-25 07:53:30 -08:00
92a4ee1cf6 Revert D26375734: Implemented torch.linalg.multi_dot
Test Plan: revert-hammer

Differential Revision:
D26375734 (0396f492b9)

Original commit changeset: 839642692424

fbshipit-source-id: cb64db646010128d802e1930d5e9526c1f7aa6a2
2021-02-25 00:43:57 -08:00
0048d97eda remove index_fill side-effect for scalar tensors (#52209)
Summary:
`index_fill` silently promotes zero dim Tensors to 1-dim Tensors. This PR fixes that.
Was:
```
In [1]: import torch

In [2]: x = torch.tensor(1)

In [3]: idx = torch.tensor(0).long()

In [4]: x.dim()
Out[4]: 0

In [5]: x.index_fill(0, idx, -1).dim()
Out[5]: 1

```
Now:
```
In [6]: x.index_fill(0, idx, -1).dim()
Out[6]: 0

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52209

Reviewed By: ejguan

Differential Revision: D26446470

Pulled By: ngimel

fbshipit-source-id: 4737e6941a7216b57f3416b59362817834df3a3a
2021-02-25 00:35:27 -08:00
57947c5d85 [TensorExpr] Add Placeholder::handle method to get the corresponding BufHandle. (#52793)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52793

Fixes #52776.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D26650481

Pulled By: ZolotukhinM

fbshipit-source-id: 54461137b857d3ac5d475cfa3d3ba07432c9bf59
2021-02-24 22:47:29 -08:00
d3b427a0e3 [TensorExpr] Add an unmasked Load constructor. (#52790)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52790

Fixes #52774.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D26649542

Pulled By: ZolotukhinM

fbshipit-source-id: ab1c9e55f52e59d0bd00fbde2ec3125f8c7917ee
2021-02-24 22:45:29 -08:00
30cb6ac53c Introduce mlc device (ML Compute device) to PyTorch's device list (#50634)
Summary:
Apple recently announced ML Compute, a new framework available in macOS Big Sur, which enables users to accelerate the training of neural networks on Mac hardware. This PR is the first on a series of PRs that will enable the integration with ML Compute. Most of the integration code will live on a separate subrepo named `mlc`.
The integration with `mlc` (ML Compute) will be very similar to that of xla. We rely on registering our ops through:

TORCH_LIBRARY_IMPL(aten, PrivateUse1, m) {
 m.impl_UNBOXED(<op_schema_name>, &customized_op_kernel)
 ...
}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50634

Reviewed By: malfet

Differential Revision: D26614213

Pulled By: smessmer

fbshipit-source-id: 3b492b346c61cc3950ac880ac01a82fbdddbc07b
2021-02-24 22:39:11 -08:00
2bdf6305a0 Drop unused variables (#52643)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52643

Test Plan: Sandcastle

Reviewed By: borovsky-d

Differential Revision: D26588961

fbshipit-source-id: 5e00ec05d006ccad8ff8cd98916a0265e592f9fd
2021-02-24 22:20:56 -08:00
a649d808e6 Added fast path in the case of no hooks (#52576)
Summary:
See the discussion here: https://github.com/pytorch/pytorch/pull/50431

~~Not completely done yet - need to figure out the backwards compatibility stuff as well as `RemovableHandle`.~~

~~Also, this concretely breaks Torchscript (which tries to script the properties), and more generally, probably requires modifying Torchscript hook support: https://github.com/pytorch/pytorch/issues/34329~~

Just kidding, I think all problems are solved :)

Another thing I could do in this PR is to simply replace all the `len(x) > 0` checks with the faster checks. That's about 1.5-2k more Python instructions and .4 - .5 microseconds slower.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52576

Reviewed By: ailzhang

Differential Revision: D26650352

Pulled By: Chillee

fbshipit-source-id: 0fd73e916354b9e306701a8a396c5dc051e69f0d
2021-02-24 21:48:09 -08:00
a6b7da7dfe Add 64bit indexing support for softmax (#52713)
Summary:
fixes https://github.com/pytorch/pytorch/issues/52715 https://github.com/pytorch/pytorch/issues/52716

split across batch dimension

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52713

Reviewed By: ailzhang

Differential Revision: D26640033

Pulled By: ngimel

fbshipit-source-id: f169cb0d6abc1cfbddf658d9775759a7d56f5c12
2021-02-24 21:39:58 -08:00
c140a5ec04 Use finer-grained mutexes in TensorPipe RPC agent (#52749)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52749

TensorPipe has recently changed some implementation details in how it schedules callbacks and this has exposed an issue in the RPC agent. Previously the callbacks of each pipe were executed independently and possibly simultaneously. For safety reasons (especially during shutdown) TensorPipe now synchronizes the pipes and thus invokes one callback at a time. Another characteristic of TensorPipe is that it "hijacks" some user threads to run some callbacks inline (e.g., if a low-level event loop completes an operation while a pipe is already busy, this completion is queued up and the user callback could be invoked later by a different thread, including the user's own thread).

These two effects combined caused a "reentrancy" phenomenon, where calling `context->connect` (formerly on line 850) to create a new client-side pipe could cause invoking a read callback on another pipe. Since we were holding `mutex_` when calling `context->connect`, and we were trying to re-acquire `mutex_` inside the read callback, this lead to a deadlock.

One solution to this problem is using finer-grained mutexes. In particular, introduce a mutex for each outgoing pipe (rather than a global one), which thus becomes the only one we need to acquire inside callbacks. At this point, the old `mutex_` is only guarding the vector of ClientPipes, thus we can rename it and release it earlier.

I also fixed the agent not acquiring any mutex when it set a message to error after a failed write (and also not removing the message from the timeout map).
ghstack-source-id: 122410367

Test Plan: Ran CI in #52677 together with the TensorPipe submodule update.

Reviewed By: mrshenli

Differential Revision: D26636345

fbshipit-source-id: d36da989f2aab51f4acb92d2e81bb15b76088df1
2021-02-24 21:28:30 -08:00
c954817696 print matrix dims in torch cuda matrix multiply error (#52780)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52780

trying to improve the error message for torch matrix multiply dimension mismatch

Test Plan: check if code compiles

Reviewed By: akyrola

Differential Revision: D26617036

fbshipit-source-id: de23e551af985a00384fb1cccd04120b9d2728b3
2021-02-24 20:09:25 -08:00
29c4290a8d Use c10::irange for great good (#52153)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52153

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D26407087

fbshipit-source-id: ea8ce1c17299cb9d89621e4a39f31edc2faa9fd6
2021-02-24 18:43:50 -08:00
373a20ad4a Modernize for-loops in caffe2/torch (#52618)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52618

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50797

Modernize for-loops throughout caffe2/ subdirs to use ranged-loops where possible (all `.cpp` files were examined).

```
find caffe2/ -iname "*.cpp" > /home/rbarnes/files
buck run mode/opt foundation/clangr:clangr_local -- -j 10 --file=/home/rbarnes/files --multi --apply-replacements=true tidy '--checks=-*,modernize-loop-convert'
```

Test Plan: Sandcastle tests

Reviewed By: suo

Differential Revision: D26585065

fbshipit-source-id: 439b9f9ce7c54fa9b4b80161f6bb27ebe8a35967
2021-02-24 18:17:46 -08:00
0567988e74 Kernel launch checks for aten/src/ATen (#52185)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52185

Test Plan: Sandcastle tests

Reviewed By: ngimel

Differential Revision: D26408276

fbshipit-source-id: 554dcfca52304b8e17ffbd0ba0dcf73f99cf28c6
2021-02-24 16:34:28 -08:00
98873b9258 Update Gloo submodule (#52754)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52754

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D26608421

fbshipit-source-id: 034ee34faa62ec4d4672d0197c59fa48894adae0
2021-02-24 15:42:22 -08:00
0396f492b9 Implemented torch.linalg.multi_dot (#51807)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51807

Implemented torch.linalg.multi_dot similar to [numpy.linalg.multi_dot](https://numpy.org/doc/stable/reference/generated/numpy.linalg.multi_dot.html).

This function does not support broadcasting or batched inputs at the moment.

**NOTE**
numpy.linalg.multi_dot allows the first and last tensors to have more than 2 dimensions despite their docs stating these must be either 1D or 2D. This PR diverges from NumPy in that it enforces this restriction.

**TODO**
- [ ] Benchmark against NumPy
- [x] Add OpInfo testing
- [x] Remove unnecessary copy for out= argument

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D26375734

Pulled By: heitorschueroff

fbshipit-source-id: 839642692424c4b1783606c76dd5b34455368f0b
2021-02-24 15:32:30 -08:00
964d47dfb9 Add torch.linalg to generated annotated_args for test_overrides (#52464)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52464

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D26618696

Pulled By: heitorschueroff

fbshipit-source-id: 9889fcaafcb307319b4526ee86355389653a6b61
2021-02-24 15:30:32 -08:00
7b54a8fc23 [quant] Reference option for conv module (#52316)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52316

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D26505642

fbshipit-source-id: e25b1a3cc37c4b744e694946e6ddf1470dd4692b
2021-02-24 14:54:02 -08:00
3cf08eaf15 [Pytorch Mobile] Improve Bundled Inputs Error Checking (#52386)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52386

Remove stale aliasing inputs warning, error check that inputs is not null and has at least one entry, error check that the list of inputs is a list of tuples. This can cause subtle bugs where if the user passes in a list of tensors (the most common mistake) the first dimension of each tensor is dropped. This can go unnoticed because its the often the batch dimension which pytorch occasionally silently re-adds if its missing
ghstack-source-id: 122363487

Test Plan:
Bundle something with an input, bundle something with {} for inputs

For typing check below paste

{P199554712}

Reviewed By: dhruvbird

Differential Revision: D26374867

fbshipit-source-id: cd176f34bad7a4da850b165827f8b2448cd9200d
2021-02-24 13:55:45 -08:00
88a160dc21 [TensorExpr] LoopNest: Cleanup LoopNest constructors. (#52726)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52726

This change removes `input_bufs_` and `intermediate_bufs_` from
`LoopNest` class as they can be deduced from the root stmt and the list
of output bufs. As a result, the constuctor of the LoopNest also becomes
simpler as we now need to pass just one list of bufs.

Note: we might consider passing list of input bufs for verification
purposes (only inputs buffers are allowed to not have a definition), but
since we don't really have an IR verifier yet, there is no need in it
now. Once we add IR verifier, we could reconsider it.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D26629596

Pulled By: ZolotukhinM

fbshipit-source-id: 81f544e9602b6855b7968d540b9ae06bd7c7e6d8
2021-02-24 13:26:22 -08:00
08d266943d structured kernels - error check when structured_delegate is not marked structured (#52227)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52227

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D26431105

Pulled By: bdhirsh

fbshipit-source-id: 82e8c6eb5eee6f9cdb39637ecfee84ab5bb2cabe
2021-02-24 13:17:52 -08:00
5e977d9c38 Catch Flake8 error codes with multiple letters (#52750)
Summary:
The Flake8 job has been passing on `master` despite giving warnings for [over a month](https://github.com/pytorch/pytorch/runs/1716124347). This is because it has been using a regex that doesn't recognize error codes starting with multiple letters, such as those used by [flake8-executable](https://pypi.org/project/flake8-executable/). This PR corrects the regex, and also adds another step at the end of the job which asserts that Flake8 actually gave no error output, in case similar regex issues appear in the future.

Tagging the following people to ask what to do to fix these `EXE002` warnings:

- https://github.com/pytorch/pytorch/issues/50629 authored by jaglinux, approved by rohan-varma
  - `test/distributed/test_c10d.py`
- https://github.com/pytorch/pytorch/issues/51262 authored by glaringlee, approved by ejguan
  - `torch/utils/data/datapipes/__init__.py`
  - `torch/utils/data/datapipes/iter/loadfilesfromdisk.py`
  - `torch/utils/data/datapipes/iter/listdirfiles.py`
  - `torch/utils/data/datapipes/iter/__init__.py`
  - `torch/utils/data/datapipes/utils/__init__.py`
  - `torch/utils/data/datapipes/utils/common.py`
- https://github.com/pytorch/pytorch/issues/51398 authored by glaringlee, approved by ejguan
  - `torch/utils/data/datapipes/iter/readfilesfromtar.py`
- https://github.com/pytorch/pytorch/issues/51599 authored by glaringlee, approved by ejguan
  - `torch/utils/data/datapipes/iter/readfilesfromzip.py`
- https://github.com/pytorch/pytorch/issues/51704 authored by glaringlee, approved by ejguan
  - `torch/utils/data/datapipes/iter/routeddecoder.py`
  - `torch/utils/data/datapipes/utils/decoder.py`
- https://github.com/pytorch/pytorch/issues/51709 authored by glaringlee, approved by ejguan
  - `torch/utils/data/datapipes/iter/groupbykey.py`

Specifically, the question is: for each of those files, should we remove the execute permissions, or should we add a shebang? And if the latter, which shebang?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52750

Test Plan:
The **Lint / flake8-py3** job in GitHub Actions:

- [this run](https://github.com/pytorch/pytorch/runs/1972039886) failed, showing that the new regex catches these warnings properly
- [this run](https://github.com/pytorch/pytorch/runs/1972393293) succeeded and gave no output in the "Run flake8" step, showing that this PR fixed all Flake8 warnings
- [this run](https://github.com/pytorch/pytorch/pull/52755/checks?check_run_id=1972414849) (in https://github.com/pytorch/pytorch/issues/52755) failed, showing that the new last step of the job successfully catches Flake8 warnings even without the regex fix

Reviewed By: walterddr, janeyx99

Differential Revision: D26637307

Pulled By: samestep

fbshipit-source-id: 572af6a3bbe57f5e9bd47f19f37c39db90f7b804
2021-02-24 12:56:12 -08:00
7ae7768617 [ZeroRedundancyOptimizer] Remove pseudo futures handling, not needed (#52698)
Summary:
This was mostly needed for ShardedDDP, not used here, dead code removal

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52698

Reviewed By: mrshenli

Differential Revision: D26617893

Pulled By: blefaudeux

fbshipit-source-id: 9bcfca5135bf332ebc1240300978c138d2041146
2021-02-24 11:39:59 -08:00
27d04f291e Clarify usage and output of tools/test_history.py (#52640)
Summary:
This PR makes several UX improvements to `tools/test_history.py`:
- warn if `--all` is unset and no jobs are passed
- print output even in `multiline` mode if no reports are found for a commit
  - this makes it easier to tell whether the script is just hanging
- if there are multiple reports for a commit/job pair, say so
- distinguish between not finding any reports and just not finding the desired test in the reports found
- don't require the suite name as a CLI arg, just use the test name

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52640

Test Plan: Example shell session: https://pastebin.com/SSemHqP8

Reviewed By: walterddr

Differential Revision: D26594350

Pulled By: samestep

fbshipit-source-id: 9ce2245f91eef289817aafe955a4343d4a068eda
2021-02-24 11:19:45 -08:00
b4b7db2f3b [FX acc]Add fx_glow support for multi outputs (#52527)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52527

Add e2e support for multiple outputs.

Test Plan: `buck test glow/fb/fx/fx_glow:test_fx_glow -- test_fx_glow_binding_with_multiple_outputs`

Reviewed By: gcatron

Differential Revision: D26555520

fbshipit-source-id: f3ccd61a0c2429d4a5f511c403fa6e782012e21e
2021-02-24 10:22:50 -08:00
59ac0ff037 Change maybe_resize_storage_cpu new_size arg to unsigned (#52671)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52671

Code is written with the assumption that new_size is unsigned value,
and when function is called with negative value it silently returns a nullptr rather than raise an exception.
Fix above-mentioned logic by converting new_size to unsigned type and let cpu_allocator raise exception on negative alloc.

Unroll nested if blocks by returning early if new_size is 0

Add TestNN.test_adaptive_pooling_size_overflow to indirecty validate the fix.

Fixes https://github.com/pytorch/pytorch/issues/50960

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D26607549

Pulled By: malfet

fbshipit-source-id: e3d4f7548b098f24fa5aba42d8f4e9288ece1e2e
2021-02-24 09:50:28 -08:00
08d7f29601 Add discontiguous kwarg to make_tensor (#51985)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51985

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D26375733

Pulled By: heitorschueroff

fbshipit-source-id: bb7831dc28c24b90c6f83885681eeccfdbb83438
2021-02-24 08:57:24 -08:00
3489b4a7b8 Fix the ordering of TCPStore's compare_set parameters (#52696)
Summary:
- Fixes the ordering of the value parameters of TCPStore's `compare_set()` in the pybind11 interop layer. The C++ API expects (old, new) while we are passing (new, old) in Python.
- Fixes the implementation of TCPStore's `compareSetHandler()` for cases where the key already exists in the store.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52696

Test Plan: `python test/distributed/test_c10d.py`

Reviewed By: malfet, H-Huang

Differential Revision: D26616976

Pulled By: cbalioglu

fbshipit-source-id: e6a70542e837be04697b5850947924edd896dbf6
2021-02-24 06:59:03 -08:00
97b6b3df51 [Reland] Update XNNPACK (#52691)
Summary:
This update contains the fix to XNNPACK by kimishpatel
Add unit test that exposed the problem
Updated torchvision checkout to 0.9.0rc1 hash

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52691

Reviewed By: walterddr

Differential Revision: D26614595

Pulled By: malfet

fbshipit-source-id: d0fe155a084690a3459a9358dac8488292e734fb
2021-02-24 06:40:38 -08:00
8af648354f [nnc] Benchmarks for concat (#52592)
Summary:
This PR adds a c++ benchmark for "concat" with 3 different versions - 1) aten::cat, 2) NNC implementation with if-then-else, 3) NNC implementation using multiple loops. It also adds a python benchmark for "concat" which can now be invoked with and without CPU fusion.

Here are the results of these benchmarks on a `Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz` machine with `OMP_NUM_THREADS=1`

```
--------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                   Time           CPU Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------------------
Concat2D2 (678fe9f077)Input/ATen/1/160/1/14/1                                         1211 ns       1211 ns     567896 GB/s=1.14953G/s
Concat2D2 (678fe9f077)Input/ATen/1/580/1/174/1                                        1296 ns       1296 ns     537060 GB/s=4.65362G/s
Concat2D2 (678fe9f077)Input/ATen/20/160/20/14/1                                       1823 ns       1823 ns     382052 GB/s=15.2677G/s
Concat2D2 (678fe9f077)Input/ATen/20/580/20/174/1                                      3347 ns       3347 ns     210036 GB/s=36.0432G/s
Concat2D2 (678fe9f077)Input/ATen/8/512/8/512/1                                        2093 ns       2093 ns     324760 GB/s=31.3061G/s
Concat2D2 (678fe9f077)Input/NNC/1/160/1/14/1                                           694 ns        694 ns    1002902 GB/s=2.00692G/s
Concat2D2 (678fe9f077)Input/NNC/1/580/1/174/1                                          852 ns        852 ns     803002 GB/s=7.08127G/s
Concat2D2 (678fe9f077)Input/NNC/20/160/20/14/1                                        1639 ns       1639 ns     419683 GB/s=16.9828G/s
Concat2D2 (678fe9f077)Input/NNC/20/580/20/174/1                                       5956 ns       5956 ns     117833 GB/s=20.2548G/s
Concat2D2 (678fe9f077)Input/NNC/8/512/8/512/1                                         3136 ns       3136 ns     224122 GB/s=20.8958G/s
Concat2D2 (678fe9f077)Input/NNCLoop/1/160/1/14/1                                       581 ns        581 ns    1209873 GB/s=2.39737G/s
Concat2D2 (678fe9f077)Input/NNCLoop/1/580/1/174/1                                      614 ns        614 ns    1132332 GB/s=9.82955G/s
Concat2D2 (678fe9f077)Input/NNCLoop/20/160/20/14/1                                    1091 ns       1091 ns     622952 GB/s=25.5247G/s
Concat2D2 (678fe9f077)Input/NNCLoop/20/580/20/174/1                                   2399 ns       2399 ns     288376 GB/s=50.289G/s
Concat2D2 (678fe9f077)Input/NNCLoop/8/512/8/512/1                                     1500 ns       1500 ns     478360 GB/s=43.6968G/s
Concat2D3 (e23ddf06e9)Input/ATen/8/512/8/512/8/512/1                                  2584 ns       2584 ns     266394 GB/s=38.0397G/s
Concat2D3 (e23ddf06e9)Input/NNC/8/512/8/512/8/512/1                                   5056 ns       5056 ns     139768 GB/s=19.4416G/s
Concat2D3 (e23ddf06e9)Input/NNCLoop/8/512/8/512/8/512/1                               1917 ns       1917 ns     369626 GB/s=51.2758G/s
Concat2D7 (b5edf329f8)Input/ATen/8/128/8/256/8/384/8/512/8/512/8/512/8/512/1          3888 ns       3888 ns     178124 GB/s=46.3571G/s
Concat2D7 (b5edf329f8)Input/NNC/8/128/8/256/8/384/8/512/8/512/8/512/8/512/1          24639 ns      24638 ns      28336 GB/s=7.31481G/s
Concat2D7 (b5edf329f8)Input/NNCLoop/8/128/8/256/8/384/8/512/8/512/8/512/8/512/1       3093 ns       3093 ns     226326 GB/s=58.265G/s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52592

Reviewed By: bertmaher

Differential Revision: D26596701

Pulled By: navahgar

fbshipit-source-id: 650fa88febf4423ea49f5a1d3d734edc2294d257
2021-02-24 06:09:32 -08:00
b56f59ea20 Revert D26599390: [pytorch][PR] Fix for incorrect usage of logging in torch/distributed/distributed_c10d.py
Test Plan: revert-hammer

Differential Revision:
D26599390 (075bbe0d6a)

Original commit changeset: d822658076f7

fbshipit-source-id: 6c4421f4de99794ea66780175af549cef9410a20
2021-02-24 05:38:34 -08:00
958d9a8364 [fx/package] make GraphModules packageable (#51976)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51976

FX serializes things by serializing Python code as a string and exec'ing
it on load. This accomplishes one goal (we don't have to pickle the
graph object directly) but breaks the pickle abstraction in ways that
are not composable with `torch.package`.

In particular:
1. `forward` is serialized by saving Python code. On load, it's
installed
by  `exec`ing that code. This `exec` call needs to have the right
importer installed, otherwise it will not import modules from the
`torch.package` but instead import from the Python environment.
2. Any types/functions used are emitted as `import` statement in the
generated Python code. These are effectively dynamic dependencies of the
`GraphModule` being saved, and need to be registered as such so that the
`PackageImporter` will package them.

To address these, this PR introduces a new protocol for the
importer/exporter: `__reduce_package__`.

A class can implement `__reduce_package__` to customize how it is placed
in the importer/exproter. It functions very similarly to `__reduce__`,
except:
- `__reduce_package__` takes one argument, which is the
`PackageExporter`
instance. Users can use this instance to save stuff to the package to
implement their serialization. `__reduce__` takes no args.
- Only the 2-element tuple version of the return value for `__reduce__`
is supported (this could be extended if necessary).
- When the reduction function is called on load, an additional argument
is added to the beginning of the args tuple. This is the
`PackageImporter`
instance doing the loading.

The `__reduce_package__` protocol is defined using `persistent_id` and
`persistent_load`, which ensures that we can still use the cpickle
implementation of the pickler by default.

Pull Request resolved: #51971

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D26340591

Pulled By: suo

fbshipit-source-id: 5872a7d22e832056399a7372bae8a57807717882
2021-02-23 22:43:00 -08:00
075bbe0d6a Fix for incorrect usage of logging in torch/distributed/distributed_c10d.py (#51739)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/51428

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51739

Reviewed By: bdhirsh

Differential Revision: D26599390

fbshipit-source-id: d822658076f7b08ebfde3dc9994159539490fda0
2021-02-23 22:30:37 -08:00
2d75346c25 [Gradient Compression] Add a minimum compression rate threshold for PowerSGD communication hook (#52541)
Summary:
Fixes #{52034}
- Add a minimum compression rate threshold to `PowerSGDState`
- Use the threshold to determine whether to compress high-rank tensors or not

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52541

Test Plan:
No performance regression using rank-8 compression:
baseline: f253000411
updated one: f253010955

Reviewed By: rohan-varma

Differential Revision: D26594862

Pulled By: SciPioneer

fbshipit-source-id: 2859a91b4ca6bd1862bf6cd6441dc2a89badb2d5
2021-02-23 22:03:02 -08:00
755c60bffc [PyTorch Mobile] Allow loading of all extra files using the extra_file argument (#52635)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52635

Currently, the method `_load_for_mobile()` accepts an extra files map named `extra_files` which serves as an in-out parameter. i.e. the call fills in the keys of this map with all files under the `extra/` folder that they wish to extract, and the method fills in the `extra_files` map with the contents of those files.

In a specific case we have encountered, it is desirable to extract all the extra files so that they can be forwarded in an opaque manner into a `save_for_mobile()` call with the same set of extra files as during load.

This change adds a method `_get_all_archive_file_names()` which returns the names of all files in the `.ptl` archive. The caller can then extract the ones within the `extra/` directory and pass them in to the `extra_files` map argument.

ghstack-source-id: 122356928

Test Plan: Added additional test + `buck test //xplat/caffe2:test_lite_interpreter`

Reviewed By: iseeyuan

Differential Revision: D26590027

fbshipit-source-id: 4dc30997929e132f319c32cb9435d8a40fe0db5e
2021-02-23 21:57:13 -08:00
0c455332e8 docs: add link to Tensor.share_memory_ in Module.share_memory (#52561)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48228

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52561

Reviewed By: malfet

Differential Revision: D26626012

fbshipit-source-id: 7aab44c60d1bcbda68012521ec852843864abc7f
2021-02-23 20:20:50 -08:00
238b0bbb68 Allow Transformer accept output result that is not Proxy (#52473)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52473

Use `map_aggregate` to create output for new graph so that it won't raise error when we have outputs that is not `Proxy`.

Test Plan: `test_transformer_multi_outputs` in `test_fx.py`

Reviewed By: jamesr66a

Differential Revision: D26502277

fbshipit-source-id: 404d9030a9b84db3f66f8505887a75717a28ad30
2021-02-23 19:28:37 -08:00
75f7b22025 Fix hipify_python (#52709)
Summary:
Two changes:
 - Print a warning rather than fail if creating hipified file fails with permission denied error
 - Do not attempt to create /usr/include/libpng/png_hip.h in the first place

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52709

Reviewed By: walterddr

Differential Revision: D26625033

Pulled By: malfet

fbshipit-source-id: ff82dc24aee12eac2daaa6e5bc938811b49ebbc6
2021-02-23 19:19:13 -08:00
26419815af Modernize for-loops (#52330)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52330

Test Plan: Sandcastle

Reviewed By: mruberry

Differential Revision: D26001961

fbshipit-source-id: e75cc8f1a8d30917b4d55df9e1a3c7836c271820
2021-02-23 17:32:33 -08:00
cyy
caa377f546 replace type().backend() with device() (#52558)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52558

Reviewed By: malfet

Differential Revision: D26616025

Pulled By: jbschlosser

fbshipit-source-id: ef9f3f42e830788c21feab533e192ba9c6eb8edb
2021-02-23 16:32:21 -08:00
b534466f01 [DataLoader] TransformsIterDataPipe (#52604)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52604

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D26581511

Pulled By: ejguan

fbshipit-source-id: c927726b7afba14586f16cde0237f2cef9080079
2021-02-23 15:47:27 -08:00
cabb1e7a94 Fix wrong TORCH_CHECK usages (#52670)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52670

TORCH_CHECK followed by a string literal is a no-op, and from the text of the message its clear that authors intended those instances to be `TORCH_CHECK(false, "msg")`

Discovered while trying to figure out of tensor_offset can be negative in Resize.h

s/TORCH_CHECK\("/TORCH_CHECK(false, "/

Test Plan: Imported from OSS

Reviewed By: walterddr, janeyx99, mruberry

Differential Revision: D26607546

Pulled By: malfet

fbshipit-source-id: 661812da84adb1d1af0284da60c93ec4bf5ef08e
2021-02-23 14:47:51 -08:00
420fc42eab add OneDNN pooling backward (#49454)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49454

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D26006888

Pulled By: VitalyFedyunin

fbshipit-source-id: 6a4930982db784819fea70ffc9029441d673d90e
2021-02-23 14:45:55 -08:00
df30cb78d2 Remove unused variable (#52652)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52652

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D26589146

fbshipit-source-id: 704a93e479e1bf2420dd47589319a5438a2f92f1
2021-02-23 14:38:23 -08:00
d5ed57569b Move cuda9 and cuda11.2 CI jobs to a scheduled workflow (#52693)
Summary:
Moving master only resource-interactive CI jobs to a less regular basis.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52693

Reviewed By: malfet, seemethere

Differential Revision: D26615060

Pulled By: janeyx99

fbshipit-source-id: def46a7890ea46c655ef2ee0f7c548171464cb48
2021-02-23 14:17:15 -08:00
ecf3ca00d8 [fx] Separate globals assignment from code generation (#51974)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51974

Right now, when an FX `Graph` references an external object, we will emit
code like:

    import foo
    def forward(input: foo.bar.baz):
        ...

This is problematic in a world with `torch.package`, since then name
`foo.bar.baz` may reference a name from any number of packages.

This PR lays the groundwork for FX-package integration by separating the
resolution of external references from the genration of the function
code.

When generating a Graph's Python source, we keep track of all external
references and assign them unique names. At the end, we have a
dictionary mapping names -> actual objects. This becomes the `globals`
namespace we pass to `exec` when installing the forward function in a
`GraphModule`. This is nice because we can always be sure that `exec` is
seeing the same objects that were referenced from the `Graph`, no import
statements needed.

At serialization time, we use a `ModuleEnv` to resolve the globals dict
to a set of import statements that can be run to reprodce the `global`
namespace. This is only used on serialiation/deserialization, and those
functions are expected to check that the import statements are producing
the correct results.

Concretely, the code above will now look like:

    from foo.bar import baz as foo_bar_baz
    def forward(input: foo_bar_baz):
        ...

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D26340593

Pulled By: suo

fbshipit-source-id: fe247f75205d0a03fd067bdd0f95491e8edf1436
2021-02-23 13:48:03 -08:00
1cddb27f39 [FX acc]Store shape and dtype in serialized output node args (#52462)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52462

This is the step one for supporting multiple outputs in fx nnpi path.

During serialization, we store the shape and dtype in output args, so that importer doesn't need to go back and find the nodes.

The output nodes will looks like
```
                {
                    "target": "output",
                    "op_code": "output",
                    "name": "output",
                    "args": [
                        {
                            "is_node": true,
                            "name": "add_1",
                            "shape": "[1, 1]",
                            "dtype": "torch.float32"
                        }
                    ],
                    "kwargs": {}
                }
```

Test Plan: Doesn't break existing tests and will test on step two.

Reviewed By: jfix71

Differential Revision: D26500742

fbshipit-source-id: 755d2dec704d9da579af40e754b556d6c01aa796
2021-02-23 13:29:02 -08:00
e2afb269b8 [caffe2] add a Python test for SaveOp chunking
Summary:
Add a test in `load_save_test.py` that passes in a chunk_size parameter,
to ensure that we exercise the logic that passes the chunk size to the C++
serialization code.

Test Plan:
Ran the tests with the vlog level set to 3 and manually verified the log
messages showed that we were serializing in the expected chunks.
There are existing C++ tests that confirm chunking behavior works as expected
in the pure C++ code.

Reviewed By: mraway

Differential Revision: D26502578

fbshipit-source-id: cd0074f2358da81c68b0fed2c2a94818d83a957d
2021-02-23 11:52:13 -08:00
1c63cb2c0f Pass child error to parent in distributed tests. (#52632)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52632

Distributed tests run in a multiprocessing environment, where a parent
process drives the tests through several child processes. As a result, when a
child process fails the parent only prints the following:

```
Process 0 exited with error code 10
```

The child process also logs its own exception, but it is cumberson to go
through the logs and track this down.

To alleviate this, I've added a bunch of pipes for each child process so that
the child process writes the error to the pipe before exiting and the parent
process can read the appropriate error from the pipe and display it.

The new output printed by the parent is as follows:

```
> RuntimeError: Process 0 exited with error code 10 and exception:
Traceback (most recent call last):
  File "torch/testing/_internal/common_distributed.py", line 361, in _run
    getattr(self, test_name)()
  File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
    fn()
  File "test_c10d.py", line 789, in test_broadcast_checks
    pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1

Process 1 exited with error code 10 and exception:
Traceback (most recent call last):
  File "torch/testing/_internal/common_distributed.py", line 361, in _run
    getattr(self, test_name)()
  File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
    fn()
  File "test_c10d.py", line 789, in test_broadcast_checks
    pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1

Process 2 exited with error code 10 and exception:
Traceback (most recent call last):
  File "torch/testing/_internal/common_distributed.py", line 361, in _run
    getattr(self, test_name)()
  File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
    fn()
  File "test_c10d.py", line 789, in test_broadcast_checks
    pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1

Process 3 exited with error code 10 and exception:
Traceback (most recent call last):
  File "torch/testing/_internal/common_distributed.py", line 361, in _run
    getattr(self, test_name)()
  File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
    fn()
  File "test_c10d.py", line 789, in test_broadcast_checks
    pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1
```
ghstack-source-id: 122273793

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D26589274

fbshipit-source-id: 7b7a71ec790b216a89db7c157377f426531349a5
2021-02-23 11:50:25 -08:00
e3a805b9c5 Fake Quantization support for f16 and f32 (#52612)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52612

used the float type macro to generalize the fake_quantization per tensor functions to f16 and f64.

Test Plan:
added test to show it works in AMP and extended the forward and backward tests below to test float16 and float64 operations. Note: the reference function doesn't work with with these types so I had to convert in and back out of these types to compare.

```test python test/test_quantization.py
TestFakeQuantize.test_forward_backward_per_tensor_with_amp

test python test/test_quantization.py TestFakeQuantize.test_forward_per_tensor_cachemask_cpu

test python test/test_quantization.py TestFakeQuantize.test_backwards_per_tensor_cachemask_cpu

test python test/test_quantization.py TestFakeQuantize.test_forward_per_tensor_cachemask_cuda

test python test/test_quantization.py TestFakeQuantize.test_backwards_per_tensor_cachemask_cuda

test python test/test_quantization.py

```
Imported from OSS

Reviewed By: vkuzo

Differential Revision: D26586416

fbshipit-source-id: 55fe83c5e47f45cd1de8ddd69bd4a5653ab6dc12
2021-02-23 10:49:12 -08:00
e658d7c37b Ignore user annotated ignored attributes (#52367)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52367

This fixes https://github.com/pytorch/pytorch/issues/52217

Test Plan: Imported from OSS

Reviewed By: navahgar, gmagogsfm

Differential Revision: D26574411

Pulled By: tugsbayasgalan

fbshipit-source-id: 7eac097f5b97cfe65854bceca14d41c156cd6e0a
2021-02-23 10:40:44 -08:00
2680ff7759 Revert D26598115: [pytorch][PR] Update XNNPACK
Test Plan: revert-hammer

Differential Revision:
D26598115 (3721962c33)

Original commit changeset: d652bacdee10

fbshipit-source-id: 7e0128aa9b7691ecd323687da6f6054363b3174a
2021-02-23 10:27:43 -08:00
49b59e3472 Add OpInfo entries for i0 and logical_not (#51956)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/42515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51956

Reviewed By: albanD

Differential Revision: D26404440

Pulled By: mruberry

fbshipit-source-id: dd73e63155dd4a200afb38a5e566eb2132e69fde
2021-02-23 10:12:05 -08:00
dc6fab4452 Fix performance of CUDA trilinear interpolate backward (#52351)
Summary:
Close https://github.com/pytorch/pytorch/issues/51206

This PR basically reverts the CUDA launch configuration changes made in https://github.com/pytorch/pytorch/issues/48675, then only apply a `gpuAtomicAdd` -> `fastAtomicAdd` replacement in the CUDA kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52351

Reviewed By: seemethere

Differential Revision: D26597006

Pulled By: ngimel

fbshipit-source-id: 4a34a351a75c80f714e50cf6dae2c31ddb901ffe
2021-02-23 07:41:38 -08:00
3721962c33 Update XNNPACK (#52645)
Summary:
This update contains the fix to XNNPACK by kimishpatel
Add unit test that exposed the problem
Updated torchvision checkout to 0.9.0rc1 hash

Fixes https://github.com/pytorch/pytorch/issues/52463

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52645

Reviewed By: kimishpatel, seemethere

Differential Revision: D26598115

Pulled By: malfet

fbshipit-source-id: d652bacdee10bb975fc445ab227de37022b8ef51
2021-02-23 06:59:57 -08:00
64847c7f0b [TensorExpr] Properly handle ExternalCalls in LoadStore analysis and Inliner. (#52628)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52628

Prior to this change ExternalCalls were not considered as Loads or
Stores to/from its buffers, which led to incorrect behavior in inlining.
This PR fixes it.

Differential Revision: D26589378

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: cd69d5f7075f6dc756aabcf676842b9a250334d6
2021-02-22 21:50:48 -08:00
b63a1e31d3 [TensorExpr] Inlining: allow inlining into Load exprs. (#52627)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52627

Currently inliner only inlines into Calls, this PR extends this to
 cover Loads too. Eventually we will remove Calls altogether and use
 Loads everywhere, this is one step in that direction.

Differential Revision: D26589377

Test Plan: Imported from OSS

Reviewed By: asuhan

Pulled By: ZolotukhinM

fbshipit-source-id: ca28f0df2273eb214f203467c6ba3d8f02a8a3b6
2021-02-22 21:47:24 -08:00
67794b14bb Use int8_t instead of char in [load|store]_scalar` (#52616)
Summary:
Since `char` is not guaranteed to be signed on all platforms (it is unsigned on ARM)
Fixes https://github.com/pytorch/pytorch/issues/52146

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52616

Test Plan: Run ` python3 -c "import torch;a=torch.tensor([-1], dtype=torch.int8);print(a.tolist())"` on arm-linux system

Reviewed By: walterddr

Differential Revision: D26586678

Pulled By: malfet

fbshipit-source-id: 91972189b54f86add516ffb96d579acb0bc13311
2021-02-22 21:11:18 -08:00
7ecc1b603a [TensorPipe] Update [Cpu|Cuda]Buffer fwd declarations (#52600)
Summary:
They've changed from class to struct in tensorpipe repo, but have not
been updated in the header, which triggers compiler warning if clang is
used and would have triggered a linker error if the same code is
compiled with MSVC

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52600

Reviewed By: lw

Differential Revision: D26579754

Pulled By: malfet

fbshipit-source-id: 800c02e7ba839bac01adf216de2d8547b7e9128b
2021-02-22 21:03:41 -08:00
fa8568184f [caffe2] Delete some unused fields from TensorProto (#52521)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52521

The `storage_type` and `external_data` fields were added a few years ago in
D10246743 (30aaa07594) but don't appear to have been used anywhere.  Let's remove them to
help simplify the `TensorProto` message definition.
ghstack-source-id: 122110201

Test Plan: Confirmed the code still builds.

Reviewed By: dzhulgakov

Differential Revision: D26500028

fbshipit-source-id: 1e188f98f59e0b8673ea342ad9aaf7e5ba9b5fac
2021-02-22 20:51:27 -08:00
f111ec48c1 docs: add fractional_max_pool in nn.functional (#52557)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/51708

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52557

Reviewed By: bdhirsh

Differential Revision: D26591388

Pulled By: jbschlosser

fbshipit-source-id: 42643864df92ea014e69a8ec5c29333735e98898
2021-02-22 20:45:07 -08:00
6cfe55dea9 Add psutil to requirements.txt (#52285)
Summary:
psutil is used in many test scripts under test/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52285

Reviewed By: jbschlosser

Differential Revision: D26516673

Pulled By: malfet

fbshipit-source-id: 09a81d5dba3bf5189e3e5575c2095eb069b93ade
2021-02-22 20:07:07 -08:00
a59c4039e0 Fix undefined symbol for CUDA 11.1 Windows (#52506)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/52467.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52506

Reviewed By: bdhirsh

Differential Revision: D26582788

Pulled By: seemethere

fbshipit-source-id: a03489449e0492ed023bf54aa9da194491f0e67f
2021-02-22 19:02:51 -08:00
a0652c8f08 [static runtime] Fix up deprecated exact equality in tests (#52617)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52617

swaps `.equals` with `torch::allclose`

tests are broken right now

Test Plan: buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --run-disabled

Reviewed By: bertmaher, maratsubkhankulov, yinghai

Differential Revision: D26585079

fbshipit-source-id: 9bd2a7b87208301415a8925f95c69fe44accf159
2021-02-22 17:50:14 -08:00
7f4dff5496 docs: add FractionalMaxPool3d in pooling layers (#52556)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/51625

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52556

Reviewed By: smessmer

Differential Revision: D26593666

Pulled By: bdhirsh

fbshipit-source-id: 3d81d23fa70efa0f794dde47a34baad0aaa9c751
2021-02-22 17:04:09 -08:00
1865499d49 [Pytorch Mobile] Improve export_opnames Documentation (#52333)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52333

Export_opnames current documentation is a bit misleading. Change it to better clarify what it does.
ghstack-source-id: 121810264

Test Plan: n/a

Reviewed By: iseeyuan

Differential Revision: D26471803

fbshipit-source-id: 496d10b161c9a4076c4e12db8a0affafc4e1e359
2021-02-22 16:46:08 -08:00
108ec77fa7 [NNC] Added reductions to NNC python bindings. (#52492)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52492

Reviewed By: bdhirsh

Differential Revision: D26575506

Pulled By: Chillee

fbshipit-source-id: 9a070f591a9709dab55dfff849184b1bcffc4fa5
2021-02-22 16:31:18 -08:00
3309f034aa remove pointless test (#52609)
Summary:
Fixes T81870118

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52609

Reviewed By: mruberry

Differential Revision: D26584288

Pulled By: ngimel

fbshipit-source-id: 7cec37db46cfe5b5b2fd21fe7c3e3fcbb8aba049
2021-02-22 16:25:04 -08:00
fd5792f857 docs: add :nosignatures: in torch.jit (#52555)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/52554

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52555

Reviewed By: ZolotukhinM

Differential Revision: D26573956

Pulled By: SplitInfinity

fbshipit-source-id: ce011c66ce771bc7e9357f98db9994d54faa7013
2021-02-22 16:19:07 -08:00
09fe753a33 Enable TCPStore fixed slow test (#52511)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52511

Re-enable a test that was previously fixed but forgot to be re-enabled.

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D26586980

Pulled By: H-Huang

fbshipit-source-id: 3cfe21de09036d2b87273680dae351e20125e815
2021-02-22 16:07:37 -08:00
973e306c84 changed TE 'Allocate' API to take one argument 'Buf' instead of three arguments 'Var', 'dtype', 'dims'. (#50167)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50167

Test Plan:
Imported from OSS

`python test/test_jit_fuser_te.py`
`python test/test_jit_fuser_legacy.py`
`python test/test_jit_fuser.py`
`build/bin/test_tensorexpr`

Reviewed By: ZolotukhinM

Differential Revision: D25814342

Pulled By: huiguoo

fbshipit-source-id: 44cba7f92365b826c9cb1d385a94858934570dee
2021-02-22 15:08:51 -08:00
0bc57f47f0 torch.Package zipfile debugging printer (#52176)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52176

Added tooling to print out zipfile structure for PackageExporter and PackageImporter.

API looks like:
```
exporter.print_file_structure("sss" /*only include files with this in the path*/)
importer3.print_file_structure(False /*don't print storage*/, "sss" /*only include files with this in the path*/)
```

The output looks like this with the storage hidden by default:
```
─── resnet.zip
    ├── .data
    │   ├── extern_modules
    │   └── version
    ├── models
    │   └── models1.pkl
    └── torchvision
        └── models
            ├── resnet.py
            └── utils.py
```
The output looks like this with the storage being printed out:
```
─── resnet_added_attr_test.zip
    ├── .data
    │   ├── 94574437434544.storage
    │   ├── 94574468343696.storage
    │   ├── 94574470147744.storage
    │   ├── 94574470198784.storage
    │   ├── 94574470267968.storage
    │   ├── 94574474917984.storage
    │   ├── extern_modules
    │   └── version
    ├── models
    │   └── models1.pkl
    └── torchvision
        └── models
            ├── resnet.py
            └── utils.py
```

If the output is filtered with the string 'utils' it'd looks like this:
```
─── resnet_added_attr_test.zip
    └── torchvision
        └── models
            └── utils.py
```

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D26429795

Pulled By: Lilyjjo

fbshipit-source-id: 4fa25b0426912f939c7b52cedd6e217672891f21
2021-02-22 15:04:56 -08:00
b72a72a477 torch.Package extend PyTorchStreamWriter to track written records (#52218)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52218

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D26429794

Pulled By: Lilyjjo

fbshipit-source-id: 5f68e7991c673ada629d0370c705520243d0637a
2021-02-22 15:02:41 -08:00
a39b1c42c1 MHA: Fix regression and apply bias flag to both in/out proj (#52537)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/52257

## Background
Reverts MHA behavior for `bias` flag to that of v1.5: flag enables or disables both in and out projection biases.

Updates type annotations for both in and out projections biases from `Tensor` to `Optional[Tensor]` for `torch.jit.script` usage.

Note: With this change, `_LinearWithBias` defined in `torch/nn/modules/linear.py` is no longer utilized. Completely removing it would require updates to quantization logic in the following files:
```
test/quantization/test_quantized_module.py
torch/nn/quantizable/modules/activation.py
torch/nn/quantized/dynamic/modules/linear.py
torch/nn/quantized/modules/linear.py
torch/quantization/quantization_mappings.py
```
This PR takes a conservative initial approach and leaves these files unchanged.

**Is it safe to fully remove `_LinearWithBias`?**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52537

Test Plan:
```
python test/test_nn.py TestNN.test_multihead_attn_no_bias
```

## BC-Breaking Note
In v1.6, the behavior of `MultiheadAttention`'s `bias` flag was incorrectly changed to affect only the in projection layer. That is, setting `bias=False` would fail to disable the bias for the out projection layer. This regression has been fixed, and the `bias` flag now correctly applies to both the in and out projection layers.

Reviewed By: bdhirsh

Differential Revision: D26583639

Pulled By: jbschlosser

fbshipit-source-id: b805f3a052628efb28b89377a41e06f71747ac5b
2021-02-22 14:47:12 -08:00
bfc2645981 [BE] force cmake to always generate version.py (#52477)
Summary:
Fix the issue that `add_custom_command(OUTPUT ...)` will only be called when target output is missing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52477

Reviewed By: malfet

Differential Revision: D26538718

Pulled By: walterddr

fbshipit-source-id: 0fef40585a0f888dcbe268deb2e7a7a8d81e6aa1
2021-02-22 13:54:39 -08:00
2eb9c0832e Modernize for-loops in torch misc (#52452)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52452

Test Plan: Sandcastle

Reviewed By: pritamdamania87

Differential Revision: D26520760

fbshipit-source-id: c13161324f24f553ad679308d0dc279ab178e129
2021-02-22 13:37:19 -08:00
947225cd1b update tracing codegen to use redispatch API (#52009)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52009

Taking advantage of the new `redispatch` API to clean up the codegen'd tracing kernels. Instead of directly interacting with the Dispatcher, the tracing kernels now just call the `redispatch` API directly.

One small benefit to this: hopefully the compiler is more likely to inline `Dispatcher::redispatch()`, since it's now used in fewer call-sites. After this change, the only places it's used are:
- the `redispatch` API (`RedispatchFunctions.cpp`)
- BackendSelect kernels.

One small complication: the redispatch API doesn't interact too well with `manual_cpp_binding` ops currently. I put a note with some thoughts in the comments.

Example tracing kernel before:
```
Tensor add_Tensor(c10::DispatchKeySet ks, const Tensor & self, const
  torch::jit::Node* node = nullptr;
  std::shared_ptr<jit::tracer::TracingState> tracer_state;
  if (jit::tracer::isTracing()) {
    tracer_state = jit::tracer::getTracingState();
    at::Symbol op_name;
    op_name = jit::Symbol::fromQualString("aten::add");
    node = tracer_state->graph->create(op_name, /*num_outputs=*/0);
    jit::tracer::recordSourceLocation(node);
    jit::tracer::addInputs(node, "self", self);
    jit::tracer::addInputs(node, "other", other);
    jit::tracer::addInputs(node, "alpha", alpha);
    tracer_state->graph->insertNode(node);

    jit::tracer::setTracingState(nullptr);
  }
  static auto op = c10::Dispatcher::singleton()
      .findSchemaOrThrow("aten::add", "Tensor")
      .typed<Tensor (const Tensor &, const Tensor &, Scalar)>();
  auto result =c10::Dispatcher::singleton()
      .redispatch<Tensor, const Tensor &, const Tensor &, Scalar>(op,
  if (tracer_state) {
    jit::tracer::setTracingState(std::move(tracer_state));
    jit::tracer::addOutput(node, result);
  }
  return result;
}
```

after: (note the lack of `Dispatcher::` calls)
```
Tensor add_Tensor(c10::DispatchKeySet ks, const Tensor & self, const Tensor & other, Scalar alpha)
  torch::jit::Node* node = nullptr;
  std::shared_ptr<jit::tracer::TracingState> tracer_state;
  if (jit::tracer::isTracing()) {
    tracer_state = jit::tracer::getTracingState();
    at::Symbol op_name;
    op_name = jit::Symbol::fromQualString("aten::add");
    node = tracer_state->graph->create(op_name, /*num_outputs=*/0);
    jit::tracer::recordSourceLocation(node);
    jit::tracer::addInputs(node, "self", self);
    jit::tracer::addInputs(node, "other", other);
    jit::tracer::addInputs(node, "alpha", alpha);
    tracer_state->graph->insertNode(node);

    jit::tracer::setTracingState(nullptr);
  }
  auto result =at::redispatch::add(ks & c10::DispatchKeySet(c10::DispatchKeySet::FULL_AFTER, c10::D
  if (tracer_state) {
    jit::tracer::setTracingState(std::move(tracer_state));
    jit::tracer::addOutput(node, result);
  }
  return result;
}
```

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D26356078

Pulled By: bdhirsh

fbshipit-source-id: bc96ca4c6d90903f1e265859160d4b13a8cc7310
2021-02-22 13:26:47 -08:00
80240d0888 update autograd kernels to use redispatch (#51363)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51363

Test Plan: Imported from OSS

Reviewed By: Lilyjjo

Differential Revision: D26153580

Pulled By: bdhirsh

fbshipit-source-id: 5d7905d2c39c9bb7f219e703940ed3eef5230491
2021-02-22 13:24:34 -08:00
6b8e670eb7 [CI][IOS] Add lite interpreter ios build job (#52567)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52567

## Summary
As title, add libtorch (lite) in ios x86 Em and

## Test plan
In `config.yml`, remove `context: org-member`:
```
      - pytorch_ios_build:
          build_environment: pytorch-ios-12.0.0-arm64_lite_interpreter_build
          context: org-member
          ios_arch: arm64
          ios_platform: OS
          lite_interpreter: "1"
          name: pytorch_ios_12_0_0_arm64_lite_interpreter_build
```
The build is:

https://app.circleci.com/pipelines/github/pytorch/pytorch/276113/workflows/49fa2f6e-c978-424b-9177-bbe313955876/jobs/11050851

**Build** step finishes successfully:
![image](https://user-images.githubusercontent.com/16430979/108619899-d183b080-73dc-11eb-809d-a21f811cf821.png)

It fails **Run Build Test** because of missing `IOS_DEV_TEAM_ID`

Test Plan: Imported from OSS

Reviewed By: xta0

Differential Revision: D26572842

Pulled By: cccclai

fbshipit-source-id: 9d868ac7e94af37ef90212b754e91d98c0d20b30
2021-02-22 13:17:41 -08:00
067fd78f05 add RECORD_FUNCTION to grad_sum_to_size (#52516)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52516

Reviewed By: bdhirsh

Differential Revision: D26582645

Pulled By: Krovatkin

fbshipit-source-id: f3aa7d959cc31fc6fd6f8a38c36488b01cc1a515
2021-02-22 12:53:45 -08:00
09c56ef45e Remove DepTracker from LoopNest (#52405)
Summary:
Remove the dependency tracker that works on Tensors, DepTracker, from LoopNest. This is essential to the goal of removing Tensors from LoopNest.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52405

Reviewed By: heitorschueroff

Differential Revision: D26548621

Pulled By: navahgar

fbshipit-source-id: b20f23d608c19ac71aebd31c14777d653eead36c
2021-02-22 12:48:07 -08:00
847d1d4d53 add debug_flush_compilation_cache to Method (#52317)
Summary:
Forgot to add `debug_flush_compilation_cache ` to `Method` as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52317

Reviewed By: bdhirsh

Differential Revision: D26583313

Pulled By: Krovatkin

fbshipit-source-id: 1b3e503950cc3314796aff53b3b8038d16767870
2021-02-22 12:31:09 -08:00
783b5c0c9f op_whitelist -> op_allowlist (#52150)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52150

Renames "whitelist" to "allowlist" to conform to company use standards, prevent critical errors raised by linters which detect the old usage, and to move toward more self-descriptive terminology.

Test Plan: Sandcastle

Reviewed By: suo

Differential Revision: D26405520

fbshipit-source-id: 9c3a41591d4e29c0197de9a8f5858c9c29271e26
2021-02-22 12:23:42 -08:00
03ae6d9903 Remove useless _allgather_then_aggregate_hook (#52593)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52593

This hook is not used at all, and it probably can only be used for demonstrating that allgather is slower than allreduce, so it should never be used in practice.

However, this hook and its helper function stay with the communication hook public APIs in the same file. It will be better to make the public API file as concise as possible.

Since I don't think we will use this hook in the future, prefer deleting it to moving it to a separate file.
ghstack-source-id: 122180969

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D26575318

fbshipit-source-id: b258154a7c92e33236c34104bd79bc244ecdb158
2021-02-22 12:12:53 -08:00
ad3319cbc2 fractional_max_pool{2/3}d : Fix segfaults for incorrect kernel_size and output_size (#51626)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/50967

TODO:

* [x] Add test for `fractional_max_pool3d` similar to `fractional_max_pool2d` (since there is no test for the same).

Needs Resolution:
* [ ] ASAN failure on the newly added 3d variant test. https://app.circleci.com/pipelines/github/pytorch/pytorch/269483/workflows/8426b3b7-9a35-4032-a57a-729964a4a5ff/jobs/10673756
* [ ] Failing gradcheck on MacOS. https://app.circleci.com/pipelines/github/pytorch/pytorch/269483/workflows/8426b3b7-9a35-4032-a57a-729964a4a5ff/jobs/10673101

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51626

Reviewed By: jbschlosser

Differential Revision: D26514064

Pulled By: heitorschueroff

fbshipit-source-id: e2cc57585dbc3a08c7f24591b202e0fabfd2a459
2021-02-22 12:06:36 -08:00
116d402200 Skip handle_r_to_c for dot & vdot (#52474)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/52455.

**Summary:**
`dot` & `vdot` operate on tensors of the same `dtype`,  so skip `handle_r_to_c` for them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52474

Reviewed By: bdhirsh

Differential Revision: D26570931

Pulled By: anjali411

fbshipit-source-id: 07c6c50e3550e521d1807c519154b028d9168de7
2021-02-22 12:00:08 -08:00
4386a3803c Replace all ASSERTM in serialization (#51756)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51756

Test Plan: Imported from OSS

Reviewed By: Lilyjjo

Differential Revision: D26280320

Pulled By: bdhirsh

fbshipit-source-id: ddba1fe46b9f39234f010aac9cdf198e82727f84
2021-02-22 11:29:10 -08:00
014d2123a3 Replace all AT_ASSERTM in ATen (#51677)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51677

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D26280321

Pulled By: bdhirsh

fbshipit-source-id: cef273e45ba7167ae240b85410ca7a3913ad54b4
2021-02-22 11:27:08 -08:00
d02a2bd5d1 codegen'd API for redispatching (#52008)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52008

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D26356079

Pulled By: bdhirsh

fbshipit-source-id: 1fd34fbb4dbc48cc8390cad99e30e0d04fc75a4f
2021-02-22 10:55:38 -08:00
ed71cbdd39 Revert PR 52483 "[reland][complex] masked_fill (#52587)
Summary:
Revert "[reland][complex] `masked_fill`: Complex Autograd support update masked_scatter skips. (https://github.com/pytorch/pytorch/issues/52483)"

This reverts commit b6cf17deeeb526b0dfee5434c96223debe62c506.

Reference: https://github.com/pytorch/pytorch/pull/52483#issuecomment-783023560

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52587

Reviewed By: anjali411

Differential Revision: D26579741

Pulled By: malfet

fbshipit-source-id: 9b53c8aab51d844d0f65393609861a4ff72ef7bb
2021-02-22 10:53:37 -08:00
57637e0ab4 port upsample_nearest3d and upsample_trilinear3d to structured (#52065)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52065

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D26373027

Pulled By: bdhirsh

fbshipit-source-id: 76b283ea8142732ffc8f7b200a8494349739e326
2021-02-22 10:38:52 -08:00
d659477ae0 port upsample_bilinear2d and upsample_bicubic2d to structured (#52012)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52012

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D26356329

Pulled By: bdhirsh

fbshipit-source-id: 8f974224799493e3172fe5dff3fbd43af8c09722
2021-02-22 10:38:48 -08:00
f3ea5ca672 port upsample_linear1d to structured (#51917)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51917

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D26327750

Pulled By: bdhirsh

fbshipit-source-id: 443ad278010ce655eb5f08fa6889c45ccb328268
2021-02-22 10:38:43 -08:00
c78a4a52d2 remove unnecessary/dead code in upsample_nearest1d cuda kernel (#51916)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51916

After getting ported to structured kernels, the vector overloads of `upsample_nearest1d` are DefaultBackend kernels, meaning they are backend agnostic. We can kill their CUDA-specific implementations.

I also removed a few redundant checks in the cuda kernels that are now performed by the meta shape-checking function.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D26327749

Pulled By: bdhirsh

fbshipit-source-id: b5a17e14237fb36236d4079433f99c71cd3beef3
2021-02-22 10:36:47 -08:00
ee04cd9587 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D26575694

fbshipit-source-id: afb40c6c12126b4f3cb8ea2ffc526b1d817f5471
2021-02-22 04:41:17 -08:00
c2b9283d4a [PyTorch Mobile] Use real if constexpr behind macro in hot template (copy D26154964 in a different setting) (#52420)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52420

Inspired by D26154964 (6e1a5b1196), I'm basically going to just blindly copy the change that swolchok has made since it promises to reduce compile time, and who doesn't want faster compiles! I haven't actually checked if it has any impact on build time, but I have come to trust what swolchok does.

In addition, swolchok observed a size reduction with the change, which I assume happens when the `constexpr` is true since the lambda is invoked and possibly needs to be compiled in. When tracing based selective build is enabled, many many many of these will be enabled, and this will use valuable size. This change is required to get the maximum bang for our buck. In addition, I'll look into making the lambda not capture all arguments by ref via the ref-capture `[&]` directive.

I can probably have an entire half's worth of impact by copying Scott's changes and mirroring them in other parts of the PyTorch codebase lol.

#accep2ship
ghstack-source-id: 122178246

Test Plan: Build

Reviewed By: iseeyuan

Differential Revision: D26506634

fbshipit-source-id: b91d5e4773ade292fddce8dddd7e5ba1e5afeb29
2021-02-21 23:51:47 -08:00
d177654981 [Take-2] [PyTorch Mobile] 15KiB size reduction by reducing MaxTypeIndex from 256 to 32 (#52466)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52466

`MaxTypeIndex` controls the size of the array

```
detail::TypeMetaData* TypeMeta::typeMetaDatas() {
  static detail::TypeMetaData instances[MaxTypeIndex + 1]
```

in `typeid.cpp`.

In practice, I have seen that this array doesn't hold more than 18 elements once the PyTorch library has been initialized (in mobile unit tests). I couldn't find situations where elements may be added to this array post library initialization.

There is a runtime check to prevent array overflow, so reducing the size of the storage shouldn't come at any additional risk from the perspective of loss in visibility of errors.

The fact that this array is staically allocated ends up using a bunch of space in the binary (potentially to initialize the trailing elements?). I'm somewhat surprised but this. However, this change registered a 15KiB size win on both fbios as well as igios.

Found this when I was looking at a bloaty run that I shared with smessmer on friday: https://www.internalfb.com/intern/everpaste/?handle=GLXImQisHOfT74EBAKw47V3ktuAzbsIXAAAB

I initially thought that the methods being passed in to the constructor of `detail::TypeMetaData` were causing the size increase, but only later relaized the issue after reading the folloing helpful comment:

```
// The remainder of the array is padded with TypeMetaData blanks.
// The first of these is the entry for ScalarType::Undefined.
// The rest are consumed by CAFFE_KNOWN_TYPE entries.
```

This change was originally reverted at https://www.internalfb.com/diff/D26525208 due to an ONNX test failure. Re-trying the change gated under `C10_MOBILE`.
ghstack-source-id: 122178181

Test Plan:
Sandcastle runs + the following BSB runs.

### igios

```
D26299594 (9e54532947)-V1 (https://www.internalfb.com/intern/diff/D26299594 (9e54532947)/?dest_number=121221891)

igios: Succeeded
Change in Download Size for arm64 + 3x assets variation: +596 B
Change in Uncompressed Size for arm64 + 3x assets variation: -15.8 KiB

Mbex Comparison: https://our.intern.facebook.com/intern/mbex/bsb:443632243487886@base/bsb:443632243487886@diff/
```

### fbios

```
D26299594 (9e54532947)-V1 (https://www.internalfb.com/intern/diff/D26299594 (9e54532947)/?dest_number=121221891)

fbios: Succeeded
Change in Download Size for arm64 + 3x assets variation: +104 B
Change in Uncompressed Size for arm64 + 3x assets variation: -15.7 KiB

Mbex Comparison: https://our.intern.facebook.com/intern/mbex/bsb:169233698063125@base/bsb:169233698063125@diff/
```

Reviewed By: iseeyuan

Differential Revision: D26527921

fbshipit-source-id: f019e5fd37e6caf24c58c6f144bedcda942d7164
2021-02-21 23:49:48 -08:00
d491fc6d48 [PyTorch] Add comment to unify macro and rename one macro (#52573)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52573

## Summary
Address comments in https://github.com/pytorch/pytorch/pull/52540
1. Add a comment to indicate that the macros `BUILD_LITE_INTERPRETER` and `C10_MOBILE` will be unified.
2. Rename the macro `DBUILD_LITE_INTERPRETER` to `BUILD_LITE_INTERPRETER`

## Test plan
1. `MACOSX_DEPLOYMENT_TARGET=10.9 CC=clang CXX=clang++ USE_CUDA=0 DEBUG=1 MAX_JOBS=16 BUILD_LITE_INTERPRETER=1  python setup.py develop`
2. `/Users/chenlai/pytorch/cmake-build-debug/bin/test_lite_interpreter_runtime --gtest_filter=* --gtest_color=no`

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D26572742

Pulled By: cccclai

fbshipit-source-id: c8895fcfe8dd893f8157913f110e2ba025fc3955
2021-02-21 21:53:55 -08:00
e677b71056 Add support for pow (#52374)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/18627
Adds pow support for JIT

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52374

Test Plan: python test/test_jit.py -k test_torch_pow

Reviewed By: heitorschueroff

Differential Revision: D26555070

Pulled By: nikithamalgifb

fbshipit-source-id: 0d325f09cf893e4ae50277a95a6b7ad67d94f342
2021-02-21 19:55:58 -08:00
d819a21692 Support any (#52360)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/18627
Adds torch.any support for JIT

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52360

Test Plan:
python test/test_jit.py -k test_torch_any
python test/test_jit.py -k test_any

Reviewed By: tugsbayasgalan

Differential Revision: D26550626

Pulled By: nikithamalgifb

fbshipit-source-id: 36c2ae15e3bfb7b32bbf442818c879b0d2120cf1
2021-02-21 15:49:57 -08:00
14f7bf0629 [PyTorch] update CMake to build libtorch lite (#51419)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51419

## Summary

1. Add an option `BUILD_LITE_INTERPRETER` in `caffe2/CMakeLists.txt` and set `OFF` as default.
2. Update 'build_android.sh' with an argument to swtich `BUILD_LITE_INTERPRETER`, 'OFF' as default.
3. Add a mini demo app `lite_interpreter_demo` linked with `libtorch` library, which can be used for quick test.

## Test Plan
Built lite interpreter version of libtorch and test with Image Segmentation demo app ([android version](https://github.com/pytorch/android-demo-app/tree/master/ImageSegmentation)/[ios version](https://github.com/pytorch/ios-demo-app/tree/master/ImageSegmentation))

### Android
1. **Prepare model**: Prepare the lite interpreter version of model by run the script below to generate the scripted model `deeplabv3_scripted.pt` and `deeplabv3_scripted.ptl`
```
import torch

model = torch.hub.load('pytorch/vision:v0.7.0', 'deeplabv3_resnet50', pretrained=True)
model.eval()

scripted_module = torch.jit.script(model)
# Export full jit version model (not compatible lite interpreter), leave it here for comparison
scripted_module.save("deeplabv3_scripted.pt")
# Export lite interpreter version model (compatible with lite interpreter)
scripted_module._save_for_lite_interpreter("deeplabv3_scripted.ptl")

```
2. **Build libtorch lite for android**: Build libtorch for android for all 4 android abis (armeabi-v7a, arm64-v8a, x86, x86_64) `BUILD_LITE_INTERPRETER=1 ./scripts/build_pytorch_android.sh`. This pr is tested on Pixel 4 emulator with x86, so use cmd `BUILD_LITE_INTERPRETER=1 ./scripts/build_pytorch_android.sh x86` to specify abi to save built time. After the build finish, it will show the library path:
```
...
BUILD SUCCESSFUL in 55s
134 actionable tasks: 22 executed, 112 up-to-date
+ find /Users/chenlai/pytorch/android -type f -name '*aar'
+ xargs ls -lah
-rw-r--r--  1 chenlai  staff    13M Feb 11 11:48 /Users/chenlai/pytorch/android/pytorch_android/build/outputs/aar/pytorch_android-release.aar
-rw-r--r--  1 chenlai  staff    36K Feb  9 16:45 /Users/chenlai/pytorch/android/pytorch_android_torchvision/build/outputs/aar/pytorch_android_torchvision-release.aar
```
3. **Use the PyTorch Android libraries built from source in the ImageSegmentation app**: Create a folder 'libs' in the path, the path from repository root will be `ImageSegmentation/app/libs`. Copy `pytorch_android-release` to the path `ImageSegmentation/app/libs/pytorch_android-release.aar`. Copy 'pytorch_android_torchvision` (downloaded from [here](https://oss.sonatype.org/#nexus-search;quick~torchvision_android)) to the path `ImageSegmentation/app/libs/pytorch_android_torchvision.aar` Update the `dependencies` part of `ImageSegmentation/app/build.gradle` to
```
dependencies {
    implementation 'androidx.appcompat:appcompat:1.2.0'
    implementation 'androidx.constraintlayout:constraintlayout:2.0.2'
    testImplementation 'junit:junit:4.12'
    androidTestImplementation 'androidx.test.ext:junit:1.1.2'
    androidTestImplementation 'androidx.test.espresso:espresso-core:3.3.0'

    implementation(name:'pytorch_android-release', ext:'aar')
    implementation(name:'pytorch_android_torchvision', ext:'aar')

    implementation 'com.android.support:appcompat-v7:28.0.0'
    implementation 'com.facebook.fbjni:fbjni-java-only:0.0.3'
}
```
Update `allprojects` part in `ImageSegmentation/build.gradle` to
```

allprojects {
    repositories {
        google()
        jcenter()
        flatDir {
            dirs 'libs'
        }
    }
}
```
4. **Update model loader api**: Update `ImageSegmentation/app/src/main/java/org/pytorch/imagesegmentation/MainActivity.java` by
4.1 Add new import: `import org.pytorch.LiteModuleLoader;`
4.2 Replace the way to load pytorch lite model
```
//            mModule = Module.load(MainActivity.assetFilePath(getApplicationContext(), "deeplabv3_scripted.pt"));
            mModule = LiteModuleLoader.load(MainActivity.assetFilePath(getApplicationContext(), "deeplabv3_scripted.ptl"));
```
5. **Test app**: Build and run the ImageSegmentation app in Android Studio,
![image](https://user-images.githubusercontent.com/16430979/107696279-9cea5900-6c66-11eb-8286-4d1d68abff61.png)

### iOS
1. **Prepare model**: Same as Android.
2. **Build libtorch lite for ios** `BUILD_PYTORCH_MOBILE=1 IOS_PLATFORM=SIMULATOR BUILD_LITE_INTERPRETER=1   ./scripts/build_ios.sh`
3. **Remove Cocoapods from the project**: run `pod deintegrate`
4. **Link ImageSegmentation demo app with the custom built library**:
Open your project in XCode, go to your project Target’s **Build Phases - Link Binaries With Libraries**, click the **+** sign and add all the library files located in `build_ios/install/lib`. Navigate to the project **Build Settings**, set the value **Header Search Paths** to `build_ios/install/include` and **Library Search Paths** to `build_ios/install/lib`.
In the build settings, search for **other linker flags**. Add a custom linker flag below
```
-all_load
```
Finally, disable bitcode for your target by selecting the Build Settings, searching for Enable Bitcode, and set the value to No.
**

5. Update library and api**
5.1 Update `TorchModule.mm``
To use the custom built libraries the project, replace `#import <LibTorch/LibTorch.h>` (in `TorchModule.mm`) which is needed when using LibTorch via Cocoapods with the code below:

```
//#import <LibTorch/LibTorch.h>
#include "ATen/ATen.h"
#include "caffe2/core/timer.h"
#include "caffe2/utils/string_utils.h"
#include "torch/csrc/autograd/grad_mode.h"
#include "torch/script.h"
#include <torch/csrc/jit/mobile/function.h>
#include <torch/csrc/jit/mobile/import.h>
#include <torch/csrc/jit/mobile/interpreter.h>
#include <torch/csrc/jit/mobile/module.h>
#include <torch/csrc/jit/mobile/observer.h>
```
5.2 Update `ViewController.swift`
```
//        if let filePath = Bundle.main.path(forResource:
//            "deeplabv3_scripted", ofType: "pt"),
//            let module = TorchModule(fileAtPath: filePath) {
//            return module
//        } else {
//            fatalError("Can't find the model file!")
//        }
        if let filePath = Bundle.main.path(forResource:
            "deeplabv3_scripted", ofType: "ptl"),
            let module = TorchModule(fileAtPath: filePath) {
            return module
        } else {
            fatalError("Can't find the model file!")
        }
```

### Unit test
Add `test/cpp/lite_interpreter`, with one unit test `test_cores.cpp` and a light model `sequence.ptl` to test `_load_for_mobile()`, `bc.find_method()` and `bc.forward()` functions.

### Size:
**With the change:**
Android:
x86: `pytorch_android-release.aar` (**13.8 MB**)

IOS:
`pytorch/build_ios/install/lib` (lib: **66 MB**):
```
(base) chenlai@chenlai-mp lib % ls -lh
total 135016
-rw-r--r--  1 chenlai  staff   3.3M Feb 15 20:45 libXNNPACK.a
-rw-r--r--  1 chenlai  staff   965K Feb 15 20:45 libc10.a
-rw-r--r--  1 chenlai  staff   4.6K Feb 15 20:45 libclog.a
-rw-r--r--  1 chenlai  staff    42K Feb 15 20:45 libcpuinfo.a
-rw-r--r--  1 chenlai  staff    39K Feb 15 20:45 libcpuinfo_internals.a
-rw-r--r--  1 chenlai  staff   1.5M Feb 15 20:45 libeigen_blas.a
-rw-r--r--  1 chenlai  staff   148K Feb 15 20:45 libfmt.a
-rw-r--r--  1 chenlai  staff    44K Feb 15 20:45 libpthreadpool.a
-rw-r--r--  1 chenlai  staff   166K Feb 15 20:45 libpytorch_qnnpack.a
-rw-r--r--  1 chenlai  staff   384B Feb 15 21:19 libtorch.a
-rw-r--r--  1 chenlai  staff    **60M** Feb 15 20:47 libtorch_cpu.a
```
`pytorch/build_ios/install`:
```
(base) chenlai@chenlai-mp install % du -sh *
 14M	include
 66M	lib
2.8M	share
```

**Master (baseline):**
Android:
x86: `pytorch_android-release.aar` (**16.2 MB**)

IOS:
`pytorch/build_ios/install/lib` (lib: **84 MB**):
```
(base) chenlai@chenlai-mp lib % ls -lh
total 172032
-rw-r--r--  1 chenlai  staff   3.3M Feb 17 22:18 libXNNPACK.a
-rw-r--r--  1 chenlai  staff   969K Feb 17 22:18 libc10.a
-rw-r--r--  1 chenlai  staff   4.6K Feb 17 22:18 libclog.a
-rw-r--r--  1 chenlai  staff    42K Feb 17 22:18 libcpuinfo.a
-rw-r--r--  1 chenlai  staff   1.5M Feb 17 22:18 libeigen_blas.a
-rw-r--r--  1 chenlai  staff    44K Feb 17 22:18 libpthreadpool.a
-rw-r--r--  1 chenlai  staff   166K Feb 17 22:18 libpytorch_qnnpack.a
-rw-r--r--  1 chenlai  staff   384B Feb 17 22:19 libtorch.a
-rw-r--r--  1 chenlai  staff    78M Feb 17 22:19 libtorch_cpu.a
```
`pytorch/build_ios/install`:
```
(base) chenlai@chenlai-mp install % du -sh *
 14M	include
 84M	lib
2.8M	share
```

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D26518778

Pulled By: cccclai

fbshipit-source-id: 4503ffa1f150ecc309ed39fb0549e8bd046a3f9c
2021-02-21 01:43:54 -08:00
a935118c90 Fix caffee2 to use MaybeAlign when using LLVM trunk
Summary: Trunk at 13 uses a different type for `CreateAlignedStore` and `CreateAlignedLoad` so updating usage here to reflect this.

Test Plan:
buck build mode/opt-clang-thinlto sigrid/predictor/v2:sigrid_remote_predictor -c cxx.extra_cxxflags="-Wforce-no-error -fbracket-depth=300" -c cxx.profile="fbcode//fdo/autofdo-bolt-compatible/sigrid/predictor/v2/sigrid_remote_predictor:autofdo-bolt-compatible" -c cxx.modules=False

Previously:
caffe2/torch/csrc/jit/tensorexpr/llvm_codegen.cpp:1079:21: error: no matching member function for call to 'CreateAlignedLoad'
      value_ = irb_.CreateAlignedLoad(vaddr, 4);
               ~~~~~^~~~~~~~~~~~~~~~~
third-party-buck/platform009/build/llvm-fb/include/llvm/IR/IRBuilder.h:1681:13: note: candidate function not viable: no known conversion from 'int' to 'llvm::MaybeAlign' for 2nd argument
  LoadInst *CreateAlignedLoad(Value *Ptr, MaybeAlign Align,

Now:
Passes

Differential Revision: D26562330

fbshipit-source-id: dbf9ca5247ccd4351861995c2c5480a7cc55c202
2021-02-20 23:12:00 -08:00
a61a8d059e Restore fast path in OnnxifiOp::adjustOutputBatchSize (#52498)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52498

If `max_shape[dim]` equals to `real_shape[dim]`, we shouldn't need to adjust dim in terms of output slicing. Consider the case, when we have output compiled at [10, 4] and real input is [5, 4], we only need to adjust outermost dim (10->5) for the second dim, we don't need to do anything. Thus this should fall to fast path.

Test Plan:
```
buck test glow/fb/test:test_onnxifinnpi
```

Reviewed By: khabinov

Differential Revision: D26542773

fbshipit-source-id: 0475e0a1c35be6f28ccc63dc69cb0b5acf695141
2021-02-20 13:50:07 -08:00
65bfa1389d [PyTorch Mobile] Do not create a static variable in Dispatcher::singleton() (#52447)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52447

Currently, `Dispatcher::singleton()` is always inlined. Additionally, `Dispatcher::singleton()` contains a static variable, which means that the generated code calls `__cxa_guard_acquire` and `__cxa_guard_release` which help implement exactly once semantics for the initialization of the `static Dispatcher& s` variable. For `C10_MOBILE`, we should not create the additional static ref within the inlined function to save binary size since it results in a lot of additional code being generated by the compiler. The `Dispatcher::singleton()` method is called from the generated method stubs for all aten opertors that are code-generated and potentially also from other operators that hand off execution to the kernel function for the right backend via the PyTorch Dispatcher.

This is a classic space/time (efficiency) tradeoff, so feedback would be welcome. kimishpatel, I'll need your expertise in figuring out how to perf-test this change, specifically for mobile.

Here's the godbolt link in case you wish to check out the generated code for a `static` variable within a function: https://godbolt.org/z/cdsG3v

{F375631117}
ghstack-source-id: 121989311

Test Plan:
Build + BSB

### lightspeed-messenger

*Divide the number below by 2*

```
D26507049-V1 (https://www.internalfb.com/intern/diff/D26507049/?dest_number=121944956)

messenger-experimental-optimized-device: Succeeded
Change in Download Size for arm64 + 3x assets variation: -21.7 KiB
Change in Uncompressed Size for arm64 + 3x assets variation: -65.4 KiB

Mbex Comparison: https://our.intern.facebook.com/intern/mbex/bsb:243392763936025@base/bsb:243392763936025@diff/
```

### igios

```
D26507049-V1 (https://www.internalfb.com/intern/diff/D26507049/?dest_number=121944956)

igios: Succeeded
Change in Download Size for arm64 + 3x assets variation: -15.6 KiB
Change in Uncompressed Size for arm64 + 3x assets variation: -34.3 KiB

Mbex Comparison: https://our.intern.facebook.com/intern/mbex/bsb:882756935844095@base/bsb:882756935844095@diff/
```

### fbios-pika

```
D26507049-V1 (https://www.internalfb.com/intern/diff/D26507049/?dest_number=121944956)

fbios-pika: Succeeded
Change in Download Size for arm64 + 3x assets variation: -8.6 KiB
Change in Uncompressed Size for arm64 + 3x assets variation: -29.1 KiB

Mbex Comparison: https://our.intern.facebook.com/intern/mbex/bsb:832297083999539@base/bsb:832297083999539@diff/
```

Reviewed By: swolchok

Differential Revision: D26507049

fbshipit-source-id: 0d2f55ea2d42a0782fb69aabfa517f2ec60c8036
2021-02-20 13:01:19 -08:00
597c9f8b22 fix zero_point rounding for _fake_quantize_learnable_per_channel_affine (#52290)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52290

_fake_quantize_learnable_per_channel_affine should allow taking non-integer zero_point as input, and perform rounding and clamp before doing forward/backward. In this diff, we make _fake_quantize_learnable_per_channel_affine to round and clamp zero_point beforehand as in _fake_quantize_learnable_per_tensor_affine.
ghstack-source-id: 122148099

Test Plan: `buck test mode/dev-nosan -c fbcode.platform=platform009 //caffe2/test:quantization -- test_learnable`

Reviewed By: raghuramank100

Differential Revision: D26446342

fbshipit-source-id: fc9b6832fa247cc9d41265eb4fd1575a2d2ed12c
2021-02-20 12:21:03 -08:00
15892a651f [PyTorch Mobile] Create compile time string for passing in to the exception message instead of 4 arguments that will be concatenated at runtime (#52303)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52303

swolchok did some stellar work in D26372806 (22b12179db) (and friends) to simplify exception handling code-paths and outline uncommon code paths. In addition, non-inlined versions of exception handling functions were provided but only in case of specific cases where 1 (or 2?) arguments were passed in to the exception throwing macros.

This change hopes to take advantage of that infrastructure and only pass in a single `const char*` to `AT_ERROR` to leverage any current (or future) optimizations that may take place in this space.

Since this isn't yet in production, it won't have a size impact. However, my guess is that it will be a significant size win once we turn on tracing based selective build since the exception code path will be present in every kernel function multiple times over since most dtypes will be unselected.
ghstack-source-id: 122149806

Test Plan: Build + auto-generated unit tests for tracing based selective build.

Reviewed By: swolchok

Differential Revision: D26463089

fbshipit-source-id: 349160a37d43d629249b92fa24f12b5bd128df1c
2021-02-20 10:54:24 -08:00
a62b0deae0 [pytorch] make is_tracing scriptable (#49853)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49853

fix https://github.com/pytorch/pytorch/issues/47379

Test Plan: buck test mode/dev-nosan //caffe2/test:jit -- 'test_script_is_tracing'

Reviewed By: SplitInfinity

Differential Revision: D25704315

fbshipit-source-id: 33c09c5bc1f1b62ef254f58e18ab1e951dbd1790
2021-02-20 02:53:28 -08:00
d9161d6da3 Optimize setDebugName time complexity (#52346)
Summary:
`setDebugName` maintains an invariant that all debug names of values in same graph must be distinct. This is achieved by appending numeric suffixes to requested debug names. However, the implementation was slow (O(N^2)) when there are a lot of name conflicts. This PR fixes the problem by adding more book-keeping logic so that time complexity is brought down to O(1) on average.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52346

Reviewed By: SplitInfinity

Differential Revision: D26564462

Pulled By: gmagogsfm

fbshipit-source-id: 3260fc3b436f1b0bcb45fdd2d1ec759b5828263f
2021-02-20 01:38:43 -08:00
cyy
53373a8e8c remove deprecated function (#52426)
Summary:
In order to remove the annoying compilation warnings

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52426

Reviewed By: glaringlee

Differential Revision: D26525718

Pulled By: SplitInfinity

fbshipit-source-id: 0a46389bd8e27e77250ca9501125c6ffc4b5d45b
2021-02-20 00:19:39 -08:00
bb34fd6191 [DataLoader] Fix util ImportError (#52459)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52459

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D26523941

Pulled By: ejguan

fbshipit-source-id: 8cd84982348687cf84fe5e821f51fbac43a783fa
2021-02-19 20:28:36 -08:00
1c64f862f6 Update vec_mergee operand specifiers (_vecb) (#52091)
Summary:
Patch needed in order to build on ppc64le with compiler g++V7.  (w/o fix, only works on minimum compiler V8).

Fixes https://github.com/pytorch/pytorch/issues/51592

To be clear, credit where due:
I tested this patch on a ppc64 RHEL container using gcc/g++ 7.4 compiler to ensure a complete pytorch build was successful -- and it was. However, I do not take credit for this patch.  I found and reported the issue, but the full brainpower to identify the cause of the error and the appropriate solution and thus the credit for this fix truly belongs to quickwritereader (and I am just helping with the legwork to integrate it after having tested it).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52091

Reviewed By: ejguan

Differential Revision: D26494943

Pulled By: glaringlee

fbshipit-source-id: 0babdb460db5047c54144f724466b77dd2d8a364
2021-02-19 20:04:14 -08:00
72f9b3c8d5 [StaticRuntime] Add function to check for memory leak (#52342)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52342

Reviewed By: yinghai

Differential Revision: D26420826

fbshipit-source-id: 4023f80fadd21e192afa485d96acd37c845146be
2021-02-19 19:45:09 -08:00
ef8d17e112 [DDP] Separate error messages for unused params in forward and not all outputs (#52391)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52391

There are 2 ways DDP can throw the exception refactored here -
1) Unused params in the forward pass. We provide `find_unused_parameters=True` for this.
2) All params used in fwd pass, but not all outputs used in loss computation. There are a few workarounds for this but we do not provide native support.

Previously, these 2 issues were combined into 1 error message but that has historically resulted in confusion, with users reporting getting this error even when they enable `find_unused_parameters=True` (which they expect to fix this error). As a result there is additional churn to debug these issues because the true cause (1) vs (2) is not known.

This commit helps to fix the issue by separating out the 2 error messages depending on if we ran with unused parameter detection or not. Hopefully this should make the error message much more clear and actionable.

error msg with `find_unused_params=True`:
```
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. Since `find_unused_parameters=True` is enabled, this likely  means that not all `forward` outputs participate in computing loss. You can fix this by making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
```
error msg without `find_unused_params` specified:
```
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
```
ghstack-source-id: 122097900

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D26496688

fbshipit-source-id: 4a9eeeda10293da13d94a692d10cb954e4506d7c
2021-02-19 17:09:22 -08:00
a3e693789f [qunat][graphmode][fx] Enable test for non quantized input for cat (#52414)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52414

When the input is not quantized, we'll still quantize cat as requested by the qconfig, even though
it might be slower

Test Plan: Imported from OSS

Reviewed By: supriyar

Differential Revision: D26503554

fbshipit-source-id: 29d7c136711a12c124791c10ae436b61c1407668
2021-02-19 16:56:41 -08:00
8fe6d17847 Moving 11.2 CI to master only (#52536)
Summary:
Moves the 11.2 linux and windows builds to master only.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52536

Reviewed By: walterddr

Differential Revision: D26557925

Pulled By: janeyx99

fbshipit-source-id: 28b8018112c159e2ae259dc00884c17796951c90
2021-02-19 15:33:03 -08:00
09516d2d0c Reenables skipped tests for all CUDA versions except 11.2 (#52359)
Summary:
This PR adds functionality to skip a test based on CUDA version.

This way, we can be more specific when skipping a test, such as when the test only fails for a particular CUDA version.

This allows us to add back the skipped tests for CUDA 11.2 for other CUDA versions, such as 10.1 and 11.1.

I tested this locally (by using 11.0 instead of 11.2), but will run all the CI to make sure it works.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52359

Reviewed By: walterddr

Differential Revision: D26487951

Pulled By: janeyx99

fbshipit-source-id: 45c71cc6105ffd9985054880009cf68ea5ef3f6a
2021-02-19 15:30:55 -08:00
626756ac39 [quant][graphmode][api] debug --> reference (#52179)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52179

Rename debug to reference. We'll use this to produce a reference quantized model
that can be used as a common interface between pytorch quantized model and backends.

Test Plan:
python test/test_quantization.py TestQuantizeFx

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D26424656

fbshipit-source-id: a0299b023f6ba7d98f5750724c517b0ecb987b35
2021-02-19 14:20:01 -08:00
941ebecc54 [glow aot] Support --onnxifi_min_ops in AOT flow (#52380)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52380

Reviewed By: jfix71, ChunliF

Differential Revision: D26455464

fbshipit-source-id: 4def6192a8898d4bbe407b819207b80f262b4721
2021-02-19 13:56:54 -08:00
db33afbf9f Change cmake to allow building with MLC kick-off build (#51326)
Summary:
- Allows build process to build with MLC enabled if subrepo folder mlc is in path and we can link against ML Compute on macOS BigSur
- To build with MLC enabled you will need to clone the mlc repo inside the pytorch repository.
- We need both this change and https://github.com/pytorch/pytorch/pull/50634 on pytorch/pytorch to enable the `mlc` device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51326

Reviewed By: glaringlee

Differential Revision: D26533138

Pulled By: malfet

fbshipit-source-id: 0baa06b4eb2d62dbfc0f6fc922096cb0db1cc7d1
2021-02-19 13:04:25 -08:00
0c0de542be [quant][graphmode][fx] Guard the supported quantization type for add/mul (#52413)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52413

TODO: We'll need to add this guard for other ops as well

(Note: this ignores all push blocking failures!)

Test Plan:
python test/test_quantization.py TestQuantizeFx.test_mul_add_fp16_config

Imported from OSS

Reviewed By: supriyar

Differential Revision: D26503348

fbshipit-source-id: 5aaba518742a516cc3521fd5f23f1a264d2973e2
2021-02-19 12:56:22 -08:00
7cd9892f83 [PyTorch] Sync TORCH_INTERNAL_ASSERT optis with TORCH_CHECK (#52226)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52226

This gets TORCH_INTERNAL_ASSERT to parity with TORCH_CHECK in terms of optimization for 0 or 1 argument.
ghstack-source-id: 121877054

(Note: this ignores all push blocking failures!)

Test Plan:
Compare generated assembly for
```
#include <c10/util/Exception.h>

void f(bool b) {
  TORCH_INTERNAL_ASSERT(b, "message");
}

void g(bool b) {
  TORCH_INTERNAL_ASSERT(b);
}

void h(bool b) {
  TORCH_INTERNAL_ASSERT(b, "message", random());
}
```

before/after this diff.
Before: P174916324
After: P174916411

Before, f and g called out to outlined lambdas to build
std::strings. After, they load string constants and call
torchInternalAssertFail. Similarly, h calls random() and c10::detail::_str_wrapper() inline and then calls out to torchInternalAssertFail. As with D26380783 (efbb854ed8), I hope to solve the problem of outlining the random & _str_wrapper calls separately.

Profile AdIndexer benchmark & verify that toTensor() is still inlined (it calls TORCH_INTERNAL_ASSERT with an integer argument, like `h` above).

Reviewed By: bhosmer

Differential Revision: D26410575

fbshipit-source-id: f82ffec8d302c9a51f7a82c65bc698fab01e1765
2021-02-19 12:45:40 -08:00
566f7c79d3 [c10] Take advantage of c10::str optis for simple CAFFE_ENFORCE (#52223)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52223

After the previous diffs, `c10::str()` will return a
`CompileTimeEmptyString` when passed 0 arguments and a `const char*` when
passed 1 `const char *` argument. We can take advantage of this to
outline further std::string creation from CAFFE_ENFORCE.
ghstack-source-id: 121877053

(Note: this ignores all push blocking failures!)

Test Plan:
Compare assembly for
```
#include <c10/util/Logging.h>

void f(bool b) {
  CAFFE_ENFORCE(b);
}

void g(bool b) {
  CAFFE_ENFORCE(b, "message");
}

void h(bool b) {
  CAFFE_ENFORCE(b, "message", random());
}
```

before & after this diff.

before: P174902847
after: P174902912

f & g are clearly much improved, and h is about the same.

(I tried measuring caffe2 perf on the AdIndexer MergeNet benchmark, but didn't see a win, which makes sense because the change is small.)

Reviewed By: bhosmer

Differential Revision: D26405181

fbshipit-source-id: c51a9e459ae7d9876494a83ade6f6fe725619512
2021-02-19 12:45:35 -08:00
d6755934fa [PyTorch] Make c10::str(const char*) return const char* (#52222)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52222

`c10::str()` is often used with variadic macros. It can be more efficient to get a C string out if you put a C string in, like if you are able to defer std::string creation to an outlined function or even never do it at all. Meanwhile, there is an implicit conversion from const char* to std::string, so users who expected a std::string will still make one.
ghstack-source-id: 121877052

(Note: this ignores all push blocking failures!)

Test Plan: CI

Reviewed By: bhosmer

Differential Revision: D26419663

fbshipit-source-id: 400bef71e6a0004b5914f5f511ea0e04e0d7599b
2021-02-19 12:43:03 -08:00
b6cf17deee [reland][complex] masked_fill: Complex Autograd support and update masked_scatter skips. (#52483)
Summary:
Reland https://github.com/pytorch/pytorch/issues/52035

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52483

Reviewed By: heitorschueroff

Differential Revision: D26545097

Pulled By: anjali411

fbshipit-source-id: f154c239183279be381a7393a8226778b36148bb
2021-02-19 12:36:49 -08:00
44ff79d849 Automatically set BUILD_SPLIT_CUDA for cpp exts (#52503)
Summary:
Fixes https://github.com/pytorch/vision/pull/3418#issuecomment-781673110

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52503

Reviewed By: malfet

Differential Revision: D26546857

Pulled By: janeyx99

fbshipit-source-id: a100b408e7cd28695145a1dda7f2fa081bb7f21f
2021-02-19 12:22:55 -08:00
b6ed05130e Adding a flag to enable CPU fusion in benchmarks (#48612)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48612

Test Plan: python -m benchmarks.tensorexpr --device cpu --mode fwd --jit_mode trace --cpu_fusion element

Reviewed By: heitorschueroff

Differential Revision: D26548643

Pulled By: navahgar

fbshipit-source-id: adb537818d77c9b6b0fe434ae6d963a5f348ad24
2021-02-19 12:11:06 -08:00
bfb007a438 Example LSTMCell (#51983)
Summary:
Fixes #{51801}
LSTMCell example updated

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51983

Reviewed By: agolynski

Differential Revision: D26467104

Pulled By: zou3519

fbshipit-source-id: 31c8bf89b21cd2f748b2cc28a74169082d81503c
2021-02-19 11:49:31 -08:00
c9c4b871a5 [pytorch] reintroduce static dispatch (#51957)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51957

This is a simplified version of #51554.

Compared to #51554, this version only supports statically dispatching to
a specific backend. The benefit is that it skipped the dispatch key
computation logic thus has less framework overhead. The downside is that
if input tensors do not match the specified backend it will throw error
instead of falling back to regular dispatch.

Sample code:
```
Tensor empty(IntArrayRef size, TensorOptions options, c10::optional<MemoryFormat> memory_format) {
    return at::cpu::empty(size, options, memory_format);
}

// aten::conj(Tensor(a) self) -> Tensor(a)
Tensor conj(const Tensor & self) {
    return at::math::conj(self);
}

// aten::conj.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!)
Tensor & conj_out(Tensor & out, const Tensor & self) {
    return at::cpu::conj_out(out, self);
}

// aten::conj.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!)
Tensor & conj_outf(const Tensor & self, Tensor & out) {
    return at::cpu::conj_out(out, self);
}

// aten::_conj(Tensor self) -> Tensor
Tensor _conj(const Tensor & self) {
    return at::defaultbackend::_conj(self);
}
```

For ops without the specific backend dispatch, it will throw error:
```
// aten::_use_cudnn_ctc_loss(Tensor log_probs, Tensor targets, int[] input_lengths, int[] target_lengths, int blank) -> bool
bool _use_cudnn_ctc_loss(const Tensor & log_probs, const Tensor & targets, IntArrayRef input_lengths, IntArrayRef target_lengths, int64_t blank) {
    TORCH_CHECK(false, "Static dispatch does not support _use_cudnn_ctc_loss for CPU.");
}
```

Differential Revision: D26337857

Test Plan: Imported from OSS

Reviewed By: bhosmer

Pulled By: ljk53

fbshipit-source-id: a8e95799115c349de3c09f04a26b01d21a679364
2021-02-19 11:41:39 -08:00
28e3dfdcca [JIT] Allow __exit__ to have a return value (#52336)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52336

**Summary**
In Python, the boolean interpretation of the return value of `__exit__` of objects that are used as context managers with `with` statements is used to determine
whether or not to propagate exceptions thrown inside the body of the with
statement. This latter feature is not possible to add to TorchScript at
the moment, but the current requirement that `__exit__` not have any
return values can make it difficult to script a context manager whose
`__exit__` *does* have a return value.

Accordingly, this commit removes the requirement that `__exit__` must
not have any return value. TorchScript does not interpret this return
value in the same way Python does (or at all), but this should make it
easier to share context managers between eager mode and script code.

**Test Plan**
This commit adds a return value to one of the context managers used in
`TestWith`.

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D26504910

Pulled By: SplitInfinity

fbshipit-source-id: 2ab635a24d111ac25df4e361b716be8fada5128e
2021-02-19 11:32:47 -08:00
bcd77cece4 [JIT] Display an error message when with item is not an object (#52335)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52335

**Summary**
`with` statements can only be used with objects that have `__enter__`
and `__exit__` defined. At present, any attempt to use an expression
that returns something that is not an instance of a class type results
in a cryptic internal assert failure instead of a useful error message.
This is because the code that generates IR for `with` statements uses
`Type::expect` as if it were `Type::cast`; that is, as if it returns
`nullptr` on failure.

This commit fixes this issue by checking the `kind()` of the type of the
expression used as the with item before calling `expect<ClassType>()` on
it.

**Test Plan**
This commit adds a unit test to `test_with_errors` to test this case.

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D26504909

Pulled By: SplitInfinity

fbshipit-source-id: 92d108e0c010370fd45131a57120f50c0b85c401
2021-02-19 11:30:48 -08:00
338d2eca4a [quant][graphmode][fx] Enable test for non quantized input for add/mul (#52412)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52412

When the input is not quantized, we'll still quantize add/mul

Test Plan: Imported from OSS

Reviewed By: supriyar

Differential Revision: D26503347

fbshipit-source-id: 457b3444c50e5b49b911b04c67684f5eead78ec9
2021-02-19 11:08:27 -08:00
49a923c8b5 [ONNX] Update LayerNorm symbolic to handle autocasting (#52199) (#52350)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52350

When onnx export creates a 0-dim tensor of constant type, this action overrides the type promotion logic as quoted in #9515. In order to prevent this from happening this PR adds the following functionality.
If the data type is a floating point type, it is converted to a 0-dim double tensor, else it is converted to a 0-dim tensor of its original type

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D26490325

Pulled By: SplitInfinity

fbshipit-source-id: 4c47c69c9b6523d2e45b74c2541d6d8ca7e28fc9
2021-02-19 10:57:15 -08:00
26e8f8f223 [ONNX] Update fuseLogSoftmaxNllLoss function to handle autocasting (#51729) (#52349)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52349

Adds a check for patterns for cases with autocasting enabled in which a cast node is inserted before the NegativeLogLikelihoodLoss
node and causing these patterns below not to be recognizable by peephole pass function

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D26490326

Pulled By: SplitInfinity

fbshipit-source-id: 4a6d806acc51b4696fd3932734d55af075fba6b1
2021-02-19 10:57:10 -08:00
12cbd6975a [ONNX] Fix for sequence of mutations in blocks (#51577) (#52347)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52347

Fixes consecutive mutations in a tensor inside blocks.
Also, support append and pop in blocks.

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D26490328

Pulled By: SplitInfinity

fbshipit-source-id: f0cdc706d2793e1f4eb0503d3e0f63f4127ea47a
2021-02-19 10:55:05 -08:00
08017f4598 Add explicit cudart_static dependency for cublas_static (#52509)
Summary:
Fixes following error during static linking, by enforcing that cudart dependency is put after cublasLt
```
/usr/bin/ld: /usr/local/cuda/lib64/libcublasLt_static.a(libcublasLt_static.a.o): undefined reference to symbol 'cudaStreamWaitEvent@libcudart.so.11.0'
/usr/local/cuda/lib64/libcudart.so: error adding symbols: DSO missing from command line
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52509

Reviewed By: janeyx99

Differential Revision: D26547622

Pulled By: malfet

fbshipit-source-id: 4e17f18cf0ab5479a549299faf2583a79fbda4b9
2021-02-19 10:45:49 -08:00
752d808fa0 Trace linear as aten::linear (#51897)
Summary:
https://github.com/pytorch/pytorch/pull/51613 made `torch.nn.functional.linear` compile as `aten::linear`, extend the same behavior with tracing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51897

Reviewed By: albanD

Differential Revision: D26320711

Pulled By: eellison

fbshipit-source-id: a26d3c37323a0706313c6ebb210bad60eec6a64b
2021-02-19 10:20:42 -08:00
d5ac929b62 [package] Introduce Importer to manage module namespace collisions. (#51975)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51975

See comments in code.

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D26340592

Pulled By: suo

fbshipit-source-id: 61b16bafad15e19060710ad2d8487c776d672847
2021-02-19 10:06:04 -08:00
76e8324370 [package] rename ex/importer.py to package_ex/importer.py (#52320)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52320

as title

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D26468416

Pulled By: suo

fbshipit-source-id: 890eecea76426918daff900402fbcbc149e48535
2021-02-19 10:04:14 -08:00
bc6852c192 Change TCPStore world_size and is_master to be optional (#51809)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51809

Changes to TCPStore which will make world_size and is_master optional parameters for initialization.

API before change:
```python
# arguments: host_name, port, world_size, is_master, timeout=300s
server_store = dist.TCPStore("127.0.0.1", 0, 2, True)
client_store = dist.TCPStore("127.0.0.1", 0, 2, False)
```

API after change:
```python
# arguments: host_name, port, world_size=-1, is_master=False, timeout=300s
server_store = dist.TCPStore("127.0.0.1", 0, is_master=True)
client_store = dist.TCPStore("127.0.0.1", 0)
```

Test Plan: Imported from OSS

Reviewed By: heitorschueroff

Differential Revision: D26461770

Pulled By: H-Huang

fbshipit-source-id: 5b2157029c73e8706e158cd49ecce60c9f3a7f41
2021-02-19 09:56:51 -08:00
9699c703c2 Stable sort for the CPU take 2. (#51790)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/38681.
A duplicate of https://github.com/pytorch/pytorch/pull/50052 created to become importable to the fb internal tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51790

Reviewed By: agolynski

Differential Revision: D26279045

Pulled By: glaringlee

fbshipit-source-id: 348e171dee9c370a76002b65d0c82c329f57a421
2021-02-19 09:28:57 -08:00
5fda3b094c Add conj OpInfo and fix out inconsistency (#52059)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/42515
Fixes: https://github.com/pytorch/pytorch/issues/51949

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52059

Reviewed By: ailzhang

Differential Revision: D26373800

Pulled By: anjali411

fbshipit-source-id: d2c92263a690072c0f23cb60885be42eebea48c6
2021-02-19 08:18:55 -08:00
8094e4844d [ROCm] Enable test_jit_c10.py tests for ROCm (#52410)
Summary:
Re-enabling these test cases for ROCm because they are passing.

jeffdaily

Signed-off-by: Kyle Chen <kylechen@amd.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52410

Reviewed By: glaringlee

Differential Revision: D26516757

Pulled By: malfet

fbshipit-source-id: 49921ee724a50f19afd8e6884a5f3ecd9291fa5c
2021-02-19 08:11:04 -08:00
dbeda994db Update FindvecLib.cmake for macOS 10.14, 10.15 and Big Sur (#51288)
Summary:
When compiling libtorch on macOS there is the option to use the `vecLib` BLAS library from Apple's (Accelerate)[https://developer.apple.com/documentation/accelerate] framework. Recent versions of macOS have changed the location of veclib.h, this change adds the new locations to `FindvecLib.cmake`

To test run the following command:
```
BLAS=vecLib python setup.py install --cmake --cmake-only
```

The choice of BLAS library is confirmed in the output:
```
-- Trying to find preferred BLAS backend of choice: vecLib
-- Found vecLib: /Library/Developer/CommandLineTools/SDKs/MacOSX10.15.sdk/System/Library/Frameworks/Accelerate.framework/Versions/Current/Frameworks/vecLib.framework/Versions/Current/Headers
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51288

Reviewed By: jbschlosser

Differential Revision: D26531136

Pulled By: malfet

fbshipit-source-id: ce86807ccbf66973f33b3acb99b7f40cfd182b9b
2021-02-19 08:04:10 -08:00
93c4067f25 [BE] Cleanup UnaryOpsKernel.cpp (#52444)
Summary:
Delete unused `dispatchtypes` of `IMPLEMENT_FLOAT_KERNEL` and `IMPLEMENT_COMPLEX_KERNEL`
Move common part of above-mentioned macros into `IMPLEMENT_ITERATOR_LAMBDA` macro

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52444

Reviewed By: walterddr

Differential Revision: D26517032

Pulled By: malfet

fbshipit-source-id: f03f89602f14fb513c66f3f2a96596e4c1e4cd16
2021-02-19 07:56:43 -08:00
b71215a909 Revert D26515596: [pytorch][PR] Add support for pow
Test Plan: revert-hammer

Differential Revision:
D26515596 (83feaebfc3)

Original commit changeset: 0c25a8eba8ed

fbshipit-source-id: 1a206f0b2923d922911fdaa5448a4e3a844ac5c4
2021-02-19 07:29:37 -08:00
7ca9776874 Fixed _out variants of linear algebra functions (#51560)
Summary:
This PR modifies the behavior of `_out` variants to match the description here https://github.com/pytorch/pytorch/wiki/Developer-FAQ#how-does-out-work-in-pytorch
With this PR result and input tensors must be on the same device and have the same "type kind".

I skipped `qr` and `eig` in this process as they require a bit more work.

Functions that can use the provided storage directly do so. If `result` is not empty and not in the batched column-major format or does not have the same type as input then we have to allocate a temporary tensor and copy it.

TODO:

- [x] Add more tests for same device and valid safe dtype
- [x] Move inv and solve changes to separate PRs https://github.com/pytorch/pytorch/pull/51968, https://github.com/pytorch/pytorch/pull/51977

Ref. https://github.com/pytorch/pytorch/issues/42666

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51560

Reviewed By: albanD

Differential Revision: D26400734

Pulled By: heitorschueroff

fbshipit-source-id: a6201ed7e919c1670c6ff3ef60217d1dbfb72e67
2021-02-19 04:03:35 -08:00
df3d1d9378 [RPC] delete torch/csrc/utils/future.h (#51698)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51698

Completely eliminates torch::utils::Future as we are now full relying on JitFuture.
ghstack-source-id: 122037612

Test Plan: CI

Reviewed By: kiukchung

Differential Revision: D26243735

fbshipit-source-id: 95010a730f9d35e618f74c5f9de482738cd57c15
2021-02-19 01:02:04 -08:00
3b11822825 [RPC] Refactor rref_context to not use utils::Future (#51697)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51697

Refactors the rest of rref_context, specifically pendingOwners map and `getOwnerRRef` to use JitFuture.
ghstack-source-id: 122037611

Test Plan: CI

Reviewed By: wanchaol

Differential Revision: D26243268

fbshipit-source-id: ab8874c8253274e8fe50dcd7291e0655a8f3f1df
2021-02-19 00:59:38 -08:00
d0795ab358 log newly added construction and runtime stats at randomly selected iterations (#51394)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51394

log newly added construction and runtime stats at randomly selected iterations
ghstack-source-id: 121934040

Test Plan: unit tests

Reviewed By: SciPioneer

Differential Revision: D26161885

fbshipit-source-id: add6e02c1a03e6f74f08b9a9aecf90fa81631d60
2021-02-19 00:15:04 -08:00
c75fa39b6c add stats that can only be collected at runtime (#51386)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51386

add stats such as rebuilt bucket stats, unused parameter stats and performance stats to ddp logging data

1. gpu time stats are not collected for single process multiple devices in this diff, as that requires events are created and recorded on multiple devices
2. use at::cuda::event API for safer calls
3. events may not be created in autograd hook if hook is not triggered in user's codes, e.g., users runs in non-sync mode in some iterations. So we checked events are created or not before synchronizing, also skipped invalid results.
4. users may not set device upfront, so explicitly set proper device before creating events in our prepare_forward() and prepare_backward() calls

ghstack-source-id: 121933566

Test Plan: unit tests

Reviewed By: SciPioneer

Differential Revision: D26158645

fbshipit-source-id: ce5f15187802eba76accb980449be68902c10178
2021-02-19 00:13:11 -08:00
0c46b6b3f6 [DDP] Enhance warning for find_unused_params (#52385)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52385

This warning should specify that we did not find unused params in the
_forward_ pass, which is when we log this warning. This is to avoid confusion
when we get an error because not all outputs were used to compute loss, which
also raises an error about unused parameters (to be fixed in the next diff)
ghstack-source-id: 122001929

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D26494136

fbshipit-source-id: d9b41732ea7e5e31b899d590d311080e3dc56682
2021-02-18 23:36:08 -08:00
c29e279f72 [DDP] unittest for when params arent used in backward pass (#52384)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52384

Adds a simple UT with unittest that we can modify when we enable DDP backward without needing all parameters to get gradient.
ghstack-source-id: 122001930

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D26482479

fbshipit-source-id: c80bdeea7cf9db35390e385084ef28d64ed239eb
2021-02-18 23:34:16 -08:00
4ee5bc74d3 [DataLoader] Change signature of Functional DataPipe (#52458)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52458

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D26523282

Pulled By: ejguan

fbshipit-source-id: c7358fc351f859617754a27b8a701d11ada5d61a
2021-02-18 23:30:58 -08:00
3adc8f8cf7 Enable min & max for Float16 & BFloat16 (#51244)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/50790.

Added `min()` & `max()` support for `Float16` & `BFloat16`.
CUDA already supported these ops on `Float16`, so the other three combinations had to be enabled.
`OpInfo`s for `min` & `max` were also added, and their sample inputs were removed from `method_tests()`.

### MORE INFO
The (slightly) long-term goal is to add dispatch for `min()` & `max()` related operations on CPU & CUDA for `Float16` & `BFloat16`,
wherever they aren't present already:
1. `amin()`
2. `argmax()`
3. `amax()`
4. `argmin()`
5. `torch._aminmax()`
6. `torch.clamp()` on CPU. Was already supported on CUDA
7. `min()` (in this PR)
8. `max()` (in this PR)
9. `minimum()`
10. `maximum()`

I'll submit separate PRs for the other ops.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51244

Reviewed By: jbschlosser

Differential Revision: D26503455

Pulled By: anjali411

fbshipit-source-id: c32247f214e9272ca2e4322a23337874e737b140
2021-02-18 23:13:51 -08:00
fb9f89507a [quant][graphmode][fx] Fix fp16 dynamic quant for functional linear (#52369)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52369

Test Plan: Imported from OSS

Reviewed By: supriyar

Differential Revision: D26491425

fbshipit-source-id: d2c2a70bf1bc43ac2b63ac4cf9ae9c07887f12e9
2021-02-18 23:05:30 -08:00
83feaebfc3 Add support for pow (#52374)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/18627
Adds pow support for JIT

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52374

Test Plan: python test/test_jit.py -k test_torch_pow

Reviewed By: Lilyjjo

Differential Revision: D26515596

Pulled By: nikithamalgifb

fbshipit-source-id: 0c25a8eba8ed93291c5e447e863edac2a35b61fb
2021-02-18 23:03:28 -08:00
d8b28579c3 Add NNC support for aten::hardtanh (a hot operation in mobilenet v2/v3) (#52394)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52394

Test Plan:
Imported from OSS

test/test_tensorexpr.py
test/test_jit_fuser_te.py

Reviewed By: bertmaher

Differential Revision: D26497856

Pulled By: huiguoo

fbshipit-source-id: 8558f89826cad250da6f970bfc49384f2b9d7ee0
2021-02-18 22:56:03 -08:00
f4c33edb45 Add onnxifi interface for set/get options (#52388)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52388

Pull Request resolved: https://github.com/pytorch/glow/pull/5364

This allows us to change global variables through onnxifi calls. And add python bindings along with it. Note that we supply a dummy backend_id as it's not needed by glow due to setting being global.

#codemod

Test Plan:
```
buck test mode/dev //glow/fb/test:test_onnxifi_optionnnpi
```

Reviewed By: jfix71, khabinov

Differential Revision: D26481652

fbshipit-source-id: 19b8201c77f653cf7d93ad68760aa7fb5ec45ff4
2021-02-18 20:12:34 -08:00
82548f3a00 [ROCm] missing template declarations for complex blas (#52472)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52472

Reviewed By: jbschlosser

Differential Revision: D26533896

Pulled By: anjali411

fbshipit-source-id: 55503028d5e087fc91992b417836cc87eb60ad55
2021-02-18 19:12:12 -08:00
65f6e665e6 Improvements for FX tracer (#52232)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52232

Pull Request resolved: https://github.com/pytorch/glow/pull/5327

Reviewed By: gcatron

Differential Revision: D26355583

fbshipit-source-id: f062e0b3a9cadf1584738bed85e9964b9a63efaf
2021-02-18 18:53:05 -08:00
bb7e07ce8e [glow] Extending AOT config with two more fields (#5359)
Summary: Pull Request resolved: https://github.com/pytorch/glow/pull/5359

Reviewed By: ChunliF

Differential Revision: D26468908

fbshipit-source-id: 16c4f4215f302c023d75c204b999f23ed6254aa1
2021-02-18 16:08:55 -08:00
e0b6252de0 [ROCm] Enable test_ddp_hooks.py test cases (#52403)
Summary:
Re-enabling these test cases for ROCm because they are passing.

jeffdaily

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52403

Reviewed By: jbschlosser, SciPioneer

Differential Revision: D26516727

Pulled By: malfet

fbshipit-source-id: 6c70805eda39b0aadfbeb30a527af3906d2da867
2021-02-18 15:51:18 -08:00
89bc9a58e2 Add arm64 binary build (#52443)
Summary:
This is getting tested by https://github.com/pytorch/pytorch/issues/52441.

Adds new config for macos arm64 to our binary builds.
Now stores artifacts for mac builds.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52443

Reviewed By: walterddr

Differential Revision: D26517330

Pulled By: janeyx99

fbshipit-source-id: 02774937a827bdd4c08486dc9f8fe63446917f1e
2021-02-18 14:29:41 -08:00
22adea04df Revert D26299594: [PyTorch Mobile] 15KiB size reduction by reducing MaxTypeIndex from 256 to 32
Test Plan: revert-hammer

Differential Revision:
D26299594 (9e54532947)

Original commit changeset: 9a78c03da621

fbshipit-source-id: 2be1149539892447872eb3289f3fdef0ac92c090
2021-02-18 13:15:55 -08:00
2d4354423e Revert nightly docker build cuda version to 11.1.1. (#52234)
Summary:
CUDA 11.2 has a performance regression, so revert to CUDA 11.1.1.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52234

Test Plan: [CI](https://github.com/pytorch/pytorch/actions?query=workflow%3A%22Build+PyTorch+nightly+Docker+image+and+push+to+GitHub+Container+Registry%22)

Reviewed By: glaringlee

Differential Revision: D26519105

Pulled By: xuzhao9

fbshipit-source-id: d1e1ecb7904c196292d83767b71000b465de73ce
2021-02-18 12:30:07 -08:00
49c90648d3 [iOS GPU] Fix max_pool_2d (#52431)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52431

The previous implementation was missing the padding information. Thus is not correct.
ghstack-source-id: 121950755

Test Plan:
- `buck test pp-macos`
- CircleCI

Reviewed By: SS-JIA

Differential Revision: D26508482

fbshipit-source-id: b28b99c399c4f1390a5cc4f023e470eed0f8c073
2021-02-18 12:28:09 -08:00
c7a70eec1b Make LLVM the default backend for TE (#52314)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/52264

When CPU fusion is enabled without LLVM support in PyTorch, it causes huge slowdown (> 50x). This PR makes the LLVM backend the default backend for TE. Now, an error will be reported if CPU fusion is enabled without LLVM support, to avoid this performance regression.

This PR also updates the tests to not use LLVM, so that the old flow is continued. This is necessary because tests run in CI do not have LLVM.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52314

Reviewed By: ejguan

Differential Revision: D26491294

Pulled By: navahgar

fbshipit-source-id: 74561db1207da805d6d28039450db046ba2988fb
2021-02-18 12:00:38 -08:00
8f3ed60d3e enable mkldnn conv2d backward to support mkldnn tensor input (#48994)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48994

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D25537189

Pulled By: VitalyFedyunin

fbshipit-source-id: d81d247798fad3815b735468d66ef9d62c07ef77
2021-02-18 10:23:10 -08:00
9e54532947 [PyTorch Mobile] 15KiB size reduction by reducing MaxTypeIndex from 256 to 32 (#51881)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51881

`MaxTypeIndex` controls the size of the array

```
detail::TypeMetaData* TypeMeta::typeMetaDatas() {
  static detail::TypeMetaData instances[MaxTypeIndex + 1]
```

in `typeid.cpp`.

In practice, I have seen that this array doesn't hold more than 18 elements once the PyTorch library has been initialized (in mobile unit tests). I couldn't find situations where elements may be added to this array post library initialization.

There is a runtime check to prevent array overflow, so reducing the size of the storage shouldn't come at any additional risk from the perspective of loss in visibility of errors.

The fact that this array is staically allocated ends up using a bunch of space in the binary (potentially to initialize the trailing elements?). I'm somewhat surprised but this. However, this change registered a 15KiB size win on both fbios as well as igios.

Found this when I was looking at a bloaty run that I shared with smessmer on friday: https://www.internalfb.com/intern/everpaste/?handle=GLXImQisHOfT74EBAKw47V3ktuAzbsIXAAAB

I initially thought that the methods being passed in to the constructor of `detail::TypeMetaData` were causing the size increase, but only later relaized the issue after reading the folloing helpful comment:

```
    // The remainder of the array is padded with TypeMetaData blanks.
    // The first of these is the entry for ScalarType::Undefined.
    // The rest are consumed by CAFFE_KNOWN_TYPE entries.
```
ghstack-source-id: 121875657

Test Plan:
Sandcastle runs + the following BSB runs.

### igios

```
D26299594-V1 (https://www.internalfb.com/intern/diff/D26299594/?dest_number=121221891)

igios: Succeeded
Change in Download Size for arm64 + 3x assets variation: +596 B
Change in Uncompressed Size for arm64 + 3x assets variation: -15.8 KiB

Mbex Comparison: https://our.intern.facebook.com/intern/mbex/bsb:443632243487886@base/bsb:443632243487886@diff/
```

### fbios

```
D26299594-V1 (https://www.internalfb.com/intern/diff/D26299594/?dest_number=121221891)

fbios: Succeeded
Change in Download Size for arm64 + 3x assets variation: +104 B
Change in Uncompressed Size for arm64 + 3x assets variation: -15.7 KiB

Mbex Comparison: https://our.intern.facebook.com/intern/mbex/bsb:169233698063125@base/bsb:169233698063125@diff/
```

Reviewed By: raziel, iseeyuan

Differential Revision: D26299594

fbshipit-source-id: 9a78c03da621fbc25a1d8087376628bccc8dbfda
2021-02-18 10:01:42 -08:00
983347fa25 Allow broadcasting against lerp weights. (#52319)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52319

Fixes: https://github.com/pytorch/pytorch/issues/52254

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D26488411

Pulled By: gchanan

fbshipit-source-id: 60eb471609986584c4235ba7f263581e988e7642
2021-02-18 09:53:25 -08:00
b52e2e6045 [BE] _get_torch_cuda_version should return tuple (#52409)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52409

Reviewed By: jbschlosser, glaringlee

Differential Revision: D26513924

Pulled By: walterddr

fbshipit-source-id: ee18ef357c326c5ad344d80c59821cc2b8814734
2021-02-18 09:28:38 -08:00
f72b4b83fe Fix upsample bicubic2d batching handling on CPU. (#52389)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52389

Fixes: https://github.com/pytorch/pytorch/issues/49159

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D26496319

Pulled By: gchanan

fbshipit-source-id: d385cd683ef09e0596a9875ce84d03e6e77acc93
2021-02-18 09:14:41 -08:00
c7b0005831 Enhance Tensor.unflatten to support -1 as the inferred size (#51955)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/51719, https://github.com/pytorch/pytorch/issues/28142

**Change**
- Update `torch.Tensor.unflatten` to support users pass`-1` as the inferred size for both tensors and named tensors.
- Examples of using `-1` in the `unflatten` function are added to the docs.
- Fix the rendered issue of original `unflatten` docs by removing a blank line between its example section.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51955

Reviewed By: agolynski

Differential Revision: D26467198

Pulled By: zou3519

fbshipit-source-id: 6a3ede25561223187273796427ad0cb63f125364
2021-02-18 08:37:41 -08:00
ad9746456e ns for fx: make unshadowed activation comparison work for N models (#52357)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52357

Refactor the NS for FX compare unshadowed activations API to be able
to work on N models and do arbitrary matching strategies.

We factor out a util which takes a model and a list of
nodes to extract weights for, with names to give the extracted
weights. The user can then call this util with a set
of nodes and names created in any way they want.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs
```

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D26487270

fbshipit-source-id: 1372ef07b5f3ddc7cebdfb2dee0221a2facd0527
2021-02-18 08:20:14 -08:00
a937d1cb16 ns for fx: make weights comparison work on N models (#52356)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52356

Refactor the NS for FX compare weights API to be able to
work on N models and do arbitrary matching strategies.

We factor out a util which takes a model and a list of
nodes to extract weights for, with names to give the extracted
weights.  The user can then call this util with a set
of nodes and names created in any way they want.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs
```

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D26487271

fbshipit-source-id: 0c2172a1b33d47565004a307aff14d205671add7
2021-02-18 08:20:09 -08:00
d903106bad [wip] ns for fx: add support for subgraph matching (#52130)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52130

We have patterns like (F.linear, F.relu) which need to match
to (toq.linear_relu).  So, we need to match subgraphs.

This PR does the following:
* defines a "subgraph" as (start_node, end_node). The current assumption
is that subgraphs are simple, there is always a path from start_node to
end_node, and we can ignore any non-input args/kwargs of these nodes
for the purposes of matching and copying things. An example one node
subgraph is (F.linear, F.linear).  An example two node subgraph
is (F.linear, F.relu).
* changes the matching logic to iterate over subgraphs instead of nodes
* changes the NS core APIs to use subgraph pairs instead of node pairs:
1. for weights, we match on the start node
2. for unshadowed activations, we observe the end nodes
3. for shadowed activations, we copy the subgraph of a to graph c

TODO(before review) write up better, not ready for review yet

Test Plan:
TODO before land: better test plan

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D26403092

fbshipit-source-id: e49aaad4b02b8d60589435848bee422b8f41937a
2021-02-18 08:20:04 -08:00
3978ffb37a NS for FX: add test for a simple sparsenn model (#52092)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52092

Adds a very simple toy sparsenn model, and enables
its inspection with the new NS APIs.

Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_sparsenn_compare_activations
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_sparsenn_shadow
```

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D26403095

fbshipit-source-id: 3c3650aca47186deb32f2b3f1d87a0716d1ad9d1
2021-02-18 08:17:57 -08:00
efbb854ed8 [PyTorch] Avoid std::string in TORCH_CHECK when possible (#52221)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52221

The previous code forced a `std::string` to be created even when the default message or a user-provided string literal message was used. Now it's not forced and we don't need an outlined lambda in those cases either.
ghstack-source-id: 121877056

Test Plan:
Compare assembly for

```
#include <c10/util/Exception.h>

void f(bool b) {
  TORCH_CHECK(b, "message");
}

void g(bool b) {
  TORCH_CHECK(b);
}

void h(bool b) {
  TORCH_CHECK(b, "message", random());
}
```

before/after in fbcode optimized build.

Before: P174696735
After: P174696840

For `f()` and `g()`, we go from a call to an outlined lambda that did a bunch of `std::string` creation to a load of a string constant before calling `torchCheckFail`. This is a clear improvement.

For `h()`, results are mixed: we save a bunch of *extra* string goop in the outlined lambda and instead call `c10::detail::_str_wrapper` directly. This is good for overall size. However, we no longer outline the call to `random()`, which is less than ideal. I hope to recover the ability to fully outline the `random()` call in future diffs; this is just thorny enough that I don't want to cram even more into one diff.

Added automated test to make sure `TORCH_CHECK` and `TORCH_INTERNAL_ASSERT` only evaluate their arguments once.

Profiled AdIndexer mergenet benchmark in perf to check that `IValue::toTensor` is still getting inlined.

Reviewed By: bhosmer

Differential Revision: D26380783

fbshipit-source-id: 288860772423994ac739a8f33e2c09f718e8dd38
2021-02-18 07:51:53 -08:00
ba77b8d84e [PyTorch][easy] Make shared empty string static instead of thread_local (#52220)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52220

D21268320 (d068a456d3) made this thread_local, but I don't think it was necessary to do so.
ghstack-source-id: 121877050

Test Plan: CI

Reviewed By: dzhulgakov

Differential Revision: D26378724

fbshipit-source-id: 7f17b5cff42983ea8f5be1bd254de01bf8db9a0e
2021-02-18 07:49:07 -08:00
c8b3686a3e Make bias in lazy modules lazy and avoid create empty tensors (#52212)
Summary:
Some minor improvement for lazy modules introduced in https://github.com/pytorch/pytorch/issues/44538, https://github.com/pytorch/pytorch/issues/47350 and https://github.com/pytorch/pytorch/issues/51548.

This PR mainly turn the bias to `UninitializedParameter` and instead of creating empty tensors like
```python
self.bias = Parameter(torch.Tensor(0))
self.bias = UninitializedParameter()
```
I think it would be better to
```python
self.register_parameter('bias', None)
self.bias = UninitializedParameter()
```

In addition, I change the constructor of the `LazyBatchNorm` from
```python
self.running_mean = UninitializedBuffer()
```
to
```python
self.register_buffer('running_mean', UninitializedBuffer())
```
as the original one would not change the underlying `self._buffers`.

Thank you for your time on reviewing this PR :).

Gently ping albanD, mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52212

Reviewed By: jbschlosser

Differential Revision: D26504508

Pulled By: albanD

fbshipit-source-id: 7094d0bb4fa9e2a40a07b79d350ea12a6ebfd080
2021-02-18 06:34:53 -08:00
758aa45563 Revert D26369476: [pytorch][PR] [complex] masked_fill: Complex Autograd support and update masked_scatter skips.
Test Plan: revert-hammer

Differential Revision:
D26369476 (7a408c7290)

Original commit changeset: 7a79d5a609b0

fbshipit-source-id: f0011f40962ccbcd8e7c19bd727e1e49cf2ec0c4
2021-02-18 05:01:03 -08:00
60518d10f6 [deploy] torch::deploy API (#51754)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51754

This API allows you to manage multiple python interpreters in a single
process to deploy PyTorch models packaged with torch.package.

torch/csrc/deploy/deploy.h contains the API definition
torch/csrc/deploy/test_deploy.cpp has some examples.

Notes:
* mutex is added to PyTorchStreamReader to make it safe to use from multiple threads at once.
* USE_DEPLOY is only true for the special libtorch_deployinterpreter.so library, when enabled
  we use a hash table to maintain PyObject <> at::Tensor mappping rather than the internal pointer
  in Tensor since >1 interpreter may have a reference to the tensor.
* serialization.py has some additional functions for creating pickle objects
  but keeping storages in memory for use transfering tensors between interpreters

Test Plan: Imported from OSS

Reviewed By: wconstab

Differential Revision: D26329468

Pulled By: zdevito

fbshipit-source-id: d75f4ebb9a27f1d911179d9996041bcb3ca04a07
2021-02-18 02:30:08 -08:00
9cf6be6b3e Fix torch.nn.functional.interpolate microbenchmark for non-4D inputs
Summary: This diff fixes the `interpolate` microbenchmark for non-4D inputs, which are not supported by the `bilinear` mode

Test Plan:
5D and 3D:

```
# Benchmarking PyTorch: interpolate
# Mode: Eager
# Name: interpolate_input_size(1,3,16,320,320)_output_size(8,256,256)
# Input: input_size: (1, 3, 16, 320, 320), output_size: (8, 256, 256)
Forward Execution Time (us) : 221008.660

# Benchmarking PyTorch: interpolate
# Mode: Eager
# Name: interpolate_input_size(4,512,320)_output_size(256,)
# Input: input_size: (4, 512, 320), output_size: (256,)
Forward Execution Time (us) : 9727.900

```

4D
```
# Benchmarking PyTorch: interpolate
# Mode: Eager
# Name: interpolate_input_size(1,3,60,40)_output_size(24,24)_channels_lastTrue
# Input: input_size: (1, 3, 60, 40), output_size: (24, 24), channels_last: True
Forward Execution Time (us) : 375.181

```

Reviewed By: fmassa

Differential Revision: D26486678

fbshipit-source-id: 5d476afba3f35da9f8b86db16e21505bdb00888b
2021-02-18 02:07:54 -08:00
7a67a7a396 [static runtime] Generate sigmoid with NNC (#52424)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52424

NNC has a fast sigmoid on par with aten.  Using it for static runtime
lets us skip dispatch overhead.
ghstack-source-id: 121953787

Test Plan:
```
caffe2=0 batch=1 run.sh
```

Reviewed By: bwasti

Differential Revision: D26291425

fbshipit-source-id: a2ad79765dacee352625f0e5322e871556e0ca9e
2021-02-18 01:56:50 -08:00
8228086e64 [static runtime] Use VML-inspired logarithm with NNC, tweak scheduling (#52423)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52423

NNC has a new logarithm implementation that closely matches the
performance of VML (see D26246400 (2e35fe9535)).  Using this in the NNC generated kernel for
logit increases the win slightly.
ghstack-source-id: 121953008

Test Plan:
```
caffe2=0 bs=20 scripts/bwasti/static_runtime/run.sh
```

Reviewed By: bwasti

Differential Revision: D26291426

fbshipit-source-id: c5c3933732c6ade5235f23d6dc71410170a6c749
2021-02-18 01:54:56 -08:00
e1d927e552 [JIT] Update freezing api (#52337)
Summary:
Update freezing api  for 1.8,  and add a corresponding C++ API. The `optimize` flag hasn't been publicly released yet, so we are able to change it without breaking BC. I will submit a PR to branch release as well, there are a few more days to do that

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52337

Reviewed By: ejguan

Differential Revision: D26491833

Pulled By: eellison

fbshipit-source-id: 6dcd74eb8f76db64ac53183d03dabdd0f101f4b5
2021-02-18 00:17:27 -08:00
ac121165e2 Remove ReduceOp::accumulator (#52196)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52196

A reduction does not need to know the buffer into which its
result will be written.  This change gets us closer to being able to
create reductions inside Compute, where we have access to the tensor
axes.
ghstack-source-id: 121813071

Test Plan: test_tensorexpr

Reviewed By: ZolotukhinM

Differential Revision: D26420107

Pulled By: bertmaher

fbshipit-source-id: c8d8a99649adfd6de56fe53a728f5aa034a84f13
2021-02-17 23:36:23 -08:00
a788c2d777 [nnc] Remove output_args from ReduceOp (#52187)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52187

ReduceOp doesn't need to track the indices that its result will be written into.
ghstack-source-id: 121813075

Test Plan:
test_tensorexpr, tensorexpr_bench

Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D26420575

fbshipit-source-id: 7afcfa611515334e36de8039722011687f3b61e4
2021-02-17 23:36:18 -08:00
62d5f60ad2 Avoid using ReduceOp->output_args() in rfactor (#52177)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52177

I'm trying to get rid of `output_args` for reductions, because they
shouldn't be necessary; it's reducing over its reduction axis, why
does it need to know where its output is going?

Rfactor is probably the trickiest place where we use output_args, but
it looks like it's mostly just carrying around the location of the
store, so use that instead.
ghstack-source-id: 121813072

Test Plan:
build/bin/test_tensorexpr && build/bin/tensorexpr_bench

Imported from OSS

Reviewed By: navahgar

Differential Revision: D26420548

fbshipit-source-id: aeab564c6113fa02eabb14c9b70c7edfd05b264d
2021-02-17 23:36:13 -08:00
f6a6814a4f Dont look at reduction output args when computing mem dependencies (#52170)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52170

ghstack-source-id: 121813073

Test Plan: unit tests

Reviewed By: navahgar

Differential Revision: D26411735

fbshipit-source-id: 31c35af80d4f3e06df17ec65e4c91f604fc8745a
2021-02-17 23:34:08 -08:00
de9016007a [PyTorch][easy] Coalesce string literals in data_ptr error message (#52379)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52379

There's no reason to create and concatenate multiple string literals here when we could just combine them.
ghstack-source-id: 121890478

Test Plan: builds

Reviewed By: ilia-cher

Differential Revision: D26492399

fbshipit-source-id: a9e611a5b7ce5c1255135f3a0db12cc765b29a87
2021-02-17 23:06:57 -08:00
7a408c7290 [complex] masked_fill: Complex Autograd support and update masked_scatter skips. (#52035)
Summary:
Now that `masked_fill` CUDA is migrated, skips on masked_scatter can be removed.

Reference: https://github.com/pytorch/pytorch/issues/33152

**Note**:

Have decreased the shape of Tensor for `masked_scatter` from (M, M) -> (S, S) and so on.

With shapes of M : **96.53s**
```
test/test_ops.py ........................................ssssssssssss........................ssssssssssss........................                                                   [100%]

=============================================================== 88 passed, 24 skipped, 7981 deselected in 96.53s (0:01:36) ================================================================
```

With shapes of S : **46.53s**
```
test/test_ops.py ........................................ssssssssssss........................ssssssssssss........................                                                   [100%]

==================================================================== 88 passed, 24 skipped, 7981 deselected in 46.53s =====================================================================
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52035

Reviewed By: VitalyFedyunin

Differential Revision: D26369476

Pulled By: anjali411

fbshipit-source-id: 7a79d5a609b0019f8fe9ce6452924dd33390dce1
2021-02-17 22:49:26 -08:00
f6321977e9 Fix shape inference for multiple outputs with different output dtypes (#52417)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52417

When we have multiple outputs, previously, we will set `infered_data_type` to the first output and stick to it. This is not correct if we have more outputs that has different dtype. The fix here will revert `infered_data_type` back to previous value (`UNDEFINED`) so that we can still enter the conditional check and set the right dtype for second and more outputs.

Test Plan:
```
buck test caffe2/caffe2/fb/operators:infer_bound_shape_op_test
```

Reviewed By: khabinov

Differential Revision: D26502161

fbshipit-source-id: 647f0106a5785dc156fddfc196ac67001602fda8
2021-02-17 22:24:02 -08:00
f1e004b954 Fix compiler warning for MathConstants.h (#52123)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52123

Compiler currently complains:
```
caffe2/c10/util/MatchConstants.h(18): warning: calling a constexpr __host__ function("from_bits") from a __host__ __device__ function("pi") is not allowed.
```
This diff extirpates the warning

Test Plan: Sandcastle tests

Reviewed By: xush6528

Differential Revision: D26379485

fbshipit-source-id: ab4821119cba8c43fd1d5788c4632d0613529ec8
2021-02-17 22:08:03 -08:00
eaad002cf6 [PyTorch] s/__attribute__((__noinline__))/__attribute__((noinline))/ (#52362)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52362

AFAICT, it is documented to be the latter and not the former.
GCC: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#Common-Function-Attributes
Clang: https://clang.llvm.org/docs/AttributeReference.html

Both versions work in the oldest and newest GCC & Clang versions on Godbolt: https://godbolt.org/z/s6f4PW

So why change?
1) lack of underscores matches the documentation
2) AMD HIP defines `__noinline__` as a macro, which doesn't play well with the underscore version.
See 2080cc113a/include/hip/hcc_detail/host_defines.h (L54)
ghstack-source-id: 121875424

Test Plan: Rely on existing CI

Reviewed By: bhosmer

Differential Revision: D26488991

fbshipit-source-id: 6cfcdfd41c58170659e263cd519ac5359ffd5d46
2021-02-17 21:04:28 -08:00
f7aa88b400 [caffe2] Explicitly define all DataTypes in python/core.py (#51768)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51768

This updates python/core.py to explicitly define all of the `DataType`
values rather than dynamically defining them at runtime from the
`caffe2_pb2` values.

This allows type checkers like Pyre and Mypy to see the members of the
`DataType` class.  Otherwise the type checkers report errors such as
`"core.DataType" has no attribute "INT64"`.

This code does keep a run-time check that all of the data types defined
by `caffe2_pb2.proto` are defined correctly in this file.  This way if
someone does add a new type to `caffe2_pb2.proto` it should be very
quickly apparent that this file needs to be updated and kept in sync.
ghstack-source-id: 121936201

Test Plan:
Confirmed that various caffe2/python tests still pass.
Verified that this allows many `pyre-fixme` comments to be removed in
downstream projects, and that Pyre is still clean for these projects.

Reviewed By: jeffdunn

Differential Revision: D26271725

Pulled By: simpkins

fbshipit-source-id: f9e95795de60aba67d7d3872d0c141ed82ba8e39
2021-02-17 20:54:17 -08:00
27d89057f8 [caffe2] fix deserialization of unknown tensor data_type values (#52411)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52411

The `TensorDeserializer` code previously did not correctly handle unknown
`data_type` values.  It attempted to deserialize the data as floats, rather
than recognizing that it did not understand the data type and erroring out.

Google protobuf will never return unknown values for enum fields.  If an
unknown value is found in serialized data, the protobuf code discards it.
As a result `has_data_type()` will return false, but `get_data_type()` will
simply return the default value, which happens to be set to `FLOAT`.  As a
result if we ever encounter a serialized blob with an unknown data type the
previous code would incorrectly think the data type was `FLOAT`.

This fixes the code to check if the `data_type` value is present before
reading it.
ghstack-source-id: 121915981

Test Plan:
Included a unit test that verifies this behavior.  Confirmed that without this
fix the code proceeded with the float deserialization code path.  When
deserializing int32_t data it fortunately did fail later due to an unexpected
field length check, but this isn't guaranteed to be the case.  In some cases
it potentially could incorrectly succeed and return wrong data.

Reviewed By: mraway

Differential Revision: D26375502

fbshipit-source-id: 4f84dd82902e18df5e693f4b28d1096c96de7916
2021-02-17 19:13:43 -08:00
e0d9d0f248 update symeig backward note about similar eigenvalues (#52311)
Summary:
First part of https://github.com/pytorch/pytorch/issues/49886 to at least properly warn users of the current state

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52311

Reviewed By: soulitzer

Differential Revision: D26495644

Pulled By: albanD

fbshipit-source-id: 72abdfe41cdbcc1ac739a536eb85d1aa4ba90897
2021-02-17 19:07:25 -08:00
08b95e3c48 [Pytorch, Sparsity] Integrate sparse qnnpack operator in framework (#52377)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52377

Add QNNPACK specific packed params for sparse linear.
Add sparse linear dynamic op with appropriate registration.
Add python side LinearDynamic module for sparsity.
Add tests to validate sparse linear qnnpack kernels.
Note that since these test are mostly run on x86 platform and
given that 1x4 sparse kernels are implemented both in sse and arm,
LinearDynamic at the moment defaults to 1x4 pattern.
Plan is to add another diff that will allow a global override for 8x1 pattern
such that prepare/convert flow can work for exporting model for mobile.

Test Plan: buck run caffe2/torch/fb/model_optimization:sparsity_test

Reviewed By: z-a-f

Differential Revision: D26491944

fbshipit-source-id: b98839b4c62664e1fabbb0cbeb2e5c1bd5903b4d
2021-02-17 18:25:13 -08:00
908ba05a06 [Pytorch] Add python binding to use mobile cpu allocator. (#52376)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52376

Using default cpu allocator for ops executed on qnnpack backend will result in
asan failures with heap overflow since qnnpack (and xnnpack) can access input
beyond their and/beginning.

Here we are enabling this feature specifically to enable dynamic sparse linear op test
using qnnpack engine. In dynamic linear op, the fp32 bias is not packed and
hence can result in out-of-bound access.

Test Plan: CI

Reviewed By: z-a-f

Differential Revision: D26491943

fbshipit-source-id: bcc2485e957c7abdef0853c36f6e0f876c20cee3
2021-02-17 18:23:14 -08:00
70bed6a55a Removes deprecated preprocess method from the backend interface (#52258)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52258

Removes deprecated preprocess method from the backend interface.

Preprocessing logic should be now registered along with the backend interface (i.e. PyTorchBackendInterface) via the BackendPreprocessFunction.

Also refactored internal dependencies.
ghstack-source-id: 121704837

Test Plan:
Validates all related tests pass:

buck test mode/dev //caffe2/test/cpp/jit:jit -- --exact 'caffe2/test/cpp/jit:jit - BackendTest.ToBackend'

python test/test_jit.py TestBackends

===== Glow

buck test mode/dev //glow/fb/torch_glow/tests:TorchGlowBackendTests

buck test mode/dev //glow/fb/torch_glow/tests:torch_glow_backend_tests

Reviewed By: jackm321

Differential Revision: D26443479

fbshipit-source-id: afdc51ae619ced293d10c7a6a12f3530e4c4e53c
2021-02-17 17:53:36 -08:00
b9f051db9f Add type hints for the _import_c_extension module (#51767)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51767

The `_import_c_extension.py` finds the right C extension library to use,
and then simply re-exports all of the symbols that it defines.

This adds a `_import_c_extension.pyi` file with type hints to let type
checkers like Pyre and Mypy know the names of the symbols that will be
re-exported from the C extension.

This does not define all of the symbols provided by the C extension,
but does define all of the symbols necessary to make type checkers happy
about other code in the `caffe2/python` directory.
ghstack-source-id: 121916324

Test Plan:
Was able to have Pyre successfully type check the `caffe2/python`
directory with this stub file plus a few other changes.

Confirmed that all of the dependent projects affected by this report no new
pyre issues in sandcastle.

Ran `python test/test_type_hints.py` in the PyTorch github repository and
confirmed it also passes.

Differential Revision: D26271726

Pulled By: simpkins

fbshipit-source-id: 6dbadcf02e0b2cc44a9e3cdabe9291c1250959b4
2021-02-17 17:37:47 -08:00
76af821c36 [PyTorch] "Fix" wrong-looking move in TensorImpl (#52344)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52344

This line is a bug-prone use of std::move combined with a reference to the moved-from parameter in the same series of function call arguments. This is normally a problem because the order of evaluation is undefined -- if the move happens before the call to `storage.device()`, we may have problems. It is not a problem here because we are merely forwarding from one `Storage&&` parameter to another.
ghstack-source-id: 121837267

Test Plan: See no clang-tidy/HowToEven warning on the diff, I hope

Reviewed By: bhosmer

Differential Revision: D26436550

fbshipit-source-id: da85d79be854ff42c5a0cab9649ba82295816eca
2021-02-17 17:26:04 -08:00
2b202667c1 [1/N] CPU pointwise optimization: Add a benchmark for Relu
Summary: As title

Test Plan:
Building: finished in 01:58.4 min (100%) 16761/16761 jobs, 16761 updated
  Total time: 02:32.3 min
Run on (24 X 2394.45 MHz CPU s)
2021-02-16 21:29:30
----------------------------------------------------------------------------------------------------
Benchmark                                             Time           CPU Iterations UserCounters...
----------------------------------------------------------------------------------------------------
relu_nnc/64                                        1738 ns       1738 ns     410535 log/s=36.8257M/s
relu_nnc/512                                       1708 ns       1708 ns     408678 log/s=299.711M/s
relu_nnc/8192                                      3297 ns       3297 ns     214362 log/s=2.48499G/s
relu_nnc/32768                                    10725 ns      10722 ns      61032 log/s=3.05603G/s
log_nnc_sleef/64                                   2076 ns       2075 ns     326248 log/s=30.8436M/s
log_nnc_sleef/512                                  3070 ns       3069 ns     230616 log/s=166.81M/s
log_nnc_sleef/8192                                22214 ns      22210 ns      31251 log/s=368.849M/s
log_nnc_sleef/32768                               85835 ns      85824 ns       8366 log/s=381.804M/s
log_nnc_fast/64                                    1852 ns       1852 ns     379123 log/s=34.5532M/s
log_nnc_fast/512                                   2456 ns       2456 ns     299463 log/s=208.503M/s
log_nnc_fast/8192                                 10953 ns      10952 ns      69894 log/s=747.957M/s
log_nnc_fast/32768                                35424 ns      35422 ns      19986 log/s=925.08M/s
log_nnc_vml/64                                     2361 ns       2361 ns     356220 log/s=27.1063M/s
log_nnc_vml/512                                    2218 ns       2218 ns     313444 log/s=230.857M/s
log_nnc_vml/8192                                   8420 ns       8420 ns      81594 log/s=972.912M/s
log_nnc_vml/32768                                 29484 ns      29484 ns      21701 log/s=1.1114G/s
log_aten/64                                       15970 ns      15970 ns      44401 log/s=4.00742M/s
log_aten/512                                      18344 ns      18344 ns      41056 log/s=27.9114M/s
log_aten/8192                                     24894 ns      24893 ns      27414 log/s=329.084M/s
log_aten/32768                                    29129 ns      29125 ns      22477 log/s=1.12508G/s
logit_nnc_sleef/64                                 2379 ns       2379 ns     261168 logit/s=26.8981M/s
logit_nnc_sleef/512                                5778 ns       5774 ns     114009 logit/s=88.6757M/s
logit_nnc_sleef/8192                              57268 ns      57236 ns      12429 logit/s=143.127M/s
logit_nnc_sleef/32768                            216356 ns     216344 ns       3026 logit/s=151.462M/s
logit_nnc_fast/64                                  2178 ns       2173 ns     282306 logit/s=29.4565M/s
logit_nnc_fast/512                                 2955 ns       2943 ns     202527 logit/s=173.95M/s
logit_nnc_fast/8192                               14836 ns      14835 ns      46794 logit/s=552.192M/s
logit_nnc_fast/32768                              53999 ns      53997 ns      12842 logit/s=606.846M/s
logit_nnc_vml/64                                   2132 ns       2132 ns     335874 logit/s=30.018M/s
logit_nnc_vml/512                                  3029 ns       3029 ns     250988 logit/s=169.058M/s
logit_nnc_vml/8192                                13264 ns      13263 ns      53504 logit/s=617.655M/s
logit_nnc_vml/32768                               49395 ns      48284 ns      14526 logit/s=678.654M/s
logit_aten/64                                     88180 ns      86690 ns       9270 logit/s=738.261k/s
logit_aten/512                                    54682 ns      54489 ns      10000 logit/s=9.3964M/s
logit_aten/8192                                  170878 ns     164357 ns       6965 logit/s=49.8427M/s
logit_aten/32768                                 452291 ns     434638 ns       3967 logit/s=75.3915M/s
logit_caffe2/64                                   30170 ns      29902 ns      24686 logit/s=2.14029M/s
logit_caffe2/512                                 203517 ns     201201 ns       3570 logit/s=2.54472M/s
logit_caffe2/8192                               3199528 ns    3157098 ns        220 logit/s=2.59479M/s
logit_caffe2/32768                             12520838 ns   12504846 ns         56 logit/s=2.62042M/s
tanh_nnc_fast/64                                   1979 ns       1977 ns     309745 tanh/s=32.3752M/s
tanh_nnc_fast/512                                  2331 ns       2331 ns     300937 tanh/s=219.636M/s
tanh_nnc_fast/8192                                 8323 ns       8323 ns      83601 tanh/s=984.26M/s
tanh_nnc_fast/32768                               30767 ns      30766 ns      23024 tanh/s=1065.06M/s
tanh_aten/64                                      17181 ns      17180 ns      36818 tanh/s=3.72522M/s
tanh_aten/512                                     19071 ns      19036 ns      37243 tanh/s=26.8968M/s
tanh_aten/8192                                    53542 ns      52006 ns      16268 tanh/s=157.521M/s
tanh_aten/32768                                  619869 ns     587600 ns       1000 tanh/s=55.7658M/s
tanh_caffe2/64                                     9668 ns       9654 ns      70926 tanh/s=6.62919M/s
tanh_caffe2/512                                   70409 ns      70409 ns       9881 tanh/s=7.27184M/s
tanh_caffe2/8192                                1179098 ns    1179011 ns        644 tanh/s=6.9482M/s
tanh_caffe2/32768                               4384300 ns    4382613 ns        156 tanh/s=7.47682M/s
BatchNorm/ATen/1/64/112/112                    23186429 ns   23183715 ns         27 GB/s=277.028M/s
BatchNorm/ATen/1/256/14/14                      1772907 ns    1770636 ns        394 GB/s=226.703M/s
BatchNorm/ATen/1/128/28/28                      3069417 ns    3069229 ns        232 GB/s=261.569M/s
BatchNorm/ATen/1/64/56/56                       6367276 ns    6367190 ns        111 GB/s=252.173M/s
BatchNorm/ATen/1/512/7/7                        1334734 ns    1334373 ns        516 GB/s=150.411M/s
BatchNorm/ATen/5/64/112/112                   131727903 ns  131721364 ns          7 GB/s=243.792M/s
BatchNorm/ATen/5/256/14/14                      7879002 ns    7874672 ns         85 GB/s=254.873M/s
BatchNorm/ATen/5/128/28/28                     15561373 ns   15269781 ns         42 GB/s=262.877M/s
BatchNorm/ATen/5/64/56/56                      29169722 ns   29107393 ns         24 GB/s=275.812M/s
BatchNorm/ATen/5/512/7/7                        5042006 ns    5028687 ns        100 GB/s=199.559M/s
BatchNorm/NNC/1/64/112/112                      3303598 ns    3271058 ns        188 GB/s=1.96344G/s
BatchNorm/NNC/1/256/14/14                        330641 ns     326644 ns       2033 GB/s=1.22889G/s
BatchNorm/NNC/1/128/28/28                        498706 ns     497894 ns       1131 GB/s=1.61242G/s
BatchNorm/NNC/1/64/56/56                        1116910 ns    1114768 ns        641 GB/s=1.44033G/s
BatchNorm/NNC/1/512/7/7                          163380 ns     163351 ns       3493 GB/s=1.22867G/s
BatchNorm/NNC/5/64/112/112                     16392078 ns   16386427 ns         41 GB/s=1.95971G/s
BatchNorm/NNC/5/256/14/14                       1133781 ns    1133369 ns        674 GB/s=1.77086G/s
BatchNorm/NNC/5/128/28/28                       2053208 ns    2053211 ns        276 GB/s=1.95503G/s
BatchNorm/NNC/5/64/56/56                        3874949 ns    3874734 ns        165 GB/s=2.07193G/s
BatchNorm/NNC/5/512/7/7                          653665 ns     651498 ns       1236 GB/s=1.54033G/s
BatchNorm/ATenRelu/1/64/112/112                36878892 ns   36100523 ns         22 GB/s=177.907M/s
BatchNorm/ATenRelu/1/256/14/14                  6404318 ns    5544976 ns        100 GB/s=72.3913M/s
BatchNorm/ATenRelu/1/128/28/28                  5897059 ns    5735509 ns        106 GB/s=139.973M/s
BatchNorm/ATenRelu/1/64/56/56                  10075458 ns    9965146 ns         62 GB/s=161.125M/s
BatchNorm/ATenRelu/1/512/7/7                    2680507 ns    2662541 ns        254 GB/s=75.3806M/s
BatchNorm/ATenRelu/5/64/112/112               145738113 ns  144253693 ns          5 GB/s=222.612M/s
BatchNorm/ATenRelu/5/256/14/14                 13582519 ns   13427209 ns         65 GB/s=149.476M/s
BatchNorm/ATenRelu/5/128/28/28                 22747138 ns   22627185 ns         31 GB/s=177.401M/s
BatchNorm/ATenRelu/5/64/56/56                  53609692 ns   52936728 ns         15 GB/s=151.656M/s
BatchNorm/ATenRelu/5/512/7/7                   11378314 ns   11083777 ns         65 GB/s=90.5395M/s
BatchNorm/NNCRelu/1/64/112/112                  3154436 ns    3148939 ns        193 GB/s=2.03958G/s
BatchNorm/NNCRelu/1/256/14/14                    337341 ns     337163 ns       1926 GB/s=1.19055G/s
BatchNorm/NNCRelu/1/128/28/28                    505570 ns     505569 ns       1231 GB/s=1.58794G/s
BatchNorm/NNCRelu/1/64/56/56                     903452 ns     903421 ns        659 GB/s=1.77728G/s
BatchNorm/NNCRelu/1/512/7/7                      158521 ns     158321 ns       3781 GB/s=1.2677G/s
BatchNorm/NNCRelu/5/64/112/112                 15488210 ns   15480019 ns         41 GB/s=2.07446G/s
BatchNorm/NNCRelu/5/256/14/14                   1149186 ns    1148963 ns        649 GB/s=1.74683G/s
BatchNorm/NNCRelu/5/128/28/28                   2011589 ns    2011424 ns        320 GB/s=1.99564G/s
BatchNorm/NNCRelu/5/64/56/56                    3776274 ns    3776060 ns        161 GB/s=2.12607G/s
BatchNorm/NNCRelu/5/512/7/7                      699762 ns     699582 ns        975 GB/s=1.43446G/s
BM_CompileSwish                                30471825 ns   30470017 ns         24
BM_CompileSwishLLVMOnly                        27479624 ns   27473475 ns         25
FusedOverhead                                    196219 ns     196195 ns       3342
UnfusedOverhead                                  220210 ns     220119 ns       3302
Gemm/Torch/128/128/128                           115526 ns     115343 ns       7414 GFLOPS=36.3637G/s
Gemm/TensorExprNoopt/128/128/128                3155851 ns    3155706 ns        210 GFLOPS=1.32912G/s
Gemm/TensorExprTile32x32/128/128/128             124454 ns     124452 ns       5774 GFLOPS=33.7021G/s
Gemm/TensorExprTile4x16/128/128/128              174408 ns     174366 ns       3987 GFLOPS=24.0546G/s
Gemm/TensorExprTile4x16VecUnroll/128/128/128      72949 ns      72948 ns       9028 GFLOPS=57.4974G/s
Gemm/TensorExprTile4x16Cache/128/128/128          73237 ns      73234 ns       9501 GFLOPS=57.2726G/s
Reduce1D/Torch/16777216                       426865265 ns  426853756 ns          2 BYTES=157.217M/s
Reduce1D/Naive/16777216                       132347709 ns  132343710 ns          5 BYTES=507.08M/s
Reduce1D/NativeRfactor/16777216               234668375 ns  234664682 ns          3 BYTES=285.978M/s
Reduce1D/TeNaive/16777216                      20468304 ns   20467906 ns         34 BYTES=3.27874G/s
Reduce1D/TeSplitTail/16777216                  20378995 ns   20378678 ns         34 BYTES=3.29309G/s
Reduce1D/TeSplitMask/16777216                  20371783 ns   20371260 ns         36 BYTES=3.29429G/s
Reduce1D/TeRfactorV2/16777216                   8235908 ns    8235723 ns         84 BYTES=8.14851G/s

CPU info:

Running ```sudo lshw -class processor```. Get 24 CPUs with identical architecture as follows:

  *-cpu:0
       description: CPU
       product: Intel Core Processor (Broadwell)
       vendor: Intel Corp.
       physical id: 400
       bus info: cpu@0
       version: 6.61.2
       slot: CPU 0
       size: 2GHz
       capacity: 2GHz
       width: 64 bits
       capabilities: fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp x86-64 constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat
       configuration: cores=1 enabledcores=1 microcode=1 threads=1

Reviewed By: bwasti

Differential Revision: D26275048

fbshipit-source-id: 3de669f622eb8cd328787caa878dc0c05de600a5
2021-02-17 17:18:28 -08:00
2775ff4a47 [BE] Decorate unused functions with C10_UNUSED (#52378)
Summary:
This suppresses repeated warnings for every file that includes vec256 or
math.h:
```
../aten/src/ATen/native/Math.h:1095:15: warning: unused function 'calc_igamma' [-Wunused-function]
c10::BFloat16 calc_igamma<c10::BFloat16>(c10::BFloat16 a, c10::BFloat16 x) {
              ^
../aten/src/ATen/native/Math.h:1100:11: warning: unused function 'calc_igamma' [-Wunused-function]
c10::Half calc_igamma<c10::Half>(c10::Half a, c10::Half x) {
          ^
../aten/src/ATen/native/Math.h:1105:15: warning: unused function 'calc_igammac' [-Wunused-function]
c10::BFloat16 calc_igammac<c10::BFloat16>(c10::BFloat16 a, c10::BFloat16 x) {
              ^
../aten/src/ATen/native/Math.h:1110:11: warning: unused function 'calc_igammac' [-Wunused-function]
c10::Half calc_igammac<c10::Half>(c10::Half a, c10::Half x) {
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52378

Reviewed By: walterddr

Differential Revision: D26492412

Pulled By: malfet

fbshipit-source-id: c570c9beb9915c96fca297e0b88d0291937d3132
2021-02-17 16:39:16 -08:00
a11650b069 .circleci: Downgrade CUDA 11.2 -> 11.1 for binaries (#52151)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52151

CUDA 11.2 might not be as performant as we thought so let's downgrade to
something we think is more performant.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D26408314

Pulled By: seemethere

fbshipit-source-id: e2446aa0115e2c2a79718b1fdfd9fccf2072822d
2021-02-17 16:20:14 -08:00
79e10ce97b [PyTorch] Construct IValue from List without copies in args (#52325)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52325

List's move constructor is comparatively expensive (copies the type) and so is its destructor (has to destroy the type, which isn't null). So, it's best not to create intermediate `List` objects in function parameters. Copy elision won't save us here; it's not allowed to! (see https://en.cppreference.com/w/cpp/language/copy_elision)
ghstack-source-id: 121807291

Test Plan:
Profile AdIndexer benchmark. Time spent in push_outputs is
down from 0.2% to 0.01%.
Inspecting assembly for
`c10::impl::push_outputs<c10::List<at::Tensor>,false>::call`
shows that we have gone from 2 List move ctor calls and 3
~instrusive_ptr dtor calls to 0 calls and 1 call, respectively.

Reviewed By: bhosmer

Differential Revision: D26471092

fbshipit-source-id: 412a85fcc36d141fb91710c7855df24c137813a9
2021-02-17 16:14:51 -08:00
7e2becb70f [PyTorch] Reduce copy/move in c10::ivalue::from (#52324)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52324

`c10::ivalue::from` took its parameter by value. `List` has
an expensive move ctor (it has to copy the type shared_ptr) and dtor
(it has to decref the type, which isn't null), so it's better to avoid
intermediate List objects in function parameters.
ghstack-source-id: 121807292

Test Plan:
Profiled AdIndexer benchmark; time spent in push_outputs is
down from 0.5% to 0.23%.
Comparing assembly for
`c10::impl::push_outputs<c10::List<at::Tensor>, false>::call`, we went
from 4 List move ctor calls and 5 ~intrusive_ptr calls to 2 move ctor
calls and 3 dtor calls, respectively.

Reviewed By: bhosmer

Differential Revision: D26471093

fbshipit-source-id: 7b2c5e8d391a428f2b4d895717a43123c8d7a054
2021-02-17 16:07:45 -08:00
f7a3634466 [WIP][FX] Normalize torch.nn.functional calls (#51816)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51816

Test Plan: Imported from OSS

Reviewed By: Chillee

Differential Revision: D26290764

Pulled By: jamesr66a

fbshipit-source-id: 9c05ff1b7c6f0ab8a13516f7cc2fe279980ebe5d
2021-02-17 15:18:03 -08:00
8bf846d2c8 Skip OneDNN Convolution in case of groups = 24 #50042 (#52327)
Summary:
Temporary disabling OneDNN conv for group size = 24 as OneDNN update came too late to be fully tested https://github.com/pytorch/pytorch/issues/50042

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52327

Reviewed By: agolynski

Differential Revision: D26474186

Pulled By: VitalyFedyunin

fbshipit-source-id: 8d6964d33c8dcab70e207088c3940810eabbd068
2021-02-17 14:49:23 -08:00
f6e0f5b85a [typing] ignore mypy false positives in aten_test.py (#52370)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52370

After adding .pyi stubs for torch / caffe2 protos, there were some mypy false positives (https://github.com/pytorch/pytorch/pull/52341). We tell mypy to ignore the offending file here.

Test Plan: Let CI run.

Reviewed By: malfet, dzhulgakov

Differential Revision: D26490302

fbshipit-source-id: 87cdfd7419efdc7abece9ca975a464201732b7a0
2021-02-17 14:31:40 -08:00
5003d417d4 [PyTorch Mobile] Outline DispatchStub::get_call_ptr() (#51908)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51908

As suggested by swolchok. The idea is to outline `DispatchStub::get_call_ptr()` and also not have it specialized per instantiation of the `DispatchStub` class since it results in size bloat for mobile.

ghstack-source-id: 121873712

Test Plan:
Build + Circle CI.

### lightspeed

```
D26324255-V8 (https://www.internalfb.com/intern/diff/D26324255/?dest_number=121462800)

messenger-experimental-optimized-device: Succeeded
Change in Download Size for arm64 + 3x assets variation: -13.0 KiB
Change in Uncompressed Size for arm64 + 3x assets variation: -51.4 KiB

Mbex Comparison: https://our.intern.facebook.com/intern/mbex/bsb:325359338899206@base/bsb:325359338899206@diff/
```

### igios

```
D26324255-V8 (https://www.internalfb.com/intern/diff/D26324255/?dest_number=121462800)

igios: Succeeded
Change in Download Size for arm64 + 3x assets variation: -9.2 KiB
Change in Uncompressed Size for arm64 + 3x assets variation: -23.4 KiB

Mbex Comparison: https://our.intern.facebook.com/intern/mbex/bsb:823799391811488@base/bsb:823799391811488@diff/
```

### fbios-pika

```
D26324255-V8 (https://www.internalfb.com/intern/diff/D26324255/?dest_number=121462800)

fbios-pika: Succeeded
Change in Download Size for arm64 + 3x assets variation: -8.0 KiB
Change in Uncompressed Size for arm64 + 3x assets variation: -22.7 KiB

Mbex Comparison: https://our.intern.facebook.com/intern/mbex/bsb:1345469719167068@base/bsb:1345469719167068@diff/
```

Reviewed By: swolchok

Differential Revision: D26324255

fbshipit-source-id: 61aba8687f4c1b742fa9d9d917a026abc8d9c328
2021-02-17 13:42:28 -08:00
e7f28d4241 [PyTorch Mobile] Restructure DispatchStub::operator() code to move template independent code into an external method (#51403)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51403

Turns out this isn't a new idea. swolchok posted about this a while ago and this was discussed in the composability group.

Links to posts:
* Template Hoisting: https://fb.workplace.com/groups/llvm.gcc/permalink/2321250667923535/
* C++: Most of the code in a template should depend on the template parameter(s): https://fb.workplace.com/groups/2088132188069398/permalink/2224983771050905/
ghstack-source-id: 121873716

Test Plan: Results in a 10KiB size reduction on fbios. Will re-run BSB for igios.

Reviewed By: swolchok

Differential Revision: D25859327

fbshipit-source-id: 915abebb2643f8ac9a901f3b4d79c63f4bbb5fee
2021-02-17 13:40:30 -08:00
6c875f17ca Enable PyTorch_QNNPACK for Apple Silicon builds (#52308)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52308

Reviewed By: janeyx99

Differential Revision: D26488223

Pulled By: malfet

fbshipit-source-id: ecc3925f3374ad4a9e8f740b007bf6f3b23d8e51
2021-02-17 13:31:16 -08:00
51c28e4d7e [ROCm] enable fft tests (#51581)
Summary:
This PR enable some failing unit tests for fft in pytorch on ROCM.

The reason these tests were failing was due to hipfft clobbering inputs causing a mismatch in tests that were checking that applying ffts and their reverse would get you back to the input.

We solve this by cloning the input using an existing flag on the ROCM platform.

There PR doesnot enable all fft tests. There are other issues that need to be resolved before that can happen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51581

Reviewed By: ejguan

Differential Revision: D26489344

Pulled By: seemethere

fbshipit-source-id: 472fce8e514adcf91e7f46a686cbbe41e91235a9
2021-02-17 13:27:55 -08:00
edf8130e9e [PyTorch] Add set_data_ptr_noswap & use where possible (#52244)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52244

`StorageImpl::set_data_ptr` returns the old pointer and thus has to do extra
work. Found because `std::swap<at::DataPtr>` was showing up in
profiling, although at < 1%.
ghstack-source-id: 121795131

Test Plan:
Run AdIndexer benchmark under `perf stat`.

Before:
```
         17,990.01 msec task-clock                #    0.998 CPUs utilized            ( +-  0.43% )
             6,550      context-switches          #    0.364 K/sec                    ( +- 31.42% )
                 3      cpu-migrations            #    0.000 K/sec                    ( +-  7.14% )
           103,820      page-faults               #    0.006 M/sec                    ( +-  2.47% )
    35,610,511,494      cycles                    #    1.979 GHz                      ( +-  0.40% )  (50.03%)
    71,651,045,779      instructions              #    2.01  insn per cycle           ( +-  0.07% )  (50.02%)
    11,679,947,910      branches                  #  649.246 M/sec                    ( +-  0.10% )  (50.03%)
        69,088,927      branch-misses             #    0.59% of all        branches          ( +-  0.24% )  (50.06%
```

After:
```
         17,896.20 msec task-clock                #    0.999 CPUs utilized            ( +-  0.24% )
             4,011      context-switches          #    0.224 K/sec                    ( +- 27.77% )
                 3      cpu-migrations            #    0.000 K/sec
           100,350      page-faults               #    0.006 M/sec                    ( +-  1.58% )
    35,418,702,208      cycles                    #    1.979 GHz                      ( +-  0.23% )  (50.05%)
    71,449,334,935      instructions              #    2.02  insn per cycle           ( +-  0.09% )  (50.03%)
    11,652,819,899      branches                  #  651.134 M/sec                    ( +-  0.12% )  (50.04%)
        69,744,411      branch-misses             #    0.60% of all branches          ( +-  0.53% )  (50.06%)
```

Cycles difference is within the noise, but it looks like we have an
0.28% instruction count win, which is outside the noise (and fits with
intuition that this should be better).

Reviewed By: hlu1

Differential Revision: D26437297

fbshipit-source-id: bf0fceccf6ad78f1497b03ccb4cdfd1a21c6846c
2021-02-17 12:42:21 -08:00
a07530e57f [quant] Factoring out the list of no_observers (#50459)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50459

Some of the custom modules cannot have the observers be inserted automatically. This PR factors out that list into a separate function.

Test is not required, as it is covered by the unittests for those modules.

(Note: this ignores all push blocking failures!)

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D26092531

fbshipit-source-id: 1f89daf3a13ef31bc4e9058c3443559c65a05812
2021-02-17 12:38:30 -08:00
b8584b884e [quant] Quantizable MultiheadAttention (#49866)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49866

- Adds the `torch.nn.quantizable.MultiheadAttention`

The quantizable version can serve as a fully equivalent to `torch.nn.MultiheadAttention` module.
The main difference is that it allows for linear units observation after the `prepare` step in the quantization flow.

Note: The `from_observed` method (called during the `convert`) removes the `bias_k` and `bias_v` parameters, and resets them as attributes.
This is done to avoid an error of assigning a quantized tensor to the `torch.nn.Parameter`.

(Note: this ignores all push blocking failures!)

Test Plan:
```
python test/test_quantization.py TestQuantizedOps.test_custom_module_multi_head_attention
```

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D25706179

fbshipit-source-id: e27ab641d8d1eccc64cf9e44343459331f89eea4
2021-02-17 12:36:30 -08:00
440fddf07b Remove unnecessary statement in capture_stderr (#52366)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52366

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D26489602

Pulled By: ansley

fbshipit-source-id: dd0db0a631840b5efd5dc48887fbf724781c6be4
2021-02-17 12:28:46 -08:00
6dabe0b291 [Dist Profiling] Enable dist profiling for DDP (gloo only) (#52031)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52031

Closes https://github.com/pytorch/pytorch/issues/52020
Ensures that we can profile collectives in DDP by propagating the profiler threadLocalState appropriately. As described in the above issue, before this wouldn't work as the profiler would only be enabled on the main thread.
ghstack-source-id: 121818080

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D26356192

fbshipit-source-id: 0158b5833a3f857a0b4b2943ae3037e9d998dfd1
2021-02-17 12:21:37 -08:00
059ee85ca4 [PyTorch] Devirtualize TensorImpl::storage() (#51050)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51050

Subclasses want to be able to make storage() calls throw, so
we find some free space in TensorImpl to add a flag that they can set
to make that happen without making storage() virtual. It should still
be inlineable.
ghstack-source-id: 121819684

Test Plan:
Compared `perf stat` on 1M iterations on AdIndexer benchmark before/after

Before:
```
         74,483.15 msec task-clock                #    0.999 CPUs utilized            ( +-  0.14% )
            16,637      context-switches          #    0.223 K/sec                    ( +- 11.97% )
                 3      cpu-migrations            #    0.000 K/sec                    ( +-  7.20% )
           107,085      page-faults               #    0.001 M/sec                    ( +-  2.39% )
   147,356,440,831      cycles                    #    1.978 GHz                      ( +-  0.14% )  (50.06%)
   278,678,430,378      instructions              #    1.89  insn per cycle           ( +-  0.01% )  (50.05%)
    43,540,698,177      branches                  #  584.571 M/sec                    ( +-  0.01% )  (50.05%)
       141,028,843      branch-misses             #    0.32% of all branches          ( +-  1.00% )  (50.05%)

```

After:
```
         74,178.77 msec task-clock                #    0.999 CPUs utilized            ( +-  0.31% )
            17,125      context-switches          #    0.231 K/sec                    ( +-  3.41% )
                 3      cpu-migrations            #    0.000 K/sec
           109,535      page-faults               #    0.001 M/sec                    ( +-  1.04% )
   146,803,364,372      cycles                    #    1.979 GHz                      ( +-  0.30% )  (50.03%)
   277,726,600,254      instructions              #    1.89  insn per cycle           ( +-  0.02% )  (50.03%)
    43,299,659,815      branches                  #  583.720 M/sec                    ( +-  0.03% )  (50.03%)
       130,504,094      branch-misses             #    0.30% of all branches          ( +-  1.14% )  (50.03%)

```

Looks like approximately 0.3% instruction count win (and similarly for cycles, but that's within noise).

Reviewed By: ezyang

Differential Revision: D26013815

fbshipit-source-id: 07939957929070e18b9981d492d8279c9bb33c55
2021-02-17 11:48:06 -08:00
4305609d66 Fix complex acos edge cases (#52287)
Summary:
Use `std::acos` even when avx2 is available
Add slow but accurate implementation of complex arc cosine based on
W. Kahan "Branch Cuts for Complex Elementary Functions" paper, where
cacos(z).re = 2*atan2(sqrt(1-z).re(), sqrt(1+z).re())
cacos(z).im = asinh((sqrt(conj(1+z))*sqrt(1-z)).im())

Fixes https://github.com/pytorch/pytorch/issues/42952

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52287

Reviewed By: walterddr

Differential Revision: D26455027

Pulled By: malfet

fbshipit-source-id: a81ce1ba4953eff4d3c2a265ef9199896a67b240
2021-02-17 11:36:09 -08:00
72d1ccd3ca Revert D26263480: [Pytorch, Sparsity] Integrate sparse qnnpack operator in framework
Test Plan: revert-hammer

Differential Revision:
D26263480 (87ebaa4eb1)

Original commit changeset: 04ab60aec624

fbshipit-source-id: ad7690eebdc4b2782c2c94b5bbadbde4ef7c0627
2021-02-17 11:29:08 -08:00
cbede834d4 [JIT] Add support for default argument values to Torchbind (#51253)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51253

**Summary**
This commit adds support to Torchbind for specifying default values for
arguments of custom class methods.

**Test Plan**
This commit adds a unit test to `test_torchbind.py` that exercises this
feature.

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D26131529

Pulled By: SplitInfinity

fbshipit-source-id: 68bc86b045dd2f03ba41e1a116081a6eae6ba9ff
2021-02-17 11:27:03 -08:00
324c6aada1 BFloat16: enable prepacked weights's inference (#48922)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48922

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D25537188

Pulled By: VitalyFedyunin

fbshipit-source-id: ab6eb1ba8cffb5ba9d00d05db8ef616628f8c932
2021-02-17 11:20:00 -08:00
e36a900e89 [tools] Use anonymous access to access S3 bucket (#52338)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52338

Reviewed By: samestep

Differential Revision: D26475415

Pulled By: malfet

fbshipit-source-id: a96e8868b11e9e7691daa54ff2baef4446605ba7
2021-02-17 11:00:52 -08:00
0e2520baae [PyTorch] Don't read 1 char per iteration in Unpickler::readString (#51901)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51901

It's much more efficient to read multiple chars with 1 memcpy than to call `read<char>` multiple times.
ghstack-source-id: 121278774

Test Plan:
Run WireSerializerBench before/after for small tensors:

```
/tmp/WireSerializerBench.Reader --real_data /mnt/homedir/hwwang/test_serialized_api_request --real_pytorch_api_request --bm_regex '[Ss]mall'
```

Before:
```
DeSerializeWire(Small)                                       7.65us  130.65K
DeSerializeWire(small_Zstd)                      100.49%     7.62us  131.29K
DeSerializeWire(small_Snappy)                    100.49%     7.62us  131.29K
DeSerializeWireIValue(Small)                      82.89%     9.23us  108.30K
DeSerializeWireIValue(small_Zstd)                 82.87%     9.24us  108.27K
DeSerializeWireIValue(small_Snappy)               82.33%     9.30us  107.57K
DeSerializeC2ToBlob(small_NoCompress)           1150.28%   665.39ns    1.50M
DeSerializeC2ToBlob(small_Zstd)                 1149.70%   665.72ns    1.50M
DeSerializeC2ToBlob(small_Zstd_Fast)            1150.94%   665.00ns    1.50M
DeSerializeC2ToBlob(Small_Snappy)               1151.70%   664.57ns    1.50M
DeSerializeC2ToString(small)                    9297.81%    82.32ns   12.15M
```

After:
```
DeSerializeWire(Small)                                       6.86us  145.84K
DeSerializeWire(small_Zstd)                      100.52%     6.82us  146.60K
DeSerializeWire(small_Snappy)                    100.13%     6.85us  146.03K
DeSerializeWireIValue(Small)                      83.94%     8.17us  122.42K
DeSerializeWireIValue(small_Zstd)                 84.00%     8.16us  122.50K
DeSerializeWireIValue(small_Snappy)               84.53%     8.11us  123.28K
DeSerializeC2ToBlob(small_NoCompress)           1019.48%   672.58ns    1.49M
DeSerializeC2ToBlob(small_Zstd)                 1020.03%   672.23ns    1.49M
DeSerializeC2ToBlob(small_Zstd_Fast)            1020.59%   671.85ns    1.49M
DeSerializeC2ToBlob(Small_Snappy)               1020.30%   672.05ns    1.49M
DeSerializeC2ToString(small)                    7709.63%    88.94ns   11.24M
```

Second run after to demonstrate it wasn't just variance:

```
DeSerializeWire(Small)                                       6.92us  144.57K
DeSerializeWire(small_Zstd)                       99.24%     6.97us  143.47K
DeSerializeWire(small_Snappy)                     99.58%     6.95us  143.97K
DeSerializeWireIValue(Small)                      84.83%     8.15us  122.63K
DeSerializeWireIValue(small_Zstd)                 84.72%     8.16us  122.49K
DeSerializeWireIValue(small_Snappy)               84.59%     8.18us  122.29K
DeSerializeC2ToBlob(small_NoCompress)           1031.03%   670.89ns    1.49M
DeSerializeC2ToBlob(small_Zstd)                 1030.64%   671.14ns    1.49M
DeSerializeC2ToBlob(small_Zstd_Fast)            1013.39%   682.57ns    1.47M
DeSerializeC2ToBlob(Small_Snappy)               1013.95%   682.19ns    1.47M
DeSerializeC2ToString(small)                    8155.98%    84.81ns   11.79M
```

By the way, this gets us closer to deserialization parity for the real data sample included in D26049387:

baseline:
```
DeSerializeWire(RealData)                                    7.34ms   136.24
DeSerializeWire(RealData_Zstd)                    99.95%     7.34ms   136.17
DeSerializeWire(RealData_Snappy)                 100.09%     7.33ms   136.36
DeSerializeWireIValue(RealData)                   82.69%     8.88ms   112.65
DeSerializeWireIValue(RealData_Zstd)              82.76%     8.87ms   112.75
DeSerializeWireIValue(RealData_Snappy)            82.68%     8.88ms   112.64
DeSerializeC2ToBlob(RealData_NoCompress)         116.87%     6.28ms   159.23
DeSerializeC2ToBlob(RealData_Zstd)               117.33%     6.26ms   159.85
DeSerializeC2ToBlob(RealData_Zstd_Fast)          117.38%     6.25ms   159.91
DeSerializeC2ToBlob(RealData_Snappy)             117.61%     6.24ms   160.23
DeSerializeC2ToString(RealData)                 4571.81%   160.55us    6.23K
```

with this diff:
```
DeSerializeWire(RealData)                                    6.57ms   152.17
DeSerializeWire(RealData_Zstd)                   100.17%     6.56ms   152.43
DeSerializeWire(RealData_Snappy)                 100.09%     6.57ms   152.31
DeSerializeWireIValue(RealData)                   83.06%     7.91ms   126.40
DeSerializeWireIValue(RealData_Zstd)              83.16%     7.90ms   126.54
DeSerializeWireIValue(RealData_Snappy)            83.22%     7.90ms   126.64
DeSerializeC2ToBlob(RealData_NoCompress)         104.02%     6.32ms   158.29
DeSerializeC2ToBlob(RealData_Zstd)               103.46%     6.35ms   157.43
DeSerializeC2ToBlob(RealData_Zstd_Fast)          104.64%     6.28ms   159.23
DeSerializeC2ToBlob(RealData_Snappy)             104.65%     6.28ms   159.25
DeSerializeC2ToString(RealData)                 4051.03%   162.22us    6.16K
```

Reviewed By: qizzzh

Differential Revision: D26321083

fbshipit-source-id: 92d45e760580bb290078ddac84128174daef0e55
2021-02-17 11:00:48 -08:00
b2aa63f17c [PyTorch] Fix return value of IValue::to for Tensor/String (#51463)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51463

We can make the return type of the `to()` template match the return type of toFoo() by using the same technique we use for `list_element_to_const_ref`. Also simplifies `list_element_to_const_ref`.
ghstack-source-id: 121363468

Test Plan:
CI

built and ran AdIndexer benchmark w/ batch size 1 under perf stat
--repeat 5 to make sure it didn't regress

Reviewed By: bhosmer

Differential Revision: D26163848

fbshipit-source-id: b8563263b9f9fa5311c7d7cedc89e28bc5badda0
2021-02-17 11:00:44 -08:00
a9f5e7229e [PyTorch] Remove reference_cast in make_boxed_from_unboxed_functor (#51319)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51319

We were going out of our way to accommodate `IValue::to<Tensor>` returning a copy of the inner Tensor. `IValue::toTensor` is capable of returning a reference without copying, so if we use it directly, we can allow kernels that want to take `Tensor &` to do so!
As a bonus, we get reduced build times.
ghstack-source-id: 121378961

Test Plan:
Rely on CI for correctness.
Profiled build time with -ftime-trace for RegisterCPU.cpp using an extracted build invocation.

Before: P168244900

After: P168245014

Note reduced time spent compiling make_boxed_from_unboxed_functor.

I also ran the AdIndexer benchmark (https://fb.quip.com/ztERAYjuzdlr) with static runtime disabled and batch size 1 to see how big the effect on boxed call performance was (any kernels that take `Tensor&` or `const Tensor&` should now actually save a refcount bump). Looks like it was roughly 1% better:

Before: 124-125 usec/iter
After: 122-123 usec/iter

Reviewed By: bhosmer

Differential Revision: D26138549

fbshipit-source-id: b0f830527da360c542c815bef2f7e1692615b32a
2021-02-17 10:58:44 -08:00
c442776f3c [PyTorch] Debug-gate static_assert in KernelFunction::makeFromUnboxedFunctor (#51367)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51367

Templight said that this assertion was taking about 5% of build time for RegisterCPU.cpp (a hopefully-representative example I picked to shorten my iteration cycle).

I've debug-gated it on the grounds that 1) we at least try to build
everything in debug mode and 2) optimized builds presumably take
longer in general, so we can more afford to pay the build time cost in
debug builds.

The win is not entirely clear; please see the test plan for details.
ghstack-source-id: 121378960

Test Plan:
1) Built RegisterCPU.cpp with -ftime-trace before and after. It doesn't seem to call out any difference in the details, but the overall time is stably down more like 10% (55s before and 49s after).
2) Did a full rebuild of aten-cpu with -ftime-trace before and
after. No significant difference in build times shown (it says *after*
is a regression, but it's using wall-time data and the machine is
loaded during builds so there's some noise).
3) Re-profiled with Templight.

Before:

{F366557311}

After:

{F366557501}

Not sure what to conclude overall. A known problem with templight is that template instantiations form more of a dependency graph than a tree because they're cached internally, so eliminating the first caller of a template may just move the time to another caller. However, it looks like we have actually reduced is_functor traffic.

UPDATE: I don't think that the -ftime-trace measurement was reliable; it seems to skew running times. I built this diff vs its base 5 times and measured the CPU ("user") time each time. Results (in seconds):

previous diff: [51.97, 50.54, 50.49, 52.89, 51.61]
mean: 51.5 std: 0.906

this diff: [50.53, 50.41, 50.57, 50.67, 50.94]
mean: 50.6 std: 0.179

Reviewed By: ezyang

Differential Revision: D26153793

fbshipit-source-id: 9a66912c1b2b068f453e78be57454e4e62b7107b
2021-02-17 10:47:07 -08:00
975d9f2551 Mypy fixes for pytorch master (#52090)
Summary:
This PR adds fixes mypy issues on the current pytorch main branch. In special, it replaces occurrences of `np.bool/np.float` to `np.bool_/np.float64`, respectively:

```
test/test_numpy_interop.py:145: error: Module has no attribute "bool"; maybe "bool_" or "bool8"?  [attr-defined]
test/test_numpy_interop.py:159: error: Module has no attribute "float"; maybe "float_", "cfloat", or "float64"?  [attr-defined]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52090

Reviewed By: walterddr

Differential Revision: D26469596

Pulled By: malfet

fbshipit-source-id: e55a5c6da7b252469e05942e0d2588e7f92b88bf
2021-02-17 10:39:51 -08:00
a8885ee7e6 [BE][typing] add caffe2/torch proto stubs (1 of 2) (#52341)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52341

Add type stubs for caffe2 protos and scripts for generating them.

It's worth calling special attention to the following. In order to make `DeviceType`s like `CPU`, `CUDA`, etc. directly accessible from the `caffe2_pb2` module, they are currently freedom-patched into it in `caffe2/python/__init__.py`. This is not ideal: it would be better if these were autogenerated when the protobuf definitions were created by using `allow_alias = true` in the `DeviceTypeProto` definition in `caffe2.proto`.

However, it is impossible to do this currently without significant effort. The issue is that the generated proto constants would conflict with various constants defined in the C++ caffe2 codebase in `caffe2_pb.h`. We cannot simply remove these constants and replace them with the caffe2 DeviceTypeProto constants, because a huge portion of code expects `at::DeviceType` constants defined in `core/DeviceType.h` (apparently duplicated to avoid having to figure out how to autogenerate the protobuf definitions using cmake for ATen).

Instead, we make a best-effort to add additional definitions in `caffe2_pb2.py` by looking for any freedom-patched constants in `caffe2/python/__init__.py` and making sure they have corresponding stubs in the pyi (see `gen_proto_typestubs_helper.py`).

Test Plan: Make sure CI is green; we're just adding stubs.

Reviewed By: d4l3k

Differential Revision: D26331875

fbshipit-source-id: 2eea147e5bf393542f558ff8cf6385c47624b770
2021-02-17 10:30:11 -08:00
99619ea3b7 Automated submodule update: FBGEMM (#52354)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: c520088927

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52354

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: ejguan

Differential Revision: D26484989

fbshipit-source-id: c9ccce0141be49c57b80e14992f842364bb18a00
2021-02-17 09:30:47 -08:00
d8bb932245 Support AST rewriting for submodules (#52297)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52297

Before, an `nn.Module` with submodules would fail AST rewriting with `TypeError: 'RewrittenModule' object does not support item assignment`. (Try the `test_ast_rewriter_reassigns_submodules` test case on `master`.) This PR fixes the issue as well as adding additional test cases

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D26483820

Pulled By: ansley

fbshipit-source-id: 757e898dc2b0a67daf2bd039d555b85f4e443322
2021-02-17 09:08:07 -08:00
87ebaa4eb1 [Pytorch, Sparsity] Integrate sparse qnnpack operator in framework
Summary:
Add QNNPACK specific packed params for sparse linear.
Add sparse linear dynamic op with appropriate registration.
Add python side LinearDynamic module for sparsity.
Add tests to validate sparse linear qnnpack kernels.
Note that since these test are mostly run on x86 platform and
given that 1x4 sparse kernels are implemented both in sse and arm,
LinearDynamic at the moment defaults to 1x4 pattern.
Plan is to add another diff that will allow a global override for 8x1 pattern
such that prepare/convert flow can work for exporting model for mobile.

Test Plan: buck run caffe2/torch/fb/model_optimization:sparsity_test

Reviewed By: z-a-f

Differential Revision: D26263480

fbshipit-source-id: 04ab60aec624d1ecce8cfb38b79c7e94f501cdf6
2021-02-17 08:44:16 -08:00
a6e94d274f [Pytorch] Add python binding to use mobile cpu allocator. (#52323)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52323

Using default cpu allocator for ops executed on qnnpack backend will result in
asan failures with heap overflow since qnnpack (and xnnpack) can access input
beyond their and/beginning.

Here we are enabling this feature specifically to enable dynamic sparse linear op test
using qnnpack engine. In dynamic linear op, the fp32 bias is not packed and
hence can result in out-of-bound access.

Test Plan: test_set_default_mobile_cpu_allocator.py

Reviewed By: z-a-f

Differential Revision: D26263481

fbshipit-source-id: a49227cac7e6781b0db4a156ca734d7671972d9f
2021-02-17 08:42:23 -08:00
4501b52fe5 Benchmark for torch.ops.quantized.linear_prepack_fp16 operator (#52229)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52229

Create benchmarks for
torch.ops.quantized.linear_prepack_fp16  and torch.ops.quantized.linear_unpack_fp16 operators

Benchmark for these operators are written in the same format as the other benchmarks for other operators.

Test Plan:
linear_prepack_fp16 test was successfully run with various parameters:

Sample test run output:
 ----------------------------------------
 PyTorch/Caffe2 Operator Micro-benchmarks
 ----------------------------------------
 Tag : long

 Benchmarking PyTorch: linear_prepack_fp16
 Mode: Eager
 Name: linear_prepack_fp16_M8_N32_K256_cpu
 Input: M: 8, N: 32, K: 256, device: cpu
Forward Execution Time (us) : 14.002

 Benchmarking PyTorch: linear_prepack_fp16
 Mode: Eager
 Name: linear_prepack_fp16_M8_N32_K512_cpu
 Input: M: 8, N: 32, K: 512, device: cpu
Forward Execution Time (us) : 14.114

 Benchmarking PyTorch: linear_prepack_fp16
 Mode: Eager
 Name: linear_prepack_fp16_M8_N64_K256_cpu
 Input: M: 8, N: 64, K: 256, device: cpu
Forward Execution Time (us) : 19.355

 Benchmarking PyTorch: linear_prepack_fp16
 Mode: Eager
 Name: linear_prepack_fp16_M8_N64_K512_cpu
 Input: M: 8, N: 64, K: 512, device: cpu
Forward Execution Time (us) : 19.056

 Benchmarking PyTorch: linear_prepack_fp16
 Mode: Eager
 Name: linear_prepack_fp16_M128_N32_K256_cpu
 Input: M: 128, N: 32, K: 256, device: cpu
Forward Execution Time (us) : 115.963

 Benchmarking PyTorch: linear_prepack_fp16
 Mode: Eager
 Name: linear_prepack_fp16_M128_N32_K512_cpu
 Input: M: 128, N: 32, K: 512, device: cpu
Forward Execution Time (us) : 116.259

 Benchmarking PyTorch: linear_prepack_fp16
 Mode: Eager
 Name: linear_prepack_fp16_M128_N64_K256_cpu
 Input: M: 128, N: 64, K: 256, device: cpu
Forward Execution Time (us) : 229.336

 Benchmarking PyTorch: linear_prepack_fp16
 Mode: Eager
 Name: linear_prepack_fp16_M128_N64_K512_cpu
 Input: M: 128, N: 64, K: 512, device: cpu
Forward Execution Time (us) : 220.016

linear_unpack_fp16 test was successfully run with identical parameters

Reviewed By: b-koopman

Differential Revision: D26403343

fbshipit-source-id: 11a98e56177952b94f291006975b0b719f48d1b9
2021-02-17 08:02:01 -08:00
6e1a5b1196 [PyTorch] Use real if constexpr behind macro in hot template (#51368)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51368

This seems to noticably reduce build times, at least for
RegisterCPU.cpp. It makes sense that a compiler builtin would be
faster than simulating the same builtin with templates.

Identified with templight.
ghstack-source-id: 121378959

Test Plan:
Confirmed this speeds up RegisterCPU.cpp optimized build by
simply running builds under `time(1)`:

previous diff: [50.53, 50.41, 50.57, 50.67, 50.94]
mean: 50.6 std: 0.179

this diff: [45.71, 45.89, 46.21, 48.51, 45.84]
mean: 46.4 std: 1.05

Reviewed By: bhosmer

Differential Revision: D26154964

fbshipit-source-id: 62ee2f5a872007db032dfebf7ad4d1b6e1ce63d1
2021-02-17 07:38:50 -08:00
680c4ce1dd [PyTorch] Avoid some extra intrusive_ptr<Tuple> copies in Unpickler (#51902)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51902

These seem like straightforward improvements. (I don't have measurements; feel free to reject if you're skeptical)
ghstack-source-id: 121278775

Test Plan: CI

Reviewed By: qizzzh

Differential Revision: D26322438

fbshipit-source-id: d393a32cc34bb68bc4f804f4b1cc5a8af27763c9
2021-02-17 07:31:58 -08:00
f235c65a2b [TorchScript] C++ interface of to_<backend> (Re-land) (#52340)
Summary:
This is a re-land off https://github.com/pytorch/pytorch/pull/51797 with fix for spurious libcuda dependency

Fix limits the scope of `no-as-needed` linker flag to just `jitbackend_test`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52340

Reviewed By: agolynski, iseeyuan

Differential Revision: D26476168

Pulled By: malfet

fbshipit-source-id: f909428af82182b3bffd020ca18cca7a9b5846b6
2021-02-17 07:17:50 -08:00
8c185e62f9 torchvision hipify revamp fix (#51453)
Summary:
The torchvision build error from hipify revamp, "KeyError: '/usr/include/libpng16/png.h'" is fixed in this PR

Description:

Traceback (most recent call last):
  File "setup.py", line 471, in <module>
    ext_modules=get_extensions(),
  File "setup.py", line 329, in get_extensions
    extra_compile_args=extra_compile_args
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 892, in CUDAExtension
    is_pytorch_extension=True,
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/hipify/hipify_python.py", line 978, in hipify
    clean_ctx=clean_ctx)
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/hipify/hipify_python.py", line 212, in preprocess
    hip_clang_launch, is_pytorch_extension, clean_ctx, show_progress)
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/hipify/hipify_python.py", line 175, in preprocess_file_and_save_result
    hip_clang_launch, is_pytorch_extension, clean_ctx, show_progress)
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/hipify/hipify_python.py", line 792, in preprocessor
    output_source = RE_ANGLE_HEADER.sub(mk_repl('#include <{0}>', False), output_source)
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/hipify/hipify_python.py", line 785, in repl
    value = HIPIFY_FINAL_RESULT[header_filepath]["hipified_path"]
KeyError: '/usr/include/libpng16/png.h'

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51453

Reviewed By: agolynski

Differential Revision: D26459979

Pulled By: fmassa

fbshipit-source-id: f653f55fd34c71314e6c6682217f84b2d1e49335
2021-02-17 03:12:20 -08:00
35b0560ea2 Automated submodule update: FBGEMM (#52255)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 7f3baec496

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52255

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: jspark1105

Differential Revision: D26443031

fbshipit-source-id: 9e2758c73a15e7d2b5aefa5bc38270404cb5862a
2021-02-17 01:12:51 -08:00
bb9e0c625e [nnc] Add dummy reference to llvm::cfg::Update<BasicBlock> (#52321)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52321

We're seeing undefined references to this function in coverage builds.
I don't even know why the toolchain is trying to look for it, because it's not
actually used in our code anywhere.

Obviously dropping in a dummy reference is a workaround more than a real
solution, but I'd like to get the coverage build back online.
ghstack-source-id: 121818432

Test Plan: `buck build mode/dbgo-cov //caffe2/test/...`

Reviewed By: asuhan

Differential Revision: D26467484

fbshipit-source-id: 4de8d950b03d0818ffc317fc1bed9be8cf470352
2021-02-16 23:27:31 -08:00
bfc7e28188 reland - ns for fx - stubs of the three APIs (compare weights, activations, activations with shadow) (#52302)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52302

Adds the basic functionality for the three Numeric Suite core APIs to work on FX models:
1. comparing weights
2. comparing activations, with same input fed to both models
3. comparing activations, with nodes of A shadowing nodes of B

Note: there are a lot of TODOs in the code, and some/most of the APIs and implementation details may change as we iterate.  This is just the first PR.

Test Plan:
We have unit test coverage for all of the APIs, for now this is with toy models:

```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs
```

Reviewed By: raghuramank100

Differential Revision: D26463013

Pulled By: vkuzo

fbshipit-source-id: e454115099ad18e4037d3c54986951cdffcab367
2021-02-16 19:59:32 -08:00
fa393b56e7 [static runtime] use NNC to generate logit, relu and tanh (#52322)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52322

diff BS=1
```
C2 run finished. Milliseconds per iter: 0.0564008. Iters per second: 17730.3
PyTorch run finished. Milliseconds per iter: 0.0677778. Iters per second: 14754.1
```
diff BS=20
```
C2 run finished. Milliseconds per iter: 0.51086. Iters per second: 1957.48
PyTorch run finished. Milliseconds per iter: 0.510077. Iters per second: 1960.49
```

master BS=1
```
C2 run finished. Milliseconds per iter: 0.0567362. Iters per second: 17625.4
PyTorch run finished. Milliseconds per iter: 0.0706478. Iters per second: 14154.7
```

master BS=20
```
C2 run finished. Milliseconds per iter: 0.510943. Iters per second: 1957.17
PyTorch run finished. Milliseconds per iter: 0.516338. Iters per second: 1936.72
```

Reviewed By: bertmaher

Differential Revision: D25407106

fbshipit-source-id: 08595ba5e4be59e2ef95fb9b24da7e7671692395
2021-02-16 18:55:34 -08:00
4156588365 [nnc] Allow 1 ulp tolerance in log approximation (#52165)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52165

Apparently bitwise identicality is too high a bar (I'm seeing
differences at this level depending on the HW platform, e.g.,
Broadwell is bitwise accurate but Skylake is 1ulp off).  But anyways
VML is accurate to 1 ulp, so let's allow that.
ghstack-source-id: 121815001

Test Plan: test_approx

Reviewed By: asuhan

Differential Revision: D26408079

fbshipit-source-id: 46cbd1487c72ae7bc40567f2f72ed2b919707d0d
2021-02-16 16:49:36 -08:00
9409a3a39b Check kernel launches in caffe2/operators (#52240)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52240

Test Plan: Sandcastle tests

Reviewed By: xush6528

Differential Revision: D26408330

fbshipit-source-id: 60779ba0e38c8f90e0e341c8faa2661e631112dd
2021-02-16 16:42:05 -08:00
059c564ba4 [DataLoader] Fix module import (#52224)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52224

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D26429871

Pulled By: ejguan

fbshipit-source-id: fcf2e5435658ecb92af1079def953b08cebb1f7f
2021-02-16 16:12:33 -08:00
4e36891e4f Temporary disable cat tests on MacOS due to Sandcastle failure
Summary: The `cat` op tests pass on device and local MacOS, but will fail during Sandcastle runs. Disabling them for now while we investigate why they fail in Sandcastle.

Test Plan: `buck test //fbobjc/Apps/Internal/PyTorchPlaygroundMac:PyTorchPlaygroundMacTests`

Reviewed By: xta0

Differential Revision: D26468606

fbshipit-source-id: 440369bb68641060fa98dbf37fb8825ee56083e0
2021-02-16 15:15:46 -08:00
52af23b912 Update PyBind to official v2.6.2 tag (#52304)
Summary:
This moves PyBind module from pre-2.6.2 to an official 2.6.2 release hash.
See https://github.com/pybind/pybind11/releases/tag/v2.6.2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52304

Reviewed By: samestep

Differential Revision: D26463177

Pulled By: malfet

fbshipit-source-id: 6c6c5d0a4ff0c3f399370194e90dc8295fdd4bb2
2021-02-16 13:40:28 -08:00
63206ada8f Adding back CUDA 11.1 CI (#52171)
Summary:
- Does not disable current CUDA 11.2 CI jobs
- Does not reenable tests disabled for CUDA 11.2
- Removes some unused docker images

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52171

Reviewed By: malfet

Differential Revision: D26461533

Pulled By: janeyx99

fbshipit-source-id: e0e23117498320e72f2cbca547981c5894b48b68
2021-02-16 13:09:36 -08:00
f3f72b5c6b when BUILD_SPLIT_CUDA=ON, create dummy torch_cuda (#52305)
Summary:
Makes dummy torch_cuda target to maintain better backwards compatibility.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52305

Test Plan:
Run `export BUILD_SPLIT_CUDA=ON && python setup.develop`.
When it's done building, run `ls -lah` within `build/lib` to check that `libtorch_cuda.so` exists and is the same size as `libtorch.so`.

Reviewed By: walterddr

Differential Revision: D26463915

Pulled By: janeyx99

fbshipit-source-id: 2b4cb8ee49bd75e11dc89d94b5956917b1800df1
2021-02-16 12:33:42 -08:00
b887c30980 Out version for sum (#52225)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52225

Supported out version for sum for SR

Test Plan:
buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest

sum node runtime before out version (1000 time run): 3558us

sum node runtime after out version (1000 time run): 2173 us

Reviewed By: ajyu

Differential Revision: D26259744

fbshipit-source-id: bc6a1231353d79a96d45f1cdc676e78a92469d85
2021-02-16 12:01:02 -08:00
71d5a8ea62 [nnc] Benchmark inference batchnorm (#52251)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52251

Batchnorm in inference is just a bunch of pointwise ops.  NNC
should be able to do a good job of this, and indeed it does.  For fun
I've included a fused BN->Relu (although the real fusion fun would be
Conv->BN->Relu...).

```
---------------------------------------------------------------------------------------
Benchmark                                Time           CPU Iterations UserCounters...
---------------------------------------------------------------------------------------
BatchNorm/ATen/1/64/112/112         252886 ns     252875 ns       2785 GB/s=25.3981G/s
BatchNorm/ATen/1/256/14/14           12145 ns      12145 ns      55347 GB/s=33.0525G/s
BatchNorm/ATen/1/128/28/28           18919 ns      18918 ns      37749 GB/s=42.437G/s
BatchNorm/ATen/1/64/56/56            61434 ns      61433 ns      11315 GB/s=26.1363G/s
BatchNorm/ATen/1/512/7/7             11924 ns      11923 ns      59070 GB/s=16.8327G/s
BatchNorm/ATen/5/64/112/112        1873321 ns    1873292 ns        382 GB/s=17.1424G/s
BatchNorm/ATen/5/256/14/14           83470 ns      83459 ns       8538 GB/s=24.0483G/s
BatchNorm/ATen/5/128/28/28          157521 ns     157520 ns       4440 GB/s=25.4829G/s
BatchNorm/ATen/5/64/56/56           314675 ns     314670 ns       2235 GB/s=25.513G/s
BatchNorm/ATen/5/512/7/7             48129 ns      48128 ns      14582 GB/s=20.851G/s

BatchNorm/NNC/1/64/112/112          249454 ns     249428 ns       2802 GB/s=25.749G/s
BatchNorm/NNC/1/256/14/14             9321 ns       9321 ns      74573 GB/s=43.066G/s
BatchNorm/NNC/1/128/28/28            16874 ns      16873 ns      40999 GB/s=47.5797G/s
BatchNorm/NNC/1/64/56/56             59276 ns      59275 ns      12047 GB/s=27.0878G/s
BatchNorm/NNC/1/512/7/7               3452 ns       3452 ns     202610 GB/s=58.1394G/s
BatchNorm/NNC/5/64/112/112         1820201 ns    1820038 ns        373 GB/s=17.6439G/s
BatchNorm/NNC/5/256/14/14            78429 ns      78420 ns       8871 GB/s=25.5935G/s
BatchNorm/NNC/5/128/28/28           155214 ns     155202 ns       4514 GB/s=25.8635G/s
BatchNorm/NNC/5/64/56/56            311454 ns     311449 ns       2163 GB/s=25.7768G/s
BatchNorm/NNC/5/512/7/7              26853 ns      26851 ns      25283 GB/s=37.3735G/s

BatchNorm/ATenRelu/1/64/112/112     378879 ns     378849 ns       1844 GB/s=16.9528G/s
BatchNorm/ATenRelu/1/256/14/14       16707 ns      16705 ns      41391 GB/s=24.029G/s
BatchNorm/ATenRelu/1/128/28/28       30235 ns      30235 ns      23060 GB/s=26.5529G/s
BatchNorm/ATenRelu/1/64/56/56        91164 ns      91160 ns       7662 GB/s=17.6132G/s
BatchNorm/ATenRelu/1/512/7/7         14681 ns      14681 ns      46088 GB/s=13.6707G/s
BatchNorm/ATenRelu/5/64/112/112    2864060 ns    2863566 ns        243 GB/s=11.2142G/s
BatchNorm/ATenRelu/5/256/14/14      118376 ns     118367 ns       5907 GB/s=16.9561G/s
BatchNorm/ATenRelu/5/128/28/28      237893 ns     237873 ns       2936 GB/s=16.8749G/s
BatchNorm/ATenRelu/5/64/56/56       472452 ns     472386 ns       1479 GB/s=16.9949G/s
BatchNorm/ATenRelu/5/512/7/7         61389 ns      61379 ns      11442 GB/s=16.3496G/s

BatchNorm/NNCRelu/1/64/112/112      248378 ns     248341 ns       2812 GB/s=25.8618G/s
BatchNorm/NNCRelu/1/256/14/14         9965 ns       9964 ns      76013 GB/s=40.2861G/s
BatchNorm/NNCRelu/1/128/28/28        16153 ns      16153 ns      43343 GB/s=49.7004G/s
BatchNorm/NNCRelu/1/64/56/56         58761 ns      58757 ns      12095 GB/s=27.3265G/s
BatchNorm/NNCRelu/1/512/7/7          10529 ns      10529 ns      66590 GB/s=19.0625G/s
BatchNorm/NNCRelu/5/64/112/112     1799001 ns    1798757 ns        362 GB/s=17.8527G/s
BatchNorm/NNCRelu/5/256/14/14        78252 ns      78246 ns       8974 GB/s=25.6504G/s
BatchNorm/NNCRelu/5/128/28/28       154940 ns     154923 ns       4483 GB/s=25.9102G/s
BatchNorm/NNCRelu/5/64/56/56        312329 ns     312324 ns       2244 GB/s=25.7046G/s
BatchNorm/NNCRelu/5/512/7/7          51203 ns      51199 ns      13559 GB/s=19.6004G/s
```

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D26440786

Pulled By: bertmaher

fbshipit-source-id: 7d3f7bf6eee4c37736e9875d31ae1b483af9fb6f
2021-02-16 10:57:38 -08:00
0019a20a2b [WIP] Add a _flush_compilation_cache for testing (#52001)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52001

Reviewed By: eellison

Differential Revision: D26371876

Pulled By: Krovatkin

fbshipit-source-id: db773d7124916bad31e80bdd7bb9b4170060977b
2021-02-16 10:49:38 -08:00
b01b7ea4f3 store artifacts for windows binary build (#52239)
Summary:
Better debugging: allows you to download the final package for binary windows builds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52239

Reviewed By: agolynski

Differential Revision: D26463613

Pulled By: janeyx99

fbshipit-source-id: ffb0ec044be23286b8975b9a6d2f90d05c2aff9c
2021-02-16 09:55:28 -08:00
4df8e774e6 [ROCm] warn unsupported PYTORCH_CUDA_FUSER_DISABLE_FMA (#50508)
Summary:
nvcc's `--fmad=false` is not valid for the HIP compiler.  Upcoming ROCm releases will start treating unrecognized compiler flags as an error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50508

Reviewed By: albanD

Differential Revision: D25920291

Pulled By: mrshenli

fbshipit-source-id: c0ff3b74dd07f3d0661ba29efafaab291ef3621c
2021-02-16 08:09:57 -08:00
68e2a8c420 Reenable test_nn tests for Windows (#52051)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/52002

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52051

Reviewed By: ngimel

Differential Revision: D26409749

Pulled By: janeyx99

fbshipit-source-id: 5fa76d4fff8cf0fe2130c925fde9dffd0d1e7172
2021-02-16 08:00:07 -08:00
df837d0384 Use the libc++ detection instead of clang detection around std:isinfinite (#52164)
Summary:
Fixes #52163

The libc++ vs libstdc++ detection in the pre-processor is taken from https://stackoverflow.com/questions/31657499/how-to-detect-stdlib-libc-in-the-preprocessor

Note that in our case `std:isinfinite` presents means that we don't need to import any additional headers to guarantee the `_LIBCPP_VERSION` presents for the `libc++`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52164

Reviewed By: albanD

Differential Revision: D26413108

Pulled By: malfet

fbshipit-source-id: 515e258d6758222c910ababf5172c3a275aff08f
2021-02-15 08:59:37 -08:00
cd46ee6175 Revert D26280518: [TorchScript] C++ interface of to_<backend>
Test Plan: revert-hammer

Differential Revision:
D26280518 (a184ef8df5)

Original commit changeset: fd466e4b4488

fbshipit-source-id: e4def49703ab525c063b8cc5d11296b9cc614fbb
2021-02-15 08:05:16 -08:00
1903b32c35 Directly Return when Numel == 0 for WeightedSum and ScatterWeightedSum
Summary:
Current Caffe2 operators WeightedSum and ScatterWeightedSum will enforce that the first input is not empty; otherwise they will raise error. However, in some cases we will have 0 batch size in training and eval. For example, when training and eval current AF and AI OC models, we will filter out the search ads in data pipeline, which might cause 0 batch size in some iterations. As a result, if the models are using Dper3 modules that contains WeightedSum or ScatterWeightedSum (e.g., HistogramBinningCalibration module), they will occasionally fail in training or eval.

To address this issue, we revise the implementation of WeightedSum and ScatterWeightedSum so that we will directly return when their first inputs are empty without failing the operators.

Test Plan:
We tested the code change by building a Dper3 backend canary package. All the jobs for AF and AI OC succeeded with the modified Caffe2 operators:

f251058001
f251058142
f251058332

To compare, all the jobs with identical model configs but with the canary package built from master failed:

f250993908
f250994106
f250994174

Reviewed By: chenshouyuan, huayuli00

Differential Revision: D26444645

fbshipit-source-id: 1c2f81a078810e3ef3c17c133a715090dee2c0ff
2021-02-14 17:49:34 -08:00
eaddadd4f7 Revert D26403094: ns for fx - stubs of the three APIs (compare weights, activations, activations with shadow)
Test Plan: revert-hammer

Differential Revision:
D26403094 (37622db76a)

Original commit changeset: 9752331d4ae0

fbshipit-source-id: f0a32d443a29b25af33d90420dfd1bada40c917c
2021-02-14 15:09:16 -08:00
4949eea0ff [StaticRuntime] Clean up output references and remove dead code (#52237)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52237

Redo D26331506 (4c58be4573). Get rid of `nodiscard` which broke OSS CI.

- Clean up references of outputs, including Tuples/Lists, by using move semantics
- Clean up references of elements in output Tuples/Lists by adding them to `unmanaged_values_` in MemoryPlanner. Check for corner case of Tuple/List element being inputs.
- Modify unit tests to check for use_counts of outputs
- Clean up dead code. A bit overlap with D25592967, but shouldn't be a problem.

This diff does not try to fix the alias problem with the MemoryPlanner.

Reviewed By: swolchok

Differential Revision: D26432539

fbshipit-source-id: e08990e4066c1ce69ad5274860851d012b7be411
2021-02-13 20:05:28 -08:00
73de98204d [JIT] Add static method support for TorchBind (#51177)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51177

**Summary**
This commit adds support for static methods to TorchBind. Just like
pybind, the API for declaring a static method is `def_static(...)`. A
static method must be called on the class directly, and can be called
both in Python as well as TorchScript.

Support for static methods is implemented in a manner similar to that of
instance methods. Registered static functions are wrapped in a layer of
unboxing logic, their schemas are inferred using templates and
metaprogramming, and they are added to the `ClassType` object
corresponding to the TorchBind class on which they are registered.
ScriptClass has been extended to support a `__getattr__` function so
that static methods of TorchBind classes can be invoked in Python. The
implementation of `__getattr__` returns `ScriptClassFunctionPtr`, a
version of `StrongFunctionPtr` without a compilation unit (since the
functions of a TorchBind class live inside the TorchBind registry).
Within TorchScript, TorchBind static functions are desugared in
`PythonClassValue::attr` by looking them up on the class type of the
`PythonClassValue` instance.

**Test Plan**
This commit adds a unit test that tests a simple static method on a
TorchBind class.

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D26356942

Pulled By: SplitInfinity

fbshipit-source-id: 1b6a9bc2e5f3e22071ad78e331a0201fbbf7ab30
2021-02-13 19:41:27 -08:00
de4c9ecc35 Fix libnvrtc discoverability in package patched by auditwheel (#52184)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52184

`auditwheel` inserts first 8 symbols of sha256 checksum of the library before relocating into the wheel package. This change adds logic for computing the same short sha sum and embedding it into LazyNVRTC as alternative name for libnvrt.so

Fixes https://github.com/pytorch/pytorch/issues/52075

Test Plan: Imported from OSS

Reviewed By: seemethere

Differential Revision: D26417403

Pulled By: malfet

fbshipit-source-id: e366dd22e95e219979f6c2fa39acb11585b34c72
2021-02-13 19:38:27 -08:00
357e5baf7e Extend DynamcLibrary constructor to support alternative library name (#52183)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52183

This allows one to load library that can exist on the system under different names.
Currently, this functionality is Linux only, as on Windows shared libraries are not renamed by `auditwheel`

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D26417405

Pulled By: malfet

fbshipit-source-id: d327e2565b26cf5b7214e7978862f56e02cad7c6
2021-02-13 19:38:23 -08:00
b8f3a658f9 Do not include "DynamicLibrary.h" into a top-level header (#52182)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52182

DynamicLibrary provides a very specific functionality, so there is no need to exposes it to every project depending on `ATen.h`

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D26417404

Pulled By: malfet

fbshipit-source-id: f8318cacb07dcc8b2f95984f88ea1df4e5369b8b
2021-02-13 19:34:46 -08:00
52e6ef8b53 [TensorExpr] Add another test for ExternalCalls. (#52162)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52162

This test demonstrates how external calls can interoperate with other
tensor computations and between themselves.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D26410813

Pulled By: ZolotukhinM

fbshipit-source-id: 8180164013b43f613d53620d1b249e0af769ae8e
2021-02-13 18:38:17 -08:00
bf841b25e4 [cmake] Add explicit cublas->cudart dependency (#52243)
Summary:
Necessary to ensure correct link order, especially if libraries are
linked statically. Otherwise, one might run into:
```
/usr/bin/ld: /usr/local/cuda/lib64/libcublasLt_static.a(libcublasLt_static.a.o): undefined reference to symbol 'cudaStreamWaitEvent@libcudart.so.11.0'
/usr/local/cuda/lib64/libcudart.so: error adding symbols: DSO missing from command line
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52243

Reviewed By: seemethere, ngimel

Differential Revision: D26437159

Pulled By: malfet

fbshipit-source-id: 33b8bb5040bda10537833f3ad737f535488452ea
2021-02-13 18:21:33 -08:00
490eb3e735 Add 3D depthwise seperable convolution (#51027)
Summary:
Because this pull request (https://github.com/pytorch/pytorch/issues/40801) becomes an important part of recent 3D models, brings significant improvement in speed, and also have been open for a while. So I decided to resolve the previous review comment and modify it a bit so that it can be merged into the latest version of Pytorch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51027

Reviewed By: albanD

Differential Revision: D26414116

Pulled By: ngimel

fbshipit-source-id: 562c099f4d7f6d603a9c2f2e2a518bc577b0d8ee
2021-02-13 18:14:09 -08:00
846755af2f Remove unused include in TensorIteratorDynamicCasting.h (#51824)
Summary:
In the past, this file included `thrust/complex.h` because the `thrust::complex` --> `c10::complex` migration was not done. Today, this task has been done for a while but seems that this include was not deleted.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51824

Reviewed By: albanD

Differential Revision: D26417144

Pulled By: ngimel

fbshipit-source-id: 1fff5b8d50f0b34c963a7893cbb0599895823105
2021-02-13 18:02:23 -08:00
8ff5a46c32 [RPC] waitForThreadLocalRRefs returns jitFuture (#51696)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51696

Modify this API to use JitFuture.
ghstack-source-id: 121695707

Test Plan: Ci

Reviewed By: mrshenli

Differential Revision: D26239132

fbshipit-source-id: 15c0c349a79e660fe4862e1d99176989f8159bf4
2021-02-13 17:43:16 -08:00
87c0b6bffc [RPC] Move confirmation future in rref context to jit future (#51695)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51695

As part of the plan to completely eliminate torch/csrc/utils/future.h,
we are converting this to JitFuture (c10::ivalue::Future).
ghstack-source-id: 121695708

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D26238873

fbshipit-source-id: 92bad1a349964ce8a9a80e2d1cf68f293cbe411c
2021-02-13 17:40:55 -08:00
96fd5d87f7 Add dict() constructor (#51934)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51934

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D26418199

Pulled By: ansley

fbshipit-source-id: 524f6d9d29ee1fa1b7c5e80ada82e577f47089dc
2021-02-13 15:23:22 -08:00
a184ef8df5 [TorchScript] C++ interface of to_<backend> (#51797)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51797

The C++ API, ```codegen_backend_module``` is added to ```to_<backend>```. Python related stuffs are decoupled in this function. It can be used from both C++ and python.

* Tests
Python: The existing ```test_backends.py```, which calls the C++ API under the hood.
C++: The end-to-end test of ```jit.BackendTest.ToBackend``` is added in ```test_backend.cpp```. The original class definitions in this file is moved to ```test_backend_lib.cpp```

ghstack-source-id: 121687464

(Note: this ignores all push blocking failures!)

Test Plan: CI

Reviewed By: raziel

Differential Revision: D26280518

fbshipit-source-id: fd466e4b448847ce64010a3297fff0b5760c5280
2021-02-13 15:15:45 -08:00
4ab86c87a2 [caffe2 and pytorch] replace temp name of new sparse adagrad JIT'ed function in fbgemm (#52193)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52193

In this step, we replace the temp name and use the old interface name with new behavior

Test Plan: CI

Reviewed By: dskhudia

Differential Revision: D26232170

fbshipit-source-id: 60233f98fe91a15c3c834bf6fde1b185269dd2b6
2021-02-13 10:23:36 -08:00
a86027ded3 Use side-stream in CPU to GPU copies in DDP (#50180)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50180

Resolves the regression in
https://github.com/pytorch/pytorch/issues/49819 by adding copy over background
stream similar to scatter. For internal use cases, this is gated with an env var that maintains the previous behavior when it is off.

Test Plan: CI

Reviewed By: mrshenli, ngimel

Differential Revision: D25818170

fbshipit-source-id: e50c76c035504b2a44e2be084701cee45c90df75
2021-02-13 00:57:32 -08:00
71d0b5632b Add SqueezeNet to PyTorch Playground
Summary: Add support for SqueezeNet in the PyTorch Playground test app

Test Plan:
```
arc focus2 pp-ios
```

Reviewed By: xta0

Differential Revision: D26083960

fbshipit-source-id: a0d753eefa431f2f9e377f082c564370d6774c0b
2021-02-12 18:43:51 -08:00
388c38505c [Metal] Add concat op for metal
Summary: Add concat op to enable models such as SqueezeNet.

Test Plan:
Test on device:
```
arc focus2 pp-ios
```
Test on mac
```
buck test pp-macos
```

Reviewed By: xta0

Differential Revision: D26029029

fbshipit-source-id: b0d621f2069a722f0770218c435b22feac4fb873
2021-02-12 18:40:58 -08:00
4cc10563e7 Customize traceback for calls to symbolically-traced code (#51648)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51648

The following code will throw during the call to `traced(5)`:
```python
class M(nn.Module):
    def __init__(self):
        super(M, self).__init__()
        self.W = torch.nn.Parameter(torch.randn(5))

    def forward(self, x):
        return torch.dot(self.W, x)

traced = fx.symbolic_trace(M())
traced(5)
```

Traceback before:
```
Traceback (most recent call last):
  File "test/tinytest.py", line 26, in <module>
    traced(5)
  File "/home/ansley/local/pytorch/torch/fx/graph_module.py", line 338, in wrapped_call
    return self._cls_call(self, *args, **kwargs)
  File "/home/ansley/local/pytorch/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "<eval_with_key_0>", line 4, in forward
TypeError: dot(): argument 'tensor' (position 2) must be Tensor, not int
```

Traceback after:
```
Traceback (most recent call last):
  File "/home/ansley/local/pytorch/torch/fx/graph_module.py", line 338, in wrapped_call
    return torch.nn.Module.__call__(self, *args, **kwargs)
  File "/home/ansley/local/pytorch/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "<eval_with_key_1>", line 4, in forward
    dot_1 = torch.dot(w, x);  w = x = None
TypeError: dot(): argument 'tensor' (position 2) must be Tensor, not int

Call using an FX-traced Module, line 4 of the traced Module’s generated forward function:
    w = self.W
    dot_1 = torch.dot(w, x);  w = x = None

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    relu_1 = dot_1.relu();  dot_1 = None

    return relu_1
```

(Note that the same `TypeError` is thrown despite modifying the traceback.)

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D26424005

Pulled By: ansley

fbshipit-source-id: 368f46ba81fb3111bd09654825bb2ac5595207d1
2021-02-12 18:31:23 -08:00
1657d59641 Walk Python AST to check for unsupported attribute type annotations (#51805)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51805

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D26418589

Pulled By: ansley

fbshipit-source-id: c13e9096dcfa242d158ebf1ae4f86ef6c46ff0ec
2021-02-12 18:18:01 -08:00
37622db76a ns for fx - stubs of the three APIs (compare weights, activations, activations with shadow) (#51669)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51669

Adds the basic functionality for the three Numeric Suite core APIs to work on FX models:
1. comparing weights
2. comparing activations, with same input fed to both models
3. comparing activations, with nodes of A shadowing nodes of B

Note: there are a lot of TODOs in the code, and some/most of the APIs and implementation details may change as we iterate.  This is just the first PR.

Test Plan:
We have unit test coverage for all of the APIs, for now this is with toy models:

```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs
```

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D26403094

fbshipit-source-id: 9752331d4ae0105346d3da309b13c895b593b450
2021-02-12 17:52:21 -08:00
bfe6e23209 Early version of fx graph matcher for NS (#51588)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51588

Early version of utility to match nodes between graph A and graph B, for Numerical Suite for FX graph mode quantization.

The main goal of this utility is to reliably match the nodes of graph A to the nodes of graph B, and throw an easy to read error message.  This will be used in future PRs to create the APIs for matching activations.  It also could potentially be used to match weights.

Test Plan:
For now, we have bare bones test coverage on some toy models, and a single torchvision model.

```
python test/test_quantization.py TestFXGraphMatcher
```

Future PRs will add more testing.

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D26403093

fbshipit-source-id: 60e318d51e6fefe65265488c4967629d946048ef
2021-02-12 17:50:13 -08:00
2900cf2b94 Refactor autograd discovery code (#52057)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/34067 by using https://github.com/pytorch/pytorch/issues/34426 by hczhu
In addition to removing the unnecessary any() we do also:
 - Get rid of the outer loop since graph_root also needs to be checked
 - Update psuedo code description so it matches what the code does
 - Add some comments explaining the difference between assigning `info.needed_` and `info.captures_` in terms of how that affects discovery
 - [edit: another benefit is that exec_info entries are no longer created for all reachable nodes]

This PR is on top of https://github.com/pytorch/pytorch/issues/51940, so once that lands rebasing on top of master should get rid of the extra commits and changes

I'm not sure if this change will bring a lot of performance gains, but the main benefit is that the code is easier to read.

Trivial graph:
```
torch.autograd.grad(a*b, [a, b], gradient)
setup:
  a = torch.rand((2, 2), requires_grad=True)
  b = torch.rand((2, 2), requires_grad=True)
  gradient = torch.ones(2, 2)

Timer before:
  15.45 us
Time after:
  14.33 us
1 measurement, 10000 runs , 1 thread

Instructions after:
                           All          Noisy symbols removed
    Instructions:      8271213                    8193169
    Baseline:             4244                       3838
Instructions before:
                           All          Noisy symbols removed
    Instructions:      8142843                    8054463
    Baseline:             4280                       3838
100 runs per measurement, 1 thread
```
Small graph:
```
torch.autograd.grad((b*a.exp()+a*b.exp()).sum(), (a, b))
setup:
  a = torch.rand((2, 2), requires_grad=True)
  b = torch.rand((2, 2), requires_grad=True)

Time before:
  52.25 us
Time after:
  50.80 us
1 measurement, 10000 runs , 1 thread

Instruction count before:
                           All          Noisy symbols removed
    Instructions:     25601257                   25518229
    Baseline:             4228                       3838
Instruction count after:
                           All          Noisy symbols removed
    Instructions:     25606533                   25522797
    Baseline:             4228
100 runs per measurement, 1 thread
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52057

Reviewed By: ngimel

Differential Revision: D26432207

Pulled By: soulitzer

fbshipit-source-id: beef68344d66e9e286378e31e3311ba43c25c749
2021-02-12 16:22:35 -08:00
b2d8f0a431 [pytorch][bot] update mobile op deps (#52110)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52110

LLVM_DIR=/usr ANALYZE_TORCH=1 tools/code_analyzer/build.sh
cp build_code_analyzer/work/torch_result.yaml tools/code_analyzer/default_op_deps.yaml

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D26419138

Pulled By: ljk53

fbshipit-source-id: 26bf00036b19ad18a9cf06111df4d9fe32e5feab
2021-02-12 14:50:29 -08:00
a8321855ad Check kernel launches in caffe2/aten/src/THC (#52174)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52174

Test Plan: Sandcastle tests

Reviewed By: ngimel

Differential Revision: D26408837

fbshipit-source-id: deecd2e856946d1adbc985c13db110c06de6f3df
2021-02-12 14:16:10 -08:00
7b21c6be67 [Dist Profiling] Enable profiling for gloo send/recv (#52004)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52004

Enables profiling of p2p collectives for Gloo. Modified/added relevant unittests.
ghstack-source-id: 121507511

Test Plan: CI

Reviewed By: mrzzd

Differential Revision: D26347164

fbshipit-source-id: f4d1c474fccf40d5776fc13c4add7a053ea08960
2021-02-12 13:46:51 -08:00
49c8be516e Add ARM64 cross-compilation build on OS X (#49751)
Summary:
Tests cross-compilation of ARM64 architecture in MacOS CI.

This should be merged after PR https://github.com/pytorch/pytorch/issues/50243 and https://github.com/pytorch/pytorch/issues/50922 (adding a fix).

The reason we pin the wheel to be version 0.36.2 is because lower versions cannot handle c38 as a tag for the wheel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49751

Reviewed By: albanD

Differential Revision: D26411133

Pulled By: janeyx99

fbshipit-source-id: 00a5cf597aee10adea1547579270cb3b38732563
2021-02-12 13:08:30 -08:00
83fa713f2b Fix test to use proper condition. (#52216)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52216

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D26427506

Pulled By: ailzhang

fbshipit-source-id: ba4f2f66794cb2843926e5566eb4d25582f7fb2b
2021-02-12 12:59:35 -08:00
0dc0cb1d8d Enable FP16 sparse regularizer
Summary: Previously there was no regularizer implemented for fp16 sparse features. Add regularizer support here using the Float16SparseNormalize implemented in this stack.

Test Plan:
buck test //caffe2/caffe2/python:regularizer_test

In f248648705, we can see there is the operator `Float16SparseNormalize`.

{F356635445}

Reviewed By: bigrabithong

Differential Revision: D24042567

fbshipit-source-id: 5e0065f8c10b8748daffa8a54a6bf8f461460b18
2021-02-12 12:29:32 -08:00
fa0a049d4e Add a make_tempdir() utility function to the TestCase base class (#51762)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51762

Update test_util.py to add a `make_tempdir()` function to the `TestCase`
class.  The main advantage of this function is that the temporary
directory will be automatically cleaned up when the test case finishes,
so that test case does not need to worry about manually cleaning up this
directory.

This also prefixes the directory name with `caffe2_test.` so that it is
more obvious where the temporary directories came from if they are ever
left behind after a crashed or killed test process.

This updates the tests in `operator_test/load_save_test.py` to use this
new function, so they no longer have to perform their own manual cleanup
in each test.

Test Plan: python caffe2/python/operator_test/load_save_test.py

Reviewed By: mraway

Differential Revision: D26271178

Pulled By: simpkins

fbshipit-source-id: 51175eefed39d65c03484482e84923e5f39a4768
2021-02-12 10:56:01 -08:00
05b60921ae [iOS][PyTorch][OSS] fix iOS nightly build (#52197)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52197

D26187854 (6045663f39) added `from typing_extensions import Literal` to `tools/codegen/gen.py` whereas `typing_extensions` was not installed while building iOS binary

Was reproduced here: https://app.circleci.com/pipelines/github/pytorch/pytorch/273256/workflows/a1c66866-87ad-4ace-a0f7-f8c17524091c/jobs/10882828

ghstack-source-id: 121621817

Test Plan:
Created a PR to trigger the nightly build which also includes the fix.
https://github.com/pytorch/pytorch/pull/52195
The nightly build was successful: https://app.circleci.com/pipelines/github/pytorch/pytorch/273262/workflows/ed7a0f14-2b48-4599-877f-45271473dd86/jobs/10883042

{F372504913}

Reviewed By: linbinyu

Differential Revision: D26420298

fbshipit-source-id: d88c9203473def936aaf1c1756c3c926d087a959
2021-02-12 10:41:11 -08:00
de54510f15 Check kernel launches in caffe2/caffe2/image (#52173)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52173

Test Plan: Sandcastle tests

Reviewed By: xush6528

Differential Revision: D26408885

fbshipit-source-id: f90a00199a73487cb9134f20c58975b134a0117b
2021-02-12 10:11:41 -08:00
1795398c24 Updates rounding_mode documentation to remove "true" (#52202)
Summary:
In design review the use of the word "true" for a "rounding mode" which actually performed no rounding was, understandably, considered confusing. This PR updates the documentation to remove references to "true." The signatures for torch.div and torch.divide are updated to reflect the future behavior where rounding_mode=None will be the default.

This is slightly inaccurate. Today when rounding mode is not specified it is effectively None, but users cannot actually specify rounding_mode=None today. That change was considered too disruptive to the 1.8 branch cut process.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52202

Reviewed By: gchanan

Differential Revision: D26424979

Pulled By: mruberry

fbshipit-source-id: db3cc769c0d9c6d7e42bfad294073c99fa9168d9
2021-02-12 09:19:39 -08:00
e8ab58bfc7 [reland] Early terminate CUDA on common_utils TestCases (#52126)
Summary:
Take 2 of https://github.com/pytorch/pytorch/issues/50914
This change moves the early termination logic into common_utils.TestCase class.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52126

Test Plan: CI with ci-all tag

Reviewed By: malfet

Differential Revision: D26391762

Pulled By: walterddr

fbshipit-source-id: a149ecc47ccda7f2795e107fb95915506ae060b4
2021-02-12 07:32:42 -08:00
d22f700f9e Link torch_global_deps to libtbb.so if USE_TBB is enabled (#51741)
Summary:
Some distributions of MKL such as the one in the Conda default channel have an implicit dependency to TBB even though they do not list it explicitly in their ELF dynamic section (DT_NEEDED). Pre-loading torch_global_deps into a process that uses such an MKL distribution fails with an unresolved symbol error due to missing libtbb.so. This code change forces torch_global_deps to load libtbb.so into the process to avoid such issues.

More over although we distribute our own TBB build, it is a widely-used third-party library and the same global namespace treatment rules should apply to it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51741

Reviewed By: malfet

Differential Revision: D26261214

Pulled By: cbalioglu

fbshipit-source-id: 94491275f8ec82d5917695e57dd766a10da92726
2021-02-12 07:13:34 -08:00
992d251c39 Revert D26333953: [StaticRuntime] Clean up output references and remove dead code
Test Plan: revert-hammer

Differential Revision:
D26333953 (0c9d72b5e1)

Original commit changeset: cadc0595ad6a

fbshipit-source-id: 75d0b33099342653cd8867b129139325789aee6c
2021-02-12 02:12:31 -08:00
0c9d72b5e1 [StaticRuntime] Clean up output references and remove dead code (#51991)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51991

- Clean up references of outputs, including Tuples/Lists, by using move semantics
- Clean up references of elements in output Tuples/Lists by adding them to `unmanaged_values_` in MemoryPlanner. Check for corner case of Tuple/List element being inputs.
- Modify unit tests to check for use_counts of outputs
- Clean up dead code. A bit overlap with D25592967, but shouldn't be a problem.

This diff does not try to fix the alias problem with the MemoryPlanner.

(Note: this ignores all push blocking failures!)

Test Plan:
```
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
buck test mode/opt-clang caffe2/caffe2/fb/predictor:ptvsc2_predictor_bench_test
```

Reviewed By: bwasti

Differential Revision: D26333953

fbshipit-source-id: cadc0595ad6ab754c4f1f7a5a3733b2c16b3102f
2021-02-12 01:11:08 -08:00
e4203c4306 Automated submodule update: FBGEMM (#52129)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 4d203256ba

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52129

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: jianyuh

Differential Revision: D26393870

Pulled By: jspark1105

fbshipit-source-id: 6cf01c45c8768f453c9fac5f8af6813db0549083
2021-02-11 22:01:48 -08:00
db6e0c7c0e Replace a platform.system() check with sys.platform (#51766)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51766

Check if we are on Windows using `sys.platform` rather than
`platform.system()`.  Even though `platform.system()` is more modern, it
has a few downsides: this performs a runtime check of the platform type,
which has non-zero overhead.  On Linux it actually executes the separate
`/bin/uname` process.  On the other hand `sys.platform` is determined
when the Python interpreter is compiled, so this is a simple hard-coded
string.

Because it is a runtime check, `platform.system()` checks also cannot be
analyzed by static type checkers like Pyre and Mypy.  These type
checkers do understand `sys.platform` checks, and can correctly avoid
complaining about code paths that use platform-specific modules and
functions.  e.g., they can avoid complaining about `ctypes.WinDLL` not
existing on Linux if its use is guarded by a `sys.platform` check.
ghstack-source-id: 121107705

Test Plan: Ran tests on Linux, and will check CI test results.

Reviewed By: mraway

Differential Revision: D26271724

Pulled By: simpkins

fbshipit-source-id: b86e427e4ceec0324464ba4bc88b95d5813172d0
2021-02-11 20:09:14 -08:00
dc25c90cfc Check kernel launches in caffe2/aten/src/THCUNN (#52172)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52172

Test Plan: Sandcastle tests

Reviewed By: xush6528

Differential Revision: D26408802

fbshipit-source-id: 4470203087bfedaf5825e5d63f3b9de25dd50161
2021-02-11 19:46:13 -08:00
578f0a04c7 fix torch.nn.parallel.scatter_gather.gather to handle NamedTuples and handle moving output to CPU (#51104)
Summary:
Fixes #{[50510](https://github.com/pytorch/pytorch/issues/50510)}

Allows ```torch.nn.parallel.scatter_gather.gather``` to accept a list of NamedTuples as input and returns a NamedTuple whose elements are tensors. I added the author's fix using the ```is_namedtuple``` function.

While testing this fix, I encountered a deprecation warning instructing me to use ```'cpu'``` instead of ```-1``` to move the outputs to the CPU. However, doing this causes an assertion error in the ```_get_device_index``` function. I solved this by handling the CPU case in the affected ```forward``` function.
rohan-varma

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51104

Reviewed By: albanD

Differential Revision: D26395578

Pulled By: rohan-varma

fbshipit-source-id: 6e98c9ce1d9f1725973c18d24a6554c1bceae465
2021-02-11 15:50:28 -08:00
ba7a2f6513 Add debug helper function to check target property (#52093)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52093

# Summary
The previous debug if statement only prints the file list, but it's not clear whether the target includes the file list correctly. This function can examine the target so it's more accurate. This pr includes changes:
1. Add a file `DebugHelper.cmake` with `print_target_properties` function.
2. Replace the debug if statement `if(FALSE)` by adding magical variable `PRINT_CMAKE_DEBUG_INFO` and applying the variable accordingly.

Note: previous debug if statement output example:
```
-- CPU sources:
--   /Users/chenlai/pytorch/aten/src/ATen/BatchedFallback.cpp
--   /Users/chenlai/pytorch/aten/src/ATen/BatchedTensorImpl.cpp
--   /Users/chenlai/pytorch/aten/src/ATen/BatchingRegistrations.cpp
--   /Users/chenlai/pytorch/aten/src/ATen/CPUGeneratorImpl.cpp
--   /Users/chenlai/pytorch/aten/src/ATen/Context.cpp
--   /Users/chenlai/pytorch/aten/src/ATen/DLConvertor.cpp
--   /Users/chenlai/pytorch/aten/src/ATen/DynamicLibrary.cpp
--   /Users/chenlai/pytorch/aten/src/ATen/ExpandUtils.cpp
--   /Users/chenlai/pytorch/aten/src/ATen/LegacyTHFunctionsCPU.cpp
--   /Users/chenlai/pytorch/aten/src/ATen/MemoryOverlap.cpp
...
-- GPU sources:
-- CPU include:
--   /Users/chenlai/pytorch/build_android/caffe2/aten/src/TH
--   /Users/chenlai/pytorch/aten/src/TH
--   /Users/chenlai/pytorch/aten/src
--   /Users/chenlai/pytorch/build_android/caffe2/aten/src
...
-- GPU include:
--   /Users/chenlai/pytorch/build_android/caffe2/aten/src/TH
--   /Users/chenlai/pytorch/aten/src/TH
--   /Users/chenlai/pytorch/build_android/caffe2/aten/src/TH
--   /Users/chenlai/pytorch/aten/src/TH
```

# Test plan
Set `PRINT_CMAKE_DEBUG_INFO` to true by adding `DPRINT_CMAKE_DEBUG_INFO` in `./scripts/build_pytorch_android.sh`, run `./scripts/build_pytorch_android.sh x86`

`print_target_properties(torch)` shows
```
torch ANDROID_ARCH = x86
torch ANDROID_STL_TYPE = c++_static
torch ARCHIVE_OUTPUT_DIRECTORY = /Users/chenlai/pytorch/build_android_x86/lib
torch AUTOGEN_ORIGIN_DEPENDS = ON
torch AUTOMOC_COMPILER_PREDEFINES = ON
torch AUTOMOC_MACRO_NAMES = Q_OBJECT;Q_GADGET;Q_NAMESPACE;Q_NAMESPACE_EXPORT
torch AUTOMOC_PATH_PREFIX = OFF
torch BINARY_DIR = /Users/chenlai/pytorch/build_android_x86/caffe2
torch BINARY_DIR = /Users/chenlai/pytorch/build_android_x86/caffe2
torch BUILD_WITH_INSTALL_RPATH = FALSE
torch CXX_STANDARD = 14
torch C_STANDARD = 11
torch IMPORTED = FALSE
torch IMPORTED_GLOBAL = FALSE
torch INCLUDE_DIRECTORIES = /Users/chenlai/pytorch/build_android_x86/aten/src;/Users/chenlai/pytorch/aten/src;/Users/chenlai/pytorch/build_android_x86;/Users/chenlai/pytorch;/Users/chenlai/pytorch/third_party/XNNPACK/include;/Users/chenlai/Library/Android/sdk/ndk/21.3.6528147/sources/third_party/vulkan/src/common;/Users/chenlai/pytorch/cmake/../third_party/eigen;/Users/chenlai/pytorch/cmake/../third_party/pybind11/include
torch INCLUDE_DIRECTORIES = /Users/chenlai/pytorch/build_android_x86/aten/src;/Users/chenlai/pytorch/aten/src;/Users/chenlai/pytorch/build_android_x86;/Users/chenlai/pytorch;/Users/chenlai/pytorch/third_party/XNNPACK/include;/Users/chenlai/Library/Android/sdk/ndk/21.3.6528147/sources/third_party/vulkan/src/common;/Users/chenlai/pytorch/cmake/../third_party/eigen;/Users/chenlai/pytorch/cmake/../third_party/pybind11/include
torch INCLUDE_DIRECTORIES = /Users/chenlai/pytorch/build_android_x86/aten/src;/Users/chenlai/pytorch/aten/src;/Users/chenlai/pytorch/build_android_x86;/Users/chenlai/pytorch;/Users/chenlai/pytorch/third_party/XNNPACK/include;/Users/chenlai/Library/Android/sdk/ndk/21.3.6528147/sources/third_party/vulkan/src/common;/Users/chenlai/pytorch/cmake/../third_party/eigen;/Users/chenlai/pytorch/cmake/../third_party/pybind11/include
torch INSTALL_RPATH = $ORIGIN
torch INSTALL_RPATH_USE_LINK_PATH = TRUE
torch INTERFACE_LINK_LIBRARIES = torch_cpu_library
torch ISPC_HEADER_SUFFIX = _ispc.h
torch LIBRARY_OUTPUT_DIRECTORY = /Users/chenlai/pytorch/build_android_x86/lib
torch LINK_LIBRARIES = torch_cpu_library
torch NAME = torch
torch PCH_INSTANTIATE_TEMPLATES = ON
torch PCH_WARN_INVALID = ON
torch POSITION_INDEPENDENT_CODE = TRUE
torch RUNTIME_OUTPUT_DIRECTORY = /Users/chenlai/pytorch/build_android_x86/bin
torch SKIP_BUILD_RPATH = FALSE
torch SOURCES = /Users/chenlai/pytorch/build_android_x86/empty.cpp
torch SOURCE_DIR = /Users/chenlai/pytorch/caffe2
torch SOURCE_DIR = /Users/chenlai/pytorch/caffe2
torch TYPE = STATIC_LIBRARY
torch TYPE = STATIC_LIBRARY
torch UNITY_BUILD_BATCH_SIZE = 8
torch UNITY_BUILD_MODE = BATCH
```

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D26377725

Pulled By: cccclai

fbshipit-source-id: dbe21ad533759f33711a0ce5328205bbcd5cf0f3
2021-02-11 15:37:14 -08:00
22b12179db [PyTorch] Make TORCH_INTERNAL_ASSERT use torchCheckFail too (#52086)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52086

I previously fixed TORCH_CHECK in D25481308 (7d406b4a07), but didn't cover TORCH_INTERNAL_ASSERT. No reason not to fix it too.
ghstack-source-id: 121456574

Test Plan:
Run framework overhead benchmarks.
Run build size check for igios.

Adindexer benchmark looks encouraging.

Before:
```
I0210 11:10:59.974778 2570617 BlackBoxPredictorBenchLib.cpp:384] C2 run finished. Milliseconds per iter: 0.0548625. Iters per second: 18227.4
I0210 11:11:07.591706 2570617 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.0677804. Iters per second: 14753.5
I0210 11:11:07.637014 2570617 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 6.35653. Iters per second: 157.319
I0210 11:11:14.592409 2572700 BlackBoxPredictorBenchLib.cpp:384] C2 run finished. Milliseconds per iter: 0.0543933. Iters per second: 18384.6
I0210 11:11:22.158799 2572700 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.0673752. Iters per second: 14842.3
I0210 11:11:22.204160 2572700 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 6.37655. Iters per second: 156.825
I0210 11:11:29.233793 2573079 BlackBoxPredictorBenchLib.cpp:384] C2 run finished. Milliseconds per iter: 0.0541586. Iters per second: 18464.3
I0210 11:11:36.726284 2573079 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.0666658. Iters per second: 15000.2
I0210 11:11:36.774489 2573079 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 6.36777. Iters per second: 157.041
I0210 11:11:43.799113 2573238 BlackBoxPredictorBenchLib.cpp:384] C2 run finished. Milliseconds per iter: 0.0535797. Iters per second: 18663.8
I0210 11:11:51.433924 2573238 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.0679261. Iters per second: 14721.9
I0210 11:11:51.479207 2573238 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 6.34747. Iters per second: 157.543
I0210 11:11:58.492782 2573599 BlackBoxPredictorBenchLib.cpp:384] C2 run finished. Milliseconds per iter: 0.0548257. Iters per second: 18239.6
I0210 11:12:06.072979 2573599 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.0674848. Iters per second: 14818.2
I0210 11:12:06.118813 2573599 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 6.34473. Iters per second: 157.611

```

After:
```
I0210 11:13:00.267062 2577288 BlackBoxPredictorBenchLib.cpp:384] C2 run finished. Milliseconds per iter: 0.0531031. Iters per second: 18831.3
I0210 11:13:07.591711 2577288 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.0651389. Iters per second: 15351.8
I0210 11:13:07.636951 2577288 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 6.25168. Iters per second: 159.957
I0210 11:13:14.497283 2580005 BlackBoxPredictorBenchLib.cpp:384] C2 run finished. Milliseconds per iter: 0.0524907. Iters per second: 19051
I0210 11:13:21.814965 2580005 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.0650504. Iters per second: 15372.7
I0210 11:13:21.861150 2580005 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 6.32074. Iters per second: 158.209
I0210 11:13:28.775005 2580166 BlackBoxPredictorBenchLib.cpp:384] C2 run finished. Milliseconds per iter: 0.0528345. Iters per second: 18927
I0210 11:13:36.041087 2580166 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.0646226. Iters per second: 15474.5
I0210 11:13:36.087904 2580166 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 6.38721. Iters per second: 156.563
I0210 11:13:43.223469 2580706 BlackBoxPredictorBenchLib.cpp:384] C2 run finished. Milliseconds per iter: 0.0534523. Iters per second: 18708.3
I0210 11:13:50.603958 2580706 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.065639. Iters per second: 15234.8
I0210 11:13:50.649281 2580706 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 6.24524. Iters per second: 160.122
I0210 11:13:57.490873 2580904 BlackBoxPredictorBenchLib.cpp:384] C2 run finished. Milliseconds per iter: 0.0529411. Iters per second: 18888.9
I0210 11:14:04.745435 2580904 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.0644963. Iters per second: 15504.8
I0210 11:14:04.790006 2580904 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 6.22258. Iters per second: 160.705
```

Looks like a pretty clear win (though it seems to have helped C2 as well). I checked with perf stat as well and it looks like a 1.9% CPU cycles win:

before:
```
    35,313,858,645      cycles                    #    1.989 GHz                      ( +-  0.32% )  (99.98%)
         17,750.69 msec task-clock                #    0.999 CPUs utilized            ( +-  0.33% )
    70,524,321,763      instructions              #    2.00  insn per cycle           ( +-  0.52% )  (99.98%)
```

after:
```
    34,628,390,377      cycles                    #    1.988 GHz                      ( +-  0.41% )  (99.98%)
         17,416.59 msec task-clock                #    0.999 CPUs utilized            ( +-  0.41% )
    70,800,211,396      instructions              #    2.04  insn per cycle           ( +-  0.11% )  (99.98%)
```

Reviewed By: ezyang

Differential Revision: D26372806

fbshipit-source-id: 817c7e61741334bb3ac33b617f9628309959b9c3
2021-02-11 15:23:01 -08:00
f2b43ddbf4 Update api doc for enabling TcpStore on Windows (#51847)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51847

Reviewed By: albanD

Differential Revision: D26405678

Pulled By: malfet

fbshipit-source-id: 073b675225b48d1732771583f8f2473e0fdcf35c
2021-02-11 14:44:03 -08:00
ac2bdf553e update build_host_protoc command for macos cross compilation (#50922)
Summary:
Currently, adding a cross compile build is failing on CI due to a cmake builtin compiler check that does not pass due to cross compiling the host protoc library.

Setting the CMAKE_TRY_COMPILE_TARGET_TYPE flag should fix it. (Based on this [SOF answer](https://stackoverflow.com/questions/53633705/cmake-the-c-compiler-is-not-able-to-compile-a-simple-test-program).)

To test that this works, please run: `CMAKE_OSX_ARCHITECTURES=arm64 USE_MKLDNN=OFF USE_NNPACK=OFF USE_QNNPACK=OFF USE_PYTORCH_QNNPACK=OFF BUILD_TEST=OFF python setup.py install` from a Mac x86_64 machine with Xcode12.3 (anything with MacOS 11 SDK).

Then, you can check that things were compiled for arm by running `lipo -info <file>` for any file in the `build/lib` directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50922

Reviewed By: malfet

Differential Revision: D26355054

Pulled By: janeyx99

fbshipit-source-id: 919f3f9bd95d7c7bba6ab3a95428d3ca309f8ead
2021-02-11 14:36:51 -08:00
6385c13630 [vulkan] Efficient gemm implementation (#49609)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49609

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D26209677

Pulled By: SS-JIA

fbshipit-source-id: 773a944559bf0deb3cf3e233d833220a12f9f2ab
2021-02-11 14:10:05 -08:00
70a805a286 [ROCm] skip one more magma test that is flaky (#52064)
Summary:
Skipped hipMAGMA tests are tracked in https://github.com/pytorch/pytorch/issues/51303.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52064

Reviewed By: albanD

Differential Revision: D26406745

Pulled By: walterddr

fbshipit-source-id: 2405ea06e03450eb22177c2c8b12a366cfbdaa93
2021-02-11 14:02:52 -08:00
4c58be4573 [StaticRuntime] Clean up input references (#51952)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51952

StaticRuntime should not hold owning refs of inputs after inference is finished. This diff adds a pass to clean them up and unit tests to enforce the check.

Will clean up output tensors in separate diffs.

Test Plan:
```
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
buck test mode/opt-clang caffe2/caffe2/fb/predictor:ptvsc2_predictor_bench_test
```

Reviewed By: bwasti

Differential Revision: D26331506

fbshipit-source-id: d395a295ada9de3033d0ea05d1dbab62d879a03b
2021-02-11 13:46:19 -08:00
deb74edb28 Add script to display history for a single test across multiple jobs over time (#52000)
Summary:
Adapted from this gist: https://gist.github.com/malfet/1c34f261a28ae7af61210174394eaece

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52000

Test Plan: Example shell session here: https://pastebin.com/HYgWZBFB

Reviewed By: walterddr

Differential Revision: D26372191

Pulled By: samestep

fbshipit-source-id: cdc9a27e1b4a0b3123a70e693b17d524e7c6cb95
2021-02-11 13:27:49 -08:00
8908874003 Gh/taylorrobie/import timer fbcode (#52124)
Summary:
`torch.__config__._cxx_flags` gets called on import, but this means that Timer can't be used if it fails. (Even just the wall time parts.) This is needlessly restrictive.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52124

Reviewed By: albanD

Differential Revision: D26395917

Pulled By: robieta

fbshipit-source-id: 4336a77dba131f80d386368ef715eed63c1cbcb4
2021-02-11 13:16:50 -08:00
ea8aadf4b6 Use self-hosted runner for nightly docker build CI. (#52148)
Summary:
The GitHub-hosted runner has maximum 14 GB disk space, which is not enough to host the nightly Docker build.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52148

Test Plan: CI workflow

Reviewed By: samestep

Differential Revision: D26406295

Pulled By: xuzhao9

fbshipit-source-id: 18a0dff45613649d6c15b8e1e9ca85042f648afd
2021-02-11 13:14:01 -08:00
4c93a79a04 [Dist Profiling] Support shape recording for profiling collectives (#51822)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51822

Adds support for shape recording for profiling distributed collectives, for nccl/gloo backends. Added
both cpp and python tests to ensure that shapes are recorded properly. Note that we don't add `ProcessGroupNCCLTest`s since they need to be modified to support single process per device and > 1 world size.
ghstack-source-id: 121507509

Test Plan: CI

Reviewed By: mrzzd

Differential Revision: D26291739

fbshipit-source-id: 5f7bd54d8c36d17a4a29e172b25266ca3dbd8fbd
2021-02-11 12:42:26 -08:00
76c6e12a5c Minor spelling updates (#52149)
Summary:
Add space between 'e.g.' and 'build'
'pacakge'->'package'

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52149

Reviewed By: osalpekar

Differential Revision: D26405824

Pulled By: malfet

fbshipit-source-id: 386390d3f31a9fc268b05902b9dca1deeaf626f9
2021-02-11 12:36:27 -08:00
3d77529f5b enable autocast for xla (#48570)
Summary:
For enabling amp in torch/xla, see [this](https://github.com/pytorch/xla/pull/2654).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48570

Reviewed By: ezyang

Differential Revision: D26120627

Pulled By: ailzhang

fbshipit-source-id: 32627b17c04bfdad128624676ea9bf6f117bc97d
2021-02-11 12:06:13 -08:00
b6806308ac typo in docs ddp_comm_hooks.rst (#51986)
Summary:
Fixes a typo in ddp_comm_hooks.rst

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51986

Reviewed By: SciPioneer

Differential Revision: D26360314

Pulled By: mrshenli

fbshipit-source-id: 50349501c53823cbcbad0f72be7c6ac9d51a4120
2021-02-11 12:02:37 -08:00
517185f946 test_lc_1d: Increase deadline to 5 seconds (#52013)
Summary:
Increasing the deadline as to avoid
flakiness of the test on ROCM.

Signed-off-by: Roy, Arindam <rarindam@gmail.com>

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52013

Reviewed By: albanD

Differential Revision: D26360209

Pulled By: mrshenli

fbshipit-source-id: 1ddc7062c5ff7c980233d22844073de9fb7dcbb3
2021-02-11 11:59:56 -08:00
497b772547 Add custom implementation for csqrt if libc++ is used (#52018)
Summary:
libc++ implements csqrt using polar form of the number, which results in higher numerical error, if `arg` is close to 0, pi/2, pi, 3pi/4

Fixes https://github.com/pytorch/pytorch/issues/47500

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52018

Reviewed By: walterddr

Differential Revision: D26359947

Pulled By: malfet

fbshipit-source-id: 8c9f4dc45948cb29c43230dcee9b030c2642d981
2021-02-11 11:53:52 -08:00
0bc7b9843b use sccache 2.15 over the outdated sccache (#52095)
Summary:
Change macos build job on CI to use newer sccache.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52095

Reviewed By: walterddr

Differential Revision: D26406024

Pulled By: janeyx99

fbshipit-source-id: a40da4acd4c01af16d30269e67c7015aff54503a
2021-02-11 11:35:42 -08:00
81b9aa743b [pytorch] Update caffe2/python to eliminate Pyre errors (#52083)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52083

This makes minor fixes in `caffe2/python` to address all errors currently
reported by Pyre.

I update the code to fix errors when doing so looked simple and safe,
and added `pyre-fixme` comments in other places.
ghstack-source-id: 121109695

Test Plan: Confirmed that Pyre no longer reports errors under `caffe2/python`

Differential Revision: D26272279

fbshipit-source-id: b1eb19d323b613f23280ce9c71e800e874ca1162
2021-02-11 11:04:59 -08:00
c4eb22009e Drop some Python 2 compatibility code (#51769)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51769

Remove some Python 2 compatibility code that otherwise causes errors to
be reported from static type checkers.

Static type checkers complain that the old Python 2 modules and
functions referenced by this code do not exist.  Given that Python 2
support is entirely deprecated now we can simply remove the
compatibility code.
ghstack-source-id: 121313191

Test Plan:
Was able to get Pyre to successfully type check the `caffe2/python`
directory with this and some other changes.

Reviewed By: Tianshu-Bao

Differential Revision: D26271723

Pulled By: simpkins

fbshipit-source-id: fec8a09466be6867388832380480aafd36616aa1
2021-02-11 11:02:33 -08:00
c931c29120 [PyTorch][easy] Fix TODOs in CppFunction constructors (#51315)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51315

The TODOs said to remove this wrapper, and it seems that it can be removed easily.
ghstack-source-id: 121363465

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D26137147

fbshipit-source-id: f1e5971dca071f37400d77cc823214527e4231bc
2021-02-11 10:39:04 -08:00
10d407647f [PyTorch] Reduce template expansion in call_functor_with_args_from_stack (#51313)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51313

The problem here is similar to the one described in
https://devblogs.microsoft.com/cppblog/build-throughput-series-more-efficient-template-metaprogramming/
in that we are iterating over an integer seqeunce of length N, where N
is the number of argument types to our function, and specializing
`TypeListAt` (which we call `element_t`) for each Ith element of the
typelist, which instantiates O(I) template specializations, for a
total of O(N^2).

The solution is also similar: we iterate over the typelist
directly. Unlike in the blog post, we do also need the index in the
sequence, so we retain the index_sequence.
ghstack-source-id: 121363464

Test Plan:
Inspect -ftime-trace output for RegisterCPU.cpp.

Before: P168220187
After: P168220294

we can see that we spend less time instantiating
call_functor_with_args_from_stack and spend a similar amount of time
compiling it. The win is modest, but it's a win and I've already
written it so I'm sending it out. (I was hoping it would reduce
compilation time for make_boxed_from_unboxed_functor.)

Reviewed By: bhosmer

Differential Revision: D26136784

fbshipit-source-id: c91a523486e3019bd21dcd03e51a58aa25aa0981
2021-02-11 10:36:40 -08:00
425a5dc3f7 [DataLoader] Modify SamplerIDP signature (#52104)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52104

Make the API of `SamplerIterDataPipe` more reasonable with `sampler_args` and `sampler_kwargs`.

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D26401494

Pulled By: ejguan

fbshipit-source-id: ee5b5c414782d0880b12968bc9c8aa470b753f6a
2021-02-11 09:29:52 -08:00
aa2fede201 Fix autograd when inputs contains tensors without materialized grad_fn (#51940)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/39784
At the time the issue was filed, there was only issue (1) below.

There are actually now two issues here:
1. We always set all inputs passed in through `inputs` arg as `needed = True` in exec_info. So if we pass in an input that has a grad_fn that is not materialized, we create an entry of exec_info with nullptr as key with `needed = True`. Coincidentally, when we perform simple arithmetic operations, such as "2 * x", one of the next edges of mul is an invalid edge, meaning that its grad_fn is also nullptr. This causes the discovery algorithm to set all grad_fns that have a path to this invalid_edge as `needed = True`.
2. Before the commit that enabled the engine skipped the dummy node, we knew that root node is always needed, i.e., we hardcode `exec_info[&graph_root]=true`. The issue was that this logic wasn't updated after the code was updated to skip the graph root.

To address (1), instead of passing in an invalid edge if an input in `inputs` has no grad_fn, we create a dummy grad_fn. This is done in both python and cpp entry points. The alternative is to add logic for both backward() and grad() cases to check whether the grad_fn is nullptr and set needed=false in that case (the .grad() case would be slightly more complicated than the .backward() case here).

For (2), we perform one final iteration of the discovery algorithm so that we really know whether we need to execute the graph root.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51940

Reviewed By: VitalyFedyunin

Differential Revision: D26369529

Pulled By: soulitzer

fbshipit-source-id: 14a01ae7988a8de621b967a31564ce1d7a00084e
2021-02-11 09:22:15 -08:00
0de7a4582e Fix Pytorch docker image name by adding the registry prefix (#52089)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52089

Test Plan:
Manually trigger the CI

Fixes the [nightly docker pipeline failure](https://github.com/pytorch/pytorch/actions?query=workflow%3A%22Build+PyTorch+nightly+Docker+image+and+push+to+GitHub+Container+Registry%22)

Reviewed By: albanD

Differential Revision: D26390660

Pulled By: xuzhao9

fbshipit-source-id: 5259fe35ffd154fc6684753f358ec5a63f31428f
2021-02-11 09:12:45 -08:00
fb2693a632 Use bool/float instead of np.bool/np.float (#52103)
Summary:
This is causing type hint test errors on the latest numpy:

```
torch/testing/_internal/common_quantized.py:38: error: Module has no attribute "float"; maybe "float_", "cfloat", or "float64"?  [attr-defined]
torch/testing/_internal/common_methods_invocations.py:758: error: Module has no attribute "bool"; maybe "bool_" or "bool8"?  [attr-defined]
```

Runtime-wise, there's also a deprecation warning:

```
__main__:1: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
```

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52103

Reviewed By: suo

Differential Revision: D26401210

Pulled By: albanD

fbshipit-source-id: a7cc12ca402c6645473c98cfc82caccf161160c9
2021-02-11 08:29:54 -08:00
7763c127cd [PyTorch] move aten::dict to lite interpreter (#52032)
Summary:
As title, this operator is needed by [DeepLabV3 model](https://pytorch.org/tutorials/beginner/deeplabv3_on_android.html) used in Image Segmentation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52032

Test Plan:
Imported from OSS

1. CI
2. Get the pr (https://github.com/pytorch/pytorch/pull/51419), build pytorch_android (`BUILD_LITE_INTERPRETER=1  ./scripts/build_pytorch_android.sh x86`), run ImageSegmentation app on emulator.
{F371671630}

Reviewed By: dhruvbird

Differential Revision: D26365389

Pulled By: cccclai

fbshipit-source-id: bd4c2bd2be83ed6bd3a4cd35eddb98c11a20e245
2021-02-11 01:52:58 -08:00
bc856b49d4 Add support for constants to fx_glow (#52094)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52094

Pull Request resolved: https://github.com/pytorch/glow/pull/5329

Nested constants are created as placeholders by the graph_splitter used in the partitioner. So we change them back to get_attr nodes before serializing the graph.

Reviewed By: jfix71

Differential Revision: D26375577

fbshipit-source-id: 66631aadd6f5b8826ffa0a1e70176fbcaa7431d5
2021-02-11 01:42:59 -08:00
fd41ed1cce Fix flaky TestTrainingLoop - TestE2ETensorPipe (#51939)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51939

TestTrainingLoop - TestE2ETensorPipe was flaky since there would still
be inflight background RPCs running as we performed the assertions. This
resulted in these assertions failing since we didn't wait for all RPCs on the
agent to finish.

To resolve this issue, in this PR we join() and shutdown() the RPC agent to
ensure no further RPCs are done. Then we assertion the map sizes to ensure no
leaks occurred.

In addition to this, added messageIdToTimeout map to lookup the appropriate
timeout for a messageId. This ensures we remove the appropriate entry from the
map. The previous solution was passing the expirationTime through the lambda,
but it is not guaranteed the lambda would read the response of the request we
just sent out.
ghstack-source-id: 121412604

Test Plan:
1) unit tests
2) waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D26331585

fbshipit-source-id: a41e0534d7d4dfd240446e661e5541311931c7d7
2021-02-10 22:14:06 -08:00
4ab0ef36a4 change back to multiple_outputs_gpu_kernel for learnable fake per-channel quantization (#52017)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52017

Change back to multiple_outputs_gpu_kernel for per-channel quantization backward c++/cuda implementations (for diff D24479735 (0c60922fb0))
ghstack-source-id: 121409281

Test Plan:
## Unit Test:
`buck test mode/dev-nosan -c fbcode.platform=platform009 //caffe2/test:quantization -- -v TestFakeQuantize`

## Benchmark Test: (checkout f3980d1d678e)
`buck run mode/opt //caffe2/benchmarks/operator_benchmark/pt:quantization_test -- --operators FakeQuantizePerTensorOpBenchmark`

`buck run mode/opt //caffe2/benchmarks/operator_benchmark/pt:quantization_test -- --operators FakeQuantizePerChannelOpBenchmark`

### In **microseconds** (`1e-6` second),
input size: [1, 3, 256, 256]
|                           | C++ Kernel | Non-backprop C++ Kernel |
|---------------------------|---------------|------------|-------------------------|---|
| Per Tensor CPU Forward    | 1372.123                | 1365.981 |
| Per Tensor Cuda Forward   | 84.586                 | 27.205|
| Per Channel CPU Forward   | 2306.668                | 2299.991|
| Per Channel Cuda Forward  | 154.742                 | 135.219 |
| Per Tensor CPU Backward   | 2544.617               | 581.268|
| Per Tensor Cuda Backward   | 304.529                 | 137.335|
| Per Channel CPU Backward   | 2582.783               |582.088 |
| Per Channel Cuda Backward  | 474.265                | 134.082|

input size: [1, 3, 512, 512]

|                           | C++ Kernel | Non-backprop C++ Kernel |
|---------------------------|---------------|------------|-------------------------|---|
| Per Tensor CPU Forward    | 5426.244                | 5726.440 |
| Per Tensor Cuda Forward   | 85.834                 | 26.871|
| Per Channel CPU Forward   | 9125.913                | 9118.152|
| Per Channel Cuda Forward  | 159.599                 | 145.117 |
| Per Tensor CPU Backward   | 14020.830               | 2214.864|
| Per Tensor Cuda Backward  | 285.525                 | 131.302|
| Per Channel CPU Backward  | 14801.976               |2104.345 |
| Per Channel Cuda Backward | 513.025                | 120.222|

Reviewed By: raghuramank100

Differential Revision: D26357325

fbshipit-source-id: f42e3803258b0f6b418eab1301b5e5a466671859
2021-02-10 21:46:05 -08:00
cyy
39aa3db62b use make_shared and make_unique and clean unneeded code (#51829)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51829

Reviewed By: izdeby

Differential Revision: D26306098

Pulled By: smessmer

fbshipit-source-id: 4f6c0469c68f044c0bfe0925fcf7b030a25d15e2
2021-02-10 21:38:43 -08:00
9653161fb4 bump nightlies to 1.9.0 (#51891)
Summary:
similar to https://github.com/pytorch/pytorch/pull/45696

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51891

Reviewed By: izdeby

Differential Revision: D26318646

Pulled By: seemethere

fbshipit-source-id: 757194845c758a24eed2d0550866ba890e7a0b58
2021-02-10 20:30:57 -08:00
faaff0cd9b [caffe2 and pytorch] use new sparse adagrad JIT'ed function in fbgemm
Summary: To consider small delay between fbgemm and caffe2/pytorch repo, we are taking multiple steps. In this diff, we use new interface with temp name.

Test Plan: CI

Reviewed By: dskhudia

Differential Revision: D26231909

fbshipit-source-id: 83ceb3e12026d459532ef54459ac125b5625d644
2021-02-10 19:52:54 -08:00
d7ea0fe75a [testing] Add OpInfo for rad2deg and deg2rad (#51283)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/50006

We should probably add aliases for these operators to be consistent with NumPy names i.e. `np.degrees` and `np.radians`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51283

Reviewed By: ngimel

Differential Revision: D26171163

Pulled By: mruberry

fbshipit-source-id: 1869604ed400820d95f6ff50a0e3cba1de1ffa84
2021-02-10 19:45:10 -08:00
de334e6a2f fast-path is_complex() in the dispatcher (#50054)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50054

Test Plan: Imported from OSS

Reviewed By: swolchok

Differential Revision: D25760987

Pulled By: bdhirsh

fbshipit-source-id: 24666d3b86df6799ebbc478fdcdcaa445daff439
2021-02-10 19:13:33 -08:00
705fa7e964 [Usability] Capture argument names for traced functions and modules (#51775)
Summary:
Previously `torch.jit.trace` relies on AutoGrad hooks to infer name of tensors in computation, including those of function/method arguments. This often doesn't work out because:

- These names often do not exist
- Tracer uses argument name of first tensor operation on each tensor as inferred argument names. These tensor operations have programmatically-generated names like `argument_1`

This PR extracts argument names directly from Python functions and pass them down to tracer, which then assigns them to correct graph inputs. This way, we always have the correct argument names captured in IR.

This is useful for both debugging and supporting using `InterfaceType` to represent traced modules.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51775

Reviewed By: izdeby

Differential Revision: D26273105

Pulled By: gmagogsfm

fbshipit-source-id: 934a385041137dc3731bb6fa8657b11532fed9e5
2021-02-10 18:28:08 -08:00
4add8502c3 inlining a function that i noticed were hot during previous benchmarking (#50848)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50848

I noticed that the call overhead from `Tensor::device()` for ~1-2% of instruction counts depending on the microbenchmark

Some nice looking instruction count wins https://www.internalfb.com/intern/paste/P164529004/

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D25984136

Pulled By: bdhirsh

fbshipit-source-id: 0e54f2afe78caeb5a03abbb15e9197556acfeca1
2021-02-10 18:12:47 -08:00
fa325d7c9f Use sum_integers and multiply_integers (#51146)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51146

Test Plan: Sandcastle tests

Reviewed By: ngimel

Differential Revision: D25903430

fbshipit-source-id: 329c14018c9e5192864eed88a8ed0a5068ff1c69
2021-02-10 18:05:45 -08:00
bff8194522 Replace 11.1 with 11.2 on CI for Windows (#51598)
Summary:
Adding CUDA 11.2 to Windows CI.

Disabled tests:

The following ran into `CUDA error: misaligned address` for CUDA 11.2: (issue linked below)
`test_where_scalar_valid_combination_cuda_complex128` in test_torch.py
`test_sgn_complex_cuda` in test_autograd.py

The following ran into `CUDA error: too many resources requested for launch` for CUDA 11.2: (https://github.com/pytorch/pytorch/issues/52002)
test_EmbeddingBag_per_sample_weights_and_new_offsets_cuda_int64_float64
test_EmbeddingBag_per_sample_weights_and_offsets_cuda_int64_float64

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51598

Reviewed By: mrshenli

Differential Revision: D26344965

Pulled By: janeyx99

fbshipit-source-id: 3c9a4ed16d748969e96593220ec0a9f33e1ffcef
2021-02-10 17:59:11 -08:00
5431d87c3e [JIT] Use is_buffer in BufferPolicy::valid (#49588)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49588

**Summary**
`BufferPolicy::valid` uses `!typ->is_parameter(i)` to check if an
attribute is a buffer or not; it should use `type->is_buffer(i)` instead.
It also removes a forward compatibility gate in `python_print.cpp` that
has prevented the preservation of buffer metadata during serialization
in fbcode. Without this, the first change (to `BufferPolicy`) does not
work correctly in fbcode.

**Test Plan**
It is difficult to write an additional test that would have failed before this
commit because the two booleans `is_parameter` and `is_buffer` are never set
to `true` at the same time.

**Fixes**
This commit fixes #48746.

Test Plan: Imported from OSS

Reviewed By: xw285cornell

Differential Revision: D25633250

Pulled By: SplitInfinity

fbshipit-source-id: e727f8506f16d2e2b28f3d76a655f6528e7ac6cb
2021-02-10 17:50:14 -08:00
410ef1335a [JIT] Add buffer/parameter metadata test to test_save_load.py (#49594)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49594

**Summary**
This commit adds a unit test to `test_save_load.py` that checks that
saving and loading a module preserves metadata about which module
attributes are parameters and buffers. The hooks that are currently used
to automatically check serialization of every function and module in the
unit tests check that the archive produced by saving and loading and
saving again are the same and that the type tags for the actual IValues
representing the module match before saving and after loading. However,
these tests do not check that buffer and parameter metadata was not
lost or destroyed during serialization.

**Test Plan**
Ran the new unit test.

Test Plan: Imported from OSS

Reviewed By: xw285cornell

Differential Revision: D25730603

Pulled By: SplitInfinity

fbshipit-source-id: 06a202935d9e0654cb1966c34f54707f0a28a331
2021-02-10 17:46:35 -08:00
9f1f5636d7 Revert D26019289: [pytorch][PR] Early terminate CUDA on common_utils TestCases
Test Plan: revert-hammer

Differential Revision:
D26019289 (c1b7ca8062)

Original commit changeset: ddc7c1c0d00d

fbshipit-source-id: 6902d03fa06cda5d03191846bc4dd98af501b594
2021-02-10 17:29:10 -08:00
d0fd41dcfe Add size op in nnapi serializer (#52026)
Summary:
serializer didn't support aten::size

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52026

Test Plan: Torchvision Mobilenetv2 [script](https://pytorch.org/tutorials/prototype/nnapi_mobilenetv2.html) works. [Test](ecfed07cc5) to be merged after [this PR](https://github.com/pytorch/pytorch/pull/47521/files) is merged

Reviewed By: dreiss

Differential Revision: D26363133

Pulled By: axitkhurana

fbshipit-source-id: 772a6bea62bca69f8bba19c25c582a1734a70eb1
2021-02-10 15:57:01 -08:00
a1b8f3d4b6 Replace CUDA 11.1 Linux CI with CUDA 11.2 (#51905)
Summary:
Adding 11.2 to CI with BUILD_SPLIT_CUDA enabled.

Disabled the following tests as they were failing in test_optim.py:
test_adadelta
test_adam
test_adamw
test_multi_tensor_optimizers
test_rmsprop

(Issue tracking that is here: https://github.com/pytorch/pytorch/issues/51992)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51905

Reviewed By: VitalyFedyunin

Differential Revision: D26368575

Pulled By: janeyx99

fbshipit-source-id: 31612c7d04d51afb3f18956e43dc7f7db8a91749
2021-02-10 11:43:50 -08:00
9b8d414a9c update sccache wrapper to accommodate new sccache for macos build (#51357)
Summary:
Before I change sccache to point to the newer version in the S3 bucket, this PR makes sure the new sccache wrapper works.

This PR previously tested a newer version of sccache for macos build jobs. Last sccache used is over a year old. The results of using both are different, but the speed isn't too impacted, see below.

With newer sccache and alternate wrapper script from this PR: https://app.circleci.com/pipelines/github/pytorch/pytorch/271777/workflows/b5c6a75e-781a-4c0f-8c99-ff2cbe1e877c/jobs/10808567

With old sccache: https://app.circleci.com/pipelines/github/pytorch/pytorch/271875/workflows/962079ce-e146-482e-b493-c99004f8d89a/jobs/10805680

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51357

Reviewed By: walterddr

Differential Revision: D26373266

Pulled By: janeyx99

fbshipit-source-id: ac5ccc512039379af6111b92a5ce37c5268dfdfe
2021-02-10 11:27:55 -08:00
bd6248106b Keep alive graph when creating iterators from it (#51951)
Summary:
Previously, the graph might have been delete while Python still has iterators, leading to segfaults.

This does not fully work for iterators from Nodes and Blocks as they may be invalidated when the owning graph goes out of scope. I will look into these separately.

Fixes https://github.com/pytorch/pytorch/issues/50454

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51951

Reviewed By: mrshenli

Differential Revision: D26352629

Pulled By: SplitInfinity

fbshipit-source-id: 67299b6cbf1ac7ab77f8703a0ca8f1162e03fcd4
2021-02-10 11:09:51 -08:00
ce8ba5f3bc Fix test time history report if no ancestor report (#52054)
Summary:
This fixes an issue (currently blocking https://github.com/pytorch/pytorch/issues/51905) where the test time regression reporting step will fail if none of the most recent `master` ancestors have any reports in S3 (e.g. if a new job is added).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52054

Test Plan:
```
python test/test_testing.py
```

Reviewed By: walterddr

Differential Revision: D26369507

Pulled By: samestep

fbshipit-source-id: 4c4e1e290cb943ce8fcdadacbf51d66b31c3262a
2021-02-10 11:02:46 -08:00
a1c67b0763 Silence harmless error logs of TensorPipe agent during shutdown (#51785)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51785

The TensorPipe pipes do not really support a "graceful" shutdown: if one side is expecting data (i.e., it has scheduled a readDescriptor call) and the other side closes, the former will receive an error. Such an error will not even be predictable, as it depends on the backend: some may detect this and report it "well" (through an EOFError), others may not be able to tell this apart from a failure and report it as such.

This meant that during shutdown some of these errors would fire and thus the agent would log them as warning. We did add a note that these were expected under some conditions, so that users wouldn't be alarmed, but it was still a far-from-ideal experience.

In principle we could build a "protocol" on top of these pipes to "agree" on a graceful shutdown, and this was the plan to solve this. However, it was rather complicated to implement.

Here I am proposing a quicker, but perhaps hackier, solution, which re-uses the already existing graceful shutdown "protocol" of the agent (i.e., the `join` method) to put the agent in a special state in which it will silence all errors due to a remote shutting down.

Such a check cannot happen in the `shutdown` method, because that's also used in case of ungraceful shutdown (in which case I believe we'd still want to display errors). Since it needs to make sure that all participants have transitioned to this new state before any of them can continue (as otherwise one of them may close its pipes before another one has realized that this is now expected), we need to perform a barrier. Hence the ideal place for it is the `join` method, where we're already doing a lot of gang-wide synchronization. Since the `join` method isn't only called during shutdown, we need to make sure we only switch the agent to this state when it's the last call to join, and we do so by adding a new optional argument to it (which will be ignored by all agents except the TensorPipe one).

I realize this isn't the prettiest solution, and since it changes the agent's API it's worth discussing it carefully. Let me know what you think!
ghstack-source-id: 121131940

Test Plan: Run on CircleCI, where this occurred quite a bit, and check the logs.

Reviewed By: mrshenli

Differential Revision: D26276137

fbshipit-source-id: 69ef14fe10908e80e627d9b4505352e482089cc8
2021-02-10 10:58:22 -08:00
b7b944a319 Avoid TensorPipe agent spamming logs when unable to guess IP address (#51784)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51784

The TensorPipe agent mimics Gloo when trying to guess the most reasonable IP address to bind to. When that fails, it prints a warning to inform the user. It turns out, we were attempting to guess the address a lot of times (I counted at least 18: 1 for the UV transport, 1 for the IBV transport, 16 for the multiplexed UV channel) and thus they might all print that same identical warning message. That's noisy. Since the outcome of all these guesses will be the same (unless the system config changes underneath, which is unlikely) we can just do it once, print the warning (at most) once, cache the result and reuse it over and over.

Also, we used to have two identical but distinct ways of doing this, one provided by the UV transport and one by the IBV one. TensorPipe offers both methods because backends are modular and independent. However PyTorch always requires the UV one to be present, hence we can always rely on the UV helpers, and avoid using the IBV ones.
ghstack-source-id: 121121275

Test Plan: Look at the CircleCI logs, I think I saw this situation happening there.

Reviewed By: mrshenli

Differential Revision: D26275838

fbshipit-source-id: 8a2ffc40d80388bdca32dbcfed16f28a0a6177a3
2021-02-10 10:54:50 -08:00
03e82f7944 Use CUDA 11.2 for nightly docker build. (#51990)
Summary:
Set CUDA_VERSION to 11.2.0 since Nvidia name their docker image on Ubuntu 18.04 to be nvidia/cuda:11.2.0-cudnn8-devel-ubuntu18.04.

Note that cudatoolkit 11.2.0 is not yet on [conda](https://repo.anaconda.com/pkgs/main/linux-64/), and we need to wait for that before merging this PR.

- https://hub.docker.com/r/nvidia/cuda/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51990

Reviewed By: samestep

Differential Revision: D26371193

Pulled By: xuzhao9

fbshipit-source-id: 76915490dc30ddb03ceeeadb3c45a6c02b60401e
2021-02-10 10:46:20 -08:00
c4a8f0ceaa [torch script] Add pure list producing ops to alias analysis (#51999)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51999

as in title

Test Plan: waiting on CI for now

Reviewed By: eellison

Differential Revision: D26349297

fbshipit-source-id: bd5574ed1f8448ba18a6fda4bdc45f45d8b158e9
2021-02-10 09:00:39 -08:00
50e6f0fdb6 Add benchmark for torch.nn.functional.interpolate
Summary:
This diff adds a new microbencharmk for the
 `torch.nn.functional.interpolate` operator, using OpBench

Test Plan:
```
[nicolashug@59262.od ~/fbsource/fbcode/caffe2/benchmarks/operator_benchmark/pt (39207820)]$ buck run //caffe2/benchmarks/operator_benchmark/pt:interpolate_test -- --tag_filter short
Starting new Buck daemon...
Buck daemon started.
Parsing buck files: finished in 06:30.7 min
Creating action graph: finished in 33.9 sec
Building: finished in 02:53.4 min (100%) 24224/24224 jobs, 24224 updated
  Total time: 09:58.2 min
/data/sandcastle/boxes/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/pt/interpolate_test#link-tree/torch/utils/cpp_extension.py:3: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: interpolate
# Mode: Eager
# Name: interpolate_input_size(1,3,60,40)_output_size(24,24)_channels_lastTrue
# Input: input_size: (1, 3, 60, 40), output_size: (24, 24), channels_last: True
Forward Execution Time (us) : 510.818

# Benchmarking PyTorch: interpolate
# Mode: Eager
# Name: interpolate_input_size(1,3,60,40)_output_size(24,24)_channels_lastFalse
# Input: input_size: (1, 3, 60, 40), output_size: (24, 24), channels_last: False
Forward Execution Time (us) : 684.324

# Benchmarking PyTorch: interpolate
# Mode: Eager
# Name: interpolate_input_size(1,3,600,400)_output_size(240,240)_channels_lastTrue
# Input: input_size: (1, 3, 600, 400), output_size: (240, 240), channels_last: True
Forward Execution Time (us) : 33791.970

# Benchmarking PyTorch: interpolate
# Mode: Eager
# Name: interpolate_input_size(1,3,600,400)_output_size(240,240)_channels_lastFalse
# Input: input_size: (1, 3, 600, 400), output_size: (240, 240), channels_last: False
Forward Execution Time (us) : 50120.585

# Benchmarking PyTorch: interpolate
# Mode: Eager
# Name: interpolate_input_size(1,3,320,320)_output_size(256,256)_channels_lastTrue
# Input: input_size: (1, 3, 320, 320), output_size: (256, 256), channels_last: True
Forward Execution Time (us) : 37668.089

# Benchmarking PyTorch: interpolate
# Mode: Eager
# Name: interpolate_input_size(1,3,320,320)_output_size(256,256)_channels_lastFalse
# Input: input_size: (1, 3, 320, 320), output_size: (256, 256), channels_last: False
Forward Execution Time (us) : 56869.472
```

Reviewed By: fmassa

Differential Revision: D26225318

fbshipit-source-id: 7757296192e630c42a6e4913c5c1d93af11d286d
2021-02-10 08:28:16 -08:00
c1b7ca8062 Early terminate CUDA on common_utils TestCases (#50914)
Summary:
This is a follow up on https://github.com/pytorch/pytorch/issues/49869.

Previously CUDA early termination only happens for generic test classes that extends from `DeviceTypeTestBase`. However, JIT test cases which extends from common_utils.TestCase cannot benefit from the early termination.

This change moves the early termination logic into common_utils.TestCase class.
- all tests extended from common_utils.TestCase now should early terminate if CUDA assert occurs.
- For TestCases that extends from common_device_type.DeviceTypeTestBase, still only do torch.cuda.synchronize() when RTE is thrown.
- For TestCases extends common_utils.TestCase, regardless of whether a test case uses GPU or not, it will always synchronize CUDA as long as `torch.cuda.is_initialize()` returns true.
- Disabling this on common_distributed.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50914

Reviewed By: malfet

Differential Revision: D26019289

Pulled By: walterddr

fbshipit-source-id: ddc7c1c0d00db4d073a6c8bc5b7733637a7e77d1
2021-02-10 07:15:40 -08:00
8b0cb5ede3 OpInfo: Added clamp and trunc tests with aliases (#51167)
Summary:
Description:
- Added clamp, trunc tests with aliases
- Added tests for aliases for asin(h), acos(h), etc
- fixed 'fix' alias implementation
- fixed annotations in test_jit_alias_remapping
- updated native_functions.yaml aliases guidelines

Blocked by https://github.com/pytorch/pytorch/issues/50368

cc mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51167

Reviewed By: gchanan

Differential Revision: D26245753

Pulled By: mruberry

fbshipit-source-id: e17b657f0515139735a8a677b1ae284904f98aef
2021-02-10 05:36:18 -08:00
3cf78395cb [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D26364039

fbshipit-source-id: 750eb64b22cd84cf99d6595970c10f3aa6037f0b
2021-02-10 04:18:50 -08:00
594a66d778 Warn about floor_divide performing incorrect rounding (#50281) (#50281)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50281

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51745

Test Plan: Imported from OSS

Reviewed By: ngimel

Pulled By: mruberry

Differential Revision: D26257855

fbshipit-source-id: e5d497cf07b0c746838ed081c5d0e82fb4cb701b
2021-02-10 03:13:34 -08:00
9c0caf0384 Adding support for comparing two bool varibales (#51844)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51844

Fixes issue #48174

=========

Adds support to compare two bool variables

Test:
======
python test/test_jit.py -k test_compare_two_bool_inputs

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D26353694

Pulled By: nikithamalgifb

fbshipit-source-id: 41af5ba3e4075ed7a21595b10e388a7302aa1fce
2021-02-10 02:13:25 -08:00
602434bcbe [te] Benchmark vml-based logit (#51771)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51771

This benchmarks an NNC implementation of logit based on VML's log
implementation.

It's a modest improvement over the sleef algorithm, but seems to be a bit
slower than aten (at larger sizes), and I'm not totally sure why, since you'd
think a fused logit kernel would be better than doing clamp/sub/div, followed
by log.  And yet...

Note that it's important to vectorize this kernel by 16, even on an 8-wide AVX2
machine; I suspect that it's needed to give the scheduler enough freedom to
fill up both FMA pipes to avoid stalling on fpdiv or (maybe) memory.
ghstack-source-id: 121392349

Test Plan:
```
-----------------------------------------------------------------------------
Benchmark                      Time           CPU Iterations UserCounters...
-----------------------------------------------------------------------------
logit_nnc_sleef/64           483 ns        483 ns    1452336 logit/s=132.469M/s
logit_nnc_sleef/512         3019 ns       3019 ns     228059 logit/s=169.577M/s
logit_nnc_sleef/8192       71427 ns      71424 ns       9662 logit/s=114.695M/s
logit_nnc_sleef/32768     307062 ns     306722 ns       2406 logit/s=106.833M/s

logit_nnc_fast/64            147 ns        147 ns    4408910 logit/s=434.908M/s
logit_nnc_fast/512           781 ns        781 ns     881230 logit/s=655.53M/s
logit_nnc_fast/8192        12519 ns      12518 ns      55626 logit/s=654.421M/s
logit_nnc_fast/32768       50530 ns      50526 ns      10000 logit/s=648.536M/s

logit_nnc_vml/64             125 ns        125 ns    5551460 logit/s=511.603M/s
logit_nnc_vml/512            733 ns        733 ns     938444 logit/s=698.955M/s
logit_nnc_vml/8192         11282 ns      11280 ns      61610 logit/s=726.23M/s
logit_nnc_vml/32768        45051 ns      44991 ns      15473 logit/s=728.325M/s

logit_aten/64                450 ns        449 ns    1599269 logit/s=142.429M/s
logit_aten/512              1055 ns       1054 ns     665538 logit/s=485.595M/s
logit_aten/8192            10865 ns      10864 ns      64152 logit/s=754.032M/s
logit_aten/32768           42106 ns      42103 ns      16477 logit/s=778.287M/s

logit_caffe2/64              233 ns        233 ns    2952127 logit/s=274.761M/s
logit_caffe2/512            1795 ns       1795 ns     393354 logit/s=285.177M/s
logit_caffe2/8192          29924 ns      29923 ns      23225 logit/s=273.77M/s
logit_caffe2/32768        123899 ns     123893 ns       5642 logit/s=264.487M/s
```

Reviewed By: bwasti

Differential Revision: D26272325

fbshipit-source-id: b9771a96e0150685506dbc625e7894e81c93a688
2021-02-10 02:09:14 -08:00
2e35fe9535 [te] Implement log approximation using the VML approach (#51752)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51752

Using a straight power series approximation with enough terms gives
precision down to the denormal range, and avoids the fp division used in the
sleef approach.  This is nice because recent CPUs have dual pipelined fma units,
so we can compute 16 logarithms in parallel; whereas there's usually only one
FP divider and it has a fairly high latency/low throughput.
ghstack-source-id: 121392347

Test Plan:
On my avx2+fma broadwell:
```
---------------------------------------------------------------------------
Benchmark                    Time           CPU Iterations UserCounters...
---------------------------------------------------------------------------
log_nnc_sleef/64           178 ns        178 ns    3933565 log/s=358.993M/s
log_nnc_sleef/512         1286 ns       1285 ns     559459 log/s=398.354M/s
log_nnc_sleef/8192       19366 ns      19364 ns      36619 log/s=423.053M/s
log_nnc_sleef/32768      79288 ns      79286 ns       8718 log/s=413.287M/s

log_nnc_fast/64             92 ns         92 ns    7644990 log/s=696.939M/s
log_nnc_fast/512           483 ns        483 ns    1426802 log/s=1059.49M/s
log_nnc_fast/8192         7519 ns       7514 ns      95319 log/s=1090.23M/s
log_nnc_fast/32768       31344 ns      31338 ns      22397 log/s=1045.62M/s

log_nnc_vml/64              88 ns         88 ns    7923812 log/s=728.469M/s
log_nnc_vml/512            454 ns        454 ns    1521437 log/s=1.12739G/s
log_nnc_vml/8192          6763 ns       6763 ns     103264 log/s=1.21136G/s
log_nnc_vml/32768        26565 ns      26564 ns      23609 log/s=1.23354G/s

log_aten/64                418 ns        418 ns    1651401 log/s=153.117M/s
log_aten/512               801 ns        801 ns     875857 log/s=638.923M/s
log_aten/8192             6877 ns       6872 ns     100840 log/s=1.19208G/s
log_aten/32768           26989 ns      26988 ns      26268 log/s=1.21416G/s
```

Reviewed By: bwasti, zheng-xq

Differential Revision: D26246400

fbshipit-source-id: dae47ee6baeab1a813ec4d4440748164051aed3d
2021-02-10 02:09:10 -08:00
ff73be7e45 [te] Introduce likely/unlikely CompareSelect hint (#51751)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51751

Similar in spirit to the `__builtin_expect` C intrinsic, it's useful
to be able to hint the expected branch direction in a tensor expression.  Using
this flag has a few effects on codegen:

- The CompareSelect is generated using conditional branches, rather than selects
- The conditional branches are strongly hinted (like, 100000:1) in the indicated direction
- A vectorized hinted CompareSelect computes its condition in parallel with a
  mask "reduction" (e.g. a bitcast from `<i1 x 8>` to `<i*>`).  In AVX terms
  this sequence might look like:
```
vpcmpgtd %ymm0, %ymm1, %ymm2
vmovmskps %ymm2, %eax
```

The motivating case for this addition is an attempt I'm making to replicate
fast transcendentals using tensor expressions.  Floating-point numbers have
lots of special cases (denormals, inf, nan) that need special handling, and
it's convenient to be able to punt that handling off to a slow path while
keeping the fast path nice and tight.
ghstack-source-id: 121366315

Test Plan:
I'm not sure how to test this (except I can tell you it works for
the `log` implementation I'm working on right now).  It would be nice to plumb
the LLIR/ASM output through programmatically so it can be used in FileCheck.
Maybe I'll do that in another diff?

Reviewed By: asuhan

Differential Revision: D26246401

fbshipit-source-id: 900f7fa0520010fb9931d6e3efc8680a51f8d844
2021-02-10 02:09:07 -08:00
74082f0d6f [te][llvm] Generate arithmetic vs logical right shift as appropriate (#51749)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51749

Following in the mode of C++, we probably want to distinguish when
it's appropriate to do arithmetic vs. logical right shift.

> For negative a, the value of a >> b is implementation-defined (in most
> implementations, this performs arithmetic right shift, so that the result
> remains negative).

If you look at what clang does, if `a` is unsigned, a logical shift is
generated; if signed, an arithmetic shift.  Let's do the same here.  This turns
out to be useful for, e.g., implementing transcendental function
approximations.
ghstack-source-id: 121366317

Test Plan:
Added Byte (unsigned) and Char (signed) right-shift tests to
test_llvm.

Reviewed By: asuhan

Differential Revision: D26245856

fbshipit-source-id: 260ee9bf4b032b9ce216f89acbc273cde0ed688c
2021-02-10 02:05:39 -08:00
0620c96fd6 Back out "Revert D26009829: Optimize relu on cpu using clamp_min" (#51819)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51819

Original commit changeset: 3e945b438fb8

One does not simply change the patterns of aten op calls
ghstack-source-id: 121379333

Test Plan: CI

Reviewed By: nikithamalgifb

Differential Revision: D26291736

fbshipit-source-id: b819ac013c0438cc2f70daed7d7f2ef8fdc12ab7
2021-02-09 23:42:29 -08:00
33afb5f19f fake_quant cachemask: remove Python bindings (#51878)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51878

`fake_quantize_per_tensor_affine_cachemask` and
`fake_quantize_per_channel_affine_cachemask` are implementation
details of `fake_quantize_per_tensor_affine` and
`fake_quantize_per_channel_affine`, removing the
Python bindings for them since there is no need to
expose them.

Test Plan:
```
python test/test_quantization.py TestFakeQuantize
```

Imported from OSS

Reviewed By: albanD, bugra

Differential Revision: D26314173

fbshipit-source-id: 733c93a3951453e739b6ed46b72fbad2244f6e97
2021-02-09 23:27:53 -08:00
5f9fb93c14 [model loading] Add max_batch_size override for batch size exploration
Summary: Currently batch_size is determined on modeling side. Add a flag caffe2_predictor_disagg_acc_max_batch_size_override to explore different batch_size during inference.

Test Plan:
replayer test
set caffe2_predictor_disagg_acc_max_batch_size_override=32 on both server and client side.

Reviewed By: khabinov

Differential Revision: D26318568

fbshipit-source-id: 4fa79e2087a5f7f7670988aec7e5b41e63f9980b
2021-02-09 23:05:15 -08:00
768662913a Migrate masked_fill__cuda to ATen (#51404)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49543

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51404

Reviewed By: mrshenli

Differential Revision: D26329833

Pulled By: ngimel

fbshipit-source-id: 510988888fad015239ab4766eb391a89b742130b
2021-02-09 22:57:03 -08:00
929b91a24d ns_eager: rename Logger I/O var names to logger_cls (#51359)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51359

`Logger` is the name of the base Logger class.  It's confusing that
it is also used as a variable name, which can represent this class
or its subclasses.  Renaming to `logger_cls` to make it clearer.

Test Plan:
```
python test/test_quantization.py TestEagerModeNumericSuite
```

Imported from OSS

Reviewed By: supriyar

Differential Revision: D26149577

fbshipit-source-id: a9c12f9446f66e5c683ab054b2a94aeb0cf9cc8a
2021-02-09 22:30:44 -08:00
5a9bac58be Automated submodule update: FBGEMM (#52014)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 884fb257ab

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52014

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: mrshenli

Differential Revision: D26357567

fbshipit-source-id: a9f239c9d3273d04ee15fb052b2bf4f25477814b
2021-02-09 22:19:44 -08:00
18e0a61388 add more logging fields that can be set in construction time (#51260)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51260

add more logging fields to DDPLoggingData, including param stats, bucket stats, environment variables, nccl version, data type
ghstack-source-id: 121260224

Test Plan: unit tests

Reviewed By: rohan-varma

Differential Revision: D26118245

fbshipit-source-id: ba48b7a11340bda1f5f3b24c8603545d346361e9
2021-02-09 21:58:58 -08:00
d23cb94098 [FX] Generalize dict key check in create-arg (#51927)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51927

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D26329655

Pulled By: jamesr66a

fbshipit-source-id: a15e7d9564551521af12a8fde1c7524856f0cbc2
2021-02-09 21:52:22 -08:00
256f93fb0f [FX][EZ] Fix tuple type annotations (#52010)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52010

Test Plan: Imported from OSS

Reviewed By: ansley

Differential Revision: D26355481

Pulled By: jamesr66a

fbshipit-source-id: 27bbc5d8949beb68663f2e1e7963bec9afbef0cc
2021-02-09 20:32:30 -08:00
d4e84b0c07 [FX] Fix leaf modules in Transformer (#51998)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51998

Test Plan: Imported from OSS

Reviewed By: Chillee

Differential Revision: D26352087

Pulled By: jamesr66a

fbshipit-source-id: ad8abc6507d4ea95fd3c99b226d1b40c3e9e64cf
2021-02-09 20:29:17 -08:00
d5a9627f10 [PyTorch] Re-order TensorImpl fields to save a word (#50920)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50920

There was a hole left after previous changes.
ghstack-source-id: 120714378

Test Plan: static_assert still passes.

Reviewed By: ezyang

Differential Revision: D26008763

fbshipit-source-id: c3830328835e28a0d06c833172ac60457049824b
2021-02-09 20:18:26 -08:00
475278f1c0 [FX] Make some modifications to limitation section (#51928)
Summary:
![](https://i.imgur.com/P0Tq4xR.jpg)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51928

Reviewed By: jamesr66a

Differential Revision: D26329664

Pulled By: Chillee

fbshipit-source-id: 94fd7b03ca53f48b1e4633a462c6e02bb0fd2f3c
2021-02-09 18:32:28 -08:00
3af7b673ef Let child CUDAFuture wait for parent CUDAFuture's CUDAEvents (#51820)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51820

If the child cannot extract tensors from returned IValue, the
current child CUDAFuture won't wait for anything. In this case,
if the `wait()` wasn't called on the parent Future, streams are
not synchronized, and it is possible that parent Future's CUDA
ops have not been added to streams yet.

This commit adds a `markCompletedWithDataPtrs()` to `ivalue::Future`,
and RPC uses this API to pass Message tensor dataPtrs to the
`PyObject` Future when marking it as completed.

Test Plan: Imported from OSS

Reviewed By: pritamdamania87

Differential Revision: D26324068

Pulled By: mrshenli

fbshipit-source-id: 3d838754f6daabad5cd9fb8953e4360196d110bb
2021-02-09 18:02:07 -08:00
c6b4fc8a90 [ROCm] add 4.0.1 docker image (#51507)
Summary:
Add a ROCm 4.0.1 docker image for CI. Keep the 3.10 image.
Keep the 3.9 image until the 3.9 image is no longer needed.
Plan is to keep two ROCm versions at a time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51507

Reviewed By: seemethere

Differential Revision: D26350348

Pulled By: malfet

fbshipit-source-id: 6230278343ee48f19e96067180590beab96b17cc
2021-02-09 17:51:16 -08:00
1921b244f6 [DataLoader] Rename files of functional datapipes (#51880)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51880

Using the reference of [iter-tools](https://more-itertools.readthedocs.io/en/stable/api.html) to rename files based on the functionality.

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D26314776

Pulled By: ejguan

fbshipit-source-id: e97bac047a0fa808676cd6f3a9202109d17f81ca
2021-02-09 17:09:10 -08:00
9eb70c3c78 [DataLoader] Rename Callable to Map IterDataPipe (#51879)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51879

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D26314775

Pulled By: ejguan

fbshipit-source-id: ee77909eae97092155ed6a6c794540e68a04d754
2021-02-09 17:09:06 -08:00
104371e1dc [DataLoader] Implement FilterIterDataPipe (#51783)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51783

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D26277688

Pulled By: ejguan

fbshipit-source-id: 25ed7da9da88c030b29627142c2f04fed26cdcda
2021-02-09 17:06:06 -08:00
e964d77fca [pytorch] recast infer_type error and amend with name and item that failed inferring
Summary:
When type inference fails when JITing torchscript module, the error message does not give any implication where the error fails. For example:  "Cannot create dict for key type 'int?', only int, float, complex, Tensor and string keys are supported".

This adds the variable name and item to the error message.

Reviewed By: ajaech

Differential Revision: D26327483

fbshipit-source-id: d8c85e7550258d7c56530f5826ff9683ca8b2b94
2021-02-09 16:07:16 -08:00
12d85b536e Fixing Softmax bench. (#51898)
Summary:
Fixes and enables the microbenchmark for Softmax.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51898

Reviewed By: gmagogsfm

Differential Revision: D26333189

Pulled By: navahgar

fbshipit-source-id: be0934e413c4f6728593f896e53a0b31f1657e52
2021-02-09 15:03:49 -08:00
7e54a64828 [C2] Add shape inference logic for ColwiseMax operator. (#51914)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51914

As desc.

Test Plan: Unit-test.

Reviewed By: intermilan

Differential Revision: D26299115

fbshipit-source-id: 9c80236f843e907476da1747dcd623c85147fa90
2021-02-09 14:12:07 -08:00
0410cba23e [FX] make map_arg require a callable (#51907)
Summary:
This makes something like: `map_arg(lambda x: x, [Node(), Node()])` throw an error (before it would silently return `lambda x: x`)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51907

Reviewed By: jamesr66a

Differential Revision: D26323916

Pulled By: jansel

fbshipit-source-id: f56ebcf9a3af47546d75603567025163f1fb8454
2021-02-09 13:36:27 -08:00
2f2b170068 [Pytorch Mobile] Only preserve bundled input helpers for forward if they exist (#51884)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51884

it is now possible to bundle inputs and not bundle them for forward. This is ok and so we need to account for that.
ghstack-source-id: 121266667

Test Plan: Manually bundle inputs for a function not named forward. Call optimize_for_mobile and make sure the functions are still there. {P173289878}

Reviewed By: iseeyuan

Differential Revision: D26304558

fbshipit-source-id: 79f82d9de59c70b76f34e01f3d691107bf40e7bc
2021-02-09 13:31:42 -08:00
8fab33f942 Fix the lifetime of PyTensorType (#51649)
Summary:
Make sure that `PyTensorType` objects are always available during the shutdown.
See https://github.com/pytorch/pytorch/issues/42125#issuecomment-772397319 for a more in-depth explanation of why it's needed.

Fixes https://github.com/pytorch/pytorch/issues/42125.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51649

Reviewed By: zhangguanheng66

Differential Revision: D26256843

Pulled By: ezyang

fbshipit-source-id: 4d1dac75e063f0bdc65e0784140641fc4beb8616
2021-02-09 13:28:12 -08:00
0ec00c1292 [docs] Add docs for storage and tensors for quantized Tensor (#51817)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51817

Test Plan: Imported from OSS

Reviewed By: supriyar

Differential Revision: D26292464

Pulled By: jerryzh168

fbshipit-source-id: c5992deda4af949de4ea2e40edee8f22bd59b9e1
2021-02-09 13:20:56 -08:00
fc314350ad Make RebatchingBuffer compatible with auto shape inference
Summary: no-op to operator behavior, resolve https://fburl.com/wte0v7tf

Test Plan: buck test

Reviewed By: huangyi1979

Differential Revision: D26333212

fbshipit-source-id: d237e8caf5977bc19fcced6aeedc6464fc905457
2021-02-09 12:37:26 -08:00
1e171f024b Fix warnings in TensorShape (#51642)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51642

Compiling currently gives:
```
an 13 16:46:39 In file included from ../aten/src/ATen/native/TensorShape.cpp:12:
Jan 13 16:46:39 ../aten/src/ATen/native/Resize.h:37:24: warning: comparison of integers of different signs: 'int64_t' (aka 'long long') and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:39     if (new_size_bytes > self->storage().nbytes()) {
Jan 13 16:46:39         ~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:39 ../aten/src/ATen/native/TensorShape.cpp:32:24: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int64_t' (aka 'long long') [-Wsign-compare]
Jan 13 16:46:39   for (size_t i = 0; i < shape_tensor.numel(); ++i) {
Jan 13 16:46:39                      ~ ^ ~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:39 ../aten/src/ATen/native/TensorShape.cpp:122:25: warning: comparison of integers of different signs: 'int64_t' (aka 'long long') and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:39   for (int64_t i = 0; i < tensors.size(); i++) {
Jan 13 16:46:39                       ~ ^ ~~~~~~~~~~~~~~
Jan 13 16:46:39 ../aten/src/ATen/native/TensorShape.cpp:162:21: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:39   for (int i = 0; i < tensors.size(); i++) {
Jan 13 16:46:39                   ~ ^ ~~~~~~~~~~~~~~
Jan 13 16:46:39 ../aten/src/ATen/native/TensorShape.cpp:300:25: warning: comparison of integers of different signs: 'int64_t' (aka 'long long') and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:39   for (int64_t i = 0; i < s1.size(); ++i) {
Jan 13 16:46:39                       ~ ^ ~~~~~~~~~
Jan 13 16:46:39 ../aten/src/ATen/native/TensorShape.cpp:807:21: warning: comparison of integers of different signs: 'int64_t' (aka 'long long') and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:39     TORCH_CHECK(dim < self_sizes.size());
Jan 13 16:46:39                 ~~~ ^ ~~~~~~~~~~~~~~~~~
Jan 13 16:46:39 ../c10/util/Exception.h:361:31: note: expanded from macro 'TORCH_CHECK'
Jan 13 16:46:39   if (C10_UNLIKELY_OR_CONST(!(cond))) {                                 \
Jan 13 16:46:39                               ^~~~
Jan 13 16:46:39 ../c10/util/Exception.h:244:47: note: expanded from macro 'C10_UNLIKELY_OR_CONST'
Jan 13 16:46:39 #define C10_UNLIKELY_OR_CONST(e) C10_UNLIKELY(e)
Jan 13 16:46:39                                               ^
Jan 13 16:46:39 ../c10/macros/Macros.h:173:65: note: expanded from macro 'C10_UNLIKELY'
Jan 13 16:46:39 #define C10_UNLIKELY(expr)  (__builtin_expect(static_cast<bool>(expr), 0))
Jan 13 16:46:39                                                                 ^~~~
Jan 13 16:46:39 ../aten/src/ATen/native/TensorShape.cpp:855:24: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'const int64_t' (aka 'const long long') [-Wsign-compare]
Jan 13 16:46:39   for (size_t i = 0; i < num_blocks; ++i) {
Jan 13 16:46:39                      ~ ^ ~~~~~~~~~~
Jan 13 16:46:39 ../aten/src/ATen/native/TensorShape.cpp:2055:23: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:39     for (int i = 0; i < vec.size(); i++) {
Jan 13 16:46:39                     ~ ^ ~~~~~~~~~~
Jan 13 16:46:39 ../aten/src/ATen/native/TensorShape.cpp:2100:25: warning: comparison of integers of different signs: 'int64_t' (aka 'long long') and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:39   for (int64_t i = 0; i < src.size(); ++i) {
```
This fixes issues with loop iteration variable types

Test Plan: Sandcastle tests

Reviewed By: ngimel

Differential Revision: D25935136

fbshipit-source-id: a5da4af16bb8045cc16ab1c78b8e0f2bb3ae64bd
2021-02-09 11:58:45 -08:00
141f615161 Support torch.type (#51904)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51904

Fixes issue: #25433

=========
Makes tensor.type(dtype) scriptable

Test:
======
python test/test_jit.py -v TestJit.test_script_tensor_type

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D26331503

Pulled By: nikithamalgifb

fbshipit-source-id: d9188999fee601a8402fdc4d9052dee4e0d529d5
2021-02-09 11:39:57 -08:00
b3fda95fe7 Add LazyBatchNormXd (#51862)
Summary:
Same diff with https://github.com/pytorch/pytorch/issues/51548 (cc. albanD)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51862

Reviewed By: izdeby

Differential Revision: D26312289

Pulled By: albanD

fbshipit-source-id: 9cdec0e0c9021c33d10d85010978c7fa5cb4dc60
2021-02-09 10:29:03 -08:00
5dd1568aa3 [ROCm] skip more magma tests (#51915)
Summary:
Additional magma tests have been identified as failing after integrating hipMAGMA into the ROCm builds.  Skipping is necessary until they can be fixed properly.  This is blocking migration of ROCm CI to 4.0.1.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51915

Reviewed By: izdeby

Differential Revision: D26326404

Pulled By: malfet

fbshipit-source-id: 558cce66f216f404c0316ab036e2e5637fc99798
2021-02-09 09:14:42 -08:00
8c09cc6475 Remove android toolchain in Windows CircleCI image (#51405)
Summary:
Fixes #{issue number}
It can spare nearly 10 GB of disk space.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51405

Reviewed By: izdeby

Differential Revision: D26325768

Pulled By: janeyx99

fbshipit-source-id: d9208c59dfd17d7bb529291821c5f1779666ac6f
2021-02-09 08:46:23 -08:00
20fe2e12d6 typo (#48887)
Summary:
a small grammar fix

jspisak - thank you!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48887

Reviewed By: malfet, zhangguanheng66

Differential Revision: D25358638

Pulled By: brianjo

fbshipit-source-id: 3b805b54df3410f8770e1c6ddc569b26661cece4
2021-02-09 07:50:29 -08:00
c357f8b826 [package] make torch.package produce unified format (#51826)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51826

Looks like this:
```
resnet.pt
├── .data  # Data folder named so it can't clash with torch.package codemodules.
│   │      # Names/extensions automatically added to avoid namingconflicts.
│   ├── 94286146172688.storage   # tensor data
│   ├── 94286146172784.storage
│   ├── extern_modules           # torch.package metadata
│   ├── version                  # version metadata
│   └── ...
├── model  # package pickled model created w/
│   │      # exporter.save_pickel('model','model.pkl', resnet_model)
│   └── model.pkl
└── torchvision  # all code dependencies for packaged picked
    └── models   # models are captured as source files
            ├── resnet.py
                    └── utils.py
```

Since `version` is hardcoded in our zip reader/writer implementation,
add it as an option that defaults to "version" but accepts other
locations for putting the version metadata.

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D26295649

Pulled By: suo

fbshipit-source-id: 2d75feeb7de0f78196b4d0b6e2b814a7d58bd1dd
2021-02-09 07:45:59 -08:00
85b25257ff [package] Use custom persistent_load in PackageImporter (#51595)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51595

Right now `PackageExporter` defines its own `persistent_id` but
`PackageImporter` uses the one defined in `torch.serialization`. I have
some downstream plans to customize this so this PR just splits it out.

Not to fear! I know this introduces some duplication and potential for
different behavior between `torch.save` and `torch.package`, but I have
plans to re-unify them soon.

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D26211578

Pulled By: suo

fbshipit-source-id: 48a2ccaefb2525e1498ad68b75c46d9de3d479b7
2021-02-09 07:45:55 -08:00
285e69a9cd [package] more reliable method for determining standard library-ness (#51694)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51694

We implicitly extern standard library modules. Our method of determining
whether a module is in the standard library is a little unreliable. In
particular, I'm seeing lots of flaky errors on windows/mac CI when I
start doing more complicated packaging tests.

I looked into the best ways to do this, turns out there's no reliable
way, so tools that need to do this generally just parse the Python docs
for a listing and save it. I took `isort`'s lists and called it a day.

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D26243751

Pulled By: suo

fbshipit-source-id: 48c685cd45ae847fe986bcb9f39106e0c3361cdc
2021-02-09 07:42:41 -08:00
42635c3e59 Fix regex in collect_env.py for CUDA 11 (#51852)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/51840

Manually tested with both CUDA 10.2.89 & 11.2.67.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51852

Reviewed By: izdeby

Differential Revision: D26326105

Pulled By: mrshenli

fbshipit-source-id: 46fbe5f20c02bca982ce2ec6e62f7cc3a14fcc97
2021-02-09 07:31:08 -08:00
35b3e16091 [pytorch] Fix torch.nn.functional.normalize to be properly scriptable (#51909)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51909

Several scenarios don't work when trying to script `F.normalize`, notably when you try to symbolically trace through it with using the default argument:

```
import torch.nn.functional as F
import torch
from torch.fx import symbolic_trace

def f(x):
    return F.normalize(x)

gm = symbolic_trace(f)
torch.jit.script(gm)
```
which leads to the error
```
RuntimeError:

normalize(Tensor input, float p=2., int dim=1, float eps=9.9999999999999998e-13, Tensor? out=None) -> (Tensor):
Expected a value of type 'float' for argument 'p' but instead found type 'int'.
:
def forward(self, x):
    normalize_1 = torch.nn.functional.normalize(x, p = 2, dim = 1, eps = 1e-12, out = None);  x = None
                  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    return normalize_1

Reviewed By: jamesr66a

Differential Revision: D26324308

fbshipit-source-id: 30dd944a6011795d17164f2c746068daac570cea
2021-02-09 07:26:57 -08:00
d61d8d886b correct value argument name for Tensor.index_fill_ docs (#51763)
Summary:
The name of "val" is inconsistent with the rest of the API and also
inconsistent with the underlying C++ implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51763

Test Plan:
Used the following command to demonstrate incorrect docs before and
correct docs after:
  python -c 'import torch; print(torch.Tensor.index_fill_.__doc__)'

Fixes https://github.com/pytorch/pytorch/issues/51250

Reviewed By: zhangguanheng66

Differential Revision: D26271273

Pulled By: dagitses

fbshipit-source-id: 4897da80b639c54ca652d2111e13f26efe2646a0
2021-02-09 07:15:52 -08:00
d5a2429c24 Fix flake8 failures (#51963)
Summary:
Fixes flake8 failures in test_autograd.py by using `gradcheck` from `torch.testing._internal.common_utils` rather than directly from`torch.autograd.gradcheck`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51963

Reviewed By: albanD

Differential Revision: D26339107

Pulled By: malfet

fbshipit-source-id: 63e0f12df16b70e394097ad88852984c1848a9e6
2021-02-09 07:02:01 -08:00
a1bfa5eed7 Do not print warning if CUDA driver not found (#51806)
Summary:
It frequently happens when PyTorch compiled with CUDA support is installed on machine that does not have NVIDIA GPUs.

Fixes https://github.com/pytorch/pytorch/issues/47038

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51806

Reviewed By: ezyang

Differential Revision: D26285827

Pulled By: malfet

fbshipit-source-id: 9fd5e690d0135a2b219c1afa803fb69de9729f5e
2021-02-09 06:45:35 -08:00
56034636b9 Workaround arm64 gcc error in std::copysign (#51900)
Summary:
Move definition of copysign template and specialization for
bfloat16/half types before first use of copysign in that file

Add comment explaining why this is necessary

Fixes https://github.com/pytorch/pytorch/issues/51889

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51900

Reviewed By: walterddr

Differential Revision: D26321741

Pulled By: malfet

fbshipit-source-id: 888858b11d9708fa140fe9c0570cc5a24599205b
2021-02-09 04:54:29 -08:00
015cabf82a move GroupByFilename Dataset to DataPipe (#51709)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51709

Move GroupByFilename Dataset to DataPipe

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D26263585

Pulled By: glaringlee

fbshipit-source-id: 00e3e13b47b89117f1ccfc4cd6239940a40d071e
2021-02-09 03:34:56 -08:00
482b94ae51 move RoutedDecoder Dataset to DataPipe (#51704)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51704

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D26245910

Pulled By: glaringlee

fbshipit-source-id: 91e3c9f8a6c1209c1a1a752ba29a80dbd9bf4119
2021-02-09 03:31:07 -08:00
8ab22a080b Build pytorch_android using Gradle wrapper. (#51067)
Summary:
[Here](https://docs.gradle.org/current/userguide/gradle_wrapper.html), there is the following description.
`The recommended way to execute any Gradle build is with the help of the Gradle Wrapper`

I took a little time to prepare Gradle for `pytorch_android` build. (version etc.)

I think using Gradle wrapper will make `pytorch_android` build more seamless.

Gradle wrapper version: 4.10.3

250c71121b/.circleci/scripts/build_android_gradle.sh (L13)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51067

Reviewed By: izdeby

Differential Revision: D26315718

Pulled By: IvanKobzarev

fbshipit-source-id: f8077d7b28dc0b03ee48bcdac2f5e47d9c1f04d9
2021-02-09 03:09:08 -08:00
034a007ad8 Remind about AutoNonVariableTypeMode in error message. (#51655)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51655

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D26228508

Pulled By: ailzhang

fbshipit-source-id: f5f48fde3611c84cc6473b77824ebf9dffbb4453
2021-02-08 19:22:38 -08:00
2303c244fc skip a second call to shouldUseRecordFunction for BackendSelect ops (#50891)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50891

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D25999514

Pulled By: bdhirsh

fbshipit-source-id: 8a6c17ab502fe463cf3fb38a1e555c64bc5556f0
2021-02-08 18:32:40 -08:00
7b9ca54ecf Reset checkpoint_valid flag when error happens during function execution (#51746)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/37874, https://github.com/pytorch/pytorch/issues/51743

Uses RAII to manage the flag so that it gets reset properly on exception

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51746

Reviewed By: izdeby

Differential Revision: D26319619

Pulled By: soulitzer

fbshipit-source-id: ea1235438ba516f99195c83fa23d5880f9977c93
2021-02-08 17:48:25 -08:00
dac730af11 Warn if mypy version doesn't match CI (#51799)
Summary:
This PR adds a local [`mypy` plugin](https://mypy.readthedocs.io/en/stable/extending_mypy.html#extending-mypy-using-plugins) that warns if you accidentally run `mypy` using a version that doesn't match [the version we install for CI](6045663f39/.circleci/docker/common/install_conda.sh (L117)), since this trips people up sometimes when `mypy` gives errors in some versions (see https://github.com/pytorch/pytorch/issues/51513) but not others.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51799

Test Plan:
To check that this doesn't break our `mypy` test(s) when you have the correct version installed:
```
python test/test_type_hints.py
```
To check that this does indeed warn when you have an incorrect `mypy` version installed, switch to a different version (e.g. 0.782), and run the above command or either of these:
```
mypy
mypy --config-file=mypy-strict.ini
```
You should get the following message on stderr:
```
You are using mypy version 0.782, which is not supported
in the PyTorch repo. Please switch to mypy version 0.770.

For example, if you installed mypy via pip, run this:

    pip install mypy==0.770

Or if you installed mypy via conda, run this:

    conda install -c conda-forge mypy=0.770
```

Reviewed By: janeyx99

Differential Revision: D26282010

Pulled By: samestep

fbshipit-source-id: 7b423020d0529700dea8972b27afa2d7068e1b12
2021-02-08 15:43:18 -08:00
21ef248fb8 [reland] Report test time regressions (#50171)
Summary:
This is a followup to https://github.com/pytorch/pytorch/issues/49190. Vaguely speaking, the goals are to make it easy to identify test time regressions introduced by PRs. Eventually the hope is to use this information to edit Dr CI comments, but this particular PR just does the analysis and prints it to stdout, so a followup PR would be needed to edit the actual comments on GitHub.

**Important:** for uninteresting reasons, this PR moves the `print_test_stats.py` file.

- *Before:* `test/print_test_stats.py`
- *After:* `torch/testing/_internal/print_test_stats.py`

Notes on the approach:

- Just getting the mean and stdev for the total job time of the last _N_ commits isn't sufficient, because e.g. if `master` was broken 5 commits ago, then a lot of those job times will be much shorter, breaking the statistics.
- We use the commit history to make better estimates for the mean and stdev of individual test (and suite) times, but only when the test in that historical commit is present and its status matches that of the base commit.
- We list all the tests that were removed or added, or whose status changed (e.g. skipped to not skipped, or vice versa), along with time (estimate) info for that test case and its containing suite.
- We don't list tests whose time changed a lot if their status didn't change, because there's a lot of noise and it's unclear how to do that well without too many false positives.
- We show a human-readable commit graph that indicates exactly how many commits are in the pool of commits that could be causing regressions (e.g. if a PR has multiple commits in it, or if the base commit on `master` doesn't have a report in S3).
- We don't show an overall estimate of whether the PR increased or decreased the total test job time, because it's noisy and it's a bit tricky to aggregate stdevs up from individual tests to the whole job level. This might change in a followup PR.
- Instead, we simply show a summary at the bottom which says how many tests were removed/added/modified (where "modified" means that the status changed), and our best estimates of the mean times (and stdevs) of those changes.
- Importantly, the summary at the bottom is only for the test cases that were already shown in the more verbose diff report, and does not include any information about tests whose status didn't change but whose running time got much longer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50171

Test Plan:
To run the unit tests:
```
$ python test/test_testing.py
$ python test/print_test_stats.py
```

To verify that this works, check the [CircleCI logs](https://app.circleci.com/pipelines/github/pytorch/pytorch/258628/workflows/9cfadc34-e042-485e-b3b3-dc251f160307) for a test job run on this PR; for example:
- pytorch_linux_bionic_py3_6_clang9_test

To test locally, use the following steps.

First run an arbitrary test suite (you need to have some XML reports so that `test/print_test_stats.py` runs, but we'll be ignoring them here via the `--use-json` CLI option):
```
$ DATA_DIR=/tmp
$ ARBITRARY_TEST=testing
$ python test/test_$ARBITRARY_TEST.py --save-xml=$DATA_DIR/test/test_$ARBITRARY_TEST
```
Now choose a commit and a test job (it has to be on `master` since we're going to grab the test time data from S3, and [we only upload test times to S3 on the `master`, `nightly`, and `release` branches](https://github.com/pytorch/pytorch/pull/49645)):
```
$ export CIRCLE_SHA1=c39fb9771d89632c5c3a163d3c00af3bef1bd489
$ export CIRCLE_JOB=pytorch_linux_bionic_py3_6_clang9_test
```
Download the `*.json.bz2` file(s) for that commit/job pair:
```
$ aws s3 cp s3://ossci-metrics/test_time/$CIRCLE_SHA1/$CIRCLE_JOB/ $DATA_DIR/ossci-metrics/test_time/$CIRCLE_SHA1/$CIRCLE_JOB --recursive
```
And feed everything into `test/print_test_stats.py`:
```
$ bzip2 -kdc $DATA_DIR/ossci-metrics/test_time/$CIRCLE_SHA1/$CIRCLE_JOB/*Z.json.bz2 | torch/testing/_internal/print_test_stats.py --compare-with-s3 --use-json=/dev/stdin $DATA_DIR/test/test_$ARBITRARY_TEST
```
The first part of the output should be the same as before this PR; here is the new part, at the end of the output:

- https://pastebin.com/Jj1svhAn

Reviewed By: malfet, izdeby

Differential Revision: D26317769

Pulled By: samestep

fbshipit-source-id: 1ba06cec0fafac77f9e7341d57079543052d73db
2021-02-08 15:35:21 -08:00
9e4f3b89c4 [Gradient Compression] Add register_comm_hook API to DDP communication hooks documentation page (#51846)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51846

`register_comm_hook` method is defined in DistributedDataParallel module, but it is not covered in `distributed.rst`. Since it's closely related to DDP communication hook, add the docstrings to `ddp_comm_hooks.rst` instead of a reference.

Screenshot:

{F370425625}
ghstack-source-id: 121278173

Test Plan:
view locally

python_doc_test:
https://app.circleci.com/pipelines/github/pytorch/pytorch/271234/workflows/dc0b443d-8a62-4334-9b42-800c33a68553/jobs/10770636

Reviewed By: rohan-varma

Differential Revision: D26298191

fbshipit-source-id: 32e0685fd3c935cf9a2d129e6c520a94aa3e3817
2021-02-08 15:12:39 -08:00
1e70b4bb73 Add GH Actions CI to build nightly Docker and push to GitHub Container Registry (#51755)
Summary:
Currently PyTorch repository provides Dockerfile to build Docker with nightly builds, but it doesn't have CI to actually build those Dockers.
This PR adds a GitHub action workflow to create PyTorch nightly build Docker and publish them to GitHub Container Registry.
Also, add "--always" option to the `git describe --tags` command that generates the Docker image tag.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51755

Test Plan: Manually trigger the workflow build in the GitHub Actions web UI.

Reviewed By: seemethere

Differential Revision: D26320180

Pulled By: xuzhao9

fbshipit-source-id: e00b472df14f5913cab9b06a41e837014e87f1c7
2021-02-08 14:59:30 -08:00
58eb23378f Clean up usage of torch._six partially (#49785)
Summary:
See https://github.com/pytorch/pytorch/issues/42919

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49785

Reviewed By: mruberry

Differential Revision: D25963833

Pulled By: bugra

fbshipit-source-id: 11c90d6b8d3f206c9d0a4d8621b773beb10c6ba2
2021-02-08 13:58:34 -08:00
97e35858ec [Resubmit] Add compare_set operation and test to TCPStore (#51815)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51815

This is resubmission of #51593, already approved.

Test Plan: Imported from OSS

Reviewed By: izdeby

Differential Revision: D26316875

Pulled By: H-Huang

fbshipit-source-id: d81cb131ef6b9e2ebaee32bb505dfc11235bc29d
2021-02-08 13:44:31 -08:00
7363da7c57 onnx export of per channel fake quantize functions (#42835)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/39502

This PR adds support for exporting **fake_quantize_per_channel_affine** to a pair of QuantizeLinear and DequantizeLinear. Per tensor support was added by PR https://github.com/pytorch/pytorch/pull/39738.

`axis` attribute of QuantizeLinear and DequantizeLinear, which is required for per channel support, is added in opset13 added by https://github.com/onnx/onnx/pull/2772.

[update 1/20/2021]: opset13 is being supported on master, the added function is now properly tested. Code also rebased to new master.

The function is also tested offline with the following code
```python
import torch
from torch import quantization

from torchvision import models
qat_resnet18 = models.resnet18(pretrained=True).eval().cuda()

qat_resnet18.qconfig = quantization.QConfig(
    activation=quantization.default_fake_quant, weight=quantization.default_per_channel_weight_fake_quant)
quantization.prepare_qat(qat_resnet18, inplace=True)
qat_resnet18.apply(quantization.enable_observer)
qat_resnet18.apply(quantization.enable_fake_quant)

dummy_input = torch.randn(16, 3, 224, 224).cuda()
_ = qat_resnet18(dummy_input)
for module in qat_resnet18.modules():
    if isinstance(module, quantization.FakeQuantize):
        module.calculate_qparams()
qat_resnet18.apply(quantization.disable_observer)

qat_resnet18.cuda()

input_names = [ "actual_input_1" ]
output_names = [ "output1" ]

torch.onnx.export(qat_resnet18, dummy_input, "quant_model.onnx", verbose=True, opset_version=13)
```
It can generate the desired graph.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42835

Reviewed By: houseroad

Differential Revision: D26293823

Pulled By: SplitInfinity

fbshipit-source-id: 300498a2e24b7731b12fa2fbdea4e73dde80e7ea
2021-02-08 13:09:50 -08:00
159c48b19b Fix triplet margin loss and reciprocal docs (#51650)
Summary:
Reciprocal: the note should be placed after the formula

Triplet-margin-loss (before):
![image](https://user-images.githubusercontent.com/13428986/106784863-cb3eb780-661a-11eb-8372-07b51e4cb2d4.png)
After:
![image](https://user-images.githubusercontent.com/13428986/106784948-e5789580-661a-11eb-890c-6185aab96e54.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51650

Reviewed By: izdeby

Differential Revision: D26314151

Pulled By: soulitzer

fbshipit-source-id: d7574e64e96a41a515231ba7e1008de8b2f292aa
2021-02-08 12:15:11 -08:00
d90911adf9 fix AdaptiveAveragePooling crash problem for non support input (#51443)
Summary:
For none support input, we should not do check in a parallel region, this PR will first do the dtype check, and then do parallel for.
Fixes https://github.com/pytorch/pytorch/issues/51352.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51443

Reviewed By: izdeby

Differential Revision: D26305584

Pulled By: ngimel

fbshipit-source-id: 6faa3148af5bdcd7246771c0ecb4db2b31ac82c6
2021-02-08 11:43:25 -08:00
b9acfcddeb Support mypy ignore annotation with particular rule specified (#51675)
Summary:
Previously TorchScript allows a ignore-all type check suppression rule that looks like
```
code code code  # type: ignore
```

But a more common use case is
```
code code code  # type: ignore[specific-rule]
```
This PR allows the more common use case

Fixes https://github.com/pytorch/pytorch/issues/48643

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51675

Reviewed By: ansley

Differential Revision: D26304870

Pulled By: gmagogsfm

fbshipit-source-id: 0ac9ee34f0219c86e428318a69484d5aa3ec433f
2021-02-08 11:21:47 -08:00
41bab9a4b6 Plumbing dispatch keys through the dispatcher (#49354)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49354

Test Plan: Imported from OSS

Reviewed By: smessmer

Differential Revision: D25614042

Pulled By: bdhirsh

fbshipit-source-id: 269a75e9a3ac518aa63bff2cafbd47ed2c4ff780
2021-02-08 11:09:51 -08:00
6fa5e96f2e remove unnecessary BoxedKernelWrapper specialization now that ops are all c10-full (#50963)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50963

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D26026665

Pulled By: bdhirsh

fbshipit-source-id: ef6e515f7dae5052538789e5b75dc551b4ce3b11
2021-02-08 11:06:51 -08:00
d9e6750759 fix multi_output_kernel (#51827)
Summary:
With zasdfgbnm's help and with his small TensorIterator kernel repro https://github.com/zasdfgbnm/tensoriterator we've found a workaround for what looks like a compiler bug in multi_output_kernel that manifests itself with cuda 10.2 and cuda 11 when there is a non-trivial OffsetCalculator.
It looks like those nvcc versions cannot handle inheritance in device structs, so instead of inheriting `multi_outputs_unroll` from `unroll` we make it independent.
cc vkuzo, haichuan-fb I verified that reverting https://github.com/pytorch/pytorch/issues/49315 to bring back multi_output_kernel and running `test_learnable_backward_per_channel_cuda` test passes, but I didn't do it in this PR - can you take it up as a follow-up?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51827

Reviewed By: izdeby

Differential Revision: D26305559

Pulled By: ngimel

fbshipit-source-id: 1168e7c894d237a954abfd1998eaad54f0ce40a7
2021-02-08 10:42:50 -08:00
21dccbca62 Revert D26232345: [pytorch][PR] Report test time regressions
Test Plan: revert-hammer

Differential Revision:
D26232345 (7467f90b13)

Original commit changeset: b687b1737519

fbshipit-source-id: 10a031c5500b083f7c82f2ae2743b671c5a07bff
2021-02-08 10:15:07 -08:00
cyy
1aaddd83a5 don't set the same C++ and C standards twice (#51832)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51832

Reviewed By: izdeby

Differential Revision: D26312660

Pulled By: ezyang

fbshipit-source-id: 7d646cd106397e70bca0050d0aa30eb62b085cee
2021-02-08 08:53:26 -08:00
649e683255 Fix torch.nonzero type annotation (#51635)
Summary:
The overloads are a little tricky here. It's important that the overloads are such that it's unambiguous what
`torch.nonzero(x)` will resolve to - so just specify defaults for one of the overloads. Also, `out` is left out of the second overload
because a non-None value for `out` is not valid in combination with `as_tuple=True`.

Closes gh-51434

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51635

Reviewed By: zhangguanheng66

Differential Revision: D26279203

Pulled By: walterddr

fbshipit-source-id: 8459c04fc9fbf7fc5f31b3f631aaac2f98b17ea6
2021-02-08 08:45:44 -08:00
0dd1d60d54 [JIT] Remove Dropout during Frozon Optimization (#51589)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51589

Dropout operators are only needed in training. Remove them for frozen models.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D26214259

fbshipit-source-id: 3ab05869e1e1f6c57498ba62bf40944f7c2189aa
2021-02-08 08:38:08 -08:00
9cbefad83f concantenate LICENSE files when building a wheel (#51634)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/50695

I checked locally that the concatenated license file appears at `torch-<version>.dist-info/LICENSE` in the wheel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51634

Reviewed By: zhangguanheng66

Differential Revision: D26225550

Pulled By: walterddr

fbshipit-source-id: 830c59fb7aea0eb50b99e295edddad9edab6ba3a
2021-02-08 08:28:46 -08:00
b97a040f71 ENH: toggle TORCH_WARN_ONCE to TORCH_WARN for tests (#48560)
Summary:
Toward fixing https://github.com/pytorch/pytorch/issues/47624

~Step 1: add `TORCH_WARN_MAYBE` which can either warn once or every time in c++, and add a c++ function to toggle the value.
Step 2 will be to expose this to python for tests. Should I continue in this PR or should we take a different approach: add the python level exposure without changing any c++ code and then over a series of PRs change each call site to use the new macro and change the tests to make sure it is being checked?~

Step 1: add a python and c++ toggle to convert TORCH_WARN_ONCE into TORCH_WARN so the warnings can be caught in tests
Step 2: add a python-level decorator to use this toggle in tests
Step 3: (in future PRs): use the decorator to catch the warnings instead of `maybeWarnsRegex`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48560

Reviewed By: ngimel

Differential Revision: D26171175

Pulled By: mruberry

fbshipit-source-id: d83c18f131d282474a24c50f70a6eee82687158f
2021-02-08 08:21:19 -08:00
d454a84bab derivatives.yaml cleanup + restore codegen code forgotten in refactor (#51721)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51721

Reviewed By: zhangguanheng66

Differential Revision: D26285908

Pulled By: albanD

fbshipit-source-id: 3130736be9146eaee3a8e80be59a66eb2180d536
2021-02-08 08:03:40 -08:00
7467f90b13 Report test time regressions (#50171)
Summary:
This is a followup to https://github.com/pytorch/pytorch/issues/49190. Vaguely speaking, the goals are to make it easy to identify test time regressions introduced by PRs. Eventually the hope is to use this information to edit Dr CI comments, but this particular PR just does the analysis and prints it to stdout, so a followup PR would be needed to edit the actual comments on GitHub.

**Important:** for uninteresting reasons, this PR moves the `print_test_stats.py` file.

- *Before:* `test/print_test_stats.py`
- *After:* `torch/testing/_internal/print_test_stats.py`

Notes on the approach:

- Just getting the mean and stdev for the total job time of the last _N_ commits isn't sufficient, because e.g. if `master` was broken 5 commits ago, then a lot of those job times will be much shorter, breaking the statistics.
- We use the commit history to make better estimates for the mean and stdev of individual test (and suite) times, but only when the test in that historical commit is present and its status matches that of the base commit.
- We list all the tests that were removed or added, or whose status changed (e.g. skipped to not skipped, or vice versa), along with time (estimate) info for that test case and its containing suite.
- We don't list tests whose time changed a lot if their status didn't change, because there's a lot of noise and it's unclear how to do that well without too many false positives.
- We show a human-readable commit graph that indicates exactly how many commits are in the pool of commits that could be causing regressions (e.g. if a PR has multiple commits in it, or if the base commit on `master` doesn't have a report in S3).
- We don't show an overall estimate of whether the PR increased or decreased the total test job time, because it's noisy and it's a bit tricky to aggregate stdevs up from individual tests to the whole job level. This might change in a followup PR.
- Instead, we simply show a summary at the bottom which says how many tests were removed/added/modified (where "modified" means that the status changed), and our best estimates of the mean times (and stdevs) of those changes.
- Importantly, the summary at the bottom is only for the test cases that were already shown in the more verbose diff report, and does not include any information about tests whose status didn't change but whose running time got much longer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50171

Test Plan:
To run the unit tests:
```
$ python test/test_testing.py
$ python test/print_test_stats.py
```

To verify that this works, check the [CircleCI logs](https://app.circleci.com/pipelines/github/pytorch/pytorch/258628/workflows/9cfadc34-e042-485e-b3b3-dc251f160307) for a test job run on this PR; for example:
- pytorch_linux_bionic_py3_6_clang9_test

To test locally, use the following steps.

First run an arbitrary test suite (you need to have some XML reports so that `test/print_test_stats.py` runs, but we'll be ignoring them here via the `--use-json` CLI option):
```
$ DATA_DIR=/tmp
$ ARBITRARY_TEST=testing
$ python test/test_$ARBITRARY_TEST.py --save-xml=$DATA_DIR/test/test_$ARBITRARY_TEST
```
Now choose a commit and a test job (it has to be on `master` since we're going to grab the test time data from S3, and [we only upload test times to S3 on the `master`, `nightly`, and `release` branches](https://github.com/pytorch/pytorch/pull/49645)):
```
$ export CIRCLE_SHA1=c39fb9771d89632c5c3a163d3c00af3bef1bd489
$ export CIRCLE_JOB=pytorch_linux_bionic_py3_6_clang9_test
```
Download the `*.json.bz2` file(s) for that commit/job pair:
```
$ aws s3 cp s3://ossci-metrics/test_time/$CIRCLE_SHA1/$CIRCLE_JOB/ $DATA_DIR/ossci-metrics/test_time/$CIRCLE_SHA1/$CIRCLE_JOB --recursive
```
And feed everything into `test/print_test_stats.py`:
```
$ bzip2 -kdc $DATA_DIR/ossci-metrics/test_time/$CIRCLE_SHA1/$CIRCLE_JOB/*Z.json.bz2 | torch/testing/_internal/print_test_stats.py --compare-with-s3 --use-json=/dev/stdin $DATA_DIR/test/test_$ARBITRARY_TEST
```
The first part of the output should be the same as before this PR; here is the new part, at the end of the output:

- https://pastebin.com/Jj1svhAn

Reviewed By: walterddr

Differential Revision: D26232345

Pulled By: samestep

fbshipit-source-id: b687b1737519d2eed68fbd591a667e4e029de509
2021-02-08 07:54:34 -08:00
c89f15ec6d Reland nightlies 11.2 (#51874)
Summary:
Cherry-picked commits from https://github.com/pytorch/pytorch/issues/51611.

Relanding after https://github.com/pytorch/pytorch/issues/51864 should fix failing CUDA tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51874

Reviewed By: malfet

Differential Revision: D26313173

Pulled By: janeyx99

fbshipit-source-id: 02250abb526cc7400bc2d9bbb146e8210ccd4b40
2021-02-08 07:41:45 -08:00
79832f3d77 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D26309565

fbshipit-source-id: b20d37ea90304052cef9b4dc359a5bd726d7fda7
2021-02-08 04:17:41 -08:00
bce4c82f0d [C2] Add TypeAndShape Inference logic for ReduceMean (#51828)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51828

As desc.

Test Plan: Unit-tests.

Differential Revision: D26293844

fbshipit-source-id: 2eb2a694c439b794ad7c134409e2b8926aabc91f
2021-02-08 00:57:47 -08:00
fcf8b71234 Disable unaliged-access test from TestVectorizedMemoryAccess.CopyKernel (#51864)
Summary:
Test begins to fail after the driver udpate

See https://github.com/pytorch/pytorch/issues/51863

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51864

Reviewed By: bertmaher

Differential Revision: D26304018

Pulled By: malfet

fbshipit-source-id: bb7ade2f28d8cf8f847159d4ce92391f0794c258
2021-02-07 16:20:31 -08:00
0c313564af Backward through sparse_coo_tensor (#50361)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49683

This PR  solves Backward through sparse_coo_tensor bug by implementing a `sparse_mask_helper` function for n-dimensional sparse tensor for CPU and CUDA which is used to reimplement `sparse_constructor_values_backward` function.

This `sparse_mask` function was implemented before for  backward  sparse-sparse matmul. However,  the algorithm is little different  because in this case it should be applyable not only for matrices but for n-dimensional tensors. Thankfully it was not quite hard to extend and now both share the same code base.

Note that  no new tests are required because now the backward for sparse-sparse matmul now uses the new `sparse_mask_helper`.

ngimel, mruberry - kindly review this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50361

Reviewed By: zhangguanheng66

Differential Revision: D26270483

Pulled By: ngimel

fbshipit-source-id: ee4bda49ff86e769342674b64d3c4bc34eae38ef
2021-02-06 23:15:54 -08:00
4b3c99ce4a [Resubmission] Add a documentation page for DDP communication hooks (#51773)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51773

Resubmission of #51715.

Minor changes:
1) Removed "Note [Guidance to Tune ``matrix_approximation_rank`` And ``start_powerSGD_iter``]" in powerSGD_hook.py.

2) Removed the duplicate description of `torch.nn.parallel.DistributedDataParallel.register_comm_hook` in ddp_comm_hooks.rst, because it is already covered by distributed.rst.

Also updated the doc based on the comments from PowerSGD paper author Thijs Vogels .

It seems that `python_doc_test` was flaky. The previous error message was not informative:
https://app.circleci.com/pipelines/github/pytorch/pytorch/270682/workflows/8d186a3c-d682-46bf-b617-ad4eef5991e2/jobs/10739143, and all the warnings did also appear on the master branch.

Rebasing to a new master branch seems to get this fixed:
https://app.circleci.com/pipelines/github/pytorch/pytorch/270696/workflows/1a3adbea-6443-4876-b87b-e17d90d41428/jobs/10740021/steps

Screenshot:

{F369899792}
ghstack-source-id: 121199613

Test Plan: View locally

Reviewed By: mingzhe09088

Differential Revision: D26272687

fbshipit-source-id: 6677db496a68171798940a80343f4d9a508e15db
2021-02-06 21:22:04 -08:00
4968227058 add shape inference for Int8GenQuantParamsMinMax
Summary: As titleed

Test Plan: successful test flow with A* setup: f245569242

Reviewed By: anurag16

Differential Revision: D25966283

fbshipit-source-id: ef9945d5039933df44c2c3c26ca149f47538ff31
2021-02-06 17:50:43 -08:00
6c0bf28da6 [wip] doc_fix (#51825)
Summary:
tries to fix doc_test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51825

Reviewed By: bertmaher

Differential Revision: D26295583

Pulled By: ngimel

fbshipit-source-id: 13f6e7f1675d810adfd4abd2d579e2812fe54c80
2021-02-06 11:36:36 -08:00
6488b2bc3a Revert D26282829: [pytorch][PR] Adding support for CUDA 11.2 in our nightly build matrix
Test Plan: revert-hammer

Differential Revision:
D26282829 (fb07aca7b0)

Original commit changeset: b15380e5c44a

fbshipit-source-id: 18f86e766ed9ec58da32167584bb5e4e2c87a639
2021-02-06 11:22:23 -08:00
fa70168804 Add metacompile of Ternary if (#51789)
Summary:
Fixes issue: https://github.com/pytorch/pytorch/issues/49728
========
Ternary if operation fails in Torchscript when the condition variable is annotated as Final.

Tests:
=======
pytest -k test_ternary_static_if test/test_jit.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51789

Reviewed By: gmagogsfm

Differential Revision: D26278969

Pulled By: nikithamalgifb

fbshipit-source-id: 27d1383290211503188428fb2e8b7749f59ba16e
2021-02-06 10:14:30 -08:00
8a9090219e [pyper] register aten::index_out (#51742)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51742

Register aten::index_out with StaticRuntime

Test Plan:
```
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 3 ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --c2_weights=/data/users/ansha/tmp/adfinder/models/c2_local_ro_weight_data.pb --c2_inputs=/data/users/ansha/tmp/adfinder/models/c2_local_ro_input_data.pb --pred_net=/data/users/ansha/tmp/adfinder/models/c2_local_ro_net.pb --c2_sigrid_transforms_opt=1 --c2_apply_nomnigraph_passes=1 --c2_use_memonger=1 --scripted_model=/data/users/ansha/tmp/adfinder/models2/210494966_0.predictor.disagg.local_ro.pt --pt_inputs=/data/users/ansha/tmp/adfinder/models/local_wrapped_input_data.pt --pt_enable_static_runtime=1 --pt_cleanup_activations=true --pt_enable_out_variant=true --compare_results=0 --iters=30000 --warmup_iters=10000 --num_threads=1 --do_profile=1 --benchmark_c2_predictor=0
```

Before total ms/iter: 0.699626
Before: 0.0277974 ms.    4.03198%. aten::index (5 nodes)

After total ms/iter: 0.696739
After: 0.0254255 ms.    3.67315%. aten::index (5 nodes)

Reviewed By: hlu1

Differential Revision: D26261215

fbshipit-source-id: b59ebd5ccd33478a9fbc4629a0075fec597a05cb
2021-02-06 04:26:15 -08:00
9a964ce89b Enables backend preprocessing to take place outside of the backend interface (#51757)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51757

Enables backend preprocessing to take place outside of the backend interface.

What's new:
* A new definition for backend preprocessing (i.e. BackendPreprocessFunction).
* Registration of the backend's PyTorchBackendInterface interface implementation is augmented to take the BackendPreprocessFunction.
* A new registry is created to handle the BackendPreprocessFunction functions, using the backend's name as key.
* When a BackendPreprocessFunction is used, the PyTorchBackendInterface's "preprocess" method is not added to the LoweredModule. Instead, the BackendPreprocessFunction is called and its output used to set the LoweredModule's __processed_module.

Why?:
These changes are needed to avoid forcing backend preprocessing to be part of the LoweredModule, and in the future be able to eliminate "preprocess" from the PyTorchBackendInterface.
This is important for Mobile use cases where "preprocess" can take the bulk of the compilation process, and thus contain code dependencies that we do not want to bring (or cannot bring) to the Mobile binary.

What didn't change:
* Everything is backwards compatible:
** The existing "preprocess" method in PyTorchBackendInterface is still there.
** When backend registration is done without the BackendPreprocessFunction, as before, things work the same way: "preprocess" is added to LoweredModule, and invoked through the module's instance of the backend interface.

Longer term, the plan is to refactor existing users to move to the new backend registration.
ghstack-source-id: 121190883

Test Plan:
Updated existing tests (test_backend.py) to use the new registration mechanism.
Verified test ran and passed (in my OSS build).

Reviewed By: iseeyuan

Differential Revision: D26261042

fbshipit-source-id: 0dc378acd5f2ab60fcdc01f7373616d1db961e61
2021-02-06 01:07:17 -08:00
215d9daceb Refactor internal methods into debugging utilities (#51737)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51737

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D26288613

Pulled By: ansley

fbshipit-source-id: 4504b1af5be7a200c1a6a376d432d7224eb8a796
2021-02-05 21:42:18 -08:00
19753af6ea [QNNPACK Sparsity] Add aarch64 kernel of 8x1 sparsity (#51120)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51120

Adds asm kernels for aarch64 using 8x1 sparsity. Also remove aarch32 8x4 prepacked kernels and 8x4 inline packed sse2 kernels.

Test Plan:
q8gemm-sparse-test

Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D26077766

fbshipit-source-id: 29793d30a47b8f4084daf8950d925dc804d3dc59
2021-02-05 18:47:38 -08:00
6b2811f288 [QNNPACK, Sparsity] Add 8x1 block sparse kernels for aarch32. (#51119)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51119

Adds asm kernel for 8x1 block sparse kernel. Since ukernels is still
producing 4x8 blocks, similar to 1x4 sparsity pattern, we can use the
same prepacking kernel for activation. It does get a tiny bit hacky but
allows us to reuse the kernel.

Test Plan:
q8gemm-sparse-test
fully-connectest-sparse-test

Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D26077765

fbshipit-source-id: cc087b0ff717a613906d442ea73680e785e0ecc2
2021-02-05 18:47:33 -08:00
c034e0750c [QNNPACK, Sparsity] Code refactoring to allow for more generic block (#51118)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51118

sparsity

Modify BCSR to pack generic block sparsity pattern.
Modify rest of the code to accommodate the change.
This is in preperation to support 8x1 sparsity.

Test Plan:
q8gemm-sparse-test

Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D26077767

fbshipit-source-id: 7179975b07a1cb76ef26896701d782fb04638743
2021-02-05 18:44:25 -08:00
bc1b1e8253 fixing mkldnn_linear & backward with silent error (#51713)
Summary:
mkldnn_linear & mkldnn_linear_backward_input gives wrong result when weight is non contiguous.

Issue exposed in PR https://github.com/pytorch/pytorch/issues/51613

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51713

Reviewed By: zhangguanheng66

Differential Revision: D26282319

Pulled By: ngimel

fbshipit-source-id: 96516e10c9dc72c30dac278fce09b746aa5f51b2
2021-02-05 18:36:30 -08:00
4395 changed files with 279322 additions and 105893 deletions

View File

@ -0,0 +1,63 @@
# PyTorch CI Builds Pipeline on Azure DevOps
#
# This pipeline:
# 1) builds PyTorch on select configurations
# 2) runs only TestTorch unit tests.
stages:
- stage: 'Build'
displayName: 'Build PyTorch'
jobs:
- template: job_templates/build-verify-publish-template-unix.yml
parameters:
name: ubuntu_1804_CPU_docker
pool: 'PyTorch-Linux-CPU'
container_endpoint: pytorchms.azurecr.io
build_stage: True
is_ci_build: True
os: ubuntu
cuda: cpu
customMatrixes:
Py_38:
configuration: ubuntu_1804_py_38_cpu
container_image: pytorchms.azurecr.io/ubuntu_1804_py_38_cpu_dev
- template: job_templates/build-verify-publish-template-unix.yml
parameters:
name: ubuntu_1804_GPU_docker
pool: 'PyTorch-Linux-GPU'
container_endpoint: pytorchms.azurecr.io
build_stage: True
is_ci_build: True
os: ubuntu
cuda: gpu
customMatrixes:
Py_39_CUDA_112_cuDNN_810:
configuration: ubuntu_1804_py_39_cuda_112_cudnn_810
container_image: pytorchms.azurecr.io/ubuntu_1804_py_39_cuda_112_cudnn_8_dev
CUDA_VERSION: 112
- template: job_templates/build-verify-publish-template-win.yml
parameters:
name: windows_2019_CPU
pool: 'PyTorch-Win-CPU'
build_stage: True
is_ci_build: True
os: windows
cuda: cpu
customMatrixes:
Py_37:
configuration: windows_2019_py_37_cpu
- template: job_templates/build-verify-publish-template-win.yml
parameters:
name: windows_2019_GPU
pool: 'PyTorch-Win-GPU'
build_stage: True
is_ci_build: True
os: windows
cuda: gpu
customMatrixes:
Py_38_CUDA_102_cuDNN_765:
configuration: windows_2019_py_38_cuda_102_cudnn_765
CUDA_VERSION: 102

View File

@ -0,0 +1,82 @@
# PyTorch Daily Builds Pipeline on Azure DevOps
#
# This pipeline:
# 1) builds PyTorch on all available configurations
# 2) runs all PyTorch unit tests
stages:
- stage: 'BuildTest'
displayName: 'Build and Test PyTorch'
jobs:
- template: job_templates/build-verify-publish-template-unix.yml
parameters:
name: ubuntu_1804_CPU_docker
pool: 'PyTorch-Linux-CPU'
container_endpoint: pytorchms.azurecr.io
build_stage: True
is_daily_build: True
os: ubuntu
cuda: cpu
customMatrixes:
Py_38:
configuration: ubuntu_1804_py_38_cpu
container_image: pytorchms.azurecr.io/ubuntu_1804_py_38_cpu_dev
Py_37:
configuration: ubuntu_1804_py_37_cpu
container_image: pytorchms.azurecr.io/ubuntu_1804_py_37_cpu_dev
- template: job_templates/build-verify-publish-template-unix.yml
parameters:
name: ubuntu_1804_GPU_docker
pool: 'PyTorch-Linux-GPU'
container_endpoint: pytorchms.azurecr.io
build_stage: True
is_daily_build: True
os: ubuntu
cuda: gpu
customMatrixes:
Py_39_CUDA_112_cuDNN_810:
configuration: ubuntu_1804_py_39_cuda_112_cudnn_810
container_image: pytorchms.azurecr.io/ubuntu_1804_py_39_cuda_112_cudnn_8_dev
CUDA_VERSION: 112
Py_38_CUDA_102_cuDNN_810:
configuration: ubuntu_1804_py_38_cuda_102_cudnn_810
container_image: pytorchms.azurecr.io/ubuntu_1804_py_38_cuda_102_cudnn_8_dev
CUDA_VERSION: 102
Py_37_CUDA_101_cuDNN_765:
configuration: ubuntu_1804_py_37_cuda_101_cudnn_765
container_image: pytorchms.azurecr.io/ubuntu_1804_py_37_cuda_101_cudnn_7_dev
CUDA_VERSION: 101
- template: job_templates/build-verify-publish-template-win.yml
parameters:
name: windows_2019_CPU
pool: 'PyTorch-Win-CPU'
build_stage: True
is_daily_build: True
os: windows
cuda: cpu
customMatrixes:
Py_38:
configuration: windows_2019_py_38_cpu
Py_37:
configuration: windows_2019_py_37_cpu
- template: job_templates/build-verify-publish-template-win.yml
parameters:
name: windows_2019_GPU
pool: 'PyTorch-Win-GPU'
build_stage: True
is_daily_build: True
os: windows
cuda: gpu
customMatrixes:
Py_39_CUDA_112_cuDNN_810:
configuration: windows_2019_py_39_cuda_112_cudnn_810
CUDA_VERSION: 112
Py_38_CUDA_102_cuDNN_765:
configuration: windows_2019_py_38_cuda_102_cudnn_765
CUDA_VERSION: 102
Py_37_CUDA_101_cuDNN_764:
configuration: windows_2019_py_37_cuda_101_cudnn_764
CUDA_VERSION: 101

View File

@ -0,0 +1,134 @@
# PyTorch build steps template with Unix images Azure DevOps Instances
#
# This build depends on 3 parameters set as environment variables in the pipeline:
# - AZURE_DEVOPS_CLI_PAT: Secret var for authenticating to Azure DevOps
# - AZURE_DEVOPS_ARTIFACTS_ORGANIZATION: Azure Artifacts Organization name to publish artifacts
# - AZURE_DEVOPS_ARTIFACTS_PROJECT: Azure Artifacts Project name to publish artifacts
parameters:
name: ''
pool: ''
container_endpoint: ''
os: ''
cuda: ''
is_ci_build: False
is_official_build: False
is_daily_build: False
build_stage: False
verify_stage: False
publish_stage: False
customMatrixes: ''
jobs:
- job: ${{parameters.name}}
timeoutInMinutes: 300
strategy:
matrix:
${{ insert }}: ${{parameters.customMatrixes}}
pool:
name: ${{ parameters.pool}}
variables:
DECODE_PERCENTS: false
container:
image: $[variables['container_image']]
endpoint: ${{parameters.container_endpoint}}
steps:
# Build stage
- ${{ if eq(parameters.build_stage, 'True') }}:
# Set up environment variables for specific pipeline build
- template: set-environment-variables.yml
parameters:
os: ${{ parameters.os}}
cuda: ${{ parameters.cuda}}
is_official_build: ${{ parameters.is_official_build}}
# Sync and update PyTorch submodules
- bash: git submodule update --init --recursive
displayName: Update PyTorch submodules
# Build PyTorch and run unit tests - no packaging
- ${{ if or(eq(parameters.is_ci_build, 'True'), eq(parameters.is_daily_build, 'True')) }}:
# Build PyTorch from source in develop mode
- bash: python setup.py develop
displayName: Build PyTorch from source
- ${{ if eq(parameters.is_ci_build, 'True') }}:
# Run TestTorch unit tests to demonstrate successful PyTorch build
- bash: python test/test_torch.py TestTorch
displayName: Run TestTorch unit tests
- ${{ if eq(parameters.is_daily_build, 'True') }}:
# Run all unit tests to demonstrate successful PyTorch build
- bash: python test/run_test.py --continue-through-error --exclude-jit-executor --verbose
displayName: Run all unit tests
# Run ComponentGovernance
- task: ComponentGovernanceComponentDetection@0
inputs:
scanType: 'Register'
verbosity: 'Verbose'
alertWarningLevel: 'High'
# Build PyTorch and produce artifacts for verification stage
- ${{ if eq(parameters.is_official_build, 'True') }}:
# Build PyTorch from source in install mode and exclude test binaries
- bash: python setup.py install
displayName: Build PyTorch from source without test binaries
# Package PyTorch Wheel
- bash: python setup.py bdist_wheel
displayName: Package PyTorch Wheel
# Publish PyTorch Wheel
- task: PublishPipelineArtifact@1
inputs:
targetPath: $(Build.SourcesDirectory)/dist/
artifactName: Build_$(Build.BuildNumber)_$(configuration)
displayName: Publish PyTorch Wheel to Pipeline Artifacts
# Verification stage
- ${{ if eq(parameters.verify_stage, 'True') }}:
# Download PyTorch Wheel
- task: DownloadPipelineArtifact@2
inputs:
artifact: Build_$(Build.BuildNumber)_$(configuration)
path: $(Build.SourcesDirectory)/verify
displayName: Download PyTorch Wheel
# Install PyTorch Wheel on Windows
- bash: python -m pip install $(Build.SourcesDirectory)/verify/torch*linux*.whl
displayName: Install PyTorch Wheel
# Ensure PyTorch installed correctly from produced wheel
- bash: |
cd $(Build.SourcesDirectory)/verify
python -c "import torch; print('Installed Torch version: ' + torch.__version__)"
displayName: Check PyTorch correctly installed from wheel
# Publishing stage
- ${{ if eq(parameters.publish_stage, 'True') }}:
# Download PyTorch Wheel
- task: DownloadPipelineArtifact@2
inputs:
artifact: Build_$(Build.BuildNumber)_$(configuration)
path: $(Build.SourcesDirectory)/publish
displayName: Download PyTorch Wheel
# Publish wheel to Azure Artifacts
# The flag continueOnError=true is needed as the artifact to be published
# may already exist, because the artifact is differentiated based on the
# last commit date.
- bash: |
export TORCH_VERSION=$(head -c 5 ./version.txt)
export LAST_COMMIT=$(git rev-parse --short HEAD)
export LAST_COMMIT_DATE=$(git log -1 --pretty=%ad --date=format:%Y%m%d)
cd $(Build.SourcesDirectory)/publish
export TORCH_WHEEL=$(echo torch*linux*whl)
az extension add -n azure-devops
echo $ADOTOKEN | az devops login
az artifacts universal publish --organization $AZURE_DEVOPS_ARTIFACTS_ORGANIZATION --project $AZURE_DEVOPS_ARTIFACTS_PROJECT --scope project --feed "PyTorch" --name $TORCH_WHEEL --description "PyTorch Official Build Artifact" --version $TORCH_VERSION-$LAST_COMMIT_DATE-$LAST_COMMIT --path .
env:
ADOTOKEN: $(AZURE_DEVOPS_CLI_PAT)
continueOnError: true
displayName: Upload PyTorch Official Build package to Azure Artifacts

View File

@ -0,0 +1,150 @@
# PyTorch build steps template with Windows images Azure DevOps Instances
#
# This build depends on 3 parameters set as environment variables in the pipeline:
# - AZURE_DEVOPS_CLI_PAT: Secret var for authenticating to Azure DevOps
# - AZURE_DEVOPS_ARTIFACTS_ORGANIZATION: Azure Artifacts Organization name to publish artifacts
# - AZURE_DEVOPS_ARTIFACTS_PROJECT: Azure Artifacts Project name to publish artifacts
parameters:
name: ''
pool: ''
os: ''
cuda: ''
is_ci_build: False
is_official_build: False
is_daily_build: False
build_stage: False
verify_stage: False
publish_stage: False
customMatrixes: ''
jobs:
- job: ${{parameters.name}}
timeoutInMinutes: 300
strategy:
matrix:
${{ insert }}: ${{parameters.customMatrixes}}
pool:
name: ${{ parameters.pool}}
variables:
CMAKE_GENERATOR: Ninja
PACKAGE_PDBS: 0
steps:
# Prepare for PyTorch build on Windows
- template: prepare-build-template.yml
parameters:
configuration: $(configuration)
build_stage: ${{ parameters.build_stage}}
# Build Stage
- ${{ if eq(parameters.build_stage, 'True') }}:
# Set up environment variables for specific pipeline build
- template: set-environment-variables.yml
parameters:
os: ${{ parameters.os}}
cuda: ${{ parameters.cuda}}
is_official_build: ${{ parameters.is_official_build}}
# Sync and update PyTorch submodules
- script: git submodule update --init --recursive
displayName: Update PyTorch submodules
# Build PyTorch and run unit tests - no packaging
- ${{ if or(eq(parameters.is_ci_build, 'True'), eq(parameters.is_daily_build, 'True')) }}:
# Build PyTorch from source in develop mode with Ninja
- script: call activate $(configuration) && python setup.py develop
displayName: Build PyTorch from source
- ${{ if eq(parameters.is_ci_build, 'True') }}:
# Run TestTorch unit tests to demonstrate successful PyTorch build
- script: call activate $(configuration) && python test\test_torch.py TestTorch
displayName: Run TestTorch unit tests
- ${{ if eq(parameters.is_daily_build, 'True') }}:
# Run all unit tests to demonstrate successful PyTorch build
- script: call activate $(configuration) && python test/run_test.py --continue-through-error --exclude-jit-executor --verbose
displayName: Run all unit tests
# Run ComponentGovernance
- task: ComponentGovernanceComponentDetection@0
inputs:
scanType: 'Register'
verbosity: 'Verbose'
alertWarningLevel: 'High'
# Build PyTorch and produce artifacts for verification stage
- ${{ if eq(parameters.is_official_build, 'True') }}:
# Build PyTorch from source in install mode with Ninja and exclude test binaries
- script: call activate $(configuration) && python setup.py install
displayName: Build PyTorch from source without test binaries
# Package PyTorch Wheel
- script: call activate $(configuration) && python setup.py bdist_wheel
displayName: Package PyTorch Wheel
# Publish PyTorch Wheel
- task: PublishPipelineArtifact@1
inputs:
targetPath: $(Build.SourcesDirectory)\dist\
artifactName: Build_$(Build.BuildNumber)_$(configuration)
displayName: Publish PyTorch Wheel to Pipeline Artifacts
# Verification Stage
- ${{ if eq(parameters.verify_stage, 'True') }}:
# Download PyTorch Wheel
- task: DownloadPipelineArtifact@2
inputs:
artifact: Build_$(Build.BuildNumber)_$(configuration)
path: $(Build.SourcesDirectory)\verify
displayName: Download PyTorch Wheel
# Install PyTorch Wheel on Windows
- script: |
call activate $(configuration)
cd $(Build.SourcesDirectory)\verify
dir torch*win*.whl /b > whl.txt
set /p whl= < whl.txt
python -m pip install %whl%
displayName: Install PyTorch Wheel
# Ensure PyTorch installed correctly from produced wheel
- script: |
call activate $(configuration)
cd $(Build.SourcesDirectory)\verify
python -c "import torch; print('Installed Torch version: ' + torch.__version__)"
displayName: Check PyTorch correctly installed from wheel
# Publishing stage
- ${{ if eq(parameters.publish_stage, 'True') }}:
# Download PyTorch Wheel
- task: DownloadPipelineArtifact@2
inputs:
artifact: Build_$(Build.BuildNumber)_$(configuration)
path: $(Build.SourcesDirectory)\publish
displayName: Download PyTorch Wheel
# Set up Azure Artifacts for Windows
# The pip install --upgrade command is a bug fix for Azure CLI on Windows
# More info: https://github.com/Azure/azure-cli/issues/16858
- script: |
pip install --upgrade pip --target \opt\az\lib\python3.6\site-packages\
az extension add -n azure-devops
displayName: Set up Azure Artifacts download on Windows
# Publish wheel to Azure Artifacts
# The flag continueOnError=true is needed as the artifact to be published
# may already exist, because the artifact is differentiated based on the
# last commit date.
- script: |
set /p TORCH_VERSION= < version.txt
cd $(Build.SourcesDirectory)\publish
git rev-parse --short HEAD > last_commit.txt && set /p LAST_COMMIT= < last_commit.txt
git log -1 --pretty=%ad --date=format:%Y%m%d > last_commit_date.txt && set /p LAST_COMMIT_DATE= < last_commit_date.txt
dir torch*win*.whl /b > whl.txt && set /p TORCH_WHEEL= < whl.txt
echo %ADOTOKEN% | az devops login
az artifacts universal publish --organization %AZURE_DEVOPS_ARTIFACTS_ORGANIZATION% --project %AZURE_DEVOPS_ARTIFACTS_PROJECT% --scope project --feed "PyTorch" --name %TORCH_WHEEL% --description "PyTorch Official Build Artifact" --version %TORCH_VERSION:~0,5%-%LAST_COMMIT_DATE%-%LAST_COMMIT% --path .
env:
ADOTOKEN: $(AZURE_DEVOPS_CLI_PAT)
continueOnError: true
displayName: Upload PyTorch nigthly package to Azure Artifacts

View File

@ -0,0 +1,17 @@
dependencies:
- python=PYTHON_VERSION
- numpy
- ninja
- pyyaml
- mkl
- mkl-include
- setuptools
- cmake
- cffi
- typing_extensions
- future
- six
- requests
- dataclasses
- pip:
- -r ../../requirements.txt

View File

@ -0,0 +1,62 @@
# Build prepare steps for PyTorch on Azure DevOps to build from source.
# These steps share between normal build process and semmle security scan tasks
parameters:
build_stage: False
configuration: ''
steps:
# End Python tasks that may be lingering over from previous runs
# Note: If python.exe isn't currently running, exit code becomes 128,
# which fails the run. Here exit code is set to 0 to avoid failed run.
- script: |
taskkill /f /im python.exe
IF %ERRORLEVEL% EQU 128 exit 0
displayName: End previous Python processes
# Clean up env directory in conda for fresh builds and set up conda environment YAML
- powershell: |
Remove-Item 'C:\Miniconda\envs' -Recurse -ErrorAction Ignore
$env:PYTHON_VERSION = $env:SYSTEM_JOBNAME.Substring(3,1) + '.' + $env:SYSTEM_JOBNAME.Substring(4,1)
(Get-Content .azure_pipelines\job_templates\common-packages.yml) -replace 'PYTHON_VERSION', $env:PYTHON_VERSION | Out-File -encoding ASCII .azure_pipelines\job_templates\common-packages.yml
displayName: Clean up previous environments and Set up conda environment YAML
# Make conda environment and install required packages
- script: |
call conda clean --all -y
call conda env create -n $(configuration) --file .azure_pipelines\job_templates\common-packages.yml
call activate $(configuration)
call conda install -c conda-forge libuv=1.39
displayName: Set up conda environment for building from source
- ${{ if eq(parameters.build_stage, 'True') }}:
# Install MKL
- script: |
rmdir /s /q mkl
del mkl_2020.2.254.7z
curl https://s3.amazonaws.com/ossci-windows/mkl_2020.2.254.7z -k -O
7z x -aoa mkl_2020.2.254.7z -omkl
displayName: Install MKL
# Install sccache and randomtemp
# Related PyTorch GitHub issue: https://github.com/pytorch/pytorch/issues/25393
# Related fix: https://github.com/pytorch/builder/pull/448/
- script: |
mkdir .\tmp_bin
curl -k https://s3.amazonaws.com/ossci-windows/sccache.exe --output .\tmp_bin\sccache.exe
curl -k https://s3.amazonaws.com/ossci-windows/sccache-cl.exe --output .\tmp_bin\sccache-cl.exe
copy .\tmp_bin\sccache.exe .\tmp_bin\nvcc.exe
curl -kL https://github.com/peterjc123/randomtemp-rust/releases/download/v0.3/randomtemp.exe --output .\tmp_bin\randomtemp.exe
displayName: Install sccache and randomtemp
condition: not(eq(variables.CUDA_VERSION, ''))
# CUDA 11.2's CUB directory conflicts with CUDA 10.2 and 10.1
# builds, where CUDA 11.2's CUB is injected into non-CUDA
# 11.2 builds.
- powershell: Remove-Item "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\include\cub" -Recurse -ErrorAction Ignore
displayName: Remove conflicting CUB from CUDA installation
condition: not(eq(variables.CUDA_VERSION, ''))
- powershell: Copy-Item -Path "F:\cuda_11_2\cub\" -Destination "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\include" -Recurse
displayName: Copy CUDA CUB for CUDA 11.2 build
condition: eq(variables.CUDA_VERSION, '112')

View File

@ -0,0 +1,51 @@
# PyTorch build steps template with Unix images Azure DevOps Instances
#
# This build depends on 5 parameters set as an environment variables in the pipeline:
# - AZURE_DEVOPS_CLI_PAT: Secret var for authenticating to Azure DevOps
# - AZURE_STORAGE_KEY: Secret var for authenticating to Azure Storage
# - _TS_CLONE_P, _TS_P, _TS_SM_P: Secret vars for specific unit tests
parameters:
name: ''
pool: ''
container_endpoint: ''
customMatrixes: ''
jobs:
- job: ${{parameters.name}}
timeoutInMinutes: 600
strategy:
matrix:
${{ insert }}: ${{parameters.customMatrixes}}
pool:
name: ${{ parameters.pool}}
variables:
DECODE_PERCENTS: false
steps:
# Don't checkout repo contents to save time and CPU compute. Environment variables
# related to checkout branch such as $(BUILD_SOURCEBRANCH) are still available.
- checkout: none
# Delete pytorch_tests repo from previous builds if exists
- bash: rm -rf pytorch_tests/
displayName: Delete pytorch_tests repo from previous builds if exists
# Clone PyTorch Tests repository
- bash: |
B64_PAT=$(printf "%s"":$_ADOTOKEN" | base64)
git -c http.extraHeader="Authorization: Basic ${B64_PAT}" clone $(AZURE_DEVOPS_PYTORCH_TESTS_REPO_URL)
cd pytorch_tests
git checkout $(PYTORCH_TESTS_CHECKOUT_BRANCH)
env:
_ADOTOKEN: $(AZURE_DEVOPS_CLI_PAT)
displayName: Clone PyTorch Tests repo
# Run PyTorch Unit Tests
- bash: bash $(Build.SourcesDirectory)/pytorch_tests/scripts/linux/run.sh
env:
_AZURE_STORAGE_KEY: $(AZURE_STORAGE_KEY)
_TS_CLONE_P: $(TS_CLONE_PASSWORD)
_TS_P: $(TS_PAT)
_TS_SM_P: $(TS_SM_PAT)
displayName: Run PyTorch Unit Tests

View File

@ -0,0 +1,49 @@
# PyTorch build steps template with Windows images Azure DevOps Instances
#
# This build depends on 5 parameters set as an environment variables in the pipeline:
# - AZURE_DEVOPS_CLI_PAT: Secret var for authenticating to Azure DevOps
# - AZURE_STORAGE_KEY: Secret var for authenticating to Azure Storage
# - _TS_CLONE_P, _TS_P, _TS_SM_P: Secret vars for specific unit tests
parameters:
name: ''
pool: ''
customMatrixes: ''
jobs:
- job: ${{parameters.name}}
timeoutInMinutes: 600
strategy:
matrix:
${{ insert }}: ${{parameters.customMatrixes}}
pool:
name: ${{ parameters.pool}}
steps:
# Don't checkout repo contents to save time and CPU compute. Environment variables
# related to checkout branch such as $(BUILD_SOURCEBRANCH) are still available.
- checkout: none
# Delete pytorch_tests repo from previous builds if exists
- script: if exist "pytorch_tests/" rmdir "pytorch_tests/" /q /s
displayName: Delete pytorch_tests repo from previous builds if exists
# Clone PyTorch Tests repository
- powershell: |
$env:B64Pat = [Convert]::ToBase64String([System.Text.Encoding]::UTF8.GetBytes(":$env:_ADOTOKEN"))
git -c http.extraHeader="Authorization: Basic $env:B64Pat" clone $env:AZURE_DEVOPS_pytorch_tests_REPO_URL
cd pytorch_tests
git checkout $(PYTORCH_TESTS_CHECKOUT_BRANCH)
env:
_ADOTOKEN: $(AZURE_DEVOPS_CLI_PAT)
displayName: Clone PyTorch Tests repo
# Run PyTorch Unit Tests
- script: call $(Build.SourcesDirectory)\pytorch_tests\scripts\windows\run.bat
env:
_ADOTOKEN: $(AZURE_DEVOPS_CLI_PAT)
_AZURE_STORAGE_KEY: $(AZURE_STORAGE_KEY)
_TS_CLONE_P: $(TS_CLONE_PASSWORD)
_TS_P: $(TS_PAT)
_TS_SM_P: $(TS_SM_PAT)
displayName: Run PyTorch Unit Tests

View File

@ -0,0 +1,131 @@
# Set environment variables for specific configurations
parameters:
is_official_build: False
os: ''
cuda: ''
steps:
# Environment configuration steps for Ubuntu builds
- ${{ if contains(parameters.os, 'ubuntu') }}:
# Set configuration specific build flags
- ${{ if eq(parameters.is_official_build, True) }}:
- bash: |
echo "##vso[task.setvariable variable=INSTALL_TEST;]0"
echo "##vso[task.setvariable variable=PYTORCH_BUILD_NUMBER;]1"
export PYTORCH_VERSION=$(head -c 5 ./version.txt)
echo "##vso[task.setvariable variable=PYTORCH_BUILD_VERSION;]$PYTORCH_VERSION.dev"
displayName: Set configuration-specific build flags
# Set PyTorch CPU/GPU build flags.
- ${{ if contains(parameters.cuda, 'cpu') }}:
- bash: |
echo "##vso[task.setvariable variable=USE_CUDA;]0"
echo "##vso[task.setvariable variable=PYTORCH_BUILD_VERSION;]$(PYTORCH_BUILD_VERSION).cpu"
displayName: Set CUDA-specific build flag for CPU builds
- ${{ if contains(parameters.cuda, 'gpu') }}:
- bash: |
echo "##vso[task.setvariable variable=USE_CUDA;]1"
echo "##vso[task.setvariable variable=PYTORCH_BUILD_VERSION;]$(PYTORCH_BUILD_VERSION).cu$(CUDA_VERSION)"
displayName: Set CUDA-specific build flag for GPU builds
# Set MKL environment variables
- bash: |
echo "##vso[task.setvariable variable=CMAKE_LIBRARY_PATH;]/opt/intel/lib:$CMAKE_LIBRARY_PATH"
echo "##vso[task.setvariable variable=CMAKE_INCLUDE_PATH;]/opt/intel/include:$CMAKE_INCLUDE_PATH"
displayName: Set MKL paths
# View current environment variables
- bash:
printenv
displayName: Show environment variables
# Environment configuration steps for Windows builds
- ${{ if contains(parameters.os, 'windows') }}:
# Set Conda Lib Path
- powershell: Write-Host "##vso[task.setvariable variable=CONDA_LIB_PATH;]C:\Miniconda\envs\$(configuration)\Library\bin"
displayName: Set Conda Lib Path
# Set configuration specific build flags
- ${{ if eq(parameters.is_official_build, True) }}:
- powershell: |
Write-Host "##vso[task.setvariable variable=INSTALL_TEST;]0"
Write-Host "##vso[task.setvariable variable=PYTORCH_BUILD_NUMBER;]1"
Set-Variable -Name PYTORCH_VERSION -Value (Get-Content .\version.txt).Substring(0,5)
Write-Host "##vso[task.setvariable variable=PYTORCH_BUILD_VERSION;]$PYTORCH_VERSION.dev"
displayName: Set configuration-specific build flags
# Set PyTorch CPU/GPU build flags..
- ${{ if contains(parameters.cuda, 'cpu') }}:
- powershell: |
Write-Host "##vso[task.setvariable variable=USE_CUDA;]0"
Write-Host "##vso[task.setvariable variable=PYTORCH_BUILD_VERSION;]$(PYTORCH_BUILD_VERSION).cpu"
displayName: Set CUDA-specific build flag for CPU build
- ${{ if contains(parameters.cuda, 'gpu') }}:
- powershell: |
Write-Host "##vso[task.setvariable variable=USE_CUDA;]1"
Write-Host "##vso[task.setvariable variable=PYTORCH_BUILD_VERSION;]$(PYTORCH_BUILD_VERSION).cu$(CUDA_VERSION)"
displayName: Set CUDA-specific build flag for GPU build
# Set CUDA 11.2, 10.2 or 10.1 specific build flags
- ${{ if eq(parameters.cuda, 'gpu') }}:
- powershell: |
Write-Host "##vso[task.setvariable variable=TORCH_CUDA_ARCH_LIST;]3.7+PTX;5.0;6.0;6.1;7.0;7.5;8.0;8.6"
Write-Host "##vso[task.setvariable variable=CUDA_PATH;]C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\"
displayName: Set CUDA 11.2 specific build flags
condition: eq(variables.CUDA_VERSION, '112')
- powershell: |
Write-Host "##vso[task.setvariable variable=TORCH_CUDA_ARCH_LIST;]3.7+PTX;5.0;6.0;6.1;7.0;7.5"
Write-Host "##vso[task.setvariable variable=CUDA_PATH;]C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\"
displayName: Set CUDA 10.2 specific build flags
condition: eq(variables.CUDA_VERSION, '102')
- powershell: |
Write-Host "##vso[task.setvariable variable=TORCH_CUDA_ARCH_LIST;]3.7+PTX;5.0;6.0;6.1;7.0;7.5"
Write-Host "##vso[task.setvariable variable=CUDA_PATH;]C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.1\"
displayName: Set CUDA 10.1 specific build flags
condition: eq(variables.CUDA_VERSION, '101')
- powershell: |
Write-Host "##vso[task.setvariable variable=CUDA_BIN_PATH;]$env:CUDA_PATH\bin\"
Write-Host "##vso[task.setvariable variable=CUDNN_ROOT;]$env:CUDA_PATH"
Write-Host "##vso[task.setvariable variable=CUDNN_INCLUDE_DIR;]$env:CUDA_PATH\include\"
Write-Host "##vso[task.setvariable variable=CUDNN_LIBRARY;]$env:CUDA_PATH\lib\x64\"
Write-Host "##vso[task.prependpath]$env:CUDA_PATH\bin"
Write-Host "##vso[task.setvariable variable=TORCH_NVCC_FLAGS;]-Xfatbin -compress-all --no-host-device-move-forward"
Write-Host "##vso[task.setvariable variable=THRUST_IGNORE_CUB_VERSION_CHECK;]1"
Write-Host "##vso[task.setvariable variable=NVTOOLSEXT_PATH;]C:\Program Files\NVIDIA Corporation\NvToolsExt\"
displayName: Set CUDA environment variables
- powershell: |
copy "$(CUDA_BIN_PATH)\cusparse*64_*.dll*" $(Build.SourcesDirectory)\torch\lib
copy "$(CUDA_BIN_PATH)\cublas*64_*.dll*" $(Build.SourcesDirectory)\torch\lib
copy "$(CUDA_BIN_PATH)\cudart*64_*.dll*" $(Build.SourcesDirectory)\torch\lib
copy "$(CUDA_BIN_PATH)\curand*64_*.dll*" $(Build.SourcesDirectory)\torch\lib
copy "$(CUDA_BIN_PATH)\cufft*64_*.dll*" $(Build.SourcesDirectory)\torch\lib
copy "$(CUDA_BIN_PATH)\cusolver*64_*.dll*" $(Build.SourcesDirectory)\torch\lib
copy "$(CUDA_BIN_PATH)\cudnn*64_*.dll*" $(Build.SourcesDirectory)\torch\lib
copy "$(CUDA_BIN_PATH)\nvrtc*64_*.dll*" $(Build.SourcesDirectory)\torch\lib
copy "C:\Program Files\NVIDIA Corporation\NvToolsExt\bin\x64\nvToolsExt64_1.dll*" $(Build.SourcesDirectory)\torch\lib
copy "$(CONDA_LIB_PATH)\libiomp*5md.dll" $(Build.SourcesDirectory)\torch\lib
copy "$(CONDA_LIB_PATH)\uv.dll" $(Build.SourcesDirectory)\torch\lib
displayName: Copy CUDA/cuDNN/libomp/libuv dlls to torch\lib
# Set MKL, sccache and randomtemp environment variables
- powershell: |
Write-Host "##vso[task.setvariable variable=CMAKE_INCLUDE_PATH;]$(Build.SourcesDirectory)\mkl\include"
Write-Host "##vso[task.setvariable variable=CMAKE_LIBRARY_PATH;]$(Build.SourcesDirectory)\mkl\lib;$env:CMAKE_LIBRARY_PATH"
Write-Host "##vso[task.setvariable variable=ADDITIONAL_PATH;]$(Build.SourcesDirectory)\tmp_bin"
Write-Host "##vso[task.setvariable variable=SCCACHE_IDLE_TIMEOUT;]1500"
Write-Host "##vso[task.setvariable variable=RANDOMTEMP_EXECUTABLE;]$(Build.SourcesDirectory)\tmp_bin\nvcc.exe"
Write-Host "##vso[task.setvariable variable=CUDA_NVCC_EXECUTABLE;]$(Build.SourcesDirectory)\tmp_bin\randomtemp.exe"
Write-Host "##vso[task.setvariable variable=RANDOMTEMP_BASEDIR;]$(Build.SourcesDirectory)\tmp_bin"
displayName: Set MKL, sccache and randomtemp environment variables
# View current environment variables
- script:
set
displayName: Show environment variables

View File

@ -0,0 +1,14 @@
# Main logic to initiate wait for PR artifact to be ready
steps:
- task: InvokeRESTAPI@1
displayName: 'Wait for job success and wheel ready'
timeoutInMinutes: 60
inputs:
connectionType: 'connectedServiceName'
serviceConnection: circleciconn
method: 'POST'
headers: '{"Content-Type":"application/json", "BranchName":"$(TARGET_BRANCH_TO_CHECK_PR)", "JobName":"$(TARGET_CIRCLECI_PR)", "PlanUrl":"$(System.CollectionUri)", "ProjectId":"$(System.TeamProjectId)", "HubName":"$(System.HostType)", "PlanId":"$(System.PlanId)", "JobId":"$(System.JobId)", "TimelineId":"$(System.TimelineId)", "TaskInstanceId":"$(System.TaskInstanceId)", "AuthToken":"$(System.AccessToken)"}'
body: ''
urlSuffix: 'api/JobStatus'
waitForCompletion: true

View File

@ -0,0 +1,49 @@
# Initiate 5 agentless-server waiting jobs to check on the
# status of PR artifact builds, for a maximum wait time of
# 5 * 60 min =300 minutes. These jobs will pass immediately
# once targeted CircleCI build is ready.
jobs:
- job: checkjob1
pool: server
timeoutInMinutes: 60
continueOnError: true
steps:
- template: wheel-wait-job-template.yml
- job: checkjob2
pool: server
timeoutInMinutes: 60
dependsOn: checkjob1
continueOnError: true
steps:
- template: wheel-wait-job-template.yml
- job: checkjob3
pool: server
timeoutInMinutes: 60
dependsOn: checkjob2
continueOnError: true
steps:
- template: wheel-wait-job-template.yml
- job: checkjob4
pool: server
timeoutInMinutes: 60
dependsOn: checkjob3
continueOnError: true
steps:
- template: wheel-wait-job-template.yml
- job: checkjob5
pool: server
timeoutInMinutes: 60
dependsOn: checkjob4
continueOnError: true
steps:
- template: wheel-wait-job-template.yml

View File

@ -0,0 +1,50 @@
# PyTorch Nightly PyTorch Tests Builds Pipeline on Azure DevOps
#
# This pipeline runs custom PyTorch unit-tests on nightly
# PyTorch wheels.
stages:
- stage: 'NightlyCustomTests'
displayName: 'Run custom unit tests on PyTorch wheels'
jobs:
- template: job_templates/pytorch-template-unix.yml
parameters:
name: ubuntu_1804_CPU_docker
pool: $(BUILD_POOL_LIN_1)
customMatrixes:
Nightly_Custom_Tests:
_DOCKER_IMAGE: $(DOCKER_IMAGE_LIN_1)
_PYTHON_VERSION: $(PYTHON_VERSION_LIN_1)
_CUDA_BUILD_VERSION: $(CUDA_BUILD_VERSION_LIN_1)
_RUN_TESTS: $(RUN_TESTS_LIN)
- template: job_templates/pytorch-template-unix.yml
parameters:
name: ubuntu_1804_GPU_docker
pool: $(BUILD_POOL_LIN_2)
customMatrixes:
Nightly_Custom_Tests:
_DOCKER_IMAGE: $(DOCKER_IMAGE_LIN_2)
_PYTHON_VERSION: $(PYTHON_VERSION_LIN_2)
_CUDA_BUILD_VERSION: $(CUDA_BUILD_VERSION_LIN_2)
_RUN_TESTS: $(RUN_TESTS_LIN)
- template: job_templates/pytorch-template-win.yml
parameters:
name: windows_2019_CPU
pool: $(BUILD_POOL_WIN_1)
customMatrixes:
Nightly_Custom_Tests:
_PYTHON_VERSION: $(PYTHON_VERSION_WIN_1)
_CUDA_BUILD_VERSION: $(CUDA_BUILD_VERSION_WIN_1)
_RUN_TESTS: $(RUN_TESTS_WIN)
- template: job_templates/pytorch-template-win.yml
parameters:
name: windows_2019_GPU
pool: $(BUILD_POOL_WIN_2)
customMatrixes:
Nightly_Custom_Tests:
_PYTHON_VERSION: $(PYTHON_VERSION_WIN_2)
_CUDA_BUILD_VERSION: $(CUDA_BUILD_VERSION_WIN_2)
_RUN_TESTS: $(RUN_TESTS_WIN)

View File

@ -0,0 +1,30 @@
# PyTorch PR PyTorch Tests Builds Pipeline on Azure DevOps
#
# This pipeline:
# 1) ensures that CircleCI builds for a given PR
# have finished, and that its artifacts are
# ready for download
# 2) runs custom PyTorch unit-tests on PyTorch
# wheels generated during PR builds.
stages:
- stage: 'EnsureArtifactsReady'
displayName: 'Ensure PyTorch PR Artifacts are ready'
jobs:
- template: job_templates/wheel-wait-template.yml
- stage: 'PRCustomTests'
displayName: 'Run custom unit tests on PyTorch wheels'
jobs:
- template: job_templates/pytorch-template-unix.yml
parameters:
name: ubuntu_1804_GPU_docker
pool: $(BUILD_POOL_PR)
customMatrixes:
PR_Custom_Tests:
_PYTHON_VERSION: $(PYTHON_VERSION_PR)
_CUDA_BUILD_VERSION: $(CUDA_BUILD_VERSION_PR)
_TARGET_CIRCLECI_BUILD: $(TARGET_CIRCLECI_PR)
_TARGET_BRANCH_TO_CHECK: $(TARGET_BRANCH_TO_CHECK_PR)
_DOCKER_IMAGE: $(DOCKER_IMAGE_PR)
_RUN_TESTS: $(RUN_TESTS_PR)

View File

@ -0,0 +1,224 @@
# PyTorch Official Builds Pipeline on Azure DevOps
#
# This pipeline:
# 1) builds PyTorch on all available configurations
# 2) verifies PyTorch artifacts by installing them in a clean environment
# and checking torch.__version_
# 3) publishes official PyTorch artifacts to Azure DevOps Artifacts for consumption
stages:
- stage: 'Build'
displayName: 'Build PyTorch'
jobs:
- template: job_templates/build-verify-publish-template-unix.yml
parameters:
name: ubuntu_1804_CPU_docker
pool: 'PyTorch-Linux-CPU'
container_endpoint: pytorchms.azurecr.io
build_stage: True
is_official_build: True
os: ubuntu
cuda: cpu
customMatrixes:
Py_38:
configuration: ubuntu_1804_py_38_cpu
container_image: pytorchms.azurecr.io/ubuntu_1804_py_38_cpu_dev
Py_37:
configuration: ubuntu_1804_py_37_cpu
container_image: pytorchms.azurecr.io/ubuntu_1804_py_37_cpu_dev
- template: job_templates/build-verify-publish-template-unix.yml
parameters:
name: ubuntu_1804_GPU_docker
pool: 'PyTorch-Linux-GPU'
container_endpoint: pytorchms.azurecr.io
build_stage: True
is_official_build: True
os: ubuntu
cuda: gpu
customMatrixes:
Py_39_CUDA_112_cuDNN_810:
configuration: ubuntu_1804_py_39_cuda_112_cudnn_810
container_image: pytorchms.azurecr.io/ubuntu_1804_py_39_cuda_112_cudnn_8_dev
CUDA_VERSION: 112
Py_38_CUDA_102_cuDNN_810:
configuration: ubuntu_1804_py_38_cuda_102_cudnn_810
container_image: pytorchms.azurecr.io/ubuntu_1804_py_38_cuda_102_cudnn_8_dev
CUDA_VERSION: 102
Py_37_CUDA_101_cuDNN_765:
configuration: ubuntu_1804_py_37_cuda_101_cudnn_765
container_image: pytorchms.azurecr.io/ubuntu_1804_py_37_cuda_101_cudnn_7_dev
CUDA_VERSION: 101
- template: job_templates/build-verify-publish-template-win.yml
parameters:
name: windows_2019_CPU
pool: 'PyTorch-Win-CPU'
build_stage: True
is_official_build: True
os: windows
cuda: cpu
customMatrixes:
Py_38:
configuration: windows_2019_py_38_cpu
Py_37:
configuration: windows_2019_py_37_cpu
- template: job_templates/build-verify-publish-template-win.yml
parameters:
name: windows_2019_GPU
pool: 'PyTorch-Win-GPU'
build_stage: True
is_official_build: True
os: windows
cuda: gpu
customMatrixes:
Py_39_CUDA_112_cuDNN_810:
configuration: windows_2019_py_39_cuda_112_cudnn_810
CUDA_VERSION: 112
Py_38_CUDA_102_cuDNN_765:
configuration: windows_2019_py_38_cuda_102_cudnn_765
CUDA_VERSION: 102
Py_37_CUDA_101_cuDNN_764:
configuration: windows_2019_py_37_cuda_101_cudnn_764
CUDA_VERSION: 101
- stage: 'Verify'
displayName: 'Verify PyTorch wheels'
dependsOn: Build
condition: succeeded()
jobs:
- template: job_templates/build-verify-publish-template-unix.yml
parameters:
name: ubuntu_1804_CPU_docker
pool: 'PyTorch-Linux-CPU'
container_endpoint: pytorchms.azurecr.io
verify_stage: True
is_official_build: True
customMatrixes:
Py_38:
configuration: ubuntu_1804_py_38_cpu
container_image: pytorchms.azurecr.io/ubuntu_1804_py_38_cpu_dev
Py_37:
configuration: ubuntu_1804_py_37_cpu
container_image: pytorchms.azurecr.io/ubuntu_1804_py_37_cpu_dev
- template: job_templates/build-verify-publish-template-unix.yml
parameters:
name: ubuntu_1804_GPU_docker
pool: 'PyTorch-Linux-GPU'
container_endpoint: pytorchms.azurecr.io
verify_stage: True
is_official_build: True
customMatrixes:
Py_39_CUDA_112_cuDNN_810:
configuration: ubuntu_1804_py_39_cuda_112_cudnn_810
container_image: pytorchms.azurecr.io/ubuntu_1804_py_39_cuda_112_cudnn_8_dev
CUDA_VERSION: 112
Py_38_CUDA_102_cuDNN_810:
configuration: ubuntu_1804_py_38_cuda_102_cudnn_810
container_image: pytorchms.azurecr.io/ubuntu_1804_py_38_cuda_102_cudnn_8_dev
CUDA_VERSION: 102
Py_37_CUDA_101_cuDNN_765:
configuration: ubuntu_1804_py_37_cuda_101_cudnn_765
container_image: pytorchms.azurecr.io/ubuntu_1804_py_37_cuda_101_cudnn_7_dev
CUDA_VERSION: 101
- template: job_templates/build-verify-publish-template-win.yml
parameters:
name: windows_2019_CPU
pool: 'PyTorch-Win-CPU'
verify_stage: True
is_official_build: True
customMatrixes:
Py_38:
configuration: windows_2019_py_38_cpu
Py_37:
configuration: windows_2019_py_37_cpu
- template: job_templates/build-verify-publish-template-win.yml
parameters:
name: windows_2019_GPU
pool: 'PyTorch-Win-GPU'
verify_stage: True
is_official_build: True
customMatrixes:
Py_39_CUDA_112_cuDNN_810:
configuration: windows_2019_py_39_cuda_112_cudnn_810
CUDA_VERSION: 112
Py_38_CUDA_102_cuDNN_765:
configuration: windows_2019_py_38_cuda_102_cudnn_765
CUDA_VERSION: 102
Py_37_CUDA_101_cuDNN_764:
configuration: windows_2019_py_37_cuda_101_cudnn_764
CUDA_VERSION: 101
- stage: 'Publish'
displayName: 'Publish PyTorch wheels'
dependsOn: Verify
condition: succeeded()
jobs:
- template: job_templates/build-verify-publish-template-unix.yml
parameters:
name: ubuntu_1804_CPU_docker
pool: 'PyTorch-Linux-CPU'
container_endpoint: pytorchms.azurecr.io
publish_stage: True
is_official_build: True
customMatrixes:
Py_38:
configuration: ubuntu_1804_py_38_cpu
container_image: pytorchms.azurecr.io/ubuntu_1804_py_38_cpu_dev
Py_37:
configuration: ubuntu_1804_py_37_cpu
container_image: pytorchms.azurecr.io/ubuntu_1804_py_37_cpu_dev
- template: job_templates/build-verify-publish-template-unix.yml
parameters:
name: ubuntu_1804_GPU_docker
pool: 'PyTorch-Linux-GPU'
container_endpoint: pytorchms.azurecr.io
publish_stage: True
is_official_build: True
customMatrixes:
Py_39_CUDA_112_cuDNN_810:
configuration: ubuntu_1804_py_39_cuda_112_cudnn_810
container_image: pytorchms.azurecr.io/ubuntu_1804_py_39_cuda_112_cudnn_8_dev
CUDA_VERSION: 112
Py_38_CUDA_102_cuDNN_810:
configuration: ubuntu_1804_py_38_cuda_102_cudnn_810
container_image: pytorchms.azurecr.io/ubuntu_1804_py_38_cuda_102_cudnn_8_dev
CUDA_VERSION: 102
Py_37_CUDA_101_cuDNN_765:
configuration: ubuntu_1804_py_37_cuda_101_cudnn_765
container_image: pytorchms.azurecr.io/ubuntu_1804_py_37_cuda_101_cudnn_7_dev
CUDA_VERSION: 101
- template: job_templates/build-verify-publish-template-win.yml
parameters:
name: windows_2019_CPU
pool: 'PyTorch-Win-CPU'
publish_stage: True
is_official_build: True
customMatrixes:
Py_38:
configuration: windows_2019_py_38_cpu
Py_37:
configuration: windows_2019_py_37_cpu
- template: job_templates/build-verify-publish-template-win.yml
parameters:
name: windows_2019_GPU
pool: 'PyTorch-Win-GPU'
publish_stage: True
is_official_build: True
customMatrixes:
Py_39_CUDA_112_cuDNN_810:
configuration: windows_2019_py_39_cuda_112_cudnn_810
CUDA_VERSION: 112
Py_38_CUDA_102_cuDNN_765:
configuration: windows_2019_py_38_cuda_102_cudnn_765
CUDA_VERSION: 102
Py_37_CUDA_101_cuDNN_764:
configuration: windows_2019_py_37_cuda_101_cudnn_764
CUDA_VERSION: 101

View File

@ -52,9 +52,18 @@ CONFIG_TREE_DATA = OrderedDict(
"3.7", "3.7",
], ],
)), )),
# Skip CUDA-9.2 builds on Windows macos_arm64=([None], OrderedDict(
wheel=[
"3.8",
"3.9",
],
conda=[
"3.8",
"3.9",
],
)),
windows=( windows=(
[v for v in dimensions.GPU_VERSIONS if v not in ['cuda92'] + dimensions.ROCM_VERSION_LABELS], [v for v in dimensions.GPU_VERSIONS if v not in dimensions.ROCM_VERSION_LABELS],
OrderedDict( OrderedDict(
wheel=dimensions.STANDARD_PYTHON_VERSIONS, wheel=dimensions.STANDARD_PYTHON_VERSIONS,
conda=dimensions.STANDARD_PYTHON_VERSIONS, conda=dimensions.STANDARD_PYTHON_VERSIONS,

View File

@ -27,7 +27,19 @@ class Conf(object):
def gen_docker_image(self): def gen_docker_image(self):
if self.gcc_config_variant == 'gcc5.4_cxx11-abi': if self.gcc_config_variant == 'gcc5.4_cxx11-abi':
return miniutils.quote("pytorch/pytorch-binary-docker-image-ubuntu16.04:latest") if self.gpu_version is None:
return miniutils.quote("pytorch/libtorch-cxx11-builder:cpu")
else:
return miniutils.quote(
f"pytorch/libtorch-cxx11-builder:{self.gpu_version}"
)
if self.pydistro == "conda":
if self.gpu_version is None:
return miniutils.quote("pytorch/conda-builder:cpu")
else:
return miniutils.quote(
f"pytorch/conda-builder:{self.gpu_version}"
)
docker_word_substitution = { docker_word_substitution = {
"manywheel": "manylinux", "manywheel": "manylinux",
@ -164,7 +176,7 @@ def gen_build_env_list(smoke):
c.find_prop("gpu"), c.find_prop("gpu"),
c.find_prop("package_format"), c.find_prop("package_format"),
[c.find_prop("pyver")], [c.find_prop("pyver")],
c.find_prop("smoke"), c.find_prop("smoke") and not (c.find_prop("os_name") == "macos_arm64"), # don't test arm64
c.find_prop("libtorch_variant"), c.find_prop("libtorch_variant"),
c.find_prop("gcc_config_variant"), c.find_prop("gcc_config_variant"),
c.find_prop("libtorch_config_variant"), c.find_prop("libtorch_config_variant"),
@ -216,7 +228,9 @@ def get_jobs(toplevel_key, smoke):
configs = gen_build_env_list(smoke) configs = gen_build_env_list(smoke)
phase = "build" if toplevel_key == "binarybuilds" else "test" phase = "build" if toplevel_key == "binarybuilds" else "test"
for build_config in configs: for build_config in configs:
jobs_list.append(build_config.gen_workflow_job(phase, nightly=True)) # don't test for macos_arm64 as it's cross compiled
if phase != "test" or build_config.os != "macos_arm64":
jobs_list.append(build_config.gen_workflow_job(phase, nightly=True))
return jobs_list return jobs_list

View File

@ -1,14 +1,14 @@
PHASES = ["build", "test"] PHASES = ["build", "test"]
CUDA_VERSIONS = [ CUDA_VERSIONS = [
"101",
"102", "102",
"112", "111",
] ]
ROCM_VERSIONS = [ ROCM_VERSIONS = [
"3.10",
"4.0.1", "4.0.1",
"4.1",
"4.2",
] ]
ROCM_VERSION_LABELS = ["rocm" + v for v in ROCM_VERSIONS] ROCM_VERSION_LABELS = ["rocm" + v for v in ROCM_VERSIONS]

View File

@ -32,24 +32,9 @@ CONFIG_TREE_DATA = [
]), ]),
]), ]),
("cuda", [ ("cuda", [
("9.2", [
("3.6", [
X(True),
("cuda_gcc_override", [
("gcc5.4", [
('build_only', [XImportant(True)]),
]),
]),
])
]),
("10.1", [
("3.6", [
('build_only', [X(True)]),
]),
]),
("10.2", [ ("10.2", [
("3.6", [ ("3.6", [
("shard_test", [XImportant(True)]), ("shard_test", [X(True)]),
("libtorch", [ ("libtorch", [
(True, [ (True, [
('build_only', [X(True)]), ('build_only', [X(True)]),
@ -59,10 +44,10 @@ CONFIG_TREE_DATA = [
]), ]),
("11.1", [ ("11.1", [
("3.8", [ ("3.8", [
X(True), ("shard_test", [XImportant(True)]),
("libtorch", [ ("libtorch", [
(True, [ (True, [
('build_only', [XImportant(True)]), ('build_only', [X(True)]),
]), ]),
]), ]),
]), ]),
@ -72,7 +57,9 @@ CONFIG_TREE_DATA = [
("bionic", [ ("bionic", [
("clang", [ ("clang", [
("9", [ ("9", [
XImportant("3.6"), ("3.6", [
("noarch", [XImportant(True)]),
]),
]), ]),
("9", [ ("9", [
("3.6", [ ("3.6", [
@ -81,6 +68,13 @@ CONFIG_TREE_DATA = [
]), ]),
]), ]),
]), ]),
("cuda", [
("10.2", [
("3.9", [
("shard_test", [XImportant(True)]),
]),
]),
]),
("gcc", [ ("gcc", [
("9", [ ("9", [
("3.8", [ ("3.8", [
@ -151,6 +145,8 @@ class PyVerConfigNode(TreeConfigNode):
def init2(self, node_name): def init2(self, node_name):
self.props["pyver"] = node_name self.props["pyver"] = node_name
self.props["abbreviated_pyver"] = get_major_pyver(node_name) self.props["abbreviated_pyver"] = get_major_pyver(node_name)
if node_name == "3.9":
self.props["abbreviated_pyver"] = "py3.9"
# noinspection PyMethodMayBeStatic # noinspection PyMethodMayBeStatic
def child_constructor(self): def child_constructor(self):
@ -167,8 +163,10 @@ class ExperimentalFeatureConfigNode(TreeConfigNode):
next_nodes = { next_nodes = {
"asan": AsanConfigNode, "asan": AsanConfigNode,
"xla": XlaConfigNode, "xla": XlaConfigNode,
"mlc": MLCConfigNode,
"vulkan": VulkanConfigNode, "vulkan": VulkanConfigNode,
"parallel_tbb": ParallelTBBConfigNode, "parallel_tbb": ParallelTBBConfigNode,
"noarch": NoarchConfigNode,
"parallel_native": ParallelNativeConfigNode, "parallel_native": ParallelNativeConfigNode,
"onnx": ONNXConfigNode, "onnx": ONNXConfigNode,
"libtorch": LibTorchConfigNode, "libtorch": LibTorchConfigNode,
@ -203,6 +201,16 @@ class XlaConfigNode(TreeConfigNode):
def child_constructor(self): def child_constructor(self):
return ImportantConfigNode return ImportantConfigNode
class MLCConfigNode(TreeConfigNode):
def modify_label(self, label):
return "MLC=" + str(label)
def init2(self, node_name):
self.props["is_mlc"] = node_name
def child_constructor(self):
return ImportantConfigNode
class AsanConfigNode(TreeConfigNode): class AsanConfigNode(TreeConfigNode):
def modify_label(self, label): def modify_label(self, label):
@ -248,6 +256,14 @@ class ParallelTBBConfigNode(TreeConfigNode):
return ImportantConfigNode return ImportantConfigNode
class NoarchConfigNode(TreeConfigNode):
def init2(self, node_name):
self.props["is_noarch"] = node_name
def child_constructor(self):
return ImportantConfigNode
class ParallelNativeConfigNode(TreeConfigNode): class ParallelNativeConfigNode(TreeConfigNode):
def modify_label(self, label): def modify_label(self, label):
return "PARALLELNATIVE=" + str(label) return "PARALLELNATIVE=" + str(label)

View File

@ -273,6 +273,7 @@ def instantiate_configs():
is_xla = fc.find_prop("is_xla") or False is_xla = fc.find_prop("is_xla") or False
is_asan = fc.find_prop("is_asan") or False is_asan = fc.find_prop("is_asan") or False
is_coverage = fc.find_prop("is_coverage") or False is_coverage = fc.find_prop("is_coverage") or False
is_noarch = fc.find_prop("is_noarch") or False
is_onnx = fc.find_prop("is_onnx") or False is_onnx = fc.find_prop("is_onnx") or False
is_pure_torch = fc.find_prop("is_pure_torch") or False is_pure_torch = fc.find_prop("is_pure_torch") or False
is_vulkan = fc.find_prop("is_vulkan") or False is_vulkan = fc.find_prop("is_vulkan") or False
@ -316,6 +317,9 @@ def instantiate_configs():
parms_list_ignored_for_docker_image.append("coverage") parms_list_ignored_for_docker_image.append("coverage")
python_version = fc.find_prop("pyver") python_version = fc.find_prop("pyver")
if is_noarch:
parms_list_ignored_for_docker_image.append("noarch")
if is_onnx: if is_onnx:
parms_list.append("onnx") parms_list.append("onnx")
python_version = fc.find_prop("pyver") python_version = fc.find_prop("pyver")

View File

@ -2,6 +2,7 @@ import cimodel.data.simple.util.branch_filters as branch_filters
from cimodel.data.simple.util.docker_constants import ( from cimodel.data.simple.util.docker_constants import (
DOCKER_IMAGE_NDK, DOCKER_REQUIREMENT_NDK DOCKER_IMAGE_NDK, DOCKER_REQUIREMENT_NDK
) )
import cimodel.lib.miniutils as miniutils
class AndroidJob: class AndroidJob:
@ -51,13 +52,15 @@ class AndroidGradleJob:
template_name, template_name,
dependencies, dependencies,
is_master_only=True, is_master_only=True,
is_pr_only=False): is_pr_only=False,
extra_props=tuple()):
self.job_name = job_name self.job_name = job_name
self.template_name = template_name self.template_name = template_name
self.dependencies = dependencies self.dependencies = dependencies
self.is_master_only = is_master_only self.is_master_only = is_master_only
self.is_pr_only = is_pr_only self.is_pr_only = is_pr_only
self.extra_props = dict(extra_props)
def gen_tree(self): def gen_tree(self):
@ -70,6 +73,8 @@ class AndroidGradleJob:
props_dict["filters"] = branch_filters.gen_filter_dict(branch_filters.NON_PR_BRANCH_LIST) props_dict["filters"] = branch_filters.gen_filter_dict(branch_filters.NON_PR_BRANCH_LIST)
elif self.is_pr_only: elif self.is_pr_only:
props_dict["filters"] = branch_filters.gen_filter_dict(branch_filters.PR_BRANCH_LIST) props_dict["filters"] = branch_filters.gen_filter_dict(branch_filters.PR_BRANCH_LIST)
if self.extra_props:
props_dict.update(self.extra_props)
return [{self.template_name: props_dict}] return [{self.template_name: props_dict}]
@ -91,6 +96,15 @@ WORKFLOW_DATA = [
[DOCKER_REQUIREMENT_NDK], [DOCKER_REQUIREMENT_NDK],
is_master_only=False, is_master_only=False,
is_pr_only=True), is_pr_only=True),
AndroidGradleJob(
"pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit",
"pytorch_android_gradle_custom_build_single",
[DOCKER_REQUIREMENT_NDK],
is_master_only=False,
is_pr_only=True,
extra_props=tuple({
"lite_interpreter": miniutils.quote(str(int(False)))
}.items())),
AndroidGradleJob( AndroidGradleJob(
"pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build", "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build",
"pytorch_android_gradle_build", "pytorch_android_gradle_build",

View File

@ -77,7 +77,7 @@ WORKFLOW_DATA = [
["libtorch", "3.7m", "cpu", "devtoolset7"], ["libtorch", "3.7m", "cpu", "devtoolset7"],
"pytorch/manylinux-cuda102", "pytorch/manylinux-cuda102",
"binary_linux_libtorch_3_7m_cpu_devtoolset7_shared-with-deps_build", "binary_linux_libtorch_3_7m_cpu_devtoolset7_shared-with-deps_build",
is_master_only=False, is_master_only=True,
has_libtorch_variant=True, has_libtorch_variant=True,
), ),
SmoketestJob( SmoketestJob(
@ -109,14 +109,14 @@ WORKFLOW_DATA = [
["libtorch", "3.7", "cpu", "debug"], ["libtorch", "3.7", "cpu", "debug"],
None, None,
"binary_windows_libtorch_3_7_cpu_debug_build", "binary_windows_libtorch_3_7_cpu_debug_build",
is_master_only=False, is_master_only=True,
), ),
SmoketestJob( SmoketestJob(
"binary_windows_build", "binary_windows_build",
["libtorch", "3.7", "cpu", "release"], ["libtorch", "3.7", "cpu", "release"],
None, None,
"binary_windows_libtorch_3_7_cpu_release_build", "binary_windows_libtorch_3_7_cpu_release_build",
is_master_only=False, is_master_only=True,
), ),
SmoketestJob( SmoketestJob(
"binary_windows_build", "binary_windows_build",
@ -131,7 +131,7 @@ WORKFLOW_DATA = [
["libtorch", "3.7", "cpu", "debug"], ["libtorch", "3.7", "cpu", "debug"],
None, None,
"binary_windows_libtorch_3_7_cpu_debug_test", "binary_windows_libtorch_3_7_cpu_debug_test",
is_master_only=False, is_master_only=True,
requires=["binary_windows_libtorch_3_7_cpu_debug_build"], requires=["binary_windows_libtorch_3_7_cpu_debug_build"],
), ),
SmoketestJob( SmoketestJob(
@ -173,7 +173,7 @@ WORKFLOW_DATA = [
["libtorch", "3.7m", "cpu", "devtoolset7"], ["libtorch", "3.7m", "cpu", "devtoolset7"],
"pytorch/manylinux-cuda102", "pytorch/manylinux-cuda102",
"binary_linux_libtorch_3_7m_cpu_devtoolset7_shared-with-deps_test", "binary_linux_libtorch_3_7m_cpu_devtoolset7_shared-with-deps_test",
is_master_only=False, is_master_only=True,
requires=["binary_linux_libtorch_3_7m_cpu_devtoolset7_shared-with-deps_build"], requires=["binary_linux_libtorch_3_7m_cpu_devtoolset7_shared-with-deps_build"],
has_libtorch_variant=True, has_libtorch_variant=True,
), ),
@ -182,7 +182,7 @@ WORKFLOW_DATA = [
["libtorch", "3.7m", "cpu", "gcc5.4_cxx11-abi"], ["libtorch", "3.7m", "cpu", "gcc5.4_cxx11-abi"],
"pytorch/pytorch-binary-docker-image-ubuntu16.04:latest", "pytorch/pytorch-binary-docker-image-ubuntu16.04:latest",
"binary_linux_libtorch_3_7m_cpu_gcc5_4_cxx11-abi_shared-with-deps_test", "binary_linux_libtorch_3_7m_cpu_gcc5_4_cxx11-abi_shared-with-deps_test",
is_master_only=False, is_master_only=True,
requires=["binary_linux_libtorch_3_7m_cpu_gcc5_4_cxx11-abi_shared-with-deps_build"], requires=["binary_linux_libtorch_3_7m_cpu_gcc5_4_cxx11-abi_shared-with-deps_build"],
has_libtorch_variant=True, has_libtorch_variant=True,
), ),

View File

@ -6,21 +6,16 @@ from cimodel.data.simple.util.branch_filters import gen_filter_dict, RC_PATTERN
# TODO: make this generated from a matrix rather than just a static list # TODO: make this generated from a matrix rather than just a static list
IMAGE_NAMES = [ IMAGE_NAMES = [
"pytorch-linux-bionic-cuda11.1-cudnn8-py3.6-gcc9",
"pytorch-linux-bionic-cuda11.1-cudnn8-py3.8-gcc9",
"pytorch-linux-bionic-cuda11.0-cudnn8-py3.6-gcc9",
"pytorch-linux-bionic-cuda11.0-cudnn8-py3.8-gcc9",
"pytorch-linux-bionic-cuda10.2-cudnn7-py3.8-gcc9", "pytorch-linux-bionic-cuda10.2-cudnn7-py3.8-gcc9",
"pytorch-linux-bionic-cuda10.2-cudnn7-py3.9-gcc7",
"pytorch-linux-bionic-py3.6-clang9", "pytorch-linux-bionic-py3.6-clang9",
"pytorch-linux-bionic-cuda10.2-cudnn7-py3.6-clang9", "pytorch-linux-bionic-cuda10.2-cudnn7-py3.6-clang9",
"pytorch-linux-bionic-py3.8-gcc9", "pytorch-linux-bionic-py3.8-gcc9",
"pytorch-linux-xenial-cuda10-cudnn7-py3-gcc7", "pytorch-linux-xenial-cuda10-cudnn7-py3-gcc7",
"pytorch-linux-xenial-cuda10.1-cudnn7-py3-gcc7", "pytorch-linux-xenial-cuda10.1-cudnn7-py3-gcc7",
"pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7", "pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7",
"pytorch-linux-xenial-cuda11.0-cudnn8-py3-gcc7",
"pytorch-linux-xenial-cuda11.1-cudnn8-py3-gcc7", "pytorch-linux-xenial-cuda11.1-cudnn8-py3-gcc7",
"pytorch-linux-xenial-cuda9.2-cudnn7-py3-gcc5.4", "pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7",
"pytorch-linux-xenial-cuda9.2-cudnn7-py3-gcc7",
"pytorch-linux-xenial-py3-clang5-android-ndk-r19c", "pytorch-linux-xenial-py3-clang5-android-ndk-r19c",
"pytorch-linux-xenial-py3-clang5-asan", "pytorch-linux-xenial-py3-clang5-asan",
"pytorch-linux-xenial-py3-clang7-onnx", "pytorch-linux-xenial-py3-clang7-onnx",
@ -30,7 +25,9 @@ IMAGE_NAMES = [
"pytorch-linux-xenial-py3.6-gcc7.2", "pytorch-linux-xenial-py3.6-gcc7.2",
"pytorch-linux-xenial-py3.6-gcc7", "pytorch-linux-xenial-py3.6-gcc7",
"pytorch-linux-bionic-rocm3.9-py3.6", "pytorch-linux-bionic-rocm3.9-py3.6",
"pytorch-linux-bionic-rocm3.10-py3.6", "pytorch-linux-bionic-rocm4.0.1-py3.6",
"pytorch-linux-bionic-rocm4.1-py3.6",
"pytorch-linux-bionic-rocm4.2-py3.6",
] ]

View File

@ -61,10 +61,20 @@ class IOSJob:
WORKFLOW_DATA = [ WORKFLOW_DATA = [
IOSJob(XCODE_VERSION, ArchVariant("x86_64"), is_org_member_context=False), IOSJob(XCODE_VERSION, ArchVariant("x86_64"), is_org_member_context=False, extra_props={
IOSJob(XCODE_VERSION, ArchVariant("arm64")), "lite_interpreter": miniutils.quote(str(int(True)))}),
IOSJob(XCODE_VERSION, ArchVariant("arm64", "metal"), extra_props={"use_metal": miniutils.quote(str(int(True)))}), IOSJob(XCODE_VERSION, ArchVariant("x86_64", "full_jit"), is_org_member_context=False, extra_props={
IOSJob(XCODE_VERSION, ArchVariant("arm64", "custom"), extra_props={"op_list": "mobilenetv2.yaml"}), "lite_interpreter": miniutils.quote(str(int(False)))}),
IOSJob(XCODE_VERSION, ArchVariant("arm64"), extra_props={
"lite_interpreter": miniutils.quote(str(int(True)))}),
IOSJob(XCODE_VERSION, ArchVariant("arm64", "metal"), extra_props={
"use_metal": miniutils.quote(str(int(True))),
"lite_interpreter": miniutils.quote(str(int(True)))}),
IOSJob(XCODE_VERSION, ArchVariant("arm64", "full_jit"), extra_props={
"lite_interpreter": miniutils.quote(str(int(False)))}),
IOSJob(XCODE_VERSION, ArchVariant("arm64", "custom"), extra_props={
"op_list": "mobilenetv2.yaml",
"lite_interpreter": miniutils.quote(str(int(True)))}),
] ]

View File

@ -1,14 +1,22 @@
class MacOsJob: class MacOsJob:
def __init__(self, os_version, is_test=False): def __init__(self, os_version, is_build=False, is_test=False, extra_props=tuple()):
# extra_props is tuple type, because mutable data structures for argument defaults
# is not recommended.
self.os_version = os_version self.os_version = os_version
self.is_build = is_build
self.is_test = is_test self.is_test = is_test
self.extra_props = dict(extra_props)
def gen_tree(self): def gen_tree(self):
non_phase_parts = ["pytorch", "macos", self.os_version, "py3"] non_phase_parts = ["pytorch", "macos", self.os_version, "py3"]
phase_name = "test" if self.is_test else "build" extra_name_list = [name for name, exist in self.extra_props.items() if exist]
full_job_name_list = non_phase_parts + extra_name_list + [
'build' if self.is_build else None,
'test' if self.is_test else None,
]
full_job_name = "_".join(non_phase_parts + [phase_name]) full_job_name = "_".join(list(filter(None, full_job_name_list)))
test_build_dependency = "_".join(non_phase_parts + ["build"]) test_build_dependency = "_".join(non_phase_parts + ["build"])
extra_dependencies = [test_build_dependency] if self.is_test else [] extra_dependencies = [test_build_dependency] if self.is_test else []
@ -21,7 +29,23 @@ class MacOsJob:
return [{full_job_name: props_dict}] return [{full_job_name: props_dict}]
WORKFLOW_DATA = [MacOsJob("10_13"), MacOsJob("10_13", True)] WORKFLOW_DATA = [
MacOsJob("10_15", is_build=True),
MacOsJob("10_13", is_build=True),
MacOsJob(
"10_13",
is_build=False,
is_test=True,
),
MacOsJob(
"10_13",
is_build=True,
is_test=True,
extra_props=tuple({
"lite_interpreter": True
}.items()),
)
]
def get_workflow_jobs(): def get_workflow_jobs():

View File

@ -65,6 +65,12 @@ WORKFLOW_DATA = [
["custom", "build", "dynamic"] ["custom", "build", "dynamic"]
), ),
MobileJob(
DOCKER_IMAGE_NDK,
[DOCKER_REQUIREMENT_NDK],
["custom", "build", "static"]
),
# Use LLVM-DEV toolchain in android-ndk-r19c docker image # Use LLVM-DEV toolchain in android-ndk-r19c docker image
# Most of this CI is already covered by "mobile-custom-build-dynamic" job # Most of this CI is already covered by "mobile-custom-build-dynamic" job
MobileJob( MobileJob(

View File

@ -1,5 +1,5 @@
import cimodel.data.simple.util.branch_filters
import cimodel.lib.miniutils as miniutils import cimodel.lib.miniutils as miniutils
from cimodel.data.simple.util.branch_filters import gen_filter_dict, RC_PATTERN, NON_PR_BRANCH_LIST
from cimodel.data.simple.util.versions import CudaVersion from cimodel.data.simple.util.versions import CudaVersion
@ -10,13 +10,19 @@ class WindowsJob:
vscode_spec, vscode_spec,
cuda_version, cuda_version,
force_on_cpu=False, force_on_cpu=False,
master_only_pred=lambda job: job.vscode_spec.year != 2019, multi_gpu=False,
master_only=False,
nightly_only=False,
master_and_nightly=False
): ):
self.test_index = test_index self.test_index = test_index
self.vscode_spec = vscode_spec self.vscode_spec = vscode_spec
self.cuda_version = cuda_version self.cuda_version = cuda_version
self.force_on_cpu = force_on_cpu self.force_on_cpu = force_on_cpu
self.master_only_pred = master_only_pred self.multi_gpu = multi_gpu
self.master_only = master_only
self.nightly_only = nightly_only
self.master_and_nightly = master_and_nightly
def gen_tree(self): def gen_tree(self):
@ -25,7 +31,10 @@ class WindowsJob:
base_phase if self.test_index is None else base_phase + str(self.test_index) base_phase if self.test_index is None else base_phase + str(self.test_index)
) )
key_name = "_".join(["pytorch", "windows", base_phase]) key_parts = ["pytorch", "windows", base_phase]
if self.multi_gpu:
key_parts.append('multigpu')
key_name = "_".join(key_parts)
cpu_forcing_name_parts = ["on", "cpu"] if self.force_on_cpu else [] cpu_forcing_name_parts = ["on", "cpu"] if self.force_on_cpu else []
@ -61,35 +70,47 @@ class WindowsJob:
is_running_on_cuda = bool(self.cuda_version) and not self.force_on_cpu is_running_on_cuda = bool(self.cuda_version) and not self.force_on_cpu
props_dict = { if self.multi_gpu:
"build_environment": build_environment_string, props_dict = {"requires": prerequisite_jobs}
"python_version": miniutils.quote("3.6"), else:
"vc_version": miniutils.quote(self.vscode_spec.dotted_version()), props_dict = {
"vc_year": miniutils.quote(str(self.vscode_spec.year)), "build_environment": build_environment_string,
"vc_product": self.vscode_spec.get_product(), "python_version": miniutils.quote("3.6"),
"use_cuda": miniutils.quote(str(int(is_running_on_cuda))), "vc_version": miniutils.quote(self.vscode_spec.dotted_version()),
"requires": prerequisite_jobs, "vc_year": miniutils.quote(str(self.vscode_spec.year)),
} "vc_product": self.vscode_spec.get_product(),
"use_cuda": miniutils.quote(str(int(is_running_on_cuda))),
"requires": prerequisite_jobs,
}
if self.master_only_pred(self): if self.master_only:
props_dict[ props_dict[
"filters" "filters"
] = cimodel.data.simple.util.branch_filters.gen_filter_dict() ] = gen_filter_dict()
elif self.nightly_only:
props_dict[
"filters"
] = gen_filter_dict(branches_list=["nightly"], tags_list=RC_PATTERN)
elif self.master_and_nightly:
props_dict[
"filters"
] = gen_filter_dict(branches_list=NON_PR_BRANCH_LIST + ["nightly"], tags_list=RC_PATTERN)
name_parts = base_name_parts + cpu_forcing_name_parts + [numbered_phase] name_parts = base_name_parts + cpu_forcing_name_parts + [numbered_phase]
if base_phase == "test": if not self.multi_gpu:
test_name = "-".join(["pytorch", "windows", numbered_phase]) if base_phase == "test":
props_dict["test_name"] = test_name test_name = "-".join(["pytorch", "windows", numbered_phase])
props_dict["test_name"] = test_name
if is_running_on_cuda: if is_running_on_cuda:
props_dict["executor"] = "windows-with-nvidia-gpu" props_dict["executor"] = "windows-with-nvidia-gpu"
props_dict["cuda_version"] = ( props_dict["cuda_version"] = (
miniutils.quote(str(self.cuda_version)) miniutils.quote(str(self.cuda_version))
if self.cuda_version if self.cuda_version
else "cpu" else "cpu"
) )
props_dict["name"] = "_".join(name_parts) props_dict["name"] = "_".join(name_parts)
@ -108,7 +129,7 @@ class VcSpec:
return [self.prefixed_year()] + self.version_elements return [self.prefixed_year()] + self.version_elements
def get_product(self): def get_product(self):
return "Community" if self.year == 2019 else "BuildTools" return "BuildTools"
def dotted_version(self): def dotted_version(self):
return ".".join(self.version_elements) return ".".join(self.version_elements)
@ -119,28 +140,23 @@ class VcSpec:
def render(self): def render(self):
return "_".join(self.get_elements()) return "_".join(self.get_elements())
def FalsePred(_):
return False
def TruePred(_):
return True
_VC2019 = VcSpec(2019) _VC2019 = VcSpec(2019)
WORKFLOW_DATA = [ WORKFLOW_DATA = [
# VS2019 CUDA-10.1 # VS2019 CUDA-10.1
WindowsJob(None, _VC2019, CudaVersion(10, 1)), WindowsJob(None, _VC2019, CudaVersion(10, 1), master_only=True),
WindowsJob(1, _VC2019, CudaVersion(10, 1)), WindowsJob(1, _VC2019, CudaVersion(10, 1), master_only=True),
WindowsJob(2, _VC2019, CudaVersion(10, 1)), WindowsJob(2, _VC2019, CudaVersion(10, 1), master_only=True),
# VS2019 CUDA-11.1 # VS2019 CUDA-11.1
WindowsJob(None, _VC2019, CudaVersion(11, 1)), WindowsJob(None, _VC2019, CudaVersion(11, 1)),
WindowsJob(1, _VC2019, CudaVersion(11, 1), master_only_pred=TruePred), WindowsJob(1, _VC2019, CudaVersion(11, 1), master_only=True),
WindowsJob(2, _VC2019, CudaVersion(11, 1), master_only_pred=TruePred), WindowsJob(2, _VC2019, CudaVersion(11, 1), master_only=True),
WindowsJob('_azure_multi_gpu', _VC2019, CudaVersion(11, 1), multi_gpu=True, nightly_only=True),
# VS2019 CPU-only # VS2019 CPU-only
WindowsJob(None, _VC2019, None), WindowsJob(None, _VC2019, None),
WindowsJob(1, _VC2019, None, master_only_pred=TruePred), WindowsJob(1, _VC2019, None),
WindowsJob(2, _VC2019, None, master_only_pred=TruePred), WindowsJob(2, _VC2019, None),
WindowsJob(1, _VC2019, CudaVersion(10, 1), force_on_cpu=True, master_only_pred=TruePred), WindowsJob(1, _VC2019, CudaVersion(10, 1), force_on_cpu=True, master_only=True),
] ]

File diff suppressed because it is too large Load Diff

View File

@ -12,8 +12,20 @@ each image as the `BUILD_ENVIRONMENT` environment variable.
See `build.sh` for valid build environments (it's the giant switch). See `build.sh` for valid build environments (it's the giant switch).
Docker builds are now defined with `.circleci/cimodel/data/simple/docker_definitions.py`
## Contents ## Contents
* `build.sh` -- dispatch script to launch all builds * `build.sh` -- dispatch script to launch all builds
* `common` -- scripts used to execute individual Docker build stages * `common` -- scripts used to execute individual Docker build stages
* `ubuntu-cuda` -- Dockerfile for Ubuntu image with CUDA support for nvidia-docker * `ubuntu-cuda` -- Dockerfile for Ubuntu image with CUDA support for nvidia-docker
## Usage
```bash
# Build a specific image
./build.sh pytorch-linux-bionic-py3.8-gcc9 -t myimage:latest
# Set flags (see build.sh) and build image
sudo bash -c 'BREAKPAD=1 ./build.sh pytorch-linux-bionic-py3.8-gcc9 -t myimage:latest
```

View File

@ -20,10 +20,8 @@ buildscript {
} }
dependencies { dependencies {
classpath 'com.android.tools.build:gradle:3.3.2' classpath 'com.android.tools.build:gradle:4.1.2'
classpath "com.jfrog.bintray.gradle:gradle-bintray-plugin:1.8.0" classpath 'com.vanniktech:gradle-maven-publish-plugin:0.14.2'
classpath "com.github.dcendents:android-maven-gradle-plugin:2.1"
classpath "org.jfrog.buildinfo:build-info-extractor-gradle:4.9.8"
} }
} }

View File

@ -88,6 +88,7 @@ case "$image" in
DB=yes DB=yes
VISION=yes VISION=yes
KATEX=yes KATEX=yes
BREAKPAD=yes
;; ;;
pytorch-linux-xenial-py3.6-gcc7.2) pytorch-linux-xenial-py3.6-gcc7.2)
ANACONDA_PYTHON_VERSION=3.6 ANACONDA_PYTHON_VERSION=3.6
@ -100,24 +101,7 @@ case "$image" in
PROTOBUF=yes PROTOBUF=yes
DB=yes DB=yes
VISION=yes VISION=yes
;; BREAKPAD=yes
pytorch-linux-xenial-cuda9.2-cudnn7-py3-gcc5.4)
CUDA_VERSION=9.2
CUDNN_VERSION=7
ANACONDA_PYTHON_VERSION=3.6
GCC_VERSION=5
PROTOBUF=yes
DB=yes
VISION=yes
;;
pytorch-linux-xenial-cuda9.2-cudnn7-py3-gcc7)
CUDA_VERSION=9.2
CUDNN_VERSION=7
ANACONDA_PYTHON_VERSION=3.6
GCC_VERSION=7
PROTOBUF=yes
DB=yes
VISION=yes
;; ;;
pytorch-linux-xenial-cuda10-cudnn7-py3-gcc7) pytorch-linux-xenial-cuda10-cudnn7-py3-gcc7)
CUDA_VERSION=10.0 CUDA_VERSION=10.0
@ -127,6 +111,7 @@ case "$image" in
PROTOBUF=yes PROTOBUF=yes
DB=yes DB=yes
VISION=yes VISION=yes
BREAKPAD=yes
;; ;;
pytorch-linux-xenial-cuda10.1-cudnn7-py3-gcc7) pytorch-linux-xenial-cuda10.1-cudnn7-py3-gcc7)
CUDA_VERSION=10.1 CUDA_VERSION=10.1
@ -137,6 +122,7 @@ case "$image" in
DB=yes DB=yes
VISION=yes VISION=yes
KATEX=yes KATEX=yes
BREAKPAD=yes
;; ;;
pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7) pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7)
CUDA_VERSION=10.2 CUDA_VERSION=10.2
@ -147,16 +133,7 @@ case "$image" in
DB=yes DB=yes
VISION=yes VISION=yes
KATEX=yes KATEX=yes
;; BREAKPAD=yes
pytorch-linux-xenial-cuda11.0-cudnn8-py3-gcc7)
CUDA_VERSION=11.0
CUDNN_VERSION=8
ANACONDA_PYTHON_VERSION=3.6
GCC_VERSION=7
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
;; ;;
pytorch-linux-xenial-cuda11.1-cudnn8-py3-gcc7) pytorch-linux-xenial-cuda11.1-cudnn8-py3-gcc7)
CUDA_VERSION=11.1 CUDA_VERSION=11.1
@ -167,6 +144,18 @@ case "$image" in
DB=yes DB=yes
VISION=yes VISION=yes
KATEX=yes KATEX=yes
BREAKPAD=yes
;;
pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7)
CUDA_VERSION=11.3.0 # Deviating from major.minor to conform to nvidia's Docker image names
CUDNN_VERSION=8
ANACONDA_PYTHON_VERSION=3.6
GCC_VERSION=7
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
BREAKPAD=yes
;; ;;
pytorch-linux-xenial-py3-clang5-asan) pytorch-linux-xenial-py3-clang5-asan)
ANACONDA_PYTHON_VERSION=3.6 ANACONDA_PYTHON_VERSION=3.6
@ -174,6 +163,7 @@ case "$image" in
PROTOBUF=yes PROTOBUF=yes
DB=yes DB=yes
VISION=yes VISION=yes
BREAKPAD=yes
;; ;;
pytorch-linux-xenial-py3-clang7-onnx) pytorch-linux-xenial-py3-clang7-onnx)
ANACONDA_PYTHON_VERSION=3.6 ANACONDA_PYTHON_VERSION=3.6
@ -181,6 +171,7 @@ case "$image" in
PROTOBUF=yes PROTOBUF=yes
DB=yes DB=yes
VISION=yes VISION=yes
BREAKPAD=yes
;; ;;
pytorch-linux-xenial-py3-clang5-android-ndk-r19c) pytorch-linux-xenial-py3-clang5-android-ndk-r19c)
ANACONDA_PYTHON_VERSION=3.6 ANACONDA_PYTHON_VERSION=3.6
@ -189,7 +180,7 @@ case "$image" in
PROTOBUF=yes PROTOBUF=yes
ANDROID=yes ANDROID=yes
ANDROID_NDK_VERSION=r19c ANDROID_NDK_VERSION=r19c
GRADLE_VERSION=4.10.3 GRADLE_VERSION=6.8.3
CMAKE_VERSION=3.7.0 CMAKE_VERSION=3.7.0
NINJA_VERSION=1.9.0 NINJA_VERSION=1.9.0
;; ;;
@ -199,6 +190,7 @@ case "$image" in
PROTOBUF=yes PROTOBUF=yes
DB=yes DB=yes
VISION=yes VISION=yes
BREAKPAD=yes
;; ;;
pytorch-linux-bionic-py3.6-clang9) pytorch-linux-bionic-py3.6-clang9)
ANACONDA_PYTHON_VERSION=3.6 ANACONDA_PYTHON_VERSION=3.6
@ -206,7 +198,8 @@ case "$image" in
PROTOBUF=yes PROTOBUF=yes
DB=yes DB=yes
VISION=yes VISION=yes
VULKAN_SDK_VERSION=1.2.148.0 BREAKPAD=yes
VULKAN_SDK_VERSION=1.2.162.1
SWIFTSHADER=yes SWIFTSHADER=yes
;; ;;
pytorch-linux-bionic-py3.8-gcc9) pytorch-linux-bionic-py3.8-gcc9)
@ -215,6 +208,8 @@ case "$image" in
PROTOBUF=yes PROTOBUF=yes
DB=yes DB=yes
VISION=yes VISION=yes
BREAKPAD=yes
BREAKPAD=yes
;; ;;
pytorch-linux-bionic-cuda10.2-cudnn7-py3.6-clang9) pytorch-linux-bionic-cuda10.2-cudnn7-py3.6-clang9)
CUDA_VERSION=10.2 CUDA_VERSION=10.2
@ -224,6 +219,7 @@ case "$image" in
PROTOBUF=yes PROTOBUF=yes
DB=yes DB=yes
VISION=yes VISION=yes
BREAKPAD=yes
;; ;;
pytorch-linux-bionic-cuda10.2-cudnn7-py3.8-gcc9) pytorch-linux-bionic-cuda10.2-cudnn7-py3.8-gcc9)
CUDA_VERSION=10.2 CUDA_VERSION=10.2
@ -233,6 +229,17 @@ case "$image" in
PROTOBUF=yes PROTOBUF=yes
DB=yes DB=yes
VISION=yes VISION=yes
BREAKPAD=yes
;;
pytorch-linux-bionic-cuda10.2-cudnn7-py3.9-gcc7)
CUDA_VERSION=10.2
CUDNN_VERSION=7
ANACONDA_PYTHON_VERSION=3.9
GCC_VERSION=7
PROTOBUF=yes
DB=yes
VISION=yes
BREAKPAD=yes
;; ;;
pytorch-linux-bionic-cuda11.0-cudnn8-py3.6-gcc9) pytorch-linux-bionic-cuda11.0-cudnn8-py3.6-gcc9)
CUDA_VERSION=11.0 CUDA_VERSION=11.0
@ -242,57 +249,42 @@ case "$image" in
PROTOBUF=yes PROTOBUF=yes
DB=yes DB=yes
VISION=yes VISION=yes
KATEX=yes BREAKPAD=yes
;;
pytorch-linux-bionic-cuda11.0-cudnn8-py3.8-gcc9)
CUDA_VERSION=11.0
CUDNN_VERSION=8
ANACONDA_PYTHON_VERSION=3.8
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
;;
pytorch-linux-bionic-cuda11.1-cudnn8-py3.6-gcc9)
CUDA_VERSION=11.1
CUDNN_VERSION=8
ANACONDA_PYTHON_VERSION=3.6
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
;;
pytorch-linux-bionic-cuda11.1-cudnn8-py3.8-gcc9)
CUDA_VERSION=11.1
CUDNN_VERSION=8
ANACONDA_PYTHON_VERSION=3.8
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
;;
pytorch-linux-bionic-rocm3.9-py3.6)
ANACONDA_PYTHON_VERSION=3.6
PROTOBUF=yes
DB=yes
VISION=yes
ROCM_VERSION=3.9 ROCM_VERSION=3.9
;; ;;
pytorch-linux-bionic-rocm3.10-py3.6) pytorch-linux-bionic-rocm4.0.1-py3.6)
ANACONDA_PYTHON_VERSION=3.6 ANACONDA_PYTHON_VERSION=3.6
GCC_VERSION=9
PROTOBUF=yes PROTOBUF=yes
DB=yes DB=yes
VISION=yes VISION=yes
ROCM_VERSION=3.10 BREAKPAD=yes
ROCM_VERSION=4.0.1
;;
pytorch-linux-bionic-rocm4.1-py3.6)
ANACONDA_PYTHON_VERSION=3.6
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
BREAKPAD=yes
ROCM_VERSION=4.1
;;
pytorch-linux-bionic-rocm4.2-py3.6)
ANACONDA_PYTHON_VERSION=3.6
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
BREAKPAD=yes
ROCM_VERSION=4.2
;; ;;
*) *)
# Catch-all for builds that are not hardcoded. # Catch-all for builds that are not hardcoded.
PROTOBUF=yes PROTOBUF=yes
DB=yes DB=yes
VISION=yes VISION=yes
BREAKPAD=yes
echo "image '$image' did not match an existing build configuration" echo "image '$image' did not match an existing build configuration"
if [[ "$image" == *py* ]]; then if [[ "$image" == *py* ]]; then
extract_version_from_image_name py ANACONDA_PYTHON_VERSION extract_version_from_image_name py ANACONDA_PYTHON_VERSION
@ -328,7 +320,7 @@ if [ -n "${JENKINS:-}" ]; then
JENKINS_GID=$(id -g jenkins) JENKINS_GID=$(id -g jenkins)
fi fi
tmp_tag="tmp-$(cat /dev/urandom | tr -dc 'a-z' | fold -w 32 | head -n 1)" tmp_tag="tmp-$(cat /dev/urandom | tr -dc 'a-z' | head -c 32)"
# Build image # Build image
# TODO: build-arg THRIFT is not turned on for any image, remove it once we confirm # TODO: build-arg THRIFT is not turned on for any image, remove it once we confirm
@ -356,6 +348,7 @@ docker build \
--build-arg "GCC_VERSION=${GCC_VERSION}" \ --build-arg "GCC_VERSION=${GCC_VERSION}" \
--build-arg "CUDA_VERSION=${CUDA_VERSION}" \ --build-arg "CUDA_VERSION=${CUDA_VERSION}" \
--build-arg "CUDNN_VERSION=${CUDNN_VERSION}" \ --build-arg "CUDNN_VERSION=${CUDNN_VERSION}" \
--build-arg "BREAKPAD=${BREAKPAD}" \
--build-arg "ANDROID=${ANDROID}" \ --build-arg "ANDROID=${ANDROID}" \
--build-arg "ANDROID_NDK=${ANDROID_NDK_VERSION}" \ --build-arg "ANDROID_NDK=${ANDROID_NDK_VERSION}" \
--build-arg "GRADLE_VERSION=${GRADLE_VERSION}" \ --build-arg "GRADLE_VERSION=${GRADLE_VERSION}" \

View File

@ -46,4 +46,7 @@ trap "docker logout ${registry}" EXIT
docker push "${image}:${tag}" docker push "${image}:${tag}"
docker save -o "${IMAGE_NAME}:${tag}.tar" "${image}:${tag}" docker save -o "${IMAGE_NAME}:${tag}.tar" "${image}:${tag}"
aws s3 cp "${IMAGE_NAME}:${tag}.tar" "s3://ossci-linux-build/pytorch/base/${IMAGE_NAME}:${tag}.tar" --acl public-read
if [ -z "${DOCKER_SKIP_S3_UPLOAD:-}" ]; then
aws s3 cp "${IMAGE_NAME}:${tag}.tar" "s3://ossci-linux-build/pytorch/base/${IMAGE_NAME}:${tag}.tar" --acl public-read
fi

View File

@ -64,6 +64,7 @@ ENV PATH /opt/rocm/hcc/bin:$PATH
ENV PATH /opt/rocm/hip/bin:$PATH ENV PATH /opt/rocm/hip/bin:$PATH
ENV PATH /opt/rocm/opencl/bin:$PATH ENV PATH /opt/rocm/opencl/bin:$PATH
ENV PATH /opt/rocm/llvm/bin:$PATH ENV PATH /opt/rocm/llvm/bin:$PATH
ENV MAGMA_HOME /opt/rocm/magma
ENV LANG en_US.utf8 ENV LANG en_US.utf8
ENV LC_ALL en_US.utf8 ENV LC_ALL en_US.utf8

View File

@ -99,7 +99,7 @@ echo "ndk.dir=/opt/ndk" >> $GRADLE_LOCAL_PROPERTIES
chown -R jenkins /var/lib/jenkins/gradledeps chown -R jenkins /var/lib/jenkins/gradledeps
chgrp -R jenkins /var/lib/jenkins/gradledeps chgrp -R jenkins /var/lib/jenkins/gradledeps
sudo -H -u jenkins $GRADLE_HOME/bin/gradle -p /var/lib/jenkins/gradledeps -g /var/lib/jenkins/.gradle --refresh-dependencies --debug --stacktrace assemble sudo -H -u jenkins $GRADLE_HOME/bin/gradle -Pandroid.useAndroidX=true -p /var/lib/jenkins/gradledeps -g /var/lib/jenkins/.gradle --refresh-dependencies --debug --stacktrace assemble
chown -R jenkins /var/lib/jenkins/.gradle chown -R jenkins /var/lib/jenkins/.gradle
chgrp -R jenkins /var/lib/jenkins/.gradle chgrp -R jenkins /var/lib/jenkins/.gradle

View File

@ -77,6 +77,7 @@ install_centos() {
glog-devel \ glog-devel \
hiredis-devel \ hiredis-devel \
libstdc++-devel \ libstdc++-devel \
libsndfile-devel \
make \ make \
opencv-devel \ opencv-devel \
sudo \ sudo \

View File

@ -0,0 +1,25 @@
#!/bin/bash
set -ex
git clone https://github.com/driazati/breakpad.git
pushd breakpad
# breakpad has no actual releases, so this is pinned to the top commit from
# main when this was forked (including the one patch commit). This uses a fork
# of the breakpad mainline that automatically daisy-chains out to any previously
# installed signal handlers (instead of overwriting them).
git checkout 5485e473ed46d065e05489e50dfc59d90dfd7e22
git clone https://chromium.googlesource.com/linux-syscall-support src/third_party/lss
pushd src/third_party/lss
# same as with breakpad, there are no real releases for this repo so use a
# commit as the pin
git checkout e1e7b0ad8ee99a875b272c8e33e308472e897660
popd
./configure
make
make install
popd
rm -rf breakpad

View File

@ -71,18 +71,22 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
# Install PyTorch conda deps, as per https://github.com/pytorch/pytorch README # Install PyTorch conda deps, as per https://github.com/pytorch/pytorch README
# DO NOT install cmake here as it would install a version newer than 3.5, but # DO NOT install cmake here as it would install a version newer than 3.5, but
# we want to pin to version 3.5. # we want to pin to version 3.5.
if [ "$ANACONDA_PYTHON_VERSION" = "3.8" ]; then SCIPY_VERSION=1.1.0
if [ "$ANACONDA_PYTHON_VERSION" = "3.9" ]; then
# Install llvm-8 as it is required to compile llvmlite-0.30.0 from source # Install llvm-8 as it is required to compile llvmlite-0.30.0 from source
conda_install numpy=1.18.5 pyyaml mkl mkl-include setuptools cffi future six llvmdev=8.0.0 conda_install numpy=1.19.2 astunparse pyyaml mkl mkl-include setuptools cffi future six llvmdev=8.0.0 -c conda-forge
SCIPY_VERSION=1.6.0
elif [ "$ANACONDA_PYTHON_VERSION" = "3.8" ]; then
# Install llvm-8 as it is required to compile llvmlite-0.30.0 from source
conda_install numpy=1.18.5 astunparse pyyaml mkl mkl-include setuptools cffi future six llvmdev=8.0.0
elif [ "$ANACONDA_PYTHON_VERSION" = "3.7" ]; then elif [ "$ANACONDA_PYTHON_VERSION" = "3.7" ]; then
# DO NOT install dataclasses if installing python-3.7, since its part of python-3.7 core packages # DO NOT install dataclasses if installing python-3.7, since its part of python-3.7 core packages
conda_install numpy=1.18.5 pyyaml mkl mkl-include setuptools cffi future six typing_extensions conda_install numpy=1.18.5 astunparse pyyaml mkl mkl-include setuptools cffi future six typing_extensions
else else
conda_install numpy=1.18.5 pyyaml mkl mkl-include setuptools cffi future six dataclasses typing_extensions conda_install numpy=1.18.5 astunparse pyyaml mkl mkl-include setuptools cffi future six dataclasses typing_extensions
fi fi
if [[ "$CUDA_VERSION" == 9.2* ]]; then
conda_install magma-cuda92 -c pytorch if [[ "$CUDA_VERSION" == 10.0* ]]; then
elif [[ "$CUDA_VERSION" == 10.0* ]]; then
conda_install magma-cuda100 -c pytorch conda_install magma-cuda100 -c pytorch
elif [[ "$CUDA_VERSION" == 10.1* ]]; then elif [[ "$CUDA_VERSION" == 10.1* ]]; then
conda_install magma-cuda101 -c pytorch conda_install magma-cuda101 -c pytorch
@ -92,8 +96,8 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
conda_install magma-cuda110 -c pytorch conda_install magma-cuda110 -c pytorch
elif [[ "$CUDA_VERSION" == 11.1* ]]; then elif [[ "$CUDA_VERSION" == 11.1* ]]; then
conda_install magma-cuda111 -c pytorch conda_install magma-cuda111 -c pytorch
elif [[ "$CUDA_VERSION" == 11.2* ]]; then elif [[ "$CUDA_VERSION" == 11.3* ]]; then
conda_install magma-cuda112 -c pytorch conda_install magma-cuda113 -c pytorch
fi fi
# TODO: This isn't working atm # TODO: This isn't working atm
@ -103,20 +107,26 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
# TODO: Why is scipy pinned # TODO: Why is scipy pinned
# Pin MyPy version because new errors are likely to appear with each release # Pin MyPy version because new errors are likely to appear with each release
# Pin hypothesis to avoid flakiness: https://github.com/pytorch/pytorch/issues/31136 # Pin hypothesis to avoid flakiness: https://github.com/pytorch/pytorch/issues/31136
# Pin coverage so we can use COVERAGE_RCFILE
as_jenkins pip install --progress-bar off pytest \ as_jenkins pip install --progress-bar off pytest \
scipy==1.1.0 \ scipy==$SCIPY_VERSION \
scikit-image \ scikit-image \
librosa>=0.6.2 \
psutil \ psutil \
numba \
llvmlite \
unittest-xml-reporting \ unittest-xml-reporting \
boto3==1.16.34 \ boto3==1.16.34 \
coverage \ coverage==5.5 \
hypothesis==4.53.2 \ hypothesis==4.53.2 \
mypy==0.770 \ mypy==0.812 \
tb-nightly tb-nightly
# Install numba only on python-3.8 or below
# For numba issue see https://github.com/pytorch/pytorch/issues/51511
if [[ $(python -c "import sys; print(int(sys.version_info < (3, 9)))") == "1" ]]; then
as_jenkins pip install --progress-bar off numba librosa>=0.6.2
else
as_jenkins pip install --progress-bar off numba==0.49.0 librosa>=0.6.2
fi
# Update scikit-learn to a python-3.8 compatible version # Update scikit-learn to a python-3.8 compatible version
if [[ $(python -c "import sys; print(int(sys.version_info >= (3, 8)))") == "1" ]]; then if [[ $(python -c "import sys; print(int(sys.version_info >= (3, 8)))") == "1" ]]; then
as_jenkins pip install --progress-bar off -U scikit-learn as_jenkins pip install --progress-bar off -U scikit-learn

View File

@ -0,0 +1,14 @@
#!/bin/bash
set -ex
OPENSSL=openssl-1.1.1k
wget -q -O "${OPENSSL}.tar.gz" "https://www.openssl.org/source/${OPENSSL}.tar.gz"
tar xf "${OPENSSL}.tar.gz"
cd "${OPENSSL}"
./config --prefix=/opt/openssl -d '-Wl,--enable-new-dtags,-rpath,$(LIBRPATH)'
# NOTE: opensl errors out when built with the -j option
make install_sw
cd ..
rm -rf "${OPENSSL}"

View File

@ -4,20 +4,27 @@ set -ex
install_magma() { install_magma() {
# "install" hipMAGMA into /opt/rocm/magma by copying after build # "install" hipMAGMA into /opt/rocm/magma by copying after build
git clone https://bitbucket.org/icl/magma.git -b hipMAGMA git clone https://bitbucket.org/icl/magma.git
pushd magma pushd magma
cp make.inc-examples/make.inc.hip-mkl-gcc make.inc git checkout 878b1ce02e9cfe4a829be22c8f911e9c0b6bd88f
cp make.inc-examples/make.inc.hip-gcc-mkl make.inc
echo 'LIBDIR += -L$(MKLROOT)/lib' >> make.inc echo 'LIBDIR += -L$(MKLROOT)/lib' >> make.inc
echo 'LIB += -Wl,--enable-new-dtags -Wl,--rpath,/opt/rocm/lib -Wl,--rpath,$(MKLROOT)/lib -Wl,--rpath,/opt/rocm/magma/lib' >> make.inc echo 'LIB += -Wl,--enable-new-dtags -Wl,--rpath,/opt/rocm/lib -Wl,--rpath,$(MKLROOT)/lib -Wl,--rpath,/opt/rocm/magma/lib' >> make.inc
echo 'DEVCCFLAGS += --amdgpu-target=gfx803 --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908' >> make.inc echo 'DEVCCFLAGS += --amdgpu-target=gfx803 --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --gpu-max-threads-per-block=256' >> make.inc
# hipcc with openmp flag may cause isnan() on __device__ not to be found; depending on context, compiler may attempt to match with host definition
sed -i 's/^FOPENMP/#FOPENMP/g' make.inc
export PATH="${PATH}:/opt/rocm/bin" export PATH="${PATH}:/opt/rocm/bin"
make -f make.gen.hipMAGMA -j $(nproc) make -f make.gen.hipMAGMA -j $(nproc)
make lib/libmagma.so -j $(nproc) MKLROOT=/opt/conda LANG=C.UTF-8 make lib/libmagma.so -j $(nproc) MKLROOT=/opt/conda
make testing/testing_dgemm -j $(nproc) MKLROOT=/opt/conda make testing/testing_dgemm -j $(nproc) MKLROOT=/opt/conda
popd popd
mv magma /opt/rocm mv magma /opt/rocm
} }
ver() {
printf "%3d%03d%03d%03d" $(echo "$1" | tr '.' ' ');
}
install_ubuntu() { install_ubuntu() {
apt-get update apt-get update
if [[ $UBUNTU_VERSION == 18.04 ]]; then if [[ $UBUNTU_VERSION == 18.04 ]]; then
@ -31,9 +38,14 @@ install_ubuntu() {
apt-get install -y libc++1 apt-get install -y libc++1
apt-get install -y libc++abi1 apt-get install -y libc++abi1
ROCM_REPO="ubuntu"
if [[ $(ver $ROCM_VERSION) -lt $(ver 4.2) ]]; then
ROCM_REPO="xenial"
fi
# Add rocm repository # Add rocm repository
wget -qO - http://repo.radeon.com/rocm/rocm.gpg.key | apt-key add - wget -qO - http://repo.radeon.com/rocm/rocm.gpg.key | apt-key add -
echo "deb [arch=amd64] http://repo.radeon.com/rocm/apt/${ROCM_VERSION} xenial main" > /etc/apt/sources.list.d/rocm.list echo "deb [arch=amd64] http://repo.radeon.com/rocm/apt/${ROCM_VERSION} ${ROCM_REPO} main" > /etc/apt/sources.list.d/rocm.list
apt-get update --allow-insecure-repositories apt-get update --allow-insecure-repositories
DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-unauthenticated \ DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-unauthenticated \

View File

@ -8,16 +8,17 @@ retry () {
$* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*) $* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)
} }
_https_amazon_aws=https://ossci-android.s3.amazonaws.com
_vulkansdk_dir=/var/lib/jenkins/vulkansdk _vulkansdk_dir=/var/lib/jenkins/vulkansdk
mkdir -p $_vulkansdk_dir
_tmp_vulkansdk_targz=/tmp/vulkansdk.tar.gz _tmp_vulkansdk_targz=/tmp/vulkansdk.tar.gz
curl --silent --show-error --location --fail --retry 3 \
--output "$_tmp_vulkansdk_targz" "$_https_amazon_aws/vulkansdk-linux-x86_64-${VULKAN_SDK_VERSION}.tar.gz"
tar -C "$_vulkansdk_dir" -xzf "$_tmp_vulkansdk_targz" --strip-components 1 curl \
--silent \
--show-error \
--location \
--fail \
--retry 3 \
--output "${_tmp_vulkansdk_targz}" "https://ossci-android.s3.amazonaws.com/vulkansdk-linux-x86_64-${VULKAN_SDK_VERSION}.tar.gz"
export VULKAN_SDK="$_vulkansdk_dir/" mkdir -p "${_vulkansdk_dir}"
tar -C "${_vulkansdk_dir}" -xzf "${_tmp_vulkansdk_targz}" --strip-components 1
rm "$_tmp_vulkansdk_targz" rm -rf "${_tmp_vulkansdk_targz}"

View File

@ -93,5 +93,9 @@ ENV TORCH_NVCC_FLAGS "-Xfatbin -compress-all"
# Install LLVM dev version (Defined in the pytorch/builder github repository) # Install LLVM dev version (Defined in the pytorch/builder github repository)
COPY --from=pytorch/llvm:9.0.1 /opt/llvm /opt/llvm COPY --from=pytorch/llvm:9.0.1 /opt/llvm /opt/llvm
ADD ./common/install_openssl.sh install_openssl.sh
ENV OPENSSL_ROOT_DIR /opt/openssl
RUN bash ./install_openssl.sh
USER jenkins USER jenkins
CMD ["bash"] CMD ["bash"]

View File

@ -27,6 +27,11 @@ ARG ANACONDA_PYTHON_VERSION
ADD ./common/install_conda.sh install_conda.sh ADD ./common/install_conda.sh install_conda.sh
RUN bash ./install_conda.sh && rm install_conda.sh RUN bash ./install_conda.sh && rm install_conda.sh
# Install gcc
ARG GCC_VERSION
ADD ./common/install_gcc.sh install_gcc.sh
RUN bash ./install_gcc.sh && rm install_gcc.sh
# (optional) Install protobuf for ONNX # (optional) Install protobuf for ONNX
ARG PROTOBUF ARG PROTOBUF
ADD ./common/install_protobuf.sh install_protobuf.sh ADD ./common/install_protobuf.sh install_protobuf.sh

View File

@ -82,6 +82,13 @@ RUN rm AndroidManifest.xml
RUN rm build.gradle RUN rm build.gradle
ENV INSTALLED_ANDROID ${ANDROID} ENV INSTALLED_ANDROID ${ANDROID}
# (optional) Install breakpad
ARG BREAKPAD
ADD ./common/install_breakpad.sh install_breakpad.sh
RUN if [ -n "${BREAKPAD}" ]; then bash ./install_breakpad.sh; fi
RUN rm install_breakpad.sh
ENV INSTALLED_BREAKPAD ${BREAKPAD}
# (optional) Install Vulkan SDK # (optional) Install Vulkan SDK
ARG VULKAN_SDK_VERSION ARG VULKAN_SDK_VERSION
ADD ./common/install_vulkan_sdk.sh install_vulkan_sdk.sh ADD ./common/install_vulkan_sdk.sh install_vulkan_sdk.sh
@ -123,5 +130,9 @@ ENV BUILD_ENVIRONMENT ${BUILD_ENVIRONMENT}
# Install LLVM dev version (Defined in the pytorch/builder github repository) # Install LLVM dev version (Defined in the pytorch/builder github repository)
COPY --from=pytorch/llvm:9.0.1 /opt/llvm /opt/llvm COPY --from=pytorch/llvm:9.0.1 /opt/llvm /opt/llvm
ADD ./common/install_openssl.sh install_openssl.sh
RUN bash ./install_openssl.sh
ENV OPENSSL_ROOT_DIR /opt/openssl
USER jenkins USER jenkins
CMD ["bash"] CMD ["bash"]

View File

@ -1,10 +1,10 @@
FROM ubuntu:16.04 FROM ubuntu:18.04
RUN apt-get update && apt-get install -y python-pip git && rm -rf /var/lib/apt/lists/* /var/log/dpkg.log RUN apt-get update && apt-get install -y python3-pip git && rm -rf /var/lib/apt/lists/* /var/log/dpkg.log
ADD requirements.txt /requirements.txt ADD requirements.txt /requirements.txt
RUN pip install -r /requirements.txt RUN pip3 install -r /requirements.txt
ADD gc.py /usr/bin/gc.py ADD gc.py /usr/bin/gc.py

View File

@ -1,4 +1,4 @@
#!/usr/bin/env python #!/usr/bin/env python3
from collections import namedtuple from collections import namedtuple

View File

@ -1,11 +1,11 @@
#!/usr/bin/env python #!/usr/bin/env python3
import argparse import argparse
import datetime
import boto3 import boto3
import datetime
import pytz import pytz
import sys
import re import re
import sys
def save_to_s3(project, data): def save_to_s3(project, data):
@ -148,9 +148,12 @@ def chunks(chunkable, n):
""" Yield successive n-sized chunks from l. """ Yield successive n-sized chunks from l.
""" """
for i in range(0, len(chunkable), n): for i in range(0, len(chunkable), n):
yield chunkable[i : i + n] yield chunkable[i: i + n]
SHA_PATTERN = re.compile(r'^[0-9a-f]{40}$') SHA_PATTERN = re.compile(r'^[0-9a-f]{40}$')
def looks_like_git_sha(tag): def looks_like_git_sha(tag):
"""Returns a boolean to check if a tag looks like a git sha """Returns a boolean to check if a tag looks like a git sha
@ -159,6 +162,7 @@ def looks_like_git_sha(tag):
""" """
return re.match(SHA_PATTERN, tag) is not None return re.match(SHA_PATTERN, tag) is not None
stable_window_tags = [] stable_window_tags = []
for repo in repos(client): for repo in repos(client):
repositoryName = repo["repositoryName"] repositoryName = repo["repositoryName"]

View File

@ -80,6 +80,52 @@ class Header(object):
for line in filter(None, lines): for line in filter(None, lines):
output_filehandle.write(line + "\n") output_filehandle.write(line + "\n")
def filter_master_only_jobs(items):
def _for_all_items(items, functor) -> None:
if isinstance(items, list):
for item in items:
_for_all_items(item, functor)
if isinstance(items, dict) and len(items) == 1:
item_type, item = next(iter(items.items()))
functor(item_type, item)
def _is_master_item(item):
filters = item.get('filters', None)
branches = filters.get('branches', None) if filters is not None else None
branches_only = branches.get('only', None) if branches is not None else None
return 'master' in branches_only if branches_only is not None else False
master_deps = set()
def _save_requires_if_master(item_type, item):
requires = item.get('requires', None)
item_name = item.get("name", None)
if not isinstance(requires, list):
return
if _is_master_item(item) or item_name in master_deps:
master_deps.update([n.strip('"') for n in requires])
def _do_filtering(items):
if isinstance(items, list):
rc = [_do_filtering(item) for item in items]
return [item for item in rc if len(item if item is not None else []) > 0]
assert isinstance(items, dict) and len(items) == 1
item_type, item = next(iter(items.items()))
item_name = item.get("name", None)
item_name = item_name.strip('"') if item_name is not None else None
if not _is_master_item(item) and item_name not in master_deps:
return None
if 'filters' in item:
item = item.copy()
item.pop('filters')
return {item_type: item}
# Scan of dependencies twice to pick up nested required jobs
# I.e. jobs depending on jobs that master-only job depend on
_for_all_items(items, _save_requires_if_master)
_for_all_items(items, _save_requires_if_master)
return _do_filtering(items)
def gen_build_workflows_tree(): def gen_build_workflows_tree():
build_workflows_functions = [ build_workflows_functions = [
@ -105,7 +151,8 @@ def gen_build_workflows_tree():
binary_build_definitions.get_nightly_tests, binary_build_definitions.get_nightly_tests,
binary_build_definitions.get_nightly_uploads, binary_build_definitions.get_nightly_uploads,
] ]
build_jobs = [f() for f in build_workflows_functions]
master_build_jobs = filter_master_only_jobs(build_jobs)
return { return {
"workflows": { "workflows": {
"binary_builds": { "binary_builds": {
@ -114,7 +161,11 @@ def gen_build_workflows_tree():
}, },
"build": { "build": {
"when": r"<< pipeline.parameters.run_build >>", "when": r"<< pipeline.parameters.run_build >>",
"jobs": [f() for f in build_workflows_functions] "jobs": build_jobs,
},
"master_build": {
"when": r"<< pipeline.parameters.run_master_build >>",
"jobs": master_build_jobs,
}, },
} }
} }
@ -139,6 +190,7 @@ YAML_SOURCES = [
File("job-specs/docker_jobs.yml"), File("job-specs/docker_jobs.yml"),
Header("Workflows"), Header("Workflows"),
Treegen(gen_build_workflows_tree, 0), Treegen(gen_build_workflows_tree, 0),
File("workflows/workflows-scheduled-ci.yml"),
File("workflows/workflows-ecr-gc.yml"), File("workflows/workflows-ecr-gc.yml"),
File("workflows/workflows-promote.yml"), File("workflows/workflows-promote.yml"),
] ]

5
.circleci/regenerate.ps1 Normal file
View File

@ -0,0 +1,5 @@
cd $PSScriptRoot;
$NewFile = New-TemporaryFile;
python generate_config_yml.py > $NewFile.name
(Get-Content $NewFile.name -Raw).TrimEnd().Replace("`r`n","`n") | Set-Content config.yml -Force
Remove-Item $NewFile.name

View File

@ -1,8 +1,17 @@
#!/bin/bash -xe #!/bin/bash -e
# Allows this script to be invoked from any directory: # Allows this script to be invoked from any directory:
cd $(dirname "$0") cd "$(dirname "$0")"
UNCOMMIT_CHANGE=$(git status -s | grep " config.yml" | wc -l | xargs)
if [[ $UNCOMMIT_CHANGE != 0 ]]; then
OLD_FILE=$(mktemp)
cp config.yml "$OLD_FILE"
echo "Uncommitted change detected in .circleci/config.yml"
echo "It has been backed up to $OLD_FILE"
fi
NEW_FILE=$(mktemp) NEW_FILE=$(mktemp)
./generate_config_yml.py > $NEW_FILE ./generate_config_yml.py > "$NEW_FILE"
cp $NEW_FILE config.yml cp "$NEW_FILE" config.yml
echo "New config generated in .circleci/config.yml"

View File

@ -62,6 +62,7 @@ popd
# Clone the Builder master repo # Clone the Builder master repo
retry git clone -q https://github.com/pytorch/builder.git "$BUILDER_ROOT" retry git clone -q https://github.com/pytorch/builder.git "$BUILDER_ROOT"
git checkout release/1.9
pushd "$BUILDER_ROOT" pushd "$BUILDER_ROOT"
echo "Using builder from " echo "Using builder from "
git --no-pager log --max-count 1 git --no-pager log --max-count 1

View File

@ -15,7 +15,7 @@ export PATH="~/anaconda/bin:${PATH}"
source ~/anaconda/bin/activate source ~/anaconda/bin/activate
# Install dependencies # Install dependencies
conda install numpy ninja pyyaml mkl mkl-include setuptools cmake cffi requests --yes conda install numpy ninja pyyaml mkl mkl-include setuptools cmake cffi requests typing_extensions --yes
conda install -c conda-forge valgrind --yes conda install -c conda-forge valgrind --yes
export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"} export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}

View File

@ -24,6 +24,6 @@ rm cert.txt
if ! [ -x "$(command -v xcodebuild)" ]; then if ! [ -x "$(command -v xcodebuild)" ]; then
echo 'Error: xcodebuild is not installed.' echo 'Error: xcodebuild is not installed.'
exit 1 exit 1
fi fi
PROFILE=PyTorch_CI_2021 PROFILE=PyTorch_CI_2021
ruby ${PROJ_ROOT}/scripts/xcode_build.rb -i ${PROJ_ROOT}/build_ios/install -x ${PROJ_ROOT}/ios/TestApp/TestApp.xcodeproj -p ${IOS_PLATFORM} -c ${PROFILE} -t ${IOS_DEV_TEAM_ID} ruby ${PROJ_ROOT}/scripts/xcode_build.rb -i ${PROJ_ROOT}/build_ios/install -x ${PROJ_ROOT}/ios/TestApp/TestApp.xcodeproj -p ${IOS_PLATFORM} -c ${PROFILE} -t ${IOS_DEV_TEAM_ID}

View File

@ -7,6 +7,10 @@ source /env
# Defaults here so they can be changed in one place # Defaults here so they can be changed in one place
export MAX_JOBS=${MAX_JOBS:-$(( $(nproc) - 2 ))} export MAX_JOBS=${MAX_JOBS:-$(( $(nproc) - 2 ))}
if [[ "${DESIRED_CUDA}" == "cu111" ]]; then
export BUILD_SPLIT_CUDA="ON"
fi
# Parse the parameters # Parse the parameters
if [[ "$PACKAGE_TYPE" == 'conda' ]]; then if [[ "$PACKAGE_TYPE" == 'conda' ]]; then
build_script='conda/build_pytorch.sh' build_script='conda/build_pytorch.sh'

View File

@ -38,6 +38,10 @@ if [[ "$DESIRED_CUDA" == "cu112" ]]; then
EXTRA_CONDA_FLAGS="-c=conda-forge" EXTRA_CONDA_FLAGS="-c=conda-forge"
fi fi
# Move debug wheels out of the the package dir so they don't get installed
mkdir -p /tmp/debug_final_pkgs
mv /final_pkgs/debug-*.zip /tmp/debug_final_pkgs || echo "no debug packages to move"
# Install the package # Install the package
# These network calls should not have 'retry's because they are installing # These network calls should not have 'retry's because they are installing
# locally and aren't actually network calls # locally and aren't actually network calls

View File

@ -68,12 +68,24 @@ if [[ -z "$DOCKER_IMAGE" ]]; then
fi fi
fi fi
USE_GOLD_LINKER="OFF"
# GOLD linker can not be used if CUPTI is statically linked into PyTorch, see https://github.com/pytorch/pytorch/issues/57744
if [[ ${DESIRED_CUDA} == "cpu" ]]; then
USE_GOLD_LINKER="ON"
fi
USE_WHOLE_CUDNN="OFF"
# Link whole cuDNN for CUDA-11.1 to include fp16 fast kernels
if [[ "$(uname)" == "Linux" && "${DESIRED_CUDA}" == "cu111" ]]; then
USE_WHOLE_CUDNN="ON"
fi
# Default to nightly, since that's where this normally uploads to # Default to nightly, since that's where this normally uploads to
PIP_UPLOAD_FOLDER='nightly/' PIP_UPLOAD_FOLDER='nightly/'
# We put this here so that OVERRIDE_PACKAGE_VERSION below can read from it # We put this here so that OVERRIDE_PACKAGE_VERSION below can read from it
export DATE="$(date -u +%Y%m%d)" export DATE="$(date -u +%Y%m%d)"
#TODO: We should be pulling semver version from the base version.txt #TODO: We should be pulling semver version from the base version.txt
BASE_BUILD_VERSION="1.8.0.dev$DATE" BASE_BUILD_VERSION="1.9.0.dev$DATE"
# Change BASE_BUILD_VERSION to git tag when on a git tag # Change BASE_BUILD_VERSION to git tag when on a git tag
# Use 'git -C' to make doubly sure we're in the correct directory for checking # Use 'git -C' to make doubly sure we're in the correct directory for checking
# the git tag # the git tag
@ -85,7 +97,7 @@ if tagged_version >/dev/null; then
# Turns tag v1.6.0-rc1 -> v1.6.0 # Turns tag v1.6.0-rc1 -> v1.6.0
BASE_BUILD_VERSION="$(tagged_version | sed -e 's/^v//' -e 's/-.*$//')" BASE_BUILD_VERSION="$(tagged_version | sed -e 's/^v//' -e 's/-.*$//')"
fi fi
if [[ "$(uname)" == 'Darwin' ]] || [[ "$DESIRED_CUDA" == "cu102" ]] || [[ "$PACKAGE_TYPE" == conda ]]; then if [[ "$(uname)" == 'Darwin' ]] || [[ "$PACKAGE_TYPE" == conda ]]; then
export PYTORCH_BUILD_VERSION="${BASE_BUILD_VERSION}" export PYTORCH_BUILD_VERSION="${BASE_BUILD_VERSION}"
else else
export PYTORCH_BUILD_VERSION="${BASE_BUILD_VERSION}+$DESIRED_CUDA" export PYTORCH_BUILD_VERSION="${BASE_BUILD_VERSION}+$DESIRED_CUDA"
@ -136,7 +148,7 @@ if [[ "${BUILD_FOR_SYSTEM:-}" == "windows" ]]; then
fi fi
export DATE="$DATE" export DATE="$DATE"
export NIGHTLIES_DATE_PREAMBLE=1.8.0.dev export NIGHTLIES_DATE_PREAMBLE=1.9.0.dev
export PYTORCH_BUILD_VERSION="$PYTORCH_BUILD_VERSION" export PYTORCH_BUILD_VERSION="$PYTORCH_BUILD_VERSION"
export PYTORCH_BUILD_NUMBER="$PYTORCH_BUILD_NUMBER" export PYTORCH_BUILD_NUMBER="$PYTORCH_BUILD_NUMBER"
export OVERRIDE_PACKAGE_VERSION="$PYTORCH_BUILD_VERSION" export OVERRIDE_PACKAGE_VERSION="$PYTORCH_BUILD_VERSION"
@ -168,6 +180,10 @@ export CIRCLE_SHA1="$CIRCLE_SHA1"
export CIRCLE_PR_NUMBER="${CIRCLE_PR_NUMBER:-}" export CIRCLE_PR_NUMBER="${CIRCLE_PR_NUMBER:-}"
export CIRCLE_BRANCH="$CIRCLE_BRANCH" export CIRCLE_BRANCH="$CIRCLE_BRANCH"
export CIRCLE_WORKFLOW_ID="$CIRCLE_WORKFLOW_ID" export CIRCLE_WORKFLOW_ID="$CIRCLE_WORKFLOW_ID"
export USE_GOLD_LINKER="${USE_GOLD_LINKER}"
export USE_GLOO_WITH_OPENSSL="ON"
export USE_WHOLE_CUDNN="${USE_WHOLE_CUDNN}"
# =================== The above code will be executed inside Docker container =================== # =================== The above code will be executed inside Docker container ===================
EOL EOL

View File

@ -15,6 +15,10 @@ else
export VC_YEAR=2019 export VC_YEAR=2019
fi fi
if [[ "${DESIRED_CUDA}" == "cu111" ]]; then
export BUILD_SPLIT_CUDA="ON"
fi
set +x set +x
export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_S3_BUCKET_V4:-} export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_S3_BUCKET_V4:-}
export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_S3_BUCKET_V4:-} export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_S3_BUCKET_V4:-}
@ -27,6 +31,10 @@ if [[ "$CIRCLECI" == 'true' && -d "C:\\ProgramData\\Microsoft\\VisualStudio\\Pac
mv _Instances "C:\\ProgramData\\Microsoft\\VisualStudio\\Packages" mv _Instances "C:\\ProgramData\\Microsoft\\VisualStudio\\Packages"
fi fi
if [[ "$CIRCLECI" == 'true' && -d "C:\\Microsoft" ]]; then
rm -rf "C:\\Microsoft\\Android*"
fi
echo "Free space on filesystem before build:" echo "Free space on filesystem before build:"
df -h df -h

View File

@ -10,7 +10,7 @@ export ANDROID_HOME=/opt/android/sdk
# Must be in sync with GRADLE_VERSION in docker image for android # Must be in sync with GRADLE_VERSION in docker image for android
# https://github.com/pietern/pytorch-dockerfiles/blob/master/build.sh#L155 # https://github.com/pietern/pytorch-dockerfiles/blob/master/build.sh#L155
export GRADLE_VERSION=4.10.3 export GRADLE_VERSION=6.8.3
export GRADLE_HOME=/opt/gradle/gradle-$GRADLE_VERSION export GRADLE_HOME=/opt/gradle/gradle-$GRADLE_VERSION
export GRADLE_PATH=$GRADLE_HOME/bin/gradle export GRADLE_PATH=$GRADLE_HOME/bin/gradle

View File

@ -5,7 +5,7 @@ set -eu -o pipefail
export ANDROID_NDK_HOME=/opt/ndk export ANDROID_NDK_HOME=/opt/ndk
export ANDROID_HOME=/opt/android/sdk export ANDROID_HOME=/opt/android/sdk
export GRADLE_VERSION=4.10.3 export GRADLE_VERSION=6.8.3
export GRADLE_HOME=/opt/gradle/gradle-$GRADLE_VERSION export GRADLE_HOME=/opt/gradle/gradle-$GRADLE_VERSION
export GRADLE_PATH=$GRADLE_HOME/bin/gradle export GRADLE_PATH=$GRADLE_HOME/bin/gradle
@ -35,7 +35,9 @@ else
echo "ndk.dir=/opt/ndk" >> $GRADLE_LOCAL_PROPERTIES echo "ndk.dir=/opt/ndk" >> $GRADLE_LOCAL_PROPERTIES
echo "SONATYPE_NEXUS_USERNAME=${SONATYPE_NEXUS_USERNAME}" >> $GRADLE_PROPERTIES echo "SONATYPE_NEXUS_USERNAME=${SONATYPE_NEXUS_USERNAME}" >> $GRADLE_PROPERTIES
echo "mavenCentralRepositoryUsername=${SONATYPE_NEXUS_USERNAME}" >> $GRADLE_PROPERTIES
echo "SONATYPE_NEXUS_PASSWORD=${SONATYPE_NEXUS_PASSWORD}" >> $GRADLE_PROPERTIES echo "SONATYPE_NEXUS_PASSWORD=${SONATYPE_NEXUS_PASSWORD}" >> $GRADLE_PROPERTIES
echo "mavenCentralRepositoryPassword=${SONATYPE_NEXUS_PASSWORD}" >> $GRADLE_PROPERTIES
echo "signing.keyId=${ANDROID_SIGN_KEY}" >> $GRADLE_PROPERTIES echo "signing.keyId=${ANDROID_SIGN_KEY}" >> $GRADLE_PROPERTIES
echo "signing.password=${ANDROID_SIGN_PASS}" >> $GRADLE_PROPERTIES echo "signing.password=${ANDROID_SIGN_PASS}" >> $GRADLE_PROPERTIES

View File

@ -111,14 +111,6 @@ popd
git rm -rf "$install_path" || true git rm -rf "$install_path" || true
mv "$pt_checkout/docs/build/html" "$install_path" mv "$pt_checkout/docs/build/html" "$install_path"
# Add the version handler by search and replace.
# XXX: Consider moving this to the docs Makefile or site build
if [ "$is_master_doc" = true ]; then
find "$install_path" -name "*.html" -print0 | xargs -0 perl -pi -w -e "s@master\s+\((\d\.\d\.[A-Fa-f0-9]+\+[A-Fa-f0-9]+)\s+\)@<a href='http://pytorch.org/docs/versions.html'>\1 \&#x25BC</a>@g"
else
find "$install_path" -name "*.html" -print0 | xargs -0 perl -pi -w -e "s@master\s+\((\d\.\d\.[A-Fa-f0-9]+\+[A-Fa-f0-9]+)\s+\)@<a href='http://pytorch.org/docs/versions.html'>$version \&#x25BC</a>@g"
fi
# Prevent Google from indexing $install_path/_modules. This folder contains # Prevent Google from indexing $install_path/_modules. This folder contains
# generated source files. # generated source files.
# NB: the following only works on gnu sed. The sed shipped with mac os is different. # NB: the following only works on gnu sed. The sed shipped with mac os is different.

View File

@ -24,7 +24,9 @@ retry sudo apt-get -y install \
echo "== DOCKER VERSION ==" echo "== DOCKER VERSION =="
docker version docker version
retry sudo pip -q install awscli==1.16.35 if ! command -v aws >/dev/null; then
retry sudo pip3 -q install awscli==1.19.64
fi
if [ -n "${USE_CUDA_DOCKER_RUNTIME:-}" ]; then if [ -n "${USE_CUDA_DOCKER_RUNTIME:-}" ]; then
DRIVER_FN="NVIDIA-Linux-x86_64-460.39.run" DRIVER_FN="NVIDIA-Linux-x86_64-460.39.run"
@ -48,43 +50,50 @@ else
fi fi
add_to_env_file() { add_to_env_file() {
local content local name=$1
content=$1 local value=$2
# BASH_ENV should be set by CircleCI case "$value" in
echo "${content}" >> "${BASH_ENV:-/tmp/env}" *\ *)
# BASH_ENV should be set by CircleCI
echo "${name}='${value}'" >> "${BASH_ENV:-/tmp/env}"
;;
*)
echo "${name}=${value}" >> "${BASH_ENV:-/tmp/env}"
;;
esac
} }
add_to_env_file "IN_CI=1" add_to_env_file IN_CI 1
add_to_env_file "COMMIT_SOURCE=${CIRCLE_BRANCH:-}" add_to_env_file COMMIT_SOURCE "${CIRCLE_BRANCH:-}"
add_to_env_file "BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" add_to_env_file BUILD_ENVIRONMENT "${BUILD_ENVIRONMENT}"
add_to_env_file "CIRCLE_PULL_REQUEST=${CIRCLE_PULL_REQUEST}" add_to_env_file CIRCLE_PULL_REQUEST "${CIRCLE_PULL_REQUEST}"
if [[ "${BUILD_ENVIRONMENT}" == *-build ]]; then if [[ "${BUILD_ENVIRONMENT}" == *-build ]]; then
add_to_env_file "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" add_to_env_file SCCACHE_BUCKET ossci-compiler-cache-circleci-v2
SCCACHE_MAX_JOBS=$(( $(nproc) - 1 )) SCCACHE_MAX_JOBS=$(( $(nproc) - 1 ))
MEMORY_LIMIT_MAX_JOBS=8 # the "large" resource class on CircleCI has 32 CPU cores, if we use all of them we'll OOM MEMORY_LIMIT_MAX_JOBS=8 # the "large" resource class on CircleCI has 32 CPU cores, if we use all of them we'll OOM
MAX_JOBS=$(( ${SCCACHE_MAX_JOBS} > ${MEMORY_LIMIT_MAX_JOBS} ? ${MEMORY_LIMIT_MAX_JOBS} : ${SCCACHE_MAX_JOBS} )) MAX_JOBS=$(( ${SCCACHE_MAX_JOBS} > ${MEMORY_LIMIT_MAX_JOBS} ? ${MEMORY_LIMIT_MAX_JOBS} : ${SCCACHE_MAX_JOBS} ))
add_to_env_file "MAX_JOBS=${MAX_JOBS}" add_to_env_file MAX_JOBS "${MAX_JOBS}"
if [ -n "${USE_CUDA_DOCKER_RUNTIME:-}" ]; then if [ -n "${USE_CUDA_DOCKER_RUNTIME:-}" ]; then
add_to_env_file "TORCH_CUDA_ARCH_LIST=5.2" add_to_env_file TORCH_CUDA_ARCH_LIST 5.2
fi fi
if [[ "${BUILD_ENVIRONMENT}" == *xla* ]]; then if [[ "${BUILD_ENVIRONMENT}" == *xla* ]]; then
# This IAM user allows write access to S3 bucket for sccache & bazels3cache # This IAM user allows write access to S3 bucket for sccache & bazels3cache
set +x set +x
add_to_env_file "XLA_CLANG_CACHE_S3_BUCKET_NAME=${XLA_CLANG_CACHE_S3_BUCKET_NAME:-}" add_to_env_file XLA_CLANG_CACHE_S3_BUCKET_NAME "${XLA_CLANG_CACHE_S3_BUCKET_NAME:-}"
add_to_env_file "AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_AND_XLA_BAZEL_S3_BUCKET_V2:-}" add_to_env_file AWS_ACCESS_KEY_ID "${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_AND_XLA_BAZEL_S3_BUCKET_V2:-}"
add_to_env_file "AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_AND_XLA_BAZEL_S3_BUCKET_V2:-}" add_to_env_file AWS_SECRET_ACCESS_KEY "${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_AND_XLA_BAZEL_S3_BUCKET_V2:-}"
set -x set -x
else else
# This IAM user allows write access to S3 bucket for sccache # This IAM user allows write access to S3 bucket for sccache
set +x set +x
add_to_env_file "XLA_CLANG_CACHE_S3_BUCKET_NAME=${XLA_CLANG_CACHE_S3_BUCKET_NAME:-}" add_to_env_file XLA_CLANG_CACHE_S3_BUCKET_NAME "${XLA_CLANG_CACHE_S3_BUCKET_NAME:-}"
add_to_env_file "AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_S3_BUCKET_V4:-}" add_to_env_file AWS_ACCESS_KEY_ID "${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_S3_BUCKET_V4:-}"
add_to_env_file "AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_S3_BUCKET_V4:-}" add_to_env_file AWS_SECRET_ACCESS_KEY "${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_S3_BUCKET_V4:-}"
set -x set -x
fi fi
fi fi
@ -93,5 +102,7 @@ fi
set +x set +x
export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_ECR_READ_WRITE_V4:-} export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_ECR_READ_WRITE_V4:-}
export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_ECR_READ_WRITE_V4:-} export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_ECR_READ_WRITE_V4:-}
eval "$(aws ecr get-login --region us-east-1 --no-include-email)" export AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
export AWS_REGION=us-east-1
aws ecr get-login-password --region $AWS_REGION|docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com
set -x set -x

View File

@ -0,0 +1,140 @@
# Documentation: https://docs.microsoft.com/en-us/rest/api/azure/devops/build/?view=azure-devops-rest-6.0
import re
import json
import os
import sys
import requests
import time
AZURE_PIPELINE_BASE_URL = "https://aiinfra.visualstudio.com/PyTorch/"
AZURE_DEVOPS_PAT_BASE64 = os.environ.get("AZURE_DEVOPS_PAT_BASE64_SECRET", "")
PIPELINE_ID = "911"
PROJECT_ID = "0628bce4-2d33-499e-bac5-530e12db160f"
TARGET_BRANCH = os.environ.get("CIRCLE_BRANCH", "master")
TARGET_COMMIT = os.environ.get("CIRCLE_SHA1", "")
build_base_url = AZURE_PIPELINE_BASE_URL + "_apis/build/builds?api-version=6.0"
s = requests.Session()
s.headers.update({"Authorization": "Basic " + AZURE_DEVOPS_PAT_BASE64})
def submit_build(pipeline_id, project_id, source_branch, source_version):
print("Submitting build for branch: " + source_branch)
print("Commit SHA1: ", source_version)
run_build_raw = s.post(build_base_url, json={
"definition": {"id": pipeline_id},
"project": {"id": project_id},
"sourceBranch": source_branch,
"sourceVersion": source_version
})
try:
run_build_json = run_build_raw.json()
except json.decoder.JSONDecodeError as e:
print(e)
print("Failed to parse the response. Check if the Azure DevOps PAT is incorrect or expired.")
sys.exit(-1)
build_id = run_build_json['id']
print("Submitted bulid: " + str(build_id))
print("Bulid URL: " + run_build_json['url'])
return build_id
def get_build(_id):
get_build_url = AZURE_PIPELINE_BASE_URL + f"/_apis/build/builds/{_id}?api-version=6.0"
get_build_raw = s.get(get_build_url)
return get_build_raw.json()
def get_build_logs(_id):
get_build_logs_url = AZURE_PIPELINE_BASE_URL + f"/_apis/build/builds/{_id}/logs?api-version=6.0"
get_build_logs_raw = s.get(get_build_logs_url)
return get_build_logs_raw.json()
def get_log_content(url):
resp = s.get(url)
return resp.text
def wait_for_build(_id):
build_detail = get_build(_id)
build_status = build_detail['status']
while build_status == 'notStarted':
print('Waiting for run to start: ' + str(_id))
sys.stdout.flush()
try:
build_detail = get_build(_id)
build_status = build_detail['status']
except Exception as e:
print("Error getting build")
print(e)
time.sleep(30)
print("Bulid started: ", str(_id))
handled_logs = set()
while build_status == 'inProgress':
try:
print("Waiting for log: " + str(_id))
logs = get_build_logs(_id)
except Exception as e:
print("Error fetching logs")
print(e)
time.sleep(30)
continue
for log in logs['value']:
log_id = log['id']
if log_id in handled_logs:
continue
handled_logs.add(log_id)
print('Fetching log: \n' + log['url'])
try:
log_content = get_log_content(log['url'])
print(log_content)
except Exception as e:
print("Error getting log content")
print(e)
sys.stdout.flush()
build_detail = get_build(_id)
build_status = build_detail['status']
time.sleep(30)
build_result = build_detail['result']
print("Bulid status: " + build_status)
print("Bulid result: " + build_result)
return build_status, build_result
if __name__ == '__main__':
# Convert the branch name for Azure DevOps
match = re.search(r'pull/(\d+)', TARGET_BRANCH)
if match is not None:
pr_num = match.group(1)
SOURCE_BRANCH = f'refs/pull/{pr_num}/head'
else:
SOURCE_BRANCH = f'refs/heads/{TARGET_BRANCH}'
MAX_RETRY = 2
retry = MAX_RETRY
while retry > 0:
build_id = submit_build(PIPELINE_ID, PROJECT_ID, SOURCE_BRANCH, TARGET_COMMIT)
build_status, build_result = wait_for_build(build_id)
if build_result != 'succeeded':
retry = retry - 1
if retry > 0:
print("Retrying... remaining attempt: " + str(retry))
# Wait a bit before retrying
time.sleep((MAX_RETRY - retry) * 120)
continue
else:
print("No more chance to retry. Giving up.")
sys.exit(-1)
else:
break

View File

@ -17,7 +17,7 @@ def get_size(file_dir):
# we should only expect one file, if no, something is wrong # we should only expect one file, if no, something is wrong
file_name = glob.glob(os.path.join(file_dir, "*"))[0] file_name = glob.glob(os.path.join(file_dir, "*"))[0]
return os.stat(file_name).st_size return os.stat(file_name).st_size
except: except Exception:
logging.exception(f"error getting file from: {file_dir}") logging.exception(f"error getting file from: {file_dir}")
return 0 return 0
@ -145,5 +145,5 @@ if __name__ == "__main__":
if size != 0: if size != 0:
try: try:
send_message([build_message(size)]) send_message([build_message(size)])
except: except Exception:
logging.exception("can't send message") logging.exception("can't send message")

View File

@ -1,7 +1,10 @@
$VS_DOWNLOAD_LINK = "https://aka.ms/vs/15/release/vs_buildtools.exe" # https://developercommunity.visualstudio.com/t/install-specific-version-of-vs-component/1142479
# https://docs.microsoft.com/en-us/visualstudio/releases/2019/history#release-dates-and-build-numbers
# 16.8.5 BuildTools
$VS_DOWNLOAD_LINK = "https://download.visualstudio.microsoft.com/download/pr/20130c62-1bc8-43d6-b4f0-c20bb7c79113/145a319d79a83376915d8f855605e152ef5f6fa2b2f1d2dca411fb03722eea72/vs_BuildTools.exe"
$COLLECT_DOWNLOAD_LINK = "https://aka.ms/vscollect.exe" $COLLECT_DOWNLOAD_LINK = "https://aka.ms/vscollect.exe"
$VS_INSTALL_ARGS = @("--nocache","--quiet","--wait", "--add Microsoft.VisualStudio.Workload.VCTools", $VS_INSTALL_ARGS = @("--nocache","--quiet","--wait", "--add Microsoft.VisualStudio.Workload.VCTools",
"--add Microsoft.VisualStudio.Component.VC.Tools.14.13",
"--add Microsoft.Component.MSBuild", "--add Microsoft.Component.MSBuild",
"--add Microsoft.VisualStudio.Component.Roslyn.Compiler", "--add Microsoft.VisualStudio.Component.Roslyn.Compiler",
"--add Microsoft.VisualStudio.Component.TextTemplating", "--add Microsoft.VisualStudio.Component.TextTemplating",
@ -13,10 +16,25 @@ $VS_INSTALL_ARGS = @("--nocache","--quiet","--wait", "--add Microsoft.VisualStud
curl.exe --retry 3 -kL $VS_DOWNLOAD_LINK --output vs_installer.exe curl.exe --retry 3 -kL $VS_DOWNLOAD_LINK --output vs_installer.exe
if ($LASTEXITCODE -ne 0) { if ($LASTEXITCODE -ne 0) {
echo "Download of the VS 2017 installer failed" echo "Download of the VS 2019 Version 16.8.5 installer failed"
exit 1 exit 1
} }
if (Test-Path "${env:ProgramFiles(x86)}\Microsoft Visual Studio\Installer\vswhere.exe") {
$existingPath = & "${env:ProgramFiles(x86)}\Microsoft Visual Studio\Installer\vswhere.exe" -products "Microsoft.VisualStudio.Product.BuildTools" -version "[16, 17)" -property installationPath
if ($existingPath -ne $null) {
echo "Found existing BuildTools installation in $existingPath"
$VS_UNINSTALL_ARGS = @("uninstall", "--installPath", "`"$existingPath`"", "--quiet","--wait")
$process = Start-Process "${PWD}\vs_installer.exe" -ArgumentList $VS_UNINSTALL_ARGS -NoNewWindow -Wait -PassThru
$exitCode = $process.ExitCode
if (($exitCode -ne 0) -and ($exitCode -ne 3010)) {
echo "Original BuildTools uninstall failed with code $exitCode"
exit 1
}
echo "Original BuildTools uninstalled"
}
}
$process = Start-Process "${PWD}\vs_installer.exe" -ArgumentList $VS_INSTALL_ARGS -NoNewWindow -Wait -PassThru $process = Start-Process "${PWD}\vs_installer.exe" -ArgumentList $VS_INSTALL_ARGS -NoNewWindow -Wait -PassThru
Remove-Item -Path vs_installer.exe -Force Remove-Item -Path vs_installer.exe -Force
$exitCode = $process.ExitCode $exitCode = $process.ExitCode

View File

@ -0,0 +1,5 @@
$CMATH_DOWNLOAD_LINK = "https://raw.githubusercontent.com/microsoft/STL/12c684bba78f9b032050526abdebf14f58ca26a3/stl/inc/cmath"
$VC14_28_INSTALL_PATH="C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29910\include"
curl.exe --retry 3 -kL $CMATH_DOWNLOAD_LINK --output "$home\cmath"
Move-Item -Path "$home\cmath" -Destination "$VC14_28_INSTALL_PATH" -Force

View File

@ -8,9 +8,18 @@ if [[ "$cuda_major_version" == "10" ]]; then
msbuild_project_dir="CUDAVisualStudioIntegration/extras/visual_studio_integration/MSBuildExtensions" msbuild_project_dir="CUDAVisualStudioIntegration/extras/visual_studio_integration/MSBuildExtensions"
cuda_install_packages="nvcc_10.1 cuobjdump_10.1 nvprune_10.1 cupti_10.1 cublas_10.1 cublas_dev_10.1 cudart_10.1 cufft_10.1 cufft_dev_10.1 curand_10.1 curand_dev_10.1 cusolver_10.1 cusolver_dev_10.1 cusparse_10.1 cusparse_dev_10.1 nvgraph_10.1 nvgraph_dev_10.1 npp_10.1 npp_dev_10.1 nvrtc_10.1 nvrtc_dev_10.1 nvml_dev_10.1" cuda_install_packages="nvcc_10.1 cuobjdump_10.1 nvprune_10.1 cupti_10.1 cublas_10.1 cublas_dev_10.1 cudart_10.1 cufft_10.1 cufft_dev_10.1 curand_10.1 curand_dev_10.1 cusolver_10.1 cusolver_dev_10.1 cusparse_10.1 cusparse_dev_10.1 nvgraph_10.1 nvgraph_dev_10.1 npp_10.1 npp_dev_10.1 nvrtc_10.1 nvrtc_dev_10.1 nvml_dev_10.1"
elif [[ "$cuda_major_version" == "11" ]]; then elif [[ "$cuda_major_version" == "11" ]]; then
cuda_installer_name="cuda_11.1.0_456.43_win10" if [[ "${CUDA_VERSION}" == "11.1" ]]; then
msbuild_project_dir="visual_studio_integration/CUDAVisualStudioIntegration/extras/visual_studio_integration/MSBuildExtensions" cuda_installer_name="cuda_11.1.0_456.43_win10"
cuda_install_packages="nvcc_11.1 cuobjdump_11.1 nvprune_11.1 nvprof_11.1 cupti_11.1 cublas_11.1 cublas_dev_11.1 cudart_11.1 cufft_11.1 cufft_dev_11.1 curand_11.1 curand_dev_11.1 cusolver_11.1 cusolver_dev_11.1 cusparse_11.1 cusparse_dev_11.1 npp_11.1 npp_dev_11.1 nvrtc_11.1 nvrtc_dev_11.1 nvml_dev_11.1" msbuild_project_dir="visual_studio_integration/CUDAVisualStudioIntegration/extras/visual_studio_integration/MSBuildExtensions"
cuda_install_packages="nvcc_11.1 cuobjdump_11.1 nvprune_11.1 nvprof_11.1 cupti_11.1 cublas_11.1 cublas_dev_11.1 cudart_11.1 cufft_11.1 cufft_dev_11.1 curand_11.1 curand_dev_11.1 cusolver_11.1 cusolver_dev_11.1 cusparse_11.1 cusparse_dev_11.1 npp_11.1 npp_dev_11.1 nvrtc_11.1 nvrtc_dev_11.1 nvml_dev_11.1"
elif [[ "${CUDA_VERSION}" == "11.3" ]]; then
cuda_installer_name="cuda_11.3.0_465.89_win10"
msbuild_project_dir="visual_studio_integration/CUDAVisualStudioIntegration/extras/visual_studio_integration/MSBuildExtensions"
cuda_install_packages="thrust_11.3 nvcc_11.3 cuobjdump_11.3 nvprune_11.3 nvprof_11.3 cupti_11.3 cublas_11.3 cublas_dev_11.3 cudart_11.3 cufft_11.3 cufft_dev_11.3 curand_11.3 curand_dev_11.3 cusolver_11.3 cusolver_dev_11.3 cusparse_11.3 cusparse_dev_11.3 npp_11.3 npp_dev_11.3 nvrtc_11.3 nvrtc_dev_11.3 nvml_dev_11.3"
else
echo "This should not happen! ABORT."
exit 1
fi
else else
echo "CUDA_VERSION $CUDA_VERSION is not supported yet" echo "CUDA_VERSION $CUDA_VERSION is not supported yet"
exit 1 exit 1

View File

@ -6,7 +6,14 @@ cuda_major_version=${CUDA_VERSION%.*}
if [[ "$cuda_major_version" == "10" ]]; then if [[ "$cuda_major_version" == "10" ]]; then
cudnn_installer_name="cudnn-${CUDA_VERSION}-windows10-x64-v7.6.4.38" cudnn_installer_name="cudnn-${CUDA_VERSION}-windows10-x64-v7.6.4.38"
elif [[ "$cuda_major_version" == "11" ]]; then elif [[ "$cuda_major_version" == "11" ]]; then
cudnn_installer_name="cudnn-${CUDA_VERSION}-windows-x64-v8.0.5.39" if [[ "${CUDA_VERSION}" == "11.1" ]]; then
cudnn_installer_name="cudnn-${CUDA_VERSION}-windows-x64-v8.0.5.39"
elif [[ "${CUDA_VERSION}" == "11.3" ]]; then
cudnn_installer_name="cudnn-${CUDA_VERSION}-windows-x64-v8.2.0.53"
else
echo "This should not happen! ABORT."
exit 1
fi
else else
echo "CUDNN for CUDA_VERSION $CUDA_VERSION is not supported yet" echo "CUDNN for CUDA_VERSION $CUDA_VERSION is not supported yet"
exit 1 exit 1

View File

@ -22,6 +22,24 @@ pytorch_params: &pytorch_params
BUILD_ONLY: << parameters.build_only >> BUILD_ONLY: << parameters.build_only >>
resource_class: << parameters.resource_class >> resource_class: << parameters.resource_class >>
pytorch_android_params: &pytorch_android_params
parameters:
build_environment:
type: string
default: ""
op_list:
type: string
default: ""
lite_interpreter:
type: string
default: "1"
environment:
BUILD_ENVIRONMENT: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c"
PYTHON_VERSION: "3.6"
SELECTED_OP_LIST: << parameters.op_list >>
BUILD_LITE_INTERPRETER: << parameters.lite_interpreter >>
pytorch_ios_params: &pytorch_ios_params pytorch_ios_params: &pytorch_ios_params
parameters: parameters:
build_environment: build_environment:
@ -39,12 +57,16 @@ pytorch_ios_params: &pytorch_ios_params
use_metal: use_metal:
type: string type: string
default: "0" default: "0"
lite_interpreter:
type: string
default: "1"
environment: environment:
BUILD_ENVIRONMENT: << parameters.build_environment >> BUILD_ENVIRONMENT: << parameters.build_environment >>
IOS_ARCH: << parameters.ios_arch >> IOS_ARCH: << parameters.ios_arch >>
IOS_PLATFORM: << parameters.ios_platform >> IOS_PLATFORM: << parameters.ios_platform >>
SELECTED_OP_LIST: << parameters.op_list >> SELECTED_OP_LIST: << parameters.op_list >>
USE_PYTORCH_METAL: << parameters.use_metal >> USE_PYTORCH_METAL: << parameters.use_metal >>
BUILD_LITE_INTERPRETER: << parameters.lite_interpreter >>
pytorch_windows_params: &pytorch_windows_params pytorch_windows_params: &pytorch_windows_params
parameters: parameters:
@ -84,6 +106,6 @@ pytorch_windows_params: &pytorch_windows_params
VC_YEAR: <<parameters.vc_year>> VC_YEAR: <<parameters.vc_year>>
VC_PRODUCT: <<parameters.vc_product>> VC_PRODUCT: <<parameters.vc_product>>
USE_CUDA: <<parameters.use_cuda>> USE_CUDA: <<parameters.use_cuda>>
TORCH_CUDA_ARCH_LIST: "7.5" TORCH_CUDA_ARCH_LIST: "5.2 7.5"
JOB_BASE_NAME: <<parameters.test_name>> JOB_BASE_NAME: <<parameters.test_name>>
JOB_EXECUTOR: <<parameters.executor>> JOB_EXECUTOR: <<parameters.executor>>

View File

@ -14,19 +14,15 @@ parameters:
run_build: run_build:
type: boolean type: boolean
default: true default: true
run_master_build:
docker_config_defaults: &docker_config_defaults type: boolean
user: jenkins default: false
aws_auth:
# This IAM user only allows read-write access to ECR
aws_access_key_id: ${CIRCLECI_AWS_ACCESS_KEY_FOR_ECR_READ_WRITE_V4}
aws_secret_access_key: ${CIRCLECI_AWS_SECRET_KEY_FOR_ECR_READ_WRITE_V4}
executors: executors:
windows-with-nvidia-gpu: windows-with-nvidia-gpu:
machine: machine:
resource_class: windows.gpu.nvidia.medium resource_class: windows.gpu.nvidia.medium
image: windows-server-2019-nvidia:stable image: windows-server-2019-nvidia:previous
shell: bash.exe shell: bash.exe
windows-xlarge-cpu-with-nvidia-cuda: windows-xlarge-cpu-with-nvidia-cuda:

View File

@ -45,7 +45,7 @@
binary_linux_test: binary_linux_test:
<<: *binary_linux_test_upload_params <<: *binary_linux_test_upload_params
machine: machine:
image: ubuntu-1604:202007-01 image: ubuntu-2004:202104-01
steps: steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml # See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
- checkout - checkout
@ -108,7 +108,7 @@
smoke_linux_test: smoke_linux_test:
<<: *binary_linux_test_upload_params <<: *binary_linux_test_upload_params
machine: machine:
image: ubuntu-1604:202007-01 image: ubuntu-2004:202104-01
steps: steps:
- checkout - checkout
- calculate_docker_image_tag - calculate_docker_image_tag
@ -198,6 +198,44 @@
root: /Users/distiller/project root: /Users/distiller/project
paths: final_pkgs paths: final_pkgs
- store_artifacts:
path: /Users/distiller/project/final_pkgs
binary_macos_arm64_build:
<<: *binary_mac_params
macos:
xcode: "12.3.0"
steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
- checkout
- run:
<<: *binary_checkout
- run:
<<: *binary_populate_env
- brew_update
- run:
<<: *binary_install_miniconda
- run:
name: Build
no_output_timeout: "90m"
command: |
# Do not set -u here; there is some problem with CircleCI
# variable expansion with PROMPT_COMMAND
set -ex -o pipefail
export CROSS_COMPILE_ARM64=1
script="/Users/distiller/project/pytorch/.circleci/scripts/binary_macos_build.sh"
cat "$script"
source "$script"
- persist_to_workspace:
root: /Users/distiller/project
paths: final_pkgs
- store_artifacts:
path: /Users/distiller/project/final_pkgs
binary_ios_build: binary_ios_build:
<<: *pytorch_ios_params <<: *pytorch_ios_params
macos: macos:
@ -270,6 +308,8 @@
- persist_to_workspace: - persist_to_workspace:
root: "C:/w" root: "C:/w"
paths: final_pkgs paths: final_pkgs
- store_artifacts:
path: C:/w/final_pkgs
binary_windows_test: binary_windows_test:
<<: *binary_windows_params <<: *binary_windows_params
@ -352,4 +392,3 @@
command: | command: |
ANACONDA_API_TOKEN="${CONDA_PYTORCHBOT_TOKEN}" \ ANACONDA_API_TOKEN="${CONDA_PYTORCHBOT_TOKEN}" \
scripts/release/anaconda-prune/run.sh scripts/release/anaconda-prune/run.sh

View File

@ -8,7 +8,7 @@
# then install the one with the most recent version. # then install the one with the most recent version.
update_s3_htmls: &update_s3_htmls update_s3_htmls: &update_s3_htmls
machine: machine:
image: ubuntu-1604:202007-01 image: ubuntu-2004:202104-01
resource_class: medium resource_class: medium
steps: steps:
- checkout - checkout

View File

@ -4,7 +4,7 @@
type: string type: string
default: "" default: ""
machine: machine:
image: ubuntu-1604:202007-01 image: ubuntu-2004:202104-01
resource_class: large resource_class: large
environment: environment:
IMAGE_NAME: << parameters.image_name >> IMAGE_NAME: << parameters.image_name >>
@ -20,7 +20,10 @@
set +x set +x
export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_DOCKER_BUILDER_V1} export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_DOCKER_BUILDER_V1}
export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_DOCKER_BUILDER_V1} export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_DOCKER_BUILDER_V1}
eval $(aws ecr get-login --no-include-email --region us-east-1) export AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
export AWS_REGION=us-east-1
aws ecr get-login-password --region $AWS_REGION|docker login --username AWS \
--password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com
set -x set -x
# Check if image already exists, if it does then skip building it # Check if image already exists, if it does then skip building it
if docker manifest inspect "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/${IMAGE_NAME}:${DOCKER_TAG}"; then if docker manifest inspect "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/${IMAGE_NAME}:${DOCKER_TAG}"; then
@ -53,7 +56,7 @@
cd .circleci/docker && ./build_docker.sh cd .circleci/docker && ./build_docker.sh
docker_for_ecr_gc_build_job: docker_for_ecr_gc_build_job:
machine: machine:
image: ubuntu-1604:202007-01 image: ubuntu-2004:202104-01
steps: steps:
- checkout - checkout
- run: - run:
@ -65,9 +68,12 @@
set +x set +x
export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_DOCKER_BUILDER_V1} export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_DOCKER_BUILDER_V1}
export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_DOCKER_BUILDER_V1} export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_DOCKER_BUILDER_V1}
eval $(aws ecr get-login --no-include-email --region us-east-1) export AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
export AWS_REGION=us-east-1
aws ecr get-login-password --region $AWS_REGION|docker login --username AWS \
--password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com
set -x set -x
docker push 308535385114.dkr.ecr.us-east-1.amazonaws.com/gc/ecr docker push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/gc/ecr
ecr_gc_job: ecr_gc_job:
parameters: parameters:
project: project:

View File

@ -1,7 +1,7 @@
pytorch_doc_push: pytorch_doc_push:
resource_class: medium resource_class: medium
machine: machine:
image: ubuntu-1604:202007-01 image: ubuntu-2004:202104-01
parameters: parameters:
branch: branch:
type: string type: string
@ -30,7 +30,7 @@
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-gcc5.4" DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-gcc5.4"
resource_class: large resource_class: large
machine: machine:
image: ubuntu-1604:202007-01 image: ubuntu-2004:202104-01
steps: steps:
- checkout - checkout
- calculate_docker_image_tag - calculate_docker_image_tag
@ -75,7 +75,7 @@
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-gcc5.4" DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-gcc5.4"
resource_class: large resource_class: large
machine: machine:
image: ubuntu-1604:202007-01 image: ubuntu-2004:202104-01
steps: steps:
- checkout - checkout
- calculate_docker_image_tag - calculate_docker_image_tag
@ -111,6 +111,43 @@
paths: paths:
- . - .
pytorch_macos_10_15_py3_build:
environment:
BUILD_ENVIRONMENT: pytorch-macos-10.15-py3-arm64-build
macos:
xcode: "12.3.0"
steps:
- checkout
- run_brew_for_macos_build
- run:
name: Build
no_output_timeout: "1h"
command: |
set -e
export IN_CI=1
export CROSS_COMPILE_ARM64=1
# Install sccache
sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
sudo chmod +x /usr/local/bin/sccache
export SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2
# This IAM user allows write access to S3 bucket for sccache
set +x
export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_S3_BUCKET_V4}
export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_S3_BUCKET_V4}
set -x
chmod a+x .jenkins/pytorch/macos-build.sh
unbuffer .jenkins/pytorch/macos-build.sh 2>&1 | ts
- persist_to_workspace:
root: /Users/distiller/workspace/
paths:
- miniconda3
- store_artifacts:
path: /Users/distiller/project/dist
pytorch_macos_10_13_py3_build: pytorch_macos_10_13_py3_build:
environment: environment:
BUILD_ENVIRONMENT: pytorch-macos-10.13-py3-build BUILD_ENVIRONMENT: pytorch-macos-10.13-py3-build
@ -127,7 +164,7 @@
export IN_CI=1 export IN_CI=1
# Install sccache # Install sccache
sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache --output /usr/local/bin/sccache sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
sudo chmod +x /usr/local/bin/sccache sudo chmod +x /usr/local/bin/sccache
export SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2 export SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2
@ -164,6 +201,42 @@
chmod a+x .jenkins/pytorch/macos-test.sh chmod a+x .jenkins/pytorch/macos-test.sh
unbuffer .jenkins/pytorch/macos-test.sh 2>&1 | ts unbuffer .jenkins/pytorch/macos-test.sh 2>&1 | ts
- run:
name: Report results
no_output_timeout: "5m"
command: |
set -ex
source /Users/distiller/workspace/miniconda3/bin/activate
pip install boto3
export PYTHONPATH="$PWD"
# Using the same IAM user to write stats to our OSS bucket
export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_S3_BUCKET_V4}
export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_S3_BUCKET_V4}
python tools/print_test_stats.py --upload-to-s3 --compare-with-s3 test
when: always
- store_test_results:
path: test/test-reports
pytorch_macos_10_13_py3_lite_interpreter_build_test:
environment:
BUILD_ENVIRONMENT: pytorch-macos-10.13-py3-test
macos:
xcode: "12.0"
steps:
- checkout
- attach_workspace:
at: ~/workspace
- run_brew_for_macos_build
- run:
name: Test
no_output_timeout: "1h"
command: |
set -e
export IN_CI=1
export BUILD_LITE_INTERPRETER=1
chmod a+x ${HOME}/project/.jenkins/pytorch/macos-lite-interpreter-build-test.sh
unbuffer ${HOME}/project/.jenkins/pytorch/macos-lite-interpreter-build-test.sh 2>&1 | ts
- store_test_results: - store_test_results:
path: test/test-reports path: test/test-reports
@ -174,7 +247,7 @@
PYTHON_VERSION: "3.6" PYTHON_VERSION: "3.6"
resource_class: large resource_class: large
machine: machine:
image: ubuntu-1604:202007-01 image: ubuntu-2004:202104-01
steps: steps:
- checkout - checkout
- calculate_docker_image_tag - calculate_docker_image_tag
@ -263,7 +336,7 @@
PYTHON_VERSION: "3.6" PYTHON_VERSION: "3.6"
resource_class: large resource_class: large
machine: machine:
image: ubuntu-1604:202007-01 image: ubuntu-2004:202104-01
steps: steps:
- checkout - checkout
- calculate_docker_image_tag - calculate_docker_image_tag
@ -299,7 +372,7 @@
PYTHON_VERSION: "3.6" PYTHON_VERSION: "3.6"
resource_class: large resource_class: large
machine: machine:
image: ubuntu-1604:202007-01 image: ubuntu-2004:202104-01
steps: steps:
- checkout - checkout
- calculate_docker_image_tag - calculate_docker_image_tag
@ -335,13 +408,10 @@
destination: artifacts.tgz destination: artifacts.tgz
pytorch_android_gradle_custom_build_single: pytorch_android_gradle_custom_build_single:
environment: <<: *pytorch_android_params
BUILD_ENVIRONMENT: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c"
PYTHON_VERSION: "3.6"
resource_class: large resource_class: large
machine: machine:
image: ubuntu-1604:202007-01 image: ubuntu-2004:202104-01
steps: steps:
- checkout - checkout
- calculate_docker_image_tag - calculate_docker_image_tag
@ -361,11 +431,11 @@
echo "DOCKER_IMAGE: ${DOCKER_IMAGE}:${DOCKER_TAG}" echo "DOCKER_IMAGE: ${DOCKER_IMAGE}:${DOCKER_TAG}"
time docker pull ${DOCKER_IMAGE}:${DOCKER_TAG} >/dev/null time docker pull ${DOCKER_IMAGE}:${DOCKER_TAG} >/dev/null
git submodule sync && git submodule update -q --init --recursive git submodule sync && git submodule update -q --init --recursive --depth 1
VOLUME_MOUNTS="-v /home/circleci/project/:/var/lib/jenkins/workspace" VOLUME_MOUNTS="-v /home/circleci/project/:/var/lib/jenkins/workspace"
export id=$(docker run --env-file "${BASH_ENV}" ${VOLUME_MOUNTS} --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${DOCKER_IMAGE}:${DOCKER_TAG}) export id=$(docker run --env-file "${BASH_ENV}" ${VOLUME_MOUNTS} --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${DOCKER_IMAGE}:${DOCKER_TAG})
export COMMAND='((echo "export GRADLE_OFFLINE=1" && echo "sudo chown -R jenkins workspace && cd workspace && ./.circleci/scripts/build_android_gradle.sh") | docker exec -u jenkins -i "$id" bash) 2>&1' export COMMAND='((echo "export GRADLE_OFFLINE=1" && echo "export BUILD_LITE_INTERPRETER=${BUILD_LITE_INTERPRETER}" && echo "sudo chown -R jenkins workspace && cd workspace && ./.circleci/scripts/build_android_gradle.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
# Skip docker push as this job is purely for size analysis purpose. # Skip docker push as this job is purely for size analysis purpose.
@ -430,7 +500,7 @@
# sync submodules # sync submodules
cd ${PROJ_ROOT} cd ${PROJ_ROOT}
git submodule sync git submodule sync
git submodule update --init --recursive git submodule update --init --recursive --depth 1
# export # export
export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"} export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
@ -440,6 +510,7 @@
echo "IOS_ARCH: ${IOS_ARCH}" echo "IOS_ARCH: ${IOS_ARCH}"
echo "IOS_PLATFORM: ${IOS_PLATFORM}" echo "IOS_PLATFORM: ${IOS_PLATFORM}"
echo "USE_PYTORCH_METAL": "${USE_METAL}" echo "USE_PYTORCH_METAL": "${USE_METAL}"
echo "BUILD_LITE_INTERPRETER": "${BUILD_LITE_INTERPRETER}"
#check the custom build flag #check the custom build flag
echo "SELECTED_OP_LIST: ${SELECTED_OP_LIST}" echo "SELECTED_OP_LIST: ${SELECTED_OP_LIST}"
@ -457,6 +528,10 @@
no_output_timeout: "30m" no_output_timeout: "30m"
command: | command: |
set -e set -e
if [ ${BUILD_LITE_INTERPRETER} == 0 ]; then
echo "Run Build Test is not for full jit, skipping."
exit 0
fi
PROJ_ROOT=/Users/distiller/project PROJ_ROOT=/Users/distiller/project
PROFILE=PyTorch_CI_2021 PROFILE=PyTorch_CI_2021
# run the ruby build script # run the ruby build script
@ -482,6 +557,9 @@
if [ ${IOS_PLATFORM} != "SIMULATOR" ]; then if [ ${IOS_PLATFORM} != "SIMULATOR" ]; then
echo "not SIMULATOR build, skip it." echo "not SIMULATOR build, skip it."
exit 0 exit 0
elif [ ${BUILD_LITE_INTERPRETER} == 0 ]; then
echo "Run Simulator Tests is not for full jit, skipping."
exit 0
fi fi
WORKSPACE=/Users/distiller/workspace WORKSPACE=/Users/distiller/workspace
PROJ_ROOT=/Users/distiller/project PROJ_ROOT=/Users/distiller/project
@ -497,7 +575,7 @@
pytorch_linux_bazel_build: pytorch_linux_bazel_build:
<<: *pytorch_params <<: *pytorch_params
machine: machine:
image: ubuntu-1604:202007-01 image: ubuntu-2004:202104-01
steps: steps:
- checkout - checkout
- calculate_docker_image_tag - calculate_docker_image_tag
@ -515,7 +593,7 @@
echo "Do NOT merge master branch into $CIRCLE_BRANCH in environment $BUILD_ENVIRONMENT" echo "Do NOT merge master branch into $CIRCLE_BRANCH in environment $BUILD_ENVIRONMENT"
git submodule sync && git submodule update -q --init --recursive git submodule sync && git submodule update -q --init --recursive --depth 1
docker cp /home/circleci/project/. $id:/var/lib/jenkins/workspace docker cp /home/circleci/project/. $id:/var/lib/jenkins/workspace
@ -535,7 +613,7 @@
pytorch_linux_bazel_test: pytorch_linux_bazel_test:
<<: *pytorch_params <<: *pytorch_params
machine: machine:
image: ubuntu-1604:202007-01 image: ubuntu-2004:202104-01
steps: steps:
- checkout - checkout
- calculate_docker_image_tag - calculate_docker_image_tag
@ -576,13 +654,26 @@
- store_test_results: - store_test_results:
path: bazel-testlogs path: bazel-testlogs
pytorch_windows_test_multigpu:
machine:
image: ubuntu-2004:202104-01
steps:
- checkout
- run:
name: Test
no_output_timeout: "90m"
command: |
set -e
python3 -m pip install requests
python3 ./.circleci/scripts/trigger_azure_pipeline.py
pytorch_doc_test: pytorch_doc_test:
environment: environment:
BUILD_ENVIRONMENT: pytorch-doc-test BUILD_ENVIRONMENT: pytorch-doc-test
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-gcc5.4" DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-gcc5.4"
resource_class: medium resource_class: medium
machine: machine:
image: ubuntu-1604:202007-01 image: ubuntu-2004:202104-01
steps: steps:
- checkout - checkout
- calculate_docker_image_tag - calculate_docker_image_tag

View File

@ -2,7 +2,7 @@ jobs:
pytorch_linux_build: pytorch_linux_build:
<<: *pytorch_params <<: *pytorch_params
machine: machine:
image: ubuntu-1604:202007-01 image: ubuntu-2004:202104-01
steps: steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml # See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
- checkout - checkout
@ -15,9 +15,6 @@ jobs:
no_output_timeout: "1h" no_output_timeout: "1h"
command: | command: |
set -e set -e
if [[ "${DOCKER_IMAGE}" == *rocm3.9* ]]; then
export DOCKER_TAG="f3d89a32912f62815e4feaeed47e564e887dffd6"
fi
if [[ ${BUILD_ENVIRONMENT} == *"pure_torch"* ]]; then if [[ ${BUILD_ENVIRONMENT} == *"pure_torch"* ]]; then
echo 'BUILD_CAFFE2=OFF' >> "${BASH_ENV}" echo 'BUILD_CAFFE2=OFF' >> "${BASH_ENV}"
fi fi
@ -33,11 +30,11 @@ jobs:
time docker pull ${DOCKER_IMAGE}:${DOCKER_TAG} >/dev/null time docker pull ${DOCKER_IMAGE}:${DOCKER_TAG} >/dev/null
export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${DOCKER_IMAGE}:${DOCKER_TAG}) export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${DOCKER_IMAGE}:${DOCKER_TAG})
git submodule sync && git submodule update -q --init --recursive git submodule sync && git submodule update -q --init --recursive --depth 1
docker cp /home/circleci/project/. $id:/var/lib/jenkins/workspace docker cp /home/circleci/project/. $id:/var/lib/jenkins/workspace
export COMMAND='((echo "sudo chown -R jenkins workspace && cd workspace && .jenkins/pytorch/build.sh && find ${BUILD_ROOT} -type f -name "*.a" -or -name "*.o" -delete") | docker exec -u jenkins -i "$id" bash) 2>&1' export COMMAND='((echo "sudo chown -R jenkins workspace && export CIRCLE_JOB="$CIRCLE_JOB" && cd workspace && .jenkins/pytorch/build.sh && find ${BUILD_ROOT} -type f -name "*.a" -or -name "*.o" -delete") | docker exec -u jenkins -i "$id" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
@ -83,7 +80,7 @@ jobs:
pytorch_linux_test: pytorch_linux_test:
<<: *pytorch_params <<: *pytorch_params
machine: machine:
image: ubuntu-1604:202007-01 image: ubuntu-2004:202104-01
steps: steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml # See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
- checkout - checkout
@ -168,6 +165,7 @@ jobs:
# =================== The following code will be executed inside Docker container =================== # =================== The following code will be executed inside Docker container ===================
set -ex set -ex
export SCRIBE_GRAPHQL_ACCESS_TOKEN="${SCRIBE_GRAPHQL_ACCESS_TOKEN}" export SCRIBE_GRAPHQL_ACCESS_TOKEN="${SCRIBE_GRAPHQL_ACCESS_TOKEN}"
export CIRCLE_JOB="$CIRCLE_JOB"
${PARALLEL_FLAGS} ${PARALLEL_FLAGS}
cd workspace cd workspace
EOL EOL
@ -182,11 +180,27 @@ jobs:
fi fi
echo "(cat docker_commands.sh | docker exec -u jenkins -i "$id" bash) 2>&1" > command.sh echo "(cat docker_commands.sh | docker exec -u jenkins -i "$id" bash) 2>&1" > command.sh
unbuffer bash command.sh | ts unbuffer bash command.sh | ts
if [[ ${BUILD_ENVIRONMENT} == *"coverage"* ]]; then
echo "Retrieving C++ coverage report"
docker cp $id:/var/lib/jenkins/workspace/build/coverage.info ./test
fi
if [[ ${BUILD_ENVIRONMENT} == *"coverage"* || ${BUILD_ENVIRONMENT} == *"onnx"* ]]; then
echo "Retrieving Python coverage report"
docker cp $id:/var/lib/jenkins/workspace/test/.coverage ./test
docker cp $id:/var/lib/jenkins/workspace/test/coverage.xml ./test
python3 -mpip install codecov
python3 -mcodecov
fi
- run: - run:
name: Report results name: Report results
no_output_timeout: "5m" no_output_timeout: "5m"
command: | command: |
set -e set -e
# Retrieving test results should be done as very first step as command never fails
# But is always executed if previous step fails for some reason
echo "Retrieving test reports"
docker cp $id:/var/lib/jenkins/workspace/test/test-reports ./ || echo 'No test reports found!'
docker stats --all --no-stream docker stats --all --no-stream
cat >docker_commands.sh \<<EOL cat >docker_commands.sh \<<EOL
@ -201,27 +215,18 @@ jobs:
export CIRCLE_JOB="$CIRCLE_JOB" export CIRCLE_JOB="$CIRCLE_JOB"
export CIRCLE_WORKFLOW_ID="$CIRCLE_WORKFLOW_ID" export CIRCLE_WORKFLOW_ID="$CIRCLE_WORKFLOW_ID"
cd workspace cd workspace
python test/print_test_stats.py --upload-to-s3 test export PYTHONPATH="\${PWD}"
python tools/print_test_stats.py --upload-to-s3 --compare-with-s3 test
EOL EOL
echo "(cat docker_commands.sh | docker exec -u jenkins -i "$id" bash) 2>&1" > command.sh echo "(cat docker_commands.sh | docker exec -u jenkins -e LANG=C.UTF-8 -i "$id" bash) 2>&1" > command.sh
unbuffer bash command.sh | ts unbuffer bash command.sh | ts
echo "Retrieving test reports"
docker cp $id:/var/lib/jenkins/workspace/test/test-reports ./ || echo 'No test reports found!'
if [[ ${BUILD_ENVIRONMENT} == *"coverage"* ]]; then
echo "Retrieving C++ coverage report"
docker cp $id:/var/lib/jenkins/workspace/build/coverage.info ./test
fi
if [[ ${BUILD_ENVIRONMENT} == *"coverage"* || ${BUILD_ENVIRONMENT} == *"onnx"* ]]; then
echo "Retrieving Python coverage report"
docker cp $id:/var/lib/jenkins/workspace/test/.coverage ./test
docker cp $id:/var/lib/jenkins/workspace/test/coverage.xml ./test
python3 -mpip install codecov
python3 -mcodecov
fi
when: always when: always
- store_test_results: - store_test_results:
path: test-reports path: test-reports
- store_artifacts:
path: test/.coverage
- store_artifacts:
path: test/coverage.xml
pytorch_windows_build: pytorch_windows_build:
<<: *pytorch_windows_params <<: *pytorch_windows_params
@ -256,6 +261,11 @@ jobs:
executor: <<parameters.executor>> executor: <<parameters.executor>>
steps: steps:
- checkout - checkout
- run:
name: Install VS2019 toolchain
no_output_timeout: 10m
command: |
powershell .circleci/scripts/vs_install.ps1
- run: - run:
name: Install Cuda name: Install Cuda
no_output_timeout: 30m no_output_timeout: 30m
@ -320,6 +330,11 @@ jobs:
- checkout - checkout
- attach_workspace: - attach_workspace:
at: c:/users/circleci/workspace at: c:/users/circleci/workspace
- run:
name: Install VS2019 toolchain
no_output_timeout: 10m
command: |
powershell .circleci/scripts/vs_install.ps1
- run: - run:
name: Install Cuda name: Install Cuda
no_output_timeout: 30m no_output_timeout: 30m
@ -346,5 +361,18 @@ jobs:
export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_WIN_BUILD_V1} export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_WIN_BUILD_V1}
set -x set -x
.jenkins/pytorch/win-test.sh .jenkins/pytorch/win-test.sh
- run:
name: Report results
no_output_timeout: "5m"
command: |
set -ex
export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_WIN_BUILD_V1}
export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_WIN_BUILD_V1}
export PYTHONPATH="$PWD"
pip install typing_extensions boto3
python tools/print_test_stats.py --upload-to-s3 --compare-with-s3 test
when: always
- store_test_results: - store_test_results:
path: test/test-reports path: test/test-reports
- store_artifacts:
path: test/coverage.xml

View File

@ -0,0 +1,195 @@
scheduled-ci:
triggers:
- schedule:
# runs every 4 hours on the 45th minute
cron: "45 0,4,8,12,16,20 * * *"
filters:
branches:
only:
- master
jobs:
- docker_build_job:
name: "docker-pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7"
image_name: "pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7"
- pytorch_linux_build:
name: periodic_pytorch_xenial_cuda11_3_cudnn8_gcc7_build
requires:
- "docker-pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7"
build_environment: "pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7-build"
docker_image: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7"
- pytorch_linux_test:
name: periodic_pytorch_xenial_cuda11_3_cudnn8_gcc7_test
requires:
- periodic_pytorch_xenial_cuda11_3_cudnn8_gcc7_build
build_environment: "pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7-test"
docker_image: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7"
use_cuda_docker_runtime: "1"
resource_class: gpu.medium
- pytorch_linux_build:
name: periodic_libtorch_xenial_cuda11_3_cudnn8_gcc7_build
requires:
- "docker-pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7"
build_environment: "pytorch-libtorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7-build"
docker_image: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7"
- pytorch_windows_build:
build_environment: pytorch-win-vs2019-cuda11-cudnn8-py3
cuda_version: "11.3"
name: periodic_pytorch_windows_cuda11.3_build
python_version: "3.6"
use_cuda: "1"
vc_product: BuildTools
vc_version: "14.28.29333"
vc_year: "2019"
- pytorch_windows_test:
build_environment: pytorch-win-vs2019-cuda11-cudnn8-py3
cuda_version: "11.3"
executor: windows-with-nvidia-gpu
name: periodic_pytorch_windows_cuda11.3_test1
python_version: "3.6"
requires:
- periodic_pytorch_windows_cuda11.3_build
test_name: pytorch-windows-test1
use_cuda: "1"
vc_product: BuildTools
vc_version: "14.28.29333"
vc_year: "2019"
- pytorch_windows_test:
build_environment: pytorch-win-vs2019-cuda11-cudnn8-py3
cuda_version: "11.3"
executor: windows-with-nvidia-gpu
name: periodic_pytorch_windows_cuda11.3_test2
python_version: "3.6"
requires:
- periodic_pytorch_windows_cuda11.3_build
test_name: pytorch-windows-test2
use_cuda: "1"
vc_product: BuildTools
vc_version: "14.28.29333"
vc_year: "2019"
# The following allows these jobs to run on ci-all and release branches
debuggable-scheduled-ci:
jobs:
- docker_build_job:
name: "docker-pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7"
image_name: "pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7"
filters:
branches:
only:
- /ci-all\/.*/
- /release\/.*/
- pytorch_linux_build:
name: pytorch_linux_xenial_cuda11_3_cudnn8_py3_gcc7_build
requires:
- "docker-pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7"
build_environment: "pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7-build"
docker_image: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7"
filters:
branches:
only:
- /ci-all\/.*/
- /release\/.*/
- pytorch_linux_test:
name: pytorch_linux_xenial_cuda11_3_cudnn8_py3_gcc7_test
requires:
- pytorch_linux_xenial_cuda11_3_cudnn8_py3_gcc7_build
build_environment: "pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7-test"
docker_image: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7"
use_cuda_docker_runtime: "1"
resource_class: gpu.medium
filters:
branches:
only:
- /ci-all\/.*/
- /release\/.*/
- pytorch_linux_build:
name: pytorch_libtorch_linux_xenial_cuda11_3_cudnn8_py3_gcc7_build
requires:
- "docker-pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7"
build_environment: "pytorch-libtorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7-build"
docker_image: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7"
filters:
branches:
only:
- /ci-all\/.*/
- /release\/.*/
- pytorch_windows_build:
build_environment: pytorch-win-vs2019-cuda11-cudnn8-py3
cuda_version: "11.3"
name: pytorch_windows_vs2019_py36_cuda11.3_build
python_version: "3.6"
use_cuda: "1"
vc_product: BuildTools
vc_version: "14.28.29333"
vc_year: "2019"
filters:
branches:
only:
- /ci-all\/.*/
- /release\/.*/
- pytorch_windows_test:
build_environment: pytorch-win-vs2019-cuda11-cudnn8-py3
cuda_version: "11.3"
executor: windows-with-nvidia-gpu
name: pytorch_windows_vs2019_py36_cuda11.3_test1
python_version: "3.6"
requires:
- pytorch_windows_vs2019_py36_cuda11.3_build
test_name: pytorch-windows-test1
use_cuda: "1"
vc_product: BuildTools
vc_version: "14.28.29333"
vc_year: "2019"
filters:
branches:
only:
- /ci-all\/.*/
- /release\/.*/
- pytorch_windows_test:
build_environment: pytorch-win-vs2019-cuda11-cudnn8-py3
cuda_version: "11.3"
executor: windows-with-nvidia-gpu
name: pytorch_windows_vs2019_py36_cuda11.3_test2
python_version: "3.6"
requires:
- pytorch_windows_vs2019_py36_cuda11.3_build
test_name: pytorch-windows-test2
use_cuda: "1"
vc_product: BuildTools
vc_version: "14.28.29333"
vc_year: "2019"
filters:
branches:
only:
- /ci-all\/.*/
- /release\/.*/
# the following clones pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7's tests but enables
# slow tests and sets an environment variable so gradcheck runs with fast_mode=False
slow-gradcheck-scheduled-ci:
triggers:
- schedule:
# runs every 8 hours on the 45th minute
cron: "45 0,8,16 * * *"
filters:
branches:
only:
- master
jobs:
- docker_build_job:
name: "docker-pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7"
image_name: "pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7"
- pytorch_linux_build:
name: periodic_pytorch_xenial_cuda10_2_cudnn7_gcc7_build
requires:
- "docker-pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7"
build_environment: "pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7-build"
docker_image: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7"
- pytorch_linux_test:
name: periodic_pytorch_xenial_cuda10_2_cudnn7_gcc7_old_gradcheck_tests
requires:
- periodic_pytorch_xenial_cuda10_2_cudnn7_gcc7_build
build_environment: "pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7-old-gradcheck-tests"
docker_image: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7"
use_cuda_docker_runtime: "1"
resource_class: gpu.medium

View File

@ -1129,4 +1129,3 @@ JNIEXPORT void JNI_OnUnload(JavaVM* vm, void* reserved);
#define JNI_ABORT 2 /* free buffer w/o copying back */ #define JNI_ABORT 2 /* free buffer w/o copying back */
#endif /* JNI_H_ */ #endif /* JNI_H_ */

View File

@ -6,8 +6,11 @@ bugprone-*,
-bugprone-forward-declaration-namespace, -bugprone-forward-declaration-namespace,
-bugprone-macro-parentheses, -bugprone-macro-parentheses,
-bugprone-lambda-function-name, -bugprone-lambda-function-name,
-bugprone-reserved-identifier,
cppcoreguidelines-*, cppcoreguidelines-*,
-cppcoreguidelines-avoid-magic-numbers,
-cppcoreguidelines-interfaces-global-init, -cppcoreguidelines-interfaces-global-init,
-cppcoreguidelines-macro-usage,
-cppcoreguidelines-owning-memory, -cppcoreguidelines-owning-memory,
-cppcoreguidelines-pro-bounds-array-to-pointer-decay, -cppcoreguidelines-pro-bounds-array-to-pointer-decay,
-cppcoreguidelines-pro-bounds-constant-array-index, -cppcoreguidelines-pro-bounds-constant-array-index,
@ -30,6 +33,7 @@ modernize-*,
-modernize-use-trailing-return-type, -modernize-use-trailing-return-type,
performance-*, performance-*,
-performance-noexcept-move-constructor, -performance-noexcept-move-constructor,
-performance-unnecessary-value-param,
' '
HeaderFilterRegex: 'torch/csrc/.*' HeaderFilterRegex: 'torch/csrc/.*'
AnalyzeTemporaryDtors: false AnalyzeTemporaryDtors: false

15
.coveragerc Normal file
View File

@ -0,0 +1,15 @@
[run]
plugins =
coverage_plugins.jit_plugin
omit =
*/tmp*
*/Temp/*
*/usr/local/lib*
*test/*
[report]
omit =
*/tmp*
*/Temp/*
*/usr/local/lib*
*test/*

35
.flake8
View File

@ -4,7 +4,7 @@ max-line-length = 120
# C408 ignored because we like the dict keyword argument syntax # C408 ignored because we like the dict keyword argument syntax
# E501 is not flexible enough, we're using B950 instead # E501 is not flexible enough, we're using B950 instead
ignore = ignore =
E203,E305,E402,E501,E721,E741,F403,F405,F821,F841,F999,W503,W504,C408,E302,W291,E303, E203,E305,E402,E501,E721,E741,F405,F821,F841,F999,W503,W504,C408,E302,W291,E303,
# shebang has extra meaning in fbcode lints, so I think it's not worth trying # shebang has extra meaning in fbcode lints, so I think it's not worth trying
# to line this up with executable bit # to line this up with executable bit
EXE001, EXE001,
@ -13,21 +13,20 @@ ignore =
# these ignores are from flake8-comprehensions; please fix! # these ignores are from flake8-comprehensions; please fix!
C400,C401,C402,C403,C404,C405,C407,C411,C413,C414,C415 C400,C401,C402,C403,C404,C405,C407,C411,C413,C414,C415
per-file-ignores = __init__.py: F401 torch/utils/cpp_extension.py: B950 per-file-ignores = __init__.py: F401 torch/utils/cpp_extension.py: B950
optional-ascii-coding = True
exclude = exclude =
docs/src, ./.git,
docs/cpp/src, ./build_code_analyzer,
venv, ./build_test_custom_build,
third_party, ./build,
caffe2, ./caffe2,
scripts, ./docs/caffe2,
docs/caffe2, ./docs/cpp/src,
torch/lib/include, ./docs/src,
torch/lib/tmp_install, ./scripts,
build, ./test/generated_type_hints_smoketest.py,
torch/include, ./third_party,
*.pyi, ./torch/include,
.git, ./torch/lib,
build, ./venv,
build_test_custom_build, *.pyi
build_code_analyzer,
test/generated_type_hints_smoketest.py

14
.gdbinit Normal file
View File

@ -0,0 +1,14 @@
# automatically load the pytoch-gdb extension.
#
# gdb automatically tries to load this file whenever it is executed from the
# root of the pytorch repo, but by default it is not allowed to do so due to
# security reasons. If you want to use pytorch-gdb, please add the following
# line to your ~/.gdbinit (i.e., the .gdbinit file which is in your home
# directory, NOT this file):
# add-auto-load-safe-path /path/to/pytorch/.gdbinit
#
# Alternatively, you can manually load the pytorch-gdb commands into your
# existing gdb session by doing the following:
# (gdb) source /path/to/pytorch/tools/gdb/pytorch-gdb.py
source tools/gdb/pytorch-gdb.py

View File

@ -11,3 +11,5 @@ labels_to_circle_params:
- v[0-9]+(\.[0-9]+)*-rc[0-9]+ - v[0-9]+(\.[0-9]+)*-rc[0-9]+
set_to_false: set_to_false:
- run_build - run_build
ci/master:
parameter: run_master_build

34
.github/scale-config.yml vendored Normal file
View File

@ -0,0 +1,34 @@
# scale-config.yml:
# Powers what instance types are available for GHA auto-scaled
# runners. Runners listed here will be available as self hosted
# runners, configuration is directly pulled from the main branch.
#
# NOTE (Apr, 5, 2021): Linux runners are currently all an amazonlinux2
#
# TODO: Add some documentation on how the auto-scaling works
#
# NOTE: Default values,
#
# runner_types:
# runner_label:
# instance_type: m4.large
# os: linux
# max_available: 20
# disk_size: 50
runner_types:
linux.2xlarge:
instance_type: c5.2xlarge
os: linux
max_available: 500
disk_size: 150
linux.8xlarge.nvidia.gpu:
instance_type: g3.8xlarge
os: linux
max_available: 50
disk_size: 150
windows.4xlarge:
instance_type: c5.4xlarge
os: windows
max_available: 200
disk_size: 256

View File

@ -0,0 +1,42 @@
#!/bin/sh
set -xeuo pipefail
PYTORCH_DOCKER_TAG=$(git describe --tags --always)-devel
CUDA_VERSION=11.1
# Build PyTorch nightly docker
make -f docker.Makefile \
DOCKER_REGISTRY=ghcr.io \
DOCKER_ORG=pytorch \
CUDA_VERSION=${CUDA_VERSION} \
DOCKER_IMAGE=pytorch-nightly \
DOCKER_TAG=${PYTORCH_DOCKER_TAG} \
INSTALL_CHANNEL=pytorch-nightly BUILD_TYPE=official devel-image
# Get the PYTORCH_NIGHTLY_COMMIT from the docker image
PYTORCH_NIGHTLY_COMMIT=$(docker run \
ghcr.io/pytorch/pytorch-nightly:${PYTORCH_DOCKER_TAG} \
python -c 'import torch; print(torch.version.git_version)' | head -c 7)
docker tag ghcr.io/pytorch/pytorch-nightly:${PYTORCH_DOCKER_TAG} \
ghcr.io/pytorch/pytorch-nightly:${PYTORCH_NIGHTLY_COMMIT}-cu${CUDA_VERSION}
docker tag ghcr.io/pytorch/pytorch-nightly:${PYTORCH_NIGHTLY_COMMIT}-cu${CUDA_VERSION} \
ghcr.io/pytorch/pytorch-nightly:latest
# Push the nightly docker to GitHub Container Registry
echo $GHCR_PAT | docker login ghcr.io -u pytorch --password-stdin
make -f docker.Makefile \
DOCKER_REGISTRY=ghcr.io \
DOCKER_ORG=pytorch \
DOCKER_IMAGE=pytorch-nightly \
DOCKER_TAG=${PYTORCH_NIGHTLY_COMMIT}-cu${CUDA_VERSION} \
devel-push
make -f docker.Makefile \
DOCKER_REGISTRY=ghcr.io \
DOCKER_ORG=pytorch \
DOCKER_IMAGE=pytorch-nightly \
DOCKER_TAG=latest \
devel-push

View File

@ -10,14 +10,13 @@ architectures:
* Latest ROCM * Latest ROCM
""" """
import argparse
import json import json
import os from typing import Dict, List
import itertools
CUDA_ARCHES = [ CUDA_ARCHES = [
"10.1",
"10.2", "10.2",
"11.0" "11.1"
] ]
ROCM_ARCHES = [ ROCM_ARCHES = [
@ -25,13 +24,17 @@ ROCM_ARCHES = [
"4.0" "4.0"
] ]
FULL_ARCHES = [
"cpu",
*CUDA_ARCHES,
*ROCM_ARCHES
]
CONTAINER_IMAGES = { def arch_type(arch_version: str) -> str:
if arch_version in CUDA_ARCHES:
return "cuda"
elif arch_version in ROCM_ARCHES:
return "rocm"
else: # arch_version should always be "cpu" in this case
return "cpu"
WHEEL_CONTAINER_IMAGES = {
**{ **{
# TODO: Re-do manylinux CUDA image tagging scheme to be similar to # TODO: Re-do manylinux CUDA image tagging scheme to be similar to
# ROCM so we don't have to do this replacement # ROCM so we don't have to do this replacement
@ -45,6 +48,29 @@ CONTAINER_IMAGES = {
"cpu": "pytorch/manylinux-cpu" "cpu": "pytorch/manylinux-cpu"
} }
CONDA_CONTAINER_IMAGES = {
**{
gpu_arch: f"pytorch/conda-builder:cuda{gpu_arch}"
for gpu_arch in CUDA_ARCHES
},
"cpu": "pytorch/conda-builder:cpu"
}
LIBTORCH_CONTAINER_IMAGES = {
**{
# TODO: Re-do manylinux CUDA image tagging scheme to be similar to
# ROCM so we don't have to do this replacement
(gpu_arch, "pre-cxx11"): f"pytorch/manylinux-cuda{gpu_arch.replace('.', '')}"
for gpu_arch in CUDA_ARCHES
},
**{
(gpu_arch, "cxx11-abi"): f"pytorch/libtorch-cxx11-builder:cuda{gpu_arch}"
for gpu_arch in CUDA_ARCHES
},
("cpu", "pre-cxx11"): "pytorch/manylinux-cpu",
("cpu", "cxx11-abi"): "pytorch/libtorch-cxx11-builder:cpu",
}
FULL_PYTHON_VERSIONS = [ FULL_PYTHON_VERSIONS = [
"3.6", "3.6",
"3.7", "3.7",
@ -53,34 +79,89 @@ FULL_PYTHON_VERSIONS = [
] ]
def is_pull_request(): def is_pull_request() -> bool:
return os.environ.get("GITHUB_HEAD_REF") return False
# return os.environ.get("GITHUB_HEAD_REF")
def generate_matrix():
python_versions = FULL_PYTHON_VERSIONS def snip_if(is_pr: bool, versions: List[str]) -> List[str]:
arches = FULL_ARCHES """
if is_pull_request(): Return the full list of versions, or just the latest if on a PR.
python_versions = [python_versions[-1]] """
arches = ["cpu", CUDA_ARCHES[-1], ROCM_ARCHES[-1]] return [versions[-1]] if is_pr else versions
matrix = []
for item in itertools.product(python_versions, arches):
python_version, arch_version = item def generate_conda_matrix(is_pr: bool) -> List[Dict[str, str]]:
# Not my favorite code here return [
gpu_arch_type = "cuda" {
if "rocm" in CONTAINER_IMAGES[arch_version]:
gpu_arch_type = "rocm"
elif "cpu" in CONTAINER_IMAGES[arch_version]:
gpu_arch_type = "cpu"
matrix.append({
"python_version": python_version, "python_version": python_version,
"gpu_arch_type": gpu_arch_type, "gpu_arch_type": arch_type(arch_version),
"gpu_arch_version": arch_version, "gpu_arch_version": arch_version,
"container_image": CONTAINER_IMAGES[arch_version] "container_image": CONDA_CONTAINER_IMAGES[arch_version],
}) }
return json.dumps({"include": matrix}) for python_version in snip_if(is_pr, FULL_PYTHON_VERSIONS)
# We don't currently build conda packages for rocm
for arch_version in ["cpu"] + snip_if(is_pr, CUDA_ARCHES)
]
def generate_libtorch_matrix(is_pr: bool) -> List[Dict[str, str]]:
libtorch_variants = [
"shared-with-deps",
"shared-without-deps",
"static-with-deps",
"static-without-deps",
]
return [
{
"gpu_arch_type": arch_type(arch_version),
"gpu_arch_version": arch_version,
"libtorch_variant": libtorch_variant,
"devtoolset": abi_version,
"container_image": LIBTORCH_CONTAINER_IMAGES[(arch_version, abi_version)],
}
# We don't currently build libtorch for rocm
for arch_version in ["cpu"] + snip_if(is_pr, CUDA_ARCHES)
for libtorch_variant in libtorch_variants
# one of the values in the following list must be exactly
# "cxx11-abi", but the precise value of the other one doesn't
# matter
for abi_version in ["cxx11-abi", "pre-cxx11"]
]
def generate_wheels_matrix(is_pr: bool) -> List[Dict[str, str]]:
arches = ["cpu"]
arches += snip_if(is_pr, CUDA_ARCHES)
arches += snip_if(is_pr, ROCM_ARCHES)
return [
{
"python_version": python_version,
"gpu_arch_type": arch_type(arch_version),
"gpu_arch_version": arch_version,
"container_image": WHEEL_CONTAINER_IMAGES[arch_version],
}
for python_version in snip_if(is_pr, FULL_PYTHON_VERSIONS)
for arch_version in arches
]
def from_includes(includes: List[Dict[str, str]]) -> str:
return json.dumps({"include": includes})
def main() -> None:
parser = argparse.ArgumentParser()
parser.add_argument('mode', choices=['conda', 'libtorch', 'wheels'])
args = parser.parse_args()
is_pr = is_pull_request()
print(from_includes({
'conda': generate_conda_matrix,
'libtorch': generate_libtorch_matrix,
'wheels': generate_wheels_matrix,
}[args.mode](is_pr)))
def main():
print(generate_matrix())
if __name__ == "__main__": if __name__ == "__main__":
main() main()

161
.github/scripts/generate_linux_ci_workflows.py vendored Executable file
View File

@ -0,0 +1,161 @@
#!/usr/bin/env python3
from pathlib import Path
import jinja2
DOCKER_REGISTRY = "308535385114.dkr.ecr.us-east-1.amazonaws.com"
GITHUB_DIR = Path(__file__).parent.parent
CPU_TEST_RUNNER = "linux.2xlarge"
CUDA_TEST_RUNNER = "linux.8xlarge.nvidia.gpu"
class PyTorchLinuxWorkflow:
def __init__(
self,
build_environment: str,
docker_image_base: str,
on_pull_request: bool = False,
enable_doc_jobs: bool = False,
):
self.build_environment = build_environment
self.docker_image_base = docker_image_base
self.test_runner_type = CPU_TEST_RUNNER
self.on_pull_request = on_pull_request
self.enable_doc_jobs = enable_doc_jobs
if "cuda" in build_environment:
self.test_runner_type = CUDA_TEST_RUNNER
def generate_workflow_file(
self, workflow_template: jinja2.Template, jinja_env: jinja2.Environment
) -> Path:
output_file_path = GITHUB_DIR.joinpath(
f"workflows/{self.build_environment}.yml"
)
with open(output_file_path, "w") as output_file:
output_file.writelines(["# @generated DO NOT EDIT MANUALLY\n"])
output_file.write(
workflow_template.render(
build_environment=self.build_environment,
docker_image_base=self.docker_image_base,
test_runner_type=self.test_runner_type,
enable_doc_jobs=self.enable_doc_jobs,
on_pull_request=self.on_pull_request,
)
)
output_file.write('\n')
return output_file_path
WORKFLOWS = [
PyTorchLinuxWorkflow(
build_environment="pytorch-linux-xenial-py3.6-gcc5.4",
docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-py3.6-gcc5.4",
on_pull_request=True,
enable_doc_jobs=True,
),
# PyTorchLinuxWorkflow(
# build_environment="pytorch-paralleltbb-linux-xenial-py3.6-gcc5.4",
# docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-py3.6-gcc5.4",
# ),
# PyTorchLinuxWorkflow(
# build_environment="pytorch-parallelnative-linux-xenial-py3.6-gcc5.4",
# docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-py3.6-gcc5.4",
# ),
# PyTorchLinuxWorkflow(
# build_environment="pytorch-pure_torch-linux-xenial-py3.6-gcc5.4",
# docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-py3.6-gcc5.4",
# ),
# PyTorchLinuxWorkflow(
# build_environment="pytorch-linux-xenial-py3.6-gcc7",
# docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-py3.6-gcc7",
# ),
# PyTorchLinuxWorkflow(
# build_environment="pytorch-linux-xenial-py3.6-clang5-asan",
# docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-py3-clang5-asan",
# ),
# PyTorchLinuxWorkflow(
# build_environment="pytorch-linux-xenial-py3.6-clang7-onnx",
# docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-py3-clang7-onnx",
# ),
PyTorchLinuxWorkflow(
build_environment="pytorch-linux-xenial-cuda10.2-cudnn7-py3.6-gcc7",
docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7",
),
# PyTorchLinuxWorkflow(
# build_environment="pytorch-linux-xenial-cuda11.1-cudnn8-py3.6-gcc7",
# docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-cuda11.1-cudnn8-py3-gcc7",
# ),
# PyTorchLinuxWorkflow(
# build_environment="pytorch-libtorch-linux-xenial-cuda11.1-cudnn8-py3.6-gcc7",
# docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-cuda11.1-cudnn8-py3-gcc7",
# ),
# PyTorchLinuxWorkflow(
# build_environment="pytorch-linux-bionic-py3.6-clang9-noarch",
# docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-bionic-py3.6-clang9",
# ),
# PyTorchLinuxWorkflow(
# build_environment="pytorch-xla-linux-bionic-py3.6-clang9",
# docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-bionic-py3.6-clang9",
# ),
# PyTorchLinuxWorkflow(
# build_environment="pytorch-vulkan-linux-bionic-py3.6-clang9",
# docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-bionic-py3.6-clang9",
# ),
# PyTorchLinuxWorkflow(
# build_environment="pytorch-linux-bionic-py3.8-gcc9-coverage",
# docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-bionic-py3.8-gcc9",
# ),
# PyTorchLinuxWorkflow(
# build_environment="pytorch-linux-bionic-rocm3.9-py3.6",
# docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-bionic-rocm3.9-py3.6",
# ),
# PyTorchLinuxWorkflow(
# build_environment="pytorch-linux-xenial-py3.6-clang5-android-ndk-r19c-x86_32",
# docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c",
# ),
# PyTorchLinuxWorkflow(
# build_environment="pytorch-linux-xenial-py3.6-clang5-android-ndk-r19c-x86_64",
# docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c",
# ),
# PyTorchLinuxWorkflow(
# build_environment="pytorch-linux-xenial-py3.6-clang5-android-ndk-r19c-arm-v7a",
# docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c",
# ),
# PyTorchLinuxWorkflow(
# build_environment="pytorch-linux-xenial-py3.6-clang5-android-ndk-r19c-arm-v8a",
# docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c",
# ),
# PyTorchLinuxWorkflow(
# build_environment="pytorch-linux-xenial-py3.6-clang5-mobile",
# docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-py3-clang5-asan",
# ),
# PyTorchLinuxWorkflow(
# build_environment="pytorch-linux-xenial-py3.6-clang5-mobile-custom-dynamic",
# docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c",
# ),
# PyTorchLinuxWorkflow(
# build_environment="pytorch-linux-xenial-py3.6-clang5-mobile-custom-static",
# docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c",
# ),
# PyTorchLinuxWorkflow(
# build_environment="pytorch-linux-xenial-py3.6-clang5-mobile-code-analysis",
# docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c",
# ),
]
if __name__ == "__main__":
jinja_env = jinja2.Environment(
variable_start_string="!{{",
loader=jinja2.FileSystemLoader(str(GITHUB_DIR.joinpath("templates"))),
)
workflow_template = jinja_env.get_template("linux_ci_workflow.yml.in")
for workflow in WORKFLOWS:
print(
workflow.generate_workflow_file(
workflow_template=workflow_template,
jinja_env=jinja_env
)
)

View File

@ -60,11 +60,6 @@ class PytorchVersion:
self.no_build_suffix = no_build_suffix self.no_build_suffix = no_build_suffix
def get_post_build_suffix(self): def get_post_build_suffix(self):
# CUDA 10.2 is the version to be uploaded to PyPI so it doesn't have a
# version suffix
if ((self.gpu_arch_type == "cuda" and self.gpu_arch_version == "10.2")
or self.no_build_suffix):
return ""
if self.gpu_arch_type == "cuda": if self.gpu_arch_type == "cuda":
return f"+cu{self.gpu_arch_version.replace('.', '')}" return f"+cu{self.gpu_arch_version.replace('.', '')}"
return f"+{self.gpu_arch_type}{self.gpu_arch_version}" return f"+{self.gpu_arch_type}{self.gpu_arch_version}"

55
.github/scripts/install_nvidia_utils_linux.sh vendored Executable file
View File

@ -0,0 +1,55 @@
#!/usr/bin/env bash
set -eou pipefail
DISTRIBUTION=$(. /etc/os-release;echo $ID$VERSION_ID) \
DRIVER_FN="NVIDIA-Linux-x86_64-460.39.run"
YUM_REPO_URL="https://nvidia.github.io/nvidia-docker/${DISTRIBUTION}/nvidia-docker.repo"
install_nvidia_docker2_amzn2() {
(
set -x
# Needed for yum-config-manager
sudo yum install -y yum-utils
sudo yum-config-manager --add-repo "${YUM_REPO_URL}"
sudo yum install -y nvidia-docker2
sudo systemctl restart docker
)
}
install_nvidia_driver_amzn2() {
(
set -x
sudo yum groupinstall -y "Development Tools"
# ensure our kernel install is the same as our underlying kernel,
# groupinstall "Development Tools" has a habit of mismatching kernel headers
sudo yum install -y "kernel-devel-uname-r == $(uname -r)"
sudo curl -fsL -o /tmp/nvidia_driver "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN"
sudo /bin/bash /tmp/nvidia_driver -s --no-drm || (sudo cat /var/log/nvidia-installer.log && false)
sudo rm -fv /tmp/nvidia_driver
nvidia-smi
)
}
# Install container toolkit based on distribution
echo "== Installing nvidia container toolkit for ${DISTRIBUTION} =="
case "${DISTRIBUTION}" in
amzn*)
install_nvidia_docker2_amzn2
;;
*)
echo "ERROR: Unknown distribution ${DISTRIBUTION}"
exit 1
;;
esac
echo "== Installing nvidia driver ${DRIVER_FN} =="
case "${DISTRIBUTION}" in
amzn*)
install_nvidia_driver_amzn2
;;
*)
echo "ERROR: Unknown distribution ${DISTRIBUTION}"
exit 1
;;
esac

51
.github/scripts/lint_native_functions.py vendored Executable file
View File

@ -0,0 +1,51 @@
#!/usr/bin/env python3
'''
Verify that it is possible to round-trip native_functions.yaml via ruamel under some
configuration. Keeping native_functions.yaml consistent in this way allows us to
run codemods on the file using ruamel without introducing line noise. Note that we don't
want to normalize the YAML file, as that would to lots of spurious lint failures. Anything
that ruamel understands how to roundtrip, e.g., whitespace and comments, is OK!
ruamel is a bit picky about inconsistent indentation, so you will have to indent your
file properly. Also, if you are working on changing the syntax of native_functions.yaml,
you may find that you want to use some format that is not what ruamel prefers. If so,
it is OK to modify this script (instead of reformatting native_functions.yaml)--the point
is simply to make sure that there is *some* configuration of ruamel that can round trip
the YAML, not to be prescriptive about it.
'''
import ruamel.yaml
import difflib
import sys
from pathlib import Path
from io import StringIO
def fn(base):
return str(base / Path("aten/src/ATen/native/native_functions.yaml"))
with open(Path(__file__).parent.parent.parent / fn('.'), "r") as f:
contents = f.read()
yaml = ruamel.yaml.YAML()
yaml.preserve_quotes = True
yaml.width = 1000
yaml.boolean_representation = ['False', 'True']
r = yaml.load(contents)
# Cuz ruamel's author intentionally didn't include conversion to string
# https://stackoverflow.com/questions/47614862/best-way-to-use-ruamel-yaml-to-dump-to-string-not-to-stream
string_stream = StringIO()
yaml.dump(r, string_stream)
new_contents = string_stream.getvalue()
string_stream.close()
if contents != new_contents:
print("""\
## LINT FAILURE: native_functions.yaml ##
native_functions.yaml failed lint; please apply the diff below to fix lint.
If you think this is in error, please see .github/scripts/lint_native_functions.py
""", file=sys.stderr)
sys.stdout.writelines(difflib.unified_diff(contents.splitlines(True), new_contents.splitlines(True), fn('a'), fn('b')))
sys.exit(1)

21
.github/scripts/parse_ref.py vendored Executable file
View File

@ -0,0 +1,21 @@
#!/usr/bin/env python3
import os
import re
def main() -> None:
ref = os.environ['GITHUB_REF']
m = re.match(r'^refs/(\w+)/(.*)$', ref)
if m:
category, stripped = m.groups()
if category == 'heads':
print(f'::set-output name=branch::{stripped}')
elif category == 'pull':
print(f'::set-output name=branch::pull/{stripped.split("/")[0]}')
elif category == 'tags':
print(f'::set-output name=tag::{stripped}')
if __name__ == '__main__':
main()

View File

@ -0,0 +1,37 @@
#!/usr/bin/env python3
'''
This file verifies that the workflows that are potentially canceled in our cancel_redundant_workflow.yml
match the workflows we have running on pull requests (found in .github/workflows). This way, anytime a
workflow is added or removed, people can be reminded to modify the cancel_redundant_workflow.yml accordingly.
'''
import ruamel.yaml
from pathlib import Path
yaml = ruamel.yaml.YAML()
yaml.preserve_quotes = True
yaml.boolean_representation = ['False', 'True']
yaml.default_flow_style = False
if __name__ == '__main__':
workflow_paths = (Path(__file__).parent.parent / 'workflows').rglob('*')
workflows = []
for path in workflow_paths:
if path.suffix in {'.yml', '.yaml'}:
with open(path) as f:
data = yaml.load(f)
assert 'name' in data, 'Every GHA workflow must have a name.'
if 'pull_request' in data['on']:
workflows.append(data['name'])
with open('.github/workflows/cancel_redundant_workflows.yml', 'r') as f:
data = yaml.load(f)
# Replace workflows to cancel
data['on']['workflow_run']['workflows'] = sorted(workflows)
with open('.github/workflows/cancel_redundant_workflows.yml', 'w') as f:
yaml.dump(data, f)

5
.github/scripts/report_git_status.sh vendored Executable file
View File

@ -0,0 +1,5 @@
#!/usr/bin/env bash
CHANGES=$(git status --porcelain)
echo "$CHANGES"
git diff
[ -z "$CHANGES" ]

103
.github/scripts/run_torchbench.py vendored Normal file
View File

@ -0,0 +1,103 @@
"""
Generate a torchbench test report from a file containing the PR body.
Currently, only supports running tests on specified model names
Testing environment:
- Intel Xeon 8259CL @ 2.50 GHz, 24 Cores with disabled Turbo and HT
- Nvidia Tesla T4
- Nvidia Driver 450.51.06
- Python 3.7
- CUDA 10.2
"""
# Known issues:
# 1. Does not reuse the build artifact in other CI workflows
# 2. CI jobs are serialized because there is only one worker
import os
import pathlib
import argparse
import subprocess
from typing import List
CUDA_VERSION = "cu102"
PYTHON_VERSION = "3.7"
TORCHBENCH_CONFIG_NAME = "config.yaml"
MAGIC_PREFIX = "RUN_TORCHBENCH:"
ABTEST_CONFIG_TEMPLATE = """# This config is automatically generated by run_torchbench.py
start: {control}
end: {treatment}
threshold: 100
direction: decrease
timeout: 720
tests:"""
def gen_abtest_config(control: str, treatment: str, models: List[str]):
d = {}
d["control"] = control
d["treatment"] = treatment
config = ABTEST_CONFIG_TEMPLATE.format(**d)
if models == ["ALL"]:
return config + "\n"
for model in models:
config = f"{config}\n - {model}"
config = config + "\n"
return config
def deploy_torchbench_config(output_dir: str, config: str):
# Create test dir if needed
pathlib.Path(output_dir).mkdir(exist_ok=True)
# TorchBench config file name
config_path = os.path.join(output_dir, TORCHBENCH_CONFIG_NAME)
with open(config_path, "w") as fp:
fp.write(config)
def extract_models_from_pr(torchbench_path: str, prbody_file: str) -> List[str]:
model_list = []
with open(prbody_file, "r") as pf:
lines = map(lambda x: x.strip(), pf.read().splitlines())
magic_lines = list(filter(lambda x: x.startswith(MAGIC_PREFIX), lines))
if magic_lines:
# Only the first magic line will be respected.
model_list = list(map(lambda x: x.strip(), magic_lines[0][len(MAGIC_PREFIX):].split(",")))
# Shortcut: if model_list is ["ALL"], run all the tests
if model_list == ["ALL"]:
return model_list
# Sanity check: make sure all the user specified models exist in torchbench repository
benchmark_path = os.path.join(torchbench_path, "torchbenchmark", "models")
full_model_list = [model for model in os.listdir(benchmark_path) if os.path.isdir(os.path.join(benchmark_path, model))]
for m in model_list:
if m not in full_model_list:
print(f"The model {m} you specified does not exist in TorchBench suite. Please double check.")
return []
return model_list
def run_torchbench(pytorch_path: str, torchbench_path: str, output_dir: str):
# Copy system environment so that we will not override
env = dict(os.environ)
command = ["python", "bisection.py", "--work-dir", output_dir,
"--pytorch-src", pytorch_path, "--torchbench-src", torchbench_path,
"--config", os.path.join(output_dir, "config.yaml"),
"--output", os.path.join(output_dir, "result.txt")]
subprocess.check_call(command, cwd=torchbench_path, env=env)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Run TorchBench tests based on PR')
parser.add_argument('--pr-num', required=True, type=str, help="The Pull Request number")
parser.add_argument('--pr-base-sha', required=True, type=str, help="The Pull Request base hash")
parser.add_argument('--pr-head-sha', required=True, type=str, help="The Pull Request head hash")
parser.add_argument('--pr-body', required=True, help="The file that contains body of a Pull Request")
parser.add_argument('--pytorch-path', required=True, type=str, help="Path to pytorch repository")
parser.add_argument('--torchbench-path', required=True, type=str, help="Path to TorchBench repository")
args = parser.parse_args()
output_dir: str = os.path.join(os.environ["HOME"], ".torchbench", "bisection", f"pr{args.pr_num}")
# Identify the specified models and verify the input
models = extract_models_from_pr(args.torchbench_path, args.pr_body)
if not models:
print("Can't parse the model filter from the pr body. Currently we only support allow-list.")
exit(1)
print(f"Ready to run TorchBench with benchmark. Result will be saved in the directory: {output_dir}.")
# Run TorchBench with the generated config
torchbench_config = gen_abtest_config(args.pr_base_sha, args.pr_head_sha, models)
deploy_torchbench_config(output_dir, torchbench_config)
run_torchbench(pytorch_path=args.pytorch_path, torchbench_path=args.torchbench_path, output_dir=output_dir)

View File

@ -0,0 +1,368 @@
# Template is at: .github/templates/linux_ci_workflow.yml
# Generation script: .github/scripts/generate_linux_ci_workflows.py
name: Linux CI (!{{ build_environment }})
on:
# TODO: Enable pull_request builds when we can verify capacity can be met by auto-scalers
{%- if on_pull_request %}
pull_request:
{%- endif %}
push:
branches:
- master
- release/*
workflow_dispatch:
env:
BUILD_ENVIRONMENT: !{{ build_environment }}
DOCKER_IMAGE_BASE: !{{ docker_image_base }}
SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
TORCH_CUDA_ARCH_LIST: 5.2
IN_CI: 1
# Used for custom_opertor, jit_hooks, custom_backend, see .jenkins/pytorch/build.sh
CUSTOM_TEST_ARTIFACT_BUILD_DIR: build/custom_test_artifacts
ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
jobs:
calculate-docker-image:
runs-on: linux.2xlarge
env:
DOCKER_BUILDKIT: 1
timeout-minutes: 90
outputs:
docker_image: ${{ steps.calculate-tag.outputs.docker_image }}
steps:
- name: Checkout PyTorch
uses: actions/checkout@v2
with:
# deep clone, to allow use of git merge-base
fetch-depth: 0
- name: Calculate docker image tag
id: calculate-tag
run: |
DOCKER_TAG=$(git rev-parse HEAD:.circleci/docker)
echo "::set-output name=docker_tag::${DOCKER_TAG}"
echo "::set-output name=docker_image::${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"
- name: Check if image should be built
id: check
env:
DOCKER_TAG: ${{ steps.calculate-tag.outputs.docker_tag }}
BASE_REVISION: ${{ github.event.pull_request.base.sha || github.sha }}
run: |
eval "$(aws ecr get-login --no-include-email --region us-east-1)"
set -x
# Check if image already exists, if it does then skip building it
if docker manifest inspect "${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"; then
exit 0
fi
if [[ "$BASE_REVISION" = "$(git rev-parse HEAD)" ]]; then
# if we're on the base branch then use the parent commit
MERGE_BASE=$(git rev-parse HEAD~)
else
# otherwise we're on a PR, so use the most recent base commit
MERGE_BASE=$(git merge-base HEAD "$BASE_REVISION")
fi
# Covers the case where a previous tag doesn't exist for the tree
# this is only really applicable on trees that don't have `.circleci/docker` at its merge base, i.e. nightly
if ! git rev-parse "$MERGE_BASE:.circleci/docker"; then
echo "Directory '.circleci/docker' not found in commit $MERGE_BASE, you should probably rebase onto a more recent commit"
exit 1
fi
PREVIOUS_DOCKER_TAG=$(git rev-parse "$MERGE_BASE:.circleci/docker")
# If no image exists but the hash is the same as the previous hash then we should error out here
if [[ "${PREVIOUS_DOCKER_TAG}" = "${DOCKER_TAG}" ]]; then
echo "ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch"
echo " contact the PyTorch team to restore the original images"
exit 1
fi
echo ::set-output name=rebuild::yes
- name: Build and push docker image
if: steps.check.outputs.rebuild
env:
DOCKER_TAG: ${{ steps.calculate-tag.outputs.docker_tag }}
DOCKER_SKIP_S3_UPLOAD: 1
run: |
export IMAGE_NAME=${DOCKER_IMAGE_BASE#308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/}
cd .circleci/docker && ./build_docker.sh
build:
runs-on: linux.2xlarge
needs: calculate-docker-image
env:
DOCKER_IMAGE: ${{ needs.calculate-docker-image.outputs.docker_image }}
steps:
- name: Log in to ECR
run: |
aws ecr get-login --no-include-email --region us-east-1 > /tmp/ecr-login.sh
bash /tmp/ecr-login.sh
rm /tmp/ecr-login.sh
- name: Chown workspace
run: |
# Ensure the working directory gets chowned back to the current user
docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
- name: Checkout PyTorch
uses: actions/checkout@v2
with:
fetch-depth: 0 # deep clone, to allow sharding to use git rev-list
submodules: recursive
- name: Pull docker image
run: |
docker pull "${DOCKER_IMAGE}"
- name: Preserve github env variables for use in docker
run: |
env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
- name: Build PyTorch
run: |
docker run \
-e BUILD_ENVIRONMENT \
-e MAX_JOBS="$(nproc --ignore=2)" \
-e SCCACHE_BUCKET \
-e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-e SKIP_SCCACHE_INITIALIZATION=1 \
-e TORCH_CUDA_ARCH_LIST \
--env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
--security-opt seccomp=unconfined \
--cap-add=SYS_PTRACE \
--tty \
--user jenkins \
-v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-w /var/lib/jenkins/workspace \
"${DOCKER_IMAGE}" \
sh -c 'sudo chown -R jenkins . && .jenkins/pytorch/build.sh'
- name: Chown workspace
run: |
# Ensure the working directory gets chowned back to the current user
docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
- name: Archive artifacts into zip
run: |
zip -r artifacts.zip dist/ build/
- uses: actions/upload-artifact@v2
name: Store PyTorch Build Artifacts
with:
name: ${{ env.BUILD_ENVIRONMENT }}
retention-days: 30
if-no-files-found: error
path:
artifacts.zip
- name: Clean up docker images
if: always()
run: |
# Prune all of the docker images
docker system prune -af
test:
runs-on: !{{ test_runner_type }}
needs:
- calculate-docker-image
- build
env:
DOCKER_IMAGE: ${{ needs.calculate-docker-image.outputs.docker_image }}
steps:
- name: Log in to ECR
run: |
aws ecr get-login --no-include-email --region us-east-1 > /tmp/ecr-login.sh
bash /tmp/ecr-login.sh
rm /tmp/ecr-login.sh
- name: Chown workspace
run: |
# Ensure the working directory gets chowned back to the current user
docker run --rm -v "$(pwd)/../":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
- name: Checkout PyTorch
uses: actions/checkout@v2
- name: Pull docker image
run: |
docker pull "${DOCKER_IMAGE}"
- name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
if: ${{ contains(env.BUILD_ENVIRONMENT, 'cuda') }}
run: |
bash .github/scripts/install_nvidia_utils_linux.sh
echo "GPU_FLAG=--gpus all" >> "${GITHUB_ENV}"
- name: Determine shm-size
run: |
shm_size="1g"
case "${BUILD_ENVIRONMENT}" in
*cuda*)
shm_size="2g"
;;
*rocm*)
shm_size="8g"
;;
esac
echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
- uses: actions/download-artifact@v2
name: Download PyTorch Build Artifacts
with:
name: ${{ env.BUILD_ENVIRONMENT }}
- name: Unzip artifacts
run: |
unzip -o artifacts.zip
- name: Output disk space left
run: |
sudo df -H
- name: Preserve github env variables for use in docker
run: |
env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
- name: Test PyTorch
run: |
# TODO: Stop building test binaries as part of the build phase
# Used for GPU_FLAG since that doesn't play nice
# shellcheck disable=SC2086
docker run \
${GPU_FLAG:-} \
-e BUILD_ENVIRONMENT \
-e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-e IN_CI \
-e MAX_JOBS="$(nproc --ignore=2)" \
--env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
--security-opt seccomp=unconfined \
--cap-add=SYS_PTRACE \
--shm-size="${SHM_SIZE}" \
--tty \
--user jenkins \
-v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-w /var/lib/jenkins/workspace \
"${DOCKER_IMAGE}" \
sh -c 'sudo chown -R jenkins . && pip install dist/*.whl && .jenkins/pytorch/test.sh'
- name: Chown workspace
if: always()
run: |
# Ensure the working directory gets chowned back to the current user
docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
- uses: actions/upload-artifact@v2
name: Store PyTorch Test Reports
if: always()
with:
name: test-reports
retention-days: 30
if-no-files-found: error
path:
test/**/*.xml
- name: Clean up docker images
if: always()
run: |
# Ensure the working directory gets chowned back to the current user
docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
# Prune all of the docker images
docker system prune -af
render_test_results:
if: always()
needs:
- test
runs-on: ubuntu-18.04
steps:
- name: Checkout PyTorch
uses: actions/checkout@v2
with:
# deep clone, to allow tools/print_test_stats.py to use Git commands
fetch-depth: 0
- uses: actions/download-artifact@v2
name: Download PyTorch Test Reports
with:
name: test-reports
path: test/test-reports
- uses: actions/setup-python@v2
with:
python-version: 3.9
- name: Install dependencies
# boto3 version copied from .circleci/docker/common/install_conda.sh
run: |
pip install -r requirements.txt
pip install boto3==1.16.34 junitparser rich
- name: Output Test Results (Click Me)
run: |
python tools/render_junit.py test
- name: Parse ref
id: parse-ref
run: .github/scripts/parse_ref.py
- name: Display and upload test statistics (Click Me)
# temporary hack: set CIRCLE_* vars, until we update
# tools/print_test_stats.py to natively support GitHub Actions
env:
SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_OSSCI_METRICS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_OSSCI_METRICS_SECRET_ACCESS_KEY }}
CIRCLE_BRANCH: ${{ steps.parse-ref.outputs.branch }}
CIRCLE_JOB: !{{ build_environment }}
CIRCLE_PR_NUMBER: ${{ github.event.pull_request.number }}
CIRCLE_SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
CIRCLE_TAG: ${{ steps.parse-ref.outputs.tag }}
CIRCLE_WORKFLOW_ID: ${{ github.run_id }} # dunno if this corresponds
run: |
export PYTHONPATH=$PWD
python tools/print_test_stats.py --upload-to-s3 --compare-with-s3 test
{%- if enable_doc_jobs %}
pytorch_python_doc_build:
runs-on: linux.2xlarge
needs:
- calculate-docker-image
- build
env:
DOCKER_IMAGE: ${{ needs.calculate-docker-image.outputs.docker_image }}
steps:
- name: Log in to ECR
run: |
aws ecr get-login --no-include-email --region us-east-1 > /tmp/ecr-login.sh
bash /tmp/ecr-login.sh
rm /tmp/ecr-login.sh
- name: Chown workspace
run: |
# Ensure the working directory gets chowned back to the current user
docker run --rm -v "$(pwd)":/v -w /v alpine chown -R "$(id -u):$(id -g)" .
- name: Checkout PyTorch
uses: actions/checkout@v2
with:
fetch-depth: 0 # deep clone, to allow sharding to use git rev-list
submodules: recursive
- name: Pull docker image
run: |
docker pull "${DOCKER_IMAGE}"
- name: Preserve github env variables for use in docker
run: |
env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
- uses: actions/download-artifact@v2
name: Download PyTorch Build Artifacts
with:
name: ${{ env.BUILD_ENVIRONMENT }}
- name: Unzip artifacts
run: |
unzip -o artifacts.zip
- name: Build Python Doc in Docker
run: |
set -ex
time docker pull "${DOCKER_IMAGE}" > /dev/null
echo "${GITHUB_REF}"
ref=${GITHUB_REF##*/}
target=${ref//v}
docker run \
-e BUILD_ENVIRONMENT \
-e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-e IN_CI \
-e MAX_JOBS="$(nproc --ignore=2)" \
-e CIRCLE_SHA1="$GITHUB_SHA" \
--env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
--security-opt seccomp=unconfined \
--cap-add=SYS_PTRACE \
--name="$GITHUB_SHA" \
--tty \
--user jenkins \
-v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-w /var/lib/jenkins/workspace \
"${DOCKER_IMAGE}" \
bash -c "sudo chown -R jenkins . && pip install dist/*.whl && ./.circleci/scripts/python_doc_push_script.sh docs/$target $target site"
- name: Chown workspace
run: |
# Ensure the working directory gets chowned back to the current user
docker run --rm -v "$(pwd)":/v -w /v alpine chown -R "$(id -u):$(id -g)" .
- name: Archive artifacts into zip
run: |
zip -r pytorch_github_io.zip "${GITHUB_WORKSPACE}/pytorch.github.io"
- uses: actions/upload-artifact@v2
name: Store PyTorch Build Artifacts
with:
name: pytorch_github_io
if-no-files-found: error
path: pytorch_github_io.zip
- name: Clean up docker images
if: always()
run: |
# Prune all of the docker images
docker system prune -af
{%- endif -%}

66
.github/workflows/add_annotations.yml vendored Normal file
View File

@ -0,0 +1,66 @@
name: Add annotations
on:
workflow_run:
types:
- completed
workflows:
- Lint
jobs:
annotate:
strategy:
fail-fast: false
matrix:
name:
- flake8-py3
- clang-tidy
runs-on: ubuntu-18.04
steps:
- name: Download artifact
uses: actions/github-script@v3
env:
RUN_ID: ${{ github.event.workflow_run.id }}
LINT_NAME: ${{ matrix.name }}
with:
# https://securitylab.github.com/research/github-actions-preventing-pwn-requests/
script: |
const artifacts = await github.actions.listWorkflowRunArtifacts({
owner: context.repo.owner,
repo: context.repo.repo,
run_id: process.env.RUN_ID,
});
const filteredArtifacts = artifacts.data.artifacts.filter(artifact => {
return artifact.name == process.env.LINT_NAME;
});
if (filteredArtifacts.length > 0) {
const matchArtifact = filteredArtifacts[0];
const download = await github.actions.downloadArtifact({
owner: context.repo.owner,
repo: context.repo.repo,
artifact_id: matchArtifact.id,
archive_format: 'zip',
});
const fs = require('fs');
fs.writeFileSync(
`${process.env.GITHUB_WORKSPACE}/linter-output.zip`,
Buffer.from(download.data),
);
}
- name: Unzip artifact
id: unzip
run: |
if unzip linter-output.zip annotations.json commit-sha.txt; then
echo ::set-output \
name=sha::"$(grep -Em1 '^[[:xdigit:]]{40}$' commit-sha.txt)"
fi
- if: steps.unzip.outputs.sha
name: Add annotations
uses: pytorch/add-annotations-github-action@master
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
with:
check_name: ${{ matrix.name }}
linter_output_path: annotations.json
commit_sha: ${{ steps.unzip.outputs.sha }}
mode: json

47
.github/workflows/auto_label.yml vendored Normal file
View File

@ -0,0 +1,47 @@
name: Label PRs & Issues
on:
issues:
types: [opened, edited]
pull_request_target:
types: [edited, opened, synchronize, reopened]
jobs:
auto-label-rocm:
runs-on: ubuntu-18.04
steps:
- name: Retrieve information
id: vars
env:
EVENT_NAME: ${{ github.event_name }}
PR_TITLE: ${{ github.event.pull_request.title }}
PR_NUMBER: ${{ github.event.pull_request.number }}
ISSUE_TITLE: ${{ github.event.issue.title }}
ISSUE_NUMBER: ${{ github.event.issue.number }}
run: |
set -eux
if [[ "$EVENT_NAME" == "pull_request_target" ]]; then
TITLE="${PR_TITLE}"
ISSUE_NUMBER="${PR_NUMBER}"
else
TITLE="${ISSUE_TITLE}"
# ISSUE_NUMBER is already set
fi
echo ::set-output name=TITLE::"${TITLE}"
echo ::set-output name=ISSUE_NUMBER::"${ISSUE_NUMBER}"
- name: Auto-label ROCm
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
TITLE: ${{ steps.vars.outputs.TITLE }}
ISSUE_NUMBER: ${{ steps.vars.outputs.ISSUE_NUMBER }}
OWNER: ${{ github.repository_owner }}
REPO: ${{ github.event.repository.name }}
run: |
set -eux
if [[ "${TITLE,,}" == *rocm* ]]; then
curl \
-X POST \
-H "Authorization: token ${GITHUB_TOKEN}" \
"https://api.github.com/repos/${OWNER}/${REPO}/issues/${ISSUE_NUMBER}/labels" \
-d '{"labels":["module: rocm"]}'
fi

95
.github/workflows/build_linux_conda.yml vendored Normal file
View File

@ -0,0 +1,95 @@
name: Build Linux Conda Packages
on:
# TODO: These are only runnable from workflow_dispatch, we need to eventually add
# a cron
# TODO: Add an on_release trigger to build on tags
workflow_dispatch:
jobs:
generate-build-matrix:
if: ${{ github.repository_owner == 'pytorch' }}
runs-on: ubuntu-18.04
outputs:
matrix: ${{ steps.set-matrix.outputs.matrix }}
container:
image: python:3.9
steps:
- name: Clone pytorch/pytorch
uses: actions/checkout@v2
- name: Generating build matrix
id: set-matrix
run: |
# outputting for debugging purposes
MATRIX=$(python .github/scripts/generate_binary_build_matrix.py conda)
echo "${MATRIX}"
echo "::set-output name=matrix::${MATRIX}"
build-conda:
if: ${{ github.repository_owner == 'pytorch' }}
needs: generate-build-matrix
runs-on: linux.2xlarge
strategy:
matrix:
${{ fromJson(needs.generate-build-matrix.outputs.matrix) }}
fail-fast: false
container:
image: ${{ matrix.container_image }}
env:
DESIRED_PYTHON: ${{ matrix.python_version }}
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: ${{ matrix.gpu_arch_version }}
GPU_ARCH_VERSION: ${{ matrix.GPU_ARCH_VERSION }}
GPU_ARCH_TYPE: ${{ matrix.gpu_arch_type }}
NO_BUILD_SUFFIX: True
# TODO: This is a legacy variable, we should just default all build to use
# this folder within the conda/build_pytorch.sh script
TORCH_CONDA_BUILD_FOLDER: pytorch-nightly
# TODO: Another legacy env variable that isn't useful anymore, should default
# to pytorch within the scripts directly
ANACONDA_USER: pytorch
PYTORCH_FINAL_PACKAGE_DIR: /remote
# We specify the CONDA_BLD_PATH here since conda creates extremely long paths
# for its default build path
CONDA_BLD_PATH: /build
PYTORCH_BUILD_NUMBER: 1
SKIP_ALL_TESTS: 1
steps:
- name: Clean runner workspace
run: rm -rf "$GITHUB_WORKSPACE"
- name: Clone pytorch/pytorch
uses: actions/checkout@v2
with:
path: pytorch
submodules: recursive
- name: Clone pytorch/builder
uses: actions/checkout@v2
with:
repository: pytorch/builder
path: builder
- name: Generate version string
working-directory: pytorch/
run: |
version=$(.github/scripts/generate_pytorch_version.py)
echo "Generated version: ${version}"
echo "PYTORCH_BUILD_VERSION=${version}" >> "$GITHUB_ENV"
- name: Set BUILD_SPLIT_CUDA
if: ${{ matrix.gpu_arch_type == 'cuda' && matrix.gpu_arch_version == '11.1' }}
run: |
echo "BUILD_SPLIT_CUDA=1" >> "$GITHUB_ENV"
# TODO: Remove this once we remove the need for the directories to be
# in specific locations
- name: Symlink repositories to root directory (for legacy scripts purposes)
run: |
mv "$PWD"/pytorch /pytorch
mv "$PWD"/builder /builder
# TODO: Bundle the correct build script in the base container image so
# that we don't have to do this type of specification
- name: Build PyTorch binary
run: |
/builder/conda/build_pytorch.sh
- uses: actions/upload-artifact@v2
with:
name: pytorch-conda-py${{ matrix.python_version }}-${{matrix.gpu_arch_type}}-${{ matrix.gpu_arch_version }}
path: /remote/**/*.bz2
# TODO: Add a step here for uploading binaries

View File

@ -0,0 +1,94 @@
name: Build Linux libtorch
on:
# TODO: These are only runnable from workflow_dispatch, we need to eventually add
# a cron
# TODO: Add an on_release trigger to build on tags
workflow_dispatch:
jobs:
generate-build-matrix:
if: ${{ github.repository_owner == 'pytorch' }}
runs-on: ubuntu-18.04
outputs:
matrix: ${{ steps.set-matrix.outputs.matrix }}
container:
image: python:3.9
steps:
- name: Clone pytorch/pytorch
uses: actions/checkout@v2
- name: Generating build matrix
id: set-matrix
run: |
# outputting for debugging purposes
MATRIX=$(python .github/scripts/generate_binary_build_matrix.py libtorch)
echo "${MATRIX}"
echo "::set-output name=matrix::${MATRIX}"
build-libtorch:
if: ${{ github.repository_owner == 'pytorch' }}
needs: generate-build-matrix
runs-on: linux.2xlarge
strategy:
matrix:
${{ fromJson(needs.generate-build-matrix.outputs.matrix) }}
fail-fast: false
container:
image: ${{ matrix.container_image }}
env:
# TODO: remove this var from the libtorch builder script(s)
DESIRED_PYTHON: '3.7'
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: ${{ matrix.gpu_arch_version }}
GPU_ARCH_VERSION: ${{ matrix.GPU_ARCH_VERSION }}
GPU_ARCH_TYPE: ${{ matrix.gpu_arch_type }}
BUILD_PYTHONLESS: 1
LIBTORCH_VARIANT: ${{ matrix.libtorch_variant }}
# TODO: remove this and bake env var into the Docker image
DESIRED_DEVTOOLSET: ${{ matrix.devtoolset }}
PYTORCH_BUILD_NUMBER: 1
SKIP_ALL_TESTS: 1
steps:
- name: Clean runner workspace
run: rm -rf "$GITHUB_WORKSPACE"
- name: Clone pytorch/pytorch
uses: actions/checkout@v2
with:
path: pytorch
submodules: recursive
- name: Clone pytorch/builder
uses: actions/checkout@v2
with:
repository: pytorch/builder
path: builder
- name: Generate version string
working-directory: pytorch/
run: |
version=$(.github/scripts/generate_pytorch_version.py)
echo "Generated version: ${version}"
echo "PYTORCH_BUILD_VERSION=${version}" >> "$GITHUB_ENV"
- name: Set BUILD_SPLIT_CUDA
if: ${{ matrix.gpu_arch_type == 'cuda' && matrix.gpu_arch_version == '11.1' }}
run: |
echo "BUILD_SPLIT_CUDA=1" >> "$GITHUB_ENV"
# TODO: Remove this once we remove the need for the directories to be
# in specific locations
- name: Symlink repositories to root directory (for legacy scripts purposes)
run: |
ln -s "$PWD"/pytorch /pytorch
ln -s "$PWD"/builder /builder
# TODO: Bundle the correct build script in the base container image so
# that we don't have to do this type of specification
- name: Build PyTorch binary (CUDA specific)
if: ${{ matrix.gpu_arch_type == 'cuda' }}
run: |
/builder/manywheel/build.sh
- name: Build PyTorch binary (CPU specific)
if: ${{ matrix.gpu_arch_type == 'cpu' }}
run: |
/builder/manywheel/build_cpu.sh
- uses: actions/upload-artifact@v2
with:
name: pytorch-libtorch-${{ matrix.libtorch_variant }}-${{ matrix.devtoolset }}-${{matrix.gpu_arch_type}}-${{ matrix.gpu_arch_version }}
path: /remote/**/*.zip
# TODO: Add a step here for uploading binaries

View File

@ -21,8 +21,8 @@ jobs:
id: set-matrix id: set-matrix
run: | run: |
# outputting for debugging purposes # outputting for debugging purposes
python .github/scripts/generate_binary_build_matrix.py MATRIX=$(python .github/scripts/generate_binary_build_matrix.py wheels)
MATRIX=$(python .github/scripts/generate_binary_build_matrix.py) echo "${MATRIX}"
echo "::set-output name=matrix::${MATRIX}" echo "::set-output name=matrix::${MATRIX}"
build-wheel: build-wheel:
if: ${{ github.repository_owner == 'pytorch' }} if: ${{ github.repository_owner == 'pytorch' }}
@ -31,6 +31,7 @@ jobs:
strategy: strategy:
matrix: matrix:
${{ fromJson(needs.generate-build-matrix.outputs.matrix) }} ${{ fromJson(needs.generate-build-matrix.outputs.matrix) }}
fail-fast: false
container: container:
image: ${{ matrix.container_image }} image: ${{ matrix.container_image }}
env: env:
@ -43,6 +44,8 @@ jobs:
PYTORCH_BUILD_NUMBER: 1 PYTORCH_BUILD_NUMBER: 1
SKIP_ALL_TESTS: 1 SKIP_ALL_TESTS: 1
steps: steps:
- name: Clean runner workspace
run: rm -rf "$GITHUB_WORKSPACE"
- name: Clone pytorch/pytorch - name: Clone pytorch/pytorch
uses: actions/checkout@v2 uses: actions/checkout@v2
with: with:
@ -58,13 +61,17 @@ jobs:
run: | run: |
version=$(.github/scripts/generate_pytorch_version.py) version=$(.github/scripts/generate_pytorch_version.py)
echo "Generated version: ${version}" echo "Generated version: ${version}"
echo "PYTORCH_BUILD_VERSION=${version}" >> $GITHUB_ENV echo "PYTORCH_BUILD_VERSION=${version}" >> "$GITHUB_ENV"
- name: Set BUILD_SPLIT_CUDA
if: ${{ matrix.gpu_arch_type == 'cuda' && matrix.gpu_arch_version == '11.1' }}
run: |
echo "BUILD_SPLIT_CUDA=1" >> "$GITHUB_ENV"
# TODO: Remove this once we remove the need for the directories to be # TODO: Remove this once we remove the need for the directories to be
# in specific locations # in specific locations
- name: Symlink repositories to root directory (for legacy scripts purposes) - name: Symlink repositories to root directory (for legacy scripts purposes)
run: | run: |
ln -s $(pwd)/pytorch /pytorch ln -s "$PWD"/pytorch /pytorch
ln -s $(pwd)/builder /builder ln -s "$PWD"/builder /builder
# TODO: Bundle the correct build script in the base container image so # TODO: Bundle the correct build script in the base container image so
# that we don't have to do this type of specification # that we don't have to do this type of specification
- name: Build PyTorch binary (CUDA specific) - name: Build PyTorch binary (CUDA specific)

View File

@ -0,0 +1,24 @@
name: Cancel redundant workflows
on:
workflow_run:
types:
- requested
# NOTE: Make sure to add to this list as you add more workflows running on 'pull_request'
workflows:
- Lint
- Linux CI (pytorch-linux-xenial-py3.6-gcc5.4)
- Test tools
- TorchBench CI (pytorch-linux-py3.7-cu102)
- clang-format
jobs:
cancel:
# We do not want to cancel reruns on master
if: github.event.workflow_run.head_branch != 'master'
runs-on: ubuntu-18.04
steps:
- name: Cancel duplicate workflow runs
uses: potiuk/cancel-workflow-runs@a81b3c4d59c61e27484cfacdc13897dd908419c9
with:
cancelMode: duplicates
token: ${{ secrets.GITHUB_TOKEN }}
sourceRunId: ${{ github.event.workflow_run.id }}

View File

@ -8,46 +8,37 @@ jobs:
runs-on: ubuntu-18.04 runs-on: ubuntu-18.04
steps: steps:
- name: Setup Python - name: Setup Python
uses: actions/setup-python@v1 uses: actions/setup-python@v2
with: with:
python-version: 3.x python-version: 3.x
architecture: x64 architecture: x64
- name: Fetch PyTorch - name: Fetch PyTorch
uses: actions/checkout@v1 uses: actions/checkout@v2
- name: Checkout PR tip with:
run: | fetch-depth: 0 # deep clone, to allow us to use git merge-base
set -eux
if [[ "${{ github.event_name }}" == "pull_request" ]]; then
# We are on a PR, so actions/checkout leaves us on a merge commit.
# Check out the actual tip of the branch.
git checkout ${{ github.event.pull_request.head.sha }}
fi
echo ::set-output name=commit_sha::$(git rev-parse HEAD)
id: get_pr_tip
- name: Run clang-format - name: Run clang-format
env:
BASE_SHA: ${{ github.event.pull_request.base.sha }}
run: | run: |
set -eu set -eu
# This is necessary to get the same results regardless of whether the # This is necessary to get the same results regardless of whether the
# PR was opened directly or from a forked repo. See: `9f890a92` for more info. # PR was opened directly or from a forked repo. See: `9f890a92` for more info.
git remote add upstream https://github.com/pytorch/pytorch git remote add upstream https://github.com/pytorch/pytorch
git fetch upstream "$GITHUB_BASE_REF" git fetch upstream "$GITHUB_BASE_REF"
BASE_SHA=${{ github.event.pull_request.base.sha }}
HEAD_SHA=${{ github.event.pull_request.head.sha }}
MERGE_BASE=$(git merge-base $BASE_SHA $HEAD_SHA)
# only run clang-format on allowlisted files # only run clang-format on allowlisted files
echo "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" echo "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"
echo "| clang-format failures found! Run: " echo "| clang-format failures found! Run: "
echo "| tools/clang_format_ci.sh ${MERGE_BASE} " echo "| tools/clang_format_ci.sh ${BASE_SHA} "
echo "| to fix this error. " echo "| to fix this error. "
echo "| For more info, see: https://github.com/pytorch/pytorch/wiki/clang-format " echo "| For more info, see: https://github.com/pytorch/pytorch/wiki/clang-format "
echo "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" echo "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"
tools/clang_format_ci.sh ${MERGE_BASE} tools/clang_format_ci.sh "${BASE_SHA}"
GIT_DIFF=$(git diff) GIT_DIFF=$(git diff)
if [[ -z $GIT_DIFF ]]; then if [[ -z $GIT_DIFF ]]; then
exit 0 exit 0
fi fi
echo $GIT_DIFF echo "$GIT_DIFF"
exit 1 exit 1

View File

@ -11,113 +11,234 @@ jobs:
runs-on: ubuntu-18.04 runs-on: ubuntu-18.04
steps: steps:
- name: Setup Python - name: Setup Python
uses: actions/setup-python@v1 uses: actions/setup-python@v2
with: with:
python-version: 3.x python-version: 3.x
architecture: x64 architecture: x64
- name: Checkout PyTorch - name: Checkout PyTorch
uses: actions/checkout@v1 uses: actions/checkout@v2
- name: Checkout PR tip - name: Install requirements
run: | id: requirements
set -eux run: pip install -r requirements.txt
if [[ "${{ github.event_name }}" == "pull_request" ]]; then
# We are on a PR, so actions/checkout leaves us on a merge commit.
# Check out the actual tip of the branch.
git checkout ${{ github.event.pull_request.head.sha }}
fi
echo ::set-output name=commit_sha::$(git rev-parse HEAD)
id: get_pr_tip
- name: Ensure consistent CircleCI YAML config - name: Ensure consistent CircleCI YAML config
if: always() && steps.requirements.outcome == 'success'
run: cd .circleci && ./ensure-consistency.py
- name: Ensure consistent GHA workflows in cancel_redundant_workflows.yml
if: always() && steps.requirements.outcome == 'success'
run: | run: |
pip install -r requirements.txt pip install ruamel.yaml==0.17.4
cd .circleci && ./ensure-consistency.py echo "Please locally run .github/scripts/regenerate_cancel_redundant_workflow.py and commit if this step fails."
- name: Shellcheck Jenkins scripts .github/scripts/regenerate_cancel_redundant_workflow.py
# https://github.com/koalaman/shellcheck#installing-a-pre-compiled-binary git diff --exit-code .github/workflows/cancel_redundant_workflows.yml
- name: Lint native_functions.yaml
if: always() && steps.requirements.outcome == 'success'
run: | run: |
scversion="stable" pip install ruamel.yaml==0.17.4
.github/scripts/lint_native_functions.py
- name: Extract scripts from GitHub Actions workflows
if: always() && steps.requirements.outcome == 'success'
run: |
# For local lints, remove the .extracted_scripts folder if it was already there
rm -rf .extracted_scripts
tools/extract_scripts.py --out=.extracted_scripts
- name: Install ShellCheck
id: install_shellcheck
if: always()
# https://github.com/koalaman/shellcheck/tree/v0.7.2#installing-a-pre-compiled-binary
run: |
set -x
scversion="v0.7.2"
wget -qO- "https://github.com/koalaman/shellcheck/releases/download/${scversion?}/shellcheck-${scversion?}.linux.x86_64.tar.xz" | tar -xJv wget -qO- "https://github.com/koalaman/shellcheck/releases/download/${scversion?}/shellcheck-${scversion?}.linux.x86_64.tar.xz" | tar -xJv
sudo cp "shellcheck-${scversion}/shellcheck" /usr/bin/ sudo cp "shellcheck-${scversion}/shellcheck" /usr/bin/
rm -r "shellcheck-${scversion}" rm -r "shellcheck-${scversion}"
shellcheck --version shellcheck --version
.jenkins/run-shellcheck.sh - name: Run ShellCheck
if: always() && steps.install_shellcheck.outcome == 'success'
run: |
tools/run_shellcheck.sh .jenkins/pytorch .extracted_scripts
- name: Ensure correct trailing newlines
if: always() && steps.requirements.outcome == 'success'
run: |
(! git --no-pager grep -Il '' -- . ':(exclude)**/contrib/**' ':(exclude)third_party' ':(exclude)**.expect' ':(exclude)tools/clang_format_hash' | tools/trailing_newlines.py || (echo "The above files do not have correct trailing newlines; please normalize them"; false))
- name: Ensure no trailing spaces
if: always()
run: |
(! git --no-pager grep -In '[[:blank:]]$' -- . ':(exclude)**/contrib/**' ':(exclude)third_party' || (echo "The above lines have trailing spaces; please remove them"; false))
- name: Ensure no tabs - name: Ensure no tabs
if: always()
run: | run: |
(! git grep -I -l $'\t' -- . ':(exclude)*.svg' ':(exclude)**Makefile' ':(exclude)**/contrib/**' ':(exclude)third_party' ':(exclude).gitattributes' ':(exclude).gitmodules' || (echo "The above files have tabs; please convert them to spaces"; false)) (! git --no-pager grep -In $'\t' -- . ':(exclude)*.svg' ':(exclude)**Makefile' ':(exclude)**/contrib/**' ':(exclude)third_party' ':(exclude).gitattributes' ':(exclude).gitmodules' || (echo "The above lines have tabs; please convert them to spaces"; false))
- name: Ensure no non-breaking spaces
if: always()
run: |
# NB: We use 'printf' below rather than '\u000a' since bash pre-4.2
# does not support the '\u000a' syntax (which is relevant for local linters)
(! git --no-pager grep -In "$(printf '\xC2\xA0')" -- . || (echo "The above lines have non-breaking spaces (U+00A0); please convert them to spaces (U+0020)"; false))
- name: Ensure canonical include - name: Ensure canonical include
if: always()
run: | run: |
(! git grep -I -l $'#include "' -- ./c10 ./aten ./torch/csrc ':(exclude)aten/src/ATen/native/quantized/cpu/qnnpack/**' || (echo "The above files have include with quotes; please convert them to #include <xxxx>"; false)) (! git --no-pager grep -In $'#include "' -- ./c10 ./aten ./torch/csrc ':(exclude)aten/src/ATen/native/quantized/cpu/qnnpack/**' || (echo "The above lines have include with quotes; please convert them to #include <xxxx>"; false))
# note that this next step depends on a clean heckout; - name: Ensure no versionless Python shebangs
if: always()
run: |
(! git --no-pager grep -In '#!.*python$' -- . || (echo "The above lines have versionless Python shebangs; please specify either python2 or python3"; false))
- name: Ensure no unqualified noqa
if: always()
run: |
# shellcheck disable=SC2016
(! git --no-pager grep -InP '# noqa(?!: [A-Z]+\d{3})' -- '**.py' '**.pyi' ':(exclude)caffe2' || (echo 'The above lines have unqualified `noqa`; please convert them to `noqa: XXXX`'; false))
- name: Ensure no unqualified type ignore
if: always()
run: |
# shellcheck disable=SC2016
(! git --no-pager grep -InP '# type:\s*ignore(?!\[)' -- '**.py' '**.pyi' ':(exclude)test/test_jit.py' || (echo 'The above lines have unqualified `type: ignore`; please convert them to `type: ignore[xxxx]`'; false))
# note that this next step depends on a clean checkout;
# if you run it locally then it will likely to complain # if you run it locally then it will likely to complain
# about all the generated files in torch/test # about all the generated files in torch/test
- name: Ensure C++ source files are not executable - name: Ensure C++ source files are not executable
if: always()
run: | run: |
# shellcheck disable=SC2016
(! find . \( -path ./third_party -o -path ./.git -o -path ./torch/bin -o -path ./build \) -prune -o -type f -executable -regextype posix-egrep -not -regex '.+(\.(bash|sh|py|so)|git-pre-commit|git-clang-format|gradlew)$' -print | grep . || (echo 'The above files have executable permission; please remove their executable permission by using `chmod -x`'; false)) (! find . \( -path ./third_party -o -path ./.git -o -path ./torch/bin -o -path ./build \) -prune -o -type f -executable -regextype posix-egrep -not -regex '.+(\.(bash|sh|py|so)|git-pre-commit|git-clang-format|gradlew)$' -print | grep . || (echo 'The above files have executable permission; please remove their executable permission by using `chmod -x`'; false))
- name: C++ docs check - name: C++ docs check
if: always() && steps.requirements.outcome == 'success'
run: | run: |
sudo apt-get install -y doxygen && pip install -r requirements.txt sudo apt-get install -y doxygen
cd docs/cpp/source && ./check-doxygen.sh cd docs/cpp/source && ./check-doxygen.sh
- name: CUDA kernel launch check - name: CUDA kernel launch check
if: always() && steps.requirements.outcome == 'success'
run: | run: |
set -eux set -eux
python torch/testing/check_kernel_launches.py |& tee ${GITHUB_WORKSPACE}/cuda_kernel_launch_checks.txt python torch/testing/check_kernel_launches.py |& tee "${GITHUB_WORKSPACE}"/cuda_kernel_launch_checks.txt
- name: Ensure no direct cub include
if: always()
run: |
(! git --no-pager grep -I -no $'#include <cub/' -- ./aten ':(exclude)aten/src/ATen/cuda/cub.cuh' || (echo "The above files have direct cub include; please include ATen/cuda/cub.cuh instead and wrap your cub calls in at::native namespace if necessary"; false))
py2-setup-validate-errormsg:
runs-on: ubuntu-18.04
steps:
- name: Setup Python
uses: actions/setup-python@v2
with:
python-version: 2.x
architecture: x64
- name: Checkout PyTorch
uses: actions/checkout@v2
- name: Attempt to run setup.py
run: |
python2 setup.py | grep "Python 2 has reached end-of-life and is no longer supported by PyTorch."
templates:
runs-on: ubuntu-18.04
steps:
- name: Setup Python
uses: actions/setup-python@v2
with:
python-version: 3.x
architecture: x64
- name: Install Jinja2
run: pip install Jinja2
- name: Checkout PyTorch
uses: actions/checkout@v2
- name: Regenerate workflows
run: .github/scripts/generate_linux_ci_workflows.py
- name: Assert that regenerating the workflows didn't change them
run: .github/scripts/report_git_status.sh
toc:
runs-on: ubuntu-18.04
# https://github.com/actions/virtual-environments/issues/599#issuecomment-602754687
env:
NPM_CONFIG_PREFIX: ~/.npm-global
steps:
- name: Setup Node
uses: actions/setup-node@v2
- name: Checkout PyTorch
uses: actions/checkout@v2
- name: Install markdown-toc
run: npm install -g markdown-toc
- name: Regenerate ToCs and check that they didn't change
run: |
set -eux
export PATH=~/.npm-global/bin:"$PATH"
for FILE in $(git grep -Il '<!-- toc -->' -- '**.md'); do
markdown-toc --bullets='-' -i "$FILE"
done
.github/scripts/report_git_status.sh
flake8-py3: flake8-py3:
runs-on: ubuntu-18.04 runs-on: ubuntu-18.04
steps: steps:
- name: Setup Python - name: Setup Python
uses: actions/setup-python@v1 uses: actions/setup-python@v2
with: with:
python-version: 3.x python-version: 3.x
architecture: x64 architecture: x64
- name: Fetch PyTorch - name: Fetch PyTorch
uses: actions/checkout@v1 uses: actions/checkout@v2
- name: Checkout PR tip with:
fetch-depth: 2 # to allow us to use github.event.pull_request.head.sha
- name: Prepare output dir with HEAD commit SHA
env:
HEAD_SHA: ${{ github.event.pull_request.head.sha }}
run: |
mkdir flake8-output
cd flake8-output
echo "$HEAD_SHA" > commit-sha.txt
- name: Install dependencies
run: | run: |
set -eux set -eux
if [[ "${{ github.event_name }}" == "pull_request" ]]; then pip install typing-extensions # for tools/translate_annotations.py
# We are on a PR, so actions/checkout leaves us on a merge commit. pip install -r requirements-flake8.txt
# Check out the actual tip of the branch. flake8 --version
git checkout ${{ github.event.pull_request.head.sha }}
fi
echo ::set-output name=commit_sha::$(git rev-parse HEAD)
id: get_pr_tip
- name: Run flake8 - name: Run flake8
run: | run: |
set -eux set -eux
pip install -r requirements-flake8.txt flake8 | tee "${GITHUB_WORKSPACE}"/flake8-output.txt
flake8 --version - name: Translate annotations
flake8 | tee ${GITHUB_WORKSPACE}/flake8-output.txt if: github.event_name == 'pull_request'
- name: Add annotations
uses: pytorch/add-annotations-github-action@master
with:
check_name: 'flake8-py3'
linter_output_path: 'flake8-output.txt'
commit_sha: ${{ steps.get_pr_tip.outputs.commit_sha }}
regex: '^(?<filename>.*?):(?<lineNumber>\d+):(?<columnNumber>\d+): (?<errorCode>\w\d+) (?<errorDesc>.*)'
env: env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} HEAD_SHA: ${{ github.event.pull_request.head.sha }}
run: |
tools/translate_annotations.py \
--file="${GITHUB_WORKSPACE}"/flake8-output.txt \
--regex='^(?P<filename>.*?):(?P<lineNumber>\d+):(?P<columnNumber>\d+): (?P<errorCode>\w+\d+) (?P<errorDesc>.*)' \
--commit="$HEAD_SHA" \
> flake8-output/annotations.json
- name: Upload artifact
uses: actions/upload-artifact@v2
with:
name: flake8-py3
path: flake8-output/
- name: Fail if there were any warnings
run: |
set -eux
# Re-output flake8 status so GitHub logs show it on the step that actually failed
cat "${GITHUB_WORKSPACE}"/flake8-output.txt
[ ! -s "${GITHUB_WORKSPACE}"/flake8-output.txt ]
clang-tidy: clang-tidy:
if: github.event_name == 'pull_request' if: github.event_name == 'pull_request'
runs-on: ubuntu-18.04 runs-on: ubuntu-18.04
steps: steps:
- name: Setup Python - name: Setup Python
uses: actions/setup-python@v1 uses: actions/setup-python@v2
with: with:
python-version: 3.x python-version: 3.x
architecture: x64 architecture: x64
- name: Checkout PyTorch - name: Checkout PyTorch
uses: actions/checkout@v1 uses: actions/checkout@v2
- name: Checkout PR tip with:
fetch-depth: 0 # to allow tools/clang_tidy.py to do its thing
- name: Prepare output dir with HEAD commit SHA
env:
HEAD_SHA: ${{ github.event.pull_request.head.sha }}
run: | run: |
set -eux mkdir clang-tidy-output
if [[ "${{ github.event_name }}" == "pull_request" ]]; then cd clang-tidy-output
# We are on a PR, so actions/checkout leaves us on a merge commit. echo "$HEAD_SHA" > commit-sha.txt
# Check out the actual tip of the branch.
git checkout ${{ github.event.pull_request.head.sha }}
fi
echo ::set-output name=commit_sha::$(git rev-parse HEAD)
id: get_pr_tip
- name: Install dependencies - name: Install dependencies
run: | run: |
set -eux set -eux
@ -135,19 +256,17 @@ jobs:
sudo apt-get update sudo apt-get update
sudo apt-get install -y clang-tidy-11 sudo apt-get install -y clang-tidy-11
sudo update-alternatives --install /usr/bin/clang-tidy clang-tidy /usr/bin/clang-tidy-11 1000 sudo update-alternatives --install /usr/bin/clang-tidy clang-tidy /usr/bin/clang-tidy-11 1000
- name: Run clang-tidy - name: Generate build files
run: | run: |
set -eux set -eux
git remote add upstream https://github.com/pytorch/pytorch git remote add upstream https://github.com/pytorch/pytorch
git fetch upstream "$GITHUB_BASE_REF" git fetch upstream "$GITHUB_BASE_REF"
BASE_SHA=${{ github.event.pull_request.base.sha }}
HEAD_SHA=${{ github.event.pull_request.head.sha }}
MERGE_BASE=$(git merge-base $BASE_SHA $HEAD_SHA)
if [[ ! -d build ]]; then if [[ ! -d build ]]; then
git submodule update --init --recursive git submodule update --init --recursive
export USE_NCCL=0 export USE_NCCL=0
export USE_DEPLOY=1
# We really only need compile_commands.json, so no need to build! # We really only need compile_commands.json, so no need to build!
time python setup.py --cmake-only build time python setup.py --cmake-only build
@ -162,6 +281,12 @@ jobs:
--native-functions-path aten/src/ATen/native/native_functions.yaml \ --native-functions-path aten/src/ATen/native/native_functions.yaml \
--nn-path aten/src --nn-path aten/src
fi fi
- name: Run clang-tidy
env:
BASE_SHA: ${{ github.event.pull_request.base.sha }}
HEAD_SHA: ${{ github.event.pull_request.head.sha }}
run: |
set -eux
# Run Clang-Tidy # Run Clang-Tidy
# The negative filters below are to exclude files that include onnx_pb.h or # The negative filters below are to exclude files that include onnx_pb.h or
@ -174,7 +299,7 @@ jobs:
python tools/clang_tidy.py \ python tools/clang_tidy.py \
--verbose \ --verbose \
--paths torch/csrc/ \ --paths torch/csrc/ \
--diff "$MERGE_BASE" \ --diff "$BASE_SHA" \
-g"-torch/csrc/jit/passes/onnx/helper.cpp" \ -g"-torch/csrc/jit/passes/onnx/helper.cpp" \
-g"-torch/csrc/jit/passes/onnx/shape_type_inference.cpp"\ -g"-torch/csrc/jit/passes/onnx/shape_type_inference.cpp"\
-g"-torch/csrc/jit/serialization/onnx.cpp" \ -g"-torch/csrc/jit/serialization/onnx.cpp" \
@ -191,44 +316,67 @@ jobs:
-g"-torch/csrc/deploy/interpreter/interpreter.h" \ -g"-torch/csrc/deploy/interpreter/interpreter.h" \
-g"-torch/csrc/deploy/interpreter/interpreter_impl.h" \ -g"-torch/csrc/deploy/interpreter/interpreter_impl.h" \
-g"-torch/csrc/deploy/interpreter/test_main.cpp" \ -g"-torch/csrc/deploy/interpreter/test_main.cpp" \
"$@" > ${GITHUB_WORKSPACE}/clang-tidy-output.txt "$@" > "${GITHUB_WORKSPACE}"/clang-tidy-output.txt
cat ${GITHUB_WORKSPACE}/clang-tidy-output.txt cat "${GITHUB_WORKSPACE}"/clang-tidy-output.txt
- name: Add annotations
uses: suo/add-annotations-github-action@master tools/translate_annotations.py \
--file=clang-tidy-output.txt \
--regex='^(?P<filename>.*?):(?P<lineNumber>\d+):(?P<columnNumber>\d+): (?P<errorDesc>.*?) \[(?P<errorCode>.*)\]' \
--commit="$HEAD_SHA" \
> clang-tidy-output/annotations.json
- name: Upload artifact
uses: actions/upload-artifact@v2
with: with:
check_name: 'clang-tidy' name: clang-tidy
linter_output_path: 'clang-tidy-output.txt' path: clang-tidy-output/
commit_sha: ${{ steps.get_pr_tip.outputs.commit_sha }}
regex: '^(?<filename>.*?):(?<lineNumber>\d+):(?<columnNumber>\d+): (?<errorDesc>.*?) \[(?<errorCode>.*)\]'
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
cmakelint: cmakelint:
runs-on: ubuntu-18.04 runs-on: ubuntu-18.04
steps: steps:
- name: Setup Python - name: Setup Python
uses: actions/setup-python@v1 uses: actions/setup-python@v2
with: with:
python-version: 3.x python-version: 3.x
architecture: x64 architecture: x64
- name: Fetch PyTorch - name: Fetch PyTorch
uses: actions/checkout@v1 uses: actions/checkout@v2
- name: Checkout PR tip - name: Install dependencies
run: |
set -eux
if [[ "${{ github.event_name }}" == "pull_request" ]]; then
# We are on a PR, so actions/checkout leaves us on a merge commit.
# Check out the actual tip of the branch.
git checkout ${{ github.event.pull_request.head.sha }}
fi
echo ::set-output name=commit_sha::$(git rev-parse HEAD)
id: get_pr_tip
- name: Run cmakelint
run: | run: |
set -eux set -eux
pip install cmakelint pip install cmakelint
cmakelint --version cmakelint --version
- name: Run cmakelint
run: |
set -eux
git ls-files -z -- bootstrap '*.cmake' '*.cmake.in' '*CMakeLists.txt' | \ git ls-files -z -- bootstrap '*.cmake' '*.cmake.in' '*CMakeLists.txt' | \
grep -E -z -v '^(cmake/Modules/|cmake/Modules_CUDA_fix/)' | \ grep -E -z -v '^(cmake/Modules/|cmake/Modules_CUDA_fix/|cmake/Caffe2Config.cmake.in|aten/src/ATen/ATenConfig.cmake.in|cmake/Caffe2ConfigVersion.cmake.in|cmake/TorchConfig.cmake.in|cmake/TorchConfigVersion.cmake.in|cmake/cmake_uninstall.cmake.in)' | \
xargs -0 cmakelint --config=.cmakelintrc --spaces=2 --quiet xargs -0 cmakelint --config=.cmakelintrc --spaces=2 --quiet
mypy:
runs-on: ubuntu-18.04
steps:
- name: Setup Python
uses: actions/setup-python@v2
with:
python-version: 3.8
architecture: x64
- name: Fetch PyTorch
uses: actions/checkout@v2
- name: Install dependencies
run: |
set -eux
pip install -r requirements.txt
pip install mypy==0.812
# Needed to check tools/render_junit.py
pip install junitparser rich
- name: Run autogen
run: |
set -eux
time python -mtools.generate_torch_version --is_debug=false
time python -mtools.codegen.gen -s aten/src/ATen -d build/aten/src/ATen
time python -mtools.pyi.gen_pyi --native-functions-path aten/src/ATen/native/native_functions.yaml --deprecated-functions-path "tools/autograd/deprecated.yaml"
- name: Run mypy
run: |
set -eux
for CONFIG in mypy*.ini; do mypy --config="$CONFIG"; done

View File

@ -0,0 +1,22 @@
name: Build PyTorch nightly Docker image and push to GitHub Container Registry
on:
schedule:
# Push the nightly docker daily at 1 PM UTC
- cron: '0 13 * * *'
# Have the ability to trigger this job manually using the API as well
workflow_dispatch:
jobs:
build-publish-docker:
if: ${{ github.repository_owner == 'pytorch' }}
runs-on: linux.2xlarge
env:
GHCR_PAT: ${{ secrets.GHCR_PAT }}
steps:
- name: Checkout
uses: actions/checkout@v2
with:
ref: master
- name: Build and upload nightly docker
run: |
bash .github/scripts/build_publish_nightly_docker.sh

Some files were not shown because too many files have changed in this diff Show More